天天看點

C++ 11 正規表達式手冊總結

我對C處理字元串學習得非常少,之前最多使用過strtok函數來split字元串。每次處理文本的時候都是使用的通過python來做處理。上次看同僚使用做多語言版本的時候,使用C++ 11 标準編寫了一個程式,去分析全部的代碼裡面的中文字串,列出一個清單寫到excel。說明C++也是能做這樣類似的處理了。本文将通過讀手冊來學習C++ 11 regular expression。

手冊

手冊

objects

定義檔案:

幾乎全部正則操作能被歸類于對于下面對象的操作:

target sequence

目标序列,就是提供出來給正則來處理的字元串。可能是一個指定了iterators區間、一個null結束的字元串,或者是std::string

Pattern

模式:就是正規表達式本身。

c++ 11 正則文法選項

Matched array

比對隊列:比對資訊可能被轉換成一個std::match_results類型的對象。

Replacement string

這個字元串決定了如何替掉比對的字元串。

Main classes

basic_regex

正規表達式對象。

窄位元組、寬位元組的支援:

Type Definition
regex basic_regex
wregex basic_regex

flag_type:可以定制正規表達式對象做一些定制。定制文檔

這裡主要是定制正規表達式的一些選項。

Value Effect(s)
icase Character matching should be performed without regard to case.
nosubs When performing matches, all marked sub-expressions (expr) are treated as non-marking sub-expressions (?:expr). No matches are stored in the supplied std::regex_match structure and mark_count() is zero
optimize Instructs the regular expression engine to make matching faster, with the potential cost of making construction slower. For example, this might mean converting a non-deterministic FSA to a deterministic FSA.
collate Character ranges of the form “[a-b]” will be locale sensitive.
ECMAScript Use the Modified ECMAScript regular expression grammar
basic Use the basic POSIX regular expression grammar (grammar documentation).
extended Use the extended POSIX regular expression grammar (grammar documentation).
awk Use the regular expression grammar used by the awk utility in POSIX (grammar documentation)
grep Use the regular expression grammar used by the grep utility in POSIX. This is effectively the same as the basic option with the addition of newline ‘\n’ as an alternation separator.
egrep Use the regular expression grammar used by the grep utility, with the -E option, in POSIX. This is effectively the same as the extended option with the addition of newline ‘\n’ as an alternation separator in addtion to ‘|’.

舉個例子:

std::regex("meow", std::regex::icase)
           

sub_match

定義被子表達式比對的字元序列

match_results

定義一個正則比對結果,包含全部的子表達式比對結果。

Algorithms

regex_match

match是做一次比對。

#include <iostream>
#include <string>
#include <regex>

int main()
{
    // Simple regular expression matching
    std::string fnames[] = {"foo.txt", "bar.txt", "baz.dat", "zoidberg"};
    std::regex txt_regex("[a-z]+\\.txt");

    for (const auto &fname : fnames) {
        std::cout << fname << ": " << std::regex_match(fname, txt_regex) << '\n';
    }   

    // Extraction of a sub-match
    std::regex base_regex("([a-z]+)\\.txt");
    std::smatch base_match;

    for (const auto &fname : fnames) {
        if (std::regex_match(fname, base_match, base_regex)) {
            // The first sub_match is the whole string; the next
            // sub_match is the first parenthesized expression.
            if (base_match.size() == ) {
                std::ssub_match base_sub_match = base_match[];
                std::string base = base_sub_match.str();
                std::cout << fname << " has a base of " << base << '\n';
            }
        }
    }

    // Extraction of several sub-matches
    std::regex pieces_regex("([a-z]+)\\.([a-z]+)");
    std::smatch pieces_match;

    for (const auto &fname : fnames) {
        if (std::regex_match(fname, pieces_match, pieces_regex)) {
            std::cout << fname << '\n';
            for (size_t i = ; i < pieces_match.size(); ++i) {
                std::ssub_match sub_match = pieces_match[i];
                std::string piece = sub_match.str();
                std::cout << "  submatch " << i << ": " << piece << '\n';
            }   
        }   
    }   
}
           

例子示範了如何通過sub_match将一次match出來的内容分組讀取出來。

foo.txt: 
bar.txt: 
baz.dat: 
zoidberg: 
foo.txt has a base of foo
bar.txt has a base of bar
foo.txt
  submatch : foo.txt
  submatch : foo
  submatch : txt
bar.txt
  submatch : bar.txt
  submatch : bar
  submatch : txt
baz.dat
  submatch : baz.dat
  submatch : baz
  submatch : dat
           

regex_search

search是可以疊代将全部比對的找出來。

#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string lines[] = {"Roses are #ff0000",
                           "violets are #0000ff",
                           "all of my base are belong to you"};

    std::regex color_regex("#([a-f0-9]{2})"
                            "([a-f0-9]{2})"
                            "([a-f0-9]{2})");

    // simple match
    for (const auto &line : lines) {
        std::cout << line << ": " << std::boolalpha
                  << std::regex_search(line, color_regex) << '\n';
    }   
    std::cout << '\n';

    // show contents of marked subexpressions within each match
    std::smatch color_match;
    for (const auto& line : lines) {
        if(std::regex_search(line, color_match, color_regex)) {
            std::cout << "matches for '" << line << "'\n";
            std::cout << "Prefix: '" << color_match.prefix() << "'\n";
            for (size_t i = ; i < color_match.size(); ++i) 
                std::cout << i << ": " << color_match[i] << '\n';
            std::cout << "Suffix: '" << color_match.suffix() << "\'\n\n";
        }
    }

    // repeated search (see also std::regex_iterator)
    std::string log(R"(
        Speed:  366
        Mass:   35
        Speed:  378
        Mass:   32
        Speed:  400
    Mass:   30)");
    std::regex r(R"(Speed:\t\d*)");
    std::smatch sm;
    while(regex_search(log, sm, r))
    {
        std::cout << sm.str() << '\n';
        log = sm.suffix();
    }
}
           

Output:

Roses are #ff0000: true
violets are #0000ff: true
all of my base are belong to you: false

matches for 'Roses are #ff0000'
Prefix: 'Roses are '
: #ff0000
: ff
: 
: 
Suffix: ''

matches for 'violets are #0000ff'
Prefix: 'violets are '
: #0000ff
: 
: 
: ff
Suffix: ''

Speed:  
Speed:  
Speed:  
           

regex_replace

查找比對到的字元并且替換。這個也支援常用的抓取替換的寫法。

#include <iostream>
#include <iterator>
#include <regex>
#include <string>

int main()
{
   std::string text = "Quick brown fox";
   std::regex vowel_re("a|e|i|o|u");

   // write the results to an output iterator
   std::regex_replace(std::ostreambuf_iterator<char>(std::cout),
                      text.begin(), text.end(), vowel_re, "*");

   // construct a string holding the results
   std::cout << '\n' << std::regex_replace(text, vowel_re, "[$&]") << '\n';
}
           

output

Q**ck br*wn f*x
Q[u][i]ck br[o]wn f[o]x
           
void auto_test2()
{
    std::string erl_text = "-define(CMD_CONNECT_EXCHANGE_KEY_REQ, 3).";
    std::regex match_erl_define("-define\\(([a-zA-Z_]+), ([0-9]+)\\)\\.");

    // construct a string holding the results
    std::cout << '\n' << std::regex_replace(erl_text, match_erl_define, "#define $1 $2") << '\n';
}
           

輸出:

全面了解正則内建的替換字元串表意可以檢視這個手冊

RegExp.lastMatch
RegExp['$&']
RegExp.$-$
RegExp.input ($_)
RegExp.lastParen ($+)
RegExp.leftContext ($`)
RegExp.rightContext ($')
           

注意在c++裡面寫正則的時候,需要寫\來做轉義符号。而且中間()這種catch都是直接支援的。

比對的時候*,+都不需要轉義。{2}這種限定出現次數的符号也不需要轉義。

在wiki裡面叫做:Quantification

fmt - the regex replacement format string, exact syntax depends on the value of flags

Iterators

疊代器

regex_iterator

可以疊代方式去查詢正規表達式比對到的内容,這個是用的場景應該是當需要在查找過程中控制次數的情況。打個比方,一段文字,裡面寫了某個酒店裡面客人消費數目,你想通過正則抓取出當酒店收到消費額度到100塊的時候,這些消費的客人清單,就可以用這個來做。當循環到發現數目已經到了100塊就可以結束掉正則比對。而不是将全部的清單比對出來,然後一條條過,最後用前面幾條。

#include <regex>
#include <iterator>
#include <iostream>
#include <string>

int main()
{
    const std::string s = "Quick brown fox.";

    std::regex words_regex("[^\\s]+");
    auto words_begin = 
        std::sregex_iterator(s.begin(), s.end(), words_regex);
    auto words_end = std::sregex_iterator();

    std::cout << "Found " 
              << std::distance(words_begin, words_end) 
              << " words:\n";

    for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
        std::smatch match = *i;                                                 
        std::string match_str = match.str(); 
        std::cout << match_str << '\n';
    }   
}
           

輸出内容:

Found  words:
Quick
brown
fox.
           

regex_token_iterator

#include <fstream>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>

int main()
{
   std::string text = "Quick brown fox.";
   // tokenization (non-matched fragments)
   // Note that regex is matched only two times: when the third value is obtained
   // the iterator is a suffix iterator.
   std::regex ws_re("\\s+"); // whitespace
   std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));

   // iterating the first submatches
   std::string html = "<p><a href=\"http://google.com\">google</a> "
                      "< a HREF =\"http://cppreference.com\">cppreference</a>\n</p>";
   std::regex url_re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"", std::regex::icase);
   std::copy( std::sregex_token_iterator(html.begin(), html.end(), url_re, ),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}
           

關于兩者的差別在于,token可以在查找的時候可以定制submatches。

執行個體1

執行個體2

Exceptions

regex_error

抓取正則的錯誤。

#include <regex>
#include <iostream>

int main()
{
    try {
        std::regex re("[a-b][a");
    } 

    catch (const std::regex_error& e) {
        std::cout << "regex_error caught: " << e.what() << '\n';
        if (e.code() == std::regex_constants::error_brack) {
            std::cout << "The code was error_brack\n";
        }
    }
}
           
regex_error caught: The expression contained mismatched [ and ].
The code was error_brack
           

總結

基本上看完這些東西,使用c++來做一些比對上的工作了。

void test_chinese_re()
{
    string text = "vice jax teemo, 老張  武松 ";
    regex reg(" ([\u4e00-\u9fa5]+) ");

    sregex_iterator pos(text.cbegin(), text.cend(), reg);
    sregex_iterator end;
    for (; pos != end; ++pos) {
        cout << "match:  " << pos->str() << endl;
        cout << " tag:   " << pos->str() << endl;
    }

}
           
match:   老張
 tag:   老張
match:   武松
 tag:   武松
           

繼續閱讀