天天看点

Regular Expressions in C++ with Boost.Regex(4)

Searching

Matching and parsing a single string in its entirety does not address the important and ubiquitous use case of searching a string that contains a substring you want, but possibly a lot of other characters you don't.

Like matching, Boost.Regex lets you search a string for a regular expression in two ways. In the simplest case, you may just want to know if a given string contains a match for your regular expression. Example 3 is a trivial implementation of the

grep

program that reads in each line from a file and prints it out if it contains a string that satisfies the regular expression pattern.

Regular Expressions in C++ with Boost.Regex(4)

#include  < iostream >

Regular Expressions in C++ with Boost.Regex(4)

#include  < string >

Regular Expressions in C++ with Boost.Regex(4)

#include  < boost / regex.hpp >

Regular Expressions in C++ with Boost.Regex(4)

#include  < fstream >

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

using   namespace  std;

Regular Expressions in C++ with Boost.Regex(4)

const   int  BUFSIZE  =   10000 ;

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

int  main( int  argc,  char **  argv)  ... {

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   // Safety checks omitted...

Regular Expressions in C++ with Boost.Regex(4)

   boost::regex re(argv[1]);

Regular Expressions in C++ with Boost.Regex(4)

   string file(argv[2]);

Regular Expressions in C++ with Boost.Regex(4)

   char buf[BUFSIZE];

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   ifstream in(file.c_str());

Regular Expressions in C++ with Boost.Regex(4)

   while (!in.eof())

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   ...{

Regular Expressions in C++ with Boost.Regex(4)

      in.getline(buf, BUFSIZE-1);

Regular Expressions in C++ with Boost.Regex(4)

      if (boost::regex_search(buf, re))

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

      ...{

Regular Expressions in C++ with Boost.Regex(4)

         cout << buf << endl;

Regular Expressions in C++ with Boost.Regex(4)

      }

Regular Expressions in C++ with Boost.Regex(4)

   }

Regular Expressions in C++ with Boost.Regex(4)

}

Example 3. Trivial

grep

You can see that you use

regex_search

in the same way as

regex_match

.

This comes in handy sometimes, but has limited appeal. More often, you will enumerate over all substrings that match a given pattern. For example, maybe you are writing a web crawler and want to iterate over all

anchor

tags in a page. Craft a regular expression to grab

anchor

tags:

<a/s+href="([/-:/w/d/.//]+)" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >
           

You don't want the whole line returned, though, as in the

grep

example above; you want the target URL. To do this, use the second subexpression in

match_results

. Example 4, a slightly modified version of Example 3, will do just that.

Regular Expressions in C++ with Boost.Regex(4)

#include  < iostream >

Regular Expressions in C++ with Boost.Regex(4)

#include  < string >

Regular Expressions in C++ with Boost.Regex(4)

#include  < boost / regex.hpp >

Regular Expressions in C++ with Boost.Regex(4)

#include  < fstream >

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

using   namespace  std;

Regular Expressions in C++ with Boost.Regex(4)

const   int  BUFSIZE  =   10000 ;

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

int  main( int  argc,  char **  argv)  ... {

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   // Safety checks omitted...

Regular Expressions in C++ with Boost.Regex(4)

   boost::regex re("<a/s+href="([/-:/w/d/.//]+)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >");

Regular Expressions in C++ with Boost.Regex(4)

   string file(argv[1]);

Regular Expressions in C++ with Boost.Regex(4)

   char buf[BUFSIZE];

Regular Expressions in C++ with Boost.Regex(4)

   boost::cmatch matches;

Regular Expressions in C++ with Boost.Regex(4)

   string sbuf;

Regular Expressions in C++ with Boost.Regex(4)

   string::const_iterator begin;

Regular Expressions in C++ with Boost.Regex(4)

   ifstream in(file.c_str());

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   while (!in.eof())

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

   ...{

Regular Expressions in C++ with Boost.Regex(4)

      in.getline(buf, BUFSIZE-1);

Regular Expressions in C++ with Boost.Regex(4)

      sbuf = buf;

Regular Expressions in C++ with Boost.Regex(4)

      begin = sbuf.begin();

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

      while (boost::regex_search(begin, sbuf.end(), matches, re))

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

      ...{

Regular Expressions in C++ with Boost.Regex(4)

         string url(matches[1].first, matches[1].second);

Regular Expressions in C++ with Boost.Regex(4)

         cout << "URL: " << url << endl;

Regular Expressions in C++ with Boost.Regex(4)

         // Update the beginning of the range to the character

Regular Expressions in C++ with Boost.Regex(4)

         // following the match

Regular Expressions in C++ with Boost.Regex(4)

         begin = matches[1].second;

Regular Expressions in C++ with Boost.Regex(4)

      }

Regular Expressions in C++ with Boost.Regex(4)

   }

Regular Expressions in C++ with Boost.Regex(4)

}

Example 4. Enumerating

anchor

tags

The hard-coded regular expression in Example 4 contains lots of backslashes. This is necessary because I am escaping certain characters twice: once for the compiler, and once for the regular expression engine.

Example 4 uses a different overload of

regex_search

than Example 3; this version takes two bidirectional iterator arguments that refer to the beginning and end of a range of characters to be searched. To access every matching substring, all I have to do is update

begin

to point to the character following the last match, which is in

matches[1].second

.

This is not the only way to iterate over all occurrences of a pattern. If you prefer (or require) iterator semantics, use a

regex_token_iterator

, which is an iterator interface to the results from a regular expression search. In Example 4, you could just as easily have iterated over the results of the URL search:

Regular Expressions in C++ with Boost.Regex(4)

//  Read the HTML file into the string s...

Regular Expressions in C++ with Boost.Regex(4)

   boost::sregex_token_iterator p(s.begin(), s.end(), re,  0 );

Regular Expressions in C++ with Boost.Regex(4)

   boost::sregex_token_iterator end;

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

    for  (;p  !=  end;count ++ ,  ++ p)

Regular Expressions in C++ with Boost.Regex(4)
Regular Expressions in C++ with Boost.Regex(4)

    ... {

Regular Expressions in C++ with Boost.Regex(4)

      string m(p->first, p->second);

Regular Expressions in C++ with Boost.Regex(4)

      cout << m << endl;

Regular Expressions in C++ with Boost.Regex(4)

   }

That's not all, though. The first token iterator here passes a zero as the last argument to its constructor. This tells it to iterate over the strings that satisfy the regular expression. Change it to -1 and you get the opposite: iteration over substrings that do not satisfy the expression. In other words, it tokenizes the string, where each token is something that satisfies the regular expression. This is a cool feature, because it lets you tokenize a string of characters based on complex delimiters. To use the example of parsing a web page, you could, for example, break the document into sections by its headers, using header tags such as

<h1>...</h1>

,

<h3>...</h3>

, etc.

Stuff to Check Out

There is, of course, more to Boost.Regex than I've presented here, but this should give you a good idea of what you can do with regular expressions in C++. The documentation on the Boost.Regex page is comprehensive, and there are plenty of examples you can copy and experiment with. In addition to searching

string

s as I did above, you can:

  • Search and replace using different Perl and Sed-style formatting conventions.
  • Use POSIX basic and extended regular expression format.
  • Use Unicode strings and other non-standard string formats.

Above all, you should experiment with regular expression syntax. There are different ways to do the same thing, and it's fun to see how concise you can make an expression that does what you want. Once you're a pro at regular expressions, you will be surprised at how often you can use them to validate, search, or parse a string.

Conclusion

Boost.Regex is the library in the Boost project that implements a regular expression engine in C++. You can use it to match, search, or search and replace with regular expressions against a target string, instead of writing ugly and cumbersome string-parsing code. Boost.Regex has been accepted as part of the next C++ standard library, and you will see it appearing in implementations of TR1 (in the

tr1

namespace) from standard library vendors very soon. Check out Boost.Regex to get a feel for how useful it is, and while you're at it, take a look at many of the other libraries in Boost--there's a lot of good stuff there.

Ryan Stephens is a software engineer, writer, and student living in Tempe, Arizona. He enjoys programming in virtually any language, especially C++.