天天看點

詞法分析工具flex學習筆記

看flex官方文檔作的一些筆記, 隻記錄一些我感覺可能會用到的部分.

  • %top

    A %top block is similar to a ‘%{’ … ‘%}’ block, except that the code in a

    %top

    block is relocated to the top of the generated file, before any flex definitions 1. The

    %top

    block is useful when you want certain preprocessor macros to be defined or certain files to be included before the generated code. The single characters, ‘{’ and ‘}’ are used to delimit the

    %top

    block, as show in the example below:
    %top{
         /* This code goes at the "top" of the generated file. */
         #include <stdint.h>
         #include <inttypes.h>
     }
               

Multiple %top blocks are allowed, and their order is preserved.

  • %pointer

    ,

    %array

    Note that yytext can be defined in two different ways: either as a character pointer or as a character array. You can control which definition flex uses by including one of the special directives

    %pointer

    or

    %array

    in the first (definitions) section of your flex input. The default is

    %pointer

    , unless you use the

    -l

    lex compatibility option, in which case yytext will be an array. The advantage of using %pointer is substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify yytext (see Actions), and calls to the unput() function destroys the present contents of yytext, which can be a considerable porting headache when moving between different lex versions.

The advantage of

%array

is that you can then modify yytext to your heart’s content, and calls to unput() do not destroy yytext (see Actions). Furthermore, existing lex programs sometimes access yytext externally using declarations of the form:

extern char yytext[];
           

This definition is erroneous when used with

%pointer

, but correct for

%array

.

The

%array

declaration defines yytext to be an array of YYLMAX characters, which defaults to a fairly large value. You can change the size by simply #define’ing YYLMAX to a different value in the first section of your flex input. As mentioned above, with

%pointer

yytext grows dynamically to accommodate large tokens. While this means your

%pointer

scanner can accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize yytext it also must rescan the entire token from the beginning, so matching such tokens can prove slow. yytext presently does not dynamically grow if a call to unput() results in too much text being pushed back; instead, a run-time error results.

Also note that you cannot use

%array

with C++ scanner classes (see Cxx).

  • An action consisting solely of a vertical bar (‘|’) means “same as the action for the next rule”. See below for an illustration.
%%
 a        |
 ab       |
 abc      |
 abcd     ECHO; REJECT;
 .|\n     /* eat up any unmatched character */
           
  • yymore()
%%
 mega-    ECHO; yymore();
kludge   ECHO;
           

First ‘mega-’ is matched and echoed to the output. Then ‘kludge’ is matched, but the previous ‘mega-’ is still hanging around at the beginning of yytext so the ECHO for the ‘kludge’ rule will actually write ‘mega-kludge’.

  • yyless(n)

returns all but the first n characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. yytext and yyleng are adjusted appropriately (e.g., yyleng will now be equal to n). For example, on the input ‘foobar’ the following will write out ‘foobarbar’:

%%
foobar    ECHO; yyless();
[a-z]+    ECHO;
           

An argument of 0 to yyless() will cause the entire current input string to be scanned again. Unless you’ve changed how the scanner will subsequently process its input (using BEGIN, for example), this will result in an endless loop.

  • yyterminate()

yyterminate() can be used in lieu of a return statement in an action. It terminates the scanner and returns a 0 to the scanner’s caller, indicating “all done”. By default, yyterminate() is also called when an end-of-file is encountered. It is a macro and may be redefined.

  • #define YY_DECL

The output of flex is the file lex.yy.c, which contains the scanning routine yylex(), a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By default, yylex() is declared as follows:

int yylex()
         {
         ... various definitions and the actions in here ...
         }
           

This definition may be changed by defining the YY_DECL macro. For example, you could use:

#define YY_DECL float lexscan(float a, float b )
           
  • %option noyywrap

When the scanner receives an end-of-file indication from YY_INPUT, it then checks the yywrap() function. If yywrap() returns false (zero), then it is assumed that the function has gone ahead and set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), then the scanner terminates, returning 0 to its caller. Note that in either case, the start condition remains unchanged; it does not revert to INITIAL.

If you do not supply your own version of yywrap(), then you must either use %option noyywrap (in which case the scanner behaves as though yywrap() returned 1), or you must link with ‘-lfl’ to obtain the default version of the routine, which always returns 1.

  • End-of-File Rules

The special rule <> indicates actions which are to be taken when an end-of-file is encountered and yywrap() returns non-zero (i.e., indicates no further files to process). The action must finish by doing one of the following things:

assigning yyin to a new input file (in previous versions of flex, after doing the assignment you had to call the special action YY_NEW_FILE. This is no longer necessary.)

executing a return statement;

executing the special yyterminate() action.

or, switching to a new buffer using yy_switch_to_buffer() as shown in the example above.

These rules are useful for catching things like unclosed comments. An example:

%x quote
     %%

     ...other rules for dealing with quotes...

     <quote><<EOF>>   {
              error( "unterminated quote" );
              yyterminate();
              }
    <<EOF>>  {
              if ( *++filelist )
                  yyin = fopen( *filelist, "r" );
              else
                 yyterminate();
              }
           
  • Values Available To the User
    • char *yytext
    • int yyleng
    • FILE *yyin
    • void yyrestart( FILE *new_file )

      may be called to point yyin at the new input file. The switch-over to the new file is immediate (any previously buffered-up input is lost). Note that calling yyrestart() with yyin as an argument thus throws away the current input buffer and continues scanning the same input file.

    • FILE *yyout

      is the file to which ECHO actions are done. It can be reassigned by the user.

    • YY_CURRENT_BUFFER

      returns a YY_BUFFER_STATE handle to the current buffer.

    • YY_START

      returns an integer value corresponding to the current start condition. You can subsequently use this value with BEGIN to return to that start condition.

  • Index of Scanner Options

Even though there are many scanner options, a typical scanner might only specify the following options:

%option   8bit reentrant bison-bridge
 %option   warn nodefault
 %option   yylineno
 %option   outfile="scanner.c" header-file="scanner.h"
           

The first line specifies the general type of scanner we want. The second line specifies that we are being careful. The third line asks flex to track line numbers. The last line tells flex what to name the files. (The options can be specified in any order. We just divided them.)

flex also provides a mechanism for controlling options within the scanner specification itself, rather than from the flex command-line. This is done by including %option directives in the first section of the scanner specification. You can specify multiple options with a single %option directive, and multiple directives in the first section of your flex input file.

Most options are given simply as names, optionally preceded by the word ‘no’ (with no intervening whitespace) to negate their meaning. The names are the same as their long-option equivalents (but without the leading ‘–’ ).

  • Performance Considerations

The main design goal of flex is that it generate high-performance scanners. It has been optimized for dealing well with large sets of rules. Aside from the effects on scanner speed of the table compression ‘-C’ options outlined above, there are a number of options/actions which degrade performance. These are, from most expensive to least:

REJECT
     arbitrary trailing context

     pattern sets that require backing up
     %option yylineno
     %array

     %option interactive
     %option always-interactive

     ^ beginning-of-line operator
     yymore()
           

There is one case when %option yylineno can be expensive. That is when your patterns match long tokens that could possibly contain a newline character. There is no performance penalty for rules that can not possibly match newlines, since flex does not need to check them for newlines.

In general, you should avoid rules such as [^f]+, which match very long tokens, including newlines, and may possibly match your entire file! A better approach is to separate

[^f]+

into two rules:

%option yylineno
 %%
     [^f\n]+
     \n+
           

The above scanner does not incur a performance penalty.