参考资料

－《lex & yacc 2nd》：下载地址参考 http://blog.csdn.net/a_flying_bird/article/details/52486815

本文即此书的学习笔记。

Lex

要点

扩展名

lex文件通常使用的后缀名： .l, .ll, .lex。——实际上，可以是任意的名称。

文件结构

文件内容分为三部分，各个部分之间以 %% 分隔：

%{
    /* part 1: Definition Section. e.g.: Global declaration of C. */
%}

%%

/* part 2: Rules section. Rule = Pattern + Action. */

%%

part 3: C codes.

注意，%} 不要写成 }% 了，否则 premature EOF。

%{ 和 %} 之间的内容会原封不动地拷贝到最后生成的c文件中，所以这里可以是任何合法的C代码。通常而言，这里放lex文件后面C代码要用到的一些东西。

lex文件生成c文件

使用lex命令，把lex文件转换成c文件（lex.yy.c）；在生成可执行文件的时候，要链接库文件l。示例：

lex simplest.l 
gcc lex.yy.c -ll -o test

示例

最简单的例子

对应En Page 2.

代码（simplest.l）：

%%
.|\n ECHO;
%%

编译（把lex文件转换成c文件）&链接&运行：

$ ls
simplest.l
$ lex simplest.l 
$ ls
lex.yy.c    simplest.l
$ gcc lex.yy.c -ll -o test
$ ./test 
The simplest lex program. ------ 键盘输入内容
The simplest lex program. ------ 程序回显结果
^C
$

识别单词 Recognizing words

这个例子可以识别指定的这些单词，其他的不认识的直接回显。－对应原书 ch1-02.l

代码：

%{

/*
 * this sample demonstrates (very) simple recognition:
 * a verb, or not a verb.
 */

%}

%%

[\t ]+   /* ignore whitespace */ ;

is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go     {printf("%s: is a verb\n", yytext);}

[a-zA-Z]+ {printf("%s: is not verb\n", yytext);}

.|\n  {ECHO; /* normal default anyway */ }

%%

int main()
{
    yylex();

    return 0;
}

运行：

$ lex recoginzing_word.l 
$ gcc lex.yy.c -ll -o test
$ ./test 
I am a student. You are a teacher. ------ 键盘输入内容
I: is not verb
am: is a verb
a: is not verb
student: is not verb
.You: is not verb
are: is a verb
a: is not verb
teacher: is not verb
.
^C
$

要点

lex文件的三部分：definition section, rules section, user subroutines section.

definition section可以有一段”%{“和”%}”，这中间用来放C代码，比如#include，函数原型，全局变量等等。在由lex生成lex.yy.c的时候，这部分原封不动拷贝到C文件中。

rules section: 每个规则由两部分组成，即 pattern + action. 两者由空格分开。其中pattern是正则表达式语法。lexer在识别到某个pattern后，就会执行其对应的action。——action: { C codes. }

user subroutines section: 拷贝到.c文件的最后。

特殊的action：

“;”: 同C语言的空余句，即什么也不做。——直接忽略这些输入
“ECHO;”: 缺省行为，将匹配的字符串打印到输出文件中（stdout，回显）。
“|”: 使用下一个pattern的action。——注意 | action的语法，会在pattern后面有一个空格。而作为正则表达式的|则不会有空格。

注意1: ;和ECHO;的区别：前者是忽略输入，后者是打印到输出。可以将示例中的ECHO;改成;后观察输出的变化情况。

注意2: | action不能像下面这种方法写到同一行：

had | go     {printf("%s: is a verb\n", yytext);}

变量：

yytext: 存储的是匹配到的字符串，其类型可以在生成的.c中看到，即 extern char *yytext;

无歧义规则：每个输入仅匹配一次＋最长匹配。英文描述如下：

Lex patterns only match a given input characer or string once.
Lex executes th action for the longest possible match for the current input.

缺省main：

这里的例子中，定义的main()调用了yylex()。yylex()是lex定义的函数，缺省情况下，如果lex文件中没有定义main()函数，lex系统有一个缺省的main，也会去调用yylex()。

程序退出：

缺省情况下，yylex()只有处理了所有的输入的时候，才会退出。对于控制台输入，则要等到Ctrl+C。当然，用户也可以主动return，即在action中增加return语句。为此，可以增加如下一个规则作验证：

quit {printf("Program will exit normally.\n"); return 0;}

注意：这句话写到a-zA-Z]+的前面，否则 warning, rule cannot be matched。

拓展

可以修改下面两点，做对比分析：

[\t ]+  {printf("%s: white space\n", yytext);}
.|\n  {printf("%s: Invalid word\n", yytext);}

示例：——注意观察最后有一个换行符。

I am a student. You are a teacher. !@#$%^&*
I: is not  verb
 : white space
am: is a verb
 : white space
a: is not  verb
 : white space
student: is not  verb
.: Invalid word
 : white space
You: is not  verb
 : white space
are: is a verb
 : white space
a: is not  verb
 : white space
teacher: is not  verb
.: Invalid word
 : white space
!: Invalid word
@: Invalid word
#: Invalid word
$: Invalid word
%: Invalid word
^: Invalid word
&: Invalid word
*: Invalid word

: Invalid word

识别更多的单词

对应 ch1-03.l

可以识别出动词、副词、介词等等。——只需要增加对应的rules即可。

代码：

%{

/*
 * this sample demonstrates (very) simple recognition:
 * a verb, or not a verb.
 */

%}

%%

[\t ]+   {printf("%s: white space\n", yytext);}

is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go     {printf("%s: is a verb\n", yytext);}

very | 
simple | 
gently | 
quietly | 
calmly | 
angrily  {printf("%s: is an adverb\n", yytext);}

to |
from |
behind |
above |
below | 
between {printf("%s: is a preposition\n", yytext);}

if | 
then | 
and | 
but | 
or {printf("%s: is a conjunction\n", yytext);}

their | 
my | 
your | 
his | 
her | 
its {printf("%s: is a adjective\n", yytext);}

I | 
you | 
he | 
she | 
we | 
they {printf("%s: is a pronoun\n", yytext);}

QUIT {
        printf("Program will exit normally.\n"); 
        return 0;
    }

[a-zA-Z]+ {printf("%s: don't recognize\n", yytext);}

.|\n  {printf("%s: Invalid word\n", yytext);}


%%

int main()
{
    yylex();

    return 0;
}

运行：

he is a student. and he is a teacher. QUIT (ENTER)
he: is a pronoun
 : white space
is: is a verb
 : white space
a: don't recognize
 : white space
student: don't recognize
.: Invalid word
 : white space
and: is a conjunction
 : white space
he: is a pronoun
 : white space
is: is a verb
 : white space
a: don't recognize
 : white space
teacher: don't recognize
.: Invalid word
 : white space
Program will exit normally.

动态定义单词表 lexer with symbol table

对应 ch1-03.l, 这个例子说明如何在lex中写更复杂的C代码。

前面的例子是把每个单词都定义在lex文件中，接下来对其优化。

比如，可以在文件中按照特定语法来定义单词的词性：

noun dog cat horse cow
verb chew eat lick

即每行开头一个单词用来定义词性，接下来的每个单词都属于该词性。如此，可以在文件中作这种定义。当然，具体到这里的示例代码，暂时不处理文件输入，而仍然从控制台输入。这时，就有两种输入：

定义：即首字母表示词性，接下来是一系列属于该词性的单词；
识别：同前一个例子，要求识别出每个单词的词性。

代码：

%{

#include <stdbool.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

/*
 * Word recognizer with a symbol table.
 */

enum {
    LOOKUP = 0, /* default - looking rather than defining. */
    VERB,
    ADJ,
    ADV,
    NOUN,
    PREP,
    PRON,
    CONJ
};

int state; // global variable, default to 0(LOOKUP).

bool add_word(int type, char *word);
int lookup_word(char *word);

%}

%%

[\t ]+ ; /* ignore whitespace */

\n {state = LOOKUP;} // end of line, return to default state.

    /* Whenever a line starts with a reserved part of speech name */
    /* start defining words of that type */
^verb {state = VERB;}
^adj {state = ADJ;}
^adv {state = ADV;}
^noun {state = NOUN;}
^prep {state = PREP;}
^pron {state = PRON;}
^conj {state = CONJ;}

    /* a normal word, define it or look it up */
[a-zA-Z]+ {
        if (state != LOOKUP) {
            /* define the current word */
            add_word(state, yytext);
        } else {
            switch(lookup_word(yytext)) {
                case VERB: printf("%s: verb\n", yytext); break;
                case ADJ:  printf("%s: adjective\n", yytext); break;
                case ADV:  printf("%s: adverb\n", yytext); break;
                case NOUN: printf("%s: noun\n", yytext); break;
                case PREP: printf("%s: preposition\n", yytext); break;
                case PRON: printf("%s: pronoun\n", yytext); break;
                case CONJ: printf("%s: conjunction\n", yytext); break;
                default:
                    printf("%s: don't recognize\n", yytext);
                    break;
            }
        }
    }

[,:.] {printf("%s: punctuation, ignored.\n", yytext);}
. {printf("%s: invalid char\n", yytext);}

%% 

int main()
{
    yylex();
    return 0;
}

/* define a linked list of words and types */
struct word {
    char *word_name;
    int word_type;
    struct word *next;
};

struct word *word_list; /* first element in word list */

bool add_word(int type, char *word)
{
    struct word *wp; // wp: word pointer 

    if (lookup_word(word) != LOOKUP) {
        printf("!!! warning: word %s already defined.\n", word);
        return false;
    }

    /* word not there, allocate a new entry and link it on the list */
    wp = (struct word*)malloc(sizeof(struct word));
    wp->next = word_list;
    wp->word_name = (char*)malloc(strlen(word) + 1);
    strcpy(wp->word_name, word);
    wp->word_type = type;

    word_list = wp;

    return true;
}

int lookup_word(char *word)
{
    struct word *wp = word_list;

    for (; wp; wp = wp->next) {
        if (strcmp(wp->word_name, word) == 0) {
            return wp->word_type;
        }
    }

    return LOOKUP;
}

这里的枚举值有两个含义：

状态。缺省是LOOKUP状态，即对当前输入行的每个单词，在词库/链表中查找其词性(lookup_word)，然后打印出来。但如果每一行的第一个单词是noun/verb等保留字，则说明要进入defining状态(细分为VERB等状态)，保留字后续的各个单词将会添加到词库/链表中(add_word)。——在添加词库的时候，会先检查该单词是否已经入库。
类型：词库中，每个单词每个单词对应的词性用VERB等表示。

运行：

noun pet dog cat cats [ENTER]
verb is are [ENTER]
adj my his their [ENTER]
my pet is dog. their pets are cats. that's ok. [ENTER]
my: adjective
pet: noun
is: verb
dog: noun
.: punctuation, ignored.
their: adjective
pets: don't recognize
are: verb
cats: noun
.: punctuation, ignored.
that: don't recognize
': invalid char
s: don't recognize
ok: don't recognize
.: punctuation, ignored.
^C

yacc

前面的例子把一串字符串识别成了一个个单词，接下来就是识别句子。

词法分析：从输入字符流中识别出一个个单词，就是所谓的词法分析，输出是token。其关键就是定义词法规则(正则表达式)；
语法分析：在得到一个个单词（包括词性）之后，就是做更高级的分析，比如某些词连在一起是否构成了一个正确的句子。——各个token如何组合或搭配在一起。对于不同的token 组合执行不同的action。

sentences

现在分析，如何由前面得到的noun&pronoun&verb等构造出句子（示例）：

主语：(假定只能是)名词或代词，即 subject -> noun | pronoun
宾语：(假定只能是)名词，即 object -> noun
句子（主谓宾）：谓语只支持动词形式，即 sentence -> subject verb object.

这里的subject&object&sentence就是基于词法分析得到的noun&pronoun&verb等token而构造出来的新的symbol。

parser和lexer之间的通信

在yacc&lex系统中，词法分析（lex/lexer）和语法分析（yacc/parser）是相对独立的两套子系统。词法分析对应的(库)函数是yylex()，这个函数对输入的字符流做词法分析，然后生成一个个token。语法分析对应的函数是parser()，其输入是yylex()产生的token。所以，要把lex/lexer/yylex()的输出作为yacc/parser/parser()的输入。

yylex()的原型：

int yylex (void);

这里的关键就在于yylex()的返回值，其表示了当前识别的token的类别。当parser()需要一个token的时候，就调用yylex()，根据其返回值，就知道这个token的类别，从而做进一步的处理。

需要注意的是，并非lexer要给parser返回所有的token。比如，注释部分或空白符号就不需要传给parser，或者说parser对此不感兴趣。这种情况下，lexer直接丢弃即可。

既然yacc和lex基于token通信，自然就需达成一致的规定。这就是所谓的token codes，即每一类token规定一个token code。在yacc&lex系统中，是由yacc来定义token codes，然后lex的代码include进来。具体地，

在yacc中用%token NOUN VERB语法定义token codes
yacc -d test.y 会生成y.tab.c和y.tab.h两个文件，其中后者就包括了token codes的宏定义
在lex中include这个y.tab.h文件。

注：取值为0的token code表示结束输入(a logical end of input)。

示例

test.l

%{

/*
 * We now build a lexical analyzer to be used by a higher-level parser.
 */

#include <stdbool.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#include "y.tab.h"

#define LOOKUP 0 /* default - looking rather than defining. */

int state; // global variable, default to 0(LOOKUP).

bool add_word(int type, char *word);
int lookup_word(char *word);
const char* get_word_type(int type);
%}

%%

[\t ]+ ; /* ignore whitespace */

\n {state = LOOKUP;} // end of line, return to default state.

\.\n {
        state = LOOKUP;
        return 0; // end of sentence.
    }

    /* Whenever a line starts with a reserved part of speech name */
    /* start defining words of that type */
^verb {state = VERB;}
^adj {state = ADJECTIVE;}
^adv {state = ADVERB;}
^noun {state = NOUN;}
^prep {state = PREPOSITION;}
^pron {state = PRONOUN;}
^conj {state = CONJUNCTION;}

    /* a normal word, define it or look it up */
[a-zA-Z]+ {
        if (state != LOOKUP) {
            /* define the current word */
            add_word(state, yytext);
        } else {
            int type = lookup_word(yytext);
            printf("%s: %s\n", yytext, get_word_type(type));

            switch(type) {
                case VERB:
                case ADJECTIVE:
                case ADVERB:
                case NOUN:
                case PRONOUN:
                case PREPOSITION:
                case CONJUNCTION:
                    return type;
                default:
                    //printf("%s: don't recognize\n", yytext);
                    break; // don't return, just ignore it.
            }
        }
    }

. {printf("%s: ----\n", yytext);} // ignore it

%% 

/* define a linked list of words and types */
struct word {
    char *word_name;
    int word_type;
    struct word *next;
};

struct word *word_list; /* first element in word list */

bool add_word(int type, char *word)
{
    struct word *wp; // wp: word pointer 

    if (lookup_word(word) != LOOKUP) {
        printf("!!! warning: word %s already defined.\n", word);
        return false;
    }

    /* word not there, allocate a new entry and link it on the list */
    wp = (struct word*)malloc(sizeof(struct word));
    wp->next = word_list;
    wp->word_name = (char*)malloc(strlen(word) + 1);
    strcpy(wp->word_name, word);
    wp->word_type = type;

    word_list = wp;

    return true;
}

int lookup_word(char *word)
{
    struct word *wp = word_list;

    for (; wp; wp = wp->next) {
        if (strcmp(wp->word_name, word) == 0) {
            return wp->word_type;
        }
    }

    return LOOKUP;
}

const char* get_word_type(int type)
{
    switch(type) {
        case VERB: return "verb";
        case ADJECTIVE: return "adjective";
        case ADVERB: return "adverb";
        case NOUN: return "noun";
        case PREPOSITION: return "preposition";
        case PRONOUN: return "pronoun";
        case CONJUNCTION: return "conjunction";
        default: return "unknown";
    }
}

test.y

%{
/*
 * A lexer for the basic grammer to use for recognizing English sentence.
 */
#include <stdio.h>  

extern int yylex (void);
void yyerror(const char *s, ...);
%}

%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION

%%
sentence: subject VERB object {printf("Sentence is valid.\n");}
      ;

subject: NOUN
      |  PRONOUN
      ;

object:  NOUN
      ;

%%

extern FILE *yyin;

int main()
{
    //while(!feof(yyin)) {
        yyparse();
    //}
}

void yyerror(const char *s, ...)
{
    fprintf(stderr, "%s\n", s);
}

y.tab.h

此文件自动生成，如下：

/* A Bison parser, made by GNU Bison 2.3.  */

/* Skeleton interface for Bison's Yacc-like parsers in C

   ...

   This special exception was added by the Free Software Foundation in
   version 2.2 of Bison.  */

/* Tokens.  */
#ifndef YYTOKENTYPE
# define YYTOKENTYPE
   /* Put the tokens into the symbol table, so that GDB and other debuggers
      know about them.  */
   enum yytokentype {
     NOUN = 258,
     PRONOUN = 259,
     VERB = 260,
     ADVERB = 261,
     ADJECTIVE = 262,
     PREPOSITION = 263,
     CONJUNCTION = 264
   };
#endif
/* Tokens.  */
#define NOUN 258
#define PRONOUN 259
#define VERB 260
#define ADVERB 261
#define ADJECTIVE 262
#define PREPOSITION 263
#define CONJUNCTION 264

#if ! defined YYSTYPE && ! defined YYSTYPE_IS_DECLARED
typedef int YYSTYPE;
# define yystype YYSTYPE /* obsolescent; will be withdrawn */
# define YYSTYPE_IS_DECLARED 1
# define YYSTYPE_IS_TRIVIAL 1
#endif

extern YYSTYPE yylval;

运行

noun dogs
noun dog
verb is are
pron they it
it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.
it is dog.
it: pronoun
syntax error

其他尝试

增加一些打印

sentence: subject verb object {printf("Sentence is valid.\n");}
      ;

subject: NOUN {printf("subject of a noun.\n");}
      |  PRONOUN {printf("subject of a pronoun.\n");}
      ;

verb: VERB {printf("verb.\n");}
      ;

object:  NOUN {printf("object of a noun.\n");}
      ;

运行：

noun dog
verb is
pron it
it is dog
it: pronoun
subject of a pronoun.
is: verb
verb.
dog: noun
object of a noun.
Sentence is valid.

或者：

noun dog dogs
verb is are
pron it they
it is dog they are dogs.
it: pronoun
subject of a pronoun.
is: verb
verb.
dog: noun
object of a noun.
Sentence is valid.
they: pronoun
syntax error

识别多个句子

extern FILE *yyin;

int main()
{
    while(!feof(stdin/*yyin*/)) {
        yyparse();
    }
}

运行：

$ ./test 
noun dog
verb is
pron it
it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.

it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.

noun dogs
verb are
pron they
they are dogs.
they: pronoun
are: verb
dogs: noun
Sentence is valid.

改成如下的代码会运行错误：

int main()
{
    //while(!feof(stdin/*yyin*/)) {
    for (;;) {
        yyparse();
    }
}

运行：

$ ./test 
noun dog
verb is
pron it
it is dog
it: pronoun
is: verb
dog: noun
Sentence is valid.

it is dog
it: pronoun
syntax error
is: verb
syntax error
dog: noun

从文件中读数据

要从文件中读取，需要使用全局变量yyin。如下这种方式无效：

//extern FILE *yyin;

int main()
{
    FILE* f = NULL;
    f = fopen("test.txt", "rb");
    if (NULL == f) {
        printf("Open file failed.\n");
        return 1;
    }

    printf("Open file successfully.\n");

    while(!feof(f)) {
        yyparse();
    }
}

注：在yy.lex.c中，使用的是yyin全局变量。该变量初始化为0（NULL）。如果用户没有更改yyin，会程序跑起来之后会自动设置为stdin。

正确代码：

extern FILE *yyin;

int main()
{
    yyin = fopen("test.txt", "rb");
    if (NULL == yyin) {
        printf("Open file failed.\n");
        return 1;
    }

    printf("Open file successfully.\n");

    while(!feof(yyin)) {
        yyparse();
    }
}

测试文件test.txt的内容：

noun dog dogs
verb is are
pron it they

it is dog.
they are dogs.

运行：

$ ./test 
Open file successfully.
it: pronoun
is: verb
dog: noun
Sentence is valid.
they: pronoun
syntax error
$

简单语句和复合语句

对应 ch1-06.y

代码

%{
/*
 * A lexer for the basic grammer to use for recognizing English sentence.
 */
#include <stdio.h>  

extern int yylex (void);
void yyerror(const char *s, ...);
%}

%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION

%%

sentence: simple_sentence { printf("Parsed a simple sentence.\n"); }
        | compound_sentence { printf("Parsed a compound sentence.\n"); }
        ;

simple_sentence: subject verb object {printf("simple sentence of type 1.\n");}
        |        subject verb object prep_phrase {printf("simple sentence of type 2.\n");}
        ;

compound_sentence: simple_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 1.\n");}
        |          compound_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 2.\n");}
        ;

subject: NOUN
      |  PRONOUN
      |  ADJECTIVE subject
      ;

verb:    VERB
      |  ADVERB VERB 
      |  verb VERB
      ;

object:  NOUN
      |  ADJECTIVE object
      ;

prep_phrase:  PREPOSITION NOUN
      ;

%%

extern FILE *yyin;

int main()
{
    yyin = fopen("test.txt", "rb");
    if (NULL == yyin) {
        printf("Open file failed.\n");
        return 1;
    }

    printf("Open file successfully.\n");

    while(!feof(yyin)) {
        yyparse();
    }

    fclose(yyin);
    yyin = NULL;

    return 0;
}

void yyerror(const char *s, ...)
{
    fprintf(stderr, "%s\n", s);
}

测试文件

noun dog dogs China
verb is are
pron it they
adj pretty
conj and
prep in

it is a pretty dog and they are dogs in China and it is dog.

运行结果

Open file successfully.
it: pronoun
is: verb
a: unknown
pretty: adjective
dog: noun
and: conjunction
simple sentence of type 1.
they: pronoun
are: verb
dogs: noun
in: preposition
China: noun
simple sentence of type 2.
compound sentence of type 1.
and: conjunction
it: pronoun
is: verb
dog: noun
.: ----
simple sentence of type 1.
compound sentence of type 2.
Parsed a compound sentence.

yacc&amp;lex-Chapter1参考资料Lexyacc