天天看点

parsley_Parsley:C#中的递归下降解析器生成器

parsley

parsley_Parsley:C#中的递归下降解析器生成器

介绍 (Introduction)

There are plenty of parser generators out there for .NET, but no language agnostic parser with syntax directed actions (code in the grammar), syntactic predicates, and custom parse routines. Parsley is different. Parsley allows you to write your code in the grammar in a C# subset that is then converted to the target parser's generated language, so if someone wants to use your parser source code in VB they can, even though your code was C#. There are also very few recursive descent parser generators out there, most preferring the table generated approach. The advantage of recursive descent is it's far more intuitive than the table driven approach and the parse routines are completely overridable - plus the code is actually readable. This generator works somewhat like Coco/R in than it is internally LL(1) but allows you to use code to parse more complex constructs than LL(1) allows for. I haven't really used Coco/R but I think it's table driven, unlike Parsley and Parsley has more code features as a result, such as virtual non-terminals for completely overriding a parse. I can't be certain Coco/R doesn't do these things, but I'm fairly sure. If I'm right, Parsley can potentially parse more grammars in more ways than Coco/R.

.NET有很多解析器生成器,但是没有与语言无关的解析器,它具有语法指导的动作(语法中的代码),语法谓词和自定义解析例程。 欧芹是不同的。 Parsley允许您在C#子集中以语法编写代码,然后将其转换为目标解析器的生成语言,因此,即使有人想要在VB中使用解析器源代码,即使您的代码是C#,也可以。 那里也很少有递归下降解析器生成器,最喜欢表生成的方法。 递归下降的优点是它比表驱动方法直观得多,并且解析例程是完全可重写的-而且代码实际上是可读的。 该生成器的工作方式与内部的LL(1)有点类似Coco / R ,但允许您使用代码来解析比LL(1)允许的更复杂的构造。 我没有真正使用过Coco / R,但我认为它是表驱动的,这与Parsley不同,而Parsley具有更多的代码功能,例如完全覆盖解析的虚拟非终端。 我不能肯定Coco / R不会做这些事情,但是我很确定。 如果我是对的,那么与Coco / R相比, Parsley可以以更多方式解析更多语法。

Update: Improved output code generation, added features. See below.

更新:改进了输出代码生成,增加了功能。 见下文。

Update 2: Bugfix in grammars, rewrite of part of article, feature adds 

更新2 :语法中的错误修复,重写文章的一部分,功能添加

Update 3: Added GPLEXv1 support!

更新3 :添加了GPLEXv1支持!

Update 4: Added experimental backtracking support

更新4 :添加了实验性回溯支持

Update 5: Added syntactic predicates, and virtual and abstract non-terminals, and /fast option

更新5 :添加了语法谓词,虚拟和抽象非终结符以及/ fast选项

Update 6: Added @options directive and the ParsleyDevstudio Visual Studio Integration Package for Parsley, Rolex and Gplex

更新6:添加了@options指令和用于Parsley,Rolex和Gplex的ParsleyDevstudio Visual Studio集成包

版权/许可声明: (Copyright / License Notice: )

While most of the code here is mine, GPLEXv1 is not my project, and I have yet to fork it per se (I haven't modified the source yet) but I probably will as it hasn't been maintained. It's also very difficult to build the first time because you need the gplex and gppg binaries to build it the first time around, and those binaries can be dicey to find. I also plan to use it for Slang because it supports Unicode, while Rolex does not. I've included it, in ready to build form simply to make life easier on me and you.

尽管这里的大多数代码是我的,但GPLEXv1并不是我的项目,我本身尚未对其进行分叉(我尚未修改源代码),但由于可能尚未维护,我可能会保留。 这也是非常困难的,因为你需要的gplex和GPPG二进制围绕打造它的第一次打造的第一次,这些二进制文件可以冒险找到。 我还计划将其用于Slang,因为它支持Unicode,而Rolex不支持。 我已将其包括在内,以易于构建的形式提供,旨在使我和您的生活更加轻松。

Gardens Point LEX Copyright © 2006-2011 Queensland University of Technology (QUT). All rights reserved.

Gardens Point LEX版权所有©2006-2011昆士兰科技大学(QUT)。 版权所有。

The full license is included in GPLEXcopyright.rtf in the Gplex folder.

完整的许可包含在Gplex文件夹GPLEXcopyright.rtf。

Background

背景

Parser generators build parsers which break down text into a meaningful hierarchy of symbols which can be easily processed programmatically. Compilers user parsers to parse your code. JSON libraries use parsers to parse JSON, and Excel uses a parser to decode the formulas in your cells.

解析器生成器构建解析器,该解析器将文本分解为有意义的符号层次结构,可以通过编程轻松地对其进行处理。 编译器用户解析器可以解析您的代码。 JSON库使用解析器来解析JSON,而Excel使用解析器来解码单元格中的公式。

Parsers come in a variety of algorithms, from simple LL(1), to the almost incomprehensible GLR parsing algorithm. Parsely is LL(1), which means it processes grammars from top to bottom creating a left derivation tree and one symbol of lookahead. The syntax, rather than the input directs the parse. This is the second most intuitive way to parse next to LL(k/*), which allows for arbitrary lookahead.

解析器有多种算法,从简单的LL(1)到几乎难以理解的GLR解析算法。 解析为LL(1),这意味着它从上到下处理语法,从而创建左派生树和一个前瞻符号。 语法而不是输入指导语法分析。 这是在LL(k / *)之后进行解析的第二种最直观的方法,它允许任意提前。

使用这个混乱 (Using this Mess)

建立这个混乱 (Building this Mess)

This code is provided as part of my larger Build Tools project. I've stripped the binaries out of the distribution, and yet they are used to build the distribution. To bootstrap it, you'll have to build a couple of times in Release. You'll get locking errors the first time. That's okay. The second time, you may still have a warning in your Error Panel, but it's stale. The build succeeded with no warnings.

此代码是我的大型构建工具项目的一部分。 我已经从发行版中删除了二进制文件,但是它们仍用于构建发行版。 要引导它,您必须在Release中构建几次。 第一次会出现锁定错误。 没关系。 第二次,您的“错误面板”中可能仍然有警告,但这是陈旧的。 构建成功,没有任何警告。

安装这个混乱 (Installing this Mess)

All of the tools included with this distro are useful in their own right. You may want to copy deslang, csbrick, rolex and parsley to your

PATH

somewhere. This makes it easier to use them in pre-build steps because you won't have to fully qualify the EXE path. You can run the ParsleyDevstudio.vsix installer to install the Visual Studio integration, though it is inferior to the standard way of using this as a pre-build step.

该发行版中包含的所有工具本身都是有用的。 您可能要deslang,csbrick, 劳力士和香菜复制到您的

PATH

地方。 这使在预构建步骤中更容易使用它们,因为您不必完全限定EXE路径。 您可以运行ParsleyDevstudio.vsix安装程序来安装Visual Studio集成,尽管它不如将其用作预构建步骤的标准方法。

运行这个混乱 (Running this Mess)

Parsley typically operates as a command line tool. It takes an input file, an optional output file and various parameters dictating the generated code options. Here is the usage screen:

Parsley通常作为命令行工具运行。 它需要一个输入文件,一个可选的输出文件以及指示生成的代码选项的各种参数。 这是用法屏幕:

Parsley generates a recursive descent parser and optional lexer spec

parsley.exe <inputfile> [/output <outputfile>] [/rolex <rolexfile>]
        [/gplex <gplexfile>] [/gplexclass <gplexcodeclass>]
        [/namespace <codenamespace>] [/class <codeclass>]
        [/langage <codelanguage>] [/fast]
        [/noshared] [/verbose] [/ifstale]

        <inputfile>             The XBNF input file to use.
        <outputfile>            The output file to use - default stdout.
        <rolexfile>             Output a Rolex lexer specification to the specified file
        <gplexfile>             Generate Gplex lexer specification to the specified file file. Will also generate supporting C# files (C# only)
        <gplexcodeclass>        Generate Gplex lexer specification with the specified class name. - default takes the name from <gplexfile>
        <codenamespace>         Generate code under the specified namespace - default none
        <codeclass>             Generate code with the specified class name - default derived from <outputfile> or the grammar.
        <codelanguage>          Generate code in the specified language - default derived from <outputfile> or C#.
        <fast>                  Generate code quickly, without resolution using C# only - not valid with the <codelanguage> option.
        <noshared>              Do not include shared library prerequisites
        <verbose>               Output all messages from the generation process
        <ifstale>               Do not generate unless <outputfile> or <rolexfile> is older than <inputfile>.

Any other switch displays this screen and exits.           

(sorry about the word wrapping!)

(对不起,自动换行!)

  • inputfile

    is the XBNF specification (explored below) that the parser will be generated from

    inputfile

    是解析器将根据其生成的XBNF规范(在下面进行探索)
  • outputfile

    is optional and specifies the output code file to generate. If it's not specified, the code will be dumped to the

    stdout

    .

    outputfile

    是可选的,它指定要生成的输出代码文件。 如果未指定,则代码将转储到

    stdout

  • rolexfile

    is optional and specifies the rolex lexer specification file to generate. If it's not specified, then no rolex specification will be generated.

    rolexfile

    是可选的,并且指定了劳力士词法分析器规范文件来生成。 如果未指定,则不会生成Rolex规范。
  • gplexfile

     is optional and specifies the gplex lexer specification file to generate. If it's not specified, it's not generated. Note that if it is, several files will be generated. This is necessary to support gplex.

    gplexfile

    是可选的,它指定要生成的gplex lexer规范文件。 如果未指定,则不会生成。 请注意,如果是,将生成几个文件。 这对于支持gplex是必需的。
  • gplexcodeclass

     is optional and specifies the base class names to use for the generated classes. "

    Expression

    " would create

    ExpressionTokenizer

    ,

    ExpressionTokenizerEnumerator

    and

    ExpressionScanner

    .

    gplexcodeclass

    是可选的,它指定要用于生成的类的基类名称。 “

    Expression

    ”将创建

    ExpressionTokenizer

    ExpressionTokenizerEnumerator

    ExpressionScanner

  • codenamespace

    is the namespace under which to generate the code. If it's not specified, the code won't be generated under a namespace.

    codenamespace

    是在其下生成代码的名称空间。 如果未指定,则不会在名称空间下生成代码。
  • codeclass

     is the name of the parser class to use. If you don't specify one, it will be the same as the

    outputfile

    , unless that's not specified, in which case it will take the name from the start element in the XBNF grammar.

    codeclass

    是要使用的解析器类的名称。 如果您未指定一个,它将与

    outputfile

    相同,除非未指定,否则它将使用XBNF语法中start元素的名称。
  • codelanguage

    is the language of the code to generate. If you don't specify one, it will be based on

    outputfile

    's file extension. If

    outputfile

    is not specified, it will default to C#.

    codelanguage

    是要生成的代码的语言。 如果您不指定一个,它将基于

    outputfile

    的文件扩展名。 如果未指定

    outputfile

    ,它将默认为C#。
  • fast

     indicates that the code generation should skip the resolution step and simply emit in C#. This is not valid when specifying a

    codelanguage

    .

    fast

    表示代码生成应跳过解析步骤并仅使用C#发出。 在指定

    codelanguage

    codelanguage

  • noshared

    skips generating the shared code. This is important to specify for generating a second parser in the same project. The first parser should be generated without this switch, and subsequent parsers should be generated with this switch. If noshared is specified here, it should also be specified to rolex when rolex is used to generate the lexer/tokenizer code. You don't want duplicates of shared code in your source, because it will cause compile errors. This option also applies to Gplex generation.

    noshared

    跳过生成共享代码。 指定在同一项目中生成第二个解析器很重要。 第一个解析器应在没有此开关的情况下生成,而后续的解析器应在此开关的情况下生成。 如果在此指定noshared,则当使用Rolex生成词法分析器/令牌生成器代码时,也应将其指定给Rolex 。 您不希望源中的共享代码重复,因为这会导致编译错误。 此选项也适用于Gplex生成。
  • verbose

     emits all grammar transformation and validation messages, including messages about refactored rules. This can create huge dumps in the case of large grammars with lots of refactoring, but it allows you to see what changes it made. You can also see the changes in the doc comments of the generated code, regardless.

    verbose

    发出所有语法转换和验证消息,包括有关重构规则的消息。 在具有大量重构的大型语法的情况下,这可能会产生巨大的转储,但它使您可以看到它进行了哪些更改。 无论如何,您也可以在生成的代码的文档注释中看到更改。
  • ifstale

    skips generating the output unless the input has been modified. This is useful for pre-build steps as it prevents the overhead of rebuilding the parser every time.

    除非输入已修改,否则

    ifstale

    跳过生成输出。 这对于预构建步骤很有用,因为它避免了每次重新构建解析器的开销。

You'll probably want to use this tool in tandem with one of the tokenizers included in the solution. Rolex, is the easiest to use, but Gplex supports Unicode and is basically more capable. These tools are lexer/tokenizer generators. Almost all parsers (excepting PEG parsers particularly) require lexers/tokenizers and Parsley is no exception. For further reading about Rolex, I wrote an article about creating Rolex here though you should use this codebase since it's much newer and uses Deslanged Slang technology to dramatically ease maintenance of the generated code while maintaining performance. For further reading about Gplex I'd just look at lex and flex resources as there's little to no documentation on Gplex itself but it's similar to those tools. Fortunately, I've included in Parsley the generation of a mostly identifical interface to that of Rolex lexers so coding against them is basically the same. The constructors are a little different is all. Eventually I'll unify them but it's not trivial because of the way Gplex works. If you use Gplex, it's often desirable to use the

fast

 option with Parsley since the tokenizer is C# anyway.

您可能希望与解决方案中包含的一个分词器一起使用此工具。 Rolex是最容易使用的,但是Gplex支持Unicode,并且基本上具有更强大的功能。 这些工具是词法分析器/令牌生成器。 几乎所有解析器(尤其是PEG解析器除外)都需要词法分析器/令牌生成器,并且Parsley也不例外。 为了进一步了解Rolex ,我在这里写了一篇有关创建Rolex的文章,尽管您应该使用此代码库,因为它较新,并且使用Deslanged Slang技术大大简化了对生成代码的维护,同时保持了性能。 为了进一步了解Gplex,我只看lex和flex资源,因为关于Gplex本身的文档很少甚至没有,但是与这些工具相似。 幸运的是,我在Parsley中包括了与Rolex lexers几乎相同的接口的生成,因此针对它们的编码基本上是相同的。 所有的构造函数都有些不同。 最终,我将它们统一起来,但是由于Gplex的工作方式,这并不简单。 如果您使用Gplex ,则通常需要在Parsley中使用

fast

选项,因为令牌生成器无论如何都是C#。

编码此混乱 (Coding this Mess)

Coming back to Parsley, let's examine the XBNF grammar format, which I use for all my parsers:

回到Parsley,让我们检查一下我用于所有解析器的XBNF语法格式:

The XBNF format is designed to be easy to learn if you know a little about composing grammars.

如果您对语法的了解不多,则XBNF格式旨在易于学习。

The productions take the form of:

生产形式为:

There are more advanced variations of the above production syntax we'll approach later.

我们稍后将介绍上述生产语法的更多高级变体。

So for example, here's a simple standards compliant JSON grammar:

因此,例如,这是一个简单的符合标准的JSON语法:

// based on spec @ json.org
Json<start>= Object | Array;
Object= "{" [ Field { "," Field } ] "}";
Field= string ":" Value;
Array= "[" [ Value { "," Value } ] "]";
Value<collapsed>= 
    string    |
    number    |
    Object    |
    Array     |
    Boolean   |
    null      ;
Boolean= true|false;
number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
string= '"([^\n"\\]|\\([btrnf"\\/]|(u[0-9A-Fa-f]{4})))*"';
true= "true";
false= "false";
null= "null";
lbracket<collapsed>= "[";
rbracket<collapsed>= "]";
lbrace<collapsed>= "{";
rbrace<collapsed>= "}";
colon<collapsed>= ":";
comma<collapsed>= ",";
whitespace<hidden>= '[\n\r\t ]+';           

The first thing to note is the

Json

production is marked with a

start

attribute. Since the value was not specified it is implicitly,

start=true

.

首先要注意的是,

Json

生产带有

start

属性。 由于未指定值,因此它是隐式的,

start=true

That tells the parser that

Json

is the start production. If it is not specified, the first non-terminal in the grammar will be used. Furthermore, this can cause a warning during generation since it's not a great idea to leave it implicit. Only the first occurrence of

start

will be honored.

这告诉解析器

Json

是开始生产的产品。 如果未指定,则将使用语法中的第一个非结尾。 此外,这可能会在生成过程中引起警告,因为将其隐式保留不是一个好主意。 只有第一次出现的

start

会受到尊重。

Object | Array

tells us the

Json

production is derived as an object or array. The

Object

production contains a repeat

{}

construct inside and optional

[]

construct, itself containing a reference to

Field

.

Array

 is similar, except it uses "[" and "]" and it refers to

Value

instead of

Field

.

Object | Array

Object | Array

告诉我们

Json

产生是作为对象或数组派生的。

Object

产生内部包含重复

{}

构造和可选

[]

构造,其本身包含对

Field

的引用。

Array

与之类似,除了它使用“ [”和“]”并且引用

Value

而不是

Field

表达方式 (Expressions)

  • ( )

    parentheses allow you to create subexpressions like

    Foo (Bar|Baz)

    ( )

    括号允许您创建子表达式,例如

    Foo (Bar|Baz)

  • [ ]

    optional expressions allow the subexpression to occur zero or once

    [ ]

    可选表达式允许子表达式出现零或一次
  • { }

    this repeat construct repeats a subexpression zero or more times

    { }

    此重复构造将子表达式重复零次或更多次
  • { }+

    this repeat construct repeats a subexpression one or more times

    { }+

    此重复构造重复一个或多次子表达式
  • |

    this alternation construct derives any one of the subexpressions

    |

    此替代构造派生任何一个子表达式
  • Concatenation is implicit, separated by whitespace

    串联是隐式的,由空格分隔

码头 (Terminals)

The terminals are all defined at the bottom but they can be anywhere in the document. XBNF considers any production that does not reference another production to be a terminal. You can force an element to be terminal with the

terminal

attribute (see below).

端子都在底部定义,但可以在文档中的任何位置。 XBNF将没有引用其他产品的任何产品视为终端。 您可以使用

terminal

属性强制元素成为终端(请参见下文)。

Regular expressions are between

'

single quotes and literal expressions are between

"

double quotes. You may declare a terminal by using XBNF constructs or by using regular expressions. The regular expressions follow a POSIX + std extensions paradigm but don't currently support all of POSIX. They support most of it. If a POSIX expression doesn't work, consider it a bug. Unicode will be supported in a future version. I've already started support for constructs like \p and \P but the regex engine needs further optimization before Rolex can create that lexer in a reasonable amount of time. The problem is Unicode's huge range of things like letters and numbers is choking my current implementation which is about 80% optimized for ranges. The other 20% is killing performance in this case, but optimizing it is non-trivial.

正则表达式之间的

'

单引号和文字表达是之间

"

双引号。您可以通过使用XBNF结构或使用正则表达式声明端,正则表达式遵循POSIX + STD扩展模式,但目前不支持所有的POSIX的他们支持大多数。如果POSIX表达式不起作用,认为它是一个错误。将来的版本将支持Unicode。我已经开始支持\ p和\ P这样的结构,但是正则表达式引擎需要进一步支持劳力士(Rolex)可以在合理的时间内创建该词法分析器之前进行优化,但问题是Unicode的大范围内容(例如字母和数字)使我当前的实现感到窒息,该实现的范围优化约为80%,在这种情况下,另外20%的性能会被破坏,但对其进行优化并非易事。

属性 (Attributes)

The

collapsed

element tells Parsley that this node should not appear in the parse tree. Instead, its children will be propagated to its parent. This is helpful if the grammar needs a nonterminal or a terminal in order to resolve a construct, but it's not useful to the consumer of the parse tree. During LL(1) factoring, generated rules must be made, and their associated non-terminals are typically collapsed. Above, we've used it to significantly trim the parse tree of nodes we won't need including collapsing unnecessary terminals like

:

in the JSON grammar. This is because they don't help us define anything - they just help the parser recognize the input, so we can throw them out to make the parse tree smaller.

collapsed

元素告诉Parsley该节点不应出现在解析树中。 相反,其子级将传播给其父级。 如果语法需要非终结符或终结符来解析构造,这将很有用,但对于解析树的使用者而言则无用。 在LL(1)分解期间,必须制定生成的规则,并且通常将其关联的非终端折叠。 上面,我们已经用它来显著修剪我们将不再需要包括折叠不必要的终端,如节点的分析树

:

在JSON语法。 这是因为它们没有帮助我们定义任何内容-它们只是帮助解析器识别输入,因此我们可以将它们丢弃以使解析树更小。

The

hidden

element tells Parsley that this terminal should be skipped. This is useful for things like comments and whitespace.

hidden

元素告诉Parsley该终端应被跳过。 这对于诸如注释和空格之类的东西很有用。

The

blockEnd

attribute is intended for terminals who have a multi character ending condition like C block comments, XML CDATA sections, and SGML/XML/HTML comments. If present, the lexer will continue until the literal specified as the

blockEnd

is matched.

blockEnd

属性适用于具有多字符结束条件的终端,例如C块注释,XML CDATA节和SGML / XML / HTML注释。 如果存在,则词法分析器将继续执行,直到匹配指定为

blockEnd

的文字为止。

The

terminal

attribute declares a production to be explicitly terminal. Such a production is considered terminal even if it references other productions. If it does, those other productions which will be included in their terminal form as though they were part of the original expression. This allows you to create composite terminals out of several terminal definitions.

terminal

属性将生产声明为显式终端。 即使此类产品引用了其他产品,也被视为终端产品。 如果是的话,那些其他作品将包含在其最终形式中,就像它们是原始表达的一部分一样。 这使您可以从多个端子定义中创建复合端子。

The

ignoreCase

attribute specifies that case shouldn't matter for matching. This only applies to terminals and only works with Rolex.

ignoreCase

属性指定大小写匹配无关紧要。 这仅适用于终端,并且仅适用于Rolex。

The

type

attribute specifies a .NET type or intrinsic C# type of the associated code block (see below). Values returned from such "typed" non-terminals are automatically converted to the target type, obviating the need to cast return values, or convert them using

int.Parse()

 or the like. It will be handled for you.

type

属性指定关联代码块的.NET类型或固有C#类型(请参见下文)。 从此类“类型化”非终端返回的值将自动转换为目标类型,从而无需转换返回值或使用

int.Parse()

或类似方法将其转换。 它将为您处理。

The

dependency

 attribute marks a non-terminal production as needed by the parser code you write.

dependency

属性标记您编写的解析器代码所需的非终端生产。

The

firsts

 and 

follows

  attributes are space delimited lists of terminal and non terminal symbols that hint to the parser what comes first and what can follow a particular non-terminal production. These are typically used in tandem with virtual productions (see the section toward the end)

firsts

follows

属性是空格分隔终端和非终端符号列出了提示什么是第一位的解析器,哪些可以遵循特定的非终端的生产。 这些通常与虚拟制作一起使用(请参阅最后一节)

The

priority

 attribute marks a terminal production with a particular priority. A negative number is lower in priority than a positive number. Must be integer. By default items are

priority

0.

priority

属性标记具有特定优先级的终端生产。 负数的优先级低于正数。 必须为整数。 默认情况下,项目的

priority

0。

评估代码块 (Evaluation Code Blocks)

Evaluation code blocks are specified in the grammar using a lambda/anonymous method-like syntax.

=> { ... }

at the end of a non-terminal production. They take the form of:

评估代码块在语法中使用lambda /类似匿名方法的语法指定。

=> { ... }

在非终端生产结束时。 它们的形式为:

Like:

喜欢:

This signifies a block of code to associate with that non-terminal. This code is for parse actions, triggered by the

EvaluateXXXX()

methods. In this code, you take a

ParseNode

object, indicated by

node

and evaluate it, returning the result. If you do not return a result, a default value will be returned. See the example below.

这表示与该非终端关联的代码块。 该代码用于解析由

EvaluateXXXX()

方法触发的动作。 在此代码中,您将获取一个由

node

表示的

ParseNode

对象并对其进行评估,并返回结果。 如果不返回结果,将返回默认值。 请参见下面的示例。

On grammar notation conventions: I use title case for the non-terminals in the grammar. This isn't necessary, but is better in the end because functions like

ParseXXXX()

and

EvaluateXXXX()

take the

XXXX

portion directly from the non-terminal name. Ergo, the non-terminal

foo

would end up generating

Parsefoo()

and

Evaluatefoo()

. This is obviously undesirable but it's not a show stopper. I thought about mangling the name to "correct" this, but in the end I decided not to make unexpected changes to the code. This way, it was straightforward to know what the eventual names of these methods will be. The symbol constants are generated directly from the symbol names as well, so if you wanted to be consistent with Microsoft's naming guidelines, you'd name your terminals with Pascal case/.NET case as well. However, this is far from important, especially since unlike with non-terminals, their names aren't appended to anything.

关于语法符号约定 :我对语法中的非终结符使用标题大小写。 这不是必需的,但最终会更好,因为

ParseXXXX()

EvaluateXXXX()

类的函数直接从非终端名称中获取

XXXX

部分。 因此,非终端

foo

最终将生成

Parsefoo()

Evaluatefoo()

。 这显然是不可取的,但这不是表演的制胜法宝。 我考虑过修改名称以“纠正”此问题,但最终我决定不对代码进行意外更改。 这样,很容易知道这些方法的最终名称是什么。 符号常量也直接从符号名称生成,因此,如果要与Microsoft的命名准则保持一致,则也可以使用Pascal case / .NET case命名终端。 但是,这远非重要,特别是因为与非终端不同,它们的名称不会附加到任何内容上。

Let's take a look at the grammar for an expression parser:

让我们看一下表达式解析器的语法:

Term<start>= Factor { ("+"|"-") Factor };
Factor= Unary { ("*"|"/") Unary };
Unary= ("+"|"-") Unary | Leaf;
Leaf= integer | identifier | "(" Term ")";
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// we don't have to explicitly declare anything but whitespace
// we can define it all inline above but this is nicer
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Now here's how the

_ParseXXXX()

methods that get generated work:

现在,这是获取生成的

_ParseXXXX()

方法的工作方式:

private static ParseNode _ParseLeaf(ParserContext context) {
    int line = context.Line;
    int column = context.Column;
    long position = context.Position;
    if ((ExpressionParser.integer == context.SymbolId)) {
        // Leaf -> integer
        ParseNode[] children = new ParseNode[1];
        children[0] = new ParseNode(ExpressionParser.integer, "integer", context.Value, line, column, position);
        context.Advance();
        return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
    }
    if ((ExpressionParser.identifier == context.SymbolId)) {
        // Leaf -> identifier
        ParseNode[] children = new ParseNode[1];
        children[0] = new ParseNode(ExpressionParser.identifier, "identifier", context.Value, line, column, position);
        context.Advance();
        return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
    }
    if ((ExpressionParser.Implicit3 == context.SymbolId)) {
        // Leaf -> Implicit3 Term Implicit4
        ParseNode[] children = new ParseNode[3];
        children[0] = new ParseNode(ExpressionParser.Implicit3, "Implicit3", context.Value, line, column, position);
        context.Advance();
        children[1] = ExpressionParser._ParseTerm(context);
        children[2] = new ParseNode(ExpressionParser.Implicit4, "Implicit4", context.Value, line, column, position);
        if ((ExpressionParser.Implicit4 == context.SymbolId)) {
            context.Advance();
            return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
        }
        context.Error("Expecting Implicit4");
    }
    context.Error("Expecting integer, identifier, or Implicit3");
    return null;
}           

You can see it resolving based on

context

's current symbol, and then parsing the child nodes, one by one. Terminals are parsed inline, while non-terminals are forwarded to their appropriate

_ParseXXXX()

 methods. Non-terminals with collapsed children generate a little differently, using

List<ParseNode>

instead of

ParseNode[]

to hold the children. This is because at the point when we're propagating children from the collapsed nodes, we can't know how many nodes there will be, unlike when the nodes are not collapsed. This is okay, as it only generates this alternative when necessary, and it only requires one additional copy. You'll note the appearance of

Implicit3

and

Implicit4

. Those are the names of the tokens we have not defined yet. If you want better error messages, give them a name in the grammar.

您可以看到它基于

context

的当前符号进行解析,然后一个一个地解析子节点。 终端被内联解析,而非终端被转发到它们适当的

_ParseXXXX()

方法。 具有折叠子级的非终端生成的方式略有不同,使用

List<ParseNode>

而不是

ParseNode[]

来容纳子级。 这是因为在从折叠节点传播子节点时,我们不知道会有多少个节点,这与未折叠节点不同。 可以,因为它仅在必要时生成此替代方法,并且仅需要一个附加副本。 您会注意到

Implicit3

Implicit4

的外观。 这些是我们尚未定义的令牌的名称。 如果您想要更好的错误消息,请在语法中给它们命名。

This is fine for parsing but will do nothing for evaluation. That is to say, the above will add a

Parse()

 method to the parser class, but there's not enough information in the grammar to create an

Evaluate()

method. Basically, we can get a parse tree back with the above, but we can't evaluate it. Lets fix that by adding some code to the grammar:

这对于解析是很好的,但不会做任何评估。 也就是说,以上代码将向Parser类添加一个

Parse()

方法,但是语法中没有足够的信息来创建

Evaluate()

方法。 基本上,我们可以通过上面的方法获得一个解析树,但是我们无法对其进行评估。 让我们通过在语法中添加一些代码来解决此问题:

Term<start>= Factor { ("+"|"-") Factor } => {
    int result = (int)EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += (int)EvaluateFactor(node.Children[i]);
        else // sub
            result -= (int)EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor= Unary { ("*"|"/") Unary } => { 
    int result = (int)EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateUnary(node.Children[i]);
            else // div
                result /= (int)EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateFactor(node.Children[i]);
            else // div
                result /= (int)EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -(int)EvaluateUnary(node.Children[1]);
}
Leaf= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[1].SymbolId==integer)
            return int.Parse(node.Children[0].Value);
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note the code blocks indicated by

=> { ... }

 which tell our parser what to do with the parse tree we made.

注意代码块指示由

=> { ... }

这告诉我们的解析器如何处理我们做出了解析树。

This indicates an integer expression parser. Because there is code in the grammar, it creates a

public

ExpressionParser.Evaluate()

method that can be used to call it, and thus evaluate a simple integer expression like

4+2*8

.

这表示整数表达式解析器。 因为语法中有代码,所以它会创建一个

public

ExpressionParser.Evaluate()

方法

ExpressionParser.Evaluate()

可用于调用它),从而求值一个简单的整数表达式,例如

4+2*8

Here is the code it generates for

EvaluateLeaf()

. You can see the code looks a lot like the associated code in the grammar above, with minor changes.

这是它为

EvaluateLeaf()

生成的代码。 您可以看到该代码看起来与上面的语法中的关联代码非常相似,但有少量更改。

public static object EvaluateLeaf(ParseNode node, object state) {
    if ((ExpressionParser.Leaf == node.SymbolId)) {
        if ((node.Children.Length == 1)) {
            if ((node.Children[1].SymbolId == ParsleyDemo.ExpressionParser.integer)) {
                return int.Parse(node.Children[0].Value);
            }
            else {
                throw new NotImplementedException("Variables are not implemented.");
            }
        }
        else {
            return ParsleyDemo.ExpressionParser.EvaluateTerm(node.Children[1]);
        }
    }
    throw new SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position);
}           

Here's that same code in VB.NET.

这是VB.NET中的相同代码。

Public Overloads Shared Function EvaluateLeaf(ByVal node As ParseNode, ByVal state As Object) As Object
    If (ExpressionParser.Leaf = node.SymbolId) Then
        If (node.Children.Length = 1) Then
            If (node.Children(1).SymbolId = ParsleyDemo.ExpressionParser.[integer]) Then
                Return Integer.Parse(node.Children(0).Value)
            Else
                Throw New NotImplementedException("Variables are not implemented.")
            End If
        Else
            Return ParsleyDemo.ExpressionParser.EvaluateTerm(node.Children(1))
        End If
    End If
    Throw New SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position)
End Function           

As you can see, the code block in the grammar was translated to the target language. This is Slang. Slang is still experimental, but it works well enough for doing simple evaluation inside code blocks. There's one problem here.

EvaluateUnary()

 (and

Evaluate()

 and all of the

EvaluateXXXX()

 methods return

object

! This is an integer expression parser so returning

int

 is far more appropriate. Fortunately, we can tell the parser what type to return using the

type

 attribute. Here we add the

type

attributes to the elements we've changed the grammar, which I've put in bold:

如您所见,语法中的代码块已翻译为目标语言。 这是S语 。 Slang仍处于试验阶段,但足以在代码块内进行简单评估。 这里有一个问题。

EvaluateUnary()

(和

Evaluate()

以及所有

EvaluateXXXX()

方法都返回

object

!这是一个整数表达式解析器,因此返回

int

更合适。幸运的是,我们可以使用

type

属性告诉解析器返回什么类型。我们将

type

属性添加到我们更改了语法的元素中,我将其以粗体显示:

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i]);
        else // sub
            result -= EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i]);
            else // div
                result /= EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i]);
            else // div
                result /= EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -EvaluateUnary(node.Children[1]);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[1].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note that not only do we now have the attribute 

type="int"

 marked up on each non-terminal production, we also have changed the code (particularly in

Leaf

) slightly, from:

请注意,我们现在不仅在每个非终端生产中都标记了属性

type="int"

,而且还从以下方面对代码(特别是在

Leaf

)进行了少许更改:

if(node.Children.Length==1) // integer | identifer
{
    if(node.Children[1].SymbolId==integer)
        return int.Parse(node.Children[0].Value);
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]);            

To:

至:

if(node.Children.Length==1) // integer | identifer
{
    if(node.Children[1].SymbolId==integer)
        return node.Children[0].Value;
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]);            

This change wasn't necessary, but it simplifies things slightly. This uses Microsoft's

TypeConverter

 framework in tandem with

Convert.ChangeType()

 to automatically translate your return values to the target type. 

不必进行此更改,但是可以稍微简化一下。 这与

Convert.ChangeType()

一起使用Microsoft的

TypeConverter

框架,将您的返回值自动转换为目标类型。

Regardless of all that code, using the generated code is pretty simple:

无论使用哪种代码,使用生成的代码都非常简单:

var text = "3*5+7*2"; 
var exprTokenizer = new ExpressionTokenizer(text);
var pt = ExpressionParser.Parse(exprTokenizer);
Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt));
Console.WriteLine();           

The reason creating the parser is a separate step than creating the tokenizer is because it's actually possible to use a tokenizer other than a Rolex tokenizer with this parser, but you'll have to provide your own

Token

struct and a class implementing

IEnumerable<Token>

. For performance reasons, there is no

IToken

interface. However, your token must have the following fields: (

int

)

SymbolId

, (

string

)

Symbol,

(

string

)

Value

, (

int

)

Line

, (

int

)

Column

, and (

long

)

Position and the constructor taking all of these things

. See the reference source under Rolex\Shared\Token.cs. Modifying the original file will change the behavior of Rolex.

创建解析器与创建令牌生成器是一个单独的步骤的原因,是因为实际上可以在此解析器中使用除Rolex令牌生成器以外的令牌生成器,但是您必须提供自己的

Token

结构和实现

IEnumerable<Token>

。 出于性能原因,没有

IToken

接口。 但是,您的令牌必须具有以下字段:(

int

)

SymbolId

,(

string

)

Symbol,

(

string

)

Value

,(

int

)

Line

,(

int

)

Column

和(

long

)

Position and the constructor taking all of these things

。 请参阅Rolex \ Shared \ Token.cs下的参考源。 修改原始文件将改变Rolex的行为。

Finally, now we have 

Evaluate()

, but there's one niggling issue to address, and that is variables, which were previously not implemented. We can implement variables using the

state

 argument in

Evaluate()

 which allows us to pass a user defined value in to our evaluation code. We'll have to change the code in the grammar in order to support it, so lets modify the grammar again.

最后,现在有了

Evaluate()

,但是有一个小问题要解决,那就是变量,以前没有实现。 我们可以使用

Evaluate()

state

参数来实现变量,该参数允许我们将用户定义的值传递到评估代码中。 为了支持它,我们必须更改语法中的代码,因此让我们再次修改语法。

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0],state);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i],state);
        else // sub
            result -= EvaluateFactor(node.Children[i],state);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0],state);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i],state);
            else // div
                result /= EvaluateUnary(node.Children[i],state);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i],state);
            else // div
                result /= EvaluateFactor(node.Children[i],state);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0],state);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1],state);
    else
        return -EvaluateUnary(node.Children[1],state);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[0].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(node.Children[0].Value,out val))
                    return val;
            }    
            throw new SyntaxException(string.Format("Reference to undefined variable {0}",node.Children[0].Value),node.Line,node.Column,node.Position);
        }
    } else // Term
        return EvaluateTerm(node.Children[1],state); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note we're now using the

state

 variable to hold an

IDictionary<string,int>

 instance which we use to hold our variables. We also pass state along to each of our evaluation methods. This is important, as the parser itself is stateless.

请注意,我们现在使用

state

变量来保存

IDictionary<string,int>

实例,该实例用于保存变量。 我们还将状态传递给我们的每种评估方法。 这很重要,因为解析器本身是无状态的。

When we encounter an identifier, we resolve it using a simple dictionary lookup.

当遇到标识符时,我们使用简单的字典查找来解析它。

We have to change the code we use to call it now, to pass in our variables:

我们必须更改用于立即调用的代码,以传递变量:

var text = "3*5+a*2";
var vars = new Dictionary<string, int>();
vars["a"] = 1;
var exprTokenizer = new ExpressionTokenizer(text);
var pt = ExpressionParser.ParseExpression(exprTokenizer);
Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt,vars));           

And Bob's your uncle. There you go.

而鲍勃是你的叔叔。 妳去

However, it's a bit complicated. So I've added some shorthand to simplify. For starters there are virtual variables named <Production>1 to <Production>N where <Production> is the name of a production in your grammar.

但是,这有点复杂。 因此,我添加了一些简化形式。 对于初学者,有一个名为<Production> 1到<Production> N的虚拟变量,其中<Production>是语法中的产品名称。

For the above grammar we have

Unary1

,

Unary2

,

UnaryN

, and

Expression1

,

Expression2

,

ExpressionN

, and

Factor1

and

Factor2

and so on. We also have

SymbolId1

,

SymbolId2

, etc.

对于以上语法,我们具有

Unary1

Unary2

UnaryN

Expression1

Expression2

ExpressionN

以及

Factor1

Factor2

等。 我们也有

SymbolId1

SymbolId2

等。

The numbers are 1 based positions into the child nodes.

SymbolId1

is an alias for

node.Children[0].SymbolId

. The terminals just point to

node.Children[x].Value

 and the non-terminal productions call the associated evaluate method for the non-terminal at the specified position like, so Factor3 from the above grammar would resolve to 

ExpressionParser.EvaluateFactor(node.Children[2], state)

!

数字是子节点中基于1的位置。

SymbolId1

node.Children[0].SymbolId

的别名。 终端仅指向

node.Children[x].Value

,非终端生成调用指定位置上非终端的关联评估方法,例如,因此上述语法中的Factor3将解析为

ExpressionParser.EvaluateFactor(node.Children[2], state)

You can also index in to these like

Unary[i]

. In addition there is a

Child

 macro (

Child1

..

ChildN

,

Child[i]

) that gives you the ability to evaluate the node type at run time and it will call the appropriate

EvaluateXXXX()

 method for you. This is useful in cases where we have collapsed nodes in the underlying grammar because sometimes that means we can't know which nodes we'll see in our evaluate function ahead of time. We'll visit this below:

您也可以像

Unary[i]

那样索引这些内容。 另外还有一个

Child

宏(

Child1

..

ChildN

Child[i]

让您评估在运行时节点类型的能力),它会调用相应的

EvaluateXXXX()

方法适合你。 这对于在基础语法中折叠了节点的情况很有用,因为有时这意味着我们无法提前知道在评估函数中会看到哪些节点。 我们将在下面访问:

Also, this may not seem very intuitive above until you look at the final grammar format:

另外,在您看完最终的语法格式后,上面的内容似乎不太直观:

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = Factor1;
    int i = 2;
    while (i<Length) 
    {
        if(SymbolId[i-1]==add)
            result += Factor[i];
        else // sub
            result -= Factor[i];
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = Unary1;
    int i = 2;
    // Because of the way the grammar is 
    // factored, this can end up being 
    // Factors when collapsed. We use
    // Child[n] which is a "smart"
    // macro that evaluates the symbol
    // at runtime and calls the right
    // EvaluateXXXX method
    while (i<Length) 
    {
        // Child always returns an object type so 
        // be sure to cast as necessary
        if(SymbolId[i-1]==mul) 
            result *= (int)Child[i];
        else // div
            result /= (int)Child[i];
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(Length==1)
        return Leaf1;
    if(SymbolId[0]==add)
        return Unary2;
    else
        return -Unary2;
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(Length==1) // integer | identifer
    {
        if(SymbolId[0]==integer)
            return integer1;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(identifier1,out val))
                    return val;
            }    
            throw new SyntaxException(string.Format("Reference to undefined variable {0}",identifier1),node.Line,node.Column,node.Position);
        }
    } else // Term
        return Term2;
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Take a look at that! That's pretty easy to follow once you get the hang of it. The macros clear things right up and keep it simple

看一看! 一旦掌握了这一点,就很容易遵循。 宏可以立即清除并保持简单

语法中的代码区域 (Code Regions In The Grammar)

Anywhere between productions you can specify regions of code to be added to the parser class using

{

}

. Any fields or methods must be static or const, and not named after any symbol in the grammar.

在生产之间的任何地方,您都可以使用

{

}

指定要添加到解析器类的代码区域。 任何字段或方法都必须是静态或const,并且不能以语法中的任何符号命名。

These regions are added as members of the class. We'll visit them in a moment when we make a helper method in the next section.

这些区域被添加为类的成员。 在下一节中,我们将在创建辅助方法时访问它们。

句法约束 (Syntactic Constraints)

Syntactic constrants, or syntactic predicates help disambiguate your grammar by indicating a

boolean

 function which tells the parser whether your production matches. Here's an example below, also using code regions from just above,  of using the

where

clause, which takes the form of:

句法约束或句法谓语通过指示

boolean

来告诉语法分析器您的结果是否匹配,从而帮助消除语法歧义。 这是下面的示例,也使用上面的代码区域,使用

where

子句,其形式为:

The following disambiguates between a C# keyword and an identifer: (note that the terminal definitions we're using are Unicode so they require a Gplex lexer)

以下代码在C#关键字和标识符之间进行了歧义消除:(请注意,我们使用的终端定义是Unicode,因此它们需要Gplex词法分析器)

Identifier<collapsed> = verbatimIdentifier | identifier : where { return !Keywords.Contains(context.Value); } 
verbatimIdentifier='@(_|[[:IsLetter:]])(_|[[:IsLetterOrDigit:]])*';
identifier='(_|[[:IsLetter:]])(_|[[:IsLetterOrDigit:]])*';
{
    static HashSet<string> Keywords=_BuildKeywords();
    static HashSet<string> _BuildKeywords() 
    {
        var result = new HashSet<string>();
        string[] sa = "abstract|as|ascending|async|await|base|bool|break|byte|case|catch|char|checked|class|const|continue|decimal|default|delegate|descending|do|double|dynamic|else|enum|equals|explicit|extern|event|false|finally|fixed|float|for|foreach|get|global|goto|if|implicit|int|interface|internal|is|lock|long|namespace|new|null|object|operator|out|override|params|partial|private|protected|public|readonly|ref|return|sbyte|sealed|set|short|sizeof|stackalloc|static|string|struct|switch|this|throw|true|try|typeof|uint|ulong|unchecked|unsafe|ushort|using|var|virtual|void|volatile|while|yield".Split(new char[] {'|'});
        
        for(var i = 0;i<sa.Length;++i) 
            result.Add(sa[i]);
        
        return result;
    }
}           

We can make the

_BuildKeywords()

 function using a fixed string lookup which would be more efficient, but a lot more code.

我们可以使用固定的字符串查找来制作

_BuildKeywords()

函数,该查找效率更高,但代码更多。

Here we've specified

verbatimIdentifier

before

identifier

, which gives it priority in the lexer/tokenizer. Our non-terminal 

Identifier

 (not to be confused with the terminal

identifier

) specifies either a verbatimIdentifier or an identifier and uses a

where

 clause to determine if it is a keyword, in which case, the parser will not consider the item under the cursor as an 

Identifier

while parsing. There's a trick for experienced users wherein you can override a first-first or first-follows conflict simply by specifying a

where

clause and always returning

true

. However, only use this when you know what you're doing as it can render parts of the grammar unreachable, and won't warn you. We didn't have to collapse

Identifier

 but we did it because it's not needed in the final parse tree, and just makes an unnecessary extra node. Also note the use of the pre-fix increment in the loop. Remember that Slang will not support post-fix increment because the underlying

CodeDOM

cannot do so in a language agnostic manner.

在这里,我们在

identifier

之前指定了

verbatimIdentifier

,从而在lexer / tokenizer中赋予了优先级。 我们的非终端

Identifier

(不要与终端

identifier

混淆)指定verbatimIdentifier或标识符,并使用

where

子句确定它是否为关键字,在这种情况下,解析器不会将光标下的项目视为解析时的

Identifier

。 对于有经验的用户,有一个技巧,您只需指定

where

子句并始终返回

true

就可以覆盖first-first或first-follows冲突。 但是,仅当您知道自己在做什么时才使用它,因为它会使部分语法无法访问,并且不会警告您。 我们不必折叠

Identifier

但我们这样做是因为最终的解析树中不需要它,而只是制作了一个不必要的额外节点。 还要注意在循环中使用前缀增量。 请记住, Slang不支持

CodeDOM

增量,因为基础

CodeDOM

无法以与语言无关的方式来实现。

抽象和虚拟非终端 (Abstract and Virtual Non-Terminals)

These are used together which is why we'll cover them together. This is a powerful feature to resolve parses that the generated code can't handle, like the difference between a cast expression and a parenthesized subexpression in C#. We'll start with the virtual non-terminals, which take the form of:

这些一起使用,这就是我们一起介绍它们的原因。 这是解析生成的代码无法处理的解析的强大功能,例如强制转换表达式和C#中带括号的子表达式之间的差异。 我们将从虚拟非终端开始,其形式为:

First, lets cover the

firsts

 attribute. This simply tells the parser which symbols can come first in your virtual non-terminal. These must match your parser function or the parser won't find your non-terminal correctly! I also mark the non-terminal with the no-op

virtual

 attribute simply for readability, but this is optional:

首先,让我们介绍一下

firsts

属性。 这只是告诉解析器哪些符号可以在您的虚拟非终端设备中首先出现。 这些必须与您的解析器功能匹配,否则解析器将无法正确找到您的非终端设备! 我也仅出于可读性的目的,使用no-op

virtual

属性标记了非终端,但这是可选的:

Cast<virtual,firsts="lparen"> { 
    return _ParseCast(context);
}           

Note that I forwarded to a function. This is so I can keep all my code at the bottom of the grammar in a code region (which we covered above). Note that I also put an underscore before my function. This is to ensure the name doesn't conflict with any of the generated parse functions, including

ParseCast()

 itself, which will be generated as a result of this virtual non-terminal declaration. Let's take a look at

_ParseCast()

:

请注意,我转发了一个函数。 这样一来,我就可以将我所有的代码都放在语法的底部,即代码区域(我们在上文中进行了介绍)。 请注意,我还在函数之前加了下划线。 这是为了确保名称不与任何生成的解析函数(包括

ParseCast()

本身

ParseCast()

冲突,该解析函数将由于此虚拟非终端声明而生成。 让我们看一下

_ParseCast()

static ParseNode _ParseCast(ParserContext context) {
    int line = context.Line;
    int column = context.Column;
    long position = context.Position;

    if("("!=context.Value)
        context.Error("Expecting ( as start of expression or cast");
    ParseNode lp = new ParseNode(SlangParser.lparen, "lparen", context.Value, context.Line, context.Column, context.Position);
    context.Advance();
    ParseNode type = ParseTypeCastPart(context);
    ParseNode expr = ParseExpression(context);
    return new ParseNode(SlangParser.Cast, "Cast", new ParseNode[] {type,expr}, line, column, position);
}           

In the code block we use the other parse routines and create parse nodes based on a parse, using the parser's API directly: You can see this is rather involved. With great power comes great responsibility. Here you are responsible for handling line numbers, manually parsing terminals, advancing the cursor, calling other

ParseXXXX()

 functions and reporting parse nodes accurately. Note how we

Advance()

to move past the terminals, but we don't need to do so to move past the non-terminals, because the other

ParseXXXX()

 functions do that for us. Remember to pass

context

 around. Note how we were careful to store the initial line, column and position information when we entered the function because we report them at the end. Note we use the current line, column, and position when reporting an error parsing a terminal. Be careful here or your position information will be wrong. Basically this allows you to entirely implement a

ParseXXXX()

 method yourself rather than using the generated code.

在代码块中,我们使用其他解析例程,并直接使用解析器的API基于解析创建解析节点:您可以看到这很复杂。 拥有权利的同时也被赋予了重大的责任。 在这里,您负责处理行号,手动解析终端,前进光标,调用其他

ParseXXXX()

函数以及准确报告解析节点。 请注意,我们如何

Advance()

越过终端,但是我们不需要这样做就越过非终端,因为其他

ParseXXXX()

函数为我们完成了此操作。 记住要传递

context

。 请注意,进入函数时,我们非常小心地存储了初始行,列和位置信息,因为我们在最后报告了它们。 请注意,在报告解析终端错误时,我们使用当前行,列和位置。 请注意此处,否则您的职位信息有误。 基本上,这使您可以自己完全实现

ParseXXXX()

方法,而不使用生成的代码。

See how we parsed TypeCastPart? I haven't shown you the definition yet, but here it is:

看看我们如何解析TypeCastPart? 我尚未显示定义,但这里是:

// this is necessary so we can get Parsley to generate a 
// method called ParseTypeCastPart() which we use
// when resolving casts
TypeCastPart<include,collapsed>= Type ")";           

Note the addition of the

include

attribute. This is a dummy attribute to make sure

TypeCastPart

 shows up in the grammar, because it's not referenced by anything in the grammar except code. Attributes force a non-terminal to be present in the grammar. We didn't need

include

 here but I specified it for readability. Had we not specified

collapsed

 we would have had to specify

include

 (or something else), otherwise

ParseTypeCastPart()

 would never show up in the grammar. Since we don't actually report it in the parse tree, but just use it for parsing we collapse it. That part however, is optional. See how it ends with ")"? That's because a cast looks like

(string)foo

 and so in order for the parser to know the ")" can follow a 

Type

we had to eat the closing parenthesis here. The other option would have been to parse

Type

manually which is simply too much work.

注意添加了

include

属性。 这是一个虚拟属性,可确保

TypeCastPart

出现在语法中,因为除代码外,语法中的任何内容均未引用它。 属性强制语法中出现非终结符。 我们不需要在此处

include

,但出于可读性考虑而指定了它。 如果没有指定

collapsed

,则必须指定

include

(或其他),否则

ParseTypeCastPart()

将永远不会出现在语法中。 由于我们实际上并没有在解析树中报告它,而是仅将其用于解析,所以我们将其折叠了。 但是,该部分是可选的。 怎么看呢结尾“)”? 这是因为铸造模样

(string)foo

,因此为了让解析器知道“)”可以按照

Type

,我们不得不在这里吃的右括号。 另一个选择是手动解析

Type

,这简直是太多的工作。

And finally we come to abstract non-terminals. Say you wanted to report a non-terminal above which wasn't actually in the grammar. Abstract non-terminals simply allow you to put an "empty/no-op" non-terminal in the grammar, which can then be reported by your parse function because it has a symbol constant associated with it. They take the form of:

最后我们来抽象非终结符。 假设您要报告一个非终止符,该终止符实际上不在语法中。 抽象非终结符仅允许您在语法中放入“空/无操作”非终结符,然后您的解析函数就可以报告该非终结符,因为它具有与之关联的符号常量。 它们的形式为:

See how nothing is specified for this production other than attributes. It's simply terminated with a semicolon. All it does is generate a symbol in the grammar, so this...

了解如何为此生产指定除属性以外的任何内容。 它只是用分号终止。 它所做的只是在语法中生成一个符号,所以这...

generates a constant symbol you can use, but no

ParseFoo()

 method. You'd return this from one of your virtual non-terminals. It wouldn't be used anywhere else. Note again, you have to specify an attribute for any non-terminal not-referenced in the grammar, and that includes abstract non-terminals which can't be referenced in the grammar.

生成可以使用的常量符号,但没有

ParseFoo()

方法。 您可以从您的虚拟非终端之一返回它。 它不会在其他任何地方使用。 再次注意,你必须指定在语法任何非终端没有引用一个属性,包括不能在语法中引用抽象的非终端。

See the ParsleyDemo and ParsleyDemoVB projects for more.

有关更多信息,请参见ParsleyDemo和ParsleyDemoVB项目。

@options指令 (The @options Directive)

The

@options

 directive allows you to override much of the command line of Parsley with your own settings in the grammar. In the case where a grammar imports other grammars, only the top level grammar's

@options

settings will be honored. The format is as follows:

@options

指令使您可以使用语法中的自己的设置来覆盖Parsley的大部分命令行。 在语法导入其他语法的情况下,仅

@options

使用顶级语法的

@options

设置。 格式如下:

The

@options

directive may appear once, and like

@import

 must appear before any productions. An option value can be a quoted string, an number, a

boolean

, or

null

. The current available options are as follows, and correspond to their command line counterparts:

outputfile

codenamespace

,

codeclass

,

codelanguage

,

rolexfile

,

gplexfile

,

gplexcodeclass

,

fast

and

verbose

. Like with attributes, boolean values that are true don't need their value specified explicitly, just their name, so

verbose

and

verbose=true

  are equivelent. Again, these override the command line options. You will typically use them with the Visual Studio integration, explored below.

@options

指令可能只出现一次,就像

@import

必须在任何生产之前出现。 选项值可以是带引号的字符串,数字,

boolean

null

。 当前可用的选项如下,并且对应于它们的命令行对应项:

outputfile

codenamespace

codeclass

codelanguage

rolexfile

gplexfile

gplexcodeclass

fast

verbose

。 像属性一样,布尔值true不需要显式指定其值,而只需其名称即可,因此

verbose

verbose=true

是等价的。 同样,它们会覆盖命令行选项。 您通常将它们与Visual Studio集成一起使用,如下所述。

使用此混乱作为预构建步骤 (Using this Mess as a Pre-Build Step)

Simply navigate to your project options, under Build Events (C#) and add it to the pre-build event command line. Finding the pre-build event location in your project options is more involved in VB than it is in C#, but dig around. It's there.

只需导航至构建事件(C#)下的项目选项,然后将其添加到构建前事件命令行即可。 与C#中相比,VB中要在项目选项中查找预生成事件的位置更多,但是要多加研究。 在那里。

Either way, you need to know what to feed it, so here you go. You need to make two things - a parser and a tokenizer/lexer. Each will have it's own command line to execute. For your parser, you'll use parsley.exe, and for your tokenizer/lexer you'll either use rolex.exe or gplex.exe. You'll want parsley.exe to come first, and make sure to give it the right arguments to generate a lexer file for you (assuming the grammar doesn't indicate one). Following that, you'll want to run rolex.exe or gplex.exe depending which lexer you're using. If you're making your own lexer from scratch, you can skip that second command line but most people won't ever do that.

无论哪种方式,您都需要知道要喂什么,所以就到这里。 您需要做两件事-解析器和令牌生成器/词法分析器。 每个人都有自己的命令行来执行。 对于解析器,您将使用parsley.exe,对于标记器/词法分析器,您将使用rolex.exe或gplex.exe。 您希望parsley.exe首先出现,并确保为其提供正确的参数以为您生成一个词法分析器文件(假设语法不表示一个)。 之后,您将要运行rolex.exe或gplex.exe,具体取决于您使用的是哪个词法分析器。 如果要从头开始创建自己的词法分析器,则可以跳过第二条命令行,但是大多数人永远不会这样做。

Anyway, here are the basics (see ParsleyDemo and ParsleyDemoVB)

无论如何,这是基础知识(请参阅ParsleyDemo和ParsleyDemoVB)

"parsley.exe" "$(ProjectDir)Expression.xbnf" /output "$(ProjectDir)ExpressionParser.cs" /rolex "$(ProjectDir)Expression.rl" /namespace ParsleyDemo /ifstale
"rolex.exe" "$(ProjectDir)Expression.rl" /output "$(ProjectDir)ExpressionTokenizer.cs" /namespace ParsleyDemo /ifstale           

First, put quotes around paths, including macros, and including your initial executable path. We simply specified "parsley.exe" and "rolex.exe" here but you may need to put in their full paths unless they're in your

PATH

. The demo projects simply use the binaries from the related projects in the solution, but that's not available to you in other situations. One option is to include the executables in your project's folder somewhere and reference those using relative paths. That way if you copy the project to another machine it will still build. I hope that's clear.

首先,在路径(包括宏)和初始可执行路径两边加上引号。 我们仅在此处指定了“ parsley.exe”和“ rolex.exe”,但是除非它们在您的

PATH

否则您可能需要输入它们的完整路径。 演示项目仅使用解决方案中相关项目的二进制文件,但在其他情况下则不可用。 一种选择是将可执行文件包含在项目文件夹中的某个位置,然后使用相对路径引用这些可执行文件。 这样,如果您将项目复制到另一台计算机上,它仍会生成。 我希望这很清楚。

Note we've used macros to refer to the project directory. The list of macros is available in the edit window of the pre-build events if you click around/click through.

注意,我们使用宏来引用项目目录。 如果单击前后单击,则宏列表在预生成事件的编辑窗口中可用。

The above was an example of using Parsley with Rolex. Using it with Gplex is similar:

上面是将Parsley与Rolex结合使用的示例。 将它与Gplex一起使用类似:

"parsley.exe" "$(ProjectDir)Slang.xbnf" /output "$(ProjectDir)SlangParser.cs" /gplex "$(ProjectDir)Slang.lex" /namespace CD /fast /ifstale
"gplex.exe" /out:"$(ProjectDir)SlangScanner.cs" "$(ProjectDir)Slang.lex"           

Note how the syntax of Gplex's command line arguments are different and options preceed the input file. Be careful with this. I didn't write Gplex, so its usage is quite a bit different than the ones in the programs I've written.

请注意,Gplex命令行参数的语法是如何不同的,并且选项在输入文件之前。 请注意这一点。 我没有写Gplex,因此它的用法与我编写的程序中的用法有很大不同。

Visual Studio整合 (Visual Studio Integration)

By request I've added devstudio integration for Parsley, Rolex and Gplex which can be installed via the ParsleyDevstudio VSIX package. In this package are 3 custom tools that each generate a log and the output of the particular program as files underneath the main file.

根据要求,我为Parsley,Rolex和Gplex添加了devstudio集成,可以通过ParsleyDevstudio VSIX软件包进行安装。 在此软件包中,有3个自定义工具,每个工具都会在主文件下面的文件中生成日志和特定程序的输出。

Navigate to the XBNF (Parsley, *.xbnf) input file, the Rolex lexer input file (*.rl), or the Gplex lexer input file (*.lex) and set the appropriate custom tool and then wait as these tools take some significant time to do their work. I recommend you use this only for small grammars or with the (C# only) 

fast

 option and using

gplex

lexers to cut your wait time down. Unfortunately, due to the infrastructure provided by Visual Studio, I can't make these custom tools work without freezing devstudio while they're working.

导航到XBNF(Parsley,* .xbnf)输入文件,Rolex lexer输入文件(* .rl)或Gplex lexer输入文件(* .lex),并设置适当的自定义工具,然后等待这些工具占用一些资源花费大量时间做他们的工作。 我建议您仅将其用于小型语法或与(仅C#)

fast

选项一起使用,并使用

gplex

词法分析器来缩短等待时间。 不幸的是,由于Visual Studio提供的基础结构,我无法在不冻结devstudio的情况下使这些自定义工具正常工作。

To set it up, install the VSIX package and then navigate to an XBNF document, click on it in solution explorer to get the document properties (usually on the lower right of the VS workspace), find

Custom Tool

 and then set it Parsley - as soon as you do, Parsley will begin the generation process, so VS will be unresponsive for a few seconds. The other custom tools are Gplex and Rolex.

要进行设置,请安装VSIX软件包,然后导航到XBNF文档,在解决方案资源管理器中单击它以获取文档属性(通常在VS工作区的右下方),找到“

Custom Tool

,然后将其设置为Parsley -as这样做后,Parsley将立即开始生成过程,因此VS将在几秒钟内无响应。 其他自定义工具是Gplex和Rolex 。

Visual Studio集成的局限性 (Limits of Visual Studio Integration)

The pre-build step method of integrating Parsley into your builds is still the recommended way to build your projects. However, if you choose to use the VS custom tool integration instead, please be mindful of these limitations:

将Parsley集成到构建中的预构建步骤方法仍然是构建项目的推荐方法。 但是,如果您选择使用VS自定义工具集成,请注意以下限制:

  • Parsley is kind of slow to generate, and sometimes VS will regenerate for no reason causing unnecessary churn

    荷兰芹的生成速度很慢,有时VS会无缘无故地重新生成,从而导致不必要的流失

  • The lexers will only support one lexer per namespace. This is due mainly to limitations in Gplex and Rolex, and with Rolex, using the pre-build method instead can sidestep this. Attempts to generate multiple lexers in one namespace will create compile errors.

    这些词法分析器每个名称空间仅支持一个词法分析器。 这主要是由于Gplex和Rolex的局限性所致,而对于Rolex,使用预构建方法可以避免这种情况。 尝试在一个名称空间中生成多个词法分析器将产生编译错误。

  • The custom tools cannot detect changes to dependent files, like imports. You must rerun the tool manually if these are changed.

    自定义工具无法检测对相关文件的更改,例如导入。 如果更改了这些,则必须手动重新运行该工具。

  • When removing the custom tool from a file, the generated files will not all be removed, only the log will. You must remove the other files manually.

    从文件中删除自定义工具时,生成的文件不会全部删除,只会删除日志。 您必须手动删除其他文件。

  • I haven't tested this integration as extensively as I'd like. Testing it is somewhat difficult, so it's a work in progress.

    我没有像我想要的那样对这种集成进行全面的测试。 测试它有些困难,因此它正在进行中。

  • You cannot use the custom tools during VS automated builds (running VS with no UI) - the tools only work in the design time environment.

    您不能在VS自动构建过程中使用自定义工具(在没有UI的情况下运行VS)-这些工具仅在设计时环境下工作。

  • I'm pretty sure renaming a top level file will break something.

    我很确定重命名顶级文件会破坏某些东西。

  • You simply do not have as much control as you do at the command line overall.

    您根本没有像在命令行上那样拥有太多的控制权。

Some of these limitations are due to a monumental hack I had to employ to get Visual Studio to recognize multiple output files from a single custom tool. It's not designed for it at all, and to do so requires some ugly magic. Some, like the periods of unresponsiveness are simply due to the nature of Parsley and/or custom tools.

这些限制中的一些是由于我不得不采用巨大的技巧来使Visual Studio从单个自定义工具中识别多个输出文件。 它根本不是为它设计的,而这样做需要一些难看的魔术。 某些情况(例如无响应的时间段)完全是由于Parsley和/或自定义工具的性质所致。

Note that you'll almost certainly want to use the

@options

 directive in your XBNF document at least to specifiy lexer file(s) to generate. If Parsley is used to generate lexer documents, the lexer documents will automatically be associated with their appropriate custom tools, so the lexer generators will be run on them as appropriate as well. Because of this, while there are 3 VS custom tools, you'll mainly only use "

Parsley

" yourself and the others will be set automatically as needed.

请注意,您几乎肯定会想要至少在XBNF文档中使用

@options

指令来指定要生成的词法分析器文件。 如果使用Parsley生成词法分析器文档,则词法分析器文档将自动与其适当的自定义工具相关联,因此词法分析器生成器也将在其上运行。 因此,尽管有3个VS自定义工具,但您主要只能自己使用“

Parsley

”,其他的将根据需要自动设置。

进一步阅读 (Further Reading)

For follow ups using advanced parsley features: Parse Anything Part 1 and Part 2.

有关使用高级欧芹功能的后续操作:解析第1 部分和第2部分 。

For more information about

firsts

and

follows

, and fundamentals of parsing LL(1), see How To Make a Parser, Lesson 1

有关更多信息,

firsts

follows

,并解析LL(1)的基本面,看如何做一个分析器,课程1

For more information on recursive descent parsing in general, see Recursive Descent Parser at Geeks for Geeks.

有关一般递归下降解析的更多信息,请参阅Geeks for Geeks的递归下降解析器 。

翻译自: https://www.codeproject.com/Articles/5254538/Parsley-A-Recursive-Descent-Parser-Generator-in-Cs

parsley