天天看點

parsley_Parsley:C#中的遞歸下降解析器生成器

parsley

parsley_Parsley:C#中的遞歸下降解析器生成器

介紹 (Introduction)

There are plenty of parser generators out there for .NET, but no language agnostic parser with syntax directed actions (code in the grammar), syntactic predicates, and custom parse routines. Parsley is different. Parsley allows you to write your code in the grammar in a C# subset that is then converted to the target parser's generated language, so if someone wants to use your parser source code in VB they can, even though your code was C#. There are also very few recursive descent parser generators out there, most preferring the table generated approach. The advantage of recursive descent is it's far more intuitive than the table driven approach and the parse routines are completely overridable - plus the code is actually readable. This generator works somewhat like Coco/R in than it is internally LL(1) but allows you to use code to parse more complex constructs than LL(1) allows for. I haven't really used Coco/R but I think it's table driven, unlike Parsley and Parsley has more code features as a result, such as virtual non-terminals for completely overriding a parse. I can't be certain Coco/R doesn't do these things, but I'm fairly sure. If I'm right, Parsley can potentially parse more grammars in more ways than Coco/R.

.NET有很多解析器生成器,但是沒有與語言無關的解析器,它具有文法指導的動作(文法中的代碼),文法謂詞和自定義解析例程。 歐芹是不同的。 Parsley允許您在C#子集中以文法編寫代碼,然後将其轉換為目标解析器的生成語言,是以,即使有人想要在VB中使用解析器源代碼,即使您的代碼是C#,也可以。 那裡也很少有遞歸下降解析器生成器,最喜歡表生成的方法。 遞歸下降的優點是它比表驅動方法直覺得多,并且解析例程是完全可重寫的-而且代碼實際上是可讀的。 該生成器的工作方式與内部的LL(1)有點類似Coco / R ,但允許您使用代碼來解析比LL(1)允許的更複雜的構造。 我沒有真正使用過Coco / R,但我認為它是表驅動的,這與Parsley不同,而Parsley具有更多的代碼功能,例如完全覆寫解析的虛拟非終端。 我不能肯定Coco / R不會做這些事情,但是我很确定。 如果我是對的,那麼與Coco / R相比, Parsley可以以更多方式解析更多文法。

Update: Improved output code generation, added features. See below.

更新:改進了輸出代碼生成,增加了功能。 見下文。

Update 2: Bugfix in grammars, rewrite of part of article, feature adds 

更新2 :文法中的錯誤修複,重寫文章的一部分,功能添加

Update 3: Added GPLEXv1 support!

更新3 :添加了GPLEXv1支援!

Update 4: Added experimental backtracking support

更新4 :添加了實驗性回溯支援

Update 5: Added syntactic predicates, and virtual and abstract non-terminals, and /fast option

更新5 :添加了文法謂詞,虛拟和抽象非終結符以及/ fast選項

Update 6: Added @options directive and the ParsleyDevstudio Visual Studio Integration Package for Parsley, Rolex and Gplex

更新6:添加了@options指令和用于Parsley,Rolex和Gplex的ParsleyDevstudio Visual Studio內建包

版權/許可聲明: (Copyright / License Notice: )

While most of the code here is mine, GPLEXv1 is not my project, and I have yet to fork it per se (I haven't modified the source yet) but I probably will as it hasn't been maintained. It's also very difficult to build the first time because you need the gplex and gppg binaries to build it the first time around, and those binaries can be dicey to find. I also plan to use it for Slang because it supports Unicode, while Rolex does not. I've included it, in ready to build form simply to make life easier on me and you.

盡管這裡的大多數代碼是我的,但GPLEXv1并不是我的項目,我本身尚未對其進行分叉(我尚未修改源代碼),但由于可能尚未維護,我可能會保留。 這也是非常困難的,因為你需要的gplex和GPPG二進制圍繞打造它的第一次打造的第一次,這些二進制檔案可以冒險找到。 我還計劃将其用于Slang,因為它支援Unicode,而Rolex不支援。 我已将其包括在内,以易于建構的形式提供,旨在使我和您的生活更加輕松。

Gardens Point LEX Copyright © 2006-2011 Queensland University of Technology (QUT). All rights reserved.

Gardens Point LEX版權所有©2006-2011昆士蘭科技大學(QUT)。 版權所有。

The full license is included in GPLEXcopyright.rtf in the Gplex folder.

完整的許可包含在Gplex檔案夾GPLEXcopyright.rtf。

Background

背景

Parser generators build parsers which break down text into a meaningful hierarchy of symbols which can be easily processed programmatically. Compilers user parsers to parse your code. JSON libraries use parsers to parse JSON, and Excel uses a parser to decode the formulas in your cells.

解析器生成器建構解析器,該解析器将文本分解為有意義的符号層次結構,可以通過程式設計輕松地對其進行處理。 編譯器使用者解析器可以解析您的代碼。 JSON庫使用解析器來解析JSON,而Excel使用解析器來解碼單元格中的公式。

Parsers come in a variety of algorithms, from simple LL(1), to the almost incomprehensible GLR parsing algorithm. Parsely is LL(1), which means it processes grammars from top to bottom creating a left derivation tree and one symbol of lookahead. The syntax, rather than the input directs the parse. This is the second most intuitive way to parse next to LL(k/*), which allows for arbitrary lookahead.

解析器有多種算法,從簡單的LL(1)到幾乎難以了解的GLR解析算法。 解析為LL(1),這意味着它從上到下處理文法,進而建立左派生樹和一個前瞻符号。 文法而不是輸入指導文法分析。 這是在LL(k / *)之後進行解析的第二種最直覺的方法,它允許任意提前。

使用這個混亂 (Using this Mess)

建立這個混亂 (Building this Mess)

This code is provided as part of my larger Build Tools project. I've stripped the binaries out of the distribution, and yet they are used to build the distribution. To bootstrap it, you'll have to build a couple of times in Release. You'll get locking errors the first time. That's okay. The second time, you may still have a warning in your Error Panel, but it's stale. The build succeeded with no warnings.

此代碼是我的大型建構工具項目的一部分。 我已經從發行版中删除了二進制檔案,但是它們仍用于建構發行版。 要引導它,您必須在Release中建構幾次。 第一次會出現鎖定錯誤。 沒關系。 第二次,您的“錯誤面闆”中可能仍然有警告,但這是陳舊的。 建構成功,沒有任何警告。

安裝這個混亂 (Installing this Mess)

All of the tools included with this distro are useful in their own right. You may want to copy deslang, csbrick, rolex and parsley to your

PATH

somewhere. This makes it easier to use them in pre-build steps because you won't have to fully qualify the EXE path. You can run the ParsleyDevstudio.vsix installer to install the Visual Studio integration, though it is inferior to the standard way of using this as a pre-build step.

該發行版中包含的所有工具本身都是有用的。 您可能要deslang,csbrick, 勞力士和香菜複制到您的

PATH

地方。 這使在預建構步驟中更容易使用它們,因為您不必完全限定EXE路徑。 您可以運作ParsleyDevstudio.vsix安裝程式來安裝Visual Studio內建,盡管它不如将其用作預建構步驟的标準方法。

運作這個混亂 (Running this Mess)

Parsley typically operates as a command line tool. It takes an input file, an optional output file and various parameters dictating the generated code options. Here is the usage screen:

Parsley通常作為指令行工具運作。 它需要一個輸入檔案,一個可選的輸出檔案以及訓示生成的代碼選項的各種參數。 這是用法螢幕:

Parsley generates a recursive descent parser and optional lexer spec

parsley.exe <inputfile> [/output <outputfile>] [/rolex <rolexfile>]
        [/gplex <gplexfile>] [/gplexclass <gplexcodeclass>]
        [/namespace <codenamespace>] [/class <codeclass>]
        [/langage <codelanguage>] [/fast]
        [/noshared] [/verbose] [/ifstale]

        <inputfile>             The XBNF input file to use.
        <outputfile>            The output file to use - default stdout.
        <rolexfile>             Output a Rolex lexer specification to the specified file
        <gplexfile>             Generate Gplex lexer specification to the specified file file. Will also generate supporting C# files (C# only)
        <gplexcodeclass>        Generate Gplex lexer specification with the specified class name. - default takes the name from <gplexfile>
        <codenamespace>         Generate code under the specified namespace - default none
        <codeclass>             Generate code with the specified class name - default derived from <outputfile> or the grammar.
        <codelanguage>          Generate code in the specified language - default derived from <outputfile> or C#.
        <fast>                  Generate code quickly, without resolution using C# only - not valid with the <codelanguage> option.
        <noshared>              Do not include shared library prerequisites
        <verbose>               Output all messages from the generation process
        <ifstale>               Do not generate unless <outputfile> or <rolexfile> is older than <inputfile>.

Any other switch displays this screen and exits.           

(sorry about the word wrapping!)

(對不起,自動換行!)

  • inputfile

    is the XBNF specification (explored below) that the parser will be generated from

    inputfile

    是解析器将根據其生成的XBNF規範(在下面進行探索)
  • outputfile

    is optional and specifies the output code file to generate. If it's not specified, the code will be dumped to the

    stdout

    .

    outputfile

    是可選的,它指定要生成的輸出代碼檔案。 如果未指定,則代碼将轉儲到

    stdout

  • rolexfile

    is optional and specifies the rolex lexer specification file to generate. If it's not specified, then no rolex specification will be generated.

    rolexfile

    是可選的,并且指定了勞力士詞法分析器規範檔案來生成。 如果未指定,則不會生成Rolex規範。
  • gplexfile

     is optional and specifies the gplex lexer specification file to generate. If it's not specified, it's not generated. Note that if it is, several files will be generated. This is necessary to support gplex.

    gplexfile

    是可選的,它指定要生成的gplex lexer規範檔案。 如果未指定,則不會生成。 請注意,如果是,将生成幾個檔案。 這對于支援gplex是必需的。
  • gplexcodeclass

     is optional and specifies the base class names to use for the generated classes. "

    Expression

    " would create

    ExpressionTokenizer

    ,

    ExpressionTokenizerEnumerator

    and

    ExpressionScanner

    .

    gplexcodeclass

    是可選的,它指定要用于生成的類的基類名稱。 “

    Expression

    ”将建立

    ExpressionTokenizer

    ExpressionTokenizerEnumerator

    ExpressionScanner

  • codenamespace

    is the namespace under which to generate the code. If it's not specified, the code won't be generated under a namespace.

    codenamespace

    是在其下生成代碼的名稱空間。 如果未指定,則不會在名稱空間下生成代碼。
  • codeclass

     is the name of the parser class to use. If you don't specify one, it will be the same as the

    outputfile

    , unless that's not specified, in which case it will take the name from the start element in the XBNF grammar.

    codeclass

    是要使用的解析器類的名稱。 如果您未指定一個,它将與

    outputfile

    相同,除非未指定,否則它将使用XBNF文法中start元素的名稱。
  • codelanguage

    is the language of the code to generate. If you don't specify one, it will be based on

    outputfile

    's file extension. If

    outputfile

    is not specified, it will default to C#.

    codelanguage

    是要生成的代碼的語言。 如果您不指定一個,它将基于

    outputfile

    的檔案擴充名。 如果未指定

    outputfile

    ,它将預設為C#。
  • fast

     indicates that the code generation should skip the resolution step and simply emit in C#. This is not valid when specifying a

    codelanguage

    .

    fast

    表示代碼生成應跳過解析步驟并僅使用C#發出。 在指定

    codelanguage

    codelanguage

  • noshared

    skips generating the shared code. This is important to specify for generating a second parser in the same project. The first parser should be generated without this switch, and subsequent parsers should be generated with this switch. If noshared is specified here, it should also be specified to rolex when rolex is used to generate the lexer/tokenizer code. You don't want duplicates of shared code in your source, because it will cause compile errors. This option also applies to Gplex generation.

    noshared

    跳過生成共享代碼。 指定在同一項目中生成第二個解析器很重要。 第一個解析器應在沒有此開關的情況下生成,而後續的解析器應在此開關的情況下生成。 如果在此指定noshared,則當使用Rolex生成詞法分析器/令牌生成器代碼時,也應将其指定給Rolex 。 您不希望源中的共享代碼重複,因為這會導緻編譯錯誤。 此選項也适用于Gplex生成。
  • verbose

     emits all grammar transformation and validation messages, including messages about refactored rules. This can create huge dumps in the case of large grammars with lots of refactoring, but it allows you to see what changes it made. You can also see the changes in the doc comments of the generated code, regardless.

    verbose

    發出所有文法轉換和驗證消息,包括有關重構規則的消息。 在具有大量重構的大型文法的情況下,這可能會産生巨大的轉儲,但它使您可以看到它進行了哪些更改。 無論如何,您也可以在生成的代碼的文檔注釋中看到更改。
  • ifstale

    skips generating the output unless the input has been modified. This is useful for pre-build steps as it prevents the overhead of rebuilding the parser every time.

    除非輸入已修改,否則

    ifstale

    跳過生成輸出。 這對于預建構步驟很有用,因為它避免了每次重新建構解析器的開銷。

You'll probably want to use this tool in tandem with one of the tokenizers included in the solution. Rolex, is the easiest to use, but Gplex supports Unicode and is basically more capable. These tools are lexer/tokenizer generators. Almost all parsers (excepting PEG parsers particularly) require lexers/tokenizers and Parsley is no exception. For further reading about Rolex, I wrote an article about creating Rolex here though you should use this codebase since it's much newer and uses Deslanged Slang technology to dramatically ease maintenance of the generated code while maintaining performance. For further reading about Gplex I'd just look at lex and flex resources as there's little to no documentation on Gplex itself but it's similar to those tools. Fortunately, I've included in Parsley the generation of a mostly identifical interface to that of Rolex lexers so coding against them is basically the same. The constructors are a little different is all. Eventually I'll unify them but it's not trivial because of the way Gplex works. If you use Gplex, it's often desirable to use the

fast

 option with Parsley since the tokenizer is C# anyway.

您可能希望與解決方案中包含的一個分詞器一起使用此工具。 Rolex是最容易使用的,但是Gplex支援Unicode,并且基本上具有更強大的功能。 這些工具是詞法分析器/令牌生成器。 幾乎所有解析器(尤其是PEG解析器除外)都需要詞法分析器/令牌生成器,并且Parsley也不例外。 為了進一步了解Rolex ,我在這裡寫了一篇有關建立Rolex的文章,盡管您應該使用此代碼庫,因為它較新,并且使用Deslanged Slang技術大大簡化了對生成代碼的維護,同時保持了性能。 為了進一步了解Gplex,我隻看lex和flex資源,因為關于Gplex本身的文檔很少甚至沒有,但是與這些工具相似。 幸運的是,我在Parsley中包括了與Rolex lexers幾乎相同的接口的生成,是以針對它們的編碼基本上是相同的。 所有的構造函數都有些不同。 最終,我将它們統一起來,但是由于Gplex的工作方式,這并不簡單。 如果您使用Gplex ,則通常需要在Parsley中使用

fast

選項,因為令牌生成器無論如何都是C#。

編碼此混亂 (Coding this Mess)

Coming back to Parsley, let's examine the XBNF grammar format, which I use for all my parsers:

回到Parsley,讓我們檢查一下我用于所有解析器的XBNF文法格式:

The XBNF format is designed to be easy to learn if you know a little about composing grammars.

如果您對文法的了解不多,則XBNF格式旨在易于學習。

The productions take the form of:

生産形式為:

There are more advanced variations of the above production syntax we'll approach later.

我們稍後将介紹上述生産文法的更多進階變體。

So for example, here's a simple standards compliant JSON grammar:

是以,例如,這是一個簡單的符合标準的JSON文法:

// based on spec @ json.org
Json<start>= Object | Array;
Object= "{" [ Field { "," Field } ] "}";
Field= string ":" Value;
Array= "[" [ Value { "," Value } ] "]";
Value<collapsed>= 
    string    |
    number    |
    Object    |
    Array     |
    Boolean   |
    null      ;
Boolean= true|false;
number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
string= '"([^\n"\\]|\\([btrnf"\\/]|(u[0-9A-Fa-f]{4})))*"';
true= "true";
false= "false";
null= "null";
lbracket<collapsed>= "[";
rbracket<collapsed>= "]";
lbrace<collapsed>= "{";
rbrace<collapsed>= "}";
colon<collapsed>= ":";
comma<collapsed>= ",";
whitespace<hidden>= '[\n\r\t ]+';           

The first thing to note is the

Json

production is marked with a

start

attribute. Since the value was not specified it is implicitly,

start=true

.

首先要注意的是,

Json

生産帶有

start

屬性。 由于未指定值,是以它是隐式的,

start=true

That tells the parser that

Json

is the start production. If it is not specified, the first non-terminal in the grammar will be used. Furthermore, this can cause a warning during generation since it's not a great idea to leave it implicit. Only the first occurrence of

start

will be honored.

這告訴解析器

Json

是開始生産的産品。 如果未指定,則将使用文法中的第一個非結尾。 此外,這可能會在生成過程中引起警告,因為将其隐式保留不是一個好主意。 隻有第一次出現的

start

會受到尊重。

Object | Array

tells us the

Json

production is derived as an object or array. The

Object

production contains a repeat

{}

construct inside and optional

[]

construct, itself containing a reference to

Field

.

Array

 is similar, except it uses "[" and "]" and it refers to

Value

instead of

Field

.

Object | Array

Object | Array

告訴我們

Json

産生是作為對象或數組派生的。

Object

産生内部包含重複

{}

構造和可選

[]

構造,其本身包含對

Field

的引用。

Array

與之類似,除了它使用“ [”和“]”并且引用

Value

而不是

Field

表達方式 (Expressions)

  • ( )

    parentheses allow you to create subexpressions like

    Foo (Bar|Baz)

    ( )

    括号允許您建立子表達式,例如

    Foo (Bar|Baz)

  • [ ]

    optional expressions allow the subexpression to occur zero or once

    [ ]

    可選表達式允許子表達式出現零或一次
  • { }

    this repeat construct repeats a subexpression zero or more times

    { }

    此重複構造将子表達式重複零次或更多次
  • { }+

    this repeat construct repeats a subexpression one or more times

    { }+

    此重複構造重複一個或多次子表達式
  • |

    this alternation construct derives any one of the subexpressions

    |

    此替代構造派生任何一個子表達式
  • Concatenation is implicit, separated by whitespace

    串聯是隐式的,由空格分隔

碼頭 (Terminals)

The terminals are all defined at the bottom but they can be anywhere in the document. XBNF considers any production that does not reference another production to be a terminal. You can force an element to be terminal with the

terminal

attribute (see below).

端子都在底部定義,但可以在文檔中的任何位置。 XBNF将沒有引用其他産品的任何産品視為終端。 您可以使用

terminal

屬性強制元素成為終端(請參見下文)。

Regular expressions are between

'

single quotes and literal expressions are between

"

double quotes. You may declare a terminal by using XBNF constructs or by using regular expressions. The regular expressions follow a POSIX + std extensions paradigm but don't currently support all of POSIX. They support most of it. If a POSIX expression doesn't work, consider it a bug. Unicode will be supported in a future version. I've already started support for constructs like \p and \P but the regex engine needs further optimization before Rolex can create that lexer in a reasonable amount of time. The problem is Unicode's huge range of things like letters and numbers is choking my current implementation which is about 80% optimized for ranges. The other 20% is killing performance in this case, but optimizing it is non-trivial.

正規表達式之間的

'

單引号和文字表達是之間

"

雙引号。您可以通過使用XBNF結構或使用正規表達式聲明端,正規表達式遵循POSIX + STD擴充模式,但目前不支援所有的POSIX的他們支援大多數。如果POSIX表達式不起作用,認為它是一個錯誤。将來的版本将支援Unicode。我已經開始支援\ p和\ P這樣的結構,但是正規表達式引擎需要進一步支援勞力士(Rolex)可以在合理的時間内建立該詞法分析器之前進行優化,但問題是Unicode的大範圍内容(例如字母和數字)使我目前的實作感到窒息,該實作的範圍優化約為80%,在這種情況下,另外20%的性能會被破壞,但對其進行優化并非易事。

屬性 (Attributes)

The

collapsed

element tells Parsley that this node should not appear in the parse tree. Instead, its children will be propagated to its parent. This is helpful if the grammar needs a nonterminal or a terminal in order to resolve a construct, but it's not useful to the consumer of the parse tree. During LL(1) factoring, generated rules must be made, and their associated non-terminals are typically collapsed. Above, we've used it to significantly trim the parse tree of nodes we won't need including collapsing unnecessary terminals like

:

in the JSON grammar. This is because they don't help us define anything - they just help the parser recognize the input, so we can throw them out to make the parse tree smaller.

collapsed

元素告訴Parsley該節點不應出現在解析樹中。 相反,其子級将傳播給其父級。 如果文法需要非終結符或終結符來解析構造,這将很有用,但對于解析樹的使用者而言則無用。 在LL(1)分解期間,必須制定生成的規則,并且通常将其關聯的非終端折疊。 上面,我們已經用它來顯著修剪我們将不再需要包括折疊不必要的終端,如節點的分析樹

:

在JSON文法。 這是因為它們沒有幫助我們定義任何内容-它們隻是幫助解析器識别輸入,是以我們可以将它們丢棄以使解析樹更小。

The

hidden

element tells Parsley that this terminal should be skipped. This is useful for things like comments and whitespace.

hidden

元素告訴Parsley該終端應被跳過。 這對于諸如注釋和空格之類的東西很有用。

The

blockEnd

attribute is intended for terminals who have a multi character ending condition like C block comments, XML CDATA sections, and SGML/XML/HTML comments. If present, the lexer will continue until the literal specified as the

blockEnd

is matched.

blockEnd

屬性适用于具有多字元結束條件的終端,例如C塊注釋,XML CDATA節和SGML / XML / HTML注釋。 如果存在,則詞法分析器将繼續執行,直到比對指定為

blockEnd

的文字為止。

The

terminal

attribute declares a production to be explicitly terminal. Such a production is considered terminal even if it references other productions. If it does, those other productions which will be included in their terminal form as though they were part of the original expression. This allows you to create composite terminals out of several terminal definitions.

terminal

屬性将生産聲明為顯式終端。 即使此類産品引用了其他産品,也被視為終端産品。 如果是的話,那些其他作品将包含在其最終形式中,就像它們是原始表達的一部分一樣。 這使您可以從多個端子定義中建立複合端子。

The

ignoreCase

attribute specifies that case shouldn't matter for matching. This only applies to terminals and only works with Rolex.

ignoreCase

屬性指定大小寫比對無關緊要。 這僅适用于終端,并且僅适用于Rolex。

The

type

attribute specifies a .NET type or intrinsic C# type of the associated code block (see below). Values returned from such "typed" non-terminals are automatically converted to the target type, obviating the need to cast return values, or convert them using

int.Parse()

 or the like. It will be handled for you.

type

屬性指定關聯代碼塊的.NET類型或固有C#類型(請參見下文)。 從此類“類型化”非終端傳回的值将自動轉換為目标類型,進而無需轉換傳回值或使用

int.Parse()

或類似方法将其轉換。 它将為您處理。

The

dependency

 attribute marks a non-terminal production as needed by the parser code you write.

dependency

屬性标記您編寫的解析器代碼所需的非終端生産。

The

firsts

 and 

follows

  attributes are space delimited lists of terminal and non terminal symbols that hint to the parser what comes first and what can follow a particular non-terminal production. These are typically used in tandem with virtual productions (see the section toward the end)

firsts

follows

屬性是空格分隔終端和非終端符号列出了提示什麼是第一位的解析器,哪些可以遵循特定的非終端的生産。 這些通常與虛拟制作一起使用(請參閱最後一節)

The

priority

 attribute marks a terminal production with a particular priority. A negative number is lower in priority than a positive number. Must be integer. By default items are

priority

0.

priority

屬性标記具有特定優先級的終端生産。 負數的優先級低于正數。 必須為整數。 預設情況下,項目的

priority

0。

評估代碼塊 (Evaluation Code Blocks)

Evaluation code blocks are specified in the grammar using a lambda/anonymous method-like syntax.

=> { ... }

at the end of a non-terminal production. They take the form of:

評估代碼塊在文法中使用lambda /類似匿名方法的文法指定。

=> { ... }

在非終端生産結束時。 它們的形式為:

Like:

喜歡:

This signifies a block of code to associate with that non-terminal. This code is for parse actions, triggered by the

EvaluateXXXX()

methods. In this code, you take a

ParseNode

object, indicated by

node

and evaluate it, returning the result. If you do not return a result, a default value will be returned. See the example below.

這表示與該非終端關聯的代碼塊。 該代碼用于解析由

EvaluateXXXX()

方法觸發的動作。 在此代碼中,您将擷取一個由

node

表示的

ParseNode

對象并對其進行評估,并傳回結果。 如果不傳回結果,将傳回預設值。 請參見下面的示例。

On grammar notation conventions: I use title case for the non-terminals in the grammar. This isn't necessary, but is better in the end because functions like

ParseXXXX()

and

EvaluateXXXX()

take the

XXXX

portion directly from the non-terminal name. Ergo, the non-terminal

foo

would end up generating

Parsefoo()

and

Evaluatefoo()

. This is obviously undesirable but it's not a show stopper. I thought about mangling the name to "correct" this, but in the end I decided not to make unexpected changes to the code. This way, it was straightforward to know what the eventual names of these methods will be. The symbol constants are generated directly from the symbol names as well, so if you wanted to be consistent with Microsoft's naming guidelines, you'd name your terminals with Pascal case/.NET case as well. However, this is far from important, especially since unlike with non-terminals, their names aren't appended to anything.

關于文法符号約定 :我對文法中的非終結符使用标題大小寫。 這不是必需的,但最終會更好,因為

ParseXXXX()

EvaluateXXXX()

類的函數直接從非終端名稱中擷取

XXXX

部分。 是以,非終端

foo

最終将生成

Parsefoo()

Evaluatefoo()

。 這顯然是不可取的,但這不是表演的制勝法寶。 我考慮過修改名稱以“糾正”此問題,但最終我決定不對代碼進行意外更改。 這樣,很容易知道這些方法的最終名稱是什麼。 符号常量也直接從符号名稱生成,是以,如果要與Microsoft的命名準則保持一緻,則也可以使用Pascal case / .NET case命名終端。 但是,這遠非重要,特别是因為與非終端不同,它們的名稱不會附加到任何内容上。

Let's take a look at the grammar for an expression parser:

讓我們看一下表達式解析器的文法:

Term<start>= Factor { ("+"|"-") Factor };
Factor= Unary { ("*"|"/") Unary };
Unary= ("+"|"-") Unary | Leaf;
Leaf= integer | identifier | "(" Term ")";
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// we don't have to explicitly declare anything but whitespace
// we can define it all inline above but this is nicer
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Now here's how the

_ParseXXXX()

methods that get generated work:

現在,這是擷取生成的

_ParseXXXX()

方法的工作方式:

private static ParseNode _ParseLeaf(ParserContext context) {
    int line = context.Line;
    int column = context.Column;
    long position = context.Position;
    if ((ExpressionParser.integer == context.SymbolId)) {
        // Leaf -> integer
        ParseNode[] children = new ParseNode[1];
        children[0] = new ParseNode(ExpressionParser.integer, "integer", context.Value, line, column, position);
        context.Advance();
        return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
    }
    if ((ExpressionParser.identifier == context.SymbolId)) {
        // Leaf -> identifier
        ParseNode[] children = new ParseNode[1];
        children[0] = new ParseNode(ExpressionParser.identifier, "identifier", context.Value, line, column, position);
        context.Advance();
        return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
    }
    if ((ExpressionParser.Implicit3 == context.SymbolId)) {
        // Leaf -> Implicit3 Term Implicit4
        ParseNode[] children = new ParseNode[3];
        children[0] = new ParseNode(ExpressionParser.Implicit3, "Implicit3", context.Value, line, column, position);
        context.Advance();
        children[1] = ExpressionParser._ParseTerm(context);
        children[2] = new ParseNode(ExpressionParser.Implicit4, "Implicit4", context.Value, line, column, position);
        if ((ExpressionParser.Implicit4 == context.SymbolId)) {
            context.Advance();
            return new ParseNode(ExpressionParser.Leaf, "Leaf", children, line, column, position);
        }
        context.Error("Expecting Implicit4");
    }
    context.Error("Expecting integer, identifier, or Implicit3");
    return null;
}           

You can see it resolving based on

context

's current symbol, and then parsing the child nodes, one by one. Terminals are parsed inline, while non-terminals are forwarded to their appropriate

_ParseXXXX()

 methods. Non-terminals with collapsed children generate a little differently, using

List<ParseNode>

instead of

ParseNode[]

to hold the children. This is because at the point when we're propagating children from the collapsed nodes, we can't know how many nodes there will be, unlike when the nodes are not collapsed. This is okay, as it only generates this alternative when necessary, and it only requires one additional copy. You'll note the appearance of

Implicit3

and

Implicit4

. Those are the names of the tokens we have not defined yet. If you want better error messages, give them a name in the grammar.

您可以看到它基于

context

的目前符号進行解析,然後一個一個地解析子節點。 終端被内聯解析,而非終端被轉發到它們适當的

_ParseXXXX()

方法。 具有折疊子級的非終端生成的方式略有不同,使用

List<ParseNode>

而不是

ParseNode[]

來容納子級。 這是因為在從折疊節點傳播子節點時,我們不知道會有多少個節點,這與未折疊節點不同。 可以,因為它僅在必要時生成此替代方法,并且僅需要一個附加副本。 您會注意到

Implicit3

Implicit4

的外觀。 這些是我們尚未定義的令牌的名稱。 如果您想要更好的錯誤消息,請在文法中給它們命名。

This is fine for parsing but will do nothing for evaluation. That is to say, the above will add a

Parse()

 method to the parser class, but there's not enough information in the grammar to create an

Evaluate()

method. Basically, we can get a parse tree back with the above, but we can't evaluate it. Lets fix that by adding some code to the grammar:

這對于解析是很好的,但不會做任何評估。 也就是說,以上代碼将向Parser類添加一個

Parse()

方法,但是文法中沒有足夠的資訊來建立

Evaluate()

方法。 基本上,我們可以通過上面的方法獲得一個解析樹,但是我們無法對其進行評估。 讓我們通過在文法中添加一些代碼來解決此問題:

Term<start>= Factor { ("+"|"-") Factor } => {
    int result = (int)EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += (int)EvaluateFactor(node.Children[i]);
        else // sub
            result -= (int)EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor= Unary { ("*"|"/") Unary } => { 
    int result = (int)EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateUnary(node.Children[i]);
            else // div
                result /= (int)EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= (int)EvaluateFactor(node.Children[i]);
            else // div
                result /= (int)EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -(int)EvaluateUnary(node.Children[1]);
}
Leaf= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[1].SymbolId==integer)
            return int.Parse(node.Children[0].Value);
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note the code blocks indicated by

=> { ... }

 which tell our parser what to do with the parse tree we made.

注意代碼塊訓示由

=> { ... }

這告訴我們的解析器如何處理我們做出了解析樹。

This indicates an integer expression parser. Because there is code in the grammar, it creates a

public

ExpressionParser.Evaluate()

method that can be used to call it, and thus evaluate a simple integer expression like

4+2*8

.

這表示整數表達式解析器。 因為文法中有代碼,是以它會建立一個

public

ExpressionParser.Evaluate()

方法

ExpressionParser.Evaluate()

可用于調用它),進而求值一個簡單的整數表達式,例如

4+2*8

Here is the code it generates for

EvaluateLeaf()

. You can see the code looks a lot like the associated code in the grammar above, with minor changes.

這是它為

EvaluateLeaf()

生成的代碼。 您可以看到該代碼看起來與上面的文法中的關聯代碼非常相似,但有少量更改。

public static object EvaluateLeaf(ParseNode node, object state) {
    if ((ExpressionParser.Leaf == node.SymbolId)) {
        if ((node.Children.Length == 1)) {
            if ((node.Children[1].SymbolId == ParsleyDemo.ExpressionParser.integer)) {
                return int.Parse(node.Children[0].Value);
            }
            else {
                throw new NotImplementedException("Variables are not implemented.");
            }
        }
        else {
            return ParsleyDemo.ExpressionParser.EvaluateTerm(node.Children[1]);
        }
    }
    throw new SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position);
}           

Here's that same code in VB.NET.

這是VB.NET中的相同代碼。

Public Overloads Shared Function EvaluateLeaf(ByVal node As ParseNode, ByVal state As Object) As Object
    If (ExpressionParser.Leaf = node.SymbolId) Then
        If (node.Children.Length = 1) Then
            If (node.Children(1).SymbolId = ParsleyDemo.ExpressionParser.[integer]) Then
                Return Integer.Parse(node.Children(0).Value)
            Else
                Throw New NotImplementedException("Variables are not implemented.")
            End If
        Else
            Return ParsleyDemo.ExpressionParser.EvaluateTerm(node.Children(1))
        End If
    End If
    Throw New SyntaxException("Expecting Leaf", node.Line, node.Column, node.Position)
End Function           

As you can see, the code block in the grammar was translated to the target language. This is Slang. Slang is still experimental, but it works well enough for doing simple evaluation inside code blocks. There's one problem here.

EvaluateUnary()

 (and

Evaluate()

 and all of the

EvaluateXXXX()

 methods return

object

! This is an integer expression parser so returning

int

 is far more appropriate. Fortunately, we can tell the parser what type to return using the

type

 attribute. Here we add the

type

attributes to the elements we've changed the grammar, which I've put in bold:

如您所見,文法中的代碼塊已翻譯為目智語言。 這是S語 。 Slang仍處于試驗階段,但足以在代碼塊内進行簡單評估。 這裡有一個問題。

EvaluateUnary()

(和

Evaluate()

以及所有

EvaluateXXXX()

方法都傳回

object

!這是一個整數表達式解析器,是以傳回

int

更合适。幸運的是,我們可以使用

type

屬性告訴解析器傳回什麼類型。我們将

type

屬性添加到我們更改了文法的元素中,我将其以粗體顯示:

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0]);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i]);
        else // sub
            result -= EvaluateFactor(node.Children[i]);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0]);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i]);
            else // div
                result /= EvaluateUnary(node.Children[i]);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i]);
            else // div
                result /= EvaluateFactor(node.Children[i]);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0]);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1]);
    else
        return -EvaluateUnary(node.Children[1]);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[1].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
            throw new NotImplementedException("Variables are not implemented.");
    } else // Term
        return EvaluateTerm(node.Children[1]); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note that not only do we now have the attribute 

type="int"

 marked up on each non-terminal production, we also have changed the code (particularly in

Leaf

) slightly, from:

請注意,我們現在不僅在每個非終端生産中都标記了屬性

type="int"

,而且還從以下方面對代碼(特别是在

Leaf

)進行了少許更改:

if(node.Children.Length==1) // integer | identifer
{
    if(node.Children[1].SymbolId==integer)
        return int.Parse(node.Children[0].Value);
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]);            

To:

至:

if(node.Children.Length==1) // integer | identifer
{
    if(node.Children[1].SymbolId==integer)
        return node.Children[0].Value;
    else // identifier
        throw new NotImplementedException("Variables are not implemented.");
} else // Term
    return EvaluateTerm(node.Children[1]);            

This change wasn't necessary, but it simplifies things slightly. This uses Microsoft's

TypeConverter

 framework in tandem with

Convert.ChangeType()

 to automatically translate your return values to the target type. 

不必進行此更改,但是可以稍微簡化一下。 這與

Convert.ChangeType()

一起使用Microsoft的

TypeConverter

架構,将您的傳回值自動轉換為目标類型。

Regardless of all that code, using the generated code is pretty simple:

無論使用哪種代碼,使用生成的代碼都非常簡單:

var text = "3*5+7*2"; 
var exprTokenizer = new ExpressionTokenizer(text);
var pt = ExpressionParser.Parse(exprTokenizer);
Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt));
Console.WriteLine();           

The reason creating the parser is a separate step than creating the tokenizer is because it's actually possible to use a tokenizer other than a Rolex tokenizer with this parser, but you'll have to provide your own

Token

struct and a class implementing

IEnumerable<Token>

. For performance reasons, there is no

IToken

interface. However, your token must have the following fields: (

int

)

SymbolId

, (

string

)

Symbol,

(

string

)

Value

, (

int

)

Line

, (

int

)

Column

, and (

long

)

Position and the constructor taking all of these things

. See the reference source under Rolex\Shared\Token.cs. Modifying the original file will change the behavior of Rolex.

建立解析器與建立令牌生成器是一個單獨的步驟的原因,是因為實際上可以在此解析器中使用除Rolex令牌生成器以外的令牌生成器,但是您必須提供自己的

Token

結構和實作

IEnumerable<Token>

。 出于性能原因,沒有

IToken

接口。 但是,您的令牌必須具有以下字段:(

int

)

SymbolId

,(

string

)

Symbol,

(

string

)

Value

,(

int

)

Line

,(

int

)

Column

和(

long

)

Position and the constructor taking all of these things

。 請參閱Rolex \ Shared \ Token.cs下的參考源。 修改原始檔案将改變Rolex的行為。

Finally, now we have 

Evaluate()

, but there's one niggling issue to address, and that is variables, which were previously not implemented. We can implement variables using the

state

 argument in

Evaluate()

 which allows us to pass a user defined value in to our evaluation code. We'll have to change the code in the grammar in order to support it, so lets modify the grammar again.

最後,現在有了

Evaluate()

,但是有一個小問題要解決,那就是變量,以前沒有實作。 我們可以使用

Evaluate()

state

參數來實作變量,該參數允許我們将使用者定義的值傳遞到評估代碼中。 為了支援它,我們必須更改文法中的代碼,是以讓我們再次修改文法。

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = EvaluateFactor(node.Children[0],state);
    int i = 2;
    while (i<node.Children.Length) 
    {
        if(node.Children[i-1].SymbolId==add)
            result += EvaluateFactor(node.Children[i],state);
        else // sub
            result -= EvaluateFactor(node.Children[i],state);
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = EvaluateUnary(node.Children[0],state);
    int i = 2;
    // Because of the way the grammar is factored, this can 
    // end up being Factors when collapsed. We have to check
    while (i<node.Children.Length) 
    {
        if(node.Children[i].SymbolId==Unary) // Unary
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateUnary(node.Children[i],state);
            else // div
                result /= EvaluateUnary(node.Children[i],state);
        } else // Factor
        {
            if(node.Children[i-1].SymbolId==mul) 
                result *= EvaluateFactor(node.Children[i],state);
            else // div
                result /= EvaluateFactor(node.Children[i],state);
        }
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(node.Children.Length==1)
        return EvaluateLeaf(node.Children[0],state);
    if(node.Children[0].SymbolId==add)
        return EvaluateUnary(node.Children[1],state);
    else
        return -EvaluateUnary(node.Children[1],state);
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(node.Children.Length==1) // integer | identifer
    {
        if(node.Children[0].SymbolId==integer)
            return node.Children[0].Value;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(node.Children[0].Value,out val))
                    return val;
            }    
            throw new SyntaxException(string.Format("Reference to undefined variable {0}",node.Children[0].Value),node.Line,node.Column,node.Position);
        }
    } else // Term
        return EvaluateTerm(node.Children[1],state); 
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Note we're now using the

state

 variable to hold an

IDictionary<string,int>

 instance which we use to hold our variables. We also pass state along to each of our evaluation methods. This is important, as the parser itself is stateless.

請注意,我們現在使用

state

變量來儲存

IDictionary<string,int>

執行個體,該執行個體用于儲存變量。 我們還将狀态傳遞給我們的每種評估方法。 這很重要,因為解析器本身是無狀态的。

When we encounter an identifier, we resolve it using a simple dictionary lookup.

當遇到辨別符時,我們使用簡單的字典查找來解析它。

We have to change the code we use to call it now, to pass in our variables:

我們必須更改用于立即調用的代碼,以傳遞變量:

var text = "3*5+a*2";
var vars = new Dictionary<string, int>();
vars["a"] = 1;
var exprTokenizer = new ExpressionTokenizer(text);
var pt = ExpressionParser.ParseExpression(exprTokenizer);
Console.WriteLine("{0} = {1}",text,ExpressionParser.Evaluate(pt,vars));           

And Bob's your uncle. There you go.

而鮑勃是你的叔叔。 妳去

However, it's a bit complicated. So I've added some shorthand to simplify. For starters there are virtual variables named <Production>1 to <Production>N where <Production> is the name of a production in your grammar.

但是,這有點複雜。 是以,我添加了一些簡化形式。 對于初學者,有一個名為<Production> 1到<Production> N的虛拟變量,其中<Production>是文法中的産品名稱。

For the above grammar we have

Unary1

,

Unary2

,

UnaryN

, and

Expression1

,

Expression2

,

ExpressionN

, and

Factor1

and

Factor2

and so on. We also have

SymbolId1

,

SymbolId2

, etc.

對于以上文法,我們具有

Unary1

Unary2

UnaryN

Expression1

Expression2

ExpressionN

以及

Factor1

Factor2

等。 我們也有

SymbolId1

SymbolId2

等。

The numbers are 1 based positions into the child nodes.

SymbolId1

is an alias for

node.Children[0].SymbolId

. The terminals just point to

node.Children[x].Value

 and the non-terminal productions call the associated evaluate method for the non-terminal at the specified position like, so Factor3 from the above grammar would resolve to 

ExpressionParser.EvaluateFactor(node.Children[2], state)

!

數字是子節點中基于1的位置。

SymbolId1

node.Children[0].SymbolId

的别名。 終端僅指向

node.Children[x].Value

,非終端生成調用指定位置上非終端的關聯評估方法,例如,是以上述文法中的Factor3将解析為

ExpressionParser.EvaluateFactor(node.Children[2], state)

You can also index in to these like

Unary[i]

. In addition there is a

Child

 macro (

Child1

..

ChildN

,

Child[i]

) that gives you the ability to evaluate the node type at run time and it will call the appropriate

EvaluateXXXX()

 method for you. This is useful in cases where we have collapsed nodes in the underlying grammar because sometimes that means we can't know which nodes we'll see in our evaluate function ahead of time. We'll visit this below:

您也可以像

Unary[i]

那樣索引這些内容。 另外還有一個

Child

宏(

Child1

..

ChildN

Child[i]

讓您評估在運作時節點類型的能力),它會調用相應的

EvaluateXXXX()

方法适合你。 這對于在基礎文法中折疊了節點的情況很有用,因為有時這意味着我們無法提前知道在評估函數中會看到哪些節點。 我們将在下面通路:

Also, this may not seem very intuitive above until you look at the final grammar format:

另外,在您看完最終的文法格式後,上面的内容似乎不太直覺:

Term<start,type="int">= Factor { ("+"|"-") Factor } => {
    int result = Factor1;
    int i = 2;
    while (i<Length) 
    {
        if(SymbolId[i-1]==add)
            result += Factor[i];
        else // sub
            result -= Factor[i];
        i+=2;
    }
    return result;
}

Factor<type="int">= Unary { ("*"|"/") Unary } => { 
    int result = Unary1;
    int i = 2;
    // Because of the way the grammar is 
    // factored, this can end up being 
    // Factors when collapsed. We use
    // Child[n] which is a "smart"
    // macro that evaluates the symbol
    // at runtime and calls the right
    // EvaluateXXXX method
    while (i<Length) 
    {
        // Child always returns an object type so 
        // be sure to cast as necessary
        if(SymbolId[i-1]==mul) 
            result *= (int)Child[i];
        else // div
            result /= (int)Child[i];
        i+=2;
    }
    return result;
}
Unary<type="int">= ("+"|"-") Unary | Leaf => {
    if(Length==1)
        return Leaf1;
    if(SymbolId[0]==add)
        return Unary2;
    else
        return -Unary2;
}
Leaf<type="int">= integer | identifier | "(" Term ")" => {
    if(Length==1) // integer | identifer
    {
        if(SymbolId[0]==integer)
            return integer1;
        else // identifier
        {
            if(state!=null) 
            {
                int val;
                var d = (IDictionary<string,int>)state;
                if(d.TryGetValue(identifier1,out val))
                    return val;
            }    
            throw new SyntaxException(string.Format("Reference to undefined variable {0}",identifier1),node.Line,node.Column,node.Position);
        }
    } else // Term
        return Term2;
}
// we only need to declare terminals explicitly if we want 
// constant names for them or to apply attributes to them
// in this case we don't need divide, subtract or 
// parentheses to be declared
add="+";
mul="*";
integer='[0-9]+';
identifier='[A-Z_a-z][0-9A-Z_a-z]*';
whitespace<hidden>='\s+';           

Take a look at that! That's pretty easy to follow once you get the hang of it. The macros clear things right up and keep it simple

看一看! 一旦掌握了這一點,就很容易遵循。 宏可以立即清除并保持簡單

文法中的代碼區域 (Code Regions In The Grammar)

Anywhere between productions you can specify regions of code to be added to the parser class using

{

}

. Any fields or methods must be static or const, and not named after any symbol in the grammar.

在生産之間的任何地方,您都可以使用

{

}

指定要添加到解析器類的代碼區域。 任何字段或方法都必須是靜态或const,并且不能以文法中的任何符号命名。

These regions are added as members of the class. We'll visit them in a moment when we make a helper method in the next section.

這些區域被添加為類的成員。 在下一節中,我們将在建立輔助方法時通路它們。

句法限制 (Syntactic Constraints)

Syntactic constrants, or syntactic predicates help disambiguate your grammar by indicating a

boolean

 function which tells the parser whether your production matches. Here's an example below, also using code regions from just above,  of using the

where

clause, which takes the form of:

句法限制或句法謂語通過訓示

boolean

來告訴文法分析器您的結果是否比對,進而幫助消除文法歧義。 這是下面的示例,也使用上面的代碼區域,使用

where

子句,其形式為:

The following disambiguates between a C# keyword and an identifer: (note that the terminal definitions we're using are Unicode so they require a Gplex lexer)

以下代碼在C#關鍵字和辨別符之間進行了歧義消除:(請注意,我們使用的終端定義是Unicode,是以它們需要Gplex詞法分析器)

Identifier<collapsed> = verbatimIdentifier | identifier : where { return !Keywords.Contains(context.Value); } 
verbatimIdentifier='@(_|[[:IsLetter:]])(_|[[:IsLetterOrDigit:]])*';
identifier='(_|[[:IsLetter:]])(_|[[:IsLetterOrDigit:]])*';
{
    static HashSet<string> Keywords=_BuildKeywords();
    static HashSet<string> _BuildKeywords() 
    {
        var result = new HashSet<string>();
        string[] sa = "abstract|as|ascending|async|await|base|bool|break|byte|case|catch|char|checked|class|const|continue|decimal|default|delegate|descending|do|double|dynamic|else|enum|equals|explicit|extern|event|false|finally|fixed|float|for|foreach|get|global|goto|if|implicit|int|interface|internal|is|lock|long|namespace|new|null|object|operator|out|override|params|partial|private|protected|public|readonly|ref|return|sbyte|sealed|set|short|sizeof|stackalloc|static|string|struct|switch|this|throw|true|try|typeof|uint|ulong|unchecked|unsafe|ushort|using|var|virtual|void|volatile|while|yield".Split(new char[] {'|'});
        
        for(var i = 0;i<sa.Length;++i) 
            result.Add(sa[i]);
        
        return result;
    }
}           

We can make the

_BuildKeywords()

 function using a fixed string lookup which would be more efficient, but a lot more code.

我們可以使用固定的字元串查找來制作

_BuildKeywords()

函數,該查找效率更高,但代碼更多。

Here we've specified

verbatimIdentifier

before

identifier

, which gives it priority in the lexer/tokenizer. Our non-terminal 

Identifier

 (not to be confused with the terminal

identifier

) specifies either a verbatimIdentifier or an identifier and uses a

where

 clause to determine if it is a keyword, in which case, the parser will not consider the item under the cursor as an 

Identifier

while parsing. There's a trick for experienced users wherein you can override a first-first or first-follows conflict simply by specifying a

where

clause and always returning

true

. However, only use this when you know what you're doing as it can render parts of the grammar unreachable, and won't warn you. We didn't have to collapse

Identifier

 but we did it because it's not needed in the final parse tree, and just makes an unnecessary extra node. Also note the use of the pre-fix increment in the loop. Remember that Slang will not support post-fix increment because the underlying

CodeDOM

cannot do so in a language agnostic manner.

在這裡,我們在

identifier

之前指定了

verbatimIdentifier

,進而在lexer / tokenizer中賦予了優先級。 我們的非終端

Identifier

(不要與終端

identifier

混淆)指定verbatimIdentifier或辨別符,并使用

where

子句确定它是否為關鍵字,在這種情況下,解析器不會将光标下的項目視為解析時的

Identifier

。 對于有經驗的使用者,有一個技巧,您隻需指定

where

子句并始終傳回

true

就可以覆寫first-first或first-follows沖突。 但是,僅當您知道自己在做什麼時才使用它,因為它會使部分文法無法通路,并且不會警告您。 我們不必折疊

Identifier

但我們這樣做是因為最終的解析樹中不需要它,而隻是制作了一個不必要的額外節點。 還要注意在循環中使用字首增量。 請記住, Slang不支援

CodeDOM

增量,因為基礎

CodeDOM

無法以與語言無關的方式來實作。

抽象和虛拟非終端 (Abstract and Virtual Non-Terminals)

These are used together which is why we'll cover them together. This is a powerful feature to resolve parses that the generated code can't handle, like the difference between a cast expression and a parenthesized subexpression in C#. We'll start with the virtual non-terminals, which take the form of:

這些一起使用,這就是我們一起介紹它們的原因。 這是解析生成的代碼無法處理的解析的強大功能,例如強制轉換表達式和C#中帶括号的子表達式之間的差異。 我們将從虛拟非終端開始,其形式為:

First, lets cover the

firsts

 attribute. This simply tells the parser which symbols can come first in your virtual non-terminal. These must match your parser function or the parser won't find your non-terminal correctly! I also mark the non-terminal with the no-op

virtual

 attribute simply for readability, but this is optional:

首先,讓我們介紹一下

firsts

屬性。 這隻是告訴解析器哪些符号可以在您的虛拟非終端裝置中首先出現。 這些必須與您的解析器功能比對,否則解析器将無法正确找到您的非終端裝置! 我也僅出于可讀性的目的,使用no-op

virtual

屬性标記了非終端,但這是可選的:

Cast<virtual,firsts="lparen"> { 
    return _ParseCast(context);
}           

Note that I forwarded to a function. This is so I can keep all my code at the bottom of the grammar in a code region (which we covered above). Note that I also put an underscore before my function. This is to ensure the name doesn't conflict with any of the generated parse functions, including

ParseCast()

 itself, which will be generated as a result of this virtual non-terminal declaration. Let's take a look at

_ParseCast()

:

請注意,我轉發了一個函數。 這樣一來,我就可以将我所有的代碼都放在文法的底部,即代碼區域(我們在上文中進行了介紹)。 請注意,我還在函數之前加了下劃線。 這是為了確定名稱不與任何生成的解析函數(包括

ParseCast()

本身

ParseCast()

沖突,該解析函數将由于此虛拟非終端聲明而生成。 讓我們看一下

_ParseCast()

static ParseNode _ParseCast(ParserContext context) {
    int line = context.Line;
    int column = context.Column;
    long position = context.Position;

    if("("!=context.Value)
        context.Error("Expecting ( as start of expression or cast");
    ParseNode lp = new ParseNode(SlangParser.lparen, "lparen", context.Value, context.Line, context.Column, context.Position);
    context.Advance();
    ParseNode type = ParseTypeCastPart(context);
    ParseNode expr = ParseExpression(context);
    return new ParseNode(SlangParser.Cast, "Cast", new ParseNode[] {type,expr}, line, column, position);
}           

In the code block we use the other parse routines and create parse nodes based on a parse, using the parser's API directly: You can see this is rather involved. With great power comes great responsibility. Here you are responsible for handling line numbers, manually parsing terminals, advancing the cursor, calling other

ParseXXXX()

 functions and reporting parse nodes accurately. Note how we

Advance()

to move past the terminals, but we don't need to do so to move past the non-terminals, because the other

ParseXXXX()

 functions do that for us. Remember to pass

context

 around. Note how we were careful to store the initial line, column and position information when we entered the function because we report them at the end. Note we use the current line, column, and position when reporting an error parsing a terminal. Be careful here or your position information will be wrong. Basically this allows you to entirely implement a

ParseXXXX()

 method yourself rather than using the generated code.

在代碼塊中,我們使用其他解析例程,并直接使用解析器的API基于解析建立解析節點:您可以看到這很複雜。 擁有權利的同時也被賦予了重大的責任。 在這裡,您負責處理行号,手動解析終端,前進光标,調用其他

ParseXXXX()

函數以及準确報告解析節點。 請注意,我們如何

Advance()

越過終端,但是我們不需要這樣做就越過非終端,因為其他

ParseXXXX()

函數為我們完成了此操作。 記住要傳遞

context

。 請注意,進入函數時,我們非常小心地存儲了初始行,列和位置資訊,因為我們在最後報告了它們。 請注意,在報告解析終端錯誤時,我們使用目前行,列和位置。 請注意此處,否則您的職位資訊有誤。 基本上,這使您可以自己完全實作

ParseXXXX()

方法,而不使用生成的代碼。

See how we parsed TypeCastPart? I haven't shown you the definition yet, but here it is:

看看我們如何解析TypeCastPart? 我尚未顯示定義,但這裡是:

// this is necessary so we can get Parsley to generate a 
// method called ParseTypeCastPart() which we use
// when resolving casts
TypeCastPart<include,collapsed>= Type ")";           

Note the addition of the

include

attribute. This is a dummy attribute to make sure

TypeCastPart

 shows up in the grammar, because it's not referenced by anything in the grammar except code. Attributes force a non-terminal to be present in the grammar. We didn't need

include

 here but I specified it for readability. Had we not specified

collapsed

 we would have had to specify

include

 (or something else), otherwise

ParseTypeCastPart()

 would never show up in the grammar. Since we don't actually report it in the parse tree, but just use it for parsing we collapse it. That part however, is optional. See how it ends with ")"? That's because a cast looks like

(string)foo

 and so in order for the parser to know the ")" can follow a 

Type

we had to eat the closing parenthesis here. The other option would have been to parse

Type

manually which is simply too much work.

注意添加了

include

屬性。 這是一個虛拟屬性,可確定

TypeCastPart

出現在文法中,因為除代碼外,文法中的任何内容均未引用它。 屬性強制文法中出現非終結符。 我們不需要在此處

include

,但出于可讀性考慮而指定了它。 如果沒有指定

collapsed

,則必須指定

include

(或其他),否則

ParseTypeCastPart()

将永遠不會出現在文法中。 由于我們實際上并沒有在解析樹中報告它,而是僅将其用于解析,是以我們将其折疊了。 但是,該部分是可選的。 怎麼看呢結尾“)”? 這是因為鑄造模樣

(string)foo

,是以為了讓解析器知道“)”可以按照

Type

,我們不得不在這裡吃的右括号。 另一個選擇是手動解析

Type

,這簡直是太多的工作。

And finally we come to abstract non-terminals. Say you wanted to report a non-terminal above which wasn't actually in the grammar. Abstract non-terminals simply allow you to put an "empty/no-op" non-terminal in the grammar, which can then be reported by your parse function because it has a symbol constant associated with it. They take the form of:

最後我們來抽象非終結符。 假設您要報告一個非終止符,該終止符實際上不在文法中。 抽象非終結符僅允許您在文法中放入“空/無操作”非終結符,然後您的解析函數就可以報告該非終結符,因為它具有與之關聯的符号常量。 它們的形式為:

See how nothing is specified for this production other than attributes. It's simply terminated with a semicolon. All it does is generate a symbol in the grammar, so this...

了解如何為此生産指定除屬性以外的任何内容。 它隻是用分号終止。 它所做的隻是在文法中生成一個符号,是以這...

generates a constant symbol you can use, but no

ParseFoo()

 method. You'd return this from one of your virtual non-terminals. It wouldn't be used anywhere else. Note again, you have to specify an attribute for any non-terminal not-referenced in the grammar, and that includes abstract non-terminals which can't be referenced in the grammar.

生成可以使用的常量符号,但沒有

ParseFoo()

方法。 您可以從您的虛拟非終端之一傳回它。 它不會在其他任何地方使用。 再次注意,你必須指定在文法任何非終端沒有引用一個屬性,包括不能在文法中引用抽象的非終端。

See the ParsleyDemo and ParsleyDemoVB projects for more.

有關更多資訊,請參見ParsleyDemo和ParsleyDemoVB項目。

@options指令 (The @options Directive)

The

@options

 directive allows you to override much of the command line of Parsley with your own settings in the grammar. In the case where a grammar imports other grammars, only the top level grammar's

@options

settings will be honored. The format is as follows:

@options

指令使您可以使用文法中的自己的設定來覆寫Parsley的大部分指令行。 在文法導入其他文法的情況下,僅

@options

使用頂級文法的

@options

設定。 格式如下:

The

@options

directive may appear once, and like

@import

 must appear before any productions. An option value can be a quoted string, an number, a

boolean

, or

null

. The current available options are as follows, and correspond to their command line counterparts:

outputfile

codenamespace

,

codeclass

,

codelanguage

,

rolexfile

,

gplexfile

,

gplexcodeclass

,

fast

and

verbose

. Like with attributes, boolean values that are true don't need their value specified explicitly, just their name, so

verbose

and

verbose=true

  are equivelent. Again, these override the command line options. You will typically use them with the Visual Studio integration, explored below.

@options

指令可能隻出現一次,就像

@import

必須在任何生産之前出現。 選項值可以是帶引号的字元串,數字,

boolean

null

。 目前可用的選項如下,并且對應于它們的指令行對應項:

outputfile

codenamespace

codeclass

codelanguage

rolexfile

gplexfile

gplexcodeclass

fast

verbose

。 像屬性一樣,布爾值true不需要顯式指定其值,而隻需其名稱即可,是以

verbose

verbose=true

是等價的。 同樣,它們會覆寫指令行選項。 您通常将它們與Visual Studio內建一起使用,如下所述。

使用此混亂作為預建構步驟 (Using this Mess as a Pre-Build Step)

Simply navigate to your project options, under Build Events (C#) and add it to the pre-build event command line. Finding the pre-build event location in your project options is more involved in VB than it is in C#, but dig around. It's there.

隻需導航至建構事件(C#)下的項目選項,然後将其添加到建構前事件指令行即可。 與C#中相比,VB中要在項目選項中查找預生成事件的位置更多,但是要多加研究。 在那裡。

Either way, you need to know what to feed it, so here you go. You need to make two things - a parser and a tokenizer/lexer. Each will have it's own command line to execute. For your parser, you'll use parsley.exe, and for your tokenizer/lexer you'll either use rolex.exe or gplex.exe. You'll want parsley.exe to come first, and make sure to give it the right arguments to generate a lexer file for you (assuming the grammar doesn't indicate one). Following that, you'll want to run rolex.exe or gplex.exe depending which lexer you're using. If you're making your own lexer from scratch, you can skip that second command line but most people won't ever do that.

無論哪種方式,您都需要知道要喂什麼,是以就到這裡。 您需要做兩件事-解析器和令牌生成器/詞法分析器。 每個人都有自己的指令行來執行。 對于解析器,您将使用parsley.exe,對于标記器/詞法分析器,您将使用rolex.exe或gplex.exe。 您希望parsley.exe首先出現,并確定為其提供正确的參數以為您生成一個詞法分析器檔案(假設文法不表示一個)。 之後,您将要運作rolex.exe或gplex.exe,具體取決于您使用的是哪個詞法分析器。 如果要從頭開始建立自己的詞法分析器,則可以跳過第二條指令行,但是大多數人永遠不會這樣做。

Anyway, here are the basics (see ParsleyDemo and ParsleyDemoVB)

無論如何,這是基礎知識(請參閱ParsleyDemo和ParsleyDemoVB)

"parsley.exe" "$(ProjectDir)Expression.xbnf" /output "$(ProjectDir)ExpressionParser.cs" /rolex "$(ProjectDir)Expression.rl" /namespace ParsleyDemo /ifstale
"rolex.exe" "$(ProjectDir)Expression.rl" /output "$(ProjectDir)ExpressionTokenizer.cs" /namespace ParsleyDemo /ifstale           

First, put quotes around paths, including macros, and including your initial executable path. We simply specified "parsley.exe" and "rolex.exe" here but you may need to put in their full paths unless they're in your

PATH

. The demo projects simply use the binaries from the related projects in the solution, but that's not available to you in other situations. One option is to include the executables in your project's folder somewhere and reference those using relative paths. That way if you copy the project to another machine it will still build. I hope that's clear.

首先,在路徑(包括宏)和初始可執行路徑兩邊加上引号。 我們僅在此處指定了“ parsley.exe”和“ rolex.exe”,但是除非它們在您的

PATH

否則您可能需要輸入它們的完整路徑。 示範項目僅使用解決方案中相關項目的二進制檔案,但在其他情況下則不可用。 一種選擇是将可執行檔案包含在項目檔案夾中的某個位置,然後使用相對路徑引用這些可執行檔案。 這樣,如果您将項目複制到另一台計算機上,它仍會生成。 我希望這很清楚。

Note we've used macros to refer to the project directory. The list of macros is available in the edit window of the pre-build events if you click around/click through.

注意,我們使用宏來引用項目目錄。 如果單擊前後單擊,則宏清單在預生成事件的編輯視窗中可用。

The above was an example of using Parsley with Rolex. Using it with Gplex is similar:

上面是将Parsley與Rolex結合使用的示例。 将它與Gplex一起使用類似:

"parsley.exe" "$(ProjectDir)Slang.xbnf" /output "$(ProjectDir)SlangParser.cs" /gplex "$(ProjectDir)Slang.lex" /namespace CD /fast /ifstale
"gplex.exe" /out:"$(ProjectDir)SlangScanner.cs" "$(ProjectDir)Slang.lex"           

Note how the syntax of Gplex's command line arguments are different and options preceed the input file. Be careful with this. I didn't write Gplex, so its usage is quite a bit different than the ones in the programs I've written.

請注意,Gplex指令行參數的文法是如何不同的,并且選項在輸入檔案之前。 請注意這一點。 我沒有寫Gplex,是以它的用法與我編寫的程式中的用法有很大不同。

Visual Studio整合 (Visual Studio Integration)

By request I've added devstudio integration for Parsley, Rolex and Gplex which can be installed via the ParsleyDevstudio VSIX package. In this package are 3 custom tools that each generate a log and the output of the particular program as files underneath the main file.

根據要求,我為Parsley,Rolex和Gplex添加了devstudio內建,可以通過ParsleyDevstudio VSIX軟體包進行安裝。 在此軟體包中,有3個自定義工具,每個工具都會在主檔案下面的檔案中生成日志和特定程式的輸出。

Navigate to the XBNF (Parsley, *.xbnf) input file, the Rolex lexer input file (*.rl), or the Gplex lexer input file (*.lex) and set the appropriate custom tool and then wait as these tools take some significant time to do their work. I recommend you use this only for small grammars or with the (C# only) 

fast

 option and using

gplex

lexers to cut your wait time down. Unfortunately, due to the infrastructure provided by Visual Studio, I can't make these custom tools work without freezing devstudio while they're working.

導航到XBNF(Parsley,* .xbnf)輸入檔案,Rolex lexer輸入檔案(* .rl)或Gplex lexer輸入檔案(* .lex),并設定适當的自定義工具,然後等待這些工具占用一些資源花費大量時間做他們的工作。 我建議您僅将其用于小型文法或與(僅C#)

fast

選項一起使用,并使用

gplex

詞法分析器來縮短等待時間。 不幸的是,由于Visual Studio提供的基礎結構,我無法在不當機devstudio的情況下使這些自定義工具正常工作。

To set it up, install the VSIX package and then navigate to an XBNF document, click on it in solution explorer to get the document properties (usually on the lower right of the VS workspace), find

Custom Tool

 and then set it Parsley - as soon as you do, Parsley will begin the generation process, so VS will be unresponsive for a few seconds. The other custom tools are Gplex and Rolex.

要進行設定,請安裝VSIX軟體包,然後導航到XBNF文檔,在解決方案資料總管中單擊它以擷取文檔屬性(通常在VS工作區的右下方),找到“

Custom Tool

,然後将其設定為Parsley -as這樣做後,Parsley将立即開始生成過程,是以VS将在幾秒鐘内無響應。 其他自定義工具是Gplex和Rolex 。

Visual Studio內建的局限性 (Limits of Visual Studio Integration)

The pre-build step method of integrating Parsley into your builds is still the recommended way to build your projects. However, if you choose to use the VS custom tool integration instead, please be mindful of these limitations:

将Parsley內建到建構中的預建構步驟方法仍然是建構項目的推薦方法。 但是,如果您選擇使用VS自定義工具內建,請注意以下限制:

  • Parsley is kind of slow to generate, and sometimes VS will regenerate for no reason causing unnecessary churn

    荷蘭芹的生成速度很慢,有時VS會無緣無故地重新生成,進而導緻不必要的流失

  • The lexers will only support one lexer per namespace. This is due mainly to limitations in Gplex and Rolex, and with Rolex, using the pre-build method instead can sidestep this. Attempts to generate multiple lexers in one namespace will create compile errors.

    這些詞法分析器每個名稱空間僅支援一個詞法分析器。 這主要是由于Gplex和Rolex的局限性所緻,而對于Rolex,使用預建構方法可以避免這種情況。 嘗試在一個名稱空間中生成多個詞法分析器将産生編譯錯誤。

  • The custom tools cannot detect changes to dependent files, like imports. You must rerun the tool manually if these are changed.

    自定義工具無法檢測對相關檔案的更改,例如導入。 如果更改了這些,則必須手動重新運作該工具。

  • When removing the custom tool from a file, the generated files will not all be removed, only the log will. You must remove the other files manually.

    從檔案中删除自定義工具時,生成的檔案不會全部删除,隻會删除日志。 您必須手動删除其他檔案。

  • I haven't tested this integration as extensively as I'd like. Testing it is somewhat difficult, so it's a work in progress.

    我沒有像我想要的那樣對這種內建進行全面的測試。 測試它有些困難,是以它正在進行中。

  • You cannot use the custom tools during VS automated builds (running VS with no UI) - the tools only work in the design time environment.

    您不能在VS自動建構過程中使用自定義工具(在沒有UI的情況下運作VS)-這些工具僅在設計時環境下工作。

  • I'm pretty sure renaming a top level file will break something.

    我很确定重命名頂級檔案會破壞某些東西。

  • You simply do not have as much control as you do at the command line overall.

    您根本沒有像在指令行上那樣擁有太多的控制權。

Some of these limitations are due to a monumental hack I had to employ to get Visual Studio to recognize multiple output files from a single custom tool. It's not designed for it at all, and to do so requires some ugly magic. Some, like the periods of unresponsiveness are simply due to the nature of Parsley and/or custom tools.

這些限制中的一些是由于我不得不采用巨大的技巧來使Visual Studio從單個自定義工具中識别多個輸出檔案。 它根本不是為它設計的,而這樣做需要一些難看的魔術。 某些情況(例如無響應的時間段)完全是由于Parsley和/或自定義工具的性質所緻。

Note that you'll almost certainly want to use the

@options

 directive in your XBNF document at least to specifiy lexer file(s) to generate. If Parsley is used to generate lexer documents, the lexer documents will automatically be associated with their appropriate custom tools, so the lexer generators will be run on them as appropriate as well. Because of this, while there are 3 VS custom tools, you'll mainly only use "

Parsley

" yourself and the others will be set automatically as needed.

請注意,您幾乎肯定會想要至少在XBNF文檔中使用

@options

指令來指定要生成的詞法分析器檔案。 如果使用Parsley生成詞法分析器文檔,則詞法分析器文檔将自動與其适當的自定義工具相關聯,是以詞法分析器生成器也将在其上運作。 是以,盡管有3個VS自定義工具,但您主要隻能自己使用“

Parsley

”,其他的将根據需要自動設定。

進一步閱讀 (Further Reading)

For follow ups using advanced parsley features: Parse Anything Part 1 and Part 2.

有關使用進階歐芹功能的後續操作:解析第1 部分和第2部分 。

For more information about

firsts

and

follows

, and fundamentals of parsing LL(1), see How To Make a Parser, Lesson 1

有關更多資訊,

firsts

follows

,并解析LL(1)的基本面,看如何做一個分析器,課程1

For more information on recursive descent parsing in general, see Recursive Descent Parser at Geeks for Geeks.

有關一般遞歸下降解析的更多資訊,請參閱Geeks for Geeks的遞歸下降解析器 。

翻譯自: https://www.codeproject.com/Articles/5254538/Parsley-A-Recursive-Descent-Parser-Generator-in-Cs

parsley