天天看点

词法分析者与解析者

本文翻译自:lexers vs parsers

Are lexers and parsers really that different in theory?

词法分析器和解析器在理论上真的有那么不同吗?

It seems fashionable to hate regular expressions: coding horror , another blog post .

讨厌正则表达式似乎很时髦: 编码恐怖 , 另一篇博客文章 。

However, popular lexing based tools: pygments , geshi , or prettify , all use regular expressions.

然而,流行的基于乐兴的工具: pygments , geshi或者美化 ,都使用正则表达式。

They seem to lex anything...

他们好像有什么东西......

When is lexing enough, when do you need EBNF?

什么时候足够兴奋,什么时候需要EBNF?

Has anyone used the tokens produced by these lexers with bison or antlr parser generators?

有没有人使用这些词法分析器生成的令牌与野牛或antlr解析器生成器?

#1楼

参考:https://stackoom.com/question/BvXl/词法分析者与解析者

#2楼

When is lexing enough, when do you need EBNF? 什么时候足够兴奋,什么时候需要EBNF?

EBNF really doesn't add much to the power of grammars.

EBNF确实不会增加语法的力量 。

It's just a convenience / shortcut notation / "syntactic sugar" over the standard Chomsky's Normal Form (CNF) grammar rules.

它只是标准乔姆斯基普通形式(CNF)语法规则的便利/快捷符号/ “语法糖” 。

For example, the EBNF alternative:

例如,EBNF替代方案:
S --> A | B
           

you can achieve in CNF by just listing each alternative production separately:

您可以通过单独列出每个替代产品来实现CNF:
S --> A      // `S` can be `A`,
S --> B      // or it can be `B`.
           

The optional element from EBNF:

EBNF的可选元素:
S --> X?
           

you can achieve in CNF by using a nullable production, that is, the one which can be replaced by an empty string (denoted by just empty production here; others use epsilon or lambda or crossed circle):

您可以在CNF通过使用可空的生产,也就是说,它可以通过一个空字符串替换的一个实现(通过这里只是空洞的生产表示;其他人使用小量或lambda或十字圆):
S --> B       // `S` can be `B`,
B --> X       // and `B` can be just `X`,
B -->         // or it can be empty.
           

A production in a form like the last one

B

above is called "erasure", because it can erase whatever it stands for in other productions (product an empty string instead of something else).

像上面最后一个

B

形式的制作被称为“擦除”,因为它可以擦除其他制作中的任何形式(产品是空字符串而不是其他字符串)。

Zero-or-more repetiton from EBNF:

来自EBNF的零次或多次重复:
S --> A*
           

you can obtan by using recursive production, that is, one which embeds itself somewhere in it.

你可以通过使用递归生产来实现,也就是说,将其自身嵌入其中。

It can be done in two ways.

它可以通过两种方式完成。

First one is left recursion (which usually should be avoided, because Top-Down Recursive Descent parsers cannot parse it):

第一个是左递归 (通常应该避免,因为自上而下的递归下降解析器无法解析它):
S --> S A    // `S` is just itself ended with `A` (which can be done many times),
S -->        // or it can begin with empty-string, which stops the recursion.
           

Knowing that it generates just an empty string (ultimately) followed by zero or more

A

s, the same string ( but not the same language! ) can be expressed using right-recursion :

知道它只生成一个空字符串(最终)后跟零个或多个

A

s,可以使用右递归表示相同的字符串( 但不是相同的语言! ):
S --> A S    // `S` can be `A` followed by itself (which can be done many times),
S -->        // or it can be just empty-string end, which stops the recursion.
           

And when it comes to

+

for one-or-more repetition from EBNF:

而当谈到

+

从EBNF一个或更多的重复:
S --> A+
           

it can be done by factoring out one

A

and using

*

as before:

它可以通过分解一个

A

并使用

*

来完成:
S --> A A*
           

which you can express in CNF as such (I use right recursion here; try to figure out the other one yourself as an exercise):

您可以在CNF中表达(我在这里使用正确的递归;尝试将另一个自己想象为练习):
S --> A S   // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A     // or it could be just one single `A`.
           

Knowing that, you can now probably recognize a grammar for a regular expression (that is, regular grammar ) as one which can be expressed in a single EBNF production consisting only from terminal symbols.

知道了,您现在可以将正则表达式(即常规语法 )的语法识别为可以在仅包含终端符号的单个EBNF生成中表达的语法 。

More generally, you can recognize regular grammars when you see productions similar to these:

更一般地说,当您看到与以下类似的作品时,您可以识别常规语法:
A -->        // Empty (nullable) production (AKA erasure).
B --> x      // Single terminal symbol.
C --> y D    // Simple state change from `C` to `D` when seeing input `y`.
E --> F z    // Simple state change from `E` to `F` when seeing input `z`.
G --> G u    // Left recursion.
H --> v H    // Right recursion.
           

That is, using only empty strings, terminal symbols, simple non-terminals for substitutions and state changes, and using recursion only to achieve repetition (iteration, which is just linear recursion - the one which doesn't branch tree-like).

也就是说,仅使用空字符串,终端符号,简单的非终端进行替换和状态更改,并且仅使用递归来实现重复(迭代,这只是线性递归 - 不是树状分支的那种)。

Nothing more advanced above these, then you're sure it's a regular syntax and you can go with just lexer for that.

没有比这更高级的东西了,那么你肯定它是一个常规语法,你可以用lexer来实现。

But when your syntax uses recursion in a non-trivial way, to produce tree-like, self-similar, nested structures, like the following one:

但是,当您的语法以非平凡的方式使用递归时,要生成类似树的,自相似的嵌套结构,如下所示:
S --> a S b    // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S -->          // or it could be (ultimately) empty, which ends recursion.
           

then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way;

然后你可以很容易地看到这不能通过正则表达式完成,因为你无法以任何方式将其解析为单个EBNF生产;

you'll end up with substituting for

S

indefinitely, which will always add another

a

s and

b

s on both sides.

你最终会与替代

S

无限,这将永远再添

a

S和

b

两侧秒。

Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many

a

s were there to match them evenly with so many

b

s.

词法分析器(更具体:由词法分析器使用有限状态自动机),也不能算到任意数量(它们是有限的,还记得吗?),所以他们不知道有多少

a

小号在那里有这么多的均匀与它们匹配

b

秒。

Grammars like this are called context-free grammars (at the very least), and they require a parser.

像这样的语法被称为无上下文语法 (至少),它们需要一个解析器。

Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax.

无上下文语法是众所周知的解析,因此它们被广泛用于描述编程语言的语法。

But there's more.

但还有更多。

Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently.

有时候需要一个更通用的语法 - 当你有更多的东西可以同时计算,独立时。

For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round).

例如,当您想要描述一种语言,其中可以使用圆括号和方括号交错,但它们必须彼此正确配对(带括号的大括号,圆形的圆形)。

This kind of grammar is called context-sensitive .

这种语法称为上下文敏感 。

You can recognize it by that it has more than one symbol on the left (before the arrow).

您可以通过左侧(箭头前)有多个符号来识别它。

For example:

例如:
A R B --> A S B
           

You can think of these additional symbols on the left as a "context" for applying the rule.

您可以将左侧的这些附加符号视为应用规则的“上下文”。

There could be some preconditions, postconditions etc. For example, the above rule will substitute

R

into

S

, but only when it's in between

A

and

B

, leaving those

A

and

B

themselves unchanged.

可能存在一些先决条件,后置条件等。例如,上述规则将

R

替换为

S

,但仅当它在

A

B

之间时,将

A

B

本身保持不变。

This kind of syntax is really hard to parse, because it needs a full-blown Turing machine.

这种语法很难解析,因为它需要一台成熟的图灵机。

It's a whole another story, so I'll end here.

这是另一个故事,所以我会在这里结束。

#3楼

There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing ( syntax analysis) phases.

编译器的分析部分通常分为词法分析和语法分析(语法分析)阶段有很多原因。
  1. Simplicity of design is the most important consideration. 设计的简单性是最重要的考虑因素。 The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. 词法和句法分析的分离通常允许我们简化这些任务中的至少一个。 For example, a parser that had to deal with comments and white space as syntactic units would be. 例如,一个必须处理注释和空格作为语法单元的解析器。 Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. 词法分析器已经删除了可以假设注释和空格的复杂程度。 If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. 如果我们正在设计一种新语言,将词汇和句法问题分开可以使整体语言设计更加清晰。
  2. Compiler efficiency is improved. 编译器效率得到提高。 A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. 一个单独的词法分析器允许我们应用专门的技术,只用于词法任务,而不是解析工作。 In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. 此外,用于读取输入字符的专用缓冲技术可以显着加速编译器。
  3. Compiler portability is enhanced. 编译器的可移植性得到增强。 Input-device-specific peculiarities can be restricted to the lexical analyzer. 输入设备特定的特性可以限制在词法分析器中。

resource___ Compilers (2nd Edition) written by- Alfred V. Abo Columbia University Monica S. Lam Stanford University Ravi Sethi Avaya Jeffrey D. Ullman Stanford University

资源___编者 (第2版) - Alfred V. Abo哥伦比亚大学Monica S. Lam斯坦福大学Ravi Sethi Avaya Jeffrey D. Ullman斯坦福大学

#4楼

To answer the question as asked (without repeating unduly what appears in other answers)

按要求回答问题(不要过度重复其他答案中出现的问题)

Lexers and parsers are not very different, as suggested by the accepted answer.

根据公认的答案,Lexers和解析器并没有太大差异。

Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers.

两者都基于简单的语言形式:词法分析器的常规语言,以及几乎总是用于解析器的无上下文(CF)语言。

They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton.

它们都与相当简单的计算模型相关联,即有限状态自动机和下推堆栈自动机。

Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology.

常规语言是无上下文语言的特例,因此可以使用更复杂的CF技术生成词法分析器。

But it is not a good idea for at least two reasons.

但至少有两个原因并不是一个好主意 。

A fundamental point in programming is that a system component should be buit with the most appropriate technology, so that it is easy to produce, to understand and to maintain.

编程的一个基本点是系统组件应该使用最合适的技术进行编写,以便易于生成,理解和维护。

The technology should not be overkill (using techniques much more complex and costly than needed), nor should it be at the limit of its power, thus requiring technical contortions to achieve the desired goal.

该技术不应过度使用(使用比所需技术复杂且成本更高的技术),也不应超出其功率限制,因此需要技术扭曲才能达到预期目标。

That is why "It seems fashionable to hate regular expressions".

这就是为什么“讨厌正则表达式似乎很时髦”。

Though they can do a lot, they sometimes require very unreadable coding to achieve it, not to mention the fact that various extensions and restrictions in implementation somewhat reduce their theoretical simplicity.

虽然它们可以做很多事情,但它们有时需要非常难以理解的编码来实现它,更不用说实现中的各种扩展和限制在某种程度上降低了它们的理论简单性。

Lexers do not usually do that, and are usually a simple, efficient, and appropriate technology to parse token.

Lexers通常不这样做,并且通常是一种简单,有效且适当的解析令牌的技术。

Using CF parsers for token would be overkill, though it is possible.

虽然有可能,但使用CF解析器进行令牌可能会过度。

Another reason not to use CF formalism for lexers is that it might then be tempting to use the full CF power.

不使用CF形式主义用于词法分析器的另一个原因是它可能很容易使用完整的CF能力。

But that might raise sructural problems regarding the reading of programs.

但这可能会引发有关阅读节目的问题。

Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure.

从根本上说,提取含义的程序文本的大部分结构是树结构。

It expresses how the parse sentence (program) is generated from syntax rules.

它表示如何从语法规则生成解析句(程序)。

Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree.

语义学是通过组合技术(数学导向的同态)从语法规则组成的方式构建的,以构建解析树。

Hence the tree structure is essential.

因此树结构是必不可少的。

The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).

使用基于规则集的词法分析器来识别令牌的事实并没有改变这种情况,因为由常规组成的CF仍然给出了CF(我对常规换能器的说法非常松散,将字符流转换为令牌流)。

However, CF composed with CF (via CF transducers ... sorry for the math), does not necessarily give CF, and might makes things more general, but less tractable in practice.

然而,由CF组成的CF(通过CF传感器...对于数学抱歉),不一定给CF,并且可能使事情更通用,但在实践中不易处理。

So CF is not the appropriate tool for lexers, even though it can be used.

所以CF不是词法分析器的合适工具,即使它可以使用。

One of the major differences between regular and CF is that regular languages (and transducers) compose very well with almost any formalism in various ways, while CF languages (and transducers) do not, not even with themselves (with a few exceptions).

常规语言和CF之间的主要区别之一是常规语言(和传感器)以各种方式与几乎任何形式主义很好地组合,而CF语言(和传感器)则没有,甚至没有自己(除了少数例外)。

(Note that regular transducers may have others uses, such as formalization of some syntax error handling techniques.)

(请注意,常规传感器可能有其他用途,例如某些语法错误处理技术的形式化。)

BNF is just a specific syntax for presenting CF grammars.

BNF只是呈现CF语法的特定语法。

EBNF is a syntactic sugar for BNF , using the facilities of regular notation to give terser version of BNF grammars.

EBNF是BNF的一种语法糖 ,使用常规符号设施来提供更为简洁的BNF语法。

It can always be transformed into an equivalent pure BNF.

它总是可以转换为等效的纯BNF。

However, the regular notation is often used in EBNF only to emphasize these parts of the syntax that correspond to the structure of lexical elements, and should be recognized with the lexer, while the rest with be rather presented in straight BNF.

然而,常规符号通常在EBNF中用于强调语法的这些部分,这些部分对应于词法​​元素的结构,并且应该用词法分析器识别,而其余部分则用直接BNF表示。

But it is not an absolute rule.

但这不是一个绝对的规则。

To summarize, the simpler structure of token is better analyzed with the simpler technology of regular languages, while the tree oriented structure of the language (of program syntax) is better handled by CF grammars.

总而言之, 使用常规语言的简单技术可以更好地分析令牌的简单结构,而CF语法可以更好地处理语言(程序语法)的树形结构。

I would suggest also looking at AHR's answer .

我建议也看一下AHR的答案 。

But this leaves a question open: Why trees?

但这留下了一个问题: 为什么是树木?

Trees are a good basis for specifying syntax because

树是指定语法的良好基础,因为
  • they give a simple structure to the text 它们为文本提供了一个简单的结构
  • there are very convenient for associating semantics with the text on the basis of that structure, with a mathematically well understood technology (compositionality via homomorphisms), as indicated above. 如上所述,基于该结构将语义与文本相关联是非常方便的,具有数学上易于理解的技术(通过同态的组合性)。 It is a fundamental algebraic tool to define the semantics of mathematical formalisms. 它是定义数学形式主义语义的基本代数工具。

Hence it is a good intermediate representation, as shown by the success of Abstract Syntax Trees (AST).

因此,它是一个很好的中间表示,如抽象语法树(AST)的成功所示。

Note that AST are often different from parse tree because the parsing technology used by many professionals (Such as LL or LR) applies only to a subset of CF grammars, thus forcing grammatical distorsions which are later corrected in AST.

请注意,AST通常与解析树不同,因为许多专业人员(例如LL或LR)使用的解析技术仅适用于CF语法的子集,因此强制语法扭曲,后来在AST中进行了纠正。

This can be avoided with more general parsing technology (based on dynamic programming) that accepts any CF grammar.

使用接受任何CF语法的更通用的解析技术(基于动态编程)可以避免这种情况。

Statement about the fact that programming languages are context-sensitive (CS) rather than CF are arbitrary and disputable.

关于编程语言是上下文敏感(CS)而不是CF的事实的陈述是任意的和有争议的。

The problem is that the separation of syntax and semantics is arbitrary.

问题是语法和语义的分离是任意的。

Checking declarations or type agreement may be seen as either part of syntax, or part of semantics.

检查声明或类型协议可以被视为语法的一部分或语义的一部分。

The same would be true of gender and number agreement in natural languages.

自然语言中的性别和数字协议也是如此。

But there are natural languages where plural agreement depends on the actual semantic meaning of words, so that it does not fit well with syntax.

但是有些自然语言中复数一致性取决于单词的实际语义,因此它不适合语法。

Many definitions of programming languages in denotational semantics place declarations and type checking in the semantics.

指称语义中的许多编程语言定义在语义中放置声明和类型检查。

So stating as done by Ira Baxter that CF parsers are being hacked to get a context sensitivity required by syntax is at best an arbitrary view of the situation.

因此, Ira Baxter所说的CF语法分析器被黑客攻击以获得语法所需的上下文敏感性充其量只是对情况的任意看法。

It may be organized as a hack in some compilers, but it does not have to be.

它可能被组织为一些编译器中的hack,但它不一定是。

Also it is not just that CS parsers (in the sense used in other answers here) are hard to build, and less efficient.

此外,不仅CS解析器(在此处的其他答案中使用的意义上)难以构建,而且效率较低。

They are are also inadequate to express perspicuously the kinf of context-sensitivity that might be needed.

它们也不足以明显地表达可能需要的背景敏感性。

And they do not naturally produce a syntactic structure (such as parse-trees) that is convenient to derive the semantics of the program, ie to generate the compiled code.

并且它们不会自然地产生一种语法结构(例如解析树),它便于导出程序的语义,即生成编译的代码。

#5楼

Yes, they are very different in theory, and in implementation.

是的,它们在理论和实施方面都有很大的不同。

Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple.

词汇表用于识别构成语言元素的“单词”,因为这些单词的结构通常很简单。

Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.

正则表达式非常擅长处理这种更简单的结构,并且有非常高性能的正则表达式匹配引擎用于实现词法分析器。

Parsers are used to recognize "structure" of a language phrases.

解析器用于识别语言短语的“结构”。

Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure.

这种结构通常远远超出“正则表达式”可以识别的结构,因此需要“上下文敏感”解析器来提取这样的结构。

Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.

上下文敏感的解析器很难构建,因此工程方面的妥协是使用“无上下文”语法并向解析器添加黑客(“符号表”等)来处理上下文敏感的部分。

Neither lexing nor parsing technology is likely to go away soon.

lexing和解析技术都不会很快消失。

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers.

它们可以通过决定使用“解析”技术来识别“单词”来统一,正如目前由所谓的无扫描器GLR解析器所探索的那样。

That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead.

这有一个运行时成本,因为您将更多通用机器应用于通常不需要它的问题,通常您需要在开销中支付。

Where you have lots of free cycles, that overhead may not matter.

如果你有很多自由周期,那么开销可能无关紧要。

If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

如果您处理大量文本,那么开销就很重要,并且将继续使用经典的正则表达式解析器。

#6楼

What parsers and lexers have in common:

解析器和词法分析者的共同点是:
  1. They read symbols of some alphabet from their input. 他们从输入中读取某些字母的 符号 。

    • Hint: The alphabet doesn't necessarily have to be of letters. 提示:字母表不一定是字母。 But it has to be of symbols which are atomic for the language understood by parser/lexer. 但它必须是解析器/词法分析器理解的语言的原子符号。
    • Symbols for the lexer: ASCII characters. 词法分子的符号:ASCII字符。
    • Symbols for the parser: the particular tokens, which are terminal symbols of their grammar. 解析器的符号:特定的令牌,它们是语法的终端符号。
  2. They analyse these symbols and try to match them with the grammar of the language they understood. 他们分析这些符号并尝试将它们与他们理解的语言的语法相匹配。

    • Here's where the real difference usually lies. 这是真正的差异通常所在。 See below for more. 见下文了解更多。
    • Grammar understood by lexers: regular grammar (Chomsky's level 3). 词法学者理解语法:常规语法(乔姆斯基的3级)。
    • Grammar understood by parsers: context-free grammar (Chomsky's level 2). 解析器理解语法:无上下文语法(乔姆斯基的2级)。
  3. They attach semantics (meaning) to the language pieces they find. 他们将语义 (意义)附加到他们找到的语言片段上。

    • Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens . 词法分析器通过分类的词位 (从输入码元串)作为特定令牌附加的含义。 Eg All these lexemes:

      *

      ,

      ==

      ,

      <=

      ,

      ^

      will be classified as "operator" token by the C/C++ lexer. 例如,所有这些词汇:

      *

      ==

      <=

      ^

      将被C / C ++词法分类器归类为“运算符”标记。
    • Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree . 解析器通过将输入(句子)中的标记字符串分类为特定的非终结符并构建解析树来附加含义。 Eg all these token strings:

      [number][operator][number]

      ,

      [id][operator][id]

      ,

      [id][operator][number][operator][number]

      will be classified as "expression" nonterminal by the C/C++ parser. 例如,所有这些令牌字符串:

      [number][operator][number]

      [id][operator][id]

      [id][operator][number][operator][number]

      将被分类为“表达式”非终结符C / C ++解析器。
  4. They can attach some additional meaning (data) to the recognized elements. 他们可以为识别的元素附加一些额外的含义(数据)。

    • When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token. 当词法分析器识别出构成正确数字的字符序列时,它可以将其转换为二进制值并以“数字”标记存储。
    • Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree. 类似地,当解析器识别表达式时,它可以计算其值并与语法树的“表达式”节点一起存储。
  5. They all produce on their output a proper sentences of the language they recognize. 他们都在他们的输出上产生他们认可的语言的正确句子 。

    • Lexers produce tokens , which are sentences of the regular language they recognize. 词典生成令牌 ,这些令牌是他们认可的常用语言的 句子 。 Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them. 每个令牌都可以具有内部语法(虽然级别3,而不是级别2),但这对输出数据和读取它们的数据无关紧要。
    • Parsers produce syntax trees , which are representations of sentences of the context-free language they recognize. 解析器生成语法树 ,它们是他们识别的无上下文语言的句子的表示。 Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. 通常它只是整个文档/源文件的一个大树,因为整个文档/源文件对于它们来说是恰当的句子 。 But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. 但是没有任何理由可以解析为什么解析器无法在其输出上生成一系列语法树。 Eg it could be a parser which recognizes SGML tags sticked into plain-text. 例如,它可以是一个解析器,它可以识别粘贴在纯文本中的SGML标签。 So it'll tokenize the SGML document into a series of tokens:

      [TXT][TAG][TAG][TXT][TAG][TXT]...

      . 因此它会将SGML文档标记为一系列令牌:

      [TXT][TAG][TAG][TXT][TAG][TXT]...

As you can see, parsers and tokenizers have much in common.

如您所见,解析器和标记器有许多共同点。

One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language.

一个解析器可以是其他解析器的标记器,它将其输入标记作为符号从其自己的字母表中读取(标记只是某些字母表的符号),就像来自一种语言的句子可以是其他一些语言的字母符号一样。语言。

For example, if

*

and

-

are the symbols of the alphabet

M

(as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code.

例如,如果

*

-

是字母

M

的符号(作为“莫尔斯代码符号”),那么您可以构建一个解析器,将这些点和线的字符串识别为摩尔斯电码中编码的字母。

The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (eg "English Words" language).

语言“摩尔斯电码”中的句子可以是其他解析器的标记 ,其中这些标记是其语言的原子符号(例如“英语单词”语言)。

And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language.

对于理解“英语句子”语言的某些更高级别的解析器,这些“英语单词”可以是令牌(字母表的符号)。

And all these languages differ only in the complexity of the grammar .

所有这些语言的区别仅在于语法的复杂性 。

Nothing more.

而已。

So what's all about these "Chomsky's grammar levels"?

那么这些“乔姆斯基的语法水平”究竟是什么呢?

Well, Noam Chomsky classified grammars into four levels depending on their complexity:

好吧,Noam Chomsky将语法分为四个级别,具体取决于它们的复杂程度:
  • Level 3: Regular grammars 3级:常规语法

    They use regular expressions, that is, they can consist only of the symbols of alphabet (

    a

    ,

    b

    ), their concatenations (

    ab

    ,

    aba

    ,

    bbb

    etd.), or alternatives (eg

    a|b

    ). 它们使用正则表达式,也就是说,它们只能由字母表符号(

    a

    b

    ),它们的连接符号(

    ab

    aba

    bbb

    etd。)或替代符号(例如

    a|b

    )组成。
    They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton). 它们可以实现为有限状态自动机(FSA),如NFA(非确定性有限自动机)或更好的DFA(确定性有限自动机)。 Regular grammars can't handle with nested syntax , eg properly nested/matched parentheses

    (()()(()()))

    , nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels. 常规语法无法处理嵌套语法 ,例如正确嵌套/匹配的括号

    (()()(()()))

    ,嵌套的HTML / BBcode标签,嵌套块等。这是因为处理它的状态自动机应该必须有无限多个状态来处理无限多的嵌套级别。
  • Level 2: Context-free grammars 第2级:无上下文语法

    They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well. 它们可以在语法树中具有嵌套的,递归的,自相似的分支,因此它们可以很好地处理嵌套结构。 They can be implemented as state automaton with stack. 它们可以实现为具有堆栈的状态自动机。 This stack is used to represent the nesting level of the syntax. 此堆栈用于表示语法的嵌套级别。 In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax. 实际上,它们通常被实现为自上而下的递归下降解析器,它使用机器的过程调用栈来跟踪嵌套级别,并在语法中对每个非终端符号使用递归调用的过程/函数。 But they can't handle with a context-sensitive syntax. 但他们无法处理上下文相关的语法。 Eg when you have an expression

    x+3

    and in one context this

    x

    could be a name of a variable, and in other context it could be a name of a function etc. 例如,当你有一个表达式

    x+3

    并且在一个上下文中,这个

    x

    可以是变量的名称,而在其他上下文中它可以是函数的名称等。
  • Level 1: Context-sensitive grammars 第1级:上下文敏感的语法

  • Level 0: Unrestricted grammars 0级:不受限制的语法 Also called recursively enumerable grammars. 也称为递归可枚举语法。

继续阅读