laitimes

Implementing Code Visualization requires a pre-knowledge of the compiler front-end

author:JD Cloud developer

1. Preamble

The concept definition and industry cases of "Code Visualization" have been described in the previous article, and the overview can be read about "Code Visualization", and more related knowledge can be found in the column "Code Visualization". This article sorts out the "code visualization" function development needs to understand the compiler front-end part of the knowledge, due to limited ability, if there is any unclear and wrong place, please understand, if you want to more in-depth formal learning related knowledge, you can check the article after the extended reading.

2. 编译器(Compiler)

Mainly understand the front-end and mid-end related theoretical knowledge, the back-end part is related to the target machine code, specific machine architecture is generally rarely used in visualization. This article mainly talks about the front-end part, and the middle-end part will be written later.

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



2.1 Compiler Working Steps

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



2.2 Compiler Front End

2.2.1 词法分析(Lexical Analysis,or Scanning)

2.2.1.1 Theoretical knowledge

Lexical analysis, also known as scanning, organizes the stream of characters that make up the source program into a sequence of meaningful morphemes (lexeme) by reading them in. A morpheme is the smallest unit of language in a source program, such as keywords, identifiers, constants, operators, and delimiters. For each morpheme, the lexicon analyzer will produce a corresponding lexical unit (token) as output.

token:<种别码,属性值>

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



The core logic of the lexer is based on the Finite Automata, which can be understood as an automatic execution machine with a finite number of states, which is used to map the scanned characters to a finite number of possibilities. Types include:

•Uncertainty finite automata (NFA): there may be multiple possible transition states under a state and input symbol;

• Deterministic finite automata (DFA): There is only one unique state of transition under any state and input symbol.

The whole automatic construction process is shown in the figure below, you can have a general understanding, if you want to learn the details of various algorithms, you can consult the information by yourself.

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



2.2.1.2 Put it into practice

Next, let's practice using Antlr to perform lexical analysis of Java source code. Antlr is an open-source tool that supports generating lexers and parsers from rule files, and is implemented in Java itself, which can be installed using Homebrew on Mac or directly using the idea plugin antlr-v4. At the same time, grammars-v4 provides a lot of rules for reference, and we will also directly use the lexical analysis rules defined for Java 8 here.

•Lexical rule definition: Java8Lexer.g4

# 截取内容
- 关键字定义
ABSTRACT     : 'abstract';
ASSERT       : 'assert';
BOOLEAN      : 'boolean';
BREAK        : 'break';
BYTE         : 'byte';
CASE         : 'case';
CATCH        : 'catch';
CHAR         : 'char';
...
- 字符串字面量定义
StringLiteral: '"' StringCharacters? '"';
fragment StringCharacters: StringCharacter+;
fragment StringCharacter: ~["\\\r\n] | EscapeSequence;
...           

• Java code to be parsed

public class HelloWorld { 
   public static void main(String[] args) { 
      System.out.println("Hello, World");
   }
}           

• Use Antlr to generate a lexical analyzer and perform analysis operations

# ① 编译词法规则
antlr Java8Lexer.g4 
# ② 编译上一步生成的java文件(注意需要把Antlr的JAR文件设置到CLASSPATH环境变量,否则会报错)
javac Java8Lexer.java
# ③ 调用生成的词法分析器获取分析结果
grun Java8Lexer tokens -tokens ./examples/helloworld.java            
Implementing Code Visualization requires a pre-knowledge of the compiler front-end



2.2.2 语法分析(Syntactic Analysis, or Parsing)

2.2.2.1 Theoretical knowledge

Grammatical analysis, also known as parsing, is performed after lexical analysis. It organizes tokens into syntactic structures, usually an Abstract Syntax Tree (AST), which represents the syntactic structure of the source code. A grammar analyzer needs to analyze a lexical unit sequence based on a set of predefined grammatical rules. These rules are usually defined in the form of Context-Free Grammar (CFG), where each rule defines how one structure in the language is composed of other structures.

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



Here is a brief introduction to CFG, if you want to learn more about it, you can check the information. A context-free grammar consists of the following four parts:

•Non-terminals: These are grammatical variables that represent a set of strings. They are usually denoted by capital letters, such as A, B, Expr, etc.;

• Terminals: These are the basic symbols of grammar that make up the strings of the language. In programming languages, terminators can be keywords, operators, identifiers, and so on. They are usually represented by lowercase letters, numbers, or other symbols;

•Production rules: These rules define how a non-terminator can be replaced by a sequence of terminators or other non-finalizers. The form of the rule is usually A → B C, indicating that the non-terminator A can be replaced by B C;

•Start symbol: This is a special non-terminator that represents the starting point of an entire language or grammar.

-----举个栗子-------
S → a S b
S → ε
-------解释--------
·非终结符是S。
·终结符是a和b。
·产生式规则有两条:S → a S b 表示 S 可以被 a S b 替换,S → ε 表示 S 可以被空字符串替换(ε 表示空字符串)。
·开始符号是S。
这个文法生成的语言是所有a和b的平衡字符串,例如:ab、aabb、aaabbb 等。           

The core competence of grammatical analysis is to answer whether s can be deduced from G, given the grammatical G and sentence s. The way of implementation can be broadly divided into bottom-up and top-down syntax analysis:

•Top-down syntax analysis: The process of building a parse tree from the top of the tree, i.e., the start symbol. In this approach, the parser tries to figure out which generation the input string can start with, and gradually expands those generators until you get the complete input string. Top-down parsing is characterized by being intuitive and easy to implement, especially for simple grammar. However, they may not be able to handle left-recursive grammar and may not be efficient enough for complex grammars. Common algorithms include "recursive descent parsing" and "LL parsing", the detailed process of the algorithm will not be analyzed here, you can check the information or ask GPT.

• Bottom-up syntax analysis: The process of building a parse tree starting from the bottom of the tree, i.e., the input string. In this approach, the parser tries to find out which parts of the input string correspond to the right side of a grammatical generative formula, and normalizes it to the left of the generative formula, gradually reducing the length of the whole until the final specification is the starting notation. Bottom-up parsing is generally more powerful than top-down parsing because they can handle more complex grammar, including those that top-down parsers can't. However, their parsing tables are often more complex and can be more difficult to implement. A common algorithm is "LR parsing".

2.2.2.2 Practice

After understanding the basic concepts, we still practiced and used Antlr to perform syntax analysis on Java source code. This time, I will not use the syntax rules defined in grammars-v4, because the syntax rules of the programming language are more complex and the resulting AST is less readable.

•Syntax rule definitions (lexical rule definitions: CommonLexer.g4; syntax rule definitions: PlayScript.g4.) )

grammar PlayScript;
import CommonLexer;   //导入词法定义

/*下面的内容加到所生成的Java源文件的头部,如包名称,import语句等。*/
@header {
package antlrtest;
}

expression
    :   assignmentExpression
    |   expression ',' assignmentExpression
    ;

assignmentExpression
    :   additiveExpression
    |   Identifier assignmentOperator additiveExpression
    ;

assignmentOperator
    :   '='
    |   '*='
    |  '/='
    |   '%='
    |   '+='
    |   '-='
    ;

additiveExpression
    :   multiplicativeExpression
    |   additiveExpression '+' multiplicativeExpression
    |   additiveExpression '-' multiplicativeExpression
    ;

multiplicativeExpression
    :   primaryExpression
    |   multiplicativeExpression '*' primaryExpression
    |   multiplicativeExpression '/' primaryExpression
    |   multiplicativeExpression '%' primaryExpression
    ;           

• Use Antlr to generate a parser and perform analysis operations

# ① 编译语法规则
antlr PlayScript.g4
# ② 编译上一步生成的java文件(注意需要把Antlr的JAR文件设置到CLASSPATH环境变量,否则会报错)
javac *.java
# ③ 调用生成的语法分析器获取分析结果(输入表达式后使用^D触发AST图生成)
grun antlrtest.PlayScript expression -gui
age + 10 * 2 + 10
^D           
Implementing Code Visualization requires a pre-knowledge of the compiler front-end



2.2.3 语义分析(Semantic Analyzer)

2.2.3.1 Theoretical knowledge

Semantic analyzer uses information from syntax trees and symbol tables to check whether the source program is semantic consistent with the language definition. It also collects type information and stores it in a syntax tree or symbol table for later use in intermediate code generation. Semantic rules generally include, but are not limited to:

•Type checking: to ensure that the type of the operand is compatible with the operator, for example, it is not allowed to assign integers to variables of type string;

•Variable binding: ensure that the declarations of variables and functions match the use, that the variables are declared before use, and that the number and type of parameters match the declarations when the function is called;

•Control flow checking: Ensure that the use of control flow statements in the program, such as loops and conditional statements, is legitimate;

•Uniqueness check: ensure that the declaration of an identifier is unique within the same scope, for example, it is not allowed to declare two variables with the same name in the same scope;

•Permission and accessibility checks: Ensure that access to variables and functions complies with their declared permissions, such as private members can only be accessed within their classes.

Since each language has its own unique semantic rules and features, such as type systems, scoping rules, legitimate operator overloads, etc., are language-specific. Therefore, semantic analysis must be designed for the specifications of each language.

2.2.3.2 Put it into practice

Due to the large differences in the implementation of semantic analysis in different languages, there is no universal semantic analyzer generation tool. So let's take a look at the relevant source code in the Java compiler to understand the implementation logic. The source code for semantic analysis in javac is available in the com.sun.tools.javac.code and com.sun.tools.javac.comp packages.

Implementing Code Visualization requires a pre-knowledge of the compiler front-end



The following lists some classes of semantic analysis, the source code will not be pasted, you can download and read it in langtools:

•Symbol: Represents all language symbols, including variables, methods, classes, etc. These symbols are created and populated into the symbol table during the compilation process;

•Scope: Used to manage scopes, it holds all symbols in a scope. This is important for parsing variable names and method names to ensure that they are visible and valid in the current context;

•Type: Represents all types in the Java language, including primitive types, class and interface types, array types, and so on. JAVAC uses this class during type checking to determine type compatibility and perform type conversions;

•Attr: The core class for property analysis, which is responsible for associating type information and other properties to nodes in the syntax tree. It performs tasks such as type checking, method parsing, and variable capture;

•Check: Used to perform various semantic checks, such as checking if a type has circular inheritance, or if a class implements all the methods of its interface;

•Resolve: Used to parse method calls, type names, and variable names. It looks for the correct symbol in the symbol table and handles complex situations such as method overload parsing;

•Annotate: handles semantic analysis related to annotations, including the parsing and application of annotations;

•Types: provides a set of practical methods for type operations, such as determining whether a type can be assigned to another type, or finding the nearest common ancestor type.

•Flow: Responsible for control flow analysis, such as checking whether a variable has been assigned a value before it is used, or whether the method always has a return value;

•LambdaToMethod: Java 8 introduces Lambda Expressions, which are responsible for converting Lambda expressions to anonymous classes or static methods;

•TransTypes and Lower: These classes are responsible for some type conversions, including bridging methods for generics and automatic binning/unbinning.

3. Read more

• Classic Books: Book of Dragons, Book of Tigers, Book of Whales

•CS143

•编译原理-网易云课堂

• Principles of compilation - Chinese University MOOC