laitimes

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

Compile the | bluemin

Editor 丨 Chen Caixian

1

abstract

Computational thinking centers on abstract models of design problems, applying computational steps and efficient algorithms to solve problems—a concept that not only serves computer science (CS), but gradually permeates science and everyday life.

"Abstraction" is at the heart of computational thinking and is the subject of this article. "Abstraction" has always been an important concept in computer science, and the emphasis on computational thinking in teaching computer knowledge to a wide audience highlights the importance of abstraction.

In computer science, abstraction is not limited to physical reality, so we find useful abstractions everywhere, such as "quantum mechanics." It has a derivative computational abstraction called a "quantum circuit", starting with a physical concept, catalyzing programming languages for simulation, as well as theoretical algorithms that take advantage of its unique capabilities, that are expected to be implemented on large machines.

"Abstractions" in computer science often include the following:

A data model contains one or more types of data and the relationships that may exist between the data. For example, an undirected graph can be described as consisting of nodes and edges, each of which connects two nodes.

Some programming languages do not perform data manipulation. This may be a traditional programming language, or it may only do a few specific things. The language always has a formal semantics — a specification of how the program affects the data.

Thus, each abstract model allows us to design algorithms that manipulate data in specific ways.

Our goal is to design "high-quality" abstract models with multiple advantages. The ease of abstraction is an important indicator when designing a solution. For example, we will discuss in Section 3.1 how the relational model can lead to a spike in database usage frequency. The resulting algorithm has other performance metrics, such as runtime on serial or parallel machines. Similarly, we tend to be easy-to-implement abstractions. Finally, some abstractions provide an easy way to measure the efficiency of an algorithm (because for traditional programming languages, we can estimate the upper bounds of a program's runtime), while other abstractions require us to implement it at a lower level even if we approximate the efficiency of an algorithm.

1.1 Compilation

Some abstractions are too high-level to provide meaningful performance metrics. Therefore, high-level abstraction operations may need to be implemented at a lower level.

In fact, there may be multiple levels of abstraction at a level that is gradually approaching the machine itself. As shown in Figure 1, the operation of a higher level abstraction (Abstract 1) can be implemented by a lower level of abstraction (Abstract 2), while a lower level of abstraction can be implemented by a lower level of abstraction (not shown in the figure). There are some interesting layers of abstraction that take us from high-level programs to machine instructions, physical hardware, logic gates, transistors, and finally to electrons. However, we focus only on the higher levels.

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

Figure 1. Abstraction layer and algorithm layer

Algorithms that operate using Abstract 1 are compiled into algorithms in the lower level abstraction 2. In this article, we are using the term compiler in the general sense, not just the regular compiler of the programming language highlighted in the Dragon Book, but also the algorithm that converts one abstract program into another, which is probably a lower-level abstraction. Therefore, in some cases, the compilation process is simple, and each operation at a higher level is superseded by one or more specific operations at a lower level. In other cases, especially when compiling from traditional languages (such as C) to machine-level languages, translation algorithms are very complex. There are other cases, such as when advanced abstractions use powerful algebraic operations such as linear algebra or relational algebra, where optimization is critical because primitive compilation often causes the algorithm to take orders of magnitude longer than the algorithm generated by optimizing compilation.

This may be because Abstraction 2 is too close to the machine level and therefore has meaningful performance metrics. If so, Abstract 1 can inherit these indicators, providing a good concept for algorithms written in Abstract 1. But high-level abstractions can often be implemented in several different low-level abstractions. Each low-level abstraction may provide a completely different concept of runtime or other measures, so at a high level there must be different concepts of algorithmic goodness.

1.2 Abstract taxonomy

We can identify at least four different types of abstractions that can be divided according to their intended goals. In the discussion that constitutes the bulk of this article, we will give corresponding examples and explore their interactions.

1.2.1. Basic Abstraction

Like all abstractions, basic abstractions consist of data models and operations. These abstractions are often thought of as abstract data types implemented in object-oriented programming languages. But basic abstractions have no concrete implementation of the operation, nor do they have a specific data structure that represents the data. One can also liken these abstractions to interfaces in Java, but unlike interfaces, these abstractions have the intended meaning for their operations, not just the names of the operations.

Studying basic abstractions actually serves two very different purposes. In some cases, they represent common operations that are worth studying separately, and there are multiple ways to implement them. For example, we discuss dictionaries (a collection of insert, delete, and lookup operations) in Section 1.4. Other examples of this type include stacks, queues, priority queues, and many other abstractions.

Other abstractions are very broad and can support large components of an application. Common examples include a variety of trees and graphs, such as directed graphs, undirected graphs, labeled graphs, and unlabeled graphs.

These abstractions have a broad set of operations that can be combined in a variety of ways. However, these operations are not Turing-complete in themselves. Instead, they were assumed to be embedded in a Turing-complete language and an algorithm was built to use the model. For example, in the graph abstraction, we might have an operation, such as "Find Neighbor Nodes". Beyond this abstraction, we can assume that there is a programming language that allows iteration on all adjacent nodes. Neither the implementation of this operation nor the representation of the graph itself are specified, so we have no specific concept of runtime. We can compare these abstractions to typical classes and their methods in object-oriented programming languages. The difference is that the methods of a class have specific implementations in the underlying programming language. Again, we can think of things like programming language libraries or TeX packages as this type of abstraction.

1.2.2 Abstract implementation

These represent implementation methods, possibly one or more basic abstract implementations. These languages are not Turing-complete in their own right, and can often be compiled into several different machine models, for example, serial or parallel machines, or models that use main or auxiliary memory. Each machine model provides a concept of runtime, which can be translated into the runtime of an abstract implementation and then into the runtime of the basic abstraction supported.

This type also includes automata, such as deterministic or non-deterministic finite automata (see sections 2.1.1 and 2.1.3) and shift reduction parsers (see section 2.2.1).

1.2.3 Statement abstraction

One of the most important uses of abstraction is to foster a style of programming that simply states what you want to do, not how to do it. As a result, we find many different abstractions, including a data model and a programming language that is more advanced than traditional languages; these languages are usually some sort of algebra. Examples include regular expressions (which will be discussed in Section 2.1) and relational algebra (which will be mentioned in Section 3). Context-free grammar (section 2.2), although not algebra in the strict sense, is another example of this type of abstraction.

The peculiarity of these abstractions is that their compilation requires high optimization. For traditional languages, good optimizations can speed up their uptime on the machine by twice as much, and for these abstractions, there can be an order of magnitude difference between the runtimes of good and bad implementations. Another feature is that programming languages with declarative abstractions are not Turing-complete. The indeterminate nature of any Turing-complete language will exclude the presence of the optimizer. The optimizer can efficiently and comprehensively handle what you want to do without being told how to do it.

1.2.4 Computational Abstraction

These abstractions are close to the machine of the physical implementation in contrast to the abstract implementation. That is, no one will build a machine just to form an abstract implementation, but usually implement a computational abstraction or something that is easy to convert. So computational abstractions provide meaningful performance metrics, even if they are not 100% accurate.

You may be familiar with many computational abstractions because they include all the general programming languages as well as machine instruction sets. Other abstractions of this type are more theoretical, such as random access memory (RAM) models or parallel random access memory (PRAM) models. Here, we will discuss a traditional machine model that emphasizes the role of secondary storage in Section 1.7. We will also discuss the abstraction of parallel computing: batch synchronization in section 3.5 and map protocol model (MapReduce) in section 3.6.

While many computational abstractions are associated with traditional computers, there are some exceptions. Turing machines are one of them, and some are not even Turing-complete, but play an important role in computer science. For example, after Claude Shannon's master's thesis, Boolean circuits and Boolean algebra were one of the earliest abstract concepts used in computational science, while quantum circuit abstraction was the latest.

1.3 Exploration of abstract spaces

To understand the nature of abstract chains and their relationships, let's look at a common example of basic abstraction: dictionaries.

Dictionaries are a common example of abstraction, which has many alternative implementations and illustrates some of the problems that come up as high-level abstractions are compiled into low-level abstractions. The dictionary's data model includes the following:

A full set U.

A subset of the full set U, S, which is empty when initialized.

The dictionary's "programming language" consists of a sequence of lines for the following three operations:

Insert (x): If the element x of U is not in the collection S, it is inserted into the collection S, i.e. S:= S ∪.

Delete (x): Remove element x from collection S, S:= S – .

Lookup (x): False if element x returns true in collection S.

For example, dictionaries can be used to describe the behavior of symbol tables in the compiler. U will be the set of possible identifiers for the programming language. When the compiler scans the program, S will be a set of identifiers with a defined meaning at each point in the program. For symbol tables, however, data needs to be appended to each identifier, such as the type of data it defines and the level of nested blocks that appear (so that we can distinguish between identifiers with the same name). When the compiler finds a declaration, it inserts the declared identifier into the collection S. When it reaches the end of a program or function, it removes the identifier associated with that program block. When you use an identifier in a program, the compiler looks for the identifier and retrieves its type and other necessary information.

Note that the dictionary programming language is fairly simple, does not have the functionality of a Turing machine, and does not have the concept of real algorithm design, because the "program" simply reflects what other processes are doing. Again, there is no real uptime concept, as it is not clear how long each operation will take. We can define each operation as a unit time occupied, but since we cannot control the length of the "program", this running time is meaningless.

1.4 Dictionary implementation

Dictionaries can be implemented using many different abstract methods. Linked lists should be abstract implementations that are very familiar to everyone, and their data models include the following:

Cells contain values (members of a full set U) and links to another cell (which may be empty).

Header, simply named as a link to a cell (may be empty).

Suppose the reader is familiar with typical operations that can be performed, such as creating cells or headers, inserting and removing cells from a list, and returning data contained in a specified cell. You can implement a dictionary by creating a linked list of all the elements in the collection S. Compiling three dictionary operations into list operations is simple.

If we assume that the linked list is implemented in the RAM model of the computer, then we have a realistic concept of runtime. We can assign a unit of time to each basic operation on a list cell, because on RAM, each operation requires a constant amount of time. This observation allows us to elevate the concept of runtime RAM to the concept of runtime linked list and then to the dictionary level. But this is not good news, on average, we have to go to at least half of the list, usually until the end, to implement any dictionary operations. Therefore, the run time of a single dictionary operation is proportional to the size of the collection S at that time.

Another easy-to-understand way to implement abstract classes of dictionaries is to use a search tree. When the algorithm of three dictionary operations maintains tree balance, such as an AVL tree or a red-black tree, the run time of each operation is logarithmic to the size of the set S at the time of the operation. But the abstraction of the usually preferred implementation dictionary is the hash table.

1.5 Hash abstraction

The hashed data model includes the following:

The full set U.

Number of hash buckets B, numbered from 0 to B-1.

Hash function h from U to . Each hash bucket b is a subset of the elements x of the full set U, such that h(x)=b.

The usual operation is to compute h(x), where x is a member of U, and insert, delete, or look up x in a hash bucket numbered h(x). For example, an insert operation for a hash table would be represented as h-insert (x, b), where b = h(x). A hashing program consists of alternating some h(x) and then performing one of three operations on x and the hash bucket h(x).

Compiling a dictionary program into a hashing program is simple. For example, the dictionary operation insert (x) is translated as b: = h (x); h-insert (x, b)。

The hash is somewhat far from the machine, and we can't use it directly to determine the runtime. The problem with that is that hashing is quite unique because the worst-case performance, where all the elements in the set are in the same hash bucket, is much worse than the average we would have when averaging all possible hash functions. For simplicity, it should be correctly assumed that in the average case almost all hash buckets contain elements close to the average, i.e. S/B. But even if you agree to only discuss the average case, it is still not known how long each operation on an element and hash bucket will take.

Essentially, each hash bucket is itself a small dictionary, so we have to decide how to implement its operations. If the size of S remains on the order of B level, we can implement it using a linked list of hash buckets and expect each operation to take an average O(1) of time on either RAM or a real machine. However, if S is much larger than B, the average length of the list of hash buckets is O (S/B). This is still better than an O(S) time complexity for each operation. However, when S is too large to fit into the main memory, the RAM model no longer applies, and we need to consider another computational abstraction.

1.6 Secondary storage abstraction

As an alternative to the RAM computing abstraction, any piece of data can be accessed in an O(1) time, and we can introduce access locality at multiple levels. We will only discuss an abstraction with disk-based secondary memory, where large chunks (such as 64KB) move between disk and primary memory as a whole and must read or write data from primary memory. Moving blocks between primary and secondary memory is expensive compared to the cost of a typical operation on the data itself in primary memory. Therefore, in this new model, it is reasonable to simply think of the runtime as the number of disk I/Os, that is, the number of times a block of data is moved from secondary memory to primary memory, and vice versa.

In the secondary storage model of the underlying machine, the best way to implement a hash table is somewhat different from the preferred method for using the RAM model. In particular, each hash bucket will consist of one or more complete disk blocks. To take advantage of locality, you want a typical hash bucket to consist of as few disk blocks as possible, but you want to make those disk blocks as full as possible. Therefore, it is assumed that primary memory can hold M elements in a full set, while disk blocks can hold P such elements. Then you want the number of hash buckets B = B = M/P, so you can save one disk block for each hash bucket in primary memory, and that disk block may be nearly full.

As the size of the set S increases, we use a linked list of disk blocks to represent each hash bucket, with only the first hash bucket in main memory. In the worst case, these three dictionary operations require checking all disk blocks in a single hash bucket. Thus, on average, the number of disk I/Os per operation is expected to be O (S/BP), since the elements of S will be roughly evenly distributed among B hash buckets, and the elements of a single hash bucket will be divided into groups every P and placed in a disk block. Since B=M/P, the running time of each operation is O(S/M).

2

The abstraction of the compiler

Modern compilers refine the translation process into stages, each of which converts one representation of the source program into another semantically equivalent representation, often at a lower level of abstraction. Phases in the compiler typically include lexical analysis, syntactic analysis, semantic analysis, intermediate code generation, code optimization, and object code generation. Symbol tables shared by all stages are used to collect and provide information about various structures in the source program. The first four phases are often referred to as the compiler's front end, and the last two phases are called the back end.

Progress in compiler implementation involves many important abstractions. We will specifically discuss three such abstractions: regular expressions, context-independent grammars, and flow graphs. The first two are declarative abstractions with interesting optimization stories. The third, while not declarative, presents interesting implementation challenges.

2.1 Regular expressions and syntactic analysis

Syntactic analysis is the first stage of the compiler, which reads the source program as a sequence of characters, maps it to a sequence of symbols called tokens, and passes it to the next stage, the parser.

Example 2.1 If the source program contains the statement: Fahrenheit temperature = Celsius * 1.8 + 32, the syntactic analyzer can map the statement to a sequence of seven tags: Here id is the token for any program variable or identifier, and the operators =, *, and + are themselves tokens, and these two constants are converted to tokens real and int, respectively.

A big step forward in compiler construction was the creation of a syntactic analysis generator, a program like Lex that takes the description of the markup as input, generates a program, breaks the source program into tokens, and returns a sequence of tokens corresponding to the source program. The abstraction that allows Lex to apply is a regular expression. Systems like Lex that use regular expression abstraction use many useful shorthands to make it easier to write regular expressions without changing the set of strings that can be defined in this abstraction.

Example 2.2 In some programming languages, the set of strings as a legal identifier can be defined as follows:

letter = [a-zA-Z]

digit = [0-9]

id = letter (letter+digit)*

In this shorthand, an expression like a-z represents a union of a single string with ASCII codes between a and z. So the regular expression of the letters is in the first three operator sets:

a+b+...+z+A+B+...+Z

Define numbers similarly, and then define the set of tagged strings as a string of letters followed by 0 or more letters and/or strings of numbers.

2.1.1 Before the Lex program is generated: Bibliographic search

It is well understood from theoretical research that regular expression abstractions can be compiled into one of several abstract implementations, such as deterministic or non-deterministic finite automata (NFA and DFA). However, when it comes to solving practical problems, there are still some technologies to be broken through.

Bell Labs took an interesting step in its first attempt to automatically search for relevant literature: they saved the titles of the entire Bell Labs Library on tape, and developed software to get a list of keywords and find documents that contained those keywords. However, when given a long list of keywords, the search is slow because it traverses the tape for each keyword searched.

The Aho-Corasick algorithm improves on this, and unlike searching for each keyword individually, a keyword list is treated as a regular expression containing all the sets of strings that appear for any keyword, namely:

Note that the dot is the extension of "any character". The expression is converted to a deterministic finite automaton. No matter how many keywords are involved, you can make a single pass on the tape. Each title is checked once by a limited automaton to see if any keywords have been found in it.

2.1.2 Generator design for syntactic analysis

Essentially, generators of syntactic analysis such as Lex are identical to the ideas embodied in section 2.1.1. Write regular expressions for each token, and then apply union operators to those expressions. The expression is converted to a deterministic finite automaton that reads the character until a string prefix that matches the token is found, then removes the character read from the input, adds the token to the output stream, and repeats the process.

There are also some additional considerations, because unlike keywords, there may be some complex interactions between tags. For example, while it looks like an identifier, it is actually a keyword used to control the flow of traffic in a program. Therefore, when this character sequence is seen, the lexical analyzer must return the token, which is not. In Lex, regular expressions break down ambiguities such as these in the order in which they are listed in their input files, so all that is done is to list the keywords before the identifiers, ensuring that the keywords are correctly distinguished and not treated as identifiers. Another problem is that some tags can be prefixes for another tag. If the next character entered is =, we don't want to

2.1.3 Evaluation of the inertia of DFA

There is also an optimization method that can use regular expression abstraction to improve the runtime of the algorithm - lazy evaluation.

You may be familiar with the standard approach to converting regular expressions into deterministic finite automata. Regular expressions are first converted to non-deterministic finite automata by McNaughton-Yamada's algorithm. This conversion is simple, if the regular expression is of length n, it produces an NFA with up to 2n states. Converting an NFA to a DFA can be difficult at first, which requires a Rabbit-Scott subset construction. In the worst case, this construction can convert an NFA with 2n states into a DFA with a state, which actually doesn't make sense. In practice, the probability of a worst-case scenario occurring is small.

However, in other applications of regular expressions, there may be and do occur near-worst-case scenarios. grep is one of the earliest UNIX commands, which stands for "get regular expression and print". The command accepts a string and determines whether it has a substring of the given regular expression language. The simplest implementation is to convert the regular expression to NFA and then to DFA, so that the DFA reads the string. When the DFA is larger, converting an NFA to a DFA takes more time than scanning a string.

However, when regular expressions are used only once to scan strings, there are more efficient ways to implement commands, such as grep. Ken Thompson's first research paper suggests that it is more efficient to directly simulate NFA rather than convert small NFAs to large DFAs. That is, the NFA that reads a string is typically in a set state after each character is read. Therefore, you only need to track these NFA states after each character and, when reading the next character, construct the set of states that character can reach from the previous set of states.

Higher efficiencies can be achieved through NFA to DFA inert conversion. That is, read the input string one character at a time, and then tabulate the set of NFA states that have actually resulted from the prefixes read so far. These sets of NFA states correspond to DFA states, so we only build the part of the DFA transformation table that is required to process this particular input string. If the DFA given a regular expression is not too large, most or all of the DFA will be built before the string is processed, reaping the benefits of using the DFA directly, rather than constructing the NFA state set after each character of the string. But if the DFA is larger than the string, most of the DFA will never be constructed, so we'll take advantage of both cases. This improvement was implemented in a grep version called egrep.

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

Figure 2. Syntax tree for the expression a + b * c

2.2 Context-independent grammar and parsing

In the second stage of the compiler, the parser or "parser" maps the token sequence generated by the lexical parser to a tree representation, thereby clarifying the syntax structure in the token sequence. A typical representation is a syntax tree where each inner node represents a structure and the node's child nodes represent components of that structure.

Example 2.3 The parser can map the tag sequence a+b*c to the syntax tree shown in Figure 2. Here, E stands for an expression. The operands a, b, and c are themselves expressions. But b*c is also an expression, consisting of the operator token * and two expressions b and c. At the root, we see another expression, which uses the operator + and the two operand expressions a and b*c.

It is important to follow many of the conventions regarding operator precedence. In general, multiplication takes precedence over addition, which is why the grammar tree multiplies b by c before adding a, rather than adding a and b first.

The syntax tree structure required for a given programming language is usually defined by a declarative abstraction, i.e. context free grammar (CFG), and the reader is expected to be familiar with this concept. CFGs are collections called generative rules that provide methods for constructing various syntactic categories, such as expressions or statements, from other syntactic categories and terminals (tokens generated by the syntactic analyzer). For example, if E represents the syntactic category of the language's well-constructed expressions, then we might find the following rule: This means that one way to construct an expression is to place a plus sign between two smaller expressions.

2.2.1 LR(k) parsing

In the 1960s, there were a series of proposals on how to construct an efficient grammar analyzer from CFG. It is recognized that for general-purpose programming languages, as long as the syntax has certain properties, it is possible to perform a left-to-right scan of the program without backtracking, and to build a syntax tree based on the syntax of the language.

Some decisions are tricky. For example, when dealing with the expression a+b*c, after reading only a+b, you must decide whether to combine the expressions a and b with the plus sign to form a larger expression. If you look forward to a marker and see *, you will know that combining a and b is not correct, but you must move on and eventually combine b and c. Only on this basis is it correct to combine a with the expression b*c.

This form of syntactic analysis is called shift-reduction parsing. When scanning input, each step requires the decision to move by pushing the next input marker onto the stack or to reduce the symbol at the top of the stack. At the time of reduction, the symbol of the reduction must be to the right of the CFG. These symbols are replaced to the left of the same production after they are stacked. Also, create a syntax tree node for the symbols to the left of the production. Its child nodes are the tree roots corresponding to the symbols that have just come out of the stack. If a tree is marked out of the stack, its tree is just a node, but if a syntax class is out of the stack, then its tree is the tree that was previously constructed for the symbols on the stack.

Don Knuth proposed LR(k) syntax analysis, suitable for the most common categories of grammar, that performs a single left-to-right scan of the input, uses the shift-reduction paradigm and parses correctly after looking at up to k symbols in front of the input. This work seems to solve the problem of how the parser should be constructed. However, not every CFG, or even every CFG of a typical programming language, satisfies the conditions necessary to become the LR(k) grammar of any k. While ordinary programming languages do seem to have LR(1) syntax, i.e. syntax that can perform shift-reduction analysis using only one antecedent on the input, these syntaxes are rather complex in design, often orders of magnitude more than the categories of syntax required for intuition.

2.2.2 Generator for Yacc parsing

So, after Knuth's paper, there were several attempts to find a way to use LR(1) parsing, but to make it suitable for simpler CFGs. We were inspired by Al Korenjak, a graduate student at Princeton University, whose paper was on ways to compress the LR(1) parser. For a general-purpose language, you can start with a non-LR(1) syntax and still build a left-to-right shift-reduction parser for that syntax. When the syntax is not in the LR(1) form, in some cases we can also use two different production methods for reduction and shift or just reduction. But we can resolve the ambiguity in reality by considering the precedence of the operator and looking forward to a token in the input.

Example 2.4 Consider the situation referred to in Example 2.3. After processing a+b for the input a+b*c, there will be E+E at the top of the stack, where both a and b were previously reduced to expressions. There is a generative E+E that reduces E+E to an E and builds a node of the parsing tree with labels E and sub-E, + and E. But * takes precedence over +, and we see * as the next input symbol, which means that it is correct to move * onto the stack. Later, we also move c and reduce c to expression E. There is E+E*E at the top of the stack. We correctly reduce the first three symbols to E, leaving E + E. Now, it is correct to reduce these symbols to E because there is no content input (or there are other inputs that are not part of the expression, such as a semicolon that ends the statement). In this way, we will generate the syntax tree shown in Figure 2.

Our colleague steve ohnson at Bell Labs took the idea and implemented a generator for syntactic analysis called Yacc. To help resolve ambiguities between shift and reduction operations, or between reductions between two different productions, Yacc makes judgments based on the order in which productions appear. In cases where both productions can be reduced, which production is preferred first. To resolve the conflict between shift and reduction, assume operator precedence that appears first in the Yacc input file.

Yacc quickly became an important tool for compiler implementation, which refers not only to compilers for traditional programming languages, but also to compilers that contain many "niche languages" with more limited uses. Together with Lex, Yacc offers a simple way to experiment with syntactic structural designs in new languages. Both tools run through the academic community's entire semester of compiler courses, in which students design and implement a new domain-specific programming language.

3

Large-scale data abstraction

We need several new abstractions to consider the largest available data sets and the algorithms available to manipulate them. The secondary storage model in Section 1.6 is important, but there are other abstractions that represent various forms of parallel and distributed computing. We will outline the most relevant abstractions here.

3.1 Relational Model of Data

First, Codd's relational model has proven to be at the heart of working with large-scale data. In short, the data is organized into a collection of tables or relationships, with two examples shown in Figure 3. On the left is a relationship called Cities, which has two columns: City and State. The schema of the relationship is a list of its names and column names, in this case Cities (City, State). The relationship itself is a set of row data or tuples in a table. For example, (Toronto, Ontario) is one of the line records of the relationship Cities. The second relationship is called States, which has columns called State, Country, and Pop (the state's population, in millions).

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

图3. 两种关系:Cities (City, State) and States (State, Country, Pop)。

Choosing a programming language for the relational model is an interesting thing. Codd can think of relational models as basic abstractions embedded in a common language, such as trees or graphs. The operation of a relational language is a simple navigation step, such as "Look up values in a given row and column" or "Look up the next row for a given row" or "Look up the next row for a given row." In fact, early database abstractions, such as networks and hierarchical models, took this approach. But Codd's view is a declarative abstraction that has been following suit as programming languages have evolved, helping to make the relational model the primary approach to database management.

In its original formulation, the programming language for relational models was considered non-recursive first-order logic, or equivalent to the set of five algebraic operations, namely union, difference, selection, projection, and concatenation, called relational algebra. The last three operations may be unfamiliar and are defined as follows:

Select: Take condition C on the column name of the relationship R and return the R row that satisfies condition C. For example, if you apply the condition "Country=India" to the relationship state in Figure 3, you get a new relationship with columns named State, Country, and Pop, but only the second and sixth row states.

Projection: Gets a set of column names for a relationship and generates a new relationship with the same rowset, but contains only the obtained columns.

Join: Accepts two relationships and a condition C for column names involving two relationships, and generates a new relationship by 1) taking into account each pair of rows, two rows in each relationship, and 2) combining the two rows and adding them to the resulting relationship if the values in those two rows meet condition C.

3.2 SQL Abstraction

Soon after the relational model was proposed, the development of the programming language SQL took a big step forward. In its original formulation, SQL was still not a Turing-complete language. However, it does support more features than the original relational model. The underlying data model supports collections and packages, where the same row can appear multiple times, and rows in a relationship can be sorted based on the values of one or more columns. In addition to the relational algebra operator described earlier, SQL supports grouping and aggregation, allowing programmers to group rows of relationships based on values in one or more properties, and then aggregate the values of one or more columns in each group, such as summing or averaging.

Example 3.2 Consider the relational States in Figure 3. We can group rows by the value of the Country column and then sum the populations of each state for each country. The result table is shown in Figure 4.

Turing Award winner and author of "The Book of Dragons" explains in a long article: What is "abstraction"?

Figure 4. Group by Country and sum pops.

As SQL evolved, more features were incorporated into the standard, including the ability to write recursive programs, as well as the ability to call code in a general-purpose programming language. So, in principle, SQL is now Turing-complete. But the vast majority of SQL programs don't use the features that make them Turing-complete, so in practice, it's still possible to compile SQL in a way that takes advantage of many optimization opportunities, which is what we call declarative abstraction.

3.3 SQL Compilation

Programs written in SQL are often compiled into a low-level language, such as C. C code makes heavy use of library functions, such as performing operations such as selection or concatenating. The early stages of SQL compilation (lexical analysis and syntactic analysis, etc.) are similar to the compilation phases of any general-purpose language. SQL differs from the specification in that the code optimization phase (often referred to as query optimization) is the code optimization phase. Recall that optimizations for languages such as C had to satisfy the need to save machine instructions everywhere, so tripling the speed was a good optimization result. But the operation of SQL and relational models is usually much more powerful than machine instructions. For example, an operator of a syntax tree can connect two huge relationships.

Therefore, a SQL program consists of relatively few steps compared to a C program or its kind, but these steps can take a lot of time if implemented as-is. As a result, SQL compilers often exhaust almost exhaustive searches for equivalent syntax trees, reducing execution time by orders of magnitude. Even it makes sense to spend time exponentially related to the size of a SQL program optimizing a program that executes only once, because the program will usually execute on a larger relationship.

3.4 Distributed Computing Abstraction

Over the years, it has been recognized that the capabilities of a single processor are reaching their limits. In order to process larger and larger data sets, it is necessary to develop algorithms that use multiple independent machines. Many of the abstractions of distributed computing algorithms that have sparked our thinking have been implemented and are being used in a focused manner.

Overall, these abstractions have some common features:

Their data model is that of traditional programming languages, but the data is distributed across many different tasks, which we call "compute nodes." In fact, multiple tasks may be performed on the same processor, but treating these tasks as the processor itself facilitates the analysis of the problem.

Programs are also written in regular languages, but the same program can run simultaneously on various nodes.

There is a device that a node can use to communicate with other nodes. This communication takes place in stages and alternates with the calculation phase.

This type of abstraction has several different performance metrics to be aware of. The obvious point is the wall clock time required to execute the programs involved on all nodes in parallel. But sometimes, the bottleneck is the time it takes to communicate between nodes, especially when a large amount of data needs to be shared between nodes. The third run-time problem is the number of turns of the algorithm (one calculation phase followed by a communication phase).

3.5 Batch synchronization abstraction

Valiant's batch synchronization model is a popular abstraction that we won't discuss in detail. The model has recently gained popularity in the computing cluster environment of Google's Pregel system and already has many similar implementations.

In a batch synchronization model, compute nodes can be considered full graphs

Read on