introduction

The idea of simulator technology can be traced back to Turing machines, Alan Turing proposed the concept of Turing machines in his monograph published in 1937, defining simulation technology, which is also the theoretical basis of computer simulation technology.

Because the simulator can help developers to deeply understand the software operation of the target machine, including detecting the execution cycle of instructions, memory usage, and the correctness of peripheral serial ports. This information is important for hardware architects, instruction designers, and system software developers.

1. Simulator demand analysis

In the traditional chip development process, the cycle of designing a certain chip is often very long. The functions of some modules or subsystems can be analyzed at least until the FPGA verification stage, and if the performance is not suitable, it is necessary to overturn the original architecture redesign, which undoubtedly increases the cost of early chip development and slows down the progress of the development process.

The simulator developed in this paper is an important part of the AI chip R&D process, which is mainly planned and developed to solve the following three R&D process difficulties, improve the overall R&D efficiency of the chip, and shorten the R&D process cycle.

The first is to verify the correctness of the AI instruction set, identify and execute different instruction types, verify the instruction flow and the function of the chip design; Second, it supports the internal instruction flow execution information review, thereby accelerating the exploration of related chip architectures and the rapid iteration of RTL design, and improving design efficiency; The third is to adapt and verify a variety of parallel operators to ensure that the AI algorithm runs correctly on the chip, so that it can meet the various needs of AI processor research and development.

The development intention of the simulator is to facilitate the development team to avoid the cumbersome hardware design process, facilitate the exploration of the early architecture, use the software method, the software language of "serial" execution to describe the behavior of each module in the chip executed in parallel, and adjust the hardware architecture and tool chain development according to the results, so the research and development of this simulator focuses on the simulation of chip functions, and does not provide simulation accuracy accurate to the clock.

The target users of the simulator are mainly composed of processor design team, tool chain R & D team and algorithm team, they have a good understanding of the structure of AI chips, the programming model of SIMT, and the underlying principles of parallel computing, and have quite professional research and development knowledge.

In order to shorten the R&D cycle and accelerate the R&D process as soon as possible, this simulator will use an interpretive simulation strategy to establish a system-level functional model of the processor.

For this simulator, although there is no universal interface and operating environment for ordinary users, the developers of various teams can use this simulator quite proficiently.

The chip involved in the research and development has a quite unique structure, powerful computing functions, a large number of self-developed IP and the most suitable software optimization. In the R&D process, due to the influence of time performance and other factors, the design structure is simplified, and the instruction execution module and the instruction simulation function unit module are designed to realize the simulation of the most important AI instruction set.

Since the instruction programming to be implemented by the simulator is based on the SIMT (Single Instruction Multithreading) idea, this architecture divides the organization of threads into three levels (Kernel-Block-Thread).

Kernel corresponds to the basic tasks performed by the processor, each kernel is divided into multiple independent thread groups (Blocks), each thread group also contains a number of independent parallel threads, they can achieve fine-grained cooperation between them.

In this process, the simulator needs to provide simulator initialization, thread dimension number control, register reading and writing, initial address and other necessary interface control, so that developers can call the relevant interfaces to change the configuration parameters such as thread dimension at any time according to the project requirements.

Since the simulator is positioned as part of the chip development integration environment, it is necessary not only to compare the processor instruction functions of the hardware design team, but also to execute the instructions that convert the parallel operators through the tool.

In these processes, the simulator also needs to check the execution information of the internal instruction flow of any specific thread, so as to verify the correctness of the instruction execution, the correctness of the expression of the compiler compilation operator, and the correctness of the operator's function. Therefore, the simulator mainly performs the functions shown in Figure 1.1 in the development process:

(1) Provide convenient external configuration interface for R & D personnel to call.

(2) Correctly simulate the SIMT architecture, simulate the address and memory design of the multi-threaded structure, and functionally realize the execution of large-density parallel instruction flow.

(3) It can completely simulate all instructions in the AI instruction set, and verify the correctness of each instruction execution, so as to provide correct simulation results.

(4) The internal instruction flow execution information of any thread can be consulted in a targeted manner in order to debug related aspects in a timely manner.

What are the key technologies of artificial intelligence (AI) instruction set simulators? What is the core structure?

Figure 1.1 Simulator requirements

The above is the specific function breakdown of the simulator. For this simulation, the most important thing is the detailed design of the input and output modules, instruction execution modules, and thread hierarchy and storage hierarchy.

Second, the instruction set architecture

2.1 Overall description of the instruction set

The instruction set used in this simulator is a typical reduced instruction set (RISC), with each instruction having a bit width of 64 bits. In order to facilitate the pipeline operation of the computer, this instruction set mainly divides the instructions into ALU/SFU/MEM/TC/CTRL and other types of instruction coding formats, and the encoding formats of various instructions are mainly shown in Figure 1.2:

Figure 1.2 Various instruction encoding formats

The acronyms in the figure have the following meanings:

@p: predicate, branch jump control bit, determine whether to jump when branch conditions occur.

opcode/op: The machine code of the instruction.

Ctrl1/Ctrl2: Instruction control bits, used to parse and determine the specific format of the instruction.

dst: Destination register index

src: Source register index.

rs1/rs2/rs3: The index of the register in which the operand is located.

rsvd: This field value is reserved and does not affect the behavior of this command.

sf1/sf2: Address offset in MEM type instruction format.

imm: Takes an immediate number as the source operand.

The data types currently supported by the simulator mainly include: U32, S32, U16, S16, U8, S8, FP32, and FP16.

In addition, some instructions can also support packing data together of the packed type, such as packing two FP16packs together to form FP16x2.

Instructions such as addition and subtraction, multiplication and addition, absolute value subtraction, etc. will support the calculation of FP16X2. When the operands of other ALU and SFU instructions are of the packed data type, you can also specify which byte the source register value is located at the start position of the operand through certain bits of ctrl2.

2.2 Instruction Classification

This simulator needs to simulate a total of 62 instructions, including 35 data processing instructions, 1 format conversion instruction, 3 matrix operation instructions, 1 data mixing instruction, 15 load/store instructions, and 7 branch and system instructions. The format of instructions for different functions will be different. According to the type, there are mainly the following categories:

(1) ALU Directive

ALU instructions refer to arithmetic calculations and logical operation instructions, including addition, subtraction, multiplication and division, comparison, limit value, AND or NOT, number system conversion and shift instructions.

According to the format requirements, two operands or three operands are taken from the source register, and the number system is unified and calculated. The ALU instruction encoding format is mainly composed of opcode, ctrl and operands.

OpCode represents the specific calculation type, Ctrl is used to select the number system, immediate number, status bits, etc., and Ctrl2 is used to specify the operand offset.

Among the ALU class instructions, the format conversion instructions are special. It can convert an operand from SRC to DST, realizing integer-to-floating-point and floating-point conversion.

The conversion process can specify saturation and rounding modes, and if the DST number system is floating-point, it can also align the output.

In addition, the convert instruction can also convert the two operands and concatenate the result.

(2) Special instruction

Special class instructions can be roughly divided into the following three categories: SFU (beyond) function instructions, TC (matrix operation) instructions, and shuffle (register value mixing) allocation instructions. Each of these three types of instructions has its own encoding format, according to the format requirements, the operands are taken from the source register, unified number system, and calculated.

The SFU instruction encoding format is the same as that of the ALU instruction, mainly consisting of opcode, ctrl and operands. OPCODE stands for the specific calculation type, Ctrl bit control number system and status bits, etc., instructions according to the calculation type, the operand is calculated and stored in the corresponding destination register.

In addition, SFU instructions will only run on thread-level computing units. When the destination register of the instruction is the Thread register, the source operand can come from the Thread register, the Warp register, and the constant register.

TC (Tensor core) instruction is mainly used to implement matrix operations, it will take out the corresponding register values of all threads in the warp thread group where the two operands are located according to the instruction rules, form two matrices for calculation, and output the result to the destination register of each thread under the same warp.

If the source operand is a half-precision floating-point number, the destination operand is a single-precision floating-point number, and the output requires two registers, the overflow output will be rolled over to the register with one index.

The format of the SHUFFLE instruction is basically the same as the TC instruction, and the main function it implements is to redistribute the value of the source register in all threads under the warp where the instruction is located according to some rule.

(3) MEM instructions

The MEM instruction encoding format is mainly composed of opcode, atom, ctrl and operands. Opcode is used to distinguish MEM instruction types, and Atom is 4 bits wide to distinguish whether this instruction is a normal storage instruction or an atomic operation instruction.

The LD instruction can read data from memory to the destination register based on the calculated address, and the ST instruction can save the source register value to the memory according to the calculated address. The LD/ST address is calculated as follows:

（1-1）

(1-2 )

From the equations (1-1) and (1-2), the base address base and offset address of the instruction can be obtained, and combined with the immediate number of decoding, the destination address can be deduced as follows:

(1-3)

Traceware Mockyard൅ Mockyard MEM instructions can run on both thread-level and warp-level units. For LD instructions, the destination register value can only be put into the thread register and the warp register; The source register values of ST instructions can come from the thread register, warp register, and constant register. The RS1 operand can come from the Warp register and the constant register, and the RS2 operand can come from the Thread register and the constant register.

Since the design of the processor is based on the SIMT idea, a large computing task will be decomposed into multiple small threads to calculate, each thread has an independent index and register space, so the destination address of the MEM also supports the following application scenarios:

1. Ordinary address (warp/thread) + fixed offset (IMM)

2. Ordinary address (thread)

3. Ordinary address (warp/const)

4. Ordinary address (thread) + tid (threadid)

5. Ordinary address (warp) + tid (threadid)

6.Globalbase(warp)+time(threadid)

In addition to LD/ST instructions, MEM class instructions also have a class of atomic operation instructions that lock threads. An atomic operation instruction consists of one or more consecutive atomicST and one atomicLD instruction.

atomicST is used to set the operand, and atomicLD is used to perform the corresponding operation based on the atom field and the operand set earlier.

(4) CTRL directive

CTRL instructions refer to control instructions, including exit (abort), nop (bubble), loop (loop), JMP (jump), branch (branch) and other instructions, mainly play the role of logic control, in the program operation is very important, the following describes several typical CTRL instructions.

LOOP instructions loop instructions, mainly composed of opcode, ctrl and operands. The lower seven bits of the ctrl indicate the number of instructions participating in the loop, and the highest bit is used to select the immediate number. rss, rsi, rse limit the number of cycles, values can come from warp registers, constant registers, and immediate numbers.

JMP instructions unconditional jump instructions support absolute and relative address jumps, with jump values from registers or immediate numbers. where Bit[0]:0 represents the address offset, and 1 represents the absolute address; Bit[7]: 0 stands for rs1 as the address, and 1 for imm as the address.

The BRANCH branch jump instruction, similar to the unconditional jump instruction, also supports absolute and relative address jumps, and the jump value comes from registers or immediate numbers. In addition, Branch decides whether to jump based on the instruction 63-bit @p and the CTRL bit.

The EXIT command terminates the current process, marking the end of the current program, the NOP instruction "empty instruction": only if the instruction is executed into a meaningless instruction, and continues to execute an instruction after the NOP. It is used to clear flags and play some auxiliary roles with other commands.

2.3 Thread organization and storage space

Because the processor is based on the SIMT architecture, it has a unique thread organization and register space compared to SIMD.

As shown in Figure 1.3, threads are organized into three levels (Kernel-Block-Thread). The kernel corresponds to the basic tasks performed by the simulator. Each kernel is divided into multiple independent thread blocks.

Each thread group contains multiple threads that work together at a fine-grained level. The instruction code corresponding to each thread is the same, but the output result will be different depending on the data participating in the operation. Threads support all C syntax internally.

Figure 1.3 SIMT programming model

Both Block and Thread are organized in three dimensions, and each thread, warp, kernel has its own storage space. In addition, for each thread, its corresponding blockIdx.* and threadIdx.* values can be accessed through specific registers.

In the SIMT programming model that the simulator needs to implement, the instructions executed internally by a single thread will be strictly ordered; When multiple threads execute instructions, the execution order between instructions is independent of each other and does not affect each other; When different threads access the same piece of storage, the program behavior at this time is undefined because the priority of the thread and the arbitration rule are not determined.

Since it is necessary for different threads to access the same piece of storage space for some programs, it is necessary to introduce a coordination mechanism between threads. The coexistence of threads requires the programmer to call the intrinsic function, and the simulator also supports the implementation of functions through related instructions.

The storage space that each thread can access is divided into several parts: registers, local memory, shared memory, external memory. Among them, registers and local memory are thread-exclusive spaces, Shared Memory is shared internally by Block, and External Memory is shared by the entire kernel.

In addition, there is a certain address mapping relationship between local memory, shared memory and external memory. The storage structure model is shown in Figure 1.4:

Figure 1.4 SIMT storage space model

Each thread has its own register space regfile, which currently mainly includes three categories: thread regfile, thread group register warp regfile, and constant register constant regfile. Thread regfiles use general-purpose registers for each thread calculation, independent of each thread.

Each thread allocates a maximum of 128 thread registers; Warp regfile and thread regfile have the same status in programming abstraction, and the core difference is Warp-level synchronicity; The Const regfile contains two parts, the program's formal parameters and the special parameters for the program to run.

Figure 1.5 Register encoding format

The encoding format of the registers is shown in Figure 1.5, where the lower 128 bits are the thread register; the 128-159 bits are the Warpregfile; and the 160-191 bits are the Indirect Warp Regfile, which corresponds to the 128-159 bit Warp register one-to-one, storing the corresponding immediate number; Bits 192-255 are constant regfiles, which are read-only registers.

Third, the overall implementation of the simulator

The primary function of the emulator is to simulate the implementation of processor functional behavior. After in-depth study of the architecture of the chip, combined with the project requirements and development environment, the design finally adopts an interpretive simulation strategy with clear structure, high simulation accuracy, easy design and debugging, and uses the system C language to establish a system-level functional model for the AI chip.

Figure 1.6 Simulator Frame

The overall framework of this simulator is shown in Figure 1.6, which is mainly composed of three functional modules, namely the gen module that provides clock and reset signals, the instruction set core analog module, and the storage module. Since this simulator is limited by the development cycle and requirements, only the functional model of the processor needs to be established, which is not accurate to a single cycle, and the GEN module only exists as a concept to be supplemented by the subsequent development process.

In addition, the emulator requires a total of five files, which are a binary instruction file, a configuration file, an input data file , an input data address file, and an output data address file.

Among them, binary instructions are generated by the program by the compiler and stored in the program memory; As a kernel information parameter table, the configuration file contains some simulator configuration parameters, which are also stored in memory when the emulator is initialized; The input data file stores the data that will be put into memory to participate in the simulator operation; The input and output address files contain the input data storage memory address and the output data source address, respectively.

Figure 1.7 Simulator top-level workflow diagram

The main module of the simulator works around the core instruction execution module, due to the need to consider the characteristics of the SIMT programming model and AI instructions, for thread-level instructions and thread-group level instructions, the overall flow is as shown in Figure 1.7:

(1) Simulation platform clock and reset signal definition, reset, simulator and other functional modules and memory module instantiation.

(2) Through the initialization function, initialize the program counter, register, MEM and other storage structures.

(3) Run the read function to save the binary instruction file, processor configuration file and HBM input data into the registers and storage space of the corresponding thread.

(4) The simulator reads the configuration file from the storage module, generates the dimensional structure of the corresponding scale, and independently runs the instruction execution modules such as value taking, decoding, value taking, execution, and writeback in each dimension in a sequential traversal manner with the PC as a unit.

(5) Detect whether the current PC value is the end of the thread's executable instruction flow, and if so, wait for the next step after the end of other threads; If not, the thread will jump to the calculated PC to continue executing instructions.

(6) Run the output function to output the data of the file address to the specified file according to the output data address file.

(7) End the simulation.

Before the instruction execution module runs, the simulator stores the binary instructions stored in the program memory in the program registers of each thread independently. In the finger retrieval stage, each thread will read binary instructions from the program registers independently according to the PC; In the decoding phase, each thread separates the binary instructions through the decoding function

Opcode bits, CTRL bits, status bits and store this information in each register; In the value retrieval stage, the simulator will put the corresponding register value into the temporary variable of each thread according to the register index obtained by decoding and send it to the execution module. The execution module will parse the opcode in the intermediate variables and implement the corresponding instruction function through the instruction function simulation function; In the final writeback phase, each thread stores the calculation result of the function function into the corresponding register according to the index.

For the SIMT instruction set architecture that the simulator is implementing, the source and destination operands may come from different register hierarchies. When the execution of each thread's instructions only involves thread thread registers and constant registers, due to the independence of each thread's thread register, the instruction execution process of each thread does not affect each other.

Although the Warp register and the thread register are the same in terms of programming abstraction, the Warp register needs to be kept synchronous within the thread group. Therefore, unlike the Thread instruction, for the instructions in which the Warp register participates, the emulator needs to make relevant synchronous adjustments to the Warp register of each thread during the execution of the instruction.

Next, the instruction execution flow with only the thread register and the instruction execution process with the warp register are described separately.

(1) Instruction execution with only the participation of the thread register.

Figure 1.8 shows the instruction execution process with only the thread register participating, and the simulator will traverse each thread under each thread group in turn, and execute the instructions through the following operations:

1. Instruction prefetching. Depending on the PC, each thread reads binary instructions from program registers and stores them in intermediate registers.

2. Instruction decoding. Determine the operation type, control information, source register, destination register, source address, destination address and other information.

3. The command takes the value. According to the register index obtained by decoding, the thread register value in each thread is put into the temporary variable of each thread and sent to the execution module.

4. Instruction execution. Each thread implements the corresponding instruction function through the instruction function simulation function.

5. Data writeback. The result of the instruction execution is written to the thread register of each thread.

Figure 1.8 The instruction execution flow involved in the thread register

(2) Instruction execution with the participation of the Warp register.

Figure 1.9 shows the instruction execution process with the participation of the warp register, and the emulator will traverse the threads under each thread group in turn, and execute the instructions through the following operations:

1. Instruction prefetching. Depending on the PC, each thread reads binary instructions from program registers and stores them in intermediate registers.

2. Instruction decoding. Determine the operation type, control information, source register, destination register, source address, destination address and other information.

3. The command takes the value. According to the register index obtained by decoding, the thread register value or warp register value in the respective thread is put into the temporary variable of each thread and sent to the execution module.

4. Instruction execution. Each thread implements the corresponding instruction function through the instruction function simulation function.

5. Data writeback. Specify a thread in the thread group according to the configuration parameters, and write the result of the execution of the thread instruction from the intermediate register to the warp register of the same purpose index in all threads in the thread group.

Whether it is an instruction with only the thread register or an instruction with the participation of the warp register, the simulator will execute the instruction in each thread under the thread group in a multi-loop manner, in a sequential traversal manner; Inside the thread group, the simulator will also traverse sequentially, executing instructions one by one from small to largest.

There are no data conflicts in the Thread register, so there is no need to adjust its workflow; When the Warp register is used as the source register, the compiler will ensure the uniformity of the Warp register value, and when the Warp register is used as the destination register, the simulator can ensure the synchronization of the Warp register by configuring the specified reference thread in the execution instruction loop to improve the algorithm program adaptability and performance.

Figure 1.9 The instruction execution flow involved in the warp register

summary

This paper starts from the usage scenarios and functional requirements of this simulator development, and determines the simulator implementation strategy and development direction. Secondly, the overall situation of the instruction set to be simulated, the encoding format, the instruction type and thread, and the storage architecture are described;

Finally, through the above information, the overall framework, hierarchical structure and basic workflow of the simulator are given, and the core structure and design ideas of the simulator are basically explained.

What are the key technologies of artificial intelligence (AI) instruction set simulators? What is the core structure?

introduction

1. Simulator demand analysis

Second, the instruction set architecture

Third, the overall implementation of the simulator

summary

Read on

Artificial intelligence has been developed for decades, why has large-scale AI become an inflection point for change?

In the vast sea of idioms, if you want to pick out one sentence to describe China's breadth and profundity, it is "Bo Gu Tong Jin". These four words not only cover the depth of time, but also contain knowledge and wisdom

The 10 Most Promising AI Startups, Chinese Enterprises Are Far Ahead!

Why did Google employees protest against the Israeli agreement? The use of artificial intelligence in warfare has angered the world

5 innovative practice directions and predictions for the integration of "AI (artificial intelligence) + digital retail".

Artificial intelligence saves the planet on Earth Day

Chen Xiaochun, founder of Sias: In the era of artificial intelligence, Sias should cultivate "new talents"

The first "Artificial Intelligence + Education" Doctoral Salon was held in the Yellow River Water Conservancy Vocational and Technical College

【World Book Day】The CPPCC of Fugou County launched the reading activity of "Artificial Intelligence, Building a Dream for the Future".

Focus on artificial intelligence and explore cutting-edge technologies

Artificial intelligence, how to take the next step

AI-Driven CRM and ERP Trends for 2024

Exploring the set of elementary particles, artificial intelligence sifts through the near-infinite possibilities of string theory

Why is Huawei Pura70Ultra sold for 10,000 yuan so controversial? I said, friends who care about the price go and buy other brands, there must be better ones, if you consider Hua

Wang Changsheng: Artificial intelligence is fundamentally artificial, and the core of foreign companies is domestic capital, so we must see the essence through the phenomenon

Artificial Intelligence Industry Investment List: Baidu, SenseTime, iFLYTEK, etc.'s latest investment trends [with a list of key enterprises]