Yiran Chen, Duke University: Hardware and software collaborative design of efficient artificial intelligence systems

Speech: Chen Yiran

Edit: Machine Heart Editorial Department

Not long ago, Yiran Chen, a professor in the Department of Computer Engineering at Duke University, delivered a keynote speech entitled "Software and Hardware Collaborative Design of Efficient Artificial Intelligence Systems", which introduced how software and hardware can be used to design high-performance artificial intelligence systems, including in-memory computing deep learning accelerators, model optimization, distributed training systems, and some involving automated operations.

Yiran Chen, Duke University: Hardware and software collaborative design of efficient artificial intelligence systems

Not long ago, the Heart of The Machine AI Technology Annual Conference was held online. Yiran Chen, a professor in the Department of Computer Engineering at Duke University, delivered a keynote speech on "Collaborative Design of Hardware and Software for Efficient Artificial Intelligence Systems."

https://www.bilibili.com/video/BV1yP4y1T7hq?spm_id_from=333.999.0.0

The following is the content of Chen Yiran's speech at the Machine Heart AI Technology Annual Conference, and the Machine Heart has made an editor that does not change the original meaning.

I often use this graph from a book about 32 years ago, The Age of Intelligent Machines, written by Dr. Kurzweil, who predicted the development of computing power needed for artificial intelligence in the future. We've extended this graph to how much computing power can be integrated on a single device. Over the past 100 years, computing power has almost grown exponentially, and now, for example, today's NVIDIA conference on the latest GPU computing, including Apple's release last week' GPU chips, can basically bring more than a human brain level of computing power, bringing us a very large possibility.

There are many kinds of artificial intelligence computing platforms, from the more familiar GPUs, FPGAs, ASICs to new architectures, which actually follow a principle: either more efficient, or more professional, take longer, or more flexible, it is impossible to achieve unity in multiple dimensions, and there is always a design contradiction.

For example, the computing power is the highest on the GPU, and the power consumption is also the highest. There will be some very easily reconfigurable computations on FPGAs, but the energy efficiency ratio may not be that good. ASICs have a very good energy efficiency ratio, but require a longer development cycle, basically designing for specific applications, and not particularly effective when shipments are relatively small.

Everyone is more familiar with Feng · Neumann bottleneck, in fact, said that computing power can be achieved by increasing the number of computing units, but in the end the bottleneck is whether the data can be given to the computing unit in time. Over the past four or five decades, the gap between on-chip computing power and off-chip data provided to on-chip data through storage bandwidth has become increasing, which has also brought about the concept of "memory wall". Of course, in the actual specific design, such as not being able to remove the heat generated in time, and not increasing the frequency of use without restriction, etc., (these) force us to find new computational designs.

Now the more popular is the near memory or memory computing design, the idea is also very simple, since the bottleneck comes from the flow between data, especially the flow of storage space to the computing space, can you think of some ways to make the computing and storage occur in the same place? This is actually the opposite of the von Neumann system, which separates the two, and this one is to be combined.

Why is this possible? Because new types of computation, such as neural networks or graph computations, often have situations in which one side is constant and the other side is constantly changing. For example, A multiplies with B, A is constantly changing and B is constant, in this case, it can be calculated where B is stored, and there is no need to move the data around.

People have tried a lot, such as using DRAM to do the deposit. Recently, Ali seems to have written an article about in-memory computing, designing the memory to have a computing unit and directly performing corresponding calculations in it, which are some useful attempts. Compared with the data taken from outside for calculation, this calculation is thousands of times more energy efficient and is a very promising future development direction.

Some of what we're talking about today is related to these, including in-memory computing deep learning accelerators, model optimizations, distributed training systems, and some operations involving automating neural network architecture design.

When you see the in-memory computation, the first idea is to save the neural network parameters in a place, and the data is directly calculated after entering this place to avoid data porting. A more common scenario is to store these parameters on a special nanochip, and the resistance value on some devices (in the nanochip) can be programmed to represent a parameter by changing the current or voltage (and changing, and then can be used).

When it comes to a relatively large matrix form, all inputs and all outputs can be connected on certain cross-nodes, much like the state of a matrix in a neural network, all inputs are equivalent to a vector, the vector multiplied by the matrix can be regarded as all voltage inputs, and the voltage through the current produces a combined current that is the sum of all their statistics. This implements multiplication of vectors by matrices in a very efficient way.

Over the past decade, we've made a lot of experiments with it at Duke and designed some chips.

In 2019 we published an article at VLSI that can connect with traditional CMOS with new types of memory, and after integration, the entire convolutional neural network can be mapped into such a matrix state, and the accuracy can be selected, and the trade-off between accuracy and energy efficiency (tradoff) can be made. Compared with traditional CMOS designs, performance can eventually be improved by dozens of times.

This year we also have another job at isSCC, mainly our graduated student Yan Bonan worked in a research group at Peking University, and this design idea can also be put into traditional SRAM. One of the peculiarities of this design is that the traditional design expresses parameters through resistors, which is actually a simulation calculation method. In other words, the organization of a continuous state expresses a parameter, which requires analog-to-digital conversion, which is very expensive. And our work on ISSCC: ADC-Less SRAM is actually a binary integer representation, such as zero and one, which makes it possible to remove the digital-to-analog conversion and directly implement the computation in the digital state. This was a technological breakthrough achieved at the time.

You can see the following figure Digital CIM (far right, This Work), the energy efficiency can reach about 27.4 (TOPS/W, 8 bit state), has exceeded all the previous designs, the entire density is very high, in the 28-nanometer process, there are almost megabits of transistors per square millimeter.

In addition to the circuit, there must be support between the architecture and the compiler to implement the design of the entire computing system. Before 2014-2015, we began to do related designs, such as designing compilers, finding parts of the program that can be used to speed up, and connecting various arrays in some way on-chip, dividing large networks into small networks, or converting values between different network layers.

It is worth mentioning that we often encounter many arrays in a large network, and there are at least two ways of network parallel computation - digital parallelism and model parallelism. The so-called digital parallelism means that there are many inputs (inputs), and some data can be divided into different computing units (PE) in parallel. The same is true for model parallelism, where large models can be computed in chunks.

However, these two parallel approaches are not exclusive. Even for a layer, when you map that layer to different PEs, there may still be different parallelisms in each PE. For example, the following figure is expressed in black and white dots, most of the parallel methods are operated through model parallelism, and a small number is digital parallelism.

Why is that? Because only by integrating different parallel distribution methods can all the numbers and numbers reach a steady-state equilibrium in the case of the entire computing power being expressed, and the overall energy efficiency is the highest. If it is a single expression, some places are empty or slower to calculate, dragging down the entire calculation. This is the 2019 article HyPar.

In 2020 we found a new problem, we actually didn't think too much about the expression of the convolutional network itself, and you often calculate layer by layer in the convolutional layer, there will be some intermediate results, until the end will not be used. This leads to a third parallelism, which is tensor parallelism, such as the following figure, where there are three parallel possibilities for input and output. 3 multiplied by 3 for a total of 9 expressions.

If you consider that each PE has 9 ways to express, there is no way to optimize it manually, and the expression of the entire system must be done by automated means (such as linear programming). At the same time, you can't find these expressions one by one, you need some layered ways.

For example, there are some large distinctions first, and then from small distinctions all the way to the final single expression. If you use three colors to express, you will find that even under a hierarchical mapping, there are different operating expressions of different PE to meet the optimal data flow as a whole. In this way, energy efficiency can be doubled.

The same idea can not only be used in deep learning. Deep learning is just a special case of graph computing. Any calculation that can be used to express the flow of data in a graphical way is a graph computation, and deep learning, although rich, is still only a special expression of graph computing. Therefore, you can use the in-memory calculation method to perform graph calculations.

Here's another hpca 2018 job. We found that graph computation, especially depth-first or network-first search algorithms, can use graph computation to implement their representation on the matrix. Compared to computing on traditional CPU platforms, energy efficiency will be hundreds of times better.

Speaking of the complete architecture design, how to continue to optimize the computational energy efficiency in the algorithm? The next example is structural sparring. Sparseness has long been known, when some of the weights of the neural network are small or close to zero, no matter how large the input is, it has no effect on the output, at this time, you don't need to calculate the result at all, just throw away the result (actually zero).

Before 2016, all sparse operations for neural networks were basically non-structural sparse, and as soon as you saw a zero, you removed it. This brings up the problem that all data is stored in a computer with a locality, because sometimes the domain and the locality of the airspace, when a number is used, there is basically an expectation that the number will continue to be used, or the number stored around it will be used in the future. When you remove a lot of zeros, a lot of holes are created. When you find a number, you look forward to the next number, but you don't save it at all. The whole cache gets stuck in a state: keep going far away to get the numbers over, only to find out that it's not needed, and then keep looking for it.

How to solve this problem? When doing sparseness, it is still hoped that the zeros or calculations that have been removed will be expressed in a certain locality, such as removing the entire row or the entire column. In this way, computing optimization can still be achieved under the premise of satisfying the locality of the storage.

Easier said than done? We had an article on structural sparsification in NeurIPS in 2016, which later became famous. These parameters can basically be found, corresponding to a certain storage structure, which makes these numbers stored in a block-by-piece manner. In this way, when clearing zero, the entire row or column is completely zeroed, and it can still meet the locality at the same time under the premise of satisfying the optimization conditions. This can be used in CNNs, LSTM, RNNs and even some more complex calculations. This technique is now basically a standard for neural network optimization.

Another commonly used neural network optimization is quantization. Network training requires high precision, but it does not require high precision when reasoning. This leads to a very interesting question: what kind of accuracy, optimization is the best, and in what way this precision is expressed.

Traditionally, people can find the best results one by one. For example, one bit, two bits, four bits... Just look for it. But you'll find that you also have to consider how the bit is expressed in storage. For example, for this layer, when there is a bit that is zero for all numbers, there is no need to save the entire bit. For example, as long as you ensure that you only have the second and fourth of these four bits exist, and not each of them needs to, this enriches the optimization of the entire precision. This is also the first time we have applied structural sparsification to a bit-level sparsification study.

We use the Group LASSO method to remove the entire column or the entire bit of the structure of the data expression, which greatly reduces the storage cost. This is an article from our 2021.

Further down is training, which is a very complicated thing. We often teach students that loss function is close to saturation. But in a company, there's never going to be enough computing power to saturate you, basically giving you a hundred machines to train for 24 hours, and whatever you train, you have to finish, which makes the training itself very efficient.

Traditionally, we have taken the approach of distributed servers, copying models many times, but each copied model is trained with only a part of the data. So how to ensure that the final result takes into account all the data? You need to send these neural networks to the parameter server in the gradient of Zen City during training, and then send them back after averaging to update the local neural network. This raises a problem: when there are a lot of node servers, the entire system is completely occupied by the data stream generated by the gradient transmission.

What to do? We later found that when there are enough parameters, the resulting gradient will satisfy a certain distribution, there is no need to transmit the original data at all, just need to calculate some parameters of the distribution and some such as more or less data, pass them over, you can completely copy the distribution at the other end, get the corresponding results.

We have completed such an operation on the mobile phone side, combined with many mobile phones for training, and can also do reasoning.

When doing reasoning, we take the way of clustering, adjust those non-zero numbers as much as possible in the form of row and column transformation, and then send them to the mobile phone to concentrate on calculations, reduce communication between mobile phones, and improve the efficiency of computing.

We once did testing with a company, found thousands of CDN network servers around the world, and built a Style Transfer application, which completed the entire calculation through distribution calculation and expression, and the effect was very good. Basically, you can instantly link with the server through the mobile phone to complete the entire training and reasoning.

Having said all this, there's actually a problem: all of these things require some very experienced and expensive engineers to design the corresponding neural networks. This is also a very large part of the current cost of neural network landing. We can optimize the entire neural network through automated methods, such as enhanced learning, optimization, because it can be simulated as some kind of optimization process, but these traditional optimization processes are very expensive.

We used to want to do this through graphic expression. Expressing a deep neural network architecture through a directed graph and a directed graph without loops, it has many cells, and different cells are superimposed to complete the entire neural network architecture. What we are looking for is the topology in this cell to see if the final neural network design meets the requirements. This is a study that is more sensitive to topology.

In addition, you will find that when doing these things, the accuracy of neural networks with similar topologies is similar, there is correlation, and the correlation coefficient is not 1, but it is basically a relatively high number. Therefore, we can predict whether such an architecture will meet our performance requirements by architecture, and this prediction can guide the search of the entire neural network architecture.

These are some concrete results. We map some discrete architecture or topology into the space of the continuous state, and generate the angle between vectors (similarity) as a key expression of performance, which can predict what the optimization will look like in this way, and constantly approach the optimization result.

Obviously, this method is the opposite of human design, and this is not how people design neural networks. People are looking at where there is a small model that can meet the requirements, and cannot be added to the top. We've also made this attempt, called Automated network depth discovery, to design some rules so that you can go up from the smallest network, adding different layers to each layer, or adding many layers, to see what kind of architecture finally meets this requirement. Of course, you have to design some specific rules or try something.

These attempts are interesting. In the end, you can always optimize to a certain point in the design balance front, but there is no way to fix the optimization to a certain point, you can only go to a certain point on this front and slowly let it flow freely. We still don't have enough understanding to control the direction and scope of optimization. So, this work needs to be further studied, we just proved the feasibility, but did not do more research on the completeness of the rules.

Finally, there are many parameters to consider in the co-design of hardware and software, including hardware and software co-design, specific circuit and architecture design, and the optimization of the algorithm itself for hardware.

Our team has done many years of accumulation internally, from 2012 to study the expression of neural networks on different hardware, to later do architecture design, distributed design, to automation design, etc., and made a lot of attempts. Probably saw how to go from a simplest expression to the end just to press a button, completed the construction of AI hardware and software combination.

Thank you!

Yiran Chen, Duke University: Hardware and software collaborative design of efficient artificial intelligence systems

Read on