laitimes

Memory outlook

author:Wang Shuyi

(本文编译自Semiconductor Engineering)

Frank Ferro, Group Director, Product Management, Semiconductor Engineering and Cadence, Steven Woo, Distinguished Inventor at Rambus, Jongsin Yun, EDA Memory Technology Specialist at Siemens, Randy White, Program Manager, Memory Solutions at Keysight, and Frank, Vice President of Solutions and Business Development at Arteris Schirrmeister discusses the impact of off-chip memory on power consumption and heat dissipation, as well as how to optimize performance, and here are some of the topics for the reader's attention.

SE: How will CXL and UCIe play a role in the memory of the future, especially given the cost of data transfer?

White: The main goal of UCIe is interoperability while reducing costs and improving yields. So, from the very beginning, we can get a better overall metric with UCIe, which will not only convert for memory, but also for other IP blocks. When it comes to CXL, which are more applicable to the field of artificial intelligence and machine learning as many different architectures emerge, CXL will play a role in managing and minimizing costs. Total cost of ownership is always the primary metric for JEDEC, with power consumption and performance being secondary metrics. CXL is essentially optimized for disaggregated, heterogeneous computing architectures, reducing over-engineering and designed around latency issues.

Schirrmeister: Take the example of on-chip networks, such as AXI, CHI, or OCP, which are all on-chip connectivity protocols. However, when detached from the wafer or chip, PCIe and CXL are interface protocols. CXL has a variety of usage models, including some understanding of the consistency between different components. On the Open Compute Project forum, when CXL is discussed, it's all about storage-related usage models. UCIe is always one of the options for chip and chip connectivity. In a memory environment, UCIe can be used in a chiplet environment, where there is an initiator and a target that has additional memory. UCIe and its latency play an important role in how connectivity is done and how the architecture is built to get data in a timely manner. AI/ML architectures are very dependent on the input and output of data. We haven't figured out the memory wall yet, so from a system perspective, it's important to think about where to keep the data from an architectural perspective.

Woo: One of the hardest challenges is that datasets are getting bigger and bigger, so one of the problems that CXL can help solve is being able to add more memory to the nodes themselves. The number of cores of these processors is increasing. Each core, in turn, wants to have a certain amount of its own memory capacity. In addition to this, the datasets are getting larger and larger, so each node requires more memory capacity. There are many models in use today. And we're seeing people passing data and doing computations between multiple nodes, especially in AI applications, where large models need to be trained on many different processors. Protocols such as CXL and UCIe provide ways for processors to flexibly change the way they access data. Both technologies will give programmers the flexibility to implement and access data sharing across multiple nodes, which makes the most sense for them, while also addressing memory walls as well as power and latency issues.

Ferro: There's been a lot of talk about CXL in terms of storage pools. From a more realistic cost level, this is also a cost burden due to the size of the servers and chassis in the data center, although more memory can be integrated there. Leveraging existing infrastructure and having the ability to continue to scale as we move into CXL 3.0 becomes important to protect against these outdated storage scenarios where the processor does not have access to memory. CXL also adds another layer of memory, so there is no need to use storage/SSD anymore, which also minimizes latency. As for UCIe, with high-bandwidth memory and very expensive 2.5D structures, UCIe may be a way to help separate these structures and reduce costs. For example, if a large processor – GPU or CPU – wants to get memory as close to it as possible, such as high-bandwidth memory, it would have to reserve a considerable area on the silicon interposer or some interposer technology, which would increase the cost of the overall system because there would have to be a silicon interposer to accommodate the CPU, DRAM, and any other components that might be needed. With the chiplet, it is possible to put the memory on its own 2.5D, and then the processor can be put on a cheaper substrate and connected via UCIe. This is a usage model that can reduce costs.

Yun: At IEDM, there was a lot of talk about AI and different types of storage. The processing parameters of AI have been growing at a rapid pace, increasing by about 40 times in less than five years. As a result, AI needs to process large amounts of data. However, DRAM performance and on-board communication have not improved as much, only by 1.5 to 2 times every two years, which is far below the actual demand for AI growth. This is an example of our attempt to improve communication between memory and chips. There is a huge gap between the data supply of memory and the data demand for AI computing power, and this problem still needs to be addressed.

SE: How will memory help us solve the power and heat dissipation problems?

White: Power consumption is an issue for memory. 50% of data center costs come from storage, whether it's I/O or refresh management and cooling maintenance. We're talking about volatile memory, specifically DRAM. As we've discussed, the amount of data is getting bigger, the workload is getting bigger, and the processing speed is getting faster, all of which means higher energy consumption. As scale continues to grow, many initiatives are being used to support the bandwidth required to support the ever-increasing number of cores, and power consumption increases accordingly. In the process, we played with a few small things, including reducing voltage flapping and improving the I/O squared power rail. We are also working to improve the efficiency of memory refresh management and use more Bank Groups, which also improves the overall throughput. A few years ago, we were approached by a customer who wanted JEDEC to make a major change in regulating memory according to the temperature range. LPDDR has a wider range and different temperature classifications, but for the most part, we are talking about commercial DDR because of its capacity advantage and the fact that it is the most commonly used storage in data centers. The customer wanted to propose to JEDEC that if we could increase the operating temperature of the DRAM by five degrees – even though we knew that the refresh rate would also increase as the temperature increased – this would in turn reduce the demand for electricity from the three coal-fired power plants per year. So what is done at the equipment level translates into macro changes on a global scale, at the power plant level. In addition, for quite some time, there has been an over-provisioning problem in memory design at the system architecture level. We introduced PMICs (Power Management ICs), so voltage regulation is done at the module level. We have a built-in temperature sensor, so the system doesn't need to monitor the temperature inside the case. There are now specific module and device temperature and thermal management to make it even more efficient.

Schirrmeister: If DRAM were a person, it would definitely be socially isolated because no one else wanted to talk to it. As important as it is, no one wants to talk to it – or as little as possible – because of the latency and power costs involved. For example, in an AI/ML architecture, everyone wants to avoid adding a lot of costs, which is why everyone is asking if the data can be stored locally or moved around in a different way. Can I systematically arrange my architecture so that I can receive data at the right time when computation? That's why memory is important because it has all the data. However, when you're optimizing for latency, you're also optimizing for power consumption. From a system perspective, you actually want to minimize access. This has very interesting implications for NoC's data transfer architecture, such as people wanting to take data with them, keep it in various local caches, and design their architecture to minimize access to DRAM.

Ferro: When we look at different AI architectures, the first goal for most people is to save as much locally as possible, or even avoid DRAM altogether. Some companies use this as their value proposition. If you don't have to store it off-chip, you can get an order of magnitude improvement in power consumption and performance. We've already talked about the size of the data models, which are going to get bigger, bulkier, and most likely not very practical yet. But the more you do with the chip, the more you save power. Even the concept of HBM is intended to be more extensive and slower storage. If you look at previous generations of HBM, their DDR speed is around 3.2GB. Now they have reached 6GB, but they are still relatively slow for a very wide DRAM, and this generation they have even lowered the I/O voltage to 0.4 in an attempt to lower the I/O. If you can make the DRAM run slower, you can also save power. Now it's time to take the memory and put it very close to the processor. Well, there will be a larger thermal footprint in a smaller area. We may have improved in some areas, but we are more challenging in others.

Schirrmeister: Based on Frank's point of view, IBM's North Pole AI architecture is an interesting example. If you look at it from an energy efficiency point of view, most memory is basically on-chip, but this doesn't apply in every case. Essentially, this is an extreme case where let's do as little disruption as possible and provide as much on-chip memory as possible. IBM's research shows that this is possible.

Woo: When thinking about DRAM, it has to be strategic. Careful consideration must be given to the interaction between SRAM at the top of the storage hierarchy and the disk hierarchy at the bottom. You don't want to move large amounts of data to any element of the storage hierarchy if you can avoid it. When you do need to move it, you need to make sure that you use as much of that data as possible to spread the overhead. The storage industry has always been very adept at addressing some of the key requirements. If we look at the evolution of products such as low-power DRAM and HBM, they are both driven by the fact that certain performance parameters such as power consumption are not met for standard memory. Some of the ways forward that people are talking about, especially with AI being a big driver, are not only improving performance, but also power efficiency – for example, trying to stack DRAM directly on top of the processor, which will help improve performance and power consumption. Going forward, the storage industry will evolve by focusing on architectural changes, not just incremental changes like the low-power roadmap, but other larger changes.

SE: In addition to what we've already discussed, is there any other way that memory can help with latency?

White: We're adding compute, which will address many of the needs of edge computing. In addition, a clear advantage of CXL is that we can now pass pointers instead of data to memory addresses, which is more efficient and reduces overall latency.

Schirrmeister: There's also the issue of power consumption. CXL, CHI, PCIe – all of these devices must be integrated together on and between chips, especially in chiplet environments. Imagine that in the background, data is passed on the chip via AXI or CHI, and now suddenly you have to start converting data from one chip to another. This has an impact from a power consumption perspective. Everyone is talking about building an open ecosystem of chiplets and switching between different players. In order to achieve this, you need to make sure that you don't have to convert all the time. This reminds me of the past, when there were five different video formats and three different audio formats, all of which needed to be converted. We all want to avoid situations like this in order to increase energy consumption and latency. From a NoC perspective, if you are trying to read data from memory, and you need to insert a block somewhere, because you need to go through UCIe to another chip to get the memory connected to another chip, this increases the time period. Because of this, the role of the architect is becoming more and more important. From a latency and low-power perspective, we all want to avoid conversions. It's just a door to nothing to add, if only everyone spoke the same language.

END

Read on