laitimes

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

Tencent Technology

2024-05-17 09:20Posted on the official account of Hebei Tencent News Technology Channel

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

This issue focuses on the feasibility analysis of Sam Altman's $7 trillion factory, and exclusively publishes Tencent News, please do not reprint without authorization.

文/ 前台积电建厂专家 Leslie Wu(公众号:梓豪谈芯)

A friend asked: Leslie, you are from a professional factory construction background, and it was rumored that Sam Ultraman was going to invest 7 trillion in core manufacturing, although I did not respond clearly, but the big media came out, it must not be groundless. If there is 7 trillion, it is cost-effective to build, or to buy a cost-effective, if it is built, what aspects of investment and factors may need to be considered?

Ultraman's "7 trillion" rumors are indeed groundbreaking, the idea is very interesting, to answer this question must have a wealth of semiconductor factory construction, technology, operation and a series of knowledge, we can try to calculate, if you have 7 trillion US dollars, how to spend this money "reasonably"?

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

For every 10,000 wafers, the investment in the construction of the factory is 15 billion US dollars

First of all, the first step must be to produce GPU logic chips, which requires the construction of a logic fab first. Let's take TSMC's state-of-the-art 2nm fab as an example to see the capital investment per 10,000 wafers.

According to the amount of investment, the proportion of investment in the fab is roughly as follows: process equipment: 77%, land and buildings: 4%, clean room: 5%, and supply system facilities such as water, electricity, and chemicals: 14%.

Lithography is one of the largest process equipment investments in wafer fabs, accounting for about 20%, and EUV lithography (responsible for 25 layers of lithography) is required for 2nm, and the cost is relatively higher, accounting for about 24%.

At present, TSMC's 2nm still uses EUV with low numerical aperture, corresponding to ASML's latest NXE:3800E, whose hourly output of wafers (WPH, Wafer per hour) is roughly in the range of 190-200 pieces, and the monthly production capacity of a single device is expected to be about 2400 pieces (see the notes in the table below for specific calculations), which means that the production capacity of every 10,000 wafers requires 4 ASML NXE: 3800 EUV lithography machines, and in addition to the 25 layers that EUV is responsible for, The remaining layer also requires three NXT:2100 machines with an output of 295 wafers per hour, as well as a KrF DUV lithography machine.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

*单台设备月产能=每小时产出晶圆数*运行时长*% customer efficiency(效率)*% doseage headwind(vs.30mJ)*EUV层数*30天

According to the data provided by ASML, the price of NXE:3800E is about 200 million US dollars, NXT:2100i is about 75 million US dollars, and KrF DUV is about 15 million US dollars, which means that the 2nm process wafer fab, per 10,000 wafer capacity, lithography machine (including maintenance and spare parts) investment is expected to be about 1.3 billion US dollars.

According to the proportion of lithography machine investment accounting for 24% of all process equipment, the total investment in process equipment is 5.4 billion US dollars, and process equipment accounts for 77% of the total investment in the fab, so that the total investment in the wafer fab is roughly 7.1 billion US dollars per 10,000 wafer production capacity at the 2nm node.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

per 10,000 wafer production capacity, the total investment of the wafer fab, unit: USD

After the construction of the GPU logic chip factory, it is necessary to consider the construction of a DRAM factory for the production of HBM.

Taking the state-of-the-art 1gamma process as an example, although the EUV lithography layer of DRAM has decreased, the number of equipment required per 10,000 wafers has not decreased but increased, especially the etching equipment, and the overall estimate is that the investment in the DRAM fab of the 1gamma process is about 85% of the total investment in the logic chip fab, that is, about 6 billion US dollars.

To add that, in the DUV era, the investment amount of DRAM fabs is about 110%-120% of that of logic wafer fabs of the same level, until the 7nm node, logic fabs begin to use EUV lithography machines in large quantities, and the investment per 10,000 wafer production capacity begins to exceed DRAM fabs.

After solving the front-end GPU logic chip and DRAM memory chip, it is necessary to solve the packaging problem in the back-end, including CoWoS advanced packaging and HBM packaging.

At present, the most advanced AI chips use SoIC+CoWoS packaging technology, and HBM4 will adopt hybrid bonding, and the investment per 10,000 wafers will be significantly increased to $1 billion (including equipment plants). In addition, the interposer involved in advanced packaging also needs to use DUV to build a supporting 65/45nm front-end wafer fab, with an investment of $800 million per 10,000 wafers. In other words, the overall investment per 10,000 wafers is expected to be around $1.8 billion for packaging.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

per 10,000 wafer production capacity, unit investment of different fabs, unit: USD

This money-spending work involves a lot of data, to help you make a small summary, per 10,000 wafer production capacity, or unit investment, corresponding to the 2nm process of logic wafer fabs, 1gamma process DRAM fabs, packaging plants (advanced packaging + interposer, $1 billion + $800 million) The total investment is about $15 billion.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

With an annual output of 6 million GPUs and a hardware cost of $50 billion

However, the corresponding ratio of GPU logic chips, HBM memory and interposers is not a 1:1:1 relationship, and all the total investment amount should be increased on the basis of the unit investment, according to the coefficient, which can be roughly estimated from the number of CPU, HBM memory and interposer required by a GPU.

Taking NVIDIA's latest Blackwell architecture GPUB200 die size (814mm²) as an example, each wafer can cut 80 chips, and according to TSMC's best process, the yield is roughly about 65%, that is, each wafer can cut 50 good dies.

Incidentally, since the GPU logic chip is a large chip, in order to improve the exposure area of lithography, the depth of field of the objective lens imaging needs to be controlled at a relatively large level, which will lead to a decrease in resolution, which is an important reason for the increase in defects and the decrease in yield.

NVIDIA's B200 is paired with Grace CPU, 2 GPUs with 1 Grace CPU, so 50 GPUs need to be paired with 25 CPUs. According to the calculation of the 3nm process, based on the die size and yield of the CPU, a wafer can cut about 300 CPU chips, which means that a GPU wafer needs to be matched with 0.08 CPU wafers.

At present, the investment in the 3nm node per 10,000 wafers is 70% of the 2nm node, which is roughly 5 billion US dollars, that is to say, while investing 7.1 billion US dollars to produce 10,000 GPU wafers, it is necessary to invest 7.1 billion US dollars×70% × 0.08, that is, 400 million US dollars, for the production of CPU wafers.

Another highlight of the AI chip is HBM, NVIDIA's H100 and H200 are equipped with 6 as standard, and the B200 of the Blackwell architecture uses 8 HBM3e memory. According to TSMC's latest roadmap, in 2026, one GPU can be paired with 12 HBM memory, and the specifications of HBM will be upgraded from 12-layer stacked HBM3e to 16-layer stacked HBM4/4e.

As mentioned earlier, 50 GPU logic chips can be cut out of a 2nm wafer, and according to the B200 standard, each wafer needs to be equipped with 400 HBM3e memory. At present, the DRAM chip of the 1gamma process can produce about 1,200 DRAM particles per wafer, and according to the yield rate of 85%, 1,000 DRAM chips can be obtained in the end, and then these DRAM chips must be packaged into 12 layers of stacked HBM3e memory. At present, the yield of the package is about 80%, that is, a DRAM wafer can produce 1000÷12*80%, which is about 70 HBM3e memory stacked with 12 layers.

In other words, a GPU wafer requires 5.7 DRAM wafers in addition to 0.08 CPU wafers. In the future, with the further increase in the number of GPU logic chips and HBM particles, especially the number of stacks from 12 to 16 layers, the ratio of GPU:DRAM wafers to 1:5.7 will be further expanded.

According to the size of the interposer layer of the existing advanced packaging, one wafer can complete the packaging of 15 GPU logic chips, corresponding to one wafer of GPU logic chips, and the advanced packaging of 3.3 wafers is required.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

per 10,000 wafer production capacity, the total investment in the construction of different fabs, unit: USD

In a word: for every 10,000 GPU wafers, 800 CPU wafers, 57,000 DRAM wafers, 33,000 interposer wafers, 33,000 SoIC+CoWoS advanced packaging, and 57,000 HBM packages are needed, and the corresponding investment amount is 1*71+0.08*50+5.7*60+3.3*10+3.3*8+5.7*10≈47.6 billion US dollars.

For every 10,000 GPU wafers, the production plant of all supporting chips needs to spend 47.6 billion US dollars, plus other miscellaneous expenses, which is directly calculated as an integer of 50 billion US dollars, which is converted into the number of GPU chips, which is 500,000 per month and 6 million a year.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

7 trillion dollars can be burned in 8 and a half years

What is the concept of investing $50 billion to produce 6 million GPUs a year? Based on TSMC's CoWoS production capacity, the number of AI GPUs in the world can be calculated, and then compared.

In 2024, TSMC CoWoS will have a total production capacity of 310,000 pieces, of which 95% will be for AI GPUs, only more than 10,000 pieces will be for Xilinx's FPGAs, and the remaining nearly 300,000 pieces will be divided by NVIDIA, AMD and global Internet giants such as Google, AWS, Meta, and Mircosoft's self-developed ASIC chips.

In other words, TSMC's CoWoS production capacity represents the world's AI chip production capacity, 80% of GPUs in 2024 will still only use 2.5D CoWoS, NVIDIA's H100 is about 29 per piece, and other self-developed ASICs are higher than this standard, some have more than 40, and at present, only AMD's MI300 uses SoIC packaging, about 15 pieces per piece.

To sum up, TSMC's 300,000 CoWoS production capacity this year corresponds to about 10 million GPUs, which is the approximate total number of AI GPUs in the world in 2024. As mentioned earlier, with an investment of $50 billion, 6 million GPUs can be produced per year, that is, in 2024, a total investment of $83 billion will be needed to produce 10 million AI GPUs that meet the world's needs. This level is equivalent to TSMC's capital expenditure for 2-3 years, and it is also about the total investment of TSMC Fab20A, a 2nm chip factory with a monthly output of 120,000 pieces.

With an investment of $83 billion, the AI chips that the world needs in 2024 can be produced, and there is still a lot of work to be done to spend Ultraman's $7 trillion, after all, $83 billion is only the cost of building a chip factory.

After the chip factory, DRAM factory, and packaging factory are built, it is necessary to consider the construction of factories for the production of servers, and many factories like those of the Industrial Fortune Union must be built, but this kind of server assembly plant is only a drop in the bucket compared with the chip factory, and all the factories in the industrial chain of AI servers are built, including servers, optical modules, liquid cooling, copper wires, and all kinds of mold factories, with an annual output of 10 million GPUs, based on 8 GPUs for a single server, that is, 1.2 million servers. The total investment in all downstream plants is roughly $17 billion. Together with the $83 billion of upstream chip factories, $100 billion is the total cost of factory construction required for all global AI chip + server shipments in 2024.

The construction of upstream and downstream factories is just the beginning, and the process still needs continuous R&D investment, including R&D expenses related to design and manufacturing, covering GPU, CPU, HBM, advanced packaging, etc., which can be packaged and calculated The total R&D of NVIDIA, AMD, TSMC, and SK hynix is roughly 30 billion US dollars. Coupled with the research and development of server hardware, such as optical modules, also cold, etc., the cost of the research and development part can reach 50 billion US dollars.

For OpenAI, on the way to promoting AGI, it also needs to continue to invest in model research and development, which costs at least $20 billion per year.

The R&D of the chip part + the R&D of the AI part, the total annual investment is at least 70 billion US dollars, if you want to advance faster, it is inevitable to increase R&D investment, and all R&D investment to promote the ultimate goal of AGI is estimated to increase the investment to 100 billion US dollars per year.

The above R&D costs do not include training costs, and training requires a large amount of water and electricity resources, and this part of the infrastructure also needs to be built by itself.

At present, the cost of building 1KW nuclear power units in Europe and the United States is about 4,000 US dollars, and the annual power generation of nuclear power units per million kilowatts is about 860 million kilowatts, according to IEA (International Energy Association) calculations, global artificial intelligence will consume 134 billion kilowatts of electricity in 2027, so it will take about 600 billion US dollars to build 155 groups of million kilowatt nuclear energy units.

According to a study by the University of California, Riverside, AI will consume 6.6 billion cubic meters of clean fresh water in 2027, about half of the water consumption in the UK, and the main scenario will come from the three major water-consuming links of server cooling, power generation and chip manufacturing, and the cost of building the corresponding water treatment plant will be about $100 billion.

Compared with the previous investment, the scale of labor costs in the manufacturing process is relatively small, and the focus is mainly on the design of chips, including model research and development.

The front-end wafer factory needs about 1,000 people per 10,000 pieces, about 1,500 people in the back end, and all front-end factories (including DRAM, interposers, etc.) with an annual production capacity of 20 million GPUs need about 20,000 people, with an average annual core of 150,000 US dollars, and 30,000 people for back-end packaging, with an average annual salary of about 70,000 US dollars, plus 5,000 R&D personnel of various chip manufacturing, with an average of 200,000 US dollars per capita, and a total of 6 billion US dollars per year in chip manufacturing.

In terms of the labor cost of chip design and large language model, according to 1.5 times the calculation of Nvidia + OpenAI + Microsoft server department, about 50,000 people, with an average annual salary of 300,000 US dollars, a total of 15 billion US dollars, 150,000 people in factories involved in all hardware manufacturing of servers, 150,000 people in power and water security facilities, a total of 300,000 people, and an average annual salary of 80,000 US dollars, a total of 24 billion US dollars.

For all of the above links, the cost of human wages is 6 billion + 15 billion + 24 billion per year, a total of 45 billion US dollars.

In terms of material costs, the cost of GPUs and related chips, plus all server hardware, is $2,000 per piece, or 20 million pieces per year, or $40 billion. In terms of server operating expenses, labor is $42 billion + material costs are $40 billion + other miscellaneous expenses are $18 billion, rounded up to $100 billion.

"Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

OpenAI's annual investment of 20 million GPUs and various costs are estimated, unit: USD

If you spend this money for Ultraman, the ideal plan is: 200 billion US dollars to build a manufacturing plant with an annual output of 20 million GPUs (about 10 million in the world in 2024) and all server hardware, and then invest 100 billion US dollars in R&D investment every year in order to promote the ultimate goal of AGI, and the total labor cost of related design, R&D and manufacturing is 100 billion US dollars. Invest $700 billion in energy and water infrastructure, set aside $200 billion in cash payables and miscellaneous or uncalculated expenses, and set aside $400 billion as a reserve fund that may be missed, so that about $1.7 trillion is needed to cover the start-up capital of all manufacturing plants with 20 million AI GPUs.

In terms of working capital, it is necessary to maintain an annual investment of $100 billion in new chip and hardware production capacity, and continue to promote Moore's Law to increase transistor density. With an additional $200 billion a year in new electricity and water, and about $100 billion a year in labor and materials, the limit would be $700 billion a year in operating expenses.

In this way, less than $2 trillion could cover all related manufacturing, energy infrastructure and operating expenses that will double the global demand for AI chips in 2024, and the new increase will be limited to $700 billion per year, which can roughly burn for another 7.5 years.

Angrily smashing 2 trillion US dollars, with twice the world's AI GPU production capacity, Ultraman cannot monopolize global artificial intelligence - OpenAI's model leads the world, based on the world's most advanced chip manufacturing represented by TSMC, and the GPU chip design represented by NVIDIA.

As an "old force", TSMC, Nvidia and many chip design companies are likely to support all of OpenAI's competitors, including companies in Silicon Valley and even the world that are making large and small models and AI applications, such as the old rival DeepMind plus giants such as Google, AWS, and Mircosoft, as well as Stability and startups such as Anthropic, which was founded by the former founder of OpenAI.

Even if OpenAI's chip design and manufacturing capabilities are comparable to TSMC and NVIDIA, it may not have an absolute advantage in the face of all large and small models and algorithms in the world, let alone with 7 trillion chips, standing on the opposite side of Nvidia and TSMC.

Objectively speaking, if you don't consider the ecology, GPU design companies are not so irreplaceable. In terms of GPU design, there is no Nvidia and AMD, and even AI chip design companies like Cerebras that design the entire wafer area far exceed that of traditional GPUs, but in chip manufacturing, TSMC is currently in a unique trend.

Taking 2024 as an example, TSMC can produce the N3P process with a density of up to 284 million transistors per square millimeter, and Intel, which ranks second, can only produce Intel 4 with 180 million transistors per square millimeter.

Chip manufacturing can't be built with money, and the industry that needs technology accumulation is more in need of money to throw money at death, and it will take three years to build a chip factory and produce chips at the earliest, for OpenAI, assuming that "7 trillion" is real, in the face of the counterattack of the "old forces", it will at least survive these three years.

View original image 58K

  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies
  • "Burned" $7,000 billion, OpenAI and Nvidia and TSMC are enemies

Read on