Want to reproduce Google's PaLM model of 540 billion parameters? It costs at least $10 million to rent a card!

2022-04-18 15:12:55

Reporting by XinZhiyuan

EDIT: LRS

Recently, researchers have calculated that if you rent a card to train a Google PaLM model, the cost of computing alone will reach tens of millions of dollars, not including data, testing overhead, etc., and renting a GPU is more cost-effective than TPU.

Recently, Google's PaLM language model turned out to be a sota that broke a number of natural language processing tasks, and this Transformer language model with 540 billion parameters once again proved that "great miracles".

Address: https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf

In addition to using the powerful Pathways system, the paper describes that paLM training uses 6144 TPU v4, uses a high-quality dataset of 780 billion tokens, and has a certain percentage of non-English multilingual corpus.

One word, "expensive"

If you really want to reproduce the training process, how much does it cost?

A researcher recently estimated the cost, and a short answer is: it will cost about $9 million to $17 million.

If other models are used as a comparison: BERT training costs $12,000, GPT-2 training costs $43,000, XLNet training costs $61,000, and the 11 billion-parameter Google T5 model is estimated to cost $1.3 million per training.

It should be noted that the training cost is not set in stone, and hardware improvements and more advanced parameter optimization measures can reduce training costs. However, even if the cost of training is greatly reduced, collecting and cleaning data on such a large scale is far from being affordable for "poor doctors" and "small companies".

Moore's Law on training costs

The amount of computation required to train machine learning models has been soaring, and the amount of computing resources that need to be procured has also increased dramatically.

Advances in computing, data, and algorithms are the three fundamental factors that guide the progress of modern machine learning, of which computation is the easiest to quantify, so the cost of computational training is usually to look at the amount of computation required for model training.

Researchers have done research that before 2010, training landmark machine learning models roughly conformed to Moore's Law, doubling the amount of computation in about 20 months.

But after the advent of deep learning in early 2010, the amount of computation required for training increased dramatically, doubling in about six months.

By the end of 2015, a new trend was pre-training large models, large-scale machine learning models developed by companies that increased the requirements for training computations by 10-100 times.

These three time periods are also divided into pre-deep learning era, deep learning era and large-scale era.

Address of the paper: https://arxiv.org/pdf/2202.05924.pdf

Back to the PaLM model, you can see the 540 billion parameter amounts, allowing PaLM to climb to the top of the computing mountain.

Based on the training calculation data provided in the paper, it can be seen that the number of calculations required for the final training of PaLM is 2.56e24 FLOPs

If the computational FLOPs required for GPT-3 with 175 billion parameters are used as the base unit, the cost of PaLM is ten times that of GPT-3.

The paper also mentions that PaLM trained for 1200 hours on 6144 TPU v4 chips and 336 hours on 3072 TPU v4 chips, including some downtime and repeated steps.

Each TPU v4 has two cores, so a total of 16809984 TPU v4 core hours are required, which is 1917.7.

In addition, the problem of TPU utilization is mentioned in Appendix B of the paper. The 540B model was trained using realisticization technology to achieve higher throughput with the same match size. If the cost of materialization is not taken into account, the utilization rate of FLOPs in the absence of self-attention is 45.7%, while the utilization rate of PALM analysis and computing hardware FLOPs with materialization is 57.8%

So now there are two ways to estimate the cost of training:

1. Calculating the training cost at 2.56×10 FLOPs, we can estimate the cost per FLOP for leasing ATP instances, or we can estimate the cost per FLOP cost from other cloud providers, such as those using NVIDIA A100.

2. Calculate by 8404992 the number of TPU hours used, and then query the hourly rent of the TPU chip.

However, at present, there is no public price of TPUv4, so it can only be calculated by method one.

Renting a 32-core TPUv3 is $32 per hour, so one TPU core is $1 an hour.

If you calculate according to the 16889984 TPUv4 core time, the cost is about $17 million, for this kind of large customer, Google can give you a 1 year 37% discount.

However, it will actually be more expensive than $17 million, because the performance of TPUv3 is certainly not as good as TPUv4, so it requires more TPU core time.

But then again, technological advances are about buying more services with less money, so TPUv4 should also be priced at $1 per core, so 17 million is a relatively accurate estimate.

According to the FLOPs calculation, it can be seen that the TPUv3 chip can provide about 123TFLOPS performance for bfloat16, but this is the peak performance, the actual situation depends on the utilization, PaLM's FLOPs utilization reached 57.8%, which is already a record-breaking.

Assuming that PaLM trains on TPUv3 with 50% hardware utilization, then we can buy 221 PFLOPs for $1, combined with the required hash rate, and end up costing $11.6 million.

How much does it cost to train with a graphics card?

LambdaLabs gave GPT-3 a calculation two years ago that it would cost as little as $4.6 million to rent Nvidia's V100 graphics cloud service.

While PaLM has ten times the amount of training computation as GPT-3, the final computational cost is $46 million.

Today, the NVIDIA A100's Tensor performance has increased by 10 times compared to the A100, and if it is calculated according to 50% utilization, it will eventually require $9.2 million, which is cheaper than TPU.

Although the three methods were calculated, three training costs were calculated, which were $17 million, $11.6 million, and $9.2 million.

But google doesn't actually need to spend that much money because they don't have to rent hardware, and the cost calculation is based on the assumption that users rent TPU to pay Google Cloud.

And this is only the cost of one training, not including the cost of other work, such as engineering, research, testing, etc.

Resources:

https://blog.heim.xyz/palm-training-cost/

Want to reproduce Google's PaLM model of 540 billion parameters? It costs at least $10 million to rent a card!

Read on