laitimes

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

Meta is not only the world's largest social networking company, but also the promoter of the hottest technology concept of the moment, the "metacosm". It is also one of the world's top companies in artificial intelligence (AI) research.

Behind the company's excellent research results in AI, there must be strong computing power support. But all along, Facebook has never publicly shown the outside world how powerful its computing power really is.

Today, Meta suddenly announced its latest advances in building AI supercomputers.

According to the results of Meta's public disclosure, the supercomputer AI RSC it has built should have reached the top four levels in the world.

The situation is enough to be very shocking. After all, other supercomputers that can compete with RSCs in terms of computing power are run by state-owned research institutions in China, the United States, and Japan — and RSCs are the only supercomputing systems from private institutions in the first five.

And that's not all: this supercomputer is still getting faster and stronger at an alarming rate.

Meta predicts that by July this year, that is, within half a year, the hash rate of RSC will increase by 2.5 times. According to professional organization HPCwire, Meta's RSC supercomputer is expected to run The hash rate of Linpack benchmarking is expected to reach 220 PFlops.

If nothing else, RSC will become truly the "world's fastest AI supercomputer."

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

AI RSC inside, image source: Meta

| AI research and development into the era of "supercomputing"

First of all, there is a question to be answered:

What kind of AI research needs such a powerful supercomputer?

A general model may be able to complete training with one or more graphics cards in a general computer or an ordinary data center. Meta is studying a much larger number of parameters than the current model, with higher performance requirements, more stringent, and a longer training time - very large models.

Take the identification of harmful content as an example: CV algorithms need to be able to process larger, longer videos at higher sampling rates; speech recognition algorithms need to achieve higher recognition accuracy in complex backgrounds of extreme noise; NLP models need to be able to understand multiple languages, dialects, accents, and so on...

In the past, many algorithms have done well on the run-score dataset. However, Meta is a company with billions of users on several continents, and it must ensure that the same model is put into production to ensure maximum ubiquitousity. So, the general model is not enough, now to train the large model.

Training big models requires a lot of computing power – ask anyone who works on big models and you'll get that answer. After all, the past training tasks can be completed in a few weeks, but in the future, in the face of new big models, we can't afford to wait for a few years...

"Today, many important efforts, including identifying harmful content, are in great demand for very large models," Meta wrote in its press release, "and high-performance computing systems are important components in training these very large models." ”

Meta's supercomputer AI RSC released this time is called AI Research SuperCluster (Artificial Intelligence Research Supercomputing Cluster).

Although Meta publicly announced the system for the first time today, the predecessor version of RSC was actually put into production within Facebook as early as 2017. At the time, the Facebook team used 22,000 NVIDIA V100 Tensor GPUs to form the first single cluster. The system can run about 35,000 training missions per day.

HPCwire estimates that this predecessor version of the V100 GPU should already have 135 PFlops of floating-point computing performance according to the Linpack benchmark. That's enough to reach third place in the November 2021 ranking of the Top 500 in the Global Supercomputing Rankings, which may have surpassed the "Sierra" supercomputer operated by the U.S. Department of Energy in Livermore, California.

For Meta, though, that's not enough. What they want is the world's largest, fastest, and strongest AI supercomputer.

This supercomputer must also reach the level of data security in the production environment, after all, in the future, the model used in the Meta production system may be trained or even run directly on it.

Moreover, this supercomputer also needs to provide users - Meta's AI researchers - with the convenience of using the average trainer/graphics card, and a smooth developer experience.

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

Kevin Lee, Technical Project Manager, Meta AI RSC Image Source: Meta

At the beginning of 2020, the Facebook team decided that the company's supercomputing cluster at that time would be difficult to keep up with the needs of future big model training, and decided to "start over" and use the most advanced GPU and data transmission network technology to create a new cluster.

This new supercomputer must be able to train very large neural network models with more than one trillion parameters on datasets in exabytes (more than 1 billion GB).

(For example, the "Enlightenment" developed by The Chinese scientific research institution Zhiyuan BAAI, and the hybrid expert system model trained by Google with Switch Transformer technology last year, are large models with trillions of parameters; in comparison, the Performance and Versatility of the OpenAI GPT-3 language model, which was previously very famous in the industry, have been very surprising, with a parameter volume of about 175 billion.) )

The Meta team selected three of the most recognizable companies in AI computing and data center components: NVIDIA, Penguin Computing, and Pure Storage.

Specifically, Meta purchased 760 DGX universal training systems directly from NVIDIA. These systems, which contain a total of 6,080 Ampere-architecture Tesla A100 Tensor core GPUs, were the top-of-the-line AI training, inference, and analysis triads at the time and today. The middle network communication uses NVIDIA InfiniBand, and the data transfer speed is up to 200GB per second.

In terms of storage, Meta purchased a total of 231PB of flash array, module and cache capacity from Pure Storage, while all rack construction, equipment installation and subsequent management of the data center were handled by Penguin Computing, which has been serving the company since Facebook days.

The new supercomputing cluster formed in this way, Meta officially named it AI RSC:

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

Shown in the figure are the parameter details of the first stage (P1) of the RSC. Image source: Meta

Compared with the previous computing cluster built by FAIR with V100 graphics cards, the original RSC has brought a 20-fold performance improvement for production-level computer vision algorithms, more than 9 times faster to run NVIDIA's Doka communication framework, and 3 times faster training speed for large-scale natural language processing workflows - the training time saved is measured in weeks.

It is worth mentioning that when Meta was just making RSC upgrade plans, the new crown epidemic suddenly struck. The construction period of all physical construction has encountered great uncertainty, and whether RSCs can be successfully upgraded has been marked with a huge question mark.

However, the company's business development and THE need for AI research cannot wait for the COVID-19 pandemic. The team responsible for the upgrade and construction of RSC, as well as technical partners including three Silicon Valley companies such as NVIDIA, Penguin Computing, and Pure Storage, had to complete a series of very cumbersome and technically demanding work such as data center decoration and construction, equipment production and transportation, on-site installation, wiring, and commissioning under great construction pressure.

What's more exaggerated is that because there were home isolation orders all over the United States at that time, many leaders of the entire RSC project team had to work remotely from home... Shubho Sengupta, a researcher on the team, said, "What makes me most proud is that we completed the (RSC upgrade) under fully remote working conditions. Given the complexity of the project, it's crazy to be able to do all of this without meeting with the other team members."

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

For now, RSC is already one of the fastest AI supercomputers in the world.

But Meta is still not satisfied.

| create the world's fastest and most secure AI supercomputer

To meet Meta's growing demand for computing power in both production environments and AI research, RSC must continue to upgrade and expand capacity.

According to Meta's RSC Phase II (P2) plan, by July this year, that is, within half a year, the total number of A100 GPUs in the entire computing cluster has increased to a staggering 16,000 blocks...

The number of DGX A100 stand-alone DGX A100 machines in the original RSC was 760, equivalent to 6,080 graphics cards - in this calculation, RSC will add another 9,920 graphics cards to the P2, that is, Meta needs to purchase another 1,240 DGX A100 supercomputers from NVIDIA...

Even Nvidia said that Meta's plan will make RSC the largest customer deployment cluster of NVIDIA DGX A100 so far, and there is no one.

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

Computing power has increased, and other supporting facilities, including storage and networking, must also keep up.

According to Meta's projections, RSC's total data storage will reach 1 exabyte upon completion of P2 – more than 1 billion GB.

Not only that, but the communication bandwidth between a single node of the entire supercomputing cluster has also been increased as never before, reaching a staggering 16TB/s, and achieving one-to-one overload (that is, each DGX A100 computing node corresponds to a network interface, and there is no multi-node shared interface competing for bandwidth resources)

(There's another point worth mentioning separately: According to the Meta team's estimates, a supercomputing cluster of DGX A100 nodes like RSC's can support a node cap of 16,000, and if there is more, there will be overload, which means that the marginal return on additional investment is significantly reduced.) )

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"
In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

From the perspective of data security, Meta did not forget to introduce its data processing methods in a press release to reassure the public.

"Whether it's detecting harmful content or creating new augmented reality experiences — to create new AI models, we use real-world data from our company's production systems," Says Meta, which is why RSC was designed with data privacy and data security in mind. Only then will Meta's institute be able to securely train models using encrypted, anonymized, real-world data.

1) RSC is designed not to connect directly to the real Internet, but to a Meta data center located near the location of the RSC;

2) When Meta researchers import data into RSC's server, the data first passes a privacy review system to confirm that the data has been anonymized;

3) Before the data is officially invested in the training of the AI model algorithm, the data will be encrypted again, and the key is generated and discarded periodically, so that even if there is an old training data store, it cannot be accessed;

4) The data will only be decrypted in the memory of the training system, so that even if an uninvited guest breaks into the RSC and physically accesses the server, the data cannot be cracked.

Probably for confidentiality purposes, Meta didn't even reveal the exact location of the RSC...

However, according to the known circumstances, there must be a Facebook/Meta data center near the RSC. Also, the following figure is taken from the RSC's announcement video, in which we can see that AI RSC is located in the upper right, and the lower left is a data center in Meta. There are a large number of taller trees in the picture.

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

The Silicon Star people can basically determine that the Meta data center in the above figure is located in Henrico County, Virginia, USA. The county is the largest concentration of data centers in the eastern United States and the end of multiple submarine cables connecting Europe, South America, Asia, and Africa in the United States. As for the actual location of RSC, its predecessor should be the QTS Richmond data center.

In order to meet the era of super models, Meta wants to create "the world's fastest AI supercomputer"

On the right is the Meta Data Center, and on the left is QTS Richmond, the location of the Meta AI RSC

Finally, let's take a look at the cost...

Without considering the equally expensive storage and network infrastructure, let's start with the computing part:

The standard price of each DGX A100 is $199,000, and there is definitely a discount on Meta bulk purchases, but let's assume that there are no discounts: RSC's P2 expansion cost this time, the part of the graphics card purchase alone, is as high as $250 million... )

Of course, according to today's Meta market capitalization, this fee is simply a dime. If it really creates the world's largest, strongest and fastest AI supercomputing, it is expected to bring great help to the company's business, whether it is its current core business or its future meta-universe products.

Meta puts it this way: "Ultimately, our efforts on RSCs will pave the way for a metaverse as the next key computing platform." AT THAT TIME, AI-driven applications and products will play an important role. ”

Read on