Compatible with PyTorch, 25 times the performance acceleration, the domestic framework OneFlow "overspeed"

Machine Hearts released

Machine Heart Editorial Department

If you want to make alchemy fly, you must choose a handy stove. As an indispensable "alchemy furnace" for AI engineers every day, "PyTorch or TensorFlow?" It has become a hot topic that is discussed every year in alchemist haunts such as Zhihu and Reddit.

There's a saying in the industry that PyTorch is good for academia and TensorFlow is for industry. After all, PyTorch is a user favorite framework, the API is very friendly, and eager mode makes the model building and debugging process much easier, but its static graph compilation and deployment experience is not yet satisfactory. TensorFlow, on the other hand, is well-equipped with static compilation and deployment, but its debugging experience is tear-jerking.

So the question arises: Are fish and bear paws really incompatible? Not necessarily, OneFlow, an open source deep learning framework launched by a top-notch technology team from Beijing, has done it.

Wait, OneFlow has always been focused on distributed and high performance, can it be as easy to use as PyTorch? Anyone who has heard of OneFlow is bound to ask such a question.

Yes, since the end of 2016, OneFlow has been born for large-scale distribution, one of the features of which is the static graph mechanism, and dynamic graphs were not supported when it was open sourced on GitHub in July 2020. However, the OneFlow team spent more than a year developing its own motion graph engine, and the OneFlow 0.7 version already supports the exact same Eager experience as PyTorch, which means that OneFlow implements support for both dynamic and static graphs. Not only that, but the OneFlow programming API is fully compatible with PyTorch, and common deep learning models only need to modify one line of import oneflow as torch to run the PyTorch-written model on OneFlow.

Check out the OneFlow visual model library flowvision first: https://github.com/Oneflow-Inc/vision, which already supports classic SOTA models in the field of computer vision for image classification, segmentation, and detection (see table below), which can be switched freely by import torch as flow or import oneflow as torch.

Compatible with PyTorch, 25 times the performance acceleration, the domestic framework OneFlow "overspeed"

Once OneFlow and PyTorch are compatible, users can use OneFlow as they would with PyTorch, and after they are satisfied with the model, they can continue to use OneFlow to scale up to large-scale distribution or deploy the model using static graphs. Doesn't it sound too good to be true?

In the following case, a head communications company quickly and easily migrated its PyTorch-based business model to OneFlow's model, and carried out a large-scale training/inference performance optimization and deployment on-line, which enabled the service to be deployed on time in just a few days, and the performance indicators greatly exceeded expectations!

How exactly did they do it? Let's start with the background of the project.

Why OneFlow?

Due to business development needs, the communication company will soon launch an image recognition application based on deep learning, and the business requirements of the project have the following five characteristics:

Large amount of data: There are hundreds of millions of pictures in the database

Simple model: Compares the general classification model

More than 400 graphics cards, which cannot be expanded in the short term

There are hard metrics for training/inference throughput

Time to go live is tight

The user built a business model based on PyTorch, the most popular deep learning framework on the market, and ran through the normal training process, but the training/inference was very slow, far from the goal (20 times the gap from the online QPS), and the whole team was deeply anxious as the delivery date approached.

Users tried various schemes (optimized based on existing implementations) to no avail, so they investigated other deep learning frameworks such as TensorFlow, OneFlow, etc., and found that OneFlow (https://github.com/OneFlow-Inc/oneflow) is the smoothest framework to accelerate PyTorch-style code.

Specifically, there are three main reasons why users choose to try OneFlow:

1. OneFlow is the most compatible API with PyTorch among many deep learning frameworks, which is convenient for engineers to migrate existing project code with the least amount of time/labor cost and reduce learning costs.

2, OneFlow dynamic and static conversion is very convenient, dynamic graph (Eager) mode code can be converted to static graph (nn.Graph) mode simply change a few lines.

3, OneFlow has done a lot of optimization at the framework level, nn. Graph offers concise, rich performance optimization options such as Operator Fusion and Auto Mixed Precision Training.

As a result, users began to try to migrate the existing code to OneFlow, unexpectedly, less than half a day to get it and run, the migration process is very silky.

With the support of the official OneFlow documentation (https://docs.oneflow.org/master/index.html) and the OneFlow R&D team, users have done the following:

Fully migrate the project code that already has PyTorch to OneFlow

Transform project code from Eager Mode to Graph Mode

Turn on various optimization options in OneFlow Graph mode and train the model

Deploy the model live with the Serving module

Migration tuning process

1. One-click migration of PyTorch model to OneFlow model: just import oneflow as torch is enough

OneFlow's latest release of 0.7.0 further improves the compatibility of the PyTorch interface. OneFlow is semantically and consequentially consistent with PyTorch's interface for already supported operators. So the user tried to migrate the model script to OneFlow. Since the backbone network of the business model is resnet101, during the migration process, the user refers to the official documentation (https://docs.oneflow.org/master/cookies/torch2flow.html) to migrate, and finds that only the import associated with the torch in the model file needs to be modified to import oneflow as torch, and the migration of the model code is completed.

After the model script is migrated, you also need to verify the correctness of the model migration to see if the accuracy is aligned.

1) The user first did the verification of the inference accuracy, that is, directly load the PyTorch trained model and then verify the inference accuracy, because OneFlow aligns the interface of PyTorch, so loading the PyTorch model is also very convenient, just a few lines of code can be completed:

2) After verifying the inference accuracy, the training process is verified, and after aligning the training hyperparameters, the loss curve of the training model using OneFlow and the convergence curve of PyTorch are also the same, and the accuracy on the small data set is exactly the same.

2. Use OneFlow's nn. Graph accelerates model training and inference performance

After verifying the correctness of the algorithm, it is necessary to consider how to accelerate the execution. If you use the existing dynamic graph model to deploy directly, within the existing machine resources and time constraints, the performance of using the most original code is still about 20 times worse, which is an impossible task in the short term.

The user decided to do a two-pronged approach, using OneFlow for acceleration in parallel when doing acceleration optimizations based on PyTorch. Finally, combined with the four types of techniques of "dynamic to static, algorithm logic reduction, increased parallelism, and static compilation optimization", the final single-machine execution achieved more than 25 times the acceleration effect.

2.1 Dynamic to static

After the dynamic graph to static graph execution, the performance acceleration is about 25%.

OneFlow has an open source project of ResNet50 (https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50), and learned that the execution efficiency of a single card has been very high, and these optimization techniques can be used on ResNet101.

OneFlow ResNet50 does model acceleration using static graph nn. Graph, similar to PyTorch's TorchScript. But OneFlow's optimization features are more comprehensive, and the runtime is also a unique Actor Runtime for acceleration.

nn. Graph is an object-oriented static graph class that represents a complete static graph. For prediction tasks, nn. Graph can include only forward calculations, and for training tasks, backward calculations and model updates.

nn. Graph's base interface and nn. Modules behave similarly, such as adding sub-Modules, customizing algorithm execution logic, calling to perform a calculation, saving the model, etc. is added into nn. Graph of nn. Module object, in nn. When executed in the graph, it will be executed in static graph mode, so that the calculation logic under the dynamic graph can be directly reused by the static graph, so that the switching of dynamic and static execution is realized. Specially, Optimizer can also be added to static graphs, so that forward, backward, and model updates can be added to a complete static graph for joint optimization.

The following steps turn the dynamically executed ResNet101Module into static execution, usage and nn. Module is similar, there are only three basic steps required to declare, instantiate, and invoke.

1) Declare a static diagram: mainly include two parts, first add nn to the initialization function to be static. Module and Optimizeizer; then construct the plot in the build function.

2) Instantiate static diagrams: Just do initialization according to ordinary Python Class usage habits.

3) Call static diagram: similar to nn. How modules are called, note that the first call will trigger compilation, so the first call is longer than the subsequent call.

Put resNet101 of nn. Instances of Module join nn. After graph execution, the comparison is accelerated by about 25%.

2.2 Algorithm-level optimization

In the process of migrating the dynamic graph code to the static graph code, because it needs to consider which parts to be static, the model has been modularly refactored, but it is found that some of the calculations in this task are left over from the experiment, which is not necessary at the time of deployment, and the reduction of the algorithm logic is done by the way:

General inference only needs forward computation, backward computing is not required, but in the special model of the user, deployment and inference also need backward computation, but do not need model updates, which leads to the user writing code in order to retain the backward calculation also mistakenly retain the logic of parameter update. This allows the gradient calculation of the parameters to be omitted, which brings about 75% acceleration;

It was then found that the second forward in the original mission (forward, backward, forward) was redundant at deployment time and could be clipped, which brought about 33% acceleration here.

Overall, the algorithm level has been accelerated by 2.33 times, and it has been proved that the algorithm logic itself has a lot of optimization space, and the code is modular, and it is easier to find the optimization point on the algorithm logic. Of course, this partial improvement also applies to PyTorch.

2.3 Increase parallelism

This idea is also more direct, on the basis of optimization, the user observed that the utilization of the GPU is only 30%. At this time, batch_size is 1 (some parameters of BN are related to the batch size, the original user worried that expanding the batch_size may affect the calculation results, and later proved that this worry is superfluous, from theoretical deductions and experimental results have confirmed that expanding the batch_size does not affect the calculation results), single process, improve data parallelism is a very worthwhile solution. Therefore, users try to improve batch_size and multi-process scenarios:

Increase by batch_size, the default batch_size is 1, the GPU utilization is 30%, when you increase to 16, you can reach up to 90%, where you get about 155% acceleration;

Because the data preprocessing is on the CPU, the network computing is performed on the GPU, and the two devices relay execution, then use 2 processes to carry out, add a mutex to the data loading part, you can relatively simple to achieve cpu and GPU two-level pipeline, which brings 80% acceleration.

The cumulative acceleration of increasing parallelism is 4.6 times faster. Increasing parallelism to take full advantage of multi-core, multi-device delivers the most noticeable acceleration effect. Of course, the optimizations here are achieved after the user migrates to OneFlow, and can also be done on PyTorch.

2.4 Static compilation optimizations

After the above optimizations, the GPU utilization rate has been relatively stable at 90%, in general, there is not much room for optimization. However, there are some automated compilation optimization techniques to try under OneFlownn.Graph.

For example, the use of automatic mixing accuracy for low-precision calculations, the use of operator fusion to reduce the memory overhead, etc., here finally brought 64% acceleration, the speed to the original best performance of 1.56 times.

The _config_graph function mentioned in the previous example is configuring these optimization options as follows:

For ResNet101, the batch_size is set to 16, in nn. Graph no optimization option opens on the basis of:

Turn on mixing accuracy and the test is 36% faster

Automatic hybrid precision training, which automatically converts the appropriate operators in the network from FP32 single-precision calculations to FP16 half-precision floating-point calculations, can not only reduce GPU memory consumption, but also improve overall performance, and use Tensor Core to further accelerate training on GPU devices that support Tensor Core.

Turn on convolutional test run optimization, and the test gets a 7% acceleration, with a total acceleration of 43%

Cudnn's convolution operator contains several algorithms, such as forward algorithms (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnConvolutionFwdAlgo_t). Different input and filter sizes have different performance performance under different algorithms, in order to choose the best algorithm, before calling the cudnn convolution operator interface, you need to call the cudnn convolution searching algorithm interface. cudnn provides 2 search modes: Heuristic search (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnGetConvolutionForwardAlgorithm_v7) and pilot search (cudnnFindConvolutionForwardAlgorithm) ( https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnFindConvolutionForwardAlgorithm)。

Heuristic search is a way to search for the best algorithm by "looking up the table", cudnn pre-defines the best algorithm corresponding to different parameter configurations, and then matches the results each time it searches. The pilot search passes in the actual tensor for multiple run runs and then returns the results of the run. The results returned by the search algorithm are all meta information of different algorithms and their time consumption.

Turn on pad and conv operator fusion, and the test is accelerated by 19%, with a total acceleration of 62%

In the CNN network Backbone, there are many combinations of convolution + pad, and the convolution operator itself supports pad operations, and automatically fuse the pad operator to the convolution operator, which can save the overhead of the pad operator and improve the overall performance of the network.

Turn on the fusion of the add operator, and the test gets a 2% acceleration, with a total acceleration of 64%

Automatically fuse the elementwise add operator, a common fetch-intensive operator in the network, and the upstream operator, which can reduce bandwidth usage and improve performance. For the Elementwise add operator, fuseing it to the previous operator can reduce the number of data reads and writes, with a performance improvement of about 2/3.

In addition nn. Graph can easily support the use of TensorRT. This optimization object does not require updating the model, so it is also suitable for using TensorRT for acceleration. In nn. Graph has no optimization options, batch_size set to 16, and adds auto-mixing accuracy, NHWC, and TensorRT backend, which can be accelerated by 48%.

In this model, using only the TensorRT backend is a bit worse than using only OneFlow's static graph optimization, probably because some optimizations under TensorRT are in nn. Graph has already done it, so it doesn't bring additional revenue. However, it is more convenient to experiment, compile OneFlow with TensorRT, and then in nn. The switch can be turned on under Graph, listed as a reference:

2.5 Summary of accelerated optimization

The main process of acceleration is recorded above, with dynamic to static acceleration of about 1.25 times, algorithm logic about minus speed of about 2.33 times, increased parallelism speed of about 4.6 times, static compilation optimization speed of about 1.6 times, and cumulative acceleration of about 21 times. Some small optimization points in the middle are not fully recorded, and the actual cumulative acceleration effect has reached more than 25 times, exceeding the 20 times acceleration requirement of project deployment.

nn. Further use of Graph can be found in:

nn. A tutorial on how to use Graph, https://docs.oneflow.org/en/master/basics/08_nn_graph.html

nn. API documentation for Graph, https://oneflow.readthedocs.io/en/master/graph.html

3. Use OneFlow-Serving to easily deploy your trained model online

When the user finishes training and gets the final model, the next step is to deploy the model. Unlike model training, which requires weight updates, the weights at deployment are fixed, so more aggressive speed optimizations can be performed, such as int8 quantization, broader kernel fusion, constant folding, and so on.

User reference OneFlow v0.7.0 provides the official Serving module (https://github.com/Oneflow-Inc/serving), which is an NVIDIA Triton backend that integrates OneFlow's built-in XRT module and provides an out-of-the-box user interface. Deploy a trained OneFlow model quickly and efficiently using the following methods:

In order to use the model for inference, nn. After the Graph training is complete, you need to construct a ResNet101 InputGraph that contains only the forward direction:

Run the inference_graph with a sample input to trigger the graph construction of the inference_graph:

You can then run flow.save to save the graph structure and weights of the inference_graph under the "model" folder for deployment:

Then just run

This enables you to start a Docker container with the ResNet101 model deployed. -v here is important because it means mapping the model folder in the current directory to the "/models/resnet101/1" directory in the container, where /models is the default directory for Triton to read the model, and Triton will use the first-level directory name ("resnet101") under that directory as the model name and the second-level directory name ("1") as the model version.

If the startup command is adjusted to

The model automatically uses TensorRT for inference through OneFlow's XRT module, and OneFlow Serving supports a similar "--enable-openvino".

After you start the Docker container, run the following command to view the service status:

The return value is HTTP/1.1 200 OK, indicating that the service is working correctly.

You can then use Triton's C++ or Python SDK to implement the logic to send a request to the server and get the result, such as the simplest client:

Try to run it, and you can see that it successfully prints out the reasoning results:

Write at the end

In the above case, the user could not do sufficient research due to time constraints, and chose OneFlow with the idea of trying it out, fortunately, and finally successfully completed the task in the project cycle of extreme compression.

Based on OneFlow v0.7.0, users can easily migrate the business model code of PyTorch previously developed into the model code of OneFlow with one click, and then convert it into a static diagram of OneFlow after simple processing nn. Graph pattern and leverage nn. Graph's rich, efficient, and concise optimization switches can quickly and dramatically improve the training speed of models, and use a comprehensive peripheral toolchain such as OneFlow-Serving for convenient online deployment. It is worth mentioning that users can also use the OneFlow-ONNX tool to convert OneFlow's efficiently trained models into ONNX format and import them into other frameworks.

This article describes only examples of how OneFlow helps users accelerate and deploy models with the help of compatibility with PyTorch. OneFlow's original killer feature, "massive distribution," hasn't yet been seen, and in the future, we'll take a closer look at how OneFlow can help users accustomed to PyTorch easily implement large-scale pre-trained Transformer models and large-scale embedding models needed in the field of search recommendation ads.

OneFlow Project Address: https://github.com/Oneflow-Inc/oneflow/

OneFlow User Documentation: https://docs.oneflow.org/master/index.html

Compatible with PyTorch, 25 times the performance acceleration, the domestic framework OneFlow "overspeed"

Read on

Chen Tianqi praised the article: The transformation and prospect of the new generation of deep learning compilation technology

AI Annual Summary and Outlook: The outbreak of ultra-large-scale pre-training models, and the eve of autonomous driving commercialization | insights

PyTorch Dedicated Compiler: Complete tasks across three major operating systems with a few clicks, and also comes with tutorials

In 2022, PyTorch's share of the AI summit has increased by 80%.

MIT designed a deep learning framework to appear on the Nature cover to predict DNA mutations in non-coding regions

What tricks can make the model train faster? Let's first understand the first principle of this problem

The Stanford PhD freshman's paper was cited: close to 40,000

Ma Yanjun, general manager of Baidu's AI technology ecosystem: Domestic deep learning frameworks still face three major problems

To achieve self-reliance and self-improvement of AI technology, the domestic deep learning framework faces three major problems

Baidu Ma Yanjun: The domestic deep learning framework faces three major difficulties

Baidu flying propeller domestic chip adaptation first, domestic deep learning framework faces three major difficulties

Baidu Ma Yanjun: The deep learning framework is the key to the next AI era

Is China an AI power or an AI power?

The sixth year of China's open source deep learning framework: the first domestic comprehensive share of flying propellers, with more than 4 million developers