Take Java performance to the next level with TornadoVM

Author | Juan Fumero

Translated by | Knowing the mountain

Planning | Ding Xiaoyun

At QCon Plus, Juan Fumero talked about TornadoVM, a Java Virtual Machine (JVM) high-performance computing platform. Java developers can use it to automatically run programs on GPUs, FPGAs, or multi-core CPUs.

Heterogeneous devices like GPUs are present in almost all computing systems today. For example, mobile devices are equipped with a multi-core CPU and an integrated GPU; laptops typically have two GPUs, one integrated with the main CPU and the other with a dedicated purpose (usually for gaming). Even data centers are integrating devices like FPGAs. As a result, heterogeneous devices will continue to exist.

All of these devices help improve performance and run more efficient workloads. Programmers of current and future computing systems need to execute programs on a wide variety of computing devices. However, many parallel programming frameworks are based on C and C++, and such systems developed using high-level programming languages such as Java are almost non-existent. That's why we're introducing TornadoVM.

Simply put, TornadoVM is a high-performance computing programming platform for Java and the JVM that can load Java code at runtime to run on heterogeneous hardware accelerators.

TornadoVM provides a Parallel Loop API and a Parallel Kernel API. In this article, we'll cover them separately, provide some performance test benchmarks, and share how TornadoVM translates Java code into machine code that can be executed on parallel hardware. Finally, we will introduce the application of TornadoVM in the industry, including some application scenarios.

Fast lanes for GPUs and FPGAs

How do high-level programming languages access heterogeneous hardware today? The following diagram shows examples of some hardware (CPU, GPU, FPGA) and high-level programming languages such as Java, R, or Python.

Take Java performance to the next level with TornadoVM

If we look at Java, we will see that it runs in a virtual machine. Among them, OpenJDK, GraalVM, and Corretto are all virtual machine (VM) implementations. Essentially, the Java source code is compiled into Java bytecode, which the VM then executes. If the application runs frequently, the virtual machine can be optimized by compiling frequently executed methods into machine code—but this is only for the CPU.

If developers want to access heterogeneous devices, such as GPUs or FPGAs, they often need to do so through Java Local Interface (JNI) libraries.

The programmer must import a library and call it via JNI. Programmers can use these libraries to optimize applications for specific GPUs. However, if the application or GPU changes, you may need to rebuild the application or readjust the optimization parameters. Similarly, this is true for different FPGAs and even other models of GPUs.

As a result, no complete JIT compiler and runtime can handle heterogeneous devices like a CPU, detect frequently executed code, and generate optimized machine code. TornadoVM was born for this.

TornadoVM can be used in conjunction with existing JDKs. It is a plugin to the JDK that allows programmers to run applications on heterogeneous hardware. Currently, TornadoVM can run on multi-core CPUs, GPUs, and FPGAs.

Hardware features and parallelization

The next question is, why support so much hardware? Support for three different hardware architectures is currently being considered: CPU, GPU, and FPGA. Each architecture is optimized for different types of workloads.

The CPU is optimized to reduce the latency of the application, and the GPU is optimized to increase throughput. FPGAs fall in between: because applications are physically connected to hardware, FPGAs can often achieve lower latency and higher throughput.

We map these schemas to existing parallelization types. In the diagram above, we can find that there are three main types of parallelism: task parallelization, data parallelization, and pipeline parallelization.

Typically, cpus is optimized for task parallelism, which means that each core can run different and independent tasks. In contrast, GPUs are optimized for running parallel data, which means that the functions executed are the same as the kernel, but the input data is different. Finally, FPGAs are ideal for pipeline parallelism, where the execution of different instructions overlaps between different internal stages.

Ideally, we need an advanced parallel programming framework that can express different types of parallelism to maximize the performance of each device type. Now, let's look at how TornadoVM is built and how developers can use it to express different types of parallelism.

TornadoVM overview

TornadoVM is a plugin for the JDK (Java Development Kit) that Java developers can use to automate programs on heterogeneous hardware. The main features of TornadoVM are as follows:

It has an optimized JIT (Just In Time) compiler that optimizes the code for each architecture. This means that the code generated for the GPU is different from the code generated for the CPU and FPGAs, maximizing the performance of each architecture.

TornadoVM enables dynamic task migration between architectures and between devices. For example, it can run an application on a GPU for a period of time and then migrate it to another GPU, FPGA, or multi-core CPU as needed without restarting the application.

TornadoVM is completely hardware-agnostic: application source code running on heterogeneous hardware is the same as running on GPUs, CPUs, and FPGAs.

Finally, it can be combined with multiple JDKs. It is open source (available on GitHub), and Docker images can also run on NVIDIA and Intel integrated GPUs.

TornadoVM system stack

Let's take a look at tornadoVM's system stack. At the top level, TornadoVM exposes an API because while it takes advantage of parallelization, it is not instrumented. Therefore, it needs a way to identify where parallelism is used in the application source code.

TornadoVM provides a task-based programming API with one existing Java method per task. That is, TornadoVM compiles code at the method level, just like the JDK or JVM, but the compiled code is GPU- and FPGA-oriented. We can also use annotations in methods to indicate parallelization. In addition, methods can be divided into task groups and compiled in the same compilation unit. The compilation unit is called Task-Schedule: Task-Schedule has a name (for debugging and optimization) and contains a set of tasks.

The TornadoVM engine reads in bytecode-level expressions and automatically generates code for different schemas. It currently has three backends that generate code, one for OpenCL, CUDA, and one for SPIR-V code. Developers can choose which one to use, or let TornadoVM choose one by default.

Example of a blur filter

Let's now look at an example of how TornadoVM can speed up Java applications: the Blur filter. We have a picture and want to apply a blur effect to this picture.

Before we know how to write code, let's take a look at the performance of this application running on heterogeneous hardware. The following diagram shows the test benchmarks for four different implementations. We use the serial implementation of Java as a reference, the y-axis is the performance gain relative to the reference, and the higher the performance, the better.

The two columns on the left represent the results of CPU-based executions. The first column uses the standard Java parallel stream, and the second column uses TornadoVM running on multi-CPU cores, resulting in 11x and 17x acceleration, respectively. TornadoVM gets better results because it generates OpenCL code for the CPU, and OpenCL is very good at vectorizing the code using vector units. If your application is running on integrated graphics, you can get 19x performance acceleration compared to the Java serial implementation. If you run your application on an NVIDIA GPU (2060), you can get up to 340x of performance acceleration (using TornadoVM's OpenCL backend). We compare performance acceleration to Java parallel streaming, and TornadoVM gets 30x performance acceleration when running on NVIDIA GPUs.

Implementation of blur filters

A blur filter is a mapping operator that applies a function (blur effect) to each input image pixel. This mode is ideal for parallelization because each pixel can be computed independently of the other pixels.

The first thing we'll do is annotate the code in java methods and let TornadoVM know how to parallelize them.

Because the computation of each pixel can be done in parallel, we add @Parallel annotations to the outermost two loops. This signals tornadoVM to compute the two loops in complete parallel. Code annotations define a data parallelization pattern.

The second thing is to define the task. Since we are entering RGB images, we can create a task for each color channel (red, green, blue). So what we're going to do is obfuscate each channel. We used a TaskSchedule object with three tasks.

In addition, we need to define what data will be transferred from the Java memory heap to the device, such as the GPU. This is because GPUs and FPGAs typically do not share memory. Therefore, we need a way to tell TornadoVM which memory regions need to be replicated between devices. This is done through the streamIn() and streamOut() functions.

This is followed by defining a task set, one task per color channel. They are identified by name and grouped together by method references. This method can now be compiled into kernel code.

Finally, we call the execute function to perform these tasks in parallel on the device. Now let's take a look at how TornadoVM compiles and executes code.

How TornadoVM boots the Java kernel on parallel hardware

The original Java code was single-threaded, even if @Parallel annotations had been added. When the execute() function is called, TornadoVM starts optimizing the code.

First, the code is compiled into an intermediate representation in order to optimize it (TornadoVM extends the Graal JIT compiler, and all optimizations occur at this level). TornadoVM then converts the optimized code into efficient PTX, OpenCL, or SPIR-V code.

Starting code execution at this time will start hundreds or thousands of threads. How many threads TornadoVM starts depends on the application.

In this example, the Blur filter has two parallel loops, each traversing an image dimension. Therefore, during runtime compilation, TornadoVM creates a thread mesh with the same dimensions as the input image. Each grid cell (that is, per pixel) maps one thread. For example, if the pixels of an image are 2000x2000, TornadoVM starts 2000x2000 threads on the target device (for example, gpu).

TornadoVM can also implement pipeline parallelism, primarily for FPGAs. When we or TornadoVM select an FPGA, it automatically inserts the generated code information into the pipeline instructions. This strategy can double performance compared to previous parallel code.

Parallel Loop API 与 Parallel Kernel API

Now let's look at how to represent the compute kernel in TornadoVM. TornadoVM has two APIs: the Parallel Loop API we used in the Blur Filter example, and the Parallel Kernel API. TornadoVM's parallel loop API is annotation-based. When using this API, developers must provide serial implementation code and then consider where to parallelize loops.

On the one hand, development is faster because developers only need to add annotations to existing Java serial code to achieve parallelism. The Parallel Loop API is suitable for non-expert users who don't need to know the computational details of the GPU or what kind of hardware they should use.

The Parallel Loop API, on the other hand, has a limited number of patterns that can be used. When using this API, developers can run applications using typical map/reduce patterns. But other parallel patterns, such as scanning or complex templates, are difficult to implement with this API. In addition, this API does not allow developers to control the hardware because it is hardware-agnostic, but sometimes developers do need to control the hardware. In addition, porting existing OpenCL and CUDA code to Java can be difficult.

To address these limitations, we've added the Parallel Kernel API.

Implement blur filters with the Parallel Kernel API

Let's go back to the previous example: blur filters. We have two parallel loops that iterate over the two dimensions of the image and apply filters. This can be converted to using the Parallel Kernel API.

Instead of using two loops, we introduce implicit parallelism through the kernel context. A context is a TornadoVM object that gives the user access to each dimension's thread identifier, local/shared memory, synchronization primitives, and so on.

In our example, the X and y-coordinates of the filter come from the globalIdx and globalIdy properties of the context, respectively, and are used to apply the filter as before. This style of programming is closer to the CUDA and OpenCL programming models.

It is important to note that TornadoVM cannot determine how many threads are needed at runtime. The user needs to configure it through the worker grid.

In this example, we create a 2D worker grid with image dimensions associated with the function name. When the user's code calls the execute() function, the mesh is passed in as an argument and the corresponding filter is applied.

Advantages of TornadoVM

But if the Parallel Kernel API is closer to the underlying programming model, why use Java instead of OpenCL and PTX or CUDA and PTX, especially with existing code?

TornadoVM has other advantages, such as live task migration, automatic memory management, and transparent code optimization, which is based on different architectures.

It can also run on FPGAs with fully transparent and integrated programming workflows. You can use your favorite IDEs, such as IntelliJ or Eclipse, to write code that runs on FPGAs.

It can also be deployed in the cloud, such as the Amazon cloud. You can port your code to Java and TornadoVM to get all of these features for free.

performance

Now let's talk about performance. TornadoVM is not only used to apply filters to images, it is also used in fintech or mathematical simulations such as Monte Carlo or Black-Scholes. It is also used in computer vision applications, physical simulation, signal processing, and other fields.

The image above compares the execution of different applications on different devices. Similarly, we still use serial execution as a reference object, and the bars represent the acceleration factor, and the higher the better.

As we have seen, it is possible for us to achieve very high performance acceleration. For example, signal processing or physical simulation can be 4000 times faster than Java's serial execution. A detailed analysis of these results can be found in academic publications. （https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/14_PUBLICATIONS.md）

TornadoVM in the industry

Some companies in the industry are also experimenting with TornadoVM. The image above shows two scenarios where TornadoVM is being used.

One of the applications comes from The Luxembourg-based company Neurocom, which runs a natural language processing algorithm. So far, a 30x performance gain has been achieved by running a hierarchical clustering algorithm on the GPU.

Another use case comes from Spark Works, an Irish-based company that processes information from IoT devices. They run post-process workloads with powerful GPUs, GPU100. Compared to Java, they can get a 460x performance improvement, which is already quite good.

You can visit the TornadoVM website for a complete list of application scenarios.

summary

Heterogeneous devices are now present in almost every computing system, which is inevitable. They're here, and they're going to continue to be like this.

As a result, programmers of current and future software systems will need to face a wide range and diversity of devices, such as GPUs, FPGAs, or any other upcoming hardware. They can program these devices via TornadoVM.

TornadoVM can be seen as a high-performance computing platform for Java and the JVM, and can be used in conjunction with existing JDKs such as OpenJDKs.

This article introduces TornadoVM and briefly explains how it works. In addition, this article demonstrates how developers can take full advantage of heterogeneous hardware through an example of image processing implemented in Java. We explain two APIs in TornadoVM for heterogeneous programming: the Parallel Loop API, which is suitable for ordinary developers, and the Parallel Kernel API, which is suitable for experts who already know CUDA and OpenCL and want to port existing code to TornadoVM.

About the Author:

Juan Fumero is a researcher at the University of Manchester. His research topics are heterogeneous high-level programming language virtual machines, GPGPU, and distributed computing. Currently, he is part of the TornadoVM project, introducing automatic GPU and FPGA JIT compilation and execution features to Java applications. He also worked with Intel to introduce oneAPI to TornadoVM to optimize code for Intel compute architecture. Juan received his PhD from the University of Edinburgh, focusing on accelerating Java, R and Ruby on GPUs. In addition, he interned at Oracle Labs and CERN, implementing compilers and evaluating parallel techniques for multicore systems.

https://www.infoq.com/articles/java-performance-tornadovm/

Take Java performance to the next level with TornadoVM

Read on