laitimes

【TVM Learning Materials】TensorIR Quick Start

author:HyperAI

This article is translated from the English document Blitz Course to TensorI by Feng Siyuan. More TVM Chinese documentation can be found → TVM Chinese Station https://hyper.ai

TensorIR is a specific language in the field of deep learning and has two main functions:

  • Transform and optimize programs in various hardware backends.
  • Automatic _tensorized_ abstraction of program optimization.
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import numpy as np           

IRModule

IRModule is the core data structure of TVM, which contains deep learning programs and is the basis for IR transformation and model building.

【TVM Learning Materials】TensorIR Quick Start

Edit toggles to center

Add image annotations, no more than 140 words (optional)

The figure above shows the lifecycle of IRModule, which can be created by TVMScript. The two main ways to convert IRModule are TensorIR's schedule primitive transformation and pass conversion. In addition, a series of conversions can be performed directly on IRModule. Note that IRModule can be printed to TVMScript at any stage. After all conversions and optimizations are complete, IRModule can be built as a runnable module for deployment on target devices.

Based on the design of TensorIR and IRModule, a new approach to programming can be created:

  1. Based on Python-AST syntax, write programs in TVMScript.
  2. Use the Python API to transform and optimize programs.
  3. Use the imperative transformation API to interactively check and improve performance.

Create IRModule

IRModule is a round-trip syntax for TVM IR that can be created by writing TVMScript.

Unlike creating computation expressions from tensor expressions (manipulating operators using tensor expressions), TensorIR allows users to program via TVMScript, a language embedded in Python AST. The new approach makes it possible to write complex programs and further schedule and optimize.

Here is an example of vector addition:

@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(a: T.handle, b: T.handle):
        # 我们通过 T.handle 进行数据交换,类似于内存指针
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # 通过 handle 创建 Buffer
        A = T.match_buffer(a, (8,), dtype="float32")
        B = T.match_buffer(b, (8,), dtype="float32")
        for i in range(8):
            # block 是针对计算的抽象
            with T.block("B"):
                # 定义一个空间(可并行)block 迭代器,并且将它的值绑定成 i
                vi = T.axis.spatial(8, i)
                B[vi] = A[vi] + 1.0

ir_module = MyModule
print(type(ir_module))
print(ir_module.script())           

Output result:

<class 'tvm.ir.module.IRModule'>
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i in T.serial(8):
            with T.block("B"):
                vi = T.axis.spatial(8, i)
                T.reads(A[vi])
                T.writes(B[vi])
                B[vi] = A[vi] + T.float32(1)           

In addition, we can write simple operators using the tensor expression DSL and convert them to IRModule.

from tvm import te

A = te.placeholder((8,), dtype="float32", name="A")
B = te.compute((8,), lambda *i: A(*i) + 1.0, name="B")
func = te.create_prim_func([A, B])
ir_module_from_te = IRModule({"main": func})
print(ir_module_from_te.script())           

Output result:

# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i0 in T.serial(8):
            with T.block("B"):
                i0_1 = T.axis.spatial(8, i0)
                T.reads(A[i0_1])
                T.writes(B[i0_1])
                B[i0_1] = A[i0_1] + T.float32(1)           

Build and run IRModule

IRModule can be built as a runnable module for a specific target backend.

mod = tvm.build(ir_module, target="llvm")  # CPU 后端的模块
print(type(mod))           

Output result:

<class 'tvm.driver.build_module.OperatorModule'>           

Prepare the input and output arrays, and then run the module:

a = tvm.nd.array(np.arange(8).astype("float32"))
b = tvm.nd.array(np.zeros((8,)).astype("float32"))
mod(a, b)
print(a)
print(b)           

Output result:

[0. 1. 2. 3. 4. 5. 6. 7.]
[1. 2. 3. 4. 5. 6. 7. 8.]           

Convert IRModule

IRModule is the core data structure of program optimization and can be converted by Schedule. Schedule contains several primitive methods to convert programs interactively. Each primitive transforms the program in a specific way to optimize performance.

【TVM Learning Materials】TensorIR Quick Start

The figure above is a typical workflow for optimizing a tensor program. First, create an initial IRModule with TVMScript or tensor expressions, and then create a schedule on this initial IRModule. Next, use a series of scheduling primitives to improve performance. Finally, we can lower it and build it into a runnable module.

Only a simple conversion is demonstrated above. First, create a schedule on input ir_module:

sch = tvm.tir.Schedule(ir_module)
print(type(sch))           

Output result:

<class 'tvm.tir.schedule.schedule.Schedule'>           

Expand the nested loop into 3 loops and print the result:

# 通过名字获取 block
block_b = sch.get_block("B")
# 获取包围 block 的循环
(i,) = sch.get_loops(block_b)
# 展开嵌套循环
i_0, i_1, i_2 = sch.split(i, factors=[2, 2, 2])
print(sch.mod.script())           

Output result:

# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i_0, i_1, i_2 in T.grid(2, 2, 2):
            with T.block("B"):
                vi = T.axis.spatial(8, i_0 * 4 + i_1 * 2 + i_2)
                T.reads(A[vi])
                T.writes(B[vi])
                B[vi] = A[vi] + T.float32(1)           

You can also reorder the loops. For example, move a loop i_2 out of the i_1:

sch.reorder(i_0, i_2, i_1)
print(sch.mod.script())           

Output the result

# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i_0, i_2, i_1 in T.grid(2, 2, 2):
            with T.block("B"):
                vi = T.axis.spatial(8, i_0 * 4 + i_1 * 2 + i_2)
                T.reads(A[vi])
                T.writes(B[vi])
                B[vi] = A[vi] + T.float32(1)           

Convert to a GPU program

To deploy a model on the GPU, thread binding is required. Fortunately, it is also possible to incrementally convert with primitives.

sch.bind(i_0, "blockIdx.x")
sch.bind(i_2, "threadIdx.x")
print(sch.mod.script())           

Output result:

# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i_0 in T.thread_binding(2, thread="blockIdx.x"):
            for i_2 in T.thread_binding(2, thread="threadIdx.x"):
                for i_1 in T.serial(2):
                    with T.block("B"):
                        vi = T.axis.spatial(8, i_0 * 4 + i_1 * 2 + i_2)
                        T.reads(A[vi])
                        T.writes(B[vi])
                        B[vi] = A[vi] + T.float32(1)           

After binding the threads, use the cuda backend to build IRModule:

ctx = tvm.cuda(0)
cuda_mod = tvm.build(sch.mod, target="cuda")
cuda_a = tvm.nd.array(np.arange(8).astype("float32"), ctx)
cuda_b = tvm.nd.array(np.zeros((8,)).astype("float32"), ctx)
cuda_mod(cuda_a, cuda_b)
print(cuda_a)
print(cuda_b)           

Output result:

[0. 1. 2. 3. 4. 5. 6. 7.]
[1. 2. 3. 4. 5. 6. 7. 8.]           

Download the Python source code: tensor_ir_blitz_course.py

Download Jupyter Notebook: tensor_ir_blitz_course.ipynb

That's all for this document, for more TVM Chinese documentation, visit →