跳转至主要内容
Version: v1.3.0

性能调优

for 循环装饰器

在 Taichi kernel 中,位于最外层作用域的 for 循环会被自动并行。 我们的编译器会自动调整参数以最好地探索目标架构。 尽管如此,对于追求最后百分之几性能的勇士,我们也提供了一些 API 以帮助他们微调应用程序。 For example, specifying a suitable block_dim could yield an almost 3x performance boost in examples/mpm3d.py.

You can use ti.loop_config to set the loop directives for the next for loop. Available directives are:

  • parallelize: Sets the number of threads to use on CPU
  • block_dim: Sets the number of threads in a block on GPU
  • serialize: If you set serialize to True, the for loop will run serially, and you can write break statements inside it (Only applies on range/ndrange fors). Equals to setting parallelize to 1.
@ti.kernel
def break_in_serial_for() -> ti.i32:
a = 0
ti.loop_config(serialize=True)
for i in range(100): # This loop runs serially
a += i
if i == 10:
break
return a

break_in_serial_for() # returns 55
n = 128
val = ti.field(ti.i32, shape=n)
@ti.kernel
def fill():
ti.loop_config(parallelize=8, block_dim=16)
# If the kernel is run on the CPU backend, 8 threads will be used to run it
# If the kernel is run on the CUDA backend, each block will have 16 threads.
for i in range(n):
val[i] = i

后台:GPU 线程架构

To better understand how the mentioned for-loop is parallelized, we briefly introduce the thread hierarchy on modern GPU architectures.

From a fine-grained to a coarse-grained level, the computation units can be defined as: iteration < thread < block < grid.

  • iteration: An iteration is the body of a for-loop. Each iteration corresponding to a specific i value in for-loop.
  • thread: Iterations are grouped into threads. A thread is the minimal unit that is parallelized. All iterations within a thread are executed in serial. We usually use 1 iteration per thread for maximizing parallel performance.
  • block: Threads are grouped into blocks. All threads within a block are executed in parallel. Threads within the same block can share their block local storage.
  • grid: Blocks are grouped into grids. Grid is the minimal unit that being launched from host. All blocks within a grid are executed in parallel. In Taichi, each parallelized for-loop is a grid.

For more details, please see the CUDA C programming guide. Note that we employ the CUDA terminology here, other backends such as OpenGL and Metal follow a similar thread hierarchy.

示例:调整 for 循环的块级别并行

Programmers may prepend some decorator(s) to tweak the property of a for-loop, e.g.:

@ti.kernel
def func():
for i in range(8192): # no decorator, use default settings
...

ti.block_dim(128) # change the property of next for-loop:
for i in range(8192): # will be parallelized with block_dim=128
...

for i in range(8192): # no decorator, use default settings
...

数据布局

You might have been familiar with Fields in Taichi. Since Taichi decouples data structure from computation, developers have the flexibility to play with different data layouts. Like in other programming languages, selecting an efficient layout can drastically improve performance. For more information on advanced data layouts in Taichi, please see the Fields (advanced) section.

局部存储优化

Taichi comes with a few optimizations that leverage the fast memory (e.g. CUDA shared memory, L1 cache) for performance optimization. The idea is straightforward: Wherever possible, Taichi substitutes the access to the global memory (slow) with that to the local one (fast), and writes the data in the local memory (e.g., CUDA shared memory) back to the global memory in the end. Such transformations preserve the semantics of the original program (will be explained later).

线程局部存储(TLS)

TLS is mostly designed to optimize parallel reduction. When Taichi identifies a global reduction pattern in a @ti.kernel, it automatically applies the TLS optimizations during code generation, similar to those found in common GPU reduction implementations.

We will walk through an example using CUDA's terminology.

x = ti.field(ti.f32, shape=1000000)
s = ti.field(ti.f32, shape=())

@ti.kernel
def sum():
for i in x:
s[None] += x[i]

sum()

Internally, Taichi's parallel loop is implemented using Grid-Stride Loops. What this means is that each physical CUDA thread could handle more than one item in x. That is, the number of threads launched for sum can be fewer than the shape of x.

One optimization enabled by this strategy is to substitute the global memory access with a thread-local one. Concretely, instead of directly and atomically adding x[i] into the destination s[None], which resides in the global memory, Taichi preallocates a thread-local buffer upon entering the thread, accumulates (non-atomically) the value of x into this buffer, then adds the result of the buffer back to s[None] atomically before exiting the thread. Assuming each thread handles N items in x, the number of atomic adds is reduced to one-N-th its original size.

Additionally, the last atomic add to the global memory s[None] is optimized using CUDA's warp-level intrinsics, further reducing the number of required atomic adds.

Currently, Taichi supports TLS optimization for these reduction operators: add, sub, min and max on 0D scalar/vector/matrix ti.fields. It is not yet supported on ti.ndarrays. Here is a benchmark comparison when running a global max reduction on a 1-D Taichi field of 8M floats on an Nvidia GeForce RTX 3090 card:

  • TLS disabled: 5.2 x 1e3 us
  • TLS enabled: 5.7 x 1e1 us

TLS has led to an approximately 100x speedup. We also show that TLS reduction sum achieves comparable performance with CUDA implementations, see benchmark for details.

块局部存储(BLS)

Context: For a sparse field whose last layer is a dense SNode (i.e., its layer hierarchy matches ti.root.(sparse SNode)+.dense), Taichi will assign one CUDA thread block to each dense container (or dense block). BLS optimization works specifically for such kinds of fields.

BLS aims to accelerate the stencil computation patterns by leveraging the CUDA shared memory. This optimization starts with the users annotating the set of fields they would like to cache via ti.block_local. Taichi then attempts to figure out the accessing range w.r.t the dense block of these annotated fields at compile time. If succeeded, Taichi generates code that first fetches all the accessed data in range into a block local buffer (CUDA's shared memory), then substitutes all the accesses to the corresponding slots into this buffer.

Here is an example illustrating the usage of BLS. a is a sparse field with a block size of 4x4.

a = ti.field(ti.f32)
b = ti.field(ti.f32)
# `a` has a block size of 4x4
ti.root.pointer(ti.ij, 32).dense(ti.ij, 4).place(a)

@ti.kernel
def foo():
# Taichi will cache `a` into the CUDA shared memory
ti.block_local(a)
for i, j in a:
print(a[i - 1, j], a[i, j + 2])

Each loop iteration accesses items with an offset [-1, 0] and [0, 2] to its coordinates, respectively. Therefore, for an entire block spanning from [M, N] (inclusive) to [M + 4, N + 4] (exclusive), the accessed range w.r.t this block is [M - 1, M + 4) x [N, N + 6) (derived from [M + (-1), M + 4) x [N, N + 4 + 2)). The mapping between the global coordinates i, j and the local indices into the buffer is shown below:

From a user's perspective, you do not need to worry about these underlying details. Taichi does all the inference and the global/block-local mapping automatically. That is, Taichi will preallocate a CUDA shared memory buffer of size 5x6, pre-load a's data into this buffer, and replace all the accesses to a (in the global memory) with the buffer in the loop body. While this simple example does not modify a, if a block-cached field does get written, Taichi would also generate code that writes the buffer back to the global memory.

note

BLS does not come for free. Remember that BLS is designed for the stencil computation, where there are a large amount of overlapped accesses to the global memory. If this is not the case, the pre-loading/post-storing could actually hurt the performance.

On top of that, recent generations of Nvidia's GPU cards have been closing the gap on the read-only access between the global memory and the shared memory. Currently, we found BLS to be more effective for caching the destinations of the atomic operations.

As a rule of thumb, run benchmarks to decide whether to enable BLS or not.

离线缓存

Taichi kernel 在第一次被调用时被隐式编译。 编译结果保存在一个在线内存缓存中,以减少后续函数调用的开销。 只要该 kernel 函数保持不变,就可以直接加载和启动。 然而,当程序终止时,缓存就不再可用。 如果你再次运行程序,Taichi 必须重新编译所有的 kernel 函数并重建线上内存缓存。 由于编译开销,第一次启动 Taichi 函数总是很慢。

为解决这一问题,我们新增离线缓存功能,将编译缓存转储到磁盘上,以备后续运行使用。 重复运行时,第一次启动的开销可大幅减少。 Taichi 现在默认构建、维护一个离线缓存,并在 ti. nit() 中提供数个选项用于配置离线缓存行为。

  • offline_cache: bool:启用或禁用离线缓存。 默认: True
  • offline_cache_file_path: str:存放离线缓存文件的目录。 默认:Windows 系统中:'C:\taichi_cache\ticache\' ;类 unix 系统中: '~/.cache/taichi/ticache/'。 自动填充目录。
  • offline_cache_max_size_of_files: int32:缓存文件的最大字节大小。 默认:100MB。 当缓存文件大小超过此限制时触发清理过程。
  • offline_cache_cleaning_policy: str:更替缓存中陈旧文件的策略。 选项:'never''version''lru''fifo'。 默认: 'lru'
    • 'never':不管 offline_cache_max_size_of_files 配置如何,从不清理任何缓存文件;
    • 'version':仅弃置与 kernel 函数相关的旧版本缓存文件;
    • 'lru':弃置最近使用最少的缓存文件;
    • 'fifo':弃置最早添加的缓存文件。

要验证效果,请选择示例运行两次,并观察启动开销:

note

If your code behaves abnormally, disable offline cache by setting the environment variable TI_OFFLINE_CACHE=0 or offline_cache=False in the ti.init() method call and file an issue with us on Taichi's GitHub repo.