跳转至主要内容
Version: v1.3.0

Profiler

概述

To facilitate code debugging and code optimization, Taichi provides a set of profiling tools, which collect hardware and Taichi-related information to quantify program performance and analyze where the bottleneck is.

Currently, Taichi provides two profiling tools:

  • ScopedProfiler analyzes the performance of the Taichi JIT compiler (host).
  • KernelProfiler analyzes the performance of Taichi kernels (device). Its advanced mode, which works with the CUDA backend only, provides detailed low-level performance metrics, such as memory bandwidth consumption.

ScopedProfiler

ScopedProfiler tracks the time spent on host tasks, such as JIT compilation.

This profiler is enabled by default. To display results in a hierarchical format, call ti.profiler.print_scoped_profiler_info().

例如:

import taichi as ti

ti.init(arch=ti.cpu)
var = ti.field(ti.f32, shape=1)

@ti.kernel
def compute():
var[0] = 1.0
print("Setting var[0] =", var[0])

compute()
ti.profiler.print_scoped_profiler_info()
note

ScopedProfiler is a C++ class in Taichi.

KernelProfiler

KernelProfiler retrieves the kernel profiling records from the backend, counts them in the Python scope, and prints the results to the console. Note that kernel_profiler supports CPU and CUDA only. Ensure that you call ti.sync() before performance profiling if your program is running on GPU.

  1. To enable this profiler, set kernel_profiler=True when calling ti.init().
  2. 要显示性能分析结果(profiling results),请调用 ti.profiler.print_kernel_profiler_info()。 有两种打印模式:
    • In 'count' mode (default), the profiling records under the same kernel name are counted as one profiling result.
    • In 'trace' mode, the profiler shows you a list of kernels launched on hardware during the profiling period. 这一模式为每个 kernel 提供了详尽的性能以及运行时硬件指标。
  3. To clear the records in this profiler, call ti.profiler.clear_kernel_profiler_info().

例如:

import taichi as ti

ti.init(ti.cpu, kernel_profiler=True)
x = ti.field(ti.f32, shape=1024*1024)

@ti.kernel
def fill():
for i in x:
x[i] = i

for i in range(8):
fill()
ti.profiler.print_kernel_profiler_info('trace')
ti.profiler.clear_kernel_profiler_info() # Clears all records

for i in range(100):
fill()
ti.profiler.print_kernel_profiler_info() # The default mode: 'count'

The profiler outputs the following:

=========================================================================
X64 Profiler(trace)
=========================================================================
[ % | time ] Kernel name
[ 0.00% | 0.000 ms] jit_evaluator_0_kernel_0_serial
[ 60.11% | 2.668 ms] fill_c4_0_kernel_1_range_for
[ 6.06% | 0.269 ms] fill_c4_0_kernel_1_range_for
[ 5.73% | 0.254 ms] fill_c4_0_kernel_1_range_for
[ 5.68% | 0.252 ms] fill_c4_0_kernel_1_range_for
[ 5.61% | 0.249 ms] fill_c4_0_kernel_1_range_for
[ 5.63% | 0.250 ms] fill_c4_0_kernel_1_range_for
[ 5.61% | 0.249 ms] fill_c4_0_kernel_1_range_for
[ 5.59% | 0.248 ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 0.004 s number of records: 9
=========================================================================
=========================================================================
X64 Profiler(count)
=========================================================================
[ % total count | min avg max ] Kernel name
[100.00% 0.033 s 100x | 0.244 0.329 2.970 ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 0.033 s number of records: 1
=========================================================================
note
  • jit_evaluator_xxx can be ignored because it is automatically generated by the system.
  • Taichi recommends running performance profiling multiple times to observe the minimum or average execution time.

高级模式

For the CUDA backend, KernelProfiler has an experimental GPU profiling toolkit based on the Nvidia CUPTI, which has low and deterministic profiling overhead and can capture more than 6,000 hardware metrics.

使用 CUPTI 的前提条件:

  1. 安装 CUDA Toolkit
  2. Build Taichi from source with CUDA toolkit: TAICHI_CMAKE_ARGS="-DTI_WITH_CUDA_TOOLKIT:BOOL=ON" python3 setup.py develop --user
  3. Resolve the privileges issue of Nvidia profiling module (run with sudo to get administrative privileges):
    • options nvidia NVreg_RestrictProfilingToAdminUsers=0 添加至 /etc/modprobe.d/nvidia-kernel-common.conf
    • Then reboot should resolve the permission issue (probably need to run update-initramfs -u before reboot)
    • 另见 ERR_NVGPUCTRPERM