schedule与计算逻辑分离是自动代码生成技术的核心概念,由MIT CASIL组的Jonathan Ragan-Kelley在2012年发表在SIGGRAPH上的文章率先提出,然后在2013年发表在PLDI上的文章给出了schedule的精确定义:

  • 1.When and where should be the value at each coordinate in each function be computed?
  • 2.Where should they be stored?
  • 3.How long are values cached and communicated across multiple consumers, and when are they independently recomputed by each?

第一条是描述了数据计算顺序对性能的影响,第二条是数据的存储位置对性能影响,最后一条是多线程处理过程中,不同线程数据应该如何进行交互。

事实上,schedule就是一系列优化选择的集合。这些选择不会影响计算的结果,但是由于其包含着对architecture的理解,因此对性能是至关重要的。往往一个选择或者多个选择的组合,都会被称为schedule。

常用的Schedule有:

存储层次的相关Schedule

  • 1 cache_read(tensor, scope, readers)

    将数据存储到片上缓存,减少访问数据时间。

cache_read将tensor读入指定存储层次scope的cache,这个设计的意义在于显式利用现有计算设备的on-chip memory hierarchy。这个例子中(AA = s.cache_read(A, "shared", [B])),会先将A的数据load到shared memory中,然后计算B。在这里,我们需要引入一个stage的概念,一个op对应一个stage,也就是通过cache_read会新增一个stage。

import tvm

n = 1024
dtype = "float32"
A = tvm.te.placeholder((n, n), dtype = dtype, name="A")
k = tvm.te.reduce_axis((0,n), name = "k")
B = tvm.te.compute((n,), lambda i: tvm.te.sum(A[i,k], axis = k), name="B") s = tvm.te.create_schedule(B.op)
print(tvm.lower(s, [A, B], simple_mode=True))
print("-------------")
AA = s.cache_read(A, "shared", [B]) print(tvm.lower(s, [A, B], simple_mode=True))

结果如下:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A[((i*1024) + k)])
}
}
} -------------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
allocate(A.shared: Pointer(shared float32), float32, [1048576]), storage_scope = shared {
for (ax0: int32, 0, 1024) {
for (ax1: int32, 0, 1024) {
let cse_var_1: int32 = ((ax0*1024) + ax1)
A.shared_1: Buffer(A.shared, float32, [1048576], [], scope="shared")[cse_var_1] = A[cse_var_1]
}
}
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A.shared_1[((i*1024) + k)])
}
}
}
}
  • 2 cache_write(tensor, scope)

    将结果写入片上缓存,然后再写入片外缓存。当然这里的片上和片外并不是绝对的概念,也可以理解为不同层次的存储结构。

cache_write和cache_read对应,是先在shared memory中存放计算结果,最后将结果写回到global memory。当然在真实的场景中,我们往往是会将结果先放着register中,最后写回。

import tvm

n = 1024
dtype = "float32"
A = tvm.te.placeholder((n, n), dtype = dtype, name="A")
k = tvm.te.reduce_axis((0,n), name = "k")
B = tvm.te.compute((n,), lambda i: tvm.te.sum(A[i,k], axis = k), name="B") s = tvm.te.create_schedule(B.op) print(tvm.lower(s, [A, B], simple_mode=True))
print("-------------") AA = s.cache_write(B, "local") print(tvm.lower(s, [A, B], simple_mode=True))

执行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A[((i*1024) + k)])
}
}
} -------------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
allocate(B.local: Pointer(local float32), float32, [1024]), storage_scope = local {
for (i.c: int32, 0, 1024) {
B.local_1: Buffer(B.local, float32, [1024], [], scope="local")[i.c] = 0f32
for (k: int32, 0, 1024) {
B.local_1[i.c] = (B.local_1[i.c] + A[((i.c*1024) + k)])
}
}
for (i: int32, 0, 1024) {
B[i] = B.local_1[i]
}
}
}
  • 3 set_scope

    为数据指定存储位置,相比于cache_read和cache_write提供了更灵活的指定数据存储方式。本质上是相同的。

set_scope指定stage计算结果所在的存储层次,为tensor选择最优的存储位置,适用于设置线程间的共享内存。事实上,set_scope是cache_read的子操作。

import tvm

n = 1024
dtype = "float32"
A = tvm.te.placeholder((n, n), dtype=dtype, name='A')
k = tvm.te.reduce_axis((0, n), name='k')
B = tvm.te.compute((n,), lambda i: tvm.te.sum(A[i, k], axis=k), name='B')
C = tvm.te.compute((n,), lambda i: B[i] + 10, name='C') s = tvm.te.create_schedule(C.op) print(tvm.lower(s, [A, C], simple_mode=True))
print("---------cutting line---------") s[B].set_scope('shared') print(tvm.lower(s, [A, C], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024], [])} {
allocate(B: Pointer(global float32), float32, [1024]), storage_scope = global {
for (i: int32, 0, 1024) {
B_1: Buffer(B, float32, [1024], [])[i] = 0f32
for (k: int32, 0, 1024) {
B_1[i] = (B_1[i] + A[((i*1024) + k)])
}
}
for (i_1: int32, 0, 1024) {
C[i_1] = (B_1[i_1] + 10f32)
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024], [])} {
allocate(B: Pointer(shared float32), float32, [1024]), storage_scope = shared {
for (i: int32, 0, 1024) {
B_1: Buffer(B, float32, [1024], [], scope="shared")[i] = 0f32
for (k: int32, 0, 1024) {
B_1[i] = (B_1[i] + A[((i*1024) + k)])
}
}
for (i_1: int32, 0, 1024) {
C[i_1] = (B_1[i_1] + 10f32)
}
}
}
  • 4 storage_align

    在我看的文章中,storage_align是针对GPU shared memory的一个优化,目的是为了减少同一个bank的访问冲突。在GPU中shared memory被分割成多个bank,这些bank可以被独立线程同时访问。Storage_align就是为了将数据和bank大小匹配,减少bank conflict的发生。AI芯片中也有类似的问题,只有尽量减少bank冲突的发生,才能最大化并行计算。

storage_align把stage对应的存储空间以factor为单位、以offset为偏置重新对齐,以避免GPU共享访问时的bank conflict,关于bank conflict可以参考2

import tvm

n = 1024
factor = 100
offset = 8
dtype = "float32"
A = tvm.te.placeholder((n, n), dtype=dtype, name="A")
k = tvm.te.reduce_axis((0, n), name="k")
B = tvm.te.compute((n,), lambda i : tvm.te.sum(A[i, k], axis=k),name="B") s = tvm.te.create_schedule(B.op) AA = s.cache_read(A, "shared",[B])
print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[AA].storage_align(AA.op.axis[0], factor, offset)
print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
allocate(A.shared: Pointer(shared float32), float32, [1048576]), storage_scope = shared {
for (ax0: int32, 0, 1024) {
for (ax1: int32, 0, 1024) {
let cse_var_1: int32 = ((ax0*1024) + ax1)
A.shared_1: Buffer(A.shared, float32, [1048576], [], scope="shared")[cse_var_1] = A[cse_var_1]
}
}
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A.shared_1[((i*1024) + k)])
}
}
}
} ---------cutting line---------
compute(A.shared, body=[A[ax0, ax1]], axis=[iter_var(ax0, range(min=0, ext=1024)), iter_var(ax1, range(min=0, ext=1024))], reduce_axis=[], tag=, attrs={})
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
allocate(A.shared: Pointer(shared float32), float32, [1134592]), storage_scope = shared {
for (ax0: int32, 0, 1024) {
for (ax1: int32, 0, 1024) {
A.shared_1: Buffer(A.shared, float32, [1134592], [], scope="shared")[((ax0*1108) + ax1)] = A[((ax0*1024) + ax1)]
}
}
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A.shared_1[((i*1108) + k)])
}
}
}
}
  • 5 compute_at

    不懂CUDA,所以对文章中的代码不是很理解,但是从其解释看,对于多次循环的计算(或者多维计算),可以通过并行计算来降维。

compute_at将当前的stage附着到目标stage的指定iter方向上,同时与目标stage采用相同的并行方式,在其内部完成当前stage的计算。往往compute_at会与cache_read和cache_write一起使用。

import tvm

n = 1024
A = tvm.te.placeholder((n,), name="A")
k = tvm.te.reduce_axis((0,n), name="k")
B = tvm.te.compute((1,), lambda i: tvm.te.sum(A[k], axis=k), name="B") s = tvm.te.create_schedule(B.op)
ko, ki = s[B].split(B.op.reduce_axis[0], factor = 32)
BF = s.rfactor(B, ki) tx = tvm.te.thread_axis("threadIdx.x")
s[B].bind(s[B].op.reduce_axis[0], tx) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[BF].compute_at(s[B], s[B].op.reduce_axis[0]) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
allocate(B.rf: Pointer(global float32), float32, [32]), storage_scope = global {
for (k.inner: int32, 0, 32) {
B.rf_1: Buffer(B.rf, float32, [32], [])[k.inner] = 0f32
for (k.outer: int32, 0, 32) {
B.rf_1[k.inner] = (B.rf_1[k.inner] + A[((k.outer*32) + k.inner)])
}
}
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local {
attr [meta[tir.CommReducer][0]] "reduce_scope" = @tir.reinterpret(0u64, dtype=handle);
@tir.tvm_thread_allreduce(1u32, B.rf_1[threadIdx.x], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], threadIdx.x, dtype=handle)
B[0] = reduce_temp0_1[0]
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
allocate(B.rf: Pointer(local float32), float32, [1]), storage_scope = local;
allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local {
B.rf_1: Buffer(B.rf, float32, [1], [], scope="local", align=4)[0] = 0f32
for (k.outer: int32, 0, 32) {
B.rf_1[0] = (B.rf_1[0] + A[((k.outer*32) + threadIdx.x)])
}
attr [meta[tir.CommReducer][0]] "reduce_scope" = @tir.reinterpret(0u64, dtype=handle);
@tir.tvm_thread_allreduce(1u32, B.rf_1[0], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], threadIdx.x, dtype=handle)
B[0] = reduce_temp0_1[0]
}
}
  • 6 compute_inline

    将独立操作转化为内联函数,有点类似FPGA上的流水线计算。转化成内联函数从上层层面减少了stage。在FPGA中也有类似问题,可以将具有相同迭代的多条指令放在一起执行。

compute_inline把独立的计算操作转化成内联函数形式,在使用到原计算结果时再调用内联函数完成运算,通过compute_inline来减少一个stage。

import tvm

n = 1024
k = 3
pad = 2
A = tvm.te.placeholder((n, n), name = "A")
W = tvm.te.placeholder((k, k), name = "W")
m = (n - k + 2 *pad) + 1 Apad = tvm.te.compute((n + 2 * pad, n + 2 * pad),
lambda yy, xx : tvm.te.if_then_else(
tvm.te.all(yy >= pad, yy < pad + n, xx >= pad, xx < pad+n),
A[yy-pad, xx - pad], tvm.te.const(0, "float32")),
name="Apad")
ry = tvm.te.reduce_axis((0, k), name = "ry")
rx = tvm.te.reduce_axis((0, k), name = "rx") B = tvm.te.compute((m, m),
lambda yy, xx : tvm.te.sum(Apad[yy + ry, xx + rx] * W[ry, rx],
axis=[ry, rx]),
name="B")
s = tvm.te.create_schedule(B.op) print(tvm.lower(s, [A, W, B], simple_mode=True))
print("---------cutting line---------") s[Apad].compute_inline() print(tvm.lower(s, [A, W, B], simple_mode=True))
exit(0)

运行结果:

@main = primfn(A_1: handle, W_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
W: Buffer(W_2: Pointer(float32), float32, [9], []),
B: Buffer(B_2: Pointer(float32), float32, [1052676], [])}
buffer_map = {A_1: A, W_1: W, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), W_1: W_3: Buffer(W_2, float32, [3, 3], []), B_1: B_3: Buffer(B_2, float32, [1026, 1026], [])} {
allocate(Apad: Pointer(global float32), float32, [1056784]), storage_scope = global {
for (yy: int32, 0, 1028) {
for (xx: int32, 0, 1028) {
Apad_1: Buffer(Apad, float32, [1056784], [])[((yy*1028) + xx)] = @tir.if_then_else(((((2 <= yy) && (yy < 1026)) && (2 <= xx)) && (xx < 1026)), A[(((yy*1024) + xx) - 2050)], 0f32, dtype=float32)
}
}
for (yy_1: int32, 0, 1026) {
for (xx_1: int32, 0, 1026) {
B[((yy_1*1026) + xx_1)] = 0f32
for (ry: int32, 0, 3) {
for (rx: int32, 0, 3) {
let cse_var_1: int32 = ((yy_1*1026) + xx_1)
B[cse_var_1] = (B[cse_var_1] + (Apad_1[((((yy_1*1028) + (ry*1028)) + xx_1) + rx)]*W[((ry*3) + rx)]))
}
}
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, W_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
W: Buffer(W_2: Pointer(float32), float32, [9], []),
B: Buffer(B_2: Pointer(float32), float32, [1052676], [])}
buffer_map = {A_1: A, W_1: W, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), W_1: W_3: Buffer(W_2, float32, [3, 3], []), B_1: B_3: Buffer(B_2, float32, [1026, 1026], [])} {
for (yy: int32, 0, 1026) {
for (xx: int32, 0, 1026) {
B[((yy*1026) + xx)] = 0f32
for (ry: int32, 0, 3) {
for (rx: int32, 0, 3) {
let cse_var_3: int32 = (yy + ry)
let cse_var_2: int32 = (xx + rx)
let cse_var_1: int32 = ((yy*1026) + xx)
B[cse_var_1] = (B[cse_var_1] + (@tir.if_then_else(((((2 <= cse_var_3) && (cse_var_3 < 1026)) && (2 <= cse_var_2)) && (cse_var_2 < 1026)), A[(((((yy*1024) + (ry*1024)) + xx) + rx) - 2050)], 0f32, dtype=float32)*W[((ry*3) + rx)]))
}
}
}
}
}
  • 7 compute_root

    Compute_at的反操作。

compute_root是compute_at的反操作。因为不做任何schedule的话,每一个stage默认就是compute_root的,这个schedule相当于注释了对之前对一个stage的compute操作。

import tvm
n = 1024
A = tvm.te.placeholder((n,), name='A')
k = tvm.te.reduce_axis((0, n), 'k')
B = tvm.te.compute((1,), lambda i: tvm.te.sum(A[k], axis=k), name='B') s = tvm.te.create_schedule(B.op) ko, ki = s[B].split(B.op.reduce_axis[0], factor=32)
BF = s.rfactor(B, ki) tx = tvm.te.thread_axis("threadIdx.x")
s[B].bind(s[B].op.reduce_axis[0], tx)
s[BF].compute_at(s[B], s[B].op.reduce_axis[0]) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[BF].compute_root() print(tvm.lower(s, [A, B], simple_mode=True))
exit(0)

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
allocate(B.rf: Pointer(local float32), float32, [1]), storage_scope = local;
allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local {
B.rf_1: Buffer(B.rf, float32, [1], [], scope="local", align=4)[0] = 0f32
for (k.outer: int32, 0, 32) {
B.rf_1[0] = (B.rf_1[0] + A[((k.outer*32) + threadIdx.x)])
}
attr [meta[tir.CommReducer][0]] "reduce_scope" = @tir.reinterpret(0u64, dtype=handle);
@tir.tvm_thread_allreduce(1u32, B.rf_1[0], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], threadIdx.x, dtype=handle)
B[0] = reduce_temp0_1[0]
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
allocate(B.rf: Pointer(global float32), float32, [32]), storage_scope = global {
for (k.inner: int32, 0, 32) {
B.rf_1: Buffer(B.rf, float32, [32], [])[k.inner] = 0f32
for (k.outer: int32, 0, 32) {
B.rf_1[k.inner] = (B.rf_1[k.inner] + A[((k.outer*32) + k.inner)])
}
}
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local {
attr [meta[tir.CommReducer][0]] "reduce_scope" = @tir.reinterpret(0u64, dtype=handle);
@tir.tvm_thread_allreduce(1u32, B.rf_1[threadIdx.x], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], threadIdx.x, dtype=handle)
B[0] = reduce_temp0_1[0]
}
}
}

常见循环优化

  • 8 fuse

    将多个循环iter融合为一个iter。

fuse用于融合两个iter,将两层循环合并到一层,其返回值为iter类型,可以多次合并。

import tvm

n = 1024
A = tvm.te.placeholder((n,), name="A")
k = tvm.te.reduce_axis((0, n), name = "k") B = tvm.te.compute((1,), lambda i : tvm.te.sum(A[k], axis=k), name="B") s = tvm.te.create_schedule(B.op) ko, ki = s[B].split(B.op.reduce_axis[0], factor = 32)
print(tvm.lower(s,[A, B], simple_mode=True))
print("---------cutting line---------") s[B].fuse(ko, ki) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k.outer: int32, 0, 32) {
for (k.inner: int32, 0, 32) {
B[0] = (B[0] + A[((k.outer*32) + k.inner)])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k.outer.k.inner.fused: int32, 0, 1024) {
B[0] = (B[0] + A[k.outer.k.inner.fused])
}
}
  • 9 split

    Fuse的反操作,将一次循环迭代拆分为多次。

split是fuse的反操作,把iter以factor为间隔分离成outer与inner两层迭代,增加循环层数,用于将循环操作分割为更小的子任务。事实上,以CUDA为例,gridDim和blockDim都可以最多是三维,所以通过split可以产生新的维度用于绑定到grid和block上3

import tvm

n = 1024
A = tvm.te.placeholder((n,), name='A')
k = tvm.te.reduce_axis((0, n), name='k') B = tvm.te.compute((1,), lambda i: tvm.te.sum(A[k], axis=k), name='B') s = tvm.te.create_schedule(B.op) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") ko, ki = s[B].split(B.op.reduce_axis[0], factor=32) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k: int32, 0, 1024) {
B[0] = (B[0] + A[k])
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k.outer: int32, 0, 32) {
for (k.inner: int32, 0, 32) {
B[0] = (B[0] + A[((k.outer*32) + k.inner)])
}
}
}
  • 10 reorder

    调整循环计算迭代顺序。

reorder用于重置循环iter的内外顺序,根据局部性原理,最大化利用cache中的现有数据,减少反复载入载出的情况。注意,这里到底怎样的顺序是最优化的是一个很有趣的问题。以矩阵乘法为例,M, N, K三维,往往是将K放在最外层可以最大程度利用局部性。

import tvm

n = 1024
A = tvm.te.placeholder((n, n), name="A")
B = tvm.te.placeholder((n, n), name="B")
C = tvm.te.compute((n, n), lambda i,j : A[i, j] + B[i, j], name="C") s = tvm.te.create_schedule(C.op) xo, xi = s[C].split(s[C].op.axis[0], factor = 32)
yo, yi = s[C].split(s[C].op.axis[1], factor = 32) print(tvm.lower(s, [A, B, C], simple_mode=True))
print("---------cutting line---------") s[C].reorder(xo, yo, yi, xi)
print(tvm.lower(s, [A, B, C], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i.outer: int32, 0, 32) {
for (i.inner: int32, 0, 32) {
for (j.outer: int32, 0, 32) {
for (j.inner: int32, 0, 32) {
let cse_var_1: int32 = ((((i.outer*32768) + (i.inner*1024)) + (j.outer*32)) + j.inner)
C[cse_var_1] = (A[cse_var_1] + B[cse_var_1])
}
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i.outer: int32, 0, 32) {
for (j.outer: int32, 0, 32) {
for (j.inner: int32, 0, 32) {
for (i.inner: int32, 0, 32) {
let cse_var_1: int32 = ((((i.outer*32768) + (i.inner*1024)) + (j.outer*32)) + j.inner)
C[cse_var_1] = (A[cse_var_1] + B[cse_var_1])
}
}
}
}
}
  • 11 tile

    Tile也是将循环迭代进行拆分,拆分多次计算。是split+reorder。

tile将stage的两个维度按照各自的factor拆分,并以固定顺序依次返回两个outer和两个inner的iter,从而增加循环层数,形成更小的计算任务。事实上,tile是可以由split和reorder来实现的,tile是矩阵乘法和卷积计算的重要schedule。

import tvm

n = 1024
A = tvm.te.placeholder((n, n), name="A")
B = tvm.te.placeholder((n, n), name="B")
K = tvm.te.reduce_axis((0, n), name="K")
C = tvm.te.compute((n, n), lambda i,j : tvm.te.sum(A[i,K] * B[K, j], axis=K), name="C") s = tvm.te.create_schedule(C.op)
print(tvm.lower(s, [A, B, C], simple_mode=True))
print("---------cutting line---------") xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], 32, 32) print(tvm.lower(s, [A, B, C], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i: int32, 0, 1024) {
for (j: int32, 0, 1024) {
C[((i*1024) + j)] = 0f32
for (K: int32, 0, 1024) {
let cse_var_2: int32 = (i*1024)
let cse_var_1: int32 = (cse_var_2 + j)
C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + K)]*B[((K*1024) + j)]))
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i.outer: int32, 0, 32) {
for (j.outer: int32, 0, 32) {
for (i.inner: int32, 0, 32) {
for (j.inner: int32, 0, 32) {
C[((((i.outer*32768) + (i.inner*1024)) + (j.outer*32)) + j.inner)] = 0f32
for (K: int32, 0, 1024) {
let cse_var_3: int32 = (j.outer*32)
let cse_var_2: int32 = ((i.outer*32768) + (i.inner*1024))
let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + j.inner)
C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + K)]*B[(((K*1024) + cse_var_3) + j.inner)]))
}
}
}
}
}
}
  • 12 unroll

    将循环展开,增加并发执行。

unroll是一种常见的循环优化方法,减分支预测失败减少,如果循环体内语句没有数据相关,增加了并发执行的机会,也有利于指令流水线的调度4

import tvm

n = 1024
A = tvm.te.placeholder((n, n), name="A")
B = tvm.te.placeholder((n, n), name="B")
C = tvm.te.compute((n, n), lambda i, j : A[i, j] + B[i,j], name="C") s = tvm.te.create_schedule(C.op) xo, xi = s[C].split(s[C].op.axis[0], factor = 4) print(tvm.lower(s, [A, B, C], simple_mode=True))
print("---------cutting line---------") s[C].unroll(xi) print(tvm.lower(s, [A, B, C], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i.outer: int32, 0, 256) {
for (i.inner: int32, 0, 4) {
for (j: int32, 0, 1024) {
let cse_var_1: int32 = (((i.outer*4096) + (i.inner*1024)) + j)
C[cse_var_1] = (A[cse_var_1] + B[cse_var_1])
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (i.outer: int32, 0, 256) {
for (j: int32, 0, 1024) {
let cse_var_1: int32 = ((i.outer*4096) + j)
C[cse_var_1] = (A[cse_var_1] + B[cse_var_1])
}
for (j_1: int32, 0, 1024) {
let cse_var_2: int32 = (((i.outer*4096) + j_1) + 1024)
C[cse_var_2] = (A[cse_var_2] + B[cse_var_2])
}
for (j_2: int32, 0, 1024) {
let cse_var_3: int32 = (((i.outer*4096) + j_2) + 2048)
C[cse_var_3] = (A[cse_var_3] + B[cse_var_3])
}
for (j_3: int32, 0, 1024) {
let cse_var_4: int32 = (((i.outer*4096) + j_3) + 3072)
C[cse_var_4] = (A[cse_var_4] + B[cse_var_4])
}
}
}

多线程并行优化

  • 13 vectorize

    将循环迭代替换成ramp,可以通过SIMD指令实现数据批量计算,也就是单指令多数据计算。这在AI加速中会很常用,每条指令都是多数据计算的。

vectorize把iter方向上的循环迭代替换成ramp,从而通过SIMD指令实现数据的批量计算,并且只有在数据size为常数、且分割的iter为2的幂(即满足SIMD的计算数量)时才会发生替换,否则vectorize没有效果,是SIMD计算设备的常用schedule。

import tvm

M = 1024
N = 1024
A = tvm.te.placeholder((M, N), name='A')
B = tvm.te.placeholder((M, N), name='B')
C = tvm.te.compute(
(M, N),
lambda x, y: A[x, y] + B[x, y],
name='C') s = tvm.te.create_schedule(C.op)
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], 32, 32) print(tvm.lower(s, [A, B, C], simple_mode=True))
print("---------cutting line---------") s[C].vectorize(yi) print(tvm.lower(s, [A, B, C], simple_mode=True))

运行结果

@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (x.outer: int32, 0, 32) {
for (y.outer: int32, 0, 32) {
for (x.inner: int32, 0, 32) {
for (y.inner: int32, 0, 32) {
let cse_var_1: int32 = ((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)
C[cse_var_1] = (A[cse_var_1] + B[cse_var_1])
}
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1048576], []),
C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {
for (x.outer: int32, 0, 32) {
for (y.outer: int32, 0, 32) {
for (x.inner: int32, 0, 32) {
let cse_var_1: int32 = (((x.outer*32768) + (x.inner*1024)) + (y.outer*32))
C[ramp(cse_var_1, 1, 32)] = (A[ramp(cse_var_1, 1, 32)] + B[ramp(cse_var_1, 1, 32)])
}
}
}
}
  • 14 bind

    CUDA中使用的优化方法,将iter绑定到不同线程,实现并发计算。

bind将iter绑定到block或thread的index上,从而把循环的任务分配到线程,实现并行化计算,这是针对CUDA后端最核心的部分。

import tvm

n = 1024
A = tvm.te.placeholder((n,), name="A")
k = tvm.te.reduce_axis((0, n), name="k") B = tvm.te.compute((1, ), lambda i : tvm.te.sum(A[i], axis=k), name="B") s = tvm.te.create_schedule(B.op) ko, ki = s[B].split(s[B].op.reduce_axis[0], factor = 32) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[B].bind(ko, tvm.te.thread_axis("blockIdx.x"))
s[B].bind(ki, tvm.te.thread_axis("threadIdx.x")) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k.outer: int32, 0, 32) {
for (k.inner: int32, 0, 32) {
B[0] = (B[0] + A[0])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 32;
allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local;
attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32 {
attr [meta[tir.CommReducer][0]] "reduce_scope" = @tir.reinterpret(0u64, dtype=handle);
@tir.tvm_thread_allreduce(1u32, A[0], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], blockIdx.x, threadIdx.x, dtype=handle)
B[0] = reduce_temp0_1[0]
}
}
  • 15 parallel

    实现多设备并行.

parallel将指定iter的for循环替换为parallel操作,从而在GPU以外的CPU等设备上实现并行。

import tvm

m = 1024
n = 1024 A = tvm.te.placeholder((n,m), name="A")
l = tvm.te.reduce_axis((0,m), name="l") B = tvm.te.compute((n,), lambda i : tvm.te.sum(A[i,l], axis=l), name="B") s = tvm.te.create_schedule(B.op) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[B].parallel(B.op.reduce_axis[0])
print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (l: int32, 0, 1024) {
B[i] = (B[i] + A[((i*1024) + l)])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (l: int32, 0, 1024) "parallel" {
B[i] = (B[i] + A[((i*1024) + l)])
}
}
}

其他schedule

  • 16 pragma

    可以在代码中人为添加编译注释,人为干预编译优化。HLS中就是通过这样的方式来实现c的硬件编程的。

pragma用于添加编译注释,使编译器遵循pragma的要求,实现unroll, vectorize等调度功能。事实上一个新的优化规则,都可以看做是一种gragma,也被称作directive5

import tvm

n = 1024
m = 1024
A = tvm.te.placeholder((n, m), name="A")
k = tvm.te.reduce_axis((0, n), name="k")
l = tvm.te.reduce_axis((0, m), name="l") B = tvm.te.compute((n,), lambda i : tvm.te.sum(A[i,l], axis=l), name="B") s = tvm.te.create_schedule(B.op) ko, ki = s[B].split(B.op.reduce_axis[0], factor = 4) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[B].pragma(ki, "unroll") print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (l.outer: int32, 0, 256) {
for (l.inner: int32, 0, 4) {
B[i] = (B[i] + A[(((i*1024) + (l.outer*4)) + l.inner)])
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (l.outer: int32, 0, 256) {
let cse_var_1: int32 = ((i*1024) + (l.outer*4))
{
B[i] = (B[i] + A[cse_var_1])
B[i] = (B[i] + A[(cse_var_1 + 1)])
B[i] = (B[i] + A[(cse_var_1 + 2)])
B[i] = (B[i] + A[(cse_var_1 + 3)])
}
}
}
}
  • 17 prefetch

    将数据计算和load后者store数据重叠起来,在FPGA中是很常见优化方法。

prefetch利用数据的空间局部性,用于使得前一个iter的计算与后一个iter的访存overlap起来,以提高访存和计算的并行度,减少耗时。本质上是软件流水线的概念,不是硬件prefetch。

import tvm

n = 1024
dtype = "float32"
k = tvm.te.reduce_axis((0, n), name="k")
A = tvm.te.placeholder((n, n), dtype=dtype, name="A")
B = tvm.te.compute((n,), lambda i: tvm.te.sum(A[i, k], axis=k), name="B") s = tvm.te.create_schedule(B.op) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[B].prefetch(A, s[B].op.reduce_axis[0], 1)
print(tvm.lower(s, [A, B], simple_mode=True))

运行结果

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
B[i] = (B[i] + A[((i*1024) + k)])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
B: Buffer(B_2: Pointer(float32), float32, [1024], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024], [])} {
for (i: int32, 0, 1024) {
B[i] = 0f32
for (k: int32, 0, 1024) {
for (prefetch.A.1: int32, 0, 1) {
for (prefetch.A.0: int32, 0, 1) {
@tir.prefetch(@tir.address_of(A[(((k*1024) + i) + 1024)], dtype=handle), 0, 3, 1, dtype=float32)
}
}
B[i] = (B[i] + A[((i*1024) + k)])
}
}
}
  • 18 tensorize

    将tensor作为一个整体匹配硬件的计算核心,比如一个卷积运算就可以实现在FPGA上的一个匹配。

tensorize将计算作为整体,编译为一个tensor_intrin函数中。这是因为很多计算属于常用计算,针对这些计算已经有了很好的built-in的schedule,通过tensorize可以直接调用这些内置的intrinsic,其实这也就是intrinsic在计算机科学中的本意6

import tvm
from tvm import te N, M, L = 1024, 512, 64
A = tvm.te.placeholder((N, L), name='A')
B = tvm.te.placeholder((M, L), name='B')
k = tvm.te.reduce_axis((0, L), name='k')
C = tvm.te.compute((N, M), lambda i, j: tvm.te.sum(A[i, k] * B[j, k], axis=k), name='C')
s = tvm.te.create_schedule(C.op) def intrin_gemv(m, l):
a = te.placeholder((l,), name="a")
b = te.placeholder((m, l), name="b")
k = te.reduce_axis((0, l), name="k")
c = te.compute((m,), lambda i: te.sum(a[k] * b[i, k], axis=k), name="c")
Ab = tvm.tir.decl_buffer(a.shape, a.dtype, name="A", offset_factor=1, strides=[1])
Bb = tvm.tir.decl_buffer(b.shape, b.dtype, name="B", offset_factor=1, strides=[te.var("s1"), 1])
Cb = tvm.tir.decl_buffer(c.shape, c.dtype, name="C", offset_factor=1, strides=[1]) def intrin_func(ins, outs):
ib = tvm.tir.ir_builder.create()
aa, bb = ins
cc = outs[0]
ib.emit(
tvm.tir.call_extern(
"int32",
"gemv_update",
cc.access_ptr("w"),
aa.access_ptr("r"),
bb.access_ptr("r"),
m,
l,
bb.strides[0],
)
)
return ib.get() return te.decl_tensor_intrin(c.op, intrin_func, binds={a: Ab, b: Bb, c: Cb}) factor = 16
x, y = C.op.axis
z, = C.op.reduce_axis
yo, yi = s[C].split(y, factor=factor)
s[C].reorder(x, yo, yi, z) gemv = intrin_gemv(factor, L) print(tvm.lower(s, [A, B, C], simple_mode=True))
print("---------cutting line---------") s[C].tensorize(yi, gemv) print(tvm.lower(s, [A, B, C], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [65536], []),
B: Buffer(B_2: Pointer(float32), float32, [32768], []),
C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
for (i: int32, 0, 1024) {
for (j.outer: int32, 0, 32) {
for (j.inner: int32, 0, 16) {
C[(((i*512) + (j.outer*16)) + j.inner)] = 0f32
for (k: int32, 0, 64) {
let cse_var_1: int32 = (((i*512) + (j.outer*16)) + j.inner)
C[cse_var_1] = (C[cse_var_1] + (A[((i*64) + k)]*B[(((j.outer*1024) + (j.inner*64)) + k)]))
}
}
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [65536], []),
B: Buffer(B_2: Pointer(float32), float32, [32768], []),
C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
for (i: int32, 0, 1024) {
for (j.outer: int32, 0, 32) {
@tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
}
}
}
  • 19 rfactor(tensor, axis, factor_axis=0)

rfactor对原tensor在axis方向以factor_axis为间隔做reduction操作。

import tvm

n = 1024
k = tvm.te.reduce_axis((0, n), name="k") A = tvm.te.placeholder((n, ), name = "A")
B = tvm.te.compute((1,), lambda i: tvm.te.sum(A[k], axis = k), name="B") s = tvm.te.create_schedule(B.op)
ko, ki = s[B].split(s[B].op.reduce_axis[0], 32) print(tvm.lower(s, [A,B], simple_mode=True))
print("---------cutting line---------") BR = s.rfactor(B, ki) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
B[0] = 0f32
for (k.outer: int32, 0, 32) {
for (k.inner: int32, 0, 32) {
B[0] = (B[0] + A[((k.outer*32) + k.inner)])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
allocate(B.rf: Pointer(global float32), float32, [32]), storage_scope = global {
for (k.inner: int32, 0, 32) {
B.rf_1: Buffer(B.rf, float32, [32], [])[k.inner] = 0f32
for (k.outer: int32, 0, 32) {
B.rf_1[k.inner] = (B.rf_1[k.inner] + A[((k.outer*32) + k.inner)])
}
}
B[0] = 0f32
for (k.inner.v: int32, 0, 32) {
B[0] = (B[0] + B.rf_1[k.inner.v])
}
}
}
  • 20 set_store_predicate

set_store_predicate设置了store的条件,适用于在多线程调度中预防写操作之间的冲突。

import tvm

n = 1024
A = tvm.te.placeholder((n,), name='A')
k = tvm.te.reduce_axis((0, n), 'k')
B = tvm.te.compute((1,), lambda i: tvm.te.sum(A[k], axis=k), name='B') s = tvm.te.create_schedule(B.op) ko, ki = s[B].split(B.op.reduce_axis[0], factor=16)
BF = s.rfactor(B, ki)
tx = tvm.te.thread_axis("threadIdx.x")
s[BF].compute_at(s[B], s[B].op.reduce_axis[0]) print(tvm.lower(s, [A, B], simple_mode=True))
print("---------cutting line---------") s[B].set_store_predicate(tx.var.equal(0)) print(tvm.lower(s, [A, B], simple_mode=True))

运行结果:

@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
allocate(B.rf: Pointer(global float32), float32, [1]), storage_scope = global {
B[0] = 0f32
for (k.inner.v: int32, 0, 16) {
B.rf_1: Buffer(B.rf, float32, [1], [], align=4)[0] = 0f32
for (k.outer: int32, 0, 64) {
B.rf_1[0] = (B.rf_1[0] + A[((k.outer*16) + k.inner.v)])
}
B[0] = (B[0] + B.rf_1[0])
}
}
} ---------cutting line---------
@main = primfn(A_1: handle, B_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1], [])}
buffer_map = {A_1: A, B_1: B}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024], []), B_1: B_3: Buffer(B_2, float32, [1], [])} {
allocate(B.rf: Pointer(global float32), float32, [1]), storage_scope = global {
B[0] = 0f32
for (k.inner.v: int32, 0, 16) {
B.rf_1: Buffer(B.rf, float32, [1], [], align=4)[0] = 0f32
for (k.outer: int32, 0, 64) {
B.rf_1[0] = (B.rf_1[0] + A[((k.outer*16) + k.inner.v)])
}
if (threadIdx.x: int32 == 0) {
B[0] = (B[0] + B.rf_1[0])
}
}
}
}
  • 21 create_group(outputs, inputs, include_inputs=False)

create_group对从inputs到outputs的所有stage创建group,group本质上是一个虚拟stage,可以通过操作这个虚拟stage来一起操作这个group里的所有stage。本例中,通过compute_at使这个group中的D和E,一起附着到指定操作中。

import tvm

n = 1024
k = tvm.te.reduce_axis((0, n), name='k') A = tvm.te.placeholder((n, n), name='A')
B = tvm.te.placeholder((n, n), name='B') D = tvm.te.compute((n, n), lambda i, j: A[i, j] + B[i, j], name='D')
E = tvm.te.compute((n, n), lambda i, j: D[i, j] + B[i, j], name='E')
F = tvm.te.compute((n,), lambda i: tvm.te.sum(E[i, k], axis=k), name='F') s = tvm.te.create_schedule(F.op) print(tvm.lower(s, [A, B, E], simple_mode=True))
print("---------cutting line---------") g = s.create_group(outputs = E, inputs = [A, B], include_inputs=True)
g.compute_at(s[F], F.op.reduce_axis[0]) print(tvm.lower(s, [A, B, E], simple_mode=True))

运行结果:

import tvm

n = 1024
k = tvm.te.reduce_axis((0, n), name='k') A = tvm.te.placeholder((n, n), name='A')
B = tvm.te.placeholder((n, n), name='B') D = tvm.te.compute((n, n), lambda i, j: A[i, j] + B[i, j], name='D')
E = tvm.te.compute((n, n), lambda i, j: D[i, j] + B[i, j], name='E')
F = tvm.te.compute((n,), lambda i: tvm.te.sum(E[i, k], axis=k), name='F') s = tvm.te.create_schedule(F.op) print(tvm.lower(s, [A, B, E], simple_mode=True))
print("---------cutting line---------") g = s.create_group(outputs = E, inputs = [A, B], include_inputs=True)
g.compute_at(s[F], F.op.reduce_axis[0]) print(tvm.lower(s, [A, B, E], simple_mode=True))

相关示例代码参考https://github.com/StrongSpoon/tvm.schedule

参考:

tvm schedule详细举例

TVM:Schedule的理解的更多相关文章

  1. TVM:

    Hello TVM  发表于 2019-06-29 TVM 是什么?A compiler stack,graph level / operator level optimization,目的是(不同框 ...

  2. kubernetes in action - Replication Controller

    理解这个问题,就是pods在Kubernetes中怎么进行failover 在Kubernetes的work node上有kubelet,会负责监控该work node上的pods,如果有contai ...

  3. Mit6.824 Lab1-MapReduce

    前言 Mit6.824 是我在学习一些分布式系统方面的知识的时候偶然看到的,然后就开始尝试跟课.不得不说,国外的课程难度是真的大,一周的时间居然要学一门 Go 语言,然后还要读论文,进而做MapRed ...

  4. 对于Linux内核执行过程的理解(基于fork、execve、schedule等函数)

    382 + 原创作品转载请注明出处 + https://github.com/mengning/linuxkernel/ 一.实验环境 win10 -> VMware -> Ubuntu1 ...

  5. 简单理解java中timer的schedule和scheduleAtFixedRate方法的区别

    timer的schedule和scheduleAtFixedRate方法一般情况下是没什么区别的,只在某个情况出现时会有区别--当前任务没有来得及完成下次任务又交到手上. 我们来举个例子: 暑假到了老 ...

  6. springboot集成schedule(深度理解)

    背景 在项目开发过程中,我们经常需要执行具有周期性的任务.通过定时任务可以很好的帮助我们实现. 我们拿常用的几种定时任务框架做一个比较: 从以上表格可以看出,Spring Schedule框架功能完善 ...

  7. 端到端TVM编译器(下)

    端到端TVM编译器(下) 4.3 Tensorization DL工作负载具有很高的运算强度,通常可以分解为张量运算符,如矩阵乘法或一维卷积.这些自然分解导致了最近的添加张量计算原语.这些新的原语带来 ...

  8. TVM设计与构架构建

    TVM设计与构架构建 本文档适用于希望了解TVM体系结构和/或在项目上进行积极开发的开发人员.该页面的组织如下: 实例编译流程Example Compilation Flow描述TVM把一个模型的高级 ...

  9. TVM 架构设计

    TVM 架构设计 本文面向希望了解TVM体系结构和/或积极参与项目开发的开发人员. 主要内容如下: 示例编译流程概述了TVM将模型的高级概念转换为可部署模块的步骤. 逻辑架构组件部分描述逻辑组件.针对 ...

  10. Neutron 理解 (7): Neutron 是如何实现负载均衡器虚拟化的 [LBaaS V1 in Juno]

    学习 Neutron 系列文章: (1)Neutron 所实现的虚拟化网络 (2)Neutron OpenvSwitch + VLAN 虚拟网络 (3)Neutron OpenvSwitch + GR ...

随机推荐

  1. 读论文-顺序推荐系统_挑战、进展和前景(Sequential recommender systems_ challenges, progress and prospects)

    前言 今天读的论文为一篇于2019年发表的论文,是关于顺序推荐系统(Sequential Recommender Systems,SRSs)的研究,作者对SRSs的挑战.进展和前景进行了系统综述. 要 ...

  2. 分布式锁—7.Curator的分布式锁

    大纲 1.Curator的可重入锁的源码 2.Curator的非可重入锁的源码 3.Curator的可重入读写锁的源码 4.Curator的MultiLock源码 5.Curator的Semaphor ...

  3. selenium 提示消息一闪而过,怎么定位

    F12 点击source 下的暂停按钮

  4. Qt QDateEdit下拉日历的样式设计

    文章目录 QDateEdit样式设计 QDateEdit QCalendarWidget QDateEdit样式设计   最近做了一个用到QDateEdit的项目,涉及到对这个控件进行设计的方面,对于 ...

  5. js回忆录(4) -- 对象,构造函数

    1.对象 && 构造函数 js是一门基于对象的语言,里边所有的数据类型都可以当对象使唤(当然null和undefined除外),当我们在v8引擎里声明一个对象时会发现每个对象属性里边都 ...

  6. CSRF的理解及Flask和Django的解决方案

    CSRF 攻击的原理 1. 用户正常登录 网站A 2. 网站A 向用户浏览器写入cookies(包含登录信息) 3. 用户在没有登出的情况下,访问了网站B(攻击网站) 4. 网站B 伪造了一个 网站A ...

  7. 朱朝兵《Web前端开发之HTML5+CSS3快速入门》课程资料在线学习

    <Web前端开发之HTML5+CSS3快速入门> [课程简介]本课程由一直专注前端技术与Discuz模板设计与制作!DZ起点网创始人朱朝兵出品,通过本课程学习,即可掌握html5的基本结构 ...

  8. 模型蒸馏(Distillation)案例--从DeepSeek-R1-1.5B 到 Qwen-2.5-1.5B 的模型蒸馏

    DeepSeek-R1-1.5B 到 Qwen-2.5-1.5B 的模型蒸馏(Distillation) 本文重点进行DeepSeek-R1-1.5B 到 Qwen-2.5-1.5B 的模型蒸馏(Di ...

  9. 软件测试_Fiddler抓包工具

    多数资料摘抄至 https://www.cnblogs.com/miantest/p/7289694.html 一.在 macOS 下如何安装 (https://www.telerik.com/fid ...

  10. 远程服务器画面GUI传输到本地显示

    远程服务器画面GUI传输到本地显示 这篇文章主要介绍远程服务器(Ubuntu系统)的GUI画面传输到本地(本地Ubuntu或者Windows)显示的方法. 远程Ubuntu系统 --> 本地Wi ...