如何在CPU上优化GEMM（下）

如何在CPU上优化GEMM（下）

Array Packing

另一个重要的技巧是数组打包。这个技巧是对数组的存储维度进行重新排序，将某个维度上的连续访问模式在平滑后转换为顺序模式。

如上图所示，在阻塞计算之后，可以观察到B的数组访问模式（扁平化后），它是规则的但不连续的。期望经过一些转换，可以得到连续访问模式。可以将[16][16]数组重新排序为[16/4][16][4]数组，这样当从压缩数组中获取相应的值时，B的访问模式将是顺序的。

# We have to re-write the algorithm slightly.

packedB = te.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name="packedB")

C = te.compute(

(M, N),

lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k),

name="C",

)

s = te.create_schedule(C.op)

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

(k,) = s[C].op.reduce_axis

ko, ki = s[C].split(k, factor=4)

s[C].reorder(xo, yo, ko, xi, ki, yi)

s[C].vectorize(yi)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print("Opt4: %f" % evaluator(a, b, c).mean)

Out:

Opt4: 0.105409

Here is the generated IR after array packing.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

attr = {"global_symbol": "main", "tir.noalias": True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] "storage_scope" = "global";

allocate(packedB, float32x32, [32768]) {

for (x: int32, 0, 32) "parallel" {

for (y: int32, 0, 1024) {

packedB[ramp(((x*32768) + (y*32)), 1, 32)] = (float32x32*)B_2[ramp(((y*1024) + (x*32)), 1, 32)]

}

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

for (x.inner.init: int32, 0, 32) {

C_2[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)

}

for (k.outer: int32, 0, 256) {

for (x.inner: int32, 0, 32) {

for (k.inner: int32, 0, 4) {

C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + (k.inner*32)), 1, 32)]))

}

Write cache for blocks

分块后，程序将结果逐块写入C，访问模式不是顺序的。因此，可以使用一个顺序缓存数组来保存块结果，并在所有块结果就绪时写入C。

s = te.create_schedule(C.op)

# Allocate write cache

CC = s.cache_write(C, "global")

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

# Write cache is computed at yo

s[CC].compute_at(s[C], yo)

# New inner axes

xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis

ko, ki = s[CC].split(k, factor=4)

s[CC].reorder(ko, xc, ki, yc)

s[CC].unroll(ki)

s[CC].vectorize(yc)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)

print("Opt5: %f" % evaluator(a, b, c).mean)

Out:

Opt5: 0.098048

Here is the generated IR after blocking.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

attr = {"global_symbol": "main", "tir.noalias": True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] "storage_scope" = "global";

allocate(packedB, float32x32, [32768]);

attr [C.global: Pointer(float32)] "storage_scope" = "global";

allocate(C.global, float32, [1024]) {

for (x: int32, 0, 32) "parallel" {

for (y: int32, 0, 1024) {

packedB[ramp(((x*32768) + (y*32)), 1, 32)] = (float32x32*)B_2[ramp(((y*1024) + (x*32)), 1, 32)]

}

for (x.outer: int32, 0, 32) {

for (y.outer: int32, 0, 32) {

for (x.c.init: int32, 0, 32) {

C.global[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)

}

for (k.outer: int32, 0, 256) {

for (x.c: int32, 0, 32) {

C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[(((x.outer*32768) + (x.c*1024)) + (k.outer*4))], 32)*(float32x32*)packedB[ramp(((y.outer*32768) + (k.outer*128)), 1, 32)]))

C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 1)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 32), 1, 32)]))

C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 2)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 64), 1, 32)]))

C.global[ramp((x.c*32), 1, 32)] = ((float32x32*)C.global[ramp((x.c*32), 1, 32)] + (broadcast((float32*)A_2[((((x.outer*32768) + (x.c*1024)) + (k.outer*4)) + 3)], 32)*(float32x32*)packedB[ramp((((y.outer*32768) + (k.outer*128)) + 96), 1, 32)]))

}

for (x.inner: int32, 0, 32) {

for (y.inner: int32, 0, 32) {

C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]

}

Parallel

此外，还可以利用多核处理器来实现线程级的并行化。

s = te.create_schedule(C.op)

CC = s.cache_write(C, "global")

xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

s[CC].compute_at(s[C], yo)

xc, yc = s[CC].op.axis

(k,) = s[CC].op.reduce_axis

ko, ki = s[CC].split(k, factor=4)

s[CC].reorder(ko, xc, ki, yc)

s[CC].unroll(ki)

s[CC].vectorize(yc)

# parallel

s[C].parallel(xo)

x, y, z = s[packedB].op.axis

s[packedB].vectorize(z)

s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name="mmult")

assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)

func(a, b, c)

tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)

evaluator = func.time_evaluator(func.entry_name, ctx, number=50)

opt6_time = evaluator(a, b, c).mean

print("Opt6: %f" % opt6_time)

Out:

Opt6: 0.032347

Here is the generated IR after parallelization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

Out:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()

attr = {"global_symbol": "main", "tir.noalias": True}

buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),

B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),

A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}

buffer_map = {A_1: A, B_1: B, C_1: C} {

attr [packedB: Pointer(float32)] "storage_scope" = "global";

allocate(packedB, float32x32, [32768]) {

for (x: int32, 0, 32) "parallel" {

for (y: int32, 0, 1024) {

packedB[ramp(((x*32768) + (y*32)), 1, 32)] = (float32x32*)B_2[ramp(((y*1024) + (x*32)), 1, 32)]

}

for (x.outer: int32, 0, 32) "parallel" {

attr [C.global: Pointer(float32)] "storage_scope" = "global";

allocate(C.global, float32, [1024]);

for (y.outer: int32, 0, 32) {

for (x.c.init: int32, 0, 32) {

C.global[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)

}

for (k.outer: int32, 0, 256) {

for (x.c: int32, 0, 32) {

}

for (x.inner: int32, 0, 32) {

for (y.inner: int32, 0, 32) {

C_2[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = (float32*)C.global[((x.inner*32) + y.inner)]

}

Summary

在用18行代码应用上述简单的优化之后，生成的代码可以达到MKL的60%的numpy性能。请注意，网页上的输出反映了非独占Docker容器上的运行时间，因此是不可靠的。强烈建议自己来完成，以观察TVM所获得的性能提升。

https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py

如何在CPU上优化GEMM（下）的更多相关文章

如何在CPU上优化GEMM（上）
如何在CPU上优化GEMM(上) (TL:DR)TVM提供了抽象接口,用户分别描述算法和算法的实现组织(所谓的调度).通常,在高性能调度中编写算法会破坏算法的可读性和模块性.尝试各种看似有希望的时间表 ...
如何在GPU上优化卷积
本文将演示如何在TVM中编写高性能的卷积实现.以平方大小的输入张量和滤波器为例,并假设卷积的输入量很大.使用不同的布局来存储数据,以实现更好的数据局部性.缓冲区布局为HWCN,代表高度,宽度,通道,批 ...
【翻译】借助 NeoCPU 在 CPU 上进行 CNN 模型推理优化
本文翻译自 Yizhi Liu, Yao Wang, Ruofei Yu.. 的 "Optimizing CNN Model Inference on CPUs" 原文链接: h ...
YOLOv5】LabVIEW+OpenVINO让你的YOLOv5在CPU上飞起来
前言上一篇博客给大家介绍了使用opencv加载YOLOv5的onnx模型,但我们发现使用CPU进行推理检测确实有些慢,那难道在CPU上就不能愉快地进行物体识别了吗?当然可以啦,这不LabVIEW和O ...
一次线上服务高 CPU 占用优化实践（转）
线上有一个非常繁忙的服务的 JVM 进程 CPU 经常跑到 100% 以上,下面写了一下排查的过程.通过阅读这篇文章你会了解到下面这些知识. Java 程序 CPU 占用高的排查思路可能造成线上服务 ...
linux下将不同线程绑定到不同core和cpu上——pthread_setaffinity_np
=============================================================== linux下的单进程多线程的程序,要实现每个线程平均分配到多核cpu,主 ...
如何在TVM上集成Codegen（下）
如何在TVM上集成Codegen(下) Bring DNNL to TVM: JSON Codegen/Runtime 现在实现将中继图序列化为JSON表示的DNNL codegen,然后实现DNNL ...
linxu下查看进程的线程方法；如何知道某个进程或者线程运行在哪个CPU上？
1.top -H -p <pid> ; top -H 在top命令后,按H键:或者top -H 2.ps -T -p <pid> “-T”选项可以开启线程查看 3.htop, ...
TVM在ARM GPU上优化移动深度学习
TVM在ARM GPU上优化移动深度学习随着深度学习的巨大成功,将深度神经网络部署到移动设备的需求正在迅速增长.与在台式机平台上所做的类似,在移动设备中使用GPU可以提高推理速度和能源效率.但是,大 ...

随机推荐

缓冲区溢出分析第10课：Winamp缓冲区溢出研究
前言 Winamp是一款非常经典的音乐播放软件,它于上世纪九十年代后期问世.与现在音乐播放软件行业百家争鸣的情况不同,当时可以说Winamp就是听音乐的唯一选择了,相信那个时代的电脑玩家是深有体会的. ...
poj2186强联通（牛仰慕）
题意: 有一群老牛,他们之间有m组敬仰关系,关系可以传递,a仰慕b,b仰慕c,那么a就仰慕c,现在问被所有老牛都仰慕的有多少? 思路: 想想,是不是一个环中的老牛的关系都是一 ...
如何绕过WAF
目录 HTTP报文包体的解析 Transfer-Encoding Charset 溢量数据 HTTP协议兼容性 HTTP请求行种的空格 HTTP 0.9+Pipelining Websocket.HT ...
CreateThread 线程操作与 _beginthreadex 线程安全（Windows核心编程）
0x01 线程的创建线程不同于进程,Windows 中的进程是拥有 '惰性' 的,本身并不执行任何代码,而执行代码的任务转交给主线程,列如使用 CreateProcess 创建一个进程打开 Cmd ...
【python】Leetcode每日一题-位1的个数
[python]Leetcode每日一题-位1的个数 [题目描述] 编写一个函数,输入是一个无符号整数(以二进制串的形式),返回其二进制表达式中数字位数为 '1' 的个数(也被称为汉明重量). 示例1 ...
03.14 ICPC训练联盟周赛，Preliminaries for Benelux Algorithm Programming Contest 2019
A .Architecture 题意:其实就是想让你找到两行数的最大值,然后比较是否相同,如果相同输出'possible',不同则输出'impossible' 思路:直接遍历寻找最大值,然后比较即可 ...
深度理解Python迭代器
迭代器迭代是什么迭代指的是一个重复的过程,每次重复都必须基于上一次的结果而继续,单纯的重复并不是迭代,如Python中的for循环就是一个非常好的迭代例子. for item in range(1 ...
.Net Core·寄托于IIS的REST服务405的问题
阅文时长 | 0.48分钟字数统计 | 828.8字符主要内容 | 1.引言&背景 2.声明与参考资料『.Net Core·寄托于IIS的REST服务405的问题』编写人 | SCsc ...
Tomcat的使用和配置
Tomcat的使用安装在tomcat官网找到你需要用的 Tomcat 版本对应的 zip 压缩包,解压到需要安装的目录即可目录介绍 bin : 专门用来存放Tomcat服务器的可执行文件 con ...
[刷题] 283 Move Zeros
要求将所有的0,移动到vector的后面比如; [1,3,0,12,5] -> [1,3,12,5,0] 实现第一版程序,时间.空间复杂度都是O(n) 1 #include<iostr ...

如何在CPU上优化GEMM（下）

如何在CPU上优化GEMM（下）的更多相关文章

随机推荐

热门专题