编写可调模板并使用Auto-tuner自动调谐器

本文介绍在TVM自动调谐模块。

自动调谐有两个步骤。第一步是定义搜索空间。第二步是运行一个搜索算法来探索这个空间。可以学习如何在TVM中执行这两个步骤。以矩阵乘法为例说明了整个工作流程。

本文不会在Windows或最新版本的macOS上运行。要让它运行，需要将主体包装在if __name__ == "__main__":块中。

安装依赖项

要在TVM中使用autotvm包，需要安装一些额外的依赖项。此步骤（安装xgboost）可以跳过，它不需要xgboost（如果使用python2，请将“3”更改为“2”）：

pip3 install --user psutil xgboost

为了使TVM的调谐速度更快，建议使用cython作为TVM的FFI。在TVM的根目录中，执行（如果使用python2，将“3”更改为“2”）：

pip3 install --user cython

sudo make cython3

现在回到python代码。导入包。

import logging

import sys

import numpy as np

import tvm

from tvm import te, testing

# the module is called `autotvm`

from tvm import autotvm

Step 1: Define the search space

在本节中，将把一个确定的TVM调度代码重写为可调调度模板。可以将定义搜索空间的过程视为现有计划代码的参数化。

首先，这里是如何在TVM中实现分块矩阵乘法。

# Matmul V0: Constant tiling factor

def matmul_v0(N, L, M, dtype):

A = te.placeholder((N, L), name="A", dtype=dtype)

B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")

C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")

s = te.create_schedule(C.op)

# schedule

y, x = s[C].op.axis

k = s[C].op.reduce_axis[0]

yo, yi = s[C].split(y, 8)

xo, xi = s[C].split(x, 8)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]

Parametrize the schedule

在前面的计划代码中，使用常数“8”作为平铺系数。然而，它可能不是最好的，因为最佳平铺系数取决于实际的硬件环境和输入形状。

如果希望计划代码在更广泛的输入形状和目标硬件之间可移植，则最好定义一组候选值，并根据目标硬件上的测量结果选择最佳值。

在autotvm中，可以定义一个可调参数，或者为此类值定义一个“旋钮”。

# Matmul V1: List candidate values

@autotvm.template("tutorial/matmul_v1") # 1. use a decorator

def matmul_v1(N, L, M, dtype):

A = te.placeholder((N, L), name="A", dtype=dtype)

B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")

C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")

s = te.create_schedule(C.op)

# schedule

y, x = s[C].op.axis

k = s[C].op.reduce_axis[0]

# 2. get the config object

cfg = autotvm.get_config()

# 3. define search space

cfg.define_knob("tile_y", [1, 2, 4, 8, 16])

cfg.define_knob("tile_x", [1, 2, 4, 8, 16])

# 4. schedule according to config

yo, yi = s[C].split(y, cfg["tile_y"].val)

xo, xi = s[C].split(x, cfg["tile_x"].val)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]

这里对前面的调度代码做了四个修改，得到了一个可调的“模板”。可以逐一解释修改。

使用修饰符将此函数标记为简单模板。

获取一个config对象：可以将这个cfg看作这个函数的一个参数，但是以不同的方式获得它。有了这个参数，这个函数不再是一个确定性的调度代码。相反，可以将不同的配置传递给这个函数并获得不同的调度，所以这个函数是一个“模板”。

为了使模板函数更紧凑，在一个函数中做两件事。（1）定义一个搜索空间和（2）根据该空间中的实体调度。为了实现这一点，将cfg设置为ConfigSpace或ConfigEntity对象。

当它是一个ConfigSpace时，它将收集此函数中的所有可调旋钮并构建搜索空间。当它是ConfigEntity时，它将忽略所有空间定义API（即，定义(...)). 相反，它存储所有可调旋钮的确定值，根据这些值进行调度。

在自动调优期间，将首先使用ConfigSpace对象调用此模板来构建搜索空间。然后使用构建空间中不同的ConfigEntity调用这个模板，以获得不同的调度。最后，将测量由不同计划生成的代码，并选择最佳的。

定义两个可调旋钮。第一个是带5个可能值的图块。第二个是tile_x，它具有相同的可能值列表。这两个旋钮是独立的，因此它们跨越一个搜索空间，大小为5x5=25

根据cfg中的确定值进行调度

使用更好的空间定义API

在前面的模板中，手动列出旋钮的所有可能值。这是定义空间的最低级别API。不过，还提供了另一组API，以使空间定义更简单、更智能。建议使用这套高级API。

在下面的示例中，使用ConfigSpace.define_split定义拆分旋钮。它将列举所有可能的方法来分割一个轴和构造空间。

也有ConfigSpace.define_reorder重新排序用于重新订购旋钮和ConfigSpace.define_annotate用于像展开、矢量化、线程绑定之类的注释。当高级API不能满足的需求时，可以随时使用低级API。

@autotvm.template("tutorial/matmul")

def matmul(N, L, M, dtype):

A = te.placeholder((N, L), name="A", dtype=dtype)

B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")

C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")

s = te.create_schedule(C.op)

# schedule

y, x = s[C].op.axis

k = s[C].op.reduce_axis[0]

##### define space begin #####

cfg = autotvm.get_config()

cfg.define_split("tile_y", y, num_outputs=2)

cfg.define_split("tile_x", x, num_outputs=2)

##### define space end #####

# schedule according to config

yo, yi = cfg["tile_y"].apply(s, C, y)

xo, xi = cfg["tile_x"].apply(s, C, x)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]

Note

More Explanation on cfg.defile_split

In this template, cfg.define_split("tile_y", y, num_outputs=2) will enumerate all possible combinations that can split axis y into two axes with factors of the length of y. For example, if the length of y is 32 and we want to split it into two axes using factors of 32, then there are 6 possible values for (length of outer axis, length of inner axis) pair, namely (32, 1), (16, 2), (8, 4), (4, 8), (2, 16) or (1, 32). They are just the 6 possible values of tile_y.

During schedule, cfg["tile_y"] is a SplitEntity object. We stores the lengths of outer axes and inner axes in cfg['tile_y'].size (a tuple with two elements). In this template, we apply it by using yo, yi = cfg['tile_y'].apply(s, C, y). Actually, this is equivalent to yo, yi = s[C].split(y, cfg["tile_y"].size[1]) or yo, yi = s[C].split(y, nparts=cfg['tile_y"].size[0])

The advantage of using cfg.apply API is that it makes multi-level split (when num_outputs >= 3) easier.

Step 2: Search through the space

在步骤1中，通过将旧的调度代码扩展到模板中来构建搜索空间。下一步是选择一个调谐器并在这个空间中探索。

TVM中的自动调谐器

调谐器的工作可以通过以下伪代码来描述

ct = 0

while ct < max_number_of_trials:

propose a batch of configs

measure this batch of configs on real hardware and get results

ct += batch_size

当建议下一批配置时，调谐器可以采取不同的策略。在autotvm中提供了四种不同策略的调谐器。

RandomTuner: Enumerate the space in a random order
GridSearchTuner: Enumerate the space in a grid search order
GATuner: Using genetic algorithm to search through the space
XGBTuner: Uses a model based method. Train a XGBoost model to predict the speed of lowered IR and pick the next batch according to the prediction.

可以根据空间大小、时间预算和其他因素选择调谐器。例如，如果空间很小（小于1000），一个gridsearch调谐器或一个随机调谐器就足够了。如果空间级别为10^9（这是CUDA GPU上conv2d操作符的空间大小），XGBoostTuner可以更高效地探索并找到更好的配置。

开始调谐

这里继续矩阵乘法例子。首先，应该创建一个调优任务。也可以检查初始化的搜索空间。在这种情况下，对于512x512平方矩阵乘法，空间大小为10x10=100。

N, L, M = 512, 512, 512

task = autotvm.task.create("tutorial/matmul", args=(N, L, M, "float32"), target="llvm")

print(task.config_space)

Out:

ConfigSpace (len=100, space_map=

0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10

1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10

)

然后需要定义如何测量生成的代码并选择调谐器。因为空间很小，随机调谐器就可以了。

本文只进行了10次试验以供演示。实际上，可以根据你的时间预算做更多的试验。将把调整结果记录到一个日志文件中。此文件可用于以后获得最佳配置。

# logging config (for printing tuning log to the screen)

logging.getLogger("autotvm").setLevel(logging.DEBUG)

logging.getLogger("autotvm").addHandler(logging.StreamHandler(sys.stdout))

# There are two steps for measuring a config: build and run.

# By default, we use all CPU cores to compile program. Then measure them sequentially.

# We measure 5 times and take average to reduce variance.

measure_option = autotvm.measure_option(builder="local", runner=autotvm.LocalRunner(number=5))

# Begin tuning with RandomTuner, log records to file `matmul.log`

# You can use alternatives like XGBTuner.

tuner = autotvm.tuner.RandomTuner(task)

tuner.tune(

    n_trial=10,

    measure_option=measure_option,

    callbacks=[autotvm.callback.log_to_file("matmul.log")],

Out:

Get devices for measurement successfully!

No: 1 GFLOPS: 0.52/0.52 result: MeasureResult(costs=(0.5179643672,), error_no=0, all_cost=8.699557542800903, timestamp=1607225778.9184623) [('tile_y', [-1, 64]), ('tile_x', [-1, 1])],None,6

No: 2 GFLOPS: 2.05/2.05 result: MeasureResult(costs=(0.1307110214,), error_no=0, all_cost=2.452157735824585, timestamp=1607225781.4836178) [('tile_y', [-1, 512]), ('tile_x', [-1, 8])],None,39

No: 3 GFLOPS: 2.77/2.77 result: MeasureResult(costs=(0.0968108324,), error_no=0, all_cost=2.015434741973877, timestamp=1607225783.5040994) [('tile_y', [-1, 2]), ('tile_x', [-1, 8])],None,31

No: 4 GFLOPS: 7.71/7.71 result: MeasureResult(costs=(0.0348177938,), error_no=0, all_cost=0.9887301921844482, timestamp=1607225784.5313203) [('tile_y', [-1, 1]), ('tile_x', [-1, 32])],None,50

No: 5 GFLOPS: 13.46/13.46 result: MeasureResult(costs=(0.0199451586,), error_no=0, all_cost=0.7833263874053955, timestamp=1607225785.3334467) [('tile_y', [-1, 256]), ('tile_x', [-1, 64])],None,68

No: 6 GFLOPS: 11.91/13.46 result: MeasureResult(costs=(0.0225446656,), error_no=0, all_cost=0.7622959613800049, timestamp=1607225786.1802726) [('tile_y', [-1, 256]), ('tile_x', [-1, 512])],None,98

No: 7 GFLOPS: 0.92/13.46 result: MeasureResult(costs=(0.2913359364,), error_no=0, all_cost=5.074311971664429, timestamp=1607225791.3119547) [('tile_y', [-1, 128]), ('tile_x', [-1, 2])],None,17

No: 8 GFLOPS: 2.37/13.46 result: MeasureResult(costs=(0.1133100596,), error_no=0, all_cost=2.2167930603027344, timestamp=1607225793.595454) [('tile_y', [-1, 8]), ('tile_x', [-1, 4])],None,23

No: 9 GFLOPS: 11.52/13.46 result: MeasureResult(costs=(0.0233022846,), error_no=0, all_cost=0.7279143333435059, timestamp=1607225795.1428313) [('tile_y', [-1, 256]), ('tile_x', [-1, 32])],None,58

No: 10 GFLOPS: 14.67/14.67 result: MeasureResult(costs=(0.0182990712,), error_no=0, all_cost=0.7626948356628418, timestamp=1607225795.9127738) [('tile_y', [-1, 64]), ('tile_x', [-1, 128])],None,76

Finally we apply history best from the cache file and check its correctness. We can call the function matmul directly under the autotvm.apply_history_best context. When we call this function, it will query the dispatch context with its argument and get the best config with the same argument.

最后，从缓存文件中应用历史记录，并检查其正确性。可以直接在autotvm.apply_history_best上下文。当调用这个函数时，它将用它的参数查询分派上下文，并用相同的参数获得最佳配置。

# apply history best from log file

with autotvm.apply_history_best("matmul.log"):

with tvm.target.Target("llvm"):

s, arg_bufs = matmul(N, L, M, "float32")

func = tvm.build(s, arg_bufs)

# check correctness

a_np = np.random.uniform(size=(N, L)).astype(np.float32)

b_np = np.random.uniform(size=(L, M)).astype(np.float32)

c_np = a_np.dot(b_np)

c_tvm = tvm.nd.empty(c_np.shape)

func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm)

https://tvm.apache.org/docs/tutorials/autotvm/tune_simple_template.html

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-2)

Download Python source code: tune_simple_template.py

Download Jupyter notebook: tune_simple_template.ipynb

编写可调模板并使用Auto-tuner自动调谐器的更多相关文章

配置eclipse编写html/js/css/jsp/java时自动提示
配置eclipse编写html/js/css/jsp/java时自动提示步骤: 1.打开eclipse→Windows→Preferences→Java→Editor→Content Assist 修 ...
Atitit.auto complete 自动完成控件的实现总结
Atitit.auto complete 自动完成控件的实现总结 1. 框架选型 1 2. 自动完成控件的ioc设置 1 3. Liger 自动完成控件问题 1 4. 官网上的code有问题,不能 ...
Auto ML自动特征工程
Auto ML自动特征工程特征工程是在做机器学习训练的过程中必不可少的环节,特征工程就是找出对模型结果有益的特征交叉关系,通常特征工程需要耗费算法工程师大量的精力去尝试.针对这样的场景,PAI推出智 ...
Auto ML自动调参
Auto ML自动调参本文介绍Auto ML自动调参的算法介绍及操作流程. 操作步骤登录PAI控制台. 单击左侧导航栏的实验并选择某个实验. 本文以雾霾天气预测实验为例. 在实验画布区,单击左上角 ...
ARM-CPU卷积网络的自动调谐
ARM-CPU卷积网络的自动调谐为特定的ARM设备自动调谐对于获得最佳性能至关重要.这是一个关于如何调整整个卷积网络的资料. 以模板的形式编写了TVM中ARM CPU的操作实现.模板有许多可调旋钮( ...
12306.cn网站自动登录器源代码
去年过年放假的时候写了一个12306.cn网站的自动登录器,刚好那时候放假了,所以没把源代码放出来,现在将代码发出来,由于编写得比较仓促(从放假的下午19:00左右到晚上到00:00左右),很多细节问 ...
NVIDIA GPU卷积网络的自动调谐
NVIDIA GPU卷积网络的自动调谐针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是关于如何为NVIDIA GPU调整整个卷积网络. NVIDIA GPU在TVM中的操作实现是以模板 ...
【图文详解】python爬虫实战——5分钟做个图片自动下载器
python爬虫实战——图片自动下载器之前介绍了那么多基本知识[Python爬虫]入门知识,(没看的先去看!!)大家也估计手痒了.想要实际做个小东西来看看,毕竟: talk is cheap sho ...
python爬虫实战——5分钟做个图片自动下载器
python爬虫实战——图片自动下载器制作爬虫的基本步骤顺便通过这个小例子,可以掌握一些有关制作爬虫的基本的步骤. 一般来说,制作一个爬虫需要分以下几个步骤: 分析需求(对,需求分析非常重要, ...

随机推荐

LA3644简单并查集判环
题意: 有n个化合物,每个化合物是两种元素组成,现在要装车,但是一旦车上的化合物中的某几个化合物组成这样一组关系,有n个化合物正好用了n中元素,那么就会爆炸,输入的顺序是装车的顺序,对于每 ...
c#-全局键盘钩子
using System; using System.Collections.Generic; using System.Text; using System.Windows.Forms; using ...
【python】Leetcode每日一题-丑数
[python]Leetcode每日一题-丑数 [题目描述] 给你一个整数 n ,请你判断 n 是否为丑数 .如果是,返回 true :否则,返回 false . 丑数就是只包含质因数 2.3 和 ...
简述MySQL优化
数据库的优化可以从四个方面来优化: 1.结构层: web服务器采用负载均衡服务器,mysql服务器采用主从复制,读写分离 2.储存层: 采用合适的存储引擎,采用三范式 3.设计层: 采用分区分表,索引 ...
RabbitMQ实现延时消息的两种方法
目录 RabbitMQ实现延时消息的两种方法 1.死信队列 1.1消息什么时候变为死信(dead-letter) 1.2死信队列的原理 1.3 代码实现 1.4死信队列的一个小坑 2 .延时插件 2. ...
Azure Storage 利用 azCopy 复制迁移数据
一,引言前两天遇到了Azure Blob Storage 需要迁移到另外的一个 Azure Blob Storage 中.手动下载.上传已经无法满足了,得另寻一种方式了 AzCopy.Azure 为 ...
[2021BUAA软工助教]个人第一次阅读作业小结
BUAA个人阅读作业小结一.作业要求 https://edu.cnblogs.com/campus/buaa/BUAA_SE_2021_LR/homework/11776 二.评分规则言之有物,按 ...
项目展示$\alpha$
项目内容课程:北航-2020-春-软件工程博客园班级博客要求强制转会与项目展示我们在这个课程的目标是提升团队管理及合作能力,开发一项满意的工程项目这个作业在哪个具体方面帮助我们实现目标 ...
python3读取文件指定行的三种方案
技术背景考虑到深度学习领域中的数据规模一般都比较大,尤其是训练集,这个限制条件对应到实际编程中就意味着,我们很有可能无法将整个数据文件的内容全部都加载到内存中.那么就需要一些特殊的处理方式,比如:创 ...
二、Python流程控制练习题
一.分支结构-if等练习题: 练习1:英制单位与公制单位互换练习2:掷骰子决定做什么练习3:百分制成绩转等级制练习4:输入三条边长如果能构成三角形就计算周长和面积练习5:个人所得税计算器练 ...

编写可调模板并使用Auto-tuner自动调谐器

编写可调模板并使用Auto-tuner自动调谐器的更多相关文章

随机推荐

热门专题