x86 cpu卷积网络的自动调谐

这是一个关于如何为x86cpu调整卷积神经网络的文档。

本文不会在Windows或最新版本的macOS上运行。要让它运行，需要将主体包装在

if __name__ == "__main__": 块中。

import os

import numpy as np

import tvm

from tvm import relay, autotvm

from tvm.relay import testing

from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner

from tvm.autotvm.graph_tuner import DPTuner, PBQPTuner

import tvm.contrib.graph_runtime as runtime

Define network

首先需要在中继前端API中定义网络。可以从relay.testing测试或编译

relay.testing.resnet转换。也可以从MXNet、ONNX和TensorFlow加载模型。本文选择restuning作为示例。

def get_network(name, batch_size):

"""Get the symbol definition and random weight of a network"""

input_shape = (batch_size, 3, 224, 224)

output_shape = (batch_size, 1000)

if "resnet" in name:

n_layer = int(name.split("-")[1])

mod, params = relay.testing.resnet.get_workload(

num_layers=n_layer, batch_size=batch_size, dtype=dtype

)

elif "vgg" in name:

n_layer = int(name.split("-")[1])

mod, params = relay.testing.vgg.get_workload(

num_layers=n_layer, batch_size=batch_size, dtype=dtype

)

elif name == "mobilenet":

mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)

elif name == "squeezenet_v1.1":

mod, params = relay.testing.squeezenet.get_workload(

batch_size=batch_size, version="1.1", dtype=dtype

)

elif name == "inception_v3":

input_shape = (batch_size, 3, 299, 299)

mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)

elif name == "mxnet":

# an example for mxnet model

from mxnet.gluon.model_zoo.vision import get_model

block = get_model("resnet18_v1", pretrained=True)

mod, params = relay.frontend.from_mxnet(block, shape={input_name: input_shape}, dtype=dtype)

net = mod["main"]

net = relay.Function(

net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs

)

mod = tvm.IRModule.from_expr(net)

else:

raise ValueError("Unsupported network: " + name)

return mod, params, input_shape, output_shape

# Replace "llvm" with the correct target of your CPU.

# For example, for AWS EC2 c5 instance with Intel Xeon

# Platinum 8000 series, the target should be "llvm -mcpu=skylake-avx512".

# For AWS EC2 c4 instance with Intel Xeon E5-2666 v3, it should be

# "llvm -mcpu=core-avx2".

target = "llvm"

batch_size = 1

dtype = "float32"

model_name = "resnet-18"

log_file = "%s.log" % model_name

graph_opt_sch_file = "%s_graph_opt.log" % model_name

# Set the input name of the graph

# For ONNX models, it is typically "0".

input_name = "data"

# Set number of threads used for tuning based on the number of

# physical CPU cores on your machine.

num_threads = 1

os.environ["TVM_NUM_THREADS"] = str(num_threads)

Configure tensor tuning settings and create tasks

为了在x86cpu上获得更好的内核执行性能，需要将卷积内核的数据布局从“NCHW”改为“NCHWc”。为了解决这种情况，在topi中定义了conv2d NCHWc运算符。将调整此运算符，而不是普通的conv2d。

将使用本地模式来优化配置。RPC跟踪器模式的设置类似于ARM CPU的卷积网络自动调谐教程中的方法。

为了进行精确测量，应该重复测量几次，并使用结果的平均值。此外，需要在重复测量之间刷新缓存中的权重张量。在端到端推断期间，这可以使一个操作符的测量延迟更接近其实际延迟。

tuning_option = {

"log_filename": log_file,

"tuner": "random",

"early_stopping": None,

"measure_option": autotvm.measure_option(

builder=autotvm.LocalBuilder(),

runner=autotvm.LocalRunner(

number=1, repeat=10, min_repeat_ms=0, enable_cpu_cache_flush=True

}

# You can skip the implementation of this function for this tutorial.

def tune_kernels(

tasks, measure_option, tuner="gridsearch", early_stopping=None, log_filename="tuning.log"

for i, task in enumerate(tasks):

prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))

# create tuner

if tuner == "xgb" or tuner == "xgb-rank":

tuner_obj = XGBTuner(task, loss_type="rank")

elif tuner == "ga":

tuner_obj = GATuner(task, pop_size=50)

elif tuner == "random":

tuner_obj = RandomTuner(task)

elif tuner == "gridsearch":

tuner_obj = GridSearchTuner(task)

else:

raise ValueError("Invalid tuner: " + tuner)

# do tuning

n_trial = len(task.config_space)

tuner_obj.tune(

n_trial=n_trial,

early_stopping=early_stopping,

measure_option=measure_option,

callbacks=[

autotvm.callback.progress_bar(n_trial, prefix=prefix),

autotvm.callback.log_to_file(log_filename),

)

# Use graph tuner to achieve graph level optimal schedules

# Set use_DP=False if it takes too long to finish.

def tune_graph(graph, dshape, records, opt_sch_file, use_DP=True):

target_op = [

relay.op.get("nn.conv2d"),

]

Tuner = DPTuner if use_DP else PBQPTuner

executor = Tuner(graph, {input_name: dshape}, records, target_op, target)

executor.benchmark_layout_transform(min_exec_num=2000)

executor.run()

executor.write_opt_sch2record_file(opt_sch_file)

最后，启动优化作业并评估端到端性能。

def tune_and_evaluate(tuning_opt):

# extract workloads from relay program

print("Extract tasks...")

mod, params, data_shape, out_shape = get_network(model_name, batch_size)

tasks = autotvm.task.extract_from_program(

mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)

)

# run tuning tasks

tune_kernels(tasks, **tuning_opt)

tune_graph(mod["main"], data_shape, log_file, graph_opt_sch_file)

# compile kernels with graph-level best records

with autotvm.apply_graph_best(graph_opt_sch_file):

print("Compile...")

with tvm.transform.PassContext(opt_level=3):

lib = relay.build_module.build(mod, target=target, params=params)

# upload parameters to device

ctx = tvm.cpu()

data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))

module = runtime.GraphModule(lib["default"](ctx))

module.set_input(input_name, data_tvm)

# evaluate

print("Evaluate inference time cost...")

ftimer = module.module.time_evaluator("run", ctx, number=100, repeat=3)

prof_res = np.array(ftimer().results) * 1000 # convert to millisecond

print(

"Mean inference time (std dev): %.2f ms (%.2f ms)"

% (np.mean(prof_res), np.std(prof_res))

)

# We do not run the tuning in our webpage server since it takes too long.

# Uncomment the following line to run it by yourself.

# tune_and_evaluate(tuning_option)

Sample Output

调整需要编译许多程序并从中提取特性。因此建议使用高性能CPU。下面列出了一个示例输出。

Extract tasks...

Tuning...

[Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.

[Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.

[Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.

[Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.

[Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.

[Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.

[Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.

[Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.

[Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.

[Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.

[Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.

[Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.

Compile...

Evaluate inference time cost...

Mean inference time (std dev): 3.16 ms (0.03 ms)

https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_x86.html

下载Python源代码：tune_relay_x86.py

下载Jupyter笔记本：tune_relay_x86.ipynbDownload Python source code: tune_relay_x86.py

Download Jupyter notebook: tune_relay_x86.ipynb

x86 cpu卷积网络的自动调谐的更多相关文章

ARM-CPU卷积网络的自动调谐
ARM-CPU卷积网络的自动调谐为特定的ARM设备自动调谐对于获得最佳性能至关重要.这是一个关于如何调整整个卷积网络的资料. 以模板的形式编写了TVM中ARM CPU的操作实现.模板有许多可调旋钮( ...
NVIDIA GPU卷积网络的自动调谐
NVIDIA GPU卷积网络的自动调谐针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是关于如何为NVIDIA GPU调整整个卷积网络. NVIDIA GPU在TVM中的操作实现是以模板 ...
自动调试用于移动GPU的卷积网络
自动调试用于移动GPU的卷积网络对特定设备进行自动调试对于获得最佳性能至关重要.这是有关如何调试整个卷积网络的说明文档. TVM中Mobile GPU的算子实现以模板形式编写.模板具有许多可调旋钮( ...
为x86 CPU自动调度神经网络
为x86 CPU自动调度神经网络对特定设备和工作负载进行自动调试对于获得最佳性能至关重要.这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档. 为了自动调试神经网络,将网络划分为小的子 ...
基于孪生卷积网络(Siamese CNN)和短时约束度量联合学习的tracklet association方法
基于孪生卷积网络(Siamese CNN)和短时约束度量联合学习的tracklet association方法 Siamese CNN Temporally Constrained Metrics T ...
全卷积网络 FCN 详解
背景 CNN能够对图片进行分类,可是怎么样才能识别图片中特定部分的物体,在2015年之前还是一个世界难题.神经网络大神Jonathan Long发表了<Fully Convolutional N ...
学习笔记CB009:人工神经网络模型、手写数字识别、多层卷积网络、词向量、word2vec
人工神经网络,借鉴生物神经网络工作原理数学模型. 由n个输入特征得出与输入特征几乎相同的n个结果,训练隐藏层得到意想不到信息.信息检索领域,模型训练合理排序模型,输入特征,文档质量.文档点击历史.文档 ...
FCN-全卷积网络
全卷积网络 Fully Convolutional Networks CNN 与 FCN 通常CNN网络在卷积层之后会接上若干个全连接层, 将卷积层产生的特征图(feature map)映射成一个固定 ...
卷积网络训练太慢？Yann LeCun：已解决CIFAR-10，目标 ImageNet
原文连接:http://blog.kaggle.com/2014/12/22/convolutional-nets-and-cifar-10-an-interview-with-yan-lecun/ ...

随机推荐

【日志追踪】（微服务应用和单体应用）-logback中的MDC机制
一.MDC介绍 MDC(Mapped Diagnostic Contexts)映射诊断上下文,该特征是logback提供的一种方便在多线程条件下的记录日志的功能, 某些应用程序采用多线程的方式来处理多 ...
hdu4126(MST + 树形dp
题意: 这个题目和hdu4756差不多,是给你一个图,然后是q次改变边的权值,权值只增不减,最后问你每次改变之后的最小树的平均值是多少. 思路:(prim+树形dp) 先跑一边 ...
Android进程so注入Hook java方法
本文博客链接:http://blog.csdn.net/qq1084283172/article/details/53769331 Andorid的Hook方式比较多,现在来学习下,基于Android ...
【译】android的审计和hacking工具
原文:Best Android Tools For Security Audit and Hacking android系统占移动市场份额的80%且有恶意软件,这是一个问题.Hacker会对手机恶意操 ...
使用DirectX截屏
网上有很多关于DirectX截屏的文章,但大都是屏幕截图,很少有窗口截图,本文则两者都涉及到,先讲如何截取整个屏幕,再讲如何截取某个窗口,其实二者的区别不大,只是某个参数的设置不同而已,最后我们还将扩 ...
仁者见仁：缓冲区栈溢出之利用 Exploit 形成完整攻击链完全攻略（含有 PayLoad）
> 前言内存缓冲区溢出又名 Buffer OverFlow,是一种非常危险的漏洞,在各种操作系统和应用软件中广泛存在.利用缓冲区溢出进行的攻击,小则导致程序运行失败.系统宕机等后果,大则可以取 ...
C++的指针相关概念
引言初入c++,肯定会对指针这个概念非常熟悉.但是为什么c/c++要使用指针? 其实每一种编程语言都使用指针,指针并不只是C/C++的独有特性.C++将指针暴露给了用户(程序员),而Java和C#等 ...
Jenkins 基础篇 - 基础设置
站点设置刚搭建好 Jenkins 环境,你还需要做一些简单设置,让我们的 Jenkins 看起来是这么一回事,特别是你要用于生产环境的时候.首先就是域名配置,如果你为 Jenkins 服务分配了一个 ...
『动善时』JMeter基础 — 17、JMeter配置元件【HTTP请求默认值】
目录 1.HTTP请求默认值介绍 2.HTTP请求默认值界面 3.HTTP请求默认值的使用 (1)用于演示的项目说明 (2)测试计划内包含的元件 (3)说明HTTP请求默认值用法 4.总结 5.拓展知 ...
XAML常用控件2
这节继续讲一些xaml中的常用控件. 布局控件除了我们之前讲过的Grid,StackPanel,Border布局控件,xaml中还有如下几个布局控件: Canvas:使用这个布局,可以通过坐标来控制 ...

x86 cpu卷积网络的自动调谐

x86 cpu卷积网络的自动调谐的更多相关文章

随机推荐

热门专题