x86 CPU自动调度神经网络

对特定设备和工作负载进行自动调试对于获得最佳性能至关重要。这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档。

为了自动调试神经网络,将网络划分为小的子图,并对其进行独立调试。每个子图被视为一个搜索任务。任务调度程序可以对时间进行分片,并为这些任务动态分配时间资源。任务调度程序可以预测每个任务对端到端执行时间的影响,并优先调度可以最大程度地减少执行时间的任务。

对于每个子图,使用compute声明tvm/python/topi获取张量表达式形式的计算DAG。然后,使用自动调度器来构造此DAG的搜索空间,并搜索良好的调度(低级优化)。

与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何调度模板。换句话说,自动调度程序仅在tvm/python/topi中使用计算声明,而不使用现有的调度模板。

注意,本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个块中。if __name__ == "__main__":

import numpy as np
 
import tvm
from tvm import relay, auto_scheduler
import tvm.relay.testing
from tvm.contrib import graph_runtime

定义网络

首先,需要使用中继前端API定义网络。可以加载一些预定义的网络tvm.relay.testing。还可以从MXNet,ONNX,PyTorch和TensorFlow加载模型。

对于卷积神经网络,尽管自动调度程序可以在任何布局下正常工作,但使用NHWC布局通常可以实现最佳性能。还使用自动调度程序对NHWC布局实施了更多优化。因此,建议将模型转换为NHWC布局以使用自动调度程序。可以在TVM中使用ConvertLayout pass进行布局转换。

def get_network(name, batch_size, layout="NHWC", dtype="float32"):
    """Get the symbol definition and random weight of a network"""
 
    # auto-scheduler prefers NHWC layout
    if layout == "NHWC":
        image_shape = (224, 224, 3)
    elif layout == "NCHW":
        image_shape = (3, 224, 224)
    else:
        raise ValueError("Invalid layout: " + layout)
 
    input_shape = (batch_size,) + image_shape
    output_shape = (batch_size, 1000)
 
    if name.startswith("resnet-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name.startswith("resnet3d-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "mobilenet":
        mod, params = relay.testing.mobilenet.get_workload(
            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
        )
    elif name == "squeezenet_v1.1":
        assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
        mod, params = relay.testing.squeezenet.get_workload(
            version="1.1",
            batch_size=batch_size,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "inception_v3":
        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
    elif name == "mxnet":
        # an example for mxnet model
        from mxnet.gluon.model_zoo.vision import get_model
 
        assert layout == "NCHW"
 
        block = get_model("resnet50_v1", pretrained=True)
        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
        net = mod["main"]
        net = relay.Function(
            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
        )
        mod = tvm.IRModule.from_expr(net)
 
    return mod, params, input_shape, output_shape
 
 
# Define the neural network and compilation target.
# If the target machine supports avx512 instructions, replace the
# "llvm -mcpu=core-avx2" with "llvm -mcpu=skylake-avx512"
network = "resnet-50"
batch_size = 1
layout = "NHWC"
target = tvm.target.Target("llvm -mcpu=core-avx2")
dtype = "float32"
log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)

提取搜索任务

接下来,从网络中提取搜索任务及其权重。任务的权重是整个网络中任务子图的出现次数。通过使用权重,可以将网络的端到端延迟近似为sum(latency[t] * weight[t]),其中latency[t]是任务的延迟,weight[t]是任务的权重。任务调度程序只会优化此目标。

# Extract tasks from the network
print("Extract tasks...")
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
 
for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    print(task.compute_dag)

出:

Extract tasks...
========== Task 0  (workload key: ["b32ed43fb351136894c322ee49097a1a"]) ==========
placeholder = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])
 
========== Task 1  (workload key: ["6129df1a3d5f6326c8393a8d17160199"]) ==========
placeholder = PLACEHOLDER [1, 2048]
placeholder = PLACEHOLDER [1000, 2048]
compute(z, y, x) += (placeholder[z, ((k*16) + x)]*placeholder[y, ((k*16) + x)])
compute(y, x) += compute[y, x, kk]
placeholder = PLACEHOLDER [1000]
T_add(ax0, ax1) = (compute[ax0, ax1] + placeholder[ax1])
 
========== Task 2  (workload key: ["36ee2798ed60bae3bcd1bb89a0285fe8"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 2048]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))
 
========== Task 3  (workload key: ["dcf6fcf5f56fa614bf9aef0c82382caf"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 2048]
T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*placeholder[ax0, 0, 0, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 2048]
T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 4  (workload key: ["7e3f0cf5a6dd80d36dab1a3dad92674a"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 5  (workload key: ["e0a9eb3795b531085e0ebb772e7e800c"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 2048]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 2048, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 6  (workload key: ["03614e726dc588d11887eb0953a77e53"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 7  (workload key: ["7657f886f5e9d8b5f19a5fd2c5b90d8d"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 8  (workload key: ["7e09b626cf077cd419190fee02091dd6"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 9  (workload key: ["95bf49cc8cf7a351e974b2359702aac0"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 10  (workload key: ["e043f834cc7f19597227e09dc7f59503"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 11  (workload key: ["cd7c4a374fb2bbc0d075c8cae638ad14"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 12  (workload key: ["1dce2c5e4269b8a12dfc50cd4dd23ff1"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 13  (workload key: ["d3b36ce001dc24d693facfbdae1979b4"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 14  (workload key: ["0fb1dfcdb5b755e2dab290ed0129dcf2"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 15  (workload key: ["45acfc473c772458684f36a34549d8aa"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 16  (workload key: ["5e3ceb6e23ae8c351d5a1770d5fc6c7c"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 17  (workload key: ["a085717fb3dcb046e5c4c2c04d3dc541"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 18  (workload key: ["691feef049c8693bbe91bd5e7c9cdf34"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 19  (workload key: ["a9e632e5167afb60fbe29e7aeef1d152"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 64, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 20  (workload key: ["b51e06c1131d4cded40d1b215f722a4e"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 21  (workload key: ["8fcee68a4342c38248a827f1c6c69177"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 22  (workload key: ["8dd7d81db440763f622f03fdc99e6d46"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 23  (workload key: ["ba2026d923536b75e9b4faed89287d5f"]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 64]
pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), placeholder[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)
tensor(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + dh), ((ax2*2) + dw), ax3]
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 24  (workload key: ["a0eb8d6048282a4a0986cc2ccf14eaa2"]) ==========
placeholder = PLACEHOLDER [1, 224, 224, 3]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), placeholder[i0, (i1 - 3), (i2 - 3), i3], 0f)
placeholder = PLACEHOLDER [7, 7, 3, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 25  (workload key: ["45b4de07687dee43ee1cbde9f516b2bf"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 26  (workload key: ["b2010aa63c95dedf1f58f3fe8bc78634"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 27  (workload key: ["4d7e646d99bfa3cea8245bd7100369cb"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 28  (workload key: ["537c8642716948c33a6eaaabc86b159d"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

开始Tuning调试

现在,设置一些选项来优化和启动搜索任务

  • num_measure_trials是在调试期间可以使用的测量试验次数。可以将其设置为较小的数字(例如200)以进行快速演示。实际上,建议将其设置为800 * len(tasks),通常足以使搜索收敛。例如,resnet-50中有29个任务,可以将其设置为20000。可以根据时间预算调试此参数。
  • 此外,还用RecordToFile将测量记录转储到日志文件中,这些测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
  • 有关更多参数, 请参见auto_scheduler.TuningOptions, auto_scheduler.LocalRunner
def run_tuning():
    print("Begin tuning...")
    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # change this to 20000 to achieve the best performance
        runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )
 
    tuner.tune(tune_option)
 
 
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.
 
# run_tuning()

注意

tuning调试期间说明打印的信息

在tuning调试期间,控制台上会打印很多信息。它们用于调试目的。最重要的信息是任务调度程序的输出。下表是示例输出。

----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.010 |           0.40 |     64 |
|    1 |        0.087 |          47.19 |     64 |
|    2 |        0.008 |          -0.00 |     64 |
|    3 |        0.177 |         582.07 |     64 |
|    4 |        0.268 |         862.37 |    256 |
|    5 |        0.166 |         621.13 |    128 |
|    6 |        0.170 |         605.10 |    128 |
|    7 |        0.128 |         403.20 |     64 |
|    8 |        0.189 |         545.71 |     64 |
|    9 |        0.231 |        1001.01 |    448 |
|   10 |        0.155 |         664.80 |    256 |
|   11 |        0.155 |         662.86 |    256 |
|   12 |        0.119 |         434.08 |     64 |
|   13 |        0.199 |         522.13 |     64 |
|   14 |        0.235 |         986.56 |    320 |
|   15 |        0.149 |         689.13 |    128 |
|   16 |        0.155 |         664.80 |    192 |
|   17 |        0.151 |         340.64 |     64 |
|   18 |        0.176 |         597.55 |    128 |
|   19 |        0.220 |        1054.37 |    192 |
|   20 |        0.150 |         686.01 |    128 |
|   21 |        0.159 |         650.88 |    128 |
|   22 |        0.073 |         358.19 |     64 |
|   23 |        0.031 |          70.63 |     64 |
|   24 |        0.251 |         947.73 |    128 |
|   25 |        0.157 |         652.47 |    128 |
|   26 |        0.215 |         954.84 |    128 |
|   27 |        0.237 |         868.92 |    128 |
|   28 |        0.266 |         774.06 |    128 |
-------------------------------------------------
Estimated total latency: 10.016 ms      Trials: 3992    Used time : 1131 s      Next ID: 15

下表列出了所有任务的延迟和(估计)速度。它还列出了所有任务的测量试验分配。最后一行显示这些任务的总加权延迟,这可以粗略估计网络的端到端执行时间。最后一行还显示测量试验的总数,自动调试所花费的总时间以及要调试的下一个任务的ID。

也将出现一些“ dmlc :: Error”错误,因为自动调度程序将尝试某些无效的调度。如果可以继续进行调试,则可以放心地忽略它们,因为这些错误与主要过程是隔离的。

注意

提前终止调试

可以通过强制终止此过程来提前终止调试。只要为日志文件中的每个任务获得至少一个有效的调度,就应该能够进行编译(下面的部分)。

编译和评估

自动调试后,可以使用发现的最佳时间表来编译网络。在自动调试过程中,所有测量记录都将转储到日志文件中,因此可以读取日志文件并加载最佳调度。

# Compile with the history best
print("Compile...")
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
        lib = relay.build(mod, target=target, params=params)
 
# Create graph runtime
ctx = tvm.context(str(target), 0)
module = graph_runtime.GraphModule(lib["default"](ctx))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)
 
# Evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))

出:

Compile...
Evaluate inference time cost...
Mean inference time (std dev): 30.72 ms (0.09 ms)

其他技巧

  • 在调试期间,自动调度器需要编译许多程序并从中提取功能。此部分占用大量CPU,因此建议使用具有多个内核的高性能CPU以加快搜索速度。
  • 可以 用python3 -m tvm.auto_scheduler.measure_record --mode distill --i log.json来提取大型日志文件,而仅保存最有用的记录。
  • 可以从上一个日志文件继续搜索。load_log_file在function中创建任务调度程序时,只需添加一个新参数run_tuning。也就是, tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
  • 如果有多个目标CPU,则可以将它们全部用于测量以并行化测量。检查本 以了解如何使用RPC跟踪器和RPC服务器。要在自动调度使用RPC跟踪,在TuningOptions中用auto_scheduler.RPCRunner更换runner 。

为x86 CPU自动调度神经网络的更多相关文章

  1. ARM CPU自动调度神经网络

    ARM CPU自动调度神经网络 对特定设备和工作负载进行自动调度,对于获得最佳性能至关重要.通过RPC使用自动调度器为ARM CPU调度整个神经网络. 为了自动调度神经网络,将网络划分为小的子图,进行 ...

  2. NVIDIA GPU自动调度神经网络

    NVIDIA GPU自动调度神经网络 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是有关如何使用自动调度器为NVIDIA GPU调整整个神经网络. 为了自动调整神经网络,将网络划分为小 ...

  3. NVIDIA GPU的神经网络自动调度

    NVIDIA GPU的神经网络自动调度 针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是一个关于如何使用自动调度器为NVIDIA GPU调整整个神经网络的资料. 为了自动调整一个神经网络 ...

  4. x86 cpu卷积网络的自动调谐

    x86 cpu卷积网络的自动调谐 这是一个关于如何为x86cpu调整卷积神经网络的文档. 本文不会在Windows或最新版本的macOS上运行.要让它运行,需要将主体包装在 if __name__ = ...

  5. CPU的自动调度矩阵乘法

    CPU的自动调度矩阵乘法 这是一个有关如何对CPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度 ...

  6. TVM自动调度器

    TVM自动调度器 随着模型大小,算子多样性和硬件异构性的不断增长,优化深度神经网络的执行速度非常困难.从计算的角度来看,深度神经网络只是张量计算的一层又一层.这些张量计算(例如matmul和conv2 ...

  7. ARM CPU与Intel x86 CPU性能比较

    Qualcomm ARM CPU与Intel x86 CPU性能比较 随着移动互联网时代的到来,Qualcomm(高通).Texas Instruments(德州仪器)等基于ARM架构的CPU受到越来 ...

  8. PANIC: Missing emulator engine program for 'x86' CPU.

    获取可用的Android模拟器 1:emulator -list-avds 获取可用的模拟器的名称(只用名称) 2:  android list -avd    获取可用的模拟器(信息详细) 获取iO ...

  9. 10 -- 深入使用Spring -- 5... 实现任务的自动调度

    10.5 实现任务的自动调度 10.5.1 使用Quartz 10.5.2 在Spring中使用Quartz

随机推荐

  1. Win64 驱动内核编程-16.WFP网络监控驱动(防火墙)

    WFP驱动监控网络 WFP 是微软推出来替代 TDI HOOK.NDIS HOOK 等拦截网络通信的方案,WFP 的框架非常庞大,在 RING3 和 RING0 各有一套类似的函数,令人兴奋的是,即使 ...

  2. [CTF]跳舞的小人

    [CTF]跳舞的小人 来自夏洛克福尔摩斯在<归来记>中侦探案件使用的一种加密方式. 对应的明文是 AT ELRIGES (住在埃尔里奇) COME ELSIE (来吧 埃尔茜) NEVER ...

  3. JDBC往数据库里插入数据

    首先还是一个工具类 插入数据

  4. 电脑进入bios和u盘启动快捷键

    参考:http://www.jb51.net/os/78638.html 一:联想系列 1:联想笔记本电脑 Thinkpad idea 520  :关机状态下,在左下角用回形针捅小孔,知道出现bios ...

  5. Smss.exe加载win32k.sys过程总结

    windows操作系统初始化 windows操作系统再初始化的过程中,当内核完全初始化而且各个组件也已经准备好后会加载一个个用户进程smss.exe(会话管理器),此进程会接着调用NtSetSyste ...

  6. ZwQuerySystemInformation枚举内核模块

    在内核中通过调用此函数来枚举windows系统中已经加载的内核模块. NTSTATUS ZwQuerySystemInformation ( SYSTEM_INFORMATION_CLASS Syst ...

  7. uboot1: 启动流程和移植框架

    目录 0 环境 1 移植框架 3 执行流程 3.0 链接地址 3.1 start.S, 入口 3.2 __main 3.3 board_init_f()和init_sequence_f[] 3.4 r ...

  8. UVA OJ 623 500!

    500!  In these days you can more and more often happen to see programs which perform some useful cal ...

  9. BUAA软件工程个人项目

    写在前面 项目 内容 所属课程 2020春季计算机学院软件工程(罗杰 任健) (北航) 作业要求 [个人项目作业](<https://edu.cnblogs.com/campus/buaa/BU ...

  10. [JavaScript之BOM与DOM]

    [JavaScript之BOM与DOM] BOM(Browser Object Model)是指浏览器对象模型,它使 JavaScript 有能力与浏览器进行"对话". DOM ( ...