为x86 CPU自动调度神经网络

为x86 CPU自动调度神经网络

对特定设备和工作负载进行自动调试对于获得最佳性能至关重要。这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档。

为了自动调试神经网络，将网络划分为小的子图，并对其进行独立调试。每个子图被视为一个搜索任务。任务调度程序可以对时间进行分片，并为这些任务动态分配时间资源。任务调度程序可以预测每个任务对端到端执行时间的影响，并优先调度可以最大程度地减少执行时间的任务。

对于每个子图，使用compute声明tvm/python/topi获取张量表达式形式的计算DAG。然后，使用自动调度器来构造此DAG的搜索空间，并搜索良好的调度（低级优化）。

与依靠手动模板定义搜索空间的基于模板的autotvm不同，自动调度程序不需要任何调度模板。换句话说，自动调度程序仅在tvm/python/topi中使用计算声明，而不使用现有的调度模板。

注意，本文无法在Windows或最新版本的macOS上运行。要使其运行，需要将本文的内容包装在一个块中。if __name__ == "__main__":

import numpy as np

import tvm

from tvm import relay, auto_scheduler

import tvm.relay.testing

from tvm.contrib import graph_runtime

定义网络

首先，需要使用中继前端API定义网络。可以加载一些预定义的网络tvm.relay.testing。还可以从MXNet，ONNX，PyTorch和TensorFlow加载模型。

对于卷积神经网络，尽管自动调度程序可以在任何布局下正常工作，但使用NHWC布局通常可以实现最佳性能。还使用自动调度程序对NHWC布局实施了更多优化。因此，建议将模型转换为NHWC布局以使用自动调度程序。可以在TVM中使用ConvertLayout pass进行布局转换。

def get_network(name, batch_size, layout="NHWC", dtype="float32"):

    """Get the symbol definition and random weight of a network"""

    # auto-scheduler prefers NHWC layout

    if layout == "NHWC":

        image_shape = (224, 224, 3)

    elif layout == "NCHW":

        image_shape = (3, 224, 224)

    else:

        raise ValueError("Invalid layout: " + layout)

    input_shape = (batch_size,) + image_shape

    output_shape = (batch_size, 1000)

    if name.startswith("resnet-"):

        n_layer = int(name.split("-")[1])

        mod, params = relay.testing.resnet.get_workload(

            num_layers=n_layer,

            batch_size=batch_size,

            layout=layout,

            dtype=dtype,

            image_shape=image_shape,

    elif name.startswith("resnet3d-"):

        n_layer = int(name.split("-")[1])

        mod, params = relay.testing.resnet.get_workload(

            num_layers=n_layer,

            batch_size=batch_size,

            layout=layout,

            dtype=dtype,

            image_shape=image_shape,

    elif name == "mobilenet":

        mod, params = relay.testing.mobilenet.get_workload(

            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape

    elif name == "squeezenet_v1.1":

        assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"

        mod, params = relay.testing.squeezenet.get_workload(

            version="1.1",

            batch_size=batch_size,

            dtype=dtype,

            image_shape=image_shape,

    elif name == "inception_v3":

        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)

        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)

    elif name == "mxnet":

        # an example for mxnet model

        from mxnet.gluon.model_zoo.vision import get_model

        assert layout == "NCHW"

        block = get_model("resnet50_v1", pretrained=True)

        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)

        net = mod["main"]

        net = relay.Function(

            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs

        mod = tvm.IRModule.from_expr(net)

    return mod, params, input_shape, output_shape

# Define the neural network and compilation target.

# If the target machine supports avx512 instructions, replace the

# "llvm -mcpu=core-avx2" with "llvm -mcpu=skylake-avx512"

network = "resnet-50"

batch_size = 1

layout = "NHWC"

target = tvm.target.Target("llvm -mcpu=core-avx2")

dtype = "float32"

log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)

提取搜索任务

接下来，从网络中提取搜索任务及其权重。任务的权重是整个网络中任务子图的出现次数。通过使用权重，可以将网络的端到端延迟近似为sum(latency[t] * weight[t])，其中latency[t]是任务的延迟，weight[t]是任务的权重。任务调度程序只会优化此目标。

# Extract tasks from the network

print("Extract tasks...")

mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)

tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

for idx, task in enumerate(tasks):

    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))

    print(task.compute_dag)

出：

Extract tasks...

========== Task 0  (workload key: ["b32ed43fb351136894c322ee49097a1a"]) ==========

placeholder = PLACEHOLDER [1, 1000]

T_softmax_maxelem(i0) max= placeholder[i0, k]

T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))

T_softmax_expsum(i0) += T_softmax_exp[i0, k]

T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])

========== Task 1  (workload key: ["6129df1a3d5f6326c8393a8d17160199"]) ==========

placeholder = PLACEHOLDER [1, 2048]

placeholder = PLACEHOLDER [1000, 2048]

compute(z, y, x) += (placeholder[z, ((k*16) + x)]*placeholder[y, ((k*16) + x)])

compute(y, x) += compute[y, x, kk]

placeholder = PLACEHOLDER [1000]

T_add(ax0, ax1) = (compute[ax0, ax1] + placeholder[ax1])

========== Task 2  (workload key: ["36ee2798ed60bae3bcd1bb89a0285fe8"]) ==========

placeholder = PLACEHOLDER [1, 7, 7, 2048]

tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]

tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))

========== Task 3  (workload key: ["dcf6fcf5f56fa614bf9aef0c82382caf"]) ==========

placeholder = PLACEHOLDER [1, 7, 7, 512]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 512, 2048]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 7, 7, 2048]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

placeholder = PLACEHOLDER [1, 1, 1, 2048]

T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*placeholder[ax0, 0, 0, ax3])

placeholder = PLACEHOLDER [1, 1, 1, 2048]

T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 4  (workload key: ["7e3f0cf5a6dd80d36dab1a3dad92674a"]) ==========

placeholder = PLACEHOLDER [1, 7, 7, 512]

PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)

placeholder = PLACEHOLDER [3, 3, 512, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 512]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 5  (workload key: ["e0a9eb3795b531085e0ebb772e7e800c"]) ==========

placeholder = PLACEHOLDER [1, 7, 7, 2048]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 2048, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 512]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 6  (workload key: ["03614e726dc588d11887eb0953a77e53"]) ==========

placeholder = PLACEHOLDER [1, 7, 7, 512]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 512, 2048]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 7, 7, 2048]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

========== Task 7  (workload key: ["7657f886f5e9d8b5f19a5fd2c5b90d8d"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 1024]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 1024, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 512]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 8  (workload key: ["7e09b626cf077cd419190fee02091dd6"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 256]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 256, 1024]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 14, 14, 1024]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

placeholder = PLACEHOLDER [1, 1, 1, 1024]

T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 9  (workload key: ["95bf49cc8cf7a351e974b2359702aac0"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 256]

PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)

placeholder = PLACEHOLDER [3, 3, 256, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 256]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 10  (workload key: ["e043f834cc7f19597227e09dc7f59503"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 1024]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 1024, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 256]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 11  (workload key: ["cd7c4a374fb2bbc0d075c8cae638ad14"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 256]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 256, 1024]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 14, 14, 1024]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

========== Task 12  (workload key: ["1dce2c5e4269b8a12dfc50cd4dd23ff1"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 512]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 512, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 256]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 13  (workload key: ["d3b36ce001dc24d693facfbdae1979b4"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 128]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 128, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 28, 28, 512]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

placeholder = PLACEHOLDER [1, 1, 1, 512]

T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 14  (workload key: ["0fb1dfcdb5b755e2dab290ed0129dcf2"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 128]

PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)

placeholder = PLACEHOLDER [3, 3, 128, 128]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 128]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 15  (workload key: ["45acfc473c772458684f36a34549d8aa"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 512]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 512, 128]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 128]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 16  (workload key: ["5e3ceb6e23ae8c351d5a1770d5fc6c7c"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 128]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 128, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 28, 28, 512]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

========== Task 17  (workload key: ["a085717fb3dcb046e5c4c2c04d3dc541"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 256]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 256, 128]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 128]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 18  (workload key: ["691feef049c8693bbe91bd5e7c9cdf34"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 64]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 64, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 56, 56, 256]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

placeholder = PLACEHOLDER [1, 1, 1, 256]

T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 19  (workload key: ["a9e632e5167afb60fbe29e7aeef1d152"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 64]

PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)

placeholder = PLACEHOLDER [3, 3, 64, 64]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 64]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 20  (workload key: ["b51e06c1131d4cded40d1b215f722a4e"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 256]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 256, 64]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 64]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 21  (workload key: ["8fcee68a4342c38248a827f1c6c69177"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 64]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 64, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 56, 56, 256]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

========== Task 22  (workload key: ["8dd7d81db440763f622f03fdc99e6d46"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 64]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 64, 64]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 64]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 23  (workload key: ["ba2026d923536b75e9b4faed89287d5f"]) ==========

placeholder = PLACEHOLDER [1, 112, 112, 64]

pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), placeholder[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)

tensor(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + dh), ((ax2*2) + dw), ax3]

placeholder = PLACEHOLDER [1, 1, 1, 64]

T_add(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 24  (workload key: ["a0eb8d6048282a4a0986cc2ccf14eaa2"]) ==========

placeholder = PLACEHOLDER [1, 224, 224, 3]

PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), placeholder[i0, (i1 - 3), (i2 - 3), i3], 0f)

placeholder = PLACEHOLDER [7, 7, 3, 64]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

placeholder = PLACEHOLDER [1, 1, 1, 64]

T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])

T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 25  (workload key: ["45b4de07687dee43ee1cbde9f516b2bf"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 64]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 64, 256]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

========== Task 26  (workload key: ["b2010aa63c95dedf1f58f3fe8bc78634"]) ==========

placeholder = PLACEHOLDER [1, 56, 56, 256]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 256, 512]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

========== Task 27  (workload key: ["4d7e646d99bfa3cea8245bd7100369cb"]) ==========

placeholder = PLACEHOLDER [1, 28, 28, 512]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 512, 1024]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

========== Task 28  (workload key: ["537c8642716948c33a6eaaabc86b159d"]) ==========

placeholder = PLACEHOLDER [1, 14, 14, 1024]

PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]

placeholder = PLACEHOLDER [1, 1, 1024, 2048]

Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

开始Tuning调试

现在，设置一些选项来优化和启动搜索任务

num_measure_trials是在调试期间可以使用的测量试验次数。可以将其设置为较小的数字（例如200）以进行快速演示。实际上，建议将其设置为800 * len(tasks)，通常足以使搜索收敛。例如，resnet-50中有29个任务，可以将其设置为20000。可以根据时间预算调试此参数。
此外，还用RecordToFile将测量记录转储到日志文件中，这些测量记录可用于最好地查询历史记录，恢复搜索以及以后进行更多分析。
有关更多参数，请参见auto_scheduler.TuningOptions, auto_scheduler.LocalRunner。

def run_tuning():

    print("Begin tuning...")

    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)

    tune_option = auto_scheduler.TuningOptions(

        num_measure_trials=200,  # change this to 20000 to achieve the best performance

        runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),

        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],

    tuner.tune(tune_option)

# We do not run the tuning in our webpage server since it takes too long.

# Uncomment the following line to run it by yourself.

# run_tuning()

注意

tuning调试期间说明打印的信息

在tuning调试期间，控制台上会打印很多信息。它们用于调试目的。最重要的信息是任务调度程序的输出。下表是示例输出。

----------------------------------------------------------------------

------------------------------  [ Task Scheduler ]

----------------------------------------------------------------------

|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |

-------------------------------------------------

|    0 |        0.010 |           0.40 |     64 |

|    1 |        0.087 |          47.19 |     64 |

|    2 |        0.008 |          -0.00 |     64 |

|    3 |        0.177 |         582.07 |     64 |

|    4 |        0.268 |         862.37 |    256 |

|    5 |        0.166 |         621.13 |    128 |

|    6 |        0.170 |         605.10 |    128 |

|    7 |        0.128 |         403.20 |     64 |

|    8 |        0.189 |         545.71 |     64 |

|    9 |        0.231 |        1001.01 |    448 |

|   10 |        0.155 |         664.80 |    256 |

|   11 |        0.155 |         662.86 |    256 |

|   12 |        0.119 |         434.08 |     64 |

|   13 |        0.199 |         522.13 |     64 |

|   14 |        0.235 |         986.56 |    320 |

|   15 |        0.149 |         689.13 |    128 |

|   16 |        0.155 |         664.80 |    192 |

|   17 |        0.151 |         340.64 |     64 |

|   18 |        0.176 |         597.55 |    128 |

|   19 |        0.220 |        1054.37 |    192 |

|   20 |        0.150 |         686.01 |    128 |

|   21 |        0.159 |         650.88 |    128 |

|   22 |        0.073 |         358.19 |     64 |

|   23 |        0.031 |          70.63 |     64 |

|   24 |        0.251 |         947.73 |    128 |

|   25 |        0.157 |         652.47 |    128 |

|   26 |        0.215 |         954.84 |    128 |

|   27 |        0.237 |         868.92 |    128 |

|   28 |        0.266 |         774.06 |    128 |

-------------------------------------------------

Estimated total latency: 10.016 ms      Trials: 3992    Used time : 1131 s      Next ID: 15

下表列出了所有任务的延迟和（估计）速度。它还列出了所有任务的测量试验分配。最后一行显示这些任务的总加权延迟，这可以粗略估计网络的端到端执行时间。最后一行还显示测量试验的总数，自动调试所花费的总时间以及要调试的下一个任务的ID。

也将出现一些“ dmlc :: Error”错误，因为自动调度程序将尝试某些无效的调度。如果可以继续进行调试，则可以放心地忽略它们，因为这些错误与主要过程是隔离的。

注意

提前终止调试

可以通过强制终止此过程来提前终止调试。只要为日志文件中的每个任务获得至少一个有效的调度，就应该能够进行编译（下面的部分）。

编译和评估

自动调试后，可以使用发现的最佳时间表来编译网络。在自动调试过程中，所有测量记录都将转储到日志文件中，因此可以读取日志文件并加载最佳调度。

# Compile with the history best

print("Compile...")

with auto_scheduler.ApplyHistoryBest(log_file):

    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):

        lib = relay.build(mod, target=target, params=params)

# Create graph runtime

ctx = tvm.context(str(target), 0)

module = graph_runtime.GraphModule(lib["default"](ctx))

data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))

module.set_input("data", data_tvm)

# Evaluate

print("Evaluate inference time cost...")

ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)

prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond

print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))

出：

Compile...

Evaluate inference time cost...

Mean inference time (std dev): 30.72 ms (0.09 ms)

其他技巧

在调试期间，自动调度器需要编译许多程序并从中提取功能。此部分占用大量CPU，因此建议使用具有多个内核的高性能CPU以加快搜索速度。
可以用python3 -m tvm.auto_scheduler.measure_record --mode distill --i log.json来提取大型日志文件，而仅保存最有用的记录。
可以从上一个日志文件继续搜索。load_log_file在function中创建任务调度程序时，只需添加一个新参数run_tuning。也就是， tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
如果有多个目标CPU，则可以将它们全部用于测量以并行化测量。检查本节以了解如何使用RPC跟踪器和RPC服务器。要在自动调度使用RPC跟踪，在TuningOptions中用auto_scheduler.RPCRunner更换runner 。