Ascend Pytorch算子适配层开发

Ascend Pytorch算子适配层开发

适配方法

找到和PyTorch算子功能对应的NPU TBE算子，根据算子功能计算出输出Tensor的size，再根据TBE算子原型构造对应的input/output/attr，传递给ACL完成TBE算子的执行。

说明：

TBE算子实现的源文件存放路径由开发套件包Toolkit的安装方式决定：

若使用root用户安装，则存放在：/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe/impl/
若使用非root用户安装，则存放在：~/.local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe/impl/

开发者可以通过查看算子实现源文件，确定算子的功能。

存放路径和命名格式

对NPU的TBE算子适配文件保存在pytorch/aten/src/ATen/native/npu目录下，命名风格采用大驼峰，命名格式：<算子名> + <KernelNpu>.cpp，如：AddKernelNpu.cpp。

适配步骤

须知：

适配代码基于C++开发。

引入依赖头文件。

2.  #include "ATen/native/npu/utils/CalcuOpUtil.h"

3.  #include "ATen/native/npu/utils/KernelNpuOutputSize.h"

#include "ATen/native/npu/utils/NpuUtils.h"

说明：

"CalcuOpUtil.h"文件中主要包含与ACL接口相关的函数。

"KernelNpuOutputSize.h"中主要包含算子输出shape的推导函数。

"NpuUtils.h"文件中主要包含公共能力的函数。

定义Add算子适配主体函数。

结合native_functions.yaml 中 add算子的分发定义，算子适配中应包含如下函数：

add_npu_input 构造输入的NPUTensorDesc对象
add_npu_output 构造输出的NPUTensorDesc对象
add_npu_attr 构造NPU TBE Add算子attr属性
add_out_npu 算子适配函数（yaml中npu派发函数，支持传入输出tensor），other参数支持 Tensor & Scalar
add_npu 算子适配函数(yaml中npu派发函数)，other参数支持 Tensor & Scalar

实现函数 add_npu_input。

将NPU适配函数(add_npu_input)的输入构造成NPUTensorDesc对象。

// 输入参数为"self": "Tensor"和"other": "Tensor"时，适配函数add_npu_input的实现

SmallVector<NPUTensorDesc, N> add_npu_input(const Tensor& self,const Tensor& other) {

    bool isSelfWrapped = CalcuOpUtil::is_scalar_wrapped_to_tensor(self);

    bool isOtherWrapped = CalcuOpUtil::is_scalar_wrapped_to_tensor(other);

    auto inputs = CalcuOpUtil::create_npu_input_tensor_desc({self, other});

    // 't + 2' to work with any type of tensor, not just LongTensor (which is what

    // integersin Python represent).

    if (isSelfWrapped && (!isOtherWrapped)) {

        inputs[0].scalarType = other.scalar_type();

    } else if (isOtherWrapped && (!isSelfWrapped)) {

        inputs[1].scalarType = self.scalar_type();

    return inputs;

// 输入参数为"self": "Tensor"和"other": "Scalar"时，适配函数add_npu_input的实现

SmallVector<NPUTensorDesc, N> add_npu_input(const Tensor& self,const Scalar& other) {

    return CalcuOpUtil::create_npu_input_tensor_desc({self});

实现函数 add_npu_output。

将函数 add_npu_output的输出tensor对象构造成NPUTensorDesc对象。

// 输出参数为 "Tensor" 时，适配函数add_npu_output的实现

SmallVector<NPUTensorDesc, N> add_npu_output(const Tensor& result) {

    return CalcuOpUtil::create_npu_output_tensor_desc({result});

说明：

一般来说，算子的输出不需要特殊处理，直接调用CreateNpuOutputTensorDesc即可。

实现函数 add_npu_attr。

根据NPU TBE算子原型中所需的attr规格，将参数适配成NPU TBE算子原型所需要的attr属性。

// 输入参数为"other": "Tensor"和"alpha": "Scalar"时，对应的适配函数add_npu_attr实现

SmallVector<NPUAttrDesc, N> add_npu_attr(const Tensor& self, const Tensor& other, Scalar alpha) {

    float value = CalcuOpUtil::get_scalar_float_value(alpha);

    NPUAttrDesc npuAttrScalar = NPUAttrDesc("alpha", value);

    SmallVector<NPUAttrDesc, N> attrs = {npuAttrScalar};

    return attrs;

// 输入参数为"other": "Scalar"和"alpha": "Scalar"时，对应的适配函数adds_npu_attr实现

SmallVector<NPUAttrDesc, N> adds_npu_attr(const Tensor& self,const Scalar& other,const Scalar& alpha) {

    float otherValue = CalcuOpUtil::get_scalar_float_value(other);

    float alphaValue = CalcuOpUtil::get_scalar_float_value(alpha);

    float value = otherValue * alphaValue;

    NPUAttrDesc npuAttrValue = NPUAttrDesc("value", value);

    SmallVector<NPUAttrDesc, N> attrs = {npuAttrValue};

    return attrs;

实现函数 add_out_npu。

9.  Tensor& add_out_npu(Tensor& result, const Tensor& self, const Tensor& other, Scalar alpha) {

10.     if (other.dim() == 0 && !other.is_npu()) {

11.         adds_out_npu(result, self, other.item(), alpha);

12.     } else if (self.dim() == 0 && !self.is_npu()) {

13.         adds_out_npu(result, other, self.item(), alpha);

14.     } else {

15.         // constructs the input and output NPUTensorDesc

16.         auto inputs = add_npu_input(self, other);

17.         auto outputs = add_npu_output({result});

18.

19.         // constructs the attr of the NPUAttrDesc

20.         auto attrs = add_npu_attr(self, other, alpha);

21.         // executing the NPU operator

22.         CalcuOpUtil::execute_npu_operate("Axpy", inputs, outputs, attrs);

23.     }

24.

25.     return result;

说明：

add_out_npu和add_npu的差别是add_out_npu支持显示指定输出tensor，往输出tensor中写入结果。

实现函数 add_npu。
定义并实现算子的shape推导函数，根据输入参数计算输出的size。

Shape推导函数定义规范：

"NPU适配函数名称" + "_" + "output" + "_" + "size"，如add_npu_output_size()；

说明：

Shape推导函数定义和实现存放在 pytorch/aten/src/ATen/native/npu/utils，对应的头文件和实现在 KernelNpuOutPutSize.h 和 KernelNpuOutPutSize.cpp中。
在KernelNpuOutPutSize.h中，函数存放位置按照函数名字排序。

//输入参数为"self": "Tensor"和"other": "Tensor"时，Shape推导该函数

SmallVector<int64_t, SIZE> add_npu_output_size(const Tensor& self,const Tensor& other) {

    return broadcast_ops_npu_output_size(self, other);    //定义Shape推导函数

// 输入参数为"self": "Tensor"和"other": "Scalar"时，Shape推导该函数

IntArrayRef add_npu_output_size(const Tensor& self, const Scalar& other) {

    return input_same_output_size(self);

说明：

broadcast_ops_npu_output_size函数的作用是：当两个参数符合PyTorch广播机制时，函数会将两个参数自动扩展为相等大小

调用对应的shape推导函数计算输出的size。
根据输出的size调用at::empty_with_ format创建输出Tensor，函数支持指定输出Tensor的format，默认为NCHW格式。

说明：

当前制定的Format设置规则为重型算子锚点扩散+连续性法则混合规则。

重型算子如卷积、Matmul，只支持某种特定format，适配时显示指定为其需要的format，format向周边扩散。
而连续性法则指的是算子对格式不敏感，算子format指定为与第一个输入tensor的format相同即可。
NPU中的卷积只支持NC1HWC0格式，所以需要显式指定为NC1HWC0格式

将构造好的输出Tensor和其他参数传给add_out_npu进行运算

e.  // 输入参数为"self": "Tensor"和"other": "Tensor"时，对应的适配函数add_npu实现

f.  //调用对应的Shape推导函数计算输出的size

g.  Tensor add_npu(const Tensor& self, const Tensor& other, Scalar alpha) {

h.      Tensor outputTensor = add_dest_output(self, other);

i.      auto outputSize = add_npu_output_size(self, other);

j.

k.  //根据输出的size调用at::empty_with_format创建输出Tensor，函数支持指定输出Tensor的format，默认为NCHW格式

l.      Tensor result = at::empty_with_format(outputSize, outputTensor.options(), CalcuOpUtil::get_tensor_npu_format(outputTensor));

m.

n.  //将构造好的输出Tensor和其他参数传给add_out_npu进行运算

o.      add_out_npu(result, self, other, alpha);

p.      return result;

q.  }

r.

s.  // 输入参数为"self": "Tensor"和"other": "Scalar"时，对应的适配函数add_npu实现

t.  //调用对应的Shape推导函数计算输出的size

u.  Tensor add_npu(const Tensor& self, Scalar other, Scalar alpha) {

v.      auto outputSize = add_npu_output_size(self, other);

w.

x.  //根据输出的size调用at::empty_with_format创建输出Tensor，函数支持指定输出Tensor的format，默认为NCHW格式

y.      Tensor result = at::empty_with_format(outputSize, self.options(), CalcuOpUtil::get_tensor_npu_format(self));

z.

aa. //将构造好的输出Tensor和其他参数传给add_out_npu进行运算

bb.     adds_out_npu(result, self, other, alpha);

cc.     return result;