OpenACC 计算构建内的自定义函数

▶ 使用 routine 构件创建的自定义函数，在并行调用上的差别

● 代码，自定义一个 sqab 函数，使用内建函数 fabsf 和 sqrtf 计算一个矩阵所有元素绝对值的平方根

 #include <stdio.h>

 #include <stdlib.h>

 #include <math.h>

 #include <openacc.h>

 #define ROW 8

 #define COL 64

 #pragma acc routine vector

 void sqab(float *a, const int m)

 {

 #pragma acc loop

     for (int idx = ; idx < m; idx++)

         a[idx] = sqrtf(fabsf(a[idx]));

 }

 int main()

 {

     float x[ROW][COL];

     int row, col;

     for (row = ; row < ROW; row++)

     {

         for (col = ; col < COL; col++)

             x[row][col] = row *  + col;

     }

     printf("\nx[1][1] = %f\n", x[][]);

 #pragma acc parallel loop vector pcopy(x[0:ROW][0:COL]) // 之后在这里分别添加 gang，worker，vector

     for (row = ; row < ROW; row++)

         sqab(&x[row][], COL);

     printf("\nx[1][1] = %f\n", x[][]);

     //getchar();

     return ;

 }

● 输出结果，第 28 行不添加并行级别子句（默认使用 gang）

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop gang /* blockIdx.x */

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=      // 8 个 gang 在 blockIdx.x 层级，1 个 worker，vector 在 threadIdx.x 层级

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: []

             elapsed time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 输出结果，第 28 行添加并行级别子句 worker

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop worker(4) /* threadIdx.y */

      , Loop is parallelizable

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=32x4    // 1 个 gang，4 个 worker 在 threadIdx.y 层级，使用 2 维线程网格

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: [32x4]

              device time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 输出结果，第 28 行添加并行级别子句 vector

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop seq

      , Loop is parallelizable

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=      // 1 个 gang，1 个 worker，并行全都堆在 threadIdx.x 层级上

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: []

             elapsed time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 如果自定义函数并行子句等级高于主调函数，则主调函数并行子句会变成 seq；如果自定义函数并行子句等级低于内部并行子句等级，则会报 warning，忽略掉内部并行子句：

 #pragma acc routine vector

 void sqab(float *a, const int m)

 {

 #pragma acc loop worker

     for (int idx = ; idx < m; idx++)

         a[idx] = sqrtf(fabsf(a[idx]));

 }

● 编译结果（运行结果通上面的 worker，不写）

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

PGC-W--acc loop worker clause ignored in acc routine vector procedure  (main.c: )

sqab:

     , Generating Tesla code

         , #pragma acc loop vector /* threadIdx.x */

     , Loop is parallelizable

OpenACC 计算构建内的自定义函数的更多相关文章

SQL Server 2008 R2——使用计算列为表创建自定义的自增列
=================================版权声明================================= 版权声明:原创文章谢绝转载请通过右侧公告中的“联系邮 ...
hive的内置函数和自定义函数
一.内置函数 1.一般常用函数 .取整函数 round() 当传入第二个参数则为精度 bround() 银行家舍入法:为5时,前一位为偶则舍,奇则进. .向下取整 floor() .向上取整 ceil ...
python自定义函数和内置函数
函数 1.定义函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段. 先定义,后使用 1.2分类系统函数自定义函数 1.3语法: def functionname(parameter ...
[VBA]发布一个计算桩号之差的Excel自定义函数（VBA）
这是一个可以计算桩号之差(也就是得到长度)的Excel(或WPS)扩展函数,可以减少工程师在统计工程量时的工作量. 该函数具有一定的通用性.可以在MS Office和金山WPS上使用. 文末会给出使用 ...
5.Smart使用内置函数或者自定义函数
1.使用内置函数例如使用date函数 {"Y-m-d"|date:$time}格式{第一个参数|方法:第二个参数:第三个参数}即可转换成 2016-07-19 2.使用resi ...
JSP第四篇【EL表达式介绍、获取各类数据、11个内置对象、执行运算、回显数据、自定义函数、fn方法库】
什么是EL表达式? 表达式语言(Expression Language,EL),EL表达式是用"${}"括起来的脚本,用来更方便的读取对象! EL表达式主要用来读取数据,进行内容的 ...
Python之函数（自定义函数，内置函数，装饰器，迭代器，生成器）
Python之函数(自定义函数,内置函数,装饰器,迭代器,生成器) 1.初始函数 2.函数嵌套及作用域 3.装饰器 4.迭代器和生成器 6.内置函数 7.递归函数 8.匿名函数
Hive内置函数和自定义函数的使用
一.内置函数的使用查看当前hive版本支持的所有内置函数 show function; 查看某个函数的使用方法及作用,比如查看upper函数 desc function upper; 查看upper ...
利用函数计算构建微信小程序的Server端
10分钟上线 - 利用函数计算构建微信小程序的Server端-博客-云栖社区-阿里云 https://yq.aliyun.com/articles/435430 函数计算读写 oss import ...

随机推荐

Microsoft - Union Two Sorted List with Distinct Value
Union Two Sorted List with Distinct Value Given X = { 10, 12, 16, 20 } & Y = {12, 18, 20, 22} W ...
好使-利用python 下paramiko模块无密码登录
[root@salt-minion02 paramiko]# vim baoleiji4.py # -*- coding:utf-8 -*-import paramikoprivate_key = p ...
JSON与JAVA数据的转换-----从3,23到现在5.25才过去2个月，感觉时间过得那么漫长
从3月23号去报到,期间经历了清明节,毕业论文答辩,从万达搬到东兴,五一节,毕业照,从东兴的一边搬到另外一个房间中去 2个月的时间过得如此的快啊!白驹过隙! 不要着急,不要和别人比,小龙哥写过3年代码 ...
stenciljs 学习三组件生命周期
stenciljs 组件包含好多生命周期方法, will did load update unload 实现生命周期的方法比价简单类似 componentWillLoad ....,使用typescr ...
bat根据星期启动程序
原来公司里的由于每次开机时启动的程序比较多,所以打算使用批处理程序,这里只列举了部分.在每周一到周五的时候,开机则启动指定的应用程序,如果是周末的两天则不启动任何程序,所以做了这个脚本.你如果需要,根 ...
jquery append、prepend、before等等
1.jQuery append() 方法 jQuery append() 方法在被选元素的结尾插入内容. 实例复制代码代码如下: $("p").append("Some ...
POJ1006——中国剩余定理
题目:http://poj.org/problem?id=1006 中国剩余定理:x= m/mj + bj + aj 讲解:http://www.cnblogs.com/MashiroSky/p/59 ...
js 正则用空格分割字符串
var filename = "ASDFK*SADF+ALDLAS-LDKFADFa*seAc tion.java";var arr = filename.split(/\*|\- ...
git 命令行修改文件并push（阿里云）
==============安装git后的准备https://code.aliyun.com/profile/keyshttps://code.aliyun.com/help/ssh/README = ...
BASIC-23_蓝桥杯_芯片测试
思路: 1.当测试与被测试的芯片全部可以互相测试时,为好芯片; 示例代码: #include <stdio.h>#define N 20 int main(void){ int n = 0 ...

OpenACC 计算构建内的自定义函数

OpenACC 计算构建内的自定义函数的更多相关文章

随机推荐

热门专题