OpenACC 异步计算
▶ 按照书上的例子,使用 async 导语实现主机与设备端的异步计算
● 代码,非异步的代码只要将其中的 async 以及第 29 行删除即可
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 // 多算几次,增加耗时 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async // 在设备上赋值 a
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} for (int i = ; i < COUNT; i++) // 在主机上赋值 b
{
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc update host(a[0:N]) async // 异步必须 update a,否则还没同步就参与 c 的运算
#pragma acc wait // 非异步时去掉该行 for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update device(a[0:N]) async // 没啥用,增加耗时
#pragma acc exit data delete(a[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果(是否异步,差异仅在行号、耗时上)
//+-----------------------------------------------------------------------------非异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time //------------------------------------------------------------------------------有异步
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating update self(a[:])
, Generating update device(a[:])
Generating exit data delete(a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block=
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= ... // 省略 launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= queue= num_gangs= num_workers= vector_length= grid= block= c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: update directive reached time
: data copyin transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 的结果,我是真没看出来有较大的差别,可能例子举得不够好

● 在一个设备上同时使用两个命令队列
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h> #define N 10240000
#define COUNT 200 int main()
{
int *a = (int *)malloc(sizeof(int)*N);
int *b = (int *)malloc(sizeof(int)*N);
int *c = (int *)malloc(sizeof(int)*N); #pragma acc enter data create(a[0:N]) async(1)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(1)
for (int j = ; j < N; j++)
a[j] = (i + j) * ;
} #pragma acc enter data create(b[0:N]) async(2)
for (int i = ; i < COUNT; i++)
{
#pragma acc parallel loop async(2)
for (int j = ; j < N; j++)
b[j] = (i + j) * ;
} #pragma acc enter data create(c[0:N]) async(2)
#pragma acc wait(1) async(2) #pragma acc parallel loop async(2)
for (int i = ; i < N; i++)
c[i] = a[i] + b[i]; #pragma acc update host(c[0:N]) async(2)
#pragma acc exit data delete(a[0:N], b[0:N], c[0:N]) printf("\nc[1] = %d\n", c[]);
free(a);
free(b);
free(c);
//getchar();
return ;
}
● 输出结果
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
main:
, Generating enter data create(a[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(a[:])
, Generating enter data create(b[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(b[:])
, Generating enter data create(c[:])
, Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
, Generating implicit copyout(c[:])
Generating implicit copyin(b[:],a[:])
, Generating update self(c[:])
Generating exit data delete(c[:],b[:],a[:]) D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe c[] =
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
Timing may be affected by asynchronous behavior
set PGI_ACC_SYNCHRONOUS to to disable async() clauses
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us): ,
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached times
: kernel launched times
grid: [] block: []
elapsed time(us): total=, max= min= avg=
: data region reached times
: data region reached time
: compute region reached time
: kernel launched time
grid: [] block: []
device time(us): total= max= min= avg=
: data region reached times
: update directive reached time
: data copyout transfers:
device time(us): total=, max=, min= avg=,
: data region reached time
● Nvvp 中,可以看到两个命令队列交替执行

● 在 PGI 命令行中使用命令 pgaccelinfo 查看设备信息
D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgaccelinfo CUDA Driver Version: Device Number:
Device Name: GeForce GTX
Device Revision Number: 6.1
Global Memory Size:
Number of Multiprocessors:
Concurrent Copy and Execution: Yes
Total Constant Memory:
Total Shared Memory per Block:
Registers per Block:
Warp Size:
Maximum Threads per Block:
Maximum Block Dimensions: , ,
Maximum Grid Dimensions: x x
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: MHz
Memory Bus Width: bits
L2 Cache Size: bytes
Max Threads Per SMP:
Async Engines: // 有两个异步引擎,支持两个命令队列并行
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Compiler Option: -ta=tesla:cc60
OpenACC 异步计算的更多相关文章
- Task:取消异步计算限制操作 & 捕获任务中的异常
Why:ThreadPool没有内建机制标记当前线程在什么时候完成,也没有机制在操作完成时获得返回值,因而推出了Task,更精确的管理异步线程. How:通过构造方法的参数TaskCreationOp ...
- 13.FutureTask异步计算
FutureTask 1.可取消的异步计算,FutureTask实现了Future的基本方法,提供了start.cancel 操作,可以查询计算是否完成,并且可以获取计算 的结果.结果 ...
- 怎样给ExecutorService异步计算设置超时
ExecutorService接口使用submit方法会返回一个Future<V>对象.Future表示异步计算的结果.它提供了检查计算是否完毕的方法,以等待计算的完毕,并获取计算的结果. ...
- java异步计算Future的使用(转)
从jdk1.5开始我们可以利用Future来跟踪异步计算的结果.在此之前主线程要想获得工作线程(异步计算线程)的结果是比较麻烦的事情,需要我们进行特殊的程序结构设计,比较繁琐而且容易出错.有了Futu ...
- 使用QFuture类监控异步计算的结果
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/Amnes1a/article/details/65630701在Qt中,为我们提供了好几种使用线程的 ...
- gearman(异步计算)学习
Gearman是什么? 它是分布式的程序调用框架,可完成跨语言的相互调 用,适合在后台运行工作任务.最初是2005年perl版本,2008年发布C/C++版本.目前大部分源码都是(Gearmand服务 ...
- OpenACC 绘制曼德勃罗集
▶ 书上第四章,用一系列步骤优化曼德勃罗集的计算过程. ● 代码 // constants.h ; ; ; ; const double xmin=-1.7; ; const double ymin= ...
- OpenACC Julia 图形
▶ 书上的代码,逐步优化绘制 Julia 图形的代码 ● 无并行优化(手动优化了变量等) #include <stdio.h> #include <stdlib.h> #inc ...
- 如何解救在异步Java代码中已检测的异常
Java语言通过已检测异常语法所提供的静态异常检测功能非常实用,通过它程序开发人员可以用很便捷的方式表达复杂的程序流程. 实际上,如果某个函数预期将返回某种类型的数据,通过已检测异常,很容易就可以扩展 ...
随机推荐
- Passing the Message 单调栈两次
What a sunny day! Let’s go picnic and have barbecue! Today, all kids in “Sun Flower” kindergarten ar ...
- win32 MSG 值
转自:https://autohotkey.com/docs/misc/SendMessageList.htm WM_NULL = 0x00 WM_CREATE = 0x01 WM_DESTROY = ...
- java泛型(一)、泛型的基本介绍和使用
现在开始深入学习java的泛型了,以前一直只是在集合中简单的使用泛型,根本就不明白泛型的原理和作用.泛型在java中,是一个十分重要的特性,所以要好好的研究下. 泛 型的定义:泛型是JDK 1.5的一 ...
- VisualSVN安装配置与使用
VisualSVN安装配置与使用 1. 所选服务器安装包:VisualSVN-Server-2.1.3.msi. 2. 客户端安装包:TortoiseSVN-1.6.2.16344-win32-s ...
- 树莓派的媒体播放软件omxplayer
树莓派中的CPU性能较差,而GPU较强大.omxplayer是专门针对树莓派的GPU的播放器.( made by Edgar (gimli) Hucek from the XBMC/Kodi proj ...
- tomcat源码阅读之生命周期(LifeCycle)
一.事件机制流程: 1. 当外部事件源发生事件(比如点击了按钮,数据发生改变更新等)时,事件源将事件封装成事件对象Event: 2. 将事件对象交由对应的事件派发器Dispatcher ...
- 【转】每天一个linux命令(26):用SecureCRT来上传和下载文件
原文网址:http://www.cnblogs.com/peida/archive/2012/11/28/2793181.html 用SSH管理linux服务器时经常需要远程与本地之间交互文件.而直接 ...
- DOMContentLoaded 和 Load 事件 区别(待补充)
javascript会阻塞dom的解析.当解析过程中遇到<script>标签的时候,便会停止解析过程,转而去处理脚本,如果脚本是内联的,浏览器会先去执行这段内联的脚本,如果是外链的,那么先 ...
- php preg_replace空格无法替换问题
一次坑爹的小bug.读取一段文字(编码utf-8),想替换掉空格,str_replace(" "..).preg_replace("/\s/"..)都不起作用. ...
- postman获取请求响应值
获取所有请求响应代码 var data = JSON.parse(responseBody); 把data的data值设置到token中 postman.setEnvironmentVari ...