OpenACC数据管理语句
▶ 书中第4章,数据管理部分的代码和说明
● 代码,关于 copy,copyin,copyout,create
- #include <stdio.h>
- #include <openacc.h>
- int main()
- {
- const int length = ;
- int a[length], b[length], c[length],d[length];
- for (int i = ; i < length; a[i] = b[i] = c[i] = );
- {
- #pragma acc kernels create(d)
- for (int i = ; i < length; i++)
- {
- a[i] ++;
- c[i] = a[i] + b[i];
- d[i] = ;
- }
- }
- for (int i = ; i < ; i++)
- printf("a[%d] = %d, c[%d] = %d\n", i, a[i], i, c[i]);
- getchar();
- return ;
- }
● 输出结果,显式创建了中间变量 d,隐式创建了 a,b,c,并具有不同的拷贝属性
- D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc -acc -Minfo main.c -o main_acc.exe
- main:
- , Generating create(d[:])
- Generating implicit copyout(c[:])
- Generating implicit copyin(b[:])
- Generating implicit copy(a[:])
- , Loop is parallelizable
- Accelerator kernel generated
- Generating Tesla code
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
● 在 kernels 里单独使用 copyout 时报警告:PGC-W-0996-The directive #pragma acc copyout is deprecated; use #pragma acc declare copyout instead (main.c: XX)
● enter data 和 exit data 用于 C++。
■ 首先,windows 中 pgi 不支持 C++ 编译,只有 pgcc.exe 而没有 pgc++*.exe,只能乖乖到 Linux 下去写!
■ 书上的代码有点问题,大意是 OpenACC 的 copy 是浅拷贝,对于内含指针的数据结构(如 vector,class)不会连着指针指向的对象一起拷。这里有两种解决办法,一种是去结构化,将 class 中的数据集中成简单数组来进行拷贝;另一种是使用 Managed 内存,也就不存在显式拷贝的问题了。【https://stackoverflow.com/questions/53860467/how-to-copy-on-gpu-a-vector-of-vector-pointer-memory-allocated-in-openacc】
■ 书上的代码没有采用这两种解决方案,会报错 “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution” 以及 “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution”,这个问题还蛮常见的【https://stackoverflow.com/search?q=call+to+returned+error+700%3A+Illegal+address+during+kernel+execution】
● 使用去结构化来使用数组
- #include <iostream>
- #include <vector>
- #include <cstdint>
- using namespace std;
- int main()
- {
- const int vectorCount = , vectorLength = ;
- long sum = ;
- vector<int32_t> *vectorTable = new vector<int32_t>[vectorCount]; // 1024 个向量,每个向量放入 20 个元素
- for (int i = ; i < vectorCount; i++)
- {
- for (int j = ; j < vectorLength; j++)
- vectorTable[i].push_back(i);
- }
- int32_t **arrayTable = new int32_t *[vectorCount]; // 仅包含向量数据的数组,与 vectorTable 对应
- int *vectorSize = new int[vectorCount]; // 每个向量的尺寸
- #pragma acc enter data create(arrayTable [0:vectorCount] [0:0]) // 设备中创建 arryTable,注意维度
- for (int i = ; i < vectorCount; i++)
- {
- int sze = vectorTable[i].size();
- vectorSize[i] = sze;
- arrayTable[i] = vectorTable[i].data(); // 把每个向量数据的指针赋给 arrayTable
- #pragma acc enter data copyin(arrayTable [i:1][:sze]) // 把每个向量的数据拷贝进设备
- }
- #pragma acc enter data copyin(vectorSize[:vectorCount]) // 向量尺寸也放进设备
- #pragma acc parallel loop gang vector reduction(+: sum) present(arrayTable, vectorSize) // 规约计算
- for (int i = ; i < vectorCount; i++)
- {
- for (int j = ; j < vectorSize[i]; ++j)
- sum += arrayTable[i][j];
- }
- cout << "Sum: " << sum << endl;
- #pragma acc exit data delete (vectorSize)
- #pragma acc exit data delete (arrayTable)
- delete[] vectorSize;
- delete[] vectorTable;
- return ;
- }
● 输出结果
- cuan@CUAN:~$ pgc++ main.cpp -o main.exe --c++ -ta=tesla -Minfo -acc
- main:
- , Generating enter data create(arrayTable[:][:])
- , Generating enter data copyin(arrayTable[i][:sze+],vectorSize[:])
- Generating implicit copy(sum)
- Generating present(vectorSize[:])
- Generating Tesla code
- , Generating reduction(+:sum)
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
- , #pragma acc loop seq
- , Generating present(arrayTable[:][:])
- , Loop is parallelizable
- , Generating exit data delete(vectorSize[:],arrayTable[:][:])
- cuan@CUAN:~$ ./main.exe
- Sum:
● 使用 Managed 内存
- #include <iostream>
- using namespace std;
- class ivector
- {
- public:
- int len;
- int *arr;
- ivector(int length)
- {
- len = length;
- arr = new int[len];
- #pragma acc enter data copyin(this)
- #pragma acc enter data create(arr [0:len])
- #pragma acc parallel loop present(arr [0:len])
- for (int iend = len, i = ; i < iend; i++) // 使用临时变量 iend,防止编译器认为 len 值在循环中会改变,从而拒绝并行化
- arr[i] = i;
- }
- ivector(const ivector &s)
- {
- len = s.len;
- arr = new int[len];
- #pragma acc enter data copyin(this)
- #pragma acc enter data create(arr [0:len])
- #pragma acc parallel loop present(arr [0:len], s.arr [0:len]) // s 也已经在设备上了
- for (int iend = len, i = ; i < iend; i++)
- arr[i] = s.arr[i];
- }
- ~ivector()
- {
- #pragma acc exit data delete (arr) // 销毁对象时依次销毁设备上的 arr 和 this
- #pragma acc exit data delete (this)
- cout << "deconstruction!" << endl;
- delete[] arr;
- len = ;
- }
- int &operator[](int i)
- {
- if (i < || i >= this->len)
- return arr[];
- return arr[i];
- }
- void add(int c)
- {
- #pragma acc kernels loop present(arr [0:len]) // 每次涉及修改 arr 的操作都要注明 present
- for (int iend = len, i = ; i < iend; i++)
- arr[i] += c;
- }
- void updateHost() // 手动更新主机端数据
- {
- #pragma acc update host(arr [0:len])
- }
- };
- int main()
- {
- ivector s1();
- s1.add();
- s1.updateHost();
- cout << "s1[1] = " << s1[] << endl;
- ivector s2(s1);
- s2.updateHost();
- cout << "s2[1] = " << s2[] << endl;
- return ;
- }
● 输出结果,不加 -ta=tesla:managed 会报错【填坑】
- cuan@CUAN:~$ pgc++ main.cpp -o main.exe --c++ -ta=tesla:managed -Minfo -acc
- ivector::ivector(int):
- , Generating enter data copyin(this[:])
- Generating enter data create(arr[:len])
- Generating Tesla code
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
- , Generating implicit copy(this[:])
- Generating present(arr[:len])
- ivector::ivector(const ivector&):
- , Generating enter data create(arr[:len])
- Generating enter data copyin(this[:])
- Generating present(arr[:len])
- Generating Tesla code
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
- , Generating implicit copyin(s[:])
- Generating implicit copy(this[:])
- Generating present(s->arr[:len])
- ivector::~ivector():
- , Generating exit data delete(this[:],arr[:])
- ivector::add(int):
- , Generating Tesla code
- , Accelerator serial kernel generated
- Generating implicit copy(this[:])
- Generating present(arr[:len])
- , Loop is parallelizable
- Generating Tesla code
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
- ivector::updateHost():
- , Generating update self(arr[:len])
- cuan@CUAN:~$ ./main.exe
- launch CUDA kernel file=/home/cuan/main.cpp function=_ZN7ivectorC1Ei line= device= threadid= num_gangs= num_workers= vector_length= grid= block=
- launch CUDA kernel file=/home/cuan/main.cpp function=_ZN7ivector3addEi line= device= threadid= num_gangs= num_workers= vector_length= grid= block=
- launch CUDA kernel file=/home/cuan/main.cpp function=_ZN7ivector3addEi line= device= threadid= num_gangs= num_workers= vector_length= grid= block=
- s1[] =
- launch CUDA kernel file=/home/cuan/main.cpp function=_ZN7ivectorC1ERKS_ line= device= threadid= num_gangs= num_workers= vector_length= grid= block=
- s2[] =
- deconstruction!
- deconstruction!
● 在这本书上找到了 C++ 中使用 OpenACC 的办法 【https://www.elsevier.com/books/parallel-programming-with-openacc/farber/978-0-12-410397-9】,代码是【https://github.com/rmfarber/ParallelProgrammingWithOpenACC/tree/master/Chapter05】下的 accList.double.cpp
- // accList.h
- #ifndef ACC_LIST_H
- #define ACC_LIST_H
- #include <cstdlib>
- #include <cassert>
- #ifdef _OPENACC
- #include <openacc.h>
- #endif
- template<typename T>
- class accList
- {
- public:
- explicit accList() {}
- explicit accList(size_t size) // 构造函数把 this 指针拷进设备,然后创建内存
- {
- #pragma acc enter data copyin(this)
- allocate(size);
- }
- ~accList() // 析构时释放内存,再删除 this 指针
- {
- release();
- #pragma acc exit data delete(this)
- }
- #pragma acc routine seq
- T& operator[](size_t idx)
- {
- return _A[idx];
- }
- #pragma acc routine seq
- const T& operator[](size_t idx) const
- {
- return _A[idx];
- }
- size_t size() const
- {
- return _size;
- }
- accList& operator=(const accList& B)
- {
- allocate(B.size());
- for (size_t j = ; j < _size; ++j)
- {
- _A[j] = B[j];
- }
- accUpdateDevice();
- return *this;
- }
- void insert(size_t idx, const T& val)
- {
- _A[idx] = val;
- }
- void insert(size_t idx, const T* val)
- {
- _A[idx] = *val;
- }
- void accUpdateSelf()
- {
- accUpdateSelfT(_A, );
- }
- void accUpdateDevice()
- {
- accUpdateDeviceT(_A, );
- }
- private:
- T * _A{ nullptr }; // 数据成员只有指针和长度
- size_t _size{ };
- void release()
- {
- if (_size > )
- {
- #pragma acc exit data delete(_A[0:_size]) // 释放内存时删除设备内存
- delete[] _A;
- _A = nullptr;
- _size = ;
- }
- }
- void allocate(size_t size)
- {
- if (_size != size) // 申请内存尺寸与当前尺寸不一致时重新开辟一块
- {
- release();
- _size = size;
- #pragma acc update device(_size)
- if (_size > )
- {
- _A = new T[_size];
- #ifdef _OPENACC // 有 OpenACC 的话检查 _A 是否已经在设备上了
- assert(!acc_is_present(&_A[], sizeof(T)));
- #endif
- #pragma acc enter data create(_A[0:_size]) // 在设备上申请新内存
- }
- }
- }
- template<typename U>
- void accUpdateSelfT(U *p, long)
- {
- #pragma acc update self(p[0:_size])
- }
- template<typename U>
- auto accUpdateSelfT(U *p, int) -> decltype(p->accUpdateSelf())
- {
- for (size_t j = ; j < _size; ++j)
- {
- p[j].accUpdateSelf();
- }
- }
- template<typename U>
- void accUpdateDeviceT(U *p, long)
- {
- #pragma acc update device(p[0:_size])
- }
- template<typename U>
- auto accUpdateDeviceT(U *p, int) -> decltype(p->accUpdateDevice())
- {
- for (size_t j = ; j < _size; ++j)
- {
- p[j].accUpdateDevice();
- }
- }
- };
- #endif
- // main.cpp
- #include <iostream>
- #include <cstdlib>
- #include <cstdint>
- #include "accList.h"
- using namespace std;
- #ifndef N
- #define N 1024
- #endif
- int main()
- {
- accList<double> A(N), B(N);
- for (int i = ; i < B.size(); ++i)
- B[i] = 2.5;
- B.accUpdateDevice(); // 手动更新设备内存
- #pragma acc parallel loop gang vector present(A,B)
- for (int i = ; i < A.size(); ++i)
- A[i] = B[i] + i;
- A.accUpdateSelf(); // 手动更新主机内存
- for (int i = ; i<; ++i)
- cout << "A[" << i << "]: " << A[i] << endl;
- cout << "......" << endl;
- for (int i = N - ; i<N; ++i)
- cout << "A[" << i << "]: " << A[i] << endl;
- return ;
- }
● 运行结果
- cuan@CUAN:~/acc$ pgc++ main.cpp -o main.exe -Minfo -acc
- main:
- , Generating present(B,A)
- Generating Tesla code
- , #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
- accList<double>::accList(unsigned long):
- , include "accList.h"
- , Generating enter data copyin(this[:])
- accList<double>::~accList():
- , include "accList.h"
- , Generating exit data delete(this[:])
- accList<double>::operator [](unsigned long):
- , include "accList.h"
- , Generating acc routine seq
- Generating Tesla code
- accList<double>::size() const:
- , include "accList.h"
- , Generating implicit acc routine seq
- Generating acc routine seq
- Generating Tesla code
- accList<double>::release():
- , include "accList.h"
- , Generating exit data delete(_A[:_size])
- accList<double>::allocate(unsigned long):
- , include "accList.h"
- , Generating update device(_size)
- , Generating enter data create(_A[:_size])
- void accList<double>::accUpdateSelfT<double>(T1 *, long):
- , include "accList.h"
- , Generating update self(p[:_size])
- void accList<double>::accUpdateDeviceT<double>(T1 *, long):
- , include "accList.h"
- , Generating update device(p[:_size])
- cuan@CUAN:~/acc$ ./main.exe
- launch CUDA kernel file=/home/cuan/acc/main.cpp function=main line= device= threadid= num_gangs= num_workers= vector_length= grid= block=
- A[]: 2.5
- A[]: 3.5
- A[]: 4.5
- A[]: 5.5
- A[]: 6.5
- A[]: 7.5
- A[]: 8.5
- A[]: 9.5
- A[]: 10.5
- A[]: 11.5
- ......
- A[]: 1016.5
- A[]: 1017.5
- A[]: 1018.5
- A[]: 1019.5
- A[]: 1020.5
- A[]: 1021.5
- A[]: 1022.5
- A[]: 1023.5
- A[]: 1024.5
- A[]: 1025.5
- Accelerator Kernel Timing data
- /home/cuan/acc/main.cpp
- main NVIDIA devicenum=
- time(us):
- : compute region reached time
- : kernel launched time
- grid: [] block: []
- device time(us): total= max= min= avg=
- elapsed time(us): total= max= min= avg=
- : data region reached times
- /home/cuan/acc/main.cpp
- _ZN7accListIdEC1Em NVIDIA devicenum=
- time(us):
- : data region reached times
- : data copyin transfers:
- device time(us): total= max= min= avg=
- /home/cuan/acc/main.cpp
- _ZN7accListIdED1Ev NVIDIA devicenum=
- time(us):
- : data region reached times
- /home/cuan/acc/main.cpp
- _ZN7accListIdE7releaseEv NVIDIA devicenum=
- time(us):
- : data region reached times
- : data copyin transfers:
- device time(us): total= max= min= avg=
- /home/cuan/acc/main.cpp
- _ZN7accListIdE8allocateEm NVIDIA devicenum=
- time(us):
- : update directive reached times
- : data copyin transfers:
- device time(us): total= max= min= avg=
- : data region reached times
- : data copyin transfers:
- device time(us): total= max= min= avg=
- /home/cuan/acc/main.cpp
- _ZN7accListIdE14accUpdateSelfTIdEEvPT_l NVIDIA devicenum=
- time(us):
- : update directive reached time
- : data copyout transfers:
- device time(us): total= max= min= avg=
- /home/cuan/acc/main.cpp
- _ZN7accListIdE16accUpdateDeviceTIdEEvPT_l NVIDIA devicenum=
- time(us):
- : update directive reached time
- : data copyin transfers:
- device time(us): total= max= min= avg=
OpenACC数据管理语句的更多相关文章
- sql数据管理语句
一.数据管理 1.增加数据 INSERT INTO student VALUES(1,'张三','男',20); -- 插入所有字段.一定依次按顺序插入 -- 注意不能少或多字段值 如只需要插入部分字 ...
- 基于指令的移植方式的几个重要概念的理解(OpenHMPP, OpenACC)-转载
引言: 什么是基于指令的移植方式呢?首先我这里说的移植可以理解为把原先在CPU上跑的程序放到像GPU一样的协处理器上跑的这个过程.在英文里可以叫Porting.移植有两种方式:一种是使用CUDA或者O ...
- 【并行计算-CUDA开发】OpenACC与OpenHMPP
在西雅图超级计算大会(SC11)上发布了新的基于指令的加速器并行编程标准,既OpenACC.这个开发标准的目的是让更多的编程人员可以用到GPU计算,同时计算结果可以跨加速器使用,甚至能用在多核CPU上 ...
- python第六天 函数 python标准库实例大全
今天学习第一模块的最后一课课程--函数: python的第一个函数: 1 def func1(): 2 print('第一个函数') 3 return 0 4 func1() 1 同时返回多种类型时, ...
- whdxlib
1 数据库系统实现 实 验 指 导 书 齐心 彭彬 计算机工程与软件实验中心 2016 年 3 月2目 录实验一.JDBC 应用程序设计(2 学时) ......................... ...
- oracle DML(数据管理语言)sql 基本语句
- Hadoop的数据管理
Hadoop的数据管理,主要包括Hadoop的分布式文件系统HDFS.分布式数据库HBase和数据仓库工具Hive的数据管理. 1.HDFS的数据管理 HDFS是分布式计算的存储基石,Hadoop分布 ...
- R语言实战(二)数据管理
本文对应<R语言实战>第4章:基本数据管理:第5章:高级数据管理 创建新变量 #建议采用transform()函数 mydata <- transform(mydata, sumx ...
- 夜黑风高的夜晚用SQL语句做了一些想做的事·······
IT这条漫漫长路注定是孤独的,陪伴我们的只有那些不知冷暖的代码语句和被手指敲打的磨掉了键上的标识的键盘. 之所以可以继续坚持下去,是因为心中有一份永不熄灭的激情. 成功的路上让我们为自己带盐 ...
随机推荐
- HTTP、TCP、UDP以及SOCKET
HTTP.TCP.UDP以及SOCKET 一.TCP/IP代表传输控制协议/网际协议,指的是一系列协组. 可分为四个层次:数据链路层.网络层.传输层和应用层. 在网络层:有IP协议.ICMP协议.AR ...
- php取浮点数后两位的方法
$num = 10.4567; //第一种:利用round()对浮点数进行四舍五入echo round($num,2); //10.46 //第二种:利用sprintf格式化字符串$format_nu ...
- stenciljs 学习三 组件生命周期
stenciljs 组件包含好多生命周期方法, will did load update unload 实现生命周期的方法比价简单类似 componentWillLoad ....,使用typescr ...
- 无法对 数据库'XXXXX' 执行 删除,因为它正用于复制
无法对 数据库'XXXXX' 执行 删除,因为它正用于复制. (.Net SqlClient Data Provider) 使用以下方式一般可以解决 sp_removedbreplication 'X ...
- EXCEL函数LookUp, VLOOKUP,HLOOKUP应用详解(含中文参数解释)
关于VLOOKUP函数的用法 “Lookup”的汉语意思是“查找”,在Excel中与“Lookup”相关的函数有三个:VLOOKUP.HLOOKUO和LOOKUP.下面介绍VLOOKUP函数的用法. ...
- PHP开源的项目管理软件
禅道 http://devel.zentao.net/help-book-zentaophphelp.html PHP session详讲 http://blog.163.com/lgh_2002/b ...
- 【转】每天一个linux命令(50):crontab命令
原文网址:http://www.cnblogs.com/peida/archive/2013/01/08/2850483.html 前一天学习了 at 命令是针对仅运行一次的任务,循环运行的例行性计划 ...
- JSOI2008——星球大战
题目:https://www.luogu.org/problemnew/show/1197 并查集. 难点是若依次去掉点在求连通块个数,时间太长. 精妙的思维:先全部读入,再逆向求连通块个数——增加点 ...
- 【jmeter】jmeter环境搭建
一. 工具描述 apache jmeter是100%的java桌面应用程序,它被设计用来加载被测试软件功能特性.度量被测试软件的性能.设计jmeter的初衷是测试web应用,后来又扩充了其它的功能.j ...
- ASM配置管理
http://blog.chinaunix.net/uid-22646981-id-3060280.htmlhttp://blog.sina.com.cn/s/blog_6a5aa0300102uys ...