使用cublas 矩阵库函数实现矩阵相乘
2014-08-10
cublas中执行矩阵乘法运算的函数
首先要注意的是cublas使用的是以列为主的存储方式,和c/c++中的以行为主的方式是不一样的。处理方法可参考下面的注释代码
// SOME PRECAUTIONS:
// IF WE WANT TO CALCULATE ROW-MAJOR MATRIX MULTIPLY C = A * B,
// WE JUST NEED CALL CUBLAS API IN A REVERSE ORDER: cublasSegemm(B, A)!
// The reason is explained as follows: // CUBLAS library uses column-major storage, but C/C++ use row-major storage.
// When passing the matrix pointer to CUBLAS, the memory layout alters from
// row-major to column-major, which is equivalent to an implict transpose. // In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
// C = A * B, we can't use the input order like cublasSgemm(A, B) because of
// implict transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
// If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
// multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
// is a column-based cublas matrix, which means C(T) in C/C++, we need extra
// transpose code to convert it to a row-based C/C++ matrix. // To solve the problem, let's consider our desired result C, a row-major matrix.
// In cublas format, it is C(T) actually (becuase of the implict transpose).
// C = A * B, so C(T) = (A * B) (T) = B(T) * A(T). Cublas matrice B(T) and A(T)
// happen to be C/C++ matrice B and A (still becuase of the implict transpose)!
// We don't need extra transpose code, we only need alter the input order!
//
// CUBLAS provides high-performance matrix multiplication.
// See also:
// V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
// in Proc. 2008 ACM/IEEE Conf. on Superconducting (SC '08),
// Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
//
小例子C++中:
A矩阵:0 3 5 B矩阵:1 1 1
0 0 4 1 1 1
1 0 0 1 1 1
现在要求C = A*B
C++中的结果
C矩阵:8 8 8
4 4 4
1 1 1
在cublas中:变成以行为主
A矩阵:0 0 1 B矩阵:1 1 1
3 0 0 1 1 1
5 4 0 1 1 1
在cublas中求C2=B*A
结果如下:C2在cublas中以列为主
惯性思维,先把结果用行为主存储好理解:
C2矩阵:8 4 1
8 4 1
8 4 1
在cublas实际是一列存储的,结果如下:
C2矩阵:8 8 8
4 4 4
1 1 1
此时在cublas中B*A的结果与C++中A*B结果一样,使用cublas时只需改变下参数的位置即可得到想要的结果。
cublas<t>gemm()
cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
intm, intn, intk,
const float*alpha,
const float*A, intlda,
const float*B, intldb,
const float*beta,
float*C, intldc);
cublasStatus_t cublasDgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
intm, intn, intk,
const double*alpha,
const double*A, intlda,
const double*B, intldb,
const double*beta,
double*C, intldc);
cublasStatus_t cublasCgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
intm, intn, intk,
constcuComplex *alpha,
constcuComplex *A, intlda,
constcuComplex *B, intldb,
constcuComplex *beta,
cuComplex *C, intldc);
cublasStatus_t cublasZgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
intm, intn, intk,
constcuDoubleComplex *alpha,
constcuDoubleComplex *A, intlda,
constcuDoubleComplex *B, intldb,
constcuDoubleComplex *beta,
cuDoubleComplex *C, intldc);
参数含义可参考下面的信息:


使用cublas中cublasSgemm实现简单的矩阵相乘代码如下:
头文件:matrix.h
// SOME PRECAUTIONS:
// IF WE WANT TO CALCULATE ROW-MAJOR MATRIX MULTIPLY C = A * B,
// WE JUST NEED CALL CUBLAS API IN A REVERSE ORDER: cublasSegemm(B, A)!
// The reason is explained as follows: // CUBLAS library uses column-major storage, but C/C++ use row-major storage.
// When passing the matrix pointer to CUBLAS, the memory layout alters from
// row-major to column-major, which is equivalent to an implict transpose. // In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
// C = A * B, we can't use the input order like cublasSgemm(A, B) because of
// implict transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
// If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
// multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
// is a column-based cublas matrix, which means C(T) in C/C++, we need extra
// transpose code to convert it to a row-based C/C++ matrix. // To solve the problem, let's consider our desired result C, a row-major matrix.
// In cublas format, it is C(T) actually (becuase of the implict transpose).
// C = A * B, so C(T) = (A * B) (T) = B(T) * A(T). Cublas matrice B(T) and A(T)
// happen to be C/C++ matrice B and A (still becuase of the implict transpose)!
// We don't need extra transpose code, we only need alter the input order!
//
// CUBLAS provides high-performance matrix multiplication.
// See also:
// V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
// in Proc. 2008 ACM/IEEE Conf. on Superconducting (SC '08),
// Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
// #include <stdio.h>
#include <stdlib.h> //cuda runtime
#include <cuda_runtime.h>
#include <cublas_v2.h> //包含的库
#pragma comment (lib,"cudart")
#pragma comment (lib,"cublas") //使用这个宏就可以很方便的将我们习惯的行为主的数据转化为列为主的数据
//#define IDX2C(i,j,leading) (((j)*(leading))+(i)) typedef struct _matrixSize // Optional Command-line multiplier for matrix sizes
{
unsigned int uiWA, uiHA, uiWB, uiHB, uiWC, uiHC;
} sMatrixSize; cudaError_t matrixMultiply(float *h_C, const float *h_A, const float *h_B,int devID, sMatrixSize &matrix_size);
CPP文件:matrix.cpp
#include "matrix.h"
cudaError_t matrixMultiply(float *h_C, const float *h_A, const float *h_B,int devID, sMatrixSize &matrix_size){
float *dev_A = NULL;
float *dev_B = NULL;
float *dev_C = NULL;
float *h_CUBLAS = NULL;
cudaDeviceProp devicePro;
cudaError_t cudaStatus;
cudaStatus = cudaGetDeviceProperties(&devicePro, devID);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"cudaGetDeviceProperties returned error code %d, line(%d)\n", cudaStatus, __LINE__);
goto Error;
}
// allocate device memory for matrices dev_A 、 dev_B and dev_C
unsigned int size_A = matrix_size.uiWA * matrix_size.uiHA;
unsigned int mem_size_A = sizeof(float) * size_A;
unsigned int size_B = matrix_size.uiWB * matrix_size.uiHB;
unsigned int mem_size_B = sizeof(float) * size_B;
unsigned int size_C = matrix_size.uiWC * matrix_size.uiHC;
unsigned int mem_size_C = sizeof(float) * size_C;
//cudaMalloc dev_A
cudaStatus = cudaMalloc( (void**)&dev_A, mem_size_A);
if(cudaStatus != cudaSuccess){
fprintf(stderr, "cudaMalloc dev_A return error code %d, line(%d)\n", cudaStatus, __LINE__);
goto Error;
}
//cudaMalloc dev_B
cudaStatus = cudaMalloc( (void**)&dev_B, mem_size_B);
if(cudaStatus != cudaSuccess){
fprintf(stderr, "cudaMalloc dev_B return error code %d, line(%d)\n", cudaStatus, __LINE__);
goto Error;
}
//cudaMalloc dev_C
cudaStatus = cudaMalloc( (void**)&dev_C, mem_size_C);
if(cudaStatus != cudaSuccess){
fprintf(stderr, "cudaMalloc dev_C return error code %d, line(%d)\n", cudaStatus, __LINE__);
goto Error;
}
// allocate host memory for result matrices h_CUBLAS
h_CUBLAS = (float*)malloc(mem_size_C);
if( h_CUBLAS == NULL && size_C > ){
fprintf(stderr, "malloc h_CUBLAS error, line(%d)\n",__LINE__);
goto Error;
}
/*
copy the host input vector h_A, h_B in host memory
to the device input vector dev_A, dev_B in device memory
*/
//cudaMemcpy h_A to dev_A
cudaStatus = cudaMemcpy(dev_A, h_A, mem_size_A, cudaMemcpyHostToDevice);
if( cudaStatus != cudaSuccess){
fprintf(stderr,"cudaMemcpy h_A to dev_A return error code %d, line(%d)", cudaStatus, __LINE__);
goto Error;
}
//cudaMemcpy h_B to dev_B
cudaStatus = cudaMemcpy(dev_B, h_B, mem_size_B, cudaMemcpyHostToDevice);
if( cudaStatus != cudaSuccess){
fprintf(stderr,"cudaMemcpy h_B to dev_B returned error code %d, line(%d)", cudaStatus, __LINE__);
goto Error;
}
//CUBLAS version 2.0
{
cublasHandle_t handle;
cublasStatus_t ret;
ret = cublasCreate(&handle);
if(ret != CUBLAS_STATUS_SUCCESS){
fprintf(stderr, "cublasSgemm returned error code %d, line(%d)", ret, __LINE__);
goto Error;
}
cudaEvent_t start;
cudaEvent_t stop;
cudaStatus = cudaEventCreate(&start);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Falied to create start Event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
goto Error;
}
cudaStatus = cudaEventCreate(&stop);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Falied to create stop Event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
goto Error;
}
//recode start event
cudaStatus = cudaEventRecord(start,NULL);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Failed to record start event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
goto Error;
}
//matrix multiple A*B, beceause matrix is column primary in cublas, so we can change the input
//order to B*A.the reason you can see the file matrix.h
float alpha = 1.0f;
float beta = 0.0f;
//ret = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHB, matrix_size.uiHA, matrix_size.uiWA,
//&alpha, dev_B, matrix_size.uiWB, dev_A, matrix_size.uiWA, &beta, dev_C, matrix_size.uiWA);
ret = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHA, matrix_size.uiHB, matrix_size.uiWB,
&alpha, dev_A, matrix_size.uiWA, dev_B, matrix_size.uiWB, &beta, dev_C, matrix_size.uiWB);
if(ret != CUBLAS_STATUS_SUCCESS){
fprintf(stderr,"cublasSgemm returned error code %d, line(%d)\n", ret, __LINE__);
}
printf("cublasSgemm done.\n");
//recode stop event
cudaStatus = cudaEventRecord(stop,NULL);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Failed to record stop event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
goto Error;
}
//wait for the stop event to complete
cudaStatus = cudaEventSynchronize(stop);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Failed to synchronize on the stop event (error code %s)!\n", cudaGetErrorString( cudaStatus ) );
goto Error;
}
float secTotal = 0.0f;
cudaStatus = cudaEventElapsedTime(&secTotal ,start, stop);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"Failed to get time elapsed between event (error code %s)!\n", cudaGetErrorString( cudaStatus ) );
goto Error;
}
//copy result from device to host
cudaStatus = cudaMemcpy(h_CUBLAS, dev_C, mem_size_C, cudaMemcpyDeviceToHost);
if(cudaStatus != cudaSuccess){
fprintf(stderr,"cudaMemcpy dev_C to h_CUBLAS error code %d, line(%d)!\n", cudaStatus, __LINE__);
goto Error;
}
}
for(int i = ; i < matrix_size.uiWC; i++){
for(int j = ; j < matrix_size.uiHC; j++){
printf("%f ", h_CUBLAS[ i*matrix_size.uiWC + j]);
}
printf("\n");
}
/*
//change the matrix from column primary to rows column primary
for(int i = 0; i<matrix_size.uiWC; i++){
for(int j = 0; j<matrix_size.uiHC; j++){
int at1 = IDX2C(i,j,matrix_size.uiWC); //element location in rows primary
int at2 = i*matrix_size.uiWC +j; //element location in column primary
if(at1 >= matrix_size.uiWC*matrix_size.uiHC || at2 >= matrix_size.uiWC*matrix_size.uiHC)
printf("transc error \n");
h_C[ at1 ] = h_CUBLAS[ at2 ];
}
}
*/
/*
for(int i = 0; i<matrix_size.uiWC; i++){
for(int j = 0; j<matrix_size.uiHC; j++){
//int at1 = IDX2C(i,j,matrix_size.uiWC); //element location in rows primary
int at2 = i*matrix_size.uiWC +j; //element location in column primary
//if(at1 >= matrix_size.uiWC*matrix_size.uiHC || at2 >= matrix_size.uiWC*matrix_size.uiHC)
//printf("transc error \n");
h_C[ at2 ] = h_CUBLAS[ at2 ];
}
}
*/
Error:
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
free(h_CUBLAS);
dev_A = NULL;
dev_B = NULL;
dev_C = NULL;
h_CUBLAS = NULL;
return cudaStatus;
}
cudaError_t reduceEdge(){
cudaError_t cudaStatus = cudaSuccess;
Error:
return cudaStatus;
}
使用cublas 矩阵库函数实现矩阵相乘的更多相关文章
- <矩阵的基本操作:矩阵相加,矩阵相乘,矩阵转置>
//矩阵的基本操作:矩阵相加,矩阵相乘,矩阵转置 #include<stdio.h> #include<stdlib.h> #define M 2 #define N 3 #d ...
- NumPy 矩阵库函数
章节 Numpy 介绍 Numpy 安装 NumPy ndarray NumPy 数据类型 NumPy 数组创建 NumPy 基于已有数据创建数组 NumPy 基于数值区间创建数组 NumPy 数组切 ...
- 矩阵乘法&矩阵快速幂&矩阵快速幂解决线性递推式
矩阵乘法,顾名思义矩阵与矩阵相乘, 两矩阵可相乘的前提:第一个矩阵的行与第二个矩阵的列相等 相乘原则: a b * A B = a*A+b*C a*c+b*D c d ...
- Codevs 1287 矩阵乘法&&Noi.cn 09:矩阵乘法(矩阵乘法练手题)
1287 矩阵乘法 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题解 查看运行结果 题目描述 Description 小明最近在为线性代数而头疼, ...
- 二维KMP - 求字符矩阵的最小覆盖矩阵 - poj 2185
Milking Grid Problem's Link:http://poj.org/problem?id=2185 Mean: 给你一个n*m的字符矩阵,让你求这个字符矩阵的最小覆盖矩阵,输出这个最 ...
- Matlab中矩阵的平方和矩阵中每个元素的平方介绍
该文章讲述了Matlab中矩阵的平方和矩阵中每个元素的平方介绍. 设t = [2 4 2 4] 则>> t.^2 ans = 4 164 16 而>> t^2 ans = ...
- C语言经典算法 - 多维矩阵转一维矩阵的代码
下边内容内容是关于C语言经典算法 - 多维矩阵转一维矩阵的内容,应该能对码农也有好处. #include <stdio.h>#include <stdlib.h>int mai ...
- Jacobian矩阵、Hessian矩阵和Newton's method
在寻找极大极小值的过程中,有一个经典的算法叫做Newton's method,在学习Newton's method的过程中,会引入两个矩阵,使得理解的难度增大,下面就对这个问题进行描述. 1, Jac ...
- #161: 给定n*n由0和1组成的矩阵,如果矩阵的每一行和每一列的1的数量都是偶数,则认为符合条件。 你的任务就是检测矩阵是否符合条件
试题描述 给定n*n由0和1组成的矩阵,如果矩阵的每一行和每一列的1的数量都是偶数,则认为符合条件. 你的任务就是检测矩阵是否符合条件,或者在仅改变一个矩阵元素的情况下能否符合条件. "改变 ...
随机推荐
- 亿级在线系统二三事-网络编程/RPC框架 原创: johntech 火丁笔记 今天
亿级在线系统二三事-网络编程/RPC框架 原创: johntech 火丁笔记 今天
- 设置django 时间
使用Django的DateTimeField(auro_now_add=True)设置当前时间为创建时间时,时间往往与当前时间对应不上,这是由于Django默认使用的是[UTC](世界标准时间)时区, ...
- kotlin标准委托之可观察属性
所谓可观察属性就是当属性变化时可以拦截其变化,实现观察属性值变化的委托函数是Delegates.observable.该函数接受二个参数,第一个是初始化值,第2个属性值变化事件的响应器.每次我们向属性 ...
- springboot之rabbitmq安装与实践
环境:腾讯云centos7 注意:rabbitmq安装插件,可能会报错.本人是主机名的问题,所以修改了主机名. vim /etc/hosts vim /etc/hostname 修改这两个文件,并重启 ...
- 阶段5 3.微服务项目【学成在线】_day05 消息中间件RabbitMQ_15.RabbitMQ研究-与springboot整合-声明交换机和队列
复制topic的代码 把常量都设置成public方便其他的类引用 ExchangeBuilder里面有4个方法分别对应四个交换机. 声明Email和短信的队列 队列绑定交换机 所以需要把Bean注入到 ...
- CentOS7 升级 python3 过程及注意
• 从官网下载python3的压缩包,解压(以3.5.1版本为例)• 创建安装目录(自定义)sudo mkdir /usr/local/python3• cd 进入解压目录sudo ./configu ...
- 按下F2编辑dxDBTreeView的节点
在TdxDBTreeView控件的OnKeyDown事件中写入if Key = VK_F2 thenbegin if DBTreeMain.DBSelected = nil then Exit ...
- 游戏数值——LOL篇 以LOL为起点-说游戏数值设计核心思路
附 文 文档在今年三月份我动笔写了一小部分,但当时思路凌乱,行文梗阻,于是丢在一边构思了半年,现在又回过头来慢慢写,希望能写好写完吧,初衷是希望即时萌新也能看懂,但是好像并不能行——本 ...
- Java文章翻译
一.基础 1.String 使用" "还是构造函数创建字符串? 画图说明字符串的不变性 在Java中字符串为什么是不可变的 Java中的字符串是按引用传递? 排名前十的Java字符 ...
- NDK学习笔记-JNI的异常处理与缓存策略
在使用JNI的时候,可能会产生异常,此时就需要对异常进行处理 异常处理 JNI抛出Throwable异常,在Java层可以用Throwable捕捉 而在C只有清空异常这种处理 但如果在JNI中通过Th ...