cuda编程-矩阵乘法（1）

本方法采用简单的单线程计算每组行和列乘加运算

代码如下：

#include <stdio.h>

#include <stdlib.h>

#include <iostream>

#include <cuda_runtime.h>

__global__ void matrixMulKernel(float *C, float *A, float *B, int width, int height){

  int tx = blockIdx.x * blockDim.x + threadIdx.x;

  int ty = blockIdx.y * blockDim.y + threadIdx.y;

  if(tx >= width || ty >= height)

    return;

  float sum = ;

  for(int i=; i<width; ++i){

    sum += A[ty * width + i] * B[i * width + tx];

  }

  C[ty * width + tx] = sum;

}

void constantInit(float *data, int size, float val){

    for (int i = ; i < size; ++i){

        data[i] = val;

    }

}

void matrixMul(){

  unsigned int width = ;

  unsigned int height = ;

  unsigned int size = width * height * sizeof(float);

  float *h_A = (float*)malloc(size);

  float *h_B = (float*)malloc(size);

  float *h_C = (float*)malloc(size);

  // Initialize host memory

  const float valB = 0.01f;

  constantInit(h_A, width*height, 1.0f);

  constantInit(h_B, width*height, valB);

  float *d_A, *d_B, *d_C;

  cudaMalloc((void**)&d_A, size);

  cudaMalloc((void**)&d_B, size);

  cudaMalloc((void**)&d_C, size);

  //copy host memory to device

  cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

  cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

  //config dims

  dim3 block(, );

  dim3 grid(width / block.x, height / block.y);

  // Excute the kernel

  matrixMulKernel<<<grid, block>>>(d_C, d_A, d_B, width, height);

  // Copy the memory from device to host

  cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

  printf("Checking computed result for correctness: ");

  bool correct = true;

  // test relative error by the formula

  //     |<x, y>_cpu - <x,y>_gpu|/<|x|, |y|>  < eps

  double eps = .e- ; // machine zero

  for (int i = ; i < width*height; i++){

      double abs_err = fabs(h_C[i] - (width * valB));

      double dot_length = width;

      double abs_val = fabs(h_C[i]);

      double rel_err = abs_err/abs_val/dot_length ;

      if (rel_err > eps)

      {

          printf("Error! Matrix[%05d]=%.8f, ref=%.8f error term is > %E\n", i, h_C[i], (float)(width*height), eps);

          correct = false;

      }

  }

  printf("%s\n", correct ? "Result = PASS" : "Result = FAIL");

  // Free

  free(h_A);

  free(h_B);

  free(h_C);

  cudaFree(d_A);

  cudaFree(d_B);

  cudaFree(d_C);

}

int main(){

  matrixMul();

}

cuda编程-矩阵乘法（1）的更多相关文章

cuda编程-矩阵乘法（2）
采用shared memory加速代码 #include <stdio.h> #include <stdlib.h> #include <math.h> #inc ...
cuda(2) 矩阵乘法优化过程
Created on 2013-8-5URL : http://blog.sina.com.cn/s/blog_a502f1a30101mjch.html@author: zhxfl转载请说明出处 # ...
CUDA编程之快速入门
CUDA(Compute Unified Device Architecture)的中文全称为计算统一设备架构.做图像视觉领域的同学多多少少都会接触到CUDA,毕竟要做性能速度优化,CUDA是个很重要 ...
CUDA编程之快速入门【转】
https://www.cnblogs.com/skyfsm/p/9673960.html CUDA(Compute Unified Device Architecture)的中文全称为计算统一设备架 ...
详解CUDA编程
CUDA 是 NVIDIA 的 GPGPU 模型,它使用 C 语言为基础,可以直接以大多数人熟悉的 C 语言,写出在显示芯片上执行的程序,而不需要去学习特定的显示芯片的指令或是特殊的结构.” 编者注: ...
CUDA 矩阵乘法终极优化指南
作者:马骏 | 旷视 MegEngine 架构师前言单精度矩阵乘法(SGEMM)几乎是每一位学习 CUDA 的同学绕不开的案例,这个经典的计算密集型案例可以很好地展示 GPU 编程中常用的优化技巧 ...
OpenCL 矩阵乘法
▶ 矩阵乘法,按照书里的内容进行了几方面的优化,包括局部内存,矢量数据类型,寄存器,流水线等. ● 最直接的乘法.调用时 main.c 中使用 size_t globalSize[] = { rowA ...
【Cuda编程】加法归约
目录 cuda编程并行归约 AtomicAdd调用出错 gpu cpu下时间计算加法的归约矩阵乘法矩阵转置统计数目平方和求和分块处理线程相邻多block计算 cuda编程并行归约 At ...
CUDA编程（十）使用Kahan's Summation Formula提高精度
CUDA编程(十) 使用Kahan's Summation Formula提高精度上一次我们准备去并行一个矩阵乘法.然后我们在GPU上完毕了这个程序,当然是非常单纯的把任务分配给各个线程.也没有经过 ...

随机推荐

玄学bug(1)---注释里面的中文会报错
有时候正常没有问题的程序会报错,可能跟注释里面的中文也有关系 with open('photo1.jpg','rb') as file: data = file.read() #print(data) ...
Java多线程（十）——线程优先级和守护线程
一.线程优先级的介绍 java 中的线程优先级的范围是1-10,默认的优先级是5.“高优先级线程”会优先于“低优先级线程”执行. java 中有两种线程:用户线程和守护线程.可以通过isDaemon( ...
linux if -d -e -f表达的意思
文件表达式-e filename 如果 filename存在,则为真-d filename 如果 filename为目录,则为真 -f filename 如果 filename为常规文件,则为真-L ...
使用 OpenSSL 创建私有 CA：2 中间证书
OpenSSL 创建私有 CA 三部曲:使用 OpenSSL 创建私有 CA:1 根证书使用 OpenSSL 创建私有 CA:2 中间证书使用 OpenSSL 创建私有 CA:3 用户证书本文将在前 ...
H5 36-背景定位属性
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
三次握手复习TCP
临近下班,突然想起三次握手的概念有点模糊. 大学时候的<计算机网络>是英语版的,那时候学习迷迷糊糊的.大概记得一个模型罢了. 幸好,大学基本所有的书都卖了,就是计算机网络没卖.待会回去看看 ...
scrapy之基础概念与用法
scrapy之基础概念与用法框架所谓的框架就是一个项目的半成品.也可以说成是一个已经被集成了各种功能(高性能异步下载.队列.分布式.解析.持久化等)的具有很强通用性的项目模板. 安装 Linux: ...
[2017BUAA软工助教]团队alpha得分总表
一.累计得分项目介绍采访贡献分功能技术 α例会 α发布 α测试 α展示 α事后合计满分 10 10 10 10 10 50 10 10 150 10 280 hotcode5 10 9 ...
PAT L2-023 图着色问题
https://pintia.cn/problem-sets/994805046380707840/problems/994805057298481152 图着色问题是一个著名的NP完全问题.给定无向 ...
ShowDoc上手
ShowDoc是什么每当接手一个他人开发好的模块或者项目,看着那些没有写注释的代码,我们都无比抓狂.文档呢?!文档呢?!Show me the doc !! 程序员都很希望别人能写技术文档,而自己却 ...

cuda编程-矩阵乘法（1）

cuda编程-矩阵乘法（1）的更多相关文章

随机推荐

热门专题