An Easy Introduction to CUDA C and C++
An Easy Introduction to CUDA C and C++
This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
CUDA Programming Model Basics
Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at a CUDA C example.
A First CUDA C Program
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.
Host Code
The main function declares two pairs of arrays.
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
An Easy Introduction to CUDA C and C++的更多相关文章
- 计算机系列:CUDA 深入研究
Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...
- Caliburn.Micro - Getting Started - Introduction
Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...
- CUDA C++编程手册(总论)
CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...
- 关于并行计算的Scan操作
simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...
- [信安Presentation]一种基于GPU并行计算的MD5密码解密方法
-------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...
- 自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
- deeplearning 源码收集
Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...
- Deep Learning Libraries by Language
Deep Learning Libraries by Language Tweet Python Theano is a python library for defining and ...
- Pytorch原生AMP支持使用方法(1.6版本)
AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...
随机推荐
- FORTRAN和C语言数组循环顺序
对于数组 A(I,J,K) FORTRAN中的循环次序应该是 K J I C语言的循环次序应该是I J K
- moment.js时间格式化库
网址:momentjs.cn 主要用来操作时间的格式化.通过发送API请求获取到的数据中:例如[新闻]中的 发布时间,有的时候.请求到的时间,后台没处理过,那么只能前端来操作数据了. moment.j ...
- CSS text-indent 属性
text-indent 属性首行文本缩进,有点像padding-left的效果.
- Scribd每月共有超过两亿个访客、累积数亿篇以上的文件档案,Alexa全球排名200以内
目前已登上世界300大网站,每月共有超过两亿个访客.累积数亿篇以上的文件档案.透过Flash介面的阅读器-iPaper,使用者可以在网站内浏览各种文件,由于该网站是一个文件分享平台,所有的文件都是由使 ...
- CPUID
http://en.wikipedia.org/wiki/CPUID CPUID From Wikipedia, the free encyclopedia The CPUID opcode i ...
- 深入浅出HashMap
/** *@ author ViVi *@date 2014-6-11 */ Hashmap是一种非常常用的.应用广泛的数据类型,最近研究到相关的内容,就正好复习一下.希望通过仪器讨论.共同提高~ 1 ...
- 关于Http请求Cookie问题
在Http请求中,很多时候我们要设置Cookie和获取返回的Cookie,在这个问题上踩了一个很大的坑,主要是两个问题: 1.不能获取到重定向返回的Cookie: 2.两次请求返回的Cookie是相同 ...
- OpenCV2.4.8 + CUDA7.5 + VS2013 配置
配置过程主要参考:https://initialneil.wordpress.com/2014/09/25/opencv-2-4-9-cuda-6-5-visual-studio-2013/ 1.为什 ...
- 4.tensorflow——CNN
1.CNN结构:X-->CONV(relu)-->MAXPOOL-->CONV(relu)-->FC(relu)-->FC(softmax)-->Y 1.1 卷积层 ...
- Python笔记(三)_字典与集合
字典dict 映射类型,以键-值的方式存储,通过键来取相应的值 member={'one':1,'two':2,'three':3} 创建字典member=dict('苹果'='apple','桔子' ...