An Easy Introduction to CUDA C and C++
An Easy Introduction to CUDA C and C++
This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
CUDA Programming Model Basics
Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at a CUDA C example.
A First CUDA C Program
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.
Host Code
The main function declares two pairs of arrays.
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
An Easy Introduction to CUDA C and C++的更多相关文章
- 计算机系列:CUDA 深入研究
Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...
- Caliburn.Micro - Getting Started - Introduction
Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...
- CUDA C++编程手册(总论)
CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...
- 关于并行计算的Scan操作
simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...
- [信安Presentation]一种基于GPU并行计算的MD5密码解密方法
-------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...
- 自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
- deeplearning 源码收集
Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...
- Deep Learning Libraries by Language
Deep Learning Libraries by Language Tweet Python Theano is a python library for defining and ...
- Pytorch原生AMP支持使用方法(1.6版本)
AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...
随机推荐
- 使用stylelint进行Vue项目样式检查
stylelint是一个强大的现代 CSS 检测器,可以让开发者在样式表中遵循一致的约定和避免错误.拥有超过170条的规则,包括捕捉错误.最佳实践.控制可以使用的语言特性和强制代码风格规范.它支持最新 ...
- 安装RabbitMQ服务器及基本配置
RabbitMQ是一个在AMQP协议标准基础上完整的,可复用的企业消息系统.它遵循Mozilla Public License开源协议,采用 Erlang 实现的工业级的消息队列(MQ)服务器,Rab ...
- 【CF1210C】Kamil and Making a Stream(vector,数论,树)
题意:给定一棵n个点带点权的树,i号点的点定义f(i,j)为i到j路径上所有点的gcd,其中i是j的一个祖先,求所有f(i,j)之和mod1e9+7 2<=n<=1e5,0<=a[i ...
- PHP之GET和POST小结
PHP之GET和POST小结 PHP $_GET 变量 $_GET 变量 预定义的 $_GET 变量用于收集来自 method="get" 的表单中的值. 从带有 GET 方法的表 ...
- 微博API的申请
https://segmentfault.com/a/1190000012548487
- SQL 批量插入数据
后面进行完善修改. /*批量插入数据*/ 这个比较完善.直接插入数据库表. INSERT INTO `goods_transverter` ( `code`,`es_id`,`barcode`, `n ...
- 启动项目时出现Error: Node Sass does not yet support your current environment: Windows 64-bit with Unsupported runtime (72)
前几天趁假期重新装了一次系统,重新安装各种配置之后再启动项目的时候就报这个错误 第一反应就是去搜这个错误怎么解决,搜来搜去基本上都是让我重新安装node-sass,但我重装node-sass的时候又出 ...
- 爬虫(一)—— 请求库(一)requests请求库
目录 requests请求库 爬虫:爬取.解析.存储 一.请求 二.响应 三.简单爬虫 四.requests高级用法 五.session方法(建议使用) 六.selenium模块 requests请求 ...
- mysql 查看数据库最大连接数
show variables like '%max_connections%'; navicat 切换到命令行: navicat查看建表语句: 选中表,右键,对象信息,选择DDL
- 分布式服务防雪崩熔断器,Hystrix理论+实战。
Hystrix是什么? hystrix对应的中文名字是"豪猪",豪猪周身长满了刺,能保护自己不受天敌的伤害,代表了一种防御机制,这与hystrix本身的功能不谋而合,因此Netfl ...