An Easy Introduction to CUDA C and C++
An Easy Introduction to CUDA C and C++
This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
CUDA Programming Model Basics
Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at a CUDA C example.
A First CUDA C Program
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.
Host Code
The main function declares two pairs of arrays.
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
An Easy Introduction to CUDA C and C++的更多相关文章
- 计算机系列:CUDA 深入研究
Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...
- Caliburn.Micro - Getting Started - Introduction
Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...
- CUDA C++编程手册(总论)
CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...
- 关于并行计算的Scan操作
simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...
- [信安Presentation]一种基于GPU并行计算的MD5密码解密方法
-------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...
- 自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
- deeplearning 源码收集
Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...
- Deep Learning Libraries by Language
Deep Learning Libraries by Language Tweet Python Theano is a python library for defining and ...
- Pytorch原生AMP支持使用方法(1.6版本)
AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...
随机推荐
- ARGB色彩模式
看到#ff ff ff 00 00这种 就是啦 .开头两位表示透明度.
- POJ 3481 Double Queue (treap模板)
Description The new founded Balkan Investment Group Bank (BIG-Bank) opened a new office in Bucharest ...
- LG2704 [NOI2001] 炮兵阵地
题目描述 (试题来源:Link ) 司令部的将军们打算在 \(N\times M\) 的网格地图上部署他们的炮兵部队.一个 \(N\times M\) 的地图由 \(N\) 行 \(M\) 列组成,地 ...
- nRF51822 蓝牙低功耗和 2.4GHz 专利 SoC
DESCRIPTION nRF51822 是功能强大.高灵活性的多协议 SoC,非常适用于 Bluetooth® 低功耗和 2.4GHz 超低功耗无线应用. nRF51822 基于配备 256kB f ...
- 使用Excel绘制F分布概率密度函数图表
使用Excel绘制F分布概率密度函数图表 利用Excel绘制t分布的概率密度函数的相同方式,可以绘制F分布的概率密度函数图表. F分布的概率密度函数如下图所示: 其中:μ为分子自由度,ν为分母自由度 ...
- 在Windows及Linux下获取毫秒级运行时间的方法
在Windows下获取毫秒级运行时间的方法 头文件:<Windows.h> 函数原型: /*获取时钟频率,保存在结构LARGE_INTEGER中***/ WINBASEAPI BOOL W ...
- CSS属性去除图片链接时的虚线框
CSS 之outline (轮廓)是绘制于元素周围的一条线,位于边框边缘的外围,可起到突出元素的作用.outline 属性是一个简写属性,用于设置元素周围的轮廓线.注释:轮廓线不会占据空间,也不一定是 ...
- 美化Windows
更改壁纸 https://www.omgubuntu.co.uk/2010/09/a-look-back-at-every-ubuntu-default-wallpaper google: ubunt ...
- 3天教你掌握Python必备常用英语词汇
第一天 path [ pɑ:θ ] 路径 unexpected [ˌʌnɪkˈspektɪd] 不期望的 class [klɑ:s] 类 usage [ˈju:sɪdʒ] 使用 public ['p ...
- spring aop思想