An Easy Introduction to CUDA C and C++
An Easy Introduction to CUDA C and C++
This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
CUDA Programming Model Basics
Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at a CUDA C example.
A First CUDA C Program
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.
Host Code
The main function declares two pairs of arrays.
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
An Easy Introduction to CUDA C and C++的更多相关文章
- 计算机系列:CUDA 深入研究
Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...
- Caliburn.Micro - Getting Started - Introduction
Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...
- CUDA C++编程手册(总论)
CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...
- 关于并行计算的Scan操作
simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...
- [信安Presentation]一种基于GPU并行计算的MD5密码解密方法
-------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...
- 自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
- deeplearning 源码收集
Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...
- Deep Learning Libraries by Language
Deep Learning Libraries by Language Tweet Python Theano is a python library for defining and ...
- Pytorch原生AMP支持使用方法(1.6版本)
AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...
随机推荐
- controler--application配置
<?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.spr ...
- 接口代码(requests库安装)
一. 首先用cd:Scripts路径名命令,进入到python--Scripts目录下:然后键入pip install requests 进行安装,有可能会要求你升级pip,键入python -m ...
- SSH小应用
1:Spring整合Hibernate <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE hi ...
- ckeditor实现WORD粘贴图片自动上传
自动导入Word图片,或者粘贴Word内容时自动上传所有的图片,并且最终保留Word样式,这应该是Web编辑器里面最基本的一个需求功能了.一般情况下我们将Word内容粘贴到Web编辑器(富文本编辑器) ...
- python实现计时器(装饰器)
1.写一个装饰器,查看函数执行的时间 import time # 装饰器run_time,@run_time加在谁头上,谁就是参数fundef run_time(fun): start_time = ...
- UE编辑器
引用ue的js 下载地址http://pan.baidu.com/s/1gdrQ35L <script type="text/javascript" src="__ ...
- ES6 Generator使用
// generator介绍: function* hello() { console.log("hello world") } hello();//没有执行 // 直接调用hel ...
- JS 時間戳轉日期格式
1.日期轉換為時間戳,(如果日期格式為時間戳,將其轉為日期類型,否則輸出傳入的數據) // 如果時間格式為時間戳,將其轉為日期 function timestampToDate(timestamp) ...
- 发布delphi程序(build with runtime package)要带哪些文件?
Delphi提供两种方式来编译你的程序:使用包或者是单独的exe 使用包,你可以使用如下方法设置: 项目选项(菜单project->options->Packages页), 在Runtim ...
- mongo 慢查询配置
我是分片部署,所以慢查询相关的配置是在启动片服务上. 执行查询命令,是在share的primary 上. 1. mongodb慢查询 配置 慢查询数据主要存储在 local库的system.pro ...