An Easy Introduction to CUDA C and C++
An Easy Introduction to CUDA C and C++
This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.
CUDA Programming Model Basics
Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.
The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.
Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:
- Declare and allocate host and device memory.
- Initialize host data.
- Transfer data from the host to the device.
- Execute one or more kernels.
- Transfer results from the device to the host.
Keeping this sequence of operations in mind, let’s look at a CUDA C example.
A First CUDA C Program
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.
Host Code
The main function declares two pairs of arrays.
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
An Easy Introduction to CUDA C and C++的更多相关文章
- 计算机系列:CUDA 深入研究
Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...
- Caliburn.Micro - Getting Started - Introduction
Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...
- CUDA C++编程手册(总论)
CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...
- 关于并行计算的Scan操作
simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...
- [信安Presentation]一种基于GPU并行计算的MD5密码解密方法
-------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...
- 自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
- deeplearning 源码收集
Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...
- Deep Learning Libraries by Language
Deep Learning Libraries by Language Tweet Python Theano is a python library for defining and ...
- Pytorch原生AMP支持使用方法(1.6版本)
AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...
随机推荐
- springboot上传excel到oss
参考:https://blog.csdn.net/qq_34864038/article/details/80239320 https://blog.csdn.net/qq_27319683/arti ...
- 【leetcode】609. Find Duplicate File in System
题目如下: Given a list of directory info including directory path, and all the files with contents in th ...
- CDH6.3.1安装hue 报错
x 一.查看日志server运行日志 /var/log/cloudera-scm-server/cloudera-scm-server.log 2019-12-11 17:28:34,201 INFO ...
- 云平台(cloud platforms)
云平台:允许开发者们或是将写好的程序放在“云”里运行,或是使用“云”里提供的服务,或二者皆是的服务 转向云计算(cloud computing),是业界将要面临的一个重大改变.各种云平台(cloud ...
- http 请求包含哪几个部分(请求行、请求头、请求体)
http协议报文 1.请求报文(请求行/请求头/请求数据/空行) 请求行 求方法字段.URL字段和HTTP协议版本 例如:GET ...
- CocoaPods中pod search报错的解决办法
pod search报错的原因: 每次使用pod search命令的使用会在该目录下生成一个search_index.json文件,会逐渐的增大,文件达到一定大小后,ruby解析里面的json字符就会 ...
- 【c#技术】一篇文章搞掂:Newtonsoft.Json Json.Net
一.介绍 Json.Net是一个.Net高性能框架. 特点和好处: 1.为.Net对象和JSON之间的转换提供灵活的Json序列化器: 2.为阅读和书写JSON提供LINQ to JSON: 3.高性 ...
- html中ul,ol和li的区别
ul是无序列表,全称是unordered list,先来个例子: ●张三 ●李四 ●王二 ●刘五 ol是有序列表 ,全称是ordered list,同样举个例子: 1.张 ...
- yii2和laravel比较
yii2和laravel比较 一.总结 一句话总结: 开发速度两者相当:laravel的artisan工具和yii的gii有异曲同工的效果,借助于artisan工具,可以快速创建控制器.模型和路由等. ...
- [MAC]配置Jenkins 开机自启动
如果是将jenkins.war放在tomcat中运行的, 则可以配置开机启动tomcat,脚本如下: XXX表示是你安装Tomcat所在目录 #启动tomcat cd XXX/Tomcat8/bin ...