An Easy Introduction to CUDA C and C++

This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.

CUDA Programming Model Basics

Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.

The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:

  1. Declare and allocate host and device memory.
  2. Initialize host data.
  3. Transfer data from the host to the device.
  4. Execute one or more kernels.
  5. Transfer results from the device to the host.

Keeping this sequence of operations in mind, let’s look at a CUDA C example.

A First CUDA C Program

In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
} int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
} cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError); cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}

The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.

Host Code

The main function declares two pairs of arrays.

  float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));

An Easy Introduction to CUDA C and C++的更多相关文章

  1. 计算机系列:CUDA 深入研究

    Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...

  2. Caliburn.Micro - Getting Started - Introduction

    Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...

  3. CUDA C++编程手册(总论)

    CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...

  4. 关于并行计算的Scan操作

    simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...

  5. [信安Presentation]一种基于GPU并行计算的MD5密码解密方法

    -------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...

  6. 自然语言处理NLP快速入门

    自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...

  7. deeplearning 源码收集

    Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...

  8. Deep Learning Libraries by Language

    Deep Learning Libraries by Language Tweet         Python Theano is a python library for defining and ...

  9. Pytorch原生AMP支持使用方法(1.6版本)

    AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...

随机推荐

  1. js中require()的用法----JS如何连接数据库执行sql语句或者建立数据库连接池

    var vue = require('vue'); 引入vue的意思,commonjs的写法.node都是用require来载入模块的,可以看看webpack+vue. require()可以调用模块 ...

  2. Mac OS 网络设置教程 wifi设置与宽带设置详解

    虽然所有设备连接无线网络的步骤都相差无几,但是Mac与windows系统还是不相同的,那么,苹果Mac怎么连接无线网络呢?针对此问题,本文就为大家介绍Mac网络的设置教程,有兴趣的朋友们可以了解下.如 ...

  3. Autodesk Maya 2019 for Mac(三维动画软件)最新功能介绍

    Autodesk Maya是美国Autodesk公司出品的世界顶级的三维动画软件,应用对象是专业的影视广告,角色动画,电影特技等.Maya功能完善,工作灵活,易学易用,制作效率极高,渲染真实感极强,是 ...

  4. python在windows中运行文件

    "d:Program Files\python35\python.exe" hello.txt

  5. 2017 ACM-ICPC乌鲁木齐网络赛 B. Out-out-control cars(计算几何 直线相交)

    题目描述 Two out-of-control cars crashed within about a half-hour Wednesday afternoon on Deer Park Avenu ...

  6. 边缘节点 如何判断CDN的预热任务是否执行完成刷新 路由追踪 近期最少使用算法

    阿里云内容分发网络(Content Delivery Network,简称CDN)是建立并覆盖在承载网之上,由分布在不同区域的边缘节点服务器群组成的分布式网络.阿里云CDN分担源站压力,避免网络拥塞, ...

  7. LINUX shell脚本相关

    调试脚本 测试脚本语法:bash -n file.sh 查看脚本每一步执行情况:bash -x file.sh   位置变量:$1,$2,... 特殊变量:           %?:最后一个命令的执 ...

  8. 信息安全-OAuth2.0:NuGetFromMicrosoft

    ylbtech-信息安全-OAuth2.0:NuGetFromMicrosoft 1.返回顶部 1. https://login.microsoftonline.com/common/oauth2/v ...

  9. 建站手册-浏览器信息:Mozilla 项目

    ylbtech-建站手册-浏览器信息:Mozilla 项目 1.返回顶部 1. http://www.w3school.com.cn/browsers/browsers_mozilla.asp 2. ...

  10. /proc/interrupts /proc/stat 查看中断信息

    /proc/interrupts列出当前所以系统注册的中断,记录中断号,中断发生次数,中断设备名称 如下图:从左至右:中断号   中断次数  中断设备名称 从上图可知中断号为19的arch_timer ...