An Easy Introduction to CUDA C and C++

This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.

CUDA Programming Model Basics

Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.

The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:

  1. Declare and allocate host and device memory.
  2. Initialize host data.
  3. Transfer data from the host to the device.
  4. Execute one or more kernels.
  5. Transfer results from the device to the host.

Keeping this sequence of operations in mind, let’s look at a CUDA C example.

A First CUDA C Program

In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
} int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
} cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError); cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}

The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.

Host Code

The main function declares two pairs of arrays.

  float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));

An Easy Introduction to CUDA C and C++的更多相关文章

  1. 计算机系列:CUDA 深入研究

    Copyright © 1900-2016, NORYES, All Rights Reserved. http://www.cnblogs.com/noryes/ 欢迎转载,请保留此版权声明. -- ...

  2. Caliburn.Micro - Getting Started - Introduction

    Caliburn.Micro Xaml made easy Introduction When my “Build Your Own MVVM Framework” talk was chosen f ...

  3. CUDA C++编程手册(总论)

    CUDA C++编程手册(总论) CUDA C++ Programming Guide The programming guide to the CUDA model and interface. C ...

  4. 关于并行计算的Scan操作

    simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapte ...

  5. [信安Presentation]一种基于GPU并行计算的MD5密码解密方法

    -------------------paper--------------------- 一种基于GPU并行计算的MD5密码解密方法 0.abstract1.md5算法概述2.md5安全性分析3.基 ...

  6. 自然语言处理NLP快速入门

    自然语言处理NLP快速入门 https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...

  7. deeplearning 源码收集

    Theano – CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) To ...

  8. Deep Learning Libraries by Language

    Deep Learning Libraries by Language Tweet         Python Theano is a python library for defining and ...

  9. Pytorch原生AMP支持使用方法(1.6版本)

    AMP:Automatic mixed precision,自动混合精度,可以在神经网络推理过程中,针对不同的层,采用不同的数据精度进行计算,从而实现节省显存和加快速度的目的. 在Pytorch 1. ...

随机推荐

  1. 51nod 1831: 小C的游戏(Bash博弈 找规律)

    题目链接 此类博弈不需要考虑sg函数,只需要确定必胜态和必败态,解题思路一般为打败先打表找规律,而后找规律给出统一的公式.打表方式:给定初始条件(此题中为ok[0]=ok[1]=0),然后从低到高枚举 ...

  2. Spring框架之接口实现覆盖(接口功能扩展)

    在日常开发中,存在着这种一种场景,框架对接口A提供了一个种默认的实现AImpl,随着需求的变更,现今AImpl不能满足了功能需要,这时,我们该怎么办? 当然是修改AImpl的实现代码了,但是,如果它是 ...

  3. Hadoop编程调用HDFS(PYTHON)

    1.运行环境 开发工具:PyCharm Python 版本:3.5 Hadoop环境: Cloudera QuickStart 2.GITHUB地址 https://github.com/nbfujx ...

  4. 跨域共享cookie

    1. JSP中Cookie的读写 Cookie的本质是一个键值对,当浏览器访问web服务器的时候写入在客户端机器上,里面记录一些信息.Cookie还有一些附加信息,比如域名.有效时间.注释等等. 下面 ...

  5. [CSP-S模拟测试]:Revive(点分治)

    题目背景 $Sparkling\ ashes\ drift\ along\ your\ flames \\ And\ softly\ merge\ into\ the\ sky$ 题目传送门(内部题1 ...

  6. 20141211--C# 构造函数

    namespace fengzhuang { class Class2 { private string _Name; private string _Code; public string _Sex ...

  7. 移动H5优化指南

    转载于http://isux.tencent.com/h5-performance.html 移动H5前端性能优化指南 概述 秒完成或使用Loading4. 基于联通3G网络平均338KB/s(2.7 ...

  8. tensorflow|tf.train.slice_input_producer|tf.train.Coordinator|tf.train.start_queue_runners

    #### ''' tf.train.slice_input_producer :定义样本放入文件名队列的方式[迭代次数,是否乱序],但此时文件名队列还没有真正写入数据 slice_input_prod ...

  9. MySQL分表备份

    #!/bin/bash DUMP=/usr/bin/mysqldump MYSQL=/usr/bin/mysql IPADDR=127.0.0.1 PORT=3306 USER=abc PASSWD= ...

  10. 组件化框架设计之apt编译时期自动生成代码&动态类加载(二)

    阿里P7移动互联网架构师进阶视频(每日更新中)免费学习请点击:https://space.bilibili.com/474380680 本篇文章将继续从以下两个内容来介绍组件化框架设计: apt编译时 ...