NVIDIA GPU Pascal架构简述

本文摘抄自英伟达Pascal架构官方白皮书：https://www.nvidia.com/en-us/data-center/resources/pascal-architecture-whitepaper/

SM

相比Maxwell架构，Pascal架构改进了16-nm FinFET的制造工艺，并提供了各种其它架构改进。

Pascal further improves the already excellent power efficiency provided by the Maxwell architecture through both an improved 16-nm FinFET manufacturing process and various architectural modifications.

与Maxwell架构类似，Pascal架构在每个分区中提供了2的次幂个CUDA核心，且每个线程束调度器都处理一个线程束数目（32）的CUDA核心中。

Like Maxwell, Pascal employs a power-of-two number of CUDA Cores per partition. This simplifies scheduling compared to Kepler, since each of the SM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width (32).

每个线程束调度器支持双重处理（在内存加载/存储操作周期内CUDA核心处理数学操作），现在单次操作即可利用所有的CUDA核心。

Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.

每个GP104的SM提供了4个线程束调度器，用于管理总计128个fp32核心和4个fp64核心。而每个GP100的SM提供了2个线程束调度器，用于管理64个fp32核心和32个fp64核心。一个GP104提供了20个SM，相比之下GP102提供了30个SM，而GP100则提供了高达60个SM。

Like Maxwell, each GP104 SM provides four warp schedulers managing a total of 128 single-precision (FP32) and four double-precision (FP64) cores. A GP104 processor provides up to 20 SMs, and the similar GP102 design provides up to 30 SMs.By contrast GP100 provides smaller but more numerous SMs. Each GP100 provides up to 60 SMs. Each SM contains two warp schedulers managing a total of 64 FP32 and 32 FP64 cores.

与Maxwell架构相同，Pascal架构每SM支持最多64个线程束并行，

The maximum number of concurrent warps per SM remains the same as in Maxwell and Kepler (i.e., 64)

Pascal架构每SM拥有64k个32-bit寄存器，每线程最多允许使用255个寄存器，每SM支持最多32个线程块驻留（相比Kepler架构，Pascal在线程块为64或更小情况下kernel的占有率提高），每SM的共享内存大小GP100为64KB，GP104为96KB，但每个线程块最多允许使用48KB共享内存。当然，英伟达推荐每个线程块最多使用32KB共享内存，因为这能使得每个SM至少可同时驻留2个线程块。

The register file size (64k 32-bit registers) is the same as that of Maxwell and Kepler GK110.

The maximum registers per thread, 255.

The maximum number of thread blocks per SM is 32. Compared to Kepler, Pascal should see an automatic occupancy improvement for kernels with thread blocks of 64 or fewer threads (shared memory and register file resource requirements permitting).

Shared memory capacity per SM is 64KB for GP100 and 96KB for GP104.The maximum shared memory per block remains limited at 48KB as with prior architectures. NVIDIA recommends that applications use at most 32 KB of shared memory in any one thread block.

Pascal提升了对fp16浮点数格式（half）的支持。Pascal通过使用成对操作同时执行每个核心的两个fp16指令来达到fp16的吞吐量峰值，为符合成对操作的条件，操作数必须以half2格式存储。GP100提供了两倍fp32的fp16吞吐量支持，相比之下GP104仅提供了64分之1倍fp32的fp16支持，但GP104提供了额外的int8格式支持。

Pascal provides improved FP16 support for applications. Peak FP16 throughput is attained by using a paired operation to perform two FP16 instructions per core simultaneously.GP100, designed with training deep neural networks in mind, provides FP16 throughput up to 2x that of FP32 arithmetic. On GP104, FP16 throughput is lower, 1/64th that of FP32. However, compensating for reduced FP16 throughput, GP104 provides additional high-throughput INT8 support not available in GP100.

GP104为int8提供了专用的2路和4路整数点乘指令，__dp4a指令点乘4个8-bit整数并累加到一个32-bit整数中，类似的，__dp2a指令点乘2个16-bit整数和2个8-bit整数并累加到一个32-bit整数中。

GP104 provides specialized instructions for two-way and four-way integer dot products.The __dp4a intrinsic computes a dot product of four 8-bit integers with accumulation into a 32-bit integer. Similarly, __dp2a performs a two-element dot product between two 16-bit integers in one vector, and two 8-bit integers in another with accumulation into a 32-bit integer.

GP100为DRAM提供了High Bandwidth Memory 2 (HBM2)存储器，HBM2存储芯片与GPU芯片封装在同一硅片上，与传统GDDR技术相比，这能提供更高的带宽。GP100最多连接到4个HBM2堆栈，每个堆栈使用两个512-bit存储控制器。存储总线的有效宽度为4096位，相比GM200的384-bit，这可以极大的提高带宽峰值。因此，使用GP100架构的Tesla P100拥有高达732GB/s的带宽峰值和 715 MHz的内存周期。

GP100 uses High Bandwidth Memory 2 (HBM2) for its DRAM. HBM2 memories are stacked on a single silicon package along with the GPU die. This allows much wider interfaces at similar power compared to traditional GDDR technology. GP100 is linked to up to four stacks of HBM2 and uses two 512-bit memory controllers for each stack.The effective width of the memory bus is then 4096 bits, a significant increase over the 384 bits in GM200. This allows a tremendous boost in peak bandwidth even at reduced memory clocks. Thus, the GP100 equipped Tesla P100 has a peak bandwidth of 732 GB/s with a modest 715 MHz memory clock.

GP100提供了ECC校验，通常情况下这会造成6.25%的内存消耗和20%的带宽损失，而HBM2 提供了专用的ECC资源，因此可以实现无开销的ECC校验。

Like Kepler GK210, the GP100 GPU's register files, shared memories, L1 and L2 caches, and DRAM are all protected by Single-Error Correct Double-Error Detect (SECDED) ECC code. When enabling ECC support on a Kepler GK210, the available DRAM would be reduced by 6.25% to allow for the storage of ECC bits. Fetching ECC bits for each memory transaction also reduced the effective bandwidth by approximately 20% compared to the same GPU with ECC disabled. HBM2 memories, on the other hand, provide dedicated ECC resources, allowing overhead-free ECC protection.4

默认情况下，GP100将全局内存加载到L1/纹理缓存，与此相反，GP104仅仅加载到L2缓存（同Kepler和Maxwell架构），但GP104允许用户在nvcc编译时使用-Xptxas -dlcm=ca选项开启L1/纹理缓存。

By default, GP100 caches global loads in the L1/Texture cache. In contrast, GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. As with previous architectures, GP104 allows the developer to opt-in to caching all global loads in the unified L1/Texture cache by passing the -Xptxas -dlcm=ca flag to nvcc at compile time.

当启用L1缓存时，Kepler通过128B粒度的内存服务来加载全局数据，否则内存服务粒度为32B。但在Pascal架构中，内存服务的粒度始终为32B。

Kepler serviced loads at a granularity of 128B when L1 caching of global loads was enabled and 32B otherwise. On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.

同Maxwell架构，Pascal提供了原生的 32-bit整数共享内存原子指令，和32-bit、64-bit的CAS指令。Pascal还提供了FP64全局内存原子指令，现atomicAdd()函数支持32-bit和64-bit的整数和浮点数操作。

Like Maxwell, Pascal provides native shared memory atomic operations for 32-bit integer arithmetic, along with native 32 or 64-bit compare-and-swap (CAS). Pascal also extends atomic addition in global memory to function on FP64 data. The atomicAdd() function in CUDA has thus been generalized to support 32 and 64-bit integer and floating-point types.

对于GP100，原子指令可以针对通过nvlink连接的存储器上的数据实现原子操作，但PCIe连接不支持。通常，跨存储器的原子操作通过一个page-faults来将另一存储器上的数据移动到本地实现。

For GP100 atomic operations may target the memories of peer GPUs connected through NVLink. Peer-to-peer atomics over NVLink use the same API as atomics targeting global memory. GPUs connected via PCIE do not support this feature.When an atomic targets a migratable address backed by a remote memory space, the local processor page-faults so that the kernel can migrate the appropriate memory page to local memory.

不同于Kepler架构L1和共享内存使用同一块片上存储，Maxwell和Pascal架构由于L1和纹理缓存合并，因此为每个SM提供了专用的共享内存存储，GP100现每SM拥有64KB共享内存，GP104每SM拥有96KB共享内存。

For Kepler, shared memory and the L1 cache shared the same on-chip storage. Maxwell and Pascal, by contrast, provide dedicated space to the shared memory of each SM, since the functionality of the L1 and texture caches have been merged. This increases the shared memory space available per SM as compared to Kepler: GP100 offers 64 KB shared memory per SM, and GP104 provides 96 KB per SM.

不同于Kepler架构仅支持8-byte的共享内存bank模式，Pascal现返回固定的4-byte的共享内存bank，因此Pascal对于int、fp32等数据类型的共享内存存取也能达到较高的带宽。

Kepler provided an optional 8-byte shared memory banking mode. Pascal follows Maxwell in returning to fixed four-byte banks.

GP100现支持计算时抢占，计算时抢占允许在GPU上运行的计算任务在指令级粒度上被中断，此时执行上下文（如寄存器、共享内存等）将被保存到全局内存上以便另一程序执行。

Compute Preemption is a new feature specific to GP100. Compute Preemption allows compute tasks running on the GPU to be interrupted at instruction-level granularity. The execution context (registers, shared memory, etc.) are swapped to GPU DRAM so that another application can be swapped in and run.

Pascal使用49-bit虚拟地址空间（可访问48-bit的物理地址空间）来统一所有GPU上的内存空间。Pascal同样支持Page fault，Page fault允许程序在不需要明确同步的情况下访问主机和设备上的托管内存，当kernel访问一个非常驻页面时，程序无需预同步所有托管内存，系统将该页面移动到本地或映射到本地地址空间。

Pascal offers new hardware capabilities to extend Unified Memory (UM) support. An extended 49-bit virtual addressing space allows Pascal GPUs to address the full 48-bit virtual address space of modern CPUs as well as the memories of all GPUs in the system through a single virtual address space, not limited by the physical memory sizes of any one processor. Pascal GPUs also support memory page faulting. Page faulting allows applications to access the same managed memory allocations from both host and device without explicit synchronization. It also removes the need for the CUDA runtime to pre-synchronize all managed memory allocations before each kernel launch. Instead, when a kernel accesses a non-resident memory page, it faults, and the page can be migrated to the GPU memory on-demand, or mapped into the GPU address space for access over PCIe/NVLink interfaces.

在支持的操作系统上，可以通过同一指针访问位于GPU和CPU上的内存，事实上，GPU可以访问整个系统的虚拟内存，在这种系统上不再需要通过cudaMallocManaged()明确分配托管内存。

On supporting operating system platforms, any memory allocated with the default OS allocator (for example, malloc or new) can be accessed from both GPU and CPU code using the same pointer. In fact, all system virtual memory can be accessed from the GPU. On such systems, there is no need to explicitly allocate managed memory using cudaMallocManaged().

Pascal架构机型P40配置数据

NVIDIA GPU Pascal架构简述的更多相关文章

NVIDIA GPU Volta架构简述
NVIDIA GPU Volta架构简述本文摘抄自英伟达Volta架构官方白皮书:https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Ce ...
NVIDIA GPU Turing架构简述
NVIDIA GPU Turing架构简述本文摘抄自Turing官方白皮书:https://www.nvidia.com/content/dam/en-zz/Solutions/design-vis ...
深入GPU硬件架构及运行机制
目录一.导言 1.1 为何要了解GPU? 1.2 内容要点 1.3 带着问题阅读二.GPU概述 2.1 GPU是什么? 2.2 GPU历史 2.2.1 NV GPU发展史 2.2.2 NV GPU ...
Caffe Ubuntu14.04 + CUDA 8 (支持GTX1080 1070等Pascal架构显卡)
1. 前言本教程使用的系统是Ubuntu 14.04 LTS 64-bit,使用的CUDA版本为8. 理论上本教程支持Pascal架构显卡,如游戏卡GeForce GTX1070,GTX 1080, ...
NVIDIA Turing Architecture架构设计（下）
NVIDIA Turing Architecture架构设计(下) GDDR6 内存子系统随着显示分辨率不断提高,着色器功能和渲染技术变得更加复杂,内存带宽和大小在 GPU 性能中扮演着更大的角色. ...
NVIDIA Turing Architecture架构设计（上）
NVIDIA Turing Architecture架构设计(上) 在游戏市场持续增长和对更好的 3D 图形的永不满足的需求的推动下, NVIDIA 已经将 GPU 发展成为许多计算密集型应用的世界领 ...
GPU体系架构(一)：数据的并行处理
最近在了解GPU架构这方面的内容,由于资料零零散散,所以准备写两篇博客整理一下.GPU的架构复杂无比,这两篇文章也是从宏观的层面去一窥GPU的工作原理罢了 GPU根据厂商的不同,显卡型号的不同,GPU ...
GPU体系架构(二)：GPU存储体系
GPU是一个外围设备,本来是专门作为图形渲染使用的,但是随着其功能的越来越强大,GPU也逐渐成为继CPU之后的又一计算核心.但不同于CPU的架构设计,GPU的架构从一开始就更倾向于图形渲染和大规模数据 ...
NVIDIA GPU的快速傅立叶变换
NVIDIA GPU的快速傅立叶变换 cuFFT库提供GPU加速的FFT实现,其执行速度比仅CPU的替代方案快10倍.cuFFT用于构建跨学科的商业和研究应用程序,例如深度学习,计算机视觉,计算物理, ...

随机推荐

【Key】 Windows 95 到 Windows10所有KEY 包括OEM系统
部分可能不准确,请见谅,数据源自Baidu,由noogai00整理,数据为Microsoft.所有 Windows 95 02398-OEM-0030022-5886409297-OEM-002128 ...
看了Java的Class的源码，我自闭了
java源码之Class 源码的重要性不言而喻,虽然枯燥,但是也有拍案叫绝.这是我的源码系列第二弹,后续还会一直更新,欢迎交流.String源码可以看我的Java源码之String,如有不足,希望 ...
《Java核心技术》笔记：第7章异常、断言和日志
1. 异常 (P 280)异常处理需要考虑的问题: 用户输入错误设备错误物理限制代码错误 (P 280)传统的处理错误的方法是:返回一个特殊的错误码,常见的是返回-1或者null引用 (P 28 ...
使用 Masstransit中的 Request/Response 与 Courier 功能实现最终一致性
简介目前的.net 生态中,最终一致性组件的选择一直是一个问题.本地事务表(cap)需要在每个服务的数据库中插入消息表,而且做不了此类事务比如:创建订单需要余额满足+库存满足,库存和余额处于两个 ...
写给.NET开发者的Python教程(一):引言
距离上一篇博文已过去8个月了,这段时间发生了很多事情导致没能持续更新博客.这段时间除了工作繁忙,业余时间都投入到AI技术的学习中,后面一段时间将会给大家分享我作为一个.NET开发人员在深度学习领域学习 ...
《HelloGitHub》第 51 期
兴趣是最好的老师,HelloGitHub 就是帮你找到兴趣! 简介分享 GitHub 上有趣.入门级的开源项目. 这是一个面向编程新手.热爱编程.对开源社区感兴趣人群的月刊,月刊的内容包括:各种编 ...
P2114 [NOI2014]起床困难综合症【二进制运算+优化】
起床困难综合症[二进制运算+优化] 题目描述 21世纪,许多人得了一种奇怪的病:起床困难综合症,其临床表现为:起床难,起床后精神不佳.作为一名青春阳光好少年,atm一直坚持与起床困难综合症作斗争.通过 ...
谈谈javaSE中的==和equals的联系与区别
前言 ==和equals是我们面试中经常会碰到的问题.那么它们之间有什么联系和区别呢?今天我们就来聊聊吧! 问题这里先抛出一些比较典型笔试问题: int x = 10; int y = 10; St ...
二.drf之使用序列化编写视图
总结:两功能序列化: a.拿到queryset --->idc = Idc.objects.all() b.将queryset给序列化成类---->serializer = IdcSeri ...
python使用selenium和requests.session登录抓取
# Author:song from selenium import webdriver from selenium.webdriver.common.keys import Keys from re ...

NVIDIA GPU Pascal架构简述

NVIDIA GPU Pascal架构简述

SM

Pascal架构机型P40配置数据

NVIDIA GPU Pascal架构简述的更多相关文章

随机推荐

热门专题