The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture.

This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.

1. The Heart of Maxwell: More Efficient Multiprocessors

Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.

Improved Instruction Scheduling

The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.

As with SMX, each SMM has four warp schedulers, but unlike SMX, all core SMM functional units are assigned to a particular scheduler, with no shared units. The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM’s warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory
operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.

Increased Occupancy for Existing Code

In terms of CUDA compute capability, Maxwell’s SM is CC 5.0. SMM is
similar in many respects to the Kepler architecture’s SMX, with key
enhancements geared toward improving efficiency without requiring
significant increases in available parallelism per SM from the
application. The register file size and the maximum number of concurrent
warps in SMM are the same as in SMX (64k 32-bit registers and 64 warps,
respectively), as is the maximum number of registers per thread (255).
However the maximum number of active thread blocks per multiprocessor
has been doubled over SMX to 32, which should result in an automatic
occupancy improvement for kernels that use small thread blocks of 64 or
fewer threads (assuming available registers and shared memory are not
the occupancy limiter). Table 1 provides a comparison between key
characteristics of Maxwell GM107 and its predecessor Kepler GK107.

Reduced Arithmetic Instruction Latency

Another major improvement of SMM is that dependent arithmetic
instruction latencies have been significantly reduced. Because occupancy
(which translates to available warp-level parallelism) is the same or
better on SMM than on SMX, these reduced latencies improve utilization
and throughput.

Table 1. A Comparison of Maxwell GM107 to Kepler GK107
GPU GK107 (Kepler) GM107 (Maxwell)
CUDA Cores 384 640
Base Clock 1058 MHz 1020 MHz
GPU Boost Clock N/A 1085 MHz
GFLOP/s 812.5 1305.6
Compute Capability 3.0 5.0
Shared Memory / SM 16KB / 48 KB 64 KB
Register File Size / SM 256 KB 256 KB
Active Blocks / SM 16 32
Memory Clock 5000 MHz 5400 MHz
Memory Bandwidth 80 GB/s 86.4 GB/s
L2 Cache Size 256 KB 2048 KB
TDP 64W 60W
Transistors 1.3 Billion 1.87 Billion
Die Size 118 mm2 148 mm2
Manufactoring Process 28 nm 28 nm

2. Larger, Dedicated Shared Memory

A significant improvement in SMM is that it provides 64KB of
dedicated shared memory per SM—unlike Fermi and Kepler, which
partitioned the 64KB of memory between L1 cache and shared memory. The
per-thread-block limit remains 48KB on Maxwell, but the increase in
total available shared memory can lead to occupancy
improvements. Dedicated shared memory is made possible in Maxwell by
combining the functionality of the L1 and texture caches into a single
unit.

3. Fast Shared Memory Atomics

Maxwell provides native shared memory atomic operations for 32-bit
integers and native shared memory 32-bit and 64-bit compare-and-swap
(CAS), which can be used to implement other atomic functions. In
contrast, the Fermi and Kepler architectures implemented shared memory
atomics using a lock/update/unlock pattern that could be expensive in
the presence of high contention for updates to particular locations in
shared memory.

4. Support for Dynamic Parallelism

Kepler GK110 introduced a new architectural feature called Dynamic
Parallelism, which allows the GPU to create additional work for itself. A
programming model enhancement leveraging this feature was introduced in
CUDA 5.0 to enable threads running on GK110 to launch additional
kernels onto the same GPU.

SMM brings Dynamic Parallelism into the mainstream by supporting it
across the product line, even in lower-power chips such as GM107. This
will benefit developers, because it means that applications will no
longer need special-case algorithm implementations for high-end GPUs
that differ from those usable in more power constrained environments.

5. Learn More about Programming Maxwell

For more architecture details and guidance on optimizing your code for Maxwell, I encourage you to check out the Maxwell Tuning Guide and Maxwell Compatibility Guide, which are available now to CUDA Registered Developers.

5 Things You Should Know About the New Maxwell GPU Architecture的更多相关文章

  1. Angular2学习笔记(1)

    Angular2学习笔记(1) 1. 写在前面 之前基于Electron写过一个Markdown编辑器.就其功能而言,主要功能已经实现,一些小的不影响使用的功能由于时间关系还没有完成:但就代码而言,之 ...

  2. 动画requestAnimationFrame

    前言 在研究canvas的2D pixi.js库的时候,其动画的刷新都用requestAnimationFrame替代了setTimeout 或 setInterval 但是jQuery中还是采用了s ...

  3. 【AR实验室】OpenGL ES绘制相机(OpenGL ES 1.0版本)

    0x00 - 前言 之前做一些移动端的AR应用以及目前看到的一些AR应用,基本上都是这样一个套路:手机背景显示现实场景,然后在该背景上进行图形学绘制.至于图形学绘制时,相机外参的解算使用的是V-SLA ...

  4. iOS代码规范(OC和Swift)

    下面说下iOS的代码规范问题,如果大家觉得还不错,可以直接用到项目中,有不同意见 可以在下面讨论下. 相信很多人工作中最烦的就是代码不规范,命名不规范,曾经见过一个VC里有3个按钮被命名为button ...

  5. 梅须逊雪三分白,雪却输梅一段香——CSS动画与JavaScript动画

    CSS动画并不是绝对比JavaScript动画性能更优越,开源动画库Velocity.js等就展现了强劲的性能. 一.两者的主要区别 先开门见山的说说两者之间的区别. 1)CSS动画: 基于CSS的动 ...

  6. 阿里巴巴直播内容风险防控中的AI力量

    直播作为近来新兴的互动形态和今年阿里巴巴双十一的一大亮点,其内容风险监控是一个全新的课题,技术的挑战非常大,管控难点主要包括业界缺乏成熟方案和标准.主播行为.直播内容不可控.峰值期间数千路高并发处理. ...

  7. 虾扯蛋:Android View动画 Animation不完全解析

    本文结合一些周知的概念和源码片段,对View动画的工作原理进行挖掘和分析.以下不是对源码一丝不苟的分析过程,只是以搞清楚Animation的执行过程.如何被周期性调用为目标粗略分析下相关方法的执行细节 ...

  8. 【探索】无形验证码 —— PoW 算力验证

    先来思考一个问题:如何写一个能消耗对方时间的程序? 消耗时间还不简单,休眠一下就可以了: Sleep(1000) 这确实消耗了时间,但并没有消耗 CPU.如果对方开了变速齿轮,这瞬间就能完成. 不过要 ...

  9. 对抗密码破解 —— Web 前端慢 Hash

    (更新:https://www.cnblogs.com/index-html/p/frontend_kdf.html ) 0x00 前言 天下武功,唯快不破.但在密码学中则不同.算法越快,越容易破. ...

  10. H5单页面手势滑屏切换原理

    H5单页面手势滑屏切换是采用HTML5 触摸事件(Touch) 和 CSS3动画(Transform,Transition)来实现的,效果图如下所示,本文简单说一下其实现原理和主要思路. 1.实现原理 ...

随机推荐

  1. mongodb中的__v字段

    "__v"是"versionKey"的简写,当每一个文档由mongoose创建时就会自动添加,代表这该文档的版本,此属性可配置修改,默认为"__v&q ...

  2. 大话设计模式之PHP篇 - 策略模式

    什么是策略模式? 定义:策略模式定义了一系列的算法,并将每一个算法封装起来,而且使它们还可以相互替换.策略模式让算法独立于使用它的客户而独立变化. 组成:抽象策略角色: 策略类,通常由一个接口或者抽象 ...

  3. [转载]ORA-00313:无法打开日志组1(线程 1)的成员_ORA-00312:

    原文地址:1)的成员_ORA-00312:">ORA-00313:无法打开日志组1(线程 1)的成员_ORA-00312:作者:Sweet_薇薇毅 今天用系统清理工具把系统垃圾清理了一 ...

  4. 用简单的反射优化代码(动态web项目)

    在动态web项目中,没有使用框架时,只是简单的jsp访问servlet实现增删改查, 无论是哪个方法都需要经过Servlet中的doGet()方法或doPost()方法,我们可以在链接中附带参数进行区 ...

  5. Spark性能优化指南--基础篇

    前言 开发调优 调优概述 原则一:避免创建重复的RDD 原则二:尽可能复用同一个RDD 原则三:对多次使用的RDD进行持久化 原则四:尽量避免使用shuffle类算子 原则五:使用map-side预聚 ...

  6. Spring data jpa 实现简单动态查询的通用Specification方法

    本篇前提: SpringBoot中使用Spring Data Jpa 实现简单的动态查询的两种方法 这篇文章中的第二种方法 实现Specification 这块的方法 只适用于一个对象针对某一个固定字 ...

  7. "阿拉伯""伊斯兰""穆斯林"三个概念怎么分?

    伊斯兰.阿拉伯.穆斯林这三个概念到底有什么不同?要言君将用五分钟给您概述这三个概念,并厘清其边界,说明其交集,帮您迅速构建"阿拉伯.伊斯兰.穆斯林"知识结构概图.相信您得沉思一下费 ...

  8. python中正则表达式的一些问题

    今天听到一句话,觉得很在理——"当你遇到一个问题,想到用正则表达式解决时,就变成了两个问题" 这也从侧面说明了正则表达式比较难理解.下面我将用通俗易懂的方式总结一下,最近遇到的一些 ...

  9. nyojb 2357

    http://acm.nyist.me/JudgeOnline/problem.php?id=2357 2357: 插塔憋憋乐 时间限制: 1 Sec  内存限制: 128 MB提交: 50  解决: ...

  10. Python学习之路day3-字符编码与转码

    一.基础概念 字符与字节 字符是相对于人类而言的可识别的符号标识,是一种人类语言,如中文.英文.拉丁文甚至甲骨文.梵语等等.    字节是计算机内部识别可用的符号标识(0和1组成的二进制串,机器语言) ...