The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture.

This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.

1. The Heart of Maxwell: More Efficient Multiprocessors

Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.

Improved Instruction Scheduling

The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.

As with SMX, each SMM has four warp schedulers, but unlike SMX, all core SMM functional units are assigned to a particular scheduler, with no shared units. The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM’s warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory
operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.

Increased Occupancy for Existing Code

In terms of CUDA compute capability, Maxwell’s SM is CC 5.0. SMM is
similar in many respects to the Kepler architecture’s SMX, with key
enhancements geared toward improving efficiency without requiring
significant increases in available parallelism per SM from the
application. The register file size and the maximum number of concurrent
warps in SMM are the same as in SMX (64k 32-bit registers and 64 warps,
respectively), as is the maximum number of registers per thread (255).
However the maximum number of active thread blocks per multiprocessor
has been doubled over SMX to 32, which should result in an automatic
occupancy improvement for kernels that use small thread blocks of 64 or
fewer threads (assuming available registers and shared memory are not
the occupancy limiter). Table 1 provides a comparison between key
characteristics of Maxwell GM107 and its predecessor Kepler GK107.

Reduced Arithmetic Instruction Latency

Another major improvement of SMM is that dependent arithmetic
instruction latencies have been significantly reduced. Because occupancy
(which translates to available warp-level parallelism) is the same or
better on SMM than on SMX, these reduced latencies improve utilization
and throughput.

Table 1. A Comparison of Maxwell GM107 to Kepler GK107
GPU GK107 (Kepler) GM107 (Maxwell)
CUDA Cores 384 640
Base Clock 1058 MHz 1020 MHz
GPU Boost Clock N/A 1085 MHz
GFLOP/s 812.5 1305.6
Compute Capability 3.0 5.0
Shared Memory / SM 16KB / 48 KB 64 KB
Register File Size / SM 256 KB 256 KB
Active Blocks / SM 16 32
Memory Clock 5000 MHz 5400 MHz
Memory Bandwidth 80 GB/s 86.4 GB/s
L2 Cache Size 256 KB 2048 KB
TDP 64W 60W
Transistors 1.3 Billion 1.87 Billion
Die Size 118 mm2 148 mm2
Manufactoring Process 28 nm 28 nm

2. Larger, Dedicated Shared Memory

A significant improvement in SMM is that it provides 64KB of
dedicated shared memory per SM—unlike Fermi and Kepler, which
partitioned the 64KB of memory between L1 cache and shared memory. The
per-thread-block limit remains 48KB on Maxwell, but the increase in
total available shared memory can lead to occupancy
improvements. Dedicated shared memory is made possible in Maxwell by
combining the functionality of the L1 and texture caches into a single
unit.

3. Fast Shared Memory Atomics

Maxwell provides native shared memory atomic operations for 32-bit
integers and native shared memory 32-bit and 64-bit compare-and-swap
(CAS), which can be used to implement other atomic functions. In
contrast, the Fermi and Kepler architectures implemented shared memory
atomics using a lock/update/unlock pattern that could be expensive in
the presence of high contention for updates to particular locations in
shared memory.

4. Support for Dynamic Parallelism

Kepler GK110 introduced a new architectural feature called Dynamic
Parallelism, which allows the GPU to create additional work for itself. A
programming model enhancement leveraging this feature was introduced in
CUDA 5.0 to enable threads running on GK110 to launch additional
kernels onto the same GPU.

SMM brings Dynamic Parallelism into the mainstream by supporting it
across the product line, even in lower-power chips such as GM107. This
will benefit developers, because it means that applications will no
longer need special-case algorithm implementations for high-end GPUs
that differ from those usable in more power constrained environments.

5. Learn More about Programming Maxwell

For more architecture details and guidance on optimizing your code for Maxwell, I encourage you to check out the Maxwell Tuning Guide and Maxwell Compatibility Guide, which are available now to CUDA Registered Developers.

5 Things You Should Know About the New Maxwell GPU Architecture的更多相关文章

  1. Angular2学习笔记(1)

    Angular2学习笔记(1) 1. 写在前面 之前基于Electron写过一个Markdown编辑器.就其功能而言,主要功能已经实现,一些小的不影响使用的功能由于时间关系还没有完成:但就代码而言,之 ...

  2. 动画requestAnimationFrame

    前言 在研究canvas的2D pixi.js库的时候,其动画的刷新都用requestAnimationFrame替代了setTimeout 或 setInterval 但是jQuery中还是采用了s ...

  3. 【AR实验室】OpenGL ES绘制相机(OpenGL ES 1.0版本)

    0x00 - 前言 之前做一些移动端的AR应用以及目前看到的一些AR应用,基本上都是这样一个套路:手机背景显示现实场景,然后在该背景上进行图形学绘制.至于图形学绘制时,相机外参的解算使用的是V-SLA ...

  4. iOS代码规范(OC和Swift)

    下面说下iOS的代码规范问题,如果大家觉得还不错,可以直接用到项目中,有不同意见 可以在下面讨论下. 相信很多人工作中最烦的就是代码不规范,命名不规范,曾经见过一个VC里有3个按钮被命名为button ...

  5. 梅须逊雪三分白,雪却输梅一段香——CSS动画与JavaScript动画

    CSS动画并不是绝对比JavaScript动画性能更优越,开源动画库Velocity.js等就展现了强劲的性能. 一.两者的主要区别 先开门见山的说说两者之间的区别. 1)CSS动画: 基于CSS的动 ...

  6. 阿里巴巴直播内容风险防控中的AI力量

    直播作为近来新兴的互动形态和今年阿里巴巴双十一的一大亮点,其内容风险监控是一个全新的课题,技术的挑战非常大,管控难点主要包括业界缺乏成熟方案和标准.主播行为.直播内容不可控.峰值期间数千路高并发处理. ...

  7. 虾扯蛋:Android View动画 Animation不完全解析

    本文结合一些周知的概念和源码片段,对View动画的工作原理进行挖掘和分析.以下不是对源码一丝不苟的分析过程,只是以搞清楚Animation的执行过程.如何被周期性调用为目标粗略分析下相关方法的执行细节 ...

  8. 【探索】无形验证码 —— PoW 算力验证

    先来思考一个问题:如何写一个能消耗对方时间的程序? 消耗时间还不简单,休眠一下就可以了: Sleep(1000) 这确实消耗了时间,但并没有消耗 CPU.如果对方开了变速齿轮,这瞬间就能完成. 不过要 ...

  9. 对抗密码破解 —— Web 前端慢 Hash

    (更新:https://www.cnblogs.com/index-html/p/frontend_kdf.html ) 0x00 前言 天下武功,唯快不破.但在密码学中则不同.算法越快,越容易破. ...

  10. H5单页面手势滑屏切换原理

    H5单页面手势滑屏切换是采用HTML5 触摸事件(Touch) 和 CSS3动画(Transform,Transition)来实现的,效果图如下所示,本文简单说一下其实现原理和主要思路. 1.实现原理 ...

随机推荐

  1. Shell编程之Expect自动化交互程序

    一.Expect自动化交互程序 1.spawn命令 通过spawn执行一个命令或程序,之后所有的Expect操作都会在这个执行过的命令或程序进程中进行,包括自动交互功能. 语法: spawn [ 选项 ...

  2. Mysql 主从复制搭建

    Mysql 主重复制搭建 Linux版本:Linux Centos 6.4 32位 Mysql版本:Mysql-5.6.38-linux-glibc2.12-i686 Mysql安装:Mysql安装教 ...

  3. Openfire部署和配置说明

    一.程序部署 1.1 程序和脚本 将文件拷贝到对应目录下,文件包括:Openfire.tar和setup.sh脚本.Openfire.tar为可执行文件库.配置等的压缩包,setup.sh为解压和部署 ...

  4. Linux 配置 SSL 证书

    完整的 SSL 证书分为四个部分: CA 根证书 (root CA) 中级证书 (Intermediate Certificate) 域名证书 证书密钥 (仅由您持有) 以 COMODO Positi ...

  5. mysql服务性能优化 my.cnf my.ini配置说明详解(16G内存)

    sort_buffer_size,join_buffer_size,read_buffer_size参数对应的分配内存也是每个连接独享 这配置已经优化的不错了,如果你的mysql没有什么特殊情况的话, ...

  6. javax.mail.MessagingException: Could not connect to SMTP host: smtp.xdf.cn

    1.问题描述:关于使用Java Mail进行邮件发送,抛出Could not connect to SMTP host: xx@xxx.com, port: 25的异常可能: 当我们使用Java Ma ...

  7. RabbitMQ自动补偿机制(消费者)及幂等问题

    如果消费者 运行时候 报错了 package com.toov5.msg.SMS; import org.springframework.amqp.rabbit.annotation.RabbitHa ...

  8. ajax02-XMLHttpRequest 对象的使用

    XMLHttpRequest 是 AJAX 的基础,用于在后台与服务器交换数据.这意味着可以在不重新加载整个网页的情况下,对网页的某部分进行更新. XMLHttpRequest 对象 所有现代浏览器均 ...

  9. Spring data jpa 使用技巧记录

    软件152 尹以操 最近在用Springboot 以及Spring data jpa  ,使用jpa可以让我更方便的操作数据库,特开此帖记录使用jpa的一些小技巧. 一.使用spring data j ...

  10. UOJ283 直径拆除鸡

    本文版权归ljh2000和博客园共有,欢迎转载,但须保留此声明,并给出原文链接,谢谢合作. 本文作者:ljh2000 作者博客:http://www.cnblogs.com/ljh2000-jump/ ...