A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer
Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-related posts in the future, but this series is long enough already. We’ve been touring all the regular parts of the graphics pipeline, down to different levels of detail. Which leaves one major new feature introduced in DX11 out: Compute Shaders. So that’s gonna be my topic this time around.
For this series, the emphasis has been on overall dataflow at the architectural level, not shader execution (which is explained well elsewhere). For the stages so far, that meant focusing on the input piped into and output produced by each stage; the way the internals work was usually dictated by the shape of the data. Compute shaders are different – they’re running by themselves, not as part of the graphics pipeline, so the surface area of their interface is much smaller.
In fact, on the input side, there’s not really any buffers for input data at all. The only input Compute Shaders get, aside from API state such as the bound Constant Buffers and resources, is their thread index. There’s a tremendous potential for confusion here, so here’s the most important thing to keep in mind: a “thread” is the atomic unit of dispatch in the CS environment, and it’s a substantially different beast from the threads provided by the OS that you probably associate with the term. CS threads have their own identity and registers, but they don’t have their own Program Counter (Instruction Pointer) or stack, nor are they scheduled individually.
That explains the “thread” and “warp” levels. Above that is the “thread group” level, which deals with – who would’ve thought? – groups of threads. The size of a thread group is specified during shader compilation. In DX11, a thread group can contain anywhere between 1 and 1024 threads, and the thread group size is specified not as a single number but as a 3-tuple giving thread x, y, and z coordinates. This numbering scheme is mostly for the convenience of shader code that addresses 2D or 3D resources, though it also allows for traversal optimizations. At the macro level, CS execution is dispatched in multiples of thread groups; thread group IDs in D3D11 again use 3D group IDs, same as thread IDs, and for pretty much the same reasons.
The above description makes it sound like thread groups are a fairly arbitrary middle level in this hierarchy. However, there’s one important bit missing that makes thread groups very special indeed: Thread Group Shared Memory (TGSM). On DX11 level hardware, compute shaders have access to 32k of TGSM, which is basically a scratchpad for communication between threads in the same group. This is the primary (and fastest) way by which different CS threads can communicate.
Group Synchronization. A Group Synchronization Barrier forces all threads inside the current group to reach the barrier before any of them may consume past it. Once a Warp reaches such a barrier, it will be flagged as non-runnable, same as if it was waiting for a memory or texture access to complete. Once the last Warp reaches the barrier, the remaining Warps will be reactivated. This all happens at the Warp scheduling level; it adds additional scheduling constraints, which may cause stalls, but there’s no need for atomic memory transactions or anything like that; other than lost utilization at the micro level, this is a reasonably cheap operation.
Group Memory Barriers. Since all threads within a group run on the same shader unit, this basically amounts to a pipeline flush, to ensure that all pending shared memory operations are completed. There’s no need to synchronize with resources external to the current shader unit, which means it’s again reasonably cheap.
DX11 offers different types of barriers that combine several of the above components into one atomic unit; the semantics should be obvious.
We’ve now dealt with CS input and learned a bit about CS execution. But where do we put our output data? The answer has the unwieldy name “unordered access views”, or UAVs for short. An UAV seems somewhat similar to render targets in Pixel Shaders (and UAVs can in fact be used in addition to render targets in Pixel Shaders), but there’s some very important semantic differences:
In current CPUs, most of the magic for shared memory processing is handled by the memory hierarchy (i.e. caches). To write to a piece of memory, the active core must first assert exclusive ownership of the corresponding cache line. This is accomplished using what’s called a “cache coherency protocol”, usually MESI and descendants. The details are tangential to this article; what matters is that because writing to memory entails acquiring exclusive ownership, there’s never a risk of two cores simultaneously trying to write to the some location. In such a model, atomic operations can be implemented by holding exclusive ownership for the duration of the operation; if we had exclusive ownership for the whole time, there’s no chance that someone else was trying to write to the same location while we were performing the atomic operation. Again, the actual details of this get hairy pretty fast (especially as soon as things like paging, interrupts and exceptions get involved), but the 30000-feet-view will suffice for the purposes of this article.
One final remark is that, of course, outstanding atomic operations count as “device memory” accesses, same as memory/texture reads and UAV writes; shader units need to keep track of their outstanding atomic operations and make sure they’re finished when they hit device memory access barriers.
Unless I missed something, these two buffer types are the last CS-related features I haven’t talked about yet. And, well, from a hardware perspective, there’s not that much to talk about, really. Structured buffers are more of a hint to the driver-internal shader compiler than anything else; they give the driver some hint as to how they’re going to be used – namely, they consist of elements with a fixed stride that are likely going to be accessed together – but they still compile down to regular memory accesses in the end. The structured buffer part may bias the driver’s decision of their position and layout in memory, but it does not add any fundamentally new functionality to the model.
Append/consume buffers are similar; they could be implemented using the existing atomic instructions. In fact, they kind of are, except the append/consume pointers aren’t at an explicit location in the resource, they’re side-band data outside the resource that are accessed using special atomic instructions. (And similarly to structured buffers, the fact that their usage is declared as append/consume buffer allows the driver to pick their location in memory appropriately).
And… that’s it. No more previews for the next part, this series is done :), though that doesn’t mean I’m done with it. I have some restructuring and partial rewriting to do – these blog posts are raw and unproofed, and I intend to go over them and turn it into a single document. In the meantime, I’ll be writing about other stuff here. I’ll try to incorporate the feedback I got so far – if there’s any other questions, corrections or comments, now’s the time to tell me! I don’t want to nail down the ETA for the final cleaned-up version of this series, but I’ll try to get it down well before the end of the year. We’ll see. Until then, thanks for reading!
A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer的更多相关文章
- A trip through the Graphics Pipeline 2011_10_Geometry Shaders
		Welcome back. Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ... 
- A trip through the Graphics Pipeline 2011_12     Tessellation
		Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ... 
- A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”
		In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ... 
- A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways
		In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ... 
- A trip through the Graphics Pipeline 2011_05
		After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ... 
- A trip through the Graphics Pipeline 2011_04
		Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ... 
- A trip through the Graphics Pipeline 2011_03
		At this point, we’ve sent draw calls down from our app all the way through various driver layers and ... 
- A trip through the Graphics Pipeline 2011_02
		Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ... 
- A trip through the Graphics Pipeline 2011_11    Stream Out
		Welcome back! This time, the focus is going to be on Stream-Out (SO). This is a facility for storing ... 
随机推荐
- 如何定义移动端字体Font-Family?
			1.对于IOS 手机系统,默认中文字体是Heiti SC.默认英文字体是Helvetica.默认数字字体是HelveticaNeue.无微软雅黑字体: 2.对于Android 手机系统,默认中文字体是 ... 
- dubbox编译
			dubbox编译要在命令行 切记切记 设置JAVA_HOME 设置maven路径 命令编译dubbox 设置M2_HOME环境变量 设置idea M2_HOME dubbox 服务端 http://w ... 
- jQuery-事件以及动画
			事件: 1.//方法1 $(window).load(function(){ }) window.onload=function(){ } //方法2 function one(){ alert(&q ... 
- 【转】浅谈 C++ 中的 new/delete 和 new[]/delete[]
			在 C++ 中,你也许经常使用 new 和 delete 来动态申请和释放内存,但你可曾想过以下问题呢? new 和 delete 是函数吗? new [] 和 delete [] 又是什么?什么时候 ... 
- 8VC Venture Cup 2016 - Elimination Round
			在家补补题 模拟 A - Robot Sequence #include <bits/stdc++.h> char str[202]; void move(int &x, in ... 
- Guacamole 介绍以及架构
			Guacamole的介绍以及架构 guacd Web应用程序 在Guacamole中与用户打交道的就是Web应用程序. 之前说过,Web应用程序自己不实现任何的远程桌面协议.Web应用程序依赖guac ... 
- uva748 - Exponentiation
			import java.io.*; import java.text.*; import java.util.*; import java.math.*; public class Exponenti ... 
- ccc 碰撞初步
			cc.Class({ extends: cc.Component, properties: { }, // use this for initialization onLoad: function ( ... 
- Extjs tree的相关方法及配置项
			Ext.tree.TreePanel 主要配置项: root:树的根节点. rootVisible:是否显示根节点,默认为true. ... 
- input属性控制弹出键盘类型
			/** * ios弹起数字键盘有三种方法 * 1. <input type="number"> 可以弹起带有小数点的键盘,可以键盘不干净,有其它各种字符,可切换 ... 
