A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer
Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-related posts in the future, but this series is long enough already. We’ve been touring all the regular parts of the graphics pipeline, down to different levels of detail. Which leaves one major new feature introduced in DX11 out: Compute Shaders. So that’s gonna be my topic this time around.
For this series, the emphasis has been on overall dataflow at the architectural level, not shader execution (which is explained well elsewhere). For the stages so far, that meant focusing on the input piped into and output produced by each stage; the way the internals work was usually dictated by the shape of the data. Compute shaders are different – they’re running by themselves, not as part of the graphics pipeline, so the surface area of their interface is much smaller.
In fact, on the input side, there’s not really any buffers for input data at all. The only input Compute Shaders get, aside from API state such as the bound Constant Buffers and resources, is their thread index. There’s a tremendous potential for confusion here, so here’s the most important thing to keep in mind: a “thread” is the atomic unit of dispatch in the CS environment, and it’s a substantially different beast from the threads provided by the OS that you probably associate with the term. CS threads have their own identity and registers, but they don’t have their own Program Counter (Instruction Pointer) or stack, nor are they scheduled individually.
That explains the “thread” and “warp” levels. Above that is the “thread group” level, which deals with – who would’ve thought? – groups of threads. The size of a thread group is specified during shader compilation. In DX11, a thread group can contain anywhere between 1 and 1024 threads, and the thread group size is specified not as a single number but as a 3-tuple giving thread x, y, and z coordinates. This numbering scheme is mostly for the convenience of shader code that addresses 2D or 3D resources, though it also allows for traversal optimizations. At the macro level, CS execution is dispatched in multiples of thread groups; thread group IDs in D3D11 again use 3D group IDs, same as thread IDs, and for pretty much the same reasons.
The above description makes it sound like thread groups are a fairly arbitrary middle level in this hierarchy. However, there’s one important bit missing that makes thread groups very special indeed: Thread Group Shared Memory (TGSM). On DX11 level hardware, compute shaders have access to 32k of TGSM, which is basically a scratchpad for communication between threads in the same group. This is the primary (and fastest) way by which different CS threads can communicate.
Group Synchronization. A Group Synchronization Barrier forces all threads inside the current group to reach the barrier before any of them may consume past it. Once a Warp reaches such a barrier, it will be flagged as non-runnable, same as if it was waiting for a memory or texture access to complete. Once the last Warp reaches the barrier, the remaining Warps will be reactivated. This all happens at the Warp scheduling level; it adds additional scheduling constraints, which may cause stalls, but there’s no need for atomic memory transactions or anything like that; other than lost utilization at the micro level, this is a reasonably cheap operation.
Group Memory Barriers. Since all threads within a group run on the same shader unit, this basically amounts to a pipeline flush, to ensure that all pending shared memory operations are completed. There’s no need to synchronize with resources external to the current shader unit, which means it’s again reasonably cheap.
DX11 offers different types of barriers that combine several of the above components into one atomic unit; the semantics should be obvious.
We’ve now dealt with CS input and learned a bit about CS execution. But where do we put our output data? The answer has the unwieldy name “unordered access views”, or UAVs for short. An UAV seems somewhat similar to render targets in Pixel Shaders (and UAVs can in fact be used in addition to render targets in Pixel Shaders), but there’s some very important semantic differences:
In current CPUs, most of the magic for shared memory processing is handled by the memory hierarchy (i.e. caches). To write to a piece of memory, the active core must first assert exclusive ownership of the corresponding cache line. This is accomplished using what’s called a “cache coherency protocol”, usually MESI and descendants. The details are tangential to this article; what matters is that because writing to memory entails acquiring exclusive ownership, there’s never a risk of two cores simultaneously trying to write to the some location. In such a model, atomic operations can be implemented by holding exclusive ownership for the duration of the operation; if we had exclusive ownership for the whole time, there’s no chance that someone else was trying to write to the same location while we were performing the atomic operation. Again, the actual details of this get hairy pretty fast (especially as soon as things like paging, interrupts and exceptions get involved), but the 30000-feet-view will suffice for the purposes of this article.
One final remark is that, of course, outstanding atomic operations count as “device memory” accesses, same as memory/texture reads and UAV writes; shader units need to keep track of their outstanding atomic operations and make sure they’re finished when they hit device memory access barriers.
Unless I missed something, these two buffer types are the last CS-related features I haven’t talked about yet. And, well, from a hardware perspective, there’s not that much to talk about, really. Structured buffers are more of a hint to the driver-internal shader compiler than anything else; they give the driver some hint as to how they’re going to be used – namely, they consist of elements with a fixed stride that are likely going to be accessed together – but they still compile down to regular memory accesses in the end. The structured buffer part may bias the driver’s decision of their position and layout in memory, but it does not add any fundamentally new functionality to the model.
Append/consume buffers are similar; they could be implemented using the existing atomic instructions. In fact, they kind of are, except the append/consume pointers aren’t at an explicit location in the resource, they’re side-band data outside the resource that are accessed using special atomic instructions. (And similarly to structured buffers, the fact that their usage is declared as append/consume buffer allows the driver to pick their location in memory appropriately).
And… that’s it. No more previews for the next part, this series is done :), though that doesn’t mean I’m done with it. I have some restructuring and partial rewriting to do – these blog posts are raw and unproofed, and I intend to go over them and turn it into a single document. In the meantime, I’ll be writing about other stuff here. I’ll try to incorporate the feedback I got so far – if there’s any other questions, corrections or comments, now’s the time to tell me! I don’t want to nail down the ETA for the final cleaned-up version of this series, but I’ll try to get it down well before the end of the year. We’ll see. Until then, thanks for reading!
A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer的更多相关文章
- A trip through the Graphics Pipeline 2011_10_Geometry Shaders
Welcome back. Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ...
- A trip through the Graphics Pipeline 2011_12 Tessellation
Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ...
- A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”
In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ...
- A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways
In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ...
- A trip through the Graphics Pipeline 2011_05
After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ...
- A trip through the Graphics Pipeline 2011_04
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
- A trip through the Graphics Pipeline 2011_03
At this point, we’ve sent draw calls down from our app all the way through various driver layers and ...
- A trip through the Graphics Pipeline 2011_02
Welcome back. Last part was about vertex shaders, with some coverage of GPU shader units in general. ...
- A trip through the Graphics Pipeline 2011_11 Stream Out
Welcome back! This time, the focus is going to be on Stream-Out (SO). This is a facility for storing ...
随机推荐
- MFC HTTP访问URL
unsigned short nPort; //用于保存目标HTTP服务端口 CString strServer, strObject; //strServer用于保存服务器地址,strObject用 ...
- django 数据库交互2
打开django shell python manage.py shell 输入命令 >>> from myapp import * >>> MySite.obje ...
- 基于webpack的前端工程化开发解决方案探索(一):动态生成HTML(转)
1.什么是工程化开发 软件工程的工程化开发概念由来已久,但对于前端开发来说,我们没有像VS或者eclipse这样量身打造的IDE,因为在大多数人眼中,前端代码无需编译,因此只要一个浏览器来运行调试就行 ...
- BZOJ3212 Pku3468 A Simple Problem with Integers 题解
题目大意: 一个数列,有两个操作:1.修改操作,将一段区间内的数加上c:2.查询操作,查询一段区间内的数的和. 思路: 线段树裸题,区间修改.区间查询,维护和以及加上的数,由于无序,不需要向下推标记, ...
- Android PhoneGap 利用 Activity 实现 CordovaInterface
1.修改main.xml <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns ...
- linux 远程桌面的配置
root根用户登录在/root目录下#rpm -q vnc vnc-server查询已安装vnc的客户端和服务器端vnc-4.1.1-36vnc-server-4.1.1-36 #vi /etc/sy ...
- 【BZOJ】1987: Zju2672 Fibonacci Subsequence
题意 给出一个序列\(A\),求一个最长的满足fib性质的子序列,输出其长度及其元素(如果多种方案,输出位置最靠前的).(\(n \le 3000\)) 题解 容易想到dp,即\(d(i, j)\)表 ...
- 《你必须知道的.NET》书中对OCP(开放封闭)原则的阐述
开放封闭原则(OCP,Open Closed Principle)是面向对象原则的核心.由于软件设计本身所追求的墓边就是封装变化,降低耦合,而开放封闭原则就是对这一目标的直接体现.(你必须知道的.NE ...
- iOS上让按钮文本左对齐问题
一,问题分析 1.在做历史记录视图的时候,由于让键盘退出后才能触发表格的 didselect 那个代理方法,也就是得点两下才触发,而表格中的按钮点一下就可以立即响应. 2.于是我就有了用按钮事件代替 ...
- SpringMVC——接收请求参数和页面传参
Spring接收请求参数: 1.使用HttpServletRequest获取 @RequestMapping("/login.do") public String login(Ht ...