Welcome back! This time, the focus is going to be on Stream-Out (SO). This is a facility for storing the Output of the Geometry Shader stage to memory, instead of sending it down the rest of the pipeline. This can be used to e.g. cache skinned vertex data, or as a sort of poor man’s Compute Shader on D3D10-level hardware using the D3D10 API (note that with D3D11, you can just use CS 4.0, even on D3D10 hardware). And just like the GS Instancing I mentioned last time, some of this is very poorly described in the API docs, so I’ll have a few comments about API usage even though it’s technically out of the intended scope of this series.

Vertex Shader Stream-Out (i.e. SO with NULL GS)
This is one of the features that’s not properly explained in the D3D10 (or D3D11, for that matter) docs; in fact, it’s not mentioned there at all except for a small throwaway remark in “Getting Started with the Stream-Output Stage (Direct3D 10)”. You’re supposed to figure it out from the examples – which themselves don’t exactly go out of their way to make it clear what’s going on. That’s a pity – VS Stream-Out is easier than GS SO, and has some pretty useful applications by itself (e.g. caching skinned vertices).

So here’s how it’s done in D3D10 and 11: You simply pass Vertex Shader bytecode (instead of GS bytecode) to CreateGeometryShaderWithStreamOutput. Yes, the docs mention something about “Size of the compiled geometry shader” here – ignore it. What you get back is a Geometry Shader object that you can then pass to GSSetShader. This is, in effect, a NULL Geometry Shader – it doesn’t actually go through GS processing. It’s just some wrapper (more like duct tape really) to make it fit into the API model, where all rendering passes through the GS stage and SO comes right after GS – though as I’ve explained last time, actual HW tends to skip the GS stage completely when there’s no GS set.

So the shaded vertices get assembled into primitives as before, but instead of getting sent down the rest of the pipeline as already described, they get forwarded to Stream-Out, where they arrive – as always – in a buffer. What exactly happens with them then depends on the Stream-Out declaration (which is passed at creation time). In the Stream-Out declaration, the app gets to specify where it wants each output vector to end up in the Stream-Out targets (or SO targets for short). If the SO declaration “matches” the Vertex Shader Output Declaration (i.e. the same attributes in the same order), data from the input buffers can be streamed more or less unprocessed into memory. If it doesn’t match the declaration exactly – it might skip some attributes written by the shader, or write them in a different order – either way, there’s some extra reordering involved. This might involve a dedicated reordering unit (which basically implements a gather-type operation from the SO input buffers), or it might involve generating lots of small memory writes instead of large burst writes, or something similar. Either way, it’s extra effort and generally slower; the details of what exactly triggers a slow path depend on the hardware specifics, but really, it doesn’t matter that much. If you want optimal SO performance, just make sure the SO declaration and Output declarations agree.

Another point is that SO usually doesn’t have access to a very high-performance path to the memory subsystem. Unlike e.g. the ROPs, SO isn’t really (yet?) a full citizen in current GPU designs, so it often only has access to one memory channel or something of the sort. That’s something to keep in mind if you’re producing a lot of data via SO. This is compounded by SO outputs always being full floats, so there’s no way to conserve bandwidth by using one of the packed vertex data types.

Final remark on VS SO: As I mentioned earlier, SO operates on assembled primitives, not individual vertices. Note that Primitive Assembly discards adjacency information if it makes it that far down the pipeline, and since this happens before SO, vertices corresponding to adjacency info won’t make it into SO buffers either. SO working on primitives not individual vertices is relevant for use cases like instancing a single skinned mesh (in a single pose) several times. If you were to draw your triangle mesh as you usually would and then use SO on that, this results in a data explosion – you get 3 unpacked, unshared vertices per input primitive. This works, but isn’t exactly an efficient use of bandwidth, both on the SO and the later vertex input side. Instead, you should draw your triangle mesh as a (non-indexed) point list in the first pass, thereby shading each vertex exactly once. The SO buffer then ends up in 1:1 correspondence to your original vertex buffer, only with skinned instead of non-skinned vertices. You can then use that vertex buffer with your original primitive topology and index buffer.

Geometry Shader SO: Multiple streams
This basically works like SO with a NULL GS, except there’s a Geometry Shader involved, which adds some new capabilities (and complications). In the VS case, we just had one output stream (note that streams are a D3D11+ feature – they don’t exist on D3D10-level HW). That stream could be sent to SO or not, and it could also be sent to down the pipeline to viewport/clip/cull or not, but that’s it. But Geometry Shaders allow multiple streams, which makes output routing a bit more difficult.

Basically, every GS can write to (as of D3D11) up to 4 streams. Each stream may be sent on to SO targets – yes, plural: a single stream can write to multiple SO targets, but a single SO target can receive values from only one stream, i.e. this is a one-to-many relationship, not a fully general many-to-many one. The presence of streams has some implications for SO buffering – instead of a single input buffer like I described in the NULL GS case, we now may have multiple input buffers, one per stream. In addition to SO targets, up to one stream may be sent down the pipe – i.e. the regular rendering pipeline and SO may be used simultaneously.

As in the NULL GS case, SO works on primitives, not individual vertices – that is, the strips you output in the GS get expanded out to full lines or triangles before they get into SO.

Tracking output size
There’s another issue here: we don’t necessarily know how much output data is going to be produced from SO. For GS, this comes about because each GS invocation may produce a variable number of output primitives; but even in the simpler VS case, as soon as indexed primitives are involved, the app might slip some “primitive cut” indices in there that influence how many primitives actually get written. This is a problem if we then want to draw from that SO buffer later, because we don’t know how many vertices are actually in there! We do have an upper bound – the maximum capacity of the buffer as created – but that’s it. Now, this could be resolved using some kind of query mechanism, but once you think it through, that seems fairly backwards: at the point we’re using the SO buffer for drawing, we obviously do know how many primitives we actually wrote – the SO unit needs to keep track of its current output position, after all! If we employed some query mechanism, we would end up transporting that single 32-bit value back over the bus to the driver, which passes it on to the API, which passes it on to the app – which then immediately dispatches another draw, going through all the layers again in the opposite direction.

So that’s now how it’s solved. Instead, there’s DrawAuto. The idea is very simple – the GPU already knows how many valid vertices it actually wrote to the output buffer; the SO unit keeps track of that while it’s writing, and the final counter is also kept in memory (along with the buffer) since the app may render to a SO buffer in multiple passes. This counter is then used for DrawAuto, instead of having the app submit an explicit count itself – simplifying things considerably and avoiding the costly round-trip completely. Note that this query mechanism does exist – both for checking the number of vertices written and to determine whether an overflow occurred. But it’s not on the critical path for rendering from SO buffers, which makes things a lot simpler for driver developers.

And that’s it for SO, really. Not really a lot of HW info in this one, and not really a super-interesting topic from a pipeline perspective, which is why it took me so long to finish; sorry about that. Next up is Tessellation – this should be a lot quicker, since it’s a fun topic :)

A trip through the Graphics Pipeline 2011_11 Stream Out的更多相关文章

  1. A trip through the Graphics Pipeline 2011_10_Geometry Shaders

    Welcome back.     Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ...

  2. A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer

    Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-relate ...

  3. A trip through the Graphics Pipeline 2011_12 Tessellation

    Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ...

  4. A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”

    In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ...

  5. A trip through the Graphics Pipeline 2011_03

    At this point, we’ve sent draw calls down from our app all the way through various driver layers and ...

  6. A trip through the Graphics Pipeline 2011_01

    It’s been awhile since I posted something here, and I figured I might use this spot to explain some ...

  7. A trip through the Graphics Pipeline 2011_09_Pixel processing – “join phase”

    Welcome back!    This post deals with the second half of pixel processing, the “join phase”. The pre ...

  8. A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways

    In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ...

  9. A trip through the Graphics Pipeline 2011_05

    After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ...

随机推荐

  1. BZOJ 2631 Tree ——Link-Cut Tree

    [题目分析] 又一道LCT的题目,LCT只能维护链上的信息. [代码] #include <cstdio> #include <cstring> #include <cs ...

  2. uva 12003 分块

    大白上的原题,我就练练手... #include <bits/stdc++.h> using namespace std; typedef long long ll; ; ; ll blo ...

  3. Android 自动化测试—robotium(七) 使用Junit_report测试报告

    使用Robotium进行测试的时候,要想可以导出明了的测试结果,可以使用junitreport来实现 junit-report下载地址:https://github.com/jsankey/andro ...

  4. ACM ICPC 2015 Moscow Subregional Russia, Moscow, Dolgoprudny, October, 18, 2015 I. Illegal or Not?

    I. Illegal or Not? time limit per test 1 second memory limit per test 512 megabytes input standard i ...

  5. apache htpasswd.exe创建密码

    一.使用apache htpasswd.exe创建密码文件,命令请看PHP推荐教程:apache htpasswd命令用法详解 apache htpasswd命令用法实例 1.如何利用htpasswd ...

  6. [深入浅出Windows 10]模拟实现微信的彩蛋动画

    9.7 模拟实现微信的彩蛋动画 大家在玩微信的时候有没有发现节日的时候发一些节日问候语句如“情人节快乐”,这时候会出现很多爱心形状从屏幕上面飘落下来,我们这小节就是要模拟实现这样的一种动画效果.可能微 ...

  7. 【HDU】4418 Time travel

    http://acm.hdu.edu.cn/showproblem.php?pid=4418 题意:一个0-n-1的坐标轴,给出起点X.终点Y,和初始方向D(0表示从左向右.1表示从右向左,-1表示起 ...

  8. 编码Q&A

    Q:什么是编码? A:由于计算机中所有数据都是以二进制存在,那么为了存储数字,字母,各种符号和文字,计算机必须用一套映射系统来对应.比如我在某台计算机上规定,用00010001这个二进制数表示字母a, ...

  9. Poj 3250 单调栈

    1.Poj 3250  Bad Hair Day 2.链接:http://poj.org/problem?id=3250 3.总结:单调栈 题意:n头牛,当i>j,j在i的右边并且i与j之间的所 ...

  10. Linux 获取设备树源文件(DTS)里描述的资源

    Linux 获取设备树源文件(DTS)里的资源 韩大卫@吉林师范大学 在linux使用platform_driver_register() 注册 platform_driver 时, 需要在 plat ...