Welcome back! This time, the focus is going to be on Stream-Out (SO). This is a facility for storing the Output of the Geometry Shader stage to memory, instead of sending it down the rest of the pipeline. This can be used to e.g. cache skinned vertex data, or as a sort of poor man’s Compute Shader on D3D10-level hardware using the D3D10 API (note that with D3D11, you can just use CS 4.0, even on D3D10 hardware). And just like the GS Instancing I mentioned last time, some of this is very poorly described in the API docs, so I’ll have a few comments about API usage even though it’s technically out of the intended scope of this series.

Vertex Shader Stream-Out (i.e. SO with NULL GS)
This is one of the features that’s not properly explained in the D3D10 (or D3D11, for that matter) docs; in fact, it’s not mentioned there at all except for a small throwaway remark in “Getting Started with the Stream-Output Stage (Direct3D 10)”. You’re supposed to figure it out from the examples – which themselves don’t exactly go out of their way to make it clear what’s going on. That’s a pity – VS Stream-Out is easier than GS SO, and has some pretty useful applications by itself (e.g. caching skinned vertices).

So here’s how it’s done in D3D10 and 11: You simply pass Vertex Shader bytecode (instead of GS bytecode) to CreateGeometryShaderWithStreamOutput. Yes, the docs mention something about “Size of the compiled geometry shader” here – ignore it. What you get back is a Geometry Shader object that you can then pass to GSSetShader. This is, in effect, a NULL Geometry Shader – it doesn’t actually go through GS processing. It’s just some wrapper (more like duct tape really) to make it fit into the API model, where all rendering passes through the GS stage and SO comes right after GS – though as I’ve explained last time, actual HW tends to skip the GS stage completely when there’s no GS set.

So the shaded vertices get assembled into primitives as before, but instead of getting sent down the rest of the pipeline as already described, they get forwarded to Stream-Out, where they arrive – as always – in a buffer. What exactly happens with them then depends on the Stream-Out declaration (which is passed at creation time). In the Stream-Out declaration, the app gets to specify where it wants each output vector to end up in the Stream-Out targets (or SO targets for short). If the SO declaration “matches” the Vertex Shader Output Declaration (i.e. the same attributes in the same order), data from the input buffers can be streamed more or less unprocessed into memory. If it doesn’t match the declaration exactly – it might skip some attributes written by the shader, or write them in a different order – either way, there’s some extra reordering involved. This might involve a dedicated reordering unit (which basically implements a gather-type operation from the SO input buffers), or it might involve generating lots of small memory writes instead of large burst writes, or something similar. Either way, it’s extra effort and generally slower; the details of what exactly triggers a slow path depend on the hardware specifics, but really, it doesn’t matter that much. If you want optimal SO performance, just make sure the SO declaration and Output declarations agree.

Another point is that SO usually doesn’t have access to a very high-performance path to the memory subsystem. Unlike e.g. the ROPs, SO isn’t really (yet?) a full citizen in current GPU designs, so it often only has access to one memory channel or something of the sort. That’s something to keep in mind if you’re producing a lot of data via SO. This is compounded by SO outputs always being full floats, so there’s no way to conserve bandwidth by using one of the packed vertex data types.

Final remark on VS SO: As I mentioned earlier, SO operates on assembled primitives, not individual vertices. Note that Primitive Assembly discards adjacency information if it makes it that far down the pipeline, and since this happens before SO, vertices corresponding to adjacency info won’t make it into SO buffers either. SO working on primitives not individual vertices is relevant for use cases like instancing a single skinned mesh (in a single pose) several times. If you were to draw your triangle mesh as you usually would and then use SO on that, this results in a data explosion – you get 3 unpacked, unshared vertices per input primitive. This works, but isn’t exactly an efficient use of bandwidth, both on the SO and the later vertex input side. Instead, you should draw your triangle mesh as a (non-indexed) point list in the first pass, thereby shading each vertex exactly once. The SO buffer then ends up in 1:1 correspondence to your original vertex buffer, only with skinned instead of non-skinned vertices. You can then use that vertex buffer with your original primitive topology and index buffer.

Geometry Shader SO: Multiple streams
This basically works like SO with a NULL GS, except there’s a Geometry Shader involved, which adds some new capabilities (and complications). In the VS case, we just had one output stream (note that streams are a D3D11+ feature – they don’t exist on D3D10-level HW). That stream could be sent to SO or not, and it could also be sent to down the pipeline to viewport/clip/cull or not, but that’s it. But Geometry Shaders allow multiple streams, which makes output routing a bit more difficult.

Basically, every GS can write to (as of D3D11) up to 4 streams. Each stream may be sent on to SO targets – yes, plural: a single stream can write to multiple SO targets, but a single SO target can receive values from only one stream, i.e. this is a one-to-many relationship, not a fully general many-to-many one. The presence of streams has some implications for SO buffering – instead of a single input buffer like I described in the NULL GS case, we now may have multiple input buffers, one per stream. In addition to SO targets, up to one stream may be sent down the pipe – i.e. the regular rendering pipeline and SO may be used simultaneously.

As in the NULL GS case, SO works on primitives, not individual vertices – that is, the strips you output in the GS get expanded out to full lines or triangles before they get into SO.

Tracking output size
There’s another issue here: we don’t necessarily know how much output data is going to be produced from SO. For GS, this comes about because each GS invocation may produce a variable number of output primitives; but even in the simpler VS case, as soon as indexed primitives are involved, the app might slip some “primitive cut” indices in there that influence how many primitives actually get written. This is a problem if we then want to draw from that SO buffer later, because we don’t know how many vertices are actually in there! We do have an upper bound – the maximum capacity of the buffer as created – but that’s it. Now, this could be resolved using some kind of query mechanism, but once you think it through, that seems fairly backwards: at the point we’re using the SO buffer for drawing, we obviously do know how many primitives we actually wrote – the SO unit needs to keep track of its current output position, after all! If we employed some query mechanism, we would end up transporting that single 32-bit value back over the bus to the driver, which passes it on to the API, which passes it on to the app – which then immediately dispatches another draw, going through all the layers again in the opposite direction.

So that’s now how it’s solved. Instead, there’s DrawAuto. The idea is very simple – the GPU already knows how many valid vertices it actually wrote to the output buffer; the SO unit keeps track of that while it’s writing, and the final counter is also kept in memory (along with the buffer) since the app may render to a SO buffer in multiple passes. This counter is then used for DrawAuto, instead of having the app submit an explicit count itself – simplifying things considerably and avoiding the costly round-trip completely. Note that this query mechanism does exist – both for checking the number of vertices written and to determine whether an overflow occurred. But it’s not on the critical path for rendering from SO buffers, which makes things a lot simpler for driver developers.

And that’s it for SO, really. Not really a lot of HW info in this one, and not really a super-interesting topic from a pipeline perspective, which is why it took me so long to finish; sorry about that. Next up is Tessellation – this should be a lot quicker, since it’s a fun topic :)

A trip through the Graphics Pipeline 2011_11 Stream Out的更多相关文章

  1. A trip through the Graphics Pipeline 2011_10_Geometry Shaders

    Welcome back.     Last time, we dove into bottom end of the pixel pipeline. This time, we’ll switch ...

  2. A trip through the Graphics Pipeline 2011_13 Compute Shaders, UAV, atomic, structured buffer

    Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-relate ...

  3. A trip through the Graphics Pipeline 2011_12 Tessellation

    Welcome back! This time, we’ll look into what is perhaps the “poster boy” feature introduced with th ...

  4. A trip through the Graphics Pipeline 2011_08_Pixel processing – “fork phase”

    In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel sha ...

  5. A trip through the Graphics Pipeline 2011_03

    At this point, we’ve sent draw calls down from our app all the way through various driver layers and ...

  6. A trip through the Graphics Pipeline 2011_01

    It’s been awhile since I posted something here, and I figured I might use this spot to explain some ...

  7. A trip through the Graphics Pipeline 2011_09_Pixel processing – “join phase”

    Welcome back!    This post deals with the second half of pixel processing, the “join phase”. The pre ...

  8. A trip through the Graphics Pipeline 2011_07_Z/Stencil processing, 3 different ways

    In this installment, I’ll be talking about the (early) Z pipeline and how it interacts with rasteriz ...

  9. A trip through the Graphics Pipeline 2011_05

    After the last post about texture samplers, we’re now back in the 3D frontend. We’re done with verte ...

随机推荐

  1. Swift3.0语言教程使用Unicode范式标准化获取字符串

    Swift3.0语言教程使用Unicode范式标准化获取字符串 Swift3.0语言教程使用Unicode范式标准化获取字符串,在NSString中可以使用4个属性去使用Unicode范式标准化获取字 ...

  2. hdu5438 Ponds dfs 2015changchun网络赛

    Ponds Time Limit: 1500/1000 MS (Java/Others)    Memory Limit: 131072/131072 K (Java/Others)Total Sub ...

  3. AngularJS 包含HTML文件

    类似于python tornado的include方法,同样是可以在一个html文件中加载另外一个html文件,这样可以不用重复的写一些几乎不改变的代码. 首先创建两个文件,然后代码如下: <! ...

  4. BestCoder Round #65

    博弈 1002 ZYB's Game 题意:中文 分析:假定两个人是绝顶聪明的,一定会采取最优的策略.所以如果选择X的左边的一个点,那么后手应该选择X的右边对称的点,如果没有则必输,否则必胜,然后再分 ...

  5. GridView中超链接设置

    <%# Eval("id") %>Bind方式    <%# Bind("id","~/info.aspx?id={0}" ...

  6. 转载:CSS3 Flexbox可视化指南

    0. 目录 目录 引言 正文 1 引入 2 基础 3 使用 4 弹性容器Flex container属性 41 flex-direction 42 flex-wrap 43 flex-flow 44 ...

  7. ACM: HDU 2563 统计问题-DFS+打表

    HDU 2563 统计问题 Time Limit:1000MS     Memory Limit:32768KB     64bit IO Format:%I64d & %I64u HDU 2 ...

  8. [Leetcode] Merge Intevals

    Question: Given a collection of intervals, merge all overlapping intervals. For example,Given [1,3], ...

  9. NOIP欢乐模拟赛 T2 解题报告

    小澳的坐标系 (coordinate.cpp/c/pas) [题目描述] 小澳者表也,数学者景也,表动则景随矣. 小澳不喜欢数学,可数学却待小澳如初恋,小澳睡觉的时候也不放过. 小澳的梦境中出现了一个 ...

  10. java判断文件是否存在

    1.判断远程服务器上文件 import java.net.HttpURLConnection; import java.net.URL; public boolean checkRemoteFile( ...