Hierarchical Z-Buffer Occlusion Culling
While I was at GDC I had the pleasure of attending the Rendering with Conviction talk by Stephen Hill, one of the topics was so cool that I thought it would be fun to try it out. The hierarchical z-buffer solution presented at GDC borrows heavily from this paper, Siggraph 2008 Advances in Real-Time Rendering (Section 3.3.3). Though I ran into a fair number of issues trying to get the AMD implementation working, a lot of the math is too simplistic and does not take into account perspective distortions and the proper width of the sphere in screen space so you end up with false negatives.
You should read the papers to get a firm grasp of the algorithm, but here is my take on the process and some implementation notes of my own.
Hierarchical Z-Buffer Culling Steps
- Bake step – Have your artists prepare occlusion geometry for things in the world that make sense as occluders, buildings, walls…etc. They should all be super cheap to render, boxes/planes. I actually ran across this paper, Geometric Simplification For Efficient Occlusion Culling In Urban Scenes, that sounded like a neat way of automating the process.
- CPU – Take all the occlusion meshes and frustum cull them.
- GPU – Render the remaining occluders to a ‘depth buffer’. The depth buffer should not be full sized, in my code I’m using 512×256. There is a Frostbite paper that mentions using a 256×114-ish sized buffer for a similar solution to do occlusion culling. The ‘depth buffer’ should just be mip 0 in a full mip chain of render targets (not the actual depth buffer).
- GPU – Now downsample the RT containing depth information filling out the entire mipchain. You’ll do this by rendering a fullscreen effect with a pixel shader taking the last level of the mip chain and down sampling it into the next, preserving the highest depth value in a sample group of 4 pixels. For DX11, you can just constrain the shader resource view so that you can both read and render from the same mip chain. For DX9 you’ll have to use StretchRect to copy from a second mip chain, since you can’t sample and render to the same mip chain in DX9. In my code I actually found a more optimized solution, by ping-ponging between 2 mip chains one containing even and the other odd levels, and a single branch in your shader code you can get around the overhead of having to do the StretchRect and just sample from a different mip chain based on the even/odd mip level you need.
- CPU – Gather all the bounding spheres for everything in your level that could possibly be visible.
- GPU – DX11 send the list of bounds to a compute shader, which computes the screen space width of the sphere then uses the width to compute the mip level to sample from the HiZ map generated in step 4, such that the sphere covers no more than 2 pixels wide. So large objects in screen space will sample from very high values in the mip chain since they require a coarse view of the world. Whereas small objects in screen space will sample from very low values in the mip chain. In DX9 the process is basically the same, the only difference is that you’ll render a point list of vertices, that instead of a Float3 position are Float4 bounds (xyz = position, w = radius). You’ll also send down a stream of texcoords that will represent x/y pixel values of where to encode the results of the occlusion test for that bound. Instead of a compute shader you’ll process the vertices using a vertex shader, you’ll also need to use the pixel location provided in the texcoord stream to make sure the results of the test are written out to that point in a render target, and in a pixel shader you’ll need to do the sampling to test to see if it’s visible, and output a color like white for culled, black for visible.
- CPU – Try to do some work on the CPU after the occluder rendering and culling process is kicked off, for me the entire process took about 0.74 ms of GPU time on a Radeon 5450, with 900 bounds. The overhead of generating the HiZ mip chain and dispatching the culling process is the real bottleneck though, there’s little difference between 900 bounds and 10,000 bounds.
- CPU – Read back the results. DX11 you’re just reading back a buffer output by a compute shader. For DX9 you’ll have to copy the render target rendered to in step 6 containing the checker pattern of black and white pixels and then iterate over the pixels on the CPU to know what is visible and what is hidden.
Hierarchical Z-Buffer Downsampling Code
The downsampling is pretty much what you would expect, you take the current pixel, sample one pixel to the right, bottom and bottom right. You take the furthest depth value and use it as the new depth in the downsampled pixel. Here’s an example of a before and after version, black is a closer depth, the whiter a pixel is the further away / higher the depth value.
Before Downsample
After Downsample
The downsampling HLSL code looks like this:
float4 vTexels; | |
vTexels.x = LastMip.Load( nCoords ); | |
vTexels.y = LastMip.Load( nCoords, uint2(1,0) ); | |
vTexels.z = LastMip.Load( nCoords, uint2(0,1) ); | |
vTexels.w = LastMip.Load( nCoords, uint2(1,1) ); | |
float fMaxDepth = max( max( vTexels.x, vTexels.y ), max( vTexels.z, vTexels.w ) ); |
Hierarchical Z-Buffer Culling Code
Here’s the heart of the algorithm, the culling. One note, [numthreads(1,1,1)] is terrible for performance with compute shaders. Anyone planning to use this should do a better job of their thread group and thread management than I did. This is the DX11 compute shader version, I decided to use it here since it’s clearer what the intentions are. You’ll find the DX9 code in the full sample at the bottom of the post.
cbuffer CB | |
{ | |
matrix View; | |
matrix Projection; | |
matrix ViewProjection; | |
float4 FrustumPlanes[6]; // view-frustum planes in world space (normals face out) | |
float2 ViewportSize; // Viewport Width and Height in pixels | |
float2 PADDING; | |
}; | |
// Bounding sphere center (XYZ) and radius (W), world space | |
StructuredBuffer Buffer0 : register(t0); | |
// Is Visible 1 (Visible) 0 (Culled) | |
RWStructuredBuffer BufferOut : register(u0); | |
Texture2D HizMap : register(t1); | |
SamplerState HizMapSampler : register(s0); | |
// Computes signed distance between a point and a plane | |
// vPlane: Contains plane coefficients (a,b,c,d) where: ax + by + cz = d | |
// vPoint: Point to be tested against the plane. | |
float DistanceToPlane( float4 vPlane, float3 vPoint ) | |
{ | |
return dot(float4(vPoint, 1), vPlane); | |
} | |
// Frustum cullling on a sphere. Returns > 0 if visible, <= 0 otherwise | |
float CullSphere( float4 vPlanes[6], float3 vCenter, float fRadius ) | |
{ | |
float dist01 = min(DistanceToPlane(vPlanes[0], vCenter), DistanceToPlane(vPlanes[1], vCenter)); | |
float dist23 = min(DistanceToPlane(vPlanes[2], vCenter), DistanceToPlane(vPlanes[3], vCenter)); | |
float dist45 = min(DistanceToPlane(vPlanes[4], vCenter), DistanceToPlane(vPlanes[5], vCenter)); | |
return min(min(dist01, dist23), dist45) + fRadius; | |
} | |
[numthreads(1, 1, 1)] | |
void CSMain( uint3 GroupId : SV_GroupID, | |
uint3 DispatchThreadId : SV_DispatchThreadID, | |
uint GroupIndex : SV_GroupIndex) | |
{ | |
// Calculate the actual index this thread in this group will be reading from. | |
int index = DispatchThreadId.x; | |
// Bounding sphere center (XYZ) and radius (W), world space | |
float4 Bounds = Buffer0[index]; | |
// Perform view-frustum test | |
float fVisible = CullSphere(FrustumPlanes, Bounds.xyz, Bounds.w); | |
if (fVisible > 0) | |
{ | |
float3 viewEye = -View._m03_m13_m23; | |
float CameraSphereDistance = distance( viewEye, Bounds.xyz ); | |
float3 viewEyeSphereDirection = viewEye - Bounds.xyz; | |
float3 viewUp = View._m01_m11_m21; | |
float3 viewDirection = View._m02_m12_m22; | |
float3 viewRight = normalize(cross(viewEyeSphereDirection, viewUp)); | |
// Help handle perspective distortion. | |
// http://article.gmane.org/gmane.games.devel.algorithms/21697/ | |
float fRadius = CameraSphereDistance * tan(asin(Bounds.w / CameraSphereDistance)); | |
// Compute the offsets for the points around the sphere | |
float3 vUpRadius = viewUp * fRadius; | |
float3 vRightRadius = viewRight * fRadius; | |
// Generate the 4 corners of the sphere in world space. | |
float4 vCorner0WS = float4( Bounds.xyz + vUpRadius - vRightRadius, 1 ); // Top-Left | |
float4 vCorner1WS = float4( Bounds.xyz + vUpRadius + vRightRadius, 1 ); // Top-Right | |
float4 vCorner2WS = float4( Bounds.xyz - vUpRadius - vRightRadius, 1 ); // Bottom-Left | |
float4 vCorner3WS = float4( Bounds.xyz - vUpRadius + vRightRadius, 1 ); // Bottom-Right | |
// Project the 4 corners of the sphere into clip space | |
float4 vCorner0CS = mul(ViewProjection, vCorner0WS); | |
float4 vCorner1CS = mul(ViewProjection, vCorner1WS); | |
float4 vCorner2CS = mul(ViewProjection, vCorner2WS); | |
float4 vCorner3CS = mul(ViewProjection, vCorner3WS); | |
// Convert the corner points from clip space to normalized device coordinates | |
float2 vCorner0NDC = vCorner0CS.xy / vCorner0CS.w; | |
float2 vCorner1NDC = vCorner1CS.xy / vCorner1CS.w; | |
float2 vCorner2NDC = vCorner2CS.xy / vCorner2CS.w; | |
float2 vCorner3NDC = vCorner3CS.xy / vCorner3CS.w; | |
vCorner0NDC = float2( 0.5, -0.5 ) * vCorner0NDC + float2( 0.5, 0.5 ); | |
vCorner1NDC = float2( 0.5, -0.5 ) * vCorner1NDC + float2( 0.5, 0.5 ); | |
vCorner2NDC = float2( 0.5, -0.5 ) * vCorner2NDC + float2( 0.5, 0.5 ); | |
vCorner3NDC = float2( 0.5, -0.5 ) * vCorner3NDC + float2( 0.5, 0.5 ); | |
// In order to have the sphere covering at most 4 texels, we need to use | |
// the entire width of the rectangle, instead of only the radius of the rectangle, | |
// which was the original implementation in the ATI paper, it had some edge case | |
// failures I observed from being overly conservative. | |
float fSphereWidthNDC = distance( vCorner0NDC, vCorner1NDC ); | |
// Compute the center of the bounding sphere in screen space | |
float3 Cv = mul( View, float4( Bounds.xyz, 1 ) ).xyz; | |
// compute nearest point to camera on sphere, and project it | |
float3 Pv = Cv - normalize( Cv ) * Bounds.w; | |
float4 ClosestSpherePoint = mul( Projection, float4( Pv, 1 ) ); | |
// Choose a MIP level in the HiZ map. | |
// The original assumed viewport width > height, however I've changed it | |
// to determine the greater of the two. | |
// | |
// This will result in a mip level where the object takes up at most | |
// 2x2 texels such that the 4 sampled points have depths to compare | |
// against. | |
float W = fSphereWidthNDC * max(ViewportSize.x, ViewportSize.y); | |
float fLOD = ceil(log2( W )); | |
// fetch depth samples at the corners of the square to compare against | |
float4 vSamples; | |
vSamples.x = HizMap.SampleLevel( HizMapSampler, vCorner0NDC, fLOD ); | |
vSamples.y = HizMap.SampleLevel( HizMapSampler, vCorner1NDC, fLOD ); | |
vSamples.z = HizMap.SampleLevel( HizMapSampler, vCorner2NDC, fLOD ); | |
vSamples.w = HizMap.SampleLevel( HizMapSampler, vCorner3NDC, fLOD ); | |
float fMaxSampledDepth = max( max( vSamples.x, vSamples.y ), max( vSamples.z, vSamples.w ) ); | |
float fSphereDepth = (ClosestSpherePoint.z / ClosestSpherePoint.w); | |
// cull sphere if the depth is greater than the largest of our HiZ map values | |
BufferOut[index] = (fSphereDepth > fMaxSampledDepth) ? 0 : 1; | |
} | |
else | |
{ | |
// The sphere is outside of the view frustum | |
BufferOut[index] = 0; | |
} | |
} |
Sample
Here’s my sample implementation of the Hierarchical Z-Buffer Culling solution in DX11 and DX9. Some notes, during one of my iterations I disabled the code for rendering a visible representation of the occluders which are just two triangles hardcoded in a vertex buffer to be rendered every frame. Also, DX9 doesn’t actually render anything based on the results. I was just using PIX to test my output of the cull render target and was more focused on getting it working in DX11. The controls are the arrow keys to move the camera around. Red boxes represent culled boxes, white boxes are the visible ones.
Notes
I haven’t quite figured out how to deal with shadows. I’ve sort of figured out how to cull the objects whose shadows you can’t possibly see, but not really. Stephen mentions using a tactic similar to the one presented in this paper, CC Shadow Volumes. I wasn’t able to figure it out in the hour I spent going over the paper and haven’t really found the time to revisit it.
Update 7/5/2010
I’ve added a new post on how to solve the problem of culling objects that cast shadows.
Update 6/26/2011
I’ve been doing some additional research into generating occluders. It doesn’t completely solve it, but it’s a start. Further work is needed.
Update 4/13/2012
I’ve started a project to automatically generate the occluders to be used with Hi-Z occlusion culling, Oxel!
Hierarchical Z-Buffer Occlusion Culling的更多相关文章
- Occlusion Culling遮挡剔除理解设置和地形优化应用
这里使用的是unity5.5版本 具体解释网上都有,就不多说了,这里主要说明怎么使用,以及参数设置和实际注意点 在大场景地形的优化上,但也不是随便烘焙就能降低帧率的,必须结合实际情况来考虑,当然还有透 ...
- 遮挡剔除 Occlusion Culling(转)
一.首先介绍下draw call(这个东西越少你的游戏跑的越快): 在游戏中每一个被展示的独立的部分都被放在了一个特别的包中,我们称之为“描绘指令”(draw call),然后这个包传递到3D部分在屏 ...
- Unity3D-游戏场景优化之遮挡剔除(Occlusion Culling)的使用
在大型3D游戏场景中,如何优化游戏性能是非常重要的一步.一般遮挡剔除是非常常用的.接下来我们看看如何使用遮挡剔除. 假设这是一个游戏场景. 下面这是相机的视口,相机的视觉是看不到很大立方体后面的那些小 ...
- Unity Occlusion Culling 遮挡剔除研究
本文章由cartzhang编写,转载请注明出处. 所有权利保留. 文章链接:http://blog.csdn.net/cartzhang/article/details/52684127 作者:car ...
- Occlusion Culling
遮挡剔除 http://www.bjbkws.com/online/1092/ unity遮挡剔除(应用) http://www.unitymanual.com/thread-37302-1-1.ht ...
- 深入剖析GPU Early Z优化
最近在公司群里同事发了一个UE4关于Mask材质的优化,比如在场景中有大面积的草和树的时候,可以在很大程度上提高效率.这其中的原理就是利用了GPU的特性Early Z,但是它的做法跟我最开始的理解有些 ...
- 测试不同格式下depth buffer的精度
这篇文章主要是参考MJP的“Attack of The Depth Buffer”,测试不同格式下depth buffer的精度. 测试的depth buffer包含两类: 一是非线性的depth b ...
- OpenGL阴影,Shadow Volumes(附源程序,使用 VCGlib )
实验平台:Win7,VS2010 先上结果截图: 本文是我前一篇博客:OpenGL阴影,Shadow Mapping(附源程序)的下篇,描述两个最常用的阴影技术中的第二个,Shadow Volu ...
- Unity 5 Game Optimization (Chris Dickinson 著)
1. Detecting Performance Issues 2. Scripting Strategies 3. The Benefits of Batching 4. Kickstart You ...
随机推荐
- 杀死 ps grep 出来的所有进程
ps -ef |grep HouseList_Day |awk
- url_encode and url_decode in Shell
之前写过一版 shell下解码url,下面给出另外一个版本 from https://gist.github.com/cdown/1163649 function urlencode() { loca ...
- 自定义tag标签的方法
JSP1.0中可以通过继承TagSupport或者BodyTagSupport来实现自定义的tag处理方法. JSP2.0中也支持另外一种更为简单的自定tag的方法,那就是直接讲JSP代码保存成*.t ...
- PHP提取字符串中的手机号正则表达式怎么写
0. 简介 PHP通过正则表达式提取字符串中的手机号并判断运营商,简单快速方便,能提取多个手机号. 1. 代码 <?php header("content-type:text/plai ...
- 使用ControllerClassNameHandlerMapping实现SpringMVC的CoC配置
使用CoC,惯例优先原则(convention over configuration)的方式来配置SpringMVC可以帮我们声明Controller的时候省下很多功夫. 只要我们的Controlle ...
- centos7 时间修改
转子 http://blog.csdn.net/kuluzs/article/details/52825331 在CentOS 6版本,时间设置有date.hwclock命令,从CentOS 7开始, ...
- Storm配置说明
配置项 配置说明 storm.zookeeper.servers ZooKeeper服务器列表 storm.zookeeper.port ZooKeeper连接端口 storm.local.dir s ...
- 【转】H5+css布局+js+前端和移动端ui+其他汇总
无意间发现一个博客比较好,由于内容比较多,就把链接转过来,先保存着方便看的时候看. 感谢博主“张果” +++++++++++++++++++++++++++++++++++++++++++++++++ ...
- 【LA3211 训练指南】飞机调度 【2-sat】
题意 有n嫁飞机需要着陆.每架飞机都可以选择“早着陆”和“晚着陆”两种方式之一,且必须选择一种.第i架飞机的早着陆时间为Ei,晚着陆时间为Li,不得在其他时间着陆.你的任务是为这些飞机安排着陆方式,使 ...
- javascript总结21:javascript-JSON与遍历
1 什么是JSON JavaScript Object Notation(JavaScript对象表示形式) JavaScript的子集 JSON和对象字面量的区别 JSON的属性必须用双引号引号引起 ...