转自http://www.gamasutra.com/view/feature/132084/sponsored_feature_common_.php?print=1

By Becky Heineman

[In this technical article, part of Microsoft's XNA-related Gamasutra microsite, XNA Developer Connection staffer and Interplay co-founder Becky Heineman gives tips on avoiding the 'Load-Hit-Store' performance-killer when making games.]

"90% of the time is spent in 10% of the code, so make that 10% the fastest code it can be."

One of the most common problems encountered in creating computer games is performance. Issues like disk access, GPU performance, CPU performance, race conditions, and memory bandwidth (or lack thereof) can cause stalls or delays that may turn a 30-frames-per-second game into a 9-frames-per-second game.

This article will describe one of the most common CPU performance killers, the Load-Hit-Store, and give tips and tricks on how to avoid it.

Load-Hit-Store

Ask any Xbox 360 performance engineer about Load-Hit-Store and they usually go into a tirade. The sequence of a memory read operation (The Load), the assignment of the value to a register (The Hit), and the actual writing of the value into a register (The Store) is usually hidden away in stages of the pipeline so these operations cause no stalls. However, if the memory location being read was one recently written to by a previous write operation, it can take as many at 40 cycles before the "Store" operation can complete.

Example:

stfs fr3,0(r3) ;Store the float
lwz r9,0(r3) ;Read it back into an integer register
oris r9,r9,0x8000 ;Force to negative

The first instruction writes a 32-bit floating-point value into memory, and the following instruction reads it back. What's interesting is that the load instruction isn't where the stall occurs; it's the "oris" instruction. That instruction can't complete until the "store" into r9 finishes, and it's waiting for the L1 cache to update.

What's going on? The first instruction stores the data and marks the L1 cache as "dirty". It takes about 40 cycles for the data to be written into the L1 cache and become available for the CPU to use. During this window of time, an instruction requests that data from the cache and then "hits" R9 for a "store". Since the last instruction can't execute until the store is complete, you've got a stall.

The Microsoft tool, PIX, can locate these issues. Since it's confusing to tag the "oris" instruction as the cause of the stall (which it is), PIX flags the load instruction that started the chain of events so the programmer has a better chance of fixing the issue.

我的理解:

CPU在将数据写回内存时,只是将cache标记为dirty,直到约40个时钟周期后才写入cache。如果在这个期间读该内存数据,发现其在cache(没有miss cache本应读取很快),但该cache line因为之前的store而变为dirty,必须等回写完成以后才可以读取,造成了延时。

PIX可以检测这种延时。

Three CPUs in One Thread

Think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data. The first is the integer unit with its 32-integer registers, which is considered the workhorse, handling a large percentage of the operations.

The second is the floating-point unit with its 32 floating-point registers, handling all of the simple mathematics. Finally, the third is the VMX unit with its 128 registers dealing with complex vector operations.

Why think of the units as three CPUs that share a common instruction stream? These units have no way of directly transferring data between one another internally. Due to the lack of an instruction to move the contents of an integer register to a floating-point register, the CPU must write the integer value to memory, and then load it into a floating-point register using a memory read instruction. That pattern of operation is by nature, a Load-Hit-Store.

Moving data from the integer unit to the floating-point unit is as simple as...

Example:

int iTime;
float fTime;
fTime = static_cast<float>(iTime);

This is extremely simple code and very common, but on the PowerPC, an instruction is generated to store the integer value to memory such that a floating-point instruction can be executed to load from memory into a floating-point register. A fix-up instruction follows that converts the integer representation into a floating-point representation, and the sequence is complete.

A common way to generate Load-Hit-Store is using member values or reference pointers as iterators in tight loops.

Example:

for (int i=0;i<100;++i)
{
m_iData++;
}

Seldom are compilers smart enough to figure out that the above loop resolves into m_iData+=100 and optimizes it into a single operation. Most will happily load m_iData at runtime, increment it, and store it back into memory referenced by the "this" pointer. The first pass of the loop will run at full speed, but once it loops back, the m_iData value will incur a Load-Hit-Store from the write operation of the previous pass through the loop.

Since registers invoke no penalty, if the code was rewritten to look like this:

int iData = m_iData;
for (int i=0;i<100;++i) {
iData++;
}
m_iData = iData;

Not only will the code run much faster since the operations are all in registers, you increase the chances the compiler will reduce this to iData+=100 and remove any chance of a Load-Hit-Store bottleneck.


References

A Load-Hit-Store can happen in code, even when it looks like it shouldn't.

void foo(int &count)
{
count = 0;
for (int i=0;i<100;++i) {
if (Test(i)) {
++count;
}
}
}

That code generates a Load-Hit-Store. How?

The variable "count" is memory bound. All writes to it, and in many cases reads, go through memory. Anytime a variable is memory bound and in a tight loop, it can cause Load-Hit-Stores. A way of fixing this is similar to the previous code example.

void foo(int &output)
{
int count = 0;
for (int i=0;i<100;++i) {
if (Test(i)) {
++count;
}
}
output = count; // Write the result
}

VectorLoad-Hit-Store

The previous examples demonstrated how easy it is to cause Load-Hit-Store stalls with floating-point and integer transactions. The VMX register sets suffer from the same problem. It's common that some math operations could be done more efficiently in a VMX operation, but what if it involves non-vector data?

On the Xbox 360, the VMX register intrinsic __vector4 is mapped onto a structure. Run-time accessing of the elements of the structure should be discouraged for the reason below.

XMVECTOR Radius = CalcBounds();
pOut->fRadius = Radius.x;

The second line creates a Load-Hit-Store because the VMX register is used as a structure. As a result, the compiler has to write the contents of the entire register to local memory; then the first element is read with a floating-point register, and only then is the value written into pOut->fRadius.

Here is a way to write the same code without incurring the hidden Load-Hit-Store:

XMVECTOR Radius = CalcBounds();
__stvewx(&pout->fRadius,__vspltw(Radius,0),0);

VMX has the ability to write any specific entry as a single float. The vspltw() operation will copy the requested entry into a temp vector register and the stvewx() operation will handle the writing the float. Using the compiler's feature of accessing the value isn't recommended.


Eliminate int->float conversions

If the data being converted is semi-static, like a frame time quantum or the width of a screen, the data can be duplicated and functions that read the integer version or the floating-point version can fetch it without penalty.

Example:

typedef struct ScreenSize_t {
int m_iWidth;
int m_iHeight;
float m_fWidth;
float m_fHeight;

inline Void SetHeight(int iWidth) {
m_iWidth = iWidth;
m_fWidth = static_cast(iWidth);
};
} ScreenSize_t;

So functions that need an integer value for processing load from the "i" form of the members, while functions that require a floating-point input fetch from the "f" form. These values could be updated by an inline function that updates both the integer and floating-point versions at the same time.

Another form of integer to floating point conversion is where an iterator is used and converted. Since floating point compares have their own set of issues, they too need to be minimized. In this example, the angle is generated with each loop by converting it from an integer to a floating point number, causing a Load-Hit-Store

VOIDDebugDraw::DrawRing( const XMFLOAT3 &Origin,
const XMFLOAT3 &MajorAxis, const XMFLOAT3 &MinorAxis, D3DCOLOR Color )
{
static const DWORD dwRingSegments = 32;
MeshVertexP verts[ dwRingSegments + 1 ];

XMVECTOR vOrigin = XMLoadFloat3( &Origin );
XMVECTOR vMajor = XMLoadFloat3( &MajorAxis );
XMVECTOR vMinor = XMLoadFloat3( &MinorAxis );

FLOAT fAngleDelta = XM_2PI / (float)dwRingSegments;
for( DWORD i = 0; i<dwRingSegments; i++)
{
FLOAT fAngle = (FLOAT)i * fAngleDelta;
XMVECTOR Pos;
Pos = XMVectorAdd( vOrigin, XMVectorScale( vMajor, cosf( fAngle ) ) );
Pos = XMVectorAdd( Pos, XMVectorScale( vMinor, sinf( fAngle ) ) );
XMStoreFloat3( (XMFLOAT3*)&verts[i], Pos );
}

verts[ dwRingSegments ] = verts[0];

SimpleShaders::SetDeclPos();
SimpleShaders::BeginShader_Transformed_ConstantColor
( g_matViewProjection, Color );
g_pd3dDevice->DrawPrimitiveUP( D3DPT_LINESTRIP, dwRingSegments, (const
VOID*)verts, sizeof( MeshVertexP ) );
SimpleShaders::EndShader();
}

With three lines changed, the Load-Hit-Store is removed and the functionality is intact.

VOID DebugDraw::DrawRing( const XMFLOAT3 &Origin,
const XMFLOAT3 &MajorAxis, const XMFLOAT3 &MinorAxis, D3DCOLOR Color )
{
static const DWORD dwRingSegments = 32;
MeshVertexP verts[ dwRingSegments + 1 ];

XMVECTOR vOrigin = XMLoadFloat3( &Origin );
XMVECTOR vMajor = XMLoadFloat3( &MajorAxis );
XMVECTOR vMinor = XMLoadFloat3( &MinorAxis );

FLOAT fAngleDelta = XM_2PI / (float)dwRingSegments;
FLOAT fi = 0.0f; // Added a copy of i as a float
for( DWORD i = 0; i<dwRingSegments; i++, fi+=1.0f ) // Inc fi
{
FLOAT fAngle = fi * fAngleDelta; // NO int to float conversion
XMVECTOR Pos;
Pos = XMVectorAdd( vOrigin, XMVectorScale( vMajor, cosf( fAngle ) ) );
Pos = XMVectorAdd( Pos, XMVectorScale( vMinor, sinf( fAngle ) ) );
XMStoreFloat3( (XMFLOAT3*)&verts[i], Pos );
}

verts[ dwRingSegments ] = verts[0];

SimpleShaders::SetDeclPos();
SimpleShaders::BeginShader_Transformed_ConstantColor
( g_matViewProjection, Color );
g_pd3dDevice->DrawPrimitiveUP( D3DPT_LINESTRIP, dwRingSegments, (const
VOID*)verts, sizeof( MeshVertexP ) );
SimpleShaders::EndShader();
}

Faster, 360, Code! Code!

It takes only a little discipline to write clean code, but it's also easy to create code that can inadvertently introduce performance bottlenecks. Using Microsoft tools like PIX will help you track down some of these, but the best way to avoid bottlenecks, is to be aware of how they can exist so that they aren't written into the code in the first place.

A good understanding of the underlying hardware is not crucial to modern game programming from a high level. However, with a solid foundation of how CPUs work as well as how they interact with the memory subsystems,

Sponsored Feature: Common Performance Issues in Game Programming的更多相关文章

  1. PLSQL_性能优化工具系列17_Best Practices: Proactive Data Collection for Performance Issues

    占位符 https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=2082062510193540&id=1366133. ...

  2. The maximum number of processes for the user account running is currently , which can cause performance issues. We recommend increasing this to at least 4096.

    [root@localhost ~]# vi /etc/security/limits.conf # /etc/security/limits.conf # #Each line describes ...

  3. [LintCode] 77. Longest common subsequences_ Medium tag: Dynamic Programming

    Given two strings, find the longest common subsequence (LCS). Example Example 1: Input: "ABCD&q ...

  4. Spark 调优(转)

    Spark 调优 返回原文英文原文:Tuning Spark Because of the in-memory nature of most Spark computations, Spark pro ...

  5. Master Note for Transportable Tablespaces (TTS) -- Common Questions and Issues (Doc ID 1166564.1)

    APPLIES TO: Oracle Database Cloud Exadata Service - Version N/A and laterOracle Database Cloud Servi ...

  6. [Algorithms] Using Dynamic Programming to Solve longest common subsequence problem

    Let's say we have two strings: str1 = 'ACDEB' str2 = 'AEBC' We need to find the longest common subse ...

  7. Thinking Clearly about Performance

    http://queue.acm.org/detail.cfm?id=1854041 The July/August issue of acmqueue is out now acmqueue is ...

  8. Chapter 6 — Improving ASP.NET Performance

    https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...

  9. Linear and Quadratic Programming Solver ( Arithmetic and Algebra) CGAL 4.13 -User Manual

    1 Which Programs can be Solved? This package lets you solve convex quadratic programs of the general ...

随机推荐

  1. 零碎记录Hadoop平台各组件使用

    >20161011 :数据导入研究    0.sqoop报warning,需要安装accumulo:    1.下载Microsoft sql server jdbc, 使用ie下载,将42版j ...

  2. GitHub之创建

    O(∩_∩)O~ 爱“搞事”的我又创了一个Github账号,和当初加入博客园的初衷一样,为了广泛交流和学习而已.很久之前我就发现了有很多人都在使用GitHub,然而当时看不懂英文(绝大部分都是英文), ...

  3. Game start

    今天开始有计划的码代码吧!!我可是以后要进微软或者google的男人.初步计划先学习编程之美吧,每天码一到题的解法,每天每天每天..然后是ACM竞赛基础,每天一节同上.最后..不对,冷静冷静,我已经没 ...

  4. webSphere集群部署主要步骤

    1.系统管理-节点,添加本机节点和另外一台机器的节点2.建立集群服务cluster,添加成员节点3.将应用部署到集群服务cluster4.将数据库源分别建立到节点作用域5.后续步骤参照安装手册 注意事 ...

  5. SSH 正向/反向代理小记

    上周因为玩耍Minecraft的原因,折腾了下ssh的正向.反向代理,不得不说,科技改变命运..了解了基础的用法之后,很多跨域的事情都可以通过代理解决,而且只需要ssh帐号权限即可. 那么就简单来介绍 ...

  6. wordpress使用video.js与七牛云存储实现无广告视频分享应用

    video.js是一款极受欢迎的基于HTML5的开源WEB视频播放器,其充分利用了HTML5的视频支持特性,可以实现全平台的无视频插件播放功能,对于现在流行的手机.PAD等移动智能终端有极佳的应用体验 ...

  7. if语句代码优化

    if($sum==7){ $sz+=135; }elseif($sum==5){ $sz+=80; }elseif($sum==6){ $sz+=97; }elseif($sum==4){ $sz+= ...

  8. MotionEvent中getX()和getRawX()的区别

    http://blog.csdn.net/ztp800201/article/details/17218067 public class Res extends Activity implements ...

  9. Mongodb学习教程汇总

    1.MongoDB权威指南 - 学习笔记 地址:http://www.cnblogs.com/refactor/category/394801.html 2.8天学通MongoDB 地址:http:/ ...

  10. dedecms 处理分页样式及去掉分页li

    最近装了个织梦dedecmsV5.7版本时,调用分页显示出现的结果出现好几行,怎么也不能在一排显示,找了很多资料,才了解到是由织梦模板中分页加了<Li>列表标签,解决有两种方法,下面将一一 ...