Game Engine Architecture 6

1、Data-Parallel Computations

  A GPU is a specialized coprocessor designed specifically to accelerate those computations that involve a high degree of data parallelism. It does so by combining SIMD parallelism (vectorized ALUs) with MIMD parallelism (by employing a form of preemptive multithreading). NVIDIA coined the term single instruction multiple thread (SIMT) to describe this SIMD/MIMD hybrid design, all GPUs employ the general principles of SIMT parallelism in their designs.

  In order for a computational task to be well-suited to execution on a GPU,  the computations performed on any one element of the dataset must be independent of the results of computations on other elements.

void DotArrays_ref(unsigned count,
float r[],
const float a[],
const float b[])
{
for (unsigned i = ; i < count; ++i)
{
// treat each block of four floats as a
// single four-element vector
const unsigned j = i * ;
r[i] = a[j+]*b[j+] // ax*bx
+ a[j+]*b[j+] // ay*by
+ a[j+]*b[j+] // az*bz
+ a[j+]*b[j+]; // aw*bw
}
}

  Upper code, the computation performed by each iteration of this loop is independent of the computations performed by the other iterations.

2、Compute Kernels

  GPGPU compute kernels are typically written in a special shading language which can be compiled into machine code that’s understandable by the GPU. Some examples of shading languages include DirectX’s HLSL (high-level shader language), OpenGL’s GLSL, NVIDIA’s Cg and CUDA C languages, and OpenCL.

  Some shading languages require you to move your kernel code into special source files, separate from your C++ application code. OpenCL and CUDA C, however, are extensions to the C++ language itself.

  OpenCL、CUDA C 扩展了 C++,使得可以在C++文件中编写 kernel函数。

  As a concrete example, here’s our DotKernel() function written in CUDA C:

__global__ void DotKernel_CUDA(int count,
float* r,
const float* a,
const float* b)
{
// CUDA provides a magic "thread index" to each
// invocation of the kernel -- this serves as
// our loop index i
size_t i = threadIdx.x; // make sure the index is valid
if (i < count)
{
// treat each block of four floats as a
// single four-element vector
const unsigned j = i * ;
r[i] = a[j+]*b[j+] // ax*bx
+ a[j+]*b[j+] // ay*by
+ a[j+]*b[j+] // az*bz
+ a[j+]*b[j+]; // aw*bw
}
}

  You’ll notice that the loop index i is taken from a variable called threadIdx within the kernel function itself. The thread index is a “magic” input provided by the compiler, in much the same way that the this pointer “magically” points to the current instance within a C++ class member function.

3、Executing a Kernel

  here’s how we’d kick off our compute kernel in CUDA C:  

void DotArrays_gpgpu2(unsigned count,
float r[],
const float a[],
const float b[])
{
// allocate "managed" buffers that are visible
// to both CPU and GPU
int *cr, *ca, *cb;
cudaMallocManaged(&cr, count * sizeof(float));
cudaMallocManaged(&ca, count * sizeof(float) * );
cudaMallocManaged(&cb, count * sizeof(float) * ); // transfer the data into GPU-visible memory
memcpy(ca, a, count * sizeof(float) * );
memcpy(cb, b, count * sizeof(float) * ); // run the kernel on the GPU
DotKernel_CUDA <<<, count>>> (cr, ca, cb, count); // wait for GPU to finish
cudaDeviceSynchronize(); // return results and clean up
memcpy(r, cr, count * sizeof(float)); cudaFree(cr);
cudaFree(ca);
cudaFree(cb);
}

  cudaDeviceSynchronize():// wait for gpu to finish

  cudaMallocManaged()://allocate "managed" buffers that are visible to both CPU and GPU

  cudaFree():// free memory from cudaMallocManaged

  

  The first argument G in the angled brackets tells the driver how many thread groups (known as thread blocks in NVIDIA terminology) to use when running this compute kernel. A compute job with a single thread group is constrained to run on a single compute unit on the GPU. A compute unit is esssentially a core within the GPU—a hardware component that is capable of executing an instruction stream. Passing a larger number for G allows the driver to divide the workload up across more than one compute unit.

4、GPU Threads and Thread Groups

  a thread group is comprised of some number of GPU threads a GPU is comprised of multiple compute units, each of which contains some number of SIMD units.

  GPU 由 compute unit 组成,每个 compute unit 包含一些 SIMD units。compute unit 可以想象成 cpu core,拥有一些vpu(SIMD)。

    

  For example, SIMDs may 16 lanes wide.

  The CU is not capable of speculative or out-of-order execution—it simply reads a stream of instructions and executes them one by one, using the SIMD units to apply each instruction to 16 elements of input data in parallel.

  To execute a compute kernel on a CU, the driver first subdivides the input data buffers into blocks consisting of 64 elements each. For each of these 64-element blocks of data, the compute kernel is invoked on one CU. Such an invocation is called a wavefront (also known as a warp in NVIDIA speak). When executing a wavefront, the CU fetches and decodes the instructions of the kernel one by one. Each instruction is applied to 16 data elements in lock step using a SIMD unit. Internally, the SIMD unit consists of a four-stage pipeline, so it takes four clock cycles to complete. So rather than allow the stages of this pipeline to sit idle for three clock cycles out of every four, the CU applies the same instruction three more times, to three more blocks of 16 data elements. This is why a wavefront consists of 64 data elements, even though the SIMDs in the CU are only 16 lanes wide.

  the term “GPU thread” actually refers to a single SIMD lane. You can think of a GPU thread, therefore, as a single iteration of the original nonvectorized loop that we started with, before we converted its body into a compute kernel. Alternatively, you can think of a GPU thread as a single invocation of the kernel function, operating on a single input datum and producing a single output datum.

  a GPU thread 可想象成迭代次数或 kernel 调用次数。

5、From SIMD to SIMT

  SIMT a form of preemptive multithreading to time-slice between wavefronts. any one instruction in the program might end up having to access memory, and that introduces a large stall while the SIMD unit waits for the memory controller to respond.

    

  

  To eliminate the cost of saving registers during context switches, each SIMD unit contains a very large register file. The number of physical registers in this register file is many times larger than the number of logical registers available to any one wavefront (typically on the order of ten times larger).

  This means that the contents of the logical registers for up to ten wavefronts can be maintained at all times in these physical registers. And that in turn implies that context switches can be performed between wavefronts without saving or restoring any registers whatsoever.

  每个 SIMD 的 register file 拥有十倍于 logic register 的 physical register,所以 SIMD 在切 context 时,根本不需要 saving or restoring registers。

6、Cylindrical coordinates

  This system employs a vertical “height” axis h, a radial axis r emanating out from the vertical, and a yaw angle theta (q). In cylindrical coordinates, a point P is represented by the triple of numbers (Ph, Pr, Pq).

    

7、Left-Handed versus Right-Handed Coordinate Systems

  It is easy to convert from left-handed to right-handed coordinates and vice versa. We simply flip the direction of any one axis, leaving the other two axes alone.

8、Dot Product and Projection

  • the dot product (a.k.a. scalar product or inner product), and
  • the cross product (a.k.a. vector product or outer product).

  The dot product is commutative (i.e., the order of the two vectors can be reversed) and distributive over addition:

  

  For any two arbitrary vectors a and b, game programmers often use the following tests, as shown in Figure 5.11:

  • Collinear. (a b) = |a| |b|= ab (i.e., the angle between them is exactly 0 degrees—this dot product equals +1 when a and b are unit vectors).

  • Collinear but opposite. (a b) = -ab (i.e., the angle between them is 180 degrees—this dot product equals +1 when a and b are unit vectors).

  • Perpendicular. (a b) = 0 (i.e., the angle between them is 90 degrees).

  • Same direction. (a b) > 0 (i.e., the angle between them is less than 90 degrees).

  • Opposite directions. (a b) < 0 (i.e., the angle between them is greater than 90 degrees).

    

  点到平面的高度:

    

    

9、Cross Product

    

    

  When converting from a right-handed system to a left-handed system or vice versa, the numerical representations of all the points and vectors stay the same, but one axis flips.

  左右手坐标系下,cross product 的数值相同,只是表现上会轴相反。

  Properties of the Cross Product:

        

  The Cartesian basis vectors are related by cross products as follows:

    

10、Pseudovectors and Exterior Algebra,伪矢量,外代数

  cross product doesn’t actually produce a vector—it produces a special kind of mathematical object known as a pseudovectorThe surface normal of a triangle (which is calculated using a cross product) is also a pseudovector

  exterior algebra or Grassman algebra, which describe how vectors and pseudovectors work and allow us to calculate areas of parallelograms (in 2D), volumes of parallelepipeds (in 3D), and so on in higher dimensions.

  Grassman algebra is to introduce a special kind of vector product known as the wedge product, denoted A^ B. A pairwise wedge product yields a pseudovector and is equivalent to a cross product, which also represents the signed area of the parallelogram formed by the two vectors. 

  Doing two wedge products in a row, A ^ B ^ C, is equivalent to the scalar triple product A (B xC) and produces another strange mathematical object known as a pseudoscalarwhich can be interpreted as the signed volume of the parallelepiped

formed by the three vectors.

11、Linear Interpolation of Points and Vectors

  

12、Matrices

  An affine matrix is a 4x4 transformation matrix that preserves parallelism of lines and relative distance ratios, but not necessarily absolute lengths and angles. An affine matrix is any combination of the following operations: rotation,

translation, scale and/or shear.

13、Matrix Multiplication

  

  Matrix multiplication is not commutative. In other words, the order in which matrix multiplication is done matters:

    

14、Representing Points and Vectors as Matrices

    

15、Matrix Inversion

  The inverse of a matrix A is another matrix (denoted A^-1) that undoes the effects of matrix A.

  all affine matrices (combinations of pure rotations, translations, scales and shears) do have inverses.

    

16、Transposition

  the inverse of an orthonormal (pure rotation) matrix is exactly equal to its transpose.

    

17、Homogeneous Coordinates

  

  When a point or vector is extended from three dimensions to four in this manner, we say that it has been written in homogeneous coordinates. A point in homogeneous coordinates always has w = 1.

18、Transforming Direction Vectors

  homogeneous (four-dimensional) coordinates can be converted into non-homogeneous (three-dimensional) coordinates by dividing the x, y and z components by the w component:

    

19、Atomic Transformation Matrices

  

20、Translation

  

  

  To invert a pure translation matrix, simply negate the vector t (i.e., negate tx, ty and tz).

21、Rotation

  

  

  

  

  上面三个 xyz 旋转,对角线都是 cos,cos,1,另一个对角线是 sin,-sin,旋转方向后面轴的 sin为正,旋转方向前面轴的sin为-。

22、Scale

      

   

  Scale 的逆矩阵不能通过 transpose 来求,因为 scale后,每一个轴向量不再是 unit vector,不满足 orthonormal 性质,所以无法通过 transpose 来求 逆矩阵。

  • When a uniform scale matrix Su and a rotation matrix R are concatenated, the order of multiplication is unimportant (i.e., SuR = RSu). This only works for uniform scale!

23、4x3 Matrices

  As such, game programmers often omit the fourth column to save memory. You’ll encounter 4 x 3 affine matrices frequently in game.

24、Coordinate Spaces

   We said above that a point is a vector whose tail is fixed to the origin of some coordinate system. This is another way of saying that a point (position vector) is always expressed relative to a set of coordinate axes.

25、Model Space

  • pitch is rotation about L or R,
  • yaw is rotation about U, and
  • roll is rotation about F.

26、Building a Change of Basis Matrix

  

  • iC is the unit basis vector along the child space x-axis, expressed in parent-space coordinates;

  • jC is the unit basis vector along the child space y-axis, in parent space;

  • kC is the unit basis vector along the child space z-axis, in parent space; and

  • tC is the translation of the child coordinate system relative to parent space.

27、Transforming Normal Vectors

  Special care must be taken when transforming a normal vector to ensure that both its length and perpendicularity properties are maintained.

  In general, if a point or (non-normal) vector can be rotated from space A to space B via the 3x3 matrixMA->B, then a normal vector n will be transformed from space A to space B via the inverse transpose of that matrix.

  if the matrix MA->B contains only uniform scale and no shear, then the angles between all surfaces and vectors in space B will be the same as they were in space A.

28、Quaternions

  a matrix is not always an ideal representation of a rotation, for a number of reasons:

  1) We need nine floating-point values to represent a rotation, which seems excessive considering that we only have three degrees of freedom— pitch, yaw and roll.

    费空间。

  2)Rotating a vector requires a vector-matrix multiplication, which involves three dot products, or a total of nine multiplications and six additions. We would like to find a rotational representation that is less expensive to calculate, if possible

    费CPU,计算量大。

  3)to find rotations that are some percentage of the way between two known rotations is difficult.

    不便插值。

  

29、Unit Quaternions as 3D Rotations

  So the unit quaternion q can be written as:

    

  a is a unit vector along the axis of rotation, and q is the angle of rotation. 

    

30、Quaternion Multiplication

  Given two quaternions p and q representing two rotations P and Q, respectively, the product pq represents the composite rotation (i.e., rotation Q followed by rotation P).

  四元数乘积 PQ 的含意为,先应用P,再应用Q。

  

31、Conjugate and Inverse.

  Conjugate:

    

  negate the vector part but leave the scalar part unchanged.

  Inverse quaternion:

      

32、Conjugate and Inverse of a Product

    

33、Rotating Vectors with Quaternions

  Therefore, the rotated vector v′ can be found as follows:

    

  when quaternion is unit, we can use conjugate:

    

34、Quaternion Concatenation

    

35、Quaternion-Matrix Equivalence

  If we let q = [qV qS] = [qVx qVy qVz qS] =x y z w], then we can find R as follows:

    
  

36、Rotational Linear Interpolation

    

    

  The resultant interpolated quaternion had to be renormalized.

37、Spherical Linear Interpolation

  LERP 的 beta 平滑插值,但并不代表 quaternion 平滑插值。实际上,beta 的平滑插值,会导致 quaternion 两头变化慢,中间变化快的问题。
  SLERP 可以解决上面的问题。

    

  

    

38、To SLERP or Not to SLERP

  Jonathan Blow wrote a great article positing that SLERP is too expensive, and LERP’s quality is not really that bad—therefore, he suggests, we should understand SLERP but avoid it in our game engines.

  On the other hand, some of my colleagues at Naughty Dog have found that a good SLERP implementation performs nearly as well as LERP. (For example, on the PS3’s SPUs, Naughty Dog’s Ice team’s implementation of SLERP takes 20 cycles per joint, while its LERP implementation takes 16.25 cycles per joint.)

39、Euler Angles

  Euler angles are prone to a condition known as gimbal lock.

  Another problem with Euler angles is that the order in which the rotations are performed around each axis matters. The order could be PYR, YPR, RYP and so on, and each ordering may produce a different composite rotation. So the rotation angles qY qP qR do not uniquely define a particular rotation—you need to know the rotation order to interpret these numbers properly

40、3 x 3 Matrices

  It does not suffer from gimbal lock.

  However, rotation matrices are not particularly intuitive.

41、Axis + Angle

  The benefits of the axis+angle representation are that it is reasonably intuitive and also compact.

42、Quaternions

  it permits rotations to be concatenated and applied directly to points and vectors via quaternion multiplication. 

43、

44、

Game Engine Architecture 6的更多相关文章

  1. Game Engine Architecture 11

    [Game Engine Architecture 11] 1.three most-common techniques used in modern game engines. 1)Cel Anim ...

  2. Game Engine Architecture 10

    [Game Engine Architecture 10] 1.Full-Screen Antialiasing (FSAA) also known as super-sampled antialia ...

  3. Game Engine Architecture 9

    [Game Engine Architecture 9] 1.Formatted Output with OutputDebugString() int VDebugPrintF(const char ...

  4. Game Engine Architecture 8

    [Game Engine Architecture 8] 1.Differences across Operating Systems • UNIX uses a forward slash (/) ...

  5. Game Engine Architecture 7

    [Game Engine Architecture 7] 1.SRT Transformations When a quaternion is combined with a translation ...

  6. Game Engine Architecture 5

    [Game Engine Architecture 5] 1.Memory Ordering Semantics These mysterious and vexing problems can on ...

  7. Game Engine Architecture 4

    [Game Engine Architecture 4] 1.a model of multiple semi-independent flows of control simply matches ...

  8. Game Engine Architecture 3

    [Game Engine Architecture 3] 1.Computing performance—typically measured in millions of instructions  ...

  9. Game Engine Architecture 2

    [Game Engine Architecture 2] 1.endian swap 函数 floating-point endian-swapping:将浮点指针reinterpert_cast 成 ...

随机推荐

  1. iis重写模块实现程序自动二级域名,微软提供的URL重写2.0版本适用IIS以上

    在iis7以后微软提供了url重写2.0版本,可以通过安装重写组件来实现.适用于iis7以上版本. 安装有两种方式可以选择,一是下载安装文件,二是通过“web平台安装程序”安装 1.下载安装文件 下载 ...

  2. 高度自适应的div

    需求:有一个高度自适应的div,里面有两个div,一个高度100px,希望另一个填满剩下的高度 1.用flex 来实现 思路:flex 垂直布局(column),第一个元素固定高度,第二个元素flex ...

  3. 去freessl.org申请免费ssl服务器证书

    去freessl.org申请免费ssl服务器证书 来源: 本文链接 来自osnosn的博客 写于: 2019-03-30. 想搞个自签名证书,可以参考这篇: 用openssl为WEB服务器生成证书(自 ...

  4. python基础知识13---函数对象、函数嵌套、名称空间与作用域、装饰器

    阅读目录 一 函数对象 二 函数嵌套 三 名称空间与作用域 四 闭包函数 五 装饰器 六 练习题 一 函数对象 1 函数是第一类对象,即函数可以当作数据传递 #1 可以被引用 #2 可以当作参数传递 ...

  5. goflyway简单使用

    前言 一个朋友最近新买的搬瓦工ip突然被强了,要等10周左右才能免费更换ip.而恰巧在网上看到了Goflyway 进阶教程:KCP 协议复活被墙IP 决定试一试.在vultr上临时搭建了测试环境,可能 ...

  6. js 数组 map方法

    https://www.cnblogs.com/xuan52rock/p/4460949.html https://jingyan.baidu.com/article/91f5db1b7453471c ...

  7. C/C++中指针和java的引用区别

    C++指针  要区分指针变量和指针变量所指对象. 指针变量先是一个变量,它有自己的地址和存储的内容,所以要想清楚是改变指针变量的值(即地址),还是改变指针变量所指对象的值. #include < ...

  8. Galaxy2D Game Engine 4.2

    Galaxy2D Game Engine 4.2 开发版  下载地址  D3DRender注意:1.下载后的压缩包请使用WinRar5.0打开.2.开发版包含了现在正在开发中的Galaxy2D游戏引擎 ...

  9. vue-cli · Failed to download repo vuejs-templates/webpack: connect ETIMEDOUT 13.229.188.59:443

    初始化vue项目 vue init webpack vue-demo 之后 一直报vue-cli · Failed to download repo vuejs-templates/webpack: ...

  10. debian apache2 多端口对应多文件 虚拟端口配置

    apache2单IP多端口创建虚拟站点如下: 1.转到配制目录虚拟站点配制目录 cd /etc/apache2/ ​ 2.配置新增多的端口 编辑上级目录中的端口配制文件sudo gedit ports ...