自己做的部分习题解答,因为时间关系,有些马虎,也不全面,欢迎探讨或指出错误

5.1 Consider the matrixaddition in Exercise 3.1. Can one use shared memory to reduce theglobal memory bandwidth consumption?

Hint: analyze the elementsaccessed by each thread and see if there is any commonality betweenthreads.

Answer:I think there is no need to use shared memory in Exercise3.1, becauseall threads only use their variables once and no variables need to beshared between threads.

5.2 Draw the equivalent ofFigure 5.6 for a 8*8 matrix multiplication with 2*2 tiling and 4*4tiling. Verify that the reduction in global memory bandwidth isindeed proportional to the dimension size of the tiles.

Answer:

1.A 8*8matrix multiplication with 2*2tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

Mds0,0

N0,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,2

Mds0,0

N2,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,1

Mds0,1

N0,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,3

Mds0,1

N2,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,0

Mds1,0

N1,0

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

M1,2

Mds1,0

N3,0

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

thread1,1

M1,1

Mds1,1

N1,1

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,3

Mds1,1

N3,1

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

Phase3

Phase4

thread0,0

M0,4

Mds0,0

N4,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,6

Mds0,0

N6,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,5

Mds0,1

N4,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,7

Mds0,1

N6,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,4

Mds1,0

N5,0

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,6

Mds1,0

N7,0

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

thread1,1

M1,5

Mds1,1

N5,1

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,7

Mds1,1

N7,1

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

2.A 8*8matrix multiplication with 4*4tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

Mds0,0

N0,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

M0,4

Mds0,0

N4,0

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

thread0,1

M0,1

Mds0,1

N0,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

M0,5

Mds0,1

N4,1

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

thread0,2

M0,2

Mds0,2

N0,2

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

M0,6

Mds0,2

N4,2

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

thread0,3

M0,3

Mds0,3

N0,3

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

M0,7

Mds0,3

N4,3

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

Thread1.x-thread3.xellipsis

As shown in the tables,the reduction in global memory bandwidth is indeed proportional tothe dimension size of the tiles, cause the if the tile is bigger, thethread used is proportional bigger, the phase of read data fromglobal memory is proportional smaller, so the reduction in globalmemory bandwidth is proportional to the dimension size of the tiles.

5.3 What type of incorrectexecution behavior can happen if one forgot to use syncthreads() inthe kernel of Figure 5.12?

Answer: The barrier__syncthreads() in line 11 ensures that all threads have finishedloading the tiles of d_M and d_N into Mds and Nds before any of themcan move forward. The barrier __syncthread() in line 14 ensures thatall threads have finished using the d_M and d_N elements in theshared memory before any of them move on to the next iteration andload the elements in the next tiles. Without synthreads() in thekernel, the threads would load the elements too early and corrupt theinput values for other threads.

5.4 Assuming capacity was notan issue for register or shared memory, give one case that it wouldbe valuable to use shared memory instead of registers to hold valuesfetched from global memory?

Explain your answer?

Answer: Without concerningthe capacity of register or shared memory. The biggest differencebetween them is that a register is made for a single thread, butshared memory can be shared by all threads in one block.

So the matrixmultiplication maybe a good example because the data read by onethread may be useful to other threads.

5.5 For our tiledmatrix-matrix multiplication kernel, if we use a 32*32 tile, what isthe reduction of memory bandwidth usage for input matrices M andN?
a. 1/8 of the original usage

b. 1/16 of the originalusage

c. 1/32 of the originalusage

d. 1/64 of the originalusage

Answer: c

5.6 Assume that a kernel islaunched with 1000 tread blocks each of which has 512 threads. If avariable is declared as a local variable in the kernel, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.512000

Answer: d

5.7 In the previous question,if a variable is declared as a shared memory variable, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.51200

Answer: b

5.9 Consider performing amatrix multiplication of two input matrices with dimensions N*N. Howmany times is each element in the input matrices request form globalmemory when:

a. There is no tiling?

b. Tiles of size T*T areused?

Answer: a. N

b. N/T

《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章

  1. Coursera公开课Functional Programming Principles in Scala习题解答:Week 2

    引言 OK.时间非常快又过去了一周.第一周有五一假期所以感觉时间绰绰有余,这周中间没有假期仅仅能靠晚上加周末的时间来消化,事实上还是有点紧张呢! 后来发现每堂课的视频还有相应的课件(Slide).字幕 ...

  2. Massively parallel supercomputer

    A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...

  3. (搬运)《算法导论》习题解答 Chapter 22.1-1(入度和出度)

    (搬运)<算法导论>习题解答 Chapter 22.1-1(入度和出度) 思路:遍历邻接列表即可; 伪代码: for u 属于 Vertex for v属于 Adj[u] outdegre ...

  4. DirectX 11游戏编程学习笔记之8: 第6章Drawing in Direct3D(在Direct3D中绘制)(习题解答)

            本文由哈利_蜘蛛侠原创,转载请注明出处.有问题欢迎联系2024958085@qq.com         注:我给的电子版是700多页,而实体书是800多页,所以我在提到相关概念的时候 ...

  5. 现代控制理论习题解答与Matlab程序示例

    现代控制理论习题解答与Matlab程序示例 现代控制理论 第三版 课后习题参考解答: http://download.csdn.net/detail/zhangrelay/9544934 下面给出部分 ...

  6. 【AI】Exponential Stochastic Cellular Automata for Massively Parallel Inference - 大规模并行推理的指数随机元胞自动机

    [论文标题]Exponential Stochastic Cellular Automata for Massively Parallel Inference     (19th-ICAIS,PMLR ...

  7. P4: Programming Protocol-Independent Packet Processors

    P4: Programming Protocol-Independent Packet Processors 摘要 P4是一门高级语言,用于编程与协议无关的数据包处理器.P4与SDN控制协议相关联,类 ...

  8. 機器學習基石(Machine Learning Foundations) 机器学习基石 作业三 课后习题解答

    今天和大家分享coursera-NTU-機器學習基石(Machine Learning Foundations)-作业三的习题解答.笔者在做这些题目时遇到非常多困难,当我在网上寻找答案时却找不到,而林 ...

  9. 《C++编程思想》第四章 初始化与清除(原书代码+习题+解答)

    相关代码: 1. #include <stdio.h> class tree { int height; public: tree(int initialHeight); ~tree(); ...

随机推荐

  1. SpringMVC08转发和重定向

    public class User { private String name; private Integer age; public String getName() { return name; ...

  2. Installing node-oracledb on Microsoft Windows

    版本 7 由 Laura Ramsey-Oracle 于 2015-10-19 下午11:46创建,最后由 cj 于 2015-10-22 下午7:44修改. Installing node-orac ...

  3. C#获取显示器宽度高度,桌面宽度高度等

    1.C#获取显示器宽度高度,桌面宽度高度等 //获取当前显示器的宽度和高度 int width = Screen.PrimaryScreen.Bounds.Width; int height = Sc ...

  4. iOS中如何使状态栏与下面的搜索栏或NavigationBar或toolBar颜色一致

    在iOS7之后,status bar是透明的(transparent),navigation bars,tab bars,toolbars,search bars 和 scope bars 是半透明的 ...

  5. Linq中字段数据类型转换问题(Linq to entity,LINQ to Entities 不识别方法"System.String ToString()"问题解决)

    1.在工作中碰到这样一个问题: 使用linq时,需要查询两个表,在这两张表中关联字段分别是int,和varchar()也就是string,在linq中对这两个字段进行关联, 如果强制类型转换两个不同类 ...

  6. 在eclipse中运行wordcount,控制台打印log4j警告

    log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).log4j:WARN Please i ...

  7. C#中的委托(Delegate)和事件(Event)

    原文地址:C#中的委托(Delegate)和事件(Event) 作者:jiyuan51 把C#中的委托(Delegate)和事件(Event)放到现在讲是有目的的:给下次写的设计模式--观察者(Obs ...

  8. webApi项目中的问题

    1.场景:客户端调用API获取所有品牌列表,使用redis存储,第一次是获取全部,之后会增量获取,通过lasttime参数 出现的问题:redis连接超时,网络流量太大 原因:这个借口没做本地缓存,每 ...

  9. 模拟springmvc 内部登陆,跳过spring filter

    说明,因为我们的一个项目B使用spring mvc配置的登陆框架,所以对登陆控制全部交给了spring,导致我们如果想通过另一个项目A登陆到项目B就不太容易,具体是项目A登陆了,我们通过一个连接直接跳 ...

  10. Facebook和Google如何激发工程师的创造力

    http://taiwen.lofter.com/post/664ff_ad8a15 今天终于“朝圣”了两个伟大的公司——Facebook和Google,对创造力和驱动力的来源有了更多的理解,尤其是对 ...