《Programming Massively Parallel Processors》Chapter5 习题解答

自己做的部分习题解答，因为时间关系，有些马虎，也不全面，欢迎探讨或指出错误

5.1 Consider the matrixaddition in Exercise 3.1. Can one use shared memory to reduce theglobal memory bandwidth consumption?

Hint: analyze the elementsaccessed by each thread and see if there is any commonality betweenthreads.

Answer:I think there is no need to use shared memory in Exercise3.1, becauseall threads only use their variables once and no variables need to beshared between threads.

5.2 Draw the equivalent ofFigure 5.6 for a 8*8 matrix multiplication with 2*2 tiling and 4*4tiling. Verify that the reduction in global memory bandwidth isindeed proportional to the dimension size of the tiles.

Answer:

1.A 8*8matrix multiplication with 2*2tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,2

↓

Mds0,0

N2,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,3

↓

Mds0,1

N2,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,0

↓

Mds1,0

N1,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

M1,2

↓

Mds1,0

N3,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

thread1,1

M1,1

↓

Mds1,1

N1,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,3

↓

Mds1,1

N3,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

Phase3

Phase4

thread0,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,6

↓

Mds0,0

N6,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,7

↓

Mds0,1

N6,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,4

↓

Mds1,0

N5,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,6

↓

Mds1,0

N7,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

thread1,1

M1,5

↓

Mds1,1

N5,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,7

↓

Mds1,1

N7,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

2.A 8*8matrix multiplication with 4*4tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

thread0,2

M0,2

↓

Mds0,2

N0,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

M0,6

↓

Mds0,2

N4,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

thread0,3

M0,3

↓

Mds0,3

N0,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

M0,7

↓

Mds0,3

N4,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

Thread1.x-thread3.xellipsis

As shown in the tables,the reduction in global memory bandwidth is indeed proportional tothe dimension size of the tiles, cause the if the tile is bigger, thethread used is proportional bigger, the phase of read data fromglobal memory is proportional smaller, so the reduction in globalmemory bandwidth is proportional to the dimension size of the tiles.

5.3 What type of incorrectexecution behavior can happen if one forgot to use syncthreads() inthe kernel of Figure 5.12?

Answer: The barrier__syncthreads() in line 11 ensures that all threads have finishedloading the tiles of d_M and d_N into Mds and Nds before any of themcan move forward. The barrier __syncthread() in line 14 ensures thatall threads have finished using the d_M and d_N elements in theshared memory before any of them move on to the next iteration andload the elements in the next tiles. Without synthreads() in thekernel, the threads would load the elements too early and corrupt theinput values for other threads.

5.4 Assuming capacity was notan issue for register or shared memory, give one case that it wouldbe valuable to use shared memory instead of registers to hold valuesfetched from global memory?

Explain your answer?

Answer: Without concerningthe capacity of register or shared memory. The biggest differencebetween them is that a register is made for a single thread, butshared memory can be shared by all threads in one block.

So the matrixmultiplication maybe a good example because the data read by onethread may be useful to other threads.

5.5 For our tiledmatrix-matrix multiplication kernel, if we use a 32*32 tile, what isthe reduction of memory bandwidth usage for input matrices M andN?
a. 1/8 of the original usage

b. 1/16 of the originalusage

c. 1/32 of the originalusage

d. 1/64 of the originalusage

Answer: c

5.6 Assume that a kernel islaunched with 1000 tread blocks each of which has 512 threads. If avariable is declared as a local variable in the kernel, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.512000

Answer: d

5.7 In the previous question,if a variable is declared as a shared memory variable, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.51200

Answer: b

5.9 Consider performing amatrix multiplication of two input matrices with dimensions N*N. Howmany times is each element in the input matrices request form globalmemory when:

a. There is no tiling?

b. Tiles of size T*T areused?

Answer: a. N

b. N/T

《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章

Coursera公开课Functional Programming Principles in Scala习题解答：Week 2
引言 OK.时间非常快又过去了一周.第一周有五一假期所以感觉时间绰绰有余,这周中间没有假期仅仅能靠晚上加周末的时间来消化,事实上还是有点紧张呢! 后来发现每堂课的视频还有相应的课件(Slide).字幕 ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
（搬运）《算法导论》习题解答 Chapter 22.1-1（入度和出度）
(搬运)<算法导论>习题解答 Chapter 22.1-1(入度和出度) 思路:遍历邻接列表即可; 伪代码: for u 属于 Vertex for v属于 Adj[u] outdegre ...
DirectX 11游戏编程学习笔记之8: 第6章Drawing in Direct3D(在Direct3D中绘制)(习题解答)
本文由哈利_蜘蛛侠原创,转载请注明出处.有问题欢迎联系2024958085@qq.com 注:我给的电子版是700多页,而实体书是800多页,所以我在提到相关概念的时候 ...
现代控制理论习题解答与Matlab程序示例
现代控制理论习题解答与Matlab程序示例现代控制理论第三版课后习题参考解答: http://download.csdn.net/detail/zhangrelay/9544934 下面给出部分 ...
【AI】Exponential Stochastic Cellular Automata for Massively Parallel Inference - 大规模并行推理的指数随机元胞自动机
[论文标题]Exponential Stochastic Cellular Automata for Massively Parallel Inference (19th-ICAIS,PMLR ...
P4: Programming Protocol-Independent Packet Processors
P4: Programming Protocol-Independent Packet Processors 摘要 P4是一门高级语言,用于编程与协议无关的数据包处理器.P4与SDN控制协议相关联,类 ...
機器學習基石(Machine Learning Foundations) 机器学习基石作业三课后习题解答
今天和大家分享coursera-NTU-機器學習基石(Machine Learning Foundations)-作业三的习题解答.笔者在做这些题目时遇到非常多困难,当我在网上寻找答案时却找不到,而林 ...
《C++编程思想》第四章初始化与清除（原书代码+习题+解答）
相关代码: 1. #include <stdio.h> class tree { int height; public: tree(int initialHeight); ~tree(); ...

随机推荐

JNI 详细使用步骤上手示例
1.定义本地native方法定义本地方法,通常情况下,应单独定义一个类来封装所有native方法 /** 存放native方法的类 */ public class MyNativeMethods { ...
06-自定义Attribute标记案例
自定义Attribute: 1)Attribute都从System. Attribute类继承,类名一般以Attribute结尾 2) 标记类的用途—AttributeUsage标记(标记的标记):A ...
IIS配置
IIS配置文档: 1.安装IIS.控制面板→程序→打开关闭Windows功能,Web管理服务和万维网服务都勾上. 2.部署网站:ASP.Net项目的发布:项目中点右键“发布”,选择“文件系统”,发布到 ...
安装Linux和Windows的双系统
平时使用较多的操作系统是Windows,想玩玩Linux平时也是在虚拟机上,强迫症的怎么能忍,一直想装个双系统,也能强迫自己练习Linux命令,之前重装系统的时候也试着装了一下,但是准备不够充分.结果 ...
字符串转化为json的三种方法
1,eval方式解析,恐怕这是最早的解析方式了.如下: function strToJson(str){ var json = eval('(' + str + ')'); return json; ...
jQuery节点操作，jQuery插入节点，jQuery删除节点，jQuery Dom操作
一．创建节点 1 var box = $('<div>节点</div>'); //创建一个节点,或者var box = "<div>节点</div& ...
linux性能分析命令top
发布时间: 2013-12-14浏览次数:154分类: 服务器 top是linux最常用的性能分析工具了,它是个交互式工具,提供系统的整体性能,如正在执行的进程信息包括进程ID,内存占用率,CPU占用 ...
SQL Server 2008将数据导出为脚本
之前我们要将一个表中的数据导出为脚本,那么只有在网上找一个导出数据的Script,然后运行就可以导出数据脚本了.现在在SQL Server 2008的Management Studio中增加了一个新特 ...
【转】Understanding and Using rem Units in CSS
CSS units have been the subject of several articles here on SitePoint (such as A Look at Length Unit ...
Swift中出现“no such module cocoa”的错误
在Swift开发中,新建了一个UIViewController的子类出现“No such module 'Cocoa' 的错误, 头部是import cocoa.. 原因很简单:在建立新的File文件 ...

《Programming Massively Parallel Processors》Chapter5 习题解答

《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章

随机推荐

热门专题