vulkan asynchronous compute

https://www.youtube.com/watch?v=XOGIDMJThto

https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization

https://gpuopen.com/concurrent-execution-asynchronous-queues/

通过queue的并行增加GPU的并行

并发性 concurrency

Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a “wavefront”.

Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.

GPU有64个CU

每个CU 4个SIMD

每个SIMD 64blocks ----- 一个wavefront

ps的计算在里面

GPU提升并发性减小GPU idel

async compute

Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
Direct Queue (DirectX 12) / Graphics Queue (Vulkan): this queue can do anything, so it is similar to the main device in legacy APIs

这三种queue对应metal里面三种encoder 是为了增加上文所述并发性

对GPU底层的操作这种可行性是通过这里的queue体现的

vulkan对queue的个数有限制可以query

dx12没有这种个数限制

更多部分拿出来用cs做异步计算

看图--技能点还没点

problem shooting

If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations 带宽限制数据onchip
Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance

Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:

Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
Depth only rendering passes are usually good candidates to have some compute tasks run next to it
A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of “for each light clear shadow map, render shadow, compute VSM” do “clear all shadow maps, render all shadow maps, compute VSM for all shadow maps”)

然后给异步计算的功能加上开关

看vulkan这个意思它似乎没有metal2 那种persistent thread group 维持数据cs ps之间传递时还可以 on tile

vulkan asynchronous compute的更多相关文章

Vulkan在Android使用Compute shader
oeip 相关功能只能运行在window平台,想移植到android平台,暂时选择vulkan做为图像处理,主要一是里面有单独的计算管线且支持好,二是熟悉下最新的渲染技术思路. 这个 demo(git ...
android下vulkan与opengles纹理互通
先放demo源码地址:https://github.com/xxxzhou/aoce 06_mediaplayer 效果图: 主要几个点: 用ffmpeg打开rtmp流. 使用vulkan Compu ...
剖析虚幻渲染体系（13）- RHI补充篇：现代图形API之奥义与指南
目录 13.1 本篇概述 13.1.1 本篇内容 13.1.2 概念总览 13.1.3 现代图形API特点 13.2 设备上下文 13.2.1 启动流程 13.2.2 Device 13.2.3 Sw ...
GPUImage移植总结
项目github地址: aoce 我是去年年底才知道有GPUImage这个项目,以前也一直没有在移动平台开发过,但是我在win平台有编写一个类似的项目oeip(不要关注了,所有功能都移植或快移植到ao ...
Compute Resource Consolidation Pattern 计算资源整合模式
Consolidate multiple tasks or operations into a single computational unit. This pattern can increase ...
论文笔记之：Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning ICML 2016 深度强化学习最近被人发现貌似不太稳定,有人提出很多改善的方法,这些方法有很 ...
Vulkan Tutorial 13 Render passes
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 Setup 在我们完成管线的创建工作,我们接下来需要告诉Vulkan渲染时候使用的f ...
Vulkan Tutorial 16 Command buffers
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 诸如绘制和内存操作相关命令,在Vulkan中不是通过函数直接调用的.我们需要在命令缓 ...
Vulkan Tutorial 29 Loading models
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 Introduction 应用程序现在已经可以渲染纹理3D模型,但是 vertice ...

随机推荐

解决SpringMVC拦截静态资源的问题
优雅REST风格的资源URL不希望带 .html 或 .do 等后缀.由于早期的Spring MVC不能很好地处理静态资源,所以在web.xml中配置DispatcherServlet的请求映射,往往 ...
Reactor系列(八)concatMap有序映射
#java#reactor#comcatMap# 有序映射视频讲解:https://www.bilibili.com/video/av79705356/ FluxMonoTestCase.java ...
《你必须知道的495个C语言问题》读书笔记之第3章：表达式
1. C语言的设计目标之一就是高效的实现——让C语言的编译器相对较小,容易写成,同时也更容易生成较好的代码. 2. Q:下面的代码打印出49.不管按什么顺序,难道不该是56吗? ; printf(&q ...
更改oracle RAC public ip,vip,scan ip和private ip
更改oracle RAC public ip,vip,scan ip和private ip oifcfg - Oracle 接口配置工具用法: oifcfg iflist [-p [-n]] ...
串的模式匹配，KMP算法
串的模式匹配现考虑一个常用操作,在字符串s(我们称为主串)中的第pos开始处往后查找,看在主串s中有没有和子串p相匹配的的,如果有,则返回字串p第一次出现的位置. 暴力求解 int Index(ch ...
XML工具——xmlbeans的使用
一.安装xmlbeans 1.下载xmlbeans 下载地址:https://gitee.com/shizuru/xmlbeans-2.6.0 2.解压,此处以解压至D盘根目录为例 3.配置环境变量( ...
【KMP】Radio Transmission
问题 L: [KMP]Radio Transmission 题目描述给你一个字符串,它是由某个字符串不断自我连接形成的.但是这个字符串是不确定的,现在只想知道它的最短长度是多少. 输入第一行给出字 ...
Web文件上传靶场 - 通关笔记
Web应用程序通常会提供一些上传功能,比如上传头像,图片资源等,只要与资源传输有关的地方就可能存在上传漏洞,上传漏洞归根结底是程序员在对用户文件上传时控制不足或者是处理的缺陷导致的,文件上传漏洞在渗透 ...
Cache的一些总结
输出缓存这是最简单的缓存类型,它保存发送到客户端的页面副本,当下一个客户端发送相同的页面请求时,此页面不会重新生成(在缓存有限期内),而是从缓存中获取该页面:当然由于缓存过期或被回收,这时页面会重新 ...
wrbstrom使用
使用webstrom时遇到Firefox浏览器打不开问题,是webstrom未找到你Firefox的安装路径下面为大家提供解决方法: 文件--->设置--->工具--->web浏览器 ...

vulkan asynchronous compute

vulkan asynchronous compute的更多相关文章

随机推荐

热门专题