Cooperative Groups

Cooperative Groups

Cooperative Groups（协同组）是CUDA 9.0引入的一个新概念，主要用于跨线程块（block）的同步。为使用Cooperative Groups，我们需要包含头文件#include <cooperative_groups.h>，同时需要cooperative_groups命名空间。

简介

在CUDA 9.0之前，CUDA仅支持线程块内的同步，CUDA提供了2个原语操作：__syncthreads()函数用于同步同一线程块内的所有线程，以及__syncwarp(unsigned mask=0xffffffff)函数用于同步线程束内的线程。

附1：

由于__syncthreads()函数要求整个线程块内的所有线程都得到达该同步点方能继续执行，也就是说同一线程块的if条件必须都相同，否则程序将会被挂起或产生意想不到的结果。为避免此问题，CUDA提供了如下三个函数用于评估if条件的预测值：

int __syncthreads_count(int predicate);

该函数在__syncthreads()函数基础上增加了返回predicate值非0的线程的数目。

int __syncthreads_and(int predicate);

该函数在__syncthreads()函数基础上，当且仅当块内所有线程predicate值非0时返回一个非0值。

int __syncthreads_or(int predicate);

该函数在__syncthreads()函数基础上，当且仅当块内存在任意一个线程predicate值非0时返回一个非0值。

显然，线程块级的同步并不能满足开发者的需求，在某些时候，开发者需要跨线程块同步，针对此问题，CUDA 9.0推出了Cooperative Groups机制，用于线程块内和跨线程块的同步。该机制为开发者提供了自定义线程组的方式，并提供了相应的同步函数，同时还包括一个新的kernel启动API（cudaLaunchCooperativeKernel），该API保证了Cooperative Groups同步的安全性。

块内组

thread_block

Cooperative Groups引入了一个新的数据结构：thread_block，即线程块。thread_block可以通过this_thread_block()进行获取并初始化：

thread_block g = this_thread_block();

thread_block继承自更广义的线程组数据结构：thread_group 。thread_group 提供了如下函数：

void sync(); //同步组内的所有线程，这里g.sync()等价于__syncthreads()

unsigned size(); //获取组内的线程数目

unsigned thread_rank();  //获取线程的组内索引值（[0,size]）

bool is_valid();  //判断本组是否违背了任何APIconstraints（API限制）

thread_block则提供如下特定线程块函数：

dim3 group_index();  //网格grid内3维索引（block索引）

dim3 thread_index();  //块block内3维索引（线程索引）

注意以上所有操作组内所有线程都要确保执行到，否则行为未定义。

相比__syncthreads()函数，使用g.sync()的好处在于避免了隐式同步隐患：

__device__ int sum(int *x, int n) {

    // ...

    __syncthreads();

    return total;

}

__global__ void parallel_kernel(float *x){

    // ...

    // Entire thread block must call sum

    sum(x, n);

}

此时，当开发者调用他人编写的sum函数时，不一定能发现sum中存在着同步，但当我们显式传参时情况就不一样了：

__device__ int sum(const thread_group& g, int *x, int n)

{

    // ...

    g.sync()

    return total;

}

__global__ void parallel_kernel(float *x)

{

    // ...

    // Entire thread block must call sum

    sum(this_thread_block(), x, n);

}

tiled_partition

tiled_partition()函数用于将一个线程块分解为多个小的协同线程组（tiled subgroups），比如说：

thread_block wholeBlock = this_thread_block(); //获取线程块

以下函数将线程块分解为若干个大小为32的小线程组：

thread_group tile32 = tiled_partition(wholeBlock, 32);

甚至可以更深一步，将tile32分解为更小的若干个大小为4的小线程组：

thread_group tile4 = tiled_partition(tile32, 4);

注意：小线程组大小仅支持2的幂数且不大于32，也就是仅限于2,4,8,16,32

现在，我们通过如下操作就可以让线程0,4,8,12,...（相对于wholeBlock 的索引）打印"Hello World"：

if (tile4.thread_rank() == 0) printf("Hello World\n");

Thread Block Tiles

CUDA还提供了thread_block_tile<>模版使得小线程组大小在编译期就可以得到：

thread_block wholeBlock = this_thread_block(); //获取线程块

thread_block_tile<32> tile32 = tiled_partition<32>(wholeBlock);

thread_block_tile<4> tile4 = tiled_partition<4>(tile32);

Thread Block Tiles提供了如下成员函数用于协同同步：

.shfl()  //等价__shfl_sync

.shfl_down()  //等价__shfl_down_sync

.shfl_up()  //等价__shfl_up_sync

.shfl_xor()  //等价__shfl_xor_sync

.any()  //等价__any_sync

.all()  //等价__all_sync

.ballot()  //等价__ballot_sync

.match_any()  //等价__match_any_sync

.match_all() //等价__match_all_sync

注意相比通过tiled_partition()函数传参动态设置线程组大小，通过tiled_partition<>模版静态设置线程组大小使得开发者可以使用如上这些线程束同步函数，前者不能。

附2：

__shfl_sync系列指令（俗称洗牌指令）用于在线程束中获取指定线程的变量值，该操作会在mask（一般取0xffffffff，每个bit位代表每个线程id）指定的那些线程中同时执行（同一mask中的线程必须执行相同指令），每次移动4字节或8字节的数据，但若指定线程为非活跃线程，则结果未知。具体功能如下：
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
__shfl_sync指令返回索引为srcLane线程的var变量值，其中srcLane大小为[0,width)，类似的，width的值必须是2的幂数且不大于32。
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
__shfl_up_sync指令返回索引为当前线程索引减去delta的值的线程的var值，若减去后的值小于0则不做任何操作（保持不变）。
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
__shfl_down_sync指令返回索引为当前线程索引加上delta的值的线程的var值，若加后的值大于width则不做任何操作（保持不变）。
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
__shfl_xor_sync指令返回索引为当前线程索引按位异或laneMask后的值的线程的var值。注意若width值小于warpSize值，此时后面的线程可以访问前面的线程组的值（获取成功），但前面的线程不能访问后面线程组的值（保持不变）。

附3：

__any_sync系列指令（俗称投票指令）对线程束中的参与线程（同样由mask指定）比较预测值predicate是否非零，并向所有参与的活跃线程广播比较结果：
int __all_sync(unsigned mask, int predicate);
当线程束中所有参与线程的预测值predicate非零时返回一个非零值。
int __any_sync(unsigned mask, int predicate);
当线程束中存在任意一个参与线程的预测值predicate非零时返回一个非零值。
unsigned __ballot_sync(unsigned mask, int predicate);
若线程束中的第N个线程活跃且其预测值predicate非零时，设定返回值的第N个bit为1，否则为0。
unsigned __activemask();
返回线程束内活跃线程组成的掩码。若线程束中的第N个线程为活跃线程，则设定第N个bit为1，否则为0（注意已退出线程也是非活跃线程）。该指令不执行同步。

附4：

__match_any_sync系列指令对线程束的参与线程（同样由mask指定）比较value值，并向所有参与线程广播比较结果：
unsigned int __match_any_sync(unsigned mask, T value);
返回value值相同的那些线程组成的掩码。
unsigned int __match_all_sync(unsigned mask, T value, int *pred);
返回mask值若所有参与线程的value值都相同，否则返回0。此外前者的预测值pred还将被设定为true，否则为false。

Coalesced Groups

若同一线程束（warp）内的线程出现条件分化（通常由if语句导致），那么程序将序列化运行：既在执行某分支线程时停止其它分支线程的执行，直到所有分支执行完毕。我们称正执行的活跃线程为coalesced thread，线程束内所有活跃线程组成的线程组即为coalesced groups，其可以通过coalesced_threads函数获取：

coalesced_group active = coalesced_threads();

coalesced_group也是一类thread_group。

网格级同步

相比块内组，Cooperative Groups最强大的能力在于跨线程块同步，在CUDA 9.0之前，不同线程块仅能在kernel执行结束时同步，现在开发者可以通过grid_group 结构执行网格级同步：

grid_group grid = this_grid();

grid.sync();

注意不同于传统的<<<...>>>执行配置，网格级同步必须通过cudaLaunchCooperativeKernel API配置并启动kernel：

cudaError_t cudaLaunchCooperativeKernel(

  const T *func, //kernel函数指针

  dim3 gridDim,

  dim3 blockDim,

  void **args, //kernel参数数组

  size_t sharedMem = 0,

  cudaStream_t stream = 0

)

注意为保证所有协同线程块能安全的常驻GPU，gridDim和blockDim的值需要慎重考虑，开发者可以通过计算SM的最大活跃线程块数目来最大化并行率：

cudaOccupancyMaxActiveBlocksPerMultiprocessor(

    &numBlocksPerSm,

    my_kernel,

    numThreads,

    0

);

// initialize, then launch

cudaLaunchCooperativeKernel(

    (void*)my_kernel,

    deviceProp.multiProcessorCount*numBlocksPerSm,

    numThreads,

    args

);

Cooperative Launch目前不支持任务抢占和调度，若一次启动的block数超过了设备驻留的极限，则报错too many blocks in cooperative launch cudaLaunchCooperativeKernel，此时你需要检查一下启动block数、使用的共享内存大小、使用的寄存器大小。相关问题见(https://bbs.gpuworld.cn/index.php?topic=73127.0)

除特殊的启动函数外，网格同步还需要在编译时开启-rdc=true参数。

该功能仅支持计算能力6.0及以上的设备，在不确定GPU是否支持网格同步时，开发者可以通过如下方式查询：

int pi=0;

cuDevice dev;

cuDeviceGet(&dev,0) // get handle to device 0

cuDeviceGetAttribute(&pi, CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH, dev);

当pi值为1时表明设备0支持网格级同步。

多设备同步

类似网格级同步，多设备同步通过multi_grid_group 结构执行：

multi_grid_group multi_grid = this_multi_grid();

multi_grid.sync();

并通过cudaLaunchCooperativeKernelMultiDevice API配置并启动kernel：

cudaError_t cudaLaunchCooperativeKernelMultiDevice(

    CUDA_LAUNCH_PARAMS *launchParamsList,

    unsigned int numDevices,

    unsigned int flags = 0

);

其中CUDA_LAUNCH_PARAMS结构体定义如下：

typedef struct CUDA_LAUNCH_PARAMS_st {

    CUfunction function;

    unsigned int gridDimX;

    unsigned int gridDimY;

    unsigned int gridDimZ;

    unsigned int blockDimX;

    unsigned int blockDimY;

    unsigned int blockDimZ;

    unsigned int sharedMemBytes;

    CUstream hStream;

    void **kernelParams;

} CUDA_LAUNCH_PARAMS;

当开发者使用该API需要注意如下几点：

该API将确保一个launch操作是原子的，例如当API调用成功时，相应数目的线程块在所有指定设备上launch成功。

对于所有设备，该API调用的kernel函数必须是相同的。

同一设备上的launchParamsList参数必须是相同的。

所有设备的计算能力必须是相同的（major and minor versions）。

对于所有设备，配置的网格大小（gridDim）、块大小（blockDim）和每个网格的共享内存大小必须是相同的。

自定义的__device__,__constant__,__managed__全局变量在每个设备上都是独立实例化的，因此需要开发者对该类变量赋初值。

类似的，该功能仅支持计算能力6.0及以上设备，可以通过CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH 查询。