[CUDA] cub库的使用

文章目录

1. cub 简介
2. cub的基本组件
- Warp-wide
- Block-wide
- - 1. cub::BlockRadixSort的使用注意事项
- Device-wide primitives
- - Single-problem
  - Segmented-problem (batch)
- Utities
3. cub的使用实例
- 3.1 nonzeros的实现（使用DeviceSelect组件）
- 3.2 allocator的使用
- 3.3 reduce类操作
- 3.4 commutative sum
- 3.5 max/min类操作
- 3.6 scan 类操作
- 3.7 unique类操作
- 3.8 partition 操作
4. reference

1. cub 简介

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model
不同范畴的集合操作：包括warp-wide, block-wide, device-wide三种级别，cuda提供scalar和collective接口，scalar接口就是我们常写的一些kernel算子，但是当一组协作的操作需要一起执行的时候，可以考虑用collective接口打包这些操作，使得并行化更强。而collective接口有三种范畴/级别。
主要操作函数: scan, reduction, histogram, sort等；

2. cub的基本组件

Warp-wide

Block-wide

1. cub::BlockRadixSort的使用注意事项

注意block radix sort是对于一个block来说的，除非让一个block包含所有元素才会用一个block实现所有的元素的排序，否则会将整个元素按照blocksize*items_per_thread为一个segment进行排序；
另外需要注意BlockRadix之前最好用BlockLoad来加载数据，然后不要在前面加if(index >= size) return 来让其返回，因为需要补全元素到blocksize的倍数才行，否做出现illegal memory问题；用Load(ptr, thread_values, size - block_offset, default_value); 来补齐不足的数据。


__global__ void CubRadixSortKernel(const int* __restrict__ input, int row,int column, int sample_stride,uint64_t* __restrict__ hash) {const int index = blockIdx.x * blockDim.x + threadIdx.x;// if (index >= row) return; // 注意不要有这个return，要让其kernel跑满 blocksize * item_per_thread的大小// 这个例子是256*4=1024个对象。typedef cub::BlockRadixSort<int, 256, 4> BlockRadixSort;typedef cub::BlockLoad<int, 256, 4, cub::BLOCK_LOAD_TRANSPOSE> BlockLoad;typedef cub::BlockStore<int, 256, 4, cub::BLOCK_STORE_TRANSPOSE> BlockStore;__shared__ union {typename BlockRadixSort::TempStorage sort;typename BlockLoad::TempStorage load;typename BlockStore::TempStorage store;} temp_storage;int block_offset = blockIdx.x * (256 * 4); // 注意这个offset是比较关键的一个pivot.int thread_keys[4];// 重要点： row - block_offset是个很重要的表示，如果不加这个，就会出现illegal memory, 因为// 后面BlockRadixSort在最后一个block不满blocksize的时候，会排序整个blocksize*items_per_thread// 这就导致访问越界； 所以这个row-block_offset防止访问数据超界并且数量不足就补足一个block操作的数量。// 这样就不会导致BlockRadixSort超界。BlockLoad(temp_storage.load).Load(input + block_offset, thread_keys, row - block_offset, 0);__syncthreads();BlockRadixSort(temp_storage.sort).Sort(thread_keys);__syncthreads();// StoreBlockStore(temp_storage.store).Store(reinterpret_cast<int*>(hash) + block_offset, thread_keys);
}

Device-wide primitives

Single-problem

Segmented-problem (batch)

Utities

3. cub的使用实例

可以直接去document 函数中去看使用用例，就能知道怎么用，然后根据自己的需求选择需要使用的函数。
TODO: 总结使用规律

3.1 nonzeros的实现（使用DeviceSelect组件）

1主要的思想是将tensor数据flatten一维，然后得到非零的index（非零的位置）和flags（只有0，1）的tensor；
2然后利用cub的cub::DeviceSelect::Flagged函数，根据flags将非零的index选择出来；得到(selected_num,) 大小的tensor
3然后再通过kernel遍历，将index扩展为横纵两个坐标值: i, j
4最后输出（selected_num， 2）的out tensor

总结来说，就是先准备cub函数需要的一些数据，然后利用cub的select/reduce/scan等single problem机制，进行高并行计算。

cub::DeviceSelect::Flagged的官方说明

/**
* out is the output of nonzero
* selected_num is the size(0) of nonzero
*/
void NonZero(const float* inp, int N, int* &out, int &selected_num)
{cub::CachingDeviceAllocator allocator(true);int *d_in = NULL;unsigned char * d_flags = NULL;allocator.DeviceAllocate((void**)&d_in, sizeof(int) * N);allocator.DeviceAllocate((void**)&d_flags, sizeof(unsigned char) * N);cudaMemset(d_in, 0, sizeof(int) * N);cudaMemset(d_flags, 0, sizeof(unsigned char) * N);// flatten, get index and flags; get_indices_flags(inp, N, d_in, d_flags);// cub selectedint *d_out = NULL;int *d_num_selected = NULL;allocator.DeviceAllocate((void**)&d_out, sizeof(int) * N);allocator.DeviceAllocate((void**)&d_num_selected, sizeof(int));// Allocate temporary storagevoid *d_temp_storage = NULL;size_t temp_storage_bytes = 0;cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected, N);allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes);cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected, N);int h_number_selected[1];cudaMemcpy(h_number_selected, d_num_selected, sizeof(int), cudaMemcpyDeviceToHost);selected_num = *h_num_selected;// [x,y,x,y.....] shape of out: (selected_num, 2)allocator.DeviceAllocate((void**)&out, sizeof(int) * selected_num * 2);set_indices(d_out, selected_num, col, out); // 将一维的index展开为两维的col,row 坐标
}

3.2 allocator的使用

TODO: 确定allocator的机制，以及怎么使用

// 有一些配置
cub::CachingDeviceAllocator allocator(true);
allocator.DeviceAllocate((void**)&out, sizeof(int) * selected_num * 2);

3.3 reduce类操作

reducebykey: 可以考虑用于不同维度累加
sum
https://zhuanlan.zhihu.com/p/416959273

3.4 commutative sum

3.5 max/min类操作

简单说明一下，可以如何用的思考

3.6 scan 类操作

Prefix scan指的是前缀扫描操作,当操作符为相加时,输出向量的各元素向前归约值
scan 的意思就是扫描，也就是逐项计算每一项，依次计算过程能实现一些规约操作，如min，max，mean，sum的计算

3.7 unique类操作

需要使用的函数：可以计数的，可以在unique中返回counts结果
在这里插入图片描述

3.8 partition 操作

也就是将指定的两种不同特性的数据分成前后两个部分，partition分段的意思。

在这里插入图片描述

[CUDA] cub库的使用

文章目录

1. cub 简介

2. cub的基本组件

Warp-wide

Block-wide

1. cub::BlockRadixSort的使用注意事项

Device-wide primitives

Single-problem

Segmented-problem (batch)

Utities

3. cub的使用实例

3.1 nonzeros的实现（使用DeviceSelect组件）

3.2 allocator的使用

3.3 reduce类操作

3.4 commutative sum

3.5 max/min类操作

3.6 scan 类操作

3.7 unique类操作

3.8 partition 操作

4. reference

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

[CUDA] cub库的使用

文章目录

1. cub 简介

2. cub的基本组件

Warp-wide

Block-wide

1. cub::BlockRadixSort的使用注意事项

Device-wide primitives

Single-problem

Segmented-problem (batch)

Utities

3. cub的使用实例

3.1 nonzeros的实现 （使用DeviceSelect组件）

3.2 allocator的使用

3.3 reduce类操作

3.4 commutative sum

3.5 max/min类操作

3.6 scan 类操作

3.7 unique类操作

3.8 partition 操作

4. reference

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

3.1 nonzeros的实现（使用DeviceSelect组件）