onnxruntime/onnxruntime/core/mickey
Chen Fu 6fb09055d4
Adding a sm80 q4 gemm kernel for small tiles (#20545)
### Description

Implementation of a q4 gemm cuda kernel for small tiles and small
sequence_len or batch_size (<=16)

### Performance Test Results

| Problem Shape |New Kernel | | | Current Kernel| |
| ------------------: | ----------- | ------- |--| ------------- |
------- |
| **(M x N x K)** | **Latency (ms)** | **GFLOPS** | | **Latency (ms)** |
**GFLOPS** |
| 1 x 3072 x 3072 | 0.008124 | 2310.93 | | 0.017231 | 1095.39 |
| 16 x 3072 x 3072 | 0.011263 | 26813.7 | | 0.017431 | 17325.4 |
| 32 x 3072 x 3072 | 0.018559 | 32544.3 | | 0.079493 | 7597.89 |
| 64 x 3072 x 3072 | 0.030364 | 39782.1 | | 0.079387 | 15216 |
| 1024 x 3072 x 3072 | 0.387194 | 49916.5 | | 0.080849 | 239054 |
| | | | | | |
| 1 x 3072 x 9216 | 0.015734 | 3598.77 | | 0.043404 | 1304.55 |
| 16 x 3072 x 9216 | 0.023611 | 38371.3 | | 0.043388 | 20859.1 |
| 32 x 3072 x 9216 | 0.038652 | 46878 | | 0.224353 | 8076.31 |
| 64 x 3072 x 9216 | 0.072334 | 50099.5 | | 0.224338 | 16153.6 |
| 1024 x 3072 x 9216 | 1.02872 | 56363.2 | | 0.231284 | 250696 |
| | | | | | |
| 1 x 8192 x 3072 | 0.015787 | 3188.18 | | 0.017714 | 2841.28 |
| 16 x 8192 x 3072 | 0.025933 | 31053.3 | | 0.017919 | 44942.2 |
| 32 x 8192 x 3072 | 0.042633 | 37778.9 | | 0.079407 | 20282.9 |
| 64 x 8192 x 3072 | 0.070061 | 45977.5 | | 0.079531 | 40502.8 |
| 1024 x 8192 x 3072 | 1.01264 | 50896.3 | | 0.237244 | 217243 |
| | | | | | |
| 1 x 3072 x 8192 | 0.014444 | 3484.56 | | 0.038961 | 1291.85 |
| 16 x 3072 x 8192 | 0.020433 | 39411.8 | | 0.039056 | |
| 32 x 3072 x 8192 | 0.03459 | 46563.5 | | 0.200189 | 8045.47 |
| 64 x 3072 x 8192 | 0.063319 | 50873.4 | | 0.20029 | 16082.8 |
| 1024 x 3072 x 8192 | 0.928282 | 55521.5 | | 0.205883 | 250334 |
| | | | | | |
| 1 x 5120 x 5120 | 0.014573 | 3597.79 | | 0.02604 | 2013.42 |
| 16 x 5120 x 5120 | 0.025638 | 32719.5 | | 0.026194 | 32024.4 |
| 32 x 5120 x 5120 | 0.037421 | 44834.2 | | 0.127676 | 13140.4 |
| 64 x 5120 x 5120 | 0.065593 | 51155.9 | | 0.127706 | 26274.8 |
| 1024 x 5120 x 5120 | 1.00217 | 53570.9 | | 0.256388 | 209398 |
| | | | | | |
| 1 x 17920 x 5120 | 0.053868 | 3406.49 | | 0.04715 | 3891.84 |
| 16 x 17920 x 5120 | 0.071952 | 40805.1 | | 0.049755 | 59009.3 |
| 32 x 17920 x 5120 | 0.123657 | 47486.3 | | 0.129812 | 45234.8 |
| 64 x 17920 x 5120 | 0.222113 | 52874.2 | | 0.129781 | 90491.6 |
| 1024 x 17920 x 5120 | 3.50124 | 53668.1 | | 0.770569 | 243852 |
| | | | | | |
| 1 x 1280 x 5120 | 0.007029 | 1864.66 | | 0.025954 | 505.027 |
| 16 x 1280 x 5120 | 0.008122 | 25821.6 | | 0.025953 | 8080.59 |
| 32 x 1280 x 5120 | 0.012498 | 33558.7 | | 0.127618 | 3286.62 |
| 64 x 1280 x 5120 | 0.022049 | 38044.6 | | 0.127762 | 6565.81 |
| 1024 x 1280 x 5120 | 0.258547 | 51912.4 | | 0.128425 | 104511 |
| | | | | | |
| 1 x 5120 x 17920 | 0.049096 | 3737.59 | | 0.109703 | 1672.7 |
| 16 x 5120 x 17920 | 0.073145 | 40139.7 | | 0.110608 | 26544.3 |
| 32 x 5120 x 17920 | 0.11405 | 51486.3 | | 0.430942 | 13626 |
| 64 x 5120 x 17920 | 0.210022 | 55918.1 | | 0.430948 | 27251.7 |
| 1024 x 5120 x 17920 | 4.571 | 41108 | | 0.860118 | 218464 |
2024-06-12 16:02:26 -07:00
..
blk_q4 Enable CUDA EP unit testing on Windows (#20039) 2024-03-27 13:32:36 -07:00
cutlass_ext/q4gemm Adding a sm80 q4 gemm kernel for small tiles (#20545) 2024-06-12 16:02:26 -07:00
gemm Adding a sm80 q4 gemm kernel for small tiles (#20545) 2024-06-12 16:02:26 -07:00
int_util.h Adding a sm80 q4 gemm kernel for small tiles (#20545) 2024-06-12 16:02:26 -07:00
README.md

About Mickey

Playful name for a template library of high performance cuda code that are often shared by various AI operators. The intention is to make this header files only, with no binary impact unless it is instantiated where it is needed.

Currently cuda code are scattered in multiple locations in the repo. Hopefully this can be the starting point of consolidating all cuda code.