mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-06-30 03:37:44 +00:00
### Description Implementation of a q4 gemm cuda kernel for small tiles and small sequence_len or batch_size (<=16) ### Performance Test Results | Problem Shape |New Kernel | | | Current Kernel| | | ------------------: | ----------- | ------- |--| ------------- | ------- | | **(M x N x K)** | **Latency (ms)** | **GFLOPS** | | **Latency (ms)** | **GFLOPS** | | 1 x 3072 x 3072 | 0.008124 | 2310.93 | | 0.017231 | 1095.39 | | 16 x 3072 x 3072 | 0.011263 | 26813.7 | | 0.017431 | 17325.4 | | 32 x 3072 x 3072 | 0.018559 | 32544.3 | | 0.079493 | 7597.89 | | 64 x 3072 x 3072 | 0.030364 | 39782.1 | | 0.079387 | 15216 | | 1024 x 3072 x 3072 | 0.387194 | 49916.5 | | 0.080849 | 239054 | | | | | | | | | 1 x 3072 x 9216 | 0.015734 | 3598.77 | | 0.043404 | 1304.55 | | 16 x 3072 x 9216 | 0.023611 | 38371.3 | | 0.043388 | 20859.1 | | 32 x 3072 x 9216 | 0.038652 | 46878 | | 0.224353 | 8076.31 | | 64 x 3072 x 9216 | 0.072334 | 50099.5 | | 0.224338 | 16153.6 | | 1024 x 3072 x 9216 | 1.02872 | 56363.2 | | 0.231284 | 250696 | | | | | | | | | 1 x 8192 x 3072 | 0.015787 | 3188.18 | | 0.017714 | 2841.28 | | 16 x 8192 x 3072 | 0.025933 | 31053.3 | | 0.017919 | 44942.2 | | 32 x 8192 x 3072 | 0.042633 | 37778.9 | | 0.079407 | 20282.9 | | 64 x 8192 x 3072 | 0.070061 | 45977.5 | | 0.079531 | 40502.8 | | 1024 x 8192 x 3072 | 1.01264 | 50896.3 | | 0.237244 | 217243 | | | | | | | | | 1 x 3072 x 8192 | 0.014444 | 3484.56 | | 0.038961 | 1291.85 | | 16 x 3072 x 8192 | 0.020433 | 39411.8 | | 0.039056 | | | 32 x 3072 x 8192 | 0.03459 | 46563.5 | | 0.200189 | 8045.47 | | 64 x 3072 x 8192 | 0.063319 | 50873.4 | | 0.20029 | 16082.8 | | 1024 x 3072 x 8192 | 0.928282 | 55521.5 | | 0.205883 | 250334 | | | | | | | | | 1 x 5120 x 5120 | 0.014573 | 3597.79 | | 0.02604 | 2013.42 | | 16 x 5120 x 5120 | 0.025638 | 32719.5 | | 0.026194 | 32024.4 | | 32 x 5120 x 5120 | 0.037421 | 44834.2 | | 0.127676 | 13140.4 | | 64 x 5120 x 5120 | 0.065593 | 51155.9 | | 0.127706 | 26274.8 | | 1024 x 5120 x 5120 | 1.00217 | 53570.9 | | 0.256388 | 209398 | | | | | | | | | 1 x 17920 x 5120 | 0.053868 | 3406.49 | | 0.04715 | 3891.84 | | 16 x 17920 x 5120 | 0.071952 | 40805.1 | | 0.049755 | 59009.3 | | 32 x 17920 x 5120 | 0.123657 | 47486.3 | | 0.129812 | 45234.8 | | 64 x 17920 x 5120 | 0.222113 | 52874.2 | | 0.129781 | 90491.6 | | 1024 x 17920 x 5120 | 3.50124 | 53668.1 | | 0.770569 | 243852 | | | | | | | | | 1 x 1280 x 5120 | 0.007029 | 1864.66 | | 0.025954 | 505.027 | | 16 x 1280 x 5120 | 0.008122 | 25821.6 | | 0.025953 | 8080.59 | | 32 x 1280 x 5120 | 0.012498 | 33558.7 | | 0.127618 | 3286.62 | | 64 x 1280 x 5120 | 0.022049 | 38044.6 | | 0.127762 | 6565.81 | | 1024 x 1280 x 5120 | 0.258547 | 51912.4 | | 0.128425 | 104511 | | | | | | | | | 1 x 5120 x 17920 | 0.049096 | 3737.59 | | 0.109703 | 1672.7 | | 16 x 5120 x 17920 | 0.073145 | 40139.7 | | 0.110608 | 26544.3 | | 32 x 5120 x 17920 | 0.11405 | 51486.3 | | 0.430942 | 13626 | | 64 x 5120 x 17920 | 0.210022 | 55918.1 | | 0.430948 | 27251.7 | | 1024 x 5120 x 17920 | 4.571 | 41108 | | 0.860118 | 218464 | |
||
|---|---|---|
| .. | ||
| blk_q4 | ||
| cutlass_ext/q4gemm | ||
| gemm | ||
| int_util.h | ||
| README.md | ||
About Mickey
Playful name for a template library of high performance cuda code that are often shared by various AI operators. The intention is to make this header files only, with no binary impact unless it is instantiated where it is needed.
Currently cuda code are scattered in multiple locations in the repo. Hopefully this can be the starting point of consolidating all cuda code.