mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-30 03:37:44 +00:00

History

Chen Fu 6fb09055d4 Adding a sm80 q4 gemm kernel for small tiles (#20545 ) ### Description Implementation of a q4 gemm cuda kernel for small tiles and small sequence_len or batch_size (<=16) ### Performance Test Results \| Problem Shape \|New Kernel \| \| \| Current Kernel\| \| \| ------------------: \| ----------- \| ------- \|--\| ------------- \| ------- \| \| (M x N x K) \| Latency (ms) \| GFLOPS \| \| Latency (ms) \| GFLOPS \| \| 1 x 3072 x 3072 \| 0.008124 \| 2310.93 \| \| 0.017231 \| 1095.39 \| \| 16 x 3072 x 3072 \| 0.011263 \| 26813.7 \| \| 0.017431 \| 17325.4 \| \| 32 x 3072 x 3072 \| 0.018559 \| 32544.3 \| \| 0.079493 \| 7597.89 \| \| 64 x 3072 x 3072 \| 0.030364 \| 39782.1 \| \| 0.079387 \| 15216 \| \| 1024 x 3072 x 3072 \| 0.387194 \| 49916.5 \| \| 0.080849 \| 239054 \| \| \| \| \| \| \| \| \| 1 x 3072 x 9216 \| 0.015734 \| 3598.77 \| \| 0.043404 \| 1304.55 \| \| 16 x 3072 x 9216 \| 0.023611 \| 38371.3 \| \| 0.043388 \| 20859.1 \| \| 32 x 3072 x 9216 \| 0.038652 \| 46878 \| \| 0.224353 \| 8076.31 \| \| 64 x 3072 x 9216 \| 0.072334 \| 50099.5 \| \| 0.224338 \| 16153.6 \| \| 1024 x 3072 x 9216 \| 1.02872 \| 56363.2 \| \| 0.231284 \| 250696 \| \| \| \| \| \| \| \| \| 1 x 8192 x 3072 \| 0.015787 \| 3188.18 \| \| 0.017714 \| 2841.28 \| \| 16 x 8192 x 3072 \| 0.025933 \| 31053.3 \| \| 0.017919 \| 44942.2 \| \| 32 x 8192 x 3072 \| 0.042633 \| 37778.9 \| \| 0.079407 \| 20282.9 \| \| 64 x 8192 x 3072 \| 0.070061 \| 45977.5 \| \| 0.079531 \| 40502.8 \| \| 1024 x 8192 x 3072 \| 1.01264 \| 50896.3 \| \| 0.237244 \| 217243 \| \| \| \| \| \| \| \| \| 1 x 3072 x 8192 \| 0.014444 \| 3484.56 \| \| 0.038961 \| 1291.85 \| \| 16 x 3072 x 8192 \| 0.020433 \| 39411.8 \| \| 0.039056 \| \| \| 32 x 3072 x 8192 \| 0.03459 \| 46563.5 \| \| 0.200189 \| 8045.47 \| \| 64 x 3072 x 8192 \| 0.063319 \| 50873.4 \| \| 0.20029 \| 16082.8 \| \| 1024 x 3072 x 8192 \| 0.928282 \| 55521.5 \| \| 0.205883 \| 250334 \| \| \| \| \| \| \| \| \| 1 x 5120 x 5120 \| 0.014573 \| 3597.79 \| \| 0.02604 \| 2013.42 \| \| 16 x 5120 x 5120 \| 0.025638 \| 32719.5 \| \| 0.026194 \| 32024.4 \| \| 32 x 5120 x 5120 \| 0.037421 \| 44834.2 \| \| 0.127676 \| 13140.4 \| \| 64 x 5120 x 5120 \| 0.065593 \| 51155.9 \| \| 0.127706 \| 26274.8 \| \| 1024 x 5120 x 5120 \| 1.00217 \| 53570.9 \| \| 0.256388 \| 209398 \| \| \| \| \| \| \| \| \| 1 x 17920 x 5120 \| 0.053868 \| 3406.49 \| \| 0.04715 \| 3891.84 \| \| 16 x 17920 x 5120 \| 0.071952 \| 40805.1 \| \| 0.049755 \| 59009.3 \| \| 32 x 17920 x 5120 \| 0.123657 \| 47486.3 \| \| 0.129812 \| 45234.8 \| \| 64 x 17920 x 5120 \| 0.222113 \| 52874.2 \| \| 0.129781 \| 90491.6 \| \| 1024 x 17920 x 5120 \| 3.50124 \| 53668.1 \| \| 0.770569 \| 243852 \| \| \| \| \| \| \| \| \| 1 x 1280 x 5120 \| 0.007029 \| 1864.66 \| \| 0.025954 \| 505.027 \| \| 16 x 1280 x 5120 \| 0.008122 \| 25821.6 \| \| 0.025953 \| 8080.59 \| \| 32 x 1280 x 5120 \| 0.012498 \| 33558.7 \| \| 0.127618 \| 3286.62 \| \| 64 x 1280 x 5120 \| 0.022049 \| 38044.6 \| \| 0.127762 \| 6565.81 \| \| 1024 x 1280 x 5120 \| 0.258547 \| 51912.4 \| \| 0.128425 \| 104511 \| \| \| \| \| \| \| \| \| 1 x 5120 x 17920 \| 0.049096 \| 3737.59 \| \| 0.109703 \| 1672.7 \| \| 16 x 5120 x 17920 \| 0.073145 \| 40139.7 \| \| 0.110608 \| 26544.3 \| \| 32 x 5120 x 17920 \| 0.11405 \| 51486.3 \| \| 0.430942 \| 13626 \| \| 64 x 5120 x 17920 \| 0.210022 \| 55918.1 \| \| 0.430948 \| 27251.7 \| \| 1024 x 5120 x 17920 \| 4.571 \| 41108 \| \| 0.860118 \| 218464 \|		2024-06-12 16:02:26 -07:00
..
blk_q4	Enable CUDA EP unit testing on Windows (#20039 )	2024-03-27 13:32:36 -07:00
cutlass_ext/q4gemm	Adding a sm80 q4 gemm kernel for small tiles (#20545 )	2024-06-12 16:02:26 -07:00
gemm	Adding a sm80 q4 gemm kernel for small tiles (#20545 )	2024-06-12 16:02:26 -07:00
int_util.h	Adding a sm80 q4 gemm kernel for small tiles (#20545 )	2024-06-12 16:02:26 -07:00
README.md

README.md

About Mickey

Playful name for a template library of high performance cuda code that are often shared by various AI operators. The intention is to make this header files only, with no binary impact unless it is instantiated where it is needed.

Currently cuda code are scattered in multiple locations in the repo. Hopefully this can be the starting point of consolidating all cuda code.