onnxruntime/onnxruntime/core/mickey
Chen Fu 06e684c9f2
Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.


### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs

### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:

(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)

If both of these options are turned off, then compilation will fail.

Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
2024-03-05 09:37:45 -08:00
..
blk_q4
cutlass_ext/q4gemm
README.md

About Mickey

Playful name for a template library of high performance cuda code that are often shared by various AI operators. The intention is to make this header files only, with no binary impact unless it is instantiated where it is needed.

Currently cuda code are scattered in multiple locations in the repo. Hopefully this can be the starting point of consolidating all cuda code.