pytorch/aten
Mengwei Liu 6564d63e69 Use mv kernel for small M (#128632)
Previously we are using:
* mv kernel for M == 1
* mm kernel for 1 < M < 4
* llama.cpp inspired mm kernel for M >= 4

This PR consolidate it to only 2 kernels, use the same mv kernel for M <
12.

Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm

Mac M1 Max, input size M x 4128 x 4096

![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632
Approved by: https://github.com/malfet
2024-06-14 01:06:53 +00:00
..
conda
src Use mv kernel for small M (#128632) 2024-06-14 01:06:53 +00:00
tools
CMakeLists.txt