mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: **wrong_max_spdup**: In the worst case, how much better would the best choice have been **wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) **max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice **gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice **max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case **non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice **default_better**: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. |batch size|prompt length| fallback | heuristic | speedup | |----------|-------------|------------:|------------:|--------:| | 1 | 7 | 75.31 tok/s | 148.83 tok/s| 1.97 | | 1 | 11 | 75.99 tok/s | 148.15 tok/s| 1.94 | | 4 | 7 | 103.48 tok/s | 472.00 tok/s| 4.56 | | 4 | 11 | 103.56 tok/s | 371.36 tok/s| 3.58 | | 8 | 7 | 201.92 tok/s | 813.44 tok/s| 4.02 | | 8 | 11 | 201.76 tok/s | 699.36 tok/s| 3.46 | Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison |
||
|---|---|---|
| .. | ||
| _autoheuristic | ||
| aoti | ||
| api | ||
| decompositions | ||
| dest | ||
| executorch | ||
| fuse | ||
| operator_versions | ||
| selective_build | ||
| shape_functions | ||
| static_runtime | ||
| __init__.py | ||
| BUCK.oss | ||
| BUILD.bazel | ||
| build.bzl | ||
| code_template.py | ||
| context.py | ||
| gen.py | ||
| gen_aoti_c_shim.py | ||
| gen_backend_stubs.py | ||
| gen_executorch.py | ||
| gen_functionalization_type.py | ||
| gen_lazy_tensor.py | ||
| gen_vmap_plumbing.py | ||
| local.py | ||
| model.py | ||
| native_function_generation.py | ||
| utils.py | ||
| yaml_utils.py | ||