pytorch/pyproject.toml
Alnis Murtovi 48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00

203 lines
5 KiB
TOML

[build-system]
requires = [
"setuptools",
"wheel",
"astunparse",
"numpy",
"ninja",
"pyyaml",
"cmake",
"typing-extensions",
"requests",
]
# Use legacy backend to import local packages in setup.py
build-backend = "setuptools.build_meta:__legacy__"
[tool.black]
line-length = 88
target-version = ["py38", "py39", "py310", "py311"]
[tool.isort]
src_paths = ["caffe2", "torch", "torchgen", "functorch", "test"]
extra_standard_library = ["typing_extensions"]
skip_gitignore = true
skip_glob = ["third_party/*"]
atomic = true
profile = "black"
indent = 4
line_length = 88
lines_after_imports = 2
multi_line_output = 3
include_trailing_comma = true
combine_as_imports = true
[tool.usort.known]
first_party = ["caffe2", "torch", "torchgen", "functorch", "test"]
standard_library = ["typing_extensions"]
[tool.ruff]
target-version = "py38"
line-length = 120
[tool.ruff.lint]
# NOTE: Synchoronize the ignores with .flake8
ignore = [
# these ignores are from flake8-bugbear; please fix!
"B007", "B008", "B017",
"B018", # Useless expression
"B023",
"B028", # No explicit `stacklevel` keyword argument found
"E402",
"C408", # C408 ignored because we like the dict keyword argument syntax
"E501", # E501 is not flexible enough, we're using B950 instead
"E721",
"E731", # Assign lambda expression
"E741",
"EXE001",
"F405",
"F841",
# these ignores are from flake8-logging-format; please fix!
"G101",
# these ignores are from ruff NPY; please fix!
"NPY002",
# these ignores are from ruff PERF; please fix!
"PERF203",
"PERF401",
"PERF403",
# these ignores are from PYI; please fix!
"PYI024",
"PYI036",
"PYI041",
"PYI056",
"SIM102", "SIM103", "SIM112", # flake8-simplify code styles
"SIM113", # please fix
"SIM105", # these ignores are from flake8-simplify. please fix or ignore with commented reason
"SIM108", # SIM108 ignored because we prefer if-else-block instead of ternary expression
"SIM110",
"SIM114", # Combine `if` branches using logical `or` operator
"SIM115",
"SIM116", # Disable Use a dictionary instead of consecutive `if` statements
"SIM117",
"SIM118",
"UP006", # keep-runtime-typing
"UP007", # keep-runtime-typing
]
select = [
"B",
"B904", # Re-raised error without specifying the cause via the from keyword
"C4",
"G",
"E",
"EXE",
"F",
"SIM1",
"SIM911",
"W",
# Not included in flake8
"FURB",
"LOG",
"NPY",
"PERF",
"PGH004",
"PIE794",
"PIE800",
"PIE804",
"PIE807",
"PIE810",
"PLC0131", # type bivariance
"PLC0132", # type param mismatch
"PLC0205", # string as __slots__
"PLC3002", # unnecessary-direct-lambda-call
"PLE",
"PLR0133", # constant comparison
"PLR0206", # property with params
"PLR1722", # use sys exit
"PLR1736", # unnecessary list index
"PLW0129", # assert on string literal
"PLW0133", # useless exception statement
"PLW0406", # import self
"PLW0711", # binary op exception
"PLW1509", # preexec_fn not safe with threads
"PLW2101", # useless lock statement
"PLW3301", # nested min max
"PT006", # TODO: enable more PT rules
"PT022",
"PT023",
"PT024",
"PT025",
"PT026",
"PYI",
"Q003", # avoidable escaped quote
"Q004", # unnecessary escaped quote
"RSE",
"RUF008", # mutable dataclass default
"RUF015", # access first ele in constant time
"RUF016", # type error non-integer index
"RUF017",
"RUF018", # no assignment in assert
"RUF019", # unnecessary-key-check
"RUF024", # from keys mutable
"RUF026", # default factory kwarg
"TCH",
"TRY002", # ban vanilla raise (todo fix NOQAs)
"TRY302",
"TRY401", # verbose-log-message
"UP",
]
[tool.ruff.lint.per-file-ignores]
"__init__.py" = [
"F401",
]
"functorch/notebooks/**" = [
"F401",
]
"test/typing/reveal/**" = [
"F821",
]
"test/torch_np/numpy_tests/**" = [
"F821",
"NPY201",
]
"test/dynamo/test_bytecode_utils.py" = [
"F821",
]
"test/dynamo/test_debug_utils.py" = [
"UP037",
]
"test/jit/**" = [
"PLR0133", # tests require this for JIT
"PYI",
"RUF015",
"UP", # We don't want to modify the jit test as they test specify syntax
]
"test/test_jit.py" = [
"PLR0133", # tests require this for JIT
"PYI",
"RUF015",
"UP", # We don't want to modify the jit test as they test specify syntax
]
"test/inductor/test_torchinductor.py" = [
"UP037",
]
# autogenerated #TODO figure out why file level noqa is ignored
"torch/_inductor/fx_passes/serialized_patterns/**" = ["F401", "F501"]
"torch/_inductor/autoheuristic/artifacts/**" = ["F401", "F501"]
"torchgen/api/types/__init__.py" = [
"F401",
"F403",
]
"torchgen/executorch/api/types/__init__.py" = [
"F401",
"F403",
]
"torch/utils/collect_env.py" = [
"UP", # collect_env.py needs to work with older versions of Python
]
"torch/_vendor/**" = [
"UP", # No need to mess with _vendor
]