mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-15 21:00:47 +00:00
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| bmm.py | ||
| conv.py | ||
| flex_attention.py | ||
| flex_decoding.py | ||
| mm.py | ||
| mm_common.py | ||
| mm_plus_mm.py | ||
| mm_scaled.py | ||
| unpack_mixed_mm.py | ||