pytorch/torch/_inductor/kernel
eellison cef6c3dcb0 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-15 14:54:56 +00:00
..
__init__.py
bmm.py Dont decompose aten.baddmm in inductor (#137904) 2024-10-15 14:54:56 +00:00
conv.py [inductor] Reduce block sizes when using Triton CPU backend (#136612) 2024-10-03 01:48:32 +00:00
flex_attention.py Port Inductor dataclasses to be kw_only (#137768) 2024-10-14 10:33:43 +00:00
flex_decoding.py [FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204) 2024-10-02 22:19:33 +00:00
mm.py Move _is_static_problem to mm_common (#137150) 2024-10-03 02:55:43 +00:00
mm_common.py Move _is_static_problem to mm_common (#137150) 2024-10-03 02:55:43 +00:00
mm_plus_mm.py
mm_scaled.py [AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008) 2024-10-04 14:02:46 +00:00
unpack_mixed_mm.py