pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Aaron Gokaslan	bb2fb554a9	[BE]: Update CUTLASS submodule to 3.7.0 (#145172 ) * This has a couple of new features, but mostly has a lot of bugfixes for the prior releases * This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it. * Most of the remaining diff noise is copyright year updates on the CUTLASS submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172 Approved by: https://github.com/eqy, https://github.com/henrylhtsang	2025-01-29 21:48:01 +00:00
James Wu	d0aa1386b8	Disable AOTAutogradCache for triton version < 3.2 (#145937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937 Approved by: https://github.com/bdhirsh	2025-01-29 21:32:16 +00:00
PyTorch MergeBot	1185b81c51	Revert "[dynamo] Use polyfill to implement comparison operators (#144485 )" This reverts commit `d1f82de2bf`. Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))	2025-01-29 21:30:42 +00:00
Catherine Lee	953e80936e	[linter] Grep linter batches long command (#145950 ) If the command is too long, the linter fails with ``` Failed due to OSError: [Errno 7] Argument list too long: 'grep' ``` Fix this by batching the command so it is shorter Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac. My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950 Approved by: https://github.com/wdvr	2025-01-29 21:23:27 +00:00
Zain Rizvi	a6e3f294f1	Don't use mypy daemon in CI (#145961 ) This is an attempt to fix flaky mypy errors in CI that look like: ``` dmypy status --verbose connection_name : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock pid : 32233 error : timed out Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill ``` "Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961 Approved by: https://github.com/malfet	2025-01-29 21:15:29 +00:00
bglass@quansight.com	40ccb7a86d	cpp_wrapper: Move #includes to per-device header files (#145932 ) Summary: This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts. Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)> Differential Revision: D68656960 Pulled By: benjaminglass1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932 Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1 Co-authored-by: bglass@quansight.com <bglass@quansight.com>	2025-01-29 21:08:45 +00:00
sanchitintel	8bd7bf3269	[Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894 ) ### Summary `RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data. Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels. ### Caveat #### _Before_ No corresponding results in PyTorch profiler profiling data #### _After_ \| Inductor config settings \| What kernel name looks like in profiling data \| Comments\| \|-------------------\|------------------------------------\|--------------------\| \| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code \| `graph_x_cpp_fused_y` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper \| `graph_x_kernel` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| Both `inductor_config.cpp.descriptive_names = "inductor_node"` & Inductor CPP Wrapper \| `graph_x_cpp_fused_flex_attention_y`\| Easy to interpret data \| \| Neither of the two configs \| `graph_x_kernel`\| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-01-29 20:54:46 +00:00
Danial Javady	bb4964013f	Add determinmistic kernel for reflection2d (#136241 ) Adds feature for #98925 Tests pass for both existing reflectionpad2d and the new one I inserted. Summary of the work: Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor. So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`. I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241 Approved by: https://github.com/eqy, https://github.com/albanD	2025-01-29 20:34:03 +00:00
Ankita George	2b8c28099a	[OSS] Add no dist as an argument to DCP top level apis (#145754 ) Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP Test Plan: existing tests pass Differential Revision: D68714246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754 Approved by: https://github.com/daulet-askarov	2025-01-29 20:33:37 +00:00
chilli	2d5d022594	Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-29 20:27:39 +00:00
Nikita Shulga	6aed6c042e	[CD] Install ninja and setuptools from PyPI (#145871 ) As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871 Approved by: https://github.com/Skylion007	2025-01-29 19:47:16 +00:00
PyTorch MergeBot	b80482988f	Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870 )" This reverts commit `c26bb9ba5b`. Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
PyTorch MergeBot	b52e8d521e	Revert "[CD] Install ninja and setuptools from PyPI (#145871 )" This reverts commit `eea7d395e5`. Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
Jack Taylor	082fab0fc7	[64-bit] Int64 casting for UpSampleNearest3D (#144865 ) Fixes #144855 Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865 Approved by: https://github.com/eqy	2025-01-29 19:30:09 +00:00
angelayi	1c9014a135	[export] Add tlparse to draft-export (#145810 ) Dependent on https://github.com/ezyang/tlparse/pull/87/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810 Approved by: https://github.com/pianpwk	2025-01-29 19:26:00 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `9fd6722fc9`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
PyTorch MergeBot	e0525dbca9	Revert "inductor.config.descriptive_names = False is not actually supported (#145523 )" This reverts commit `edf266e9bb`. Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))	2025-01-29 18:27:44 +00:00
PyTorch MergeBot	284f217011	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit `97b3b73f3e`. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))	2025-01-29 18:24:29 +00:00
PyTorch MergeBot	0d6343347f	Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448 )" This reverts commit `a699034eec`. Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))	2025-01-29 18:07:12 +00:00
Avik Chaudhuri	1a613c3342	bump counters for unbacked binding names (#145882 ) Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match. Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen). Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882 Approved by: https://github.com/pianpwk	2025-01-29 17:46:21 +00:00
rpsilva	4abff4b271	Introduce cache clearing APIs for the lazy graph executor (#144489 ) This PR introduces two new methods to the LazyGraphExecutor class: - ClearComputationCache(): Allows clearing the entire computation cache. - RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash. The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance: - Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently. - Selectively remove cache entries to analyze the impact on subsequent computations. - Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations. On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash. Simultaneously, another motivation, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489 Approved by: https://github.com/wconstab	2025-01-29 17:38:01 +00:00
Animesh Jain	d1f82de2bf	[dynamo] Use polyfill to implement comparison operators (#144485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485 Approved by: https://github.com/jansel	2025-01-29 17:37:40 +00:00
saienduri	3e135993bd	Update mi300 labels to account for multiple clusters. (#145923 ) We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923 Approved by: https://github.com/jeffdaily	2025-01-29 16:56:43 +00:00
Animesh Jain	4499d60d56	[dynamo][builin-skipfiles-cleanup] Remove types (#145909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909 Approved by: https://github.com/zou3519 ghstack dependencies: #145856, #145875, #145878, #145892	2025-01-29 16:47:02 +00:00
Brian Hirsh	ed141d7d1a	dont assign a size to _assert_scalar in partitioner (#143877 ) Fixes https://github.com/pytorch/pytorch/issues/143876 Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877 Approved by: https://github.com/zou3519	2025-01-29 16:21:37 +00:00
Yu, Guangye	3b3aac0cde	Filter out iGPU if dGPU is found on XPU (#144378 ) # Motivation for https://github.com/pytorch/pytorch/issues/143914 On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context. Now I generalize the logic as below: 1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform. 2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform. 3. No GPU is found (neither iGPU nor dGPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378 Approved by: https://github.com/EikanWang, https://github.com/gujinghui	2025-01-29 15:53:16 +00:00
Bert Maher	5e5da9bd9a	[triton] Update pin to tip of 3.2 release (#145867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867 Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte	2025-01-29 15:17:58 +00:00
Aidyn-A	81685d81eb	[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 ) This is a re-base PR to my previous one #141959. Description from the original PR: This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100. <details> <summary>The benchmark code used </summary> ```Python import time import torch from torch.profiler import profile, ProfilerActivity def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False): device = torch.device("cuda") shapes = [] for p in range(24, 30): shape = 1<<p shapes.append(shape) for shape in shapes: for _ in range(6): x = torch.randn(shape, device=device, dtype=dtype) y = function(x) if print_profile: x = torch.randn(shape, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = function(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) x = torch.randn(shape, device=device, dtype=dtype) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(6): y = function(x) torch.cuda.synchronize() t2 = time.perf_counter() perf_time = (t2 - t1) / 6 print(f"{function.__name__}, {dtype}, {shape}, {perf_time}") if check_numerics: x_cpu = x.cpu() y_cpu = function(x_cpu).cuda() try: torch.testing.assert_allclose(y_cpu, y) except AssertionError as error: print("An exception occurred:", error) def main(): ops = [ torch.relu, torch.sigmoid, torch.tanh, torch.nn.functional.gelu, torch.sin, torch.exp, ] dtypes = [ torch.float16, torch.bfloat16, torch.float32, ] for op in ops: for dtype in dtypes: benchmark(op, dtype=dtype) torch.cuda.empty_cache() if __name__ == "__main__": main() ``` </details> <details> <summary> Results </summary> \| op \| dtype \| size \| time after \| time before \| % improvement \| \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| relu \| torch.float16 \| 33554432 \| 4.84E-05 \| 5.06E-05 \| 4.66296539127052 \| \| relu \| torch.float16 \| 67108864 \| 9.22E-05 \| 9.64E-05 \| 4.56491432752297 \| \| relu \| torch.float16 \| 134217728 \| 0.000180343495837102 \| 0.000187981834945579 \| 4.23543919508829 \| \| relu \| torch.float16 \| 268435456 \| 0.000355071155354381 \| 0.000370856161074092 \| 4.44558942107169 \| \| relu \| torch.float16 \| 536870912 \| 0.000704489842367669 \| 0.000736006341564159 \| 4.47366268483987 \| \| relu \| torch.bfloat16 \| 16777216 \| 3.03E-05 \| 3.04E-05 \| 0.166504085842689 \| \| relu \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.45848238875716 \| \| relu \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.65E-05 \| 3.56122651631445 \| \| relu \| torch.bfloat16 \| 134217728 \| 0.000180805509444326 \| 0.000187998676362137 \| 3.97840029317567 \| \| relu \| torch.bfloat16 \| 268435456 \| 0.000356242332297067 \| 0.000371279485989362 \| 4.22104627356745 \| \| relu \| torch.bfloat16 \| 536870912 \| 0.000708114336399982 \| 0.000736773828975856 \| 4.04729732229083 \| \| relu \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.61E-05 \| 0.0442587268354941 \| \| relu \| torch.float32 \| 33554432 \| 9.33E-05 \| 9.30E-05 \| -0.259070913799022 \| \| relu \| torch.float32 \| 67108864 \| 0.000181321326332788 \| 0.000181289506144822 \| -0.0175490597877115 \| \| relu \| torch.float32 \| 134217728 \| 0.000356896334172537 \| 0.000356570177245885 \| -0.0913870206618981 \| \| relu \| torch.float32 \| 268435456 \| 0.000709421835684528 \| 0.000707465515006334 \| -0.275762681635911 \| \| relu \| torch.float32 \| 536870912 \| 0.00141372415237129 \| 0.00141036518228551 \| -0.237597276678471 \| \| sigmoid \| torch.float16 \| 16777216 \| 3.10E-05 \| 3.16E-05 \| 2.10012593866895 \| \| sigmoid \| torch.float16 \| 33554432 \| 4.91E-05 \| 5.23E-05 \| 6.37710600666122 \| \| sigmoid \| torch.float16 \| 67108864 \| 9.30E-05 \| 0.000100057009452333 \| 7.61866144555331 \| \| sigmoid \| torch.float16 \| 134217728 \| 0.000180928347011407 \| 0.000194982004662355 \| 7.76752669390248 \| \| sigmoid \| torch.float16 \| 268435456 \| 0.000355658994521946 \| 0.00038468533117945 \| 8.16128288742412 \| \| sigmoid \| torch.float16 \| 536870912 \| 0.000705982849467546 \| 0.000764021339515845 \| 8.22094900634937 \| \| sigmoid \| torch.bfloat16 \| 16777216 \| 3.08E-05 \| 3.17E-05 \| 2.90965915673149 \| \| sigmoid \| torch.bfloat16 \| 33554432 \| 4.87E-05 \| 5.24E-05 \| 7.63503884668234 \| \| sigmoid \| torch.bfloat16 \| 67108864 \| 9.33E-05 \| 0.000100019678939134 \| 7.21238137428013 \| \| sigmoid \| torch.bfloat16 \| 134217728 \| 0.000180786165098349 \| 0.000194868014659733 \| 7.78922964250206 \| \| sigmoid \| torch.bfloat16 \| 268435456 \| 0.000355564659306159 \| 0.000384909333661199 \| 8.25297835063321 \| \| sigmoid \| torch.bfloat16 \| 536870912 \| 0.000705831005082776 \| 0.000764102345177283 \| 8.2557070566308 \| \| sigmoid \| torch.float32 \| 16777216 \| 4.93E-05 \| 5.65E-05 \| 14.5314136197766 \| \| sigmoid \| torch.float32 \| 33554432 \| 9.32E-05 \| 9.31E-05 \| -0.120169865610833 \| \| sigmoid \| torch.float32 \| 67108864 \| 0.000181328505277634 \| 0.000180455681402236 \| -0.481349512069855 \| \| sigmoid \| torch.float32 \| 134217728 \| 0.000357362829769651 \| 0.000356093340087682 \| -0.35523831137877 \| \| sigmoid \| torch.float32 \| 268435456 \| 0.000708921831877281 \| 0.000707052337626616 \| -0.263709504574663 \| \| sigmoid \| torch.float32 \| 536870912 \| 0.00141358317341656 \| 0.0014090768333214 \| -0.318788464654745 \| \| tanh \| torch.float16 \| 16777216 \| 3.03E-05 \| 3.03E-05 \| -0.0912564658661808 \| \| tanh \| torch.float16 \| 33554432 \| 4.90E-05 \| 5.07E-05 \| 3.46644442974484 \| \| tanh \| torch.float16 \| 67108864 \| 9.30E-05 \| 9.68E-05 \| 3.99871369815531 \| \| tanh \| torch.float16 \| 134217728 \| 0.00018052199933057 \| 0.000188717152923346 \| 4.53969799978138 \| \| tanh \| torch.float16 \| 268435456 \| 0.000355684508879979 \| 0.000373026006855071 \| 4.8755280430115 \| \| tanh \| torch.float16 \| 536870912 \| 0.000706660988119741 \| 0.000740105014604827 \| 4.73268328765002 \| \| tanh \| torch.bfloat16 \| 16777216 \| 2.99E-05 \| 3.03E-05 \| 1.21049563135981 \| \| tanh \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.48836101041744 \| \| tanh \| torch.bfloat16 \| 67108864 \| 9.28E-05 \| 9.69E-05 \| 4.39944918036626 \| \| tanh \| torch.bfloat16 \| 134217728 \| 0.000180710999605556 \| 0.000189167990659674 \| 4.67984299382829 \| \| tanh \| torch.bfloat16 \| 268435456 \| 0.000356062994493792 \| 0.000372666652159144 \| 4.66312363882606 \| \| tanh \| torch.bfloat16 \| 536870912 \| 0.000707100164921333 \| 0.000740134331863374 \| 4.67178040408393 \| \| tanh \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.64E-05 \| 0.439595755746353 \| \| tanh \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.31E-05 \| 0.00287633090228212 \| \| tanh \| torch.float32 \| 67108864 \| 0.000181465332085888 \| 0.000180895323865116 \| -0.31411411437098 \| \| tanh \| torch.float32 \| 134217728 \| 0.000356963835656643 \| 0.000356073161431899 \| -0.249513854283251 \| \| tanh \| torch.float32 \| 268435456 \| 0.000709201170442005 \| 0.00070707315656667 \| -0.300057862849997 \| \| tanh \| torch.float32 \| 536870912 \| 0.00141367283261692 \| 0.00141030051357423 \| -0.238550176877922 \| \| gelu \| torch.float16 \| 16777216 \| 2.73E-05 \| 3.17E-05 \| 15.921079070745 \| \| gelu \| torch.float16 \| 33554432 \| 5.06E-05 \| 5.55E-05 \| 9.76345374333098 \| \| gelu \| torch.float16 \| 67108864 \| 9.65E-05 \| 0.000106600326641152 \| 10.4308039074712 \| \| gelu \| torch.float16 \| 134217728 \| 0.000187776672343413 \| 0.000208565829476962 \| 11.0712139447915 \| \| gelu \| torch.float16 \| 268435456 \| 0.000370216167842348 \| 0.000412251994324227 \| 11.3544005187205 \| \| gelu \| torch.float16 \| 536870912 \| 0.000737301345604161 \| 0.000819394170927505 \| 11.1342296895002 \| \| gelu \| torch.bfloat16 \| 16777216 \| 3.02E-05 \| 3.08E-05 \| 1.78405479367653 \| \| gelu \| torch.bfloat16 \| 33554432 \| 5.13E-05 \| 5.69E-05 \| 10.9929393318302 \| \| gelu \| torch.bfloat16 \| 67108864 \| 9.76E-05 \| 0.00010968199543034 \| 12.3420807512356 \| \| gelu \| torch.bfloat16 \| 134217728 \| 0.000189661824454864 \| 0.000214487663470209 \| 13.0895287371091 \| \| gelu \| torch.bfloat16 \| 268435456 \| 0.000374197009174774 \| 0.000423670164309442 \| 13.2211519391275 \| \| gelu \| torch.bfloat16 \| 536870912 \| 0.000743675006863972 \| 0.000842577001700799 \| 13.299088166737 \| \| gelu \| torch.float32 \| 16777216 \| 5.06E-05 \| 5.04E-05 \| -0.413385894716413 \| \| gelu \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.32E-05 \| 0.134157041722546 \| \| gelu \| torch.float32 \| 67108864 \| 0.000181480175039421 \| 0.000180836669945469 \| -0.354586992112075 \| \| gelu \| torch.float32 \| 134217728 \| 0.000356874331676712 \| 0.000356305002545317 \| -0.159532104402047 \| \| gelu \| torch.float32 \| 268435456 \| 0.000708909006789327 \| 0.000706991491218408 \| -0.270488250615287 \| \| gelu \| torch.float32 \| 536870912 \| 0.00141321367118508 \| 0.00140937082081412 \| -0.271922813181618 \| \| sin \| torch.float16 \| 16777216 \| 3.04E-05 \| 3.11E-05 \| 2.21834939018859 \| \| sin \| torch.float16 \| 33554432 \| 4.85E-05 \| 5.23E-05 \| 7.72165512511596 \| \| sin \| torch.float16 \| 67108864 \| 9.31E-05 \| 9.98E-05 \| 7.24947099480072 \| \| sin \| torch.float16 \| 134217728 \| 0.000180371008658161 \| 0.000194791161144773 \| 7.99471744039613 \| \| sin \| torch.float16 \| 268435456 \| 0.000355454161763191 \| 0.000384903668115536 \| 8.28503630574026 \| \| sin \| torch.float16 \| 536870912 \| 0.000705183832906187 \| 0.000764360166310022 \| 8.39161799270973 \| \| sin \| torch.bfloat16 \| 16777216 \| 3.11E-05 \| 3.10E-05 \| -0.257677954940036 \| \| sin \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.24E-05 \| 7.34808420323539 \| \| sin \| torch.bfloat16 \| 67108864 \| 9.26E-05 \| 0.000100248667877167 \| 8.22347488801205 \| \| sin \| torch.bfloat16 \| 134217728 \| 0.000180674154156198 \| 0.00019567032965521 \| 8.30012215584937 \| \| sin \| torch.bfloat16 \| 268435456 \| 0.000355360486234228 \| 0.000386023331278314 \| 8.62865913118873 \| \| sin \| torch.bfloat16 \| 536870912 \| 0.00070483615854755 \| 0.000766805159704139 \| 8.79197248964745 \| \| sin \| torch.float32 \| 16777216 \| 5.67E-05 \| 5.64E-05 \| -0.441348534920039 \| \| sin \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.30E-05 \| -0.496458540364117 \| \| sin \| torch.float32 \| 67108864 \| 0.000181706990891447 \| 0.000180556671693921 \| -0.633062708199702 \| \| sin \| torch.float32 \| 134217728 \| 0.000356894995396336 \| 0.000356046327700218 \| -0.237791985616354 \| \| sin \| torch.float32 \| 268435456 \| 0.000708777321657787 \| 0.000707602652255446 \| -0.165731798471427 \| \| sin \| torch.float32 \| 536870912 \| 0.00141263716310884 \| 0.00140912582476934 \| -0.248566187496451 \| \| exp \| torch.float16 \| 16777216 \| 3.00E-05 \| 3.04E-05 \| 1.40099098901014 \| \| exp \| torch.float16 \| 33554432 \| 4.86E-05 \| 5.03E-05 \| 3.44611943643906 \| \| exp \| torch.float16 \| 67108864 \| 9.37E-05 \| 9.55E-05 \| 1.96412400380129 \| \| exp \| torch.float16 \| 134217728 \| 0.000180913504057874 \| 0.000187193179347863 \| 3.47109262113439 \| \| exp \| torch.float16 \| 268435456 \| 0.00035607748820136 \| 0.000369079003576189 \| 3.65131630210701 \| \| exp \| torch.float16 \| 536870912 \| 0.000707551507124056 \| 0.000732363162872692 \| 3.50669251620789 \| \| exp \| torch.bfloat16 \| 16777216 \| 2.98E-05 \| 3.04E-05 \| 1.74345594341654 \| \| exp \| torch.bfloat16 \| 33554432 \| 4.88E-05 \| 5.04E-05 \| 3.40217856534821 \| \| exp \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.62E-05 \| 3.29219958210226 \| \| exp \| torch.bfloat16 \| 134217728 \| 0.000180999826019009 \| 0.000187239318620414 \| 3.44723679499521 \| \| exp \| torch.bfloat16 \| 268435456 \| 0.000355944503098726 \| 0.000369370992605885 \| 3.77207384585864 \| \| exp \| torch.bfloat16 \| 536870912 \| 0.000707135167128096 \| 0.000733066000975668 \| 3.66702648277075 \| \| exp \| torch.float32 \| 16777216 \| 4.89E-05 \| 5.63E-05 \| 15.1245314346532 \| \| exp \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.31E-05 \| -0.259945454477446 \| \| exp \| torch.float32 \| 67108864 \| 0.000181152504713585 \| 0.000180474346658836 \| -0.374357536939058 \| \| exp \| torch.float32 \| 134217728 \| 0.000356771342922002 \| 0.000355627329554409 \| -0.3206573034212 \| \| exp \| torch.float32 \| 268435456 \| 0.000708404501589636 \| 0.00070713268360123 \| -0.179532736671163 \| \| exp \| torch.float32 \| 536870912 \| 0.00141283582585553 \| 0.00140944866385932 \| -0.23974208002295 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-01-29 13:32:59 +00:00
Ting Lu	354fe48db9	Add magma cuda build 12.8 (#145765 ) https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765 Approved by: https://github.com/malfet	2025-01-29 08:43:38 +00:00
gasoonjia	501c5972f0	[pytorch] raise exception when calling dim order on sparse tensor (#145888 ) This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor. Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888 Approved by: https://github.com/Jack-Khuu	2025-01-29 06:15:44 +00:00
David Berard	2e8c080ab1	[inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583 ) Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`. This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583 Approved by: https://github.com/jansel	2025-01-29 05:46:05 +00:00
Animesh Jain	3f77002b96	[dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875, #145878	2025-01-29 05:30:06 +00:00
Animesh Jain	236793684d	[dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875	2025-01-29 05:30:06 +00:00
Animesh Jain	a479656cd2	[dynamo][builtin-skipfiles-removal] Remove logging (#145875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875 Approved by: https://github.com/williamwen42 ghstack dependencies: #145856	2025-01-29 05:29:58 +00:00
Animesh Jain	64ee57847b	[dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856 ) [dynamo][builtin-skipfiles-cleanup] Remove more builtins Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856 Approved by: https://github.com/zou3519	2025-01-29 05:29:47 +00:00
Aaron Orenstein	7178b827d7	PEP585: Missed conversions (#145342 ) Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342 Approved by: https://github.com/bobrenjc93	2025-01-29 05:24:36 +00:00
bobrenjc93	8696e59ae2	add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821 ) Followup from https://github.com/pytorch/pytorch/issues/130290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821 Approved by: https://github.com/eellison, https://github.com/ezyang	2025-01-29 04:36:32 +00:00
Justin Chu	776bdb962c	[ONNX] Support subgraphs with 1+ outputs (#145860 ) Fixed a bug in _handle_output_node where additional output values were not added as graph outputs Fixes #145734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860 Approved by: https://github.com/titaiwangms	2025-01-29 04:13:23 +00:00
cyy	fd515e4f59	Fix C++20 Wambiguous-reversed-operator warnings (#144126 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126 Approved by: https://github.com/albanD	2025-01-29 03:13:57 +00:00
Simon Mahns	90a6db4a9c	[be][pytorch] Fix backend in autocast (#145859 ) Summary: fixing backend typo (BAKCNEDS -> BACKENDS) Test Plan: ci Differential Revision: D68573324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859 Approved by: https://github.com/jvandebon	2025-01-29 03:13:08 +00:00
Mwiza Kunda	9be2e88d41	Fix lowering to inductor IR for triton CPU (#144389 ) Example failing test: `pytest -s test_torchinductor_opinfo.py -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU. Failure: ```shell triton.compiler.errors.CompilationError: at 10:11: def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 1.0 tl.static_assert(tmp1.dtype == tl.float32) tmp2 = ops.polygamma(tmp1, tmp0) ^ NameError('ops is not defined') ``` This occurs because the registered triton fallbacks are not used during the lowering to inductor IR. Marked the problematic code in the excerpt below from `6bc17b0725/torch/_inductor/lowering.py (L572)` ```python def make_pointwise( fn, override_return_dtype=None, override_device=None, override_fn_when_input_bool=None, override_fn_when_gpu_float64=None, allow_alpha=False, triton_fallback=None, ): def inner(inputs: TensorBox, alpha=None): if triton_fallback is not None and any( isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU ): assert not allow_alpha # not implemented return triton_fallback(inputs) inputs = promote_constants(inputs, override_return_dtype) if allow_alpha: if alpha is not None and alpha != 1: inputs = list(inputs) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389 Approved by: https://github.com/jansel	2025-01-29 03:10:53 +00:00
Colin Peppler	50f834f134	[export] allow bit shift builtin ops (#145802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802 Approved by: https://github.com/pianpwk	2025-01-29 03:05:48 +00:00
Ting Lu	f4ca98950e	Add CUDA 12.8 libtorch image (#145789 ) https://github.com/pytorch/pytorch/issues/145570 Builds 12.8 libtorch docker/deprecate 12.1 meanwhile Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-01-29 02:59:37 +00:00
Sam Larsen	9330b6d098	Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 ) Summary: Test Plan: Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829 Approved by: https://github.com/Chillee	2025-01-29 02:52:55 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
Menglu Yu	29521256e1	[Customized Optimus][Inductor] Add split cat pattern in aten level (#145721 ) Summary: Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels. trace: https://fburl.com/perfdoctor/82fauil7 tlparse: https://fburl.com/98q6tadx Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567 Network: Up: 341KiB Down: 359KiB (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d) Executing actions. Remaining 0/3 Command: test. Finished 2 local Time elapsed: 3:04.8s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local run ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115 ``` https://www.internalfb.com/mlhub/pipeline/1630903954173593 # E2E ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18] ``` ### how to add the config Add the following patterns to the dynamo config ``` post_grad_fusion_options: { "normalization_aten_pass": {}, "split_cat_aten_pass": {}, } ``` {F1974700331} baseline: aps-recgpt_ranking_1115_pt2_5-8cb4905c7d {F1974700216} proposal: Differential Revision: D68695717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721 Approved by: https://github.com/Yuzhen11	2025-01-29 01:59:06 +00:00
Natalia Gimelshein	331f49057d	Removes threadfence from topk kernel to improve AMD performance (#145536 ) Also marginally improves cuda perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536 Approved by: https://github.com/eqy	2025-01-29 01:29:15 +00:00
wz337	6f5c8fb128	[DTensor] Add pointwise ops strategy for `aten.minimum` (#145816 ) Need it for Shampoo optimizer. `9c5700ad5e/matrix_functions.py (L240-L242)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816 Approved by: https://github.com/XilunWu	2025-01-29 01:19:01 +00:00
Pian Pawakapan	15e37e4253	[export] don't always print GM in serdes logging (#145857 ) Summary: Didn't realize print_readable() would also print and not just return string Test Plan: . Differential Revision: D68781525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857 Approved by: https://github.com/angelayi, https://github.com/yiming0416	2025-01-29 01:03:02 +00:00
fan.mo	a24b25942a	Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 ) Fixes #140092 Here's what this PR does: Case 1: no `eps` is passed to python frontend: Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation Case 2: `eps` is passed to python frontend Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848 Approved by: https://github.com/albanD	2025-01-29 01:01:44 +00:00

1 2 3 4 5 ...

83778 commits