pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

Author	SHA1	Message	Date
David Watson	c9cdcb299a	Remove ExclusivelyOwned from register_dispatch_key (#106791 ) This fixes a bug that could occur with python decompositions. When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode). Note this will be a performance hit when handling errors. Fixes #106790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791 Approved by: https://github.com/ezyang	2023-08-11 21:04:33 +00:00
Edward Z. Yang	d97b18d769	Propose nkaretnikov as general PrimTorch/meta/decomp reviewer (#106970 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106970 Approved by: https://github.com/albanD	2023-08-11 20:50:31 +00:00
Yanbo Liang	fbfb9a1648	[Dynamo] Improve PT2 fbcode logging observability (#106932 ) Summary: https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit This is the revamped version of D47908299. For each frame, we will record a list of compilation metrics: e.g, backend_compile time, entire_frame_compile time, cache_size, co_filename, co_firstlineno, co_name, guards, graph input_count, graph node_count, graph op_count. With the help of job info: mast_job_name, global_rank, we can satisfy the requirements from `Things I’ve used/wanted to use our logging to determine` in https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit (or add more metrics for this framework) Test Plan: ``` buck2 test //caffe2/test:test_dynamo ``` Differential Revision: D48142400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106932 Approved by: https://github.com/anijain2305	2023-08-11 20:46:04 +00:00
Catherine Lee	1cfe292061	Mark test_lstm_packed as slow (#107048 ) The test takes >30 minutes to run on some configurations and keeps getting unmarked as slow by the automatic slow test detection. Examples: https://ossci-raw-job-status.s3.amazonaws.com/log/15824750763 https://ossci-raw-job-status.s3.amazonaws.com/log/15802766247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107048 Approved by: https://github.com/huydhn	2023-08-11 20:35:14 +00:00
Nikita Shulga	22a20d0850	Add `isFloat8Type` predicate (#106977 ) And make float8 dtypes part of `isReducedFloatingType()` predicate Pull Request resolved: https://github.com/pytorch/pytorch/pull/106977 Approved by: https://github.com/albanD	2023-08-11 20:13:57 +00:00
Wanchao Liang	5c48ff20b5	AsyncCollectiveTensor: dont sync on view ops (#105240 ) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: `1518d5eec4/torch/distributed/_tensor/api.py (L207)`) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab	2023-08-11 19:20:25 +00:00
Sam Larsen	e165938853	Implement decomposition for aten.rrelu_with_noise (#106812 ) Test Plan: * Primarily, added new test in test/test_decomp.py * Updated existing tests, e.g., to NOT expect failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812 Approved by: https://github.com/eellison	2023-08-11 19:18:29 +00:00
Adnan Akhundov	b1b3f61f2c	Skip Triton templates in MM max autotune with zero-size inputs (#106865 ) Summary: MM max autotune (and friends) crash when one of the inputs is zero-size. E.g., running this code: ``` @torch.compile() def fn(x, y): return torch.mm(x, y) inps = [torch.rand([0, 30]), torch.rand([30, 40])] inps = [x.to(device="cuda") for x in inps] out = fn(inps) ``` with this command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 python test.py ``` raises this error (the top of the stack trace omitted for brevity): ``` ... File "/data/users/aakhundov/pytorch/torch/_inductor/kernel/mm.py", line 119, in tuned_mm return autotune_select_algorithm("mm", choices, [mat1, mat2], layout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 960, in autotune_select_algorithm return _ALGORITHM_SELECTOR_CACHE(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 787, in __call__ timings = self.lookup( ^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/codecache.py", line 267, in lookup timings[choice] = benchmark(choice) ^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 774, in autotune raise ErrorFromChoice(msg, choice, benchmark_fn.debug_str()) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: ErrorFromChoice: Please run `ptxas /tmp/compile-ptx-src-bfb1c6` to confirm that this is a bug in `ptxas` From choice TritonTemplateCaller(/tmp/torchinductor_aakhundov/z7/cz7n7nn6rdlaelu4pbaaurgmu74ikl6g76lkngwawrevlfxlc6re.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4) inputs = [ torch.empty_strided((0, 30), (30, 1), dtype=torch.float32, device='cuda'), torch.empty_strided((30, 40), (40, 1), dtype=torch.float32, device='cuda'), ] out = torch.empty_strided((0, 40), (40, 1), dtype=torch.float32, device='cuda') target: aten.mm.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda', torch.float32, size=[0, s0], stride=[s0, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg3_1', layout=FixedLayout('cuda', torch.float32, size=[s0, s1], stride=[s1, 1])) )) ``` This PR adds a check to skip Triton templates in the `mm`, `addmm`, `mm_plus_mm` autotuning when the product of the MM problem shape (`m n * k`) is zero. Additionally, early exit conditions have been added to the mm and mm_plus_mm Triton templates on `M * N * K == 0`, to prevent issues when autotuning was done on non-zero-size inputs with dynamic shapes, then zero-size inputs are encountered by the compiled model. Test Plan: ``` $ python test/inductor/test_max_autotune.py -v ... ---------------------------------------------------------------------- Ran 16 tests in 29.569s OK ``` Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106865 Approved by: https://github.com/jansel	2023-08-11 19:10:16 +00:00
Howard Huang	656412f0cb	Add multiprocess option to dynamo benchmarks (#106394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106394 Approved by: https://github.com/XilunWu	2023-08-11 18:34:09 +00:00
ydwu4	3fe1dba068	Fix test_cond_functionalized_aot_func_check_functional (#106889 ) Fix a typo in unit test. Test Plan: Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106889 Approved by: https://github.com/tugsbayasgalan	2023-08-11 18:31:12 +00:00
Philip Meier	a926be39d4	torch.jit.script escape hatch (#106229 ) Although the sun is setting for torchscript, it is not [officially deprecated](https://github.com/pytorch/pytorch/issues/103841#issuecomment-1605017153) since nothing currently fully replaces it. Thus, "downstream" libraries like TorchVision, that started offering torchscript support still need to support it for BC. torchscript has forced us to use workaround after workaround since forever. Although this makes the code harder to read and maintain, we made our peace with it. However, we are currently looking into more elaborate API designs that are severely hampered by our torchscript BC guarantees. Although likely not intended as such, while looking for ways to enable our design while keeping a subset of it scriptable, we found the undocumented `__prepare_scriptable__` escape hatch: `0cf918947d/torch/jit/_script.py (L977)` One can define this method and if you call `torch.jit.script` on the object, the returned object of the method will be scripted rather than the original object. In TorchVision we are using exactly [this mechanism to enable BC](`3966f9558b/torchvision/transforms/v2/_transform.py (L122-L136)`) while allowing the object in eager mode to be a lot more flexible (`args, *kwargs`, dynamic dispatch, ...). Unfortunately, this escape hatch is only available for `nn.Module`'s `0cf918947d/torch/jit/_script.py (L1279-L1283)` This was fine for the example above since we were subclassing from `nn.Module` anyway. However, we recently also hit a case [where this wasn't the case](https://github.com/pytorch/vision/pull/7747#issuecomment-1642045479). Given the frozen state on JIT, would it be possible to give us a general escape hatch so that we can move forward with the design unconstrained while still keeping BC? This PR implements just this by re-using the `__prepare_scriptable__` hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106229 Approved by: https://github.com/lezcano, https://github.com/ezyang	2023-08-11 18:24:46 +00:00
PyTorch MergeBot	71be8f2223	Revert "Add initial support for FP8 ONNX export (#106379 )" This reverts commit `08704f96f0`. Reverted https://github.com/pytorch/pytorch/pull/106379 on behalf of https://github.com/kit1980 due to breaking multiple internal builds ([comment](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675192700))	2023-08-11 18:22:35 +00:00
Shunting Zhang	e18ca4028b	[indcutor] add one triton config for reduction (#106925 ) This config found by coordinate descent tuning improves kernel https://gist.github.com/shunting314/189a8ef69f90db9d614a823385147a72 from - 10.008ms 5.993GB 598.83GB/s to - 6.170ms 5.993GB 971.28GB/s . It should only affect reduction with hint ReductionHint.DEFAULT or when max autotune is enabled. (It's funny that before I upgrade my triton version, the improvement is from 9.076ms -> 5.692ms ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106925 Approved by: https://github.com/jansel	2023-08-11 17:15:03 +00:00
Shunting Zhang	6696a75ea8	[inductor] make thread order consistent with loop order (#106827 ) I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead. For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels: - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827 Approved by: https://github.com/jansel	2023-08-11 17:05:21 +00:00
PyTorch MergeBot	745d29b0cc	Revert "[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 )" This reverts commit `18989890bf`. Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))	2023-08-11 16:37:47 +00:00
Thiago Crepaldi	0b05aef8d0	Add ONNX export support for huggingface's bigscience/bloom-560m model (#106930 ) Port fix from https://github.com/huggingface/safetensors/pull/318 into ONNX exporter until it is merged * This add support for safetensors to be loaded within a FakeTensorMode, which results in creating `torch.empty((shape,), dtype=)`. This is done through a monkeypatch for the in-progress https://github.com/huggingface/safetensors/pull/318 * Adds a test for the HF bloom model (bigscience/bloom-560m) * This PR also fixes existing fake tensor unit tests by moving the `torch.onnx.dynamo_export` to be inside the `enable_fake_mode()` context. Although calling `torch.onnx._dynamo_export` works for several models, the right way of using fake mode is calling the exporter within the context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106930 Approved by: https://github.com/BowenBao	2023-08-11 16:34:24 +00:00
Edward Z. Yang	9f26503bf0	SymInt'ify tile (#106933 ) When auditing before I was deceived by the argument name "dims". Actually, this is saying how many times to replicate each dim, so definitely can be symbolic. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106933 Approved by: https://github.com/nkaretnikov, https://github.com/lezcano	2023-08-11 15:28:28 +00:00
Yukio Siraichi	a5d841ef01	`asarray`: take the default device into consideration. (#106779 ) Fix: #106773 This PR makes it so `asarray` takes the default device into consideration when called with a Python sequence as the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106779 Approved by: https://github.com/rgommers, https://github.com/lezcano	2023-08-11 13:16:42 +00:00
Kurt Mohler	171341ee65	Support complex inputs in `nan_to_num` (#106944 ) Fixes #105462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106944 Approved by: https://github.com/lezcano	2023-08-11 09:15:57 +00:00
summerdo	7db6eb7156	[test_nn] add custom device support for dropout tests、lazy_modules te… (#106609 ) add custom device support for dropout tests、lazy_modules tests and multihead_attention tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106609 Approved by: https://github.com/mikaylagawarecki	2023-08-11 09:14:34 +00:00
HDCharles	03414081ff	adding mixed_dtype_mm to torch._inductor (#106443 ) Summary: if torch._inductor.config.use_mixed_mm then we can convert torch.mm(a, b.to(some_dtype)) into a triton kernel where the casting b is fused into the matmul rather than needing to instantiate the casted b tensor. If use_mixed_mm is set, this fused kernel will be autotuned against the default 2 kernel fallback option. If force_mixed_mm then the fused kernel will always be used, This option is needed for weight-only quantization where we are in some cases relying on the superior memory characteristics of the fused kernel rather than the perf numbers (when we can't afford to load memory with a tensor 4x the size of our quantized one). Test Plan: python test/inductor/test_pattern_matcher.py -k "mixed_mm" python test/inductor/test_torchinductor.py -k "mixed_mm" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106443 Approved by: https://github.com/jansel	2023-08-11 05:34:54 +00:00
Tugsbayasgalan Manlaibaatar	18989890bf	[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 ) Some notable changes: 1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2. 2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591 Approved by: https://github.com/gmagogsfm, https://github.com/ezyang	2023-08-11 05:29:22 +00:00
XiaobingSuper	df6aaf7bc2	inductor: fix compile error for inplace variable multi-defined (#106852 ) When removing an inplace buffer, we just mark it as ```REMOVED```, after removing some inplace buffer, and then if we mark a buffer as inplace buffer using the ```self.inplace_buffer.values()``` length to create a buffer name, there may have an issue which we may define a same inplace buffer name with existed in ```self.inplace_buffer.values()```: before removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf7': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf9': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf12': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf13': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf25': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf20': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf26': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf31': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf32': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32'])} ``` After removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': 'REMOVED', 'buf7': 'REMOVED', 'buf9': 'REMOVED', 'buf12': 'REMOVED', 'buf13': 'REMOVED', 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': 'REMOVED', 'buf25': 'REMOVED', 'buf20': 'REMOVED', 'buf26': 'REMOVED', 'buf31': 'REMOVED', 'buf32': 'REMOVED', 'buf16': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38']), 'buf38': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38'])} ``` And then if we mark some buffer as inplace buffer and the buffer name will use ```in_out_ptr{len(unique(self.inplace_buffers.values()))}```, the buffer name may be ```in_out_ptr6``` even this name has existed in ```self.inplace_buffers```. After this PR, we will change ```REMOVED``` to ```REMOVED{1, 2, 3..}``` which avoids defining a duplicate name. ```pyhpc_equation_of_state ``` of ```torchbench``` will work for CPU backend: ```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --inference --float32 -dcpu -n50 --inductor --freezing --no-skip --dashboard --only pyhpc_equation_of_state --cold_start_latency``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106852 Approved by: https://github.com/lezcano	2023-08-11 04:06:58 +00:00
Driss Guessous	7460adf7f3	Causing internal clang tidy to error (#106895 ) Summary: This was causing an error with clang tidy for internal version of PyTorch: https://www.internalfb.com/diff/D47044755?dst_version_fbid=1190859734932683&transaction_fbid=1553883761684581 Test Plan: See Summary Reviewed By: dmpolukhin Differential Revision: D48202402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106895 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-08-11 03:54:05 +00:00
Michael Voznesensky	71a336ef75	[Dynamo x FSDP][1/x] Builder support for deque, appendleft (#106884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106884 Approved by: https://github.com/ezyang	2023-08-11 03:26:12 +00:00
XiaobingSuper	4df84c3b4d	make sure mkldnn convolution given same stride as ref path for nc11 contiguous input (#106966 ) On SPR machine, the mkldnn bfloat16 convolution always return a channels last output, and we will convert it to channels first if input and weight are channels first, there has an issue when we do such conversion if output is nc11(45121*1), we always mark it as public format ideep tensor, and even we calling ```to_dense``` before returning the output, the output's stride is still a channels last stride(512, 1, 512, 512), this PR will calling ```resize_``` to make sure the stride is contiguous stride. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106966 Approved by: https://github.com/mingfeima	2023-08-11 00:59:48 +00:00
lezcano	a9dca53438	NumPy support in torch.compile (#106211 ) RFC: https://github.com/pytorch/rfcs/pull/54 First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/ We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core. In the next commits, I do a number of things in this order - Fix a few small issues - Make the tests that this PR adds pass - Bend backwards until lintrunner passes - Remove the optional dependency on `torch_np` and simply rely on the upstreamed code - Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now. Missing from this PR (but not blocking): - Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate. - https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge. All the tests in `tests/torch_np` take about 75s to run. This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211 Approved by: https://github.com/ezyang	2023-08-11 00:39:32 +00:00
Sam Larsen	8f774330af	[inductor] Use shape env bounds in inductor bounds.py (#106175 ) (#106568 ) Summary: If constrained ranges are available, use them in bounds.py before value range analysis (to enable Int64 -> Int32 optimization). Test Plan: New unit test in test_torchinductor.py to mark a tensor as dynamic, then constrain with constrain_as_size (as outlined in https://github.com/pytorch/pytorch/issues/106175) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106568 Approved by: https://github.com/eellison, https://github.com/lezcano	2023-08-11 00:16:09 +00:00
Stephen Jia	62b3018024	[Vulkan] Introduce GPU Memory Layout qualifier (#106978 ) Summary: Introduce a GPU memory Layout qualifier in `vTensor`, which will allow more efficient memory layouts when storing Tensors on the GPU. The plan is for shaders to use the memory layout qualifier to convert between logical tensor coordinates and physical texel positions. Test Plan: As-is, this diff should be a no-op. Run standard tests to make sure everything works as expected. ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 ``` Reviewed By: kimishpatel Differential Revision: D48129905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106978 Approved by: https://github.com/liuk22	2023-08-10 23:45:54 +00:00
Stephen Jia	8c8477e55a	Add _unsafe_index decomp (#106814 ) Summary: Redirect `aten._unsafe_index` to `aten.index` through a decomposition. Also add it to the list of core decompositions. Test Plan: contbuild and OSS CI (similar to D40075277) Differential Revision: D48163393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814 Approved by: https://github.com/SherlockNoMad	2023-08-10 23:23:37 +00:00
Jiaxu Zhu	152203d3c3	[pytorch][ao] Add `torch.matmul` in FloatFunctional/QFunctional (#106831 ) Summary: As title Test Plan: new unit tests Differential Revision: D48172841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106831 Approved by: https://github.com/jerryzh168	2023-08-10 22:43:36 +00:00
Howard Cheng	dfb1b95919	[caffe2] Add enforce inside ScatterAssignOp (#106882 ) Summary: Adding an enforce gives better error information than raising SIGFPE when division by zero happens. We'll get the actual BlobRef names as well as the error categories. Test Plan: Ran a local worker and client using DPP session with empty tensors and checked the error: `../buck-out/v2/gen/fbcode/data_preproc/perf_test/client --sr2_event_base_pool_size=24` `../buck-out/v2/gen/fbcode/data_preproc/perf_test/worker --dpp_session_id=5D49F56C98CC95BD97027BC0DDB38D8F` ```{dpp_internal_errorcategory : user_error, ONCALL : MLDP_CONTROL, CATEGORY : INPUT_ERROR, errorsubsystemtags : [DPP_WORKER], errorcause : USER_ERROR, RETRYABILITY : 0}F0806 17:47:52.607200 2280375 SchedRuntimeEnv.cpp:385] facebook::data_preproc::NonRetryableGenericUser Error: User preprocessing error c10::Error: [enforce fail at utility_ops.h:730] input.numel() > 0. 0 vs 0. tensor has t o be nonempty (Error from operator: input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_ features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCOR ELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Concat:0" input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_feature s_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_E NCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Mul_2" input: "preproc_d ata_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processo r_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/encoded_id_lengths" output: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST``` Differential Revision: D48104430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106882 Approved by: https://github.com/kit1980	2023-08-10 21:46:13 +00:00
atalman	aef27bdbe7	Reword release version: major->minor in README.md (#106980 ) Correct wording to reflect reality Pull Request resolved: https://github.com/pytorch/pytorch/pull/106980 Approved by: https://github.com/huydhn, https://github.com/albanD	2023-08-10 21:32:30 +00:00
Peter Bell	a62de2d5ec	[inductor] Enable multilayer reductions with dynamic shapes (#106747 ) Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106747 Approved by: https://github.com/lezcano ghstack dependencies: #106626, #106870	2023-08-10 21:07:25 +00:00
Peter Bell	fa65df3745	[inductor] Type triton size arguments in the kernel index_dtype (#106870 ) `JITFunction._key_of` uses the value of the argument to distinguish between i32 and i64, but this fails if the value is used in indexing calculations where the value exceeds `INT_MAX`. Instead, we should use `index_dtype` which means all indexing calculations are performed in the same dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870 Approved by: https://github.com/lezcano ghstack dependencies: #106626	2023-08-10 21:07:25 +00:00
Peter Bell	3b2cb459fc	[inductor] Fix reference_as_float gradcheck (#106626 ) When `reference_as_float` is true, reference gradients will not have the same dtype as the actual computed gradients. This fixes the issue by downcasting before doing the comparison. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106626 Approved by: https://github.com/lezcano	2023-08-10 21:07:25 +00:00
Alexander Pivovarov	02abbb8109	Fix some typos, mostly "that that" (#106901 ) Fix some typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901 Approved by: https://github.com/janeyx99	2023-08-10 19:46:53 +00:00
Andrew Gu	7b94d93431	[FSDP] Fix train -> EMA -> eval with mixed precision (#106858 ) This fixes a pretty vicious bug relating to `SHARD_GRAD_OP`, mixed precision, EMA, and eval. Bug Explanation The model has a main module and an EMA module, where the main module is used for training and the EMA module is used for eval. The model has FSDP's fp16 mixed precision enabled. The flow consists of (1) training forward/backward/optimizer -> (2) EMA update (copy main module to EMA module) -> eval forward in `torch.no_grad()`, where this repeats for many iterations. Consider the _second_ iteration. - From the first iteration's eval forward, the EMA module has the fp16 unsharded parameters in memory (not freed due to `SHARD_GRAD_OP`). - In this second iteration's step (2), we perform the EMA update under the `summon_full_params()` context, where FSDP specially forces full precision. This means that the EMA module now uses fp32 unsharded parameters, distinct from the fp16 unsharded parameters still in memory. The EMA update modifies those fp32 parameters, and upon exiting the context, FSDP correctly writes the modifications back to the fp32 sharded parameters. - In the second iteration's step (3) (eval forward), FSDP checks whether it needs to run the unshard op (including all-gather) but sees it does not since the fp16 unsharded parameters are still in memory. Thus, FSDP uses those fp16 unsharded parameters directly without all-gather. However, these fp16 unsharded parameters are stale and do not include the EMA update! - In other words, at this point, the fp32 sharded parameters are correct, the fp16 unsharded parameters are stale, and FSDP chooses _not_ to re-all-gather since the fp16 unsharded parameters are in memory. Fix Explanation This PR fixes this by freeing the fp16 unsharded parameters if they are still allocated when forcing full precision, i.e. using fp32 unsharded parameters in `summon_full_params()`. This ensures that any modifications written back to the fp32 sharded parameters will be persisted via the next all-gather. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106858 Approved by: https://github.com/kumpera ghstack dependencies: #106857	2023-08-10 19:32:43 +00:00
Jerry Zhang	79449e6272	[quant][pt2e][fix] Remove the requirement of using no_grad for reference model that contains quantized conv2d (#106924 ) Summary: att we don't actually need gradient for conv2d, just need it to run without error, so we delayed the error of out_dtype gradient to the time when user actually requested it Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_conv2d Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106924 Approved by: https://github.com/zou3519, https://github.com/kimishpatel	2023-08-10 19:16:10 +00:00
alanhe151220037	1afbc985fe	Make RNGStateTracker support cuda-like device (#106771 ) replace `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771 Approved by: https://github.com/wanchaol	2023-08-10 19:14:33 +00:00
Nikita Shulga	bb6b157458	Fix IndexKernel.cu build (#104423 ) Fixes `signed-unsigned comparison` warnings introduced by https://github.com/pytorch/pytorch/pull/106809 (previously by <s> https://github.com/pytorch/pytorch/pull/104054 </s> ) that changed type of `num_indices` to unsigned. Before the change warnings looks as follows: ``` /tmp/tmpxft_00194ca7_00000000-6_IndexKernel.cudafe1.stub.c:31:580: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:58:63: warning: comparison of integer expressions of different signedness: ‘const long unsigned int’ and ‘int’ [-Wsign-compare] 58 \| AT_ASSERT(num_indices == iter.ntensors() - 2); \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:74:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const long unsigned int’ [-Wsign-compare] 74 \| for (int i = 0; i < num_indices; i++) { \| ~~^~~~~~~~~~~~~ ``` TODO: Turn those warning into errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/104423 Approved by: https://github.com/Skylion007	2023-08-10 18:32:47 +00:00
David Berard	393e9eed90	[inductor] modify index_reduce to pass opinfo tests (#106429 ) 1. add a python meta registration, to fix an issue with the forward pass. The problem was that previously, the C++ meta registration calls [numel()](`7b14a14e27/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L329)`) which fails (LMK if it's better to fix the C++ implementation to not do this check) 2. Modify the backward to fix an issue in the backward. The backward is not a custom op - it's a custom manual backward implementation. In particular, there's some situations that don't support double backward; the check for whether double backward is allowed requires a .item() call. To fix the meta/fake tensor case, this PR will avoid setting the double backward error only if `GradMode::is_enabled()` - which shouldn't be turned on in PT2. 3. Update skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106429 Approved by: https://github.com/zou3519	2023-08-10 18:14:00 +00:00
Catherine Lee	a14d99bb6c	Close non existent disable issues complete rollout (#106923 ) follow up to https://github.com/pytorch/pytorch/pull/105096 It seems fine, anecdotally I have seen some issues closed and they haven't been reopened Pull Request resolved: https://github.com/pytorch/pytorch/pull/106923 Approved by: https://github.com/huydhn	2023-08-10 16:48:14 +00:00
Jane Xu	c0f80c6696	[forward-fix] Fix multigpu varying tensor optim tests (#106887 ) Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test. The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887 Approved by: https://github.com/izaitsevfb	2023-08-10 16:35:38 +00:00
Howard Huang	149e458846	[BE] RPC is missing RRef docs (#106902 ) Current `RRef` class derives from `PyRRef` which has all the method definitions and documentations, and we don't see any of this in the current documentation: <img width="891" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/62897766-a660-4846-97bf-182e4aa45079"> Changing to :inherited-member: so sphinx can pick up these methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/106902 Approved by: https://github.com/svekars	2023-08-10 16:26:27 +00:00
Edward Z. Yang	89fd1b8717	Upgrade all inductor workflows to CUDA 12.1 / gcc9 (#106876 ) gcc7 is too old to build fbgemm Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106876 Approved by: https://github.com/msaroufim, https://github.com/anijain2305 ghstack dependencies: #106900	2023-08-10 15:02:20 +00:00
Kurt Mohler	4d6a891baf	Remove `set_default_dtype` from nn tests (#105775 ) Part of #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105775 Approved by: https://github.com/ezyang	2023-08-10 14:56:13 +00:00
XiaobingSuper	22bc08da29	inductor: remove conv_bn folding from pre_grad pass (#106686 ) The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686 Approved by: https://github.com/eellison	2023-08-10 12:25:05 +00:00
vfdev-5	35a1913370	[inductor] Added affine_grid_generator decomposition (#104709 ) Description: - Added affine_grid_generator decomposition Related to https://github.com/pytorch/pytorch/issues/104296 Fixes https://github.com/pytorch/pytorch/issues/105565 Perfs: - speed-up on cuda with bilinear and nearest modes ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 7.467 (+-0.036) \| 11.905 (+-0.276) \| 13.391 (+-0.051) \| 1.125 (+-0.000) \| 7.343 (+-0.036) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 7.722 (+-0.168) \| 14.371 (+-0.035) \| 15.899 (+-0.038) \| 1.106 (+-0.000) \| 7.870 (+-0.043) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 7.710 (+-0.051) \| 11.354 (+-0.053) \| 13.376 (+-0.045) \| 1.178 (+-0.000) \| 7.698 (+-0.061) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 7.870 (+-0.050) \| 13.744 (+-0.237) \| 15.206 (+-0.102) \| 1.106 (+-0.000) \| 7.912 (+-0.039) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 4.738 (+-0.015) \| 4.508 (+-0.005) \| 6.566 (+-0.027) \| 1.456 (+-0.000) \| 4.630 (+-0.022) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 4.391 (+-0.010) \| 4.860 (+-0.390) \| 6.438 (+-0.047) \| 1.325 (+-0.000) \| 4.458 (+-0.010) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 4.279 (+-0.008) \| 4.127 (+-0.010) \| 6.598 (+-0.709) \| 1.599 (+-0.000) \| 5.064 (+-0.025) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 4.537 (+-0.010) \| 4.593 (+-0.006) \| 6.365 (+-0.104) \| 1.386 (+-0.000) \| 4.480 (+-0.011) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 26.411 (+-0.066) \| 62.275 (+-0.436) \| 64.486 (+-0.353) \| 1.035 (+-0.000) \| 26.210 (+-0.110) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 26.457 (+-0.096) \| 72.887 (+-0.247) \| 74.207 (+-0.337) \| 1.018 (+-0.000) \| 25.995 (+-0.120) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 26.457 (+-0.086) \| 64.110 (+-0.233) \| 66.340 (+-0.406) \| 1.035 (+-0.000) \| 26.145 (+-0.085) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 26.536 (+-0.094) \| 73.742 (+-0.483) \| 71.946 (+-0.460) \| 0.976 (+-0.000) \| 26.457 (+-0.166) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 91.971 (+-0.253) \| 90.570 (+-0.193) \| 137.206 (+-0.214) \| 1.515 (+-0.000) \| 84.280 (+-0.241) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 91.893 (+-0.361) \| 89.866 (+-0.170) \| 136.678 (+-0.471) \| 1.521 (+-0.000) \| 84.573 (+-0.214) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 116.967 (+-0.481) \| 110.468 (+-0.326) \| 223.770 (+-0.334) \| 2.026 (+-0.000) \| 108.098 (+-0.392) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 117.563 (+-0.546) \| 111.438 (+-0.212) \| 223.101 (+-0.350) \| 2.002 (+-0.000) \| 108.225 (+-0.395) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 80.706 (+-0.289) \| 70.525 (+-0.204) \| 143.697 (+-0.311) \| 2.038 (+-0.000) \| 74.485 (+-0.258) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 80.955 (+-0.208) \| 69.986 (+-0.250) \| 143.658 (+-0.244) \| 2.053 (+-0.000) \| 74.163 (+-0.238) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 117.576 (+-0.435) \| 71.179 (+-0.412) \| 178.515 (+-0.539) \| 2.508 (+-0.000) \| 108.394 (+-0.473) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 117.441 (+-0.205) \| 70.313 (+-0.170) \| 178.664 (+-0.555) \| 2.541 (+-0.000) \| 108.098 (+-0.416) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 92.962 (+-0.509) \| 1740.964 (+-0.597) \| 1785.401 (+-0.369) \| 1.026 (+-0.000) \| 92.638 (+-0.539) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 92.928 (+-0.493) \| 1401.146 (+-0.732) \| 1453.229 (+-0.628) \| 1.037 (+-0.000) \| 92.458 (+-0.428) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 118.152 (+-0.442) \| 1740.644 (+-0.480) \| 1793.475 (+-0.458) \| 1.030 (+-0.000) \| 107.962 (+-0.548) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 118.182 (+-0.425) \| 1400.621 (+-0.624) \| 1461.796 (+-0.630) \| 1.044 (+-0.000) \| 107.894 (+-0.994) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709 Approved by: https://github.com/lezcano	2023-08-10 09:52:48 +00:00
shibo19	bb2fcc7659	unify TEST_CUDA (#106685 ) Fixes #ISSUE_NUMBER as title, unify TEST_CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/106685 Approved by: https://github.com/zou3519	2023-08-10 09:01:36 +00:00

1 2 3 4 5 ...

62887 commits