Commit graph

62887 commits

Author SHA1 Message Date
David Watson
c9cdcb299a Remove ExclusivelyOwned from register_dispatch_key (#106791)
This fixes a bug that could occur with python decompositions.

When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode).

Note this will be a performance hit when handling errors.

Fixes #106790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791
Approved by: https://github.com/ezyang
2023-08-11 21:04:33 +00:00
Edward Z. Yang
d97b18d769 Propose nkaretnikov as general PrimTorch/meta/decomp reviewer (#106970)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106970
Approved by: https://github.com/albanD
2023-08-11 20:50:31 +00:00
Yanbo Liang
fbfb9a1648 [Dynamo] Improve PT2 fbcode logging observability (#106932)
Summary:
https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit

This is the revamped version of D47908299.

For each frame, we will record a list of compilation metrics: e.g, backend_compile time, entire_frame_compile time, cache_size, co_filename, co_firstlineno, co_name, guards, graph input_count, graph node_count, graph op_count.

With the help of job info: mast_job_name, global_rank, we can satisfy the requirements from `Things I’ve used/wanted to use our logging to determine` in https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit (or add more metrics for this framework)

Test Plan:
```
buck2 test //caffe2/test:test_dynamo
```

Differential Revision: D48142400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106932
Approved by: https://github.com/anijain2305
2023-08-11 20:46:04 +00:00
Catherine Lee
1cfe292061 Mark test_lstm_packed as slow (#107048)
The test takes >30 minutes to run on some configurations and keeps getting unmarked as slow by the automatic slow test detection.
Examples:
https://ossci-raw-job-status.s3.amazonaws.com/log/15824750763
https://ossci-raw-job-status.s3.amazonaws.com/log/15802766247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107048
Approved by: https://github.com/huydhn
2023-08-11 20:35:14 +00:00
Nikita Shulga
22a20d0850 Add isFloat8Type predicate (#106977)
And make float8 dtypes part of `isReducedFloatingType()` predicate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106977
Approved by: https://github.com/albanD
2023-08-11 20:13:57 +00:00
Wanchao Liang
5c48ff20b5 AsyncCollectiveTensor: dont sync on view ops (#105240)
AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used.

Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: 1518d5eec4/torch/distributed/_tensor/api.py (L207))

AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op.

Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab
2023-08-11 19:20:25 +00:00
Sam Larsen
e165938853 Implement decomposition for aten.rrelu_with_noise (#106812)
Test Plan:
* Primarily, added new test in test/test_decomp.py
* Updated existing tests, e.g., to NOT expect failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812
Approved by: https://github.com/eellison
2023-08-11 19:18:29 +00:00
Adnan Akhundov
b1b3f61f2c Skip Triton templates in MM max autotune with zero-size inputs (#106865)
Summary:

MM max autotune (and friends) crash when one of the inputs is zero-size.

E.g., running this code:

```
@torch.compile()
def fn(x, y):
    return torch.mm(x, y)

inps = [torch.rand([0, 30]), torch.rand([30, 40])]
inps = [x.to(device="cuda") for x in inps]
out = fn(*inps)
```

with this command:

```
TORCHINDUCTOR_MAX_AUTOTUNE=1 python test.py
```

raises this error (the top of the stack trace omitted for brevity):

```
...
  File "/data/users/aakhundov/pytorch/torch/_inductor/kernel/mm.py", line 119, in tuned_mm
    return autotune_select_algorithm("mm", choices, [mat1, mat2], layout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 960, in autotune_select_algorithm
    return _ALGORITHM_SELECTOR_CACHE(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 787, in __call__
    timings = self.lookup(
              ^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/codecache.py", line 267, in lookup
    timings[choice] = benchmark(choice)
                      ^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 774, in autotune
    raise ErrorFromChoice(msg, choice, benchmark_fn.debug_str())
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: ErrorFromChoice: Please run `ptxas /tmp/compile-ptx-src-bfb1c6` to confirm that this is a bug in `ptxas`

From choice TritonTemplateCaller(/tmp/torchinductor_aakhundov/z7/cz7n7nn6rdlaelu4pbaaurgmu74ikl6g76lkngwawrevlfxlc6re.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4)
inputs = [
    torch.empty_strided((0, 30), (30, 1), dtype=torch.float32, device='cuda'),
    torch.empty_strided((30, 40), (40, 1), dtype=torch.float32, device='cuda'),
]
out = torch.empty_strided((0, 40), (40, 1), dtype=torch.float32, device='cuda')

  target: aten.mm.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda', torch.float32, size=[0, s0], stride=[s0, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg3_1', layout=FixedLayout('cuda', torch.float32, size=[s0, s1], stride=[s1, 1]))
  ))
```

This PR adds a check to skip Triton templates in the `mm`, `addmm`, `mm_plus_mm` autotuning when the product of the MM problem shape (`m * n * k`) is zero.

Additionally, early exit conditions have been added to the mm and mm_plus_mm Triton templates on `M * N * K == 0`, to prevent issues when autotuning was done on non-zero-size inputs with dynamic shapes, then zero-size inputs are encountered by the compiled model.

Test Plan:

```
$ python test/inductor/test_max_autotune.py -v

...

----------------------------------------------------------------------
Ran 16 tests in 29.569s

OK
```

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106865
Approved by: https://github.com/jansel
2023-08-11 19:10:16 +00:00
Howard Huang
656412f0cb Add multiprocess option to dynamo benchmarks (#106394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106394
Approved by: https://github.com/XilunWu
2023-08-11 18:34:09 +00:00
ydwu4
3fe1dba068 Fix test_cond_functionalized_aot_func_check_functional (#106889)
Fix a typo in unit test.

Test Plan:
Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106889
Approved by: https://github.com/tugsbayasgalan
2023-08-11 18:31:12 +00:00
Philip Meier
a926be39d4 torch.jit.script escape hatch (#106229)
Although the sun is setting for torchscript, it is not [officially deprecated](https://github.com/pytorch/pytorch/issues/103841#issuecomment-1605017153) since nothing currently fully replaces it. Thus, "downstream" libraries like TorchVision, that started offering torchscript support still need to support it for BC.

torchscript has forced us to use workaround after workaround since forever. Although this makes the code harder to read and maintain, we made our peace with it. However, we are currently looking into more elaborate API designs that are severely hampered by our torchscript BC guarantees.

Although likely not intended as such, while looking for ways to enable our design while keeping a subset of it scriptable, we found the undocumented `__prepare_scriptable__` escape hatch:

0cf918947d/torch/jit/_script.py (L977)

One can define this method and if you call `torch.jit.script` on the object, the returned object of the method will be scripted rather than the original object. In TorchVision we are using exactly [this mechanism to enable BC](3966f9558b/torchvision/transforms/v2/_transform.py (L122-L136)) while allowing the object in eager mode to be a lot more flexible (`*args, **kwargs`, dynamic dispatch, ...).

Unfortunately, this escape hatch is only available for `nn.Module`'s

0cf918947d/torch/jit/_script.py (L1279-L1283)

This was fine for the example above since we were subclassing from `nn.Module` anyway. However, we recently also hit a case [where this wasn't the case](https://github.com/pytorch/vision/pull/7747#issuecomment-1642045479).

Given the frozen state on JIT, would it be possible to give us a general escape hatch so that we can move forward with the design unconstrained while still keeping BC?

This PR implements just this by re-using the `__prepare_scriptable__` hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106229
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-08-11 18:24:46 +00:00
PyTorch MergeBot
71be8f2223 Revert "Add initial support for FP8 ONNX export (#106379)"
This reverts commit 08704f96f0.

Reverted https://github.com/pytorch/pytorch/pull/106379 on behalf of https://github.com/kit1980 due to breaking multiple internal builds ([comment](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675192700))
2023-08-11 18:22:35 +00:00
Shunting Zhang
e18ca4028b [indcutor] add one triton config for reduction (#106925)
This config found by coordinate descent tuning improves kernel https://gist.github.com/shunting314/189a8ef69f90db9d614a823385147a72 from
- 10.008ms    5.993GB    598.83GB/s
to
- 6.170ms    5.993GB    971.28GB/s .

It should only affect reduction with hint ReductionHint.DEFAULT or when max autotune is enabled.

(It's funny that before I upgrade my triton version, the improvement is from 9.076ms -> 5.692ms )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106925
Approved by: https://github.com/jansel
2023-08-11 17:15:03 +00:00
Shunting Zhang
6696a75ea8 [inductor] make thread order consistent with loop order (#106827)
I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.

For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:

- before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d
- after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551

I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827
Approved by: https://github.com/jansel
2023-08-11 17:05:21 +00:00
PyTorch MergeBot
745d29b0cc Revert "[export] Refactor constrain_as_value and constrain_as_size (#106591)"
This reverts commit 18989890bf.

Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))
2023-08-11 16:37:47 +00:00
Thiago Crepaldi
0b05aef8d0 Add ONNX export support for huggingface's bigscience/bloom-560m model (#106930)
Port fix from https://github.com/huggingface/safetensors/pull/318 into ONNX exporter until it is merged

* This add support for safetensors to be loaded within a FakeTensorMode, which results in creating `torch.empty((shape,), dtype=)`. This is done through a monkeypatch for the in-progress https://github.com/huggingface/safetensors/pull/318
* Adds a test for the HF bloom model (bigscience/bloom-560m)
* This PR also fixes existing fake tensor unit tests by moving the `torch.onnx.dynamo_export` to be inside the `enable_fake_mode()` context. Although calling `torch.onnx._dynamo_export` works for several models, the right way of using fake mode is calling the exporter within the context manager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106930
Approved by: https://github.com/BowenBao
2023-08-11 16:34:24 +00:00
Edward Z. Yang
9f26503bf0 SymInt'ify tile (#106933)
When auditing before I was deceived by the argument name "dims". Actually, this is saying how many times to replicate each dim, so definitely can be symbolic.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106933
Approved by: https://github.com/nkaretnikov, https://github.com/lezcano
2023-08-11 15:28:28 +00:00
Yukio Siraichi
a5d841ef01 asarray: take the default device into consideration. (#106779)
Fix: #106773

This PR makes it so `asarray` takes the default device into consideration when called with
a Python sequence as the data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106779
Approved by: https://github.com/rgommers, https://github.com/lezcano
2023-08-11 13:16:42 +00:00
Kurt Mohler
171341ee65 Support complex inputs in nan_to_num (#106944)
Fixes #105462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106944
Approved by: https://github.com/lezcano
2023-08-11 09:15:57 +00:00
summerdo
7db6eb7156 [test_nn] add custom device support for dropout tests、lazy_modules te… (#106609)
add custom device support for dropout tests、lazy_modules tests and multihead_attention tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106609
Approved by: https://github.com/mikaylagawarecki
2023-08-11 09:14:34 +00:00
HDCharles
03414081ff adding mixed_dtype_mm to torch._inductor (#106443)
Summary: if torch._inductor.config.use_mixed_mm then we can convert
torch.mm(a, b.to(some_dtype)) into a triton kernel where the casting b
is fused into the matmul rather than needing to instantiate the casted b
tensor. If use_mixed_mm is set, this fused kernel will be autotuned
against the default 2 kernel fallback option. If force_mixed_mm then the
fused kernel will always be used, This option is needed for weight-only quantization where we are in
some cases relying on the superior memory characteristics of the fused
kernel rather than the perf numbers (when we can't afford to load memory
 with a tensor 4x the size of our quantized one).

Test Plan: python test/inductor/test_pattern_matcher.py -k "mixed_mm"

python test/inductor/test_torchinductor.py -k "mixed_mm"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106443
Approved by: https://github.com/jansel
2023-08-11 05:34:54 +00:00
Tugsbayasgalan Manlaibaatar
18989890bf [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-11 05:29:22 +00:00
XiaobingSuper
df6aaf7bc2 inductor: fix compile error for inplace variable multi-defined (#106852)
When removing an inplace buffer, we just mark it as ```REMOVED```, after removing some inplace buffer, and then if we mark a buffer as inplace buffer using the ```self.inplace_buffer.values()``` length to create a buffer name, there may have an issue which we may define a same inplace buffer name with existed in ```self.inplace_buffer.values()```:

before removing some inplace buffers, the ```self.inplace_buffers``` may be like:

```
{'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf7': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf9': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf12': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf13': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf25': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf20': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf26': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf31': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf32': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32'])}
```
After removing some inplace buffers, the ```self.inplace_buffers``` may be like:

```
{'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': 'REMOVED', 'buf7': 'REMOVED', 'buf9': 'REMOVED', 'buf12': 'REMOVED', 'buf13': 'REMOVED', 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': 'REMOVED', 'buf25': 'REMOVED', 'buf20': 'REMOVED', 'buf26': 'REMOVED', 'buf31': 'REMOVED', 'buf32': 'REMOVED', 'buf16': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38']), 'buf38': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38'])}
```
And then if we mark some buffer as inplace buffer and the buffer name will use ```in_out_ptr{len(unique(self.inplace_buffers.values()))}```, the buffer name may be ```in_out_ptr6``` even this name has existed in ```self.inplace_buffers```.

After this PR, we will change ```REMOVED``` to ```REMOVED{1, 2, 3..}``` which avoids defining a duplicate name.  ```pyhpc_equation_of_state ``` of ```torchbench``` will work for CPU backend:

```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --inference --float32 -dcpu -n50 --inductor --freezing --no-skip --dashboard --only pyhpc_equation_of_state  --cold_start_latency```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106852
Approved by: https://github.com/lezcano
2023-08-11 04:06:58 +00:00
Driss Guessous
7460adf7f3 Causing internal clang tidy to error (#106895)
Summary:
This was causing an error with clang tidy for internal version of PyTorch:
https://www.internalfb.com/diff/D47044755?dst_version_fbid=1190859734932683&transaction_fbid=1553883761684581

Test Plan: See Summary

Reviewed By: dmpolukhin

Differential Revision: D48202402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106895
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-08-11 03:54:05 +00:00
Michael Voznesensky
71a336ef75 [Dynamo x FSDP][1/x] Builder support for deque, appendleft (#106884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106884
Approved by: https://github.com/ezyang
2023-08-11 03:26:12 +00:00
XiaobingSuper
4df84c3b4d make sure mkldnn convolution given same stride as ref path for nc11 contiguous input (#106966)
On SPR machine, the mkldnn bfloat16 convolution always return a channels last output, and we will convert it to channels first if input and weight are channels first, there has an issue when we do such conversion if output is nc11(4*512*1*1), we always mark it as public format ideep tensor, and even we calling ```to_dense``` before returning the output, the output's stride is still a channels last stride(512, 1, 512, 512), this PR will calling ```resize_``` to make sure the stride is contiguous stride.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106966
Approved by: https://github.com/mingfeima
2023-08-11 00:59:48 +00:00
lezcano
a9dca53438 NumPy support in torch.compile (#106211)
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/

We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.

In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.

Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.

All the tests in `tests/torch_np` take about 75s to run.

This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
2023-08-11 00:39:32 +00:00
Sam Larsen
8f774330af [inductor] Use shape env bounds in inductor bounds.py (#106175) (#106568)
Summary: If constrained ranges are available, use them in bounds.py before value range analysis (to enable Int64 -> Int32 optimization).

Test Plan: New unit test in test_torchinductor.py to mark a tensor as dynamic, then constrain with constrain_as_size (as outlined in https://github.com/pytorch/pytorch/issues/106175)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106568
Approved by: https://github.com/eellison, https://github.com/lezcano
2023-08-11 00:16:09 +00:00
Stephen Jia
62b3018024 [Vulkan] Introduce GPU Memory Layout qualifier (#106978)
Summary:
Introduce a GPU memory Layout qualifier in `vTensor`, which will allow more efficient memory layouts when storing Tensors on the GPU.

The plan is for shaders to use the memory layout qualifier to convert between logical tensor coordinates and physical texel positions.

Test Plan:
As-is, this diff should be a no-op. Run standard tests to make sure everything works as expected.

```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

```

Reviewed By: kimishpatel

Differential Revision: D48129905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106978
Approved by: https://github.com/liuk22
2023-08-10 23:45:54 +00:00
Stephen Jia
8c8477e55a Add _unsafe_index decomp (#106814)
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.

Also add it to the list of core decompositions.

Test Plan: contbuild and OSS CI (similar to D40075277)

Differential Revision: D48163393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
2023-08-10 23:23:37 +00:00
Jiaxu Zhu
152203d3c3 [pytorch][ao] Add torch.matmul in FloatFunctional/QFunctional (#106831)
Summary: As title

Test Plan: new unit tests

Differential Revision: D48172841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106831
Approved by: https://github.com/jerryzh168
2023-08-10 22:43:36 +00:00
Howard Cheng
dfb1b95919 [caffe2] Add enforce inside ScatterAssignOp (#106882)
Summary: Adding an enforce gives better error information than raising SIGFPE when division by zero happens. We'll get the actual BlobRef names as well as the error categories.

Test Plan:
Ran a local worker and client using DPP session with empty tensors and checked the error:

`../buck-out/v2/gen/fbcode/data_preproc/perf_test/client --sr2_event_base_pool_size=24`

`../buck-out/v2/gen/fbcode/data_preproc/perf_test/worker --dpp_session_id=5D49F56C98CC95BD97027BC0DDB38D8F`

```{dpp_internal_errorcategory : user_error,
ONCALL : MLDP_CONTROL,
CATEGORY : INPUT_ERROR,
errorsubsystemtags : [DPP_WORKER],
errorcause : USER_ERROR,
RETRYABILITY : 0}F0806 17:47:52.607200 2280375 SchedRuntimeEnv.cpp:385] facebook::data_preproc::NonRetryableGenericUser
Error: User preprocessing error c10::Error: [enforce fail at utility_ops.h:730] input.numel() > 0. 0 vs 0. tensor has t
o be nonempty (Error from operator:
input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_
features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCOR
ELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Concat:0" input:
"preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_feature
s_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_E
NCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Mul_2" input: "preproc_d
ata_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processo
r_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/encoded_id_lengths" output: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST```

Differential Revision: D48104430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106882
Approved by: https://github.com/kit1980
2023-08-10 21:46:13 +00:00
atalman
aef27bdbe7 Reword release version: major->minor in README.md (#106980)
Correct wording to reflect reality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106980
Approved by: https://github.com/huydhn, https://github.com/albanD
2023-08-10 21:32:30 +00:00
Peter Bell
a62de2d5ec [inductor] Enable multilayer reductions with dynamic shapes (#106747)
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106747
Approved by: https://github.com/lezcano
ghstack dependencies: #106626, #106870
2023-08-10 21:07:25 +00:00
Peter Bell
fa65df3745 [inductor] Type triton size arguments in the kernel index_dtype (#106870)
`JITFunction._key_of` uses the value of the argument to distinguish between
i32 and i64, but this fails if the value is used in indexing calculations where
the value exceeds `INT_MAX`.

Instead, we should use `index_dtype` which means all indexing calculations are
performed in the same dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870
Approved by: https://github.com/lezcano
ghstack dependencies: #106626
2023-08-10 21:07:25 +00:00
Peter Bell
3b2cb459fc [inductor] Fix reference_as_float gradcheck (#106626)
When `reference_as_float` is true, reference gradients will not have the same
dtype as the actual computed gradients. This fixes the issue by downcasting
before doing the comparison.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106626
Approved by: https://github.com/lezcano
2023-08-10 21:07:25 +00:00
Alexander Pivovarov
02abbb8109 Fix some typos, mostly "that that" (#106901)
Fix some typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901
Approved by: https://github.com/janeyx99
2023-08-10 19:46:53 +00:00
Andrew Gu
7b94d93431 [FSDP] Fix train -> EMA -> eval with mixed precision (#106858)
This fixes a pretty vicious bug relating to `SHARD_GRAD_OP`, mixed precision, EMA, and eval.

**Bug Explanation**
The model has a main module and an EMA module, where the main module is used for training and the EMA module is used for eval. The model has FSDP's fp16 mixed precision enabled. The flow consists of (1) training forward/backward/optimizer -> (2) EMA update (copy main module to EMA module) -> eval forward in `torch.no_grad()`, where this repeats for many iterations.

Consider the _second_ iteration.
- From the first iteration's eval forward, the EMA module has the fp16 unsharded parameters in memory (not freed due to `SHARD_GRAD_OP`).
- In this second iteration's step (2), we perform the EMA update under the `summon_full_params()` context, where FSDP specially forces full precision.  This means that the EMA module now uses fp32 unsharded parameters, distinct from the fp16 unsharded parameters still in memory. The EMA update modifies those fp32 parameters, and upon exiting the context, FSDP correctly writes the modifications back to the fp32 sharded parameters.
- In the second iteration's step (3) (eval forward), FSDP checks whether it needs to run the unshard op (including all-gather) but sees it does not since the fp16 unsharded parameters are still in memory. Thus, FSDP uses those fp16 unsharded parameters directly without all-gather. However, these fp16 unsharded parameters are stale and do not include the EMA update!
- In other words, at this point, the fp32 sharded parameters are correct, the fp16 unsharded parameters are stale, and FSDP chooses _not_ to re-all-gather since the fp16 unsharded parameters are in memory.

**Fix Explanation**
This PR fixes this by freeing the fp16 unsharded parameters if they are still allocated when forcing full precision, i.e. using fp32 unsharded parameters in `summon_full_params()`. This ensures that any modifications written back to the fp32 sharded parameters will be persisted via the next all-gather.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106858
Approved by: https://github.com/kumpera
ghstack dependencies: #106857
2023-08-10 19:32:43 +00:00
Jerry Zhang
79449e6272 [quant][pt2e][fix] Remove the requirement of using no_grad for reference model that contains quantized conv2d (#106924)
Summary:
att

we don't actually need gradient for conv2d, just need it to run without error, so we delayed the error of out_dtype gradient
to the time when user actually requested it

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_conv2d

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106924
Approved by: https://github.com/zou3519, https://github.com/kimishpatel
2023-08-10 19:16:10 +00:00
alanhe151220037
1afbc985fe Make RNGStateTracker support cuda-like device (#106771)
replace  `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771
Approved by: https://github.com/wanchaol
2023-08-10 19:14:33 +00:00
Nikita Shulga
bb6b157458 Fix IndexKernel.cu build (#104423)
Fixes `signed-unsigned comparison` warnings introduced by https://github.com/pytorch/pytorch/pull/106809 (previously by  <s> https://github.com/pytorch/pytorch/pull/104054 </s> ) that changed type of `num_indices` to unsigned.

Before the change warnings looks as follows:
```
/tmp/tmpxft_00194ca7_00000000-6_IndexKernel.cudafe1.stub.c:31:580:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:58:63: warning: comparison of integer expressions of different signedness: ‘const long unsigned int’ and ‘int’ [-Wsign-compare]
   58 |   AT_ASSERT(num_indices == iter.ntensors() - 2);
      |                                                               ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:74:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const long unsigned int’ [-Wsign-compare]
   74 |   for (int i = 0; i < num_indices; i++) {
      |                 ~~^~~~~~~~~~~~~
```
TODO: Turn those warning into errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104423
Approved by: https://github.com/Skylion007
2023-08-10 18:32:47 +00:00
David Berard
393e9eed90 [inductor] modify index_reduce to pass opinfo tests (#106429)
1. add a python meta registration, to fix an issue with the forward pass. The problem was that previously, the C++ meta registration calls [numel()](7b14a14e27/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L329)) which fails (LMK if it's better to fix the C++ implementation to not do this check)
2. Modify the backward to fix an issue in the backward. The backward is not a custom op - it's a custom manual backward implementation. In particular, there's some situations that don't support double backward; the check for whether double backward is allowed requires a .item() call. To fix the meta/fake tensor case, this PR will avoid setting the double backward error only if `GradMode::is_enabled()` - which shouldn't be turned on in PT2.
3. Update skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106429
Approved by: https://github.com/zou3519
2023-08-10 18:14:00 +00:00
Catherine Lee
a14d99bb6c Close non existent disable issues complete rollout (#106923)
follow up to https://github.com/pytorch/pytorch/pull/105096
It seems fine, anecdotally I have seen some issues closed and they haven't been reopened
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106923
Approved by: https://github.com/huydhn
2023-08-10 16:48:14 +00:00
Jane Xu
c0f80c6696 [forward-fix] Fix multigpu varying tensor optim tests (#106887)
Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test.

The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887
Approved by: https://github.com/izaitsevfb
2023-08-10 16:35:38 +00:00
Howard Huang
149e458846 [BE] RPC is missing RRef docs (#106902)
Current `RRef` class derives from `PyRRef` which has all the method definitions and documentations, and we don't see any of this in the current documentation:

<img width="891" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/62897766-a660-4846-97bf-182e4aa45079">

Changing to :inherited-member: so sphinx can pick up these methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106902
Approved by: https://github.com/svekars
2023-08-10 16:26:27 +00:00
Edward Z. Yang
89fd1b8717 Upgrade all inductor workflows to CUDA 12.1 / gcc9 (#106876)
gcc7 is too old to build fbgemm

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106876
Approved by: https://github.com/msaroufim, https://github.com/anijain2305
ghstack dependencies: #106900
2023-08-10 15:02:20 +00:00
Kurt Mohler
4d6a891baf Remove set_default_dtype from nn tests (#105775)
Part of #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105775
Approved by: https://github.com/ezyang
2023-08-10 14:56:13 +00:00
XiaobingSuper
22bc08da29 inductor: remove conv_bn folding from pre_grad pass (#106686)
The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686
Approved by: https://github.com/eellison
2023-08-10 12:25:05 +00:00
vfdev-5
35a1913370 [inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition

Related to https://github.com/pytorch/pytorch/issues/104296

Fixes https://github.com/pytorch/pytorch/issues/105565

Perfs:
- speed-up on cuda with bilinear and nearest modes

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           7.467 (+-0.036)            |             11.905 (+-0.276)            |             13.391 (+-0.051)            |     1.125 (+-0.000)      |           7.343 (+-0.036)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           7.722 (+-0.168)            |             14.371 (+-0.035)            |             15.899 (+-0.038)            |     1.106 (+-0.000)      |           7.870 (+-0.043)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |           7.710 (+-0.051)            |             11.354 (+-0.053)            |             13.376 (+-0.045)            |     1.178 (+-0.000)      |           7.698 (+-0.061)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |           7.870 (+-0.050)            |             13.744 (+-0.237)            |             15.206 (+-0.102)            |     1.106 (+-0.000)      |           7.912 (+-0.039)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           4.738 (+-0.015)            |             4.508 (+-0.005)             |             6.566 (+-0.027)             |     1.456 (+-0.000)      |           4.630 (+-0.022)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           4.391 (+-0.010)            |             4.860 (+-0.390)             |             6.438 (+-0.047)             |     1.325 (+-0.000)      |           4.458 (+-0.010)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |           4.279 (+-0.008)            |             4.127 (+-0.010)             |             6.598 (+-0.709)             |     1.599 (+-0.000)      |           5.064 (+-0.025)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |           4.537 (+-0.010)            |             4.593 (+-0.006)             |             6.365 (+-0.104)             |     1.386 (+-0.000)      |           4.480 (+-0.011)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           26.411 (+-0.066)           |             62.275 (+-0.436)            |             64.486 (+-0.353)            |     1.035 (+-0.000)      |           26.210 (+-0.110)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           26.457 (+-0.096)           |             72.887 (+-0.247)            |             74.207 (+-0.337)            |     1.018 (+-0.000)      |           25.995 (+-0.120)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |           26.457 (+-0.086)           |             64.110 (+-0.233)            |             66.340 (+-0.406)            |     1.035 (+-0.000)      |           26.145 (+-0.085)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |           26.536 (+-0.094)           |             73.742 (+-0.483)            |             71.946 (+-0.460)            |     0.976 (+-0.000)      |           26.457 (+-0.166)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           91.971 (+-0.253)           |             90.570 (+-0.193)            |            137.206 (+-0.214)            |     1.515 (+-0.000)      |           84.280 (+-0.241)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           91.893 (+-0.361)           |             89.866 (+-0.170)            |            136.678 (+-0.471)            |     1.521 (+-0.000)      |           84.573 (+-0.214)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |          116.967 (+-0.481)           |            110.468 (+-0.326)            |            223.770 (+-0.334)            |     2.026 (+-0.000)      |          108.098 (+-0.392)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |          117.563 (+-0.546)           |            111.438 (+-0.212)            |            223.101 (+-0.350)            |     2.002 (+-0.000)      |          108.225 (+-0.395)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           80.706 (+-0.289)           |             70.525 (+-0.204)            |            143.697 (+-0.311)            |     2.038 (+-0.000)      |           74.485 (+-0.258)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           80.955 (+-0.208)           |             69.986 (+-0.250)            |            143.658 (+-0.244)            |     2.053 (+-0.000)      |           74.163 (+-0.238)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |          117.576 (+-0.435)           |             71.179 (+-0.412)            |            178.515 (+-0.539)            |     2.508 (+-0.000)      |          108.394 (+-0.473)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |          117.441 (+-0.205)           |             70.313 (+-0.170)            |            178.664 (+-0.555)            |     2.541 (+-0.000)      |          108.098 (+-0.416)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           92.962 (+-0.509)           |            1740.964 (+-0.597)           |            1785.401 (+-0.369)           |     1.026 (+-0.000)      |           92.638 (+-0.539)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           92.928 (+-0.493)           |            1401.146 (+-0.732)           |            1453.229 (+-0.628)           |     1.037 (+-0.000)      |           92.458 (+-0.428)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |          118.152 (+-0.442)           |            1740.644 (+-0.480)           |            1793.475 (+-0.458)           |     1.030 (+-0.000)      |          107.962 (+-0.548)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |          118.182 (+-0.425)           |            1400.621 (+-0.624)           |            1461.796 (+-0.630)           |     1.044 (+-0.000)      |          107.894 (+-0.994)

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano
2023-08-10 09:52:48 +00:00
shibo19
bb2fcc7659 unify TEST_CUDA (#106685)
Fixes #ISSUE_NUMBER
as title, unify TEST_CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106685
Approved by: https://github.com/zou3519
2023-08-10 09:01:36 +00:00