Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56307
This should fix https://github.com/pytorch/pytorch/issues/56273. I tested these changes locally by making them directly on top of https://github.com/pytorch/pytorch/pull/56151, and running the xla tests (`xla/test/cpp/build/test_ptxla`).
**Current state:** For ops that are ported to structured, If external backends like XLA have implemented the `out` op but not the `functional` version, they will call into our code-generated `CompositeExplicitAutograd` kernel, which calls the structured operator's `meta()` function and then redispatches to the external backend's `out` function.
If XLA has registered their own kernel to the `functional` variant of the op, it'll override our codegen'd composite kernel. XLA has logic to code-generate "CPU fallback" kernels for "required" ops. It gets this information based off of `RegistrationDeclarations.yaml`. That info was technically incorrect up until this PR, since we were code-generating `inplace/functional` composite kernels for structured ops, but not updating `RegistrationDeclarations.yaml` with that information.
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D27883950
Pulled By: bdhirsh
fbshipit-source-id: fe896b0d2bbd4369490dcdf7a87f227fd3d8b8b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56306
It turns out that TensorIteratorBase `meta()` calls don't work with XLA tensors, since the logic that builds up the `TensorIteratorBase` object also tries to grab/store the underlying tensors' data pointers. This doesn't work for XLA because they don't have storage.
I think it's fine to just skip this bit of logic for tensors that don't have storage- since the data_ptr information isn't important to the meta call, and TensorIterator isn't actually used in the implementation for non-native kernels, i.e. XLA.
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D27883949
Pulled By: bdhirsh
fbshipit-source-id: 7db4358b94b23c504a383f9673dc509c4020a708
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55145
Repeating the discussion from https://github.com/pytorch/pytorch/pull/54784#issuecomment-811792089
The error messages for mismatched values are directly adapted from the old `_compare_tensors_internal`:
50cb75edce/torch/testing/__init__.py (L104-L111)
A sample error message right now looks like this
```
With rtol=1.3e-06 and atol=1e-05, found 1 different element(s) out of 12 (8.3%). The greatest difference of 4.0 (5.0 vs. 9.0) occurred at index (2, 3)
```
Using the same data with `numpy.testing.assert_equal` gives the following output:
```
Not equal to tolerance rtol=1.3e-06, atol=1e-05
Mismatched elements: 1 / 12 (8.33%)
Max absolute difference: 4.
Max relative difference: 0.44444445
x: array([[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.]], dtype=float32)
y: array([[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 9.]], dtype=float32)
```
Pros:
- The info is presented in a list instead of a sentence. IMO this makes it more readable
- The maximum relative difference is reported, which is beneficial in case a comparison fails due to the `rtol`
Cons:
- The values of the inputs are reported (this can be disabled by passing `verbose=False`, but lets face it: most users will use the default setting). In case the inputs are large, the output gets truncated with `...`. Not only is it hard to visually find the mismatching values, they could also live within the truncated part, making the output completely useless.
- Even when visually find the offending values it is hard to parse this back to the index in the inputs.
This implements a mix of both to get a short but expressive message:
```
Tensors are not close according to rtol=1.3e-6 and atol=1e-05:
Mismatched elements: 1 / 12 (8.3%)
Max. rel. diff.: 4.44e-1 at (2, 3)
Max. abs. diff.: 4.0 at (2, 3)
```
Test Plan: Imported from OSS
Reviewed By: heitorschueroff
Differential Revision: D27877157
Pulled By: mruberry
fbshipit-source-id: 6898a995f116f127e3ae8ed0bcb1ada63eadc45a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324
Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.
So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
output_1[] = some_complicated_expr...
}
for (...) {
output_2[] = some_complicated_expr...
}
```
Into:
```
for (...) {
output_1[] = complicated_expr
output_2[] = complicated_expr. // llvm cse should take care of this
}
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27841194
Pulled By: bertmaher
fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447
MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.
Test Plan: CI
Reviewed By: ajyu
Differential Revision: D27840361
fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56167
VC++ does not recognize `or` as a valid operator. This breaks the build under `Debug` configuration.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D27866143
Pulled By: SplitInfinity
fbshipit-source-id: 490cee57b9762170ce02a6f73130772a3542e76d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56166
Support tensordot in symbolic function of opset 12, and add tests accordingly.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D27866140
Pulled By: SplitInfinity
fbshipit-source-id: 68e218cfbd630900fb92871fc7c0de3e7e8c8c3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56165
Add implementation for cases when
- interleaving happens along dim which consist of dynamic axes
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D27866137
Pulled By: SplitInfinity
fbshipit-source-id: 7fef1b2c614f2e24a677b7ca0886bb37bd0ab479
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56163
* [ONNX] Improve index_put symbolic to handle singular Bool updates (#53690)
Adds support for cases where the updates to the index_put node is a single Bool value, such as the case shown below
```
mask[indices] = True
```
Fixes#53507
* [ONNX] Support primitive type input/outputs and attributes (#53550)
Support primitive type attributes. Needed for Silero model.
* [ONNX] Fix if output shape mismatch error & Fix graph input directly used as output (#53219)
Fix if output shape mismatch error & Fix graph input directly used as output
* Add support for hann_window operator.
* [ONNX] Replace decomposeLinear pre process pass with a symbolic (#53077)
Replace decomposeLinear pre process pass with a symbolic
* Add a test case for dtype is None.
* Resolve flake8 issue.
* Remove one unused test case.
* Add support for hann_window operator.
* Add a test case for dtype is None.
* Remove one unused test case.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D27866145
Pulled By: SplitInfinity
fbshipit-source-id: e0b43df9ecd1a95cd7ac297213aba453bbaf2913
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Negin Raoof <neginmr@utexas.edu>
Co-authored-by: Bowen Bao <bowbao@microsoft.com>
Co-authored-by: Ksenija Stanojevic <KsenijaS@users.noreply.github.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56539
It seems like a potential source of non-determinism when generating YAML files during the build stems from the fact that when we write out Python lists, they get written out in list order. This isn't a problem per-se, but if you look to see how these lists are generated, you'll see that they come from sets, which are inherently [not order preserving](https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set) in Python.
I can't guarantee that this removes non-determinism, but it removes all non-determinism that I know of so far. The surface area of codegen isn't sprawling, and the YAML file is generated by converting the object `toDict()` and passing it into the YAML serializer, so this should cover it (I think). Dictionaries are serialized in key order by pyyaml, so that's not a problem.
This could be releated to the elevated Android build times being seen [here](https://fb.workplace.com/groups/pytorch.edge.users/permalink/841622146708080/).
ghstack-source-id: 126987721
Test Plan: Build + Sandcastle.
Reviewed By: JacobSzwejbka
Differential Revision: D27893058
fbshipit-source-id: 6d7bcb09f34c05b71fbb4a0673bac1c4c33f23d7
Summary:
This PR includes:
* Update to the loop-carried dependence check API to correctly ignore loop-independent dependences and handle all kinds of loop-carried dependences like RAW, WAR and WAW.
* Fix for the overlap API to look only for conflicting buffer accesses where at least one of them is a Store.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56354
Reviewed By: bertmaher
Differential Revision: D27856202
Pulled By: navahgar
fbshipit-source-id: 206e4ec771fe0f7f2ccf4b11b29e35df7b9b18bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56491
Move the prepack convolution to the op file to get rid of the selective compilation.
ghstack-source-id: 126960054
Test Plan: CI
Reviewed By: SS-JIA
Differential Revision: D27719539
fbshipit-source-id: 75fb3849858a31a915828a0f5f6f3d4066ff4c9b
Summary:
This hides the warnings from #35026 until we can fix them for real by migrating to custom classes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56543
Pulled By: driazati
Reviewed By: rohan-varma
Differential Revision: D27895085
fbshipit-source-id: a325a5d8cefb20a5033c1a059e49c03c08514f18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56308
But only for float tensors. Even on CUDA, int tensors just have weird
behavior with pow, and I bet FP is so much more common that it's just not worth
trying to fuse ints here.
ghstack-source-id: 126769637
Test Plan: `pytest test_jit_fuser_te.py -k test_binary_pow`
Reviewed By: navahgar
Differential Revision: D27834694
fbshipit-source-id: 7274d72cf02ab95d63574b6c17995b8f34560810
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56004
added reference pattern support for GELU, softmax and bmm for int dtypes. For GELU and Softmax, this consisted of adding reference patterns to the default node handler for int dtypes. Note GELU and softmax patterns are not registered since they do not have a proper quantized kernel which means they would either add unnecessary dequant and quant ops to the network, or they would simply error. This can be circumvented with custom qconfig usage as in test_gelu_reference
bmm was added within binary ops along with some significant changes to how that code is structured. Theoretically the reference pattern used for bmm could be applied to other dtypes. This was not enabled because of issues relating to Line 1323 in quantize.py. In essence, the prepare step does not know whether an op will use a reference pattern or not, so for ops that are supported with one dtype in reference and one dtype normally, this has the potential to cause issues. This is difficult to get aorund with the is_reference flag being available in the prepare step or discussed changes around separating
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_gelu_reference
python test/test_quantization.py TestQuantizeFxOps.ttest_gelu_normal
python test/test_quantization.py TestQuantizeFxOps.test_softmax_reference
python test/test_quantization.py TestQuantizeFxOps.test_softmax_normal
python test/test_quantization.py TestQuantizeFxOps.test_silu_reference
python test/test_quantization.py TestQuantizeFxOps.test_bmm_int_reference
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestFuseFx
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxModels
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D27818340
fbshipit-source-id: de65be0797035463cd2d1b0e4677d1a87f69143c
Summary:
This pulls out shell scripts from an action and runs them locally as a first pass at https://github.com/pytorch/pytorch/issues/55847. A helper script extracts specific steps in some order and runs them:
```bash
$ time -p make lint -j 5 # run lint with 5 CPUs
python scripts/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'flake8-py3' \
--step 'Run flake8'
python scripts/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'mypy' \
--step 'Run mypy'
python scripts/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'quick-checks' \
--step 'Ensure no trailing spaces' \
--step 'Ensure no tabs' \
--step 'Ensure no non-breaking spaces' \
--step 'Ensure canonical include' \
--step 'Ensure no unqualified noqa' \
--step 'Ensure no direct cub include' \
--step 'Ensure correct trailing newlines'
python scripts/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'cmakelint' \
--step 'Run cmakelint'
quick-checks: Ensure no direct cub include
quick-checks: Ensure canonical include
quick-checks: Ensure no unqualified noqa
quick-checks: Ensure no non-breaking spaces
quick-checks: Ensure no tabs
quick-checks: Ensure correct trailing newlines
cmakelint: Run cmakelint
quick-checks: Ensure no trailing spaces
mypy: Run mypy
Success: no issues found in 1316 source files
Success: no issues found in 56 source files
flake8-py3: Run flake8
./test.py:1:1: F401 'torch' imported but unused
real 13.89
user 199.63
sys 6.08
```
Mypy/flake8 are by far the slowest, but that's mostly just because they're wasting a bunch of work linting the entire repo.
In followup, we could/should:
* Improve ergonomics (i.e. no output unless there are errors)
* Speed up lint by only linting files changes between origin and HEAD
* Add clang-tidy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56439
Reviewed By: samestep
Differential Revision: D27888027
Pulled By: driazati
fbshipit-source-id: d6f2a59a45e9d725566688bdac8e909210175996
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56346
Now that TensorPipe's API has `targetDevice`, use that instead of
manually writing the CUDA device index in `metadata`.
Test Plan: CI
Reviewed By: lw
Differential Revision: D27703235
fbshipit-source-id: c5b620e3b3ce619367412efdbe9fa3778f6b8869
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56425
As SPMD mode is gone, `_specify_ddp_gpu_num` becomes useless. It only checks if the module is a GPU module. This actually is already checked by the caller of this function (in fairscale and some other codebases).
Additionally also remove `enable_pytorch_sync_bn` wrapper that only calls this function and does nothing else.
ghstack-source-id: 126885376
Test Plan: waitforbuildbot
Reviewed By: zhaojuanmao
Differential Revision: D27866440
fbshipit-source-id: d2fd5cf43eda25c0a2bd35f647848ec0dbd6ad0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56412
We are investigating some flaky profiiling tests such as https://github.com/pytorch/pytorch/issues/56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.
To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
ghstack-source-id: 126920867
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D27864845
fbshipit-source-id: 01f04a884482ec7741323218a7f8f4a8451eb4ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56434
If we hit multiple TORCH_WARN from different sources when running the
statement, it makes more sense to me that we want to check the regex is
met in any one of the warning messages instead of all messages.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D27871946
Pulled By: ailzhang
fbshipit-source-id: 5940a8e43e4cc91aef213ef01e48d506fd9a1132
Summary:
The lint was originally added in https://github.com/pytorch/pytorch/issues/54974, but at the time I didn't realize that these other Markdown files also each have a table of contents:
- `GLOSSARY.md`
- `torch/csrc/jit/OVERVIEW.md`
- `torch/csrc/jit/docs/serialization.md`
- `torch/fx/OVERVIEW.md`
This PR adds those files to the lint, and also changes the rule from using a fixed list of filenames to a `git grep` command that finds all Markdown files containing this magic comment:
```md
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56487
Test Plan: The "Lint / toc" job in GitHub Actions.
Reviewed By: janeyx99
Differential Revision: D27884885
Pulled By: samestep
fbshipit-source-id: 5462437502b17fba93abf5098e21754bf566a4fe