Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68037
Right now mobile::Code doesn't outlive its enclosing Function, and all accesses to Code happens inside interpreter loop which doesn't outlive the module, so we don't need to use std::shared_ptr here. This also should saves us 1-2 KB for binary size, because shared_ptr seems to bloat on arm64 android.
ghstack-source-id: 145818696
Test Plan: eyes.
Reviewed By: qihqi, tugsbayasgalan
Differential Revision: D32264616
fbshipit-source-id: d83f538d6604cf75fd7728a25127b4849ce7ab2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68858
when executing with ir_eval, check for index out of bounds.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D32657881
Pulled By: davidberard98
fbshipit-source-id: 62dd0f85bb182b34e9c9f795ff761081290f6922
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964
Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h
nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)
Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428
Reviewed By: ngimel
Differential Revision: D33073817
Pulled By: wconstab
fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69923
Original commit changeset: fbaf2cc06ad4
Original Phabricator Diff: D32606547 (e61fc1c03b)
This is the same thing as the original diff but just using a normal std::mutex instead of std::shared_timed_mutex which is not available on OSX 10.11. The performance difference should be negligible and easy to change down the line if it does become a bottleneck.
Old failing build: https://github.com/pytorch/pytorch/runs/4495465412?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68783
Test Plan:
buck test //caffe2/test/cpp/monitor:monitor
will add ciflow tags to ensure mac builds are fine
Reviewed By: aivanou
Differential Revision: D33102715
fbshipit-source-id: 3816ff01c578d8e844d303d881a63cf5c3817bdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69421
I've hit a lot of build issues in D32671972, and I've come to realize that a lot of it boils down to header hygene. `function.h` includes `profiler.h` *solely* to transitively include `record_function.h` which winds up leaking the profiler symbols. Moreover several files are relying on transitive includes to get access to `getTime`. As long as I have to touch all the places that use `getTime`, I may as well also move them to the new namespace.
Test Plan: Unit tests and CI.
Reviewed By: aaronenyeshi, albanD
Differential Revision: D32865907
fbshipit-source-id: f87d6fd5afb784dca2146436e72c69e34623020e
Summary:
This adds a C++ event handler corresponding to the Python one mentioned in the RFC.
This changes the counters a bit to all be push driven instead of being polled. The two window types are "fixed count" and "interval". One is based off the number of logged events and the other is based off of time windows. There's currently no active ticker for interval so it needs a regular stream of events to ensure events are produced. A follow up diff can add support for things like HHWheel / simple ticker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68783
Test Plan: buck test //caffe2/test/cpp/monitor:monitor
Reviewed By: kiukchung
Differential Revision: D32606547
fbshipit-source-id: a00d0364092d7d8a98e0b18e503c0ca8ede2bead
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095
This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585
Reviewed By: mrshenli
Differential Revision: D32958594
Pulled By: albanD
fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69394
Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D32837991
fbshipit-source-id: fc7c4f76d2f32a17a0faf329294b3fe7cb81df32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67730
This pr implement the register function for upgrader so it can be used at loading stage
ghstack-source-id: 145170986
Test Plan:
```
buck test //caffe2/test/cpp/jit:jit
```
Reviewed By: iseeyuan
Differential Revision: D32092518
fbshipit-source-id: 779b51eb12b8cb162a93a55c1e66fe0becc4cb36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69277
LazyView is the main class for tracking alias caused by view
ops. The corresponding IR classes for view ops are hand-written now, and
we can switch to code-gen them in future. For certain view ops, they
have a reverse IR class to perform inplace update in the backward
direction on a chain of alias ops.
As part of the future work, we will simplify the logic for LazyView once
the functionalization pass in core is ready to use.
Test Plan: Imported from OSS
Reviewed By: wconstab
Differential Revision: D32820014
Pulled By: desertfire
fbshipit-source-id: d9eb526cb23885f667e4815dc9dd291a7b7e4256
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705359
fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68678
Test Plan: Ill update the unit test before land
Reviewed By: cccclai
Differential Revision: D32573603
fbshipit-source-id: 19271bcbb68b61d24d6943e61a943f4f75fddb5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67726
1. Check in one model with aten:div_tensor old op with unittest in both cpp and python. The following two lines are commented out and expected to work after using upgrader.
```
_helper(mobile_module_v2, div_tensor_0_3)
_helper(current_mobile_module, torch.div)
```
2. Update the commented code accordingly.
Currently there are 6 upgraders. The following old models with operators are added to cover these 6 upgraders:
```
// Tensor x Tensor
test_versioned_div_tensor_v3
// Tensor x Scalar
test_versioned_div_scalar_float_v3
test_versioned_div_scalar_reciprocal_int_v3
test_versioned_div_scalar_inplace_float_v3
// Scalar x Scalar
test_versioned_div_scalar_scalar_v3
// Tensor x Tensor with out kwarg
test_versioned_div_tensor_out_v3
// Tensor x Tensor inplace
test_versioned_div_tensor_inplace_v3
// Tensor x Scalar inplace
test_versioned_div_scalar_inplace_int_v3
```
Note:
In this pr, per model, it includes the following test:
1. Model (with old op) load/run test will be in both cpp and python
2. Model (with old op) + upgrader test will be in python
Other tests considered adding:
1. per upgrader bytecode test
2. app level integration test
ghstack-source-id: 144422418
Test Plan: CI and the added unittest
Reviewed By: iseeyuan
Differential Revision: D32069653
fbshipit-source-id: 96d9567088a1f709bc7795f78beed7a308e71ca9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68410
First step toward not heap-allocating a string in RecordFunction::before() every time
ghstack-source-id: 144287654
Test Plan: CI
Reviewed By: chaekit
Differential Revision: D32453847
fbshipit-source-id: 080d95095fb568287b65fcc41a4ca6929b5f9a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68314
Add a convenience to lazy::Shape for counting the number of elements (by multiplying out the dimensions). This is a method on Tensor, and in switching other lazy tensor shape utils to use aten shape inference, we need numel counts.
Test Plan: add unit tests
Reviewed By: alanwaketan
Differential Revision: D32409138
fbshipit-source-id: 3ae725300f8826d38e45412f46501d5e5f776fb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68074
This is the first step of many PRs towards implementing the `torch.monitor` RFC https://github.com/pytorch/rfcs/pull/30
This defines the aggregation types, the `Stat` class and provides some simple collection of the stats.
This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance.
Changes:
* added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats.
* This doesn't include the push metrics yet (will be coming).
After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1).
Performance considerations:
* Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast.
* Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently.
* Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue.
Next steps:
1. Add StatCollector interface for push style metrics
1. Add pybind interfaces to expose to Python
1. Add default metric providers
1. Integrate into Kineto trace view
Test Plan:
buck test //caffe2/test/cpp/monitor:monitor
CI
Reviewed By: kiukchung
Differential Revision: D32266032
fbshipit-source-id: dab8747b4712f5dba5644387817a3a0fda18b66a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310
Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling.
Test Plan:
Standalone test
* Typical desync - P467288969
* Mismatched collectives - P467288916
* Mismatched broadcast size - P467288873
DDP benchmark
* DDP benchmark desync - P467433483, P467520195
No perf regression:
* w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs
* w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs
Reviewed By: mingzhe09088
Differential Revision: D32348647
fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a
Summary:
1. is to convert Function -> mobile::Function
2. is to serialize mobile::Function
This also opens opportunity to create mobile::Module without saving/reloading
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66494
Reviewed By: zhxchen17
Differential Revision: D32293022
Pulled By: qihqi
fbshipit-source-id: 29b43d47ff86071d5e2f9d6ca4dba4445711ce3d
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.
Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943
Reviewed By: ngimel
Differential Revision: D32288709
Pulled By: dzhulgakov
fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67000
See the [related issue](https://github.com/pytorch/pytorch/issues/66654) for context.
This new JIT optimization transforms patterns like this:
```
%inputs.1 : Tensor[] = prim::ListConstruct(%a, %b, %c)
%concat.1 : Tensor = aten::cat(%inputs, %dim)
%inputs.2 : Tensor[] = prim::ListConstruct(%x, %concat.1, %y)
%concat.2 : Tensor = aten::cat(%inputs.2, %dim)
```
into this:
```
%inputs.2 : Tensor[] = prim::ListConstruct(%x, %a, %b, %c, %y)
%concat.2 : Tensor = aten::cat(%inputs.2, %dim)
```
(it can do this for chains of `aten::cat` longer than 2 as well)
A few conditions have to hold:
1. The `dim`s have to match.
2. `inputs.1` and `inputs.2` cannot be mutated
Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOpt`
Reviewed By: d1jang
Differential Revision: D31819491
fbshipit-source-id: 9f1a501d52099eb1a630b5dd906df4c38c3817ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68201
Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.
Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases. Hash() should only read the bytes corresponding to the currently active type.
Test Plan: Added new unit tests. Verified HashTest.Scalar failed with the original Hash() impl and then fixed.
Reviewed By: alanwaketan
Differential Revision: D32367564
fbshipit-source-id: ac30dd4f6dd0513954986d3d23c0c11ba802c37b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128
Reland of D31762735 (0cbfd466d2).
This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler.
I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls.
Test Plan:
rpc_pickler_test file:
buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx
rpc_pickler stress test:
buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results
Reviewed By: mrshenli
Differential Revision: D32316077
fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66100
A backend should not directly dependent on ATen operators. The demo backend is changed to that way for testing purpose.
Test Plan: Imported from OSS
Reviewed By: pavithranrao
Differential Revision: D31384614
Pulled By: iseeyuan
fbshipit-source-id: c97f0c4aa12feb1d124f1d7a852e9955a7a2ce42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67927
BackendData - represents 'tensor data' in opaque backend storage
LoweringContext - interface for performing backend-specific IR lowering
BackendImplInterface - interface for lazy tensors backends to implement
Reorgs backend-related files into lazy/backend subdir
includes a few small fixes, which were made on lazy_tensor_staging but need to be back-ported to master.
Test Plan: used by lazy_tensor_staging branch
Reviewed By: desertfire
Differential Revision: D32142032
fbshipit-source-id: 828c717bcd0d511876e64ad209b50f7bfb10cec5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68027
This commit upstreams class BackendDevice to the master, which is a backend
specific representation of the actual hardware, for instances, CPU, GPU, or
TPU.
This concept is important for backend like XLA where it needs to tell the
actual hardware type from the c10::DeviceType::Lazy virtual device during
both IR constructions and lowerings.
Test Plan: ./build/bin/test_lazy --gtest_filter=BackendDeviceTest.*
Reviewed By: wconstab
Differential Revision: D32261838
Pulled By: alanwaketan
fbshipit-source-id: 579c3fc5f9da7847c887a383c6047e8ecb9cc5bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67929
1. Write a node-hash based unit test for Cache
2. Replace CHECK with TORCH_CHECK in IrUtil
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D32246134
Pulled By: desertfire
fbshipit-source-id: c464bc300126d47e9ad4af3b3e8484a389757dc0