Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50229
`fastmod -m 'cast(<((at|c10)::)?\w+Type>\(\)\s*)->' 'castRaw${1}->'` Presuming it builds, this is a safe change: the
result of `cast()` wasn't being saved anywhere, so we didn't need
it, so we can use a raw pointer instead of a new `shared_ptr`.
ghstack-source-id: 120769170
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837494
fbshipit-source-id: 46319100dc0dfc78f6d2b45148207f83481f2ada
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863
Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).
Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation
buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg
Reviewed By: iseeyuan
Differential Revision: D25896212
fbshipit-source-id: 6d7e7fd5f3244a88bd44889024d81ad2e678ffa5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50649
**Summary**
Tutorials, documentation and comments are not consistent with the file
extension they use for JIT archives. This commit modifies certain
instances of `*.pth` in `torch.jit.save` calls with `*.pt`.
**Test Plan**
Continuous integration.
**Fixes**
This commit fixes#49660.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D25961628
Pulled By: SplitInfinity
fbshipit-source-id: a40c97954adc7c255569fcec1f389aa78f026d47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50194
**Summary**
`ClassType::repr_str()` prints out only the name of a `ClassType`, which
is not always enough to disambiguate it. In some situations, two
`ClassTypes` are compared and do not match despite having identical
names because they are in separate compilation units. In such cases, the
error message can seem nonsensical (e.g. `expected type T but found type
T`). This commit modifies `ClassType::repr_str()` so that it prints out
the address of the type's compilation unit to make these messages less
puzzling (e.g. `expected type T (0x239023) but found type T (0x230223)`).
**Test Plan**
This commit adds a unit test, `ClassTypeTest.IdenticalTypesDifferentCus`
that reproduces this situation.
**Fixes**
This commit fixes#46212.
Test Plan: Imported from OSS
Reviewed By: tugsbayasgalan
Differential Revision: D25933082
Pulled By: SplitInfinity
fbshipit-source-id: ec71b6728be816edd6a9c2b2d5075ead98d8bc88
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228
`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837374
fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes https://github.com/pytorch/pytorch/issues/49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433
Reviewed By: ngimel
Differential Revision: D25681374
Pulled By: Krovatkin
fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary:
=======
This PR addresses the following:
* Adds JIT support for CUDA Streams
* Adds JIT support for CUDA Events
* Adds JIT support for CUDA Stream context manager
Testing:
======
python test/test_jit.py -v TestCUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48020
Reviewed By: navahgar
Differential Revision: D25725749
Pulled By: nikithamalgifb
fbshipit-source-id: b0addeb49630f8f0c430ed7badeca43bb9d2535c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49385
Currently, the API to export operator lists accepts a `torch::jit::Module` object, and spits out an operator list. The operator list is practically used only for mobile. This is not ideal because the set of root operators may change by the time the model is subsequently optmized and exported for mobile.
What we need to to instead is glean the list of operators from the mobile model itself (`bytecode.pkl` specifically), and expose that instead.
Also updated the logic in `converter`.
### Before this change:
1. Get operator List from Torch Script Model
2. Convert to bytecode mobile model
### After this change:
1. Convert to bytecode mobile model
2. Use this converted mobile model to get the list of operators for each method on the model
ghstack-source-id: 118796752
Test Plan:
Added a unit test in `test_lite_interpreter.cpp` to ensure that all model referenced operators show up in the exported operator list. Also make `test_lite_interpreter.cpp` runnable from `xplat/caffe2/BUCK` since this is where the production code will be built from.
Verified that the list of operators produced before and after this change for an example model (segmentation) are the same.
{P147863234}
Also verified that the operator lists for BI-Xray model is different (we have been having problems with missing operators for this one): {P154903132}
Reviewed By: iseeyuan
Differential Revision: D24690094
fbshipit-source-id: 0426a6ef90456a811010cfe337c415882ae2deff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863
Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).
Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation
buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg
Reviewed By: raziel, iseeyuan
Differential Revision: D25152559
fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808
Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629
Reviewed By: malfet
Differential Revision: D25563207
fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240
Test Plan: CI
Reviewed By: dhruvbird
Differential Revision: D25135415
fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620
In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D25132183
fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48607
This change builds on top of
https://github.com/pytorch/pytorch/pull/46865
further exposing the async interface to `torch::jit::Method`.
added unit test for new `run_async`
Test Plan: `buck test caffe2/test/cpp/jit/...`
Reviewed By: dzhulgakov
Differential Revision: D25219726
fbshipit-source-id: 89743c82a0baa1affe0254c1e2dbf873de8e5c76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274
The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491
Test Plan: CI
Reviewed By: dzhulgakov
Differential Revision: D25102077
fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322
Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.
Test Plan: sandcastle
Reviewed By: ZolotukhinM
Differential Revision: D25126202
fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796
`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577
Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks
Reviewed By: dzhulgakov
Differential Revision: D24902978
fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45556
User defined classes can be used as constants. This is useful when freezing and removing the module from the graph.
Test Plan: waitforsadcastle
Reviewed By: eellison
Differential Revision: D23994974
fbshipit-source-id: 5b4a5c91158aa7f22df39d71f2658afce1d29317
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47425
Extra files can be exported in lite interpreter model, but it could not be loaded. This PR is to add the capability to load extra files from lite interpreter model. Because extra_files is a default argument, it should not affect the existing usage of _load_for_mobile. It's a simple assembly or a generic unordered_map. No additional dependency should be introduced and the size overhead should be small (to be tested).
Test Plan: Imported from OSS
Reviewed By: kwanmacher
Differential Revision: D24770266
Pulled By: iseeyuan
fbshipit-source-id: 7e8bd301ce734dbbf36ae56c9decb045aeb801ce
Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865
Test Plan:
unit test is passed.
fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac
Fixes #{issue number}
Reviewed By: houseroad
Differential Revision: D24616102
Pulled By: garroud
fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47253
The function simply goes over all aten nodes in the graph and
concatenates their names, truncating the final name to a given length.
Differential Revision: D24698272
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: d6e50194ca5faf0cb61f25af83247b5e40f202e4
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46373
As noted in https://github.com/pytorch/pytorch/issues/46373, there needs to be a flag passed into the engine that indicates whether it was executed through the backward api or grad api. Tentatively named the flag `accumulate_grad` since functionally, backward api accumulates grad into .grad while grad api captures the grad and returns it.
Moving changes not necessary to the python api (cpp, torchscript) to a new PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46855
Reviewed By: ngimel
Differential Revision: D24649054
Pulled By: soulitzer
fbshipit-source-id: 6925d5a67d583eeb781fc7cfaec807c410e1fc65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46401
Broader context about selective/custom build available at https://fb.quip.com/2oEzAR5MKqbD and https://fb.workplace.com/groups/pytorch.mobile.team/permalink/735794523641956/
Basically, we want to be able to trace full operator names (with overload name). The current observer infra picks up the operator name from the schema, which doesn't seem to include the overload name. To ensure consistency with the existing uses and to accomodate the new use-case, this diff adds a new overload to accept an `OperatorHandle` object, and the code in `before()` eagerly resolves it to an `OperatorName` object (which can be cached in a member variable) as well as a string (view) operator-name which has the same semantics as before.
Why do we pass in an `OperatorHandle` but then resolve it to an `OperatorName`? This might come across as a strange design choice (and it is), but it is grounded in practicality.
It is not reasonable to cache an `OperatorHandle` object but caching an `OperatorName` object is reasonable since it holds all the data itself.
An initial version of this change was trying to test this change in the `xplat` repo, which didn't work. Thanks to ilia-cher for pointing out that the dispatcher observing mechanism is disabled under a compile time flag (macro) for xplat.
ghstack-source-id: 114360747
Test Plan:
`buck test fbcode/caffe2/fb/test:record_function_test` succeeds. Also replicated this test in OSS in the file `test_misc.cpp` where the rest of the `RecordFunction` subsystem is being tested.
Ran benchmark as reqiested by ilia-cher
{P146511280}
Reviewed By: ilia-cher
Differential Revision: D24315241
fbshipit-source-id: 239f3081e6aa2e26c3021a7dd61f328b723b03d9
Summary:
With this PR, users can optionally provide a "doc_string" to describe a class or its method. doc_string for TorchBind classes and methods are stored as `doc_string` properties in `Function` and `ScriptClass`. These `dos_string` properties are then exposed in Python layer via PyBind for doc generation.
Fixes https://github.com/pytorch/pytorch/issues/46047
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46576
Reviewed By: wanchaol
Differential Revision: D24440636
Pulled By: gmagogsfm
fbshipit-source-id: bfa9b270a6c2d8bc769a88fad6be939cc6310412
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452
Reviewed By: zou3519, anjali411
Differential Revision: D24364642
Pulled By: ngimel
fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929
We were checking `and` when we should have been checking `or`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24148804
Pulled By: eellison
fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d
Summary:
The torchbind tests didn't work be cause somehow we missed the rename of caffe2_gpu to torch_... (hip for us) in https://github.com/pytorch/pytorch/issues/20774 (merged 2019-06-13, oops) and still tried to link against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45426
Reviewed By: VitalyFedyunin
Differential Revision: D24112439
Pulled By: walterddr
fbshipit-source-id: a66a574e63714728183399c543d2dafbd6c028f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664
Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.
To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.
For example, if the following async function is ran on a server over RPC:
```
def slow_add(x, y):
time.sleep(1)
return torch.add(x, y)
rpc.functions.async_execution
def slow_async_add(to, x, y):
return rpc.rpc_async(to, slow_add, args=(x, y))
```
we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:
```
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s
1.012s 1 1
aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s
1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us
11.843us 1 2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us
22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Self CPU time total: 164.164us
```
This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470
Test Plan:
```
rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```
Reviewed By: mrshenli
Differential Revision: D23638387
fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264
Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.
This PR completes the process of porting and removes unused files/macros.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23901392
Pulled By: suo
fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
Summary:
A lot of changes are in this update, some highlights:
- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218
Reviewed By: ezyang
Differential Revision: D23905183
Pulled By: soumith
fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144
Moves prim ops from C10 back to JIT.
These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781
Test Plan:
buck test //caffe2/test/cpp/jit:jit
https://pxl.cl/1l22N
buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test
https://pxl.cl/1lBxD
Reviewed By: iseeyuan
Differential Revision: D23697598
fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653
This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.
This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.
Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620
Reviewed By: mrshenli
Differential Revision: D23638499
fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b