Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180
Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377
Test Plan: CIs
Reviewed By: seemethere
Differential Revision: D32358467
fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62185
This file can take 5 minutes on its own to compile, and is the single limiting
factor for compile time of `libtorch_cpu` on a 32-core threadripper. Instead,
sharding into 5 files that take around 1 minute each cuts a full minute off the
overall build time.
This also factors out the `.findSchemaOrThrow(...).typed` step so the code can
be shared between `call` and `redispatch`.
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D29962049
Pulled By: albanD
fbshipit-source-id: be5df05fbea09ada0d825855f1618c25a11abbd8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573
To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;
We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.
This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.
Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.
We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.
The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.
Differential Revision:
D28941020
D28941020
Test Plan: Imported from OSS
Reviewed By: dhruvbird
Pulled By: ljk53
fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
Summary:
This PR greatly simplifies `mypy-strict.ini` by strictly typing everything in `.github` and `tools`, rather than picking and choosing only specific files in those two dirs. It also removes `warn_unused_ignores` from `mypy-strict.ini`, for reasons described in https://github.com/pytorch/pytorch/pull/56402#issuecomment-822743795: basically, that setting makes life more difficult depending on what libraries you have installed locally vs in CI (e.g. `ruamel`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59117
Test Plan:
```
flake8
mypy --config mypy-strict.ini
```
Reviewed By: malfet
Differential Revision: D28765386
Pulled By: samestep
fbshipit-source-id: 3e744e301c7a464f8a2a2428fcdbad534e231f2e
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
optimizer.zero_grad(set_to_none=True)
g.capture_begin()
with autocast():
out = model(static_input)
loss = loss_fn(out)
scaler.scale(loss).backward()
g.capture_end()
torch.cuda.current_stream().wait_stream(s)
# Training loop:
for b in data:
# optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
static_input.copy_(b)
g.replay()
scaler.step(optimizer)
scaler.update()
```
Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.
I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.
Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562
Reviewed By: zou3519
Differential Revision: D28046159
Pulled By: ngimel
fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50611
Removed the unused old-style code to prevent it from being used.
Added all autograd/gen_pyi sources to mypy-strict.ini config.
Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
.jenkins/pytorch/codegen-test.sh <baseline_output_dir>
.jenkins/pytorch/codegen-test.sh <test_output_dir>
Then run diff to compare the generated files:
diff -Naur <baseline_output_dir> <test_output_dir>
```
Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D25929730
Pulled By: ljk53
fbshipit-source-id: 1fc94436fd4a6b9b368ee0736e99bfb3c01d38ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49220
Since all ops are c10-full, we can remove .impl_UNBOXED now.
This also removes the ability of KernelFunction or CppFunction to store unboxedOnly kernels.
ghstack-source-id: 119450489
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D25490225
fbshipit-source-id: 32de9d591e6a842fe18abc82541580647e9cfdad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308
The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.
The only change from the previous commit is that I updated the regex like so:
before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`
I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.
Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.
Fixing regex pattern to allow for underscores at the beginning of the
namespace
This reverts commit 3c936ecd3c.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D25123295
Pulled By: bdhirsh
fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
Summary:
`__ROOT__` ops are only used in full-jit. To make size compact, disable using it in inference. Since FL is still in fill-jit, keep it for training only.
It saves -17 KB for fbios.
TODO: when FL is migrated to lite_trainer, remove `__ROOT__` to save size in training too.
Test Plan: CI
Reviewed By: dhruvbird
Differential Revision: D24686838
fbshipit-source-id: 15214cebb9d8defa3fdac3aa0d73884b352aa753
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46057
The code analyser (that uses LLVM and runs in the OSS PyTorch git repo) already produces a YAML file which contains base operator names and the operators that they depend on. Currently, this operator dependency graph is converted into a python dictionary to be imported in BUCK and used there. However, it is mostly fed into other executables by serializing the JSON and the consumer pieces this JSON together by concatenating each argument together. This seems unnecessary. Instead, this diff retains the original YAML file and makes all consumers consume that same YAML file.
ghstack-source-id: 114641582
Test Plan: Build Lite Predictor + sandcastle.
Reviewed By: iseeyuan
Differential Revision: D24186303
fbshipit-source-id: eecf41bf673d90b960c3efe7a1271249f0a4867f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45722
This diff does a bunch of things:
1. Introduces some abstractions as detailed in https://fb.quip.com/2oEzAR5MKqbD to help with selective build related codegen in multiple files.
2. Adds helper methods to combine operators, debug info, operator lists, etc...
3. Currently, the selective build machinery querying `op_registration_whitelist` directly at various places in the code. `op_registration_whitelist` is a list of allowed operator names (without overload name). We want to move to a world where the overload names are also included so that we can be more selective about which operators we include. To that effect, it makes sense to hide the checking logic in a separate abstraction and have the build use that abstraction instead of putting all this selective build specific logic in the code-generator itself. This change is attempting to do just that.
4. Updates generate_code, unboxing-wrapper codegen, and autograd codegen to accept the operator selector paradigm as opposed to a selected operator list.
5. Update `tools/code_analyzer/gen_op_registration_allowlist.py` to expose providing an actual structured operator dependency graph in addition to a serialized string.
There are a bunch of structural changes as well:
1. `root_op_list.yaml` and `combined_op_list.yaml` are now actual YAML files (not a space separated list of operator names)
2. `generate_code.py` accepts only paths to operator list YAML files (both old style as well as new style) and not list of operator names on the command line as arguments
3. `gen.py` optionally also accepts a custom build related operators YAML path (this file has information about which operators to register in the generated library).
ghstack-source-id: 114578753
(Note: this ignores all push blocking failures!)
Test Plan:
`buck test caffe2/test:selective_build`
Generated YAML files after the change:
{P143981979}
{P143982025}
{P143982056}
Ensure that the generated files are same before and after the change:
```
[dhruvbird@devvm2490 /tmp/TypeDefault.cpp] find -name "*.cpp" | xargs md5sum
d72c3d125baa7b77e4c5581bbc7110d2 ./after_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f ./after_change/lite_predictor_lib_aten/TypeDefault.cpp
d72c3d125baa7b77e4c5581bbc7110d2 ./before_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f ./before_change/lite_predictor_lib_aten/TypeDefault.cpp
```
`VariableTypes_N.cpp` are generated the same both before and after the change:
```
[dhruvbird@devvm2490 /tmp/VariableType] find -name "*.cpp" | xargs -n 1 md5sum | sort
3be89f63fd098291f01935077a60b677 ./after/VariableType_2.cpp
3be89f63fd098291f01935077a60b677 ./before/VariableType_2.cpp
40a3e59d64e9dbe86024cf314f127fd6 ./after/VariableType_4.cpp
40a3e59d64e9dbe86024cf314f127fd6 ./before/VariableType_4.cpp
a4911699ceda3c3a430f08c64e8243fd ./after/VariableType_1.cpp
a4911699ceda3c3a430f08c64e8243fd ./before/VariableType_1.cpp
ca9aa611fcb2a573a8cba4e269468c99 ./after/VariableType_0.cpp
ca9aa611fcb2a573a8cba4e269468c99 ./before/VariableType_0.cpp
e18f639ed23d802dc4a31cdba40df570 ./after/VariableType_3.cpp
e18f639ed23d802dc4a31cdba40df570 ./before/VariableType_3.cpp
```
Reviewed By: ljk53
Differential Revision: D23837010
fbshipit-source-id: ad06b1756af5be25baa39fd801dfdf09bc565442
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148
Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413
Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `
should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`
Reviewed By: iseeyuan
Differential Revision: D23503886
fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43570
Add the default op dependency graph to the source tree - use it if user runs
custom build in dynamic dispatch mode without providing the graph.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23326988
Pulled By: ljk53
fbshipit-source-id: 5fefe90ca08bb0ca20284e87b70fe1dba8c66084
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43155
Update the code_analyzer build.sh script to be able to take additional build flags in the mobile build/analysis
Test Plan:
Checkout associated PR or copy contents of build.sh into PyTorch repo (must be run from root of PyTorch repo)
To run with inclusion of autograd dependencies (note BUILD_MOBILE_AUTOGRAD is still an experimental build flag): `ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseopsfile MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`
Reviewed By: ljk53
Differential Revision: D23065754
fbshipit-source-id: d83a7ad62ad366a84725430ed020adf4d56687bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401
This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen. See the Note inside for more
details.
This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis. It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all. (This
means we may be dropping dispatch keys where we weren't previously!)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D21905593
Pulled By: ezyang
fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42135
Tested the code analyzer with LLVM 9 & 10 and fixed a couple issues:
- Rename local demangle() which is available as public API since LLVM 9;
- Fix falsely associated op registrations due to the `phi` instruction;
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22795508
Pulled By: ljk53
fbshipit-source-id: 2d47af088acd3312a7ea5fd9361cdccd48940fe6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40276
- add a couple new namespaces;
- handle the case where both contextual namespace and opreator namespace
are set (BackendSelectRegister.cpp and #39401);
- improve error message;
Test Plan: Imported from OSS
Differential Revision: D22135686
Pulled By: ljk53
fbshipit-source-id: 14d359c93573349b8fe1e05d7e44d875295a5f6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37797
This is slow (see comment in code).
Not fixing this yet, but at least adding a warning so people are aware and don't add new call sites.
ghstack-source-id: 103887226
Test Plan: waitforsandcastle
Differential Revision: D21390364
fbshipit-source-id: 7bff1c3b9756a16c9d9110f209c23bf557266dda
Summary:
- Add debug mode to include debug information.
- Move codegen comment to FB shell script (as it's only checked-in FB repo).
- Analyze lite-predictor instead of full-JIT as full-JIT BUCK target contains variable kernels thus pull in a lot more dependencies.
- Use pre-opt bitcode instead of pre-codegen bitcode - there is one special `callOp()` case in RNN.cpp where optimized bitcode has opname string and API body inlined together: https://fburl.com/diffusion/8rz6u4rg; pre-optimization bitcode should give more stable result.
Test Plan: - Tested the bash script with stacked diff.
Reviewed By: iseeyuan
Differential Revision: D21298837
fbshipit-source-id: be33e2db5d8cb0f804460c503e52beb0dcb4857f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37404
Many aten operators are really like util functions, e.g.:
aten::is_nonzero, aten::is_floating_point, etc. These ops can be called
via overloaded c++ operator, so seemingly trivial and innocent code changes can
affect how these ops are used by other ops (thus changes the output of
static analyzer).
Most of these util ops are rather small in terms of build size cost, so
for the purpose of optimizing binary size with custom build, whether to
include these ops or not does not make significant difference. In fact
for non-trivial models a set of these ops are almost always used.
This PR introduced the (optional) '__BASE__' ops section to the dependency graph.
We can maintain the list of frequently used small util ops for internal BUCK
build. This way, the output dependency graph will only contain meaningful
edges with significant binary size impact, and it will be more stable from
trivial code changes (which is checked in FB codebase).
Having a stable and sparse deps graph by factoring out frequently used based ops
is also a nice property to allow us to explore alternative custom build
solutions in case we find it hard to maintain the static code analyzer.
Test Plan: Imported from OSS
Differential Revision: D21280835
Pulled By: ljk53
fbshipit-source-id: c4d0d1f07ca868c60f23118d877fc1eeead4c875
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37393
Simplify the code analyzer by removing some unused flags and moving the
different format printer logic to python script. It's easier to add other
post processing logic to adapt to different BUCK build configs.
Test Plan: Imported from OSS
Differential Revision: D21280836
Pulled By: ljk53
fbshipit-source-id: 0d66d5891d850f012c4ab4f39eabbd9aecc1caa9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36742
Now, you can define a custom class inside a TORCH_LIBRARY block.
It looks very similar to what you did before. Instead of
```
static auto m = torch::class_<Class>("Namespace", "Class").def("foo", foo);
```
you write
```
TORCH_LIBRARY(Namespace, m) {
m.class_<Class>("Class")
.def("foo", foo);
}
```
All the old usages still work, but at some point we should start
updating the tutorials when we're ready to go 100% live with the
new pybind11 style API.
custom class API previously lived in torch/ folder and in torch
namespace, so for consistency, the new TORCH_LIBRARY also got
moved to torch/library.h The definition of Library::class_ is in the
bottom of that header because I need all of the class_ constructors
available, but there is a circular dependency between the two headers.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D21089648
Test Plan: Imported from OSS
Pulled By: ezyang
fbshipit-source-id: 8d54329c125242605336c22fa1642aae6940b507
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36258
Previous we had a && chaining style API. There are some downsides to
this API:
- It's easy to forget the 'static' qualifier in front, leading to
subtle ODR bugs.
- It is not compatible with torchbind class_ definitions, as these
need multiple levels of chaining. So in practice people end
up having to define multiple static initializers, one per class.
- It's not like pybind11.
- There's no way to conveniently get the file and line number of
the registration, as there is no macro point in the API.
- The old API doesn't really encourage people to put all of their
definitions for a library in one place, and to give a custom
namespace for it. Similarly, the old API wasn't very DRY, because
you had to keep repeating the namespace/dispatch key you
were writing implementations for.
The new API is modeled exactly off of the PYBIND11_MODULE macro:
you write:
```
TORCH_LIBRARY(aten, m) {
m.def("aten::add(Tensor self, Tensor other) -> Tensor");
...
}
```
in a non-chaining fashion, and under the hood the macro expands to
define a function, and define a static initializer that allocates
c10::Library (previously called c10::Module, but we renamed it
to avoid confusion with the existing NN module concept), passes
it to your function, and then retains it for the rest of the lifetime
of the program. Specification of the namespace is mandatory,
and in later commit I plan to make it a hard error to TORCH_LIBRARY
the same library name twice.
If you are specifying an implementation for an existing operator
(e.g., you're the XLA backend, or even if you're just putting
registrations for implementations at the implementation site),
you should use TORCH_LIBRARY_IMPL, which instead takes a backend
argument (instead of namespace) and can be used to specify an
implementation for a backend. Unlike TORCH_LIBRARY, you can do
as many of these as you want for a backend.
This needs updates to the mobile code analyzer.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20929257
Pulled By: ezyang
fbshipit-source-id: ba04d78492e8c93ae7190165fb936f6872896ada
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36607
PR #36258 and subsequent PRs in the stack switch c10 registrations to
the new pybind11 style registration API. One notable difference from old
c10 registration API is that, operator's namespace is no longer in op
schema string, e.g. "aten::" will be factored out from "aten::conv",
"aten::emtpy" and etc. The namespace string will be declared at the
beginning of registrations with TORCH_LIBRARY / TORCH_LIBRARY_IMPL
macro.
A rather simple fix is to extract namespace string from the name of
enclosing function of registrations, as the TORCH_LIBRARY macro will
always create an init function (per namespace) by appending namespace
string to a common prefix.
Another side effect of the API change is that it adds some debug string
constants to the registration API, and because of factoring out the
namespace part from op name, there is no longer an effect way to
differentiate between real op name and debug strings. A simple
workaround is that we only keep the first string constant it encounters
while BFSing the LLVM IR - the real op name is directly passed into the
registration call while the debug string is indirectly passed via
CppFunction.
These new assumptions might be broken by future changes but it's so simple
to implement to unblock the API work.
Test Plan: Imported from OSS
Differential Revision: D21026008
Pulled By: ljk53
fbshipit-source-id: c8c171d23aaba6d6b7985d342e8797525126a713
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35941
The key step of mobile custom build is to find out ops used by specific
model, with which it can produce a tailored build of optimal size.
However, ops can not only be called from TorchScript model but can also
be called from C++ code directly, e.g.: via torch::jit:: APIs. With
static dispatch, ops called this way will be statically linked into client
code. With dynamic dispatch, we need obtain & keep these ops explicitly.
This PR improves static code analyzer to dump ops that are called from
visible c++ symbols matching specific regex. This provides a mechanism
to solve the custom build problem with dynamic dispatch.
It starts with dumping ops that are callable from functions in torch::jit
namespace and include them in custom build with dynamic dispatch. We can
extend it to analyze custom code / to refine the set of JIT APIs that
are relevant, and etc. This is just a preliminary version. We need
improve its usability for more general purpose.
Test Plan: Imported from OSS
Differential Revision: D20835166
Pulled By: ljk53
fbshipit-source-id: a87cfb22b34f89545edd0674a5dfca6b7cff2b0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36223
Previously #35714
There are a lot of unboxed only defs. We're committed to removing
them at the end of the half but as I am about to do a lot of porting
to the new API, let's get them into a form where they're easy to
remove. This is a new overload impl_UNBOXED that will pass
the function pointer straight to CppFunction::makeUnboxedOnly
I don't attempt to make the _UNBOXED API complete; in particular,
catchall declarations don't get this sugar (as there are very few
of them).
To get some coverage of _UNBOXED API for code analysis, I switched
one of our unboxed tests to be an impl rather than a def. This
shouldn't materially affect coverage.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20929259
Pulled By: ezyang
fbshipit-source-id: 72d2061b6c8a6afbcd392b47f53ade18de2f9184