Commit graph

1129 commits

Author SHA1 Message Date
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
PeixuanZuo
6740528b98 [ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525) 2022-11-01 13:05:55 +08:00
Baiju Meswani
c557a55816
Fix on-device training ExportModelForInferencing api (#13510) 2022-10-31 21:29:06 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
Baiju Meswani
a46c599a40
Training API to export the eval model to an inference model (#13345) 2022-10-27 09:34:01 -07:00
Vincent Wang
805ec459a0
Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462) 2022-10-26 15:45:18 -07:00
Vincent Wang
b6a3562ffb
[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430)
Add env variable to control disabling custom autogard function support.
When using ORTModule, if the torch model has torch.nn.Function, if user
confirms that it can be exported to ONNX (for example, by inline
PythonOp) and the backward implementation is matched to the forward
impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to
disable the custom autograd support so that it won't use ORT's PythonOp
to fallback to PyTorch. Exporting to ONNX sometimes can leverage some
graph optimizations in ORT so that perf is better.
2022-10-25 16:58:04 +08:00
cloudhan
2748f38362
Drop hip_add_library (#13406)
Switching to use CMake's builtin hip language support.
2022-10-25 12:57:48 +08:00
Adam Louly
bed169192d
Windows build fix for on device training training. (#13354)
### Description
This is a fix for on device training wheel build.

### Motivation and Context
when building linux wheel it treats PathString same as std::string, but
when trying to build the wheel on windows it fails because we needed to
cast the std::string to a PathString.

This error was found manually because there is no pipeline that uses the
--enable_training_on_device for windows.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-10-20 09:58:02 -07:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
PeixuanZuo
4b2b588895
[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365)
### Description
<!-- Describe your changes. -->

Use SAS Token to fix error` failed to perform copy command due to error:
no SAS token or OAuth token is present and the resource is not public`

Generate SAS Token of target data, add it into Key vault, and use it as
Pipeline Variable.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-10-20 12:08:57 +08:00
Vincent Wang
67150baa8d
[ORTModule] ATen Support for aten::upsample_nearest (#13364)
ATen support for aten::upsample_nearest, which is required for
Huggingface's diffusers model training using ORTModule.
2022-10-20 08:30:04 +08:00
Vincent Wang
b6b3f41636
Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347)
The PR applies some fixes to Hierarchical ORTModule and ORTModule
PythonOp.

For Hierarchical ORTModule:
- Don't wrap module if the caller is to call other function instead of
forward() function
- Support single module instance is call multiple times with different
types of inputs
- Check if module can be warped from top to bottom instead of from
bottom to top

For ORTModule PythonOp:
- Add env variable control to allow using
torch.utils.checkpoint.CheckpointFunction
- Add env variable control to skip register some autograd functions so
that there is no conflict for some models.
2022-10-20 08:16:03 +08:00
Adam Louly
61ee5585b2
update the nightly build to use the latest ptca image. (#13309)
### Description
updating the ptca image used in the nightly pipeline

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-10-17 14:12:03 -07:00
Adam Louly
68eff69ab1
Add Utils for federated learning scenarios (#13014)
**Description**: utils for federated learning.

**Motivation and Context**
- This PR includes utils that will be used on federated learning
scenarios.
- Exposing python bindings to some utils, and added a util to calculate
the difference between two buffers.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-10-17 12:39:43 -07:00
Jeff Daily
65c67764ae
remove line "ADD model ${WORKSPACE_DIR}/model" in the amdgpu Dockerfile (#12914)
Follow-up to #12707. docker build is broken otherwise; model dir is
gone.
2022-10-14 13:17:28 -07:00
Wei-Sheng Chin
dc324b1d90
[LazyTensor] Make LORT Build Again with Latest PyTorch (#13303)
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.
2022-10-13 13:56:17 -07:00
Vincent Wang
807b2f4dd5
[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296)
This PR is to add support of using env variable to set provider option
cudnn_conv_algo_search so that user can choose better conv algo search
method to run model. This is a quick fix to unblock the test of MoE
model. Will have another PR to design and implement the ORTModule config
so that we can config ORTModule using Python script or config file
instead of env variable.
2022-10-13 15:36:21 +08:00
Vincent Wang
6fb70a82df
[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305)
Update supported deepspeed highest version from 0.7.1 to 0.7.3 for
FP16_Optimizer. Also add version info to warning log.
2022-10-13 13:03:01 +08:00
Vincent Wang
afb5f76770
[ORTModule] ATen Support for torch.nn.GroupNorm (#13293)
Model [huggingface's diffusers
library](https://github.com/huggingface/diffusers) has
torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's
InstanceNormalization, which is lack of gradient. The implementation of
ORT's InstanceNormalization will call cuDNN's BatchNorm for part of
computation, which is not efficient compared to PyTorch's
implementation. This PR is to use ATen fallback to support this torch
module, including its forward and backward.
2022-10-13 11:59:03 +08:00
PeixuanZuo
6895918b1c
[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297)
### Description
<!-- Describe your changes. -->

Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-12 10:47:33 -07:00
Vincent Wang
a2658f0784
[ORTModule] Fix Graph Builder for Eval Mode (#13255)
Current graph builder for ORTModule will apply the training's graph
optimizations for both training and eval mode. Take BatchNorm as
example, one of training's graph optimizations will replace
BatchNormalization Op to BatchNormInternal which is for training only.
This PR is to fix this, for eval mode, we will not apply the training's
graph optimizations. The inference's graph optimizations will be applied
when InferenceSession initialization.
2022-10-12 14:39:54 +08:00
Prathik Rao
93e0a15117
implement cos gradient as a function op (#13227)
### Description
Implemented gradient of cos as per the function below.

![image](https://user-images.githubusercontent.com/31260940/193900310-b62a3e77-06d5-45af-ad28-a1d41920bad0.png)

### Motivation and Context
Cos gradient required for [huggingface's diffusers
library](https://github.com/huggingface/diffusers)

### Testing
built ORT from source: `./build.sh --config RelWithDebInfo
--enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home
/usr/local/cuda --build_wheel --parallel --skip_tests`
tested CosGrad implementation: `cd build/Linux/RelWithDebInfo/ &&
./onnxruntime_test_all --gtest_filter=GradientCheckerTest.CosGrad`

Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
2022-10-11 10:11:19 -07:00
Prathik Rao
05acd20a88
convert singrad to function op and remove cpu kernel (#13263)
### Description
Implemented gradient of sin as a function op.

### Motivation and Context
Sin gradient currently implemented as cpu op which could hurt
performance.

### Testing
built ORT from source: `./build.sh --config RelWithDebInfo
--enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home
/usr/local/cuda --build_wheel --parallel --skip_tests`
tested SinGrad implementation: `cd build/Linux/RelWithDebInfo/ &&
./onnxruntime_test_all --gtest_filter=GradientCheckerTest.SinGrad`

Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-10-11 10:11:08 -07:00
Vincent Wang
b9e23bd086
[ORTModule] Fix Custom Op Registry for Torch 1.13+ (#13250)
This PR has two fixes:
- https://github.com/pytorch/pytorch/pull/85636 change the behavior of
register_custom_op_symbolic to only register the symbolic function at a
single version. For ORTModule we need to pass the op_set version when
calling it.
- Since torch_1.13 the signature of einsum is changed to have a new
argument, need to change our custom op symbolic registry code
accordingly.

Without the fixes, ORTModule will not work with the nightly torch, and
the new torch version will be released.
2022-10-11 15:20:51 +08:00
PeixuanZuo
4d25b9c8f0
[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257)
### Description
<!-- Describe your changes. -->

1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3
ROCm pipeline run ortmodule test one time and disable it :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04
2. Add `workspace: clean: all `.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-11 13:47:22 +08:00
Baiju Meswani
04ba8a7e6e
Introduce Training C++ Apis (#12994) 2022-10-06 20:13:37 -07:00
cloudhan
72076b1eb2
Update ROCm CI to use HIP LANGUAGE (#13214)
Update for ROCm CI before reland tunable GEMM #12853. This PR also update
composable kernel to use CMakes's HIP language support so that we can
mix C/C++ compiler with HIP compiler instead of locking to hip-clang
2022-10-05 16:15:16 +08:00
Ashwini Khade
4fc8f7139a
Bug Fix - C# API order incompatibile with C API (#13191)
### Description
Training C# bindings (ReleaseTrainingSession and ReleaseCheckpointState)
broke after an API order change in Training C API. This PR fixes this
issue.



### Motivation and Context
Bug Fix for Training C# bindings
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-04 09:29:20 -07:00
Ashwini Khade
c780c4a2b9
Fix two prefast warnings (#13211) 2022-10-03 20:00:57 -07:00
Tony Xia
962fee5fe5
Fix typo enviroment => environment (#13195) 2022-10-03 17:02:26 -07:00
Vincent Wang
6c63c1c9ee
Multiple Gather to Split Fusion (#13095)
For below code in some transformers models:
```
fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim)
return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
```

The exported graph will contains 3 Gather nodes, currently ORT's
GatherGrad CUDA implementation is slow. This pattern can be fused to use
one Split, so that we can launch less kernels for the compute, the perf
of Split/Concat (for grad) is also better than Gather/GatherGrad.

In a real example, one GatherGrad will take 15ms and there are 3 for
each layer in the graph, after the fusion, one Concat takes only 35us.
The total time of a step is improved from 1.5s to 0.4s.
2022-09-29 11:09:57 +08:00
Vincent Wang
94e34ace15
Bugfix for SimplifiedLayerNormalization (#12975)
This PR is to fix https://github.com/microsoft/onnxruntime/issues/12930
and https://github.com/microsoft/onnxruntime/issues/12579.

In detail:
- For CPU EP, since current impl of SimplifiedLayerNormalization doesn't
support input and scale having different data types, so if the sub-graph
contains Cast Op, the sub-graph will not fused, this guarantee that both
inputs and output data type will be same
- For CUDA EP, add (fp16, float) support to (T,V) type constraints all
combinations of fp16 and float can be supported in the impl

With the fix, the original model can be run with
SimplifiedLayerNormalization, which also helps to improve the perf.
2022-09-27 14:24:16 +08:00
Baiju Meswani
bcc93ab17c
Deprecate ORTTrainer (#13022) 2022-09-23 18:10:09 -07:00
ashari4
c4a7e88fc8
QuantizeBFP and DequantizeBFP (#12833)
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).

Context: 

The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).

Requirements:

* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution

Goals of this PR:

* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
2022-09-22 14:02:55 -07:00
Weixing Zhang
4113df0e21
use constexpr (#12953) 2022-09-20 14:34:33 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Pranav Sharma
a8b0f57d1a
Fix eager mode pipeline to accommodate recent allocator change. (#13000) 2022-09-20 12:53:46 +08:00
cloudhan
0ddf4efbd9
Make PythonOp report dtype mismatch by name, instead of by using enum index (#13007) 2022-09-20 12:29:30 +08:00
Adam Louly
268bfe2a5d
python training api bindings (#12610)
**Description**: **Python API Bindings for on device training. **
**Motivation and Context**
- This PR contains api bindings so python users can perform a whole
training loop.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-09-16 09:38:24 -07:00
Vincent Wang
da07c83948
SoftmaxCrossEntropyLossInternalGrad and Sum Fusion (#12746)
* fuse scegrad and sum

* add yield output shapes to value_info

* resolve comments

* fix merge main
2022-09-14 14:45:51 +08:00
pengwa
b5327595f3
Fix [prefast:Warning]: C26814 (#12897)
fix C26814
2022-09-09 08:26:48 +08:00
Thiago Crepaldi
55c745eefd
Add support for ORTModule Torch cpp CUDA extension build within docker (#12868)
Currently, CUDA hardware is not available to be leveraged by build
during `docker build`. because of that, CUDA capable hardware would not
have CUDA support

This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA
extensions to be compiled even when CUDA support is not detected.
2022-09-08 15:30:44 -04:00
guyang3532
4765e5c382
Using ORTModule to wrap a evaluation model should not change the mode (#12747)
Using ORTModule to wrap a evaluation model should not change the mode of model
2022-09-08 10:54:59 +08:00
RandySheriffH
d3b684cd9e
Drop nuphar (#11555)
* drop nuphar code and configs

* refactor test case

* format python

* remove nuphar from training test

* remove commented nuphar logics

* restore llvm setting

* drop nuphar ci

* fix compile err

* fix compile err

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-07 15:11:18 -07:00
Baiju Meswani
9e47eb68e0
Remove unused orttraining amd dockerfiles and scripts (#12707) 2022-09-02 18:43:21 -07:00
Baiju Meswani
295bd26980
Remove orttraining-distributed CI pipeline (#12738) 2022-09-02 14:34:26 -07:00
ashbhandare
27dde0b51f
Csharp bindings for on-device training APIs (#12404) 2022-09-02 13:13:48 -07:00