onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-05 04:17:53 +00:00

Author	SHA1	Message	Date
Kyushick Lee	cd24f0794a	Extend ort_backend.py for another ep (#14349 ) ### Description <!-- Describe your changes. --> This PR extends OrtBackend to allow for configuring an EP based on the name, and fallbacks to existing mechanism that infers the EP based on tensor affinity if nothing is provided. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently OrtBackend needs `get_ort_device()` with the device tag inferred from torch.Tensor, but ort device is not yet supported for dort. The change allows run dort with a supported EP, by configuring dort with a desired EP and letting the dort (ort InferenceSession) take CPU-affined pytorch Tensors as inputs then inject data transfer nodes internally.	2023-01-20 07:30:00 -08:00
Ashwini Khade	cc7799835e	Enable a single build with optimized inference and on device training (#14241 ) ### Description Right now prepacking code is not compiled when training is enabled. Our partners want a single build of ort which can do both optimized inference + training on device. This PR enables prepacking code in a training build and controls whether it is enabled or not using already existing session option - kOrtSessionOptionsConfigDisablePrepacking For Inference scenarios - prepacking will be turned on by default and this behavior remains the same after this PR too. For training scenarios - prepacking will be disabled by default and if user explicitly enables it then an error will be thrown. ### Motivation and Context Enable both optimized inference as well as on device training in a single build. For on device training use flag --enable_training_apis.	2023-01-12 21:36:43 -08:00
Xavier Dupré	79dc39600f	Replace distutils by setuptools to import build_ext (#14108 ) ### Description Uses setuptools instead of distutils. ### Motivation and Context Fixes #14107.	2023-01-09 11:48:01 +01:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Vincent Wang	0c3480e565	[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069 ) PyTorch removed upsample_nearest related backward functions with "vec" overload name since 1.13. The functions without overload name are available for all versions, though they are not that convienent to use. This PR changes the gradient builder code to use functions without overload name for ATen upsample_nearest nodes. This PR also fixed a bug for ORTModule's corner case introduced by the multi-stream PR. There is some code to execute the barrier step for triggered downsteam is the barrier is out of range. But this should be applied to triggered downstream only. If it's a normal run with start step as a barrier step but out of range, we should not apply the logic. For example, for ORTModule, if the barrier is the 1st step of whole CPU plan, and the forward part is empty, then the forward normal run will run step from start-0 to end-0 (actually nothing), and step-0 is the barrier, then we should not execute the barrier in such case.	2022-12-27 10:18:30 +08:00
Adam Louly	e49f358686	expose lr scheduler python bindings for on device training. (#13882 ) ### Description Exposing LR Scheduler python bindings for on device training. Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-12-22 18:44:04 -08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
Baiju Meswani	1fd63487fd	ORTModule support for kwargs input that is a dict (#13910 )	2022-12-14 16:23:48 -08:00
Baiju Meswani	5a55fac402	Miscellaneous updates to training apis (#13929 )	2022-12-14 13:33:07 -08:00
Adam Louly	fb4707f76d	add cuda support to python bindings (#13700 ) ### Description Add cuda support to the on device training python bindings. ### Motivation and Context Now users can set the execution provider (cpu or cuda) when using python bindings for on device training apis.	2022-12-08 16:03:53 -08:00
Adam Louly	f453d2845e	adding get and set lr for optimizer (#13661 ) ### Description Exposing get and set Learning rate for optimizer ### Motivation and Context you can now set learning rate for optimizer.	2022-12-07 11:59:11 -08:00
Wei-Sheng Chin	7df8f84228	Improve DORT document (#13790 ) 1. Refine words based on PyTorch changes. 2. Make the need of inference mode clearer. A test is added.	2022-11-30 16:55:25 -08:00
Wei-Sheng Chin	639d285670	[DORT] Catch up with yesterday's PyTorch change (#13779 ) Fix recent CI failures.	2022-11-30 09:23:44 -08:00
Xavier Dupré	441b30b2d2	Move a function call outside a loop in ORTModule (#13771 ) ### Description The proposed change is useful for ORTModule when the output graph has multiple outputs. ### Motivation and Context performance Signed-off-by: xadupre <xadupre@microsoft.com>	2022-11-30 12:49:41 +01:00
guyang3532	ba9a585fcc	Fix the tensor save for backward release problem (#13679 ) Motivation: PythonOp is saving input for backward, it's risky since ONNX Runtime backend is not aware of this, the tensor buffer may be "released" by ORT, then potentially modified by other operators before backward function executes. Fix: This pr just clone all input of PythonOp before forward is invoked. This may be high overhead, it's just a workaround before a better fix.	2022-11-22 17:32:19 +08:00
pengwa	947aab0ae0	Make HF converge with lighting native amp (#13616 ) ### Fix training convergence issues #### Problem: Huggingface Transformers: 4.22.0 PyTorch Lightning: 1.6.3 PyTorch: v1.12.1, cuda 11.6 ORT: main branch, cuda 11.6 Model: RobertaForSequenceClassification @ models/roberta/modeling_roberta.py Mixed Precision training with `torch.autocast`: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)` Under this amp autocast context, forward + loss computation run. Here is a snippet of loss computation. ``` if labels is not None: ... if self.config.problem_type == "regression": loss_fct = MSELoss() if self.num_labels == 1: ... elif self.config.problem_type == "single_label_classification": loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": ... return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) ``` It is found after forward run, loss is 1.0850 in float16, looks good.. Then it did a scaling up here: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62)`, the scaler is 65536. then we get a scaled loss 71104 in float type (because float16 loss multiple fp32 scaler, type got promoted to fp32). Then backward started with initial grads to be 1, then 1 (float32) * 65536 (float32) as the backward step, generating a float16 gradient, then we got a `inf`. The problem occurs. With `inf`, the backward feed the `inf` into crossentropygradient op, generating `nan`s. Then all gradients got `nan` in back propagation. So we see training with ORTModule (it almost always `overflow`, the loss did not drop too much, as compared with PyTorch). #### Analysis for the UT (when autocast enabled) PyTorch trace graph looks like this : ``` graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0), %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)): %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %12 : NoneType = prim::Constant() %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %17 : NoneType = prim::Constant() %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %19 : NoneType = prim::Constant() %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %21 : NoneType = prim::Constant() %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %data : Float(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0 return (%27) ``` The most important lines %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %input : _Half_(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 _Float_(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%_input_, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 `aten::cross_entropy_loss` takes Half input, and return Float output. As said in doc: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32, `cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its input to fp32 (if it is not), do the compute and return fp32 result. The other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of input and output, and our code `31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)` when exporting `aten::cross_entropy_loss` assumed this, and set the output to be fp16 either. So this is the reason we have the problem. #### Possible Fixes 1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types of input and output. 2. Check the input and output when exporting, add the input case explicitly if there is type promotion from input to output. This PR used the 2nd approach. We can start 1st approach when needed later. TODO: revisit all other exporter functions, add the checks, etc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-22 15:08:30 +08:00
Wei-Sheng Chin	6160ba0692	Fix aten::_to_copy in DORT (#13682 ) `aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in `_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch commit, and this PR is a fix. Basically, we examine more key-word attributes passed to `aten::_to_copy` and if they lead to a type casting operator (i.e., mapped to ONNX's Cast), we replace that `aten::_to_copy` with `aten::to`. Unsupported attributes are removed (with a low risk of breaking FX graph's assumptions).	2022-11-18 09:31:18 -08:00
zhijiang	1977b7ed6a	Fix pythonop training_mode in evaluation mode (#13514 ) Customer reported this issue: they see many warnings when doing hte evaluation using ORTModule. ![image](https://user-images.githubusercontent.com/10530022/199371757-5fed7d05-a951-4f1b-8f88-049c5ab89886.png) After investigation, we found the `training_mode` is exported to a wrong value in evaluation mode, it's value should be 0, but we found it is 1. Fix: fix pythonop training mode if training_mode's type is torch._C._onnx.TrainingMode, then not matter it is EVAL or TRAINING, "if training_mode" will always be true	2022-11-04 08:47:01 +08:00
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
Wei-Sheng Chin	b5904c40dd	Enable ORT in TorchDynamo (#13259 ) This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.	2022-11-01 11:19:29 -07:00
Baiju Meswani	c557a55816	Fix on-device training ExportModelForInferencing api (#13510 )	2022-10-31 21:29:06 -07:00
Baiju Meswani	a46c599a40	Training API to export the eval model to an inference model (#13345 )	2022-10-27 09:34:01 -07:00
Vincent Wang	805ec459a0	Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462 )	2022-10-26 15:45:18 -07:00
Vincent Wang	b6a3562ffb	[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430 ) Add env variable to control disabling custom autogard function support. When using ORTModule, if the torch model has torch.nn.Function, if user confirms that it can be exported to ONNX (for example, by inline PythonOp) and the backward implementation is matched to the forward impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to disable the custom autograd support so that it won't use ORT's PythonOp to fallback to PyTorch. Exporting to ONNX sometimes can leverage some graph optimizations in ORT so that perf is better.	2022-10-25 16:58:04 +08:00
Vincent Wang	67150baa8d	[ORTModule] ATen Support for aten::upsample_nearest (#13364 ) ATen support for aten::upsample_nearest, which is required for Huggingface's diffusers model training using ORTModule.	2022-10-20 08:30:04 +08:00
Vincent Wang	b6b3f41636	Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347 ) The PR applies some fixes to Hierarchical ORTModule and ORTModule PythonOp. For Hierarchical ORTModule: - Don't wrap module if the caller is to call other function instead of forward() function - Support single module instance is call multiple times with different types of inputs - Check if module can be warped from top to bottom instead of from bottom to top For ORTModule PythonOp: - Add env variable control to allow using torch.utils.checkpoint.CheckpointFunction - Add env variable control to skip register some autograd functions so that there is no conflict for some models.	2022-10-20 08:16:03 +08:00
Adam Louly	68eff69ab1	Add Utils for federated learning scenarios (#13014 ) Description: utils for federated learning. Motivation and Context - This PR includes utils that will be used on federated learning scenarios. - Exposing python bindings to some utils, and added a util to calculate the difference between two buffers. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-10-17 12:39:43 -07:00
Vincent Wang	807b2f4dd5	[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296 ) This PR is to add support of using env variable to set provider option cudnn_conv_algo_search so that user can choose better conv algo search method to run model. This is a quick fix to unblock the test of MoE model. Will have another PR to design and implement the ORTModule config so that we can config ORTModule using Python script or config file instead of env variable.	2022-10-13 15:36:21 +08:00
Vincent Wang	6fb70a82df	[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305 ) Update supported deepspeed highest version from 0.7.1 to 0.7.3 for FP16_Optimizer. Also add version info to warning log.	2022-10-13 13:03:01 +08:00
Vincent Wang	afb5f76770	[ORTModule] ATen Support for torch.nn.GroupNorm (#13293 ) Model [huggingface's diffusers library](https://github.com/huggingface/diffusers) has torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's InstanceNormalization, which is lack of gradient. The implementation of ORT's InstanceNormalization will call cuDNN's BatchNorm for part of computation, which is not efficient compared to PyTorch's implementation. This PR is to use ATen fallback to support this torch module, including its forward and backward.	2022-10-13 11:59:03 +08:00
Vincent Wang	a2658f0784	[ORTModule] Fix Graph Builder for Eval Mode (#13255 ) Current graph builder for ORTModule will apply the training's graph optimizations for both training and eval mode. Take BatchNorm as example, one of training's graph optimizations will replace BatchNormalization Op to BatchNormInternal which is for training only. This PR is to fix this, for eval mode, we will not apply the training's graph optimizations. The inference's graph optimizations will be applied when InferenceSession initialization.	2022-10-12 14:39:54 +08:00
Vincent Wang	b9e23bd086	[ORTModule] Fix Custom Op Registry for Torch 1.13+ (#13250 ) This PR has two fixes: - https://github.com/pytorch/pytorch/pull/85636 change the behavior of register_custom_op_symbolic to only register the symbolic function at a single version. For ORTModule we need to pass the op_set version when calling it. - Since torch_1.13 the signature of einsum is changed to have a new argument, need to change our custom op symbolic registry code accordingly. Without the fixes, ORTModule will not work with the nightly torch, and the new torch version will be released.	2022-10-11 15:20:51 +08:00
Baiju Meswani	bcc93ab17c	Deprecate ORTTrainer (#13022 )	2022-09-23 18:10:09 -07:00
Adam Louly	268bfe2a5d	python training api bindings (#12610 ) Description: Python API Bindings for on device training. Motivation and Context - This PR contains api bindings so python users can perform a whole training loop. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-09-16 09:38:24 -07:00
Thiago Crepaldi	55c745eefd	Add support for ORTModule Torch cpp CUDA extension build within docker (#12868 ) Currently, CUDA hardware is not available to be leveraged by build during `docker build`. because of that, CUDA capable hardware would not have CUDA support This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA extensions to be compiled even when CUDA support is not detected.	2022-09-08 15:30:44 -04:00
guyang3532	4765e5c382	Using ORTModule to wrap a evaluation model should not change the mode (#12747 ) Using ORTModule to wrap a evaluation model should not change the mode of model	2022-09-08 10:54:59 +08:00
Baiju Meswani	56bae3b196	Use InplaceClipGradNorm for offline processing for on-device training (#12603 )	2022-09-02 07:47:17 -07:00
Cheng	5dd9afe75a	python lint (#12825 )	2022-09-01 22:38:25 +08:00
Justin Chu	a48b115540	Remove reference to the deprecated variable in `torch.onnx.symbolic_helper` (#12452 ) Description: Remove reference to the deprecated variable in `torch.onnx.symbolic_helper` pytorch/pytorch#81953 - Removed unused imports - Changed BANNED_AUTOGRAD_FUNCTION_NAMES to a frozenset Motivation and Context The cast_pytorch_to_onnx variable is deprecated and removed in `torch.onnx.symbolic_helper`. Since there is still a need for converting scalar types to onnx type, I copied the mapping to `_CAST_PYTORCH_TO_ONNX` in the module.	2022-08-31 11:55:56 -07:00
Yulong Wang	1a402a3f25	replace 'master' branch ref to 'main' for onnx repo (#12678 )	2022-08-30 13:41:42 -07:00
pengwa	a0c25e5c2f	Fix segment fault for alltoall (#12701 ) * fix segment fault * formatting	2022-08-30 11:27:14 +08:00
Vincent Wang	53ecb9e635	Update Supporting DS Version to 0.7.1 for ORTModule (#12696 ) update ds version support for fp16_optimizer	2022-08-24 14:56:12 +08:00
Yulong Wang	c144acc534	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
Wei-Sheng Chin	dc486d146b	Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460 ) * Make ORT as Pytorch JIT backend LORT likely doesn't work with aten fallback so we only test LORT in its own CI. * Revert changes to enable external CUDA allocator. Will add it later. Revert "Revert changes to enable external CUDA allocator. Will add it later." This reverts commit d5487f2e193014c805505afae8fb577c53667658. Fix external allocator * Relax tolerance and remove commented code * Print more information in CI * Fix pointer * Address comments. 1. Reuse ORT-eager mode's environment. 2. Remove unused ctor. * Use Pytorch master branch as all PRs are merged Fix * Refine based on cpplint feedbacks * Revert changes to allow custom CUDA allocator in public APIs * Use torch.testing.assert_close * Use unittest framework * Switch docker repo * Rename .cpp to .cc * Address comments * Add comment * Use same pipeline file for eager and lort pipelines * Address comments * Add yaml comment * Fix cmake files * Address comments * Rename flags, remove printing code, remove dead comment	2022-08-22 09:40:40 -07:00
Vincent Wang	a078c8d99b	Update Supporting Deepspeed Version of ORTModule's FP16_Optimizer (#12668 )	2022-08-22 22:22:53 +08:00
pengwa	24eab921be	Enable PythonOp for --enable_training_torch_interop build (#12539 ) * enable PythonOp by default when --enable_training_torch_interop is enabled during build * clean up * fix * fix comment * fix * fix tests * fix fallback test * pylint format * refine based on comments	2022-08-12 00:49:30 +08:00
Adam Louly	2681648f5b	Load checkpoint in cpp (#12352 ) * Load checkpoint in cpp * removed unused imports * throw error on invalid name and change function name * inplace model assignment, change name and other comments resolved * name change on import * Addded unit test, resolved comments * remove unused imports * resolved comments * refactoring too reduce memoory allocation * resolved extra comments * changed files hierarchy an force added onnx moodel * solved order of function argument * used gtest macros on test cases Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-08-09 12:30:50 -07:00
pengwa	a2dc3e9eac	Improve the compilation speed when compiling for multiple architectures. (#12490 ) * improve the compilation speed when compiling for multiple architectures. * formatting * fix * use 0 by default * fix comments	2022-08-09 11:52:26 +08:00
Vincent Wang	e85e31ee80	Update ORTModule Default Opset Version to 15 (#12419 ) * update ortmodule opset to 15 * update torch version * fix ut * fix ut * rollback * rollback for orttrainer	2022-08-05 16:55:04 +08:00
Baiju Meswani	7f58bd7236	Perform graph transformations during offline tooling (#12422 )	2022-08-03 11:27:12 -07:00

1 2 3 4 5 ...

335 commits