onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-09 00:30:53 +00:00

Author	SHA1	Message	Date
Ashwini Khade	ea7bbd667d	fix headers for training apis (#14350 ) ### Description Minor refactor PR for fixing header placement for training apis	2023-01-19 10:26:53 -08:00
Adam Louly	f0555eb437	Improved test cases by using paramerters (#14246 ) ### Description Completing some missing parts of some test cases for python bindings ### Motivation and Context Some test cases like test_training_module_checkpoint and test_optimizer step were not completed before because we had no access to parameters to check if the parameters are changing after the optimizer step or that the checkpoint saved parameters remains the same. now that we have access to the vector or parameters by exposing get_contiguous_parameters() method. we can complete the tests.	2023-01-13 12:54:23 -08:00
Ashwini Khade	cc7799835e	Enable a single build with optimized inference and on device training (#14241 ) ### Description Right now prepacking code is not compiled when training is enabled. Our partners want a single build of ort which can do both optimized inference + training on device. This PR enables prepacking code in a training build and controls whether it is enabled or not using already existing session option - kOrtSessionOptionsConfigDisablePrepacking For Inference scenarios - prepacking will be turned on by default and this behavior remains the same after this PR too. For training scenarios - prepacking will be disabled by default and if user explicitly enables it then an error will be thrown. ### Motivation and Context Enable both optimized inference as well as on device training in a single build. For on device training use flag --enable_training_apis.	2023-01-12 21:36:43 -08:00
Vincent Wang	fb3c1221e4	Fix Prefast Warning (#14250 ) Fix two prefast:Warning related to constexpr.	2023-01-13 10:16:35 +08:00
Scott McKay	dd2df460b3	Split(18) (#14015 ) ### Description <!-- Describe your changes. --> Opset 18 Split changes. Adds ability to specify num_outputs which also allows uneven splitting. https://github.com/onnx/onnx/releases/tag/v1.13.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support ONNX opset 18.	2023-01-12 08:14:10 +10:00
pengwa	a4180d79c5	Multi-tensor SGDOptimizer (on device training) (#14083 ) Implement SGDOptimizerV2 taking sequence of weights and gradients as inputs. For CPU EP and CUDA EP only. Added tests.	2023-01-11 10:15:53 -08:00
Ashwini Khade	d92c663f28	Create dedicated build for training api (#14136 ) ### Description Enable creating dedicated build for on device training. With this PR we can build a lean binary for on device training using flag --enable_training_apis. This binary includes only the essentials like training ops, optimizers etc and NOT features like Aten fallback, strided tensors, gradient builders etc . This binary also removes all the deprecated components like training::TrainingSession and OrtTrainer etc ### Motivation and Context This enables our partners to create a lean binary for on device training.	2023-01-10 20:58:04 -08:00
Xavier Dupré	79dc39600f	Replace distutils by setuptools to import build_ext (#14108 ) ### Description Uses setuptools instead of distutils. ### Motivation and Context Fixes #14107.	2023-01-09 11:48:01 +01:00
Baiju Meswani	c6ff5bac9d	Update torch in eager mode CI pipeline (#14094 )	2023-01-06 11:46:44 -08:00
Adrian Lizarraga	68794d0ac1	Improve custom op library handle cleanup (#14099 ) ### Description - Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages the lifetime of dynamic library handles (i.e., calls `dlclose` or `FreeLibrary`). - Deprecates C API `OrtApi::RegisterCustomOpsLibrary`. - Adds C++ API wrapper for convenient registering of custom op libraries. - `PySessionOptions` is now an alias of `OrtSessionOptions` ### Motivation and Context The current API for registering custom op libraries loads dynamic libraries but requires users to handle the release of the corresponding library handles. Additionally, the user has to make sure to release the library handle _after_ the session has been destroyed (or the program segfaults). The new API automatically cleans up the library and allows the user to write more straightforward code.	2023-01-04 17:56:29 -08:00
Baiju Meswani	0ff61f7b97	Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055 )	2023-01-03 20:03:33 -08:00
cao lei	b29a1c7348	Address follow-up comments on multistream pr #13495 (#13992 ) ### Description This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Co-authored-by: Lei Cao <leca@microsoft.com>	2023-01-03 16:33:36 -08:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Dmitri Smirnov	5d729839b5	Support loading widechar paths on windows (#14066 ) ### Description Make GetRuntimePath() and LoadDynamicLibrary() operate on platform specific paths ### Motivation and Context This addresses https://github.com/microsoft/onnxruntime/issues/14063	2022-12-30 16:30:11 -08:00
Vincent Wang	0c3480e565	[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069 ) PyTorch removed upsample_nearest related backward functions with "vec" overload name since 1.13. The functions without overload name are available for all versions, though they are not that convienent to use. This PR changes the gradient builder code to use functions without overload name for ATen upsample_nearest nodes. This PR also fixed a bug for ORTModule's corner case introduced by the multi-stream PR. There is some code to execute the barrier step for triggered downsteam is the barrier is out of range. But this should be applied to triggered downstream only. If it's a normal run with start step as a barrier step but out of range, we should not apply the logic. For example, for ORTModule, if the barrier is the 1st step of whole CPU plan, and the forward part is empty, then the forward normal run will run step from start-0 to end-0 (actually nothing), and step-0 is the barrier, then we should not execute the barrier in such case.	2022-12-27 10:18:30 +08:00
Adam Louly	e49f358686	expose lr scheduler python bindings for on device training. (#13882 ) ### Description Exposing LR Scheduler python bindings for on device training. Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-12-22 18:44:04 -08:00
fxmarty	4d2dc8bbbd	Replace all numpy.bool by python builtin bool (#14014 ) `numpy.bool` has been removed as from 1.24.0. It was before an alias for python's `bool`. Fixes https://github.com/huggingface/optimum/issues/610 ### Motivation and Context Numpy 1.24.0 breaks for example IO binding helpers.	2022-12-23 09:27:23 +10:00
Baiju Meswani	1b58331fb3	[QAT] Graph transformer to fuse QDQ pattern into FakeQuant (#13777 ) To perform QAT in onnxruntime, `FakeQuant` op was introduced in #13649. The onnxruntime quantization tool generates a post training static quantization onnx model with `QuantizeLinear`->`DequantizeLinear` nodes. To perform QAT, this pattern needs to be transformed to `FakeQuant`. This pull request introduces a graph transformer that looks for the `Q->DQ` pattern and fuses it to a `FakeQuant` node.	2022-12-22 09:44:39 -08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
PeixuanZuo	ab2dd8dfaf	[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011 ) Update ROCm and MigraphX CI to ROCm5.4 Run ortmodule_test with ROCm5.4 and all passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d). Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:05 +08:00
Tang, Cheng	a81faee41e	Multi-stream execution support (#13495 ) Description: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. Motivation and Context - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com>	2022-12-15 07:39:29 -08:00
Baiju Meswani	1fd63487fd	ORTModule support for kwargs input that is a dict (#13910 )	2022-12-14 16:23:48 -08:00
Baiju Meswani	5a55fac402	Miscellaneous updates to training apis (#13929 )	2022-12-14 13:33:07 -08:00
Baiju Meswani	8c249cc8f7	[QAT] FakeQuantGrad and gradient building for FakeQuant (#13825 )	2022-12-14 11:54:02 -08:00
Ashwini Khade	6090d8cd6e	Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888 ) ### Description Fix usage of enable_training_ops and reduce ifdef complexity for training builds. ### Motivation and Context This is the second refactoring PR towards creating a dedicated build for on device training. This PR aims to reduce some complexity. We can set ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. - If it fixes an open issue, please link to the issue here. -->	2022-12-14 08:32:46 -08:00
PeixuanZuo	80a046b36f	[ROCm] update amd CI huggingface model performance number (#13961 ) Fix CI test failure. Test distilbert-base model performance number on gcramdrr1-mi100-08x and update.	2022-12-14 16:30:25 +08:00
Ashwini Khade	a7bc927b4b	fix typos in training apis (#13908 ) ### Description This PR fixes some typos in the training apis. We need to add more tests and make sure they are all run on the CIs to capture such issues. These changes are out of scope of this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-12-09 16:01:11 -08:00
Adam Louly	fb4707f76d	add cuda support to python bindings (#13700 ) ### Description Add cuda support to the on device training python bindings. ### Motivation and Context Now users can set the execution provider (cpu or cuda) when using python bindings for on device training apis.	2022-12-08 16:03:53 -08:00
Adam Louly	f453d2845e	adding get and set lr for optimizer (#13661 ) ### Description Exposing get and set Learning rate for optimizer ### Motivation and Context you can now set learning rate for optimizer.	2022-12-07 11:59:11 -08:00
Ashwini Khade	983877c712	Decouple strided tensor support from ENABLE_TRAINING (#13829 ) ### Description Decouple strided tensor support from ENABLE_TRAINING ### Motivation and Context This is step 1 for creating a dedicated build for on device training. Intention is 1. We can set ENABLE_STRIDED_TENSORS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. 2. This also paves the way to easily enable strided tensor support for inference in future (if required).	2022-12-07 09:22:21 -08:00
Wei-Sheng Chin	7df8f84228	Improve DORT document (#13790 ) 1. Refine words based on PyTorch changes. 2. Make the need of inference mode clearer. A test is added.	2022-11-30 16:55:25 -08:00
Wei-Sheng Chin	639d285670	[DORT] Catch up with yesterday's PyTorch change (#13779 ) Fix recent CI failures.	2022-11-30 09:23:44 -08:00
Xavier Dupré	441b30b2d2	Move a function call outside a loop in ORTModule (#13771 ) ### Description The proposed change is useful for ORTModule when the output graph has multiple outputs. ### Motivation and Context performance Signed-off-by: xadupre <xadupre@microsoft.com>	2022-11-30 12:49:41 +01:00
Baiju Meswani	2c29938846	[QAT] Introduce FakeQuant op (#13649 )	2022-11-29 08:43:37 -08:00
pengwa	7c53b6eee8	Skip the tests of saving tensor in backward (#13767 ) ### skip the tests of saving tensor in backward The test failed randomly; Let's skip it until the issue got fixed to unblock the CIs.	2022-11-29 13:02:26 +08:00
Vincent Wang	3c258c878c	[CUDA] Optimize Slice Kernel (#13641 ) The PR optimizes Slice CUDA kernel by two ways: - Coalesce dimensions so less divmod during the kernel compute - Split data load and write for better memory throughput Below shows some perf results (cycles number from Nsight Compute) in V100 using real cases from Huggingface's XLNet model: \| Old \| New -- \| -- \| -- [8,12,2048,1024], axis=2, start=1, end=2048 \| 1838687\| 1539846 [8,12,1024,2047], axis=3, start=0, end=1024 \| 951383\| 722203	2022-11-29 09:18:03 +08:00
Changming Sun	87e6a26c5d	Enforce Prefast check in Windows CPU CI pipeline (#13735 ) Right now we fix the warnings in an ad-hoc way. We run static analysis in nightly builds, then create work items for the finding it found. Our CI build pipelines run the same scan but do not break the build. So, this PR will fix the remaining findings in the CPU EP(including the training part) and enforce the check. Later on we can continue to expand the scope. We still have some warnings left in the JNI part. I will try to address them later in the next month.	2022-11-23 09:25:02 -08:00
guyang3532	ba9a585fcc	Fix the tensor save for backward release problem (#13679 ) Motivation: PythonOp is saving input for backward, it's risky since ONNX Runtime backend is not aware of this, the tensor buffer may be "released" by ORT, then potentially modified by other operators before backward function executes. Fix: This pr just clone all input of PythonOp before forward is invoked. This may be high overhead, it's just a workaround before a better fix.	2022-11-22 17:32:19 +08:00
pengwa	947aab0ae0	Make HF converge with lighting native amp (#13616 ) ### Fix training convergence issues #### Problem: Huggingface Transformers: 4.22.0 PyTorch Lightning: 1.6.3 PyTorch: v1.12.1, cuda 11.6 ORT: main branch, cuda 11.6 Model: RobertaForSequenceClassification @ models/roberta/modeling_roberta.py Mixed Precision training with `torch.autocast`: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)` Under this amp autocast context, forward + loss computation run. Here is a snippet of loss computation. ``` if labels is not None: ... if self.config.problem_type == "regression": loss_fct = MSELoss() if self.num_labels == 1: ... elif self.config.problem_type == "single_label_classification": loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": ... return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) ``` It is found after forward run, loss is 1.0850 in float16, looks good.. Then it did a scaling up here: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62)`, the scaler is 65536. then we get a scaled loss 71104 in float type (because float16 loss multiple fp32 scaler, type got promoted to fp32). Then backward started with initial grads to be 1, then 1 (float32) * 65536 (float32) as the backward step, generating a float16 gradient, then we got a `inf`. The problem occurs. With `inf`, the backward feed the `inf` into crossentropygradient op, generating `nan`s. Then all gradients got `nan` in back propagation. So we see training with ORTModule (it almost always `overflow`, the loss did not drop too much, as compared with PyTorch). #### Analysis for the UT (when autocast enabled) PyTorch trace graph looks like this : ``` graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0), %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)): %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %12 : NoneType = prim::Constant() %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %17 : NoneType = prim::Constant() %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %19 : NoneType = prim::Constant() %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %21 : NoneType = prim::Constant() %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %data : Float(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0 return (%27) ``` The most important lines %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %input : _Half_(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 _Float_(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%_input_, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 `aten::cross_entropy_loss` takes Half input, and return Float output. As said in doc: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32, `cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its input to fp32 (if it is not), do the compute and return fp32 result. The other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of input and output, and our code `31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)` when exporting `aten::cross_entropy_loss` assumed this, and set the output to be fp16 either. So this is the reason we have the problem. #### Possible Fixes 1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types of input and output. 2. Check the input and output when exporting, add the input case explicitly if there is type promotion from input to output. This PR used the 2nd approach. We can start 1st approach when needed later. TODO: revisit all other exporter functions, add the checks, etc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-22 15:08:30 +08:00
Changming Sun	a5c2047dd1	Fix the remaining Prefast warnings in CPU EP (#13707 ) ### Description Fix the remaining Prefast warnings in CPU EP.	2022-11-21 10:21:38 -08:00
Wei-Sheng Chin	6160ba0692	Fix aten::_to_copy in DORT (#13682 ) `aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in `_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch commit, and this PR is a fix. Basically, we examine more key-word attributes passed to `aten::_to_copy` and if they lead to a type casting operator (i.e., mapped to ONNX's Cast), we replace that `aten::_to_copy` with `aten::to`. Unsupported attributes are removed (with a low risk of breaking FX graph's assumptions).	2022-11-18 09:31:18 -08:00
Vincent Wang	07812a2fa6	Fix UT Failure on AMD for ORTModule's Conv Test (#13688 ) Currently provider option conv_algo_search is for CUDA only, so remove the checking for ROCm EP.	2022-11-18 17:52:22 +08:00
cloudhan	9e649d1ac4	Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601 ) This ports #13116 from ROCm EP to CUDA EP	2022-11-15 14:43:54 +08:00
Vincent Wang	2bda3fd341	Gather to Slice Fusion (#13599 ) This PR is to optimize the running for below code from Huggingface's XLNet model. ``` x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long)) ``` The code will be exported to Range->Gather, which can be fused to a Slice Op. Slice kernel is much faster than Gather, especially for backward run. The main reason is for Gather, the data in indices can be duplicated so that it needs sum during backward, but Slice node cannot have such case. Use Huggingface's XLNet model for profiling. - Before the fuse forward, ~753us ![image](https://user-images.githubusercontent.com/11661208/200758439-63f2f9b5-9610-4df8-98c8-a1ad4dc62f4e.png) backward, ~46101us ![image](https://user-images.githubusercontent.com/11661208/200758530-fe16a8ec-ea8f-4b79-b3ac-386b72ba1670.png) - After the fuse forward, ~627us ![image](https://user-images.githubusercontent.com/11661208/200758654-ab9a6068-c45d-40f4-9c71-3862a56732f8.png) backward, ~677us ![image](https://user-images.githubusercontent.com/11661208/200758833-aab1b8e1-1b5d-4e55-88cf-03c2a1d9d42b.png)	2022-11-10 13:03:30 +08:00
Edward Chen	9e65f3bfdb	Replace deprecated Python dependency sklearn with scikit-learn. (#13585 )	2022-11-08 09:08:29 -08:00
pengwa	ab9ac2acc4	Add guidelines for ORTModule (#13553 ) ### Add guidelines for ORTModule As title. Feel free to let me know if I missed something. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-04 19:42:10 +08:00
zhijiang	1977b7ed6a	Fix pythonop training_mode in evaluation mode (#13514 ) Customer reported this issue: they see many warnings when doing hte evaluation using ORTModule. ![image](https://user-images.githubusercontent.com/10530022/199371757-5fed7d05-a951-4f1b-8f88-049c5ab89886.png) After investigation, we found the `training_mode` is exported to a wrong value in evaluation mode, it's value should be 0, but we found it is 1. Fix: fix pythonop training mode if training_mode's type is torch._C._onnx.TrainingMode, then not matter it is EVAL or TRAINING, "if training_mode" will always be true	2022-11-04 08:47:01 +08:00
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
Wei-Sheng Chin	b5904c40dd	Enable ORT in TorchDynamo (#13259 ) This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.	2022-11-01 11:19:29 -07:00
PeixuanZuo	6740528b98	[ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525 )	2022-11-01 13:05:55 +08:00

1 2 3 4 5 ...

1176 commits