onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-17 21:10:43 +00:00

Author	SHA1	Message	Date
Tang, Cheng	a81faee41e	Multi-stream execution support (#13495 ) Description: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. Motivation and Context - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com>	2022-12-15 07:39:29 -08:00
Baiju Meswani	1fd63487fd	ORTModule support for kwargs input that is a dict (#13910 )	2022-12-14 16:23:48 -08:00
Baiju Meswani	5a55fac402	Miscellaneous updates to training apis (#13929 )	2022-12-14 13:33:07 -08:00
Baiju Meswani	8c249cc8f7	[QAT] FakeQuantGrad and gradient building for FakeQuant (#13825 )	2022-12-14 11:54:02 -08:00
Ashwini Khade	6090d8cd6e	Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888 ) ### Description Fix usage of enable_training_ops and reduce ifdef complexity for training builds. ### Motivation and Context This is the second refactoring PR towards creating a dedicated build for on device training. This PR aims to reduce some complexity. We can set ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. - If it fixes an open issue, please link to the issue here. -->	2022-12-14 08:32:46 -08:00
PeixuanZuo	80a046b36f	[ROCm] update amd CI huggingface model performance number (#13961 ) Fix CI test failure. Test distilbert-base model performance number on gcramdrr1-mi100-08x and update.	2022-12-14 16:30:25 +08:00
Ashwini Khade	a7bc927b4b	fix typos in training apis (#13908 ) ### Description This PR fixes some typos in the training apis. We need to add more tests and make sure they are all run on the CIs to capture such issues. These changes are out of scope of this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-12-09 16:01:11 -08:00
Adam Louly	fb4707f76d	add cuda support to python bindings (#13700 ) ### Description Add cuda support to the on device training python bindings. ### Motivation and Context Now users can set the execution provider (cpu or cuda) when using python bindings for on device training apis.	2022-12-08 16:03:53 -08:00
Adam Louly	f453d2845e	adding get and set lr for optimizer (#13661 ) ### Description Exposing get and set Learning rate for optimizer ### Motivation and Context you can now set learning rate for optimizer.	2022-12-07 11:59:11 -08:00
Ashwini Khade	983877c712	Decouple strided tensor support from ENABLE_TRAINING (#13829 ) ### Description Decouple strided tensor support from ENABLE_TRAINING ### Motivation and Context This is step 1 for creating a dedicated build for on device training. Intention is 1. We can set ENABLE_STRIDED_TENSORS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. 2. This also paves the way to easily enable strided tensor support for inference in future (if required).	2022-12-07 09:22:21 -08:00
Wei-Sheng Chin	7df8f84228	Improve DORT document (#13790 ) 1. Refine words based on PyTorch changes. 2. Make the need of inference mode clearer. A test is added.	2022-11-30 16:55:25 -08:00
Wei-Sheng Chin	639d285670	[DORT] Catch up with yesterday's PyTorch change (#13779 ) Fix recent CI failures.	2022-11-30 09:23:44 -08:00
Xavier Dupré	441b30b2d2	Move a function call outside a loop in ORTModule (#13771 ) ### Description The proposed change is useful for ORTModule when the output graph has multiple outputs. ### Motivation and Context performance Signed-off-by: xadupre <xadupre@microsoft.com>	2022-11-30 12:49:41 +01:00
Baiju Meswani	2c29938846	[QAT] Introduce FakeQuant op (#13649 )	2022-11-29 08:43:37 -08:00
pengwa	7c53b6eee8	Skip the tests of saving tensor in backward (#13767 ) ### skip the tests of saving tensor in backward The test failed randomly; Let's skip it until the issue got fixed to unblock the CIs.	2022-11-29 13:02:26 +08:00
Vincent Wang	3c258c878c	[CUDA] Optimize Slice Kernel (#13641 ) The PR optimizes Slice CUDA kernel by two ways: - Coalesce dimensions so less divmod during the kernel compute - Split data load and write for better memory throughput Below shows some perf results (cycles number from Nsight Compute) in V100 using real cases from Huggingface's XLNet model: \| Old \| New -- \| -- \| -- [8,12,2048,1024], axis=2, start=1, end=2048 \| 1838687\| 1539846 [8,12,1024,2047], axis=3, start=0, end=1024 \| 951383\| 722203	2022-11-29 09:18:03 +08:00
Changming Sun	87e6a26c5d	Enforce Prefast check in Windows CPU CI pipeline (#13735 ) Right now we fix the warnings in an ad-hoc way. We run static analysis in nightly builds, then create work items for the finding it found. Our CI build pipelines run the same scan but do not break the build. So, this PR will fix the remaining findings in the CPU EP(including the training part) and enforce the check. Later on we can continue to expand the scope. We still have some warnings left in the JNI part. I will try to address them later in the next month.	2022-11-23 09:25:02 -08:00
guyang3532	ba9a585fcc	Fix the tensor save for backward release problem (#13679 ) Motivation: PythonOp is saving input for backward, it's risky since ONNX Runtime backend is not aware of this, the tensor buffer may be "released" by ORT, then potentially modified by other operators before backward function executes. Fix: This pr just clone all input of PythonOp before forward is invoked. This may be high overhead, it's just a workaround before a better fix.	2022-11-22 17:32:19 +08:00
pengwa	947aab0ae0	Make HF converge with lighting native amp (#13616 ) ### Fix training convergence issues #### Problem: Huggingface Transformers: 4.22.0 PyTorch Lightning: 1.6.3 PyTorch: v1.12.1, cuda 11.6 ORT: main branch, cuda 11.6 Model: RobertaForSequenceClassification @ models/roberta/modeling_roberta.py Mixed Precision training with `torch.autocast`: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)` Under this amp autocast context, forward + loss computation run. Here is a snippet of loss computation. ``` if labels is not None: ... if self.config.problem_type == "regression": loss_fct = MSELoss() if self.num_labels == 1: ... elif self.config.problem_type == "single_label_classification": loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": ... return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) ``` It is found after forward run, loss is 1.0850 in float16, looks good.. Then it did a scaling up here: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62)`, the scaler is 65536. then we get a scaled loss 71104 in float type (because float16 loss multiple fp32 scaler, type got promoted to fp32). Then backward started with initial grads to be 1, then 1 (float32) * 65536 (float32) as the backward step, generating a float16 gradient, then we got a `inf`. The problem occurs. With `inf`, the backward feed the `inf` into crossentropygradient op, generating `nan`s. Then all gradients got `nan` in back propagation. So we see training with ORTModule (it almost always `overflow`, the loss did not drop too much, as compared with PyTorch). #### Analysis for the UT (when autocast enabled) PyTorch trace graph looks like this : ``` graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0), %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)): %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %12 : NoneType = prim::Constant() %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %17 : NoneType = prim::Constant() %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %19 : NoneType = prim::Constant() %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %21 : NoneType = prim::Constant() %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %data : Float(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0 return (%27) ``` The most important lines %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %input : _Half_(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 _Float_(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%_input_, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 `aten::cross_entropy_loss` takes Half input, and return Float output. As said in doc: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32, `cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its input to fp32 (if it is not), do the compute and return fp32 result. The other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of input and output, and our code `31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)` when exporting `aten::cross_entropy_loss` assumed this, and set the output to be fp16 either. So this is the reason we have the problem. #### Possible Fixes 1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types of input and output. 2. Check the input and output when exporting, add the input case explicitly if there is type promotion from input to output. This PR used the 2nd approach. We can start 1st approach when needed later. TODO: revisit all other exporter functions, add the checks, etc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-22 15:08:30 +08:00
Changming Sun	a5c2047dd1	Fix the remaining Prefast warnings in CPU EP (#13707 ) ### Description Fix the remaining Prefast warnings in CPU EP.	2022-11-21 10:21:38 -08:00
Wei-Sheng Chin	6160ba0692	Fix aten::_to_copy in DORT (#13682 ) `aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in `_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch commit, and this PR is a fix. Basically, we examine more key-word attributes passed to `aten::_to_copy` and if they lead to a type casting operator (i.e., mapped to ONNX's Cast), we replace that `aten::_to_copy` with `aten::to`. Unsupported attributes are removed (with a low risk of breaking FX graph's assumptions).	2022-11-18 09:31:18 -08:00
Vincent Wang	07812a2fa6	Fix UT Failure on AMD for ORTModule's Conv Test (#13688 ) Currently provider option conv_algo_search is for CUDA only, so remove the checking for ROCm EP.	2022-11-18 17:52:22 +08:00
cloudhan	9e649d1ac4	Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601 ) This ports #13116 from ROCm EP to CUDA EP	2022-11-15 14:43:54 +08:00
Vincent Wang	2bda3fd341	Gather to Slice Fusion (#13599 ) This PR is to optimize the running for below code from Huggingface's XLNet model. ``` x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long)) ``` The code will be exported to Range->Gather, which can be fused to a Slice Op. Slice kernel is much faster than Gather, especially for backward run. The main reason is for Gather, the data in indices can be duplicated so that it needs sum during backward, but Slice node cannot have such case. Use Huggingface's XLNet model for profiling. - Before the fuse forward, ~753us ![image](https://user-images.githubusercontent.com/11661208/200758439-63f2f9b5-9610-4df8-98c8-a1ad4dc62f4e.png) backward, ~46101us ![image](https://user-images.githubusercontent.com/11661208/200758530-fe16a8ec-ea8f-4b79-b3ac-386b72ba1670.png) - After the fuse forward, ~627us ![image](https://user-images.githubusercontent.com/11661208/200758654-ab9a6068-c45d-40f4-9c71-3862a56732f8.png) backward, ~677us ![image](https://user-images.githubusercontent.com/11661208/200758833-aab1b8e1-1b5d-4e55-88cf-03c2a1d9d42b.png)	2022-11-10 13:03:30 +08:00
Edward Chen	9e65f3bfdb	Replace deprecated Python dependency sklearn with scikit-learn. (#13585 )	2022-11-08 09:08:29 -08:00
pengwa	ab9ac2acc4	Add guidelines for ORTModule (#13553 ) ### Add guidelines for ORTModule As title. Feel free to let me know if I missed something. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-04 19:42:10 +08:00
zhijiang	1977b7ed6a	Fix pythonop training_mode in evaluation mode (#13514 ) Customer reported this issue: they see many warnings when doing hte evaluation using ORTModule. ![image](https://user-images.githubusercontent.com/10530022/199371757-5fed7d05-a951-4f1b-8f88-049c5ab89886.png) After investigation, we found the `training_mode` is exported to a wrong value in evaluation mode, it's value should be 0, but we found it is 1. Fix: fix pythonop training mode if training_mode's type is torch._C._onnx.TrainingMode, then not matter it is EVAL or TRAINING, "if training_mode" will always be true	2022-11-04 08:47:01 +08:00
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
Wei-Sheng Chin	b5904c40dd	Enable ORT in TorchDynamo (#13259 ) This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.	2022-11-01 11:19:29 -07:00
PeixuanZuo	6740528b98	[ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525 )	2022-11-01 13:05:55 +08:00
Baiju Meswani	c557a55816	Fix on-device training ExportModelForInferencing api (#13510 )	2022-10-31 21:29:06 -07:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
Baiju Meswani	a46c599a40	Training API to export the eval model to an inference model (#13345 )	2022-10-27 09:34:01 -07:00
Vincent Wang	805ec459a0	Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462 )	2022-10-26 15:45:18 -07:00
Vincent Wang	b6a3562ffb	[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430 ) Add env variable to control disabling custom autogard function support. When using ORTModule, if the torch model has torch.nn.Function, if user confirms that it can be exported to ONNX (for example, by inline PythonOp) and the backward implementation is matched to the forward impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to disable the custom autograd support so that it won't use ORT's PythonOp to fallback to PyTorch. Exporting to ONNX sometimes can leverage some graph optimizations in ORT so that perf is better.	2022-10-25 16:58:04 +08:00
cloudhan	2748f38362	Drop hip_add_library (#13406 ) Switching to use CMake's builtin hip language support.	2022-10-25 12:57:48 +08:00
Adam Louly	bed169192d	Windows build fix for on device training training. (#13354 ) ### Description This is a fix for on device training wheel build. ### Motivation and Context when building linux wheel it treats PathString same as std::string, but when trying to build the wheel on windows it fails because we needed to cast the std::string to a PathString. This error was found manually because there is no pipeline that uses the --enable_training_on_device for windows. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-10-20 09:58:02 -07:00
cloudhan	fc12abf6b1	Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116 ) Related PRs #12853 This allows the user enable/disbale tunable GEMM on demand.	2022-10-19 22:35:08 -07:00
PeixuanZuo	4b2b588895	[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365 ) ### Description <!-- Describe your changes. --> Use SAS Token to fix error` failed to perform copy command due to error: no SAS token or OAuth token is present and the resource is not public` Generate SAS Token of target data, add it into Key vault, and use it as Pipeline Variable. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-10-20 12:08:57 +08:00
Vincent Wang	67150baa8d	[ORTModule] ATen Support for aten::upsample_nearest (#13364 ) ATen support for aten::upsample_nearest, which is required for Huggingface's diffusers model training using ORTModule.	2022-10-20 08:30:04 +08:00
Vincent Wang	b6b3f41636	Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347 ) The PR applies some fixes to Hierarchical ORTModule and ORTModule PythonOp. For Hierarchical ORTModule: - Don't wrap module if the caller is to call other function instead of forward() function - Support single module instance is call multiple times with different types of inputs - Check if module can be warped from top to bottom instead of from bottom to top For ORTModule PythonOp: - Add env variable control to allow using torch.utils.checkpoint.CheckpointFunction - Add env variable control to skip register some autograd functions so that there is no conflict for some models.	2022-10-20 08:16:03 +08:00
Adam Louly	61ee5585b2	update the nightly build to use the latest ptca image. (#13309 ) ### Description updating the ptca image used in the nightly pipeline Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-10-17 14:12:03 -07:00
Adam Louly	68eff69ab1	Add Utils for federated learning scenarios (#13014 ) Description: utils for federated learning. Motivation and Context - This PR includes utils that will be used on federated learning scenarios. - Exposing python bindings to some utils, and added a util to calculate the difference between two buffers. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-10-17 12:39:43 -07:00
Jeff Daily	65c67764ae	remove line "ADD model ${WORKSPACE_DIR}/model" in the amdgpu Dockerfile (#12914 ) Follow-up to #12707. docker build is broken otherwise; model dir is gone.	2022-10-14 13:17:28 -07:00
Wei-Sheng Chin	dc324b1d90	[LazyTensor] Make LORT Build Again with Latest PyTorch (#13303 ) `python setup.py develop` doesn't install PyTorch as a normal package in site-packages anymore, and the user must stay at PyTorch's root directory to call `import torch`. This will break LORT tests because LORT tests contains `import torch` and are called outside PyTorch root directory. To make PyTorch a normal package again, this PR build PyTorch with `python setup.py install`.	2022-10-13 13:56:17 -07:00
Vincent Wang	807b2f4dd5	[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296 ) This PR is to add support of using env variable to set provider option cudnn_conv_algo_search so that user can choose better conv algo search method to run model. This is a quick fix to unblock the test of MoE model. Will have another PR to design and implement the ORTModule config so that we can config ORTModule using Python script or config file instead of env variable.	2022-10-13 15:36:21 +08:00
Vincent Wang	6fb70a82df	[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305 ) Update supported deepspeed highest version from 0.7.1 to 0.7.3 for FP16_Optimizer. Also add version info to warning log.	2022-10-13 13:03:01 +08:00
Vincent Wang	afb5f76770	[ORTModule] ATen Support for torch.nn.GroupNorm (#13293 ) Model [huggingface's diffusers library](https://github.com/huggingface/diffusers) has torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's InstanceNormalization, which is lack of gradient. The implementation of ORT's InstanceNormalization will call cuDNN's BatchNorm for part of computation, which is not efficient compared to PyTorch's implementation. This PR is to use ATen fallback to support this torch module, including its forward and backward.	2022-10-13 11:59:03 +08:00
PeixuanZuo	6895918b1c	[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297 ) ### Description <!-- Describe your changes. --> Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3. We will update to ROCm5.3 when the issue resloved by AMD. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-12 10:47:33 -07:00

1 2 3 4 5 ...

1156 commits