onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-13 18:08:13 +00:00

Author	SHA1	Message	Date
zhijiang	16d7f55193	lora conv1d replacement (#16643 ) in LoRA code, it will use conv1d to do projection for qkv, while the conv1d calculation is mathematically equivalent to matmul, and matmul is much faster than conv1d. The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1 squeeze + group_num matmul + 1 concat with this optimizer, we see 10%+ in one 1P model	2023-11-16 17:08:06 +08:00
guyang3532	751aa8d31a	fix axis of layernorm for UpstreamReshape (#18425 ) Similar to https://github.com/microsoft/onnxruntime/pull/17255 update axis for Layernormalization when Reshape upstream it.	2023-11-16 16:29:00 +08:00
Vincent Wang	ed89ca573a	[ORTModule] Support User Config for Triton Codegen, Bugfix for Reduce-to-scalar (#18448 ) User can provide Triton codegen config JSON through env variable. Also fix some bugs related to reduction to scalar case.	2023-11-15 17:16:38 +08:00
Vincent Wang	4a82030339	[ORTModule] Symbolic Shape Support for Triton Codegen (#18317 ) Add symbolic shape support for Triton codegen for ORTModule.	2023-11-13 12:16:27 +08:00
guyang3532	4dc63692f8	Add FlattenAndUnpad Op (#17845 ) ### Description Add an op named `FlattenAndUnpad`. This op implements functions: 1. Flatten the first two dims of input tensor. 2. Gather valid value from input tensor with index tensor,. ### Motivation and Context The grad op of `PadAndUnflatten` was `GatherGrad` which is inefficient in performance. I implement this `FlattenAndUnpad` just to replace the `GatherGrad` as grad of `PadAndUnflatten`. With this op, we also can simplify the "Reshape + ShrunkenGather" pattern to `PadAndUnflatten` in padding elimination optimizer, which will also improve performance.	2023-11-09 09:52:48 +08:00
Justin Chu	c250540722	Bump linter versions (#18341 ) Bump linter versions and run format.	2023-11-08 13:04:40 -08:00
Prathik Rao	34f77eaa24	bfloat16 support for quickgelugrad (#18336 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA QuickGeluGrad Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 08:40:02 -08:00
pengwa	2151c79bf1	Tune ORTModule logging experience a bit (#18298 ) ### Tune logging experience a bit After last time we update the ORTModule log experience, we found few issues: 1. `INFO` level output too many things, including PyTorch exporter verbose logs (tracing graphs) on every ranks. On this level, we only want to - Output a little bit more information to Users than `WARNING` level, for example the memory recomputation recommendations or other not-fully-ready features. - Output a little bit more information for a quick diagnostic, collected on rank-0 only. 2. ONNX Runtime logging filter during graph build, session init sometimes will hide the issues (for example segement fault), there is no useful information in `WARNING`/`INFO` for users to report to us. This is not good! 3. Some of our devs like using `pdb` to debug Python code, but if we add `import pdb; pdb.set_trace()` in models' code might hang when they use `INFO` or `WARNING`, where exporter happens and all output got redirected due to log filtering. The only workaround is to switch to VERBOSE, which output toooooooooooo many logs. The corresponding changes proposed here are: 1. For `INFO` logging, - We only logs rank-0. - We restricted the ORT backend logging level to be WARNING in this case, because ORT backend code output way too many logs that should be under verbose, while we cannot guarantee we can get them cleaned up immediately once they are added. - We output the PyTorch exporter verbose log (including tracing graph), which is useful for a quick diagnostic when an issue happens. 2. Remove all logging filtering on ORT backend, then the segment fault issue details will not be hidden once it happens again. 3. Introduced a `DEVINFO` logging, - Log logs on all ranks - Log ORT backend logging level INFO - PyTorch exporter logging filtering are all turned OFF (to unblock the pdb debugging). 4. Currently, to use Memory Optimizer, need use DEVINFO (which will output ORT backend INFO log). So update memory optimizer document to reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will update the requirement back to INFO for show memory optimization infos. You can check https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations for a better view of different log levels. This PR also extract some changes from a bigger one https://github.com/microsoft/onnxruntime/pull/17481, to reduce its complexity for review. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-11-08 17:42:50 +08:00
Prathik Rao	83c0275354	add bfloat16 support for ConcatTraining and SplitTraining ops (#18280 ) ### Description <!-- Describe your changes. --> Updates input/output type constraints on training operators ConcatTraining and SplitTraining to include bfloat16 which was introduced in IR version 4. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-07 10:10:01 -08:00
pengwa	4f15b42728	Customize _get_tensor_rank for model export in stage3 (#18294 ) ### Customize _get_tensor_rank for model export in stage3 Weight/Params sizes are all (0), so exporter logic depending on input shape will fail. This PR override `_get_tensor_rank` function by retrieving the shape for weight differently. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-07 16:37:11 +08:00
zhijiang	630c877b43	Zhijxu/improve ortmodule python perf a little bit (#13716 ) improve 2 python functions a little bit. according to a profiling result from a real user case, we find that 2 python function can be improved. the first is the result before improvement, the second is after improvement, we can see 8ms saved from the improvement. ![image](https://user-images.githubusercontent.com/43435212/202961725-b88d679e-993b-4910-a339-253f3ed5dcde.png) ![image](https://user-images.githubusercontent.com/43435212/202961732-6c6deebf-962f-4392-90d7-03705433e3ee.png)	2023-11-07 15:24:57 +08:00
pengwa	c8e1038eab	Optimize 4bit Qlora training (#18131 ) ### Optimize 4bit Qlora training Extent existing `MatmulBnb4bit` to its usage in training scenarios. The PR includes following changes: 1. Add special `torch.autograd.Function` export logic for `bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before common PythonOp exporter. 2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which help skip some inference specific logic in implementation. 3. Add `transB` optional attribute, which is by default be 1; setting it to be 0 is needed by backward usage. Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9% throughput gains. The reason is: `bitsandbytes.autograd._functions.MatMul4Bit` has logic `ctx.save_for_backward`, which would need an additional copy in PythonOp, otherwise, the tensor might be released by ORT, while backward op still references it. Removing the clones also reduce the peak memory consumptions because `bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not needed in backward compute.	2023-11-02 09:46:11 -07:00
Vincent Wang	1c25fe5580	Fix PoliCheck (#18180 ) Fix PoliCheck by changing some words, which was from Triton flash attention's original code.	2023-10-31 13:53:11 +08:00
guyang3532	58f1d15d19	Replace Transpose with Replace if they are equivalent (#18096 ) ### Description Transpose is equivalent to a Reshape if: empty dimensions can change place, not empty dimensions must be in the same order in the permuted tenosr. Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1). This pr adds a graph transformer which replaces Transpose with Reshape if they are equivalent. Because Transpose need memory copy while Reshape needn't, this replacement can save overhead for memory copy.	2023-10-27 23:50:18 +08:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
pengwa	2c6b31c5aa	FP16 optimizer automatically detect DeepSpeed compatibility (#18084 ) ### FP16 optimizer automatically detect DeepSpeed compatibility Optimum/Transformers are using accelerate lib to prepare models, so our FP16 optimizer wrapper does not work for long time. Because the namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`, which underlying is still calling into DeepSpeed stage1and2 optimizer. This PR includes following changes: 1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the modifier registry, plus a check on its contained `optimizer` property MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3 optimizer later) 2. For DeepSpeed version > 0.9.1, we will store the source code in a version list. As long as the related function in DeepSpeed remains unchanged during its new release, we won't need manually upgrade the version check any more. If some day, the source code did not match, a warning will be raised to users, to add a new version of source code in the list. With the above change, we will have our FP16 Optimizer working again in Optimum. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)	2023-10-25 15:11:02 +08:00
pengwa	444a0eda30	Avoid one time clone to save memory peak (#17934 ) ### Avoid one more time clone to save memory peak	2023-10-21 19:45:45 +08:00
Baiju Meswani	a43c57f59d	ResizeGrad CUDA/ROCM kernel implementation (#17772 )	2023-10-20 11:39:57 -07:00
Vincent Wang	fa0a79a921	Fix Triton Compile Error for Codegened Dropout Code (#17899 )	2023-10-12 20:57:14 +08:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Abhishek Jindal	54b7503c30	create patch for allgather fn for deepspeed stage 3 (#17855 ) ### Description <!-- Describe your changes. --> Patch for All gather fn for Deepspeed Stage 3 changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-11 11:15:06 +08:00
PeixuanZuo	2ef6ee674c	[ROCm] Update ROCm and MIGraphX CI to ROCm5.7 (#17834 ) - Update ROCm and MIGraphX CI to ROCm5.7 - Simplify test exculde file. Some tests will output `registered execution providers ROCMExecutionProvider were unable to run the model.` if they cannot run. - Add `enable_training` build argument for MIGraphX pipeline.	2023-10-09 10:29:11 +08:00
pengwa	7201def4ec	Fix convergence for dolly+stage3 training (#17685 ) ### Fix convergence for dolly+stage3 training In [ZeROOffloadSubscriber](`216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)`), we defined some PythonOp, taking input and returning it inplace, for example: `216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20)`. While it is possible, when ORT runs such a PythonOp, once it completes, it will release the input OrtValue, triggered the data erasing or overridden. But the PythonOp's returned value OrtValue are still pointing to that address, reading or writting on that may introduce a wrong result or even undefined behaviors. ``` /bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy. warnings.warn(f".rank-{get_rank()}: {message}") 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0} 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7 0%\|▍ \| 2/1000 [00:04<31:53, 1.92s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s]Traceback (most recent call last): File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module> main() File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train return inner_training_loop( File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop self.deepspeed.step() File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step self._take_model_step(lr_kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step self.optimizer.step() File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step if self._overflow_check_and_loss_scale_update(): File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, *kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update self._update_scale(self.overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale self.loss_scaler.update_scale(has_overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale raise Exception( Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 2%\|████ \| 17/1000 [00:06<06:07, 2.67it/s] [2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python Traceback (most recent call last): File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module> sys.exit(main()) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, **kwargs) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ examples/onnxruntime/training/language-modeling/run_clm.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-25_08:30:51 host : orttrainingdev10.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 1065120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim ``` ## The Fix For those output that are reusing input, but ORT is not aware of, we detected on the fly (the first iteration, by checking the output tensor addresses with input tensor addresses) , then do implicit copy before set it as PythonOp's output tensors. With this fix: (left: PyTorch, right: ORT) ![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)	2023-10-07 08:40:19 +08:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
shaahji	5a623dca01	Python API to check whether collective ops are available or not (#17730 ) Python API to check whether collective ops are available or not ### Description <!-- Describe your changes. --> Adding an API to check whether collective ops are available or not. Since there is no independent MPI enabled build, this flag can be used on Python front for branching. Specifically, to conditionally enable tests. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Flag to be used in Python to check whether onnxruntime supports collective ops or not. Handy for conditionally enabling/disabling tests and for other branching decisions.	2023-09-29 14:11:05 -07:00
Vincent Wang	e6aa0fa174	Add Gelu Related Ops to Triton Codegen (#17713 ) Add Gelu/QuickGelu/GeluGrad/QuickGeluGrad support to Triton Codegen so that it can be fused with some other connected supported Ops. For example, in llama2, it can be fused with Mul so we will have extra 1-2% perf gain.	2023-09-27 19:57:39 +08:00
Scott McKay	33295ed883	Handle string initializers in constant folding (#17422 ) ### Description <!-- Describe your changes. --> * Allow either an allocator or a MemBuffer to be used when creating an OrtValue from an TensorProto * `Tensor<std::string>` requires an allocator to allocate/free the string values * Forcing the buffer to be allocated outside of the Tensor doesn't seem to provide any benefit in this usage as the Tensor class disables copy and assignment (so we wouldn't create 2 copies of the buffer via the Tensor class that externally managing the would buffer avoid) * New approach means we don't need to manage the buffers in the optimizer Info class as the Tensor dtor will do that * Update naming - MLValue was replaced by OrtValue a long time ago ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17392	2023-09-27 21:15:58 +10:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Baiju Meswani	ccb73fd827	[On-Device Training] Expose Parameters through the Training API (#17364 )	2023-09-25 20:03:24 -07:00
pengwa	6b7bce5ec9	Model post process for zero stage3 training (#17187 ) ### Model post process for zero stage3 training This is the last change to make single GPU/Multiple GPUs run pass. Design details: https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9 `PyTorch` runs with ZeROOffloadSubscriber: ``` model = prepare_model(...) from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3 configure_ort_compatible_zero_stage3() ``` `ORTModule` runs with ZeROOffloadSubscriber: ``` os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1' from onnxruntime.training.ortmodule import ORTModule model = ORTModule(self.model) ``` It will be fairly easy to debug convergence issue if both ORT and PyTorch can run the same offload path. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-22 08:54:25 +08:00
Dmitri Smirnov	fdb132643d	Remove redundant Resolve() after each inlined function (#17556 ) ### Description Remove `Resolve()` on the entire graph as each function is resolved. We retain `Resolve()` after each inlining iteration. ### Motivation and Context Poor performance for inlining the model and session initialization. Original model before Resolve() removal FunctionTest.Profiling (65953 ms) After Resolve() Removal FunctionTest.Profiling (2911 ms) RelWithDebInfo pre-inlined model. Presumably because it runs Level1 optimizers Non-inlined model consists of functions and Level1 optimizers have no effect. FunctionTest.Profiling (9851 ms)	2023-09-15 12:13:37 -07:00
Changming Sun	bc84f52633	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 ) ### Description Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint to newer versions per request of @ mayeut. He created the following PRs to update the deps: https://github.com/microsoft/onnxruntime/pull/15432 https://github.com/microsoft/onnxruntime/pull/15434 https://github.com/microsoft/onnxruntime/pull/15435 https://github.com/microsoft/onnxruntime/pull/15436 https://github.com/microsoft/onnxruntime/pull/15437 However, our build system needs to fetch the dependencies from an internal mirror that only Microsoft employees have write access to. So I closed his PRs and created this one. This PR also updates abseil to a newer version. This is to prepare for upgrading re2.	2023-09-08 13:35:04 -07:00
Ashwini Khade	c5dbd5c919	Updates to training pipelines (#17292 )	2023-09-08 11:57:12 -07:00
Vincent Wang	deda5db231	[ORTModule] Add Manual Seed to Fix UT Failure (#17411 ) Add manual seed to fix ORTModule UT failure.	2023-09-06 11:24:55 +08:00
Baiju Meswani	8b98ecad70	Change RuntimeError to ImportError (#17380 ) The `onnxruntime-validation` for ORTModule checks for `ImportError`: `44101e8771/onnxruntime/python/onnxruntime_validation.py (L73-L75)` If any other kind of error is raised, it does not silently fail and will raise an exception. This causes a problem when ortmodule is explicitly not made available on win/mac packages since we currently raise a RuntimeError. Resolves issue: https://github.com/microsoft/onnxruntime-training-examples/issues/161	2023-09-01 09:56:40 +08:00
pengwa	58af36b49a	Fuse ScaledSum and its backward BatchScale (#16517 ) ### Fuse ScaledSum and its backward BatchScale For deberta models, there is a pattern a / scalar_0 + b / scalar_1 + c / scalar_2 We can fuse this into ScaledSum operator, taking 2(or 3) inputs, and 2(or 3) attributes scalar, generating one output. For the backward, the gradient of a, b and c will be computed with BatchScale. ### Benchmark on 8x32GV100 ```bash torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 400 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json --per_device_train_batch_size 10 ``` #### Main Branch ``` Total overhead: 127954ms where export takes 116489ms. epoch = 14.29 train_loss = 4.9803 train_runtime = 0:10:27.29 train_samples = 2223 train_samples_per_second = 51.013 train_steps_per_second = 0.638 throughput per GPU = 14.29* 2223/ (627.29 - 127.954) / 8 (gpu) = 7.952 samples/second ``` #### This PR ``` Total overhead: 128761ms where export takes 118510ms. *** train metrics *** epoch = 14.29 train_loss = 4.6144 train_runtime = 0:10:04.31 train_samples = 2223 train_samples_per_second = 52.953 train_steps_per_second = 0.662 throughput per GPU = 14.29*2223 / (604.31 - 128.761) / 8 = 8.350 samples/second ``` 5.x% performance gains.	2023-08-31 14:55:27 +08:00
Adam Louly	8224891236	add logits option to generate artifacts (#17276 ) ### Description Adding the ability to export logits as an output for train and eval graphs in generate_artifacts it will remain optional..	2023-08-29 16:55:31 -07:00
kushalpatil07	7b92057376	EvalStep called with wrong inputs onnxruntime_training_cxx_inline.h (#17331 )	2023-08-29 14:14:35 -07:00
Baiju Meswani	38ea8c3931	Increase max error tolerance for ConvTransposeGrad test (#17315 )	2023-08-28 17:05:40 -07:00
guyang3532	401129d484	Add support for more ops for padding elimination (#17217 ) Add support for Gelu/ReduceMean/SimplifiedLayerNormalization for padding elimination	2023-08-25 18:02:15 +08:00
pengwa	d90afc697b	Introduce ZeROOffloadSubscriber for ORTModule (#17006 ) ### Introduce ZeROOffloadSubscriber for ORTModule As part of the work: integrate ORTModule with DeepSpeed stage3, this PR mainly focus on moving original PyTorch-based (leveraging hooks) param partition/offload implementation to ORTModule compatible implementation. Changes include: 1. Refactor `SubscriberBase`/`SubcriberManager` to support pre-forward/post_forward hooks. 2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook function as much as possible. Since all hook functions are defined in `DeepSpeedZeRoOffload._register_hooks_recursively` and `DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is, the closure is not complex, all hooks are referencing the owning `DeepSpeedZeRoOffload` instance, so we can create new hook function with `FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance, then call the new created function in subscriber's `pre_forward_module_apply_impl` and `post_forward_module_apply_impl` interfaces. 3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to register the `ZeROOffloadSubscriber` for the model, then we don't need change any code on the DeepSpeed repo (at least so far). 4. Fix the ATen embedding custom symbolic exporter function by tolerating weights size be (0) (changed by DeepSpeed zero stage 3). UT will be added once stage3 is fully supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:15:22 +08:00
Baiju Meswani	fca81cc5d5	ConvTransposeGrad CUDA Kernel (#17201 )	2023-08-24 09:08:06 -07:00
Baiju Meswani	34d18ee076	Build gradient graph starting at the loss alone (#17240 )	2023-08-23 23:54:45 -07:00
Ashwini Khade	56102ecbdd	On-Device Training - Enable loading from buffer (#16417 )	2023-08-22 19:59:32 -07:00
Adam Louly	c0b6c6c94b	Add SGDOptimizer in the on-device training offline tooling (onnxblock) (#17085 ) ### Description Adding SGDOptimizer to on device training onnxblock	2023-08-18 10:50:39 -07:00
Ashwini Khade	68a670c7f8	Move some tests from CUDA only to CPU (#17189 ) ### Description Minor PR to move some CUDA only on-device training tests to CPU as well. This is to make sure we have good coverage for CPU too. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-18 09:44:57 -07:00
Maximilian Müller	7b9d1f18c7	NVTX windows include and link fixes (#16831 ) ### Description For windows headers are not duplicated to the normal cuda include. For linux they are: ``` (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h ``` Is the preference via those added defines or should the include just be changed to be `nvtx3/` ? Also there is no library linking needed on Windows and the library is not even present.	2023-08-16 11:53:58 -07:00
pengwa	abf9765d73	PythonOp Enhancement: Bool and Tuple[Bool] Constants, Materialize Grads, Empty Inputs, Save In Context (#16828 ) ### PythonOp Enhancement: Bool and Tuple[Bool] Constants, Materialize Grads, Empty Inputs, Save In Context 1. Support `bool` or `Tuple[bool]` constant type in inputs. 2. Support `ctx.set_materialize_grads(True\|False)` 3. Backward op can accept empty input (that don't require grad) 4. Special handling for ORT tensors are saved in context Scenario: a tensor is generated by ORT, then it might be saved for backward by `ctx.save_for_backward(tensor)`, while `tensor`'s reference count is not increased in ORT's allocation plan, so it is possible ORT release the tensor data, before backward usage. Currently: we copy every tensor before running autograd.Function.forward(), this might be a problem for cases there are many PythonOp (for example zero stage 3). Proposal: To avoid those unnecessary copies for tensors that are not saved in context, this change introduced a `_GlobalOpKernelInfoMap`. During the kernel first run, we will anyway copy all tensors generated from ORT, and give it to torch.autograd.Function for run, then we check whether the inputs needs to be saved in context, and save the input index that needs saving in `_GlobalOpKernelInfoMap`. Then for later iterations, we just copy what is needed.	2023-08-15 13:31:04 +08:00
pengwa	cd7b3f54da	Allow defining customized PythonOp shape inferer (#17093 ) ### Allow defining customized PythonOp shape inferer For `torch.autograd.Function`, we converted it to PythonOp in MSDomain, there are two places to do shape inferencing for it: 1. in SymbolicShapeInfer, there is one. 2. in PythonOp op definition. For common PythonOp, since we don't know the relation ship between inputs and outputs, so we only infer the rank from output ranks, and generate symbolic dimensions for each dim. While this will introduce many meaningless symbolic dimensions, sometimes blocking our graph transformers to do op fusion. This PR provide a way to define custom shape inferencing for `torch.autograd.Function` we defined, to propagate the original dimensions across the PythonOp at the best efforts. But the 2rd one is not covered yet, we could refine that later. Fixing 1st one is enough for ORTModule training/evaluation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-14 09:13:32 +08:00
Baiju Meswani	3e7f70bf88	LeakyRelu Gradient (#17039 )	2023-08-10 20:45:34 -07:00

1 2 3 4 5 ...

1379 commits