onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-04 23:59:56 +00:00

Author	SHA1	Message	Date
Christian Veenhuis	59dfcfdce7	Fix typos in sources: operater, tranform, neccessary, trainig (#14907 ) ### Description While browsing the sources I found several typos here and there. I collected them to a single PR and fixed them. Namely these typos are: operater, tranform, neccessary, trainig. After fixing none of them was found anymore: $ git grep "operater" $ git grep "tranform" $ git grep "neccessary" $ git grep "trainig" $ ### Motivation and Context Since some of the typos are in example notebooks and markdown files, users can see them.	2023-03-13 22:45:04 -07:00
pengwa	44dda08b51	Renaming files (#15015 ) ### Renaming files for compute optimizer ### Motivation and Context A follow up for https://github.com/microsoft/onnxruntime/pull/14832	2023-03-13 17:07:59 +08:00
pengwa	448e989df8	Op slicing upstream refactor (#14832 ) ### Slice op upstream refactor A refactor work for https://github.com/microsoft/onnxruntime/pull/13672. ### Motivation and Context There is a similar optimization opportunity for other operator upstreaming, to reduce compute flops. So refactor the existing code base for making it easier to support other ops. The changes in this PR are mainly about renaming and moving. - Move common logic (from compute_optimizer.h/cc) into upstream_transformer_base.h/cc and shared_utils.h/cc. - For upstream common logic, they are moved into upstream_transformer_base.h/cc - For shared utilities, they are moved to shared_utils.h/cc. - After the move, compute_optimizer.h/cc mainly for upstreaming gather implementation (inheriting upstream_transformer_base.h/cc). Ideally it should be renamed, but for easier review this time, I keep its name.	2023-03-13 08:19:32 +08:00
Dmitri Smirnov	0d7855ea5a	Re-work global objects dependancies in pybind layer. (#14941 ) ### Description Re-work handling of static objects in pybind. Make sure we ref-count Environment from Sessions. The following has been done: - Make global objects function static. This ensures that the objects are constructed on demand. The first object constructed is destructed last. This is platform independent. - Make global objects ownership shared as suggested by pybind since they are not surfaced at Python level, and they cannot be referred to by dependent python objects. Verified that all python objects are GCed before globals are destroyed. This takes care of inference session dependency on environment and its default logger and this is also platform independent. - Utilize pybind atexit mechanism to clear execution providers and unload CUDA libraries (as suggested by https://github.com/microsoft/onnxruntime/pull/14903) . Since this is registered for module exit, it takes place before any other global are destroyed and clears shared objects state or even unloads the libraries. This should also work in a platform independent way. ### Motivation and Context - Global object destruction order is managed manually and that becomes source of trouble. We want to make it deterministic and platform independent. - Frequent hangs in Python layer due to the static object's destruction order. Some of the Python session objects are being garbage collected after main exits and they require ORT environment to be alive. (Use after free)	2023-03-10 13:55:31 -08:00
Baiju Meswani	748758c135	Address issue with uninitialized variable (#14988 )	2023-03-10 09:24:04 -08:00
Adam Pocock	47f00b5d49	[Java] Initial on device training support (#14027 ) contributor: @Craigacp	2023-03-08 10:01:08 -08:00
Ashwini Khade	f14ab63c19	fix prefast warnings (#14931 ) ### Description Fixes prefast warnings Fixed [AB#11328](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11328) Fixed [AB#11329](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11329)	2023-03-08 09:49:15 -08:00
Kyushick Lee	c696392f0c	Support external output tensors for DORT (#14516 ) ### Description <!-- Describe your changes. --> Support externally-managed output tensors (torch Tensors) for dort. Add `preallocate_output` option to OrtBackend to rely on externally-managed output tensors for dort. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> DORT currently allocates and returns output ortvalues and convert them to torch Tensors. The conversion based on dlpack does not support torch Tensors for custom Aten backends, and it is not yet possible to transfer the ownership from ortvalue to external handle (torch Tensor). To avoid this issue, the PR change provides an option (`preallocate_output`) to allocate output tensors externally in pytorch, which creates torch Tensor for an Aten backend, and let dort take pointers from torch Tensors to construct output ortvalues instead of allocating them inside InferenceSession.	2023-03-07 21:32:23 -08:00
pengwa	5d8ce817cb	Fix simplified layer norm fusion for training (#14866 ) ### Fix simplified layer norm fusion for training Co-author with @prathikr. Fix bug identified by @prathikr. https://github.com/microsoft/onnxruntime/issues/14822. Running T5 model enabling deepspeed, we see simplified layer norm is not fused because the device check did not pass `b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568)`. Since during pretraining optimization pass, there is no device placement, so the device check not fulfilled is expected. On the other hand, the device check is still valid to avoid simplified layer norm fusion works correctly for CPU runs. As a mitigation, added a flag to indicate whether the fusion is triggered by pre-training optimization or not. There is a risk though, when we run ORTModule training with CPU EP, but I feel the risk can be much reduced if we check CUDA/ROCM is enabled for the build. ``` CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-07 13:59:20 -08:00
pengwa	f6c81d8aca	Introduce padding inspector in ORTModule (#14652 ) ### Introduce padding inspector in ORTModule In some Transformer-based LLM training recipes, high data sparsity is observed due to 1). token padding (to max sequence length), 2). labels contains many ignore_index for calculate loss. This PR introduces a switch to enable data sparsity inspection, which 1). in short term, can inform training users to use techniques like dynamic batching to amortize the issue. 2). in medium and longer term, also helps us (training team) to have better understanding what our training customers' models looks like from perspective of data sparsity (and potentially motivate us to improve with runtime). Here is an example of different data sparsity with same training model arch, same training input, but with different user models. Low Embed Density, High Label Density Case - Sentence Classification ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/text-classification/run_glue.py --model_name_or_path roberta-large-openai-detector --task_name mnli --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir --output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137 --fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 60 \| EMBED \| input_ids \| 1 \| 35.21 % \| 1442 \| 4096 \| [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] \| \| 61 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 62 \| EMBED \| input_ids \| 1 \| 30.00 % \| 1229 \| 4096 \| [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] \| \| 63 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 64 \| EMBED \| input_ids \| 1 \| 26.73 % \| 1095 \| 4096 \| [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] \| \| 65 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 66 \| EMBED \| input_ids \| 1 \| 30.03 % \| 1230 \| 4096 \| [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] \| \| 67 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 68 \| EMBED \| input_ids \| 1 \| 31.62 % \| 1295 \| 4096 \| [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] \| \| 69 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| <<< ``` High Embed Density, Low Label Density Case - masked language model ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path bert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 710 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 711 \| LABEL \| labels \| -100 \| 13.77 % \| 564 \| 4096 \| N/A \| \| 712 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 713 \| LABEL \| labels \| -100 \| 14.48 % \| 593 \| 4096 \| N/A \| \| 714 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 715 \| LABEL \| labels \| -100 \| 14.18 % \| 581 \| 4096 \| N/A \| \| 716 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 717 \| LABEL \| labels \| -100 \| 14.53 % \| 595 \| 4096 \| N/A \| \| 718 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 719 \| LABEL \| labels \| -100 \| 15.31 % \| 627 \| 4096 \| N/A \| <<< ``` #### Next Step Let's see how we leverage the data sparsity for improvement. Optimizations on the way around compute optimizer wave 2: > Loss compute flops reduction. > Flatten/Unflatten embedding tokens to save compute flops.	2023-03-03 18:36:08 +08:00
guyang3532	c49f250a14	Del ort_model._modules to foward its accessing to torch_model._modules (#14563 ) Missing '_modules' attribute in ORTModule will cause load_state_dict for wrapped_ortmodule fail. reference:https://github.com/microsoft/onnxruntime/pull/7847	2023-03-03 10:12:37 +08:00
Dmitri Smirnov	8d87fdcfa1	Add GetVersionSting API for C++, C# and Python (#14873 ) ### Description Added APIs. ### Motivation and Context Addresses https://github.com/microsoft/onnxruntime/issues/14584 Cc: @Craigacp cp	2023-03-02 17:11:07 -08:00
cao lei	d69823f764	Do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario (#14570 ) ### Description Do not create Barrier and triggerDownstream steps during execution plan creation if the corresponding nodes are split by yield Op in training scenario. ### Motivation and Context In training scenario, forward and backward processes are running two different partial nodes of a graph. If there are two nodes each in one of the partial graph and separate in two streams, there are still triggerDownstream/barrier steps between them which work quite different from inference process as one of the steps will not be executed due to it is not in the correct range. To make it work, there is a hacky way to trigger the barrier step explicitly for training. This PR is to do some check, and do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario. So the hacky way is not needed.	2023-03-02 07:08:29 -08:00
pengwa	79aa0acdd0	SCELoss(SCELossGrad) support half(float) input float(half) output (#13972 ) ### Description A follow up change for https://github.com/microsoft/onnxruntime/pull/13616. SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad support different type for input and output. Add SCELoss(SCELossGrad) support half(float) input float(half) output ### Test Note #### Add tests for variant input and output types. To add such tests, have to refactor existing testing code for sce loss and scelossinternal gradient. Originally, FP32 input and output, the CPU kernels, runs with CPU kernels the baseline, CUDA/RCOM then runs with same data, user CompareTester to compare with CPU run results. FP16 input and output, the CPU kernels (did not have half kernels), runs with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM then runs with same data but using Half implementation, user CompareTester to compare with CPU run results. Now, we want the support run different input and output types. The proposed change here is, to run CPU kernels always with float input and output as baseline (because CPU only have float type kernels impl), this step is the very first thing for every test. Then, we run CUDA/RCOM kernels using half_input_half_output, float_input_float_output, half_input_float_output, float_input_half_output if there is corresponding kernel registered. Afterwards, compare the CUDA/ROCM run results with CPU float baselines. Be noted, there is one thing that deserved a special note: CompareOpTester's result compare can be loose than OpTester's. Roughly speaking: the former tolerant diff <= atol + rtolexpected_value, while the later one telerant diff < atol && diff < rtolexpected_value. When the expected value is super small in many cases of our tests cases, the former one can pass but the later one fails. So the refactoring also move the check outside of OpTester, explicitly check the values using the way CompareOPTester did (to align the previous behaviour). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-28 18:02:08 +08:00
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
Vincent Wang	e9ec4c098b	[CUDA] Fix FP16 Precision for Sigmoid Op (#14727 ) Current Sigmoid's CUDA kernel uses target data type for all computation. For some small negative numbers, if using FP16, it will loss precision. For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's likely to produce NaN as compute result. The PR fixes this by using FP32 for kernel internal computation. Note that the fix will not have perf regression, as CUDA's _Exp will also do float to half casting, so the fix doesn't introduce extra cast. We move the cast to right begin and end of the whole kernel so that other parts of computation are also in FP32 (instead of only Exp).	2023-02-22 09:16:22 +08:00
pengwa	fbf5d09a0c	Fix random failure of ortmodule_api.py::test_unused_parameters (#14729 ) ### Fix random failure of ortmodule_api.py::test_unused_parameters Fix FAILED orttraining_test_ortmodule_api.py::test_unused_parameters[model1-none_pt_params1] for orttraining-linux-gpu-ci-pipeline CI pipeline ``` =================================== FAILURES =================================== ________________ test_unused_parameters[model1-none_pt_params1] ________________ model = UnusedMiddleParameterNet( (fc1): Linear(in_features=784, out_features=500, bias=True) (relu): ReLU() (fc2): Linear(in_features=500, out_features=400, bias=True) (fc3): Linear(in_features=500, out_features=10, bias=True) ) none_pt_params = ['fc2.weight', 'fc2.bias'] @pytest.mark.parametrize( "model, none_pt_params", [ (UnusedBeginParameterNet(784, 500, 400, 10), ["fc1.weight", "fc1.bias"]), (UnusedMiddleParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]), (UnusedEndParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]), ], ) def test_unused_parameters(model, none_pt_params): device = "cuda" N, D_in, H1, H2, D_out = 64, 784, 500, 400, 10 model = model.to(device) ort_model = ORTModule(copy.deepcopy(model)) # Make sure model runs without any exception for _ in range(5): x = torch.randn(N, D_in, device=device) y = copy.deepcopy(x) out_pt = model(x) out_ort = ort_model(y) loss_pt = out_pt.sum() loss_pt.backward() loss_ort = out_ort.sum() loss_ort.backward() _test_helpers.assert_values_are_close(out_ort, out_pt) > _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, model, none_pt_params=none_pt_params) orttraining_test_ortmodule_api.py:4050: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _test_helpers.py:216: in assert_gradients_match_and_reset_gradient assert_values_are_close(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ``` Initially the test runs very well. As we insert more and more tests, when running ortmodule_api.py::test_unused_parameters, the random generated data got changed, and now it is more easily to generate an input data that produce a result the break existing rtol and atol. The example data, 0.1041 only have very minor diff, e.g. abs_diff: 2.2649765014648438e-06. > The torch.allclose judge it is not equal because: abs_diff> 0.1041 * rtol + atol = 1.041e-1 * 1e-5 + 1e-6 =-2.041e-6. > Additionally, according to math [here](`7b31bcda2e/orttraining/orttraining/test/python/_test_helpers.py (L230)`) The maximum atol is 1.2238311910550692e-06 > current atol(1e-6), maximum rtol is 1.2149855137977283e-05 > current rtol(1e-5). This PR looses the atol to 1e-5, rtol to 1e-4 .	2023-02-20 18:09:53 +08:00
Baiju Meswani	ae205a7924	QAT POC tutorial (#14577 )	2023-02-16 14:38:18 -08:00
Baiju Meswani	4e686a9a7d	Support building a QAT onnx model using onnxblock (#14551 )	2023-02-16 14:38:01 -08:00
Edward Chen	5605c3d454	Make some variables constexpr in orttraining/orttraining/training_ops/cuda/optimizer/lamb.cc. (#14698 )	2023-02-15 14:10:59 -08:00
cao lei	50fa151298	remove device_id parameter out of ExecutionProvider::GetAllocator() (#14580 ) ### Description Remove the parameter device_id out of ExecutionProvider::GetAllocator() function ### Motivation and Context The parameter device_id is not necessary. We can fully rely on the second parameter OrtMemType mem_type to determine the device_id when getting allocator from executionProvider.	2023-02-13 10:01:07 -08:00
Baiju Meswani	22de2798f2	Update typing hints to support python 3.8 for training apis (#14649 )	2023-02-13 09:52:05 -08:00
guyang3532	ba00f3a134	fix problem of reduplicate input names (#14163 ) Contributor: @guyang3532	2023-02-10 12:57:51 -08:00
Wei-Sheng Chin	875a7791bf	[DORT] Update import path (#14605 ) Follow up changes from https://github.com/pytorch/pytorch/pull/93409/files for fixing DORT CI failures.	2023-02-08 19:54:06 -08:00
Maximilian Müller	e9ab56fa64	Adding RunOptions synchronization behaviour to C/C++ API (#14088 ) ### Description This is exposing the already existent interface of asynchronous work of all CUDA base EP's (CUDA + TensorRT). ### Motivation and Context This is something requested in #12216. It will enable users to build an efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing. PCI traffic to the CUDA device can be run during inference as soon as the postprocessing consumed the input buffer and it can be overwritten. To do this work has to be submitted async to the device. Please see below screenshots showing the illustration of this using NSight Systems. Async: <img width="1401" alt="image" src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png"> Synchronous: <img width="1302" alt="image" src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png"> Note the gap in between the 2 inference runs due to issuing PCI traffic in between and to the CPU overhead the active synchronization has. --------- Co-authored-by: Chi Lo <chi.lo@microsoft.com>	2023-02-07 19:59:28 -08:00
Tang, Cheng	8f34c8c8ed	Introduce collective ops to ort inference build (#14399 ) ### Description Introduce collective ops into onnxruntime inference build, including 1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI flag 2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL flag ### Motivation and Context Enable the collective ops in onnxruntime inference build so we have the ability to run distributed inference with multiple GPUs. The original ncclAllReduce ops in training build require quite complex configurations, which is not suitable for inference case, and it already broken. so we introduce a new implementation. --------- Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-02-07 13:47:48 -08:00
Vincent Wang	3d7518762a	[ORTModule] ATen Support for upsample_bilinear (#14519 ) It's required by model MobileViT.	2023-02-04 15:20:18 +08:00
pengwa	62442c3d27	Enable multiple step run for adamw tests (on device training) (#14520 ) (cherry picked from commit 414b73a02123b672e496326664cd2dc3bd6c6d24) ### Rework for PR https://github.com/microsoft/onnxruntime/pull/14068: Enable multiple step run for adamw tests (on device training) ### Removed duplicated MACRO checks for training. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-02 18:40:30 +08:00
Abhishek Jindal	3d388a1aea	change deepspeed version in warning from 0.7.3 to 0.8.0 (#14527 ) ### Description change deepspeed version in warning from 0.7.3 to 0.8.0 ### Motivation and Context The version was updated for Deepspeed support in ORT from 0.7.3 to 0.8.0 but wasn't updated in the warnings message and this PR is to fix that.	2023-02-01 12:00:43 -08:00
Abhishek Jindal	6fa4555a06	Including support for Deepspeed 0.8.0 (#14506 ) ### Description Including Support for Deepspeed 0.8.0. ### Motivation and Context Deepspeed 0.8.0 has a bug fix and mlfow integration.	2023-02-01 06:19:41 -08:00
Erick Muñoz	d1533c27eb	[oneDNN] Improved thread handling (#13618 ) * Added the OrtDnnlProviderOptions structure to expose configuration options to the user * The number of threads can be defined by the user with the -i flag on the perftest * Number of threads can also be configured via the OMP_NUM_THREADS environment variable * The number of threads defined in the OrtDnnlProviderOptions is prioritized over the environment variable ### Description Avoids thread oversubscription caused by OpenMP allocating the maximum number of threads possible for oneDNN EP. Added support for the OrtDnnlProviderOptions, this will allow for more EP customization capabilities, and allows for user defined number of threads. ### Motivation and Context - Improves performances and allows for user to fine tune the number of threads	2023-01-31 14:37:13 -08:00
Ashwini Khade	764202d740	fix prefast warning (#14446 ) ### Description Fixes a prefast warning: https://aiinfra.visualstudio.com/ONNX%20Runtime/_workitems/edit/11113 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-30 09:13:39 -08:00
Kyushick Lee	cd24f0794a	Extend ort_backend.py for another ep (#14349 ) ### Description <!-- Describe your changes. --> This PR extends OrtBackend to allow for configuring an EP based on the name, and fallbacks to existing mechanism that infers the EP based on tensor affinity if nothing is provided. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently OrtBackend needs `get_ort_device()` with the device tag inferred from torch.Tensor, but ort device is not yet supported for dort. The change allows run dort with a supported EP, by configuring dort with a desired EP and letting the dort (ort InferenceSession) take CPU-affined pytorch Tensors as inputs then inject data transfer nodes internally.	2023-01-20 07:30:00 -08:00
Wei-Sheng Chin	432a9912a3	Fix LORT CI failure due to PyTorch change (#14367 ) As title. The fuser in LORT doesn't like "scalar". With a recent PyTorch change, scalar is intorduced somewhere it was there before. Now, a simple fix is to check if all inputs are tensors or some specially allowed cases before sending ops to ORT.	2023-01-19 16:02:40 -08:00
Ashwini Khade	ea7bbd667d	fix headers for training apis (#14350 ) ### Description Minor refactor PR for fixing header placement for training apis	2023-01-19 10:26:53 -08:00
Adam Louly	f0555eb437	Improved test cases by using paramerters (#14246 ) ### Description Completing some missing parts of some test cases for python bindings ### Motivation and Context Some test cases like test_training_module_checkpoint and test_optimizer step were not completed before because we had no access to parameters to check if the parameters are changing after the optimizer step or that the checkpoint saved parameters remains the same. now that we have access to the vector or parameters by exposing get_contiguous_parameters() method. we can complete the tests.	2023-01-13 12:54:23 -08:00
Ashwini Khade	cc7799835e	Enable a single build with optimized inference and on device training (#14241 ) ### Description Right now prepacking code is not compiled when training is enabled. Our partners want a single build of ort which can do both optimized inference + training on device. This PR enables prepacking code in a training build and controls whether it is enabled or not using already existing session option - kOrtSessionOptionsConfigDisablePrepacking For Inference scenarios - prepacking will be turned on by default and this behavior remains the same after this PR too. For training scenarios - prepacking will be disabled by default and if user explicitly enables it then an error will be thrown. ### Motivation and Context Enable both optimized inference as well as on device training in a single build. For on device training use flag --enable_training_apis.	2023-01-12 21:36:43 -08:00
Vincent Wang	fb3c1221e4	Fix Prefast Warning (#14250 ) Fix two prefast:Warning related to constexpr.	2023-01-13 10:16:35 +08:00
Scott McKay	dd2df460b3	Split(18) (#14015 ) ### Description <!-- Describe your changes. --> Opset 18 Split changes. Adds ability to specify num_outputs which also allows uneven splitting. https://github.com/onnx/onnx/releases/tag/v1.13.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support ONNX opset 18.	2023-01-12 08:14:10 +10:00
pengwa	a4180d79c5	Multi-tensor SGDOptimizer (on device training) (#14083 ) Implement SGDOptimizerV2 taking sequence of weights and gradients as inputs. For CPU EP and CUDA EP only. Added tests.	2023-01-11 10:15:53 -08:00
Ashwini Khade	d92c663f28	Create dedicated build for training api (#14136 ) ### Description Enable creating dedicated build for on device training. With this PR we can build a lean binary for on device training using flag --enable_training_apis. This binary includes only the essentials like training ops, optimizers etc and NOT features like Aten fallback, strided tensors, gradient builders etc . This binary also removes all the deprecated components like training::TrainingSession and OrtTrainer etc ### Motivation and Context This enables our partners to create a lean binary for on device training.	2023-01-10 20:58:04 -08:00
Xavier Dupré	79dc39600f	Replace distutils by setuptools to import build_ext (#14108 ) ### Description Uses setuptools instead of distutils. ### Motivation and Context Fixes #14107.	2023-01-09 11:48:01 +01:00
Baiju Meswani	c6ff5bac9d	Update torch in eager mode CI pipeline (#14094 )	2023-01-06 11:46:44 -08:00
Adrian Lizarraga	68794d0ac1	Improve custom op library handle cleanup (#14099 ) ### Description - Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages the lifetime of dynamic library handles (i.e., calls `dlclose` or `FreeLibrary`). - Deprecates C API `OrtApi::RegisterCustomOpsLibrary`. - Adds C++ API wrapper for convenient registering of custom op libraries. - `PySessionOptions` is now an alias of `OrtSessionOptions` ### Motivation and Context The current API for registering custom op libraries loads dynamic libraries but requires users to handle the release of the corresponding library handles. Additionally, the user has to make sure to release the library handle _after_ the session has been destroyed (or the program segfaults). The new API automatically cleans up the library and allows the user to write more straightforward code.	2023-01-04 17:56:29 -08:00
Baiju Meswani	0ff61f7b97	Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055 )	2023-01-03 20:03:33 -08:00
cao lei	b29a1c7348	Address follow-up comments on multistream pr #13495 (#13992 ) ### Description This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Co-authored-by: Lei Cao <leca@microsoft.com>	2023-01-03 16:33:36 -08:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Dmitri Smirnov	5d729839b5	Support loading widechar paths on windows (#14066 ) ### Description Make GetRuntimePath() and LoadDynamicLibrary() operate on platform specific paths ### Motivation and Context This addresses https://github.com/microsoft/onnxruntime/issues/14063	2022-12-30 16:30:11 -08:00
Vincent Wang	0c3480e565	[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069 ) PyTorch removed upsample_nearest related backward functions with "vec" overload name since 1.13. The functions without overload name are available for all versions, though they are not that convienent to use. This PR changes the gradient builder code to use functions without overload name for ATen upsample_nearest nodes. This PR also fixed a bug for ORTModule's corner case introduced by the multi-stream PR. There is some code to execute the barrier step for triggered downsteam is the barrier is out of range. But this should be applied to triggered downstream only. If it's a normal run with start step as a barrier step but out of range, we should not apply the logic. For example, for ORTModule, if the barrier is the 1st step of whole CPU plan, and the forward part is empty, then the forward normal run will run step from start-0 to end-0 (actually nothing), and step-0 is the barrier, then we should not execute the barrier in such case.	2022-12-27 10:18:30 +08:00
Adam Louly	e49f358686	expose lr scheduler python bindings for on device training. (#13882 ) ### Description Exposing LR Scheduler python bindings for on device training. Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-12-22 18:44:04 -08:00

1 2 3 4 5 ...

1210 commits