onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-01 23:30:35 +00:00

Author	SHA1	Message	Date
Yi Zhang	777c474f61	skip quantized model C# tests on GPU (#13782 ) ### Description Skip quantized model C# tests on GPU too. ### Motivation and Context It looks the current test result isn't reasonable. https://github.com/onnx/models/issues/581 Once we update the image, the quantized model [test data will be generated with VNNI](`ba629906dd`), the CI would be broken.	2022-12-01 12:33:20 +08:00
Wei-Sheng Chin	7df8f84228	Improve DORT document (#13790 ) 1. Refine words based on PyTorch changes. 2. Make the need of inference mode clearer. A test is added.	2022-11-30 16:55:25 -08:00
Yulong Wang	77c97b6f16	[js/rn] support load model from buffer on Android (#12676 ) Description: [js/React Native] Add android implementation for creating session from buffer. #12500 Co-authored-by: Rachel Guo <guorachel@microsoft.com>	2022-11-30 10:55:55 -08:00
Wei-Sheng Chin	639d285670	[DORT] Catch up with yesterday's PyTorch change (#13779 ) Fix recent CI failures.	2022-11-30 09:23:44 -08:00
Xavier Dupré	441b30b2d2	Move a function call outside a loop in ORTModule (#13771 ) ### Description The proposed change is useful for ORTModule when the output graph has multiple outputs. ### Motivation and Context performance Signed-off-by: xadupre <xadupre@microsoft.com>	2022-11-30 12:49:41 +01:00
Patrice Vignola	08ed09d20b	Add DML support to the transformers benchmark.py script (#13776 ) ### Description Add DML support to the transformers benchmark.py script ### Motivation and Context Before this change, running the `benchmark.py` script when the `onnxruntime-directml` package is installed resulted in an error because it expects a CUDA or ROCM framework.	2022-11-29 18:57:52 -08:00
Changming Sun	29ed8811e5	Move C/C++ deps' URLs to deps.txt (#13769 ) ### Description 1. Move C/C++ deps' URLs to deps.txt, and download the dependencies from Azure Devops Artifacts instead of github. 2. Add "EXCLUDE_FROM_ALL" keyword to the cmake external projects, so that we only build the parts we need and avoid installing the 3rd-party dependencies when people run `make install` in ORT's build directory. However, at this moment cmake itself doesn't have the feature. So I copied their code to cmake/external/helper_functions.cmake and modified it. This PR is split from #13523, to make that one smaller. ### Motivation and Context 1. Secure the supply chain 2. Make it be possible to automatically detect if ORT has an old dependency that hasn't been updated from a long time.	2022-11-29 18:06:35 -08:00
Jeff Bloomfield	571dc5a1f1	Support exteranl weights in DML execution provider (#13740 ) ### Description This enables support for external weights in the DML execution provider when its graph optimization logic is reached. ### Motivation and Context External weighs are encountered after optimization is applied to transformer models.	2022-11-29 15:47:16 -08:00
stevenlix	ce0025d3f2	Fallback Pow op in layer norm to FP32 in TRT to avoid overflow (#13639 ) Accuracy loss is observed when transformer models such as BERT, DeBERTa, ViT are running in TRT FP16 mode. The cause is that overflow happens at Pow op in layer norm. This PR provides the option to force Pow to run in TRT FP32 precision if overflow occurs. Co-authored-by: Ubuntu <azureuser@orteplinuxdev.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>	2022-11-29 13:37:31 -08:00
Chi Lo	0327606d2d	Revert TRT EP Linux CI to run unit tests in container (#13766 ) Revert TRT EP Linux CI to old behavior that code build and unit tests are both executing in container. So that we don't have to update the VM image for native Ubuntu to include latest TRT libraries every time newer version of TRT is introduced.	2022-11-29 13:15:27 -08:00
Tianlei Wu	abe1642a0c	Update fusion for distilbert accuracy test on SQuAD (#13748 ) (1) Embed layer fusion to work with --use_mask_index. (2) Parse num_heads and hidden_size from a pattern of Concat shape node. (3) Fix a typo (CUDAExcecutionProvider=> CUDAExecutionProvider) in eval_squad.py (4) Update example comments in eval_squad.py to use optimized fp16 model. (5) Update tests in test_optimizer.py	2022-11-29 13:06:39 -08:00
FFrog	181628ced1	[CANN] add more operators (#13578 ) ### Description Adding new operators and enhances operators, also. ### Motivation and Context The operators of CANN EP is modified as follows: The list of enhanced operators is as follows: - Add - Sub - Mul - Div - Gemm - MatMul - AveragePool - GlobalAveragePool - MaxPool - GlobalMaxPool - Dropout The new operators are as follows: - Abs - Neg - Floor - Ceil - Reciprocal - Sqrt - Log - Exp - Erf - Round - Sin - Cos - Cast - Reshape - Transpose The remaining operators will be supported in the next PRs.	2022-11-29 12:08:36 -08:00
Baiju Meswani	2c29938846	[QAT] Introduce FakeQuant op (#13649 )	2022-11-29 08:43:37 -08:00
sfatimar	49c3768985	Enabled ops for DeBERTa model (#13690 ) ### Description Enabled GatherElements Ops to enable DeBERTA Model ### Motivation and Context - This change is required to enable DeBerta Model which is relevant to MSFT - If it fixes an open issue, please link to the issue here. --> Co-authored-by: mayavijx <mayax.vijayan@intel.com>	2022-11-28 22:39:32 -08:00
pengwa	7c53b6eee8	Skip the tests of saving tensor in backward (#13767 ) ### skip the tests of saving tensor in backward The test failed randomly; Let's skip it until the issue got fixed to unblock the CIs.	2022-11-29 13:02:26 +08:00
Vincent Wang	3c258c878c	[CUDA] Optimize Slice Kernel (#13641 ) The PR optimizes Slice CUDA kernel by two ways: - Coalesce dimensions so less divmod during the kernel compute - Split data load and write for better memory throughput Below shows some perf results (cycles number from Nsight Compute) in V100 using real cases from Huggingface's XLNet model: \| Old \| New -- \| -- \| -- [8,12,2048,1024], axis=2, start=1, end=2048 \| 1838687\| 1539846 [8,12,1024,2047], axis=3, start=0, end=1024 \| 951383\| 722203	2022-11-29 09:18:03 +08:00
JiCheng	47780b7f3b	[XNNPACK] add more computation heavy ops (#13270 ) ### Description This is the first PR of adding remaining Ops for XNPACK EP, I am gonna add: - [x] ConvTranspose f32 qu8 q s8 - [x] ~~UnMaxpool f32 qu8 qs8~~ - [x] Resize f32 qu8 q s8 - [ ] GEMM see https://github.com/microsoft/onnxruntime/pull/13126 The remains operation support would be seperated into another PR. ### Motivation and Context	2022-11-29 09:09:26 +08:00
Dmitri Smirnov	4fbe16e493	Ifdef cpuinfo code on platforms we do not set affinity (#13486 ) ### Description Remove code that invokes cpuinfo library on platforms we do not set affinity. ### Motivation and Context `cpuinfo` library increases binary size.	2022-11-28 13:44:16 -08:00
Guenther Schmuelling	2d523c507e	for wasm catch exceptions at top level api (#13644 ) fix for https://github.com/microsoft/onnxruntime/issues/13383, https://github.com/microsoft/onnxruntime/issues/13408 Currently ort-web doesn't catch exceptions because turning on exception catching increases the binary size by 3MB (~30%). But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no useable error message. Turning on exception catching just for top level api released file will fix the error messages at minimal increase of binary size.	2022-11-28 10:24:34 -08:00
Faith Xu	b7c3862330	Update resource section in readme (#13724 ) ### Description - adds link to release plans page - adds link to youtube channel	2022-11-28 09:42:31 -08:00
Jicheng Tang	b4a4fa5aac	Fix compile error with protobuf RepeatedIterator (#13731 ) ### Description <!-- Describe your changes. --> There are some compile errors with google::protobuf::internal::RepeatedIterator. replace reinterpret_cast with &(iter), which iter is RepeatedIterator type. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> My protobuf version is: - libprotoc 3.21.5 - g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 when I use build command: ``` ./build.sh --use_cuda --cudnn_home /usr --cuda_home /usr/local/cuda --config Debug --build_shared_lib --parallel ``` There are some compile errors like this: - error 1 onnxruntime/test/util/test_utils.cc:186:105: error: no matching function for call to ‘make_span(google::protobuf::RepeatedField<long int>::const_iterator, google::protobuf::RepeatedField<long int>::const_iterator)’ 186 \| ind_span = gsl::make_span(indices_proto.int64_data().cbegin(), indices_proto.int64_data().cend()); - error 2 onnxruntime/test/onnx/tensorprotoutils.cc:101:56: error: invalid cast from type ‘google::protobuf::internal::RepeatedIterator<const long unsigned int>’ to type ‘const uint32_t’ {aka ‘const unsigned int’} 101 \| p_data++ = reinterpret_cast<const T>(data_iter);	2022-11-28 09:33:53 -08:00
Numfor Tiapo	aa1390e963	Fix Prefast Errors (#13675 ) Fixes all C28204, C6031, and C26814 prefast errors. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-11-28 09:16:22 -08:00
Ted Themistokleous	c6bea4f02f	Modify MIGraphX EP for Accuracy tests (#13455 ) Allows MIGraphX EP to run the following additional tests. Also adds support to get MIGraphX to run eval_squad.py Reference to the Rocm EP changes: https://github.com/microsoft/onnxruntime/pull/13306 Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com> Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2022-11-27 18:26:49 +08:00
Yufeng Li	4ca62b9ee8	fix build break in test/beam_search_topk.cc (#13739 )	2022-11-23 21:20:51 -08:00
Vincent Wang	47e7630378	[CUDA] Transpose3DImpl Supporting more Cases (#13611 ) CUDA's Transpose3DImpl is to transpose [batch, m, n] to [batch, n, m]. Currently it requires both m and n can be divided by 32 or 16. If it's not this case, the compute will fallback to general implementation, which is slow. This PR is to remove the limitation. Profiling in V100 using below size of tensors, got the cycles number from Nsight Compute: \| Old \| New -- \| -- \| -- [3072,64,512] \| 760793 \| 727140 [3072,16,2048] \| 854303 \| 851146 [3072,2048,12] \| 986924 \| 737884 [3072,1024,24] \| 1212427 \| 495117 It shows that even we added extra IF statements to the kernel implementation, it has nearly no impact to the old version (case 1 and 2). And for case 3 and 4 which will fallback to general implementation before, it's much faster. Above data was collected using FP16 tensors, similar results was observed for float tensors. This PR is to enhance the perf of ORT training of Huggingface's XLNet model which has[8,1024,1024,12].permute(0,3,1,2).	2022-11-24 09:40:48 +08:00
Yi Zhang	87d5703b14	skip TestCUDAProviderOptions in End2EndTest (#13737 ) ### Description <!-- Describe your changes. --> Skip the test with --filter in runtest.sh ### Motivation and Context Recently, the Zip-Nuget-Java-Nodejs Packaging Pipeline always failed in Nuget_Test_Linux_GPU. To unblock the packaging workflow, skip the test in Nuget_Test_Linux_GPU temporally. the exception message is below. ``` [xUnit.net 00:07:26.28] TestCUDAProviderOptions [FAIL] Failed TestCUDAProviderOptions [1 m 19 s] Error Message: Microsoft.ML.OnnxRuntime.OnnxRuntimeException : [ErrorCode:RuntimeException] Non-zero status code returned while running FusedConv node. Name:'' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Available memory of 11416064 is smaller than requested bytes of 134217728 Stack Trace: at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess(IntPtr nativeStatus) at Microsoft.ML.OnnxRuntime.InferenceSession.RunImpl(RunOptions options, IntPtr[] inputNames, IntPtr[] inputValues, IntPtr[] outputNames, DisposableList`1 cleanupList) at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames, RunOptions options) at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames) at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs) at Microsoft.ML.OnnxRuntime.Tests.CUDATest.TestCUDAProviderOptions() in /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.Tests.NetCoreApp/InferenceTest.netcore.cs:line 93 Failed! - Failed: 1, Passed: 0, Skipped: 0, Total: 1, Duration: < 1 ms - /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/bin/Debug/netcoreapp3.1/Microsoft.ML.OnnxRuntime.EndToEndTests.dll (netcoreapp3.1) Done executing task "Microsoft.TestPlatform.Build.Tasks.VSTestTask" -- FAILED. 1>Done building target "VSTest" in project "Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" -- FAILED. 1>Done Building Project "/mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" (VSTest target(s)) -- FAILED. ```	2022-11-23 14:56:04 -08:00
Ye Wang	c1bda4c1cc	fix buffer overuse in addtofeed() (#13733 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-23 10:53:53 -08:00
Tianlei Wu	e306b44e98	Improve coverage of fused MHA in Attention (#13732 ) Previously, fused attention was applied to limited sequence lengths (64, 96, 128, 256, 384, 512). This will expand support all sequence lengths <= 384 for V100 and T4, or 512 for A100. Previously, fused attention only works for batch_size=1. After this change, fused MHA has no limit on batch_size. ## Accuracy Tests on SQuAD Using optimized fp16 onnx model of distilbert-base-cased-distilled-squad, we test the CUDA EP with IO Binding using eval_squad.py: disable_fused_attention \| batch_size \| sequence_length \| exact \| f1 \| samples_per_second \| latency_in_ms -- \| -- \| -- \| -- \| -- \| -- \| -- TRUE \| 1 \| 384 \| 79.6 \| 86.8 \| 283.5 \| 3.5 TRUE \| 2 \| 384 \| 79.6 \| 86.8 \| 308.3 \| 3.2 FALSE \| 1 \| 384 \| 79.6 \| 86.8 \| 313.2 \| 3.2 FALSE \| 2 \| 384 \| 79.6 \| 86.8 \| 340.9 \| 2.9 TRUE \| 1 \| 300 \| 79.3 \| 86.6 \| 278.5 \| 3.6 TRUE \| 2 \| 300 \| 79.4 \| 86.6 \| 301.8 \| 3.3 FALSE \| 1 \| 300 \| 79.4 \| 86.6 \| 305.8 \| 3.3 FALSE \| 2 \| 300 \| 79.4 \| 86.6 \| 335.9 \| 3.0 It shows that with/without fused attention could achieve same accuracy. Note that latency number here is just for reference (eval_squad.py has not been optimized for speed). We can see that it is about 10% faster with fused attention than without fused attention. version of package used: onnx 1.12.0, torch 1.13.0, transformers 4.24.0, optimum 1.5.0, datasets 2.7.0, evaluate 0.3.0 ## Performance Test of base-based-cased on T4 GPU ``` sudo nvidia-smi -rgc export ORT_DISABLE_FUSED_ATTENTION=0 python benchmark.py -m bert-base-cased -e onnxruntime -g -p fp16 -o by_script -i 3 -t 1000 -b 1 8 -s 8 16 32 64 80 96 120 128 --use_mask_index --overwrite ``` Disable_Fused_Attention \| b1_s8 \| b1_s16 \| b1_s32 \| b1_s64 \| b1_s80 \| b1_s96 \| b1_s120 \| b1_s128 \| b8_s8 \| b8_s16 \| b8_s32 \| b8_s64 \| b8_s80 \| b8_s96 \| b8_s120 \| b8_s128 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- FALSE \| 1.32 \| 1.28 \| 1.33 \| 1.51 \| 1.71 \| 1.79 \| 1.99 \| 2.04 \| 1.56 \| 1.99 \| 2.85 \| 4.88 \| 6.03 \| 7.03 \| 9.2 \| 9.34 TRUE \| 1.37 \| 1.34 \| 1.44 \| 1.68 \| 1.89 \| 1.99 \| 2.15 \| 2.21 \| 1.63 \| 2.31 \| 3.19 \| 5.48 \| 6.98 \| 8.14 \| 10.54 \| 10.66 Latency Reduction \| 3.6% \| 4.5% \| 7.6% \| 10.1% \| 9.5% \| 10.1% \| 7.4% \| 7.7% \| 4.3% \| 13.9% \| 10.7% \| 10.9% \| 13.6% \| 13.6% \| 12.7% \| 12.4% Perf gain is observed in all sequence lengths tested.	2022-11-23 10:19:04 -08:00
Changming Sun	87e6a26c5d	Enforce Prefast check in Windows CPU CI pipeline (#13735 ) Right now we fix the warnings in an ad-hoc way. We run static analysis in nightly builds, then create work items for the finding it found. Our CI build pipelines run the same scan but do not break the build. So, this PR will fix the remaining findings in the CPU EP(including the training part) and enforce the check. Later on we can continue to expand the scope. We still have some warnings left in the JNI part. I will try to address them later in the next month.	2022-11-23 09:25:02 -08:00
Ted Themistokleous	9168e25738	Patch eval_squad.py script for Python < 3.8 and multiple Execution Providers (#13524 ) Need this for benchmarks to function correctly with older containers This fixes import errors when attempting to run eval_squad.py to evaluate bert distilled models Adds a change to the previously merged #12947 which fails when using Python version < 3.8 to run this script. Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2022-11-23 15:37:39 +08:00
PeixuanZuo	977da6635b	[ROCm] Remove tuning options on transformerOptions (#13689 ) ### Description <!-- Describe your changes. --> Remove tuning options on transformerOptions, use IsTunableOpEnabled from provider in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-11-23 15:36:09 +08:00
Yufeng Li	c43ce64795	Beam search TopK improvement (#13594 ) ### Description <!-- Describe your changes. --> TopK in BeamSearch retrieves top 2beam next tokens based on logit score, specifically computing top [batch, 2beam] tokens based on score [batch, beam, vocab_size]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Current implementation use batch as the grid and each thread block compute top 2beam from [beam, vocab_size]. It is inefficient because: 1. batch size is usually small( <32) and can not fully leverage GPU's SMs; 2. vocab_size is usually more than 50k. It is inefficient to compute 50k beam in one thread block. This PR split the topk computation into multiple stages: - for small beam size, split [batch, beam, vocab_size] to [batch, beam, parts_of_vocab, vocab_size_per_part] - 1st stage, each thread block compute top 2beam from vocab_sizer_per_part and gets [batch, beam, parts_of_vocab, 2beam] - 2nd stage, each thread block compute top 2beam from parts_of_vocab (2beam} and gets [batch, beam, 2beam] - last stage, compute [batch, 2beam] from [batch, beam, 2beam] - for large beam size, 1st stage computes [batch, beam, 2beam] from [batch, beam, vocab_size] and 2nd stage computes [batch, 2beam] from [batch, beam, 2*beam]. With the change, performance improves a lot, it reduces ~100us from 2ms for batch:4, beam:4, vocab_size:~50k.	2022-11-22 21:24:27 -08:00
apsonawane	7857f59d2b	Use sequences to create initial feeds for decoder subgraph (#13719 ) Use sequences to create initial feeds for decoder subgraph instead of beam_next_tokens ### Description For TuLG models exporting of decoder is different from bart model. Passing beam_next_tokens to the decoder while ort inferencing generated incorrect result from pytorch inference. This change will use sequences as inputs for the first iteration as well ### Motivation and Context Pytorch and ORT inference for TuLG models was incorrect, keeping pytorch as correct result we modified ort to match the result.	2022-11-22 18:00:58 -08:00
Baiju Meswani	fb85b31fac	Remove protobuf pin from training requirements (#13695 )	2022-11-22 12:27:18 -08:00
Yulong Wang	2bebe6189a	set node schema when apply NHWC transformer (#13660 ) ### Description set node schema when apply NHWC transformer ### Motivation and Context The implementation in `IExecutionProvider::GetCapability()` checks node schema to determine the capability of the current EP. If NHWC graph transformer created a new channel last `Conv` node to replace the channel first `Conv` node, we need to assign the schema to the replaced node.	2022-11-22 12:26:52 -08:00
Patrice Vignola	ce460f9cdb	[DML EP] Return device removal reason when D3D12 device gets removed (#13727 ) ### Description Before this change, when the D3D12 device was getting removed, we were returning a generic device removed error, which can be harder to investigate. ### Motivation and Context It makes it easier to debug and investigate device removal failures.	2022-11-22 10:38:56 -08:00
Patrice Vignola	6c5333e1a7	[DML EP] Enable more DML tests (#13726 ) ### Description Enables more DML tests. ### Motivation and Context It increases test coverage that was missing for the DML EP	2022-11-22 10:35:16 -08:00
Adam Pocock	dd2c031d95	[java] Sparse tensor support (#10653 ) Description: Adds support for creating and receiving sparse tensors in the ORT Java API. CSRC and COO tensors as inputs are tested, but there is no op which accepts a block sparse tensor to test. COO tensors are tested as outputs, but there is no op which emits a CSRC or block sparse tensor to test. Motivation and Context - Why is this change required? What problem does it solve? Request to expose ORT sparse tensor support in Java. cc @yuslepukhin	2022-11-22 10:29:24 -08:00
Tianlei Wu	8b0e0f4927	Add RemovePadding and RestorePadding for BERT model (#13701 ) Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.	2022-11-22 10:00:23 -08:00
guyang3532	ba9a585fcc	Fix the tensor save for backward release problem (#13679 ) Motivation: PythonOp is saving input for backward, it's risky since ONNX Runtime backend is not aware of this, the tensor buffer may be "released" by ORT, then potentially modified by other operators before backward function executes. Fix: This pr just clone all input of PythonOp before forward is invoked. This may be high overhead, it's just a workaround before a better fix.	2022-11-22 17:32:19 +08:00
pengwa	947aab0ae0	Make HF converge with lighting native amp (#13616 ) ### Fix training convergence issues #### Problem: Huggingface Transformers: 4.22.0 PyTorch Lightning: 1.6.3 PyTorch: v1.12.1, cuda 11.6 ORT: main branch, cuda 11.6 Model: RobertaForSequenceClassification @ models/roberta/modeling_roberta.py Mixed Precision training with `torch.autocast`: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)` Under this amp autocast context, forward + loss computation run. Here is a snippet of loss computation. ``` if labels is not None: ... if self.config.problem_type == "regression": loss_fct = MSELoss() if self.num_labels == 1: ... elif self.config.problem_type == "single_label_classification": loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": ... return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) ``` It is found after forward run, loss is 1.0850 in float16, looks good.. Then it did a scaling up here: `a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62)`, the scaler is 65536. then we get a scaled loss 71104 in float type (because float16 loss multiple fp32 scaler, type got promoted to fp32). Then backward started with initial grads to be 1, then 1 (float32) * 65536 (float32) as the backward step, generating a float16 gradient, then we got a `inf`. The problem occurs. With `inf`, the backward feed the `inf` into crossentropygradient op, generating `nan`s. Then all gradients got `nan` in back propagation. So we see training with ORTModule (it almost always `overflow`, the loss did not drop too much, as compared with PyTorch). #### Analysis for the UT (when autocast enabled) PyTorch trace graph looks like this : ``` graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0), %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)): %9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %12 : NoneType = prim::Constant() %13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %17 : NoneType = prim::Constant() %18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %19 : NoneType = prim::Constant() %input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 %21 : NoneType = prim::Constant() %22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %data : Float(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 %27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0 return (%27) ``` The most important lines %target : Long(16, strides=[1], requires_grad=0, device=cuda:0), %input : _Half_(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0 _Float_(requires_grad=0, device=cuda:0) = aten::cross_entropy_loss(%_input_, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0 `aten::cross_entropy_loss` takes Half input, and return Float output. As said in doc: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32, `cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its input to fp32 (if it is not), do the compute and return fp32 result. The other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of input and output, and our code `31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)` when exporting `aten::cross_entropy_loss` assumed this, and set the output to be fp16 either. So this is the reason we have the problem. #### Possible Fixes 1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types of input and output. 2. Check the input and output when exporting, add the input case explicitly if there is type promotion from input to output. This PR used the 2nd approach. We can start 1st approach when needed later. TODO: revisit all other exporter functions, add the checks, etc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-22 15:08:30 +08:00
Changming Sun	67e46a873a	Add '-DCMAKE_OSX_ARCHITECTURES=x86_64;arm64' when build protobuf from source on MacOS (#13720 ) ### Description Add '-DCMAKE_OSX_ARCHITECTURES=x86_64;arm64' when build protobuf from source on MacOS. Because later on we will the built library with the other parts of onnxruntime to generate libonnxruntime.dylib, and if the target CPU ARCH of libonnxruntime.dylib is not x86_64, it will fail. ### Motivation and Context To fix a packaging pipeline failure, which was introduced from #13694	2022-11-21 21:59:34 -08:00
PeixuanZuo	8f3c6ea0df	[ROCm] Add GemmFastGelu TunableOp (#13589 ) ### Description <!-- Describe your changes. --> 1. Update the rules for GemmFastGelu fusion, MatMul input x should >= two dimension, input weight should == two dimension. 2. Add GemmFastGelu fusion test. 3. Add GemmFastGelu TunableOp, only contains the original implementation(Gemm + FastGelu). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-11-22 12:58:01 +08:00
PeixuanZuo	45a895cdc3	[ROCm] Fix static TunableOp (#13668 ) ### Description <!-- Describe your changes. --> 1. Re-add staticSelectionOp for FastGelu. 2. Call TunableOp when enable tuning. Call StaticSelectionOp when disable tuning. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-11-22 10:51:54 +08:00
Yulong Wang	f1b5e4f1c9	[js] [deps] upgrade @xmldom/xmldom@0.7.9 (#13705 ) ### Description upgrade @xmldom/xmldom@0.7.9 ### Motivation and Context ``` yarn audit yarn audit v1.22.19 ┌───────────────┬──────────────────────────────────────────────────────────────┐ │ critical │ xmldom allows multiple root nodes in a DOM │ ├───────────────┼──────────────────────────────────────────────────────────────┤ │ Package │ @xmldom/xmldom │ ├───────────────┼──────────────────────────────────────────────────────────────┤ │ Patched in │ >=0.7.7 │ ├───────────────┼──────────────────────────────────────────────────────────────┤ │ Dependency of │ @expo/config-plugins │ ├───────────────┼──────────────────────────────────────────────────────────────┤ │ Path │ @expo/config-plugins > @expo/plist > @xmldom/xmldom │ ├───────────────┼──────────────────────────────────────────────────────────────┤ │ More info │ https://www.npmjs.com/advisories/1084900 │ └───────────────┴──────────────────────────────────────────────────────────────┘ 1 vulnerabilities found - Packages audited: 952 Severity: 1 Critical Done in 3.51s. ```	2022-11-21 17:01:42 -08:00
Seungwon Jeong	307ad1413a	[js/web] support 'pytorch_half_pixel' mode for WebGL kernel 'Resize' (#11208 ) Description: 1. add pytorch_half_pixel interpolation mode in resize-packed.ts Changes: add the following case in createPackedResizeProgramInfo function: ``` case 'pytorch_half_pixel': getSourceFracIndex = ` vec4 getSourceFracIndex(ivec4 coords) { vec4 fcoords = vec4(coords); return vec4( ${outputWidth}.0 > 1.0 ? (fcoords.x + 0.5) / scaleWHWH.x - 0.5 : 0.0, ${outputHeight}.0 > 1.0 ? (fcoords.y + 0.5) / scaleWHWH.y - 0.5 : 0.0, ${outputWidth}.0 > 1.0 ? (fcoords.z + 0.5) / scaleWHWH.z - 0.5 : 0.0, ${outputHeight}.0 > 1.0 ? (fcoords.w + 0.5) / scaleWHWH.w - 0.5 : 0.0 ); } `; break; ``` 2. fix "unrecognized input '' for node: Resize_$num" error when inputs like [input_tensor, None, scale_factor] (roiInput not given) are fed into the resize layer. Changes: change in input handling logic in upsample.ts & node scanning logic in graph.ts Motivation and Context Before this fix, we aren't able to use webGL backend when the neural network contains pytorch resize layers. This fix adds 'pytorch_half_pixel' interpolation mode support and makes it possible to use webGL backend for more kind of computer vision networks. This commit solves: #10430 Co-authored-by: neo <neo@icode-lab.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2022-11-21 12:03:48 -08:00
shalvamist	3119381011	ORT Web build script (#12643 ) Description: Adding a few scripts to enable user to build ORT Web in a simpler way. Instructions: Under ROOT\js folder you will have 2 scripts - 1. "Build_web.bat" - for Windows users 1. "Build_web.sh" - for Linux users Default build configuration is "Release" to change the build configuration just add to the script call the flag "--config <Desired configuration>". As example: ``` build_web.bat --config Debug ``` Co-authored-by: shalvamist <shalva.mist@microsoft.com>	2022-11-21 11:08:39 -08:00
Changming Sun	a5c2047dd1	Fix the remaining Prefast warnings in CPU EP (#13707 ) ### Description Fix the remaining Prefast warnings in CPU EP.	2022-11-21 10:21:38 -08:00
cloudhan	8de5381e84	Add IsSupported support to Op functor (#13692 ) Sometime it is a bit risky to call the Op directly to check whether the impl supports consuming the param. This gives the user a way to actually implement `IsSupported` for checking in non-compact way.	2022-11-21 19:22:00 +08:00
shalvamist	4a2a857030	Bug Fix - WASM build break (#13699 ) ### Description When using the build flag "--cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1" with WASM it results with a build break. Since we are comparing a const vs. non-const T type, this added casting resolves the issue.	2022-11-20 23:30:31 -08:00

1 2 3 4 5 ...

7863 commits