onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-13 18:08:13 +00:00

Author	SHA1	Message	Date
Edward Chen	c46c7ccba5	Update Gradle version (#14862 ) - Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable. Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects. - Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.	2023-03-08 12:22:06 -08:00
Changming Sun	d9436407b6	Use safe allocator for JNI code (#13999 ) ### Description Use a customized allocarray function to replace the original malloc calls to avoid integer overflow. ### Motivation and Context Fix Prefast warnings. Fixed [AB#8990](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8990) Fixed [AB#8991](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8991) Fixed [AB#9016](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/9016)	2023-03-08 11:40:55 -08:00
Adam Pocock	47f00b5d49	[Java] Initial on device training support (#14027 ) contributor: @Craigacp	2023-03-08 10:01:08 -08:00
Ashwini Khade	f14ab63c19	fix prefast warnings (#14931 ) ### Description Fixes prefast warnings Fixed [AB#11328](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11328) Fixed [AB#11329](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11329)	2023-03-08 09:49:15 -08:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
Kyushick Lee	c696392f0c	Support external output tensors for DORT (#14516 ) ### Description <!-- Describe your changes. --> Support externally-managed output tensors (torch Tensors) for dort. Add `preallocate_output` option to OrtBackend to rely on externally-managed output tensors for dort. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> DORT currently allocates and returns output ortvalues and convert them to torch Tensors. The conversion based on dlpack does not support torch Tensors for custom Aten backends, and it is not yet possible to transfer the ownership from ortvalue to external handle (torch Tensor). To avoid this issue, the PR change provides an option (`preallocate_output`) to allocate output tensors externally in pytorch, which creates torch Tensor for an Aten backend, and let dort take pointers from torch Tensors to construct output ortvalues instead of allocating them inside InferenceSession.	2023-03-07 21:32:23 -08:00
edgchen1	2ef25a2200	Update CODEOWNERS file.	2023-03-07 17:56:37 -08:00
edgchen1	5b3f79a11a	Add gradle wrapper validation workflow.	2023-03-07 17:56:37 -08:00
Ashwini Khade	f71ac9859e	Update acpt image in the training pipeline (#14855 ) ### Description Current pipeline refers to an old image which is causing test failures. Updating the image to the latest one. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Fixes pipeline failure: https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=198 - If it fixes an open issue, please link to the issue here. -->	2023-03-07 14:10:32 -08:00
pengwa	5d8ce817cb	Fix simplified layer norm fusion for training (#14866 ) ### Fix simplified layer norm fusion for training Co-author with @prathikr. Fix bug identified by @prathikr. https://github.com/microsoft/onnxruntime/issues/14822. Running T5 model enabling deepspeed, we see simplified layer norm is not fused because the device check did not pass `b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568)`. Since during pretraining optimization pass, there is no device placement, so the device check not fulfilled is expected. On the other hand, the device check is still valid to avoid simplified layer norm fusion works correctly for CPU runs. As a mitigation, added a flag to indicate whether the fusion is triggered by pre-training optimization or not. There is a risk though, when we run ORTModule training with CPU EP, but I feel the risk can be much reduced if we check CUDA/ROCM is enabled for the build. ``` CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-07 13:59:20 -08:00
Patrice Vignola	65f1f840f6	[DML EP] Fix Attention regression caused by removing transposes (#14908 ) By removing the transposes and using strides instead, the metacommands are not able to be reached anymore since it's not using NCHW layout.	2023-03-07 11:17:28 -08:00
Xavier Dupré	6b604521a6	Fix tree implementation when left, right node have lower index (#14839 ) ### Description Previous implementation did not support left or right node of a node to have an index lower than the node itself. This condition would forbid the tree to enter an infinite loop. Lightgbm does not follow that rule. The changes do not change the algorithm but remove the test enforcing that condition. ### Motivation and Context It fixes a regression introduced by #14670.	2023-03-07 19:47:12 +01:00
Hitesh Shah	66101c02a2	Implement AllToAll collective op	2023-03-07 10:17:07 -08:00
Adam Pocock	150043f74f	Adds a Java accessor for GetVersionString (#14876 ) ### Description Java part of #14873.	2023-03-07 09:46:56 -08:00
Xavier Dupré	5930e7e22f	Introduce RemovableAttributes (#14868 ) ### Description TreeEnsemble* kernels fully copies all the parameters from the onnx graph. Even if they are no longer needed or unused (hitrates), they remain in memory. For big models >= 200 trees, max_depth > 10, the model usually weights more than 10 Mb. This change offers a kernel the possibility to remove all unneeded attributes after they were used to create the session. Attributes are deleted after the model was possibly saved, at the of the session creation. The current design is to be debatted: * it stored the list of removable attributes in class `onnxruntime::Node`, * the node is marked as `const` everytime this implementation needs to register the name of a removable attribute or to remove them. The current implementation is just a POC as it needs to cast `onnxruntime::Node` into `const onnxruntime::Node`. Should we keep the list of removable attributes in `onnxruntime::Node`? ### Motivation and Context Motivation is mostly to reduce memory consumption. --------- Signed-off-by: xadupre <xadupre@microsoft.com>	2023-03-07 12:37:12 +01:00
JiCheng	be1416d032	prefast C26451 (#14933 ) Fixed [AB#13290](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13290)	2023-03-07 15:16:50 +08:00
Changming Sun	3e08a67dd6	Add Linux ARM64 CI pipeline (#14904 )	2023-03-06 21:47:10 -08:00
Adrian Lizarraga	d45b47945c	Linux QNN Pipeline: fix build error reporting (#14922 ) ### Description Split up the ORT build step in the Linux QNN CI Pipeline. ### Motivation and Context Build errors were not being immediately reported at the end of the build step. The build step currently concatenates multiple shell commands, and the return code for the last (mkdir) was being reported. This PR ensures that the return code of the `python build.py ...` command is reported for the build step.	2023-03-06 17:49:35 -08:00
Sheil Kumar	f88b97ede2	Cast to int32_t->size_t to avoid prefast overflow warning (#14902 ) Cast to int32_t->size_t to avoid prefast overflow warning	2023-03-05 06:21:46 -08:00
Tianlei Wu	6c8538f086	Fix prefast warning (#14895 ) Fix a prefast warning: `The const variable 'spatial_dim_start' can be computed at compile-time. Consider using constexpr (con.5).`	2023-03-03 12:54:28 -08:00
Changming Sun	c1155b70c5	Remove 37 and 50 from CUDA compute archs (#14874 ) ### Description To reduce CUDA package's size a little bit. 37 is for Tesla K80. Azure's NC-series uses it, but in most cases CUDA can dynamic generate device code .	2023-03-03 12:24:21 -08:00
George Wu	289f7dbcdd	enable pybind for qnn ep (#14897 ) enable python bindings for QNN EP. tested on Windows Dev Kit 2023 (ARM64) with python 3.11 (ARM64) from https://www.python.org/ftp/python/3.11.1/python-3.11.1-arm64.exe	2023-03-03 07:26:53 -08:00
pengwa	f6c81d8aca	Introduce padding inspector in ORTModule (#14652 ) ### Introduce padding inspector in ORTModule In some Transformer-based LLM training recipes, high data sparsity is observed due to 1). token padding (to max sequence length), 2). labels contains many ignore_index for calculate loss. This PR introduces a switch to enable data sparsity inspection, which 1). in short term, can inform training users to use techniques like dynamic batching to amortize the issue. 2). in medium and longer term, also helps us (training team) to have better understanding what our training customers' models looks like from perspective of data sparsity (and potentially motivate us to improve with runtime). Here is an example of different data sparsity with same training model arch, same training input, but with different user models. Low Embed Density, High Label Density Case - Sentence Classification ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/text-classification/run_glue.py --model_name_or_path roberta-large-openai-detector --task_name mnli --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir --output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137 --fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 60 \| EMBED \| input_ids \| 1 \| 35.21 % \| 1442 \| 4096 \| [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] \| \| 61 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 62 \| EMBED \| input_ids \| 1 \| 30.00 % \| 1229 \| 4096 \| [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] \| \| 63 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 64 \| EMBED \| input_ids \| 1 \| 26.73 % \| 1095 \| 4096 \| [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] \| \| 65 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 66 \| EMBED \| input_ids \| 1 \| 30.03 % \| 1230 \| 4096 \| [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] \| \| 67 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 68 \| EMBED \| input_ids \| 1 \| 31.62 % \| 1295 \| 4096 \| [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] \| \| 69 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| <<< ``` High Embed Density, Low Label Density Case - masked language model ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path bert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 710 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 711 \| LABEL \| labels \| -100 \| 13.77 % \| 564 \| 4096 \| N/A \| \| 712 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 713 \| LABEL \| labels \| -100 \| 14.48 % \| 593 \| 4096 \| N/A \| \| 714 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 715 \| LABEL \| labels \| -100 \| 14.18 % \| 581 \| 4096 \| N/A \| \| 716 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 717 \| LABEL \| labels \| -100 \| 14.53 % \| 595 \| 4096 \| N/A \| \| 718 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 719 \| LABEL \| labels \| -100 \| 15.31 % \| 627 \| 4096 \| N/A \| <<< ``` #### Next Step Let's see how we leverage the data sparsity for improvement. Optimizations on the way around compute optimizer wave 2: > Loss compute flops reduction. > Flatten/Unflatten embedding tokens to save compute flops.	2023-03-03 18:36:08 +08:00
Yi Zhang	8c454a76e0	Check Mac silicon package name (#14898 ) ### Description 1. add comments 2. check Mac silicon package name ### Motivation and Context There isn't Mac silicon Agent in ADO. We couldn't add smoking test to test the wheel can be installed. But We can check whether the package name is correct to avoid the mistake in 1.14 release. Test run https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=283100&view=logs&j=fe710151-df7c-5aa4-0cea-cf5331faa499&t=3182cefe-2612-53c6-4445-e5b3e0c4ac57	2023-03-03 18:27:54 +08:00
cloudhan	a997bb46b6	Refactor rocm attention (#14688 ) Extract QKV projection and attention computation into pipelines (composed from gemms and kernel launch). This will allow us to introduce ck flash attention in next PR	2023-03-03 12:16:11 +08:00
Changming Sun	f3b6664384	Remove Python 3.7 from the python packaging pipeline (#14887 ) ### Description 1. Remove Python 3.7 from the python packaging pipeline. It is planned for the next release and approved by the PMs. Also we will add 3.11, but it will be addressed in another PR. 2. Stop generating python packages based on Ubuntu 18.04 which will reach EOL next month. We will either replace them with Ubuntu 20.04 or a CentOS 8 variant.	2023-03-02 19:44:49 -08:00
guyang3532	c49f250a14	Del ort_model._modules to foward its accessing to torch_model._modules (#14563 ) Missing '_modules' attribute in ORTModule will cause load_state_dict for wrapped_ortmodule fail. reference:https://github.com/microsoft/onnxruntime/pull/7847	2023-03-03 10:12:37 +08:00
Dmitri Smirnov	8d87fdcfa1	Add GetVersionSting API for C++, C# and Python (#14873 ) ### Description Added APIs. ### Motivation and Context Addresses https://github.com/microsoft/onnxruntime/issues/14584 Cc: @Craigacp cp	2023-03-02 17:11:07 -08:00
Zachary Streeter	6e2ca15140	added miopenGetConvolutionSpatialDim if ROCm5.5 (#14772 ) The API should be `miopenGetConvolutionSpatialDim(cdesc, &spatial_dim)`, NOT `miopenGetConvolutionDescriptorSize(cdesc, &spatial_dim)`	2023-03-03 08:02:32 +08:00
Yuriy Chernyshov	0ebe8e34f8	Do not use ADL to invoke std algos (#14851 ) This is a follow up for #14716	2023-03-02 15:38:33 -08:00
Chun-Wei Chen	70a31e047a	Consume ONNX 1.13.1 in ONNX Runtime (#14812 ) ### Description <!-- Describe your changes. --> Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's ONNX submodule consistent with the latest released ONNX. Not sure whether this PR is really needed, but let me make it ready. Previous PR for testing ONNX 1.13.1rc2 : https://github.com/microsoft/onnxruntime/pull/14634. Fixed [AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174) .	2023-03-02 14:57:35 -08:00
G. Ramalingam	2facc5efe6	Fix function inliner bug re. outer-scope names (#14734 ) ### Description Fix the function inliner logic for renaming variables. Typically, a FunctionProto does not contain references to outer-scope names. However, special cases, such as the function-expansion of SequenceMap, can generate such FunctionProtos. Extend the renaming logic to ensure that references to outer-scope names are not renamed. ### Motivation and Context Fixes https://github.com/onnx/onnx/issues/4892 Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>	2023-03-02 12:08:37 -08:00
Chen Fu	603026fb84	Transpose for 16b tensors (#14877 ) ### Description Matrix transpose for 16b tensors (shorts, and half precision floats) ### Motivation and Context Need it for fp16 operations	2023-03-02 11:32:05 -08:00
Rachel Guo	7cd4b334a9	[CoreML EP] Add Flatten Op and LRN Op support (#14857 ) ### Description <!-- Describe your changes. --> As title. CoreML Spec for reference: https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#flattento2dlayerparams https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#lrnlayerparams ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fill CoreML Clipchamp usage gaps. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-03-02 09:43:15 -08:00
Hector Li	bf35ad2aa3	[Qnn EP] Call Qnn deviceCreate during backend setup and deviceFree during shutdown (#14875 ) ### Description Call Qnn deviceCreate during backend setup and call deviceFree during shutdown ### Motivation and Context Algin with Qnn formal setup and shutdown procedure.	2023-03-02 08:54:13 -08:00
cao lei	d69823f764	Do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario (#14570 ) ### Description Do not create Barrier and triggerDownstream steps during execution plan creation if the corresponding nodes are split by yield Op in training scenario. ### Motivation and Context In training scenario, forward and backward processes are running two different partial nodes of a graph. If there are two nodes each in one of the partial graph and separate in two streams, there are still triggerDownstream/barrier steps between them which work quite different from inference process as one of the steps will not be executed due to it is not in the correct range. To make it work, there is a hacky way to trigger the barrier step explicitly for training. This PR is to do some check, and do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario. So the hacky way is not needed.	2023-03-02 07:08:29 -08:00
Tianlei Wu	c66af46fc1	Doc for Stable Diffusion CUDA Optimizations (#14830 ) Add document for stable diffusion optimizations and benchmark.	2023-03-01 19:29:30 -08:00
Hector Li	c6074f3a4b	OnnxRuntime QNN EP (#14791 ) ### Description Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices ### Motivation and Context Enable Ort inference on QC hexagon NPU devices. --------- Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>	2023-03-01 13:48:20 -08:00
George Wu	6044312a43	fix TRT dockerfile documentation https://github.com/microsoft/onnxruntime/issues/14556 (#14600 ) address https://github.com/microsoft/onnxruntime/issues/14556	2023-03-01 07:02:42 -08:00
Scott McKay	b7fde84341	Changes to support standalone custom ops in a minimal build. (#14497 ) ### Description <!-- Describe your changes. --> Changes to support standalone custom ops in a minimal build. Also incorporates changes from #14492 (needed to test builds prior to that being checked in). We first need to save the schema info from the operators used by the standalone op invoker in the ORT format model. Add mechanism for that. Merge the kernel lookup logic so the same is used in full and minimal build. NOTE: the version matching is now consistent with all other kernel lookups, and the call to CreateOp MUST use the exact version for the operator. Previously matching wasn't as strict, but this can lead to the incorrect kernel being chosen. Add tests. NOTE: There is currently no way to detect the ops/types/opsets used inside these custom ops as they don't exist until we create kernels, which is after model loading completes (which is the point the ORT format model is saved). Due to that they have to be manually added to the configuration used to do the reduced ops build. That shouldn't be too hard for the custom op author to add given the custom op implementation is specifying the op, opset and type constraints (i.e. they have the info and it's just a case of capturing/formatting it correctly). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable usage of the standalone op invoker by custom ops in a minimal build. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-03-01 11:22:54 +10:00
Chen Fu	acc2ac627f	Fp16 Activations (#14722 ) ### Description NEON fp16 SIMD implementation of Activation functions ### Motivation and Context Step 2 of fp16 SIMD support. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-28 17:20:40 -08:00
Yulong Wang	69c5edb11b	[wasm] upgrade emsdk from 3.1.19 to 3.1.32 (#14818 ) ### Description upgrade emsdk from 3.1.19 to 3.1.32 also add explicit config for stack size (1MB).	2023-02-28 11:06:09 -08:00
Nat Kershaw (MSFT)	95c777b745	Fix link to High Level Design (#11786 ) Address #11661	2023-02-28 11:05:54 -08:00
Yi Zhang	6320decf04	increase Test GPU Job's timeout to 8 hours (#14850 ) ### Description <!-- Describe your changes. --> ### Motivation and Context In practice, 6 hours is not enough to finish the job.	2023-02-28 18:52:03 +08:00
pengwa	79aa0acdd0	SCELoss(SCELossGrad) support half(float) input float(half) output (#13972 ) ### Description A follow up change for https://github.com/microsoft/onnxruntime/pull/13616. SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad support different type for input and output. Add SCELoss(SCELossGrad) support half(float) input float(half) output ### Test Note #### Add tests for variant input and output types. To add such tests, have to refactor existing testing code for sce loss and scelossinternal gradient. Originally, FP32 input and output, the CPU kernels, runs with CPU kernels the baseline, CUDA/RCOM then runs with same data, user CompareTester to compare with CPU run results. FP16 input and output, the CPU kernels (did not have half kernels), runs with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM then runs with same data but using Half implementation, user CompareTester to compare with CPU run results. Now, we want the support run different input and output types. The proposed change here is, to run CPU kernels always with float input and output as baseline (because CPU only have float type kernels impl), this step is the very first thing for every test. Then, we run CUDA/RCOM kernels using half_input_half_output, float_input_float_output, half_input_float_output, float_input_half_output if there is corresponding kernel registered. Afterwards, compare the CUDA/ROCM run results with CPU float baselines. Be noted, there is one thing that deserved a special note: CompareOpTester's result compare can be loose than OpTester's. Roughly speaking: the former tolerant diff <= atol + rtolexpected_value, while the later one telerant diff < atol && diff < rtolexpected_value. When the expected value is super small in many cases of our tests cases, the former one can pass but the later one fails. So the refactoring also move the check outside of OpTester, explicitly check the values using the way CompareOPTester did (to align the previous behaviour). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-28 18:02:08 +08:00
Justin Stoecker	08699c8052	Address SDL warnings in recent STFT changes (#14847 ) ### Description Addresses two separate SDL warnings, neither of which point to a cause for concern: 1. `The expression '0<=_Param_(1)&&_Param_(1)<=3-1' is not true at this call. at D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\\DmlSTFT.h@443,33`. In other words, the tool thinks one of the calls to `barriers[barrierCount++]` will be an index-of-of-range issue, even those this is not currently possible. Switching a normal C array avoids this complaint. 2. `'_Param_(1)' could be '0': this does not adhere to the specification for the function 'CD3DX12_RESOURCE_BARRIER::UAV'`. The d3dx12 helper for UAV barriers has the wrong SAL annotation and doesn't allow a null resource (`_In_`), even though a null resource is legal and well defined. Updated the annotation to `_In_out_` and created a PR upstream. ### Motivation and Context Pacify SDL tasks in CI pipelines.	2023-02-27 21:01:34 -08:00
kailums	9bdd42115c	add build flag for rocblas tune and fix bug (#14797 ) ### Description <!-- Describe your changes. --> 1. add a build flag for rocblas tuning feature. 2. fix a build bug when enable rocblas tuning. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The rocblas tunning feature has no build flag to control, only using a MACRO flag. So I add an build flag, and fix a code bug when enable rocblas tunning.	2023-02-28 10:37:07 +08:00
Yulong Wang	2d079c6333	[js/web] disable multi-thread test on Node.js in E2E test (#14844 ) ### Description disable multi-thread test on Node.js in E2E test. multi-thread test on Node.js in E2E test never worked, however the CI does not pick up the error every time. So this became a flaky test case which sometimes cause a build break. Disable this test now and should enable it once it's get fixed.	2023-02-27 16:01:51 -08:00
Yi Zhang	0be20dc0f6	Run GPU test job after all CPU test jobs succeed. (#14833 ) ### Description Make GPU job depends on all CPU jobs ### Motivation and Context GPU resources are very limited in packaging pipeline. And GPU test job is very time consuming. Only one CPU job fails, the workflow fails, so the GPU job is meaningless. To utilize GPU resources more efficiently, run GPU job only after all CPU jobs succeed. ###test pipeline https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=280905&view=results	2023-02-28 07:44:51 +08:00
Chen Fu	26abaeb284	Fix half precision gemm test accumulation error (#14842 ) ### Description Half precision gemm test requirement relaxation ### Motivation and Context Most CPUs does not support mixed precision accumulation, only mul & add fuse. As a result, different striding on the K dimension may lead to rounding error. Accumulation of these rounding error maybe very significant. So setting an approximation ratio does NOT always work. What's more, a relaxed test condition may hide real implementation problem. So this is only a compromised fix. More rigorous tests require manual efforts: 1. Change the K stride of the kernel under test to be 16. 2. Force the K stride of the fp16 kernel to 16 3. Change the test oracle to be exact match. 4. Pass this test and then change it back :-(. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-27 13:23:14 -08:00

1 2 3 4 5 ...

8265 commits