onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-08 17:17:15 +00:00

Author	SHA1	Message	Date
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Hariharan Seshadri	d43e0ec9ba	Misc transformer fixes (#14103 ) ### Description 1. SkipLayerNormalization has a new output (https://github.com/microsoft/onnxruntime/pull/13988) and the symbolic shape inference script needs corresponding updates 2. The greedy sampling op (https://github.com/microsoft/onnxruntime/pull/13426) shouldn't re-use the logits buffer as its corresponding kernel doesn't seem to support it yet. ### Motivation and Context Fix some transformer issues	2023-01-03 13:05:55 -08:00
Patrice Vignola	e7f9d40dde	[DML EP] Force instance norm inputs to be 4D to better target metacommands (#14020 ) ### Description Force instance norm inputs to be 4D to better target metacommands ### Motivation and Context This may improve performance on some hardware by allowing the driver to return valid layouts to DML when querying for metacommand support.	2023-01-03 12:47:10 -08:00
Patrice Vignola	589612106a	[DML EP] Force layer norm inputs to be 4D to better target metacommands (#14022 ) ### Description Force layer norm inputs to be 4D to better target metacommands ### Motivation and Context This may improve performance on some hardware by allowing the driver to return valid layouts to DML when querying for metacommand support.	2023-01-03 12:46:33 -08:00
RandySheriffH	587e891cae	CloudEP (#13855 ) Implement CloudEP for hybrid inferencing. The PR introduces zero new API, customers could configure session and run options to do inferencing with Azure [triton endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint) Sample configuration in python be like: ``` sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton'); sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com'); sess_opt.add_session_config_entry('cloud.model_name', 'detection2'); sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1 sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose ... run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint. run_opt.add_run_config_entry('cloud.auth_key', '...') ... sess.run(None, {'input':input_}, run_opt) ``` Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-01-03 10:03:15 -08:00
Yi Zhang	52e3fe961d	add dnnl dependency in unittest.cmake (#14104 ) ### Description It's from the PR #14085 On multiple running msbuilds , it throws the exception of ``` 22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal "C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo if %errorlevel% neq 0 goto :cmEnd :cmEnd endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone :cmErrorLevel exit /b %1 :cmDone if %errorlevel% neq 0 goto :VCEnd :VCEnd" exited with code 1. ``` https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f It should make sure that the onnxruntime_test_all project depends on dnnl project.	2023-01-03 11:24:06 +08:00
cloudhan	613920d6c5	Remove all usages of hipLaunchKernelGGL (#14089 ) All thoes macro syntaxes are mistake by https://github.com/ROCm-Developer-Tools/HIP-CPU/issues/8#issuecomment-756188453, they should be corrected in documentation but is not. We moved away hipThreadIdx_* in some previous commits, now we move away from hipLaunchKernelGGL.	2023-01-02 12:55:44 +08:00
Tianlei Wu	6a9dc6c993	[CUDA] Update fused MHA to support flash attention and causal mask (#13953 ) ### Description Update fused attention kernels to support flash attention and causal mask (GPT-2 initial decoder run). Note: Causal kernels are from FasterTransformer 5.2. Flash attention kernels that is not causal are from TensorRT 8.5.1. #### Performance Test of bert-base model Test like the following: ``` python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index ``` Original Flash Attention is from https://github.com/HazyResearch/flash-attention. RemovePadding and RestorePadding is added before/after the original flash attention but not for this PR, so the result is not apple-to-apple comparison. It is added for reference only. Average latency (ms) of float16 bert-base-cased model: * A100 Kernel \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 \| b128_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 1.83 \| 5.00 \| 9.31 \| 17.76 \| 34.47 \| 67.43 \| 133.38 TRT Fused \| 2.05 \| 3.58 \| 5.70 \| 10.96 \| 21.22 \| 41.23 \| 80.56 Flash Attention (from FT) \| 1.43 \| 3.20 \| 5.71 \| 10.95 \| 22.19 \| 42.96 \| 84.54 Flash Attention (from TRT) \| 1.44 \| 3.28 \| 5.70 \| 10.86 \| 21.00 \| 40.56 \| 79.53 Original Flash Attention \| 1.81 \| 4.04 \| 6.82 \| 13.06 \| 24.62 \| 46.58 \| 91.10 * T4 \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 8.17 \| 29.86 \| 59.56 \| 115.77 \| 236.66 \| 461.43 Flash Attention (from FT) \| 5.65 \| 21.12 \| 44.94 \| 86.83 \| 174.16 \| 351.38 Flash Attention (from TRT) \| 5.73\| 21.49\| 45.49 \| 89.15 \| 174.37 \| 352.08 Original Flash Attention \| 6.22 \| 22.16 \| 43.39 \| 83.8 \| 168.77 \| 337.04 * V100 Kernel \| b1_s512 \| b4_512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 3.77 \| 10.48 \| 19.53 \| 37.63 \| 73.68 \| 145.58 Flash Attention (from FT) \| 3.21 \| 8.25 \| 14.95 \| 28.83 \| 56.28 \| 111.15 #### Performance Test of GPT-2 model Test like the following: ` python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b 1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512 ` * A100 Note that flash attention is used as fused attention when sequence_length > 128. batch_size \| sequence_length \| with Fused Attention \| without Fused Attention \| A100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 0.93 \| 1 \| 7.0% 4 \| 8 \| 0.82 \| 0.88 \| 6.8% 8 \| 8 \| 0.84 \| 0.88 \| 4.5% 16 \| 8 \| 0.92 \| 0.97 \| 5.2% 32 \| 8 \| 1.15 \| 1.17 \| 1.7% 64 \| 8 \| 1.68 \| 1.72 \| 2.3% 128 \| 8 \| 2.76 \| 2.78 \| 0.7% 1 \| 16 \| 0.95 \| 0.95 \| 0.0% 4 \| 16 \| 0.83 \| 0.88 \| 5.7% 8 \| 16 \| 0.91 \| 0.97 \| 6.2% 16 \| 16 \| 1.12 \| 1.17 \| 4.3% 32 \| 16 \| 1.67 \| 1.72 \| 2.9% 64 \| 16 \| 2.73 \| 2.76 \| 1.1% 128 \| 16 \| 4.96 \| 4.95 \| -0.2% 1 \| 32 \| 0.94 \| 0.88 \| -6.8% 4 \| 32 \| 0.91 \| 0.97 \| 6.2% 8 \| 32 \| 1.12 \| 1.17 \| 4.3% 16 \| 32 \| 1.65 \| 1.71 \| 3.5% 32 \| 32 \| 2.69 \| 2.76 \| 2.5% 64 \| 32 \| 4.86 \| 4.94 \| 1.6% 128 \| 32 \| 9.35 \| 9.38 \| 0.3% 1 \| 64 \| 0.84 \| 0.88 \| 4.5% 4 \| 64 \| 1.1 \| 1.17 \| 6.0% 8 \| 64 \| 1.64 \| 1.73 \| 5.2% 16 \| 64 \| 2.66 \| 2.77 \| 4.0% 32 \| 64 \| 4.82 \| 4.97 \| 3.0% 64 \| 64 \| 9.23 \| 9.4 \| 1.8% 128 \| 64 \| 18.54 \| 19.12 \| 3.0% 1 \| 128 \| 0.91 \| 0.98 \| 7.1% 4 \| 128 \| 1.68 \| 1.74 \| 3.4% 8 \| 128 \| 2.71 \| 2.83 \| 4.2% 16 \| 128 \| 4.85 \| 5.09 \| 4.7% 32 \| 128 \| 9.32 \| 9.69 \| 3.8% 64 \| 128 \| 18.54 \| 19.44 \| 4.6% 128 \| 128 \| 36.86 \| 38.47 \| 4.2% 1 \| 256 \| 1.15 \| 1.23 \| 6.5% 4 \| 256 \| 2.71 \| 2.95 \| 8.1% 8 \| 256 \| 4.87 \| 5.3 \| 8.1% 16 \| 256 \| 9.32 \| 10.23 \| 8.9% 32 \| 256 \| 18.6 \| 20.53 \| 9.4% 64 \| 256 \| 36.93 \| 40.41 \| 8.6% 128 \| 256 \| 72.84 \| 80.14 \| 9.1% 1 \| 512 \| 1.68 \| 1.96 \| 14.3% 4 \| 512 \| 4.9 \| 6.02 \| 18.6% 8 \| 512 \| 9.4 \| 11.59 \| 18.9% 16 \| 512 \| 18.71 \| 23.05 \| 18.8% 32 \| 512 \| 37.13 \| 45.46 \| 18.3% 64 \| 512 \| 74.04 \| 89.88 \| 17.6% 128 \| 512 \| NA \| NA \| NA * T4: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| T4 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.97 \| 2.11 \| 6.6% 4 \| 8 \| 2.2 \| 2.25 \| 2.2% 8 \| 8 \| 2.77 \| 3.1 \| 10.6% 16 \| 8 \| 4.17 \| 4.2 \| 0.7% 32 \| 8 \| 6.86 \| 6.82 \| -0.6% 64 \| 8 \| 14.88 \| 14.92 \| 0.3% 128 \| 8 \| 31.4 \| 31.29 \| -0.4% 1 \| 16 \| 1.61 \| 1.71 \| 5.8% 4 \| 16 \| 2.13 \| 2.31 \| 7.8% 8 \| 16 \| 3.38 \| 3.67 \| 7.9% 16 \| 16 \| 6.16 \| 6.54 \| 5.8% 32 \| 16 \| 14.16 \| 14.76 \| 4.1% 64 \| 16 \| 30.36 \| 30.57 \| 0.7% 128 \| 16 \| 63.14 \| 63.57 \| 0.7% 1 \| 32 \| 1.53 \| 1.69 \| 9.5% 4 \| 32 \| 3.34 \| 3.66 \| 8.7% 8 \| 32 \| 6.25 \| 6.64 \| 5.9% 16 \| 32 \| 14.12 \| 14.9 \| 5.2% 32 \| 32 \| 28.96 \| 29.82 \| 2.9% 64 \| 32 \| 61.07 \| 61.77 \| 1.1% 128 \| 32 \| 116.38 \| 117.98 \| 1.4% 1 \| 64 \| 2.01 \| 2.21 \| 9.0% 4 \| 64 \| 6.18 \| 6.67 \| 7.3% 8 \| 64 \| 13.72 \| 14.49 \| 5.3% 16 \| 64 \| 28.71 \| 29.83 \| 3.8% 32 \| 64 \| 58.65 \| 60.68 \| 3.3% 64 \| 64 \| 113.09 \| 113.17 \| 0.1% 128 \| 64 \| 205.21 \| 209.4 \| 2.0% 1 \| 128 \| 3.37 \| 3.76 \| 10.4% 4 \| 128 \| 13.54 \| 14.85 \| 8.8% 8 \| 128 \| 28.32 \| 30.22 \| 6.3% 16 \| 128 \| 58.16 \| 62.09 \| 6.3% 32 \| 128 \| 109.17 \| 113.99 \| 4.2% 64 \| 128 \| 198.9 \| 207.1 \| 4.0% 128 \| 128 \| 413.25 \| 421.82 \| 2.0% 1 \| 256 \| 6.33 \| 7.05 \| 10.2% 4 \| 256 \| 28.09 \| 31.49 \| 10.8% 8 \| 256 \| 57.47 \| 62.76 \| 8.4% 16 \| 256 \| 106.77 \| 117.95 \| 9.5% 32 \| 256 \| 197.02 \| 208.58 \| 5.5% 64 \| 256 \| 406.81 \| 431.36 \| 5.7% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 13.84 \| 16.32 \| 15.2% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA * V100: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| V100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.31 \| 1.6 \| 18.1% 4 \| 8 \| 1.17 \| 1.26 \| 7.1% 8 \| 8 \| 1.43 \| 1.79 \| 20.1% 16 \| 8 \| 2.14 \| 1.96 \| -9.2% 32 \| 8 \| 2.91 \| 3.08 \| 5.5% 64 \| 8 \| 5.32 \| 5.27 \| -0.9% 128 \| 8 \| 9.34 \| 8.97 \| -4.1% 1 \| 16 \| 1.41 \| 1.58 \| 10.8% 4 \| 16 \| 1.38 \| 1.49 \| 7.4% 8 \| 16 \| 1.81 \| 2.2 \| 17.7% 16 \| 16 \| 2.8 \| 2.83 \| 1.1% 32 \| 16 \| 4.94 \| 4.99 \| 1.0% 64 \| 16 \| 8.88 \| 8.84 \| -0.5% 128 \| 16 \| 17.35 \| 17.2 \| -0.9% 1 \| 32 \| 1.38 \| 1.77 \| 22.0% 4 \| 32 \| 1.77 \| 1.93 \| 8.3% 8 \| 32 \| 2.71 \| 2.86 \| 5.2% 16 \| 32 \| 5.03 \| 4.92 \| -2.2% 32 \| 32 \| 8.8 \| 8.79 \| -0.1% 64 \| 32 \| 17.29 \| 17.23 \| -0.3% 128 \| 32 \| 33.27 \| 33.1 \| -0.5% 1 \| 64 \| 1.67 \| 1.87 \| 10.7% 4 \| 64 \| 2.69 \| 2.76 \| 2.5% 8 \| 64 \| 4.87 \| 4.94 \| 1.4% 16 \| 64 \| 8.73 \| 8.81 \| 0.9% 32 \| 64 \| 16.92 \| 17.24 \| 1.9% 64 \| 64 \| 33 \| 33.38 \| 1.1% 128 \| 64 \| 65.33 \| 65.86 \| 0.8% 1 \| 128 \| 2.03 \| 2.22 \| 8.6% 4 \| 128 \| 4.9 \| 5.04 \| 2.8% 8 \| 128 \| 8.76 \| 8.81 \| 0.6% 16 \| 128 \| 17.06 \| 17.29 \| 1.3% 32 \| 128 \| 33.25 \| 33.56 \| 0.9% 64 \| 128 \| 65.54 \| 66.5 \| 1.4% 128 \| 128 \| 130.44 \| 131.44 \| 0.8% 1 \| 256 \| 2.78 \| 2.86 \| 2.8% 4 \| 256 \| 8.75 \| 9.04 \| 3.2% 8 \| 256 \| 17 \| 17.68 \| 3.8% 16 \| 256 \| 33.19 \| 34.32 \| 3.3% 32 \| 256 \| 65.43 \| 67.86 \| 3.6% 64 \| 256 \| 129.92 \| 134.68 \| 3.5% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 4.95 \| 5.32 \| 7.0% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA	2022-12-31 10:33:54 -08:00
Dmitri Smirnov	5d729839b5	Support loading widechar paths on windows (#14066 ) ### Description Make GetRuntimePath() and LoadDynamicLibrary() operate on platform specific paths ### Motivation and Context This addresses https://github.com/microsoft/onnxruntime/issues/14063	2022-12-30 16:30:11 -08:00
Baiju Meswani	b85878953f	Fix nightly ort training ci pipeline (#14007 )	2022-12-30 12:28:57 -08:00
Tang, Cheng	9f52a8bc55	fix the cuda EP test failure (#14087 ) ### Description Fix a regression failure for cuda EP test ### Motivation and Context CudaEP test is a special test case under EP folder, not in test folder. when refactor the code during multi-stream work, we missed it. This PR is to fix the test. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-12-28 17:04:27 -08:00
Chen Fu	a79d88ed7f	Fix prefast scan bug 8995 (#14077 ) ### Description Prefast warning fix ### Motivation and Context https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/8995/ Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-12-28 10:38:36 -08:00
Vincent Wang	0c3480e565	[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069 ) PyTorch removed upsample_nearest related backward functions with "vec" overload name since 1.13. The functions without overload name are available for all versions, though they are not that convienent to use. This PR changes the gradient builder code to use functions without overload name for ATen upsample_nearest nodes. This PR also fixed a bug for ORTModule's corner case introduced by the multi-stream PR. There is some code to execute the barrier step for triggered downsteam is the barrier is out of range. But this should be applied to triggered downstream only. If it's a normal run with start step as a barrier step but out of range, we should not apply the logic. For example, for ORTModule, if the barrier is the 1st step of whole CPU plan, and the forward part is empty, then the forward normal run will run step from start-0 to end-0 (actually nothing), and step-0 is the barrier, then we should not execute the barrier in such case.	2022-12-27 10:18:30 +08:00
PeixuanZuo	770fb9649b	[Fix] CPU memory type need device_id=0 to get allocator (#14050 ) ### Description <!-- Describe your changes. --> Fix an error: `onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/allocation_planner.cc:819 onnxruntime::common::Status onnxruntime::PlannerImpl::ComputeValueLocation() allocator was false.` This error happens when we run huggingface models with DDP on multi-GPUs. In a thread with rank>0, it will attempt to obtain a CPU memory allocator with device_id>0, which causes the error. There is a workaround judges whether node’s output is on the CPU or not. If the output is on CPU, we set device_id = 0. Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-26 13:58:01 +08:00
Dmitri Smirnov	d762aa2a4c	Let Cmake decide where to place abseil (#14057 ) ### Description Remove Abseil module placement specifications ### Motivation and Context Allow Cmake defaults take place and possible redirection of all submodules for sharing between the local builds.	2022-12-23 12:08:13 -08:00
Adrian Lizarraga	3bbcc2799f	Support for custom op variadic inputs/outputs (#13946 ) ### Description Adds support for variadic inputs and outputs to custom operators. ### Motivation and Context Needed for custom ops that wrap external runtimes/models and maybe TensorRT plugins.	2022-12-23 11:41:15 -08:00
Tianlei Wu	8ac264b896	Deprecate one step beam search (#14046 ) ### Description Deprecate one step beam search since it lacks maintenance (some tests failed) and its performance is not optimal. For users who still need this feature, please use older version (<=1.13.1) of onnxruntime to export one step beam search model, and the model can run in latest onnxruntime. It is recommend to use [convert_generation.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py) to generate beam search onnx model for better performance.	2022-12-22 23:14:31 -08:00
Adam Louly	e49f358686	expose lr scheduler python bindings for on device training. (#13882 ) ### Description Exposing LR Scheduler python bindings for on device training. Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-12-22 18:44:04 -08:00
Ye Wang	b999022b03	disable some test input for generation model temporarily (#14045 ) ### Description <!-- Describe your changes. --> newly added test cases break the parity check. disable them temporarily during investigation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:24 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
fxmarty	4d2dc8bbbd	Replace all numpy.bool by python builtin bool (#14014 ) `numpy.bool` has been removed as from 1.24.0. It was before an alias for python's `bool`. Fixes https://github.com/huggingface/optimum/issues/610 ### Motivation and Context Numpy 1.24.0 breaks for example IO binding helpers.	2022-12-23 09:27:23 +10:00
Baiju Meswani	1b58331fb3	[QAT] Graph transformer to fuse QDQ pattern into FakeQuant (#13777 ) To perform QAT in onnxruntime, `FakeQuant` op was introduced in #13649. The onnxruntime quantization tool generates a post training static quantization onnx model with `QuantizeLinear`->`DequantizeLinear` nodes. To perform QAT, this pattern needs to be transformed to `FakeQuant`. This pull request introduces a graph transformer that looks for the `Q->DQ` pattern and fuses it to a `FakeQuant` node.	2022-12-22 09:44:39 -08:00
Tianlei Wu	944bff0ad6	Support two stages onnx GPT-2 conversion (#14025 ) ### Description Add support of ONNX conversion of GPT-2 for two stages: * Stage 1 is the initial stage that has empty past state. * Stage 2 has non-empty past state and sequence_length is 1. Add a parameter --stage to specify such stage. For stage 1, we will enable mask_index for Attention so that we can use fused attention in CUDA. Other changes: (1) use int32 inputs as default (otherwise, there is error in inference) (2) update gpt2_parity to include SkipLayerNormalization (see https://github.com/microsoft/onnxruntime/pull/13988) and EmbedLayerNormalization (3) get all environment variables that might impact GPT-2 latency in benchmark_gpt2 ### Motivation and Context To test fused attention for GPT-2 model for https://github.com/microsoft/onnxruntime/pull/13953.	2022-12-22 09:33:01 -08:00
PeixuanZuo	694ba033e9	[ROCm] update skip_layernorm test sample (#14051 ) ### Description <!-- Describe your changes. --> Larger batch_size won't cover more implementations and may block CI, remove batch_size 128. Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 21:18:10 +08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
Hariharan Seshadri	7ed8bd4f95	Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988 )	2022-12-21 23:04:44 -08:00
Joseph Groenenboom	baba312e30	Add provider selection for gpt2/convert_to_onnx.py (#13982 ) Allows the user to select from supported backends for gpt2/convert_to_onnx.py. Default behavior is preserved if no provider is selected. This allows the ROCm EP to be selected.	2022-12-22 11:41:09 +08:00
PeixuanZuo	a170e40fbb	[ROCm] Update Dockerfiles of ROCm and MIgraphX to ROCm5.4 (#14013 ) Update Dockerfiles of ROCm and MIGraphX to ROCm5.4 Update README.md Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:03:34 +08:00
PeixuanZuo	b5fd2a6a80	[ROCm] Add ROCm5.4 to python package pipeline (#14012 ) Add ROCm5.4 to python package pipeline. The download link of ROCm5.4 nightly build whl is https://download.onnxruntime.ai/onnxruntime_nightly_rocm54.html The download linkd of ROCm5.4 nightly build whl with profiling is https://download.onnxruntime.ai/onnxruntime_nightly_rocm54.profiling.html Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:40 +08:00
PeixuanZuo	ab2dd8dfaf	[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011 ) Update ROCm and MigraphX CI to ROCm5.4 Run ortmodule_test with ROCm5.4 and all passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d). Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:05 +08:00
Edward Chen	df8ff34f25	Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983 ) ### Description Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. With the way these kernels are currently registered, the documentation shows support for opset 11+. This is not accurate. ### Motivation and Context Fix #13781	2022-12-21 19:01:00 -05:00
Numfor Tiapo	8943d623a4	DML EP Register operators for Opset 16 (#14034 ) This PR Registers the following operators for opset 16 to the DML EP: - LeakyRelu-16 - PRelu-16 - Where-16 - GreaterOrEqual-16 - LessOrEqual-16 Identity-16 was not added in this PR due to pipeline failures Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-21 09:05:12 -08:00
JiCheng	1a177a1713	Cover beta in all Conv paths. (#14008 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 09:02:48 -08:00
pengwa	ccc4487553	fix CI onnxruntime_test_python_sparse_matmul.py (#14039 ) ### Description Numpy1.24.0 removed the np.float. ``` /opt/hostedtoolcache/Python/3.8.15/x64/bin/python onnxruntime_test_python_sparse_matmul.py EE. ====================================================================== ERROR: testRunContribSparseMatMul (__main__.TestSparseToDenseMatmul) Mutliple sparse COO tensor to dense ---------------------------------------------------------------------- Traceback (most recent call last): File "onnxruntime_test_python_sparse_matmul.py", line 407, in testRunContribSparseMatMul np.float, File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'float' ====================================================================== ERROR: testRunSparseOutputOnly (__main__.TestSparseToDenseMatmul) Try running models using the new run_with_ort_values ---------------------------------------------------------------------- Traceback (most recent call last): File "onnxruntime_test_python_sparse_matmul.py", line 39, in testRunSparseOutputOnly values = np.array([1.764052391052246, 0.40015721321105957, 0.978738009929657], np.float) File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'float' ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 17:31:52 +08:00
JiCheng	7738be9b25	[prefast:Warning]: C26451 (#14036 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 16:53:29 +08:00
Changming Sun	05137e6ec4	Use target name for flatbuffers (#13991 ) ### Description Use target name for flatbuffers. Add version range for flatbuffers. It is similar to #13870 ### Motivation and Context To fix a build error: ``` CMake Error at onnxruntime_graph.cmake:88 (add_dependencies): The dependency target "flatbuffers" of target "onnxruntime_graph" does not exist. Call Stack (most recent call first): CMakeLists.txt:1490 (include) ``` It happens when flatbuffers library is already installed. For example, on Ubuntu people may get it from apt-get. But, the one provided by Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04 works fine.	2022-12-20 11:44:02 -08:00
RandySheriffH	cd305a90d6	Stop creating static thread pool to fix random hang in onnx_test_runner (#14023 )	2022-12-19 19:48:14 -08:00
Yulong Wang	533fe37cbd	fix build break in transformer debug dump (#14009 ) ### Description Fix build break in transformer debug dump introduced in #13954.	2022-12-19 16:49:21 -08:00
Changming Sun	fc2a6db573	Update absl to the latest release (#13990 ) ### Description Update absl to a new version ### Motivation and Context The new version contains fixes that are needed for Nvidia GPU build. Once we update it to that version, we don't need to maintain our private patches for Nvidia GPU build.	2022-12-19 14:25:13 -08:00
Hariharan Seshadri	f1044e3b9a	CUDA GreedySearch ProcessLogits optimization (#13823 ) ### Description Explore the possible re-use of the logits buffer in `GreedySearch` for cases where sequence length == 1 (Post the first decoding run, the sequence length is guaranteed to be 1). This re-use will ensure that we do not have to make copies of the logits before processing them. Currently, we make a copy of the logits even if the sequence length == 1 which is not necessary as we can directly re-use the logits buffer for the token generation step. A similar optimization exists in `BeamSearch`, but seems lacking in `GreedySearch`. Since, the logits buffer may contain padded data, we need to adjust the pieces consuming the logits buffer directly to account for any padding. A more invasive change (needs changes in a few places) will be to adjust the interfaces of `ProcessLogits()` such that it takes a reference to the logits and not a const reference as (based on my understanding) this is the only place where the logits from the decoder subgraph will ever be used and giving the `ProcessLogits()` method license to mutate/process the underlying buffer of the logits OrtValue seems reasonable (instead of making a copy and then mutating/processing them). The will also remove the ugly `const_cast`(s) seen in this change.	2022-12-19 13:29:10 -08:00
Chen Fu	28e2b1790f	Moving MLAS threaded QGEMM packing buffer from stack to heap (#14002 ) ### Description MLAS QGEMM kernel need memory buffer for packing of source tensors. This change moves these buffers from stack to heap ### Motivation and Context MLAS QGEMM kernels have packing buffers on the stack since the beginning of time. Emerging hardware demands larger and larger buffers, causing potential stack overflow problems down the road. This change moves these buffers from stack to the heap. This change also introduces a thread initializer per kernel. For instance, in the new AMX instruction set (support coming), we need to initialize the tile registers per thread. This requirement can be easily satisfied by tapping into this change. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-12-19 09:39:19 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
cloudhan	2df046fc67	Fix deprecated-builtins (#14001 ) Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins] This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.	2022-12-17 18:17:05 +08:00
Tianlei Wu	6fb54fc607	Add ms domain during saving onnx model in onnx_model.py (#13978 ) Add domain "com.microsoft" during saving model if needed.	2022-12-16 22:45:57 -08:00
Yulong Wang	cc0a6213e4	[js] update versions of a few build dependencies (#13977 ) ### Description update versions of a few build dependencies for onnxruntime NPM packages. update nodejs version to v16.x in linux CI. v12 is too out-of-dated. see [nodejs release schedule](https://github.com/nodejs/release#release-schedule) ### Motivation and Context - upgrade to latest webpack allows using of latest Node.js LTS version. previous version of webpack does not work on Node.js v18 and it is fixed in latest version - upgrade to latest typescript, ts-loader and other dev deps to accelerate the build and bundling. - upgrade also helps to resolve security warnings that may be vulnerable in out-of-dated version	2022-12-16 17:26:54 -08:00
Chi Lo	ba89cae3bd	Update package pipelines to support TRT 8.5 (#13998 ) Update following package pipelines to support TRT 8.5 after https://github.com/microsoft/onnxruntime/pull/13867: - [Linux Multi GPU TensorRT CI Pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1016&_a=summary) - [Python packaging pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841&_a=summary) - [build-perf-test-binaries](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1130&_a=summary) - [Linux-GPU-EP-Perf](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841&_a=summary)	2022-12-16 15:01:50 -08:00
Tianlei Wu	848f80f7a9	Skip some attention op tests in A100 (#13980 ) Skip some attention_op tests in A100 due to TF32 is enabled in GEMM, and that causes some unit tests fails in A100.	2022-12-16 10:23:41 -08:00
FFFrog	6705915af8	[CANN] Add the ability to run graph (#13728 ) ### Description Add the ability to run graph ### Motivation and Context A brief description is as follows: 1) If the whole graph is supported, then will be processed by the graph engine, directly. 2) If the whole graph is not supported, the whole graph will be divided into subgraphs and single operators; The sub-graphs will be run on graph engine, and the single operators will fallback to the traditional mode.	2022-12-16 06:57:40 -08:00
Yi Zhang	aa9fbed3d4	Add compilation cache for Linux GPU (#13995 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-16 16:38:12 +08:00
Scott McKay	be9ae28d9f	Add ability to set RunOptions config entries to C# API. (#13939 ) ### Description <!-- Describe your changes. --> Add ability to set RunOptions config entries. Largely a cut-and-paste of the existing code for setting SessionOptions config entries. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #13936	2022-12-16 10:28:01 +10:00

1 2 3 4 5 ...

7906 commits