onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-03 03:58:54 +00:00

Author	SHA1	Message	Date
pengwa	bcebd3b1ca	Allow upstream for Slice on single axis (#16410 ) ### Allow upstream for Slice on single axis #### Benchmark on 8x32GB V100 + DeepSpeed On Bloom560M model, there is 1.5% throughput gains on the same max batch size 6. ``` torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 6 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json ``` ##### Main branch ``` Total overhead: 38957ms where export takes 35493ms. *** train metrics *** epoch = 4.08 train_loss = 2.6841 train_runtime = 0:03:10.67 train_samples = 2318 train_samples_per_second = 50.348 train_steps_per_second = 1.049 throughput per gpu=4.08 * 2318 / (190.67 - 38.957) / 8(gpu) = 7.792 samples/second ``` ##### This PR ``` Total overhead: 38649ms where export takes 34946ms. *** train metrics *** epoch = 4.08 train_loss = 2.6757 train_runtime = 0:03:08.08 train_samples = 2318 train_samples_per_second = 51.04 train_steps_per_second = 1.063 throughput per gpu=4.08 * 2318 / (188.08 - 38.649) / 8(gpu) = 7.911 samples/second ``` #### Benchmark on 4x16GB V100 + AutoCast On Bloom560M model, there is 1.8% throughput gains on the same batch size, 24% gains with corresponding maximum batch size. Also it allow ORT run bigger batch size (from 3 to 4) on following recipe. ``` torchrun --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss ``` ##### Main branch ``` Total overhead: 4789ms where export takes 3798ms. *** train metrics *** epoch = 1.02 train_loss = 20.3338 train_runtime = 0:01:42.78 train_samples = 2343 train_samples_per_second = 23.349 train_steps_per_second = 1.946 throughput per gpu=1.02 * 2343 / (102.78 - 4.789) / 4(gpu) = 6.097 samples/second ``` ##### This PR ``` Total overhead: 4608ms where export takes 3555ms. *** train metrics *** epoch = 1.02 train_loss = 20.3364 train_runtime = 0:01:40.87 train_samples = 2343 train_samples_per_second = 23.792 throughput per gpu=1.02 * 2343 / (100.87 - 4.608) / 4(gpu) = 6.207 samples/second ``` With this PR, also can run batch size 4 (main branch fails), ``` Total overhead: 4743ms where export takes 3698ms. *** train metrics *** epoch = 1.36 train_loss = 20.2096 train_runtime = 0:01:50.42 train_samples = 2343 train_samples_per_second = 28.979 train_steps_per_second = 1.811 throughput per gpu= 1.36 * 2343 / (110 - 4.743) / 4(gpu) =7.57 sample/second ``` #### Benchmark on 8x32GB V100 + AutoCast On Bloom560M model, there is 0.9% throughput gains on the same batch size, 8.6% gains with corresponding maximum batch size. Also it allow ORT run bigger batch size (from 3 to 4) on following recipe. ``` torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss ``` ##### Main branch ``` Total overhead: 55259ms where export takes 51140ms. *** train metrics *** epoch = 2.06 train_loss = 2.8788 train_runtime = 0:02:36.65 train_samples = 2318 train_samples_per_second = 30.64 train_steps_per_second = 1.277 throughput per gpu=2.06 * 2318 / (156.65 - 55.259) / 8(gpu) = 5.887 samples/second ``` ##### This PR ``` Total overhead: 55712ms where export takes 51418ms. *** train metrics *** epoch = 2.06 train_loss = 2.8696 train_runtime = 0:02:36.19 train_samples = 2318 train_samples_per_second = 30.731 train_steps_per_second = 1.28 throughput per gpu=2.06 * 2318/ (156.19 - 55.712) / 8(gpu) = 5.940 samples/second ``` With this PR, also can run batch size 4 (main branch fails), ``` Total overhead: 54238ms where export takes 49899ms. *** train metrics *** epoch = 2.74 train_loss = 2.7692 train_runtime = 0:02:58.47 train_samples = 2318 train_samples_per_second = 35.859 train_steps_per_second = 1.121 throughput per gpu= 2.74 * 2318 / (178.47 - 54.238) / 8(gpu) =6.391sample/second ```	2023-07-10 08:36:11 +08:00
Yulong Wang	67f4cd54fa	fix JSEP build break (#16636 ) ### Description fix JSEP build break. the build break was caused by enabling `-Wshorten-64-to-32` while merging the CI.	2023-07-09 08:53:11 -07:00
satyajandhyala	00e8f2a2a9	[Web/JS] Add ConvTranspose support (#16433 ) ### Description Add ConvTranspose support for WebGPU ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-08 11:10:50 -07:00
Yulong Wang	5c6613875c	[js/web] [JSEP] allow passing data in kernel compute (#16621 ) ### Description allow passing data in OpKernel::Compute() from C++ to JS.	2023-07-07 14:27:30 -07:00
cao lei	329e8156d4	clean unused parameter in ORT_UNUSED_PARAMETER (#16538 ) ### Description clean unused parameter in ORT_UNUSED_PARAMETER ### Motivation and Context clean unused parameters in ORT_UNUSED_PARAMETER which are introduced from #15833	2023-07-07 13:20:36 -07:00
satyajandhyala	e55a20ece8	[Web/JS] Added Split operator support. (#16567 ) ### Description Added WeGPU/JSEP Split operator support. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-07 12:16:10 -07:00
Adrian Lizarraga	dd13252506	[QNN EP] Support 1D Conv/ConvTranspose (#16563 ) ### Description - Adds support for 1D (rank 3) convolutions to QNN EP - Implements 1D convolutions as 2D convolutions with height == 1. Reshape nodes are added at the inputs and outputs as necessary. - Adds more unit tests for Conv and ConvTranspose (2D and 1D). ### Motivation and Context Allow more models to run on QNN EP.	2023-07-07 10:37:49 -07:00
satyajandhyala	5933a183df	[Web/JS] Added missing L1Reduce and L2Reduce oprator kernels. (#16580 ) ### Description Add missing L1Reduce and L2Reduce operator kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-07 09:55:55 -07:00
cloudhan	01c5d05712	Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518 ) The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct, avoid the ctor invocation will substantially improve the launch time and get better performance during the decoding. This get <7% e2e time reduction of whisper large.	2023-07-08 00:25:07 +08:00
Xavier Dupré	47a0289ee6	[CI] Removes type2 in process_registration and fix Windows GPU Reduced Ops CI Pipeline (#16530 ) ### Description Windows GPU Reduced Ops CI Pipeline is broken due to the introduction of a second template type in registered kernels. The python code checking the registration is broken due to that. This PR addresses this issue on the python side by keeping only one type equal to the concatenation of the two types.	2023-07-07 18:21:06 +02:00
Edward Chen	6be7b03e53	Enable `-Wshorten-64-to-32` warning if available. (#16524 ) - Fix some warnings from Xcode build (`-Wshorten-64-to-32`). - Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet. - Some clean up in build.py including setting CMake generator more consistently.	2023-07-07 08:11:44 -07:00
Edward Chen	e22b0836e7	[objc] Update docs and fix static analysis build (#16617 ) - Update some documentation comments. - Use onnxruntime_training.h as the umbrella header so training API docs are included in generated docs. - Fix static analysis build.	2023-07-07 07:58:54 -07:00
Vincent Wang	2a11f29eaa	[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608 ) The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes: - Use Erf instead of Normcdf for half compute - Change CUDA thread organization for BiasGelu kernel instead of using binary elementwise template - Add vectorized support Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf number below are in us Before change, FW: 246.37, BW: 292.77 Use Erf, FW: 152.86, BW: 238.98 All above changes, FW: 132.45, BW: 199.14 For Huggingface's bertweet-base model, with the changes, the step time (FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x faster. Using Erf is for half data only, evaluation shows that for float on CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD, so keep them unchanged.	2023-07-07 08:28:38 +08:00
Scott McKay	697dd12f6e	Re-organize the transpose optimization and layout transformation files. (#16246 ) ### Description <!-- Describe your changes. --> Split out the more basic changes from #15552 for easier review. Re-organize to clarify the structure - Separate out generic base functionality from ORT specific components - pass in handlers for internal ORT ops to Optimize - Split out layout transformation from transpose optimization - Separate out level 1 transpose optimizer - Cleanup some naming to try and clarify things like an optimizer vs. general optimization code Most of the changes are from this movement of code. Two implementation changes: - the extended handlers are queried first in GetHandler - allows the extended handlers to override the default behaviour for an ONNX operator - simplify the Optimize function to remove OptimizerMode. - `can_modify_node` is used instead of `mode` and `ignore_assigned_nodes` and a long description of the current usage is added. I don't _think_ that changes the current behavior and hopefully clarifies what happens and when, and makes the base transpose optimizer implementation more generic. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Create a cleaner separation to support adding EP specific logic next to cleanly handle where an EP has additional layout sensitive behaviour required (e.g. it's Resize implementation only handles one layout).	2023-07-07 08:24:47 +10:00
Hariharan Seshadri	5ffd58c8e6	Fix Reduced Ops pipeline (#16612 )	2023-07-06 14:32:59 -07:00
Dmitri Smirnov	51c42ae64a	[C#] Allow users to quickly populate native string buffers with utf8 bytes (#16559 ) ### Description Introduce an API that allows users to gain access to a string tensor element buffer of requested length in bytes so then can quickly load any utf8 data. ### Motivation and Context Useful for testing an otherwise.	2023-07-06 09:51:26 -07:00
Yulong Wang	d13f3153d7	[js/webgpu] enable op test for webgpu (#16542 ) ### Description This change enables the JSON-format operator tests for webgpu. Usage: ``` npm test -- op abs.jsonc -b=webgpu ```	2023-07-06 08:35:19 -07:00
Xavier Dupré	d906d48ae9	Support custom ops taking float 8 tensors as inputs and outputs (#16323 ) ### Description C API for custom ops does not support float 8 types. This PR changes that. ### Motivation and Context The list of operators supporting float 8 is very limited. It should be extended to custom ops to let developpers add customized operators for these specific types.	2023-07-06 14:36:06 +02:00
zhangsibo1129	180292f426	Modify CANN EP to align with the EP API refactor and fix CANN CI (#16490 ) Modify CANN EP `CANNExecutionProvider::CreatePreferredAllocators`, `CANNExecutionProvider::CreateCannAllocator` to align with the EP API refactor and fix CANN CI for https://github.com/microsoft/onnxruntime/pull/15833#issuecomment-1601568295 in this [PR](https://github.com/microsoft/onnxruntime/pull/15833)	2023-07-05 19:24:32 -07:00
cloudhan	b84c63db2a	Fix ORT_RETURN condition mistake (#16520 ) In #16339, the `ORT_ENFORCE(cuda_device_arch_ >= 530` (throw) it changed to `ORT_RETURN_IF` (Status) but the condition is negated. This fixes the problem.	2023-07-06 09:13:13 +08:00
Yi Zhang	fed08e070a	Add compiler cache in linux wasm build (#16579 ) ### Description Add compiler cache in wasm build to accelerate web ci ### Motivation and Context It could reduce the pipeline duration by 30 minutes. web ci could be completed in 2 hours with cache. https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1053219&view=results	2023-07-06 06:58:48 +08:00
Vrajang Parikh	fd8ad9b950	Enable iOS packaging for training (#16525 ) ### Description Enable support for building iOS packages/CocoaPods with training API - Add `Training` Package variant and config files in current iOS packaging utilities to enable creation of training packages ### Motivation and Context This PR introduces new `Training` variant in `build_and_assemble_ios_pods.py` script which allows creating pods for iOS with training API enabled. The sample script to build training pods: ``` python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py --variant Training \ --build-settings-file tools/ci_build/github/apple/default_full_ios_training_framework_build_settings.json \ -b=-- path_to_protoc_exe=<path/to/protoc> ``` Note: build settings file should have `--enable_training` as a build parameter. Simply adding training packaging increases the duration of the Azure pipeline for packaging by 70 minutes. To address this issue, we need to parallelize pod creation. In order not to further strain the pipeline, the changes for training packaging will be added in another PR, which optimizes the packaging pipeline. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-07-05 13:27:59 -07:00
Adam Pocock	ba91457183	[java] Adding addExternalInitializers and addInitializer to OrtSession.SessionOptions (#16198 ) ### Description Adds support for adding external initializers or overriding initializers to a session options from Java. ### Motivation and Context We want to instantiate large models from Java without filesystem access. cc @yuslepukhin	2023-07-05 12:51:59 -07:00
Yulong Wang	661fd4b978	[js/rn] always use 'typescript' from /js/ folder (#16554 ) ### Description always use 'typescript' from /js/ folder. This allows all NPM packages to use the same typescript version. - remove 'typescript' from /js/react_native/package.json. use the one from /js/package.json - remove unused '@types/fs-extra'	2023-07-05 12:26:56 -07:00
satyajandhyala	a7c892106d	[Web/JS] Support WebGPU Concat operator (#16543 ) ### Description Add Concat operator ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-05 11:59:45 -07:00
Chi Lo	d8792f8040	Fix TRT EP allocator memory leak (#16552 ) Fix memory leak issue which comes from TRT EP's allocator object not being released upon destruction. Following is the log from valgrind: ``` ==1911860== 100,272 (56 direct, 100,216 indirect) bytes in 1 blocks are definitely lost in loss record 1,751 of 1,832 ==1911860== at 0x483CFA3: operator new(unsigned long) (vg_replace_malloc.c:472) ==1911860== by 0x315DC2: std::_MakeUniq<onnxruntime::OrtAllocatorImplWrappingIAllocator>::__single_object std::make_unique<onnxruntime::OrtAllocatorImplWrappingIAllocator, std::shared_ptr<onnxruntime::IAllocator> >(std::shared_ptr<onnxruntime::IAllocator>&&) (unique_ptr.h:857) ==1911860== by 0x30EE7B: OrtApis::KernelContext_GetAllocator(OrtKernelContext const, OrtMemoryInfo const, OrtAllocator*) (custom_ops.cc:121) ==1911860== by 0x660D115: onnxruntime::TensorrtExecutionProvider::Compile(std::vector<onnxruntime::IExecutionProvider::FusedNodeAndGraph, std::allocator<onnxruntime::IExecutionProvider::FusedNodeAndGraph> > const&, std::vector<onnxruntime::NodeComputeInfo, std::allocator<onnxruntime::NodeComputeInfo> >&)::{lambda(void, OrtApi const, OrtKernelContext)#3}::operator()(void, OrtApi const, OrtKernelContext*) const (tensorrt_execution_provider.cc:2223) ``` This issue happens after this [EP allocator refactor](https://github.com/microsoft/onnxruntime/pull/15833)	2023-07-05 09:25:05 -07:00
Aditya Goel	9799d43c36	LabelEncoder kernel creation improvement (#16516 )	2023-07-05 07:09:20 -07:00
PeixuanZuo	e2526714e2	[ROCm] Move MIGraphX build step on CPU only machine (#16582 ) - Move MIGraphX build step on CPU only machine - Use ccache on build step - Not pass host uid into docker build process.	2023-07-05 13:55:28 +08:00
Wei-Sheng Chin	a0a5f57581	[DORT] Use new FX-to-ONNX exporter (#16450 ) The ONNX exporter in DORT have been moved to PyTorch as a formal feature. We therefore switch to consume the exporter from PyTorch instead of maintaining two duplicates.	2023-07-04 13:13:04 -07:00
PeixuanZuo	d540c7da0f	[ROCm] Add ROCm5.6 to python package pipeline (#16572 ) Add ROCm5.6 to python package pipeline.	2023-07-04 18:18:12 +08:00
Adam Pocock	13cc6192e5	[java] Adding native library loader to SessionOptions and RunOptions static init (#16435 ) ### Description Unlike most ORT classes `SessionOptions` and `RunOptions` don't trigger native library loading of the JNI binding and ORT when the classes are initialized (after class loading). This was initially because I thought that loading an inner class would trigger the static initialization of the outer class, but this is not true. So if you create a `SessionOptions` instance before referencing `OrtEnvironment` then you won't trigger library loading and you'll get an error saying it couldn't link the native method that creates a `SessionOptions` object. Note this doesn't prevent users from creating a `SessionOptions` and modifying it before the `OrtEnvironment` is created, which can still cause issues. It would be a breaking API change to modify the `SessionOptions` constructor to take an environment, and it wouldn't mirror the way it works in the C API which requires this by convention rather than API design, but we can discuss making that modification later. ### Motivation and Context Reduces the occurrence of mysterious Java library loading errors. Helps with #16434.	2023-07-03 15:59:03 -07:00
Rachel Guo	2e2e6aeff6	[NNAPI EP] Add leakyRelu support (#16533 ) ### Description <!-- Describe your changes. --> As title. ONNX definition: `2e9a6757ad/onnx/defs/math/defs.cc (L330)` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> CoreML has support and some super resolution models use it. --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-07-03 09:30:15 -07:00
pengwa	ac100ebb64	Fix orttraining-ortmodule-distributed CI (#16569 ) ### Fix orttraining-ortmodule-distributed CI https://pypi.org/project/pydantic/#history released version 2.0 1st July, Deepspeed has known issue on newer version of it (https://github.com/microsoft/DeepSpeed/issues/3280). So fix this by add similar check as DS did in https://github.com/microsoft/DeepSpeed/pull/3290	2023-07-03 13:18:59 +08:00
Sheil Kumar	f46956056d	Add WinML Experimental API to Register ORT CustomOps Libraries (#16535 ) Add WinML Experimental API to Register ORT CustomOps Libraries --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-06-30 22:17:35 -07:00
dependabot[bot]	80a9c40cba	Bump actions/checkout from 2 to 3 (#16405 )	2023-07-01 03:51:31 +00:00
dependabot[bot]	c8a94f1ef7	Bump actions/setup-dotnet from 2 to 3 (#16403 ) Bumps [actions/setup-dotnet](https://github.com/actions/setup-dotnet) from 2 to 3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/setup-dotnet/releases">actions/setup-dotnet's releases</a>.</em></p> <blockquote> <h2>v3.0.0</h2> <p>This major release includes the following <strong>changes:</strong></p> <ul> <li><a href="https://redirect.github.com/actions/setup-dotnet/issues/219">#219</a> New input <code>dotnet-quality</code> was added in <a href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>:</li> </ul> <pre lang="yaml"><code> - uses: actions/setup-dotnet@v3 with: dotnet-version: '6.0.x' dotnet-quality: 'preview' - run: dotnet build <my project> </code></pre> <p>More in detail <a href="https://github.com/actions/setup-dotnet#using-the-dotnet-quality-input">here</a>.</p> <ul> <li><a href="https://redirect.github.com/actions/setup-dotnet/issues/241">#241</a> The output variable <code>dotnet-version</code> which contains the installed by the action SDK version was added in <a href="https://redirect.github.com/actions/setup-dotnet/pull/324">#324</a>:</li> </ul> <pre lang="yaml"><code> - uses: actions/setup-dotnet@v3 id: cp310 with: dotnet-version: '3.1.422' - run: echo '${{ steps.cp310.outputs.dotnet-version }}' # outputs 3.1.422 </code></pre> <p>More in detail <a href="https://github.com/actions/setup-dotnet/tree/main#dotnet-version">here</a>.</p> <ul> <li>The <code>dotnet-version</code> syntax was updated and now it allows to specify the prerelease version without using <code>include-prerelease</code> input. The <code>include-prerelease</code> input was cut out:</li> </ul> <pre lang="yaml"><code> - uses: actions/setup-dotnet@v3 with: dotnet-version: '5.0.0-preview.6' </code></pre> <p>More in detail <a href="https://github.com/actions/setup-dotnet#supported-version-syntax">here</a>.</p> <ul> <li><a href="https://redirect.github.com/actions/setup-dotnet/issues/251">#251</a> The problem with out of support .NET version warnings was solved in <a href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>.</li> </ul> <p><strong>Breaking changes</strong>:</p> <ul> <li>Installation paths for Windows and Ubuntu images were changed to match the location of pre-installed SDKs. In more detail, read <a href="https://github.com/actions/setup-dotnet/blob/main/docs/adrs/v3-setup-dotnet.md#breaking-changes">here</a>.</li> </ul> <h2>Add support for Windows-arm</h2> <p>In scope of this release we <a href="https://redirect.github.com/actions/setup-dotnet/pull/320">add support for Windows-arm</a>. Besides, we change getInput to <a href="https://redirect.github.com/actions/setup-dotnet/pull/250">getBooleanInput</a> for include-prerelease.</p> <h2>Package updates, support for global json file in a subdirectory, installer scripts updates</h2> <p>This release includes the following PRs:</p> <ul> <li>Adding support for the <code>global-json-file</code> input: <a href="https://redirect.github.com/actions/setup-dotnet/issues/276">#276</a> Example of usage: <pre lang="yaml"><code>- uses: actions/setup-dotnet@v2 with: global-json-file: csharp/global.json - run: dotnet build <my project> working-directory: csharp </code></pre> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`3447fd6a9f`"><code>3447fd6</code></a> feat: Cache NuGet global-packages folder (<a href="https://redirect.github.com/actions/setup-dotnet/issues/303">#303</a>)</li> <li><a href="`916351aac9`"><code>916351a</code></a> Merge pull request <a href="https://redirect.github.com/actions/setup-dotnet/issues/430">#430</a> from akv-platform/remove-implicit-dependencies</li> <li><a href="`1ad2e312fa`"><code>1ad2e31</code></a> Add missing dependency</li> <li><a href="`e3f84b8f7a`"><code>e3f84b8</code></a> Install eslint-plugin-node</li> <li><a href="`ba848a34bb`"><code>ba848a3</code></a> Update configuration files</li> <li><a href="`aa983c550d`"><code>aa983c5</code></a> Merge pull request <a href="https://redirect.github.com/actions/setup-dotnet/issues/428">#428</a> from akv-platform/add-latest-patch-syntax</li> <li><a href="`b891376106`"><code>b891376</code></a> Merge branch 'main' into add-latest-patch-syntax</li> <li><a href="`b05a3f26b3`"><code>b05a3f2</code></a> Fix review points, rebuild solution</li> <li><a href="`5fdecd2063`"><code>5fdecd2</code></a> Increase amount of retries for Dotnet installation scripts tests (<a href="https://redirect.github.com/actions/setup-dotnet/issues/427">#427</a>)</li> <li><a href="`38b49fb717`"><code>38b49fb</code></a> Fix informational and debug messages</li> <li>Additional commits viewable in <a href="https://github.com/actions/setup-dotnet/compare/v2...v3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/setup-dotnet&package-manager=github_actions&previous-version=2&new-version=3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-01 09:16:28 +08:00
Dmitri Smirnov	f5b2d213eb	Fix nuget pipeline (#16553 ) ### Description Address test class visibility. ### Motivation and Context Fixes NuGetPackaging pipeline	2023-06-30 17:32:06 -07:00
petermcaughan	47f136e2d3	Speed Up Whisper Export (#16504 ) ### Description Add a greedy option to the initializer deduplication process in the Whisper export. Currently to detect shared initializers, ORT compares every initializer against every other initializer (n^2). In the comparison operator, if the two initializers have different data types (e.g. raw_data and int_64), both initializers are converted to a numpy array and the cast result is compared. This cast happens in every comparison, and exponentially affects the runtime of finding shared initializers. This cast operation is the bottleneck for the current Whisper export script. The conversion to the numpy array is useful for detecting equal initializer values across nodes of different data types (e.g. recognizing a bias value of 0.0 is the same as a slice index of 0) but isn't triggered when comparing initializers of the same data type (e.g. weight value of 0.6 == weight value of 0.6). The latter case is where the majority of utility is for Whisper, and so by eliminating our path for comparing numpy arrays for initializers we save a lot of time for minimal cost. In other words, this PR adds an option to remove the ability to detect shared initializers of different types (e.g. Slice Index and MatMul Constant) while retaining the ability to deduplicate weights. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - Current time to export Whisper-large is prohibitive. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-06-30 12:22:30 -07:00
Yulong Wang	708dec5d95	[js/webgpu] allow 0 sized tensor for tensor view (#16540 ) ### Description allow 0 sized tensor for tensor view	2023-06-30 12:05:04 -07:00
satyajandhyala	3be6eb53c8	[JS/Web] Fixed the output indexing in the shader code when the output is 1-dim. (#16508 ) ### Description Modified indexing into outputIndices in the shader code. When the output is 1-dim the outputIndices is not a vector and indexing results in error. ### Motivation and Context Fix the problem in the Reduce Ops implementation in WebGPU. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-06-30 09:42:38 -07:00
Mike Guo	aeaa1d650f	make optimized_model_path be in temp folder instead of source model folder for transformer optimization (#16531 ) ### The optimize_model will generate a temporary model in current model folder. Most of time, it is fine. However, the scenario will break when the function run against input model mount from AzureML. In that case, the mounted folder is read-only. We have to copy the model to another temp folder to call optimize_model to workaround this issue. Otherwise, the optimize_model will fail when creating the optimized model in the read-only folder. However, the model copy is painful, especially when model is huge. This PR just expose the optimized_model_path at optimize_model level so that the caller could decide where to save the temp model.	2023-06-30 20:19:51 +08:00
Scott McKay	2fd25de360	Use verbose logging in Android emulator in React Native CI (#16528 ) ### Description <!-- Describe your changes. --> Set emulator logging to verbose to see if it helps with intermittent React Native CI failures when emulator crashes at startup ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-06-30 11:51:20 +10:00
Baiju Meswani	5b3447beef	Allow disable exceptions to work with ort-extensions (#16536 )	2023-06-29 18:24:33 -07:00
JiCheng	0051497055	Fix Comments (#16513 ) ### Description Address comments in #14040 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-06-30 10:37:57 +10:00
Edward Chen	05c4566fe9	[objc] Fix possible leak of OrtValue in initializer. (#16487 ) Fix possible leak of OrtValue in initializer. There was a possible early return before ownership was transferred to the internal C++ Ort::Value.	2023-06-29 17:37:16 -07:00
pengwa	8fc3037ff4	Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363 ) ### Support SCELoss/SCELossGrad run with larger sized input #### Motivation and Context: Run bigger batch size for Bloom model. For Bloom560M model, ORT has potential to run bigger batch size from initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023 * 250680. When Bsz is bigger than 8, totoal element count cannot be represented by int32_t, which those kernels are using to passing total elem count. There is silent overflow causing other indirectly exceptions, or wrong mistake without errors. #### Changes in this PR - For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if total element count is bigger than int32::max() to pass all element count and element index for the ops mentioned above. - For SCELossInternal/SCELossGradInternal CPU kernels, - always use uint64_t to pass the element count. - update the Eigen functions involved in the two kernels' implementations, to use `ptrdiff_t` to pass element count instead of original `int`. - Parallelize SCELossInternal/SCELossGradInternal CPU kernels, otherwise, it is super slow when handling so many elements. - Others changed needed: - Add `CompareOrtValueNumerals` to compare two OrtValue with different data types (float or float16), without caller explicitly converting to the lower-precision data types. The comparison is also done in parallel, which reduce the comparsion time for the large UT case from 22s to ~1.6s. - The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix the bugs. - The cross entropy tests are running CPU base line with float, then the result is used to compare with float16 results of CUDA runs. But there is precision issue when we check the results. Because the randomized input data is represented in float, CPU use it directly, but CUDA use a float16 version of it, so there is precision diff between the inputs, as the test data count increases, it make the results fail even on 1e-2. The fix is: generate data in float16, convert to float for CPU run, directly use float16 for CUDA runs. When compare the output, cast back CPU float to float16 then compare with CUDA outputs. - `RandomValueGenerator ` for the large size take about ~20second, so `ParallelRandomValueGenerator ` is added to random input in parallel, it takes about <2s for preparing input data. #### Non-goals `SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not covered in this PR	2023-06-30 08:36:06 +08:00
Dmitri Smirnov	322237f482	[C#] Implement OrtValue APIs (#16206 ) ### Description Expose `OrtValue` class API as first-class citizen. Make it simular with C++ API. Enable safe direct native memory access. Make string tensor manipulation more efficient. Avoid intermediate structures such as `NamedOnnxValue`, `DisposableNamedOnnxvalue` and etc. Provide more examples with `IOBinding`, although `OrtValue` API potentially makes `IOBinding` redundant for most of scenarios, since `OrtValue` can be created on top of any memory. Run all the pre-trained models now with `OrtValue` API as well. Obsolete `OrtExternalMemory class`. Obsolete IOBinding API that takes `FixedBufferOnnxValue`. ### Motivation and Context Make the API efficient and uniform with C++. This aspires to address: https://github.com/microsoft/onnxruntime/issues/14918 https://github.com/microsoft/onnxruntime/issues/15381 Cc: @Craigacp	2023-06-29 08:59:23 -07:00
Edward Chen	9b2733de8e	[docs] Specify Objective-C max line length. (#16503 ) Update coding standards doc to specify Objective-C max line length of 120 to be consistent with C++.	2023-06-28 16:58:23 -07:00
cao lei	0c5f492493	remove AllocatorMgr class (#16509 ) ### Description Remove AllocatorManager class ### Motivation and Context After the refactor PR #15833 is in, AllocatorManager class is not referenced anymore.	2023-06-28 15:43:19 -07:00
Hariharan Seshadri	ff0894e540	Simplify gating check for CUDA Graph usage (#16491 )	2023-06-28 15:25:34 -07:00

1 2 3 4 5 ...

9103 commits