* support optimizer opt for deepspeed 0.5.9
* resolve comments
* resolve comments
* FP16_Optimizer Support for more Deepspeed Versions (#12046)
* fp16_optimizer for more ds versions
* change ds version
* bugfix
* fix bug
* Fix unused function warning for decodeMIDR(). (#12069)
Changed from static function defined in header to function declared in header and defined in separate .cc file.
* pin protobuf version to be compatible with onnx (#12132)
Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev10.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* RoiAlign CPU EP add warning for max mode with samples != 1 (#12136)
* RoiAlign add warning about incorrect max summation when sample size not 1
* include coreml_provider_factory.h in macos build instead of coreml_ex… (#12138)
include coreml_provider_factory.h in macos build instead of coreml_execution_provider.h
* List 3.10 as supported python version and remove 3.6 (#12141)
list 3.10 as supported python version and remove 3.6
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
* Use updated symbolic_helper.check_training_mode (#11900)
Co-authored-by: Jingyan Wang, Baiju Meswani
* Fix GH issue 12151 by using inverse perms for updating DQ axis attribute (#12158)
* Fix GH issue 12151.
Need to use inverse perms for updating that axis to what is used for transposing the input. This only applies if the DQ node is doing per-axis dequantization.
* fixing positions for beam search gpt2 (#12156)
* fixing positions for beam search gpt2
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
* remove wrong placed libs (#12201)
* Add file mapping for windows platform. (#12183)
* Add file mapping for windows platform.
* Add unit test for file mapping for windows. Also add an error message for mis-aligned offset
* Add unit test for file mapping for windows. Also add an error message for mis-aligned offset
* Update data type to avoid warnings
* Compitable data type to avoid warnings. Update CreatFileMapping2 condition for winml compiling.
* Add type conversion to avoid warnings for X86 release build.
Co-authored-by: Ting Cao <ticao@microsoft.com>
* Fix bug where onnxruntime_USE_NCCL flag would default to ON (#12195)
Fix bug where onnxruntime_USE_NCCL flag would default to ON, causing ORT to not build properly. New functionality: flag is ON when training is enabled and NCCL is not disabled. Flag is OFF otherwise
Co-authored-by: zhijxu <zhijxu@microsoft.com>
Co-authored-by: zhijxu <zhijxu>
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev10.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
Co-authored-by: Carson Swope <carsonswope@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: jingyanwangms <47403504+jingyanwangms@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Viswanath Boga <44417868+viboga@users.noreply.github.com>
Co-authored-by: leqiao-1 <61653207+leqiao-1@users.noreply.github.com>
Co-authored-by: caoting-dotcom <71617901+caoting-dotcom@users.noreply.github.com>
Co-authored-by: Ting Cao <ticao@microsoft.com>
Co-authored-by: Sean Murray <59740888+seanmurr1@users.noreply.github.com>
* Update ONNX to 1.12 (#11924)
Follow-ups that need to happen after this and before the next ORT release:
* Support SequenceMap with https://github.com/microsoft/onnxruntime/pull/11731
* Support signal ops with https://github.com/microsoft/onnxruntime/pull/11778
Follow-ups that need to happen after this but don't necessarily need to happen before the release:
* Implement LayerNormalization kernel for opset version 17: https://github.com/microsoft/onnxruntime/issues/11916Fixes#11640
* Dll version fix ovep4.1 (#11953)
* Setting default version values for ovep dlls as well
* Update backend_manager.cc
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: mohsin <mohsinx.mohammad@intel.com>
* Optimize t5 encoder in beam search (#11926)
* ooptimize t5 encoder
* update
* update
* update
* refactor expand impl
* cuda tests passed
* update
* alignment
* more alignments
* review comments
* Allow saving on CPU usage for infrequent inference requests by reducing thread spinning (#11841)
Introduce Start/Stop threadpool spinning switch
Add a session config option to force spinning stop at the end of the Run()
* Restructure function inliner (#11731)
* Add nested function call tests
* Add overload for Specialize
* Pass symboltable to onnx shape inference
* Avoid renaming empty names
* Enable sequence_map tests which failed before this change
* Deprecate APIs returning raw ptrs and provide replacements (#11922)
Provider better documentation
* register signal ops for opset 17 (#11778)
* Register signal ops for op set 17
Note code is mostly being moved, not added. These ops were previously
only registered as Microsoft contrib ops and only built if
`BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx
standard op set in version 17.
Main components of this change:
* Move the kernels from the conrib_ops directory to the
core directory.
* Add function bodies for ms experimental ops. This will allow
old models that use the contrib ops to continue to function.
All the function bodies consist of a single op (the
new standard op), so performance overhead should be minimal.
Minor clean-up also in this change:
* De-duplicate get_scalar_value_from_tensor: put it in a new utils.h.
* Fix some bugs that caused compilation errors with the experimental
ops. Tested with `build.sh --ms_experimental`
* Fix some spelling errors and lint violations.
* Replace a couple of switch statements with `MLTypeCallDispatcher`.
* Use `InlineVector` instead of `std::vector`.
Unblocks https://github.com/microsoft/onnxruntime/issues/11640
* Include opset 15 in Conv+BatchNormalization fusion (#11960)
* Fix WinML Tests are still targetting deprecated (deleted) experimental signal op definitions (#12006)
* fix winml tests
* remove legacy test
* switch idft -> dft+inverse attr
* upgrade opset 13->17 for signal ops tests
* [C# Tests] Add support for double tensor output in TestPreTrainedModels. (#12008)
Add support for double tensor output in TestPreTrainedModels.
* DML EP ResNet50 opset 15 fails in ONNX checker for FusedBatchNormalization lacking training_mode attribute (#12010)
FusedBatchNormalization include training_mode attribute
* Generalize native op creation (#11539)
* create op from ep
* read input count from context
* create holder to host nodes
* fix typo
* cast type before comparison
* throw error on API fail
* silence warning from minimal build
* switch to unique_ptr with deleter to host nodes
* fix typo
* fix build err for minimal
* fix build err for minimal
* add UT for conv
* enable test on CUDA
* add comment
* fix typo
* use gsl::span and string view for Node constructor
* Added two APIs - CopyKernelInfo and ReleaseKernelInfo
* pass gsl::span by value
* switch to span<NodeArg* const> to allow for reference to const containers
* fix typo
* fix reduced build err
* fix reduced build err
* refactoring node construction logic
* rename exceptions
* add input and output count as arguments for op creation
* refactor static member
* use ORT_CATCH instead of catch
* cancel try catch
* add static value name map
* format input definition and set err code
* fix comments
* fix typo
* [DML EP] Pad operator: Handle negative pad counts (#11974)
* Pad fallback to CPU
* Added queryPad in operatorRegistration.cpp
* Acknowledged PR comments
* Used any_of
* used none_of instead of any_of
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
* Add warning about future computation change for ConvTranspose with auto_pad (#11984)
* Add warning about future computation change for Convtranspose with auto_pad
* improve msg
* update TODO to make lint happy
* update more contents for warning and add if
* valid was not infected
* move it into kernel registration
* parse auto_pad myself
* try to use conv_transpose_attrs_.auto_pad directly
* update roialign cuda impl to onnx opset16 (#12036)
* roialign opset16
* fix
* fix
* Fix windows eager build break by pinning to torch version 1.11.0 (#12033)
Fix windows and linux eager build to torch 1.11.0.
* Skip Constant Folding for ops producing an optional type output (#11839)
* Disable sequence-type tests since C# infra doesn't support well (#12037)
* Extend lifetime of KernelDef when creating a standalone op (#12057)
place tmp kernel def as local variable to cover the lifetime of kernel creation
* Add targets files for new .net6 frameworks (#12016)
* Add net6 targets.
Remove maccatalyst as we don't have a native build targetting that.
* Set platform in macos targets
* Add targetFramework entries
* Move NativeLib.DllName definition and set using preprocessor values for simplicity. Couldn't get it to build with the preprocessor based setup when it was in a separate file.
Update the nuspec generation to set platform version for .net6 targets. TODO: Validate versions. I copied them from the managed nuget package the packaging pipeline generated prior to adding targets. Possibly w could/should lower some of the versions.
Hopefully the need to specify a version goes away when the release version of VS2022 supports .net6.
* Try android 31.1 as https://github.com/actions/virtual-environments/blob/main/images/win/Windows2022-Readme.md suggests that should be available on the CI machines
* Fix patch version mismatch
Add some extra debug info in case it helps
* Debug nuget location in CI
* Add workspace entry back in
* Add steps
* One more attempt with hardcoded nuget.exe path and original android31.0 version
* Better fix - found explicit nuget download and updated version there.
* flake8 fixes
* Fix black complaints.
* Exit Microsoft_ML_OnnxRuntime_CheckPrerequisites for net6 iOS.
* Removed outdated comment
* Fix DML custom operators which set descriptor heap to command list (#12059)
* Make C# runtest.sh automatically set latest opset (#12039)
* Update C# runtest.sh for opset 17
Should have been part of https://github.com/microsoft/onnxruntime/pull/11924
* get appropriate opset version from onnx doc
* use absolute rather than relative path
* fix typo in var name
* Disable DML command list reuse for Xbox (#12063)
disable cl reuse for xbox
* Add data type check in ConvAddRelu fusion (#12058)
* Add undocumented attribute to disable generation of Java bindings from the Android AAR. (#12075)
The generated bindings causes C# build errors that require workaround code. Disabling generation should avoid the need for any workarounds.
As the user has the C# ORT package with the C# to C bindings there's no need for binding generation that calls the ORT Java API (which is C# -> Java ->C).
* enable the extensions custom build for java and android (#11823)
* generate quantization parameter for outputs (#12089)
* DML EP Update to DML 1.9 (#12090)
* Update to DML 1.9
* Appease obnoxious Python formatting tool
* Fix orttraining-linux-ci-pipeline - Symbolic shape infer (#11965)
fix symbolic shape error due to upgraded numpy + legacy sympy
* check consumers of dq node before swap dq and transpose (#12099)
* check consumers of dq node before swap dq and transpose
* add unit test
Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: mohsin <mohsinx.mohammad@intel.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: G. Ramalingam <grama@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
Co-authored-by: Sheil Kumar <smk2007@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: sumitsays <sumitagarwal330@gmail.com>
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
Co-authored-by: Chun-Wei Chen <jacky82226@gmail.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Wil Brady <25513670+WilBrady@users.noreply.github.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: Justin Stoecker <justoeck@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
* Add .net6 support to the C# nuget package.
Currently requires jumping through a lot of hoops due to .net 6 only being supported in the preview release of VS 2022.
Build existing targets using msbuild.
Add .net6 targets and build using dotnet.
Create nuget package with combined targets.
A few misc automated changes from VS to spacing and adding a couple of properties.
* Try manually installing trt8.4 in multi-gpu pipeline
* Remove stmts that clean up cmake, ctest. Update tensorrt repository name passed to get_docker_image.py
* Update trt and cudnn home
* Don't install trtexec cli tool.
* Increase job timeout
* Revert timeout change and use trt placeholder builder build option
* Revert "Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888)"
This reverts commit d2cbae3a04.
* Revert prepacked_weights to avoid indirect inclusion in CUDA and TRT code that breaks the build.
* Add test for case where main const initialier in subgraph
* update test to use trt ep
* add initializer when converting from graph viewer to proto
* add comments
* add comments
* add comments
* only add initialier that is outer scope value
* make including outer scope value optional
* modify python format
* modify python format
* modify python format
* Remove test
* remove redundant argument
Minor wording update to warning message to clarify that the function style Compile API is deprecated now and will be removed soon.
Also updated some code comments.
The optimization consists of:
* Use int32_t instead of int64_t
* Use different code path for tf_crop_and_resize or other
coordinate_transformation_mode to avoid redundant conditions
* Loop-invariant code motion of offset, coefficient and extrapolation_value
check
* Use fixed point to avoid floating-point computation
Besides, it always transforms NCHW Resize to NHWC because it has higher perf in
the NHWC variant when the input X is 4D int8/uint8 tensor and the mode is
linear on ARM.
It improves DeepLab V3 with int8 quantization by 26%~27% on big core and 37% on
LITTLE core on AArch64. It also improves DeepLab V3 with uint8 quantization by
24%~25% on big core and 34% on LITTLE core on AArch64.
Co-authored-by: Yufeng Li liyufeng1987@gmail.com
* update trt 8.4ga
* trt 8.4 linux ci pipeline
* fix cmake
* placeholder_builder
* trt 8.4 windows pipeline
* gpu package pipeline
* trt 8.4.1.5 , packaging pipeline updates
* python packaging
* ctest timeout
* python packaging test
* bump timeout
* python format
* format
* revert
* newline
* enable trt python tests
* typo
* python format
* disable on windows
* Rework the EP factory creation setup so we're not cut-and-pasting function declarations in multiple places.
Convert append EP for SNPE to be generic, and also use for XNNPACK.
Add XNNPACK to C# API
* Don't need stub for MIGraphX as it's using provider bridge.
* Remove old 'create' functions that aren't applicable now that the EPs are built as separate libraries.
* Only use EPs that require the layout transform if the opset is supported by the layout transformer.
* Update wasm registration of xnnpack.
* initial gather support nnapi
* update
* minor update
* address pr comments
* add int32 indices test case for nnapi
* remove nnapi ep limitation for added UT
* add link for memcpy type punning usage
Prior to this every test shared the same tolerances. This meant
that if an ONNX test failed due to a small but acceptable difference in
output, the only alternative was to disable the test entirely.
In op set 17, the DFT operator is being added. Without this change, the
tests for that operator fail because the output is off by about 5e-5.
It's better to keep test coverage for this new op rather than disable
the test entirely.
Also prior to this change, the global tolerances were not shared between
C++, JavaScript, and Python tests. Now they are.
Also fix various minor issues raised by linters.
Unblocks https://github.com/microsoft/onnxruntime/issues/11640.
* Reserve the first core for the main thread
Currently in "auto affinity" mode the worker threads are affinized to cores 0..(N-1), leaving the very last core for the main thread. This patch preserves core #0 for the main thread, and affinizes the worker threads to cores 1..N.
* Avoid unneeded spin_pause in thread pool's worker threads
Remove unneeded PAUSE instruction (0.1-0.2 usec latency) after a worker thread finds a task to execute.
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is
not sufficient to compensate the difference of VNNI instructions
throughput
between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x
smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task
getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CPUs
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is not sufficient to compensate the difference of VNNI instructions
throughput between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CP
* Addressing comments
* combining x86 ARM branches in qlinearconv threaded job partition
* revert first core assignment
Co-authored-by: Saurabh <saurabh.tangri@intel.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>