this is a big PR. we are going to move it up to layer_dev , which is still a L3 so we are still safe to do work there agile.
we are going to move this into the L3 so that ryan can start doing intergration testing.
we will pause for a full code review and integration test result prior to going into the L2.
>>>> raw comments from previous commits >>>
* LearningModelSession is cleaned up to use the adapter, and parts of binding are.
* moved everything in the winmladapter
made it all nano-com using, WRL to construct objects in the ORT side.
base interfaces for everythign for winml to call
cleaned up a bunch of winml to use the base interfaces.
* more pieces
* GetData across the abi.
* renamed some namepsace
cleaned up OrtValue
cleaned up Tensor
cleaned up custom ops.
everything *but* learnignmodel should be clean
* make sure it's building. winml.dll is still a monolith.
add onecoreuap_apiset.lib in order to avoid linking against kernel32.lib etc and violating our OS layering requirements.
We linked against onecoreuap_apiset.lib in VB so we will continue doing this, but I am still unsure why not to link against onecore instead since that is where we ship. However, since Sheil is the owner of this code we will wait to discuss with him before changing anything.
* Guard unused parameter
Guard unused parameter for Linux Arm and other cases.
* Add ACL (Arm Compute Library) execution provider
Add a new execution provider targeting Arm architecture based on Arm Compute Library.
Validated on NXP i.MX8QM CPU with ResNet50, MobileNetv2 and VGG models.
All unit tests are passing.
Comparative performance improvements for ResNet50v1 model obtained with
onnxruntime_perf_test:
A72 2xA72 A53 4xA53
ACL vs CPU 16% 9% 21% 13%
Usage documentation available in ACL-ExecutionProvider.
* Fix eigen unused parameter
Fix eigen unused parameter error for Arm cross-compilation.
* Refine optimizers
* Address PR comments
* Changes from PR comments and discussion.
* Fixed signed/unsigned mismatch
* Address PR comments
* Address PR comments
* Fix linux build
* Fix issue with mkldnn logic.
* Turn off optimizers by default for operator unit tests.
* Handle edge case of graph with no nodes in partitioner so all execution providers don't need to.
* Comment out change to turn off optimizers for unit tests. Add details on what needs to be done to re-enable.
This change adds a new execution provider powered by [DirectML](https://aka.ms/DirectML).
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning on Windows. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers.
The DirectML execution provider is capable of greatly improving evaluation time of models using commodity GPU hardware, without sacrificing broad hardware support or requiring vendor-specific extensions to be installed.
**Note** that the DML EP code was moved verbatim from the existing WindowsAI project, which is why it doesn't yet conform to the onnxruntime coding style. This is something that can be fixed later; we would like to keep formatting/whitespace changes to a minimum for the time being to make it easier to port fixes from WindowsAI to ORT during this transition.
Summary of changes:
* Initial commit of DML EP files under onnxruntime/core/providers/dml
* Add cmake entries for building the DML EP and for pulling down the DirectML redist using nuget
* Add a submodule dependency on the Windows Implementation Library (WIL)
* Add docs under docs/execution_providers/DirectML-ExecutionProvider.md
* Add support for DML EP to provider tests and perf tests
* Add support for DML EP to fns_candy_style_transfer sample
* Add entries to the C ABI for instantiating the DML EP
In ORT, there is only 3 cuda stream: default, HtoD, DtoH. And both HtoD and DtoH are non-blocking stream. Thus, per-thread stream mode doesn't have any benefit.
I also tried in multiple thread env and the legacy mode is also better than per-thread model.
Below is the perf of a 3 layer bert on v100. Unit is ms:
batch size 1:
concurrency | c=1 | c=2 | c=4
legacy | 0.54 | 1.17 | 2.68
per-thread | 0.66 | 1.37 | 2.86
batch size 4:
concurrency | c=1 | c=2 | c=4
legacy | 1.1 | 2.22 | 4.6
per-thread | 1.21 | 2.44 | 4.98
batch size 64:
concurrency | c=1 | c=2 | c=4
legacy | 8.09 | 16.13 | 32.37
per-thread | 8.18 | 16.26 | 32.45
* Adjust ngraph cmake files to onnx 1.5.0
* Enable LSTM reverse direction mode in nGraph EP
* Enable full support for the Split op in nGraph EP
* Revert "Disable the unsigned input Shrink op tests for nGraph until the next update"
This reverts commit 257b42a55bdd98f804d4846868542b8e3aeb4b4e.
* Enable Gather and remove unused subgraph attribute
* Remove the unused param from AppendClusterToSubGraph
* Fix for the incorrect onnx opset version
* Use the r0.26 release branch before the tag is created
* Enable the quantizelinear and dequantizelinear for NGEP
* Use the v0.26.0-rc.2 tag in ngraph.cmake
* Add skip for modes others than default in Pad operator
* Reenable negative axis tests for ngraph
* Use temporary ngraph version
* Use branch name instead of SHA for temporary ngraph branch
* Use ngraph v0.26.0-rc.4
* Remove patch for missing symbol in MKLDNN
* Use MKLDNN 1.0 in ngraph
* Exclude the Pad op for opsets greater than 10
* Disable quantizelinear and dequantizelinear tests for ONNX 1.5.0
* Fix the onnx-headers related compilation errors
* ONNX libs linking fix
* Use a tag for ngraph and support more Pad modes
* Use the v0.26.0 release tag for nGraph
* Update ngraph to RC8 - bigobj flag for Windows builds
* Fix the MKLDNN constexpr error on Windows
* save status: add tiling layout; add avx512 skylake cpuid info
* unit tests and matmul integer model passed on skylake, need to verify model
* save commit before update master
* fix check
* address comments
Remove gsl subodule and replace with a local copy of gsl-lite
Refactor for onnxruntime::make_unique
gsl::span size and index are now size_t
Remove lambda auto argument type detection.
Remove constexpr from fail_fast in gsl due to Linux not being happy.
Comment out std::stream support due to MacOS std lib broken.
Move make_unique into include/core/common so it is accessible for server builds.
Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc
due to x86 build.
Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized
* Fixed a bug of missing tvm in python wheel
* Put Nuphar Python scripts into wheel
* Add note book tutorial
* Some improvements in symbolic shape inference for quantized models
* Added support for Hetero plugin
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Fixed spelling error in cmake for hetero plugin
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Added listener to print messages from the plugin
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Updated Documentation for VAD-F enablement
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Added VAD-F option for FPGA
*Disabled unit tests and backed tests because FPGA only accepts NCHW models
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Added comment for why tests need to be disabled on VAD-F
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>