* enabme telemetry
* enable telemetry
* set enable telemetry as default
* for debugging
* remove log and set disable telemetry as default back
* delete private file while testing
* resolve comment: mainly add license header, rename macro and update docs
* rewording in privacy.md
* [NupharEP] Enable parallel schedule
* Update TVM with the fix to TVM threadpool to use OpenMP if possible
* Add parallel schedule when trying to vectorize
With this change, BERT squad perf on a 4-core (8 HT) CPU goes from 187ms to 150ms
* Address CR, docs and cmake update
* Doc fix
* Fix mkl
* Fix TVM windows build when using mklml
- Added C-API test for loading custom op shared lib.
- Made some changes in C++ api header and C-api implementation to get it working.
- Added C# API and corresponding test for loading custom op shared library.
* onnxrt server with OVEP
* onnxrt server with OVEP
* Update Dockerfile.server.openvino
* onnxrt server OVEP fix reviews
* onnxrt server OVEP fix reviews
1. refactor the pipeline, remove some duplicated code
2. Move Windows_py_GPU_Wheels job to Win-GPU-CUDA10. We'll deprecated the "Win-GPU" pool
3. Delete cpu-nocontribops-esrp-pipeline.yml and cpu-nocontribops-pipeline.yml
4. In Linux nuget jobs, run "make install" before creating the package. So that extra RPAH info will be removed
* Guard unused parameter
Guard unused parameter for Linux Arm and other cases.
* Add ACL (Arm Compute Library) execution provider
Add a new execution provider targeting Arm architecture based on Arm Compute Library.
Validated on NXP i.MX8QM CPU with ResNet50, MobileNetv2 and VGG models.
All unit tests are passing.
Comparative performance improvements for ResNet50v1 model obtained with
onnxruntime_perf_test:
A72 2xA72 A53 4xA53
ACL vs CPU 16% 9% 21% 13%
Usage documentation available in ACL-ExecutionProvider.
* Fix eigen unused parameter
Fix eigen unused parameter error for Arm cross-compilation.
* Refine optimizers
* Address PR comments
* Changes from PR comments and discussion.
* Fixed signed/unsigned mismatch
* Address PR comments
* Address PR comments
* Fix linux build
* Fix issue with mkldnn logic.
* Turn off optimizers by default for operator unit tests.
* Handle edge case of graph with no nodes in partitioner so all execution providers don't need to.
* Comment out change to turn off optimizers for unit tests. Add details on what needs to be done to re-enable.
This change adds a new execution provider powered by [DirectML](https://aka.ms/DirectML).
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning on Windows. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers.
The DirectML execution provider is capable of greatly improving evaluation time of models using commodity GPU hardware, without sacrificing broad hardware support or requiring vendor-specific extensions to be installed.
**Note** that the DML EP code was moved verbatim from the existing WindowsAI project, which is why it doesn't yet conform to the onnxruntime coding style. This is something that can be fixed later; we would like to keep formatting/whitespace changes to a minimum for the time being to make it easier to port fixes from WindowsAI to ORT during this transition.
Summary of changes:
* Initial commit of DML EP files under onnxruntime/core/providers/dml
* Add cmake entries for building the DML EP and for pulling down the DirectML redist using nuget
* Add a submodule dependency on the Windows Implementation Library (WIL)
* Add docs under docs/execution_providers/DirectML-ExecutionProvider.md
* Add support for DML EP to provider tests and perf tests
* Add support for DML EP to fns_candy_style_transfer sample
* Add entries to the C ABI for instantiating the DML EP
In ORT, there is only 3 cuda stream: default, HtoD, DtoH. And both HtoD and DtoH are non-blocking stream. Thus, per-thread stream mode doesn't have any benefit.
I also tried in multiple thread env and the legacy mode is also better than per-thread model.
Below is the perf of a 3 layer bert on v100. Unit is ms:
batch size 1:
concurrency | c=1 | c=2 | c=4
legacy | 0.54 | 1.17 | 2.68
per-thread | 0.66 | 1.37 | 2.86
batch size 4:
concurrency | c=1 | c=2 | c=4
legacy | 1.1 | 2.22 | 4.6
per-thread | 1.21 | 2.44 | 4.98
batch size 64:
concurrency | c=1 | c=2 | c=4
legacy | 8.09 | 16.13 | 32.37
per-thread | 8.18 | 16.26 | 32.45
* Adjust ngraph cmake files to onnx 1.5.0
* Enable LSTM reverse direction mode in nGraph EP
* Enable full support for the Split op in nGraph EP
* Revert "Disable the unsigned input Shrink op tests for nGraph until the next update"
This reverts commit 257b42a55bdd98f804d4846868542b8e3aeb4b4e.
* Enable Gather and remove unused subgraph attribute
* Remove the unused param from AppendClusterToSubGraph
* Fix for the incorrect onnx opset version
* Use the r0.26 release branch before the tag is created
* Enable the quantizelinear and dequantizelinear for NGEP
* Use the v0.26.0-rc.2 tag in ngraph.cmake
* Add skip for modes others than default in Pad operator
* Reenable negative axis tests for ngraph
* Use temporary ngraph version
* Use branch name instead of SHA for temporary ngraph branch
* Use ngraph v0.26.0-rc.4
* Remove patch for missing symbol in MKLDNN
* Use MKLDNN 1.0 in ngraph
* Exclude the Pad op for opsets greater than 10
* Disable quantizelinear and dequantizelinear tests for ONNX 1.5.0
* Fix the onnx-headers related compilation errors
* ONNX libs linking fix
* Use a tag for ngraph and support more Pad modes
* Use the v0.26.0 release tag for nGraph
* Update ngraph to RC8 - bigobj flag for Windows builds
* Fix the MKLDNN constexpr error on Windows
* save status: add tiling layout; add avx512 skylake cpuid info
* unit tests and matmul integer model passed on skylake, need to verify model
* save commit before update master
* fix check
* address comments
Remove gsl subodule and replace with a local copy of gsl-lite
Refactor for onnxruntime::make_unique
gsl::span size and index are now size_t
Remove lambda auto argument type detection.
Remove constexpr from fail_fast in gsl due to Linux not being happy.
Comment out std::stream support due to MacOS std lib broken.
Move make_unique into include/core/common so it is accessible for server builds.
Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc
due to x86 build.
Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized
* Fixed a bug of missing tvm in python wheel
* Put Nuphar Python scripts into wheel
* Add note book tutorial
* Some improvements in symbolic shape inference for quantized models