**Description**: This PR adds Ascend CANN execution provider support.
**Motivation and Context**
- Why is this change required? What problem does it solve?
As the info shown in the issue. CANN is the API layer for Ascend
processor. Add CANN EP can allow user run onnx model on Ascend hardware
via onnxruntime
The detail change:
1. Added CANN EP framework.
2. Added the basic operators to support ResNet and VGG model.
3. Added C/C++、Python API support
- If it fixes an open issue, please link to the issue here.
https://github.com/microsoft/onnxruntime/issues/11477
Author:
lijiawei <lijiawei19@huawei.com>
wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: FFrog <ljw1101.vip@gmail.com>
**Description**: **Python API Bindings for on device training. **
**Motivation and Context**
- This PR contains api bindings so python users can perform a whole
training loop.
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
* Make ORT as Pytorch JIT backend
LORT likely doesn't work with aten fallback so we only test LORT in its own CI.
* Revert changes to enable external CUDA allocator. Will add it later.
Revert "Revert changes to enable external CUDA allocator. Will add it later."
This reverts commit d5487f2e193014c805505afae8fb577c53667658.
Fix external allocator
* Relax tolerance and remove commented code
* Print more information in CI
* Fix pointer
* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.
* Use Pytorch master branch as all PRs are merged
Fix
* Refine based on cpplint feedbacks
* Revert changes to allow custom CUDA allocator in public APIs
* Use torch.testing.assert_close
* Use unittest framework
* Switch docker repo
* Rename *.cpp to *.cc
* Address comments
* Add comment
* Use same pipeline file for eager and lort pipelines
* Address comments
* Add yaml comment
* Fix cmake files
* Address comments
* Rename flags, remove printing code, remove dead comment
* Rework the EP factory creation setup so we're not cut-and-pasting function declarations in multiple places.
Convert append EP for SNPE to be generic, and also use for XNNPACK.
Add XNNPACK to C# API
* Don't need stub for MIGraphX as it's using provider bridge.
* Remove old 'create' functions that aren't applicable now that the EPs are built as separate libraries.
* Only use EPs that require the layout transform if the opset is supported by the layout transformer.
* Update wasm registration of xnnpack.
* aten op for inference
* fix build error
* more some code to training only
* remove domain from operator name
* move aten_op_executor ext out from ortmodule
* add pipeline
* add exec mode
* fix script
* fix ut script
* fix test pipeline
* failure test
* rollback
* bugfix
* resolve comments
* enable aten for python build only
* fix win build
* use target_compile_definitions
* support io binding
* turn off aten by default
* fix ut
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: zhijxu <zhijxu@microsoft.com>
* Tweaks to the model utils
* Add handling for a dim_value of -1 when replacing the entire input shape. This occurs in models exported from PaddlePaddle
* make pytorch helpers accessible in package
* make QDQ helpers accessible in package
Add runtime optimization support to ONNX -> ORT format conversion script.
Replace `--optimization_level`, `--use_nnapi`, and `--use_coreml` with a new `--optimization_style` option.
* add support for bool type
* add TVM EP support for tests
* include TVM EP in python test pool
* fix pylint
* moved technical imports to a separate file
* clean up post build actions & move _ld_preload.py extension to CMake level
* add files for include TVM EP into CI
* implement custom logger for TVM
* replace TVM logging with ONNX RT logging
* update link for TVM EP tutorial
* clean up TVM EP cmake
* add pybind auto enabling for TVM EP
* fix blank spaces
* code review fixes
* replace print with comment
* add list of EP without TVM EP
* enable onnx tests
* disable contrib ops and ml ops
* reuse Dockerfile.ubuntu
* Move install_tvm_test_dependencies.sh out of Docker context dir, update build definition.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* clearing map for eager mode backends
* clearing map for eager mode backends manager
* making OrtBackendsManager an extern variable and trying to delete it
* cleaning backends manager when the python interpret exits
* adding ifdef for eager mode code
* disabling warning for pybind state file
* disabling warning for python module file
* running clang auto format and reducing redundancy
* remove new line
* moving declaration to a new header file
* adding the header file for eager mode for python module
* removing source files for eager mode
* add source file for python module in eager mode
* Update orttraining/orttraining/python/orttraining_python_module_eager.h
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
* squashed commit for standalone tvm execution provider
* critical fix for correct python build with stvm ep
* get tuning log file from ep options. It has priority over AUTOTVM_TUNING_LOG
* updates and fixes
* update parsing of stvm provider options
* add support of external data for onnx model
* add conditional dump of subgraphs
* remove unused code
* get input tensor shapes through provider options. get output shapes for fixed input ones by TVM API
* support AUTO_TVM tuning log file inside ORT. Selector for Ansor and Auto_TVM is provider option (tuning_type)
* add fp16
* add functionality of conversion of model layout to NHWC if need. Necessary parameter was added to STVM provider options
* fix license text in header. fix log format
* small fixes
* fix issues from flake8
* remove model proto construction from GetCapability
* reserve memory for vector of DLTensors
* add simple tutorial for STVM EP
* STVM docs
* jroesch/tvm -> apache/tvm
* remove dead code, unneccessary logs and comments
* fix in readme
* improve tutorial notebook
* tvm update
* update STVM_EP.md
* fix default value
* update STVM_EP.md
* some TODOs for the future development
* shorten long lines
* add hyperlink to STVM_EP.md
* fix Linux CI error
* fix error in csharp test
Co-authored-by: Jared Roesch <jroesch@octoml.ai>
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
* explicit link with libtorch instead of use cmake var to avoid introduce mkl dependency
* use find_lib to get libtorch lib name
* temp fix
* add missing libraries
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* libonnxruntime_providers_rocm.so and libonnxruntime_providers_shared.so are not included in python package.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* optimize python overhead of _post_amp_backward
* overwrite apex amp's zero_grad for faster implementation
* move unscale_fp16_grads_into_fp32_grads into C++ impl
* improve the efficiency furthur, reducing 3.5ms to 1.7ms for unilm.
* unilm 1.7ms to 338us: 1). optimize python list <==> std::vector copy, 2). launch the kernels as long as num_elem reach thresh hold. This help reduce the CUDA idel time.
* refine the logic a bit after validating
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
* re-hipify all rocm EP sources
* fix all other files affected by re-hipify
* add cuda_provider_factory.h to amd_hipify.py
* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration
* Fix ReduceConsts template specialization introduced in #9101.
Fixes the error when building for ROCm 4.3.1:
error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)
* fix flake8 error in amd_hipify.py
* speed up hipify with concurrent.futures
* flake8 fix in amd_hipify.py
* removing warnings which are causing errors from torch and changing flags for Windows
* adding MKL library resolution and comments
* cleaning up the code
* fixing onnxruntime_python file for windows build
* fix the include order to aovid the python_d.lib issue on win debug build
* changes for warnings, typos and other comments
* merge conflict
* adding fix for mkl library error
* Revert "adding fix for mkl library error"
This reverts commit 73b87c73c2.
* fix for dll path for windows
* typo for dll path
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* Include pytorch_export_contrib_ops in inference builds
Rename / move it from tools/python/register_custom_ops_pytorch_exporter
to onnxruntime/python/tools/pytorch_export_contrib_ops.
Rationale for inclusion in inference builds:
This code is potentially useful for anyone using ORT, not just training.
Rationale for new name:
"Contrib op" is the nomenclature used within ORT to refer to the set of
ops that are not in the standard op set but are included by default with
ORT. This is more specific than "custom op", which is what the PyTorch
exporter uses to refer to any non-standard op.
Step 1 of addressing #8818. After this is merged I will update the docs.
* Enable test_pytorch_export_contrib_ops.py in CI
Fixes AB#1342330
* Use PROTOBUF_LIB instead of protobuf::libprotbuf
* Moved setdlopenflags to _pybind_state.py
* Copy the generated _pybind_state.py to required location for Windows.
* Expose symbols in onnx and protobuf namespaces in python when building with --enable_external_custom_op_schemas
* Add external onnx and protobuf files to wheel
* Added an example to demonstrate external custom ops use-case
* Added a Linux build pipeline to test external custom ops
* seperate the training python module; share the execution proivder instance
* fix build break
* fix cuda test crash; reorg the python module code base
* se correct env
* use provider customized hash func
* fixbuild break
* fix rocm break
* use const ref in argument
* rename the file
* move hash func to trainiing module
* integrate eager mode source codde; build with cmake and integrate the python test
* Adding the python path for importing libraries in the Eager mode
* fix clang break;check if training and python enabled
* handling the linking of torch libraries across multiple platforms
* merge and fix the naming
* add build instruction
Co-authored-by: Abhishek Jindal <abjindal@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: ajindal1 <abjindal@microsoft.com>
* enable shared lib test on linux
* fix build break
* add onnx dependency
* add rpath
* skip the test for linux training
* set ONNX_ML definition
* install training python dependency
* update
* fix format; add eigen include folder
* fix format
* skip amd build
* enable shared provider on training
* fix comments in pr
Co-authored-by: Ubuntu <chenta@chenta-orttraining-cpu.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
Co-authored-by: Changming Sun <chasun@microsoft.com>
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces