* register custom symbolic for einsum
* bugfix for case needs permute at the end
* refactor
* refactor equation parser
* support new case, use ReduceProd
* optimize perf and graph
* remove some Gather node
* add more ut, fix gemm trans fusion
When the pattern Sum(Gemm(A, B), C) exists, we can convert it to
Gemm(A, B, C), assuming that C the output of the original Gemm is
not used elsewhere, and this change does not break broadcasting.
* remove default python ep registration. raise exception if providers are not explicitly set if there are available providers
* temporarily disable exception
* fix python tests
* explicitly set CUDAProvider for python iobinding tests
* explicitly set providers param for InferenceSession())
* onnxrt
* raise ValueError if not explicitly set providers when creating InferenceSession
* add required providers param
* explicitly set providers
* typo
Add support for saving graph runtime optimizations in an ORT format model. The idea is to allow some optimizations to be "replayed" at runtime in a minimal build. The replaying part will be in a future change.
* Add source for conv_grad
* Add sources for ROCm EP.
* Transliterate sources for conv_grad for ROCm EP.
* Add conv_grad to ROCm EP
Add conv_grad to ROCm execution
provider.
* Update ROCm EP ConvGrad
Update ConvGrad for the ROCm EP to match other EP
changes and fix a build issue.
* optimize python overhead of _post_amp_backward
* overwrite apex amp's zero_grad for faster implementation
* move unscale_fp16_grads_into_fp32_grads into C++ impl
* improve the efficiency furthur, reducing 3.5ms to 1.7ms for unilm.
* unilm 1.7ms to 338us: 1). optimize python list <==> std::vector copy, 2). launch the kernels as long as num_elem reach thresh hold. This help reduce the CUDA idel time.
* refine the logic a bit after validating
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
* Exception when duplicated autograd.Function name detected
* reorder a bit for a bittle bit better perf
* fix a bug in previous PR :(
* correct the error message a bit
* re-hipify all rocm EP sources
* fix all other files affected by re-hipify
* add cuda_provider_factory.h to amd_hipify.py
* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration
* Fix ReduceConsts template specialization introduced in #9101.
Fixes the error when building for ROCm 4.3.1:
error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)
* fix flake8 error in amd_hipify.py
* speed up hipify with concurrent.futures
* flake8 fix in amd_hipify.py
* removing warnings which are causing errors from torch and changing flags for Windows
* adding MKL library resolution and comments
* cleaning up the code
* fixing onnxruntime_python file for windows build
* fix the include order to aovid the python_d.lib issue on win debug build
* changes for warnings, typos and other comments
* merge conflict
* adding fix for mkl library error
* Revert "adding fix for mkl library error"
This reverts commit 73b87c73c2.
* fix for dll path for windows
* typo for dll path
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* resolve the provider options before create training session in orttrainer
* Update orttraining/orttraining/python/orttraining_pybind_common.h
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
* support clear the training ep instance pool
* fix status error
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>