* Add ONNX postpasses
* add flag + add bert test from onnx file
* address PR comments
* fix typo
* fix rebase
* address comments
* Fix test failures
* add new pass for expand for new pt version, add comments
* fix rebase
Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* fix a links to Engineering Design and API in CONTRIBUTING.md
* fix additional links in CONTRIBUTING.md
* correct the link to the public API in CONTRIBUTING.md
Co-authored-by: Emad El-Haraty <emad.elharaty@limebike.com>
Search/replace of the pattern "const auto foo = tensor.Shape()" to "const auto& foo = tensor.Shape()" to avoid unneeded copies at runtime and reduce code size (8KB drop for onnxruntime.dll). Remove some unnecessary header includes.
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build
Add more variants of MlasGemm that do a u8x8 GEMM with the output type as float. This fuses the common sequence of MatMulInteger + Cast + Mul(OutputScale) + optional Add(BiasVector).
* Introduce DynamicQuantizeMatMul
It fuses DynamicQuantizeLinear, MatMul and following cast, multiplier. It gets float in and float out for quantized matmul. We have a MLAS kernel in implementation for this op.
Modify gradle build so artifactID has _gpu for GPU builds.
Pass USE_CUDA flag on CUDA build
Adjust publishing pipelines to extract POM from a correct path.
Co-Authored-By: @Craigacp
Disable nuphar large model test, because it takes too long(40+ minutes), while the default cpu provider takes about 5 minutes. After this change, we still keep a lot of other nuphar model tests, I think that should be enough.
1. Enlarge the read buffer size further, so that our code can run even faster. TODO: need apply the similar changes to python some other language bindings.
2. Add coreml_VGG16_ImageNet to the test exclusion set of x86_32. It is not a new model but previously we didn't run the test against x86_32.
* try mac pipeline
* fix path separator
* copy prebuilds folder
* split esrp yaml for win/mac
* disable mac signing temporarily
* add linux
* fix indent
* add nodetool in linux
* add nodetool in win-ci-2019
* replace linux build by custom docker scripts
* use manylinux as node 12.16 not working on centos6
* try ubuntu
* loosen timeout for test case - multiple runs calls