* Fix a bug and add code to profile memory
1. Compile Send/Recv again (currently broken because of
HOROVOD refactor).
2. Add code to print out initializer allocation size and
activation memory size.
* Address comments
* Split memory counts per locations
* Fix a metric
update PyTorch Bert SquAD notebooks to use onnxruntim-tools and update usage of intra_op_num_threads.
rename python files according to coding style
Fix change_input_to_int32.
update keras notebook to copy script from rel-1.3.0 branch (Will update them later)
* Add ONNX postpasses
* add flag + add bert test from onnx file
* address PR comments
* fix typo
* fix rebase
* address comments
* Fix test failures
* add new pass for expand for new pt version, add comments
* fix rebase
Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* fix a links to Engineering Design and API in CONTRIBUTING.md
* fix additional links in CONTRIBUTING.md
* correct the link to the public API in CONTRIBUTING.md
Co-authored-by: Emad El-Haraty <emad.elharaty@limebike.com>
Search/replace of the pattern "const auto foo = tensor.Shape()" to "const auto& foo = tensor.Shape()" to avoid unneeded copies at runtime and reduce code size (8KB drop for onnxruntime.dll). Remove some unnecessary header includes.
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build
Add more variants of MlasGemm that do a u8x8 GEMM with the output type as float. This fuses the common sequence of MatMulInteger + Cast + Mul(OutputScale) + optional Add(BiasVector).
* Introduce DynamicQuantizeMatMul
It fuses DynamicQuantizeLinear, MatMul and following cast, multiplier. It gets float in and float out for quantized matmul. We have a MLAS kernel in implementation for this op.
Modify gradle build so artifactID has _gpu for GPU builds.
Pass USE_CUDA flag on CUDA build
Adjust publishing pipelines to extract POM from a correct path.
Co-Authored-By: @Craigacp
Disable nuphar large model test, because it takes too long(40+ minutes), while the default cpu provider takes about 5 minutes. After this change, we still keep a lot of other nuphar model tests, I think that should be enough.