* initial update from 11.1 to 11.4
* change 11.4.1 to 11.4.0
* adjusting to match nvidia/cuda image tags
* adjusting to match nvidia/cuda image tags centos7
* correction to 11.4.0
* correction to 11.4.0
* update to cuda 11.4
* change training back to 11.1
* change training back to 11.1
* point to correct nvcr.io/nvidia/cuda 11.4.1 image
* change centos8 to centos7
* correct cudnn path
* Update linux-gpu-ci-pipeline.yml for Azure Pipelines
* Update c-api-noopenmp-packaging-pipelines.yml
* need to resolve centos images but remove space and change to 11.4
* Update linux-gpu-ci-pipeline.yml
* add cudnn to docker image
* bump devtoolset to 10
* revert cuda 11.4 change to setup_env_trt
* orttraining back to 11.1
* use nvcr.io
* Fix previous change back to cuda 11.1
* update cudnn path
* use cudnn image (revert if failure)
* Fetching frontier tensors to frontend
* Move before session initialize call
* Fetch tensor and add to cache
* Rest of the changes for using cache
* Review comments
* Review changes
* Review comments
* switch to shared_ptr
* Fix bug after rebase
* FE docstring change
Create a new M8 loop processing A[8x8] B[8x8] per iteration.
Avoid saving registers on paths that are not needed.
Adjusted M2 and M1 loop, using more registers to relax the loop carrying dependencies.
Nearly 7% improvement observed on Surface Pro X 2 with model ssd_mobilenet_v2_300
About 4.5% improvement on resnet50 on Surface Pro X 2.
Add IsSparseTensor
Add CreateSparseTensor
Add utilities and test fully sparse instantiation
Fully sparse blocksparse
Add test and docs for fully sparse tensor instantiation
Rework creation API
Use API
Non string API
Retrofit of existing String API
Add tests
Add documentation
Address build issues (Winml pending)
Add inference test
Bump binary size
Add ifdef DISABLE CONTRIB
* dnnl ep rework
rework DnnlTensor,DnnlNode,DnnlSubgraph to support arbitrary graph topology and tensor data types
rework GetCapability to claim nodes in graph greedily from node topological ordering and delay creation of DnnlSubgraph until Compile
rework compile to have DnnlSubgraphPrimitive as the object to handle primitive creation and execution
instead of thread local primitive pool which duplicates intermediate memory allocated by the EP across threads
DnnlSubgraphPrimitive provides helpers to handle many common functions for each dnnl primitive builder and become the centralized place to store input, output, intermediate memories, initializer memories and etc
it provides functions to obtain input memories with automatic reordering/reshaping and moving between engines
it provides interfaces to add primitive, set output memory for single node and etc
add CONCURRENT_EXEC compile flag for dnnl library as without it, convolution primitive cannot be created and executed on different threads
enable unit tests to run on dnnl ep as well if built with dnnl ep
add dnnl ep support for Matmulinteger
* Add Relu to the DNNL refactor
Signed-off-by: George Nash <george.nash@intel.com>
* Add Convolution op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add Pooling ops to the DNNL rework
This adds the following ops:
- AveragePool
- GlobalAveragePool
- GlobalMaxPool
- MaxPool
Note: Pooling with dilation is not yet supported.
Note: GlobalLpPool, LpPool, MaxRoiPool, and MaxUnpool are not supported yet.
Signed-off-by: George Nash <george.nash@intel.com>
* Add Sum op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add ConvGrad op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add MaxPoolGrad and AveragePoolGrad ops to DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Added lrn operator to the refactored code
Signed-off by chethan.palangoutu.keshava@intel.com
* Added ReduceMean DNNL op to the refactor code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Softmax DNNL op for the refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added BatchNorm DNNL op inference-only for refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Binary Ops to DNNL rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added ReluGrad to DNNL Rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Update OneDNN tag to v2.3
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added support for memory upto dim size 12
this is to fix the CI test cases that contain binary ops of input dim
size > 5
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Prevent claiming support for float16 and bfloat16 when only float is suppoted
By using The string.find used was causing the code to claiming support
for float16 and bfloat16 when we only supported float. We now explicitly
check the code for the data type or the data type with a 7 letter prefix
basically prefixed with "tensor("
Signed-off-by: George Nash <george.nash@intel.com>
* Disable uint8 mul and div, improve type conversion
Disable mul_uint8 and div_uint8 test cases as they use modulo for
overflow handling while onednn uses saturation
improve ype conversion using enum instead of string comparsion as well
as adding more types
Signed-off-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
Merge CPU/GPU nuget pipeline. The old GPU nuget pipeline will be only for DML.
TODO: the result GPU package contains PDB files for some of the DLLs, but not all. It is due to the refactoring of CUDA EP to pluggable DLLs. At that time we forgot to copy the PDB files. However, I can't add them in now. Because currently the package is already 220MB large. If the missed PDB files were added, then it will be oversize. nuget.org doesn't accept >250MB packages.
* optimize some lstm gate computation. Remove no need string constructions.
* change gcc optimization flags for computation bound logics in rnn_helpers
* better qgemm for M=1
* Some improve on avx512
* add condition to limit GCC related marcros
* Correct QGemm assembly for M=1 AVX2 optimization to pass mlas_test.
* Fix rnn_helper build issue for wasm.
* better asm code here according to feedbacks.
* Remove customized vectorize and unroll option for GCC.
Using restrict on some function to help GCC to correctly vectorize it.
Rewrite clip_add_bias() to let GCC correctly vectorize it.
* Better restrict semantic for merge_lstm_gates_to_memory() by adding in_place().
Add MSC __restrict for the clip_add_bias() mthod to vectorize correctly.
* Force CI restart as it stucked by the onnxruntime-python-checks-ci-pipeline which can not restart.
Bug #31652854 also repros on Qualcomm Adreno (down to the exact same pixel). This change disables this model test for Qualcomm, in addition to the existing disablement for Intel.
By default, not do enable subgraph quantization to make it consistent with existing behavior.
It should be OK to enable it at quantize_dynamic mode with extra_options.
* changes
* tile grad unsqueeze fix for opset 13
* clean up
* remove bool support for opset 2 to 12 for Pad as it is not supported.
* Copy OperatorKernels.md from artifacts of Windows CI build.