Use the more robust implementation from DML's Algorithms.h.
```
engine\lotus\onnxruntime\core\providers\dml\OperatorAuthorHelper\Common.h(27): warning C4756: overflow in constant arithmetic
```
- Now few more DML operators support INT64 dataType directly.
- Operators like Padding, ElementWise_Clip now have new dml structure to support int64 data type for scalar value.
Related work items: #33883294
Now that DML has int64 support directly, register the related operators for uint64/int64 (rather than the hack in the ORT DML EP with doubled strides).
## Remaining work
- Not implemented in DML: CumSum, Range, MaxPool/MaxUnpool, TopK, ReduceProd/Sum/SumSquare/L1
- Implemented in DML but need DML EP kernel work: Clip, Pad, Neg, Range, ConstantOfShape
```
te.exe OnnxConformanceTests.dll
Summary: Total=4454, Passed=4147, Failed=0, Blocked=0, Not Run=0, Skipped=307
```
Corresponding PR: https://microsoft.visualstudio.com/WindowsAI/_git/WindowsAI/pullrequest/6486426
Related work items: #28761231, #33883294
* enable shared lib test on linux
* fix build break
* add onnx dependency
* add rpath
* skip the test for linux training
* set ONNX_ML definition
* install training python dependency
* update
* fix format; add eigen include folder
* fix format
* skip amd build
* enable shared provider on training
* fix comments in pr
Co-authored-by: Ubuntu <chenta@chenta-orttraining-cpu.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
Co-authored-by: Changming Sun <chasun@microsoft.com>
* correct batchnorm replacement output order;
remove bn replacement in grad graph builder
* update op defs and kernel class
* implement batch norm internal and grad.
* change saved_var into saved_inv_std
* cuda test case: bn internal
* remove redundant include
* fix comment; add support and UT for 1d input.
* exclude batch_norm_internal in amd_hipify
* run BNInternal UT for CUDA only
* fix CI error
* fix comment errors
* fix error
* add comment for inconsistency with cudnnBN doc
* additional comments for cudnnBN inconsistency
QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision.
Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute.
The formula for quantized GEMM is:
Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which,
C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)
*) use context buffer allocator, remove init cost of vector
*) using lookup table to dequantize large input
*) fall back to global average pool if it is
Adds a StridedCopy function that implements a copy from strided tensor to another.
This parallelizes the Concat operator, and can also be used in the future to parallelize many other data movement operators (e.g. Transpose, Split, etc.).
This operation is also required for the proposed data layout extensions to ORT.
* Do not copy the model_data when session is started by CreateSessionFromArray
* Add config option for disabling copy model bytes
* Add one additional test
* Address CR comments
* attention fusion kernel refactored
* consider the case of none in add_qk
* variabled added to check for pre-pack weights
* added a comment to PrePack()
* Optimized prepack and try to free the weights
* making comment sound better
* fixing a bug with optimizer.py
* commented out changes to be done
* removed comments
* make the private fn() private
* fix build
* making clean up fn static
* backed out optimizer tool change, needs more looking into
* freeze/fastpath support
* more comments on _fast_path
* per comments
* minor fix
* IntFlag improve
* address comments
Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* atenop for inference
* assert if dtype mismatch
* atenop config in frontend
* fix orttrainer test
* gradient def not only for ATenOp
* bugfix
* fix gradient input shape and type issue
* fix after merge master
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces