### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"
The PR also adds new test cases for mlas and ort.
### Motivation and Context
This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1 past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms
Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.
The performance of MatMulNBits in this PR meets our expectations.
### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
Previously building webnn ep with --disable_rtti will throw
unboundTypeError since unbound type names are illegal with RTTI disabled
in Embind API, we can fix it by adding a
-DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.
### Description
Update DML version to 1.13.1
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV
2023.3.
### Context
- The API is added to facilitate customers in using published official
Microsoft onnxruntime libraries with OVEP libraries.
- Add support for OpenVINO 2023.3 official release.
- Extend operator coverage
- GH fixes
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Description
1. Add two build jobs for enabling Address Sanitizer in CI. One for
Windows CPU, One for Linux CPU.
2. Set default compiler flags/linker flags in build.py for normal
Windows/Linux/MacOS build. This can help control compiler flags in a
more centralized way.
3. All Windows binaries in our official packages will be built with
"/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft
public symbol
server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols).
Limitations:
1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries.
Therefore once Address Sanitizer is enabled, before running tests we
need to manually set LD_LIBRARY_PATH properly otherwise
libonnxruntime.so may not be able to find custom ops and shared EPs.
4. On Linux we also need to set LD_PRELOAD before running some tests(if
the main executable, like python, is not built with address sanitizer.
On Windows we do not need to.
5. On Windows before running python tests we should manually copy
address sanitizer DLL to the onnxruntime/capi directory, because python
3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the
information provided by PATH env.
6. On Linux Address Sanitizer found a lot of memory leaks from our
python binding code. Therefore right now we cannot enable Address
Sanitizer when building ONNX Runtime with python binding.
7. Address Sanitizer itself uses a lot of memory address space and
delays memory deallocations, which is easy to cause OOM issues in 32-bit
applications. We cannot run all the tests in onnxruntime_test_all in
32-bit mode with Address Sanitizer due to this reason. However, we still
can run individual tests in such a way. We just cannot run all of them
in one process.
### Motivation and Context
To catch memory issues.
Fix error:
```
[ 48%] Built target onnxruntime_optimizer
In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```
This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
### Description
This PR enables onnxruntime to build with the most recent release of Arm
Compute Library
### Motivation and Context
The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
### Description
onnxruntime may raise an error "type inference failed" but when a custom
operator sets IsHomogeneous to false in its schema. This change make
sure that TypeInferenceFunction and schema type constraints are aligned
to prevent that from happening.
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
a replacement of #18683. try to resolve#18689.
By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
### Description
[DirectML EP] Add DML EP registration for Col2Im operator
### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Fix build when flash attention and memory efficient attention are
disabled
On a customer env with lower version of CUDA < 11.6. Both flash
attention and memory efficient attention is turned OFF according to
e8f33b54ba/cmake/CMakeLists.txt (L701).
So
e8f33b54ba/cmake/external/cutlass.cmake (L1)
condition check return false. No cutlass lib is built.
```
Turn off flash attention since CUDA compiler version < 11.6
```
While, the kernels in
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe
are depending on cutass for its build, so we get error like this:
```
[ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
fatal : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii
make[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module>
sys.exit(main())
File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main
build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets
run_subprocess(cmd_args, env=env)
File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run
completed_process = subprocess.run(
File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
```
### Motivation and Context
To summarize, there are two cases we will have build failure for Linux
CUDA build:
1. User use cuda version < 11.6
2. User disabled Flash attention and memory efficient attention
explictly with onnxruntime_USE_FLASH_ATTENTION and
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION
### Description
Disable mlas unit test in ARM64EC build because the program has some
link errors. We will fix the errors later.
This PR only impacts Windows ARM64EC build. It has no impact on the
existing build pipelines.
### Description
Update absl and googletest to their latest version to include some cmake
changes:
1. A googletest's cmake change that will allow using external absl and
re2.
2. Nullability enhancements that will allow our clang-based static
analysis detecting many kinds of null pointer errors.
### Motivation and Context
To fix a C4744 link warning in our Windows pipelines.
```
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\usage.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<int>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
```
### Description
<!-- Describe your changes. -->
1. Add a backward-compatible API for compiling model.
2. Run-time load vitisai-ep.dll
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
- Add support for OpenVINO 2023.2
- num_of_threads provider option is mapped to the CPU device property
inference_num_threads of the CPU plugin, so users can control the
#threads used for inference by the CPU
- Logging in Debug mode now includes the runtime properties set for
devices
- Fix issue in using external weights through OpenVINO
---------
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
- Adds graph fusions to preprocessing step that can be called before
creating a QDQ model for QNN EP.
- Fuse Erf sequence to Gelu (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)).
Required by QNN EP.
- Fuse ReduceMean sequence to LayerNormaliation (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)).
Not required by QNN EP.
- Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP).
Required by QNN EP.
Example use:
```python3
from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model
# Added by this PR:
model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True)
model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx"
# Quantize model ...
qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16)
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
### Motivation and Context
Allow more models to be quantized for use with QNN EP
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
Update absl and gtest to fix an ARM64EC build error
### Motivation and Context
We need to get an important fix into ORT.
The fix is:
8028a87c96
Build onnxruntime.dll as arm64x
Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.
If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.
Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.
AB#22533
### Description
<!-- Describe your changes. -->
Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
per-channel quantization, the list contains a dictionary for each channel in the tensor.
Each dictionary contains optional overrides with the following keys and values.
'quant_type' = QuantType : The tensor's quantization data type.
'scale' = Float : The scale value to use. Must also specify `zero_point` if set.
'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set.
'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also
set `scale` or `zero_point`.
'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also
set `scale` or `zero_point`.
'rmax' = Float : Override the maximum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
'rmin' = Float : Override the minimum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
```
- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.
Example for per-tensor quantization overrides:
```Python3
extra_options = {
"TensorQuantOverrides": {
"SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
"WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
"BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
},
}
```
Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
"TensorQuantOverrides": {
"WGT": [
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.0,
"rmax": 2.5,
"reduce_range": True,
},
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.2,
"rmax": 2.55,
"reduce_range": False,
},
],
"BIAS": [
{"zero_point": 0, "scale": 0.000621},
{"zero_point": 0, "scale": 0.23},
],
},
}
```
#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.
Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config
qnn_config = get_qnn_qdq_config(input_model_path,
data_reader,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8)
quantize(input_model_path,
output_model_path,
qnn_config)
```
### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
Reverts microsoft/onnxruntime#18413
there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
### Description
<!-- Describe your changes. -->
As title.
1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.
Replace https://github.com/microsoft/onnxruntime/pull/18334
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware
### Motivation and Context
Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
This kernel requires sophisticated quantized weight rearrangement to
speedup loading data to tensor-core. To facilitate the addition, this
change includes the following:
1. matrix_layout.h A new layout lib that facilitate iterating matrix
elements and tiles that balance memory safety and performance.
2. prepack_sm80.h Code for rearranging quantized weight, scales and
offsets (aka. prepacking)
3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
memory safety and correctness of the prepacking code.
Currently the prepacking code runs on CPU with single threaded code. We
run this on CPU in order to minimize GPU memory fragmentation. On the
other hand, hopefully we get around to parallelize this part of the
code. Should be straight forward with the unit tests in place.
### Description
Truncate traling non-existing arguments.
Make sure we do not skip on the non-existing arguments in the middle,
because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
instead of pointing to outer scope defs.
Add unit test.
### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
### Description
Motivation for this PR is code cleanup.
1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.
- Connect MatMulNBits contrib op to MLAS function.