### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.
Added a test to cover this configuration.
### Description
This PR does not add or remove any functionality. It refactors common
functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes
into a new `BaseQuantizer` class.
This change helps decouple the QDQ quantizer from other quantization
modes and makes it easier to determine if a change to one quantization
mode will impact another.
### Motivation and Context
An upcoming PR aims to add mixed-precision support to QDQ models (e.g.,
one part of the graph uses u8 activations and another uses u16
activations). This change makes the upcoming PR smaller and should
presumably make determining the impact on existing features more
straightforward.
### Description
<!-- Describe your changes. -->
run_count_before_capture_ is graph_id aware, fix the bug by adding a map
to retrieve the run_count_ for each graph_id.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
set-data_loaction-as-default-when-load-external-data
fix vitis ai ep can not get CutomOps by session_option register
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
VitisAI bug daily fixes
when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error
Data of TensorProto ( tensor name: xxx) is stored externally and should
not have data field.raw_data
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.
### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.
> use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
> in all other cases defaults to `False`.
For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.
```
Evaluating Llama2 `model(inputs)` step with past_key_values
Before, w/o iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps
After, w/ iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```
### Fix torch cpp extension build warnings
For the warnings shown as below:
```
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
[5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so
```
Fix by replacing eixsting `PyObject_GetAttrString` with
`PyObject_FastGetAttrString` which claims to be faster in its
implementation comment.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Fix and enable few ORTModule Unit Tests
Fix 'test_bert_inputs_with_dynamic_shape' and
'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT
run.
The root cause is, the logic to generatic attention mask test data is
not correct, only 0 or 1 is allowed in the dataset, but we see lots of
other numbers. ( The reason we don't have this using old version of
transformers for example v4.4.2 or 4.16.2 is because they don't contains
such
d3cb28886a,
which increase the scaling to a bigger number, causing a overflow to
inf)
Another improvement during the investigation using convergence tools:
Don't dump the activations during model export phase, otherwise, the
dumped data might contains some PyTorch run's result making us confused
during comparing with stock PyTorch run results.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR adds below shape related fusions, which is helpful for some
transformer models:
- ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single
one (the 1st one on topo order) if they have the same shape value. This
helps CSE fusion to merge more nodes.
- CSE fusion to support scalar tensor as attribute value. This is mainly
to support ConstantOfShape node.
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.
Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
As a follow up of #19788, remove more remaining Windows ARM32 build
jobs.
### Motivation and Context
Our nuget packaging pipeline is failing because it could not find an
artifact for Win ARM32.
```
##[error]Artifact onnxruntime-training-win-arm was not found for build 421397.
```
Deprecation of Win ARM32 was announced by Windows team in January 2023.
We should follow it.
### Description
1. Replace some old file system calls to use C++17 std::filesystem APIs.
2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in
onnxruntime_perf_test and the code is out of maintain.
3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build
because C++17 filesystem library is not available there
### Description
DML Implementation for [com.microsoft.DynamicQuantizeMatMul
](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.DynamicQuantizeMatMul)
```
.\onnxruntime_test_all.exe --gtest_filter="*DynamicQuantizeMatMul.*"
Note: Google Test filter = *DynamicQuantizeMatMul.*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from DynamicQuantizeMatMul
[ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8
[ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 (635 ms)
[ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8
[ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 (514 ms)
[ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8
[ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 (512 ms)
[ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8
[ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 (505 ms)
[ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8
[ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 (526 ms)
[ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8
[ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 (504 ms)
[ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8
[ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 (512 ms)
[ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8
[ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 (512 ms)
[ RUN ] DynamicQuantizeMatMul.UInt8_test_with_empty_input
[ OK ] DynamicQuantizeMatMul.UInt8_test_with_empty_input (112 ms)
[ RUN ] DynamicQuantizeMatMul.B_PerColumn_ND
[ OK ] DynamicQuantizeMatMul.B_PerColumn_ND (348 ms)
[----------] 10 tests from DynamicQuantizeMatMul (4685 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (4686 ms total)
[ PASSED ] 10 tests.
memleakdbg:
----- No memory leaks detected -----
```
### Motivation and Context
- CalculateDynamicQuantizeMatMul to replace CPU EP run reference
- Added more FP32 testcases to isolate all input datatype combinations
---------
Co-authored-by: Xiang Zhang <xianz@microsoft.com>
Implement STFT Decomposition transformer.
Certain hardware does not support DXIL, and therefore existing operator
should be mapped to hardware supported functions.
Optimized convolution can be used to implement STFT.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
* Update name of existing dockerfiles and add support to test latest
TensorRT EA binary located in the image
* Add cuda 12.3/cuDNN 9/TensorRT 8.6 dockerfile
* Add detail to CI prompts and configs
Instruction to test latest TRT via BIN:
1. Select `BIN` in TensorRT Version
2. In Variables, update related tarCudaVersion, **clear**
tarCudnnVersion (not required in latest TRT tar binary) , and path to
binary.
### Description
* Add tag to distinguish if TRT `builtin` or `oss` parser is being used
* `oss` tag will be inserted with onnx-tensorrt commit id, to indicate
which version oss parser is
### Validate
DB entry before/after this PR
(during test, `builtin` or `oss_{commit_id}` tag was inserted in the
database entries):
### Motivation and Context
To distinguish perf results using builtin/oss parser in the database,
this parser tag is needed.
In future, results using different parsers will be listed in different
Perf Dashboard pages.
### Description
<!-- Describe your changes. -->
- Support multiple include/exclude values.
- e.g. can now run with `-i MacOS -i iOS` to run CIs for both Apple
platforms.
- Default to current branch if run from directory in repo.
- make lazier usage possible
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve tools.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add synthetic CoreML EP name to the list of providers so we test with
NeuralNetwork and MLProgram model types.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Automatically test new MLProgram support in CI
### Description
<!-- Describe your changes. -->
The argparser had incorrectly used `description` and `options` instead
of `help` and `choices`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes:
#19751
### Description
<!-- Describe your changes. -->
1. add a config key in run_options to control cuda graph in runtime.
2. enhance cuda graph class to support mutiple graph saving and
retrieving in one ORT session
3. provide model modification/inference example on Phi2
4. benchmark shows an average of 13% latency reduction in token
generation.
limitation: TRT ep and ROCM ep hasn't applied this feature. we can
revisit this in the future.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Check float/double/float16/bfloat16 tensors are close like
[numpy.isclose](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html).
```
absolute(a - b) <= (atol + rtol * absolute(b))
```
The default tolerance thresholds:
- float: atol=1e-5 and rtol=1e-4
- float16: atol=0.0025 and rtol=0.001
- bfloat16: atol=0.02 and rtol=0.01
### Motivation and Context
Current pipeline has frequent failure due to using only relative
tolerance in https://github.com/microsoft/onnxruntime/pull/19608:
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
1: C:\a\_work\1\s\onnxruntime\test\providers\checkers.cc(272): error:
The difference between cur_expected[i] and cur_actual[i] is
1.3113021850585938e-06, which exceeds *(params.relative_error) *
std::abs(cur_expected[i]), where
1: cur_expected[i] evaluates to -1.3113021850585938e-06,
1: cur_actual[i] evaluates to 0, and
1: *(params.relative_error) * std::abs(cur_expected[i]) evaluates to
2.6226043559063328e-08.
It is not reasonable to use relative tolerance for a small value very
close to 0. Combining relative tolerance with a positive absolute
tolerance could avoid such issue.
### Define recomputable op list with domain/opset
Originally, we just check the OpType and decide whether it is
recomputable.
In this PR, few improvements are made:
1. [Op type search] Domain + OpType are used to check whether the op is
supported to recompute.
2. [Opset search] Then, node.SinceVersion() will be searched in the
supported opsets.
3. During subgraph detection, If the node in that this opset is
supported, get the ignorable input indices, which means we don't
consider in the bottom-up search. This would save time for the subgraph
detection.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adding CUDA NHWC support for SpaceToDepth and DepthToSpace
- Add a new test which verifies that swizzling SpaceToDepth swizzling
for the H axis is correct.
- If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as
well.
### Motivation and Context
Adding more NHWC operations to avoid layout transformations when using
the CUDA EP for more efficiency.
### Description
Previously, GQA incorrectly enforced rotary cos and sin cache to be of
sequence length equal to present sequence length. Now it enforces that
it be greater than or equal to present sequence length since to match
Rotary Embedding Op it should be of max_sequence_length
### Motivation and Context
Fixes issue with fusing Rotary Embedding and GQA for certain models
which prefer this optimization.
### Description
1. Upgrade the version from 10.0.19041.0 to 10.0.22621.0. The old one
misses some macros that are needed by PyTorch's CPUINFO
2. Also update cmake.
### Motivation and Context
In PR #19655 I added CPUINFO to all Windows builds, but forgot to test
this pipeline.
### Adapt memory optimizer to fit PHI2
Few improvements and bug fixes:
1. Fix bug related to transformer layer detection.
2. Use default reversed typo order to create recompute node, to avoid
the leaf nodes are handled too late, then having lowest priority for
execution.
3. Add early stop when activation's element count is constant and total
element count < 1M. This can avoid overhead to search subgraphs.
Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute,
on given recipe, memory consumption dropped from ~22GB to ~13GB .
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.
Mainly
- unused parameters
- variables shadowing other ones
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
This PR:
- add support for int as return type, will create a CPU scalar tensor
for it.
- add attributes to specify which arguments or returns are CPU tensors.
- adjust ATen efficient attn to match latest PyTorch native function.
- a Triton codegen bugfix by the way.
### Fix seed for recomputed Dropout
If Dropout node is recomputed in the backward, we should make sure its
execution is same as the run in the forward.
If we don't set seed attribute, then this cannot be guaranteed.
Add ` export ORTMODULE_MEMORY_OPT_LEVEL=2` to enabled per layer
recompute with compromised recomputable subgraphs.
When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.
This PR also refactors the code to only allocate single bytes for all
empty tensors.
#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs
---------
Co-authored-by: Chi Lo <lochi@microsoft.com>
### Description
Implment IsInf-10,20 for CUDA.
Add FP16 types also on CPU.
### Motivation and Context
Certain models lag in performance due to IsInf not available on CUDA.
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.
### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs
### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:
(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)
If both of these options are turned off, then compilation will fail.
Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
### Description
- Fix incorrect running_mean / running_var in training mode due to
incorrect momentum and missing input mean/var. runnig_var could be
correct, but has a too high epsilon.
- Fix incorrect checks when using NHWC
- Pass NHWC flag to NormalizeDims to get correct new dimensions from
x_shape
- Register missing double operations to get parity between NHWC/NCHW
### Description
<!-- Describe your changes. -->
1. Support quantized GPTQ weight in huggingface like
[TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
2. Support Act_order for GPTQ
3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to
quantize matmul weight and add quant script
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
remove the constraint - "group number should be less than 3";
add more condition to make sure the conv1d replacement only happens on
conv1d instead of conv2d/conv3d;
add more tests;
### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)
```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)
[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[ PASSED ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->