Commit graph

11997 commits

Author SHA1 Message Date
wejoncy
22ad629cf7
[bug fix] dequantize 4bit (#19793)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 18:27:46 -07:00
Edward Chen
860eb762c2
[Apple framework] Fix minimal build with training enabled. (#19858)
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.

Added a test to cover this configuration.
2024-03-12 11:33:30 -07:00
Adrian Lizarraga
00c3cd497e
[QDQ Quantization] Refactor shared functionality into a base quantizer (#19817)
### Description
This PR does not add or remove any functionality. It refactors common
functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes
into a new `BaseQuantizer` class.

This change helps decouple the QDQ quantizer from other quantization
modes and makes it easier to determine if a change to one quantization
mode will impact another.

### Motivation and Context
An upcoming PR aims to add mixed-precision support to QDQ models (e.g.,
one part of the graph uses u8 activations and another uses u16
activations). This change makes the upcoming PR smaller and should
presumably make determining the impact on existing features more
straightforward.
2024-03-12 10:47:09 -07:00
Ye Wang
7f0520cdf9
bug fix to multi-cudagraph (#19856)
### Description
<!-- Describe your changes. -->

run_count_before_capture_ is graph_id aware, fix the bug by adding a map
to retrieve the run_count_ for each graph_id.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:33:37 -07:00
zz002
319159b7bd
[VitisAI]set-data_loaction-as-default-when-load-external-data (#19712)
### Description
<!-- Describe your changes. -->

set-data_loaction-as-default-when-load-external-data
fix vitis ai ep can not get CutomOps by session_option register

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

VitisAI bug daily fixes
when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error
Data of TensorProto ( tensor name: xxx) is stored externally and should
not have data field.raw_data

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-03-12 10:27:14 -07:00
Bowen Bao
742595b885
Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.


### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.

>         use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
 >           in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.

```
Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```
2024-03-12 09:41:11 -07:00
Yi Zhang
d4fa4f0276
Remove FFmpeg to meet compliance (#19859) 2024-03-12 09:06:59 -07:00
pengwa
3fb8905393
Fix torch cpp extension build warnings (#19842)
### Fix torch cpp extension build warnings

For the warnings shown as below: 

```
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
  104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
[5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
  104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so

```

Fix by replacing eixsting `PyObject_GetAttrString` with
`PyObject_FastGetAttrString` which claims to be faster in its
implementation comment.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:51:30 +08:00
pengwa
3e954da3e6
Fix and enable few ORTModule Unit Tests (#19847)
### Fix and enable few ORTModule Unit Tests

Fix 'test_bert_inputs_with_dynamic_shape' and
'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT
run.

The root cause is, the logic to generatic attention mask test data is
not correct, only 0 or 1 is allowed in the dataset, but we see lots of
other numbers. ( The reason we don't have this using old version of
transformers for example v4.4.2 or 4.16.2 is because they don't contains
such
d3cb28886a,
which increase the scaling to a bigger number, causing a overflow to
inf)

Another improvement during the investigation using convergence tools:
Don't dump the activations during model export phase, otherwise, the
dumped data might contains some PyTorch run's result making us confused
during comparing with stock PyTorch run results.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:49:19 +08:00
Vincent Wang
0c078dfc8b
Some Shape Related Fusions (#19832)
This PR adds below shape related fusions, which is helpful for some
transformer models:
- ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single
one (the 1st one on topo order) if they have the same shape value. This
helps CSE fusion to merge more nodes.
- CSE fusion to support scalar tensor as attribute value. This is mainly
to support ConstantOfShape node.
2024-03-12 10:29:27 +08:00
Scott McKay
978c40d853
Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723)
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.

Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-12 10:55:49 +10:00
Hector Li
cba605e845
Fix Clip op builder for FP16 support (#19825)
### Description
Fix Clip op builder for FP16 support.

### Motivation and Context
Enables mobilenet v2 FP16 model inference on HTP
2024-03-11 16:39:41 -07:00
raoanag
89aa4697b1
[DML] QAttention (#19766)
### Description
DML Implementation for
[com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Xiang Zhang <xianz@microsoft.com>
2024-03-11 10:44:34 -07:00
Changming Sun
5479124834
Remove remaining Windows ARM32 build jobs (#19840)
### Description
As a follow up of #19788, remove more remaining Windows ARM32 build
jobs.


### Motivation and Context
Our nuget packaging pipeline is failing because it could not find an
artifact for Win ARM32.
```
##[error]Artifact onnxruntime-training-win-arm was not found for build 421397.
```

Deprecation of Win ARM32 was announced by Windows team in January 2023.
We should follow it.
2024-03-11 11:25:11 +08:00
Changming Sun
efad5bbc5a
Replace some old file system calls with C++17 std::filesystem APIs. (#19196)
### Description
1. Replace some old file system calls to use C++17 std::filesystem APIs.
2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in
onnxruntime_perf_test and the code is out of maintain.
3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build
because C++17 filesystem library is not available there
2024-03-09 09:17:36 -08:00
raoanag
fa73d7cbf9
[DML] DynamicQuantizeMatMul (#19763)
### Description
DML Implementation for [com.microsoft.DynamicQuantizeMatMul
](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.DynamicQuantizeMatMul)

```
.\onnxruntime_test_all.exe --gtest_filter="*DynamicQuantizeMatMul.*"
Note: Google Test filter = *DynamicQuantizeMatMul.*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from DynamicQuantizeMatMul
[ RUN      ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8
[       OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 (635 ms)
[ RUN      ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8
[       OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 (514 ms)
[ RUN      ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8
[       OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 (512 ms)
[ RUN      ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8
[       OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 (505 ms)
[ RUN      ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8
[       OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 (526 ms)
[ RUN      ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8
[       OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 (504 ms)
[ RUN      ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8
[       OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 (512 ms)
[ RUN      ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8
[       OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 (512 ms)
[ RUN      ] DynamicQuantizeMatMul.UInt8_test_with_empty_input
[       OK ] DynamicQuantizeMatMul.UInt8_test_with_empty_input (112 ms)
[ RUN      ] DynamicQuantizeMatMul.B_PerColumn_ND
[       OK ] DynamicQuantizeMatMul.B_PerColumn_ND (348 ms)
[----------] 10 tests from DynamicQuantizeMatMul (4685 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (4686 ms total)
[  PASSED  ] 10 tests.
memleakdbg:
----- No memory leaks detected -----
```


### Motivation and Context
- CalculateDynamicQuantizeMatMul to replace CPU EP run reference
- Added more FP32 testcases to isolate all input datatype combinations

---------

Co-authored-by: Xiang Zhang <xianz@microsoft.com>
2024-03-08 15:35:10 -08:00
Sheil Kumar
7deee944c0
Implement STFT Decomposition transformer (#19725)
Implement STFT Decomposition transformer.

Certain hardware does not support DXIL, and therefore existing operator
should be mapped to hardware supported functions.
Optimized convolution can be used to implement STFT.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-03-08 15:02:58 -08:00
Yifan Li
069d2d6f54
[EP Perf] Update EP Perf dockerfiles with cuda12/cudnn9 (#19781)
### Description
* Update name of existing dockerfiles and add support to test latest
TensorRT EA binary located in the image
* Add cuda 12.3/cuDNN 9/TensorRT 8.6 dockerfile
* Add detail to CI prompts and configs

Instruction to test latest TRT via BIN:
1. Select `BIN` in TensorRT Version
2. In Variables, update related tarCudaVersion, **clear**
tarCudnnVersion (not required in latest TRT tar binary) , and path to
binary.
2024-03-08 13:58:22 -08:00
Yifan Li
3170a48e60
[EP Perf] Add tag to indicate which TRT parser is using (#19784)
### Description
* Add tag to distinguish if TRT `builtin` or `oss` parser is being used
* `oss` tag will be inserted with onnx-tensorrt commit id, to indicate
which version oss parser is
### Validate
DB entry before/after this PR 
(during test, `builtin` or `oss_{commit_id}` tag was inserted in the
database entries):

### Motivation and Context
To distinguish perf results using builtin/oss parser in the database,
this parser tag is needed.
In future, results using different parsers will be listed in different
Perf Dashboard pages.
2024-03-08 10:24:36 -08:00
Scott McKay
01c376a0b9
Update script to run CIs for a branch. (#19797)
### Description
<!-- Describe your changes. -->
- Support multiple include/exclude values. 
- e.g. can now run with `-i MacOS -i iOS` to run CIs for both Apple
platforms.
- Default to current branch if run from directory in repo.
  - make lazier usage possible

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve tools.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-08 17:52:47 +10:00
Satya Kumar Jandhyala
24b72d2613
[JS/WebGPU] Preserve zero size input tensor dims. (#19737)
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-07 19:07:49 -08:00
Scott McKay
6c3bed6740
Run CoreML EP with NeuralNetwork and ML Program in CI unit tests (#19796)
### Description
<!-- Describe your changes. -->
Add synthetic CoreML EP name to the list of providers so we test with
NeuralNetwork and MLProgram model types.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Automatically test new MLProgram support in CI
2024-03-08 12:50:13 +10:00
Dmitri Smirnov
2964352641
Implement IsNaN-9,13,20 for CUDA along with tests (#19807)
### Description


### Motivation and Context
Some models require IsNan CUDA along with training
2024-03-07 15:46:11 -08:00
Yi-Hong Lyu
33578cc76e
Remove memset for the case no any mask (#19823)
Improved OCR model speed by 1.034 end-to-end, by eliminating unnecessary
memset when no mask is present.
2024-03-07 13:54:16 -08:00
Jambay Kinley
3dfce2f1cd
Fix argparser in matmul_bnb4_quantizer (#19812)
### Description
<!-- Describe your changes. -->
The argparser had incorrectly used `description` and `options` instead
of `help` and `choices`.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes:
#19751
2024-03-07 11:31:34 -08:00
Ye Wang
72ce4de07d
cuda graph enhancement (#19636)
### Description
<!-- Describe your changes. -->

1. add a config key in run_options to control cuda graph in runtime.
2. enhance cuda graph class to support mutiple graph saving and
retrieving in one ORT session
3. provide model modification/inference example on Phi2
4. benchmark shows an average of 13% latency reduction in token
generation.



limitation: TRT ep and ROCM ep hasn't applied this feature. we can
revisit this in the future.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-07 10:15:18 -08:00
Tianlei Wu
bff4f8bf75
Update tolerance of provider tests to fix flaky tests (#19792)
### Description

Check float/double/float16/bfloat16 tensors are close like
[numpy.isclose](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html).
```
absolute(a - b) <= (atol + rtol * absolute(b))
```

The default tolerance thresholds:
- float: atol=1e-5 and rtol=1e-4
- float16: atol=0.0025 and rtol=0.001
- bfloat16: atol=0.02 and rtol=0.01

### Motivation and Context

Current pipeline has frequent failure due to using only relative
tolerance in https://github.com/microsoft/onnxruntime/pull/19608:

[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
1: C:\a\_work\1\s\onnxruntime\test\providers\checkers.cc(272): error:
The difference between cur_expected[i] and cur_actual[i] is
1.3113021850585938e-06, which exceeds *(params.relative_error) *
std::abs(cur_expected[i]), where
1: cur_expected[i] evaluates to -1.3113021850585938e-06,
1: cur_actual[i] evaluates to 0, and
1: *(params.relative_error) * std::abs(cur_expected[i]) evaluates to
2.6226043559063328e-08.

It is not reasonable to use relative tolerance for a small value very
close to 0. Combining relative tolerance with a positive absolute
tolerance could avoid such issue.
2024-03-06 17:47:17 -08:00
pengwa
5c5d6e99ce
Define recomputable op list with domain/opset (#19722)
### Define recomputable op list with domain/opset

Originally, we just check the OpType and decide whether it is
recomputable.

In this PR, few improvements are made:
1. [Op type search] Domain + OpType are used to check whether the op is
supported to recompute.
2. [Opset search] Then, node.SinceVersion() will be searched in the
supported opsets.
3. During subgraph detection, If the node in that this opset is
supported, get the ignorable input indices, which means we don't
consider in the bottom-up search. This would save time for the subgraph
detection.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-07 09:12:12 +08:00
Wanming Lin
1ce5bfb0ec
[WebNN EP] Make sure optional input is provided (#19686)
Some optional input is presented as empty string, we should not only
check if the input size is correct, but also check if the optional input
is not empty.

e.g. Pad node has empty optional input in sam-b-encoder.onnx model:
<img width="514" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/3271201/cc3b06fe-46b9-4ee7-aca5-157bdf112856">
2024-03-06 16:19:59 -08:00
Markus Tavenrath
f2dc725b33
Add SpaceToDepth and DepthToSpace CUDA NHWC Ops (#19646)
### Description
- Adding CUDA NHWC support for SpaceToDepth and DepthToSpace
- Add a new test which verifies that swizzling SpaceToDepth swizzling
for the H axis is correct.
- If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as
well.

### Motivation and Context
Adding more NHWC operations to avoid layout transformations when using
the CUDA EP for more efficiency.
2024-03-06 12:35:55 -08:00
aciddelgado
8bd1335d00
Fix GQA Rotary Embedding sequence length (#19801)
### Description
Previously, GQA incorrectly enforced rotary cos and sin cache to be of
sequence length equal to present sequence length. Now it enforces that
it be greater than or equal to present sequence length since to match
Rotary Embedding Op it should be of max_sequence_length



### Motivation and Context
Fixes issue with fusing Rotary Embedding and GQA for certain models
which prefer this optimization.
2024-03-06 12:34:33 -08:00
Hector Li
db8d0c8e06
reset dcvsEnable for different HTP performance mode (#19728)
reset dcvsEnable for different HTP performance mode
2024-03-06 11:21:19 -08:00
Changming Sun
f9a92e589a
Upgrade the Windows SDK version that is used in WindowsAI Nuget Packaging pipeline (#19786)
### Description
1. Upgrade the version from 10.0.19041.0 to 10.0.22621.0. The old one
misses some macros that are needed by PyTorch's CPUINFO
2. Also update cmake.


### Motivation and Context
In PR #19655 I added CPUINFO to all Windows builds, but forgot to test
this pipeline.
2024-03-06 09:10:35 -08:00
pengwa
d9bf85613d
Adapt memory optimizer to fit PHI2 (#19757)
### Adapt memory optimizer to fit PHI2

Few improvements and bug fixes:
1. Fix bug related to transformer layer detection. 
2. Use default reversed typo order to create recompute node, to avoid
the leaf nodes are handled too late, then having lowest priority for
execution.
3. Add early stop when activation's element count is constant and total
element count < 1M. This can avoid overhead to search subgraphs.


Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute,
on given recipe, memory consumption dropped from ~22GB to ~13GB .
2024-03-06 21:54:16 +08:00
Ashwini Khade
e93a860819
Remove arm build for training (#19788)
We no longer support Win arm 32 so removing the associated build and
packaging job.
2024-03-05 21:54:48 -08:00
Scott McKay
db59cec82f
Don't reduce warning level for CUDA build on Windows (#19663)
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.

Mainly 
- unused parameters
- variables shadowing other ones

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
2024-03-06 15:03:55 +10:00
Yulong Wang
a788514027
[js/web] dump debug logs for karma for diagnose purpose (#19785)
### Description
dump debug logs for karma for diagnose purpose.

This is for debugging the CI issue of Chrome launch failure and
considered temporary.
2024-03-05 18:27:26 -08:00
Vincent Wang
1bfc26685b
ATen Op Supports Int Return Type and CPU Tensor Arguments (#19773)
This PR:
- add support for int as return type, will create a CPU scalar tensor
for it.
- add attributes to specify which arguments or returns are CPU tensors.
- adjust ATen efficient attn to match latest PyTorch native function.
- a Triton codegen bugfix by the way.
2024-03-06 10:11:46 +08:00
pengwa
d102569755
Fix seed for recomputed Dropout (#19715)
### Fix seed for recomputed Dropout

If Dropout node is recomputed in the backward, we should make sure its
execution is same as the run in the forward.
If we don't set seed attribute, then this cannot be guaranteed. 

Add ` export ORTMODULE_MEMORY_OPT_LEVEL=2` to enabled per layer
recompute with compromised recomputable subgraphs.
2024-03-06 10:06:25 +08:00
Chi Lo
d9730c7f43
[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575)
When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.

This PR also refactors the code to only allocate single bytes for all
empty tensors.

#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs

---------

Co-authored-by: Chi Lo <lochi@microsoft.com>
2024-03-05 14:39:36 -08:00
Dmitri Smirnov
1e78bcea60
Implement CUDA IsInf-10,20 (#19772)
### Description
Implment IsInf-10,20 for CUDA.
Add FP16 types also on CPU.

### Motivation and Context
Certain models lag in performance due to IsInf not available on CUDA.
2024-03-05 13:33:01 -08:00
Chen Fu
06e684c9f2
Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.


### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs

### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:

(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)

If both of these options are turned off, then compilation will fail.

Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
2024-03-05 09:37:45 -08:00
Markus Tavenrath
bdf678df93
Fix CUDA BatchNorm bugs and add support for NHWC (#19742)
### Description
- Fix incorrect running_mean / running_var in training mode due to
incorrect momentum and missing input mean/var. runnig_var could be
correct, but has a too high epsilon.
- Fix incorrect checks when using NHWC
- Pass NHWC flag to NormalizeDims to get correct new dimensions from
x_shape
- Register missing double operations to get parity between NHWC/NCHW
2024-03-05 08:09:42 -08:00
guyang3532
cd56ea4a74
enable embedding sparse optimization by default (#19714) 2024-03-05 13:15:30 +08:00
wejoncy
7e613ee821
[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" (#19106)
### Description
<!-- Describe your changes. -->
1. Support quantized GPTQ weight in huggingface like
[TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
2. Support Act_order for GPTQ
3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to
quantize matmul weight and add quant script



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-05 11:45:45 +08:00
zhijiang
2a5c9b86eb
Zhijxu/fix conv1d replacement (#19758)
remove the constraint - "group number should be less than 3";
add more condition to make sure the conv1d replacement only happens on
conv1d instead of conv2d/conv3d;
add more tests;
2024-03-05 10:11:19 +08:00
Dmitri Smirnov
0cdf36faeb
Expose SessionOtions.DisablePerSessionThreads (#19730)
### Description

### Motivation and Context
ML.NET needs to run mltiple sessions on a single threadpool.
2024-03-04 13:46:51 -08:00
raoanag
27b1dc91ab
[DML] MatrixMultiplyIntegerToFloat (#19608)
### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)

```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[       OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[  PASSED  ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations 
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
2024-03-04 11:55:35 -08:00
inisis
2e13d5f0ab
fix split shape inference error for opset >= 13 (#19756)
### Description
get split operator split section by opset

### Motivation and Context
for opset higher than 13, split section is treated as an input.
2024-03-04 09:41:36 -08:00
ironman
9acaf534a6
Benchmark - Updating llama-2 requirement files (#19716)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-04 07:29:58 -08:00