onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-26 22:35:43 +00:00

Author	SHA1	Message	Date
wejoncy	22ad629cf7	[bug fix] dequantize 4bit (#19793 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 18:27:46 -07:00
Edward Chen	860eb762c2	[Apple framework] Fix minimal build with training enabled. (#19858 ) Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled. - training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build. - tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build. Added a test to cover this configuration.	2024-03-12 11:33:30 -07:00
Adrian Lizarraga	00c3cd497e	[QDQ Quantization] Refactor shared functionality into a base quantizer (#19817 ) ### Description This PR does not add or remove any functionality. It refactors common functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes into a new `BaseQuantizer` class. This change helps decouple the QDQ quantizer from other quantization modes and makes it easier to determine if a change to one quantization mode will impact another. ### Motivation and Context An upcoming PR aims to add mixed-precision support to QDQ models (e.g., one part of the graph uses u8 activations and another uses u16 activations). This change makes the upcoming PR smaller and should presumably make determining the impact on existing features more straightforward.	2024-03-12 10:47:09 -07:00
Ye Wang	7f0520cdf9	bug fix to multi-cudagraph (#19856 ) ### Description <!-- Describe your changes. --> run_count_before_capture_ is graph_id aware, fix the bug by adding a map to retrieve the run_count_ for each graph_id. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:33:37 -07:00
zz002	319159b7bd	[VitisAI]set-data_loaction-as-default-when-load-external-data (#19712 ) ### Description <!-- Describe your changes. --> set-data_loaction-as-default-when-load-external-data fix vitis ai ep can not get CutomOps by session_option register ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> VitisAI bug daily fixes when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error Data of TensorProto ( tensor name: xxx) is stored externally and should not have data field.raw_data --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-03-12 10:27:14 -07:00
Bowen Bao	742595b885	Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853 ) ### Description Always set `use_io_binding=True` when using optimum.onnxruntime unless there is a special case. ### Motivation and Context By default, `ORTModel` under optimum.onnxruntime will choose the appropriate `use_io_binding` value based on provider and use cases. > use_io_binding (`Optional[bool]`, defaults to `None`): > Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to > `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider, > in all other cases defaults to `False`. For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time. ``` Evaluating Llama2 `model(inputs)` step with past_key_values Before, w/o iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.4518657898902893 s Throughput: 2.2130464894073856 tps After, w/ iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.2662619352340698 s Throughput: 3.7557001871893703 tps ```	2024-03-12 09:41:11 -07:00
Yi Zhang	d4fa4f0276	Remove FFmpeg to meet compliance (#19859 )	2024-03-12 09:06:59 -07:00
pengwa	3fb8905393	Fix torch cpp extension build warnings (#19842 ) ### Fix torch cpp extension build warnings For the warnings shown as below: ``` cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ [5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so ``` Fix by replacing eixsting `PyObject_GetAttrString` with `PyObject_FastGetAttrString` which claims to be faster in its implementation comment. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:51:30 +08:00
pengwa	3e954da3e6	Fix and enable few ORTModule Unit Tests (#19847 ) ### Fix and enable few ORTModule Unit Tests Fix 'test_bert_inputs_with_dynamic_shape' and 'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT run. The root cause is, the logic to generatic attention mask test data is not correct, only 0 or 1 is allowed in the dataset, but we see lots of other numbers. ( The reason we don't have this using old version of transformers for example v4.4.2 or 4.16.2 is because they don't contains such `d3cb28886a`, which increase the scaling to a bigger number, causing a overflow to inf) Another improvement during the investigation using convergence tools: Don't dump the activations during model export phase, otherwise, the dumped data might contains some PyTorch run's result making us confused during comparing with stock PyTorch run results. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:49:19 +08:00
Vincent Wang	0c078dfc8b	Some Shape Related Fusions (#19832 ) This PR adds below shape related fusions, which is helpful for some transformer models: - ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single one (the 1st one on topo order) if they have the same shape value. This helps CSE fusion to merge more nodes. - CSE fusion to support scalar tensor as attribute value. This is mainly to support ConstantOfShape node.	2024-03-12 10:29:27 +08:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Hector Li	cba605e845	Fix Clip op builder for FP16 support (#19825 ) ### Description Fix Clip op builder for FP16 support. ### Motivation and Context Enables mobilenet v2 FP16 model inference on HTP	2024-03-11 16:39:41 -07:00
raoanag	89aa4697b1	[DML] QAttention (#19766 ) ### Description DML Implementation for [com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-11 10:44:34 -07:00
Changming Sun	5479124834	Remove remaining Windows ARM32 build jobs (#19840 ) ### Description As a follow up of #19788, remove more remaining Windows ARM32 build jobs. ### Motivation and Context Our nuget packaging pipeline is failing because it could not find an artifact for Win ARM32. ``` ##[error]Artifact onnxruntime-training-win-arm was not found for build 421397. ``` Deprecation of Win ARM32 was announced by Windows team in January 2023. We should follow it.	2024-03-11 11:25:11 +08:00
Changming Sun	efad5bbc5a	Replace some old file system calls with C++17 std::filesystem APIs. (#19196 ) ### Description 1. Replace some old file system calls to use C++17 std::filesystem APIs. 2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in onnxruntime_perf_test and the code is out of maintain. 3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build because C++17 filesystem library is not available there	2024-03-09 09:17:36 -08:00
raoanag	fa73d7cbf9	[DML] DynamicQuantizeMatMul (#19763 ) ### Description DML Implementation for [com.microsoft.DynamicQuantizeMatMul ](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.DynamicQuantizeMatMul) ``` .\onnxruntime_test_all.exe --gtest_filter="DynamicQuantizeMatMul." Note: Google Test filter = DynamicQuantizeMatMul. [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from DynamicQuantizeMatMul [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 (635 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 (514 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 (505 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 (526 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 (504 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 (512 ms) [ RUN ] DynamicQuantizeMatMul.UInt8_test_with_empty_input [ OK ] DynamicQuantizeMatMul.UInt8_test_with_empty_input (112 ms) [ RUN ] DynamicQuantizeMatMul.B_PerColumn_ND [ OK ] DynamicQuantizeMatMul.B_PerColumn_ND (348 ms) [----------] 10 tests from DynamicQuantizeMatMul (4685 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (4686 ms total) [ PASSED ] 10 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context - CalculateDynamicQuantizeMatMul to replace CPU EP run reference - Added more FP32 testcases to isolate all input datatype combinations --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-08 15:35:10 -08:00
Sheil Kumar	7deee944c0	Implement STFT Decomposition transformer (#19725 ) Implement STFT Decomposition transformer. Certain hardware does not support DXIL, and therefore existing operator should be mapped to hardware supported functions. Optimized convolution can be used to implement STFT. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-03-08 15:02:58 -08:00
Yifan Li	069d2d6f54	[EP Perf] Update EP Perf dockerfiles with cuda12/cudnn9 (#19781 ) ### Description * Update name of existing dockerfiles and add support to test latest TensorRT EA binary located in the image * Add cuda 12.3/cuDNN 9/TensorRT 8.6 dockerfile * Add detail to CI prompts and configs Instruction to test latest TRT via BIN: 1. Select `BIN` in TensorRT Version 2. In Variables, update related tarCudaVersion, clear tarCudnnVersion (not required in latest TRT tar binary) , and path to binary.	2024-03-08 13:58:22 -08:00
Yifan Li	3170a48e60	[EP Perf] Add tag to indicate which TRT parser is using (#19784 ) ### Description * Add tag to distinguish if TRT `builtin` or `oss` parser is being used * `oss` tag will be inserted with onnx-tensorrt commit id, to indicate which version oss parser is ### Validate DB entry before/after this PR (during test, `builtin` or `oss_{commit_id}` tag was inserted in the database entries): ### Motivation and Context To distinguish perf results using builtin/oss parser in the database, this parser tag is needed. In future, results using different parsers will be listed in different Perf Dashboard pages.	2024-03-08 10:24:36 -08:00
Scott McKay	01c376a0b9	Update script to run CIs for a branch. (#19797 ) ### Description <!-- Describe your changes. --> - Support multiple include/exclude values. - e.g. can now run with `-i MacOS -i iOS` to run CIs for both Apple platforms. - Default to current branch if run from directory in repo. - make lazier usage possible ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve tools. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-08 17:52:47 +10:00
Satya Kumar Jandhyala	24b72d2613	[JS/WebGPU] Preserve zero size input tensor dims. (#19737 ) ### Description For Concat operation, the zero-size input tensor shape need to be preserved and, unlike non-zero tensors, the dims are not constrained to match other input tensors' dims. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 19:07:49 -08:00
Scott McKay	6c3bed6740	Run CoreML EP with NeuralNetwork and ML Program in CI unit tests (#19796 ) ### Description <!-- Describe your changes. --> Add synthetic CoreML EP name to the list of providers so we test with NeuralNetwork and MLProgram model types. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Automatically test new MLProgram support in CI	2024-03-08 12:50:13 +10:00
Dmitri Smirnov	2964352641	Implement IsNaN-9,13,20 for CUDA along with tests (#19807 ) ### Description ### Motivation and Context Some models require IsNan CUDA along with training	2024-03-07 15:46:11 -08:00
Yi-Hong Lyu	33578cc76e	Remove memset for the case no any mask (#19823 ) Improved OCR model speed by 1.034 end-to-end, by eliminating unnecessary memset when no mask is present.	2024-03-07 13:54:16 -08:00
Jambay Kinley	3dfce2f1cd	Fix argparser in `matmul_bnb4_quantizer` (#19812 ) ### Description <!-- Describe your changes. --> The argparser had incorrectly used `description` and `options` instead of `help` and `choices`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes: #19751	2024-03-07 11:31:34 -08:00
Ye Wang	72ce4de07d	cuda graph enhancement (#19636 ) ### Description <!-- Describe your changes. --> 1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session 3. provide model modification/inference example on Phi2 4. benchmark shows an average of 13% latency reduction in token generation. limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 10:15:18 -08:00
Tianlei Wu	bff4f8bf75	Update tolerance of provider tests to fix flaky tests (#19792 ) ### Description Check float/double/float16/bfloat16 tensors are close like [numpy.isclose](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html). ``` absolute(a - b) <= (atol + rtol * absolute(b)) ``` The default tolerance thresholds: - float: atol=1e-5 and rtol=1e-4 - float16: atol=0.0025 and rtol=0.001 - bfloat16: atol=0.02 and rtol=0.01 ### Motivation and Context Current pipeline has frequent failure due to using only relative tolerance in https://github.com/microsoft/onnxruntime/pull/19608: [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 1: C:\a\_work\1\s\onnxruntime\test\providers\checkers.cc(272): error: The difference between cur_expected[i] and cur_actual[i] is 1.3113021850585938e-06, which exceeds (params.relative_error) std::abs(cur_expected[i]), where 1: cur_expected[i] evaluates to -1.3113021850585938e-06, 1: cur_actual[i] evaluates to 0, and 1: (params.relative_error) std::abs(cur_expected[i]) evaluates to 2.6226043559063328e-08. It is not reasonable to use relative tolerance for a small value very close to 0. Combining relative tolerance with a positive absolute tolerance could avoid such issue.	2024-03-06 17:47:17 -08:00
pengwa	5c5d6e99ce	Define recomputable op list with domain/opset (#19722 ) ### Define recomputable op list with domain/opset Originally, we just check the OpType and decide whether it is recomputable. In this PR, few improvements are made: 1. [Op type search] Domain + OpType are used to check whether the op is supported to recompute. 2. [Opset search] Then, node.SinceVersion() will be searched in the supported opsets. 3. During subgraph detection, If the node in that this opset is supported, get the ignorable input indices, which means we don't consider in the bottom-up search. This would save time for the subgraph detection. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 09:12:12 +08:00
Wanming Lin	1ce5bfb0ec	[WebNN EP] Make sure optional input is provided (#19686 ) Some optional input is presented as empty string, we should not only check if the input size is correct, but also check if the optional input is not empty. e.g. Pad node has empty optional input in sam-b-encoder.onnx model: <img width="514" alt="image" src="https://github.com/microsoft/onnxruntime/assets/3271201/cc3b06fe-46b9-4ee7-aca5-157bdf112856">	2024-03-06 16:19:59 -08:00
Markus Tavenrath	f2dc725b33	Add SpaceToDepth and DepthToSpace CUDA NHWC Ops (#19646 ) ### Description - Adding CUDA NHWC support for SpaceToDepth and DepthToSpace - Add a new test which verifies that swizzling SpaceToDepth swizzling for the H axis is correct. - If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as well. ### Motivation and Context Adding more NHWC operations to avoid layout transformations when using the CUDA EP for more efficiency.	2024-03-06 12:35:55 -08:00
aciddelgado	8bd1335d00	Fix GQA Rotary Embedding sequence length (#19801 ) ### Description Previously, GQA incorrectly enforced rotary cos and sin cache to be of sequence length equal to present sequence length. Now it enforces that it be greater than or equal to present sequence length since to match Rotary Embedding Op it should be of max_sequence_length ### Motivation and Context Fixes issue with fusing Rotary Embedding and GQA for certain models which prefer this optimization.	2024-03-06 12:34:33 -08:00
Hector Li	db8d0c8e06	reset dcvsEnable for different HTP performance mode (#19728 ) reset dcvsEnable for different HTP performance mode	2024-03-06 11:21:19 -08:00
Changming Sun	f9a92e589a	Upgrade the Windows SDK version that is used in WindowsAI Nuget Packaging pipeline (#19786 ) ### Description 1. Upgrade the version from 10.0.19041.0 to 10.0.22621.0. The old one misses some macros that are needed by PyTorch's CPUINFO 2. Also update cmake. ### Motivation and Context In PR #19655 I added CPUINFO to all Windows builds, but forgot to test this pipeline.	2024-03-06 09:10:35 -08:00
pengwa	d9bf85613d	Adapt memory optimizer to fit PHI2 (#19757 ) ### Adapt memory optimizer to fit PHI2 Few improvements and bug fixes: 1. Fix bug related to transformer layer detection. 2. Use default reversed typo order to create recompute node, to avoid the leaf nodes are handled too late, then having lowest priority for execution. 3. Add early stop when activation's element count is constant and total element count < 1M. This can avoid overhead to search subgraphs. Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute, on given recipe, memory consumption dropped from ~22GB to ~13GB .	2024-03-06 21:54:16 +08:00
Ashwini Khade	e93a860819	Remove arm build for training (#19788 ) We no longer support Win arm 32 so removing the associated build and packaging job.	2024-03-05 21:54:48 -08:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Yulong Wang	a788514027	[js/web] dump debug logs for karma for diagnose purpose (#19785 ) ### Description dump debug logs for karma for diagnose purpose. This is for debugging the CI issue of Chrome launch failure and considered temporary.	2024-03-05 18:27:26 -08:00
Vincent Wang	1bfc26685b	ATen Op Supports Int Return Type and CPU Tensor Arguments (#19773 ) This PR: - add support for int as return type, will create a CPU scalar tensor for it. - add attributes to specify which arguments or returns are CPU tensors. - adjust ATen efficient attn to match latest PyTorch native function. - a Triton codegen bugfix by the way.	2024-03-06 10:11:46 +08:00
pengwa	d102569755	Fix seed for recomputed Dropout (#19715 ) ### Fix seed for recomputed Dropout If Dropout node is recomputed in the backward, we should make sure its execution is same as the run in the forward. If we don't set seed attribute, then this cannot be guaranteed. Add ` export ORTMODULE_MEMORY_OPT_LEVEL=2` to enabled per layer recompute with compromised recomputable subgraphs.	2024-03-06 10:06:25 +08:00
Chi Lo	d9730c7f43	[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575 ) When the DDS output is empty tensor (i.e. any of the dimension is 0), TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to prevent accidentally overwriting other location that might belong to other tensors. This PR also refactors the code to only allocate single bytes for all empty tensors. #TODO: add unit tests to cover the DDS code paths or doing more testing with concurrent,sequential, threaded faster-rcnn using onnx_test_runner and verifying outputs --------- Co-authored-by: Chi Lo <lochi@microsoft.com>	2024-03-05 14:39:36 -08:00
Dmitri Smirnov	1e78bcea60	Implement CUDA IsInf-10,20 (#19772 ) ### Description Implment IsInf-10,20 for CUDA. Add FP16 types also on CPU. ### Motivation and Context Certain models lag in performance due to IsInf not available on CUDA.	2024-03-05 13:33:01 -08:00
Chen Fu	06e684c9f2	Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619 ) ### Description Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is specially optimized for Nvidia Ampere GPUs. ### Motivation and Context Trying to improve quantized LLM inference performance on Nvidia Ampere GPUs ### Note: This is implemented by extending CUTLASS, so it has a hard dependency on CUTLASS. However, in current build system, loading of CUTLASS dependency is guarded with: (onnxruntime_USE_FLASH_ATTENTION OR onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION) If both of these options are turned off, then compilation will fail. Why CUTLASS dependency is guarded at all? It's a header file only library that does not introduce any binary if not instantiated. What's the downside of removing all the guards and just include CUTLASS unconditionally?	2024-03-05 09:37:45 -08:00
Markus Tavenrath	bdf678df93	Fix CUDA BatchNorm bugs and add support for NHWC (#19742 ) ### Description - Fix incorrect running_mean / running_var in training mode due to incorrect momentum and missing input mean/var. runnig_var could be correct, but has a too high epsilon. - Fix incorrect checks when using NHWC - Pass NHWC flag to NormalizeDims to get correct new dimensions from x_shape - Register missing double operations to get parity between NHWC/NCHW	2024-03-05 08:09:42 -08:00
guyang3532	cd56ea4a74	enable embedding sparse optimization by default (#19714 )	2024-03-05 13:15:30 +08:00
wejoncy	7e613ee821	[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" (#19106 ) ### Description <!-- Describe your changes. --> 1. Support quantized GPTQ weight in huggingface like [TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) 2. Support Act_order for GPTQ 3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to quantize matmul weight and add quant script ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-05 11:45:45 +08:00
zhijiang	2a5c9b86eb	Zhijxu/fix conv1d replacement (#19758 ) remove the constraint - "group number should be less than 3"; add more condition to make sure the conv1d replacement only happens on conv1d instead of conv2d/conv3d; add more tests;	2024-03-05 10:11:19 +08:00
Dmitri Smirnov	0cdf36faeb	Expose SessionOtions.DisablePerSessionThreads (#19730 ) ### Description ### Motivation and Context ML.NET needs to run mltiple sessions on a single threadpool.	2024-03-04 13:46:51 -08:00
raoanag	27b1dc91ab	[DML] MatrixMultiplyIntegerToFloat (#19608 ) ### Description DML Implementation for [com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat) ``` .\onnxruntime_test_all.exe --gtest_filter="MatMulIntegerToFloat." Note: Google Test filter = MatMulIntegerToFloat. [==========] Running 22 tests from 1 test suite. [----------] Global test environment set-up. [----------] 22 tests from MatMulIntegerToFloat [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms) [ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint [ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms) [----------] 22 tests from MatMulIntegerToFloat (8679 ms total) [----------] Global test environment tear-down [==========] 22 tests from 1 test suite ran. (8680 ms total) [ PASSED ] 22 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * `CalculateMatMulIntegerToFloat` to replace CPU EP run reference * Added more FP32 testcases to isolate all input datatype combinations * Added fixed input to `MatMulIntegerToFloat_FP16` test cases as for FP16 test cases. onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of generating FP16 models, but we do not produce any for now	2024-03-04 11:55:35 -08:00
inisis	2e13d5f0ab	fix split shape inference error for opset >= 13 (#19756 ) ### Description get split operator split section by opset ### Motivation and Context for opset higher than 13, split section is treated as an input.	2024-03-04 09:41:36 -08:00
ironman	9acaf534a6	Benchmark - Updating llama-2 requirement files (#19716 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-04 07:29:58 -08:00

... 25 26 27 28 29 ...

11997 commits