onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-09 17:28:58 +00:00

Author	SHA1	Message	Date
kunal-vaishnavi	4ac98d6d65	Update replacing MultiHeadAttention with GroupQueryAttention (#19882 ) ### Description This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in [this PR](https://github.com/microsoft/onnxruntime/pull/18906). ### Motivation and Context The updated replacement of MHA with GQA includes the following fusion changes. - Apply sliding window within GQA - Fuse the rotary embeddings within GQA - Fuse the 3 MatMuls into 1 packed MatMul if possible - Fuse the 3 Adds into 1 packed Add if possible	2024-03-13 14:10:52 -07:00
aciddelgado	8eb49c5f00	fix gqa rotary dim 1 (#19874 ) ### Description GQA Rotary Dimension 1 incorrectly assumed to be based on head size. ### Motivation and Context This change should enable us to run phi-2 with GQA and Rotary Embedding fused.	2024-03-13 14:09:54 -07:00
Yulong Wang	e771a763c3	[js/test] align web test runner flags with ort.env (#19790 ) ### Description the `npm test` flags are difficult to memorize, because they are different to the `ort.env` flags. This change makes those flags align with ort JS API. eg. `--wasm-enable-proxy` became `--wasm.proxy`. Old flags are marked as deprecated except `-x` (as a shortcut of `--wasm.numThreads`)	2024-03-13 12:00:36 -07:00
Yi Zhang	d5d9dbd51d	reuse T4 on Linux GPU (#19879 ) ### Description ### Motivation and Context Linux GPU test on A10 isn't very stable	2024-03-13 10:41:36 -07:00
Satya Kumar Jandhyala	ed250b88c3	[JS/WebGPU] Optimize MatMulNBits (#19852 ) ### Description Use vec<2> or vec<4>, operands in MatMulNBits ### Motivation and Context Improve performance	2024-03-13 10:33:14 -07:00
Hariharan Seshadri	ed306b4f97	Fix Android CI pipeline (#19877 )	2024-03-13 10:09:43 -07:00
Justin Chu	faea42af95	Bump ruff to 0.3.2 and black to 24 (#19878 ) ### Motivation and Context Routing updates	2024-03-13 10:00:32 -07:00
Yi Zhang	9e0a0f0f32	Check whether required tests are executed. (#19884 ) ### Description Check the onnx node tests and model tests worked ### Motivation and Context onnx node test data and model data are mount in one dir. And onnxruntime_test_all search the dir and load the data. If the dir does exist or there's some change in onnxruntime_test_all, those tests may not be executed. For example, all onnx node test data is 32M. It's hardly for us aware of the regression. So I add the simple check to ensure those tests are executed. --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-13 09:59:57 -07:00
Yi Zhang	7313aa4efe	Remove --extra-index-url (#19885 ) ### Description <!-- Describe your changes. --> ### Motivation and Context --extra-index-url is not allowed by injected Secure Supply Chain Step in packaging pipelines. ``` > Starting Multifeed Python Security Analysis: ##[warning]tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml - Found "extra-index-url". (https://aka.ms/cfs/pypi) ``` And those 2 packages can be installed from PyPI as well now. Co-authored-by: Yi Zhang <your@email.com>	2024-03-13 09:45:22 -07:00
Hector Li	60ad6c6409	Enable float32 model with FP16 precision for QNN HTP backend (#19863 ) ### Description Enable float32 model with FP16 precision for QNN HTP backend	2024-03-13 08:35:21 -07:00
George Wu	6579f74af0	skip onnx node_tests for tensorrt ep (#19880 ) fix build break caused by image update. tensorrt isn't expected to pass all onnx node tests.	2024-03-12 23:35:05 -07:00
Yang Gu	53de2d8cb0	[js/webgpu] Enable GroupedConvVectorize path (#19791 ) Vectorize met 2 failed cases in a CI bot with NVIDIA GPU, but we couldn't repro with all the GPUs at hand, including NVIDIA GPUs. This PR introduces GPUAdapterInfo and enables this opt on non-NVIDIA GPUs to make the bots happy. No obivous perf gain can be seen if we enable vectorize on NVIDIA. However, it shows big perf improvement on Intel. On my Gen12 Intel GPU, mobilenetv2-12 perf was improved from 11.14ms to 7.1ms.	2024-03-12 22:25:07 -07:00
Yulong Wang	4538d31a8b	[js/webgpu] expose a few properties in WebGPU API (#19857 ) ### Description This change exposes a few properties in `ort.env.webgpu` to resolve feature requirement mentioned in properties in https://github.com/microsoft/onnxruntime/pull/14579#discussion_r1519612619. - Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`, to allow users to set the value of the properties before the first inference session is created. - Add readonly property `adapter` in `ort.env.webgpu` to allow users to get the adapter instance. Now users can access `ort.env.webgpu.device` and `ort.env.webgpu.adapter`. @xenova @beaufortfrancois	2024-03-12 19:50:51 -07:00
wejoncy	22ad629cf7	[bug fix] dequantize 4bit (#19793 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 18:27:46 -07:00
Edward Chen	860eb762c2	[Apple framework] Fix minimal build with training enabled. (#19858 ) Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled. - training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build. - tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build. Added a test to cover this configuration.	2024-03-12 11:33:30 -07:00
Adrian Lizarraga	00c3cd497e	[QDQ Quantization] Refactor shared functionality into a base quantizer (#19817 ) ### Description This PR does not add or remove any functionality. It refactors common functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes into a new `BaseQuantizer` class. This change helps decouple the QDQ quantizer from other quantization modes and makes it easier to determine if a change to one quantization mode will impact another. ### Motivation and Context An upcoming PR aims to add mixed-precision support to QDQ models (e.g., one part of the graph uses u8 activations and another uses u16 activations). This change makes the upcoming PR smaller and should presumably make determining the impact on existing features more straightforward.	2024-03-12 10:47:09 -07:00
Ye Wang	7f0520cdf9	bug fix to multi-cudagraph (#19856 ) ### Description <!-- Describe your changes. --> run_count_before_capture_ is graph_id aware, fix the bug by adding a map to retrieve the run_count_ for each graph_id. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:33:37 -07:00
zz002	319159b7bd	[VitisAI]set-data_loaction-as-default-when-load-external-data (#19712 ) ### Description <!-- Describe your changes. --> set-data_loaction-as-default-when-load-external-data fix vitis ai ep can not get CutomOps by session_option register ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> VitisAI bug daily fixes when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error Data of TensorProto ( tensor name: xxx) is stored externally and should not have data field.raw_data --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-03-12 10:27:14 -07:00
Bowen Bao	742595b885	Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853 ) ### Description Always set `use_io_binding=True` when using optimum.onnxruntime unless there is a special case. ### Motivation and Context By default, `ORTModel` under optimum.onnxruntime will choose the appropriate `use_io_binding` value based on provider and use cases. > use_io_binding (`Optional[bool]`, defaults to `None`): > Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to > `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider, > in all other cases defaults to `False`. For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time. ``` Evaluating Llama2 `model(inputs)` step with past_key_values Before, w/o iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.4518657898902893 s Throughput: 2.2130464894073856 tps After, w/ iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.2662619352340698 s Throughput: 3.7557001871893703 tps ```	2024-03-12 09:41:11 -07:00
Yi Zhang	d4fa4f0276	Remove FFmpeg to meet compliance (#19859 )	2024-03-12 09:06:59 -07:00
pengwa	3fb8905393	Fix torch cpp extension build warnings (#19842 ) ### Fix torch cpp extension build warnings For the warnings shown as below: ``` cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ [5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so ``` Fix by replacing eixsting `PyObject_GetAttrString` with `PyObject_FastGetAttrString` which claims to be faster in its implementation comment. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:51:30 +08:00
pengwa	3e954da3e6	Fix and enable few ORTModule Unit Tests (#19847 ) ### Fix and enable few ORTModule Unit Tests Fix 'test_bert_inputs_with_dynamic_shape' and 'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT run. The root cause is, the logic to generatic attention mask test data is not correct, only 0 or 1 is allowed in the dataset, but we see lots of other numbers. ( The reason we don't have this using old version of transformers for example v4.4.2 or 4.16.2 is because they don't contains such `d3cb28886a`, which increase the scaling to a bigger number, causing a overflow to inf) Another improvement during the investigation using convergence tools: Don't dump the activations during model export phase, otherwise, the dumped data might contains some PyTorch run's result making us confused during comparing with stock PyTorch run results. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:49:19 +08:00
Vincent Wang	0c078dfc8b	Some Shape Related Fusions (#19832 ) This PR adds below shape related fusions, which is helpful for some transformer models: - ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single one (the 1st one on topo order) if they have the same shape value. This helps CSE fusion to merge more nodes. - CSE fusion to support scalar tensor as attribute value. This is mainly to support ConstantOfShape node.	2024-03-12 10:29:27 +08:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Hector Li	cba605e845	Fix Clip op builder for FP16 support (#19825 ) ### Description Fix Clip op builder for FP16 support. ### Motivation and Context Enables mobilenet v2 FP16 model inference on HTP	2024-03-11 16:39:41 -07:00
raoanag	89aa4697b1	[DML] QAttention (#19766 ) ### Description DML Implementation for [com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-11 10:44:34 -07:00
Changming Sun	5479124834	Remove remaining Windows ARM32 build jobs (#19840 ) ### Description As a follow up of #19788, remove more remaining Windows ARM32 build jobs. ### Motivation and Context Our nuget packaging pipeline is failing because it could not find an artifact for Win ARM32. ``` ##[error]Artifact onnxruntime-training-win-arm was not found for build 421397. ``` Deprecation of Win ARM32 was announced by Windows team in January 2023. We should follow it.	2024-03-11 11:25:11 +08:00
Changming Sun	efad5bbc5a	Replace some old file system calls with C++17 std::filesystem APIs. (#19196 ) ### Description 1. Replace some old file system calls to use C++17 std::filesystem APIs. 2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in onnxruntime_perf_test and the code is out of maintain. 3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build because C++17 filesystem library is not available there	2024-03-09 09:17:36 -08:00
raoanag	fa73d7cbf9	[DML] DynamicQuantizeMatMul (#19763 ) ### Description DML Implementation for [com.microsoft.DynamicQuantizeMatMul ](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.DynamicQuantizeMatMul) ``` .\onnxruntime_test_all.exe --gtest_filter="DynamicQuantizeMatMul." Note: Google Test filter = DynamicQuantizeMatMul. [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from DynamicQuantizeMatMul [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 (635 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 (514 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 (505 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 (526 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 (504 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 (512 ms) [ RUN ] DynamicQuantizeMatMul.UInt8_test_with_empty_input [ OK ] DynamicQuantizeMatMul.UInt8_test_with_empty_input (112 ms) [ RUN ] DynamicQuantizeMatMul.B_PerColumn_ND [ OK ] DynamicQuantizeMatMul.B_PerColumn_ND (348 ms) [----------] 10 tests from DynamicQuantizeMatMul (4685 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (4686 ms total) [ PASSED ] 10 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context - CalculateDynamicQuantizeMatMul to replace CPU EP run reference - Added more FP32 testcases to isolate all input datatype combinations --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-08 15:35:10 -08:00
Sheil Kumar	7deee944c0	Implement STFT Decomposition transformer (#19725 ) Implement STFT Decomposition transformer. Certain hardware does not support DXIL, and therefore existing operator should be mapped to hardware supported functions. Optimized convolution can be used to implement STFT. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-03-08 15:02:58 -08:00
Yifan Li	069d2d6f54	[EP Perf] Update EP Perf dockerfiles with cuda12/cudnn9 (#19781 ) ### Description * Update name of existing dockerfiles and add support to test latest TensorRT EA binary located in the image * Add cuda 12.3/cuDNN 9/TensorRT 8.6 dockerfile * Add detail to CI prompts and configs Instruction to test latest TRT via BIN: 1. Select `BIN` in TensorRT Version 2. In Variables, update related tarCudaVersion, clear tarCudnnVersion (not required in latest TRT tar binary) , and path to binary.	2024-03-08 13:58:22 -08:00
Yifan Li	3170a48e60	[EP Perf] Add tag to indicate which TRT parser is using (#19784 ) ### Description * Add tag to distinguish if TRT `builtin` or `oss` parser is being used * `oss` tag will be inserted with onnx-tensorrt commit id, to indicate which version oss parser is ### Validate DB entry before/after this PR (during test, `builtin` or `oss_{commit_id}` tag was inserted in the database entries): ### Motivation and Context To distinguish perf results using builtin/oss parser in the database, this parser tag is needed. In future, results using different parsers will be listed in different Perf Dashboard pages.	2024-03-08 10:24:36 -08:00
Scott McKay	01c376a0b9	Update script to run CIs for a branch. (#19797 ) ### Description <!-- Describe your changes. --> - Support multiple include/exclude values. - e.g. can now run with `-i MacOS -i iOS` to run CIs for both Apple platforms. - Default to current branch if run from directory in repo. - make lazier usage possible ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve tools. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-08 17:52:47 +10:00
Satya Kumar Jandhyala	24b72d2613	[JS/WebGPU] Preserve zero size input tensor dims. (#19737 ) ### Description For Concat operation, the zero-size input tensor shape need to be preserved and, unlike non-zero tensors, the dims are not constrained to match other input tensors' dims. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 19:07:49 -08:00
Scott McKay	6c3bed6740	Run CoreML EP with NeuralNetwork and ML Program in CI unit tests (#19796 ) ### Description <!-- Describe your changes. --> Add synthetic CoreML EP name to the list of providers so we test with NeuralNetwork and MLProgram model types. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Automatically test new MLProgram support in CI	2024-03-08 12:50:13 +10:00
Dmitri Smirnov	2964352641	Implement IsNaN-9,13,20 for CUDA along with tests (#19807 ) ### Description ### Motivation and Context Some models require IsNan CUDA along with training	2024-03-07 15:46:11 -08:00
Yi-Hong Lyu	33578cc76e	Remove memset for the case no any mask (#19823 ) Improved OCR model speed by 1.034 end-to-end, by eliminating unnecessary memset when no mask is present.	2024-03-07 13:54:16 -08:00
Jambay Kinley	3dfce2f1cd	Fix argparser in `matmul_bnb4_quantizer` (#19812 ) ### Description <!-- Describe your changes. --> The argparser had incorrectly used `description` and `options` instead of `help` and `choices`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes: #19751	2024-03-07 11:31:34 -08:00
Ye Wang	72ce4de07d	cuda graph enhancement (#19636 ) ### Description <!-- Describe your changes. --> 1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session 3. provide model modification/inference example on Phi2 4. benchmark shows an average of 13% latency reduction in token generation. limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 10:15:18 -08:00
Tianlei Wu	bff4f8bf75	Update tolerance of provider tests to fix flaky tests (#19792 ) ### Description Check float/double/float16/bfloat16 tensors are close like [numpy.isclose](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html). ``` absolute(a - b) <= (atol + rtol * absolute(b)) ``` The default tolerance thresholds: - float: atol=1e-5 and rtol=1e-4 - float16: atol=0.0025 and rtol=0.001 - bfloat16: atol=0.02 and rtol=0.01 ### Motivation and Context Current pipeline has frequent failure due to using only relative tolerance in https://github.com/microsoft/onnxruntime/pull/19608: [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 1: C:\a\_work\1\s\onnxruntime\test\providers\checkers.cc(272): error: The difference between cur_expected[i] and cur_actual[i] is 1.3113021850585938e-06, which exceeds (params.relative_error) std::abs(cur_expected[i]), where 1: cur_expected[i] evaluates to -1.3113021850585938e-06, 1: cur_actual[i] evaluates to 0, and 1: (params.relative_error) std::abs(cur_expected[i]) evaluates to 2.6226043559063328e-08. It is not reasonable to use relative tolerance for a small value very close to 0. Combining relative tolerance with a positive absolute tolerance could avoid such issue.	2024-03-06 17:47:17 -08:00
pengwa	5c5d6e99ce	Define recomputable op list with domain/opset (#19722 ) ### Define recomputable op list with domain/opset Originally, we just check the OpType and decide whether it is recomputable. In this PR, few improvements are made: 1. [Op type search] Domain + OpType are used to check whether the op is supported to recompute. 2. [Opset search] Then, node.SinceVersion() will be searched in the supported opsets. 3. During subgraph detection, If the node in that this opset is supported, get the ignorable input indices, which means we don't consider in the bottom-up search. This would save time for the subgraph detection. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 09:12:12 +08:00
Wanming Lin	1ce5bfb0ec	[WebNN EP] Make sure optional input is provided (#19686 ) Some optional input is presented as empty string, we should not only check if the input size is correct, but also check if the optional input is not empty. e.g. Pad node has empty optional input in sam-b-encoder.onnx model: <img width="514" alt="image" src="https://github.com/microsoft/onnxruntime/assets/3271201/cc3b06fe-46b9-4ee7-aca5-157bdf112856">	2024-03-06 16:19:59 -08:00
Markus Tavenrath	f2dc725b33	Add SpaceToDepth and DepthToSpace CUDA NHWC Ops (#19646 ) ### Description - Adding CUDA NHWC support for SpaceToDepth and DepthToSpace - Add a new test which verifies that swizzling SpaceToDepth swizzling for the H axis is correct. - If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as well. ### Motivation and Context Adding more NHWC operations to avoid layout transformations when using the CUDA EP for more efficiency.	2024-03-06 12:35:55 -08:00
aciddelgado	8bd1335d00	Fix GQA Rotary Embedding sequence length (#19801 ) ### Description Previously, GQA incorrectly enforced rotary cos and sin cache to be of sequence length equal to present sequence length. Now it enforces that it be greater than or equal to present sequence length since to match Rotary Embedding Op it should be of max_sequence_length ### Motivation and Context Fixes issue with fusing Rotary Embedding and GQA for certain models which prefer this optimization.	2024-03-06 12:34:33 -08:00
Hector Li	db8d0c8e06	reset dcvsEnable for different HTP performance mode (#19728 ) reset dcvsEnable for different HTP performance mode	2024-03-06 11:21:19 -08:00
Changming Sun	f9a92e589a	Upgrade the Windows SDK version that is used in WindowsAI Nuget Packaging pipeline (#19786 ) ### Description 1. Upgrade the version from 10.0.19041.0 to 10.0.22621.0. The old one misses some macros that are needed by PyTorch's CPUINFO 2. Also update cmake. ### Motivation and Context In PR #19655 I added CPUINFO to all Windows builds, but forgot to test this pipeline.	2024-03-06 09:10:35 -08:00
pengwa	d9bf85613d	Adapt memory optimizer to fit PHI2 (#19757 ) ### Adapt memory optimizer to fit PHI2 Few improvements and bug fixes: 1. Fix bug related to transformer layer detection. 2. Use default reversed typo order to create recompute node, to avoid the leaf nodes are handled too late, then having lowest priority for execution. 3. Add early stop when activation's element count is constant and total element count < 1M. This can avoid overhead to search subgraphs. Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute, on given recipe, memory consumption dropped from ~22GB to ~13GB .	2024-03-06 21:54:16 +08:00
Ashwini Khade	e93a860819	Remove arm build for training (#19788 ) We no longer support Win arm 32 so removing the associated build and packaging job.	2024-03-05 21:54:48 -08:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Yulong Wang	a788514027	[js/web] dump debug logs for karma for diagnose purpose (#19785 ) ### Description dump debug logs for karma for diagnose purpose. This is for debugging the CI issue of Chrome launch failure and considered temporary.	2024-03-05 18:27:26 -08:00

1 2 3 4 5 ...

10710 commits