onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Hector Li	d5c6a2cecf	Enable code in QNN UT to verify the fix for partition issue (#19939 ) ### Description Enable code in QNN UT to verify the fix for partition issue relate to QDQ model. https://github.com/microsoft/onnxruntime/pull/19723	2024-03-15 17:02:01 -07:00
enximi	7b46b31558	fix: "UserWarning: Unsupported Windows version (11). ONNX Runtime sup… (#19845 ) fix: "UserWarning: Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only." ### Description Include Windows 11 in the version check. Now, you will not see the warning “Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only.” ### Motivation and Context Warning on Windows 11: Only supports systems above Windows 10, which is somewhat strange.	2024-03-15 12:41:44 -07:00
Yulong Wang	79e50aeef3	[js/web] rewrite backend resolve to allow multiple EPs (#19735 ) ### Description This PR rewrite the backend resolve logic to support specifying multiple EPs. #### Backend The first version of ONNX Runtime Web actually carried some existing code from [ONNX.js](https://github.com/microsoft/onnxjs), which includes the "backend" concept. The original "backend" in ONNX.js is designed in a way assuming there is only one backend from user's backend hint list will be used. For example, in ONNX.js, if user specify a backend hint as `['webgl', 'wasm']`, ONNX.js will first try to use WebGL backend - if it loads successfully (the browser supports webgl), then "webgl" backend will be used and "wasm" will be ignored; otherwise, "webgl" will be ignored and try to load "wasm" backend. In short: only one backend will be used when initializing a session. #### Execution Provider Execution Provider, or EP, in ONNX Runtime is a different concept. One of the differences is that users are allow to specify multiple EPs, and if one does not support a particular kernel, it can fallback to other EP. This is a very common case when using a GPU EP in ONNX Runtime. #### Current Status: Backend v.s. EP Because of the history reasons mentioned above, the current status is quite confusing. There are real backends, which means it's different implementation in code; and there are backend hints, which are used as string names for backend hint; and there are EPs of the ONNX Runtime concepts. currently there are only 2 backends in our code base: The "onnxjs backend", and the "wasm backend". The "onnxjs backend" currently only powers backend hint "webgl", which go into the old onnx.js code path. All other backend hints including "wasm", "cpu"(alias to wasm), "webgpu" and "webnn" are all powered by "wasm backend". And because ORT Web treat "backend" as an internal concept and want to align with ONNX Runtime, so those names of backend hints are becoming EP names. The following table shows today's status: \| Execution Provider Name (public) / Backend Hint (internal) \| Backend \| EP in ORT \| -------- \| ------- \| ------- \| \| "wasm"/"cpu" \| WasmBackend \| CPU EP \| "webgl" \| OnnxjsBackend \| \* technically not an EP \| "webgpu" \| WasmBackend \| JSEP \| "webnn" \| WasmBackend \| WebNN EP #### Problem While the API allows to specify multiple EPs, the backend resolving only allows one backend. This causes issues when user specify multiple EP names in session options, the backend resolve behavior and EP registration behavior is inconsistent. Specifically, in this issue: https://github.com/microsoft/onnxruntime/issues/15796#issuecomment-1925363908: EP list `['webgpu', 'wasm']` on a browser without WebGPU support resolves to 'wasm' backend, but the full EP list is passed in session options, so JSEP is still enabled, causing the runtime error. #### Solution Since we still need WebGL backend, we cannot totally remove the backend register/resolve system. In this PR I made the following changes: - initialize every backend from the EP list, instead of only do that for the first successful one. - for the first resolved backend, filter all EP using the exact same backend. Remove all EPs not using this backend from session options - for every explicitly specified EP, if it's removed, show a warning message in console	2024-03-15 11:47:45 -07:00
Yifan Li	0b2a75b274	[EP Perf] Add concurrency test (#19804 ) ### Description <!-- Describe your changes. --> * Add concurrency test to EP Perf CI panel (impl. by onnx_test_runner) * Model: FasterRCNN-10 model within CI image * `-c` param configurable via CI panel when kicking off CI tasks * Auto-replicate test input/outputs according to `-c` param * By default, the model test will be executed in 100 iterations (~2min added to T4 CI task load overall) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To monitor potential concurrency issues of ORT-TRT	2024-03-15 07:41:21 -07:00
Hariharan Seshadri	42399dfd2b	Fix a potential race in the CUDA TopK kernel (#19917 ) ### Description If the `K` value is flowing through as a tensor, we are updating a mutable member of the `TopK` class and basing the compute off that - which is likely to cause data race issues with concurrent Run() calls and `K` value changes. ### Motivation and Context Fix potential race in CUDA TopK kernel	2024-03-14 18:13:47 -07:00
Justin Chu	bcf47d3546	Update install_deps_lort.sh to fix onnxscript installation (#19922 ) Install onnxscript correctly with `pip install`. Dev dependencies are not required. ### Motivation and Context Fix build breaks.	2024-03-14 17:05:50 -07:00
Adam Louly	32558134a9	[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770 ) ### Description Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers ### Motivation and Context This PR includes changes that will make ort handle 2GB+ checkpoints. To do that we need to upgrade flatbuffers to 23.5.9 - https://github.com/google/flatbuffers/pull/7945 - Modified the commitHash and the hash for the new version - Removed the patch for rust generator's unused variable warning as it is no longer producing this - [Check it out here](`d121e09d89/src/idl_gen_rust.cpp`) - Updated the VerifyField calls with alignment values that were introduced in the new version. --------- Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2024-03-14 16:36:24 -07:00
Yi Zhang	87a9f77c56	Refactor Python Packaing Pipeline (Training Cuda 11.8) (#19910 ) ### Description 1. Use stage to organize the pipeline and split building and testing 2. Move compilation on CPU machine 3. test stage can leverage existing artifacts 4. check wheel size, it gives warning if the size above 300M 5. docker image name wasn't change even the argument changed, which caused the docker image was always rebuilt. So update the docker image name according to the argument can save the docker build time. Pipeline duration reduced by 60% (2 hours -> 50 minutes) Compilation time reduced by 75% (1.5hours -> 20 minutes) GPU time reduced by 87% ( 8 hours to 1 hours) for debugging, the GPU time could be reduced by above 95%, because we can choose run only one test stage and skip building. ### Motivation and Context Make the pipeline efficient. Optimized https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=424177&view=results Curent https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=422393&view=results ---------	2024-03-15 06:47:41 +08:00
Changming Sun	8b766bd24e	Change nuget pipeline's "Windows_Packaging_combined_GPU" job to download TRT binaries in every build (#19919 ) ### Description Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build. Now all the other build jobs are already doing this. This is the only one left. Similar to #19909 ### Motivation and Context As a follow up of #19118	2024-03-14 15:07:56 -07:00
Tianlei Wu	a2ffc3740b	[Cuda] Demo multiple cuda graphs and user compute stream (#19883 ) Update stable diffusion demo to add options `--max-cuda-graphs` and `--user-compute-stream`. * Add python class GpuBindingManager to manage IO Binding based on input shape and max number of cuda graphs setting. The benefit is that one inference session could enable or disable cuda graph in different runs. * When `--user-compute-stream`, the demo will use custom compute stream.	2024-03-14 13:48:37 -07:00
Edward Chen	0b90363acb	[MLAS][AArch64] SQ4BitGemm CompInt8 multi-block implementation (#19826 ) Update SQ4BitGemm CompInt8 implementation to process multiple blocks along a single column instead of processing single blocks from multiple columns.	2024-03-14 13:05:42 -07:00
Baiju Meswani	226f60f2f1	Add support for SGD optimizer in minimal build (#19901 )	2024-03-14 11:31:20 -07:00
Changming Sun	1fb6cbddee	Add a build patch for Windows ARM64EC (#19898 ) ### Description Add a patch for Windows ARM64EC ### Motivation and Context Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc	2024-03-14 08:50:42 -07:00
Changming Sun	ea4a5eea18	Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build (#19909 ) ### Description Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build. Now all the other build jobs are already doing this. This is the only one left. ### Motivation and Context As a follow up of #19118	2024-03-14 07:55:00 -07:00
cao lei	966fa74597	Add 2 C API for ort extension (#19808 ) ### Description <!-- Describe your changes. --> Add 2 C API for ORT extension: - KernelInfo_GetAllocator - OrtCustomOp::GetMayInplace ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add 2 C API for ORT extension project, which will leverage these 2 APIs for GroupQueryAttention custom op.	2024-03-14 06:00:41 -07:00
pengwa	409b811325	Refine logging for execution plan print (#19777 ) ### Refine logging for execution plan print Printing NodeIndex only is not enough for us to debug the execution order. keep original behaviour for ORT_MINIMAL_BUILD build in case of any CPU memory concerns. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-14 16:31:32 +08:00
Scott McKay	0be0791fcc	Update MAUI model tester tool to .net8 (#19907 ) ### Description <!-- Describe your changes. --> Update to .net8. Didn't want to build with the latest VS2022 using net6 (which was EOL last year). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-14 15:19:19 +10:00
Jeff Daily	9443366009	[ROCm] fix build failure when nccl is enabled (#19900 ) Building onnxruntime ROCm EP with --enable_nccl --use_mpi fails due to inclusion of MOE source files but MOE is not supported. The error observed is `error: contrib_ops/rocm/moe/ft_moe/moe_kernel.h: No such file or directory` The fix is to exclude collective/sharded_moe.* files when nccl is requested.	2024-03-13 21:16:54 -07:00
Adrian Lizarraga	9c3242ab70	[QNN EP] Copy security catalog file for HtpV73Skel.so from QNN SDK (#19903 ) ### Description Copies the `QNN_HOME/lib/hexagon-v73/unsigned/libqnnhtpv73.cat` file from QNN SDK to the unittest build directory. This is necessary in order to be able to load the `libQnnHtpV73Skel.so` file on Windows for modern versions of QNN SDK. ### Motivation and Context A [digitally-signed catalog file](https://learn.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files) (.cat) can be used as a digital signature for an arbitrary collection of files.	2024-03-13 20:52:59 -07:00
cao lei	2c525a79b1	Add new API KernelContext_GetScratchBuffer (#19809 ) ### Description <!-- Describe your changes. --> add new API KernelContext_GetScratchBuffer to get scratch buffer from kernel context ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> add new API KernelContext_GetScratchBuffer to get scratch buffer from kernel context which will be used in ORT extension project for GroupQueryAttention custom op	2024-03-13 19:41:15 -07:00
Jake Mathern	18ad8587a6	[CP] Fix for xfgcheck and Fix WAI ARM64 build (#19634 ) (#19644 ) ### Description Fix WAI build by only conditionally copying linker flags ### Motivation and Context I broke the WAI build that contains ORT on ARM64	2024-03-13 17:54:06 -07:00
Markus Tavenrath	f42e6ad61e	Add support for LRN NHWC OPs (#19866 ) Support LRN NHWC in the CUDA EP. ### Motivation and Context Add support for all NHWC OPs to avoid NHWC/NCHW Layout transformation	2024-03-13 17:52:07 -07:00
raoanag	9f08f8d5b2	Set seed for DynamicQuantizeMatMul tests (#19896 ) Seed for DynamicQuantizeMatMul tests to avoid pipeline failures with marginal mismatches.	2024-03-13 17:49:55 -07:00
kunal-vaishnavi	4ac98d6d65	Update replacing MultiHeadAttention with GroupQueryAttention (#19882 ) ### Description This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in [this PR](https://github.com/microsoft/onnxruntime/pull/18906). ### Motivation and Context The updated replacement of MHA with GQA includes the following fusion changes. - Apply sliding window within GQA - Fuse the rotary embeddings within GQA - Fuse the 3 MatMuls into 1 packed MatMul if possible - Fuse the 3 Adds into 1 packed Add if possible	2024-03-13 14:10:52 -07:00
aciddelgado	8eb49c5f00	fix gqa rotary dim 1 (#19874 ) ### Description GQA Rotary Dimension 1 incorrectly assumed to be based on head size. ### Motivation and Context This change should enable us to run phi-2 with GQA and Rotary Embedding fused.	2024-03-13 14:09:54 -07:00
Yulong Wang	e771a763c3	[js/test] align web test runner flags with ort.env (#19790 ) ### Description the `npm test` flags are difficult to memorize, because they are different to the `ort.env` flags. This change makes those flags align with ort JS API. eg. `--wasm-enable-proxy` became `--wasm.proxy`. Old flags are marked as deprecated except `-x` (as a shortcut of `--wasm.numThreads`)	2024-03-13 12:00:36 -07:00
Yi Zhang	d5d9dbd51d	reuse T4 on Linux GPU (#19879 ) ### Description ### Motivation and Context Linux GPU test on A10 isn't very stable	2024-03-13 10:41:36 -07:00
Satya Kumar Jandhyala	ed250b88c3	[JS/WebGPU] Optimize MatMulNBits (#19852 ) ### Description Use vec<2> or vec<4>, operands in MatMulNBits ### Motivation and Context Improve performance	2024-03-13 10:33:14 -07:00
Hariharan Seshadri	ed306b4f97	Fix Android CI pipeline (#19877 )	2024-03-13 10:09:43 -07:00
Justin Chu	faea42af95	Bump ruff to 0.3.2 and black to 24 (#19878 ) ### Motivation and Context Routing updates	2024-03-13 10:00:32 -07:00
Yi Zhang	9e0a0f0f32	Check whether required tests are executed. (#19884 ) ### Description Check the onnx node tests and model tests worked ### Motivation and Context onnx node test data and model data are mount in one dir. And onnxruntime_test_all search the dir and load the data. If the dir does exist or there's some change in onnxruntime_test_all, those tests may not be executed. For example, all onnx node test data is 32M. It's hardly for us aware of the regression. So I add the simple check to ensure those tests are executed. --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-13 09:59:57 -07:00
Yi Zhang	7313aa4efe	Remove --extra-index-url (#19885 ) ### Description <!-- Describe your changes. --> ### Motivation and Context --extra-index-url is not allowed by injected Secure Supply Chain Step in packaging pipelines. ``` > Starting Multifeed Python Security Analysis: ##[warning]tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml - Found "extra-index-url". (https://aka.ms/cfs/pypi) ``` And those 2 packages can be installed from PyPI as well now. Co-authored-by: Yi Zhang <your@email.com>	2024-03-13 09:45:22 -07:00
Hector Li	60ad6c6409	Enable float32 model with FP16 precision for QNN HTP backend (#19863 ) ### Description Enable float32 model with FP16 precision for QNN HTP backend	2024-03-13 08:35:21 -07:00
George Wu	6579f74af0	skip onnx node_tests for tensorrt ep (#19880 ) fix build break caused by image update. tensorrt isn't expected to pass all onnx node tests.	2024-03-12 23:35:05 -07:00
Yang Gu	53de2d8cb0	[js/webgpu] Enable GroupedConvVectorize path (#19791 ) Vectorize met 2 failed cases in a CI bot with NVIDIA GPU, but we couldn't repro with all the GPUs at hand, including NVIDIA GPUs. This PR introduces GPUAdapterInfo and enables this opt on non-NVIDIA GPUs to make the bots happy. No obivous perf gain can be seen if we enable vectorize on NVIDIA. However, it shows big perf improvement on Intel. On my Gen12 Intel GPU, mobilenetv2-12 perf was improved from 11.14ms to 7.1ms.	2024-03-12 22:25:07 -07:00
Yulong Wang	4538d31a8b	[js/webgpu] expose a few properties in WebGPU API (#19857 ) ### Description This change exposes a few properties in `ort.env.webgpu` to resolve feature requirement mentioned in properties in https://github.com/microsoft/onnxruntime/pull/14579#discussion_r1519612619. - Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`, to allow users to set the value of the properties before the first inference session is created. - Add readonly property `adapter` in `ort.env.webgpu` to allow users to get the adapter instance. Now users can access `ort.env.webgpu.device` and `ort.env.webgpu.adapter`. @xenova @beaufortfrancois	2024-03-12 19:50:51 -07:00
wejoncy	22ad629cf7	[bug fix] dequantize 4bit (#19793 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 18:27:46 -07:00
Edward Chen	860eb762c2	[Apple framework] Fix minimal build with training enabled. (#19858 ) Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled. - training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build. - tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build. Added a test to cover this configuration.	2024-03-12 11:33:30 -07:00
Adrian Lizarraga	00c3cd497e	[QDQ Quantization] Refactor shared functionality into a base quantizer (#19817 ) ### Description This PR does not add or remove any functionality. It refactors common functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes into a new `BaseQuantizer` class. This change helps decouple the QDQ quantizer from other quantization modes and makes it easier to determine if a change to one quantization mode will impact another. ### Motivation and Context An upcoming PR aims to add mixed-precision support to QDQ models (e.g., one part of the graph uses u8 activations and another uses u16 activations). This change makes the upcoming PR smaller and should presumably make determining the impact on existing features more straightforward.	2024-03-12 10:47:09 -07:00
Ye Wang	7f0520cdf9	bug fix to multi-cudagraph (#19856 ) ### Description <!-- Describe your changes. --> run_count_before_capture_ is graph_id aware, fix the bug by adding a map to retrieve the run_count_ for each graph_id. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:33:37 -07:00
zz002	319159b7bd	[VitisAI]set-data_loaction-as-default-when-load-external-data (#19712 ) ### Description <!-- Describe your changes. --> set-data_loaction-as-default-when-load-external-data fix vitis ai ep can not get CutomOps by session_option register ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> VitisAI bug daily fixes when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error Data of TensorProto ( tensor name: xxx) is stored externally and should not have data field.raw_data --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-03-12 10:27:14 -07:00
Bowen Bao	742595b885	Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853 ) ### Description Always set `use_io_binding=True` when using optimum.onnxruntime unless there is a special case. ### Motivation and Context By default, `ORTModel` under optimum.onnxruntime will choose the appropriate `use_io_binding` value based on provider and use cases. > use_io_binding (`Optional[bool]`, defaults to `None`): > Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to > `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider, > in all other cases defaults to `False`. For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time. ``` Evaluating Llama2 `model(inputs)` step with past_key_values Before, w/o iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.4518657898902893 s Throughput: 2.2130464894073856 tps After, w/ iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.2662619352340698 s Throughput: 3.7557001871893703 tps ```	2024-03-12 09:41:11 -07:00
Yi Zhang	d4fa4f0276	Remove FFmpeg to meet compliance (#19859 )	2024-03-12 09:06:59 -07:00
pengwa	3fb8905393	Fix torch cpp extension build warnings (#19842 ) ### Fix torch cpp extension build warnings For the warnings shown as below: ``` cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ [5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so ``` Fix by replacing eixsting `PyObject_GetAttrString` with `PyObject_FastGetAttrString` which claims to be faster in its implementation comment. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:51:30 +08:00
pengwa	3e954da3e6	Fix and enable few ORTModule Unit Tests (#19847 ) ### Fix and enable few ORTModule Unit Tests Fix 'test_bert_inputs_with_dynamic_shape' and 'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT run. The root cause is, the logic to generatic attention mask test data is not correct, only 0 or 1 is allowed in the dataset, but we see lots of other numbers. ( The reason we don't have this using old version of transformers for example v4.4.2 or 4.16.2 is because they don't contains such `d3cb28886a`, which increase the scaling to a bigger number, causing a overflow to inf) Another improvement during the investigation using convergence tools: Don't dump the activations during model export phase, otherwise, the dumped data might contains some PyTorch run's result making us confused during comparing with stock PyTorch run results. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:49:19 +08:00
Vincent Wang	0c078dfc8b	Some Shape Related Fusions (#19832 ) This PR adds below shape related fusions, which is helpful for some transformer models: - ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single one (the 1st one on topo order) if they have the same shape value. This helps CSE fusion to merge more nodes. - CSE fusion to support scalar tensor as attribute value. This is mainly to support ConstantOfShape node.	2024-03-12 10:29:27 +08:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Hector Li	cba605e845	Fix Clip op builder for FP16 support (#19825 ) ### Description Fix Clip op builder for FP16 support. ### Motivation and Context Enables mobilenet v2 FP16 model inference on HTP	2024-03-11 16:39:41 -07:00
raoanag	89aa4697b1	[DML] QAttention (#19766 ) ### Description DML Implementation for [com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-11 10:44:34 -07:00
Changming Sun	5479124834	Remove remaining Windows ARM32 build jobs (#19840 ) ### Description As a follow up of #19788, remove more remaining Windows ARM32 build jobs. ### Motivation and Context Our nuget packaging pipeline is failing because it could not find an artifact for Win ARM32. ``` ##[error]Artifact onnxruntime-training-win-arm was not found for build 421397. ``` Deprecation of Win ARM32 was announced by Windows team in January 2023. We should follow it.	2024-03-11 11:25:11 +08:00

1 2 3 4 5 ...

10733 commits