onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-24 22:17:32 +00:00

Author	SHA1	Message	Date
kailums	40d4b2ec75	exclude split3inner kernel on rocm ep (#21238 ) ### Description There is an issue when using split3inner kernel on rocm-6.0.3, exclude these code from rocm EP.	2024-07-04 14:32:28 +08:00
Tianlei Wu	7d9b12a2e3	[CPU] SparseAttention op (#21110 ) Add SparseAttention cpu implementation. - [x] Refactoring GQAAttentionBase - [x] Add SparseAttention implementation - [x] Add test cases This is unfused version. Flash attention version will be added later.	2024-07-03 21:51:57 -07:00
Yi Zhang	30b6e82e7d	Make ROCm packaging stages to a single workflow (#21235 ) ### Description Make current ROCm packaging stages to a single workflow. Reduce the possibility of all nightly packages can't be generated by one failed stage ### Motivation and Context Our plan is to reduce the complexity of the current zip-nuget pipeline to improve the stability and performance of nightly packages generation. ROCm packaging stages has no dependencies with other packaging jobs and it's the most time-consuming route. After this change, the most used CPU/CUDA/Mobile packaging workflow duration can be reduced roughly from 3h20m to 2h30m.	2024-07-04 11:07:04 +08:00
cloudhan	f39ee14b46	Add GQA support for ROCm (#21032 )	2024-07-03 14:55:31 +08:00
pengwa	4932e04053	ORTModule GraphTransitionManager (#19007 ) ### Problem Currently, the codebase contains some logics pertaining to model re-export checks and graph_builder reinitialization checks. Ideally, these operations should function akin to a state machine. However, upon inspecting the implementation, it becomes apparent that certain states are checked or set in various scattered locations. This fragmentation makes it challenging to comprehend when a re-export or re-initialization will be triggered. For optimal clarity and maintainability, it is advisable to consolidate these states into a cohesive component, rather than dispersing them within the current graph execution manager. Furthermore, the process of model exports and post-export processing for stage 3 support or memory-efficient gradient management introduces considerable complexity. To enhance the codebase's structure, it would be beneficial to extract these intricate functionalities into a dedicated component, divorcing them from the current graph execution manager. As part of the effort to improve the codebase, it's essential to address inconsistencies in handling input/output flatten/unflatten operations. Currently, there are several functions performing these operations recursively, each with slightly different implementations. This inconsistency leads to varying support for input/output data types and structures in different parts of the code. To rectify this, the proposed pull request simplifies these operations into a set of primitive functions, ensuring uniformity. This not only streamlines the code but also facilitates the maintenance of consistency when introducing bug fixes or supporting new data types. One thing to mention here: input output handling is deeply bound to the graph transition mentioned above, so it is difficult to make this change separately. While acknowledging the complexity of these logics, it is reassuring that the codebase benefits from an extensive suite of unit tests that cover all possible branches. Despite the intricacies, ensuring the passage of all tests has been a time-intensive but necessary aspect of this development effort. ### Design Introduce `GraphTransitionManager` and put all model export and post-export processing logics in it. 1. Re-export check 2. Do export 3. Re-post-export process check 4. Do post-export process 5. Return `PostExportProcessedModelInfo`, which contains all the information we need, to pass to ORT to build gradient graph (currently we do the same for training or evaluating, but ideally we should not do it for evaluating, let's keep this behavior as it is now, and make the change later). ``` # Input names for the pre-gradient-build graph. # This may be different with the one in ExportedGraph since we may modify the graph inputs as needed # for example when memory efficient gradient management is enabled. self.onnx_graph_input_names: list[str] = onnx_graph_input_names # A subset of onnx_graph_input_names. # Input names that require gradients for the pre-gradient-build graph. self.onnx_graph_input_names_require_grad: list[str] = onnx_graph_input_names_require_grad # Create symbolic names for each dimension of the graph input (e.g. onnx_graph_input_names). # The key is the input name, the value is a dict of {dim_index: symbolic_dim_name} # e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0: "input2_dim0"}} self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] = onnx_graph_input_dynamic_axes_map self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict() self.onnx_graph_input_names_user_defined = ( onnx_graph_input_names_user_defined # The ONNX graph input names excluding the parameters, buffers. ) # The ONNX graph input names excluding the parameters, buffers. self.onnx_graph_input_names_require_grad_user_defined = onnx_graph_input_names_require_grad_user_defined self._post_export_processed_model: onnx.ModelProto \| None = post_export_processed_model # A function to access the input data from the args and kwargs. # If it is not None, the length is same as onnx_graph_input_names. # For i-th input name, we can use the i-th function to get the input data from args and kwargs. self.data_accessor: list[callable] \| None = data_accessor # Used for unflattening the outputs from the ORT forward run. self.module_forward_output_schema: ORTModelInputOutputSchemaType \| None = module_forward_output_schema``` The `GraphTransitionManager` instance is a property of `GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager), 1. Use 'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs, kwargs)' to check whether the PyTorch module need a re-export or re-post-export-process. 2. Use `self._graph_transition_manager._post_export_processed_model_info.construct_inputs` to construct the list of inputs used for ORT runs. 3. Use `self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)` to restore the outputs in original PyTorch output structure. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-03 10:53:31 +08:00
Baiju Meswani	116398c1a4	onnxruntime shared lib inside python package (#21223 )	2024-07-02 15:37:50 -07:00
Tianlei Wu	7df97f1987	Add debugging helper to dump string, vector and thread id (#21224 ) ### Description Add some macro to help print data to console for debugging purpose. Example usage: ``` int input_id; vector<int> some_vector; DUMP_CPU_TENSOR_INIT() DUMP_CPU_TENSOR("some vector", some_vector); DUMP_STRING("input_id=", input_id); ``` - To enable dump thread id, set environment variable `ORT_DUMP_THREAD_ID=0`. - User can disable dumping by environment variable `ORT_ENABLE_CPU_DUMP=0`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-02 11:24:04 -07:00
Yifan Li	7be1d4aad3	[TensorRT EP] Update TRT10.0 deprecated api (#20989 ) ### Description <!-- Describe your changes. --> Note: * This PR would remove C4996 suppression in tensorrt_execution_provider.cc only (according to Nvidia, places with nvinfer.h included need C4996 suppression, when /Zc:__cplusplus is enabled in ORT win build) * A follow-up PR will be raised to update deprecated TRT Plugin api usage. Here are deprecated apis to be updated in this PR: \| deprecated api \| Update \| \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| \| [kCUBLAS](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [kCUBLAS_LT](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [kCUDNN](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [reallocateOutput](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#acae6441d4029584cc1c6550917518691) \| Superseded by [reallocateOutputAsync](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#aa40eeb891c1dfe4c1bbf1eabe8c705ab) with cudaStream_t argument \| \| [createExecutionContextWithoutDeviceMemory](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#adc86bcc42b098204997396ef2b1093fb) \| Superseded by [createExecutionContext()](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a35de29aa6134165a5b14a537e6d99e82) with parameter.<br />Check [ExecutionContextAllocationStrategy::kUSER_MANAGED](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac6251a050df629edfc0ce037fa366503) for more detail \| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> TRT deprecated api list: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/deprecated.html	2024-07-01 22:55:20 -07:00
Yi Zhang	beb2496748	Templatize publishing nuget package (#21199 ) ### Description It's the prerequisite step of reducing complexity of current zip-nuget pipeline. Some packaging tasks could be cut from the most complex nuget pipline and easily be published ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-02 09:24:19 +08:00
Scott McKay	8c2689877f	CoreML: Disable 1D ML Program matmul due to bug in coreml (#21186 ) ### Description Disable using CoreML ML Program for a matmul where one of the inputs is 1D as the CoreML implementation appears to be broken. See https://github.com/apple/coremltools/issues/2263 Add some debugging notes. ### Motivation and Context Fix failing test on macos-14.	2024-06-29 12:19:51 -07:00
Chen Feiyue	56b36a58ba	Initial PR for VSINPU execution provider (#20903 ) ### Description <!-- Describe your changes. --> -It is an initial PR for VSINPU execution provider ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - For support VeriSilicon hardware - TIM-VX(Tensor Interface Module) (https://github.com/VeriSilicon/TIM-VX) is an integrated software solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.) design, it is easy to use Verisilicon’s hardware by simply connecting onnxruntime with the TIM-VX API by this VSINPU execution provider.	2024-06-28 21:48:34 -07:00
Jian Chen	9007ede102	Update upstream packaging pipeline name to make it more meaningful. (#21154 ) ### Description Update upstream packaging pipeline name to make it more meaningful. ### Motivation and Context The upstream pipeline used to only building Nuget packages, but now it also builds Zip and Java. So change the name will make it more meaningful.	2024-06-28 21:40:09 -07:00
Changming Sun	3a83f8b317	Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920 ) ### Description 1. Update the functions in tensorprotoutils.h to use std::filesystem::path instead of onnxruntime::Path. Eventually we can remove the whole onnxruntime::Path class, but to this PR small I am not doing that. 2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro def when TensorRT EP is enabled.	2024-06-28 20:03:57 -07:00
Jian Chen	0cbe7eec5e	Uppdate nuget to Use Nuget 6.10.x (#21209 ) ### Description Uppdate nuget to Use Nuget 6.10.x	2024-06-28 19:49:54 -07:00
mingyueliuh	7e93cd7f8b	[VitisAI] Align TensorProto_DataType with onnx1.16 (#21067 ) ### Description Vitis AI EP synchronously supports the TensorProto data types supported by ONNX 1.16. Add error message show when graph resolve fail for troubleshooting. ### Motivation and Context ONNX 1.15 & 1.16 add support some new TensorProto DataType , such as - FLOAT8E4M3FN - FLOAT8E4M3FNUZ - FLOAT8E5M2 - FLOAT8E5M2FNUZ - UINT4 - INT4 --------- Co-authored-by: liumingyue <mingyue@xilinx.com>	2024-06-28 17:19:20 -07:00
Preetha Veeramalai	6baaaf5165	OVEP options to disable CPU fallback at compile time (#21166 ) ### Description Provide user level options to control the fallback on CPU for models not supported on Intel's NPU hardware. ### Motivation and Context - Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on compilation failures. Also supports MLAS CPU fallback in presence of unsupported custom ops. - The PR provides a build-time option to disable fallback from OV NPU to OV CPU. - The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables OV CPU and MLAS CPU fallback. - Also has bug fix for proto creation. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>	2024-06-28 08:31:02 -07:00
Hector Li	21ad004237	Add QNN UTs for QNN Pad Op with FP16 data on HTP backend (#21142 ) ### Description 1. Add QNN UTs for QNN Pad Op with FP16 data on HTP backend 2. Improve Pad op builder to handle invalid optional input 3. Add UT for ReduceSum for FP16 precision with 5D for issue reproduce	2024-06-27 22:09:13 -07:00
Yi Zhang	587e92c279	Add FP32 and INT4 test in Llama2 (#21187 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-28 06:18:26 +08:00
Changming Sun	d1ab94c2b0	Add compatibility for NumPy 2.0 (#21085 ) ### Description As suggested by SciPy's doc, we will `Build against NumPy 2.0.0, then it will work for all NumPy versions with the same major version number (NumPy does maintain backwards ABI compatibility), and as far back as NumPy 1.19 series at the time of writing` I think it works because in [numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64) there is a macro NPY_FEATURE_VERSION. By default it is set to NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI. This PR only upgrade the build time dependency; When a user installs ONNX Runtime, they still can use numpy 1.x. ### Motivation and Context Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.	2024-06-27 13:50:53 -07:00
Wanming Lin	78316c8cbe	[WebNN EP] Remove useless variable unpacked_tensors_ (#21189 )	2024-06-27 11:56:56 -07:00
Guenther Schmuelling	9eb1c2a7a3	support for layernorm in webgpu pre opset-17 (#21121 ) handled the same way cpu does	2024-06-27 10:20:48 -07:00
Yi Zhang	8f738d8e9f	[Fix] Throwes one excepiton while Llama2 parity_check fails (#21160 ) ### Description ### Motivation and Context The pipeline is green even Llama2 parity_check fails. The PR should be merged after the below exception is solved. ''' 2024-06-25 03:49:43.621298481 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} An error occurred while verifying parity: Error in execution: Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} Traceback (most recent call last): File "/workspace/onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py", line 1043, in main parity_check(parity_cmd) File "/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py", line 298, in main verify_parity(args, location, use_auth_token, kv_cache_ortvalues, pytorch_model=llama, config=config) File "/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py", line 137, in verify_parity ort_model.run_with_iobinding(io_binding) File "/home/onnxruntimedev/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 331, in run_with_iobinding self._sess.run_with_iobinding(iobinding._iobinding, run_options) RuntimeError: Error in execution: Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} ''' The exception looks caused by #19832	2024-06-27 23:49:32 +08:00
Wanming Lin	b49788e68b	[WebNN EP] Fixed bug in Expand implementation (#21163 ) ONNX's Expand supports bidirectionally broadcast, while WebNN's expand op only supports unidirectionally broadcast. Thus we should calculate the output shape for 'newShape' input of WebNN's expand op.	2024-06-27 08:09:13 -07:00
kailums	a1bbfeb306	add split3inner (#19886 ) ### Description <!-- Describe your changes. --> The split op is using pin_memory when split on different sizes. But pin_memory is not capable for using cudagraph. Add a new implementation for only transformer scenarios, it split the qkv_proj into q, k, v, not using pin_memory. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-27 18:53:12 +08:00
PeixuanZuo	446aa986a1	[ROCm] Extend the Pipeline restriction time (#21158 ) ROCm EP builds are taking longer.	2024-06-27 15:36:04 +08:00
mindest	eecc11afc7	[ROCm] Disable ck_tile in Debug build (#21178 ) ### Description tmp fix: disable ck_tile for Debug build. ### Motivation and Context Release build works fine for ck_tile, while Debug build fails. <details> <summary> Typical error log to revisit </summary> ``` [880/1797] Building HIP object CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o FAILED: CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o /opt/rocm/llvm/bin/clang++ -DEIGEN_MPL2_ONLY -DENABLE_ROCM_PROFILING -DENABLE_STRIDED_TENSORS -DENABLE_TRAINING -DENABLE_TRAINING_APIS -DENABLE_TRAINING_CORE -DENABLE_TRAINING_OPS -DENABLE_TRAINING_TORCH_INTEROP -DMIOPEN_VERSION=30100 -DORT_ENABLE_STREAM -DROCM_VERSION=60100 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_ROCclr__=1 -D__bf16__ -D__fp16__ -D__fp32__ -I/build/Debug/_deps/utf8_range-src -I/ws/onnxruntime/include/onnxruntime -I/ws/onnxruntime/include/onnxruntime/core/session -I/ws/onnxruntime/orttraining/orttraining/training_api/include -I/build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha -I/build/Debug/_deps/composable_kernel-src/include -I/build/Debug/_deps/composable_kernel-build/include -I/build/Debug/_deps/composable_kernel-src/library/include -isystem /opt/rocm-6.1.0/include -g -O -std=gnu++17 --offload-arch=gfx90a -fPIC -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -MD -MT CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -MF CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o.d -o CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -x hip -c /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp In file included from /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp:5: In file included from /build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha/fmha_fwd.hpp:6: In file included from /build/Debug/_deps/composable_kernel-src/include/ck_tile/core.hpp:11: /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression 27 \| asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory"); \| ^ /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx90a. ... ``` </details>	2024-06-27 12:04:17 +08:00
Scott McKay	887a818aa7	Check for unit test log severity override earlier (#21177 ) ### Description <!-- Describe your changes. --> Setting the log level after environment creation is too late in some cases. If the DML EP is enabled, it will create a composite sink with the original logger using the creation time log severity, as well as additional ETW sink. As it saves the current severity levels for each sink inside the composite sink that prevents being able to get verbose log output to stdout even if you set that at the session level. I don't know enough about the setup that combines ETW with the original sink to say whether we should also be updating the severity of individual sinks in the combined sink, so this change is limited to making the unit tests behave in the expected manner when the default log severity is set in the background and not directly controlled. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it possible to get verbose output to stdout when the DML EP is enabled.	2024-06-27 12:51:13 +10:00
Vincent Wang	3c0b407709	Rollback 19832, Remove shape_input_merge Fusion (#21179 ) The PR caused Big Models pipeline failure for running Llama2. After the rollback, the pipeline is back to normal.	2024-06-26 10:00:45 -07:00
Scott McKay	337cc56d6f	Convert scalars to 1D to satisfy ML Program requirements. (#21159 ) ### Description <!-- Describe your changes. --> Convert scalars to 1D to satisfy ML Program requirements. https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1418617&view=logs&j=f7cc61a9-cc70-56e7-b06c-4668ca17e426&t=16d281b5-1bfd-5309-f274-36d0dffd9cb1&l=27167 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes test failure in #17361	2024-06-26 09:54:36 -07:00
mindest	e2abba18ea	Skip softmax BF16 test for ROCm (#21162 ) ### Description Skip softmax BF16 test for ROCm, because BFloat16 is unsupported by MIOpen, and `torch.cuda.is_available()` also returns `True` for ROCm.	2024-06-26 11:15:50 +08:00
Wanming Lin	41ad83fb00	[WebNN EP] Support rest Reduction ops for TFLite backend (#21135 ) - reduceLogSum, reduceLogSumExp and reduceSumSquare have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5575815 - reduceL1 and reduceL2 have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5606091	2024-06-25 18:30:55 -07:00
Wanming Lin	4743803944	[WebNN EP] Support more Normalization ops for TFLite backend (#21151 ) Following Normalization ops have been supported in Chromium for TFLite backend: - batchNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532745 - layerNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5573326 - instanceNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532750	2024-06-24 19:04:23 -07:00
Jian Chen	f81c0ec32a	Remove warning suppression from Java Packaging pipeline. (#21010 ) ### Description Remove warning suppression from Java Packaging pipeline. ### Motivation and Context We want the CI step not to produce warning.	2024-06-24 16:46:21 -07:00
mindest	adaf0e8116	[Fix] USE_NCCL -> ORT_USE_NCCL (#21136 ) ### Description Correct the macro used when NCCL enabled.	2024-06-24 11:33:17 -07:00
Wanming Lin	3a917e49fb	[WebNN EP] Support 4 more ops for TFLite backend (#21134 ) Recently WebNN TFLite backend supports gelu, expand, softsign, reciprocal.	2024-06-24 09:52:12 -07:00
aciddelgado	ebd0368bb0	Make Flash Attention work on Windows (#21015 ) ### Description Previously, Flash Attention only worked on Linux systems. This PR will make it work and enable it to be built and run on Windows. Limitations of Flash Attention in Windows: Requires CUDA 12. ### Motivation and Context This will significantly increase the performance of Windows-based LLM's with hardware sm>=80. To illustrate the improvement of Flash Attention over Memory Efficient Attention, here are some average benchmark numbers for the GQA operator, run with configurations based on several recent models (Llama, Mixtral, Phi-3). The benchmarks were obtained on RTX4090 GPU using the test script located at (onnxruntime/test/python/transformers/benchmark_gqa_windows.py). * Clarifying Note: These benchmarks are just for the GQA operator, not the entire model. ### Memory Efficient Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|----------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.19790525 \| 13105.63425 \| \| Llama3-8B (Average Token) \| 8192 \| 0.207775538 \| 12025.10172 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.216049167 \| 11563.31185 \| \| Llama3-70B (Average Token) \| 8192 \| 0.209730731 \| 12284.38149 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.371928785 \| 7031.440056 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.2996659 \| 7607.947159 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.183195867 \| 15542.0852 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.198215688 \| 12874.53494 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 2.9884929 \| 2332.584142 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.845072406 \| 2877.85822 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.324974429 \| 8094.909517 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.263662567 \| 8978.463687 \| ### Flash Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|--------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.163566292 \| 16213.69057 \| \| Llama3-8B (Average Token) \| 8192 \| 0.161643692 \| 16196.14715 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.160510375 \| 17448.67753 \| \| Llama3-70B (Average Token) \| 8192 \| 0.169427308 \| 14702.62043 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.164121964 \| 15618.51301 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.1715865 \| 14524.32273 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.167527167 \| 14576.725 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.175940594 \| 15762.051 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 0.162719733 \| 17824.494 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.14977525 \| 16749.19858 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.156490786 \| 17679.2513 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.165333833 \| 14932.26079 \| Flash Attention is consistently faster for every configuration we benchmarked, with improvements in our trials ranging from ~20% to ~650%. In addition to these improvements in performance, Flash Attention has better memory usage. For example, Memory Efficient Attention cannot handle a max sequence length higher than 32,768, but Flash Attention can handle max sequence lengths at least as high as 131,072. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2024-06-24 09:43:49 -07:00
zhijiang	269d9b094f	Zhijxu/fix softmax cudnn bf16 (#21045 ) if seq >2048, ort will fallback to cudnn version, while when dtype is bf16, ort will throw exception, this PR trying to fix it.	2024-06-24 16:07:39 +08:00
Yi Zhang	5b5ce0bfb0	Add UsePython Task in Nuget Publish workflow (#21144 ) ### Description Otherwise it would fail in `b95982e588/tools/ci_build/github/azure-pipelines/publish-nuget.yml (L78-L81)` ### Motivation and Context The Windows CPU image is migrated to managed image ### Verification Link https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1313	2024-06-24 13:36:13 +08:00
Dmitri Smirnov	b95982e588	Fix 2D detection bug (#21128 ) ### Description Should compare two leading dims for 1.f ### Motivation and Context Vulnerability scanner	2024-06-21 13:58:21 -07:00
Dwayne Robinson	ac21626725	DML EP EinSum make more generic to avoid EP fallback (#21114 ) ### Problem Newer models using more novel equations (e.g. `bhwc,hkc->bhwk` in Segment Anything's encoder or `bqc,bchw->bqhw`) cause fallback from DML to CPU, yielding performance issues. The EP had some pattern matching to map more common equations to existing DML operators, but the number of permutations was prohibitive and could not catch them all. ### Solution So, ditch the static mapping, and instead handle any 1-input or 2-input cases via remapped strides and a mini-graph of elementwise multiplication & sum reduction (as if DML had a `DML_OPERATOR_DOT_PRODUCT` that took `axes`). A subset of mappings still exist for performance (GEMM, pure reduction, transpose...), but they are identified generally rather than via a pattern table. Also... - Diagonals are supported now (e.g. iji->i). - Removes any remaining DML-specific EinSum `GTEST_SKIP` statements. - Handles any cases up to 8 unique labels (DML dimension limit is 8D). - \>= 3 inputs and arbitrary size inputs via ellipsis are not handled, but we have yet to come across a model.	2024-06-21 11:46:16 -07:00
Caroline Zhu	6236707c64	Enable >2GB models + allow model paths to be passed for generate_artifacts API (#20958 ) ### Description Alternative design from #20942 Allow users to pass in a model path for the generate_artifacts API. ### Motivation and Context - ONNX API calls such as the onnx checker + shape inference fail when given a model > 2GB, but work if a path to a model >2GB is passed in.	2024-06-21 09:55:26 -07:00
RuomeiMS	7cf9263ee7	Add changes for strided calibration (#20949 ) Context and motivation: When quantizing large transformer models, we faced OOM issue when the number of calibration samples goes up. To resolve this, in the PR we want to add support for reading quantization data in chunck, calculating ranges for intermediate tensors, then accumulating results for the final ranges.	2024-06-21 08:23:23 -07:00
Changming Sun	f5625b8858	Revert "[MIGraphX EP] enable compilation and execution on Windows (21084)" (#21132 ) ### Description This reverts commit `1d7bf56947` because it broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to run the AMD GPU CI pipeline. Will revert the PR first then ask the author to fix the issue.	2024-06-21 01:01:07 -07:00
Yi Zhang	69d522f4e9	[Fix] use cmdline in Final Jar Testing Stage for new managed Windows Image (#21130 ) ### Description No bash command in Managed Windows image. Use CmdlLine step instead. ### Verified Link https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=491902&view=logs&j=f1f8e11e-a9fa-53e5-cd29-3ba2c1988550	2024-06-21 12:41:06 +08:00
Jake Mathern	b9eb1dc21e	Update protobuf_cmake.patch to allow extra disablements configurable by projects that build ORT (#20875 ) ### Description Update protobuf_cmake.patch to allow extra disablements. ORT repo already patches protobuf to not disable the warning 4996. ### Motivation and Context To meet SDL requirements, Microsoft repos have to fail build if there is warning 4996 Binskim also gives errors if warning 4996 is disabled. We can suppress the Binskim issues, but we need a way to disable the warnings for the minimal set of code that has them. Right now, WindowsAI disables 4996 for entirety of ORT, but it should only be disabled for protobuf.	2024-06-20 16:28:15 -07:00
Ted Themistokleous	1d7bf56947	[MIGraphX EP] enable compilation and execution on Windows (#36 ) (#21084 )	2024-06-20 16:21:11 -07:00
Changming Sun	efcaa835b1	Update generate_nuspec_for_native_nuget.py for training (#21112 ) ### Description Similar to #21096 , but this one is for ORT training nuget package.	2024-06-20 16:13:31 -07:00
Yi-Hong Lyu	00c713088d	Adpot QDQFinalCleanupTransformer for Q->DQs/DQ->Qs cases (#21018 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-20 11:21:32 -07:00
Wanming Lin	0c80cd2157	[WebNN EP] Update Prelu restriction for CPU backend (#20878 )	2024-06-20 11:04:01 -07:00
ivberg	55f7f9d7a9	Fix Crash When Enabling and Disabling ETW with Old Callbacks (#21086 ) ### Description Under certain conditions with enabling & disabling ETW continuously, we got a crash report. Allows ETW callbacks to be de-registered upon class destructor. Related to #20537 ### Motivation and Context Fixes crash ### Callstack We see it crash in [0x0] onnxruntime!<lambda_967a738fca8512372f170fcaf2d094d4>::operator()+0x34 0x12941ff570 0x7ffa994f0a04 [0x1] onnxruntime!std::_Func_class<void,_GUID const ,unsigned long,unsigned char,unsigned __int64,unsigned __int64,_EVENT_FILTER_DESCRIPTOR ,void *>::operator()+0x54 0x12941ff7b0 0x7ffa994f0d64 [0x2] onnxruntime!onnxruntime::logging::EtwRegistrationManager::InvokeCallbacks+0xcc 0x12941ff7b0 0x7ffa994f0d64 [0x3] onnxruntime!onnxruntime::logging::EtwRegistrationManager::ORT_TL_EtwEnableCallback+0x94 0x12941ff860 0x7ffa98d19628 and seems to us that the this pointer captured in etwRegistrationManager.RegisterInternalCallback( [&etwRegistrationManager, this]( ... is no longer valid when the callback is called.	2024-06-20 06:45:45 -07:00

... 13 14 15 16 17 ...

11997 commits