onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-25 02:50:42 +00:00

Author	SHA1	Message	Date
Dmitri Smirnov	d9de054eb5	Multi-Lora support (#22046 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-30 15:59:07 -07:00
Enrico Galli	52a8c1cae8	[WebNN EP] Enable IO Bindings with MLTensor (#21301 ) ### Description Enables using the MLTensor to pass data between models. ### Motivation and Context Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.	2024-09-27 17:24:21 -07:00
Patrice Vignola	14ba2fb83c	[DML EP] Add intermediate tensor dumping for DML (#22246 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-27 12:39:45 -07:00
sfatimar	0309c5f02f	Ovep release lnl 1.2.1 (#22027 ) Error Codes are added to catch compilation error and signal recompile. Remote Tensors are added to ensure direct memory access for NPU inferencing. UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching ### Motivation and Context The changes are needed to ensure backward compatibility UMD Bypass caching eliminates driver caching Remote Tensors lead to performance improvement with inferencing on NPU --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-11 14:55:40 -07:00
PARK DongHa	f633caa0b1	Create CMake option `onnxruntime_USE_VCPKG` (#21348 ) ### Changes 1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg port * Unit test may fail because this option leads to a mixture of unexpected external library versions. Especially ONNX, Protobuf, and Flatbuffers version can be different 2. Overhaul of `onnxruntime_external_deps.cmake` * Make `FetchContent_Declare` to try `find_package`. See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or `onnxruntime_fetchcontent_makeavailable`) to closer lines. It was too hard to navigate the entire file to search related sections... * Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` --> `onnx`) ```cmake # The script uses `find_package` with the changes. # In this case, use vcpkg to search dependencies # See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html include(external/onnxruntime_external_deps.cmake) ``` 3. Create CMakePresets.json and presets to [run vcpkg in manifest mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode) * Currently, it's NOT for training build * Main triplets are `x64-windows` and `x64-osx` ```pwsh Push-Location "cmake" cmake --preset "x64-windows-vcpkg" cmake --build --preset "x64-windows-vcpkg-debug" Pop-Location ``` ```bash pushd "cmake" cmake --preset "x64-osx-vcpkg" cmake --build --preset "x64-osx-vcpkg-debug" popd ``` 4. Updated tools/ci_build/build.py * `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with [vcpkg.cmake toolchain script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake) * `--compile_no_warning_as_error` is recommended because library version differences will cause unexpected compiler warnings ```bash python ./tools/ci_build/build.py \ --compile_no_warning_as_error \ --use_vcpkg \ --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \ --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..." ``` 5. Created Job `Vcpkg` for Windows and macOS * Show how to setup and use vcpkg. Similar to the CMakePresets.json usage ### Motivation and Context * Help #7150 * Help https://github.com/microsoft/vcpkg/pull/36850 * https://github.com/luncliff/vcpkg-registry/pull/212 * https://github.com/microsoft/vcpkg/pull/39881 * https://github.com/luncliff/vcpkg-registry/pull/215 * https://github.com/luncliff/vcpkg-registry/pull/216 * https://github.com/luncliff/vcpkg-registry/pull/227 * https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake ### Future Works? More feature coverage with the vcpkg supported libraries * CUDA feature support * Training feature support	2024-09-10 16:39:27 -07:00
Arne H Juul	605a84ffc9	remove unused and confusing float16 constants (#21999 ) ### Description Remove unused and confusing special constants in MLFloat16 and BFloat16 types. ### Motivation and Context While looking at adding a specialization for std::numeric_limits for the 16-bit floating point types, I found that there are various special constants in those types that are confusing or just wrong. MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks like a copy-paste bug. BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`, nor even to the C# Float.Epsilon. Instead, it corresponds to `numeric_limits::min()` which was really confusing to me. The "MinValue" constants does correspond to the C# `Float.MinValue` constant, but this is C++ so it would be better renamed to "LowestValue" since it corresponds to `numeric_limits::lowest()`. As it was unused except for some unit tests I have replaced it with the equivalent `MaxValue.Negate()` here. There's also an unused `kSignaling_NaNBits` constant which is just wrong (has the same value as `kPositiveInfinityBits` instead of a NaN).	2024-09-05 22:00:48 -07:00
Yulong Wang	2e83541eba	fix one build warning in MSVC (#21983 ) ### Description Fix one MSVC warning member not initialized ``` Warning C26495 Variable 'onnxruntime::ITuningContext::allocators_' is uninitialized. Always initialize a member variable (type.6). C:\code\onnxruntime\onnxruntime\core\framework\tuning_context.h 22 ```	2024-09-04 17:51:14 -07:00
Yulong Wang	32af2ba68f	enhance string util functions (#21893 ) ### Description - make `MakeString` force inline - refactor ORT_FORCEINLINE macro - move to one place to avoid macro redefinition error - ~~add a `StringJoin` utility~~ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-29 10:37:50 -07:00
Yulong Wang	d2a1b7a353	Introduce custom external data loader (#21634 ) ### Description This PR introduces support for custom external data loader. An EP can register a custom external data loader to override the default behavior, making it possible to upload initializers directly to GPU. ### Motivation and Context - In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type (`sizeof(size_t)==4`), which means there is a 4GB hard limit on the maximum memory. As the ONNX models get larger, this becomes a blocker for supporting medium-sized language models. - ORT runs out of memory because the current code always loads data into CPU memory, including the .onnx file (protobuf) and external data file(s). However, if using GPU EP, the big data does not need to be kept on CPU because the only thing that ORT does is to load the data into memory, upload to GPU and then release them. - Some platforms has offered developers way to upload data directly to GPU. For example, webgpu allows uploading from any ArrayBuffer (it can be a side buffer, not count into the 4GB) to GPU directly. This helps to keep the CPU memory usage significantly. ### Design Class `ExternalDataLoader` and `ExternalDataLoaderManager` are introduced. They are similar to `DataTransfer` and `DataTransferManager`. `InferenceSession` owns the manager object, and `SessionState` keeps a reference to it. Added a new method `GetExternalDataLoader` in `IExecutionProvider`. An EP can override the method to register an instance of custom external data loader. The key function in a `ExternalDataLoader` class is method `LoadTensor`: ```c++ // the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan). virtual common::Status LoadTensor(const Env& env, const std::filesystem::path& data_file_path, FileOffsetType data_offset, SafeInt<size_t> data_length, Tensor& tensor) const; ``` This function can be registered by EP, going through a few layers and eventually get into `DeserializeTensorProto()` in the finalizing stage of session initialization. In this step, initializer tensors are created. Behavior is changed to first look up for a registered external data loader that can handle the current memory info. If any instance is available, use the loader; otherwise respect the old code path.	2024-08-27 12:18:52 -07:00
Frank Dong	a0708a0d96	avoid redundant memory allocation for external initializers (#21682 ) ### Description avoid redundant memory allocation for external initializers, we will use mmap for external initializers later so no point to allocate memory in advance then release them later. ### Motivation and Context In current implementation, we will: 1. Allocate memory (with desired size of current initializer) for initializer first: [https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L131](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L131&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340041125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6fN57MUsergrCX%2BBS7jztWBRmc8nx19EVvn0lUJ2Gtk%3D&reserved=0) 2. For external initializer, we will point initializer to mmaped object in memory and release previously allocated tensor: [https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L89](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L89&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340054491%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=yBtXLc%2Bhpx3IT1%2FX0664foqQ5X5O%2Fy5XNhj4Oed%2BAt4%3D&reserved=0) For large models, we are keep allocating and release memory for external initializers which seems unnecessary. For phi silica model, with this change we can reduce transient memory usage from 4,566MB to 2,724MB. Since these redundant memory is released quickly when we mmap external initializers so this change has no much impact on peak memory usage.	2024-08-13 23:13:49 -07:00
jingyanwangms	154084efaa	Security Fuzz Test Fixes (#21608 ) ### Description Fix address sanitizer and memory access Bug 1, 4, 5, 7, 8 found in security fuzz test ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-11 03:28:41 -07:00
Adrian Lizarraga	0e708de4fc	[QNN EP] Support Conv + Clip/Relu fusion (#21537 ) ### Description - Supports quantized Conv + Activation on the HTP backend: - Translates `DQs -> Conv -> Relu/Clip -> Q` into a single QNN Conv operator if the Relu (or Clip) are redundant. ### Motivation and Context Expands support for QDQ models created with tools that do not wrap Relu or Clip with QDQ nodes. This PR introduces the `IQnnNodeGroup` class. In the same way that a `NodeUnit` represents a collection of `Nodes`, a `IQnnNodeGroup` can represent one or more `NodeUnits` that are translated into a QNN operator. QNN EP parses the ONNX graph to create a list of `IQnnNodeGroup` objects, each representing a single `NodeUnit` or a fusion of multiple `NodeUnits`.	2024-08-02 11:02:22 -07:00
Jing Fang	e7aa11607f	Utilize ext data location to reduce qd matmul memory usage (#21451 ) ### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: \|\| QOperator Model (MatMulNBits) \| QDQ Model (DQ + MatMul, original code) \| QDQ Model (this PR) \| \|---\|---\|---\|---\| \| peak memory consumption \| 2.8 GB \| ~4.8 GB \| 2.8 GB \| \| initialization time \| 3 sec \| 9 sec \| 5 sec \| ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.	2024-07-30 15:22:46 -07:00
Justin Chu	c203d89958	Update ruff and clang-format versions (#21479 ) ruff -> 0.5.4 clang-format -> 18	2024-07-24 11:50:11 -07:00
mindest	5b9369e93c	Fix typos according to reviewdog report. (#21335 ) ### Description Fix typos based on reviewdog report but with some exceptions/corrections.	2024-07-22 13:37:32 -07:00
Ranjit Ranjan	6c7562b097	Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133 ) ### Description Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. ### Motivation and Context changes in this PR contains: 1. Enablement code for building onnxruntime on AIX operating system. 2. while testing the build on AIX, we found issues related to big endian platform . More details about few of those issues can be found in [Big endian issue: Graph Transformation Attention Fusion tests are failing #12921](https://github.com/microsoft/onnxruntime/issues/12921) Below are list of files and the description about the change. 1. cmake/CMakeLists.txt [BUILDING on AIX issue] check for "IBMClang" is added for handling -Wno-unused-parameter 2. cmake/external/onnxruntime_external_deps.cmake [BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX 3. cmake/onnxruntime.cmake [BUILDING on AIX issue] o Blocking codes for AIX which generates generated_source.c and further requires some symbol files. o Putting NO AIX check for non-supported linker flags like --Xlinker o iconv linking 4. cmake/onnxruntime_framework.cmake [BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN' 5. cmake/onnxruntime_mlas.cmake [BUILDING on AIX issue]POWER10 releated macro/function definition . 6. cmake/onnxruntime_providers_cpu.cmake [BUILDING on AIX issue]Putting NO AIX check for non-supported linker flags like --Xlinker 7. cmake/onnxruntime_unittests.cmake [BUILDING on AIX issue] o Putting NO AIX check for non-supported linker flags like --Xlinker o Adding required libraries for AIX linker under applicatiion like onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc 8. cmake/patches/flatbuffers/flatbuffers.patch [BUILDING on AIX issue] Handling of TypeCode in include/flatbuffers/flatbuffers.h under AIX + clang 9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc [Big endian issue] Byte-Conversion handlling in compute() and getblock() routines 10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc [Big endian issue] Handling of test failures . Byte swapping for quant_value. 11. onnxruntime/core/framework/tensorprotoutils.cc [Big endian issue] Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto . o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling ConvertRawDataInTensorProto() in big-endian system o ConvertRawDataInTensorProto : function used mainly on big-endian system for byte-swapping of tensor raw_data 12. onnxruntime/core/framework/tensorprotoutils.h [Big endian issue] Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto 13. onnxruntime/core/graph/graph.cc [Big endian issue] o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type o Call ConvertRawDataInTensorProto for SaveToOrtFormat 14. onnxruntime/core/mlas/lib/platform.cpp [BUILDING on AIX issue] POWER10 released enablement for AIX 15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp [BUILDING on AIX issue]Handling of __vector under AIX+clang 16. onnxruntime/core/mlas/lib/qgemm.h [BUILDING on AIX issue] Adding _AIX flag 17. onnxruntime/core/mlas/lib/qlmul.cpp [BUILDING on AIX issue] Handling of __vector under AIX+clang 18. onnxruntime/core/optimizer/attention_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 20. onnxruntime/core/optimizer/constant_folding.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 22. onnxruntime/core/optimizer/nchwc_transformer.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 26. onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 27. onnxruntime/core/optimizer/reshape_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 28. onnxruntime/core/optimizer/stft_decomposition.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 29. onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 30. onnxruntime/core/platform/path_lib.h [BUILDING on AIX issue] Moving to normal function call, instead of template 31. onnxruntime/core/platform/posix/env.cc [BUILDING on AIX issue]Blocking syscall.h in AIX 32. onnxruntime/core/session/inference_session.cc [Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN 33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc [Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer and ExternalWriteReadWithLoadInitializers 34. onnxruntime/test/framework/sparse_kernels_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 35. onnxruntime/test/framework/tensorutils_test.cc [Big endian issue] Helper method ConvertEndianessForVector and call this from required place. 36. onnxruntime/test/framework/test_tensor_loader.cc o. [BUILDING on AIX issue] Handling of getcwd for AIX o. [Big endian issue] Bytes Swapping in run_external_data_test 37. onnxruntime/test/onnx/main.cc [Big endian issue] including <thread> for AIX 38. onnxruntime/test/onnx/tensorprotoutils.cc [Big endian issue] Bytes swapping in UnpackTensorWithRawData 39. onnxruntime/test/optimizer/graph_transform_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 40. onnxruntime/test/optimizer/graph_transform_test_builder.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 41. onnxruntime/test/optimizer/graph_transform_test_builder.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 42. onnxruntime/test/optimizer/initializer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 44. onnxruntime/test/providers/base_tester.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 45. onnxruntime/test/providers/cpu/generator/random_test.cc [BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase --------- Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>	2024-07-17 12:37:06 -07:00
Changming Sun	2c53b4a534	Remove core/common/gsl.h (#20894 ) ### Description It might be easier if we just directly include the original gsl headers. "core/common/gsl.h" is an indirection that doesn't provide extra help.	2024-07-08 18:09:39 -07:00
Changming Sun	07c429191e	Delete path.h (#21211 ) ### Description Delete path.h and replace all occurrences of onnxruntime::Path with std::filesystem::path. Previously we couldn't use C++17's std::filesystem because it was not supported in iOS 12(which was released in 2018). Now we dropped the support for iOS 12. ### Motivation and Context To simplify code. For example, if an EP wants to use the Path class, now it can directly use it without going through a wrapper. And the standard implementation can handle various path types better. (We didn't take much consideration on UNC path, "/" as a path separator on Windows, etc).	2024-07-04 15:54:13 +08:00
Chen Feiyue	56b36a58ba	Initial PR for VSINPU execution provider (#20903 ) ### Description <!-- Describe your changes. --> -It is an initial PR for VSINPU execution provider ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - For support VeriSilicon hardware - TIM-VX(Tensor Interface Module) (https://github.com/VeriSilicon/TIM-VX) is an integrated software solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.) design, it is easy to use Verisilicon’s hardware by simply connecting onnxruntime with the TIM-VX API by this VSINPU execution provider.	2024-06-28 21:48:34 -07:00
Changming Sun	3a83f8b317	Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920 ) ### Description 1. Update the functions in tensorprotoutils.h to use std::filesystem::path instead of onnxruntime::Path. Eventually we can remove the whole onnxruntime::Path class, but to this PR small I am not doing that. 2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro def when TensorRT EP is enabled.	2024-06-28 20:03:57 -07:00
Preetha Veeramalai	6baaaf5165	OVEP options to disable CPU fallback at compile time (#21166 ) ### Description Provide user level options to control the fallback on CPU for models not supported on Intel's NPU hardware. ### Motivation and Context - Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on compilation failures. Also supports MLAS CPU fallback in presence of unsupported custom ops. - The PR provides a build-time option to disable fallback from OV NPU to OV CPU. - The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables OV CPU and MLAS CPU fallback. - Also has bug fix for proto creation. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>	2024-06-28 08:31:02 -07:00
ivberg	55f7f9d7a9	Fix Crash When Enabling and Disabling ETW with Old Callbacks (#21086 ) ### Description Under certain conditions with enabling & disabling ETW continuously, we got a crash report. Allows ETW callbacks to be de-registered upon class destructor. Related to #20537 ### Motivation and Context Fixes crash ### Callstack We see it crash in [0x0] onnxruntime!<lambda_967a738fca8512372f170fcaf2d094d4>::operator()+0x34 0x12941ff570 0x7ffa994f0a04 [0x1] onnxruntime!std::_Func_class<void,_GUID const ,unsigned long,unsigned char,unsigned __int64,unsigned __int64,_EVENT_FILTER_DESCRIPTOR ,void *>::operator()+0x54 0x12941ff7b0 0x7ffa994f0d64 [0x2] onnxruntime!onnxruntime::logging::EtwRegistrationManager::InvokeCallbacks+0xcc 0x12941ff7b0 0x7ffa994f0d64 [0x3] onnxruntime!onnxruntime::logging::EtwRegistrationManager::ORT_TL_EtwEnableCallback+0x94 0x12941ff860 0x7ffa98d19628 and seems to us that the this pointer captured in etwRegistrationManager.RegisterInternalCallback( [&etwRegistrationManager, this]( ... is no longer valid when the callback is called.	2024-06-20 06:45:45 -07:00
Adrian Lizarraga	a6c18ae9df	[QNN EP] Add quantization axis checks for Conv/ConvTranspose/Q/DQ ops (#21016 ) ### Description Updates QNN EP to reject Conv/ConvTranspose/Q/DQ ops with unsupported quantization axis values. ### Motivation and Context Allows these unsupported operators to be handled by the CPU EP. Fixes errors like the following: > Node 'ConvTranspose' OpType:ConvTranspose with domain:com.ms.internal.nhwc was inserted using the NHWC format as requested by QNNExecutionProvider, but was not selected by that EP. This means the graph is now invalid as there will not be an EP able to run the node. This could be a bug in layout transformer, or in the GetCapability implementation of the EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-06-17 09:46:14 -07:00
Tianlei Wu	f25cf19375	Add helper functions to dump 4d tensors in CPU for debugging (#21043 ) Add some helper functions to dump 4D tensors to help debugging. Example to use it: (1) Change DUMP_TENSOR_LEVEL from 0 to 2 in contrib_ops/cpu/utils/debug_macros.h to enable dumping. Without enabling, the dumping code will not be built into ORT binary. (2) Add a few lines to dump tensors like ``` DUMP_CPU_TENSOR_INIT(); DUMP_CPU_TENSOR("tensor name", tensor_data, dim0, dim1, dim2, dim3); ``` Changes: - [x] Add functions to dump 4D int32/int64/float/half tensors in CPU - [x] Add functions to dump 4D int32/int64 tensors in CUDA - [x] Change namespace (remove .transformers from namespace, and move files to utils directory)	2024-06-14 17:32:27 -07:00
pengwa	87b14ac7e4	Release backward inputs per static graph ref count (#20804 ) ### Release backward inputs per static graph ref count For the output buffer marked as external output: 1. Remove the additional ref count we used for avoiding reusing buffer. Instead, when we find reuse input/output buffer, we will make sure the reused buffer not not generated by nodes that has external outputs. 2. Remove the ref count of pybind feed inputs, which exists all the time until the run_backward completed. Instead, passing a mutuble feeds, and we clean the feeds vector once that is copied into session states and not needed any more before run the graph sequencentially. #### Before the change: One of the backward inputs is 3.9GB, it lives until the backward ends. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/e71e2072-eaaa-4be3-a39f-0ca74b507265) #### With the change: The 3.9GB is released when the last node depending on that tensor completed. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/7b27d01f-c675-4faf-9a3e-f886b31b2afe) Be noted: the peak did not change though, we have more work to do to reduce on the peak. #### Others It is found there are few tests that were updated to use incorrect expected values in previous code refactoring `a81faee41e (diff-9e8fbae7d3dff24106cd17564949f320e943cb3048eae07813c7de144f140419L382)`. This PR tries to fix them back, and I think now all test cases are back to normal. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-14 14:33:01 +08:00
Baiju Meswani	fff68c3151	Avoid reusing buffer for node outputs with no consumers (#21019 )	2024-06-13 16:08:16 -07:00
Adrian Lizarraga	5ec7ac80c7	Fix compiler error when onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS is enabled (#20889 ) ### Description The recent [PR for int4 support](https://github.com/microsoft/onnxruntime/pull/20362) breaks builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled. This PR adds utility functions for debug printing of int4 tensor statistics and data. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-31 18:07:53 -07:00
Adrian Lizarraga	b02d5e6d76	[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose (#20362 ) ### Description - 4-bit QuantizeLinear(21). Blocked quantization still missing (i.e., do not support the new `block_size` attribute) - 4-bit DequantizeLinear(21). Blocked dequantization still missing (i.e., do not support the new `block_size` attribute) - 4-bit Transpose(21). - Update quantization tool with int4 types. - Disable QDQ fusions for 4-bit types. See: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc - MLAS 4-bit quantization kernels for intel, neon, powerpc. ##### Notes To calculate a tensor's storage size, we normally get the number of elements from the shape (i.e., `tensor_shape.Size()`) and multiply by the size of a single element. This does not directly work for sub-byte elements like int4 as each element in a `Tensor<Int4x2>` stores two packed int4 elements in a byte. The `Tensor:: CalculateTensorStorageSize` should be called to perform the correct calculation for any tensor element type. ### Motivation and Context ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4 type to ORT and adds int4 implementations for the Quant, Dequant, and Transpose ops on CPU EP. We still need to add int4 support for many ops and execution providers. See the ONNX 1.16 release notes: https://github.com/onnx/onnx/releases.	2024-05-30 18:56:24 -07:00
Edward Chen	3270a002fa	Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. (#20379 ) Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. The ORT format model doesn't contain information about kMSInternalNHWCDomain since it is set during layout transformation. Fall back to known domains instead.	2024-04-22 18:34:01 -07:00
pengwa	a7787a0bad	Introduce memory efficient topological sort (#20258 ) ### Introduce memory efficient topo sort (for training) ~~and laze initialize Priority-Based and Memory-Efficient topo sort. Because in most cases, they are not needed, so we free the overheads of GraphViewer construction for most use cases.~~ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-23 08:00:23 +08:00
Hector Li	5daeb5e0b0	enable model with external data be loaded from memory buffer (#19089 ) ### Description Background: User save large model with initializer data in external file. e.g: onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True, location="filename", size_threshold=1024). In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length. But it won't work if user load the model from memory, since Ort lost track of the model path. This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that 1. user can load the model from memory buffer with external initializers in memory buffer too. 2. the initializers can be shared across sessions, for different EPs. 3. user can load the file in any way they want, e.g mmap. Internally, 1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto . 2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided. 4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ). 5. Ort creates the Tensor and replace the existing Tensor in the graph. ### Motivation and Context https://github.com/onnx/onnx/blob/main/docs/ExternalData.md For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.	2024-04-17 19:01:01 -07:00
Adrian Lizarraga	f644ff9fc0	[QNN EP] Support per-channel quantized weights (#20154 ) ### Description - Adds general support for per-channel quantized weights to QNN EP (HTP backend). - Add QNN EP unit tests for per-channel Conv - Update quantization tool to allow selecting which ops are quantized per-channel (and which axis) via tensor-level overrides. Currently, setting `per_channel=True` assumes all Convs, MatMuls, Gemms, InstanceNormalization, and LayerNormalization ops should be quantized per-channel using some assumed default axis. #### Creating QDQ per-channel Conv model example ```python from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model class DataReader(CalibrationDataReader): # TODO: See ONNX Runtime QNN docs for example of a data reader # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64 pass if __name__ == "__main__": input_model_path = "model.onnx" my_data_reader = DataReader(model_to_quantize) # Pre-process the original float32 model. preproc_model_path = "model.preproc.onnx" model_changed = qnn_preprocess_model(input_model_path, preproc_model_path) model_to_quantize = preproc_model_path if model_changed else input_model_path # RELEVANT TO THIS PR: # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0. # The presence of the 'axis' key indicates that this is a per-channel quantized weight. init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]} qnn_config = get_qnn_qdq_config(model_to_quantize, my_data_reader, init_overrides=init_overrides, activation_type=QuantType.QUInt16, # uint16 activations weight_type=QuantType.QUInt8) # uint8 weights by default quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` float32 model: <img width="683" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39"> QDQ model (per-channel Conv weight): <img width="748" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7"> ### Motivation and Context Support more models, especially models with int4 quantized weights.	2024-04-16 08:45:35 -07:00
liqun Fu	cd7112f800	Integration with ONNX 1.16.0 (#19745 ) ### Description update with ONNX 1.16.0 branch according to https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md ONNX 1.16.0 release notes: https://github.com/onnx/onnx/releases/tag/v1.16.0 #### Updated ops for CPU EP: - DequantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block dequantization support - QuantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block quantization support - Cast(21) - Missing int4 and uint4 support - CastLike(21) - Missing int4 and uint4 support - ConstantOfShape(21) - Missing int4 and uint4 support - Identity(21) - Missing int4 and uint4 support - If(21) - Missing int4 and uint4 support - Loop(21) - Missing int4 and uint4 support - Reshape(21) - Missing int4 and uint4 support - Scan(21) - Missing int4 and uint4 support - Shape(21) - Missing int4 and uint4 support - Size(21) - Missing int4 and uint4 support - Flatten(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Pad(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Squeeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Transpose(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Unsqueeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support #### Unimplemented opset 21 features/ops - int4 and uint4 data type - QLinearMatMul(21) - GroupNormalization(21) - ai.onnx.ml.TreeEnsemble(5) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Disabled tests #### ORT Training orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py - test_ort_custom_ops: Potential shape inference bug for custom ops #### Python quantization unit tests test/onnx/python/quantization (shape inference bug) - test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16 - test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16 - test_op_gemm.py: test_quantize_qop_gemm_s8s8 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3 - test_op_matmul.py: test_quantize_matmul_u8u8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy - test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile - test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution - test_op_relu.py: test_quantize_qop_relu_s8s8 #### ONNX tests - test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a maxpool output size bug and added this test. Enable this test when [ORT PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged. Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741). - test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op ai.onnx.ml.TreeEnsemble - test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same - test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same - test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same - test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4 yet - test_cast_INT4_to_INT8_cpu: same - test_cast_UINT4_to_FLOAT_cpu: same - test_cast_UINT4_to_UINT8_cpu: same - test_cast_INT4_to_FLOAT_cuda - test_cast_INT4_to_INT8_cuda - test_cast_UINT4_to_FLOAT_cuda - test_cast_UINT4_to_UINT8_cuda - test_constantofshape_float_ones_cuda: ConstantOfShape(21) not implemented for cuda - test_constantofshape_int_shape_zero_cuda: same - test_constantofshape_int_zeros_cuda: same - test_flatten_axis0_cuda: Flatten(21) not implemented for cuda - test_flatten_axis1_cuda: same - test_flatten_axis2_cuda: same - test_flatten_axis3_cuda: same - test_flatten_default_axis_cuda: same - test_flatten_negative_axis1_cuda: same - test_flatten_negative_axis2_cuda: same - test_flatten_negative_axis3_cuda: same - test_flatten_negative_axis4_cuda: same - test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not implemented in ORT yet - test_qlinearmatmul_2D_int8_float32_cpu: same - test_qlinearmatmul_2D_uint8_float16_cpu: same - test_qlinearmatmul_2D_uint8_float32_cpu: same - test_qlinearmatmul_3D_int8_float16_cpu: same - test_qlinearmatmul_3D_int8_float32_cpu: same - test_qlinearmatmul_3D_uint8_float16_cpu: same - test_qlinearmatmul_3D_uint8_float32_cpu: same - test_qlinearmatmul_2D_int8_float16_cuda: same - test_qlinearmatmul_2D_int8_float32_cuda: same - test_qlinearmatmul_2D_uint8_float16_cuda: same - test_qlinearmatmul_2D_uint8_float32_cuda: same - test_qlinearmatmul_3D_int8_float16_cuda: same - test_qlinearmatmul_3D_int8_float32_cuda: same - test_qlinearmatmul_3D_uint8_float16_cuda: same - test_qlinearmatmul_3D_uint8_float32_cuda: same - test_size_cuda: Size(21) not implemented for cuda - test_size_example_cuda: same - test_dequantizelinear_blocked: Missing implementation for block dequant for DequantizeLinear(21) - test_quantizelinear_blocked_asymmetric: Missing implementation for block quant for QuantizeLinear(21) - test_quantizelinear_blocked_symmetric: Missing implementation for block quant for QuantizeLinear(21) --------- Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-04-12 09:46:49 -07:00
cao lei	2a184ac1a1	use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo (#20037 ) ### Description <!-- Describe your changes. --> use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops	2024-03-28 20:45:37 -07:00
cao lei	793a8882ed	Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams (#19970 ) ### Description <!-- Describe your changes. --> Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is the improvement for the fix https://github.com/microsoft/onnxruntime/pull/17303	2024-03-26 13:57:25 -07:00
Adam Louly	32558134a9	[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770 ) ### Description Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers ### Motivation and Context This PR includes changes that will make ort handle 2GB+ checkpoints. To do that we need to upgrade flatbuffers to 23.5.9 - https://github.com/google/flatbuffers/pull/7945 - Modified the commitHash and the hash for the new version - Removed the patch for rust generator's unused variable warning as it is no longer producing this - [Check it out here](`d121e09d89/src/idl_gen_rust.cpp`) - Updated the VerifyField calls with alignment values that were introduced in the new version. --------- Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2024-03-14 16:36:24 -07:00
pengwa	409b811325	Refine logging for execution plan print (#19777 ) ### Refine logging for execution plan print Printing NodeIndex only is not enough for us to debug the execution order. keep original behaviour for ORT_MINIMAL_BUILD build in case of any CPU memory concerns. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-14 16:31:32 +08:00
Markus Tavenrath	f42e6ad61e	Add support for LRN NHWC OPs (#19866 ) Support LRN NHWC in the CUDA EP. ### Motivation and Context Add support for all NHWC OPs to avoid NHWC/NCHW Layout transformation	2024-03-13 17:52:07 -07:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Vincent Wang	1bfc26685b	ATen Op Supports Int Return Type and CPU Tensor Arguments (#19773 ) This PR: - add support for int as return type, will create a CPU scalar tensor for it. - add attributes to specify which arguments or returns are CPU tensors. - adjust ATen efficient attn to match latest PyTorch native function. - a Triton codegen bugfix by the way.	2024-03-06 10:11:46 +08:00
Hector Li	4ab497603e	Enable user to set QNN HTP performance mode for every session run (#19521 ) ### Description Currently, the QNN HTP performance mode is set during session creation, there's no way to change it afterwards. There's requirement to set it high performance mode for high priority request and set it back to low performance mode later to save the power when the incoming request is idle for example. Now, still keeps the performance mode at the session level in QNN EP options which is used at the default one. Ort QNN EP will set it once if user set it. And there are setting (qnn.htp_perf_mode and qnn.htp_perf_mode_post_run) in run option to change the performance mode before and after session run. There's recommended scenario that user set the mode to high performance mode before the the inference sun so that user can get the result back ASAP. And set the mode to low performance mode after the inference to save the power.	2024-02-22 17:04:59 -08:00
cao lei	6b73ab3e3e	Introduce reused_buffer_index_per_stream in allocation planner which will be reset after computing the reuse buffer for each stream (#19515 ) ### Description <!-- Describe your changes. --> Introduce reused_buffer_index_per_stream in allocation planner which will be reset after computing the reuse buffer for each stream. So if a NodeArg is an input of several Ops across different streams and reuses other NodeArg, the reused NodeArg won't be involved when computing the second stream's reuse plan. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to fix https://github.com/microsoft/onnxruntime/issues/19480, which is a crash for the scenario mentioned above. --------- Co-authored-by: Lei Cao <leca@microsoft.com>	2024-02-22 10:19:08 -08:00
zhijiang	8fadc6c913	Zhijxu/cleanup cached tensors when oom (#19306 ) in pytorch, when oom happens at bp, user could decrease the batch size and rerun it without restarting the process. while in ORT, the intermediate tensors are kept even OOM, so decrease batch size still fail. this is torch run, we can see after oom failure, torch will release tensor before next step ![image](https://github.com/microsoft/onnxruntime/assets/43435212/92b8a2e3-454b-448a-a223-17cb91d463c2) this is from ort, we can see ort not release its tensors after OOM failure. ![image](https://github.com/microsoft/onnxruntime/assets/43435212/bb6a3882-8e14-4f37-8079-e7f70fc2546b) ort with the PR, we can see memory is released, the 4GB memory is not own by ort, and will be released by torch at the end. ![image](https://github.com/microsoft/onnxruntime/assets/43435212/7f39d711-4e36-47d5-aecf-3805433a6d01)	2024-02-21 10:41:42 +08:00
Fangrui Song	d00adb7989	Align bins_space_ storage (#17552 ) Otherwise, `new (BinFromIndex(b)) Bin(this, bin_size);` in bfc_arena.cc would cause a -fsanitize=alignment (part of -fsanitize=undefined) failure like runtime error: constructor call on misaligned address 0xXXX for type 'Bin', which requires 8 byte alignment	2024-02-11 19:18:26 -08:00
ivberg	148f54c6ea	Add capturestate / rundown ETW support logging for session and provider options (#19397 ) ### Description Add capturestate / rundown ETW support logging for session and provider options. ### Motivation and Context Follow-up to #16259 and #18882 This is very useful when you have longer running ONNX sessions which will be the case for a lot of AI workloads. That means ETW tracing may start minutes or hours after a process & session has been established. When a trace is captured, you would want to know the state of ONNX at that time. The state for ONNX is session and config options so that they show up in the trace. Tested with xperf and ORT xperf -start ort -on 3a26b1ff-7484-7484-7484-15261f42614d xperf -capturestate ort 3a26b1ff-7484-7484-7484-15261f42614d <--- Run this after session has been up for some time xperf -stop ort -d .\ort.etl <- Trace will now also have rundown events Also these will show if you use WPR [CaptureStateOnSave ](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave)	2024-02-08 11:28:05 -08:00
Hector Li	0fa88bc810	Multi-partition support for context binary cache feature (#18865 ) ### Description Multi-partition support for context binary cache feature 1. In QNNEP create the list of EPContext nodes if ep_context_enable is enabled, so that it can dump the model with multiple partitions 2. Extend context loading part to support multiple EPContext nodes ### Motivation and Context It only support single partition before this changes. There's graph partition limitation for context cache feature after this change.	2024-02-01 15:04:29 -08:00
cao lei	7d4dc66846	ExecutionProvider API refactor - make GenerateMetaDefId a standalone function, decouple it from EP (#18977 ) ### Description <!-- Describe your changes. --> Make EP's member function, GenerateMetaDefId, a standalone function which decouples from EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change is for ExecutionProvider API refactoring, we will make a clean ExecutionProvider API first for later EPv2 work	2024-01-26 07:39:08 -08:00
Hector Li	6e17571f2f	Fix issue that the generated context cache model inputs/outputs order is not guaranteed (#19195 ) Fix issue that the generated context cache model inputs/outputs order is not guaranteed ### Description Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model. This is also a break down of PR for multi-partition support. https://github.com/microsoft/onnxruntime/pull/18865	2024-01-19 15:16:17 -08:00
pengwa	1150b1f81e	ORTModule memory improvement (#18924 ) ## Dependency https://github.com/microsoft/onnxruntime/pull/19007 ## ORTModule memory efficient gradient management Previously I have tried to solve the coarsed-grained gradient accumulation/update problem in ORTModule with https://github.com/microsoft/onnxruntime/pull/8979, while that resolution somehow is not fully validated with DDP or there is user hooks on the gradient accumulation on torch parameter. This PR is addressing the problem in the similar approach as PR 8979, e.g. trigger gradient accumulation once ORT computed the grad, but instead of use a AccumulateGrad op, this time with a ONNX operator PythonOp, internally it will call param.backward(grad), which will help handle all related hooks correctly. ## Design Check the details from https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ ## Convergence Validation: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe) differences are on mostly 0.000x, sometimes 0.00x, which may comes from the different order gradient apply happens before or after this change (on deepspeed zero stage 2) ## TODO Consolidate the logic with Stage3's similar logic.	2024-01-16 08:57:37 +08:00
Yulong Wang	07cfc56538	[js] enable external data loading for ort-web (#19087 ) ### Description enable external data loading for ort-web. ### Why The ORT external data design is highly depending on the file system, especially synchronous file I/O APIs. Those are not available in web platforms. We need to have extra code to make external data working on web. ### How Considering there is no file system in web, an implementation for web to support external data is to use pre-loaded data. Assume model file a.onnx includes initializers that linked to ./b.bin, we require users to pass a full data file list when creating the session. The user code will be look like: ```js const mySess = await ort.InferenceSession.create('./path/model/a.onnx', { // session options externalData: [ { // relative or absolute path/URL of the file, // or a pre-loaded Uint8Array containing the data of the external data file data: './path/data/b.bin', // the relative path of the external data. Should match initializers' "location" value defined in the model file path: './b.bin' }, // { } if multiple external data file ] }); ``` Currently, this feature only works with JSEP build enabled.	2024-01-12 19:24:24 -08:00

1 2 3 4 5 ...

816 commits