onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-01 03:45:06 +00:00

Author	SHA1	Message	Date
Yueqing Zhang	aedb49beb4	[VitisAI] change all support tensor type from ir 9 to ir 10 (#23204 ) ### Description <!-- Describe your changes. --> Changed all support tensor type from ir 9 to ir 10. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - See issue https://github.com/microsoft/onnxruntime/issues/23205 Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2025-01-02 06:45:21 -08:00
Jean-Michaël Celerier	2116fd1999	Update onnxruntime_c_api.h to work with MinGW (#23169 ) The SAL2 macros are not always available there ### Description Make SAL2 macros only available on MSVC. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/1175	2024-12-31 11:05:10 -08:00
wejoncy	86870114eb	[CoreML] support coreml model cache (#23065 ) ### Description Refactor compute plan profiling Support cache coreml model to speed up session initialization. this is only support by user provided entry and user responsible to manage the cache With the cache, session initialization time can be reduced by 50% or more: \|model\| before\| after\| \|--\|--\|--\| \|yolo11.onnx\| 0.6s\|0.1s\| \|yolo11-fp16.onnx\|1.8s\|0.1s\| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: wejoncy <wejoncy@.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-12-31 09:29:41 +08:00
Dmitri Smirnov	00b262dbb4	Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069 ) ### Description <!-- Describe your changes. --> Pre-packing is a feature, that allows kernels to re-arrange weights data to gain performance at interference time Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated. This change enables pre-packed data to be stored on disk alongside with the external initializers. The pre-packed blobs are memory mapped and are loaded into either the X-session shared container or a new container that shares pre-packed blobs within the session. With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data. A separate container owned by a root `SessionState` owns and shares the data when X-session sharing is not enabled. To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing `ToGraphProtoWithExternalInitializers` function. To externalize the pre-packed weights, we introduce a new session option `kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that pre-packing should be enabled (default) for this to work. `ToGraphProtoWithExternalInitializers`function is modified to recurse into subgraphs to make sure we properly account for local initializer names. In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Reduce memory usage by pre-packed initializers and externalize them.	2024-12-20 10:49:08 -08:00
Changming Sun	2ff66b80e0	Fix a deadlock bug in EigenNonBlockingThreadPool.h (#23098 ) ### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.	2024-12-16 09:05:12 -08:00
Hector Li	ebb968d34a	disable the EP context embed model by default in session option (#23070 ) change the default value for session option ep.context_embed_mode to 0 to avoid the model loading memory overhead	2024-12-11 17:26:29 -08:00
Scott McKay	708ee8556e	Reduce default logger usage (#23030 ) ### Description <!-- Describe your changes. --> We have use cases where multiple sessions are created concurrently. Minimizing the usage of the default logger is important for these scenarios. Wire through the session logger to as many places as possible. The EP logger can also be used once the session is created (can't be used during EP construction/kernel registration but can be used in GetCapability and Compile). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve logging when there are concurrent sessions.	2024-12-10 12:54:14 +11:00
wejoncy	e12421be30	[CoreML] more performace flag (#22975 ) ### Description refactor unsquzee's implementation add more flags to boost peformance. add profile flag ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: jicwen <jicwen@YiMacBook-Pro.local> Co-authored-by: wejoncy <wejoncy@.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-12-10 09:35:05 +08:00
Scott McKay	2f2c73bdde	Miscellaneous cleanups (#23048 ) ### Description <!-- Describe your changes. --> - fix some missing end of version markers and since_version info - fix include to use onnx_protobuf.h which handles minimal builds - we should always prefer that header over directly using the onnx ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-10 09:24:16 +11:00
Hector Li	401d16c671	Enable QNN HTP spill fill buffer setting to save RAM usage. (#22853 ) ### Description Enable QNN HTP spill fill buffer setting to save RAM usage. This feature is available after QNN 2.28. Need to re-generate QNN context binary. https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api Requirements: 1. Need to re-generate the Onnx model with QNN context binary by set the EP option enable_htp_spill_fill_buffer = 1. 2. Works for a model with multiple Context binaries. Need manually merge 2 Onnx model with context binary into 1 Onnx model. 3. Requires Linux platform if generate the context binary offline since QnnSystem lib is not available for Windows x86_64 platform. No need to do extra thing while running the model inference. The generated EPContext node will have a max_size attribute with the maximum spill fill buffer size for the context binary <img width="353" alt="image" src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d"> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-12-06 11:36:52 -08:00
wejoncy	c284a686f2	[CoreML] Create EP by AppendExecutionProvider (#22675 ) ### Description AppendExecutionProvider("CoreML", {{"MLComputeUnits","MLProgram"}}) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-11-27 09:26:31 +08:00
Chi Lo	56e4fda8a8	[TensorRT EP] Revert "Add new provider option to exclude nodes from running on TRT" (#22878 ) - Revert https://github.com/microsoft/onnxruntime/pull/22681 - But still implicitly exclude DDS ops for TRT 10. Will later provide better PR to add trt_op_types_to_exclude provider option.	2024-11-19 09:08:54 -08:00
Preetha Veeramalai	ac9c135b95	Ovep develop 1.21 (#22824 ) ### Description OVEP development changes for ORT 1.21 Release ### Motivation and Context Has critical bug fixes Support for concurrency execution of models is enabled Support for OV 2024.5 Memory optimizations for NPU platform --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>	2024-11-14 20:10:07 -08:00
Chi Lo	fa4cbcd36b	[TensorRT EP] Add new provider option to exclude nodes from running on TRT (#22681 ) Add new provider option `trt_op_types_to_exclude`: - User can provide op type list to be excluded from running on TRT - e.g. `trt_op_types_to_exclude="MaxPool"` There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes DDS ops from running on TRT by default, user can override default value with empty string to include all ops.	2024-11-13 11:34:43 -08:00
Dmitri Smirnov	c5276ac448	Revert "enable serialize prepacked weights into data file (#22256 )" (#22788 ) This reverts commit `c5b6be045f`. ### Description Revert ### Motivation and Context This needs simpler and more robust approach	2024-11-11 09:59:05 -08:00
wejoncy	9daf7664fc	[CoreML] ML Program more ops (2/N) (#22480 ) - cast - argmax - gelu - cast - LayerNorm - GroupNorm - InstanceNorm ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-11-01 08:37:56 +08:00
Pranav Sharma	03ea5dc495	Distinguish between DML and the generic 'GPU' term. This is needed for packaging DML EP in the same ORT GPU pkg. (#22657 ) ### Description Distinguish between DML and the generic 'GPU' term. This is needed for packaging DML EP in the same ORT GPU pkg. ### Motivation and Context Customer requirement.	2024-10-30 11:58:34 -07:00
Dmitri Smirnov	e106131260	Enable Ort objects to be stored in a resizable std::vector (#22608 ) ### Description <!-- Describe your changes. --> Allow some classes to be default constructed. The effect is the same as constructing it with nullptr. Make default ctor visible from the base classes. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Multiple customers complained that when storing Ort::Value in an instance of std::vector, vector can not be resized. We enable that with allowing it default constructed.	2024-10-29 09:59:59 -07:00
Frank Dong	c5b6be045f	enable serialize prepacked weights into data file (#22256 ) ### Description part of https://github.com/microsoft/onnxruntime/issues/21448 This change is intend to save CPU memory during model load for inference. Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on: 1. optimize model with inference session, prepacked external initializer will be saved into data file. 2. load optimized model and external data file with prepacked initializer, no prepack is needed 3. run inference with optimized model and data file Tested with model Phi-3-mini-instruct-onnx, with ORT 1.12.0: ![image](https://github.com/user-attachments/assets/3c0337be-f340-4bb7-8f9f-30f3552072ef) with this change: ![image](https://github.com/user-attachments/assets/23282990-2e1e-4a1f-92de-afa8ed7e6a43) Peak memory usage dropped from 5.438 GB to 2.726GB. This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers). next step: Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-24 22:24:48 -07:00
Changming Sun	88676e62b9	Remove nsync (#20413 ) ### Description 1. Remove the onnxruntime::OrtMutex class and replace it with ~absl::Mutex~ std::mutex. 2. After this change, most source files will not include <Windows.h> indirectly. ### Motivation and Context To reduce the number of deps we have, and address some Github issues that are related to build ONNX Runtime from source. In PR #3000 , I added a custom implementation of std::mutex . It was mainly because at that time std::mutex's default constructor was not trivial on Windows. If you had such a mutex as a global var, it could not be initialized at compile time. Then VC++ team fixed this issue. Therefore we don't need this custom implementation anymore. This PR also removes nsync. I ran several models tests on Linux. I didn't see any perf difference. This PR also reverts PR #21005 , which is no longer needed since conda has updated its msvc runtime DLL. This PR unblocks #22173 and resolves #22092 . We have a lot of open issues with nsync. This PR can resolve all of them.	2024-10-21 15:32:14 -07:00
Jeff Daily	5aabc53121	[ROCm] redo hipify of version controlled files (#22449 ) ### Description Updates the ROCm EP opsets to match the current CUDA EP opsets. Also enable the test CApiTest.basic_cuda_graph_with_annotation. Note that some changes are whitespace-only. These changes were made to improve the comparison of corresponding ROCm and CUDA EP source files when using a side by side diff tool. ### Motivation and Context The ROCm EP derives from the CUDA EP. Many source files are shared between the EPs and "hipified" during the ROCm EP build, however quite a few files within the ROCm EP are under source control after their initial hipification. Over time these ROCm EP files get stale relative to their CUDA EP counterparts. It becomes necessary to re-hipify these otherwise static files in order to pick up important changes such as opset differences.	2024-10-18 12:40:54 -07:00
Akshay Sonawane	e5c2e50849	bumps up version in main from 1.20 -> 1.21 (#22482 ) Bump up version in main from 1.20.0 to 1.21.0 since the release branch has been cut.	2024-10-17 12:32:35 -07:00
Adrian Lizarraga	84d48b6ad6	[QNN EP] Add provider option to offload graph I/O quantization/dequantization to the CPU EP (#22436 ) ### Description Adds QNN provider option `offload_graph_io_quantization` to offload graph input quantization and graph output dequantization to the CPU EP. Option is disabled by default to maintain current behavior. ### Motivation and Context Offloading the handling of I/O quantization to the CPU EP significantly improves inference latency for many models.	2024-10-16 15:00:53 -07:00
wejoncy	20a45dd67b	[CoreML ML Program] support acclerators selector (#22383 ) ### Description For no, CoreML only support run mlmodels on CPU/ALL, However, sometimes CPU_GPU would be faster a lot. We support the option to select different hardware to boost performance in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-10-15 11:50:11 +08:00
Jeff Daily	8c21680ffc	[ROCm] prefer hip interfaces over roc during hipify (#22394 ) ### Description Change the hipify step to remove the -roc option to hipify-perl. This will prefer hipblas over rocblas. rocblas can still be called directly such as in TunableOp. ### Motivation and Context hip interfaces are preferred over roc for porting from cuda to hip. Calling roc interfaces is meant for ROCm-specific enhancements or extensions.	2024-10-14 20:34:03 -07:00
Vishnudas Thaniel S	35adba21c7	Ovep develop lnl 1.2 (#22424 ) ### Description Support OV2024.4 Refactor tensor initialization check for external weights Support loading OV Config OVEP: Tensor Caching fix, Fix accuracy issues Refactor device memory implementation to make it more generic ### Motivation and Context The changes are required to fix accuracy issues, support loading of OV config, support OV2024.4 --------- Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>	2024-10-14 12:10:01 -07:00
George Wu	332173509d	fixups for doxygen. add c++ wrapper for setEpDynamicOptions (#22416 ) follow up to https://github.com/microsoft/onnxruntime/pull/22282 replaces https://github.com/microsoft/onnxruntime/pull/22388	2024-10-11 21:59:33 -07:00
Luis E. P.	1bc546af61	Add SetEpDynamicOptions and remove workload_type from run/session options (#22282 ) ### Description Add SetEpDynamicOptions and Remove workload_type from run/session options. ### Motivation and Context Added SetEpDynamicOptions as a dynamic way of changing EP settings even in the middle of a Run Using workload_type run/session options to set Efficient/Default mode for workloads does not cover all the scenarios and can lead to priority inversions. Working on a new API to support setting Efficient/Default mode for workloads. --------- Co-authored-by: Luis E. Pena <luispena@microsoft.com>	2024-10-09 22:54:22 -07:00
Pranav Sharma	c415991c16	Revert "ThreadPool: Spend less time busy waiting. (#21545 )" (#22350 ) This reverts commit `4e15b229a0`. Reason: We are seeing an increase in the number of deadlocks after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. Moreover, this is causing an increase int he binary size. ### Description We are seeing an [increase in the number of deadlocks](https://github.com/microsoft/onnxruntime/pull/22315#issuecomment-2394821893) after this PR. We have a release coming up next week and do not have enough time to investigate the root cause, hence reverting this PR temporarily. ### Motivation and Context See above.	2024-10-08 17:50:26 -07:00
Yulong Wang	c5d28cac4d	Initial WebGPU EP checkin (#22318 ) ### Description This change introduces the WebGPU EP into ONNX Runtime. To make the PR as simple as possible, this PR excluded the following: - C API changes for WebGPU EP - actual implementation of WebGPU EP. Currently in this PR, WebGPU is a stub implementation that does not register any kernel. - Python IO Binding update - Node.js IO Binding update This PR now contains only 43 file changes (while the working branch contains 130+) and hopefully this makes it easier to review. There is going to be separated PRs for each mentioned above. Current working branch: #21904	2024-10-08 16:10:46 -07:00
Dmitri Smirnov	9f3676bc31	Address leftover comments for Lora support (#22322 ) ### Description Address comments ### Motivation and Context Re: https://github.com/microsoft/onnxruntime/pull/22046	2024-10-04 16:43:26 -07:00
goldsteinn	4e15b229a0	ThreadPool: Spend less time busy waiting. (#21545 ) The purpose of the patch is primarily to save power, but it also has nice perf benefits (mostly from allowing the system to better distribute power to cores doing meaningful work). Changes are twofold: 1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The reality is after ~10^4 spins, if there hasn't been any new work added its unlikely any new work is imminent so sleep to preserve power. This aligns more closely with upstream EigenV3. 2) Use exponential backoff for waiting on memory. This saves a bit more power, and important increases the time between iterations in WorkerLoop to help accomidate the dramatically lowering spin counts. Since the tuning for both the iteration counts / backoff counts are dramatically different for hybrid/non-hybrid systems, this patch templates the affected functions and dynamically choses based on `CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of getting the change in, although its likely we could incur less dynamic overhead if we added the template argument to the entirety of `ThreadPoolTempl`. Measured performance on an [Intel Meteor Lake CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html) across a range of models. Below are the result of 3 runs with each metric being the value-before-patch / value-after-patch (so for something like inference time, lower is better). <div align="center"> <table> <tr> <th>Session creation time cost</th> <td>0.7179</td> </tr> <tr> <th>First inference time cost</th> <td>0.7156</td> </tr> <tr> <th>Total inference time cost</th> <td>1.0146</td> </tr> <tr> <th>Total inference requests</th> <td>0.8874</td> </tr> <tr> <th>Average inference time cost</th> <td>0.8800</td> </tr> <tr> <th>Total inference run time</th> <td>1.0146</td> </tr> <tr> <th>Number of inferences per second</th> <td>0.8955</td> </tr> <tr> <th>Avg CPU usage</th> <td>0.9462</td> </tr> <tr> <th>Peak working set size</th> <td>0.9922</td> </tr> <tr> <th>Runs</th> <td>1.1552</td> </tr> <tr> <th>Min Latency</th> <td>0.7283</td> </tr> <tr> <th>Max Latency</th> <td>0.9258</td> </tr> <tr> <th>P50 Latency</th> <td>0.9534</td> </tr> <tr> <th>P90 Latency</th> <td>0.9639</td> </tr> <tr> <th>P95 Latency</th> <td>0.9659</td> </tr> <tr> <th>P99 Latency</th> <td>0.9640</td> </tr> </table> </div> So the net result is a 1.16x improvement in throughput and between 1.08-1.37x improvement in latency.	2024-10-01 17:25:02 -07:00
Dmitri Smirnov	d9de054eb5	Multi-Lora support (#22046 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-30 15:59:07 -07:00
Enrico Galli	52a8c1cae8	[WebNN EP] Enable IO Bindings with MLTensor (#21301 ) ### Description Enables using the MLTensor to pass data between models. ### Motivation and Context Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.	2024-09-27 17:24:21 -07:00
Tianlei Wu	2deab75d39	Add numeric_limits for float8 types (#22228 ) Add std::numeric_limits for float8 data types to provide a consistent way to access limits of those types. Reference: * https://onnx.ai/onnx/technical/float8.html	2024-09-26 14:42:36 -07:00
Tianlei Wu	7880342e5e	Add numeric_limits for MLFloat16 and BFloat16 (#22197 ) ### Description * Add std::numeric_limits for MLFloat16 and BFloat16. * Update some comments in csharp ORTFloat16.shared.cs. * Add unit tests (including Clip) Note that the canonical NaN is not consistent in C++ and C#. C# uses negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN. The choice of CSharp Float16.NaN is to be consistent with System.Half.NaN. FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU provider might have 0x7E00 as NaN. Anyway there is no consistent canonical NaN in ORT right now. Because all these NaNs are aligned with IEEE spec, there shall not an issue in downstream. ### Motivation and Context std::numeric_limits is used in codebase but not defined for MLFloat16 and BFloat16. It causes some bugs like https://github.com/microsoft/onnxruntime/issues/21957 introduced by https://github.com/microsoft/onnxruntime/pull/21493.	2024-09-25 17:10:05 -07:00
Hector Li	5fa4505d1b	Set enable_htp_fp16_precision default to true (#22186 ) ### Description Set enable_htp_fp16_precision default to true for HTP backend.	2024-09-24 09:37:53 -07:00
Edward Chen	209ff86d52	Get build working on Xcode 16 (#22168 )	2024-09-24 08:33:03 -07:00
Scott McKay	d4692835bf	Fix std::chrono/date conflict for mac builds with C++20 (#22138 ) ### Description Fix usage of c++ std::chrono::operator<< in mac builds for wider range of xcode/targets. ### Motivation and Context #21033	2024-09-20 11:18:24 -07:00
Michael Tyler	904b850b44	Update Arm Compute Library Execution Provider (#22032 ) ### Description This PR makes the following updates to the Arm Compute Library execution provider: - Target Arm Compute Library 24.07 - Add support for the following operators: - Conv (FP16) - NhwcConv - QLinearConv - MatMul - FusedMatMul - MatMulIntegerToFloat - Optimize memory usage and performance - Expose the enable_fast_math setting - Use the main runtime thread pool ### Motivation and Context These updates improve performance and memory usage, and enable use of a more recent version of Arm Compute Library. @microsoft-github-policy-service agree company="Arm Ltd" --------- Signed-off-by: Michael Tyler <michael.tyler@arm.com>	2024-09-12 20:51:59 -07:00
sfatimar	0309c5f02f	Ovep release lnl 1.2.1 (#22027 ) Error Codes are added to catch compilation error and signal recompile. Remote Tensors are added to ensure direct memory access for NPU inferencing. UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching ### Motivation and Context The changes are needed to ensure backward compatibility UMD Bypass caching eliminates driver caching Remote Tensors lead to performance improvement with inferencing on NPU --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-11 14:55:40 -07:00
Arne H Juul	493159b481	near-zero negative values must convert to 0 not NAN (#18473 ) for the Float8 types with unsigned zero, we must clear the sign bit when rounding to zero; otherwise we end up with 0x80 which is the encoding for NAN. ### Description Handle all zero and near-zero values the same way, rounding to positive zero. Note that I removed one "if" level but did not re-indent the code in this PR, to make it easier to see what the actual changes are. ### Motivation and Context For the two new 8-bit floating point types Float8E4M3FNUZ and Float8E5M2FNUZ, converting from a near-zero negative value would end up with the sign bit set only; this bit pattern is not negative zero but instead means NAN.	2024-09-06 11:41:48 -07:00
Arne H Juul	605a84ffc9	remove unused and confusing float16 constants (#21999 ) ### Description Remove unused and confusing special constants in MLFloat16 and BFloat16 types. ### Motivation and Context While looking at adding a specialization for std::numeric_limits for the 16-bit floating point types, I found that there are various special constants in those types that are confusing or just wrong. MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks like a copy-paste bug. BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`, nor even to the C# Float.Epsilon. Instead, it corresponds to `numeric_limits::min()` which was really confusing to me. The "MinValue" constants does correspond to the C# `Float.MinValue` constant, but this is C++ so it would be better renamed to "LowestValue" since it corresponds to `numeric_limits::lowest()`. As it was unused except for some unit tests I have replaced it with the equivalent `MaxValue.Negate()` here. There's also an unused `kSignaling_NaNBits` constant which is just wrong (has the same value as `kPositiveInfinityBits` instead of a NaN).	2024-09-05 22:00:48 -07:00
Hector Li	190588bb64	Enable QNN weight sharing (#21077 ) ### Description Enable QNN weight sharing across graphs in single context Create tool to generate QNN context cache model with weight sharing enabled.	2024-09-04 11:20:33 -07:00
sfatimar	8dba8e3e24	Memory Optimization for Compilation in OVEP (#21872 ) Calling Split API Calls Read+Model in lieu of unified Compile Model call for export compile flow to ensure memory optimization. Freeing up model proto and serialized string and read model ov ir later to free up memory for the ahead pipeline Optimization during EpCtxt flow All the Graph related operations require all the Node Attributes to be set while dealing with model instances internally with them, in the existing implementation these attributes make a copy when constructing a Graph dynamically during runtime. Propose to use these attributes in place without creating a copy to avoid memory allocation / copy while calling these Graph related functions. Changes to ensure the bug fixes related to openvino version and epctxt file path. Moving Compiler version to C++20 for getting r-value mem optimizations benefit ### Motivation and Context This change is required because memory optimization during Compilation flow is too high. --------- Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-03 13:52:31 -07:00
Yulong Wang	257792225f	revert forceinline for MakeString (#21943 ) ### Description revert forceinline for MakeString. This change reverts https://github.com/microsoft/onnxruntime/pull/21893. The forceinline was introduced for performance considerations, however it turns out to have some notable binary size increase, which is a concern for some binary size sensitive platforms like Android. I made a few tests locally and found it is not related to whether or not have used the template struct `if_char_array_make_ptr_t` trick. So I have to revert this back.	2024-09-02 19:01:08 -07:00
Yulong Wang	32af2ba68f	enhance string util functions (#21893 ) ### Description - make `MakeString` force inline - refactor ORT_FORCEINLINE macro - move to one place to avoid macro redefinition error - ~~add a `StringJoin` utility~~ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-29 10:37:50 -07:00
AlbertGuan9527	ef073fd8f4	Add session and run option workload_type for applications to set efficient mode. (#21781 ) ### Description This PR added session and run option workload_type, this option is the knob for applications to enable/disable the processor performance efficient mode. ### Motivation and Context The efficient mode is co-engineered with processor vendors to allow applications voluntarily being serviced at a more energy efficient performance level. This functionality can be used by long running, latency insensitive application to save the energy consumption.	2024-08-28 08:17:01 -07:00
Yulong Wang	d2a1b7a353	Introduce custom external data loader (#21634 ) ### Description This PR introduces support for custom external data loader. An EP can register a custom external data loader to override the default behavior, making it possible to upload initializers directly to GPU. ### Motivation and Context - In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type (`sizeof(size_t)==4`), which means there is a 4GB hard limit on the maximum memory. As the ONNX models get larger, this becomes a blocker for supporting medium-sized language models. - ORT runs out of memory because the current code always loads data into CPU memory, including the .onnx file (protobuf) and external data file(s). However, if using GPU EP, the big data does not need to be kept on CPU because the only thing that ORT does is to load the data into memory, upload to GPU and then release them. - Some platforms has offered developers way to upload data directly to GPU. For example, webgpu allows uploading from any ArrayBuffer (it can be a side buffer, not count into the 4GB) to GPU directly. This helps to keep the CPU memory usage significantly. ### Design Class `ExternalDataLoader` and `ExternalDataLoaderManager` are introduced. They are similar to `DataTransfer` and `DataTransferManager`. `InferenceSession` owns the manager object, and `SessionState` keeps a reference to it. Added a new method `GetExternalDataLoader` in `IExecutionProvider`. An EP can override the method to register an instance of custom external data loader. The key function in a `ExternalDataLoader` class is method `LoadTensor`: ```c++ // the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan). virtual common::Status LoadTensor(const Env& env, const std::filesystem::path& data_file_path, FileOffsetType data_offset, SafeInt<size_t> data_length, Tensor& tensor) const; ``` This function can be registered by EP, going through a few layers and eventually get into `DeserializeTensorProto()` in the finalizing stage of session initialization. In this step, initializer tensors are created. Behavior is changed to first look up for a registered external data loader that can handle the current memory info. If any instance is available, use the loader; otherwise respect the old code path.	2024-08-27 12:18:52 -07:00
Edward Chen	5726318ec0	[CoreML EP] Fix ArgMaxOpBuilder::AddToModelBuilderImpl() nullptr Node access. (#21797 )	2024-08-23 10:19:53 -07:00

1 2 3 4 5 ...

1022 commits