onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-06 04:28:32 +00:00

Author	SHA1	Message	Date
saymrwulf	c0cc6369dd	Implementing approach from a new paper read last night (onnxruntime)	2015-05-22 00:04:00 +02:00
saymrwulf	46b031c00b	Minor doc updates: linking to article on quantization (onnxruntime)	2015-05-04 08:05:00 +02:00
saymrwulf	10e0e4c77b	Quick fix, referencing a known issue from the official repo (onnxruntime)	2015-05-04 08:45:00 +02:00
saymrwulf	bd34bcc8f0	Trying out boneh-franklin approach for IBE (ref. 2003 paper) (onnxruntime)	2015-04-07 19:06:00 +02:00
saymrwulf	82ea9b7e4b	Minor doc updates: linking to article on quantization (onnxruntime)	2015-04-07 22:43:00 +02:00
saymrwulf	bab0005f8f	Implementing approach from a new paper read last night (onnxruntime)	2015-04-07 03:53:00 +02:00
saymrwulf	77f6d2451c	Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime)	2015-04-07 04:54:00 +02:00
saymrwulf	670cd61113	Implementing approach from a new paper read last night (onnxruntime)	2015-03-15 02:06:00 +01:00
saymrwulf	c27806ca45	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2015-03-15 07:58:00 +01:00
saymrwulf	352b192bd9	Refactor for clarity, might break a few tests though (onnxruntime)	2015-01-30 04:57:00 +01:00
saymrwulf	203d84fdf3	Minor doc updates: linking to article on quantization (onnxruntime)	2014-09-22 03:38:00 +02:00
saymrwulf	808d20cc47	Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime)	2014-09-22 02:04:00 +02:00
saymrwulf	08e2251961	Refactor for clarity, might break a few tests though (onnxruntime)	2014-09-22 04:33:00 +02:00
saymrwulf	fec9d1fb67	Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime)	2014-08-26 06:02:00 +02:00
saymrwulf	aa4e8e31dc	Quick fix, referencing a known issue from the official repo (onnxruntime)	2014-08-26 20:50:00 +02:00
saymrwulf	b98cacd5ed	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-08-26 17:07:00 +02:00
saymrwulf	1ed9494329	Refactor for clarity, might break a few tests though (onnxruntime)	2014-07-22 02:59:00 +02:00
saymrwulf	c5c4cdab9b	Late-night bugfix on financial RL environment (onnxruntime)	2014-07-22 22:21:00 +02:00
saymrwulf	d9111330b2	Late-night bugfix on financial RL environment (onnxruntime)	2014-07-22 21:53:00 +02:00
saymrwulf	16db96a5ee	Late-night bugfix on financial RL environment (onnxruntime)	2014-07-11 22:08:00 +02:00
saymrwulf	5c7e0eaadb	Refactor for clarity, might break a few tests though (onnxruntime)	2014-06-06 07:20:00 +02:00
saymrwulf	6583d7a9e0	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-06-06 21:06:00 +02:00
saymrwulf	07cff3567b	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-06-06 06:21:00 +02:00
saymrwulf	8eb35e2ef1	Trying out boneh-franklin approach for IBE (ref. 2003 paper) (onnxruntime)	2014-05-21 18:29:00 +02:00
saymrwulf	b01237e408	Quick fix, referencing a known issue from the official repo (onnxruntime)	2014-03-27 20:11:00 +01:00
saymrwulf	16e98e7844	Implementing approach from a new paper read last night (onnxruntime)	2014-03-27 20:33:00 +01:00
saymrwulf	7d74e4142f	Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime)	2014-03-27 02:30:00 +01:00
saymrwulf	bbf6c15c38	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-03-13 20:45:00 +01:00
saymrwulf	dd57ce7d25	Quick fix, referencing a known issue from the official repo (onnxruntime)	2014-03-13 05:30:00 +01:00
saymrwulf	3b6a120e7b	Refactor for clarity, might break a few tests though (onnxruntime)	2014-03-13 00:23:00 +01:00
saymrwulf	95e97d9aeb	Implementing approach from a new paper read last night (onnxruntime)	2014-03-13 03:47:00 +01:00
saymrwulf	f8bc8f7d01	Late-night bugfix on financial RL environment (onnxruntime)	2014-02-26 08:47:00 +01:00
saymrwulf	0d3fd2dae5	Implementing approach from a new paper read last night (onnxruntime)	2014-02-03 02:30:00 +01:00
saymrwulf	8870016009	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-01-15 07:10:00 +01:00
saymrwulf	ba8a97128a	Testing bigger LLM config, referencing 'Attention Is All You Need' (onnxruntime)	2014-01-03 05:27:00 +01:00
saymrwulf	0b529cce22	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-01-03 08:37:00 +01:00
saymrwulf	353bef6887	Experimenting with FPGA constraints (source: Trimberger 'Three Ages of FPGAs') (onnxruntime)	2014-01-03 21:38:00 +01:00
saymrwulf	77ccf0441d	Late-night bugfix on financial RL environment (onnxruntime)	2014-01-03 05:27:00 +01:00
Yifan Li	0274b7b82f	fix on trtCudaVersion (#23616 ) ### Description <!-- Describe your changes. --> TensorRT 10.8 zip file has suffix of cuda-12.8, not 12.6 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2025-02-08 14:20:00 -08:00
Yulong Wang	740e9ab9f8	update run CI script (#23621 ) ### Description Add `Win_TRT_Minimal_CUDA_Test_CI`.	2025-02-08 12:28:50 -08:00
shaoboyan091	5ef18328bf	[WebGPU] Support PIX Capture for WebGPU EP (#23192 ) PIX Capture tool requires 'present' to end a frame capture. ORT doesn't have rendering work so no 'present' happens. To avoid endless waiting for PIX capture tool, this PR added a blank surface and 'present' on it in each session run. The surface is created in WebGPU ep constructor and closed in WebGPU ep destructor. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2025-02-08 02:05:15 -08:00
Javier Martinez	01145511b1	Fix for C4267 warning (#23610 ) ### Description A recent [commit](`1fce51b3b2`) is causing an OVEP warning in [openvino_provider_factory.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/openvino/openvino_provider_factory.cc#L151). This PR fixes the warning. ### Motivation and Context Minor fix	2025-02-07 23:01:28 -08:00
Hector Li	002916acb0	Validate the context_file_path before EP compile graphs (#23611 ) Validate the context_file_path before EP compile graphs to make it fail fast. To avoid the possibility that EP generate new file (context binary file or blob file) over write the existing file. Return error if the path points to folder.	2025-02-07 21:31:11 -08:00
Jie Chen	0887e3694a	[webgpu] Use pushErrorScope()/popErrorScope() once for an inference run (#23438 ) The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2025-02-07 13:52:58 -08:00
microsoft-github-policy-service[bot]	65008cbb73	Auto-generated baselines by 1ES Pipeline Templates (#23603 )	2025-02-06 17:06:29 -08:00
Tianlei Wu	09e5724f3b	[CUDA] Fix beam search of num_beams > 32 (#23599 ) ### Description * Pass topk_scores to beam scorer in slow topk path. * Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk. * Add a test case for slow topk path. ### Motivation and Context This bug was introduced in https://github.com/microsoft/onnxruntime/pull/16272 Beam search uses fast cuda kernel when number of beams <= 32. When beam size is larger than that threshold, we use another code path (slower cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be passed to beam scorer but it is not. This bug will cause incorrect result when num_beams > 32. It was not found previously since such large beam size is rarely used.	2025-02-06 16:50:31 -08:00
Sushanth Rajasankar	82840f635d	Implement Flash Attention 2 for webgpu EP (#23576 ) ### Description This change implements FlashAttention 2 for the webgpu EP for the MHA operator. Numbers from Alderlake device show a 2.2x speed up for prefill, which considering that Attention is 50% of prefill phase (other 50% being MatMul) implies 4x speed up for Attention with this implementation. This is inline with the expected perf gain of 2-4x with FlashAttention over regular attention. ``` Baseline PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 9.54997e+06 <<<<< avg (tokens/s): 104.817 p50 (us): 9.49218e+06 stddev (us): 251442 n: 5 * 1001 token(s) ------ With FlashAttention 2 PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 4.27937e+06 <<<<< avg (tokens/s): 233.913 p50 (us): 4.27687e+06 stddev (us): 5344.1 n: 5 * 1001 token(s) ``` ### Motivation and Context On integrated GPUs memory bandwidth is premium, Flash attention makes softmax computation (and therefore output attention vector computation) a running operation instead of maintaining full QKt attention scores in memory. As a result, we see significant improvements in prefill speed - 200% speed up measured here. This change uses techniques from co-operative matrix multiply to use registers from a subgroup for fast in register matrix multiply. Without the co-operative matrix multiply technique ALD showed about 6.0s prefill time. Tested on ALD/TGL intel integrated and Nvidia 4070. ### Future Work - Fine tuning and profiling optimizations. - Current implement is for prefill only, a generation phase optimized FA2 implementation is possible, however attention is a tiny part of the generation phase.	2025-02-06 16:32:05 -08:00
Ankit Maheshkar	a6ea57b8f3	OpenVINO EP Weights Sharing Feature (#23553 ) ### Description These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing. Key changes introduced in this feature are: Creating a shared context between two models Extracting external constant initializers and re labelling them back as inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening. ### Motivation and Context This change was required to ensure that LLM with prefill and kvcache models can use the same share The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>	2025-02-06 14:57:38 -08:00
Tianlei Wu	2c2ff4aef9	[CUDA] Fix BeamSearchTest.DummyT5WithSequenceInputIds test failure in Windows (#23596 ) ### Description BeamSearchTest.DummyT5WithSequenceInputIds failed in Windows due to early stopping triggered. The cause is state.early_stopping_ is interpreted as true in cuda kernel at some point, however printf still show its value is false. The root cause is unknown. Update the code to use early_stopping as template parameter seems walk around the issue. Other changes: * Add some debug code (will not be built into binary unless DEBUG_GENERATION is fined) to assist debugging beam search scorer in CUDA. * Enable DummyT5WithSequenceInputIds test in CI. This test was not run in Windows CUDA CI pipeline previously. ### Motivation and Context Fix a unit test BeamSearchTest.DummyT5WithSequenceInputIds failure in Windows.	2025-02-06 13:15:09 -08:00
Joshua Lochner	d981b153d3	[webgpu/js] Optimize resize webgpu op & fix precision issues (#23591 ) ### Description <!-- Describe your changes. --> This PR is a follow-up to https://github.com/microsoft/onnxruntime/pull/23488 and partially improves upon https://github.com/microsoft/onnxruntime/issues/23403. It does the following: - Prevents unnecessary cache shader recompilation for 'nearest' resize operation. - Fixes precision (offset-by-one) errors with asymmetric coordinate transform. When running the Kokoro TTS model, values for the `/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in differences at the end bounds due to precision issues when dividing 21600 by 72 (should be 300, but seemingly results in 299.999, which causes issues when flooring) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and found that the above node had a large difference. Thinking this was a major issue, I spent some time fixing it. Turns out, it only happens for a small number of values, leading to high maximum error, but most values are correct (as seen here). BEFORE: ``` [/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 \| rtol: 24.13991587587724 \| avgDiff: 0.009967932171121087 \| medianDiff: 0.000030517578125 ``` AFTER: ``` [/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 \| rtol: 0.0020059924232260704 \| avgDiff: 0.00008570214675873825 \| medianDiff: 0.000030517578125 ``` So, although it has a very small impact on the final output (waveform), this bug could appear with other models in a more severe way. BEFORE: ``` [waveform] atol: 0.04784199967980385 \| rtol: 1366.0462001093495 \| avgDiff: 0.0009544936942737713 \| medianDiff: 0.00015346752479672432 ``` AFTER: ``` [waveform] atol: 0.04775865003466606 \| rtol: 1354.7002460360852 \| avgDiff: 0.000954830244055033 \| medianDiff: 0.00015274062752723694 ```	2025-02-06 10:26:25 -08:00

1 2 3 4 5 ...

12346 commits