onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

Author	SHA1	Message	Date
Jian Chen	0a10a3003a	component-governance fix round 4 (#20754 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-22 11:05:24 -07:00
Yulong Wang	e412bc1919	[doc] update file size table for ORT Web (#20755 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-22 11:04:57 -07:00
Xu Xing	f1fef19b6e	[js/webgpu] Support shared memory for transpose 2d (#19267 ) For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.	2024-05-22 08:15:44 -07:00
Yulong Wang	068bb3d5ee	[js/webgpu] add missing space in build script (#20752 )	2024-05-21 16:24:34 -07:00
Chi Lo	df01e0d497	[TensorRT EP] Update ORT kernel output with TRT DDS int64 output for TRT 10 (#20738 ) TRT 10 now natively supports int64 tensor, so needs to updating the code where binding the ORT kernel output with DDS int64 output.	2024-05-21 09:03:48 -07:00
pengwa	8a98874e7e	Flash attention recompute (#20603 ) ### Flash attn recompute 1. Allow PythonOp(FlashAttn) can be recomputed correctly. `45879ff5c2` 2. Use JSON to pass the selected-to-recompute subgraphs. `3c374da678` #### Better Memory Efficiency Customer model can run both PyTorch SPDA and Flash Attn, this PR make it possible to let the Flash Attn path work with ORTModule layerwise recompute. The peak drop from 45.xGB to 32.xGB if we only compare the layers (not including other pieces, BTW there are few more optimization targeting other pieces as well later). #### Better Perf Using Flash ATTN bring additionally 16% end to end time reduction, with highly aligned loss curve. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/bb63894a-f281-49bc-a8e6-ff818439be38) #### Use JSON File to pass Recompute Plans To overcome the limitation of max length of the strings defined in session options. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-21 13:38:19 +08:00
Adrian Lizarraga	8acf60f35c	Layout transform: Fix-up QDQ units and add constant folding (#20685 ) ### Description #### Problem 1: Broken Transpose QDQ unit Layout transform's specialized cost function aggressively pushes down transposes with channel-first or channel-last perms. This can lead to a situation where a channel-fist/last Transpose gets stuck after being pushed through an Unsqueeze node that makes the Transpose's perm no longer channel-first/last. At this point, the specialized cost function defers to the default const function, which does not see a need to continue pushing this transpose node. This breaks the QDQ node units for both the Unsqueeze and the Transpose: DQ -> Unsqueeze -> Transpose -> Q. <img width="266" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/82f8432d-ca27-451b-8c36-c8d87b806e30"> The transpose optimizer should insert a Q -> DQ pair between the Unsqueeze and Transpose nodes to fix both QDQ node units: DQ -> Unsqueeze -> Q[new] -> DQ[new] -> Transpose -> Q <img width="198" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/5a584bdf-e5db-4622-b3bb-83c060e09261"> #### Problem 2: Inserted Squeeze/Transpose nodes should be constant folded when possible. The transpose optimizer inserts Squeeze (and Transpose) ops between an initializer and a DQ to counteract the effect of Unsqueezing that initializer if it is consumed by multiple nodes. This results in a graph where the inserted nodes are not in valid node units: Original graph where two Mul nodes share a common initializer input: <img width="456" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/4b9155ae-e32f-41fc-9136-f953b73e92e7"> Resulting graph after transpose optimization without constant folding: <img width="452" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/3c1bfef1-d45f-4d6e-aa19-1c2929eae3f5"> Here, the circled Transpose and Squeeze nodes operate on a quantized integer type but are not in valid QDQ node units. The solution is to run constant folding, which results in: <img width="405" alt="image" src="https://github.com/microsoft/onnxruntime/assets/19691973/aebdb91f-f38f-4583-adec-33e46126365f"> ### Motivation and Context Improve the layout transformation to allow more models to run on EPs that prefer the channel-last layout. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-05-20 20:19:06 -07:00
Jian Chen	372974e5d6	Using CPU pool to build Linux GPU C API Package (#20648 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-20 15:25:14 -07:00
Wanming Lin	87d49e3dda	[WebNN EP] Add WebNN operators doc to README.md (#20734 )	2024-05-20 14:57:40 -07:00
Wanming Lin	0399d1b12d	[WebNN EP] Update chromium flag (#20732 ) WebNN is currently enabled behind "Enables WebNN API" flag.	2024-05-20 14:57:30 -07:00
Jian Chen	ddafbf2224	Component Governance fix round 3 (#20689 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-20 13:39:09 -07:00
Jian Chen	11df22b59b	Reenabling Nuget Cuda Packaging Pipeline (#20688 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-20 10:37:15 -07:00
Edward Chen	fefae0cd04	Add Mac CI GitHub Actions workflow (#20717 ) Add a new GitHub Actions workflow, `.github/workflows/mac.yml`. It contains these jobs: - ARM64 MacOS CI build. - Objective-C static analysis build. This was moved over from another Azure DevOps pipeline to make it more visible.	2024-05-20 10:27:03 -07:00
Preetha Veeramalai	ebed2c3785	Unified OV compile_model API in OVEP (#20700 ) ### Description Have a unified API in OVEP that pass the ONNX graph proto from ORT to OV for compilation ### Motivation and Context The earlier implementation used two different flows when onnx model path is present vs model laoded from memory. The former directly passed the onnx model path to OV when the graph is fully supported by EP. While the latter pass the ORT model proto to OV. This cause a difference in results when ORT optimizations are enabled. This PR address this issue.	2024-05-20 10:20:28 -07:00
Yulong Wang	036fcd93d4	[js/web] optimize module export and deployment (#20165 ) ### Description This PR make numbers of optimizations to onnxruntime-web's module export and deployment. See each section below for more details. #### Preview > [onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21) > ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~ > ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~ > ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~ <details> <summary><h4>Breaking changes</h4></summary> There is no code change required, but there are a few differences regarding code import, flags, bundler config and deployment steps. #### Importing: Import table is changed. See following for details. <details> <summary><h5>Current import table:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| `onnxruntime-web/experimental` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️<sup>\[1]</sup> \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.wasm-core` \| `onnxruntime-web/wasm-core` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ✔️<sup>\[2]</sup> \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| * [1] didn't test. may not actually work. * [2] not working. this is a mistake in build config. </details> <details> <summary><h5>Proposed update:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| ~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| ~~`ort.wasm-core`~~ \| ~~`onnxruntime-web/wasm-core`~~ \| ~~❌~~ \| ~~❌~~ \| ~~✔️~~ \| ~~❌~~ \| ~~❌~~ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ~~✔️~~ ❌ \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| </details> #### Flags: The following flags are deprecated: - `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in build. The following flags changed their type: - `env.wasm.wasmPaths`: When using this flag as a string ( for the URL prefix ), nothing is changed. When using this flag as an object ( for per-file path override ), the type changed: ```diff - export interface Old_WasmFilePaths{ - 'ort-wasm.wasm'?: string; - 'ort-wasm-threaded.wasm'?: string; - 'ort-wasm-simd.wasm'?: string; - 'ort-training-wasm-simd.wasm'?: string; - 'ort-wasm-simd-threaded.wasm'?: string; - }; + export interface New_WasmFilePaths { + /** + * Specify the override path for the main .wasm file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .wasm file is: + * - `ort-wasm-simd-threaded.wasm` for default build + * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.wasm` for training build + / + wasm?: URL\|string; + /* + * Specify the override path for the main .mjs file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .mjs file is: + * - `ort-wasm-simd-threaded.mjs` for default build + * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.mjs` for training build + / + mjs?: URL\|string; + } ``` #### Bundler compatibility: Config changes are need for bundlers. See usage example in /js/web/test/e2e/ for Webpack, parcel and rollup. #### Deployment: - if consuming from a CDN, there is no breaking change. - if consuming from a local server, need to copy all `ort-.wasm` and `ort-.mjs` files (totally 6 files) in the dist folder. (previously only need to copy `ort-.wasm` files.) </details> <details> <summary><h4>Problems</h4></summary> There are a few problems with the current module export and deployment: - Script URL cannot be correctly inferred when imported as ESM. - Workers are forcefully encoded using Blob URL, which makes onnxruntime-web not working in CSP environment and Node.js, when using proxy or multi-threading feature. - Generated JS code (by Emscripten) is encoded using `function.toString()`, which is unstable and error-prone. - When running with a different Emscripten build, always need the build step. Making it difficult to swap artifacts in deveopment/debug. </details> <details> <summary><h4>Goals</h4></summary> - Full ESM support - Support variances of ways to import. Including: - import from HTML's `<script>` tag (IIFE format, exporting to global variable `ort`) ```html <script src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script> ``` - import from source code inside `<script type="module">` tag (ESM) ```html <script type="module"> import * as ort from "https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs"; // using 'ort' </script> ``` - import in a CommonJS project (CJS format, resolve from package.json "exports" field) ```js // myProject/main.js const ort = require('onnxruntime-web'); ``` - import in an ESM project (ESM format, resolve from package.json "exports" field) ```js // myProject/main.js (or main.mjs) import * as ort from 'onnxruntime-web'; ``` - Support popular bundlers when importing onnxruntime-web into a CJS/ESM project. - webpack (esm requires extra post-process step) - rollup - parcel (esm requires extra post-process step) - More bundlers TBD - Multi-threading support for Node.js NOTE: keeping single JavaScript file (the all-in-one bundle) is no longer a goal. This is because technically there is a conflict with the other requirements. </details> <details> <summary><h4>Important Design Decisions</h4></summary> - Drop support of single JavaScript output. - The current onnxruntime-web distribution uses a single JavaScript file to include all code. While there are a few benefits, it also creates problems as mentioned above. Since ESM is being used more and more widely, and browsers are making more restricted security checks and requirement, the old Blob based solution is going to be replaced. - To achieve the requirement, specifically, the CSP environment support, we have to offer a non Blob based solution. Therefore, we have to distribute multiple files and drop the single file solution. - Do not run parser/postprocess on Emscripten generated JavaScript. - Emscripten is evolving quickly so we should only depends on what's in its documentation instead of a certain implementation details. (for example, currently we patch on its code to deal with a special variable `_scriptDir`) - Keep the generated files as-is also helps to: - reduce the size of ort.min.js - make it easier to replace build artifacts when in development/debug - Drop support for non-SIMD and non-MultiThread. This helps to reduce the number of artifacts in distribution. - (fixed-sized) SIMD is supported in any mainstream JS environment. - Multi-thread as WebAssembly feature is supported in any mainstream JS environment. In some environment the feature is guarded with cross origin policy, but it can still work if not trying to create any worker. - Use ESM output for Emscripten generated JavaScript. - There are 2 ways to dynamically import classic (umd) modules and neither of them are recommended: - dynamically creating a <script> tag. This changes the HTML structure and have quite a lot of compatibility issue - use `fetch()` and `eval()`. However `eval` is strongly suggested to be avoid because there is a great perf hit. - importing ESM is super easy - just use the `import()` call. Considering ESM is widely supported in modern browsers and Node.js this is the better option. - Add Blob based solution as a fallback for cross-origin workers. - There are still wide use case of importing onnxruntime-web from CDN. In this usage, make it able create worker by using `fetch()`+`Blob` to create a same-origin Blob URL. </details> <details> <summary><h4>Distribution File Manifest</h4></summary> The distribution folder contains the following files: - WebAssembly artifacts. These files are the result of compiling the ONNX Runtime C++ code to WebAssembly by Emscripten. \| File Name \| Build Flags \| \|------\|-----\| \| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-training-wasm-simd-threaded.mjs <br/> ort-training-wasm-simd-threaded.wasm \| `--enable_training_apis` <br/> `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep` <br/> `--use_webnn` \| - onnxruntime-web JavaScript artifacts. These files are generated by ESBuild as the entry point for onnxruntime-web. There are multiple build targets for different use cases: \| Target Name \| Path for "import" or "require" \| Description \| \|------\|-----\|-----\| \| `ort` \| `onnxruntime-web` \| The default target. \| \| `ort.all` \| `onnxruntime-web/all` \| The target including webgl. \| \| `ort.node` \| `onnxruntime-web` \| The default target for Node.js. \| \| `ort.training` \| `onnxruntime-web/training` \| The target including training APIs \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| The target including only WebAssembly (CPU) EP \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| The target including only WebGL EP \| For each target, there are multiple files generated: \| File Name \| Description \| \|------\|-----\| \| [target].js \| The entry point for the target. IIFE and CommonJS format. \| \| [target].mjs \| The entry point for the target. ESM format. \| \| [target].min.js <br/> [target].min.js.map \| The entry point for the target. Minimized with sourcemap. IIFE and CommonJS format. \| \| [target].min.mjs <br/> [target].min.mjs.map \| The entry point for the target. Minimized with sourcemap. ESM format. \| \| [target].proxy.mjs \| (if appliable) The proxy ESM module for the target. \| \| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map \| (if appliable) The proxy ESM module for the target. Minimized with sourcemap. \| </details> <details> <summary><h4>Dynamic Import Explained</h4></summary> - Local Served \| No Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Local Served \| Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort.proxy.min.mjs] \| + new Worker()--> [ort.proxy.min.mjs (worker)] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| No Proxy: ``` [Bundle or ort.min.js] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| Proxy ``` [Bundle or ort.min.js] \| + fetch('ort.proxy.min.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort.proxy)] \| + new Worker()--> [blob:... (ort.proxy) (worker)] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` </details>	2024-05-20 09:51:16 -07:00
kunal-vaishnavi	ca22a5a9d0	Add fusions for OpenAI CLIP (#20721 ) ### Description This PR adds fusions for [OpenAI's CLIP model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here is an example of how to run the ORT transformer optimizer for the linked CLIP model. ``` $ git clone https://github.com/microsoft/onnxruntime $ cd onnxruntime/onnxruntime/python/tools/transformers $ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps optimize multi-modal models that use CLIP for the vision encoder.	2024-05-18 08:27:16 -07:00
cloudhan	5d07291247	hipify int4 gemv (#20666 ) Hipify MatMulNBits to accommodate the need of Phi3 onnx release.	2024-05-18 16:59:03 +08:00
kunal-vaishnavi	72a3bde330	Add GQA on CPU in LLaMA scripts (#20720 ) ### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](https://github.com/microsoft/onnxruntime/pull/20598).	2024-05-17 23:23:57 -07:00
Dmitri Smirnov	bd7a0fb377	[C API Docs ] Address doxygen errors (#20714 ) ### Description Make C API compliant with Doxygen expectations ### Motivation and Context Doc workflow is failing.	2024-05-17 23:23:20 -07:00
Tianlei Wu	2e7de54565	[CUDA] Fix SparseAttention Kernel (#20716 ) ### Description Currently, there is one bool flag to indicate whether kernel is loaded. However, there are v1 and v2 kernels, so the flag will allow only one version of kernel loaded. We use v1 kernel for prompt and v2 kernel for token generation, and the flag will cause issue when we want both prompt and token generation. This bug is found in integration test. The unit test only test one kernel at a time so the issue was not found before. Another possible walkaround without this fix is to set an environment variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-17 22:42:19 -07:00
guyang3532	d7f7c3b343	Fix bug when Embedding has >2 output (#20678 )	2024-05-17 16:12:57 +08:00
Xu Xing	6b58fcc00b	[js/web] Refine conv attributes (#20684 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-16 18:00:57 -07:00
Edward Chen	e81c8676e3	MatMulNBits + Add fusion (#20587 ) - Add MatMulNBits Bias input - Add graph transformer to fuse MatMulNBits + Add	2024-05-16 11:00:59 -07:00
Tom McDonald	1e1b3f9689	Remove ref struct return usage (#20132 ) ### Description Removes ref struct return usage on netstandard 2.0 builds. ### Motivation and Context Unblocks .NET native compilation	2024-05-16 09:46:19 -07:00
Yifan Li	47a178b518	[EP Perf] Fix on EP Perf (#20683 ) ### Description <!-- Describe your changes. --> * Partially revert [previous change](https://github.com/microsoft/onnxruntime/pull/19804), and * Redo concurrency_test_result parser outside of post.py * Add support of syncing memtest result to db ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To fix the error when CI is running on two model groups. - When running on two model groups, the [previous change](https://github.com/microsoft/onnxruntime/pull/19804) wrongly navigates two levels up in the directory after running one model group, while one level is needed. After that, the script can't find another model group. - Running on one model group can't repro the issue	2024-05-15 21:38:52 -07:00
Wanming Lin	f5bfbd6d81	[WebNN EP] Remove activation fusion (#20635 ) WebNN spec has removed activation option for conv and batchNormalization. We don't need additional activation fusion in WebNN EP anymore. [edit by fdwr] Note this is handled in the browser now, which knows more about the backend platform version and can more safely make decisions about which fusions are possible (e.g. for the DirectML backend, whether softmax and gelu can fuse successfully with their base operator).	2024-05-15 16:49:07 -07:00
Jian Chen	d1e66f0446	Increase NPM ComponentDetection.Timeout: 1200 (#20681 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-15 13:41:59 -07:00
Changming Sun	ee3f2f4ebf	Update docs/Model_Test.md (#11466 ) Update the instructions of how to get test models.	2024-05-15 11:33:11 -07:00
Hans	2a17958b34	[js/rn] Fix some bugs (#20242 ) ### Description <!-- Describe your changes. --> - Fix `logSeverityLevel` - Correct get RCTCxxBridge, old method for some cases will got wrong bridge ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-05-15 10:32:08 -07:00
Jian Chen	87ed1e3e3f	Component governance fix round 2 (#20679 )	2024-05-14 17:15:15 -07:00
Edward Chen	113aa2992f	Update React Native CI (#20673 ) - Move iOS package build to separate job so it can run in parallel with Android AAR build and be decoupled from the test stage. The test stage fails sometimes (not infrequently) and may need to be re-run. - Update stop iOS simulator step so it doesn't fail if the start step doesn't run.	2024-05-14 14:10:56 -07:00
Jian Chen	83a871f890	Fix critical and High issues from Component Governance (#20611 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-14 09:17:23 -07:00
Hector Li	0e11d0c4f8	Enable Qnn nuget nightly (#20662 ) ### Description Enable Qnn nuget nightly	2024-05-13 21:28:43 -07:00
Yi Zhang	c131ea89e1	Nuget Publish pipelines should be trigger by rel-* automatically too. (#20652 ) ### Description And Set allowPackageConflicts = True `#allowPackageConflicts: false # boolean. Optional. Use when command = push && nuGetFeedType = internal. Allow duplicates to be skipped. Default: false.` https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/reference/nuget-command-v2?view=azure-pipelines Once the publish patial failed, we don't need to rerun the whole package generation workflow.	2024-05-13 13:18:16 -07:00
Xu Xing	8c59cd4fce	[js/webgpu] Support GroupQueryAttention (#20237 ) TODOs: 1. Handle H * params.kvNumHeads greater than work group size limit. 2. Support BNSH kv cache.	2024-05-13 09:43:37 -07:00
Edward Chen	90d49ccb9a	Allow path pattern to be specified in package_release_tasks.py. (#20650 ) Do more in the Python helper script so the Bash code in the release definition can be simplified.	2024-05-13 09:16:04 -07:00
Adrian Lizarraga	643ed14720	Quant tool: make removal of Clip/Relu ops configurable (#20616 ) ### Description Adds the extra option `QDQKeepRemovableActivations` to optionally prevent automatic removal of Clip/Relu ops in QDQ models. The current default behavior, which is to remove Clip/Relu, remains the same if the new option is not enabled. ### Motivation and Context Explicitly representing these Relu/Clip operators in the QDQ model is necessary if optimizations or EP transformations will later remove QuantizeLinear/DequantizeLinear operators from the model.	2024-05-10 17:23:24 -07:00
Yi-Hong Lyu	49d197a8e6	Enable ClipQuantFusion exclusively on CPU EP (#20627 ) ### Motivation and Context The Intel NPU does not support 16-bit int quantized operators. Consequently, the execution provider removes the QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and executes the operation as FP16 in the backend. However, if a Clip operator was fused into a Q operator in the node unit, the removal of Q/DQ operators results in inaccuracies because the effect of the original Clip operators is lost. Consider the following example: - FP32 model: -> Op_FP32 -> Clip -> - QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') -> - After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') -> - Intel Execution Provider strips Q/DQ: -> Op_FP16 -> To solve this issue, we have enabled ClipQuantFusion exclusively on the CPU execution provider.	2024-05-10 16:07:42 -07:00
Jian Chen	4fe565a62a	Java CUDA 12 support (#20583 ) ### Description - This PR combine all CUDA 12 stage into the Zip-nuget-... pipeline. - It also enables the cuda12 support ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-10 14:16:22 -07:00
Tianlei Wu	85facd678b	[CUDA] Benchmark GQA on popular LLM models (#20646 ) ### Description Update benchmark_gqa.py to test latency on popular models (like Llama3-8b, Llama3-70b, Mixtral-8x22B-v0.1 and Phi-3 etc). Note that this is latency of just one GroupQueryAttention node, not the whole model. For example, packed QKV might need more time in GQA, but it is faster in MatMul of input projection, the overall effect is not measured here. Example output in A100-SXM4-80GB : ``` prompt-sm80-Llama3-8B-b1-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.019073 0.016264 1 32.0 0.017768 0.017957 2 64.0 0.023304 0.023192 3 128.0 0.032541 0.031348 4 256.0 0.048329 0.049484 5 512.0 0.095294 0.095950 6 1024.0 0.228050 0.228980 7 2048.0 0.663820 0.663308 8 4096.0 2.243657 2.242999 9 8192.0 8.197120 8.186282 token-sm80-Llama3-8B-b1-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018516 0.015398 1 32.0 0.015687 0.016079 2 64.0 0.016115 0.016053 3 128.0 0.018727 0.019413 4 256.0 0.036373 0.035962 5 512.0 0.041701 0.042203 6 1024.0 0.053730 0.053750 7 2048.0 0.076382 0.075707 8 4096.0 0.121876 0.121802 9 8191.0 0.211292 0.211254 prompt-sm80-Llama3-8B-b4-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.024558 0.022070 1 32.0 0.021276 0.021406 2 64.0 0.044172 0.027789 3 128.0 0.069100 0.059071 4 256.0 0.146569 0.106717 5 512.0 0.270472 0.244461 6 1024.0 0.690024 0.692501 7 2048.0 2.308546 2.325453 8 4096.0 8.724295 8.957337 9 8192.0 39.030785 41.381378 token-sm80-Llama3-8B-b4-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018893 0.018611 1 32.0 0.018124 0.018190 2 64.0 0.018115 0.018156 3 128.0 0.023291 0.023733 4 256.0 0.038357 0.038351 5 512.0 0.047117 0.047792 6 1024.0 0.066272 0.065409 7 2048.0 0.104196 0.104527 8 4096.0 0.180557 0.180424 9 8191.0 0.332545 0.332714 prompt-sm80-Llama3-70B-b1-h64_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.040974 0.015852 1 32.0 0.017839 0.018615 2 64.0 0.023956 0.022704 3 128.0 0.044622 0.035229 4 256.0 0.080241 0.075237 5 512.0 0.143457 0.144322 6 1024.0 0.380473 0.381731 7 2048.0 1.217328 1.214505 8 4096.0 4.305315 4.286324 9 8192.0 15.918250 15.933440 token-sm80-Llama3-70B-b1-h64_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.016148 0.015612 1 32.0 0.015616 0.015616 2 64.0 0.016082 0.016070 3 128.0 0.019470 0.019130 4 256.0 0.036617 0.037296 5 512.0 0.042087 0.042176 6 1024.0 0.053704 0.053587 7 2048.0 0.076918 0.076365 8 4096.0 0.122534 0.121984 9 8191.0 0.212961 0.213330 prompt-sm80-Llama3-70B-b4-h64_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.031137 0.026270 1 32.0 0.030938 0.032009 2 64.0 0.040833 0.059118 3 128.0 0.084899 0.085482 4 256.0 0.163951 0.166310 5 512.0 0.420436 0.423721 6 1024.0 1.282019 1.283482 7 2048.0 4.397661 4.420121 8 4096.0 16.931839 17.456945 9 8192.0 77.896706 83.007484 token-sm80-Llama3-70B-b4-h64_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.026106 0.026061 1 32.0 0.025678 0.025589 2 64.0 0.025438 0.025965 3 128.0 0.033879 0.033320 4 256.0 0.058078 0.057656 5 512.0 0.078010 0.078153 6 1024.0 0.106353 0.098079 7 2048.0 0.160039 0.159153 8 4096.0 0.282527 0.283346 9 8191.0 0.546207 0.542135 prompt-sm80-Mistral-7B-v0.1-b1-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Local ORT-GQA-Dense-PackedQKV ORT-GQA-Local-PackedQKV 0 16.0 0.015722 0.015655 0.015666 0.016150 1 32.0 0.018590 0.018562 0.018136 0.024617 2 64.0 0.022480 0.023085 0.023184 0.023160 3 128.0 0.029948 0.030581 0.030839 0.031464 4 256.0 0.048532 0.049099 0.049424 0.049408 5 512.0 0.095096 0.095665 0.096174 0.096175 6 1024.0 0.228606 0.228942 0.228434 0.229568 7 2048.0 0.660832 0.661943 0.662170 0.663979 8 4096.0 2.238001 2.243999 2.242243 2.241707 9 8192.0 8.173824 6.147072 8.187648 6.152822 10 16384.0 33.826305 14.486015 34.849792 14.938283 11 32768.0 176.702469 32.725330 184.309753 34.736130 token-sm80-Mistral-7B-v0.1-b1-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Local ORT-GQA-Dense-PackedQKV ORT-GQA-Local-PackedQKV 0 16.0 0.015407 0.016042 0.016030 0.015429 1 32.0 0.015525 0.016115 0.016768 0.016052 2 64.0 0.015556 0.016079 0.015383 0.016008 3 128.0 0.019302 0.018644 0.018680 0.019278 4 256.0 0.036924 0.035900 0.036753 0.036786 5 512.0 0.041482 0.041434 0.041646 0.042238 6 1024.0 0.053587 0.052972 0.052888 0.052856 7 2048.0 0.075749 0.075807 0.076528 0.075945 8 4096.0 0.122053 0.122016 0.122115 0.122216 9 8192.0 0.212069 0.121317 0.211919 0.121087 10 16384.0 0.394036 0.121202 0.393661 0.121483 11 32767.0 0.757216 0.124326 0.757659 0.124157 prompt-sm80-Mistral-7B-v0.1-b4-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Local ORT-GQA-Dense-PackedQKV ORT-GQA-Local-PackedQKV 0 16.0 0.018418 0.018911 0.023387 0.019256 1 32.0 0.021085 0.021132 0.022143 0.022251 2 64.0 0.026743 0.026770 0.027942 0.027714 3 128.0 0.057922 0.058483 0.058800 0.059402 4 256.0 0.105927 0.104876 0.106695 0.105996 5 512.0 0.242958 0.242543 0.244599 0.244774 6 1024.0 0.689321 0.689347 0.691759 0.692334 7 2048.0 2.308250 2.304410 2.321587 2.317875 8 4096.0 8.705210 8.713682 8.927418 8.903866 9 8192.0 39.630848 28.227926 41.604607 29.648554 10 16384.0 175.553543 61.422592 183.384064 64.560127 11 32768.0 772.296692 132.006912 813.537292 138.996735 token-sm80-Mistral-7B-v0.1-b4-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Local ORT-GQA-Dense-PackedQKV ORT-GQA-Local-PackedQKV 0 16.0 0.018127 0.018691 0.018661 0.018681 1 32.0 0.018183 0.018812 0.018739 0.018759 2 64.0 0.018081 0.018116 0.018136 0.018153 3 128.0 0.023257 0.023146 0.023114 0.023103 4 256.0 0.038665 0.038102 0.038120 0.038759 5 512.0 0.047181 0.047156 0.047012 0.046382 6 1024.0 0.066047 0.066103 0.066604 0.066076 7 2048.0 0.104427 0.103770 0.103799 0.103807 8 4096.0 0.180951 0.180373 0.180173 0.180154 9 8192.0 0.334018 0.180801 0.333269 0.180690 10 16384.0 0.638682 0.180965 0.638543 0.180202 11 32767.0 1.249536 0.184779 1.249963 0.184624 prompt-sm80-Mixtral-8x22B-v0.1-b1-h48_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.015699 0.015563 1 32.0 0.017931 0.017719 2 64.0 0.029975 0.022875 3 128.0 0.031038 0.055747 4 256.0 0.050191 0.050845 5 512.0 0.125187 0.122813 6 1024.0 0.304004 0.301824 7 2048.0 0.936454 0.931546 8 4096.0 3.264547 3.255931 9 8192.0 12.062719 12.030080 10 16384.0 49.018368 48.970749 11 32768.0 261.211151 254.461945 12 65536.0 1221.138428 1197.559814 token-sm80-Mixtral-8x22B-v0.1-b1-h48_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.015980 0.016024 1 32.0 0.015440 0.016165 2 64.0 0.015987 0.015979 3 128.0 0.020837 0.018715 4 256.0 0.036240 0.036747 5 512.0 0.042477 0.041813 6 1024.0 0.052950 0.052956 7 2048.0 0.076084 0.076691 8 4096.0 0.122233 0.121540 9 8192.0 0.212469 0.212433 10 16384.0 0.394937 0.394996 11 32768.0 0.757285 0.757257 12 65535.0 1.484867 1.485015 prompt-sm80-Mixtral-8x22B-v0.1-b4-h48_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.024119 0.018755 1 32.0 0.022214 0.022267 2 64.0 0.028045 0.027562 3 128.0 0.062894 0.079766 4 256.0 0.135146 0.134483 5 512.0 0.331323 0.329094 6 1024.0 0.984576 0.982221 7 2048.0 3.353564 3.351021 8 4096.0 12.762113 12.778350 9 8192.0 58.599422 57.704449 10 16384.0 263.392242 258.709503 11 32768.0 1155.789795 1128.622070 12 65536.0 5014.187012 4874.590332 token-sm80-Mixtral-8x22B-v0.1-b4-h48_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018148 0.018813 1 32.0 0.018929 0.018840 2 64.0 0.018745 0.018232 3 128.0 0.023864 0.023822 4 256.0 0.038603 0.038694 5 512.0 0.048347 0.047630 6 1024.0 0.066957 0.067392 7 2048.0 0.105094 0.105058 8 4096.0 0.181941 0.181808 9 8192.0 0.334227 0.334324 10 16384.0 0.640429 0.640961 11 32768.0 1.267897 1.269120 12 65535.0 2.534238 2.504408 prompt-sm80-Phi-3-mini-128k-b1-h32_32x96-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.016112 0.026949 1 32.0 0.016486 0.017284 2 64.0 0.020910 0.020994 3 128.0 0.029306 0.029452 4 256.0 0.044604 0.044642 5 512.0 0.090079 0.086868 6 1024.0 0.208169 0.208094 7 2048.0 0.604687 0.607910 8 4096.0 2.029056 2.046771 9 8192.0 7.792128 7.906303 10 16384.0 34.271233 34.418175 11 32768.0 160.377853 159.980545 12 65536.0 733.443054 734.722046 token-sm80-Phi-3-mini-128k-b1-h32_32_d96-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.016339 0.015718 1 32.0 0.016572 0.015964 2 64.0 0.016182 0.016192 3 128.0 0.019373 0.018621 4 256.0 0.021856 0.022463 5 512.0 0.028943 0.028888 6 1024.0 0.041124 0.041104 7 2048.0 0.067668 0.067542 8 4096.0 0.117528 0.117447 9 8192.0 0.216241 0.215492 10 16384.0 0.413434 0.414047 11 32768.0 0.811085 0.810612 12 65536.0 1.606189 1.606458 13 131071.0 3.193037 3.192491 prompt-sm80-Phi-3-mini-128k-b4-h32_32x96-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.019385 0.019403 1 32.0 0.019801 0.020006 2 64.0 0.025958 0.025376 3 128.0 0.056445 0.055909 4 256.0 0.103180 0.102221 5 512.0 0.244224 0.244360 6 1024.0 0.703066 0.709327 7 2048.0 2.307456 2.335001 8 4096.0 8.334522 8.406760 9 8192.0 33.340416 33.758209 10 16384.0 144.141312 145.005569 11 32768.0 655.496216 655.656982 12 65536.0 2981.463135 2984.790039 token-sm80-Phi-3-mini-128k-b4-h32_32_d96-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018701 0.018185 1 32.0 0.020625 0.019213 2 64.0 0.019936 0.019943 3 128.0 0.023648 0.023689 4 256.0 0.030309 0.030305 5 512.0 0.043501 0.043801 6 1024.0 0.067314 0.068014 7 2048.0 0.108649 0.108134 8 4096.0 0.186053 0.186848 9 8192.0 0.339973 0.339742 10 16384.0 0.643288 0.644366 11 32768.0 1.261468 1.261510 12 65536.0 2.502252 2.501820 13 131071.0 4.990437 4.989521 prompt-sm80-Phi-3-small-128k-b1-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.025280 0.023331 1 32.0 0.023071 0.025931 2 64.0 0.022883 0.026258 3 128.0 0.030658 0.031445 4 256.0 0.057659 0.057073 5 512.0 0.095589 0.106579 6 1024.0 0.228532 0.229402 7 2048.0 0.662315 0.663349 8 4096.0 2.242885 2.248095 9 8192.0 8.194646 8.180395 10 16384.0 33.926659 35.130882 11 32768.0 175.320068 184.967163 12 65536.0 810.447876 847.632385 token-sm80-Phi-3-small-128k-b1-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.015517 0.016038 1 32.0 0.016372 0.015477 2 64.0 0.015472 0.016016 3 128.0 0.019291 0.018664 4 256.0 0.036250 0.035990 5 512.0 0.041691 0.042238 6 1024.0 0.053730 0.053126 7 2048.0 0.075912 0.076439 8 4096.0 0.121336 0.121334 9 8192.0 0.213104 0.212443 10 16384.0 0.394353 0.394272 11 32768.0 0.756965 0.757017 12 65536.0 1.484548 1.485371 13 131071.0 2.939200 2.939552 prompt-sm80-Phi-3-small-128k-b4-h32_8x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.044326 0.019298 1 32.0 0.021840 0.021408 2 64.0 0.027492 0.027802 3 128.0 0.058128 0.059431 4 256.0 0.104300 0.106019 5 512.0 0.242562 0.244948 6 1024.0 0.689614 0.692305 7 2048.0 2.297931 2.312857 8 4096.0 8.654848 8.843170 9 8192.0 38.770176 40.929279 10 16384.0 175.572998 183.692291 11 32768.0 780.126221 820.551697 12 65536.0 3357.564941 3488.527344 token-sm80-Phi-3-small-128k-b4-h32_8_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018061 0.017995 1 32.0 0.018225 0.018851 2 64.0 0.018203 0.018104 3 128.0 0.023161 0.023651 4 256.0 0.038421 0.037673 5 512.0 0.047590 0.046938 6 1024.0 0.065639 0.066055 7 2048.0 0.103545 0.103581 8 4096.0 0.180461 0.179998 9 8192.0 0.332667 0.332564 10 16384.0 0.638503 0.639094 11 32768.0 1.249180 1.249479 12 65536.0 2.469457 2.471666 13 131071.0 4.915362 4.914499 prompt-sm80-Phi-3-medium-128K-b1-h40_10x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.025759 0.016318 1 32.0 0.018282 0.018111 2 64.0 0.022642 0.022978 3 128.0 0.030860 0.037988 4 256.0 0.055703 0.050318 5 512.0 0.113465 0.113776 6 1024.0 0.267678 0.268292 7 2048.0 0.795202 0.797222 8 4096.0 2.737953 2.740435 9 8192.0 10.101760 10.149092 10 16384.0 43.326466 43.990013 11 32768.0 230.886398 229.886978 12 65536.0 1067.412476 1052.922852 token-sm80-Phi-3-medium-128K-b1-h40_10_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.016122 0.015582 1 32.0 0.015594 0.016262 2 64.0 0.016099 0.015512 3 128.0 0.018708 0.019510 4 256.0 0.037582 0.036341 5 512.0 0.042411 0.041894 6 1024.0 0.053278 0.053914 7 2048.0 0.076553 0.076636 8 4096.0 0.121539 0.121610 9 8192.0 0.212083 0.212377 10 16384.0 0.395086 0.395280 11 32768.0 0.757879 0.757888 12 65536.0 1.486093 1.486915 13 131071.0 2.941728 2.941408 prompt-sm80-Phi-3-medium-128K-b4-h40_10x128-fp16: sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.019448 0.018872 1 32.0 0.022290 0.022380 2 64.0 0.027986 0.027955 3 128.0 0.062699 0.062175 4 256.0 0.124868 0.125247 5 512.0 0.298873 0.298169 6 1024.0 0.862584 0.863467 7 2048.0 2.944640 2.957824 8 4096.0 11.318656 11.390720 9 8192.0 52.606976 52.019199 10 16384.0 232.616959 230.360062 11 32768.0 1024.171997 1019.540466 12 65536.0 4377.362305 4354.510742 token-sm80-Phi-3-medium-128K-b4-h40_10_d128-fp16: past_sequence_length ORT-GQA-Dense ORT-GQA-Dense-PackedQKV 0 16.0 0.018192 0.018175 1 32.0 0.018999 0.018319 2 64.0 0.018447 0.018897 3 128.0 0.023863 0.023195 4 256.0 0.037712 0.038192 5 512.0 0.048863 0.048548 6 1024.0 0.067244 0.066473 7 2048.0 0.105203 0.105021 8 4096.0 0.180712 0.180429 9 8192.0 0.334948 0.334734 10 16384.0 0.640662 0.639709 11 32768.0 1.252196 1.251684 12 65536.0 2.474927 2.474280 13 131071.0 4.930829 4.959340 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-10 14:14:15 -07:00
guyang3532	cfe830b248	Generalize label input sparsity check and refactor (#20636 ) ### Description The InsertGatherBeforeSceLoss optimization is enabled when the density of label padding less than 90%. We need to check the density of the label padding to decide whether enable the optimization. Before this pr, we just check the inputs of graph and correlate one with the SCE node by iterate graph from the SCE node back to one graph input. This is hard to be general because there may be complicated pattern between graph input and SCE node. This pr check padding density by the direct input of SCE module rather than the input of graph at the first graph execution when exporting onnx graph. And if the density < 90%, insert a flag PythonOp after the SCE node as: ``` SoftmaxCrossEntropy \| PythonOp (func_name: FlagAndPrintDensity) (insert if density < 90%) \| Following graph ``` When the InsertGatherBeforeSceLoss is invoked, it check if there is the flag PythonOp(func_name: FlagAndPrintDensity) after the SCE node and if it is, remove it and do the padding elimination optimization. If the env of ORTMODULE_PRINT_INPUT_DENSITY is 1, we will print input density each step by the PythonOp (func_name: FlagAndPrintDensity). In this case the PythonOp will not be removed.	2024-05-10 21:55:43 +08:00
vivianw-amd	e124cf8e76	set unload to false to prevent crash when linux lib load not successfully (#20626 ) ### Description <!-- Describe your changes. --> during VITSIAI shared library load, set unload to false to prevent crash when linux lib load not successfully ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> for Linux environment, when library not loaded successfully, it will end up with crash without giving any useful message. the fix is to prevent the crash and give the useful message when shared library not loaded correctly.	2024-05-10 00:01:23 -07:00
Tianlei Wu	01dd991f97	Update SparseAttention op spec to make it more flexible (#20625 ) ### Description Make the operator more flexible: (1) Decouple the max sequence length of rotary cache, kv cache and block mask. They are allowed to have different values. (2) Replace block_mask dense by CSR format (block_row_indices and block_col_indices) to improve performance. (3) Mark past_key and past_value as required inputs since we need them to compute the shape of present_key and present_value. ### Motivation and Context (1) LongRoPE has short and long rotary cache, which has different length. (2) Most users do not have enough GPU memory to run maximum sequence length 128K. This change allows user to use smaller kv cache length to test without hitting out of memory.	2024-05-09 22:15:21 -07:00
George Wu	a0c4bd4da7	[qnn ep] sign onnxruntime.dll/pyd for qnn packages (#20634 ) sign only onnxruntime.dll and onnxruntime_pybind11_state.pyd in packages.	2024-05-09 20:45:44 -07:00
pengwa	56f7035521	Improve perf for mem efficient grad mgmt (#20480 ) ### Improve perf for mem efficient grad mgmt When memory efficient gradient mangement feature is enabled, the weight retrieval PythonOp for every layers will be launched at the beginning of the forward, which would make GPU stream idle for few milliseconds. The reason is the ReversedDFS ordering cannot ALWAYS handle such input branching well, so we introduce a distantance-to-input_leaf concepts when doing the reversedDFS, which not only move the problematical PythonOp to the place where it is needed, but also those Cast ops following the weight retrieval to the place where it is needed. Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples), 61.04sample/second This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second (+0.8% gains) Main branch: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/75c4131e-dade-49b0-aa8b-ee1c637ad9a8) This PR: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/e590a536-3b80-4f51-b89f-f25a55ddd7e2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-10 08:09:17 +08:00
Yi Zhang	5a18818e1d	Migrate training storage from SAS to managed identity (#20618 ) ### Description orttrainingtestdatascus has only save mnist whose size is only 64M in Azure File To meet security requirements and reduce maintenance cost, move the test data to lotusscus and saved in Azure blob.	2024-05-09 15:44:29 -07:00
Jon Campbell	768c79317c	Enable QNN HTP support for Node (#20576 ) ### Description Add support for using Onnx Runtime with Node ### Motivation and Context Onnx Runtime supports the QNN HTP, but does not support it for Node.js. This adds baseline support for the Onnx Runtime to be used with Node. Note it does not update the node packages that are distributed officially. This simply patches the onnxruntime.dll to allow 'qnn' to be used as an execution provider. Testing was done using the existing onnxruntime-node package. The `onnxruntime.dll` and `onnxruntime_binding.node` were swapped into `node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly built version, then the various QNN dlls and .so files were placed next to the onnxruntime.dll. Testing was performed on a variety of models and applications, but the easiest test is to modify the [node quickstart example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).	2024-05-09 13:11:07 -07:00
Jian Chen	d1cbb3e076	The time for nuget pkg should be consistent (#20522 ) This pull request primarily involves changes to the build scripts in the `tools/ci_build/github/azure-pipelines` directory. The changes add build date and time information to the build process. This is achieved by introducing two new parameters, `BuildDate` and `BuildTime`, and incorporating them into the `msbuildArguments` in multiple locations. Addition of new parameters: * [`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R309-R310): Added `BuildDate` and `BuildTime` parameters using the pipeline's start time. Incorporation of new parameters in `msbuildArguments`: * [`tools/ci_build/github/azure-pipelines/c-api-noopenmp-packaging-pipelines.yml`](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948): Added `CurrentDate` and `CurrentTime` parameters to `msbuildArguments` in multiple locations. [[1]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948) [[2]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1092-R1093) [[3]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1114-R1115) [[4]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1137-R1138) * [`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L446-R448): Incorporated the `CurrentDate` and `CurrentTime` parameters into `msbuildArguments`.### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-09 11:35:45 -07:00
Tianlei Wu	69cfcba38a	[CUDA] Sparse Attention support 128k sequence length (#20614 ) ### Description When sequence length is 128K, block_mask has 2048 rows, that is not supported by previous kernel. (1) Add a new kernel to handle more than 1024 rows, and each thread need handle two rows. (2) Add a test for sequence length 128k.	2024-05-08 20:54:38 -07:00
Edward Chen	a0db2187ee	Update CocoaPods package release script. (#20608 ) - Update method for uploading to Azure storage to use managed identity. - Allow helper script tasks to be split across different calls. - Rewrite helper script in Python. Motivation: Recently the Azure storage account configuration was changed and now the old way of uploading to it no longer works.	2024-05-08 16:17:26 -07:00

1 2 3 4 5 ...

11091 commits