onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-24 22:17:32 +00:00

Author	SHA1	Message	Date
pengwa	56f7035521	Improve perf for mem efficient grad mgmt (#20480 ) ### Improve perf for mem efficient grad mgmt When memory efficient gradient mangement feature is enabled, the weight retrieval PythonOp for every layers will be launched at the beginning of the forward, which would make GPU stream idle for few milliseconds. The reason is the ReversedDFS ordering cannot ALWAYS handle such input branching well, so we introduce a distantance-to-input_leaf concepts when doing the reversedDFS, which not only move the problematical PythonOp to the place where it is needed, but also those Cast ops following the weight retrieval to the place where it is needed. Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples), 61.04sample/second This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second (+0.8% gains) Main branch: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/75c4131e-dade-49b0-aa8b-ee1c637ad9a8) This PR: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/e590a536-3b80-4f51-b89f-f25a55ddd7e2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-10 08:09:17 +08:00
Yi Zhang	5a18818e1d	Migrate training storage from SAS to managed identity (#20618 ) ### Description orttrainingtestdatascus has only save mnist whose size is only 64M in Azure File To meet security requirements and reduce maintenance cost, move the test data to lotusscus and saved in Azure blob.	2024-05-09 15:44:29 -07:00
Jon Campbell	768c79317c	Enable QNN HTP support for Node (#20576 ) ### Description Add support for using Onnx Runtime with Node ### Motivation and Context Onnx Runtime supports the QNN HTP, but does not support it for Node.js. This adds baseline support for the Onnx Runtime to be used with Node. Note it does not update the node packages that are distributed officially. This simply patches the onnxruntime.dll to allow 'qnn' to be used as an execution provider. Testing was done using the existing onnxruntime-node package. The `onnxruntime.dll` and `onnxruntime_binding.node` were swapped into `node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly built version, then the various QNN dlls and .so files were placed next to the onnxruntime.dll. Testing was performed on a variety of models and applications, but the easiest test is to modify the [node quickstart example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).	2024-05-09 13:11:07 -07:00
Jian Chen	d1cbb3e076	The time for nuget pkg should be consistent (#20522 ) This pull request primarily involves changes to the build scripts in the `tools/ci_build/github/azure-pipelines` directory. The changes add build date and time information to the build process. This is achieved by introducing two new parameters, `BuildDate` and `BuildTime`, and incorporating them into the `msbuildArguments` in multiple locations. Addition of new parameters: * [`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R309-R310): Added `BuildDate` and `BuildTime` parameters using the pipeline's start time. Incorporation of new parameters in `msbuildArguments`: * [`tools/ci_build/github/azure-pipelines/c-api-noopenmp-packaging-pipelines.yml`](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948): Added `CurrentDate` and `CurrentTime` parameters to `msbuildArguments` in multiple locations. [[1]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948) [[2]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1092-R1093) [[3]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1114-R1115) [[4]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1137-R1138) * [`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L446-R448): Incorporated the `CurrentDate` and `CurrentTime` parameters into `msbuildArguments`.### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-09 11:35:45 -07:00
Tianlei Wu	69cfcba38a	[CUDA] Sparse Attention support 128k sequence length (#20614 ) ### Description When sequence length is 128K, block_mask has 2048 rows, that is not supported by previous kernel. (1) Add a new kernel to handle more than 1024 rows, and each thread need handle two rows. (2) Add a test for sequence length 128k.	2024-05-08 20:54:38 -07:00
Edward Chen	a0db2187ee	Update CocoaPods package release script. (#20608 ) - Update method for uploading to Azure storage to use managed identity. - Allow helper script tasks to be split across different calls. - Rewrite helper script in Python. Motivation: Recently the Azure storage account configuration was changed and now the old way of uploading to it no longer works.	2024-05-08 16:17:26 -07:00
kunal-vaishnavi	274d162d93	Fix SparseAttention cos/sin cache dimension checks (#20609 ) ### Description This PR fixes the dimension checks for the cos/sin caches used in the rotary embeddings in the `SparseAttention` operator. ### Motivation and Context This PR ports over the same changes from [this PR](https://github.com/microsoft/onnxruntime/pull/20547) for `GroupQueryAttention`.	2024-05-08 16:07:02 -07:00
George Wu	58d7b12205	support --arm64ec for qnn ep build (#20607 ) link against binaries in arm64x-windows-msvc when building qnn ep with --arm64ec build option.	2024-05-08 11:09:15 -07:00
Dmitri Smirnov	08ecf30e0b	Implement numpy array over CPU OrtValues on return values (#20539 ) ### Description Create numpy arrays based on the native buffers of returned OrtValues. Hold on to the OrtValue until the numpy array is garbage collected. ### Motivation and Context This saves cpu on tensor copies and addresses customer concerns.	2024-05-08 10:56:36 -07:00
Yufeng Li	156d52163d	optimize gqa cpu (#20598 ) ### Description <!-- Describe your changes. --> optimize the GQA implementation on CPU. Mainly optimization are: 1. compute attention on real total sequence length instead of maximum sequence length in case past/present share same buffer 2. remove the mask 3. remove the transpose after attention x value It improve the phi3 model https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py with max sequence length 2k/4k from 10 tps to 20 tps. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-08 10:42:29 -07:00
Tianlei Wu	1f509215bc	Fix GroupQueryAttention benchmark script (#20291 ) ### Description Fix a few issues in GQA: (1) memory efficient attention does not have bfloat16, need disable it when bfloat16 is used. (2) When prompt length is 1, it is not classified as prompt. (3) Fix benchmark_gqa.py (4) Add a comment about seqlen_k to avoid confusion. ### Motivation and Context https://github.com/microsoft/onnxruntime/pull/20279	2024-05-08 09:48:46 -07:00
maggie1059	b6d9abf150	Revert compute queue default for DML (#20604 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-08 09:48:29 -07:00
Changming Sun	08b637350a	Remove an extra space in azure_scale_set_vm_mount_test_data.sh (#20584 )	2024-05-08 09:46:50 -07:00
Guenther Schmuelling	55a6986d38	optimize skiplayernorm (#20551 ) SkipSimplifiedLayerNormalization used in phi3 comes down from 222usec to 14usec	2024-05-08 08:40:03 -07:00
Ted Themistokleous	737eb48f5c	MIGraphX EP: Add set_false_math to false by default (#20520 ) Patching in fast match disabled in the MIGraphX Compile stage in the MIGraphX EP ### Description Allow the MIGraphX API to compile the program given to the EP to turn off fast math by default. ### Motivation and Context Fixes accuracy issue we're seeing with GELU parity tests. Without fast math disabled GELU will use a faster but less numerically stable version which trades speed for accuracy. Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>	2024-05-08 15:48:21 +08:00
Scott McKay	8d09baf49f	Clarify when protobuf dependency builds protoc (#20542 ) ### Description <!-- Describe your changes. --> Currently figuring out if the protobuf dependency is building protoc it is a little obtuse and inconsistent * in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF to indicate the protobuf dependency is not building protoc * e.g. macOS/iOS/visionOS builds * for a user provided protoc path we don't set protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that determines if `protobuf::protoc` is added as a dependency or not * `0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)` To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid. Remove outdated script that built and external protoc binary which was used in later builds. The build setup will fetch a pre-built protoc so there's no need for this additional build. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it easier to figure out if protoc is coming from the protobuf dependency.	2024-05-08 08:30:11 +10:00
aciddelgado	4e27841bdb	fix gqa cpu nan bug (#20521 ) ### Description There was a bug with gqa on cpu where on token case, with batch_size > 1, and with past_present_share_buffer off, the output would occasionally contain nans. this pr fixes that. it also updates documentation and fixes posid gen for rotary in cuda in prompt case. ### Motivation and Context this pr solves the GQA CPU bug as well as updates the documentation and makes seqlens_k irrelevant for prompt case, which is useful to prevent user error.	2024-05-07 15:19:26 -07:00
moyo1997	aff04ba08a	Dev/mookerem/arm64x update (#20536 ) Made some changes to the arm64x.cmake script to: - handle edge case - Enable Projects that include onnxruntime as submodule and build it, to be able to build as x without causing onnxruntime build_as_x to fail.	2024-05-07 12:50:38 -07:00
Hector Li	d121a1f906	Enable int32 data support for Clip (#20590 ) Enable int32 data support for Clip fix issue: https://github.com/microsoft/onnxruntime/issues/20525	2024-05-07 11:35:29 -07:00
Tianlei Wu	d693aef39e	Fix Sparse Attention with Packed QKV inputs (#20591 ) ### Description (1) Fix UnpackQKV kernel (2) Update test_sparse_attention.py with packed QKV option	2024-05-07 10:50:01 -07:00
Patrice Vignola	478d3e0c62	Add simplified layernorm fusion for Gemma (#20572 ) Gemma has a `Mul` node right after the `Gather` and before the first layer norm.	2024-05-06 20:07:14 -07:00
Yufeng Li	05b4ad2e57	fix bug: input q/k/v should not be modified by operator (#20555 ) ### Description <!-- Describe your changes. --> Operator should not modify input tensors because they are managed by framework and may be reused by other nodes. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-06 16:05:00 -07:00
Chi Lo	c86476a636	[TensorRT] adapt for TRT lib name change after TRT 10 GA (update) (#20550 ) https://github.com/microsoft/onnxruntime/pull/20445 The nvonnxparser still needs major version appending to it when building oss parser.	2024-05-06 15:00:13 -07:00
Ye Wang	ae6195b5a7	MoE Gemm perf tuning (#20541 ) ### Description <!-- Describe your changes. --> This PR supports profiling and tuning MoE Gemm kernels in the very first run and store the best configuration to reuse in the following runs. The Gemm id (the key to the config map, int64_t) is determined by num_rows, gemm_n and gemm_k for each type. First 32 bits are total_rows, next 16 bits are gemm_n, next 16 bits are gemm_k int64_t key = total_rows; key = key << 16 \| gemm_n; key = key << 16 \| gemm_k; Mixtral-fp16 on 2 A100 with tp=2. batch size = 1, seq_len = 1k \| \| Prompt \| Token \| \| :--- \| :---: \| ---: \| \| before \| 138ms \| 16.4ms \| \| after \| 100ms \| 13.9ms \| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-06 14:40:44 -07:00
pengwa	addcc4c4b2	Fix missing node during mem efficient topo sort (#20497 ) ### Fix missing node during mem efficient topo sort Some nodes are not cusumed by the backward path, they are also not generating graph outputs. We missed those nodes, so this PR fix that and add related tests. A side note: we should remove those nodes that are not used for computing any graph outputs in a graph transformer. (TODO) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-06 17:25:23 +08:00
Adam Pocock	a36692066d	[java] CUDA & TensorRT options fix (#20549 ) ### Description I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java. There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods). ### Motivation and Context Fixes #20544.	2024-05-05 00:16:55 -07:00
Tianlei Wu	baaef59696	Add sparse attention kernel for H100 (sm90) (#20553 ) ### Description Follow up of https://github.com/microsoft/onnxruntime/pull/20216 to add sparse attention kernel compiled by Triton for H100 (sm90). - [x] Refine sparse attention v1 kernel compilation (remove some combinations) - [x] compile kernels for v1 kernels - [x] compile kernels for H100 - [x] run performance tests ### Performane Test setting `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Note that ORT-GQA-Dense has more computation than ORT-SparseAtt, while ORT-GQA-Local has less computation (no vertial strides) than ORT-SparseAtt. They are added for reference. It is not fair comparison, but could show the benefit of sparsity vs dense. Example results in Azure Standard_ND96isr_H100_v5 VM with NVIDIA H100-80GB-HBM3 GPU (sm=90): ``` prompt-sm90-batch4-head32-d128-local16-vert8-torch.float16: sequence_length TORCH-GQA ORT-GQA-Dense ORT-GQA-Local ORT-SparseAtt 0 16.0 0.079877 0.006362 0.006403 0.042758 1 32.0 0.086920 0.016404 0.016686 0.044183 2 64.0 0.090727 0.020429 0.020409 0.045343 3 128.0 0.128148 0.032009 0.031984 0.051516 4 256.0 0.323933 0.074110 0.073920 0.068308 5 512.0 1.021856 0.162167 0.161951 0.109226 6 1024.0 3.596002 0.452629 0.452780 0.231653 7 2048.0 13.865088 1.499534 1.195749 0.515488 8 4096.0 0.000000 5.454785 2.669682 1.163233 9 8192.0 0.000000 22.068159 6.018604 2.772873 token-sm90-batch4-head32-d128-local16-vert8-torch.float16: past_sequence_length TORCH-GQA ORT-GQA-Dense ORT-GQA-Local ORT-SparseAtt 0 16.0 0.104460 0.012652 0.012661 0.069549 1 32.0 0.113866 0.012776 0.012765 0.069024 2 64.0 0.124600 0.016791 0.012672 0.069397 3 128.0 0.108658 0.017900 0.018294 0.074844 4 256.0 0.115463 0.029409 0.029608 0.078911 5 512.0 0.149824 0.033968 0.033701 0.092998 6 1024.0 0.234050 0.042930 0.042951 0.116920 7 2048.0 0.390695 0.061462 0.043008 0.121555 8 4096.0 0.000000 0.097505 0.042948 0.134757 9 8191.0 0.000000 0.165861 0.043542 0.158796 ``` The following might be able to help performance on short sequence length. Need update operator spec: Fall back to flash attention when total_sequence length < local_blocks * block_size ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-04 19:53:32 -07:00
Hector Li	cb37b1b43b	Return ENGINE_ERROR for QNN NPU SSR issue (#20560 ) Return ENGINE_ERROR for QNN NPU SSR issue	2024-05-04 12:46:50 -07:00
Changming Sun	38412b68c6	Update setup.py: update TRT version (#20557 ) ### Description As a follow-up of #20506 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-03 22:39:20 -07:00
Yufeng Li	d6280e26bd	check rotary_embedding with seq length (#20547 ) ### Description <!-- Describe your changes. --> with past/present shared same buffer, the present seq length is different with total sequence length. The size of cos/sin cache should be checked with sequence length. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-03 09:43:53 -07:00
Hector Li	e540423179	[QNN EP] Conv ConvTranspose 3D support (#20507 ) ### Description Support Conv ConvTranspose 3D for QNN EP	2024-05-03 08:55:31 -07:00
Edward Chen	030a9611c2	Add `#pragma once` to matmul_scale_fusion.h. (#20538 )	2024-05-02 15:38:11 -07:00
Adrian Lizarraga	7211eab365	[QNN EP] Support HardSigmoid (#20508 ) ### Description - Adds support for float32/float16 HardSigmoid on HTP backend. Decomposes `HardSigmoid(X)` into `max(0, min(1, alpha * X + beta))`. - Fuses the sequence `X * HardSigmoid<alpha=1/6, beta=0.5>(X)` into a single `HardSwish(x)`. Only applies to non-quantized HardSigmoid/Mul. ### Motivation and Context QNN does not natively support HardSigmoid. These changes expand model support on QNN EP.	2024-05-02 15:36:54 -07:00
Hector Li	e6228575e4	Add tensor v2 support (#20530 ) ### Description Add tensor v2 support to unblock the inference with context binary generated from QNN v2.21	2024-05-02 13:49:04 -07:00
aamajumder	589aeb7036	[DML EP] Register DFT-20 (#20341 ) ### Description <!-- Describe your changes. --> This PR registers DFT-20 to the DML EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-02 11:08:39 -07:00
Adrian Lizarraga	0dda8b0c44	[QNN EP] Update QNN SDK to 2.21 (#20534 ) ### Description - Updates QNN pipelines to use QNN SDK 2.21 - Downloads QNN SDK from Azure storage to avoid having to rebuild images when a new version is released. ### Motivation and Context Test with the latest QNN SDK.	2024-05-01 20:17:35 -07:00
Tianlei Wu	87076553b0	[CUDA] Add SparseAttention kernel for sm=75 (#20531 ) ### Description Follow up of #20216 to add kernel for sm=75 (GPU like T4, Geforce RTX 2080, GeForce GTX 1650 Ti, NVIDIA TITAN RTX, RTX 4000 etc) - [x] Add kernel for sm=75 - [x] Update dispatch code to use sm to call different kernel. - [x] Update compile script to use num_stages=2 instead of 3 for sm=75 - [x] Refactor test script and add tests for bfloat16. - [x] Fix performance test of token generation (previously we did not concatenate past_key) - [x] Fix debug build - [x] Run performance test and update numbers. For sm=70, the v1 kernel can be compiled but there is error in compiling v2 kernel. So it is skipped in this pull request. Performance Test on T4 GPU (using Standard_NC4as_T4_v3 Azure VM) with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with dense causal. Note that GQA with dense need more computation since no sparsity is used. The TORCH-GQA use naive implementation (using cuSPARSE Block-SpMM could be faster). ``` prompt-sm75-batch4-head32-d128-local16-vert8-torch.float16: sequence_length TORCH-GQA ORT-GQA-Dense ORT-SparseAtt 1 32.0 0.184173 2.994347 0.089064 2 64.0 0.303300 3.023986 0.107418 3 128.0 0.887795 3.073728 0.174213 4 256.0 2.797654 3.246899 0.357869 5 512.0 10.055048 3.814039 0.893903 6 1024.0 37.849937 5.818439 2.658720 7 2048.0 148.641785 13.638480 7.202690 8 4096.0 OOM 43.556847 17.680954 9 8192.0 OOM 161.628540 44.336670 token-sm75-batch4-head32-d128-local16-vert8-torch.float16: past_sequence_length TORCH-GQA ORT-GQA-Dense ORT-SparseAtt 1 32.0 0.110353 2.996305 0.137509 2 64.0 0.145088 3.006860 0.165424 3 128.0 0.219500 3.036448 0.192001 4 256.0 0.347496 3.071341 0.249125 5 512.0 0.595842 3.135225 0.398726 6 1024.0 1.081216 3.261110 0.612744 7 2048.0 2.060307 3.515578 0.685670 8 4096.0 OOM 4.022986 0.819707 9 8191.0 OOM 5.024528 1.072912 ``` ### Motivation and Context To inference Phi-3-small in T4 GPU	2024-05-01 19:52:13 -07:00
Scott McKay	f9febc4f35	Remove usage of 'required reason' iOS API from protobuf (#20529 ) ### Description <!-- Describe your changes. --> Using certain APIs is about to require a [privacy manifest](https://developer.apple.com/documentation/bundleresources/privacy_manifest_files/describing_use_of_required_reason_api) to be added to a package. Our version of protobuf uses `mach_absolute_time`. Patch as per https://github.com/protocolbuffers/protobuf/pull/15662/ to remove usage. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Usage of API will require a privacy manifest for an iOS app to be accepted as of 5/1/2024 #20519	2024-05-02 08:21:08 +10:00
Yifan Li	29417762f7	[TensorRT EP] support TensorRT 10-GA (#20506 ) ### Description <!-- Describe your changes. --> This branch is based on rel-1.18.0 and supports TensorRT 10-GA. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-01 11:10:53 -07:00
Shubham Bhokare	7a7344dcc2	Update openai-whisper version in requirements.txt (#20473 ) ### Description Update openai-whisper version in requirements.txt	2024-04-30 22:25:41 -07:00
Hector Li	755aaea9a6	Qnn nuget update (#20527 ) ### Description Update Qnn nuget package to include Qnn libs and license file	2024-04-30 22:12:53 -07:00
Yi Zhang	91baeb8495	Reduce downloads to NodeJS to mitigate random connection exception. (#20518 ) ### Description There was connection exception in docker build in package pipeline ``` 48.26 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail 456.0 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2) ``` https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=453140&view=logs&j=f9f5b320-fa10-56c4-debe-61ea69c74793&t=1656e225-defa-5b12-8935-2a0a93e76a67&s=3c85d903-a183-5028-775e-d63999fcc9ae In fact, docker image shouldn't be rebuilt this time. Checked the code, The docker image tag in Linux_C_API_Packaging_GPU_x64 of onnxruntimecuda${{ variables.CUDA_VERSION_MAJOR }}build was same as the image tag of Linux-gpu-ci-pipeline, but their docker files are different. So changing the Linux GPU pipeline's image tag to avoid packaging pipeline docker image overridden unexpectedly.	2024-05-01 09:04:56 +08:00
vividsnow	5c3a1bc3b8	update onnxruntime_c_api.h (#20360 ) ### Description removing excess trailing semicolon from specific macro ### Motivation and Context I am preparing automatic generation of onnxruntime bindings for perl, and the parser (ucpp) has broken due to the "double semicolon" error in the subsequent lines where the macro is applied.	2024-04-30 16:47:24 -07:00
Edward Chen	a7fc0e8370	Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. (#20509 ) Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. It can cause unused variable warnings in some compilations.	2024-04-30 16:10:13 -07:00
Yi-Hong Lyu	33e883fbc4	Fix the doxygen error (#20515 ) Fix onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637: error: argument 'session' of command @param is not found in the argument list of ``` OrtApi::AddExternalInitializersFromFilesInMemory( OrtSessionOptions options, const char const external_initializer_file_names, char const external_initializer_file_buffer_array, const size_t external_initializer_file_lengths, size_t num_external_initializer_files) ```	2024-04-30 11:45:03 -07:00
Tianlei Wu	9f0fae29e8	[CUDA] Add SparseAttention operator for Phi-3-small (#20216 ) ### Description Add CUDA implementation for block sparse attention for Phi-3-small. Block sparse attention was proposed in [Sparse Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different sparse layout. In Phi-3-small, the sparse layout is static, and works with unidirectional (causal) attention. Compared to dense attention, the benefit of block sparse is to speed up both training and inference. It could save memory thus support longer context length. - [x] Add operator spec and shape inference - [x] Symbolic shape inference - [x] Refactor GroupQueryAttention to expose common kernels for kv cache concatenation, q/k/v transpose etc. - [x] Add cuda kernel to convert block mask to CSR format - [x] Add cuda kernel to generate position ids - [x] Add compile script and template files to convert triton kernel to cubin and dispatcher. - [x] Add triton kernel v1 for prompt - [x] Add triton kernel v2 for token generation and support padding - [x] Update IO Binding Helper to allow buffer sharing. - [x] Test relevance - [x] Test performance ### Performance Test in A100-SXM4-80GB with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Average latency in milliseconds (for fused attention kernel used in prompt prefilling): seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0465 \| 0.0722 \| 0.0641 128 \| 0.0618 \| 0.0787 \| 0.0672 256 \| 0.1086 \| 0.1076 \| 0.0943 512 \| 0.2535 \| 0.2487 \| 0.1676 1024 \| 0.7042 \| 0.7050 \| 0.3800 2048 \| 2.4125 \| 1.9316 \| 0.8966 4096 \| 8.9346 \| 4.5699 \| 2.1129 8192 \| 40.5401 \| 10.3508 \| 5.1748 Average latency in milliseconds (for fused attention kernel used in token generation: past_seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0186 \| 0.0186 \| 0.0870 128 \| 0.0408 \| 0.0466 \| 0.1165 256 \| 0.0530 \| 0.0592 \| 0.0988 512 \| 0.0445\| 0.0447 \| 0.1150 1024 \| 0.0634 \| 0.0640 \| 0.1454 2048 \| 0.1027 \| 0.0637 \| 0.1589 4096 \| 0.1789 \| 0.0631 \| 0.1806 8192 \| 0.3288 \| 0.0655 \| 0.2146 We can see that the kernel for token generation still have room to improve. #### Limitations Only support right-side padding and unidirectional attention. The following are not supported in the first version: (1) Packed mode like PackedMultiHeadAttention where input has been removed padding. (2) paged attention. (3) bidirectional attention. (4) GPU compute capacity that is not 8.0, 8.6 and 8.9. (5) Left side padding. Some of these limitations will be removed in the future (may be in a new operator).	2024-04-30 09:06:29 -07:00
Yi-Hong Lyu	b2481e3602	Bump up version in main from 1.18.0 to 1.19.0 (#20489 ) Bump up version in main from 1.18.0 to 1.19.0 since the release branch has been cut. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-04-29 20:21:41 -07:00
Yulong Wang	b1085b51ca	[js/web] update README (#20492 ) ### Description Update README.md in /js/web/ - update compatibility table - update links to onnxruntime.ai	2024-04-29 17:56:23 -07:00
Chi Lo	a1558fe117	[TensorRT EP] Make TRT EP use priority-based topo sort (#20512 ) This PR is needed for https://github.com/microsoft/onnxruntime/pull/20411 to make sure TRT EP use priority-based topo sort for consistency across TRT EP.	2024-04-29 16:00:43 -07:00
Rachel Guo	8c31f27dd1	Catalyst nuget package .NET changes only (#20424 ) ### Description <!-- Describe your changes. --> https://github.com/microsoft/onnxruntime/pull/20418 Add back Catalyst changes only for now. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-04-29 15:39:48 -07:00

... 18 19 20 21 22 ...

11997 commits