onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-09 17:28:58 +00:00

Author	SHA1	Message	Date
Changming Sun	2d23b4e117	Update min macos version (#18251 )	2023-11-10 11:08:17 -08:00
Bart Verhagen	87744e55fa	fix reference to Microsoft.GSL::GSL in CMake build scripts when enabling cuda (#17843 ) ### Description Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the GSL package that is found on the system is used. However, when cuda is enabled, it is downloaded and patched. Most CMake scripts rely on the first case and forget about the second. This patch makes the second case behave like the first case. ### Motivation and Context This is an issue that occurs 'in the wild'. For example, I had to patch this to be able to enable the CUDA provider for the onnxruntime conan package (see https://github.com/conan-io/conan-center-index/pull/20392).	2023-11-10 10:46:45 -08:00
Xu Xing	dd1bb760eb	[js/webgpu] Fix scalar uniform (#18318 )	2023-11-10 10:12:22 -08:00
sophies927	d955885791	Update stale.yml to fix start-date bug (#18376 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-09 16:04:31 -08:00
RandySheriffH	59262dfc63	Add cuda context headers to zip (#18330 ) Expose cuda context headers for cuda custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-09 14:53:58 -08:00
dependabot[bot]	1ff894898a	Bump actions/stale from 4.1.1 to 8.0.0 (#18149 )	2023-11-09 11:31:04 -08:00
Xu Xing	829d802337	[js/webgpu] Support uniform for softmax (#18345 )	2023-11-09 11:19:23 -08:00
Adrian Lizarraga	f237b0b1f8	[QNN EP/Quantization] Add MinimumRealRange extra option to quantization script (#18278 ) ### Description Adds the extra option `MinimumRealRange` to the quantization script: ```python3 """ MinimumRealRange= float\|None : Default is None. If set to a floating-point value, the calculation of the quantization parameters (i.e., scale and zero point) will enforce a minimum range between rmin and rmax. If (rmax - rmin) is less than the specified minimum range, rmax will be set to rmin + QuantMinRealRange. This is necessary for EPs like QNN that require a minimum floating-point range when determining quantization parameters. """ ``` ### Motivation and Context QNN requires a minimum floating-point range of 0.0001. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-11-09 10:55:09 -08:00
Guenther Schmuelling	25fbc2b0ab	fix fused relu activation (#18303 )	2023-11-09 08:18:21 -08:00
David Justice	2c22b49876	Fix rust compile issues and add GH action to run build validations and tests (#18346 ) ### Description This PR gets the onnxruntime Rust bindings to a foundation where they can be extended and validated as the onnxruntime progresses. Specifically, the PR does the following. - fixes some of the existing compilation issues due to missing some enums output tensor data types. - introduces a `just vendor` task that will vendor the source code from the onnxruntime to enable a common base directory within the crate directory rather than using a relative parent path. This enables `crate package` to be able to archive the onnxruntime native code, which will enable consumers of the onnxruntime-sys crate to be able to compile on their target. - introduces a GH action to lint the Rust code (rustfmt, clippy), build the library, validate through tests, and validate crate can package correctly. TODOs: - [x] This PR is based on #18200 and will need to be rebased once that PR is merged. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is the first step to getting new onnxruntime Rust crates published through this project, which will unblock community Rust projects which would like to take a dependency on onnxruntime Rust. Follow up work to enable publication of onnxruntime Rust crates: - change name of the crates to be published (onnxruntime-rs and onnxruntime-sys are already taken and we'll need new names) - update authors / license to reflect contributions from previous maintainer(s) and new maintainers - introduce a crate publish GH action or ADO pipeline --------- Signed-off-by: David Justice <david@devigned.com>	2023-11-09 04:26:02 -08:00
Ted Themistokleous	8d50313816	[Migraphx EP] Static int8 QDQ support (#17931 ) ### Description <!-- Describe your changes. --> Adding static int8 quantization support for MIGraphX Execution Provider - Allows for parsing in calibration tables generated by Onnxruntime or TensorRT's toolsets - Add proper environment variables into the MIGraphX EP - Update python API to include updating execution provider flags -> was missing on python side - Hook into MIGraphX's int8 quantitation and optimization of models ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Required so that we can get onnxruntime to pass in models while leveraging the existing tooling for int8 static QDQ quantization. First step in a series of PRs which will add further static quantization on the operator level as MIGraphX releases further support. These changes drew heavily from the tensorRT EP should allow for similar functionality for GPU based (versus CPU) quantization of models before an inference is performed. --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>	2023-11-09 17:46:49 +08:00
Hector Li	55c19d6ab5	[QNN EP] Enable option to set QNN context priority (#18315 ) Enable option qnn_context_priority to set QNN context priority, options: "low", "normal", "normal_high", "high". ### Description Enable option qnn_context_priority to set QNN context priority, options: "low", "normal", "normal_high", "high". This feature guarantees the model inference with higher priority. Tested with onnxruntime_perf_test tool using same model. 1. Run the model on the NPU with single instance, the latency is 300ms. 2. Run the same model on NPU with 2 instance at same time. Case 1: both with same priority (high ) -- latency is 600ms Case 2: 1 with low priority -- latency is 30,000ms 1 with high priority -- latency is 300ms Case 3: 1 with normal priority -- latency is 15,000ms 1 with high priority -- latency is 300ms	2023-11-08 20:56:36 -08:00
Prathik Rao	7a3da4526f	add bfloat16 support for CUDA Neg kernel (#18306 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA Neg Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 18:32:12 -08:00
guyang3532	4dc63692f8	Add FlattenAndUnpad Op (#17845 ) ### Description Add an op named `FlattenAndUnpad`. This op implements functions: 1. Flatten the first two dims of input tensor. 2. Gather valid value from input tensor with index tensor,. ### Motivation and Context The grad op of `PadAndUnflatten` was `GatherGrad` which is inefficient in performance. I implement this `FlattenAndUnpad` just to replace the `GatherGrad` as grad of `PadAndUnflatten`. With this op, we also can simplify the "Reshape + ShrunkenGather" pattern to `PadAndUnflatten` in padding elimination optimizer, which will also improve performance.	2023-11-09 09:52:48 +08:00
Scott McKay	885bf3561d	Add tool to fix lines > 120 chars. (#18293 ) ### Description <!-- Describe your changes. --> Helper to run clang-format on lines that are > 120 chars. We disable clang-format enforcing 120 chars by default because it's formatting can negatively impact readability. If a developer has not manually kept a line within the 120 char limit this tool will fix it. It will leave all other lines alone to honor the formatting the developer chose. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Help developers fix lint errors. Preferred is to use a vertical ruler/guideline in your editor when actually writing the code.	2023-11-09 10:12:57 +10:00
Justin Chu	c250540722	Bump linter versions (#18341 ) Bump linter versions and run format.	2023-11-08 13:04:40 -08:00
Changming Sun	812532592e	Add a build validation for Linux ARM64 cross-compile (#18200 ) ### Description 1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch issues listed in #18195 . 2. Revert eigen's commit id back to what we had before. ### Motivation and Context To catch cross-compile issues. Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639	2023-11-08 13:03:18 -08:00
sophies927	68fab24c22	Update stale.yml (#18304 ) Exempt all issues w/ assignees from stale bot, increase days before issue close, + add start date to address issue w/ GH API rate limiting ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-08 11:56:35 -08:00
Dmitri Smirnov	a37e6a503b	Update Abseil raw_flat_hash visualization (#18329 ) ### Description <!-- Describe your changes. --> Fix the broken pieces due to the latest Abseil update. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Make the debugging bearable.	2023-11-08 11:19:45 -08:00
Adrian Lizarraga	a0eeeafa80	[QNN EP] Session option for graph optimization (#18262 ) ### Description Adds the QNN session option `htp_graph_finalization_optimization_mode` to enable QNN graph optimizations at the expense of longer preparation time. ### Motivation and Context Allow enabling QNN graph optimizations per app/model.	2023-11-08 10:06:15 -08:00
kunal-vaishnavi	c8def0cc51	Add LLaMA GQA ragged batching (#18337 ) This PR updates replacing MHA with GQA and updates the LLaMA scripts for the modified GQA op. It is related to the changes in [this PR](https://github.com/microsoft/onnxruntime/pull/18283). ### Motivation and Context This PR allows us to run LLaMA with the GQA op end-to-end using ragged batching (i.e. batched inputs of different lengths).	2023-11-08 09:36:28 -08:00
Prathik Rao	34f77eaa24	bfloat16 support for quickgelugrad (#18336 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA QuickGeluGrad Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 08:40:02 -08:00
pengwa	2151c79bf1	Tune ORTModule logging experience a bit (#18298 ) ### Tune logging experience a bit After last time we update the ORTModule log experience, we found few issues: 1. `INFO` level output too many things, including PyTorch exporter verbose logs (tracing graphs) on every ranks. On this level, we only want to - Output a little bit more information to Users than `WARNING` level, for example the memory recomputation recommendations or other not-fully-ready features. - Output a little bit more information for a quick diagnostic, collected on rank-0 only. 2. ONNX Runtime logging filter during graph build, session init sometimes will hide the issues (for example segement fault), there is no useful information in `WARNING`/`INFO` for users to report to us. This is not good! 3. Some of our devs like using `pdb` to debug Python code, but if we add `import pdb; pdb.set_trace()` in models' code might hang when they use `INFO` or `WARNING`, where exporter happens and all output got redirected due to log filtering. The only workaround is to switch to VERBOSE, which output toooooooooooo many logs. The corresponding changes proposed here are: 1. For `INFO` logging, - We only logs rank-0. - We restricted the ORT backend logging level to be WARNING in this case, because ORT backend code output way too many logs that should be under verbose, while we cannot guarantee we can get them cleaned up immediately once they are added. - We output the PyTorch exporter verbose log (including tracing graph), which is useful for a quick diagnostic when an issue happens. 2. Remove all logging filtering on ORT backend, then the segment fault issue details will not be hidden once it happens again. 3. Introduced a `DEVINFO` logging, - Log logs on all ranks - Log ORT backend logging level INFO - PyTorch exporter logging filtering are all turned OFF (to unblock the pdb debugging). 4. Currently, to use Memory Optimizer, need use DEVINFO (which will output ORT backend INFO log). So update memory optimizer document to reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will update the requirement back to INFO for show memory optimization infos. You can check https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations for a better view of different log levels. This PR also extract some changes from a bigger one https://github.com/microsoft/onnxruntime/pull/17481, to reduce its complexity for review. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-11-08 17:42:50 +08:00
Tianlei Wu	8044e5f603	SDXL: Update demo with dynamic shape serving with CUDA EP (#18340 ) Update the SDXL demo with dynamic shape serving with CUDA EP.	2023-11-08 00:42:55 -08:00
aciddelgado	3dece27f51	GQA Flash Attention with Attention Mask (#18283 ) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>	2023-11-07 17:47:51 -08:00
Yulong Wang	10df847baf	[js] fix linter out-of-memory issue (#18307 ) ### Description fix linter out-of-memory issue by ignoring file pattern 'test/data/'.	2023-11-07 17:12:22 -08:00
Yulong Wang	d117a8010f	fix typo (node)->(browser) in linux-wasm-ci.yml (#18309 ) ### Description fix display name `'Build and test (node) (simd + threads)'` to `'Build and test (browser) (simd + threads)'`	2023-11-07 17:07:40 -08:00
Dmitri Smirnov	096307c64b	Do not run AOT function inlining when the model does not define any local functions (#18302 ) ### Description Check if the model defines any local functions. if not, skip AOT inlining including any schema based functions. The latter would be inlined during partitioning. ### Motivation and Context This prevents calls GetCapability() to EPs and enhahces compatibility. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Pranav Sharma <prs@microsoft.com>	2023-11-07 13:46:42 -08:00
Jiajia Qin	606356d0b1	[js/webgpu] Simplify the Resize shader when noScale is true (#18321 ) ### Description For Resize, when `noScale` is true, the shader can become very simple, which is not related with `attributes.mode` anymore. So we should remove those parts of shader code for simplification. This PR can also fix #18311 since the `noScale` are all true in that model. However, #18311 also exposes that the Resize implementation for `linear` mode has bug. It seems that the currently implementation always treat the input as either 2d or 4d tensor, however, the actual input is 3d tensor, that's why the shader compilation is failed. We may need to fix it in a separate PR.	2023-11-07 12:54:20 -08:00
liqun Fu	6127dd1d2d	implement gridsample 20 (#17744 )	2023-11-07 10:42:41 -08:00
Prathik Rao	83c0275354	add bfloat16 support for ConcatTraining and SplitTraining ops (#18280 ) ### Description <!-- Describe your changes. --> Updates input/output type constraints on training operators ConcatTraining and SplitTraining to include bfloat16 which was introduced in IR version 4. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-07 10:10:01 -08:00
satyajandhyala	a16d528399	[JS/Web] Added Uniforms support to binary ops. (#18260 ) ### Description Added Uniform support to binary ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To improve performance	2023-11-07 08:41:52 -08:00
Patrice Vignola	800ae7742c	[DML EP] Add RotaryEmbedding (#18158 ) This is a graph implementation of RotaryEmbedding since there's no time to add it to DML before 1.16.2, but it eventually should move into DirectML since we're bandwidth-bound.	2023-11-07 08:26:11 -08:00
Yi Zhang	9868a71373	[Fix] Stages to Run couldn't be selected (#18310 ) ### Description Add the pool definition in 2 stages even the pool is Microsoft-Hosted Pool. ### Motivation and Context Recently, in Nuget pipeline, when we click the Stages to Run ![image](https://github.com/microsoft/onnxruntime/assets/16190118/45af295e-fa75-402a-a7de-803c6a2ab7cd) It always pops up ``` Encountered error(s) while parsing pipeline YAML: Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz. Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz. ```	2023-11-07 17:52:47 +08:00
pengwa	4f15b42728	Customize _get_tensor_rank for model export in stage3 (#18294 ) ### Customize _get_tensor_rank for model export in stage3 Weight/Params sizes are all (0), so exporter logic depending on input shape will fail. This PR override `_get_tensor_rank` function by retrieving the shape for weight differently. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-07 16:37:11 +08:00
zhijiang	630c877b43	Zhijxu/improve ortmodule python perf a little bit (#13716 ) improve 2 python functions a little bit. according to a profiling result from a real user case, we find that 2 python function can be improved. the first is the result before improvement, the second is after improvement, we can see 8ms saved from the improvement. ![image](https://user-images.githubusercontent.com/43435212/202961725-b88d679e-993b-4910-a339-253f3ed5dcde.png) ![image](https://user-images.githubusercontent.com/43435212/202961732-6c6deebf-962f-4392-90d7-03705433e3ee.png)	2023-11-07 15:24:57 +08:00
Tianlei Wu	00c2bf39bd	SkipGroupNorm fusion and SDXL Pipeline Update (#18273 ) Update a few optimizations for Stable Diffusion XL: (1) Add SkipGroupNorm fusion (2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm when channels is one of `320, 640, 960, 1280, 1920, 2560, 128, 256, 512` so some GroupNorm in refiner was not fused. (3) Tune SkipLayerNormalization to use vectorized kernel for hidden size 320, 640 and 1280. Pipeline Improvements: (4) Enable cuda graph for unetxl. (5) Change optimization to generate optimized fp32 model with ORT, then convert to fp16. Otherwise, fp16 model might be invalid. (6) Add option to enable-vae-slicing. Bug fixes: (a) Fix vae decode in SD demo. (b) Fix UnipPC add_noise missing a parameter. (c) EulerA exception in SDXL demo. Disable it for now. (d) Batch size > 4 has error in VAE without slicing. Force to enable vae slicing when batch size > 4. #### Performance Test on A100-SXM4-80GB Description about the experiment in results: Baseline: removed GroupNorm fusion limits; CUDA graph is enabled in Clip and VAE, but not in Clip2 and UNet. UNetCG: Enable Cuda Graph on UNet SLN: Tune SkipLayerNormalization SGN: Add SkipGroupNorm fusion The latency (ms) of generating an image of size 1024x1024 with 30 steps base model and 9 steps of refiner model: \| Baseline \| UNetCG\| UNetCG+SLN \| UNetCG+SLN+SGN -- \| -- \| -- \| -- \| -- Base Clip \| 3.74 \| 3.70 \| 3.88 \| 3.81 Base Unet x30 \| 2567.73 \| 2510.69 \| 2505.09 \| 2499.99 Refiner Clip \| 7.59 \| 7.42 \| 7.41 \| 7.58 Refiner Unet x 9 \| 814.43 \| 803.03 \| 802.20 \| 799.06 Refiner VAE Decoder \| 84.62 \| 85.18 \| 85.24 \| 87.43 E2E \| 3480.56 \| 3412.05 \| 3405.77 \| 3400.23 We can see that enable cuda graph brought major gain (around 68ms). SLN Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain. SkipGroupNorm fusion won't reduce latency much, while it also has benefit of reducing memory usage, so it is recommended to enable it. ### Motivation and Context Additional optimizations upon previous work in https://github.com/microsoft/onnxruntime/pull/17536.	2023-11-06 22:02:33 -08:00
Patrice Vignola	276918d93b	Allow SkipLayerNorm and LayerNorm in rotary attention fusion (#18288 ) Although SimplifiedLayerNorm is faster than LayerNorm, DML doesn't have an optimized implementation for the former yet and LayerNorm ends up being faster.	2023-11-06 22:01:17 -08:00
Wei-Sheng Chin	fb6737e893	Distributed Squeeze and Distributed Unsqueeze (#18269 ) Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.	2023-11-06 20:11:35 -08:00
Hector Li	ad34c67a44	[QNN EP] Enable Expand op (#18234 ) ### Description Enable Expand Op. There no directly mapping from Onnx Expand op to QNN. Need to use ElementWiseMultiply to do the data broadcast. Basically create the 2nd input with value 1.0 and use the shape data from Expand op.	2023-11-06 16:28:11 -08:00
Xavier Dupré	3b63d85c25	Fix unit test when TVM EP is enabled (#18189 ) ### Description TestInlinedLocalFunctionNotRemoved checks that local functions are not removed but TVM EP optimizes the whole graph after it is inlined.	2023-11-06 19:32:26 +01:00
Changming Sun	398ef677ba	Update protobuf python package's version (#18203 ) 1. Now we use a released version of ONNX, so we can directly download a prebuilt package from pypi.org. We do not need to build one from source. 2. Update protobuf python package's version to match the C/C++ version we are using. 3. Update tensorboard python python because the current one is incompatible with the newer protobuf version.	2023-11-06 09:22:54 -08:00
Yi Zhang	b7b8b5b2ce	Fix Eigen-3.4.0 URL and hash (#18290 ) ### Description Add CI changes for #18287 Install onnx explicitly to pass windows GPU+dml stage. ### Motivation and Context 'eigen-3.4' was refering to a branch, not to a tag. There is now an Eigen 3.4.1 on that branch, and thus the hash has changed. See https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416	2023-11-06 09:19:51 -08:00
BoarQing	d652b1fe48	[VitisAI] fix tensor has multi data type (#18188 ) ### Description <!-- Describe your changes. --> When take a tensor's data as raw, clear data with other types within the tensor. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> One model's graph transformation caused a node with multiple data types. This would make the model valid.	2023-11-06 07:16:17 -08:00
Chi Lo	dfafcb58aa	[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274 ) https://github.com/microsoft/onnxruntime/pull/17468 The above PR didn't fully fix the issue for some environments. This PR fixes this.	2023-11-03 20:04:10 -07:00
kunal-vaishnavi	08eaa1c55d	Remove internal enforce for IO binding inputs (#18266 ) ### Description This PR removes an internal `ORT_ENFORCE` when binding `torch.tensor` inputs using IO binding for end-to-end scripts. ### Motivation and Context In merged exports of PyTorch models to ONNX, each past key and past value in the past KV cache has an input shape of `(batch_size, num_heads, past_sequence_length, head_size)`. In the first pass through the model to process the prompt, `past_sequence_length = 0`. Therefore, each of these inputs is of shape `(batch_size, num_heads, 0, head_size)`. In subsequent passes, `past_sequence_length > 0`. When binding a `torch.tensor` of shape `(batch_size, num_heads, 0, head_size)` with `io_binding.bind_input`, the tensor's `data_ptr()` must be passed. For a `torch.tensor` of this shape, its `data_ptr()` returns 0. Because it returns 0, the existing `ORT_ENFORCE` is therefore false and an error is raised. By removing the internal `ORT_ENFORCE`, no error is raised and the model runs successfully. LLaMA-2 Example: Input Name \| Input Size \| Device \| Device ID \| Torch Dtype \| data_ptr() ------------- \| ----------- \| ------- \| ----------- \| ------------- \| ----------- input_ids \| torch.Size([1, 11]) \| cuda \| 7 \| torch.int64 \| 140639561842688 attention_mask \| torch.Size([1, 11]) \| cuda \| 7 \| torch.int64 \| 140639561843200 position_ids \| torch.Size([1, 11]) \| cuda \| 7 \| torch.int64 \| 140639561844224 past_key_values.0.key \| torch.Size([1, 32, 0, 128]) \| cuda \| 7 \| torch.float32 \| 0 past_key_values.0.value \| torch.Size([1, 32, 0, 128]) \| cuda \| 7 \| torch.float32 \| 0 ... \| ... \| ... \| ... \| ... \| ...	2023-11-03 16:12:32 -07:00
Chi Lo	84bdf04b25	[TensorRT EP] Fix bug for shape tensor input (#18253 ) When the model has "shape tensor" as one of the inputs and user provides explicit profile shapes for it, TRT EP doesn't correctly set the "shape tensor" input. Also, there is a bug for applying explicit profile shapes for the shape tensor input. Note: It seems the model has shape tensor input is a rare case. Most of the cases, the inputs are all execution tensor.	2023-11-03 16:07:50 -07:00
Chen Fu	26b396418d	Block-wise 4b quantization matmul operator change (#18172 ) ### Description Replace block-wise 4b quantization implementation ### Motivation and Context In https://github.com/microsoft/onnxruntime/pull/18101 we have an augmented block-wise 4b quantization interface and implementation. Here we use this new implementation in onnxruntime contrib ops --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-03 15:29:43 -07:00
Edward Chen	2ec1f94bfd	Make MlasTestFixture::mlas_tester an inline variable. (#18263 ) Make MlasTestFixture::mlas_tester an inline variable. With this change we no longer need to define `MlasTestFixture<T>::mlas_tester` outside of the class definition.	2023-11-03 10:50:21 -07:00
Changming Sun	4c4d79a612	Change a bitwise logical xor to logical wise (#18246 ) ### Description Change a bitwise logical xor to logical-wise ### Motivation and Context For Boolean values we should not use bitwise operations.	2023-11-03 10:42:51 -07:00

1 2 3 4 5 ...

9947 commits