onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-10 17:37:14 +00:00

Author	SHA1	Message	Date
kunal-vaishnavi	a0ebd5fee5	Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149 ) ### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.	2024-03-29 23:09:37 -07:00
mo-ja	00244ea143	fix quantization errors of ConvTranspose with per_channel=True (#19996 ) ### Description <!-- Describe your changes. --> - update axis value for per_channel quantization of QDQConv - we should use `axis=1` for ConvTranspose operator. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - this PR fixes https://github.com/microsoft/onnxruntime/issues/19694, which I have opened	2024-03-29 21:36:15 -07:00
Ye Wang	f3a864217f	Fix MoE tensor parallelism tests (#20147 ) ### Description <!-- Describe your changes. --> Previously the expert weights are in row-major. But with the updated cutlass extension introduced by https://github.com/microsoft/onnxruntime/pull/20108, weights are stored in col-major that aligns with Pytorch implementation. This change fixes the way the tensors are sliced across shards. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 16:10:09 -07:00
Jeff Bloomfield	2f31560430	Enable generic feature level devices in DML EP (#20114 ) ### Description Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX headers through the DirectX-Headers repo. Note that this includes an update to cgamanifest.json for onnx-tensorrt which is triggered during re-generation due to a prior changes to deps.txt. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 14:37:30 -07:00
cao lei	604b284261	add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp (#20145 ) ### Description <!-- Describe your changes. --> Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp	2024-03-29 13:49:56 -07:00
inisis	8396845806	fix shape inference bug (#19848 ) ### Description for nodes like add, their input should be merged dynamically ### Motivation and Context when doing shape inference, for nodes like Add, currently when doing _onnx_infer_single_node, their inputs are generated from last node's output, but they should be merged.	2024-03-29 13:06:27 -07:00
Adrian Lizarraga	b1a5eb255e	[Quant] Fix accuracy_level config option for MatMul 4bits quantizer (#20146 ) ### Description Fixes code that extracts the accuracy level when creating a MatMulNBits node in the `DefaultWeightOnlyQuantizer` class. ### Motivation and Context Error from line 443: `AttributeError: 'DefaultWeightOnlyQuantizer' object has no attribute 'accuracy_level'`. The solution is to access `self.config.accuracy_level` instead of `self.accuracy_level`. Relevant commit: https://github.com/microsoft/onnxruntime/pull/19106	2024-03-29 11:54:55 -07:00
Ye Wang	17919717b5	add QMoE (#20108 ) ### Description <!-- Describe your changes. --> 1. Introduce latest cutlass extension from TRTLLM that gives us cutlass upgrade(to 3.4) opportunity from MoE side. 2. Fix Windows build issue 3. Add Int4 MoE op and ut ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 10:24:19 -07:00
pengwa	2092bebc78	Fix transformer layer detection for recompute (#20106 ) ### Fix transformer layer detection for recompute Originally logic miss detecting the layer boudary node in Mistral model. This PR simplifies the searching, by using more strong pattern's match, to make sure it is flexible enough to cover different transformer variants. Also add a UT. Add a warning when user enable layerwise recompute but no layer boudary nodes are found.	2024-03-29 17:44:38 +08:00
cao lei	2a184ac1a1	use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo (#20037 ) ### Description <!-- Describe your changes. --> use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops	2024-03-28 20:45:37 -07:00
Adam Pocock	2f82400b13	[java] Java 21 build support (#19876 ) ### Description Bump spotless and the Gradle wrapper to 6.25.0 and 8.6 respectively to allow compiling ORT on Java 21. The build still targets Java 8. I'm not sure if there will be CI changes necessary to use this PR, specifically for the Gradle version as I don't know if that is cached somewhere earlier in the CI build process. The new Gradle version adds a warning that using `--source` and `--target` to select the Java language version is obsolete which is annoying, we can fix it if we decide to only allow building on newer versions of Java, while still supporting running on Java 8. ### Motivation and Context Java 21 is the latest LTS release of Java and ORT should be able to build on it.	2024-03-28 15:51:22 -07:00
Yi Zhang	f7b52d2e3e	[Fix] Only copy java files when build_java is True (#20121 ) ### Description ### Motivation and Context Fix error in Nuget-CUDA-Packaging-Pipeline	2024-03-28 14:06:28 -07:00
Pranav Sharma	3ed0c81b30	Expose Reserve() in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified. (#19904 ) ### Description Expose Reserve() in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified. Update: this change has been verified by Bing Ads and brings a significant benefit in terms of memory utilization: 30GB less memory and also better CPU utilization. ### Motivation and Context https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah	2024-03-28 12:28:37 -07:00
Yi Zhang	2a38168f0b	increase cl mpcount since Compilation is moved on CPU machine (#20116 ) ### Description The CPU machine has 16 cores, so we can increase the parallel count. Compared with 2 runs. 1. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=432328&view=results 2. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=432331&view=results The compilation took about 25 minutes if the parallel count is 15, while it took 41 minutes if the parallel count is 3 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Yi Zhang <your@email.com>	2024-03-28 13:30:33 +08:00
Yi Zhang	c5d7310f1b	Remove TSA upload in testing stage (#20115 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-28 13:15:03 +08:00
Yi Zhang	8f069f81c4	Split more windows GPU workflow into 2 stages, building and testing, to make them more stable (#20080 ) ### Description reactor win-ci.yml to solve the random hang issue in more GPU workflows, move nugget-zip packages and python cuda12 packages building to CPU machine. --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-28 12:55:44 +08:00
wejoncy	16af7adc70	[llm exporter]auto infer output shape (#20071 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-28 09:52:10 +08:00
pengwa	55f63a48ca	Keep original name during fusion (#20097 ) ### Keep original name during fusion This could be helpful to know where the fused node coming from, I feel this is very useful when debugging the execution order issues between different transformer layers. For example: - A node named `/_original_module/model/layers.1/self_attn/MatMul/MatmulTransposeFusion//MatMulScaleFusion/` goes through two fusion paths in the 1st transformer layer - e.g. `MatmulTransposeFusion` and `MatMulScaleFusion`. - `/_original_module/model/layers.2/post_attention_layernorm/Mul_1/SimplifiedLayerNormFusion/` node is a fused node by `SimplifiedLayerNormFusion`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-28 08:40:34 +08:00
Ye Wang	a9d9b083e4	Fix py package pipeline (#20065 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Fixes #20068	2024-03-27 15:59:35 -07:00
Dmitri Smirnov	b95fd4e644	Enable CUDA EP unit testing on Windows (#20039 ) ### Description Address build issues and source code discrepancies. Fix cuda_test_provider gtest argument stack corruption. ### Motivation and Context `OpTester` class that is widely used for kernel testing is not suitable for testing internal classes for EPs that are built as shared objects. Currently, CUDA EP tests run only on Linux. We want to enable testing and developments on Windows, and create a usable pattern for testing of other EPs internals. Alternatives considered: Abstracting EP unit tests into separate test executable such as `onnxruntime_test_all`. This alternative was rejected as it would create a lot more changes in the established patterns, and potentially interfere with CUDA functionality with more complex source code maintanence.	2024-03-27 13:32:36 -07:00
Yi Zhang	ab2eaedfaa	Install ONNX by buildling source code in Windows DML stage (#20079 ) ### Description In #20073, I use pin onnx version to unblock the whole PR CI. In fact, we could use the onnx that installed by building source code, that the onnx version is controlled by deps.txt. For some history reason, DML stage installed onnx from pypi. Now, the onnx can be installed as other stages. add an option to skip installing onnx in win-ci-prebuild-step	2024-03-27 12:29:34 -07:00
Yi Zhang	4df9d16f98	[Fix] TSAUpload task must be in building stage (#20098 ) ### Description In #20085, TSAUpload was in testing stage so main branch failed.	2024-03-27 12:20:57 -07:00
Xiaoyu	c8676ffbff	Add ModelProto support for quantize api (#20018 ) ### Description Add ModelProto support for `quantize` api ### Motivation and Context Currently, the `quantize` API only accepts a model path as the input model. However, for large models, saving and loading from disk can be time-consuming. By adding `ModelProto` as an input option to the `quantize` API, significant time can be saved.	2024-03-27 10:40:08 -07:00
Yulong Wang	47903e701a	fix condition in web CI YAML (#20095 ) ### Description fix condition in web CI YAML	2024-03-27 10:35:43 -07:00
Nanashi	ca465dc087	[js] Make error friendly when isOrtFormat is undefined (#19958 ) ### Description Make error friendly when isOrtFormat is undefined (`onnxruntime.InferenceSession.create` is called with ArrayBuffer or Uint8Array). ### Motivation and Context I was trying to run my onnx model in WebGL EP, but it gave me the error "Cannot read properties of null (reading 'irVersion')". I used debugger to find that actual error is `int64 is not supported`, but the error was invisible for me. So I made it to show both error when isOrtFormat is undefined. <s>I haven't written unit test yet, so I'm making it draft. (I have no idea about how do I test this though...)</s> [d62d942](`d62d9425ba`)	2024-03-27 02:07:00 -07:00
guyang3532	4aa84003ca	support Pow/Div/Sqrt in PaddingElimination (#20083 )	2024-03-27 16:10:07 +08:00
Yulong Wang	28907d8c59	[js/web] workaround NPM test fetch failure (#20020 ) ### Description Sometimes the `npm test` failed with an error of "TypeError: Failed to fetch". I checked the callback entry of the localhost server started by karma. When the "Failed to fetch" happens, no request is reflected on the server side. The root cause is still not identified. However, as this issue only happens sometimes when the browser is just launched by karma runner, doing retry can workaround this issue for most of the time.	2024-03-26 21:35:49 -07:00
Chi Lo	3dcda13e62	[TensorRT EP] Fix concurrency issue for TRT custom op list (#20093 ) The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its static variables, `created_custom_op_list` and `custom_op_domain`. This PR makes sure synchronization using mutex. see issue: https://github.com/microsoft/onnxruntime/issues/20089	2024-03-26 21:20:14 -07:00
Yi Zhang	0561b9576e	Fix and Refactor Python Packaging Pipeline (#20085 ) ### Description Make Windows GPU Packaging stage in Python Packaging pipeline run on CPU machine as well ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Test Link https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430961&view=results	2024-03-27 12:17:22 +08:00
zhijiang	b14d3f1d52	Zhijxu/fix softmax fp16 (#20059 ) in fp16 input, the softmax will return nan in some case, the reason is because in float16 dtype, std::numeric_limits<float16>::infinity() will return 0 instead of inf	2024-03-27 11:37:10 +08:00
Xiaoyu	512c803550	fix import in transformer optimizer python script (#20091 ) ### Description Fix import. ### Motivation and Context Fix error.	2024-03-26 20:16:09 -07:00
Yulong Wang	473434c73f	[js/webgpu] perform uniform consistency check (#20019 ) ### Description This PR makes a change in WebGPU backend to validate program uniforms. It compares the uniform data that comes from the result of `getRunData()` callback from the program info, with the `ShaderHelper`'s maintained list of uniform variables. Fixes a few bugs that found by this check as well.	2024-03-26 17:14:43 -07:00
cao lei	793a8882ed	Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams (#19970 ) ### Description <!-- Describe your changes. --> Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is the improvement for the fix https://github.com/microsoft/onnxruntime/pull/17303	2024-03-26 13:57:25 -07:00
Yulong Wang	050085a7fb	[js/web] remove "browser" field in package.json (#20021 ) ### Description Field "browser" is deprecated in favor of "exports". Removes the unused field. Some bundler may read from "browser" and generate errors. Removing this field should let bundler to look up "exports". Fixes #19915	2024-03-26 13:57:11 -07:00
Yulong Wang	0313dd1f65	Update Web CI to use data dir under Agent.TempDirectory (#20074 ) ### Description Update Web CI to use data dir under Agent.TempDirectory This change fixes the random failure caused by unstable access to karma temp directory (which is under AppData\Local\Temp) on CI pipeline	2024-03-26 13:16:59 -07:00
Baiju Meswani	40efbd6c37	Fix training and macos ci pipelines (#20034 )	2024-03-26 12:20:11 -07:00
zesongw	ea3082edc6	[WebNN EP] Support Split before opset13 (#19988 ) ### Description Support Split before opset13, where the `split` is an attribute. ### Motivation and Context Support more models which use the earlier opset.	2024-03-26 11:59:41 -07:00
pengwa	dfa891a2d8	Fix memory stats printing (#20061 ) ### Fix memory stats printing The mmeory stats printing is failed when module is in eval mode, doing ORTModule wrap. At that time, runtime inspector for training manager should have training model being true, but got a false (because existing logic get the boolean from module.training). Runtime inspector as part of training manager or inference manager should know it is serving training or not explicitly, so we cannot depend on the stat of module.training during ORTModule initialization. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 21:25:59 +08:00
Yi Zhang	0906c57c9e	Pin Onnx Version (#20073 ) ### Description 1. change in build.py is to fix DML exception (https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=10&_a=summary) 2. change in requirements.txt is to fix exception in python packaging pipeline. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430433&view=results ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-26 17:59:46 +08:00
pengwa	1a0ba3f69f	Fix softmax export (#20057 ) ### Description Why we need to define softmax export logic here? For the usage `nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32)` in the model, `76a33a1092/src/transformers/models/mistral/modeling_mistral.py (L302)` If dtype is specified, the input tensor is casted to dtype before the operation is performed. This is useful for preventing data type overflows. While existing ONNX exporter do the cast after the operation, which is not correct. (`cf06189a2d/torch/onnx/symbolic_opset13.py (L27)`). This override can be a workaround before PyTorch fix the issues in coming releases. (TODO: pengwa - add PyTorch versions when the issue is fixed). @thiagocrepaldi We may need a fix in PyTorch repo as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 13:09:20 +08:00
Adrian Lizarraga	7d976cf720	[QNN QDQ Quant] Utils to generate mixed-precision quant overrides (#20028 ) ### Description - Adds a utility to the QNN quantization scripts that "fixes" an initial set of tensor quantization overrides for mixed-precision QDQ models. Follow-up to https://github.com/microsoft/onnxruntime/pull/19925 - Moves existing overrides for QNN compatibility (matmul, layernorm, sigmoid, tanh) to separate functions. PR adds missing unit tests for these. - Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()` function to enable user specification (instead of always using default behavior). - If weight_symmetric is set to `None`, it will be set to `weight_symmetric = weight_type in (QUInt8, QUInt16)`. - Otherwise, the user's value is used. #### Example Float model: ``` input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0 ^ \| input_1 --> Op2 -+-> Op4 ----+ \| +-> Op7 --> output_1 \| +-> Op8 --> output_2 ``` If we'd like to quantize this model to uint8 precision, but would like to make sure tensor "Op4_out" is quantized to 16-bit, then we would specify the following initial tensor quantization overrides: ```python # Op4_out could be an inaccurate tensor that should be upgraded to 16bit initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]} ``` These initial overrides may not create a valid model because Op4 and Op5 may require both the input and output to be the same type (e.g., uint16). This helper fixes the overrides so that input/output data types are valid: ```python qnn_config = get_qnn_qdq_config( float_model_path, data_reader, activation_type=QuantType.QUInt8, weight_type=QuantType.QUInt8, init_overrides=initial_overrides, # These initial overrides will be "fixed" ) ``` The above snippet generates the following "fixed" overrides (get via `qnn_config.extra_options["TensorQuantOverrides"]`): ```python { "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}], "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}], "Op4_out": [{"quant_type": QUInt16}], "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}] } ``` How to interpret the fixed overrides: - Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, but Op7 and Op8 consume the original u8 type. - Op3's output is converted from u8 to u16. Op5 consumes the converted u16 type. - Op4's output is just u16 (not converted). All consumers of Op4_out get the u16 type. - Op5's output is converted from u16 to u8. Op6 consumes the u8 type. ### Motivation and Context Generating mixed-precision quantization overrides is currently a manual process. This PR adds an utility that helps generate valid overrides.	2024-03-25 14:41:14 -07:00
Vincent Wang	d30c81d270	Add Symbolic Shape Hint to Triton Codegen Config (#20056 ) Add symbolic shape hint to Triton codegen config so that we can avoid unnecessary recompile when input shapes are keeping changing. Below screenshot shows that with proper configuration, we can speed up the training a lot by reducing unnecessary recompile. ![image](https://github.com/microsoft/onnxruntime/assets/11661208/699944d2-81cd-4c22-84e7-73a4fa0d2a28)	2024-03-25 15:05:02 +08:00
aciddelgado	4a196d1594	Packed QKV and Rotary Embedding Support for sm<80 GQA (#20012 ) ### Description Add support for packed qkv input and rotary embedding with sm<80 using memory efficient attention kernel. ### Motivation and Context Allows lower-end gpus to run gqa with packed qkv input and rotary embedding.	2024-03-23 14:30:35 -07:00
Hector Li	f977be0663	Fix issue that failed to load Conv node with external initializer (#20042 ) ### Description Fix issue that failed to load Conv node with external initializer. Root cause the model path is not provided while loading the weight and bias tensor for Conv.	2024-03-23 13:43:20 -07:00
Satya Kumar Jandhyala	5b64d7c32b	[JS/WebGPU] Use non-matmul implementation for ConvTranspose in channel-first case. (#20022 ) ### Description Avoid using vec4 Matmul implementation for ConvTranspose with channel-last ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-23 11:19:14 -07:00
Adrian Lizarraga	cdc5d72ba9	[QDQ Quant] Support mixed-precision integer quantization via overrides (#19925 ) ### Description Adds support for specifying mixed precision QDQ models via tensor quantization overrides. ### Motivation and Context This PR implements an approach for supported "mixed precision" models. The following figure demonstrates an example mixed precision model as defined in this PR. ![image](https://github.com/microsoft/onnxruntime/assets/19691973/40ae3bf9-b21a-4ba5-a1cd-41c1e08c21e7) A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence. The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics. #### Current support By default, the ORT quantizer supports specifying default activation and weight quantization data types for the entire model. A recent PR added support for specifying basic quantization overrides at the tensor level via the `extra_options["TensorQuantOverrides"]` configuration: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` The tensor-level overrides are currently used to override the quantization type for weights/initializers or to set specific scale/zero-point values for a tensor (e.g., QNN requires Sigmoid to use a specific scale/zero-point at its output). However, these overrides are not typically used to override activation quantization types due in large part to operator data type constraints. Consider, for example, that all inputs and outputs to an Add operator must be of the same data type. Consequently, using tensor-level overrides to promote the Add’s output to 16-bits would force the inputs to also be overridden to 16-bit. In turn, this would have a cascading effect on potentially the entire graph. The solution implemented by this PR is to allow the specification of tensor boundaries where the activation quantization data type changes. #### The approach The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type. ![image](https://github.com/microsoft/onnxruntime/assets/19691973/5998c301-ae20-4ac9-8a43-37f335cfcf8b) Note the following observations: - Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type. - Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type. - Op4’s output is just u16 (not converted). - Op5’s output is converted from u16 to u8. Op6 consumes the u8 type. The approach implemented by this PR uses the tensor-level quantization overrides to specify a tensor’s quantization type at both the producer and consumer ends. The following shows the overrides necessary to create this mixed precision QDQ model. ```python3 overrides = { “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}], “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}], “Op4_out”: [{“quant_type”: QUInt16}], “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}] } ```	2024-03-23 11:05:08 -07:00
Changming Sun	3b4b99b90b	Fix a bug in WASM's GEMM (#20023 ) ### Description Fix a bug in WASM's GEMM. The bug was found when running "ConvAddActivationFusionTests.ConvGemmDirect" unit test in a wasm build with address sanitizer enabled. When CountK=25, CountN=1, lda=25, ldc=1, the function I am modifying triggered a read out of bound error. The bug fix was provided by @fs-eire.	2024-03-23 08:53:50 -07:00
Xiaoyu	71551dacd5	Add ModelProto support for transformers optimize_model (#19990 ) ### Description Add `ModelProto` support as an input to transformers `optimize_model` API. ### Motivation and Context Currently, the `optimize_model` API only accepts a model path as the input model. However, for large models, saving and loading from disk can be time-consuming. By adding `ModelProto` as an input option to the `optimize_model` API, significant time can be saved.	2024-03-22 18:40:58 -07:00
Dmitri Smirnov	3076b56947	Make MS Debug engine SymInitialize() called as needed. (#20036 ) ### Description <!-- Describe your changes. --> Initialize Symbol engine as needed with no duplicate calls. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently absel library may call SymInitialize more than once when shared libraries are involved. However, this can only be called only once per process. Our debug_alloc also may call it when enabled. This change enables intialization to proceed only when needed with no duplicate effort.	2024-03-22 16:17:47 -07:00
kunal-vaishnavi	f9cddd2cf5	Remove early stopping from LLaMA end-to-end benchmarking (#20033 ) ### Description This PR removes early stopping from the end-to-end LLaMA-2 benchmark script. ### Motivation and Context This allows models to always generate the requested number of new tokens.	2024-03-22 14:44:34 -07:00

1 2 3 4 5 ...

10825 commits