onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-26 22:35:43 +00:00

Author	SHA1	Message	Date
Yi Zhang	0561b9576e	Fix and Refactor Python Packaging Pipeline (#20085 ) ### Description Make Windows GPU Packaging stage in Python Packaging pipeline run on CPU machine as well ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Test Link https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430961&view=results	2024-03-27 12:17:22 +08:00
zhijiang	b14d3f1d52	Zhijxu/fix softmax fp16 (#20059 ) in fp16 input, the softmax will return nan in some case, the reason is because in float16 dtype, std::numeric_limits<float16>::infinity() will return 0 instead of inf	2024-03-27 11:37:10 +08:00
Xiaoyu	512c803550	fix import in transformer optimizer python script (#20091 ) ### Description Fix import. ### Motivation and Context Fix error.	2024-03-26 20:16:09 -07:00
Yulong Wang	473434c73f	[js/webgpu] perform uniform consistency check (#20019 ) ### Description This PR makes a change in WebGPU backend to validate program uniforms. It compares the uniform data that comes from the result of `getRunData()` callback from the program info, with the `ShaderHelper`'s maintained list of uniform variables. Fixes a few bugs that found by this check as well.	2024-03-26 17:14:43 -07:00
cao lei	793a8882ed	Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams (#19970 ) ### Description <!-- Describe your changes. --> Regarding copy inputs before inference, flush the stream which copies the input only if the input is consumed by the ops from different streams ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is the improvement for the fix https://github.com/microsoft/onnxruntime/pull/17303	2024-03-26 13:57:25 -07:00
Yulong Wang	050085a7fb	[js/web] remove "browser" field in package.json (#20021 ) ### Description Field "browser" is deprecated in favor of "exports". Removes the unused field. Some bundler may read from "browser" and generate errors. Removing this field should let bundler to look up "exports". Fixes #19915	2024-03-26 13:57:11 -07:00
Yulong Wang	0313dd1f65	Update Web CI to use data dir under Agent.TempDirectory (#20074 ) ### Description Update Web CI to use data dir under Agent.TempDirectory This change fixes the random failure caused by unstable access to karma temp directory (which is under AppData\Local\Temp) on CI pipeline	2024-03-26 13:16:59 -07:00
Baiju Meswani	40efbd6c37	Fix training and macos ci pipelines (#20034 )	2024-03-26 12:20:11 -07:00
zesongw	ea3082edc6	[WebNN EP] Support Split before opset13 (#19988 ) ### Description Support Split before opset13, where the `split` is an attribute. ### Motivation and Context Support more models which use the earlier opset.	2024-03-26 11:59:41 -07:00
pengwa	dfa891a2d8	Fix memory stats printing (#20061 ) ### Fix memory stats printing The mmeory stats printing is failed when module is in eval mode, doing ORTModule wrap. At that time, runtime inspector for training manager should have training model being true, but got a false (because existing logic get the boolean from module.training). Runtime inspector as part of training manager or inference manager should know it is serving training or not explicitly, so we cannot depend on the stat of module.training during ORTModule initialization. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 21:25:59 +08:00
Yi Zhang	0906c57c9e	Pin Onnx Version (#20073 ) ### Description 1. change in build.py is to fix DML exception (https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=10&_a=summary) 2. change in requirements.txt is to fix exception in python packaging pipeline. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430433&view=results ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yi Zhang <your@email.com>	2024-03-26 17:59:46 +08:00
pengwa	1a0ba3f69f	Fix softmax export (#20057 ) ### Description Why we need to define softmax export logic here? For the usage `nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32)` in the model, `76a33a1092/src/transformers/models/mistral/modeling_mistral.py (L302)` If dtype is specified, the input tensor is casted to dtype before the operation is performed. This is useful for preventing data type overflows. While existing ONNX exporter do the cast after the operation, which is not correct. (`cf06189a2d/torch/onnx/symbolic_opset13.py (L27)`). This override can be a workaround before PyTorch fix the issues in coming releases. (TODO: pengwa - add PyTorch versions when the issue is fixed). @thiagocrepaldi We may need a fix in PyTorch repo as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 13:09:20 +08:00
Adrian Lizarraga	7d976cf720	[QNN QDQ Quant] Utils to generate mixed-precision quant overrides (#20028 ) ### Description - Adds a utility to the QNN quantization scripts that "fixes" an initial set of tensor quantization overrides for mixed-precision QDQ models. Follow-up to https://github.com/microsoft/onnxruntime/pull/19925 - Moves existing overrides for QNN compatibility (matmul, layernorm, sigmoid, tanh) to separate functions. PR adds missing unit tests for these. - Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()` function to enable user specification (instead of always using default behavior). - If weight_symmetric is set to `None`, it will be set to `weight_symmetric = weight_type in (QUInt8, QUInt16)`. - Otherwise, the user's value is used. #### Example Float model: ``` input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0 ^ \| input_1 --> Op2 -+-> Op4 ----+ \| +-> Op7 --> output_1 \| +-> Op8 --> output_2 ``` If we'd like to quantize this model to uint8 precision, but would like to make sure tensor "Op4_out" is quantized to 16-bit, then we would specify the following initial tensor quantization overrides: ```python # Op4_out could be an inaccurate tensor that should be upgraded to 16bit initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]} ``` These initial overrides may not create a valid model because Op4 and Op5 may require both the input and output to be the same type (e.g., uint16). This helper fixes the overrides so that input/output data types are valid: ```python qnn_config = get_qnn_qdq_config( float_model_path, data_reader, activation_type=QuantType.QUInt8, weight_type=QuantType.QUInt8, init_overrides=initial_overrides, # These initial overrides will be "fixed" ) ``` The above snippet generates the following "fixed" overrides (get via `qnn_config.extra_options["TensorQuantOverrides"]`): ```python { "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}], "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}], "Op4_out": [{"quant_type": QUInt16}], "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}] } ``` How to interpret the fixed overrides: - Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, but Op7 and Op8 consume the original u8 type. - Op3's output is converted from u8 to u16. Op5 consumes the converted u16 type. - Op4's output is just u16 (not converted). All consumers of Op4_out get the u16 type. - Op5's output is converted from u16 to u8. Op6 consumes the u8 type. ### Motivation and Context Generating mixed-precision quantization overrides is currently a manual process. This PR adds an utility that helps generate valid overrides.	2024-03-25 14:41:14 -07:00
Vincent Wang	d30c81d270	Add Symbolic Shape Hint to Triton Codegen Config (#20056 ) Add symbolic shape hint to Triton codegen config so that we can avoid unnecessary recompile when input shapes are keeping changing. Below screenshot shows that with proper configuration, we can speed up the training a lot by reducing unnecessary recompile. ![image](https://github.com/microsoft/onnxruntime/assets/11661208/699944d2-81cd-4c22-84e7-73a4fa0d2a28)	2024-03-25 15:05:02 +08:00
aciddelgado	4a196d1594	Packed QKV and Rotary Embedding Support for sm<80 GQA (#20012 ) ### Description Add support for packed qkv input and rotary embedding with sm<80 using memory efficient attention kernel. ### Motivation and Context Allows lower-end gpus to run gqa with packed qkv input and rotary embedding.	2024-03-23 14:30:35 -07:00
Hector Li	f977be0663	Fix issue that failed to load Conv node with external initializer (#20042 ) ### Description Fix issue that failed to load Conv node with external initializer. Root cause the model path is not provided while loading the weight and bias tensor for Conv.	2024-03-23 13:43:20 -07:00
Satya Kumar Jandhyala	5b64d7c32b	[JS/WebGPU] Use non-matmul implementation for ConvTranspose in channel-first case. (#20022 ) ### Description Avoid using vec4 Matmul implementation for ConvTranspose with channel-last ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-23 11:19:14 -07:00
Adrian Lizarraga	cdc5d72ba9	[QDQ Quant] Support mixed-precision integer quantization via overrides (#19925 ) ### Description Adds support for specifying mixed precision QDQ models via tensor quantization overrides. ### Motivation and Context This PR implements an approach for supported "mixed precision" models. The following figure demonstrates an example mixed precision model as defined in this PR. ![image](https://github.com/microsoft/onnxruntime/assets/19691973/40ae3bf9-b21a-4ba5-a1cd-41c1e08c21e7) A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence. The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics. #### Current support By default, the ORT quantizer supports specifying default activation and weight quantization data types for the entire model. A recent PR added support for specifying basic quantization overrides at the tensor level via the `extra_options["TensorQuantOverrides"]` configuration: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` The tensor-level overrides are currently used to override the quantization type for weights/initializers or to set specific scale/zero-point values for a tensor (e.g., QNN requires Sigmoid to use a specific scale/zero-point at its output). However, these overrides are not typically used to override activation quantization types due in large part to operator data type constraints. Consider, for example, that all inputs and outputs to an Add operator must be of the same data type. Consequently, using tensor-level overrides to promote the Add’s output to 16-bits would force the inputs to also be overridden to 16-bit. In turn, this would have a cascading effect on potentially the entire graph. The solution implemented by this PR is to allow the specification of tensor boundaries where the activation quantization data type changes. #### The approach The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type. ![image](https://github.com/microsoft/onnxruntime/assets/19691973/5998c301-ae20-4ac9-8a43-37f335cfcf8b) Note the following observations: - Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type. - Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type. - Op4’s output is just u16 (not converted). - Op5’s output is converted from u16 to u8. Op6 consumes the u8 type. The approach implemented by this PR uses the tensor-level quantization overrides to specify a tensor’s quantization type at both the producer and consumer ends. The following shows the overrides necessary to create this mixed precision QDQ model. ```python3 overrides = { “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}], “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}], “Op4_out”: [{“quant_type”: QUInt16}], “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}] } ```	2024-03-23 11:05:08 -07:00
Changming Sun	3b4b99b90b	Fix a bug in WASM's GEMM (#20023 ) ### Description Fix a bug in WASM's GEMM. The bug was found when running "ConvAddActivationFusionTests.ConvGemmDirect" unit test in a wasm build with address sanitizer enabled. When CountK=25, CountN=1, lda=25, ldc=1, the function I am modifying triggered a read out of bound error. The bug fix was provided by @fs-eire.	2024-03-23 08:53:50 -07:00
Xiaoyu	71551dacd5	Add ModelProto support for transformers optimize_model (#19990 ) ### Description Add `ModelProto` support as an input to transformers `optimize_model` API. ### Motivation and Context Currently, the `optimize_model` API only accepts a model path as the input model. However, for large models, saving and loading from disk can be time-consuming. By adding `ModelProto` as an input option to the `optimize_model` API, significant time can be saved.	2024-03-22 18:40:58 -07:00
Dmitri Smirnov	3076b56947	Make MS Debug engine SymInitialize() called as needed. (#20036 ) ### Description <!-- Describe your changes. --> Initialize Symbol engine as needed with no duplicate calls. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently absel library may call SymInitialize more than once when shared libraries are involved. However, this can only be called only once per process. Our debug_alloc also may call it when enabled. This change enables intialization to proceed only when needed with no duplicate effort.	2024-03-22 16:17:47 -07:00
kunal-vaishnavi	f9cddd2cf5	Remove early stopping from LLaMA end-to-end benchmarking (#20033 ) ### Description This PR removes early stopping from the end-to-end LLaMA-2 benchmark script. ### Motivation and Context This allows models to always generate the requested number of new tokens.	2024-03-22 14:44:34 -07:00
Abhishek Jindal	7e84ba0ea3	remove const cast for DLManagedTensor (#20015 ) ### Description <!-- Describe your changes. --> Removing const_cast as it might lead to unknown behavior. Specifying DLMangedTensor as a const doesn't seem to be necessary and I have tested this by running torch_ort.configure. Not sure what other tests which needs to be done. Background can be found in this [PR](https://github.com/microsoft/onnxruntime/pull/19982) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-22 10:39:19 -07:00
Baiju Meswani	2bc29244b4	Support model with multiple SCE loss nodes (#20016 )	2024-03-22 10:28:44 -07:00
kunal-vaishnavi	6238e9c0af	Add LLaMA end-to-end benchmarking (#19985 ) ### Description This PR adds a benchmarking script to measure end-to-end performance and saves the results in a CSV file. ### Motivation and Context With this PR, end-to-end performance can be easily measured for many large-language models such as LLaMA-2. The performance numbers for LLaMA-2 are located [here](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama).	2024-03-21 18:59:05 -07:00
sfatimar	eab35c20fc	Ort openvino npu 1.17 master (#19966 ) ### Description Add NPU to list of device supported. Added changes for Support to OV 2024.0 Nuget packages removes packaging of OpenVINO DLL Bug Fixes with Python API Reverted Dockerfiles not being maintained. ### Motivation and Context NPU Device has been introduced by Intel in latest client systems OpenVINO 2024.0 release is out. --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Ubuntu <ubuntu@ubuntu-118727.iind.intel.com> Co-authored-by: hmamidix <hemax.sowjanya.mamidi@intel.com> Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>	2024-03-21 18:44:00 -07:00
Yi Zhang	cd6d3aea45	Refactor Python CUDA packaging pipeline to fix random hangs in building (#19989 ) ### Description 1. Move building on CPU machine. 2. Optimize the pipeline 3. Since there isn't official ONNX package for python 12, the python 12 test stage uses the packages built with ONNX source in build stage. ### Motivation and Context 1. Resolve the random hang in compilation 4. Save a lot of GPU resources. ---------	2024-03-22 09:16:00 +08:00
Changming Sun	dafbef3a21	CMake: support reading dependency zip files from a local mirror (#20005 ) ### Description To test this feature, run ```bat python cmake\deps_update_and_upload.py --root-path mirror ``` Then run build.py as usual. The zip files will be cached local. To avoid being downloaded again and again.	2024-03-21 17:58:59 -07:00
TP Boudreau	983fd8393a	Recognize NaN operands in Min and Max ops (#19984 ) ### Description Update the Min and Max CUDA math operations on float/double types to propagate NaNs: if either operand is NaN, the result should be NaN. TODO: float16/bfloat16 need similar change. ### Motivation Currently, results differ between the CPU and CUDA implementations of the floating point Min and Max operators: the CPU operators correctly return NaN results if either operand is NaN. This PR updates the CUDA implementations to conform with this correct behavior. See the the issue and comments raised [here](https://github.com/onnx/onnx/issues/6003). ### Context Same behavior in numpy, torch and Java: ``` >>> numpy.min([numpy.NAN, 1]) nan >>> numpy.max([numpy.NAN, 1]) nan >>> torch.min(torch.tensor([1, float('nan')])) tensor(nan) >>> torch.max(torch.tensor([1, float('nan')])) tensor(nan) ``` C languguage [fmin](https://en.cppreference.com/w/c/numeric/math/fmin) and [fmax](https://en.cppreference.com/w/c/numeric/math/fmax) has different behavior: ``` fmax(NaN,1) = 1 fmin(NaN,1) = 1 ``` https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/minNum_maxNum_Removal_Demotion_v3.pdf ![image](https://github.com/microsoft/onnxruntime/assets/30328909/62446cf1-f252-4ddc-8118-5ce605252331) https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2273.pdf	2024-03-21 16:08:18 -07:00
Yi Zhang	30a0d80925	Fix exception in Publish unit test results step (#20007 ) ### Description Test results files are all in RelWithDebInfo\RelWithDebInfo directory. It's not necessary to stat the directory of _deps ### Motivation and Context Recently this exception in zip-nuget pipleine occurs many times. `##[error]Error: Failed find: EPERM: operation not permitted, stat 'D:\a\_work\1\b\RelWithDebInfo\_deps\flatbuffers-src\java\src\test\java\DictionaryLookup'` https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=426981&view=logs&j=75fc0348-fe99-522b-3acb-90fd80ac5271&t=5d4ebcc1-bcde-574d-6f4e-8abd0f04ae4b	2024-03-22 06:53:59 +08:00
Tianlei Wu	06fe4f3113	Increase MNIST test tolerance (#20000 ) ### Description Found multiple occurrence of failures: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1321061&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=56a04c0b-9e7f-5c69-cb7b-c2a7b1a7392a&l=17537 https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1329701&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=4f6ef737-111d-50d1-a46b-5f86d9a970bc&s=3618b4c0-1011-591a-85b8-671e72e2cff1 1: [ RUN ] ModelTests/ModelTest.Run/ cuda__models_zoo_opset7_MNIST_model 1: D:\a\_work\1\s\onnxruntime\test\providers\cpu\model_tests.cc(358): error: Expected equality of these values: 1: COMPARE_RESULT::SUCCESS 1: Which is: 4-byte object <00-00 00-00> 1: ret.first 1: Which is: 4-byte object <01-00 00-00> 1: expected -2.33638 (c0158735), got -2.30239 (c0135a47), diff: 0.0339923, tol=0.0243638 idx=9	2024-03-20 23:40:27 -07:00
Prathik Rao	0b958bb421	add random seed to layernorm tests (#19998 ) Adds random seed to layernorm tests to prevent random failure. ### Motivation and Context Fixes https://github.com/microsoft/onnxruntime/issues/19983	2024-03-20 21:00:25 -07:00
Yi Zhang	175f149b30	Remove downloading deps in CUDA package test stage (#19993 ) ### Description <!-- Describe your changes. --> ### Motivation and Context downloading deps is not needed in test stage remove it to reduce random downloading errors	2024-03-21 10:01:03 +08:00
Justin Chu	0335ea9f1e	Use Java 11 to build project in the codeql pipeline (#19999 ) Codeql uses Java 8 by default, which is too old for the project. Related: https://learn.microsoft.com/en-us/java/openjdk/reasons-to-move-to-java-11 https://github.com/actions/setup-java	2024-03-20 17:53:48 -07:00
Yufeng Li	15219e2e71	turn on neural_speed by default (#19627 ) ### Description <!-- Describe your changes. --> the crash caused by the neural_speed turns out to be a very corn case. Turn it on by default. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-20 12:49:58 -07:00
Rachel Guo	6b305f95e0	Support xcframework for mac catalyst builds. (#19534 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> MAUI on macOS uses mac-catalyst which requires a different native binary. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-03-20 10:55:19 -07:00
Adam Pocock	19ff4a6d6c	String Tensor SplitToSequence fix (#19942 )	2024-03-20 10:52:00 -07:00
Markus Tavenrath	0af5eacc8b	Fix broken Pooling CUDA NHWC Ops and ensure NCHW / NHWC parity. (#19889 ) ### Description Fixed all CUDA NHWC Pooling operations which were broken and enabled the NHWC CUDA pooling tests. Disabled all pooling tests which are not supported by the CUDA EP. ### Motivation and Context Ensure parity between CUDA NHWC / NCHW and work towards 100% tests enabled for the CUDA EP / CUDA NHWC EP. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2024-03-20 09:57:29 -07:00
Yi Zhang	8adbc09314	[Fix] Error Python Packaging Pipeline (Training CPU) (#19992 ) ### Description fix the error caused by https://github.com/microsoft/onnxruntime/pull/19973	2024-03-20 09:02:50 -07:00
zesongw	7e18cb4c35	[WebNN EP] Support MatMul 1D (#19862 ) ### Description Support MatMul 1D inputs by combining Reshape and ReduceMean. ### Motivation and Context ONNX MatMul can support 1D inputs, which is disabled in `IsOpSupportedImpl`.	2024-03-20 08:32:57 -07:00
Ye Wang	6ff31e06d5	[MoE] Add TP and Mixtral MoE (#19945 ) ### Description <!-- Describe your changes. --> 1.Support Tensor Parallelism in ShardedMoE. 2.Make necessary code changes to support Mixtral MoE. 3.Fix a bug related to using IOBinding in test script. 4.Fix the input size limitation ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-19 21:28:15 -07:00
mindest	3dfe4a5e6d	[ROCm] Remove MPI dependency and collectives to use NCCL (#19830 ) ### Description * Remove MPI dependency to use NCCL AllReduce, etc. * Exclude unsupported collectives in hipify	2024-03-19 17:35:18 -07:00
Abhishek Jindal	6fe02068af	Add const cast for DLManagedTensor (#19982 ) ### Description <!-- Describe your changes. --> Add Const Cast for DLManagedTensor as PyTorch has changed it's [code](https://github.com/pytorch/pytorch/pull/121102) which creates incompatibility. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix the below error while configuring ORT-training with nightly PyTorch ``` aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor’ to ‘DLManagedTensor’ [-fpermissive] 60 \| at::Tensor tensor = at::fromDLPack(dlpack); \| ^~~~~~ \| \| \| const DLManagedTensor* ```	2024-03-19 17:00:44 -07:00
Guenther Schmuelling	c45cff60cf	[js/webgpu] fix maxpool / fp16 (#19981 )	2024-03-19 16:15:49 -07:00
Tianlei Wu	597e828aae	Adjust test tolerance (#19947 ) ### Description Improve the precision of tests. Changes include: (1) Update checkers.cc to use consistent default tolerance. (2) Allow different default tolerances for different providers at runtime (Previously, threshold of a test is decided during compiling). (3) Explicitly set absolute and relative error tolerances for tests that failed to pass new default threshold. #### Default Thresholds Change Note that the formula of testing is `abs(expected - value) < absolute + relative * expected` Default test thresholds when both absolute and relative tolerance are not set: type \| provider \| absolute (before) \| absolute (after) \| relative (before) \| relative (after) -- \| -- \| -- \| -- \| -- \| -- double \| CPU \| 0.001 \| 0.00001 \| 0 \| 0.00001 double \| CUDA \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| TRT \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| ROCM \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| DML \| 0.005 \| 0.00001 \| 0 \| 0.00001 \| \| \| \| \| float \| CPU \| 0.0001 \| 0.00001 \| 0 \| 0.0001 float \| CUDA \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| TRT \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| ROCM \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| DML \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| Training* \| 0.005 \| 0.001 \| 0 \| 0.0001 \| \| \| \| \| half \| CPU \| 0.001 \| 0.0025 \| 0 \| 0.001 half \| CUDA \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| TRT \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| ROCM \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| DML \| 0.02 \| 0.005 \| 0 \| 0.001 half \| Training* \| 0.005 \| 0.005 \| 0 \| 0.001 \| \| \| \| \| bfloat16 \| CPU \| 0.0001 \| 0.02 \| 0 \| 0.01 bfloat16 \| CUDA \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| TRT \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| ROCM \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| DML \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| Training* \| 0.0001 \| 0.02 \| 0.05 \| 0.01 *Training mean a build flag ENABLE_TRAINING_CORE is defined. The provider can be any one. #### Threshold for provider Previously, the threshold might change according to build flags: ``` #if defined(USE_CUDA) \|\| defined(USE_ROCM) \|\| defined(USE_DML) constexpr float threshold = 0.005f; #else constexpr float threshold = 0.0001f; #endif ``` For a cpu only build, the threshold is 0.0001. For a cuda build, the threshold for CPU provider (some tests in cuda build actually run with CPU provider) is changed to 0.005. After this change, the threshold only depends on data type and provider used in the test. It will not change by build flags for non-training builds. Default thresholds for training might be different from inference (please refer to the above table). There are a few factors there: Training has gradient outputs; TF32 is not disabled in training; Some training tests has iterations, and error might accumulate. How to set different thresholds based on these factors could be a future task.	2024-03-19 15:50:13 -07:00
Hariharan Seshadri	cd6ec50b50	Switch a portion of CI/packaging jobs to MacOS12 (#19908 )	2024-03-19 14:54:58 -07:00
Adrian Lizarraga	18a7f34ba0	[NhwcTransformerTests] Fix linker error due to explicit template instantiation of ModelBuilder methods (#19980 ) Currently, the nhwc_transformer_test.cc compilation unit defines explicit FP16 versions of `ModelTestBuilder::MakeInput<MLFloat16>` and `ModelTestBuilder::MakeInitializer<MLFloat16>` outside of the ModelTestBuilder class's header file. These explicit template instantiations cause linker errors when other compilation units also instantiate these functions due to duplicate definitions. Additionally, the versions defined in nhwc_transformer_test.cc do not really conform to the expected behavior in the original ModelTestBuilder class, which is to make random input/initializer values. Instead, the versions in nhwc_transformer_test.cc create a range of values. The solution is to edit nhwc_transformer_test.cc to use stand-alone static functions that do not change the ModelTestBuilder class. Note: This linker error cannot currently be replicated in our CIs because it requires a QNN-HTP-enabled Windows ARM64 environment with `MLAS_F16VEC_INTRINSICS_SUPPORTED` defined. I can replicate on a local build. The linker error/conflict happens with with this new FP16 QNN test: `d4c8bc359e/onnxruntime/test/providers/qnn/clip_op_test.cc (L186)`	2024-03-19 13:48:04 -07:00
Yulong Wang	01c7aaf6aa	[js/webgpu] allow setting env.webgpu.adapter (#19940 ) ### Description Allow user to set `env.webgpu.adapter` before creating the first inference session. Feature request: https://github.com/microsoft/onnxruntime/pull/19857#issuecomment-1999984753 @xenova	2024-03-19 12:55:00 -07:00
Tianlei Wu	8293aa1564	Exclude TRT provider in tests crashed in A100 (#19972 ) TensorRT EP segmentation fault on A100 for some tests. Exclude TRT EP in those tests on A100 to unblock developing. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/19530	2024-03-19 11:36:42 -07:00
Yi Zhang	d4c8bc359e	Fix Training CPU docker image name to avoid unnecessary rebuilding (#19973 ) ### Description The docker image name was fixed, but the docker argument was different in different job. It would trigger rebuilding the docker image almost every time!!!	2024-03-19 09:33:24 -07:00

... 23 24 25 26 27 ...

11997 commits