onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-03 03:58:54 +00:00

Author	SHA1	Message	Date
Jian Chen	bfa5eb4591	Adding a new pipeline for pubilshing cuda 12 nuget packages (#18713 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-11 13:07:05 -08:00
Ashwini Khade	16df8377d3	Update transformers package to fix the security issue (#18730 ) ### Description Updating transformers package in test pipeline to fix a security vulnerability. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-11 09:15:23 -08:00
Patrice Vignola	8d641229e6	Fix GQA shape inference (#18723 ) The shape inference is always returning before getting the chance to infer the key/value outputs.	2023-12-10 21:36:19 -08:00
cloudhan	de32baeeef	[ROCm] Add GemmFloat8 (#18488 )	2023-12-11 11:37:29 +08:00
Xavier Dupré	d41dd77241	Extend API page on the python documentation (#18762 )	2023-12-09 15:33:57 -08:00
Abhishek Jindal	2f93d97fd0	Add cuda visible devices for Mistral benchmark (#18764 ) ### Description <!-- Describe your changes. --> Add cuda visible devices for Mistral benchmark as it is not working for Torch compile and throwing an error. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Error: File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/_inductor/triton_heuristics.py", line 556, in run return launcher( File "<string>", line 8, in launcher RuntimeError: Triton Error [CUDA]: invalid device context	2023-12-08 23:12:48 -08:00
Changming Sun	c7799d7058	Build fixes for Windows ARM32 desktop build (#18752 ) ### Description Fix a link error: ``` onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\ src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function _tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] ```	2023-12-08 12:45:06 -08:00
pengwa	44b5843740	Fix gemm_float8 build failure on CUDA 11.3-11.7 (#18760 ) ### Fix gemm_float8 build failure on CUDA 11.3 ~ 11.7 User env: CUDA 11.3, build option include "--disable_types float8" ``` /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/math/gemm_float8.cu(256): error: identifier "CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET" is undefined /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/math/gemm_float8.cu(264): error: enum "cublasLtMatmulDescAttributes_t" has no member "CUBLASLT_MATMUL_DESC_FAST_ACCUM" /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/math/gemm_float8.cu(268): error: identifier "CUBLASLT_MATMUL_DESC_A_SCALE_POINTER" is undefined /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/math/gemm_float8.cu(271): error: identifier "CUBLASLT_MATMUL_DESC_B_SCALE_POINTER" is undefined /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/math/gemm_float8.cu(274): error: identifier "CUBLASLT_MATMUL_DESC_D_SCALE_POINTER" is undefined 5 errors detected in the compilation of "/tmp/onnxruntime/onnxruntime/contrib_ops/cu ``` Here is a versions (major version) diff on the requested attributes: ``` cuda 11.5.1 no CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET cuda 11.6 https://docs.nvidia.com/cuda/archive/11.6.0/pdf/CUBLAS_Library.pdf has CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET cuda 11.7 no CUBLASLT_MATMUL_DESC_FAST_ACCUM no CUBLASLT_MATMUL_DESC_A_SCALE_POINTER no CUBLASLT_MATMUL_DESC_B_SCALE_POINTER no CUBLASLT_MATMUL_DESC_D_SCALE_POINTER cuda 11.8 https://docs.nvidia.com/cuda/archive/11.8.0/pdf/CUBLAS_Library.pdf has CUBLASLT_MATMUL_DESC_FAST_ACCUM has CUBLASLT_MATMUL_DESC_A_SCALE_POINTER has CUBLASLT_MATMUL_DESC_A_SCALE_POINTER has CUBLASLT_MATMUL_DESC_B_SCALE_POINTER has CUBLASLT_MATMUL_DESC_D_SCALE_POINTER ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-08 21:01:34 +08:00
Wanming Lin	e8f33b54ba	[WebNN EP] Don't covert all inputs except the 0th input for Resize (#18687 ) Currently all the inputs of Resize node will be converted to NHWC if the preferred layout is NHWC, and the ORT will call `IsOpSupportedImpl` twice, first time the inputs are NCHW, and the second time the inputs have been converted to NHWC. This would make the validation for scales input complicated and difficult to identify the height and width values.	2023-12-07 18:18:35 -08:00
Edward Chen	7ed48a299a	Objective-C API updates (#18738 ) - Add ORTSession and ORTTrainingSession strong references to ORTEnv. - Make ORTTrainingSession session options parameter optional.	2023-12-07 16:47:46 -08:00
Changming Sun	bf33919afb	Update absl and gtest to fix an ARM64EC build error (#18735 ) ### Description Update absl and gtest to fix an ARM64EC build error ### Motivation and Context We need to get an important fix into ORT. The fix is: `8028a87c96`	2023-12-07 15:55:17 -08:00
Rachel Guo	305db31301	fix build aar error in Zip-Nuget-Java-Nodejs Packaging pipeline (#18745 ) ### Description <!-- Describe your changes. --> [Pipeline failure info](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=387310&view=logs&j=0aae05c9-1dc0-5099-eb4a-4cbb949c7458&t=71450a55-3e84-511c-7394-a06145376912&l=1044) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix packaging pipeline brought by pr. Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-12-07 14:48:55 -08:00
Yulong Wang	efbef5f611	[js/webgpu] allow to specify callback for profiling data (#18732 ) ### Description This PR is a replacement of #17820. allow to specify callback for profiling data Previous: ```js ort.env.webgpu.profilingMode = 'default'; // enable profiling // profiling data will output to console. ``` Now: ```js ort.env.webgpu.profiling = { mode: 'default'; // enable profiling ondata: (data) => { // .. process the profiling data } }; //for each kernel, "ondata" will be called once. only output to console if ondata is not specified. ```	2023-12-07 14:10:28 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
Yi Zhang	a045be335b	use EO pool for windows web_cpu stage (#18737 ) ### Description reuse EO pool in NPM pipeline. ### Motivation and Context build_web_debug failed in onnxruntime-Win-CPU-2022 but it works in EO pool. Reuse EO pool to make the pipeline work now. When I'm free, I'll try upgrading the chrome in the custom image.	2023-12-07 10:10:00 -08:00
Hector Li	e469de65f5	Re-enable Sign op int64 test for QNN CPU test (#18734 ) ### Description Re-enable Sign op int64 test for QNN CPU test	2023-12-07 08:42:25 -08:00
Wanming Lin	3d8af6eb65	[WebNN EP] Skip split initializer (#18729 )	2023-12-07 08:09:49 -08:00
Tianlei Wu	49470f06e8	Add benchmark script for control net (#18717 ) Add script to benchmark PyTorch and StableFast for control net. Add an option --max-batch-size in demo for benchmark purpose.	2023-12-06 21:54:51 -08:00
Dmitri Smirnov	e603e78627	Enforce If condition size == 1 (#18733 ) ### Description <!-- Describe your changes. --> ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/18549	2023-12-06 21:04:18 -08:00
moyo1997	9479ba525b	Build onnxruntime.dll as arm64x (#18633 ) Build onnxruntime.dll as arm64x Added a .cmake file to generate a link repro of the onnxruntime.dll during arm64 build. This provides us a directory containing all the arm64 objs, def file and libs to link to when it is time to building arm64x onnxruntime.dll during the arm64ec build by passing the /machine:arm64x flag to the linker along with the arm64 artifacts. If other dlls wanted to be built as x, setting the ARM64X_TARGETS variable in the toplevel cmakelists.txt to include these other targets is all that will be needed. Added build_arm64x.bat as a wrapper for the multiple (rm64, then arm64ec) cmake calls needed to build as arm64x. AB#22533	2023-12-06 16:49:00 -08:00
Rachel Guo	7762f3f7c5	[NNAPI EP] Add NNAPI Split (#18702 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> yolo-v8 model missing operator support. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-12-06 15:11:15 -08:00
Wanming Lin	c4b8120c5b	Rename op elementwiseIf to where (#18657 ) WebNN latest spec uses `where`.	2023-12-06 14:56:26 -08:00
Hector Li	9768a727e1	[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722 ) Fix a bug that can't create context binary if the model has inputs/outputs with different data type ### Description Update EPContext op schema to unblock nodes with different data type among inputs & outputs	2023-12-06 13:07:09 -08:00
Adrian Lizarraga	559bd52252	[QNN EP] Update QNN SDK to version 2.17.0 (#18684 ) ### Description - Update QNN CI Pipelines to use QNN SDK version 2.17.0 - Print warning if unit test requires adjusted tolerance to pass - Temporarily disable unloading QnnCpu.dll for windows x64 due to crash when calling FreeLibrary - Enable fixed HTP tests - QnnHTPBackendTests.LayerNorm1D_LastAxis_DynamicScale - QnnHTPBackendTests.GlobalMaxPool_LargeInput2_u8 - QnnHTPBackendTests.ReduceSumS8Opset13_Rank5 - QnnHTPBackendTests.ReduceSumU8Opset13_Rank5_LastAxis - QnnHTPBackendTests.WhereLargeDataBroadcastU8 - QnnHTPBackendTests.WhereLargeDataBroadcastTransformedU8 - Enabled fixed CPU tests - QnnCPUBackendTests.Resize_DownSample_Linear_AlignCorners_scales - Increased tolerance for HTP tests that are less accurate on QNN SDK 2.17.0 - QnnHTPBackendTests.AveragePool_CountIncludePad_HTP_u8 - QnnHTPBackendTests.AveragePool_AutopadSameUpper_HTP_u8 - QnnHTPBackendTests.AveragePool_AutopadSameLower_HTP_u8 - QnnHTPBackendTests.ConvU8U8S32_bias_dynamic_input - QnnHTPBackendTests.ConvU8U8S32_bias_initializer - QnnHTPBackendTests.ConvU8U8S32_large_input1_padding_bias_initializer - QnnHTPBackendTests.LRNSize3 - QnnHTPBackendTests.LRNSize5 - QnnHTPBackendTests.MaxPool_Large_Input_HTP_u8 - QnnHTPBackendTests.MaxPool_LargeInput_1Pads - QnnHTPBackendTests.Resize_DownSample_Linear_HalfPixel - QnnHTPBackendTests.ResizeU8_2xLinearPytorchHalfPixel - QnnHTPBackendTests.ResizeU8_2xLinearHalfPixel - QnnHTPBackendTests.ResizeU8_2xLinearAlignCorners - QnnHTPBackendTests.ResizeU8_2xLinearAsymmetric - Disabled ONNX model tests - averagepool_2d_ceil: Accuracy issues only on Windows x64 QnnCpu.dll - Disabled QDQ model tests (onnx_test_runner) - facedetection_op8_qdq: Accuracy issues - Disabled CPU EP tests (these use QnnCpu.dll) - ActivationOpTest.Relu: QNN SDK 2.17 Relu treats inf as FLT_MAX - GemmOpTypedTests/0.TestGemmBroadcast: Inaccuracy when weight is initializer and bias is not - MathOpTest.MatMulFloatType "test padding and broadcast B > A": Inaccuracy (only linux) - Fix Gemm translation bugs in QNN EP: - Do not skip processing of inputs that need to be transposed. ### Motivation and Context - Allow testing with newest QNN SDK version - Take advantage of improvements to enable new models.	2023-12-06 11:05:41 -08:00
Ye Wang	c012e41f93	MoE with Expert Slicing (#18565 ) ### Description <!-- Describe your changes. --> Registered Sharded MoE op under contrib_op/cuda/collective with expert slicing. The broadcast process happens just before adding second bias(if has) and permutation undoing. Tensor slicing is planned but not included in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 16:56:38 -08:00
petermcaughan	871c52977a	Mistral Optimization & Benchmarking Support (#18225 ) ### Description As a prerequisite for this model running correctly, two PRs need to be merged: - GQA Sliding Window Attention: https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local - MHA Fusion: https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b This PR adds optimization, quantization, and benchmarking support for Mistral. The README included describes steps to export, optimize, and benchmark Mistral models, but won't function correctly without the two above branches being merged first. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com> Co-authored-by: Abhishek Jindal <abjindal@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2023-12-05 15:39:17 -08:00
Jian Chen	c9e558cd36	Adding common python test requirements.txt (#18698 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 14:09:43 -08:00
pengwa	4bfa84487c	Skip module clone for preparing large model export (#18663 ) ### Skip module clone for preparing large model export For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight. This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy. Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.	2023-12-05 12:41:17 -08:00
Guenther Schmuelling	9aa7284351	fix lint error (#18708 )	2023-12-05 10:37:03 -08:00
cao lei	07aabcc314	Set cuda device before create cuda stream for IOBinding case (#18583 ) ### Description <!-- Describe your changes. --> Set cuda device before create cuda stream for IOBinding case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to fix the issue #18432 , which the inference will fail for IOBinding case when there are multiple cuda devices. The reason is that the cuda device is not set properly before the cuda stream is created	2023-12-05 10:02:21 -08:00
satyajandhyala	70816001cc	[JS/Web] AddedUniforms in GatherElements. (#18670 ) ### Description Use Uniforms in GatherElements and clean-up ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-12-05 09:19:53 -08:00
Xu Xing	f949e0580b	[js/webgpu] Support uniforms for pool (#18656 )	2023-12-05 07:54:30 -08:00
satyajandhyala	10c547516d	[JS/Web] Added CumSum operator to JSEP (#18637 ) ### Description Added CumSum operator ### Motivation and Context Reduce CPU <->GPU data movement.	2023-12-05 07:51:53 -08:00
rui-ren	c14fae9461	add SAVE_TEST_GRAPH macro (#18696 ) ### Description <!-- Describe your changes. --> Add a macro `SAVE_TEST_GRAPH ` in `graph_transform_test_builder.cc`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will help us debug the graph and Unitest. Co-authored-by: ruiren <ruiren@microsoft.com>	2023-12-05 07:46:08 -08:00
zhijiang	2b3050bb0c	Zhijxu/fix toposort (#18705 ) in training, shape/size need to be executed immediately when it's ok to be executed and thus to save memory if possible; the toposort logic is enhanced before, while didn't take of the "shape->size" pattern, which make the following size op will not show up in toposort result.	2023-12-05 17:36:00 +08:00
Adrian Lizarraga	e066fca777	[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465 ) ### Description #### 1. Adds `TensorQuantOverrides` extra option Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` #### 2. Adds utilities to get the default QDQ configs for QNN EP Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` ### Motivation and Context Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-04 17:54:58 -08:00
Tianlei Wu	01b5c78917	Add SD-Turbo and refine diffusion demo (#18694 ) [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast generative text-to-image model that distilled from [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is targeted for 512x512 resolution. 1. Support sd-turbo model. 1. Refiner ControlNet in demo + Cache the ControlNet model so that it is downloaded only once. + Do not download default images in script. Instead update document to use wget to download example image. + Fix an issue of control image processing that causes shape mismatch in inference. 1. Refine arguments: + Change argument --disable-refiner to --enable-refiner since refiner is not used in most cases + Rename --refiner-steps to --refiner_denoising_steps + Add abbreviations for most used arguments. + Add logic to set default arguments for different models. 1. Refine torch model cache: + Share cached torch model among different engines to save disk space. + Only download fp16 model (previously, ORT_CUDA downloads fp32 model). 1. Do not use vae slicing when image size is small. 1. For LCM scheduler, allow guidance scale 1.0~2.0. 2. Allow sdxl-turbo to use refiner ### Performance Test Results Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on A100-SXM4-80GB. Batch \| Steps \| TRT 8.6 static \| ORT_TRT static \| ORT_CUDA static \| TRT 8.6 dynamic \| ORT_TRT dynamic \| ORT_CUDA dynamic -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 1 \| 1 \| 32.07 \| 30.55 \| 32.89 \| 36.41 \| 38.30 \| 34.83 4 \| 1 \| 125.36 \| 97.40 \| 97.49 \| 118.24 \| 114.95 \| 99.10 1 \| 4 \| 62.29 \| 60.24 \| 62.50 \| 72.49 \| 77.82 \| 67.66 4 \| 4 \| 203.51 \| 173.11 \| 168.32 \| 217.14 \| 215.71 \| 172.53 * Dynamic engine is built for batch size 1 to 8, image size 512x512 to 768x768, optimized for batch size 1 and 512x512	2023-12-04 16:03:47 -08:00
Edward Chen	d514a960ee	Remove "Python Checks" pipeline status from readme as that pipeline no longer exists. (#18697 )	2023-12-04 13:38:36 -08:00
Caroline Zhu	c02a386145	[js/web/training] Implemented runEvalStep & runOptimizerStep (#18259 ) ### Description * implemented runEvalStep and runOptimizerStep * added hasEvalModel and hasOptimizerModel boolean fields in TrainingSession representation * added evalInputNames and evalOutputNames fields to TrainingSessionHandler & TrainingSession * removed the inputNamesEncoded and outputNamesEncoded fields from TrainingSessionHandler -- since none of the training methods require the input names and output names as parameters, there's no need to store them. ### Motivation and Context * part of the work for implementing web bindings for training * previous PR: #18250 --------- Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-12-04 13:37:14 -08:00
Jiajia Qin	5353adcde3	[js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658 ) ### Description With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1, 1, 16, 16] becomes 0.59ms from 6.64ms.	2023-12-04 13:18:37 -08:00
trajep	a5b2291e0f	[Transformer Optimization]Return model directly for unknown model type (#18642 ) This pull request is used to improves the handling of unsupported model types in the optimization process.	2023-12-04 12:26:50 -08:00
Deoksang Kim	2f8b86b939	Fix typo in the TensorShape (#17813 ) The function name in the log should be SizeToDimension	2023-12-01 16:48:55 -08:00
Jiajia Qin	92ee664f64	[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661 ) ### Description Currently, for non-uniform variables, we still use `array<u32, N>` type instead of array<vec4<u32>, N1>`. So we can't always treat all variables with rank > 4 as uniforms to index. This PR fixes below errors: ``` error(s) generated while compiling the shader: :5:44 error: index 4 out of bounds [0..1] return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]); ^ FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32 FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()	2023-12-01 15:35:35 -08:00
Changming Sun	eaaf27015e	Remove EnvSetupScript parameter from win-ci.yml (#18662 ) ### Description To make the code more consistent. Now some TRT pipelines download TRT binaries on-the-fly, while other TRT pipelines use a preinstalled version. This PR make them the same.	2023-12-01 15:30:16 -08:00
Rachel Guo	9c45fe4957	Fix macos xcframework test stage codesign info (#18649 ) ### Description <!-- Describe your changes. --> Remove developement id and force codesign not required in the test macos target. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix failure happened in iOS_Full_xcframwork stage in Zip-Nuget-Java-NodeJS packaging pipeline. --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-12-01 14:47:46 -08:00
Edward Chen	a353805631	Fix Windows TVM CI workflow (#18667 ) Fix issue with installing LLVM dependency.	2023-12-01 13:49:45 -08:00
Edward Chen	b22f49ff35	Fix unit tests failures in build with contrib ops disabled (#18659 ) Fix unit tests failures in build with contrib ops disabled. - QDQTransformerTests.QDQPropagation_GH11605_Opset12_19 - TransposeOptimizerTests.QnnTransposeNonConstBroadcastInput	2023-12-01 09:41:25 -08:00
Bowen Bao	fcea2cb7f1	[Dort] Run type promotion pass to resolve dtype discrepancy (#18516 ) Fixes CI failures mentioned in #18507 But we should not keep two separate dort impls in both pytorch and onnxruntime. They are out of sync.	2023-12-01 09:36:18 -08:00
snadampal	05a9c95764	[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847 ) Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Description <!-- Describe your changes. --> Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to enable the optimized ACL gemm kernels for dnnl execution provider on aarch64 platform.	2023-12-01 09:16:44 -08:00
Jian Chen	d69842226b	Update the template files to correct stage to fix the python cuda 12 packaging pipeline (#18651 )	2023-12-01 07:57:46 -08:00

1 2 3 4 5 ...

10136 commits