onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-31 23:27:43 +00:00

Author	SHA1	Message	Date
Changming Sun	c7799d7058	Build fixes for Windows ARM32 desktop build (#18752 ) ### Description Fix a link error: ``` onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\ src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function _tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] ```	2023-12-08 12:45:06 -08:00
Changming Sun	bf33919afb	Update absl and gtest to fix an ARM64EC build error (#18735 ) ### Description Update absl and gtest to fix an ARM64EC build error ### Motivation and Context We need to get an important fix into ORT. The fix is: `8028a87c96`	2023-12-07 15:55:17 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
moyo1997	9479ba525b	Build onnxruntime.dll as arm64x (#18633 ) Build onnxruntime.dll as arm64x Added a .cmake file to generate a link repro of the onnxruntime.dll during arm64 build. This provides us a directory containing all the arm64 objs, def file and libs to link to when it is time to building arm64x onnxruntime.dll during the arm64ec build by passing the /machine:arm64x flag to the linker along with the arm64 artifacts. If other dlls wanted to be built as x, setting the ARM64X_TARGETS variable in the toplevel cmakelists.txt to include these other targets is all that will be needed. Added build_arm64x.bat as a wrapper for the multiple (rm64, then arm64ec) cmake calls needed to build as arm64x. AB#22533	2023-12-06 16:49:00 -08:00
Ye Wang	c012e41f93	MoE with Expert Slicing (#18565 ) ### Description <!-- Describe your changes. --> Registered Sharded MoE op under contrib_op/cuda/collective with expert slicing. The broadcast process happens just before adding second bias(if has) and permutation undoing. Tensor slicing is planned but not included in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 16:56:38 -08:00
Adrian Lizarraga	e066fca777	[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465 ) ### Description #### 1. Adds `TensorQuantOverrides` extra option Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` #### 2. Adds utilities to get the default QDQ configs for QNN EP Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` ### Motivation and Context Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-04 17:54:58 -08:00
snadampal	05a9c95764	[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847 ) Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Description <!-- Describe your changes. --> Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to enable the optimized ACL gemm kernels for dnnl execution provider on aarch64 platform.	2023-12-01 09:16:44 -08:00
George Wu	5c67a00d8e	Revert "remove full protobuf requirement for tensorrt ep" (#18626 ) Reverts microsoft/onnxruntime#18413 there's a timing issue here. we eventually want to get this change merged in but we need to update OSS onnx-tensorrt first.	2023-11-29 22:27:51 -08:00
Edward Chen	14a343441d	Fix Objective-C static analysis build (#18606 ) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings.	2023-11-28 17:14:20 -08:00
Rachel Guo	288b80d363	Add MacOS build to ORT C Pod (#18550 ) ### Description <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-28 10:11:53 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Sheil Kumar	0b7048e7d6	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 ) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-28 09:26:48 -08:00
cloudhan	6f3c1f9dc9	[ROCm] Update ck for GemmFloat8 (#18487 )	2023-11-23 12:06:19 +08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
George Wu	d73073d491	remove full protobuf requirement for tensorrt ep (#18413 ) tensorrt can work with protobuf lite.	2023-11-16 20:44:27 -08:00
Yulong Wang	6f9f653ada	[wasm] increase test max memory from 2G to 4G (#18459 ) ### Description increase max memory from 2G to 4G for onnxruntime_test_all in WebAssembly build.	2023-11-15 17:51:04 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00
PeixuanZuo	a62a500ae1	[ROCm] Update CK version (#17628 ) update ck version	2023-11-13 15:43:38 -08:00
Scott McKay	8d298f6f78	Fix xnnpack compile error on arm32 (#18291 ) ### Description <!-- Describe your changes. --> Use different march flag to workaround what appears to be a clang issue. See https://github.com/tensorflow/tensorflow/issues/59970 for links to various relevant pieces of info/discussions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-12 08:59:20 +10:00
Scott McKay	64c91d790b	Fix ability to use patch on Windows CI machines (#18356 ) ### Description <!-- Describe your changes. --> Add 32-bit patch binary and infra to fallback to it. The Azure devops Windows CIs are missing patch.exe from their git install for some reason so the default `find_package(Patch)` fails as that is where it expects to find it. Remove Eigen patch. Underlying issue was fixed in source 3 years ago by `c6c84ed961` and the patch command is invalid (args are for git apply not patch). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make usage of patch consistent across all CIs Fix https://github.com/microsoft/onnxruntime/issues/15248	2023-11-11 07:32:14 +10:00
Bart Verhagen	87744e55fa	fix reference to Microsoft.GSL::GSL in CMake build scripts when enabling cuda (#17843 ) ### Description Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the GSL package that is found on the system is used. However, when cuda is enabled, it is downloaded and patched. Most CMake scripts rely on the first case and forget about the second. This patch makes the second case behave like the first case. ### Motivation and Context This is an issue that occurs 'in the wild'. For example, I had to patch this to be able to enable the CUDA provider for the onnxruntime conan package (see https://github.com/conan-io/conan-center-index/pull/20392).	2023-11-10 10:46:45 -08:00
Changming Sun	812532592e	Add a build validation for Linux ARM64 cross-compile (#18200 ) ### Description 1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch issues listed in #18195 . 2. Revert eigen's commit id back to what we had before. ### Motivation and Context To catch cross-compile issues. Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639	2023-11-08 13:03:18 -08:00
Dmitri Smirnov	a37e6a503b	Update Abseil raw_flat_hash visualization (#18329 ) ### Description <!-- Describe your changes. --> Fix the broken pieces due to the latest Abseil update. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Make the debugging bearable.	2023-11-08 11:19:45 -08:00
Wei-Sheng Chin	fb6737e893	Distributed Squeeze and Distributed Unsqueeze (#18269 ) Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.	2023-11-06 20:11:35 -08:00
Yi Zhang	b7b8b5b2ce	Fix Eigen-3.4.0 URL and hash (#18290 ) ### Description Add CI changes for #18287 Install onnx explicitly to pass windows GPU+dml stage. ### Motivation and Context 'eigen-3.4' was refering to a branch, not to a tag. There is now an Eigen 3.4.1 on that branch, and thus the hash has changed. See https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416	2023-11-06 09:19:51 -08:00
Chi Lo	dfafcb58aa	[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274 ) https://github.com/microsoft/onnxruntime/pull/17468 The above PR didn't fully fix the issue for some environments. This PR fixes this.	2023-11-03 20:04:10 -07:00
Scott McKay	4f2096be38	Update XNNPACK to latest version (#18038 ) ### Description <!-- Describe your changes. --> Update XNNPACK to latest version - adds fp16 kernels and various other improvements - requires pthreadpool update as well Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API - 'setup' is split into 'reshape' and 'setup' - some ops use a workspace buffer - copied workspace allocation from XNNPACK unit test code - some suffixes changed Added wrapper for XNNPACK caches to base XNNPACK EP kernel - simplifies usage - XNNPACK split out the code and weights caches, but the code cache isn't currently usable via the public API - we could use the internal types if we think it's required for performance reasons. non-trivial though as we'd need to propagate ifdef values from the XNNPACK build up to the ORT build. - using XNNPACK internals would also mean we would not be able to support using a pre-build XNNPACK package - not an issue currently Fixed opset registration for internal NHWC domain - was not being tied to the ONNX version, so nodes inserted by layout transformation had the incorrect opset - a number of other places needed updating once this issue was fixed Remove support for NCHW Resize from XNNPACK EP so it's NHWC only - we only supported NCHW for fp32, - doing so adds complexity in multiple places (XNNPACK EP kernel implementation, layout transformation and transpose optimization) - unclear if that complexity provides any benefit. can add back if required by production scenario ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We're looking at enabling fp16 support for CoreML and NNAPI. If we do that we need a good fallback story if the CPU EP will be used. The XNNPACK fp16 kernels will hopefully provide that. NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That can be done as required in separate EPs and should be relatively simple to do.	2023-11-03 09:04:28 -07:00
Scott McKay	016b75260b	Pre-link when creating static library for apple framework (#18241 ) ### Description <!-- Describe your changes. --> Pre-link with `ld -r` to apply symbol visibility when the static library is created to replicate XCode's Single Object Pre-link. Current builds set the visibility flags but that doesn't get applied until the static library is linked into something else, which can be too late. Pre-linking fixes this. The pre-link uses the .o files from the ORT static libraries and the .a files from external libraries. This combination limits the symbols included from the .a files to things required by the ORT .o files. In order to minimize changes elsewhere in the build we extract the .o files from the ORT static libraries using `ar -x`. Re-ordered the pieces use to build the Apple framework to make it a little more readable. Fixed a couple of misc issues with missing symbols from the minimal build that show up when pre-linking is applied. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Will hopefully address #17722	2023-11-03 23:38:29 +10:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Wei-Sheng Chin	9e8ad39847	Distributed Reduction (#18206 ) This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.	2023-11-01 08:49:33 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
liqun Fu	20f2dd8b6b	use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177 )	2023-10-31 14:58:21 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Wei-Sheng Chin	9c32310673	Distributed Reshape Implementation (#18068 ) This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.	2023-10-26 22:33:42 -07:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
Chi Lo	455a9ce614	[TensorRT EP] Use latest onnx-tensorrt parser (#18067 ) Use latest onnx-tensorrt to fix compile error. Please see the issue https://github.com/microsoft/onnxruntime/issues/18029	2023-10-26 13:55:12 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
liqun Fu	020824ed50	Update ONNX to 1.15.0rc1 (#17914 )	2023-10-20 15:08:25 -07:00
Hariharan Seshadri	9356986730	Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972 ) ### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (https://github.com/microsoft/onnxruntime/pull/17200) and guard against regressions	2023-10-17 09:23:52 -07:00
Maximilian Müller	7c17e33c07	Make CUDA a NHWC EP (#17200 ) ### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: \| CUDA EP \| NCHW [ms] \| NHWC [ms] \| Speedup \| \|:------------------------\|--------------------------------------:\|-----------------------------------------:\|------------------:\| \| \| -e cuda -t 5 -q \| -e cuda -t 5 -q -l \| \| \| resnet101-v2-7_bs8_fp16 \| 18.33 \| 13.07 \| 1.4 \| \| resnet101-v2-7_bs8 \| 21.8 \| 12.06 \| 1.81 \| \| test \| 102.07 \| 73.62 \| 1.39 \| Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-10-16 10:16:37 -07:00
Chi Lo	8abaa7b753	[TensorRT EP] Fix cmake install (#17923 ) We removed tensorrt_provider_factory.h in the [PR](https://github.com/microsoft/onnxruntime/pull/17617). Need to remove the copy of this file when cmake install.	2023-10-16 09:16:24 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Jeff Daily	07317316cc	CUDA EP vs ROCM EP hipify audit (#17776 ) Migrate most CUDA EP improvements and changes to ROCM EP. The process involves using hipify against all CUDA EP files (i.e. do not exclude any files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them against the ROCM EP files that are under source control and pull in most changes. These changes include functional as well as formatting and makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff somewhat less obvious due to formatting changes. - hipify audit of onnxruntime/core/providers/rocm, enable ops - Loop - Scan - hipify audit of onnxruntime/contrib_ops/rocm - fix contrib ops search implementation - enable more contrib ops - Affine - ComplexMul - ConvTransposeWithDynamicPads - Crop - DynamicSlice - FFT [Rfft, Irfft] - GreedySearch - ImageScaler - ParametricSoftplus - ScaledTanh - ThresholdRelu --------- Co-authored-by: cloudhan <cloudhan@outlook.com>	2023-10-13 10:13:53 +08:00

1 2 3 4 5 ...

1533 commits