### Description
Update absl and gtest to fix an ARM64EC build error
### Motivation and Context
We need to get an important fix into ORT.
The fix is:
8028a87c96
Build onnxruntime.dll as arm64x
Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.
If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.
Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.
AB#22533
### Description
<!-- Describe your changes. -->
Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
per-channel quantization, the list contains a dictionary for each channel in the tensor.
Each dictionary contains optional overrides with the following keys and values.
'quant_type' = QuantType : The tensor's quantization data type.
'scale' = Float : The scale value to use. Must also specify `zero_point` if set.
'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set.
'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also
set `scale` or `zero_point`.
'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also
set `scale` or `zero_point`.
'rmax' = Float : Override the maximum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
'rmin' = Float : Override the minimum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
```
- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.
Example for per-tensor quantization overrides:
```Python3
extra_options = {
"TensorQuantOverrides": {
"SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
"WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
"BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
},
}
```
Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
"TensorQuantOverrides": {
"WGT": [
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.0,
"rmax": 2.5,
"reduce_range": True,
},
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.2,
"rmax": 2.55,
"reduce_range": False,
},
],
"BIAS": [
{"zero_point": 0, "scale": 0.000621},
{"zero_point": 0, "scale": 0.23},
],
},
}
```
#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.
Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config
qnn_config = get_qnn_qdq_config(input_model_path,
data_reader,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8)
quantize(input_model_path,
output_model_path,
qnn_config)
```
### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
Reverts microsoft/onnxruntime#18413
there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
### Description
<!-- Describe your changes. -->
As title.
1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.
Replace https://github.com/microsoft/onnxruntime/pull/18334
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware
### Motivation and Context
Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
This kernel requires sophisticated quantized weight rearrangement to
speedup loading data to tensor-core. To facilitate the addition, this
change includes the following:
1. matrix_layout.h A new layout lib that facilitate iterating matrix
elements and tiles that balance memory safety and performance.
2. prepack_sm80.h Code for rearranging quantized weight, scales and
offsets (aka. prepacking)
3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
memory safety and correctness of the prepacking code.
Currently the prepacking code runs on CPU with single threaded code. We
run this on CPU in order to minimize GPU memory fragmentation. On the
other hand, hopefully we get around to parallelize this part of the
code. Should be straight forward with the unit tests in place.
### Description
Truncate traling non-existing arguments.
Make sure we do not skip on the non-existing arguments in the middle,
because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
instead of pointing to outer scope defs.
Add unit test.
### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
### Description
Motivation for this PR is code cleanup.
1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.
- Connect MatMulNBits contrib op to MLAS function.
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR
limitation: __CUDA_ARCH__ >= 700
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Use different march flag to workaround what appears to be a clang issue.
See https://github.com/tensorflow/tensorflow/issues/59970 for links to
various relevant pieces of info/discussions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add 32-bit patch binary and infra to fallback to it. The Azure devops
Windows CIs are missing patch.exe from their git install for some reason
so the default `find_package(Patch)` fails as that is where it expects
to find it.
Remove Eigen patch. Underlying issue was fixed in source 3 years ago by
c6c84ed961
and the patch command is invalid (args are for git apply not patch).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make usage of patch consistent across all CIs
Fix https://github.com/microsoft/onnxruntime/issues/15248
### Description
Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the
GSL package that is found on the system is used. However, when cuda is
enabled, it is downloaded and patched. Most CMake scripts rely on the
first case and forget about the second. This patch makes the second case
behave like the first case.
### Motivation and Context
This is an issue that occurs 'in the wild'. For example, I had to patch
this to be able to enable the CUDA provider for the onnxruntime conan
package (see https://github.com/conan-io/conan-center-index/pull/20392).
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before.
### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
### Description
<!-- Describe your changes. -->
Fix the broken pieces due to the latest Abseil update.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Make the debugging bearable.
### Description
Add CI changes for #18287
Install onnx explicitly to pass windows GPU+dml stage.
### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well
Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
- some ops use a workspace buffer
- copied workspace allocation from XNNPACK unit test code
- some suffixes changed
Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
- not an issue currently
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed
Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.
NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
### Description
<!-- Describe your changes. -->
Pre-link with `ld -r` to apply symbol visibility when the static library
is created to replicate XCode's Single Object Pre-link.
Current builds set the visibility flags but that doesn't get applied
until the static library is linked into something else, which can be too
late. Pre-linking fixes this.
The pre-link uses the .o files from the ORT static libraries and the .a
files from external libraries. This combination limits the symbols
included from the .a files to things required by the ORT .o files.
In order to minimize changes elsewhere in the build we extract the .o
files from the ORT static libraries using `ar -x`.
Re-ordered the pieces use to build the Apple framework to make it a
little more readable.
Fixed a couple of misc issues with missing symbols from the minimal
build that show up when pre-linking is applied.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Will hopefully address #17722
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.
### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
This PR implements distributed reduciton for llama 2. This version
doesn't consider any cases requring re-sharding because we haven't seen
any use cases.
Intutive examples:
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
-> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and
device_mesh=[0,1]
Algorithm:
When the reduced axes are not sharded, each device can call reduction
directly. The output sharding spec will be identical to input sharding
spec. We currently throw when input and output sharding specs are
different.
Review guideline:
- Check 97b8d2f for new op's schema and how new op is registered.
- Read tests in 2450f93 to get faimilar with the behavior of these ops.
- Check the implementation details in 753d9af.
### Description
Integration to OpenVINO 2023.1
### Motivation and Context
- Alignment with latest OpenVINO Version.
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.
---------
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
This PR implements DistributedExpand for llama 2.
Representative Examples of DistributedExpand:
- [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor
(shape=[8, 2], spec=S[0]R, device_mesh=[0,1])`
- [sharding expanded axis is invalid since it must have dim=1 and axis
with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor
(shape=[2, 8], spec=S[0]R, device_mesh=[0,1])`
From those examples, we observe a few important behaviors.
- The output sharding spec is always the same to the input sharding
spec.
- Expanding always happen on axis with dimension=1. Otherwise, it will
violate the broadcasting rule.
- No communication is needed since all computation can happen locally.
Let's consider the first example again. If you put the first half tensor
(shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on
device 1, then `Expand` it with target shape [4, 2] , these two local
tensors (shape: [4, 2]) are exactly the same as the one described by
output sharding spec.
Algorithm:
- Compute logical (i.e., unsharded) shapes of input and output.
- Compute sharded output shape from logical output.
- Call Expand to broadcast local input to sharded output shape.
How to review?
- Start with [changes in
onnxruntime_test_distributed.py](ea33392f37).
Those tests are good examples for using this op.
- [Read
expand.h/expand.cc](e4c49987f5).
Theose changes are for exposing functionalities in Expand to
DistributedExpand.
- Read distributed_expand.h/distributed_expand.cc. It follows the
algorithm described above. The commit
68ac301bba
first sketches the definition of DistributedExpand. The next commit
0eb9330c3b
adds real implementation.
### Description
Add support for Gemm with float 8 as a contrib op.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
This DistributedReshape aims at supporting all sharding patterns
encountered in llama 2. All patterns found are tested in
`TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR
implements algorithms to compute the categories below.
- All inputs and outputs are replica, so it's computed like a normal
Reshape.
- Two-axis fusion (if any of the inputs and outputs are sharded). This
category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`.
- Two-axis decomposition (if any of the inputs and outputs are sharded).
This category convers, e.g., `[batch x seq, hidden] -> [batch, seq,
hidden]`.
Review guideline:
- Ignore the changes in sharding_spec.h and sharding_spec.cc since they
come from another PR #18025.
- First, read onnxruntime_test_distributed.py to get familiar with the
input/output of DistributedReshape.
- Second, check the new APIs in reshape.h/reshape.cc to expose CUDA
Reshape kernel to DistributedReshape.
- For DistributedReshape, check its `ComputeInternal` for the 3
categories mentioned above.
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.
Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.
The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.
This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270
cc: @skottmckay , @chenfucn , @snnn
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and
(vi) Signed activations and weights (S8S8 Gemm) scenarios
I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing
```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync
```
### Description
This PR:
(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)
(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.
Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770
### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
### Description
CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.
### Motivation and Context
Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn
class Net1(nn.Module):
def __init__(self):
super(Net1, self).__init__()
# 1 input image channel, 6 output channels, 5x5 square convolution
# kernel
self.m = nn.ModuleList([
nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
])
def forward(self, x):
for module in self.m:
x = module(x)
return x
if __name__ == "__main__":
dtype = torch.half
device = "cuda"
dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
model = Net1().to(dtype=dtype, device=device)
input_names = ["input1"]
output_names = ["output1"]
torch.onnx.export(model, dummy_input, "test.onnx",
input_names=input_names, output_names=output_names)
```
I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels:

If I add the introduced `-l` flag we see below kernels:

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.
Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |
|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
| | -e cuda -t 5 -q | -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53
## Outlook
Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:
- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits.
Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
Migrate most CUDA EP improvements and changes to ROCM EP. The process
involves using hipify against all CUDA EP files (i.e. do not exclude any
files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them
against the ROCM EP files that are under source control and pull in most
changes. These changes include functional as well as formatting and
makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff
somewhat less obvious due to formatting changes.
- hipify audit of onnxruntime/core/providers/rocm, enable ops
- Loop
- Scan
- hipify audit of onnxruntime/contrib_ops/rocm
- fix contrib ops search implementation
- enable more contrib ops
- Affine
- ComplexMul
- ConvTransposeWithDynamicPads
- Crop
- DynamicSlice
- FFT [Rfft, Irfft]
- GreedySearch
- ImageScaler
- ParametricSoftplus
- ScaledTanh
- ThresholdRelu
---------
Co-authored-by: cloudhan <cloudhan@outlook.com>