Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.
### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
### Description
Added FusedConv and FusedConvTranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
This PR implements distributed reduciton for llama 2. This version
doesn't consider any cases requring re-sharding because we haven't seen
any use cases.
Intutive examples:
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
-> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and
device_mesh=[0,1]
Algorithm:
When the reduced axes are not sharded, each device can call reduction
directly. The output sharding spec will be identical to input sharding
spec. We currently throw when input and output sharding specs are
different.
Review guideline:
- Check 97b8d2f for new op's schema and how new op is registered.
- Read tests in 2450f93 to get faimilar with the behavior of these ops.
- Check the implementation details in 753d9af.
### Description
Integration to OpenVINO 2023.1
### Motivation and Context
- Alignment with latest OpenVINO Version.
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.
---------
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
### Description
There is a gap between onnx’s definition of batch normalization and
QNN’s.
According to the formula:
onnx: `(X - input_mean) / sqrt(input_var + epsilon) * scale + B`
QNN: `X * weight + bias`
We can then deduce that:
`weight = scale / sqrt(var + epsilon)`
`bias = B – (mean * scale / sqrt(var + epsilon))`
We must calculate the weight and bias, and their quantization parameters
for QNN in QNN EP.
Therefore, `scale`, `B`, `input_mean`, and `input_var` must be static
(`initializer`).
Implementation:
Firstly, dequantize `scale`, `B`, `input_mean`, and `input_var` to
floating point.
Second, calculate `weight` and `bias`, and their quantization
parameters.
Finally, quantize `weight` and `bias`, and add them into `TensorWrapper`
### Motivation and Context
Fix QnnHTPBackendTests.BatchNorm1D and QnnHTPBackendTests.BatchNorm2D
failures
### Description
Implement Split KV optimization for FlashAttention in MHA and Attention
operators.
### Motivation and Context
Can help further accelerate these ops.
### Description
This PR reduces the memory usage when exporting and benchmarking LLaMA.
### Motivation and Context
- Exporting: The PyTorch model is deleted from memory after a successful
export instead of deleting it from memory after exporting + converting
the ONNX model to the desired precision.
- Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache
inputs use the same GPU memory for both the prompt and token generation
benchmarks.
### Description
<!-- Describe your changes. -->
Add the mobile CIs to the list so we check external PRs don't break
those.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Recent external PR was found to break iOS CI after checkin
The `RemoveDuplicateCastTransformer` fairly naively removed Cast nodes
from the graph without considering precision loss when using the same
`TypeGroup`. For instance, F64 -> F32 -> F64 would be optimised out of
the graph.
I also noticed that signedness was not accounted for, which is not
covered by any existing issue but is a problem. For example doing int ->
unsigned int -> int produces very different values for negative inputs
and so should not be optimised out
One could argue that we shouldn't be performing such cast elimination at
all (at least not in this transformer). The original scope might be well
restricted to only eliminating unnecessary casts from the
`InsertCastTransformer` and no others.
### Motivation and Context
This should fix https://github.com/microsoft/onnxruntime/issues/17565,
ttps://github.com/microsoft/onnxruntime/issues/9915 and
https://github.com/microsoft/onnxruntime/issues/8787.
* Add a new operator SkipGroupNorm to support skip and bias inputs.
* Update GroupNorm kernel to support number of channels used in SD XLrefiner.
* Add epsilon in kernel
* Add parity and performance test script
* Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc.
### Motivation and Context
Update GroupNorm to support SD XL Refiner and beyond.
### Description
Update batch file to set PATH for Cuda with TRT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
update rocm package exclude libs.
- change librocblas.so.0 to librocblas.so.3 which is used on ROCm5.6 and
ROCm5.7
- add librocfft.so.0, libhipfft.so.0, libhiprtc.so.5 and sort the list.
### Description
This PR enables `softmax` outputs max supported components instead of
scalar for each thread.
Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms
### Description
<!-- Describe your changes. -->
Optimize SkipLayerNorm for large dimension (>=2048) by handling 8
elements in one thread. It avoid the re-writing and re-loading sum of
input, skip and bias to main memory. It reduces the latency of dimension
4096 with small batch size from ~18us to ~3.8us on A100.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
[QNN EP] Disable early termination in GetCapability if there are
multiple partition and context binary enabled
### Description
QNN EP context binary cache feature only support single partition for
now. We have early termination in GetCapability.
After the PR https://github.com/microsoft/onnxruntime/pull/17764.
There's no Level 1 optimization any more for the 1st GetCapability.
Graph transformer EnsureUniqueDQForNodeUnit is not applied. So if
there's initializer -> DQ -> shared by multiple node unit. The node is
not identified as node unit group. QNN EP report many not supported
nodes because of this in the 1st GetCapability call.
The 2nd GetCapability still works normally.
Disable the early termination in GetCapability, delay the decision to
Compile.
### Description
This PR tries to fix a part of the NPM package consuming problems for
onnxruntime-web (ES module) as described in #10913:
- reduce the package size to fit the 150MB restriction in jsdelivr, by
removing dev build targets for uncommon exports
- add default export to support `import ort from 'onnxruntime-web';`
(currently only support `import * as ort from 'onnxruntime-web';`
This reverts commit 99b8dcaae2.
### Description
<!-- Describe your changes. -->
### Motivation and Context
Restore the dml stage in windows GPU pipeline.
Agent issue is solved by adding Feature.DisableGpuDriver in pool
properties.
Arm Compute Library(ACL)backend requires explicit memory format tag
iniatilization to decide wether the tensor can be computed with the ACL
kernels. Hence, the src, weights and dst memroy descriptor format is set
based on the tensor dimensions instead of using the format::any tag.
### Description
<!-- Describe your changes. -->
Arm Compute Library(ACL)backend requires explicit memory format tag
iniatilization to decide wether the tensor can be computed with the ACL
kernels. Hence, the src, weights and dst memroy descriptor format is set
based on the tensor dimensions instead of using the format::any tag.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The change enables ACL kernels for DNNL matmul ops on aarch64 platform.
This PR implements DistributedExpand for llama 2.
Representative Examples of DistributedExpand:
- [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor
(shape=[8, 2], spec=S[0]R, device_mesh=[0,1])`
- [sharding expanded axis is invalid since it must have dim=1 and axis
with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor
(shape=[2, 8], spec=S[0]R, device_mesh=[0,1])`
From those examples, we observe a few important behaviors.
- The output sharding spec is always the same to the input sharding
spec.
- Expanding always happen on axis with dimension=1. Otherwise, it will
violate the broadcasting rule.
- No communication is needed since all computation can happen locally.
Let's consider the first example again. If you put the first half tensor
(shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on
device 1, then `Expand` it with target shape [4, 2] , these two local
tensors (shape: [4, 2]) are exactly the same as the one described by
output sharding spec.
Algorithm:
- Compute logical (i.e., unsharded) shapes of input and output.
- Compute sharded output shape from logical output.
- Call Expand to broadcast local input to sharded output shape.
How to review?
- Start with [changes in
onnxruntime_test_distributed.py](ea33392f37).
Those tests are good examples for using this op.
- [Read
expand.h/expand.cc](e4c49987f5).
Theose changes are for exposing functionalities in Expand to
DistributedExpand.
- Read distributed_expand.h/distributed_expand.cc. It follows the
algorithm described above. The commit
68ac301bba
first sketches the definition of DistributedExpand. The next commit
0eb9330c3b
adds real implementation.
Set min version supporting custom_op::ComputeV2 to 16, since the feature
has been released since ort 1.16.
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
WebNN can now handle Conv with filter as input .
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support more models with WebNN.
### Description
Previously used GitHub stale app is now deprecated, so I deleted that
file and added a new GitHub Actions workflow to automatically apply the
stale label to inactive issues.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I am adding a new `trt_timing_cache_path` option. Internally it is
handled as `global_cache_path_` and will be set via a fall through
approach:
1. no path provided => workdir
2. `trt_engine_cache_path` provided but no `trt_timing_cache_path` =>
`trt_engine_cache_path`
3. `trt_timing_cache_path` provided => `trt_timing_cache_path` (if not
provided `trt_engine_cache_path` will still be workdir)
### Motivation and Context
A TRT timing cache can be reused across multiple models as it only holds
kernel timings and it is common that network "patterns" are reused. This
can accelerate build times a lot.
---------
Co-authored-by: Carson M <carson@pyke.io>
### Description
Transpose is equivalent to a Reshape if:
empty dimensions can change place, not empty dimensions must be in
the same order in the permuted tenosr.
Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1).
This pr adds a graph transformer which replaces Transpose with Reshape
if they are equivalent.
Because Transpose need memory copy while Reshape needn't, this
replacement can save overhead for memory copy.
### Description
Add support for Gemm with float 8 as a contrib op.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Version 41.0.0 currently used has vulnerabilities.
### Motivation and Context
See [Vulnerable OpenSSL included in cryptography
wheels](https://github.com/advisories/GHSA-v8gr-m533-ghj9)
This DistributedReshape aims at supporting all sharding patterns
encountered in llama 2. All patterns found are tested in
`TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR
implements algorithms to compute the categories below.
- All inputs and outputs are replica, so it's computed like a normal
Reshape.
- Two-axis fusion (if any of the inputs and outputs are sharded). This
category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`.
- Two-axis decomposition (if any of the inputs and outputs are sharded).
This category convers, e.g., `[batch x seq, hidden] -> [batch, seq,
hidden]`.
Review guideline:
- Ignore the changes in sharding_spec.h and sharding_spec.cc since they
come from another PR #18025.
- First, read onnxruntime_test_distributed.py to get familiar with the
input/output of DistributedReshape.
- Second, check the new APIs in reshape.h/reshape.cc to expose CUDA
Reshape kernel to DistributedReshape.
- For DistributedReshape, check its `ComputeInternal` for the 3
categories mentioned above.
### Description
This PR adds a few updates to scripts in the LLaMA folder:
- Fixes the precision re-naming in the LLaMA export
- Adds a "prerequisites" section in the README
- Adds IO binding synchronizations during benchmarking for other EPs
### Motivation and Context
- With precision re-naming, the LLaMA parity check does not produce
errors when creating the FP32 CPU model
- The "prerequisites" section shows that there are specific package
versions needed
- This allows for benchmarking with other EPs besides CPU and CUDA
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.
Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.
The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
### Description
Opset 18 apply the "axes as input" change from ReduceSum to all the
other reduce ops. Our cuda kernel actually support it, but we didn't
enable it for opset18. This PR update the reduce ops' kernel
registration to enable the "axes as input" behavior for opset18.
As part of the fix, I also simplify the reduce op kernel registration
part. ORT doesn't require the kernel definition need to be exactly the
same as onnx op definition. For our case, which we share the same kernel
for all the reduce ops (from version 1 to version 18), we don't need to
maintain different version of kernel definitions. we can simplify it by
just using a single kernel definition for multiple versions. Although
for some cases, we might register more types for legacy versions, but it
is harmless. Framework is using schema to validate the graph, not kernel
definition.
---------
Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Timestamp-query has a broader support than timestamp-query-in-passes on
all the platforms, including macOS.
Note that to enable timestamp-query, you still need to add switch
"--enable-dawn-features=allow_unsafe_apis" to Chrome. By default, the
lowest 16 bits are masked with 0 (at a granularity about 0.1ms) for
privacy. To get the highest precision, you need to add another switch
"--enable-webgpu-developer-features".
There were some warning in https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770
e.g.
```
[ RUN ] CudaNhwcTypedTest/1.AveragePoolNhwcPad @ /home/administrator/onnxruntime/onnxruntime/test/providers/cuda/nhwc/pool_test.cc:84
[W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'Y' source:{1,16,66,66} target:{1,16,67,67}. Falling back to lenient merge.
```
These warnings where not specific to NHWC or NCHW but were just a miscalculation of output shape in some tests.
### Description
Motivation for this PR is reducing CI test time by removing unnecessary
tests from the pipelines.
Following changes are for reducing test time in pipelines:
- Skip CPU model tests in GPU builds. Training CIs run these tests as a
sanity check. There is no direct training code being tested in these
pipelines, furthermore, CPU tests are being run in CPU pipelines so no
need to run them again in GPU builds and block the GPU VM. This change
reduces testing time by 20-25 mins in all training GPU pipelines.
- Delete debug package building pipeline for linux training packages.
This was required by compiler team at some point but there have been 0
downloads of these packages.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1],
we can't assume same device mesh per op. At each operator, we replace a
single operator-level device mesh
- `device_mesh_shapes`
- `device_mesh_elements`
with per-tensor device meshes
- `input_device_mesh_shapes` (input_device_mesh_shapes[i] is the device
mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3
devices)
- `input_device_mesh_elements` (input_device_mesh_elements[i] is the
flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if
you have 3 devices in that mesh)
- `output_device_mesh_shapes`
- `output_device_mesh_elements`
Check out the change in onnxruntime_test_distributed.py for examples.
It's also heavily used in #18068's `onnxruntime_test_distributed.py`
change.
### Description
* Adds TrainingSession.create() functionality following the web bindings
for training design doc
* Added 2 new training APIs to wasm/api.h:
* OrtTrainingGetInputOutputName
* OrtTrainingGetInputOutputCount
* Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a
method that references it
### Motivation and Context
* Adding web bindings for training
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allows for training package to be built + adds training backend
to web package **[MUST BE MERGED IN BEFORE THIS ONE]**
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>