Commit graph

11122 commits

Author SHA1 Message Date
Wanming Lin
9ea9f9e46a
[WebNN EP] Add data type constraint (#20779)
WebNN spec has added data type constraint for every op, and its CPU
backend (currently is TFLite) has additional constraint. Add
corresponding constraint to each op in WebNN EP.

Note: Temporarily disable fp16 for CPU backend as which is planned to be
ready in Chromium next month.
2024-05-29 10:19:51 -07:00
Vincent Wang
e77f238dc6
Update Torch Version to Fix ATen CPU Pipeline Failure (#20845)
Update Torch Version to Fix ATen CPU Pipeline Failure.
2024-05-29 16:04:18 +08:00
Adrian Lizarraga
3044aa8743
[Quant tool] Extend support for QDQ type conversion at graph output (#20841)
### Description
Allows mixed-precision overrides that adds a QDQ quantization type
conversion sequence at a graph output that **is not** consumed by other
nodes. This is not a common use-case but should handle it instead of
raising an error.

#### Example
Original model

![image](https://github.com/microsoft/onnxruntime/assets/19691973/4c9c3bb0-4ca1-4213-9259-9d0506ed22f2)

mixed-precision overrides:
```python
        mixed_prec_overrides = {
            "input_0": [{"quant_type": QuantType.QUInt16}],
            "op_0_out": [
                {
                    "quant_type": QuantType.QUInt16,
                    "convert": {"quant_type": QuantType.QUInt8},
                }
            ],
        }
        quantize_static(
            float_model_path,
            qdq_model_path,
            data_reader,
            quant_format=QuantFormat.QDQ,
            activation_type=QuantType.QUInt8,
            op_types_to_quantize=[node.op_type for node in float_model.graph.node],
            extra_options={
                "TensorQuantOverrides": mixed_prec_overrides,
            },
        )
```

QDQ model:

![image](https://github.com/microsoft/onnxruntime/assets/19691973/804fc89b-4a00-43bc-a4ff-21edd6f27e98)

### Motivation and Context
This scenario is arising for certain quantization configurations. Should
handle it gracefully.
2024-05-28 21:27:54 -07:00
Yifan Li
d44be41e1c
[TensorRT EP] Support engine hardware compatibility (#20669)
### Description
<!-- Describe your changes. -->
- Introduce option `trt_engine_hw_compatible` to support engine hardware
compatibility for Ampere+ GPUs
- This enables `nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS` flag
when generating engines
- This option has been validated on sm80/86 GPUs, as engine can be
reused across different ampere+ arch:
- Client side need to enable this option as well to leverage existing
sm80+ engines
- If this option is enabled by users which TRT<8.6 or sm<80, there will
be a warning showing this option not supported

Engine naming:
| When | `trt_engine_hw_compat=false` | `trt_engine_hw_compat=true` |
| -------------- |
------------------------------------------------------------ |
------------------------------------------------------------ |
| A100 (sm80) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
| RTX3080 (sm86) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**86**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reference:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#hardware-compat

---------

Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
2024-05-28 18:12:56 -07:00
Edward Chen
535e9d7114
Update package_release_tasks.py (#20835)
1. Move azcopy environment variables out of script and into an Azure DevOps variable group. Move towards consolidating the managed identity client ID definition in one place.
2. Disable azcopy overwrite. We don't want to accidentally change the files for a released package.
2024-05-28 17:50:25 -07:00
Ye Wang
362a623905
fix a build error with cuda 12.5 (#20770)
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/20765
2024-05-28 10:46:24 -07:00
Adrian Lizarraga
e78b18a2fb
Increase ComponentDetection timeout for React Native CI (#20800)
### Description
Runs of the React Native CI are timing out during ComponentDetection
after 8 minutes. This increases the timeout value.



### Motivation and Context
Runs of the React Native CI are timing out during ComponentDetection.
2024-05-28 08:36:38 -07:00
Jian Chen
b1b8cb05dc
Adding java build and packaging stage to cuda-packaging-pipeline.yml (#20812)
### Description
Adding java build/packaging stage to `cuda-packaging-pipeline.yml`



### Motivation and Context
This way we can enable publishing the Java Cuda 12 along with Nuget CUDA
12
2024-05-27 07:59:19 -07:00
Chi Lo
454fcdde00
[TensorRT EP] Weightless API integration (#20412)
This PR includes the weight-stripped engine feature (thanks @moraxu for
the #20214) which is the major feature for TRT 10 integration.

Two TRT EP options are added:

- `trt_weight_stripped_engine_enable`: Enable weight-stripped engine
build and refit.
- `trt_onnx_model_folder_path`: In the quick load case using embedded
engine model / EPContext mode, the original onnx filename is in the
node's attribute, and this option specifies the directory of that onnx
file if needed.

Normal weight-stripped engine workflow:

![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f314865-cbda-4979-a7ac-b31c7a553b56)
Weight-stripped engine and quick load workflow:

![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f31db51-a7a8-495b-ba25-54c7f904cbad)

see the doc [here
](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for
more information about EPContext model.

---------

Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
2024-05-26 12:24:17 -07:00
Changming Sun
439ed92b96
Remove TVM EP's pipeline (#20813)
### Description
Temporarily remove TVM EP's pipeline until someone helps us upgrade TVM
to a newer version which is compatible with the latest ONNX.

### Motivation and Context
The ONNX version that TVM EP uses has a known security vulnerability. We
cannot continue using it in our hosted build environment. This change is temporary
2024-05-25 20:42:41 -07:00
Adrian Lizarraga
5bae32eb34
Extend DoubleQDQPairsRemover to handle sequences that end in duplicate DQ nodes (#20759)
### Description
Extend the DoubleQDQPairsRemover optimizer to also handle sequences that
end in duplicate DQ nodes.

For example, the following sequence:
```
 Q1 --> DQ1 --> Q2 --+--> DQ2
                     |
                     +--> DQ2'
```
Is now simplified to:
```
 Q1 ---+--> DQ2
       |
       +--> DQ2'
```


### Motivation and Context
The EnsureUniqueDQNodeUnits pass may add duplicate DQ nodes to ensure
valid QDQ node units. The DoubleQDQPairsRemover should still be able to
remove unnecessary QDQ ops if the target sequence ends in duplicate DQ
nodes.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-05-24 18:30:15 -07:00
Chi Lo
a7bc49a565
[TensorRT EP] Use latest commit of onnx-tensorrt parser (#20758)
The 10 GA branch updated with several issues fixed.
https://github.com/onnx/onnx-tensorrt/commits/10.0-GA/
2024-05-24 16:44:16 -07:00
Suryaprakash Shanmugam
1765da17e4
QDQ transformations in the OpenVINO EP for the NPU device (#20622)
We introduce rulesets that eliminate QDQ nodes of unsupported types and
for unsupported quantised operators for the NPU device. This leads to
improved performance and accuracy on critical client AI models.

Here's a summary of the changes:

- Introduces the provider option `enable_qdq_optimizer` which when set
to `True` enables stripping of QDQ nodes on the NPU device for models
with `QuantizeLinear` and `DequantizeLinear` layers in them.
`enable_qdq_optimizer` defaults to `False`.
- Always strip out int16/uint16 QDQ layers as these types are not
supported by the NPU compiler.
- Only supported ops `Conv`, `MatMul`, and `Add` retain QDQ layers
around them, specifically identified for optimal inference performance.
OpenVINO EP achieves this by iterating through NodeUnits in the QDQ
model, and reconstructing the graph only with the required layers.
- Added provider APIs to manipulate node units from EP code by
@adrianlizarraga
- Added capability rule for the Pad operator when it takes DQ layers as
input
- Fixes from static code analysis tool

---------

Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
2024-05-24 16:25:05 -07:00
Adam Louly
ed8275883a
[Training] Add bf16 support to GatherElementsGrad. (#20796)
### Description
Adding bf16 support to GatherElementsGrad.

---------

Co-authored-by: Adam Louly <adamlouly@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
2024-05-24 15:55:14 -07:00
Suryaprakash Shanmugam
76e1a06986
Fix ordering of value info in GraphProto creation (#20691)
### Description
Graph member value_info_ (unordered_set) is ordered before its values
are added to the graph proto.


### Motivation and Context

- Without this ordering, the model proto used by the OpenVINO EP is not
deterministic and varies across runs.
- Since the model proto varies, it affects caching attempts by OpenVINO.

Q: If creating a vector to have ordered elements is costly, should we
make value_info_ a std::set that is sorted according to NodeArg names?


Related PR about ordering initializers:
https://github.com/microsoft/onnxruntime/pull/14631
2024-05-24 10:49:32 -07:00
Peishen Yan
cfe68e489e
[WebNN EP] Support Trilu op (#20730)
Adds support for Trilu via WebNN Triangular op
2024-05-24 10:46:54 -07:00
Guenther Schmuelling
33a68d221f
add missing file for pr20791 (#20811)
this file should have been in pr20791 to allow fp16 in the tile
implementation
2024-05-24 09:59:13 -07:00
Jian Chen
10c425a4d5
Fix Onnx >= to == (#20798)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:16:23 -07:00
Jian Chen
fe24006425
Fix Nuget Cuda pipeline package pipeline (#20741)
### Description
<!-- Describe your changes. -->

This PR adding protoc.exe to make the Nuget Cuda Pipleine, which also
allowing it to get build Java for various CUDA version

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:15:57 -07:00
Satya Kumar Jandhyala
bab5037eab
Eliminate explicit Concat operations in Attention (#20556)
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:07:57 -07:00
Changming Sun
535a030b1e
Remove manylinux build scripts from python packaging pipeline (#20786)
### Description
Use a common set of prebuilt manylinux base images to build the
packages, to avoid building the manylinux part again and again. The base
images can be used in GenAI and other projects too.
This PR also updates the GCC version for inference python CUDA11/CUDA12
builds from 8 to 11. Later on I will update all other CUDA pipelines to
use GCC 11, to avoid the issue described in
https://github.com/onnx/onnx/issues/6047 and
https://github.com/microsoft/onnxruntime-genai/issues/257 .

### Motivation and Context
To extract the common part as a reusable build infra among different
ONNX Runtime projects.
2024-05-24 08:18:22 -07:00
Jian Chen
884acd4598
Fix Nuget-Cuda pubish pipeline (#20794)
### Description
Previous all feed are set to nightly, the offcial released feed-id is
not set


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-23 18:27:46 -07:00
Guenther Schmuelling
0cf7caaff2
[js/webgpu] enable fp16 for tile (#20791) 2024-05-23 16:59:39 -07:00
Edward Chen
d1af19db9d
Add some CPU feature detection for Apple platforms. (#20769)
Add CPUIDInfo::ArmAppleInit() to detect CPU features on Apple platforms. This initial implementation is not comprehensive.
2024-05-23 15:59:46 -07:00
Changming Sun
b522df0ae4
Update RE2 to the latest (#20775)
Update RE2 to the latest.

To keep the components up to date.
2024-05-23 14:30:15 -07:00
Yulong Wang
0996d6e19e
[tools] update pipeline list for run_CIs_for_external_pr.py (#20776)
### Description
add required pipeline "Linux Android Emulator QNN CI Pipeline"
2024-05-23 10:38:42 -07:00
Yi Zhang
fa8670fe5b
Add a test image for stable diffusion (#20780) 2024-05-23 08:50:23 -07:00
Wanming Lin
2c39d0c502
[WebNN EP] Disable ConvTranspose for WebNN CPU (#20762)
WebNN CPU backend implementation has been migrated from XNNPack to
TFLite, currently TFLite has not supported WebNN's convTranspose2d yet,
just disable it for now.
2024-05-22 20:59:37 -07:00
Adam Louly
529feb01f4
Add BF16 for Scale Op. (#20753)
Adding Bfloat16 to scale op

---------

Co-authored-by: Adam Louly <adamlouly@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
2024-05-22 17:01:17 -07:00
Edward Chen
a39f8862fd
SQNBitGemm - move workspace size calculation functions to hardware-specific implementations (#20757)
The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.
2024-05-22 15:12:17 -07:00
Jian Chen
d4fe4b5b51
Replace ubuntu-latest with onnxruntime-Ubuntu2204-AMD-CPU (#20736)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-22 13:36:02 -07:00
Jian Chen
0a10a3003a
component-governance fix round 4 (#20754)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-22 11:05:24 -07:00
Yulong Wang
e412bc1919
[doc] update file size table for ORT Web (#20755)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-22 11:04:57 -07:00
Xu Xing
f1fef19b6e
[js/webgpu] Support shared memory for transpose 2d (#19267)
For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.
2024-05-22 08:15:44 -07:00
Yulong Wang
068bb3d5ee
[js/webgpu] add missing space in build script (#20752) 2024-05-21 16:24:34 -07:00
Chi Lo
df01e0d497
[TensorRT EP] Update ORT kernel output with TRT DDS int64 output for TRT 10 (#20738)
TRT 10 now natively supports int64 tensor, so needs to updating the code
where binding the ORT kernel output with DDS int64 output.
2024-05-21 09:03:48 -07:00
pengwa
8a98874e7e
Flash attention recompute (#20603)
### Flash attn recompute

1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678

#### Better Memory Efficiency 

Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).

#### Better Perf

Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/bb63894a-f281-49bc-a8e6-ff818439be38)

#### Use JSON File to pass Recompute Plans

To overcome the limitation of max length of the strings defined in
session options.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-21 13:38:19 +08:00
Adrian Lizarraga
8acf60f35c
Layout transform: Fix-up QDQ units and add constant folding (#20685)
### Description

#### Problem 1: Broken Transpose QDQ unit
Layout transform's specialized cost function aggressively pushes down
transposes with channel-first or channel-last perms. This can lead to a
situation where a channel-fist/last Transpose gets stuck after being
pushed through an Unsqueeze node that makes the Transpose's perm no
longer channel-first/last. At this point, the specialized cost function
defers to the default const function, which does not see a need to
continue pushing this transpose node. This breaks the QDQ node units for
both the Unsqueeze and the Transpose: DQ -> Unsqueeze -> Transpose -> Q.

<img width="266" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/82f8432d-ca27-451b-8c36-c8d87b806e30">


The transpose optimizer should insert a Q -> DQ pair between the
Unsqueeze and Transpose nodes to fix both QDQ node units: DQ ->
Unsqueeze -> Q[new] -> DQ[new] -> Transpose -> Q

<img width="198" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/5a584bdf-e5db-4622-b3bb-83c060e09261">


#### Problem 2: Inserted Squeeze/Transpose nodes should be constant
folded when possible.
The transpose optimizer inserts Squeeze (and Transpose) ops between an
initializer and a DQ to counteract the effect of Unsqueezing that
initializer if it is consumed by multiple nodes. This results in a graph
where the inserted nodes are not in valid node units:

Original graph where two Mul nodes share a common initializer input:
<img width="456" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/4b9155ae-e32f-41fc-9136-f953b73e92e7">

Resulting graph after transpose optimization without constant folding:
<img width="452" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/3c1bfef1-d45f-4d6e-aa19-1c2929eae3f5">

Here, the circled Transpose and Squeeze nodes operate on a quantized
integer type but are not in valid QDQ node units. The solution is to run
constant folding, which results in:
<img width="405" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/aebdb91f-f38f-4583-adec-33e46126365f">


### Motivation and Context
Improve the layout transformation to allow more models to run on EPs
that prefer the channel-last layout.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-05-20 20:19:06 -07:00
Jian Chen
372974e5d6
Using CPU pool to build Linux GPU C API Package (#20648)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 15:25:14 -07:00
Wanming Lin
87d49e3dda
[WebNN EP] Add WebNN operators doc to README.md (#20734) 2024-05-20 14:57:40 -07:00
Wanming Lin
0399d1b12d
[WebNN EP] Update chromium flag (#20732)
WebNN is currently enabled behind "Enables WebNN API" flag.
2024-05-20 14:57:30 -07:00
Jian Chen
ddafbf2224
Component Governance fix round 3 (#20689)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 13:39:09 -07:00
Jian Chen
11df22b59b
Reenabling Nuget Cuda Packaging Pipeline (#20688)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 10:37:15 -07:00
Edward Chen
fefae0cd04
Add Mac CI GitHub Actions workflow (#20717)
Add a new GitHub Actions workflow, `.github/workflows/mac.yml`. It contains these jobs:
- ARM64 MacOS CI build.
- Objective-C static analysis build. This was moved over from another Azure DevOps pipeline to make it more visible.
2024-05-20 10:27:03 -07:00
Preetha Veeramalai
ebed2c3785
Unified OV compile_model API in OVEP (#20700)
### Description
Have a unified API in OVEP that pass the ONNX graph proto from ORT to OV
for compilation


### Motivation and Context
The earlier implementation used two different flows when onnx model path
is present vs model laoded from memory.
The former directly passed the onnx model path to OV when the graph is
fully supported by EP. While the latter pass the ORT model proto to OV.

This cause a difference in results when ORT optimizations are enabled.
This PR address this issue.
2024-05-20 10:20:28 -07:00
Yulong Wang
036fcd93d4
[js/web] optimize module export and deployment (#20165)
### Description

This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.

See each section below for more details.

#### Preview

>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)

> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~

> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~

> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~

<details>
<summary><h4>Breaking changes</h4></summary>

There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.

#### Importing:

Import table is changed. See following for details.

<details>
<summary><h5>Current import table:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
  | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
| `ort.training` | `onnxruntime-web/training` |  |  | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
  | `ort.wasm-core` | `onnxruntime-web/wasm-core` |  |  | ✔️ |  |  |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ✔️<sup>\[2]</sup>
|  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.

</details>

<details>
<summary><h5>Proposed update:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
  | `ort.training` | `onnxruntime-web/training` |  |  | ✔️ | ✔️ | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~~~ | ~~~~
| ~~✔️~~ | ~~~~ | ~~~~ |
  | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ~~✔️~~  |  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

</details>

#### Flags:

The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.

The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
  ```diff
  -  export interface Old_WasmFilePaths{
  -    'ort-wasm.wasm'?: string;
  -    'ort-wasm-threaded.wasm'?: string;
  -    'ort-wasm-simd.wasm'?: string;
  -    'ort-training-wasm-simd.wasm'?: string;
  -    'ort-wasm-simd-threaded.wasm'?: string;
  -  };
  +  export interface New_WasmFilePaths {
  +    /**
  +     * Specify the override path for the main .wasm file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .wasm file is:
  +     * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.wasm` for training build
  +     */
  +    wasm?: URL|string;
  +    /**
  +     * Specify the override path for the main .mjs file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .mjs file is:
  +     * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.mjs` for training build
  +     */
  +    mjs?: URL|string;
  +  }
  ```

#### Bundler compatibility:

Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.

#### Deployment:

- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)

</details>
<details>
<summary><h4>Problems</h4></summary>

There are a few problems with the current module export and deployment:

- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>

- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
    ```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
    ```
  - import from source code inside `<script type="module">` tag (ESM)
    ```html
    <script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";

      // using 'ort'
    </script>
    ```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js
    const ort = require('onnxruntime-web');
    ```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js (or main.mjs)
    import * as ort from 'onnxruntime-web';
    ```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
  - webpack (esm requires extra post-process step)
  - rollup
  - parcel (esm requires extra post-process step)
  - More bundlers **TBD**
- Multi-threading support for Node.js

NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>

<details>
<summary><h4>Important Design Decisions</h4></summary>

- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.

- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
  - Keep the generated files as-is also helps to:
    - reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug

- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
  - (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.

- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.

- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.

</details>

<details>
<summary><h4>Distribution File Manifest</h4></summary>

The distribution folder contains the following files:

- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.

  | File Name | Build Flags |
  |------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |

- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.

  There are multiple build targets for different use cases:
  | Target Name | Path for "import" or "require" | Description |
  |------|-----|-----|
  | `ort` | `onnxruntime-web` | The default target. |
  | `ort.all` | `onnxruntime-web/all` | The target including webgl. |
  | `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |


  For each target, there are multiple files generated:
  | File Name | Description |
  |------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
  | [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |

</details>

<details>
<summary><h4>Dynamic Import Explained</h4></summary>

- Local Served | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort-wasm-simd-threaded.mjs]
                    |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                    |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Local Served | Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort.proxy.min.mjs]
                    |
                    + new Worker()--> [ort.proxy.min.mjs (worker)]
                                        |
+ import()--> [ort-wasm-simd-threaded.mjs]
                                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                        |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Cross Origin | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort-wasm-simd-threaded.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort-wasm-simd-threaded)]
                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                        |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
                                            |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```

- Cross Origin | Proxy
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort.proxy.min.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort.proxy)]
                        |
+ new Worker()--> [blob:... (ort.proxy) (worker)]
                                            |
+ fetch('ort-wasm-simd-threaded.mjs')
                                                |
+ URL.createObjectURL(res.blob())
                                                |
+ import()--> [blob:... (ort-wasm-simd-threaded)]
                                                                |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                                |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
</details>
2024-05-20 09:51:16 -07:00
kunal-vaishnavi
ca22a5a9d0
Add fusions for OpenAI CLIP (#20721)
### Description
This PR adds fusions for [OpenAI's CLIP
model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here
is an example of how to run the ORT transformer optimizer for the linked
CLIP model.

```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps optimize multi-modal models that use CLIP for the vision
encoder.
2024-05-18 08:27:16 -07:00
cloudhan
5d07291247
hipify int4 gemv (#20666)
Hipify MatMulNBits to accommodate the need of Phi3 onnx release.
2024-05-18 16:59:03 +08:00
kunal-vaishnavi
72a3bde330
Add GQA on CPU in LLaMA scripts (#20720)
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](https://github.com/microsoft/onnxruntime/pull/20598).
2024-05-17 23:23:57 -07:00
Dmitri Smirnov
bd7a0fb377
[C API Docs ] Address doxygen errors (#20714)
### Description
Make C API compliant with Doxygen expectations

### Motivation and Context
Doc workflow is failing.
2024-05-17 23:23:20 -07:00