Commit graph

9613 commits

Author SHA1 Message Date
Dmitri Smirnov
fdb132643d
Remove redundant Resolve() after each inlined function (#17556)
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.

### Motivation and Context
Poor performance for inlining the model and session initialization.

Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)

RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
2023-09-15 12:13:37 -07:00
Tianlei Wu
adb0be45d3
Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559)
To avoid a huge cu file and make code more readable:
 - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
 - Move ConcatPastToPresent to attention_concat.cu
 - Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
2023-09-15 10:57:29 -07:00
Tianlei Wu
af80542e65
Update optimize_pipeline for SDXL (#17536)
- [x] Optimize SDXL models exported by optimum.
- [x] Enable it to run locally instead of using module.
- [x] Detect external data file in original model, and save with same
format by default.
- [x]  Add tests

### Example
```
pip install optimum transformers diffusers onnx onnxruntime-gpu>=1.16
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl ./sd_xl_base_onnx
python -m  onnxruntime.transformers.models.stable_diffusion.optimize_pipeline -i ./sd_xl_base_onnx -o ./sd_xl_base_fp16 --float16
```

### Known issues
(1) VAE decoder cannot be converted to float16. Otherwise, there will be
black image in output.
(2) To use the float16 models, need a minor change in optimum to convert
the inputs for VAE decoder from float16 to float32 since we keep VAE
decoder as float32. The change is to append a line like the following
after [this
line](afd2b5a366/optimum/pipelines/diffusers/pipeline_stable_diffusion_xl.py (L483))
```
latents = latents.astype(np.float32)
```
2023-09-15 10:17:20 -07:00
Yi Zhang
377f959c69
Run Final_Jar_Testing_Linux_GPU in docker (#17533)
### Description
1. Create a package test image based on [RedHat
UBI](https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image)
2. Install TensorRT 8.6.1.6 in RedHat. (Ref.
https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#maclearn-net-repo-install-rpm)
3. Run Final_Jar_Testing_Linux_GPU in docker (base image:
nvidia/cuda:11.8.0-cudnn8-devel-ubi8)

### Motivation and Context

[AB#18470](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18470)

### Verification

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=354004&view=logs&j=8939b564-1402-57b5-92dc-510eba75e069&t=8939b564-1402-57b5-92dc-510eba75e069
2023-09-15 08:35:55 -07:00
zesongw
a5302fec93
[WebNN EP] Fix bug for PRelu on CPU backend. (#17543)
### Description
WebNN CPU backend expects slope of PRelu to be a static value. For now,
we will not support it.


### Motivation and Context
Fallback this case to pass the CI.
2023-09-15 08:29:48 -07:00
Changming Sun
4d931edd78
Update tensorrt_dependencies in setup.py (#17562)
### Description
The files should not have the minor version number. The names were added
in #17365 by mistake.

### Motivation and Context
We did not successfully exclude them out.
2023-09-15 08:20:47 -07:00
Yulong Wang
94f2ed6bbd
run_CIs_for_external_pr.py: update required pipelines (#17557)
### Description
Add required pipeline "Windows x64 QNN CI Pipeline" to script
"run_CIs_for_external_pr.py"
2023-09-14 21:15:10 -07:00
Yulong Wang
9aafbe3feb
[js/web] revise TensorView (#17473)
### Description

This change:
- removes the unused `Tensor` types declared in
/js/web/lib/wasm/jsep/tensor.ts
- removes duplicated util functions in  /js/web/lib/wasm/jsep/tensor.ts
- renames /js/web/lib/wasm/jsep/**tensor.ts** to
/js/web/lib/wasm/jsep/**tensor-view.ts** and update corresponding
references. It was kind of confusing that we have multiple `Tensor`
types defined in different places also we have multiple `tensor.ts`
source files.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
https://github.com/microsoft/onnxruntime/pull/17465
https://github.com/microsoft/onnxruntime/pull/17469
https://github.com/microsoft/onnxruntime/pull/17470
https://github.com/microsoft/onnxruntime/pull/17472
https://github.com/microsoft/onnxruntime/pull/17473 (this one)
2023-09-14 21:14:44 -07:00
Nat Kershaw (MSFT)
a2fba28f6c
Remove extraneous javascript includes (#17558) 2023-09-14 20:43:24 -07:00
Tianlei Wu
3a1e48dd5a
update BERT notebook with ORT 1.16 (#17524)
- Update BERT notebook with onnxruntime-gpu 1.16
- Add example of packing mode
- Run results in RTX 4090 GPU
2023-09-14 18:15:29 -07:00
Kaz Nishimura
5ed5f13920
[DML EP] Add missing member initializer DmlGraphNodeCreateInfo::nodeCount (#17505)
### Description
<!-- Describe your changes. -->

This adds a missing member initialization.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It caused an access violation in
`Dml::GraphDescBuilder::BuildGraphDesc`.
2023-09-14 17:47:45 -07:00
Jiajia Qin
41d2ff622c
[js/webgpu] Optimize InstanceNormalization (#17491)
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```

In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
2023-09-14 17:03:18 -07:00
Hector Li
46fe08226f
[QNN EP] Enable Pad op support for QNN EP (#17508)
### Description
Enable Pad op support for QNN EP to support more models
2023-09-14 14:22:45 -07:00
xhcao
198d468849
[WebGPU/JS] Added Pad operator support (#16928)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-14 13:14:11 -07:00
Rachel Guo
e11849e716
Configure StringNormalizer default_locale for _APPLE_ system (#17339)
### Description
<!-- Describe your changes. -->

As title.

iOS language code uses different syntax for specifying language
code/region code:
https://developer.apple.com/documentation/xcode/choosing-localization-regions-and-scripts

current `default_locale` is not working for iOS.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Issue:
https://github.com/microsoft/onnxruntime/issues/17017

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-14 10:58:25 -07:00
Yulong Wang
7af2f68ef3
[js/web] add a test flag to customize chromium flags (#17545)
### Description
add a test flag to customize chromium flags.

Usage:
npm test -- \<other flags> --chromium-flags=<...>
2023-09-14 10:05:31 -07:00
Changming Sun
5af6279440
Fix Android build (#17540)
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
2023-09-14 07:36:01 -07:00
Hans
ad369a1fad
[js/rn] Support create boolean tensor (#17052)
### Description
<!-- Describe your changes. -->

For some use case need to create boolean tensor.

I've tested on [this
project](https://github.com/hans00/react-native-transformers-example)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Add handle `ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL`

And it required #15556 (It seems not include in latest release
(v1.15.1))
2023-09-14 15:02:27 +10:00
cao lei
32f5658abb
remove gsl to make status.h independent from gsl (#17402)
### Description
<!-- Describe your changes. -->
Make status.h independent from gsl.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In the coming new feature external EP API (see the prototype
https://github.com/microsoft/onnxruntime/pull/16718), we need to expose
stream in the public header, however, stream is dependent on status.h
which is dependent on gsl. We are seeking a way to decouple stream from
gsl.

From Changming's comment offline, prefast is disabled so all
GSL_SUPPRESS are not taking any effect now. He will handle the warnings
when enable prefast in the future
2023-09-13 21:47:43 -07:00
Arthur Islamov
03b56f7a73
[js/webgpu] FP16 extension registration (#17493)
### Description
First small change to support FP16

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-09-13 13:11:17 -07:00
Patrice Vignola
7edff1c2bf
[DML EP] Add subgraph fusion support (#17504)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-13 13:02:58 -07:00
dependabot[bot]
4e37c5d1f0
Bump actions/checkout from 3 to 4 (#17487) 2023-09-13 09:22:21 -07:00
Yulong Wang
a2e75114cc
[js/web] add sessionOptions.freeDimensionOverrides (#17488)
### Description
Allows to specify fixed size for dynamic input of a model. resolves
#16707

Pending test
2023-09-13 09:17:34 -07:00
Changming Sun
5d3786206b
Fix ROCM's nightly build (#17518)
### Description
PR 15470 updated some C/C++ dependencies. The change caused ROCM EP's
nightly build to fail. see issue
https://github.com/ROCm-Developer-Tools/HIP/issues/2082 for a
background. So, the root cause is HIP compiler has a special requirement
that HIP's include dirs must be used before the operating system's
include folder: /usr/include. HIP adds "-isystem" in front of
"/usr/include". gcc or clang will search the folders added with "-I"
first, then the "-isystem" folder. It works fine as long as we do not
add "-I/usr/include" to the compile commands for *.cu files. It would be wrong if
we already have installed an open source library to /usr and want to use the
prebuilt library from there instead of the current build dir. 


### Motivation and Context
2023-09-13 08:50:14 -07:00
Patrice Vignola
54a092c427
[DML EP] Complete python IO binding implementation (#17344)
@fdwr This is the part 2 of the pybind work that was started earlier.
This adds the following features to the python IO binding
implementation:

- Use a bucketized allocator in order to reduce the number of resource
allocations
- Implement the following functions: `ortvalue_from_numpy`,
`update_inplace`, `ortvalue_from_shape_and_type` and `numpy`
- Modify the `onnxruntime_test_python_iobinding` tests to also run on
DML

---------

Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
2023-09-13 07:26:35 -07:00
Yi Zhang
c0a4fe777f
Move Linux python test into docker (#17479)
### Description
supplement of #17417



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-13 15:21:28 +08:00
Yulong Wang
cdf3e9dba9
[js] update prepack script to use exact version (#17484)
### Description
update prepack script to use exact version.

the prepack script for onnxruntime-node, onnxruntime-web and
onnxruntime-react-native is used to update their referencing version of
dependency "onnxruntime-common".

Previously "~" (tilde symbol) is used. This may cause NPM choose an
older version (if the old version matches the version requirement and
was previously installed already so hit the cache). see also
https://semver.npmjs.com/. [This
build](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134671&view=results)
is caused by this issue.
2023-09-13 00:07:16 -07:00
xhcao
ec94b07f0a
[JS/WebGPU] support Concat.int32 operator (#17003)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-13 00:05:00 -07:00
Yulong Wang
41584b2827
[js/web] ensure ORT initialization to run only once (#17529)
### Description
ensure ORT initialization to run only once
2023-09-12 23:52:08 -07:00
Changming Sun
24a3c740c0
Revert "[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715)" (#17523)
This reverts commit bb136f86c8, then
re-implement it in a different way.
I reverted the original change, then added a version constraint to the
find_package args.

If you still found it picks up wrong gtest version after this change,
you may disable `find_package` by setting
'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest
gtest version is 1.14.0. If at a later time Google releases a new
version of gtest and that one is incompatible with the ONNX Runtime
source code you get today and your dev environment already installed the
new version and you do not want to create a new clean build environment
that is without the package, you can add `--cmake_extra_defines
FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve
the problem.
2023-09-12 22:39:31 -07:00
rui-ren
b52127d22d
update acpt image for the training ci nightly (#17521)
### Description
<!-- Describe your changes. -->

The name of nightly ACPT image has been updated to
`ptebic.azurecr.io/internal/aifx/acpt/nightly-ubuntu-cuda-torch-dev`

As the previous image alias had `cu118`, `torch210dev` or `py38`, any
version update will break the training nightly pipeline



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Using constant image alias to avoid pipeline failure.
2023-09-12 22:32:20 -07:00
Nat Kershaw (MSFT)
bbcf4b45dc
Upgrade doxygen to 1.9.8 (#17525) 2023-09-12 20:44:27 -07:00
Changming Sun
9b755dce9f
Delete all Prefast tasks (#17522)
### Description
Delete all Prefast tasks because the new VS 17.7 version crashes every
time when we run the task on our CI build servers. However, we cannot
reproduce it locally. And this problem blocks us installing security
patches to our CI build machines.

Will use [CodeQL](https://codeql.github.com/) instead. 

### Motivation and Context
Address some security alerts.
2023-09-12 17:40:49 -07:00
Yulong Wang
f923eec28b
[js/web] release session after use in npm test (#17470)
### Description
release session after use in npm test.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
#17465
#17469
#17470 (this one)
2023-09-12 16:59:13 -07:00
Tianlei Wu
49511b5483
Improve performance of prune_graph in onnx_model.py (#17502)
During optimization of SDXL UNet, the prune_graph takes up to 5 minutes.
The cause is to find a node in all nodes is time-consuming. This
optimization will reduce the latency of prune_graph to 2 seconds.

New algorithm will use a hash table (key is first node output, value is
node) to speed up.
2023-09-12 11:38:36 -07:00
Edward Chen
cf672c5887
Use name of temporary provisioning profile. (#17459)
The old provisioning profile no longer works. Switched to a temporary one that we can use before a new one is available. The temporary one has a different name.
2023-09-12 10:56:35 -07:00
Chi Lo
aa5e36456a
[TRT EP] Fix multithreading bug of getting the corrupted trt engine instance (#17507)
Revert to the old TRT EP behavior of securing the whole compute_function
by lock_guard.

Current TRT EP which only puts lock_guard around a critical section
(obvious wrong) inside compute_function.
The issue can happen where one thread is updating the engine in
compute_function whereas another thread still accesses the
stale/corrupted engine instance in compute_function, for example, the
code outside the critical section, `int total_bindings =
trt_engine->getNbBindings()`.

So, make the whole compute_function the critical section should be okay.
2023-09-12 07:37:45 -07:00
Aditya Goel
db558ef9b4
TreeEnsemble speed up (#17449)
### Description
This PR proposes a change that should speed up inference for the
TreeEnsemble* kernels. Previously, when traversing a decision tree, the
`TreeNodeElement` pointer would be incremented or decremented to the
appropriate child node - I assume this was because the
`truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member
were overloaded for two purposes.

In this PR, we now assign the true branch pointer. We also initialise
`nodes_` in a pre-order traversal which means that the false branch's
position can be resolved statically and does not need to be stored.

I observe the following speed ups. The benchmarks used are derived from
those in https://github.com/siboehm/lleaves/tree/master/benchmarks and
the baseline is the main branch.

NYC Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 |
| 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 |
| 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 |

Airline Dataset
---------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 |
| 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 |
| 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 |

mtpl2 Dataset
--------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 |
| 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 |
| 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 |

These were run using an M2 Pro with 16GB of RAM. All times are in
milliseconds and averages over 10 runs with a batch size of 100,000.

### Motivation and Context
Performance improvements.
2023-09-12 10:26:25 +02:00
Arthur Islamov
65249f42e4
[js/web] FP16 Gemm, Softmax & Transpose (#17494)
### Description
First three OPs to support fp16. Will add more once this gets merged
since others depend on changes in js_data_types
2023-09-11 21:09:37 -07:00
Adrian Lizarraga
f20e475e67
[QNN EP] Update QNN SDK to version 2.14.1 (#17467)
### Description
Updates the version of QNN SDK used by CI Pipelines. Enables some tests
fixed by 2.14.1, but still need to look into Resize in a separate PR.

### Motivation and Context
Test latest version of QNN SDK.
2023-09-11 21:07:50 -07:00
Patrice Vignola
8ad9ab1b9a
Fix CPU constant folding not reverting the node to its previous EP (#17399)
A recent change was made in
5a83a67f32
to make `ep_type` a reference instead of having it be a copy, presumably
to avoid assigning strings (so `auto& ep_type =
node->GetExecutionProviderType()` instead of `auto ep_type =
node->GetExecutionProviderType()`). The problem with this change is that
calling `node->SetExecutionProviderType(kCpuExecutionProvider)` will
change the value of the reference itself, which means that it's
impossible to revert the node to its previous EP.

This change fixes this bug and adds an optimization over the previous
approach by only assigning a string when we know that we are dealing
with a non-CPU node.
2023-09-11 17:38:37 -07:00
satyajandhyala
bf6d6961cc
[JS/Web] Added Einsum operator support. (#17401)
### Description
Added Einsum operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-11 15:57:15 -07:00
Yulong Wang
850baced33
[web] a few updates to web pipeline (#17485)
### Description

Update the Web CI pipelines:

- remove parameter 'WebTemplate': Since we start to support webgpu, the
linux-web-ci.yml is no longer working and it is already out-of-date.
remove this file and parameter so that we always use win-web-ci.yml

- change flag `RunWebGpuTests` into 2 flags, for release and debug.
Currently for CI we only run webgpu tests on release build. But we want
to have the capability to run webgpu tests on debug build as well.


After this PR is merged, next step is to enable both Debug and Release
webgpu tests in PostMerge pipeline.
2023-09-11 11:43:42 -07:00
Kaz Nishimura
24f0893d3c
Enable optimize_by_onnxruntime call for float32 unet model (#17483)
This makes it possible to call `optimize_by_onnxruntime` for float32 unet if `--use_external_data_format` is also used.

### Motivation and Context
When using `optimize_pipeline.py` without `--float16`, `optimize_by_onnxruntime` was not called for unet.
2023-09-10 16:50:50 -07:00
Chi Lo
b827ab0efc
[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468)
If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil
include [oss onnx-tensorrt
parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82)
and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is
needed.

if not, you will encounter following [ build
error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda):

CMake Error: The following variables are used in this project, but they
are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the
CMake files:
    /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR

Note: Not quite sure why in the past when CI still tested with oss
parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined
somewhere back then.
2023-09-08 20:34:57 -07:00
Yulong Wang
89da5a0108
[js/webgpu] exclude WebGPU reduce_log_sum_exp_* float64 test cases (#17472)
### Description

as explained in the comments, tests "test_reduce_log_sum_exp_*" on
opset17/opset18 are excluded because they use float64.

They are passing now because they fallback to CPU. WebGPU does not
support f64.


This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
https://github.com/microsoft/onnxruntime/pull/17465
https://github.com/microsoft/onnxruntime/pull/17469
https://github.com/microsoft/onnxruntime/pull/17470
https://github.com/microsoft/onnxruntime/pull/17472 (this one)
2023-09-08 17:03:04 -07:00
Yulong Wang
550293d9ad
OrtMemoryInfo: support new name "WebGPU_Buffer" (#17469)
### Description
Add new name "WebGPU_Buffer" to OrtMemoryInfo.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
#17465
#17469 (this one)
2023-09-08 16:37:35 -07:00
Caroline Zhu
dcc93909b4
Add training WASM generation to Web CI pipeline (#17319)
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)

Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.

Bundled into this PR are minor formatting fixes + naming fixes.

### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
2023-09-08 15:49:47 -07:00
Tianlei Wu
29a818caa0
Attention fusion for stable diffusion clip model (#17445)
Add attention fusion for stable diffusion clip model to improve performance of SD or SDXL
2023-09-08 14:17:14 -07:00
Yulong Wang
4d753b74a5
[js/common] prepare work for supporting webgpu IO binding implementation (#17465)
### Description
This PR contains a few changes in /js/common/ to support a coming PR for
a full implementation of webgpu IO binding.

- allows pass-through if value is already a Tensor instance in return
value of `handler.run()` called by `InferenceSession.run()`
(inference-session-impl.ts). Specifically, onnxruntime-node and
onnxruntime-react-native uses native bindings to generate a Tensor-like
object so we need to create a real Tensor instance here; for
onnxruntime-web the return value is already a Tensor instance.

- adds new types for GPU buffer supported types: `'float32'|'int32'` ->
`'float32'|'float16'|'int32'|'int64'|'uint32'|'bool'`

- exposes types `GpuBufferDataTypes` together with `CpuPinnedDataTypes`
and `TextureDataTypes` as exported
2023-09-08 13:49:24 -07:00