Commit graph

9641 commits

Author SHA1 Message Date
Arthur Islamov
498b60d8a4
[js/web] fp16 Pool & Reduce (#17512)
### Description
Two more ops to support fp16
2023-09-21 14:52:13 -07:00
Abhishek Jindal
d56fc7ebf5
Layer norm fusion deepspeed stage3 changes (#17614)
### Description
<!-- Describe your changes. -->
Layer norm fusion changes required for deepspeed stage 3, also includes
test case.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It helps fusing layer norm for Deepspeed Stage 3. Added a test case
scenario which ensures that the fusion is working properly for the
scenario.
2023-09-21 14:16:41 -07:00
George Nash
f299016cbe
Fix crash on Windows server 2016 on Intel Gen4 Xeon processors (#17611)
This adds an additional check before enabling MlasGemmU8S8DispatchAmx
for GEMM operations. After checking the CPUID for AMX-TILE and AMX-INT8,
an additional check is added that checks value of the XCR0 register.

The value in the OXR0 register is set by the OS and indicates support
for various CPU features. In this case the bits indicating XTILECFG and
XTILEDATA support are checked.

### Description
This adds an additional check before enabling MlasGemmU8S8DispatchAmx
for GEMM operations. After checking the CPUID for AMX-TILE and AMX-INT8,
an additional check is added that checks value of the XCR0 register.

The value in the OXR0 register is set by the OS and indicates support
for various CPU features. In this case the bits indicating XTILECFG and
XTILEDATA support are checked.



### Motivation and Context
Fix for crash reported directly by customer. When running older Windows
server OS on newer Gen4 Xeon processors.

Signed-off-by: Nash <george.nash@intel.com>
2023-09-21 09:25:41 -07:00
PeixuanZuo
5b9cd91a9c
[ROCm] fix CI (#17648)
fix CI, follow #17621
2023-09-21 07:37:50 -07:00
Changming Sun
57dfd15d7b
Remove dnf update from docker build scripts (#17551)
### Description
1. Remove 'dnf update' from docker build scripts, because it upgrades TRT
packages from CUDA 11.x to CUDA 12.x.
To reproduce it, you can run the following commands in a CentOS CUDA
11.x docker image such as nvidia/cuda:11.8.0-cudnn8-devel-ubi8.
```
export v=8.6.1.6-1.cuda11.8
dnf  install -y libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v}  libnvinfer-headers-plugin-devel-${v} 
dnf update -y
```
The last command will generate the following outputs:
```
========================================================================================================================
 Package                                     Architecture       Version                          Repository        Size
========================================================================================================================
Upgrading:
 libnvinfer-devel                            x86_64             8.6.1.6-1.cuda12.0               cuda             542 M
 libnvinfer-headers-devel                    x86_64             8.6.1.6-1.cuda12.0               cuda             118 k
 libnvinfer-headers-plugin-devel             x86_64             8.6.1.6-1.cuda12.0               cuda              14 k
 libnvinfer-plugin-devel                     x86_64             8.6.1.6-1.cuda12.0               cuda              13 M
 libnvinfer-plugin8                          x86_64             8.6.1.6-1.cuda12.0               cuda              13 M
 libnvinfer-vc-plugin-devel                  x86_64             8.6.1.6-1.cuda12.0               cuda             107 k
 libnvinfer-vc-plugin8                       x86_64             8.6.1.6-1.cuda12.0               cuda             251 k
 libnvinfer8                                 x86_64             8.6.1.6-1.cuda12.0               cuda             543 M
 libnvonnxparsers-devel                      x86_64             8.6.1.6-1.cuda12.0               cuda             467 k
 libnvonnxparsers8                           x86_64             8.6.1.6-1.cuda12.0               cuda             757 k
 libnvparsers-devel                          x86_64             8.6.1.6-1.cuda12.0               cuda             2.0 M
 libnvparsers8                               x86_64             8.6.1.6-1.cuda12.0               cuda             854 k
Installing dependencies:
 cuda-toolkit-12-0-config-common             noarch             12.0.146-1                       cuda             7.7 k
 cuda-toolkit-12-config-common               noarch             12.2.140-1                       cuda             7.9 k
 libcublas-12-0                              x86_64             12.0.2.224-1                     cuda             361 M
 libcublas-devel-12-0                        x86_64             12.0.2.224-1                     cuda             397 M

Transaction Summary
========================================================================================================================

```
As you can see from the output,  they are CUDA 12 packages. 

The problem can also be solved by lock the packages' versions by using
"dnf versionlock" command right after installing the CUDA/TRT packages.
However, going forward, to get the better reproducibility, I suggest
manually fix dnf package versions in the installation scripts like we do
for TRT now.

```bash
v="8.6.1.6-1.cuda11.8" &&\
    yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo &&\
    yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}\
        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v}  libnvinfer-headers-plugin-devel-${v}
```
When we have a need to upgrade a package due to security alert or some
other reasons, we manually change the version string instead of relying
on "dnf update". Though this approach increases efforts, it can make our
pipeines more stable.

2. Move python test to docker
### Motivation and Context
Right now the nightly gpu package mixes using CUDA 11.x and CUDA 12.x
and the result package is totally not usable(crashes every time)
2023-09-21 07:33:29 -07:00
Pranav Sharma
038c76378f
Include onnxruntime_float16.h in the package. (#17637)
### Description
Include onnxruntime_float16.h in the package.

### Motivation and Context
This was missed in the recently released 1.16 pkgs (except Nuget).
2023-09-21 00:08:10 -07:00
Changming Sun
4f3f4366d5
Fix API 16's marker (#17640) 2023-09-20 19:51:50 -07:00
PeixuanZuo
1f991f27f1
[ROCm] add manylinux build test for ROCm CI (#17621)
manylinux build is used for nightly packaging generation and it's hard
to capture issue in time when related files change. This PR add
manylinux build in CI.
2023-09-21 10:45:16 +08:00
Changming Sun
dd561f2015
Upgrade sympy (#17639)
AB#17015
2023-09-20 18:44:23 -07:00
Adrian Lizarraga
c55da45e20
[QNN EP] Add more op unit tests (fix Clip, TopK, Tile) (#17457)
### Description
Adds more operator unit tests (all op types should now have at least 1
unit test):
- [x] Reshape
- [x] Flatten
- [x] Squeeze
- [x] Unsqueeze
- [x] Gemm
- [x] Clip
- Enable QDQ Clip on HTP backend (when not optimized away by L1
ClipQuantFusion optimizer)
  - Add support for 16-bit QDQ Clip to ClipQuantFusion optimizer
- [x] Split
- [x] Topk
  - Enable QDQ TopK on HTP backend
- [x] Tile
  - Enable QDQ Tile on HTP backend



### Motivation and Context
Increase QNN operator support and test coverage.
2023-09-20 14:31:01 -07:00
Hariharan Seshadri
c65e892089
[CUDA] Fix performance bug in DecoderMaskedMultiheadAttention for BeamSearch (#17613) 2023-09-20 10:35:15 -07:00
Vincent Wang
e6301eee6a
Bump Up Version to 1.17.0 (#17587)
Bump up version to 1.17.0 as the 1.16.0 release branch had been branched
out.
2023-09-20 11:02:58 +08:00
Numfor Tiapo
f297d4dfb9
Remove onnxruntime extensions from list of gitmodules (#17615)
The extensions submodule was removed in [this
PR](https://github.com/microsoft/onnxruntime/pull/17097) but not deleted
from the list of git modules. This causes breaks in code ingesting ORT
that references the git modules for an accurate list of submodules.

This change removes the extensions from the list of git modules to
resolve this issue.
2023-09-19 17:12:14 -07:00
Yulong Wang
d522cc7cc4
Update npm-packaging-pipeline.yml to always use artifacts from main branch (#17604)
### Description
Update npm-packaging-pipeline.yml to always use artifacts from main
branch
2023-09-19 14:42:08 -07:00
Hariharan Seshadri
460f17fbb8
[JS/WebGPU] Support If on WebGPU (#17478) 2023-09-19 12:20:18 -07:00
Bowen Bao
152e61da37
Avoid get_logger overriding root logger level (#17569)
### Description
Instead, set level to DEBUG for the logger returned.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Otherwise, this function call overrides root logger level setting, which
affects logging facility of other python packages.
2023-09-19 10:42:27 -07:00
Tianlei Wu
730fab3050
Refactor Attention cuda kernel (#17578)
* Break QkvToContext into small functions. Each fused and unfused kernel
will have separated function.
* Move DecoderAttention kernel to separated file
* Move KV cache related kernel to attention_kv_cache.cu

### Motivation and Context
To make the code easier to maintain.
2023-09-19 09:49:21 -07:00
Wei-Sheng Chin
068300d97e
Pin beartype version (#17599)
PyTorch doesn't like the latest beartype:
https://github.com/pytorch/pytorch/pull/109510
2023-09-18 19:31:04 -07:00
Justin Chu
d350ab31d7
Remove reference to internals in torch.onnx in test (#17550)
- https://github.com/microsoft/onnxruntime/issues/11901
2023-09-18 18:40:09 -07:00
Jambay Kinley
f969e7f8d8
Provide kwargs to remove_shared_initializers (#17539)
### Description
Fixes a bug in `get_shared_initializers` where `signature_cache1,
signature_cache2` are passed as positional arguments to
`remove_shared_initializers` but their positions don't match the
function signature. So `signature_cache1` is passed to `min_elements`
and causes comparison error at line 907.

Pass the arguments as kwargs so that it doesn't rely on their positions.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes the bug described above.
2023-09-18 16:41:11 -07:00
Yi Zhang
7116e66c4b
Improve Win QNNEP pipeline (#17586)
### Description
1. use standard win build template
2. enable compiler cache

### Motivation and Context
Make win build task easy to maintain and accelerate the pipeline.
2023-09-19 07:36:17 +08:00
Arthur Islamov
0f406ca1d3
[js/web] FP16 binary and unary ops (#17515)
### Description
Binary and unary ops with fp16 support
2023-09-18 15:43:32 -07:00
Adrian Lizarraga
dea425e7c1
[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
  - DequantizeLinear and QuantizeLinear contrib ops
  - QNN EP Quantize and Dequantize operators
  - Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.

Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
    quantize_static(
        input_model_path,
        output_model_path,
        data_reader,
        quant_format=args.quant_format,
        per_channel=args.per_channel,
        activation_type=QuantType.QUInt16,
        weight_type=QuantType.QUInt8,
        extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
    )
``` 

Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.

### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation

Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover 
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding

QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops

Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-18 09:43:34 -07:00
PeixuanZuo
af14ae8050
[ROCm] Update whisper benchmark script (#17391)
- update whisper benchmark for ROCm EP.
2023-09-18 13:34:39 +08:00
simonjub
c969237321
[TRT EP] Fix ProviderOptions functions (#17567)
### Description
When trying to use the TRT EP option trt_extra_plugin_lib_paths I
noticed that my custom op library was not being loaded by the EP. After
some digging I found that code was missing to update this option when
UpdateTensorRTProviderOptions() is used to set it.

At the same time I noticed that char arrays were allocated in that
function and wondered where they are de-allocated. When I found it was
done in ReleaseTensorRTProviderOptions(), I noticed that a few
de-allocations were missing.

### Motivation and Context
This PR fixes the problems described above.
2023-09-17 12:19:32 -07:00
Yifan Li
705f8a3718
[TensorRT EP] Fallback to CUDA EP if it's explicitly assigned (#17535)
### Description
* TensorRT EP can fall back to CUDA EP if it's explicitly assigned
* MIGraphX can fall back to ROCM if it's explicitly assigned

Test cases:
| When user specifies providers= | self._fallback_providers= |
| ------------------------------------------------------------ |
------------------------------------------------- |
| ["TensorrtExecutionProvider", "CUDAExecutionProvider"] |
["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider",("CUDAExecutionProvider", cuda_options)]
| ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider"] | ["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options)] |
["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options), ("CUDAExecutionProvider",
cuda_options)] | ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider", "CPUExecutionProvider"] |
["CPUExecutionProvider"] |





### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Apply comments of https://github.com/microsoft/onnxruntime/issues/17394
and unify the logic to [MIGraphX, ROCM]
2023-09-15 15:16:11 -07:00
Yulong Wang
efd416b71f
[js/web] update test to explicitly fail for webnn without proxy (#17554)
### Description

Update test to explicitly fail for webnn without proxy.

I am doing this change because if I test webnn with other backend
together, it silently enables proxy. I want to make test runner behave
with less implicit flag reset. If proxy is not enabled, webnn test
should fail.

@Honry please let me know if other places (eg. CI scripts) should change
also.
2023-09-15 14:40:22 -07:00
Yulong Wang
155887593d
[js/web] update npm test to load test cases only for required backends (#17555)
### Description
update npm test to load test cases for required backends.

No need to load test case list for the backends that we don't test.
2023-09-15 13:55:25 -07:00
Dmitri Smirnov
fdb132643d
Remove redundant Resolve() after each inlined function (#17556)
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.

### Motivation and Context
Poor performance for inlining the model and session initialization.

Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)

RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
2023-09-15 12:13:37 -07:00
Tianlei Wu
adb0be45d3
Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559)
To avoid a huge cu file and make code more readable:
 - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
 - Move ConcatPastToPresent to attention_concat.cu
 - Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
2023-09-15 10:57:29 -07:00
Tianlei Wu
af80542e65
Update optimize_pipeline for SDXL (#17536)
- [x] Optimize SDXL models exported by optimum.
- [x] Enable it to run locally instead of using module.
- [x] Detect external data file in original model, and save with same
format by default.
- [x]  Add tests

### Example
```
pip install optimum transformers diffusers onnx onnxruntime-gpu>=1.16
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl ./sd_xl_base_onnx
python -m  onnxruntime.transformers.models.stable_diffusion.optimize_pipeline -i ./sd_xl_base_onnx -o ./sd_xl_base_fp16 --float16
```

### Known issues
(1) VAE decoder cannot be converted to float16. Otherwise, there will be
black image in output.
(2) To use the float16 models, need a minor change in optimum to convert
the inputs for VAE decoder from float16 to float32 since we keep VAE
decoder as float32. The change is to append a line like the following
after [this
line](afd2b5a366/optimum/pipelines/diffusers/pipeline_stable_diffusion_xl.py (L483))
```
latents = latents.astype(np.float32)
```
2023-09-15 10:17:20 -07:00
Yi Zhang
377f959c69
Run Final_Jar_Testing_Linux_GPU in docker (#17533)
### Description
1. Create a package test image based on [RedHat
UBI](https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image)
2. Install TensorRT 8.6.1.6 in RedHat. (Ref.
https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#maclearn-net-repo-install-rpm)
3. Run Final_Jar_Testing_Linux_GPU in docker (base image:
nvidia/cuda:11.8.0-cudnn8-devel-ubi8)

### Motivation and Context

[AB#18470](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18470)

### Verification

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=354004&view=logs&j=8939b564-1402-57b5-92dc-510eba75e069&t=8939b564-1402-57b5-92dc-510eba75e069
2023-09-15 08:35:55 -07:00
zesongw
a5302fec93
[WebNN EP] Fix bug for PRelu on CPU backend. (#17543)
### Description
WebNN CPU backend expects slope of PRelu to be a static value. For now,
we will not support it.


### Motivation and Context
Fallback this case to pass the CI.
2023-09-15 08:29:48 -07:00
Changming Sun
4d931edd78
Update tensorrt_dependencies in setup.py (#17562)
### Description
The files should not have the minor version number. The names were added
in #17365 by mistake.

### Motivation and Context
We did not successfully exclude them out.
2023-09-15 08:20:47 -07:00
Yulong Wang
94f2ed6bbd
run_CIs_for_external_pr.py: update required pipelines (#17557)
### Description
Add required pipeline "Windows x64 QNN CI Pipeline" to script
"run_CIs_for_external_pr.py"
2023-09-14 21:15:10 -07:00
Yulong Wang
9aafbe3feb
[js/web] revise TensorView (#17473)
### Description

This change:
- removes the unused `Tensor` types declared in
/js/web/lib/wasm/jsep/tensor.ts
- removes duplicated util functions in  /js/web/lib/wasm/jsep/tensor.ts
- renames /js/web/lib/wasm/jsep/**tensor.ts** to
/js/web/lib/wasm/jsep/**tensor-view.ts** and update corresponding
references. It was kind of confusing that we have multiple `Tensor`
types defined in different places also we have multiple `tensor.ts`
source files.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
https://github.com/microsoft/onnxruntime/pull/17465
https://github.com/microsoft/onnxruntime/pull/17469
https://github.com/microsoft/onnxruntime/pull/17470
https://github.com/microsoft/onnxruntime/pull/17472
https://github.com/microsoft/onnxruntime/pull/17473 (this one)
2023-09-14 21:14:44 -07:00
Nat Kershaw (MSFT)
a2fba28f6c
Remove extraneous javascript includes (#17558) 2023-09-14 20:43:24 -07:00
Tianlei Wu
3a1e48dd5a
update BERT notebook with ORT 1.16 (#17524)
- Update BERT notebook with onnxruntime-gpu 1.16
- Add example of packing mode
- Run results in RTX 4090 GPU
2023-09-14 18:15:29 -07:00
Kaz Nishimura
5ed5f13920
[DML EP] Add missing member initializer DmlGraphNodeCreateInfo::nodeCount (#17505)
### Description
<!-- Describe your changes. -->

This adds a missing member initialization.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It caused an access violation in
`Dml::GraphDescBuilder::BuildGraphDesc`.
2023-09-14 17:47:45 -07:00
Jiajia Qin
41d2ff622c
[js/webgpu] Optimize InstanceNormalization (#17491)
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```

In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
2023-09-14 17:03:18 -07:00
Hector Li
46fe08226f
[QNN EP] Enable Pad op support for QNN EP (#17508)
### Description
Enable Pad op support for QNN EP to support more models
2023-09-14 14:22:45 -07:00
xhcao
198d468849
[WebGPU/JS] Added Pad operator support (#16928)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-14 13:14:11 -07:00
Rachel Guo
e11849e716
Configure StringNormalizer default_locale for _APPLE_ system (#17339)
### Description
<!-- Describe your changes. -->

As title.

iOS language code uses different syntax for specifying language
code/region code:
https://developer.apple.com/documentation/xcode/choosing-localization-regions-and-scripts

current `default_locale` is not working for iOS.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Issue:
https://github.com/microsoft/onnxruntime/issues/17017

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-14 10:58:25 -07:00
Yulong Wang
7af2f68ef3
[js/web] add a test flag to customize chromium flags (#17545)
### Description
add a test flag to customize chromium flags.

Usage:
npm test -- \<other flags> --chromium-flags=<...>
2023-09-14 10:05:31 -07:00
Changming Sun
5af6279440
Fix Android build (#17540)
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
2023-09-14 07:36:01 -07:00
Hans
ad369a1fad
[js/rn] Support create boolean tensor (#17052)
### Description
<!-- Describe your changes. -->

For some use case need to create boolean tensor.

I've tested on [this
project](https://github.com/hans00/react-native-transformers-example)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Add handle `ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL`

And it required #15556 (It seems not include in latest release
(v1.15.1))
2023-09-14 15:02:27 +10:00
cao lei
32f5658abb
remove gsl to make status.h independent from gsl (#17402)
### Description
<!-- Describe your changes. -->
Make status.h independent from gsl.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In the coming new feature external EP API (see the prototype
https://github.com/microsoft/onnxruntime/pull/16718), we need to expose
stream in the public header, however, stream is dependent on status.h
which is dependent on gsl. We are seeking a way to decouple stream from
gsl.

From Changming's comment offline, prefast is disabled so all
GSL_SUPPRESS are not taking any effect now. He will handle the warnings
when enable prefast in the future
2023-09-13 21:47:43 -07:00
Arthur Islamov
03b56f7a73
[js/webgpu] FP16 extension registration (#17493)
### Description
First small change to support FP16

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-09-13 13:11:17 -07:00
Patrice Vignola
7edff1c2bf
[DML EP] Add subgraph fusion support (#17504)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-13 13:02:58 -07:00
dependabot[bot]
4e37c5d1f0
Bump actions/checkout from 3 to 4 (#17487) 2023-09-13 09:22:21 -07:00