Commit graph

10461 commits

Author SHA1 Message Date
Changming Sun
04afe77305
Update ThirdPartyNotices.txt: Add Intel neural-speed (#19332)
Add Intel neural-speed to ThirdPartyNotices.txt because it will be
shipped in the default build in most of our packages.
2024-01-30 12:40:30 -08:00
kunal-vaishnavi
febec1c586
Update Whisper export with beam search (#19322)
### Description
This PR updates the Whisper export with beam search by adding the
following.

- Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the
Whisper with beam search model
- Sets the default PyTorch attention implementation to `eager` to allow
existing attention fusions to continue working
- Re-uses the cache directory when loading the PyTorch model to reduce
memory used on disk
- Adds `--disable_auto_mixed_precision` to the example FP16 export
command

### Motivation and Context
- [This PR](https://github.com/microsoft/onnxruntime/pull/19112) added
the `is_unidirectional` parameter to `CheckInputs`, but it was not
provided when checking the inputs in `DecoderMaskedMultiHeadAttention`.
- [This PR](https://github.com/microsoft/onnxruntime/pull/19200)
explains the reasoning behind why `eager` is used to load the
`WhisperAttention` class.
- By re-using the cache directory for loading the PyTorch model, only
one copy of the PyTorch model is saved on disk instead of two copies.
- By providing this flag, there will be less Cast nodes in the Whisper
with beam search model to switch between FP16 and FP32 precision.
2024-01-30 11:59:15 -08:00
ivberg
3454f86e70
Windows - Only set thread affinity on Server with auto affinity (#19318)
### Description
Only set thread affinity on Server with auto affinity. Auto affinity =
when API user does specify thread settings or affinity themselves.

### Motivation and Context
On client best to let OS scheduler handle. On big (P-Core) / little
(E-Core) CPU designs affinity overrides win32 Quality of Service (QoS)
and has high power usage. Specifically on background workloads whose
process is tagged QoS Utility (Background), this affinity setting
overrides the OS scheduler that only wants to schedule on the E-Cores.
Thus P-Cores waking up uses more energy than intended on client and
users gets less battery life.

Foreground AI workloads would be tagged QoS High and would run the ORT
threads on all cores.
2024-01-30 10:53:10 -08:00
liqun Fu
b84cb247e3
io_binding to handle optional input of sequence type_proto (#19273) 2024-01-30 10:25:14 -08:00
Wei-Sheng Chin
ffc3431a66
Update ScatterElements to Support Opset 13, 15, 18 (#19198)
`ScatterElements` in opset 18 has been around for a while. However, the
highest opset supporting `ScatterElements` in ORT is 13. This PR
implement this op in CUDA EP by replacing `assignment` in the current
CDUA kernel with `atomic reduction` (e.g., atomic add, atomic max). A
series of fundamental atomic functions (e.g., atomic max for int8_t and
half) are implemented in `common.cuh`; the implementation is general
enough to cover old CUDA and new CUDA versions.

- The core changes are in `cuda/atomic/common.cuh` with very detailed
documentation including `bit-wise operation's visualization`. They are
also copied to `rocm/atomic/common.cuh` to support AMD GPU.
- `/cuda/tensor/gather_elements_impl.cu` contains small changes to call
the new atomic functions to support new `reduction` behavior in new
`ScatterElements`.
- New `ScatterElements` are defined in `rocm_execution_provider.cc` and
`cuda_execution_provider.cc`.
2024-01-30 09:18:50 -08:00
Rachel Guo
3e17ca3dab
Fix iOS artifacts issue in Microsoft.ML.OnnxRuntime Nuget Package (#19311)
### Description
<!-- Describe your changes. -->

Updates to only include ios archs framework in artifacts included in
Nuget Package.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Related issue:
https://github.com/microsoft/onnxruntime/issues/19295#issuecomment-1914143256

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-01-30 08:44:20 -08:00
Changming Sun
a92802f940
Disable a few tests for wasm build (#19316) 2024-01-30 08:16:57 -08:00
Vincent Wang
9f68a27c7a
[ORTModule] Handle Cast on Constant Number on Triton Code-gen (#19321)
When using scaled_dot_product_attention on float16 type, the exported
graph has Sqrt(float16(constant)), which cannot be ConstantFold in ORT
because Sqrt CPU kernel doesn't support float16. This causes Triton
code-gen generates code like:

result = 128.0.to(tl.float32)

This code cannot be compiled because .to() cannot be applied to
constant.

This PR is to handle such case that constant number will not do the
Cast.
2024-01-30 17:04:01 +08:00
Xu Xing
624b4e2063
[js/webgpu] Remove enableShapesUniforms (#19279) 2024-01-29 17:49:06 -08:00
Chi Lo
00d048121b
[TensorRT EP] Fix InferenceSession::Run() not thread-safe issue (#19301)
Given that InferenceSession::Run() is guaranteed to be thread-safe
meaning multiple threads can call this function concurrently,
TRT EP needs to carefully take care of concurrency here, if not,
following concurrent issue might happen:
    

- It's suggested that to perform inference concurrently in multiple
streams, use one trt execution context per stream.
In the design of TRT EP (Not apply per-thread context implementation)
and if multiple threads are calling InferenceSession::Run()
concurrently, the trt execution context instance is shared by all the
threads and each thread aquires different stream from ORT.
So TRT EP will end up having one trt execution context using multiple
streams which is not suggested.
But, since the whole compute_func() is protected by the lock and if
cudaStreamSynchronize() is enforced here, one trt execution context per
stream is guaranteed.
     
Therefore, TRT EP needs to call cudaStreamSynchronize() at
compute_func() which means to wait until stream has completed all
operations to prevent the concurrent

github isse: https://github.com/microsoft/onnxruntime/issues/19275
2024-01-29 17:36:27 -08:00
Baiju Meswani
465540d29b
Update training api python documentation (#19287) 2024-01-29 14:14:15 -08:00
Changming Sun
e91d91ae4f
Fix a build issue: /MP was not enabled correctly (#19190)
### Description

In PR #19073 I mistunderstood the value of "--parallel". Instead of
testing if args.parallel is None or not , I should test the returned
value of number_of_parallel_jobs function.

If build.py was invoked without --parallel, then args.parallel equals to
1. Because it is the default value. Then we should not add "/MP".
However, the current code adds it. Because if `args.paralllel` is
evaluated to `if 1` , which is True.
If build.py was invoked with --parallel with additional numbers, then
args.parallel equals to 0. Because it is unspecified. Then we should add
"/MP". However, the current code does not add it. Because `if
args.paralllel` is evaluated to `if 0` , which is False.

This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons. 

### Motivation and Context
2024-01-29 12:45:38 -08:00
Changming Sun
4ee222413f
Update OneBranch.Nuget-WindowsAI-Pipeline.Official.yml for Azure Pipelines (#19293)
To fix a pipeline issue.
2024-01-29 12:00:42 -08:00
Guenther Schmuelling
9e69606360
fix f16 for attention, enable slice and flatten for more types (#19262) 2024-01-29 10:13:46 -08:00
Yi Zhang
e96a038f01
Add VP test in Stable diffusion pipeline (#19300)
### Description
1. Add visual parity test based on openai clip model
2. Add trigger rules

### Motivation and Context
1. check generated image is expected
2. reduce unnecessary triggers
2024-01-29 09:33:58 -08:00
PeixuanZuo
82c1cb416b
[CUDA] Refactor GroupNorm and add common vectorize implementation (#19158)
Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-29 09:15:10 +08:00
Adrian Lizarraga
6d7ac9c93a
Support general session config entries in perf test tool (#19289)
### Description
Adds the ability to specify general session configuration entries via
the `-C` command-line option.
Example: `-C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"`

Some session config entries can already be set via dedicated
command-line options. If the user uses multiple command-line options to
set the same session config entry, we'll print a warning. Note that the
dedicated command-line options will take precedence.

### Motivation and Context
Allows setting session configurations when testing EPs. QNN EP, for
example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*`
options.
2024-01-26 19:51:48 -08:00
Tianlei Wu
d7ff81dfb7
[CUDA] support user_compute_stream in python API (#19229)
### Description
It is an important feature to pass user cuda stream to avoid
synchronization in python API. Here we allow user to pass cuda stream
for CUDA provider. Note that TRT or ROCm provider need similar change,
which are not included in this pull request.

Note that we will set `has_user_compute_stream` automatically based on
whether there is cuda stream passed, so setting
`has_user_compute_stream` through python API has no effect.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/19094
2024-01-26 10:34:43 -08:00
cao lei
7d4dc66846
ExecutionProvider API refactor - make GenerateMetaDefId a standalone function, decouple it from EP (#18977)
### Description
<!-- Describe your changes. -->
Make EP's member function, GenerateMetaDefId, a standalone function
which decouples from EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is for ExecutionProvider API refactoring, we will make a
clean ExecutionProvider API first for later EPv2 work
2024-01-26 07:39:08 -08:00
Baiju Meswani
fc44f96ad5
Add support for a collection of OrtValue as inputs and outputs to C# TrainingSession (#19048) 2024-01-25 21:55:36 -08:00
Tianlei Wu
358650d441
Fix BigModel stable diffusion pipeline (#19277)
### Description
Fix two issues:
(1) We can only use single quote inside `bash -c "..."`. Current
pipeline job stopped at `python3 demo_txt2img.py astronaut` and skip the
following commands. In this change, we remove the remaining commands to
get same effect (otherwise, the pipeline runtime might be 2 hours
instead of 15 minutes).
(2) Fix a typo of Stable.
2024-01-25 17:19:04 -08:00
Xu Xing
a3f0e2422b
[js/webgpu] Support f16 uniform (#19098)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:58:22 -08:00
Tianlei Wu
8b4517218b
Remove USE_CUTLASS flag (#19271)
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:57:58 -08:00
Xu Xing
656ca66186
[js/webgpu] Support uniforms for conv, conv transpose, conv grouped (#18753) 2024-01-25 15:37:05 -08:00
Chi Lo
a2867b911e
[TensorRT EP] Fix mem leak for TRT plugins custom ops (#19248)
TRT EP's GetTensorRTCustomOpDomainList() will create vector of
OrtCustomOpDomain objects and release the ownership of those objects.
But, thoses objects are not released forever.
In session level, we need to make TRT EP remember what OrtCustomOpDomain
objects it created and release them at EP destruction time.
2024-01-25 11:51:39 -08:00
Tianlei Wu
2b285cd78a
[CUDA] Add functions to dump bfloat16 tensors (#19266)
### Description
GroupQueryAttention add BFloat16 in
https://github.com/microsoft/onnxruntime/pull/19095, and there is build
error when enable dumping. This supports print bfloat16 tensor to
console.
2024-01-25 09:30:15 -08:00
Jiajie Hu
5b06505073
[js/webgpu] Fix Tanh explosion (#19201)
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```

### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
2024-01-25 08:25:35 -08:00
PeixuanZuo
1c92e56dc0
[Cuda] Refactor GroupNorm (#19146)
Split GroupNorm implementation into multiple files, to make ROCm EP can
reuse cuda code.

Related PR: https://github.com/microsoft/onnxruntime/pull/19158

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-25 22:28:47 +08:00
Vincent Wang
2b87dd373a
[ORTModule] Remove Mod from Hash to Avoid Conflict for Triton Code-gen (#19256)
Remove mod (10**8) from hash to avoid conflict for Triton code-gen.
2024-01-25 10:16:41 +08:00
Dmitri Smirnov
7dd1f4b8e2
Pad-18 Cuda implementation (#19211)
### Description
Implement Pad-18 for Cuda.

### Motivation and Context
Latest models converted by Dynamo fall back on CPU for Pad with
performance degradation.

This contributes to
https://github.com/microsoft/onnx-rewriter/issues/126
2024-01-24 18:12:04 -08:00
Phoebe Chen
4477f57ee3
Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238)
### Description  
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.

### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.

### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```

### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239

---------

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-24 16:27:05 -08:00
Wanming Lin
0c2f0ba90d
[WebNN EP] Support conv1d by reshaping with prepended 1's (#18857)
WebNN only supports 4-D inputs for conv2d and convTranspose2d, this PR
supports 3-D inputs (i.e. conv1d) by prepending a 1 size dimension and
several reshape operations.
2024-01-24 15:53:10 -08:00
Wanming Lin
7252c6e747
[WebNN EP] Support WebNN async API with Asyncify (#19145) 2024-01-24 15:37:35 -08:00
Yufeng Li
c456f19dba
remove old quantization tool file (#19247)
### Description
<!-- Describe your changes. -->
remove old python files


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We have a new op MatMulNBits and this one is deprecated.
2024-01-24 15:20:36 -08:00
Yang Gu
591f90c0b9
[js/webgpu] Fix issue of timestamp query (#19258)
When we enable webgpu profiling mode between session.create and
session.run, current implementation has a problem to create querySet
(and also queryResolveBuffer) if we share the commandEncoder with inputs
upload. This PR fixes this by moving the querySet creation to the place
we set queryType.
2024-01-24 14:49:37 -08:00
Changming Sun
bc54ad3f03
Update abseil to a release tag and register neural_speed (#19255)
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
2024-01-24 14:37:39 -08:00
Changming Sun
a28abeb241
Change "#ifdef WIN32" to "#ifdef _WIN32" (#19254)
### Description
`_WIN32` is a standard macro listed at
https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170
. But `WIN32` is not.
2024-01-24 14:35:44 -08:00
satyajandhyala
a33b5bd1fa
[JS/WebGPU] Added Uniforms to SkipLayerNorm. (#18788)
### Description
Added Uniforms to SkipLayerNorm



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-01-25 01:12:21 +05:30
Sheil Kumar
a39ac4a979
[DirectML] Register Pad19 (#19175)
### Description
Register Pad19 in DirectML

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-01-24 10:06:31 -08:00
Yi Zhang
d7aebf9ea8
Move Nuget Test from T4 to A10 to reduce release duration (#19253)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Running release process is very painful and boring because some GPU jobs
have to wait so long time.

![image](https://github.com/microsoft/onnxruntime/assets/16190118/1c5c981e-68d4-4678-9758-443fbf362802)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/ba0d79ba-1554-4c7a-93dd-6ea8144c9295)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/36cab833-71c1-4ff5-bca5-f4caa9aee0c9)
On the one hand, we could move some T4 from PR process since some jobs
are not using T4 any more and on the other hand, we can continue to
change some jobs' agent from T4 to A4 too.

In the future, T4 will mainly be used for the scenarioes that big GPU
memory is needed, multiple GPU cards or some special cases.


Test runs:

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=401786&view=logs&j=8048494c-e6eb-5e47-5e87-ff0aa863325d

cc @YUNQIUGUO @snnn
2024-01-24 14:15:07 +08:00
Chi Lo
c10be1848c
[TensorRT EP] Avoid calling unavailable function with cpu python package (#19251)
C.register_tensorrt_plugins_as_custom_ops() is only available in gpu
python package.
Add condition to avoid calling it in cpu python package.
2024-01-23 21:30:22 -08:00
Ye Wang
6a424ccf8c
Fix AMD pipeline test failures (#19250)
### Description
<!-- Describe your changes. -->

Fix amd test failure

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-23 19:33:49 -08:00
aciddelgado
cbb29d80ff
GQA Rotary and Packed QKV with Flash (#18906)
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.



### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2024-01-23 16:34:26 -08:00
Wei-Sheng Chin
532f8c642c
Fix a backend test by using local backend (#19230)
The decomposition pass (e.g., converting torch.add to aten.add) in DORT
no longer exists. Therefore, we have to use `use_aot_autograd=True` to
enable Dynamo's built-in operator decomposition. I think we need to add
the decomposition pass back to DORT or remove `use_aot_autograd` (remove
because it will always be `true`).
2024-01-23 14:57:30 -08:00
petermcaughan
f53068446e
Add Temperature to WhisperBeamSearch input (#19188)
### Description
<!-- Describe your changes. -->
Add `temperature` as an input to WhisperBeamSearch op and initialize
correctly in parameter setup.


### Motivation and Context
Currently, temperature is included as an attribute to the BeamSearch op,
which doesn't let the model act dynamically in a single inference
session. By including this variable as an input, the temperature value
can be altered in any inference call (important for 1P teams)

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
2024-01-23 13:44:34 -08:00
Yi Zhang
54871a2773
Replace T4 to A10 in Linux GPU workflow (#19205)
### Description
1. Update Linux GPU  machine from T4 to A10, sm=8.6
2. update the tolerance 

### Motivation and Context
1. Free more T4 and test with higher compute capability.
2. ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss
and fail this test
```
2024-01-19T13:27:18.8302842Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z   ret.first
2024-01-19T13:27:25.8445514Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z 
2024-01-19T13:27:25.8555736Z [  FAILED  ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z   ret.first
2024-01-19T13:27:29.3176322Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ

```
3. some other test like SSD throw other exception, so skip them
'''
2024-01-22T09:07:40.8446910Z [ RUN ]
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-22T09:07:51.5587571Z
/onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358:
Failure
2024-01-22T09:07:51.5588512Z Expected equality of these values:
2024-01-22T09:07:51.5588870Z   COMPARE_RESULT::SUCCESS
2024-01-22T09:07:51.5589467Z     Which is: 4-byte object <00-00 00-00>
2024-01-22T09:07:51.5589953Z   ret.first
2024-01-22T09:07:51.5590462Z     Which is: 4-byte object <01-00 00-00>
2024-01-22T09:07:51.5590841Z expected 1, got 63
'''
2024-01-23 10:49:24 -08:00
Heflin Stephen Raj
0ea48fc73e
Modified the condition to load the optimiser model (#18891) 2024-01-23 10:10:54 -08:00
Xu Xing
61610ff986
[js/webgpu] Add FusedConv clip test case (#18900)
Bug: https://github.com/microsoft/onnxruntime/issues/18899
2024-01-23 08:25:05 -08:00
Tianlei Wu
6ca7c1a933
unet fusion for stable diffusion webui (#19227)
### Description
Update unet fusion for [stable diffusion webui
extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime):
(1) Update fusion pattern to support fp16 unet model.
(2) Add progress bar
(3) Use a cached map to speed up dtype or shape lookup in shape
inference result.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 20:42:30 -08:00
Jeff Daily
b2aec41a83
[ROCm] enable hipGraph (#18382)
This ports the cudaGraph support from the CUDA EP to the ROCM EP's
hipGraph.
2024-01-23 11:17:04 +08:00