Commit graph

9803 commits

Author SHA1 Message Date
Yi Zhang
99b8dcaae2
Disable dml stage in windows GPU pipeline temporarily. (#18034)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-20 08:41:40 -07:00
Hector Li
35ecce4549
[QNN EP] Reduce overhead of QNN context binary loading (#17965)
### Description
Reduce overhead of QNN context binary loading by avoiding memory copy

### Motivation and Context
Reduce the session initialization time and memory usage while load from
QNN context binary
2023-10-18 15:30:35 -07:00
Jian Chen
cbb0e0f83c
Create a new Dockerfile for cuda 12 and trt 8.6.1.6-1.cuda12.0 (#18000) 2023-10-18 14:46:02 -07:00
aciddelgado
a2c6283274
Fix Packed MultiHead Attention (#17996)
### Description
Initialize previously unitialized parameters that were causing Op to
crash.



### Motivation and Context
Solves Cuda Memory Misalignment / Illegal Memory Access error when
FlashAttention was used in Packed Multi-Head Attention.
2023-10-18 10:52:14 -07:00
Arthur Islamov
22947109f2
[js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630)
### Description
This PR includes fixes for Norm operations to support FP16 and also some
optimizations to use vec2/vec4 if possible
2023-10-18 10:47:41 -07:00
dependabot[bot]
f9694c5b97
Bump @babel/traverse from 7.18.5 to 7.23.2 in /js/react_native/e2e (#17963) 2023-10-18 05:25:27 +00:00
Tianlei Wu
59ae3fdfdc
[CUDA] StableDiffusion XL demo with CUDA EP (#17997)
Add CUDA EP to the StableDiffusion XL Demo including:
(1) Add fp16 VAE support for CUDA EP.
(2) Configuration for each model separately (For example, some models
can run with CUDA graph but some models cannot).

Some remaining works will boost performance further later:
(1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph
is partitioned to CPU, which blocks CUDA graph.
(2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel
only supports limited number of channels in refiner so we shall see some
gain there if we remove the limitation.

Some extra works that are nice to have (thus lower priority):
(3) Support denoising_end to ensemble base and refiner.
(4) Support classifier free guidance (The idea is from
https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/).


#### Performance on A100-SXM4-80GB

Example commands to test an engine built with static shape or dynamic
shape:
```
engine_name=ORT_CUDA
python demo_txt2img_xl.py --engine $engine_name "some prompt"
python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt"
```
Engine built with dynamic shape could support different batch size (1 to
4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024).
Engine built with static shape could only support fixed batch size (1)
and image size (1024x1024).

The latency (ms) of generating an image of size 1024x1024 (sorted by
total latency):

 Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms)
-- | -- | -- | --
ORT_TRT (static shape) | 2467 | 1033 | 3501
TRT (static shape) | 2507 | 1048 | 3555
ORT_CUDA (static shape) | 2630 | 1015 | 3645
ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654
TRT (dynamic shape) | 2777 | 1099 | 3876
ORT_TRT (dynamic shape) | 2890 | 1166 | 4057

\* VAE decoder is not used in Base since the output from base is latent,
which is consumed by refiner to output image.

We can see that ORT_CUDA is faster on dynamic shape, while slower in
static shape (The cause is Clip2 and UNet cannot run with CUDA Graph
right now, and we will address the issue later).

### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/17536
2023-10-17 21:30:04 -07:00
Patrice Vignola
61f1a16265
Fix MHA shape inference (#18009)
The previous shape inference never had the chance to infer the past_key
and past_value outputs because we were returning early.
2023-10-17 21:19:57 -07:00
Patrice Vignola
65575389f4
[DML EP] Enable more MHA masks (#17882)
Those masks are used for MHA in LLaMA.
2023-10-17 17:31:51 -07:00
Guenther Schmuelling
2ef7abf1ff
support for fp16 in GetClipMinMax() (#17967)
add support for fp16 in GetClipMinMax()
2023-10-17 16:43:30 -07:00
Adam Pocock
3456831413
[java] Make the backing byte buffer in an OrtValue accessible (#16578)
### Description
Adds a method to access the backing direct byte buffer from a Java
`OnnxTensor` object, assuming it is backed by a direct byte buffer
(tensors created by ORT's run call or ones created in Java from
multidimensional arrays are not). Also adds a method to check if the
backing byte buffer was copied from the user's buffer supplied on
creation (this could be tested via a pointer comparison from the output
of `getBufferRef` and the user's input buffer, so I'm not sure if it's
necessary).

### Motivation and Context
This is the first part of changes necessary to support output pinning in
Java OrtSession.run/OrtTrainingSession.run calls. I split it out from
the rest of the work as it's useful by itself (e.g. to allow users to
keep a single input tensor and rewrite it each time with new inputs
rather than allocate a fresh one) and the other change will be much more
involved so splitting it makes it easier to review.

cc @yuslepukhin
2023-10-17 10:03:49 -07:00
Changming Sun
57c8736596
Move a nodejs test to a different machine pool (#17970)
### Description
This is a temp fix for the failing "Zip-Nuget-Java-Nodejs Packaging
Pipeline". The pipeline is failing because I removed NodeJS from the
build machine pool's image, to reduce the number of dependencies we need
to maintain in VMs.
So this PR will temporarily move the test to a different machine pool to
get the test passed. Then I will move the test to docker. Docker images
are relatively easier to update and maintain. Now we almost run all
Linux test in docker, except for this one. Moving it to docker is needed
for enabling GPU support in nodejs, because all our Linux VMs do not
have CUDA.


### Motivation and Context
2023-10-17 09:30:14 -07:00
Hariharan Seshadri
9356986730
Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972)
### Description
This PR:

(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)

(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.

Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770

### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
2023-10-17 09:23:52 -07:00
Scott McKay
6832b688a2
Fix missing attribute on C# DOrtGetResizedStringTensorElementBuffer delegate (#17901)
### Description
<!-- Describe your changes. -->
Fix missing attribute. Causes build error on release xamarin iOS build.

Fix some long lines as well.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16463 - once the dummy extensions nuget package is used this problem
shows up.
2023-10-17 17:48:36 +10:00
Tianlei Wu
83547d3067
[CUDA] Fix SkipLayerNorm vectorized kernel out-of-bounds read (#17943)
Fix a bug in https://github.com/microsoft/onnxruntime/pull/11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.

Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.

Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
2023-10-16 13:58:37 -07:00
dependabot[bot]
cf974f0905
Bump @babel/traverse from 7.18.2 to 7.23.2 in /js/react_native (#17962) 2023-10-16 18:24:01 +00:00
Maximilian Müller
7c17e33c07
Make CUDA a NHWC EP (#17200)
### Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.

### Motivation and Context


Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)
```

I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels: 

![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b)

If I add the introduced `-l` flag we see below kernels:

![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008)

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.

Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |

|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
|                         |  -e cuda -t 5 -q |   -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53

## Outlook

Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-10-16 10:16:37 -07:00
Chi Lo
8abaa7b753
[TensorRT EP] Fix cmake install (#17923)
We removed tensorrt_provider_factory.h in the
[PR](https://github.com/microsoft/onnxruntime/pull/17617).
Need to remove the copy of this file when cmake install.
2023-10-16 09:16:24 -07:00
Yulong Wang
ad817d0efa
[js/web] optimize tsc for web: split out "npm prepare" (#17955)
### Description
optimize tsc for web: split out "npm prepare"
2023-10-16 09:04:54 -07:00
Yulong Wang
f7341e8103
enable training for win-wasm-ci.yml (#17954)
### Description
**Fixes NPM Packaging pipeline.**

Training was enabled for linux-wasm-ci.yml but not enabled for
win-wasm-ci.yml.

the web CI uses linux-wasm-ci.yml
NPM packaging pipeline uses win-wasm-ci.yml
2023-10-16 16:07:20 +08:00
Yi Zhang
209b6dbd97
add onnxruntime_float16.h into header validation list. (#17935)
### Description
supplement of #17637

### Motivation and Context
2023-10-16 08:52:16 +08:00
Scott McKay
ae211999dd
Attempt to make the usage of the Android emulator in CIs more robust (#17903)
### Description
<!-- Describe your changes. -->
Android emulator usage updates:
- Change approach to detecting boot has completed
- use `-delay-adb` and a simple command (`ls`) with `wait-for-device` as
the first step
    - this ensures enough startup has occurred for adb to be responsive
- use secondary loop on the python side to check for sys.boot_completed
to be set
- doing the check on the python side provides more feedback and seems to
work well
- make the 'stop' logic more precise by using psutil
- add internal timeout of 20 mins for emulator startup
  - waiting for the CI jobs overall timeout is way too long
- value is hardcoded for now (most CIs startup in under 10 mins) but
could be made configurable if needed

CI updates:
- add template for using the Android emulator
  - update CIs to use template
- reorder React Native CI
- minimize the time the Android emulator or iOS simulator is running by
moving some build steps around
  - don't run both at the same time
- unnecessary and potentially adds significant memory pressure to the
machine
- fix QNN Android emulator CI as much as possible
- now everything works apart from running onnx_test_runner with the QNN
EP

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix inconsistent detection of the emulator boot completing.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-10-15 08:42:36 +10:00
Zhipeng Han
a55b2688b6
Add save_attribute option to quantize_static (#17945)
### Description
<!-- Describe your changes. -->
The model with big Constants tensors size: Estimate size of the RWKV
model: ONNX graph (8MB), initializer tensors(200MB), constants (~5.7GB).
The `onnx.save_model` will got error due to the Constants is not output
in external data. Only the initializer tensors are output as external
data. In this change, expose parameter to support the constants in
external data. Model owner can customize the output behavior and still
keep the default behavior.

Quantize the model and output it to local, got issue due to output size
exceed 2GB even set `use_external_data_format=True`. The
`use_external_data_format` flag only outputs initializer tensors to
external data.
Use the falg `convert_attribute` flag to output all tensors to external
data.
```
def convert_model_to_external_data(
    model: ModelProto,
    all_tensors_to_one_file: bool = True,
    location: Optional[str] = None,
    size_threshold: int = 1024,
    include_attribute: bool = False,
) -> None:
    tensors = _get_initializer_tensors(model)
    if include_attribute:
        tensors = _get_all_tensors(model)
        ...
```
The `onnx.external_data_helper.convert_model_to_external_data` support
output the attribute to external with flag `include_attribute=True`.
However, this parameter is hide by the
`onnxruntime\quantization\onnx_model.py` and the constants(`5.7GB)
within the model will got protobuf 2GB limitation issue with default
parameters.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix https://github.com/microsoft/onnxruntime/issues/17944

---------

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2023-10-14 06:29:08 -07:00
Yufeng Li
11af34440a
Add MatMul 4bits support on GPU (#17890)
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:

- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits. 

Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
2023-10-13 16:55:30 -07:00
Adrian Lizarraga
3b69d9b886
[QNN EP] Fix index-out-of-bounds bug in Slice builder when initializer is shared (#17905)
### Description
There's an index-out-of-bounds bug that is triggered when a Slice
operator shares an initializer with another operator that is processed
first. In this case, QNN EP fails to properly initialize a `raw_starts`
(or `raw_ends`) vector, which is later indexed by a call to
`SliceOp::PrepareForComputeHelper()`.


### Motivation and Context
Fix bug that blocks https://github.com/microsoft/onnxruntime/pull/17764
2023-10-13 13:46:29 -07:00
Wanming Lin
9c65d5558c
[WebNN EP] Fixed wasm heap overflow for big model (#17902)
When processing initialized tensors, WebNN did unnecessary tensor
unpacking as which is already stored as raw byte data. This would cause
WASM heap overflow when running big model.

Fixed this issue by pointing directly to the tensor raw data.

---------

Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
2023-10-13 13:40:31 -07:00
RandySheriffH
c6c3555d0e
Custom op shape inference API (#17737)
Add c/cxx API to allow custom ops do shape  inference.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-10-13 12:57:42 -07:00
Zhang Lei
762703e037
Support output cross qk, dtw and more for whisper model (#17500)
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob

Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:

* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
2023-10-13 11:47:15 -07:00
Tianlei Wu
c695de91ee
Update eval_squad to use API of latest optimum (#17918)
Update eval_squad with latest optimum. 

Tested with:
* optimum 1.13.1
* transformers 4.31.0
* onnxruntime-gpu 1.16.0
* onnx 1.14.1
* datasets 2.14.5
* evaluate 0.4.0
* torch version 2.2.0.dev20230920+cu121

Example output in A100:

{'exact': 86.66035950804162, 'f1': 92.99622739711005, 'total': 10570,
'HasAns_exact': 86.66035950804162, 'HasAns_f1': 92.99622739711005,
'HasAns_total': 10570, 'best_exact': 86.66035950804162,
'best_exact_thresh': 0.9998456239700317, 'best_f1': 92.9962273971104,
'best_f1_thresh': 0.9998456239700317, 'total_time_in_seconds':
84.74025378189981, 'samples_per_second': 124.73410838731417,
'latency_in_seconds': 0.008017053337928081, 'provider':
'CUDAExecutionProvider', 'disable_fused_attention': False,
'pretrained_model_name':
'bert-large-uncased-whole-word-masking-finetuned-squad', 'onnx_path':
'./bert-large-uncased-whole-word-masking-finetuned-squad/optimized_model.onnx',
'batch_size': 1, 'sequence_length': 384, 'use_io_binding': True}
2023-10-13 10:39:15 -07:00
Yufeng Li
7551dd039f
Fix build break with cuda 12.2 (#17922)
### Description
<!-- Describe your changes. -->
nvcc 12.2 crashes while building
onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_*
for SM<8.0. nvcc 18.8 works though. It should be a bug in nvcc 12.2.

This PR excludes building flashattention for arch < 800.
2023-10-13 10:21:06 -07:00
Chi Lo
28c1944561
[TensorRT EP] Add back creation cudnn and cublas using external cuda stream (#17912)
Add back creation cudnn and cublas using external cuda stream.
2023-10-13 10:18:28 -07:00
Sheil Kumar
635d3faa3b
Remove half of the compiled shaders for gridsample with unused types (#17928)
Remove half of the compiled shaders for gridsample with unused types

Shader compilations for Bool and Double datatypes are not needed for GPU
evaluation and are causing binary bloat.
Removing them.

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-13 09:29:37 -07:00
Tianlei Wu
67d7eb3ac5
[CUDA] Fix SkipLayerNorm strict mode when skip has broadcast (#17896)
In SLN strict mode, current code (#16510) does not handle skip broadcast
nicely . There are two issues:
(1) skip related parameters is not passed to cuda kernel in strict mode
(2) Strict mode kernel also has bug in handling skip broadcasting (like
cuWelfordMuSigma2 does not handle skip broadcasting).

Here we remove the support of skip broadcasting in strict mode, and
operator will return error message that strict mode only support same
shape of input and skip.

Other changes:
* skip_size is misleading when there is no broadcasting. Change to
correct value.
* Refactor the code to be more efficient: (1) no need to check whether
there is broadcasting in kernel. (2) remove one local buffer (load input
to sum_v directly to save a local buffer copy).
* compute input + bias + skip instead of input + skip + bias. The order
is followed common pattern in transformers model (Here assume graph
fusion will distinguish input and skip correctly, need double check
fusion code later).
* update unit test so that strict mode is triggered in each test case
(unless skip broadcasting) to have higher test coverage.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

SLN strict mode does not support skip broadcast but current code will
silently run (kernel might fail)
2023-10-13 07:51:37 -07:00
Adrian Lizarraga
4e2a03d5fa
[QNN EP] Fix topological node unit traversal during validation (#17913)
### Description
We need to ensure that tensors are first created and validated by their
producers. If we don't, then builders that need to modify their outputs
may not be able to do so if consumers are processed first (due to
caching of tensors). For example, the Tanh builder may need to override
its output quant param for 16-bit QDQ. I've encountered a scenario
(while working on a partner model) where the override was not being
correctly applied due to the graph traversal order.

I tried to fix this bug in a previous
[PR](https://github.com/microsoft/onnxruntime/pull/17877#discussion_r1353676802),
but my fix was incorrect.
2023-10-12 23:37:01 -07:00
Adrian Lizarraga
dad70ad4e8
[QNN EP] Handle rank 3 InstanceNormalization with N != 1 (#17897)
### Description
The QNN HTP backend does not support rank 3 InstanceNorm if the batch
size is not 1. To work around this limitation, QNN EP can wrap a rank 4
QNN InstanceNorm op with Reshapes (with the H dim set to 1).

### Motivation and Context
Enable support for more models.
2023-10-12 21:52:09 -07:00
PeixuanZuo
0c5b1598d3
[ROCm] Add ROCm Debug wheels to private ADO Feeds (#17887)
Add ROCm Debug wheels to private ADO Feeds
2023-10-13 10:28:10 +08:00
Jeff Daily
07317316cc
CUDA EP vs ROCM EP hipify audit (#17776)
Migrate most CUDA EP improvements and changes to ROCM EP. The process
involves using hipify against all CUDA EP files (i.e. do not exclude any
files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them
against the ROCM EP files that are under source control and pull in most
changes. These changes include functional as well as formatting and
makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff
somewhat less obvious due to formatting changes.

- hipify audit of onnxruntime/core/providers/rocm, enable ops
  - Loop
  - Scan
- hipify audit of onnxruntime/contrib_ops/rocm
- fix contrib ops search implementation
- enable more contrib ops
  - Affine
  - ComplexMul
  - ConvTransposeWithDynamicPads
  - Crop
  - DynamicSlice
  - FFT [Rfft, Irfft]
  - GreedySearch
  - ImageScaler
  - ParametricSoftplus
  - ScaledTanh
  - ThresholdRelu

---------

Co-authored-by: cloudhan <cloudhan@outlook.com>
2023-10-13 10:13:53 +08:00
Scott McKay
ba7f20ac57
Fix illegal opcode error from mlas (#17885)
### Description
<!-- Describe your changes. -->
Use cpuinfo value when checking to dot product is available. Reading the
ID_AA64ISAR0_EL1 register is unsafe.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17647 
#17541 
#17851
2023-10-13 08:27:15 +10:00
Tang, Cheng
ca8cab29cd
distributed slice (#17761)
### Description
Support DistributedSlice kernel in Cuda EP.

mainly support following cases:
1. input data is sharded or replica for all axes (including slice axes)
2. slice axes is sharded across different devices.

starts / ends / steps sharded across different devices are not supported
yet.

---------

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2023-10-12 14:28:00 -07:00
Changming Sun
3f3ece4a39
Update NDK to 26.0.10792818 (#17852)
### Description
Update NDK to 26.0.10792818 which is included in every macOS build
machine so that we do not need to download a different version every
time in every build.

### Motivation and Context
Downloading NDK on-the-fly is a main contributor of Android related
build failures.
2023-10-12 14:08:43 -07:00
zesongw
163218d6d7
[WebNN EP] Update Op Softmax for readability (#17826)
### Description
<!-- Describe your changes. -->

Improve readability by fixing misplaced comments and utilizing
std::rotate.


### Motivation and Context

Resolve some comments in
https://github.com/microsoft/onnxruntime/pull/17714
2023-10-12 12:02:50 -07:00
Caroline Zhu
c373a808a2
Add "glue" between training WASM artifacts and training web (#17474)
### Description
* follows the packaging approach according to the design document
    * adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS`
    * modifies `package.json` to include training submodule
* modifies build script to handle, validate, and minimize training WASM
artifacts
* adds the binding for the new backend with training enabled & the new
training artifacts
    * adds training backend
    * edits `index.ts` to use training backend depending on `BUILD_DEFS`
    * edits `wasm-factory.ts` to use the training artifacts if necessary

### Motivation and Context
* we are in the process of adding web bindings to enable training. 
* Adding the "glue" to allow onnxruntime-web to use the training WASM
artifacts is required for this work.
* Since BUILD_DEFS is defined and used at build time, I thought that it
made sense to bundle the changes to building in the same PR.
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 must be merged in before this one

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-12 11:16:56 -07:00
Changming Sun
809c8905fe
Update TestCase.cc: exclude a test for DML (#17909)
### Description
Update TestCase.cc: exclude a test for DML

### Motivation and Context
The test is failing due to GPU driver update.
2023-10-12 10:08:07 -07:00
Sheil Kumar
2efab54f9c
Fix missing registration for new dml device selection API (#17898)
Fix missing registration for new dml device selection API

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-12 09:41:45 -07:00
Vincent Wang
fa0a79a921
Fix Triton Compile Error for Codegened Dropout Code (#17899) 2023-10-12 20:57:14 +08:00
Yi Zhang
9d07ca3621
Move compliance check before publishing pipeline artifact (#17857)
### Description
<!-- Describe your changes. -->


### Motivation and Context
Compliance check would fail randomly but the stage couldn't be rerun if
the pipeline artifacts are already published.
There's the error like `Artifact xxxx already exists`.
We had to restart the whole pipeline if there's a random error in
compliance check.
2023-10-12 15:48:04 +08:00
Maximilian Müller
74a8acf405
Set default value for NVCC threads (#17866)
Without doing this CMake gives a miscellaneous error on windows when
checking if NVCC is functional. It will be missing a number after
`--threads`.

Currently it is only possible to configure through the python build scripts and not CMake
only configure - which is what I am usually doing through CLion.
2023-10-11 22:46:40 -07:00
Tianlei Wu
e2cd6748fc
Fix GroupNorm fusion: skip if num of channels not supported (#17869)
Right now, GroupNorm only support limited number of channels (320, 640,
960, 1280, 1920, 2560, 128, 256, 512). Skip the fusion if number of
channels are not supported.

### Motivation and Context
SD XL refiner model uses number of channels 384, 768, 1152, 2304 and
3072 in GroupNorm.
2023-10-11 22:45:22 -07:00
Yulong Wang
25bbd8d4eb
[js/web] allow gpu IO binding tests to fail temporarily (#17892)
### Description
allow gpu IO binding tests to fail temporarily.

when the root cause is still in investigation, use `continueOnError:
true` to allow the test to fail without blocking PRs.
2023-10-11 21:21:21 -07:00
Changming Sun
138ccecd22
Change how "NPM packaging pipeline" downloads packages from another pipeline (#17838)
### Description
"NPM packaging pipeline" needs to download an artifact from
"Zip-Nuget-Java-Nodejs Packaging Pipeline".
It has been a long-time issue that they two pipelines often use
different commit ids.
This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a
resource, so that "NPM packaging pipeline" will always fetch from the
pipeline run that triggers this NPM pipeline.
Their official document says:
"When you define a resource trigger, if its pipeline resource is from
the same repo as the current pipeline, triggering follows the same
branch and commit on which the event is raised."
2023-10-11 21:07:27 -07:00