Commit graph

10124 commits

Author SHA1 Message Date
Yulong Wang
efbef5f611
[js/webgpu] allow to specify callback for profiling data (#18732)
### Description

**This PR is a replacement of #17820.**

allow to specify callback for profiling data

*Previous*:
```js
ort.env.webgpu.profilingMode = 'default';  // enable profiling

// profiling data will output to console.
```

*Now*:
```js
ort.env.webgpu.profiling = {
  mode: 'default';  // enable profiling
  ondata: (data) => {
    // .. process the profiling data
  }
};

//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
2023-12-07 14:10:28 -08:00
junchao-loongson
4abec9749e
[mlas] add loongarch lsx and lasx optimize code (#17937)
### Description
Hello we(@lixing-star) are the developers of loongson team.

We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch
architecture


[100% tests passed, 0 tests failed out of
7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true)

### Development Environments1
```
CPU: 
    Loongson-3C5000L
uname -a:  
    Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux

```
### LonngArch Documents
- [LoongArch Reference Manual - Volume 1: Basic Architecture: This
manual describes the basic part of the LoongArch
architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html)
- [LoongArch ELF psABI: This manual describes the LoongArch ELF
psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
-
[more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)
2023-12-07 11:15:59 -08:00
Yi Zhang
a045be335b
use EO pool for windows web_cpu stage (#18737)
### Description
reuse EO pool in NPM pipeline.


### Motivation and Context
build_web_debug failed in onnxruntime-Win-CPU-2022 but it works in EO
pool.
Reuse EO pool to make the pipeline work now.
When I'm free, I'll try upgrading the chrome in the custom image.
2023-12-07 10:10:00 -08:00
Hector Li
e469de65f5
Re-enable Sign op int64 test for QNN CPU test (#18734)
### Description
Re-enable Sign op int64 test for QNN CPU test
2023-12-07 08:42:25 -08:00
Wanming Lin
3d8af6eb65
[WebNN EP] Skip split initializer (#18729) 2023-12-07 08:09:49 -08:00
Tianlei Wu
49470f06e8
Add benchmark script for control net (#18717)
Add script to benchmark PyTorch and StableFast for control net.
Add an option --max-batch-size in demo for benchmark purpose.
2023-12-06 21:54:51 -08:00
Dmitri Smirnov
e603e78627
Enforce If condition size == 1 (#18733)
### Description
<!-- Describe your changes. -->

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/18549
2023-12-06 21:04:18 -08:00
moyo1997
9479ba525b
Build onnxruntime.dll as arm64x (#18633)
Build onnxruntime.dll as arm64x

Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.

If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.

Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.

AB#22533
2023-12-06 16:49:00 -08:00
Rachel Guo
7762f3f7c5
[NNAPI EP] Add NNAPI Split (#18702)
### Description
<!-- Describe your changes. -->

As title.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

yolo-v8 model missing operator support.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-12-06 15:11:15 -08:00
Wanming Lin
c4b8120c5b
Rename op elementwiseIf to where (#18657)
WebNN latest spec uses `where`.
2023-12-06 14:56:26 -08:00
Hector Li
9768a727e1
[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722)
Fix a bug that can't create context binary if the model has inputs/outputs with different data type

### Description
Update EPContext op schema to unblock nodes with different data type among inputs & outputs
2023-12-06 13:07:09 -08:00
Adrian Lizarraga
559bd52252
[QNN EP] Update QNN SDK to version 2.17.0 (#18684)
### Description
- Update QNN CI Pipelines to use QNN SDK version 2.17.0
- **Print warning if unit test requires adjusted tolerance to pass**
- **Temporarily disable unloading QnnCpu.dll for windows x64 due to
crash when calling FreeLibrary**
- Enable fixed HTP tests
  - QnnHTPBackendTests.LayerNorm1D_LastAxis_DynamicScale
  - QnnHTPBackendTests.GlobalMaxPool_LargeInput2_u8
  - QnnHTPBackendTests.ReduceSumS8Opset13_Rank5
  - QnnHTPBackendTests.ReduceSumU8Opset13_Rank5_LastAxis
  - QnnHTPBackendTests.WhereLargeDataBroadcastU8
  - QnnHTPBackendTests.WhereLargeDataBroadcastTransformedU8
- Enabled fixed CPU tests
  - QnnCPUBackendTests.Resize_DownSample_Linear_AlignCorners_scales
- Increased tolerance for HTP tests that are less accurate on QNN SDK
2.17.0
  - QnnHTPBackendTests.AveragePool_CountIncludePad_HTP_u8
  - QnnHTPBackendTests.AveragePool_AutopadSameUpper_HTP_u8
  - QnnHTPBackendTests.AveragePool_AutopadSameLower_HTP_u8
  - QnnHTPBackendTests.ConvU8U8S32_bias_dynamic_input
  - QnnHTPBackendTests.ConvU8U8S32_bias_initializer
  - QnnHTPBackendTests.ConvU8U8S32_large_input1_padding_bias_initializer
  - QnnHTPBackendTests.LRNSize3
  - QnnHTPBackendTests.LRNSize5
  - QnnHTPBackendTests.MaxPool_Large_Input_HTP_u8
  - QnnHTPBackendTests.MaxPool_LargeInput_1Pads
  - QnnHTPBackendTests.Resize_DownSample_Linear_HalfPixel
  - QnnHTPBackendTests.ResizeU8_2xLinearPytorchHalfPixel
  - QnnHTPBackendTests.ResizeU8_2xLinearHalfPixel
  - QnnHTPBackendTests.ResizeU8_2xLinearAlignCorners
  - QnnHTPBackendTests.ResizeU8_2xLinearAsymmetric
- Disabled ONNX model tests
- averagepool_2d_ceil: Accuracy issues **only on Windows x64
QnnCpu.dll**
- Disabled QDQ model tests (onnx_test_runner)
  - facedetection_op8_qdq: Accuracy issues
- Disabled CPU EP tests (these use QnnCpu.dll)
  - ActivationOpTest.Relu: QNN SDK 2.17 Relu treats inf as FLT_MAX
- GemmOpTypedTests/0.TestGemmBroadcast: Inaccuracy when weight is
initializer and bias is not
- MathOpTest.MatMulFloatType "test padding and broadcast B > A":
Inaccuracy (**only linux**)
- Fix Gemm translation bugs in QNN EP:
  - Do not skip processing of inputs that need to be transposed.

### Motivation and Context
- Allow testing with newest QNN SDK version
- Take advantage of improvements to enable new models.
2023-12-06 11:05:41 -08:00
Ye Wang
c012e41f93
MoE with Expert Slicing (#18565)
### Description
<!-- Describe your changes. -->

Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-05 16:56:38 -08:00
petermcaughan
871c52977a
Mistral Optimization & Benchmarking Support (#18225)
### Description
As a prerequisite for this model running correctly, two PRs need to be
merged:

- GQA Sliding Window Attention:
https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local
- MHA Fusion:
https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b

This PR adds optimization, quantization, and benchmarking support for
Mistral. The README included describes steps to export, optimize, and
benchmark Mistral models, but won't function correctly without the two
above branches being merged first.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2023-12-05 15:39:17 -08:00
Jian Chen
c9e558cd36
Adding common python test requirements.txt (#18698)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-05 14:09:43 -08:00
pengwa
4bfa84487c
Skip module clone for preparing large model export (#18663)
### Skip module clone for preparing large model export

For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It
failed during preparing outputs which will be used for
torch.onnx.export. The reason, we deep copy all the params including
both big sizes of frozen weights, + a little bit of Lora trainable
weight.

This PR will firstly check whether the GPU memmory is enough for a
cloned module, if not, skip the copy.

Copying the module is to guarantee the fw path run may change the
weight, while this case should be rare. But for now, Not-Able-To-Run is
worse than Runnable-with-A-little-bit-different-initial-weight,
especially for large models.
2023-12-05 12:41:17 -08:00
Guenther Schmuelling
9aa7284351
fix lint error (#18708) 2023-12-05 10:37:03 -08:00
cao lei
07aabcc314
Set cuda device before create cuda stream for IOBinding case (#18583)
### Description
<!-- Describe your changes. -->
Set cuda device before create cuda stream for IOBinding case


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to fix the issue #18432 , which the inference will fail for
IOBinding case when there are multiple cuda devices. The reason is that
the cuda device is not set properly before the cuda stream is created
2023-12-05 10:02:21 -08:00
satyajandhyala
70816001cc
[JS/Web] AddedUniforms in GatherElements. (#18670)
### Description
Use Uniforms in GatherElements and clean-up



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-12-05 09:19:53 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
rui-ren
c14fae9461
add SAVE_TEST_GRAPH macro (#18696)
### Description
<!-- Describe your changes. -->

Add a macro `SAVE_TEST_GRAPH ` in `graph_transform_test_builder.cc`. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

This will help us debug the graph and Unitest.

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-12-05 07:46:08 -08:00
zhijiang
2b3050bb0c
Zhijxu/fix toposort (#18705)
in training, shape/size need to be executed immediately when it's ok to
be executed and thus to save memory if possible;

the toposort logic is enhanced before, while didn't take of the
"shape->size" pattern, which make the following size op will not show up
in toposort result.
2023-12-05 17:36:00 +08:00
Adrian Lizarraga
e066fca777
[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465)
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
    Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
    list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
    per-channel quantization, the list contains a dictionary for each channel in the tensor.
    Each dictionary contains optional overrides with the following keys and values.
          'quant_type' = QuantType : The tensor's quantization data type.
          'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
          'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
          'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                     set `scale` or `zero_point`.
          'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                     set `scale` or `zero_point`.
          'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
          'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
```

- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.

Example for per-tensor quantization overrides:
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
        "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
        "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
    },
}
```

Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "WGT": [
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.0,
                "rmax": 2.5,
                "reduce_range": True,
            },
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.2,
                "rmax": 2.55,
                "reduce_range": False,
            },
        ],
        "BIAS": [
            {"zero_point": 0, "scale": 0.000621},
            {"zero_point": 0, "scale": 0.23},
        ],
    },
}
```

#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.

Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config

qnn_config = get_qnn_qdq_config(input_model_path,
                                data_reader,
                                activation_type=QuantType.QUInt16,
                                weight_type=QuantType.QUInt8)
                                
quantize(input_model_path,
         output_model_path,
         qnn_config)
```

### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-12-04 17:54:58 -08:00
Tianlei Wu
01b5c78917
Add SD-Turbo and refine diffusion demo (#18694)
[SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast
generative text-to-image model that distilled from [Stable Diffusion
2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is
targeted for 512x512 resolution.

1. Support sd-turbo model.
1. Refiner ControlNet in demo
    +  Cache the ControlNet model so that it is downloaded only once.
+ Do not download default images in script. Instead update document to
use wget to download example image.
+ Fix an issue of control image processing that causes shape mismatch in
inference.
1. Refine arguments:
+ Change argument --disable-refiner to --enable-refiner since refiner is
not used in most cases
   + Rename --refiner-steps to --refiner_denoising_steps
   + Add abbreviations for most used arguments.
   + Add logic to set default arguments for different models.
1. Refine torch model cache:
+ Share cached torch model among different engines to save disk space.
+ Only download fp16 model (previously, ORT_CUDA downloads fp32 model).
1. Do not use vae slicing when image size is small.
1. For LCM scheduler, allow guidance scale 1.0~2.0.
2. Allow sdxl-turbo to use refiner

###  Performance Test Results

Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on
A100-SXM4-80GB.

Batch | Steps | TRT 8.6 static | ORT_TRT static | ORT_CUDA static | TRT
8.6 dynamic | ORT_TRT dynamic | ORT_CUDA dynamic
-- | -- | -- | -- | -- | -- | -- | --
1 | 1 | 32.07 | 30.55 | 32.89 | 36.41 | 38.30 | 34.83
4 | 1 | 125.36 | 97.40 | 97.49 | 118.24 | 114.95 | 99.10
1 | 4 | 62.29 | 60.24 | 62.50 | 72.49 | 77.82 | 67.66
4 | 4 | 203.51 | 173.11 | 168.32 | 217.14 | 215.71 | 172.53

* Dynamic engine is built for batch size 1 to 8, image size 512x512 to
768x768, optimized for batch size 1 and 512x512
2023-12-04 16:03:47 -08:00
Edward Chen
d514a960ee
Remove "Python Checks" pipeline status from readme as that pipeline no longer exists. (#18697) 2023-12-04 13:38:36 -08:00
Caroline Zhu
c02a386145
[js/web/training] Implemented runEvalStep & runOptimizerStep (#18259)
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.

### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-04 13:37:14 -08:00
Jiajia Qin
5353adcde3
[js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658)
### Description
With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1,
1, 16, 16] becomes 0.59ms from 6.64ms.
2023-12-04 13:18:37 -08:00
trajep
a5b2291e0f
[Transformer Optimization]Return model directly for unknown model type (#18642)
This pull request is used to improves the handling of unsupported model
types in the optimization process.
2023-12-04 12:26:50 -08:00
Deoksang Kim
2f8b86b939
Fix typo in the TensorShape (#17813)
The function name in the log should be SizeToDimension
2023-12-01 16:48:55 -08:00
Jiajia Qin
92ee664f64
[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661)
### Description
Currently, for non-uniform variables, we still use `array<u32, N>` type
instead of array<vec4<u32>, N1>`. So we can't always treat all variables
with rank > 4 as uniforms to index.

This PR fixes below errors:
```
error(s) generated while compiling the shader:
:5:44 error: index 4 out of bounds [0..1]
             return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]);
                                           ^
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()
2023-12-01 15:35:35 -08:00
Changming Sun
eaaf27015e
Remove EnvSetupScript parameter from win-ci.yml (#18662)
### Description
To make the code more consistent. Now some TRT pipelines download TRT
binaries on-the-fly, while other TRT pipelines use a preinstalled
version. This PR make them the same.
2023-12-01 15:30:16 -08:00
Rachel Guo
9c45fe4957
Fix macos xcframework test stage codesign info (#18649)
### Description
<!-- Describe your changes. -->

Remove developement id and force codesign not required in the test macos
target.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix failure happened in iOS_Full_xcframwork stage in
Zip-Nuget-Java-NodeJS packaging pipeline.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-12-01 14:47:46 -08:00
Edward Chen
a353805631
Fix Windows TVM CI workflow (#18667)
Fix issue with installing LLVM dependency.
2023-12-01 13:49:45 -08:00
Edward Chen
b22f49ff35
Fix unit tests failures in build with contrib ops disabled (#18659)
Fix unit tests failures in build with contrib ops disabled.
- QDQTransformerTests.QDQPropagation_GH11605_Opset12_19
- TransposeOptimizerTests.QnnTransposeNonConstBroadcastInput
2023-12-01 09:41:25 -08:00
Bowen Bao
fcea2cb7f1
[Dort] Run type promotion pass to resolve dtype discrepancy (#18516)
Fixes CI failures mentioned in #18507

But we should not keep two separate dort impls in both pytorch and
onnxruntime. They are out of sync.
2023-12-01 09:36:18 -08:00
snadampal
05a9c95764
[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847)
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.

### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
2023-12-01 09:16:44 -08:00
Jian Chen
d69842226b
Update the template files to correct stage to fix the python cuda 12 packaging pipeline (#18651) 2023-12-01 07:57:46 -08:00
guyang3532
182c525416
Support MatMulBnb4 in PaddingElimination (#18646)
Also support Cast pattern between input and embedding node for sparsity
inspecting
2023-12-01 19:27:50 +08:00
Hector Li
ccfea55942
[QNN EP] Enable QNN HTP VTCM size setting (#18653)
### Description
[QNN EP] Enable QNN HTP VTCM size setting
2023-11-30 21:09:13 -08:00
Tianlei Wu
9c9e6adeb2
Add SDXL Turbo to demo (#18627)
* Add SDXL Turbo to the demo.
* Change default scheduler to EulerA for XL or Turbo since DDIM does not
work well with small steps.

Example to run the model in demo (See README for instructions):
```
python3 demo_txt2img_xl.py --version xl-turbo --height 512 --width 512 --denoising-steps 1 --scheduler UniPC "little cute gremlin sitting on a bed, cinematic"
```
2023-11-30 18:19:31 -08:00
Wanming Lin
c7732a78d7
[WebNN EP] Fixed bug in op checking (#18638) 2023-11-30 17:47:56 -08:00
Xu Xing
73d9b03509
[js/webgpu] Add multidimensional(>4) uniform support (#18546)
This change removes the check of enableShapesUniforms. When all uses of
this are removed, enableShapesUniforms can be removed too.
2023-11-30 17:10:33 -08:00
Wanming Lin
73a2eb82eb
Fixed bug in Flatten's axis (#18645)
Flatten's axis is in the range [-r, r] rather than [-r, r-1].
2023-11-30 16:19:22 -08:00
Jiajia Qin
6781b6cf3d
[js/webgpu] add bool type for Expand/Gather (#18615)
### Description
In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
it uses expand with bool type running on cpu ep.




| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
CPUExecutionProvider |

After this change, it will run on jsep.
| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
JsExecutionProvider |
2023-11-30 15:47:08 -08:00
Yi Zhang
efee9abdb7
Reduce downloads in Nuget-Java pipeline to reduce connection exception (#18635)
### Description
1. Add a new stage to download java tools from https://oss.sonatype.org
and publish them to pipeline artifact
2. Remove downloads in other jobs, they get the java tools from pipeline
artifact
3. consolidate final_java_testing stages.


### Motivation and Context
Reduce downloads to reduce the connection error like below.

```
--2023-11-28 07:16:31--  https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar
Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23
Connecting to oss.sonatype.org (oss.sonatype.org)|3.227.40.198|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2023-11-28 07:16:32 ERROR 502: Bad Gateway.
```
2023-12-01 07:44:44 +08:00
zesongw
4025bd8ebd
[WebNN EP] Fix bug of padding in Op ConvTranspose (#18577)
Get the dimensions of H and W according to the layout.
2023-11-30 12:59:36 -08:00
Jiajia Qin
b1e749e3be
[js/webgpu] Add program name into webgpuProfiling info (#18640)
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
2023-11-30 12:57:29 -08:00
Dmitri Smirnov
c5ea1547c6
Eliminate intermediate string conversion buffer. (#18608)
### Description
  Make use of unsafe string constructor that is able to convert native
  UTF-8 string straight into the string instance buffer.

### Motivation and Context
Reduce garbage,
2023-11-30 10:50:24 -08:00
Yulong Wang
e7f64f4510
[js/web] fix ESLint by excluding generated .js from tsconfig.json (#18634)
### Description
ESLint will went into error sometimes.

The root cause is because some large generated JavaScript file in the
tsconfig's include path will cause TypeScript parser fail in a line of
`string.match()` with a regex on a huge string (~8MB), causing the
following error:
```
RangeError: Maximum call stack size exceeded
```

The solution is to remove the large files from the tsconfig's include
path. Previously I excluded the `web/dist/` folder and this PR excludes
`web/test/ort.test[.min].js`.
2023-11-30 09:50:47 -08:00