### Description
**This PR is a replacement of #17820.**
allow to specify callback for profiling data
*Previous*:
```js
ort.env.webgpu.profilingMode = 'default'; // enable profiling
// profiling data will output to console.
```
*Now*:
```js
ort.env.webgpu.profiling = {
mode: 'default'; // enable profiling
ondata: (data) => {
// .. process the profiling data
}
};
//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
### Description
reuse EO pool in NPM pipeline.
### Motivation and Context
build_web_debug failed in onnxruntime-Win-CPU-2022 but it works in EO
pool.
Reuse EO pool to make the pipeline work now.
When I'm free, I'll try upgrading the chrome in the custom image.
Build onnxruntime.dll as arm64x
Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.
If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.
Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.
AB#22533
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
yolo-v8 model missing operator support.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Fix a bug that can't create context binary if the model has inputs/outputs with different data type
### Description
Update EPContext op schema to unblock nodes with different data type among inputs & outputs
### Description
- Update QNN CI Pipelines to use QNN SDK version 2.17.0
- **Print warning if unit test requires adjusted tolerance to pass**
- **Temporarily disable unloading QnnCpu.dll for windows x64 due to
crash when calling FreeLibrary**
- Enable fixed HTP tests
- QnnHTPBackendTests.LayerNorm1D_LastAxis_DynamicScale
- QnnHTPBackendTests.GlobalMaxPool_LargeInput2_u8
- QnnHTPBackendTests.ReduceSumS8Opset13_Rank5
- QnnHTPBackendTests.ReduceSumU8Opset13_Rank5_LastAxis
- QnnHTPBackendTests.WhereLargeDataBroadcastU8
- QnnHTPBackendTests.WhereLargeDataBroadcastTransformedU8
- Enabled fixed CPU tests
- QnnCPUBackendTests.Resize_DownSample_Linear_AlignCorners_scales
- Increased tolerance for HTP tests that are less accurate on QNN SDK
2.17.0
- QnnHTPBackendTests.AveragePool_CountIncludePad_HTP_u8
- QnnHTPBackendTests.AveragePool_AutopadSameUpper_HTP_u8
- QnnHTPBackendTests.AveragePool_AutopadSameLower_HTP_u8
- QnnHTPBackendTests.ConvU8U8S32_bias_dynamic_input
- QnnHTPBackendTests.ConvU8U8S32_bias_initializer
- QnnHTPBackendTests.ConvU8U8S32_large_input1_padding_bias_initializer
- QnnHTPBackendTests.LRNSize3
- QnnHTPBackendTests.LRNSize5
- QnnHTPBackendTests.MaxPool_Large_Input_HTP_u8
- QnnHTPBackendTests.MaxPool_LargeInput_1Pads
- QnnHTPBackendTests.Resize_DownSample_Linear_HalfPixel
- QnnHTPBackendTests.ResizeU8_2xLinearPytorchHalfPixel
- QnnHTPBackendTests.ResizeU8_2xLinearHalfPixel
- QnnHTPBackendTests.ResizeU8_2xLinearAlignCorners
- QnnHTPBackendTests.ResizeU8_2xLinearAsymmetric
- Disabled ONNX model tests
- averagepool_2d_ceil: Accuracy issues **only on Windows x64
QnnCpu.dll**
- Disabled QDQ model tests (onnx_test_runner)
- facedetection_op8_qdq: Accuracy issues
- Disabled CPU EP tests (these use QnnCpu.dll)
- ActivationOpTest.Relu: QNN SDK 2.17 Relu treats inf as FLT_MAX
- GemmOpTypedTests/0.TestGemmBroadcast: Inaccuracy when weight is
initializer and bias is not
- MathOpTest.MatMulFloatType "test padding and broadcast B > A":
Inaccuracy (**only linux**)
- Fix Gemm translation bugs in QNN EP:
- Do not skip processing of inputs that need to be transposed.
### Motivation and Context
- Allow testing with newest QNN SDK version
- Take advantage of improvements to enable new models.
### Description
<!-- Describe your changes. -->
Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
As a prerequisite for this model running correctly, two PRs need to be
merged:
- GQA Sliding Window Attention:
https://github.com/microsoft/onnxruntime/tree/aciddelgado/gqa_local
- MHA Fusion:
https://github.com/frankdongms/onnxruntime/tree/frdong/llama_70b
This PR adds optimization, quantization, and benchmarking support for
Mistral. The README included describes steps to export, optimize, and
benchmark Mistral models, but won't function correctly without the two
above branches being merged first.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Skip module clone for preparing large model export
For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It
failed during preparing outputs which will be used for
torch.onnx.export. The reason, we deep copy all the params including
both big sizes of frozen weights, + a little bit of Lora trainable
weight.
This PR will firstly check whether the GPU memmory is enough for a
cloned module, if not, skip the copy.
Copying the module is to guarantee the fw path run may change the
weight, while this case should be rare. But for now, Not-Able-To-Run is
worse than Runnable-with-A-little-bit-different-initial-weight,
especially for large models.
### Description
<!-- Describe your changes. -->
Set cuda device before create cuda stream for IOBinding case
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to fix the issue #18432 , which the inference will fail for
IOBinding case when there are multiple cuda devices. The reason is that
the cuda device is not set properly before the cuda stream is created
### Description
Use Uniforms in GatherElements and clean-up
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
<!-- Describe your changes. -->
Add a macro `SAVE_TEST_GRAPH ` in `graph_transform_test_builder.cc`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will help us debug the graph and Unitest.
Co-authored-by: ruiren <ruiren@microsoft.com>
in training, shape/size need to be executed immediately when it's ok to
be executed and thus to save memory if possible;
the toposort logic is enhanced before, while didn't take of the
"shape->size" pattern, which make the following size op will not show up
in toposort result.
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
per-channel quantization, the list contains a dictionary for each channel in the tensor.
Each dictionary contains optional overrides with the following keys and values.
'quant_type' = QuantType : The tensor's quantization data type.
'scale' = Float : The scale value to use. Must also specify `zero_point` if set.
'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set.
'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also
set `scale` or `zero_point`.
'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also
set `scale` or `zero_point`.
'rmax' = Float : Override the maximum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
'rmin' = Float : Override the minimum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
```
- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.
Example for per-tensor quantization overrides:
```Python3
extra_options = {
"TensorQuantOverrides": {
"SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
"WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
"BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
},
}
```
Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
"TensorQuantOverrides": {
"WGT": [
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.0,
"rmax": 2.5,
"reduce_range": True,
},
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.2,
"rmax": 2.55,
"reduce_range": False,
},
],
"BIAS": [
{"zero_point": 0, "scale": 0.000621},
{"zero_point": 0, "scale": 0.23},
],
},
}
```
#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.
Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config
qnn_config = get_qnn_qdq_config(input_model_path,
data_reader,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8)
quantize(input_model_path,
output_model_path,
qnn_config)
```
### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
[SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast
generative text-to-image model that distilled from [Stable Diffusion
2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is
targeted for 512x512 resolution.
1. Support sd-turbo model.
1. Refiner ControlNet in demo
+ Cache the ControlNet model so that it is downloaded only once.
+ Do not download default images in script. Instead update document to
use wget to download example image.
+ Fix an issue of control image processing that causes shape mismatch in
inference.
1. Refine arguments:
+ Change argument --disable-refiner to --enable-refiner since refiner is
not used in most cases
+ Rename --refiner-steps to --refiner_denoising_steps
+ Add abbreviations for most used arguments.
+ Add logic to set default arguments for different models.
1. Refine torch model cache:
+ Share cached torch model among different engines to save disk space.
+ Only download fp16 model (previously, ORT_CUDA downloads fp32 model).
1. Do not use vae slicing when image size is small.
1. For LCM scheduler, allow guidance scale 1.0~2.0.
2. Allow sdxl-turbo to use refiner
### Performance Test Results
Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on
A100-SXM4-80GB.
Batch | Steps | TRT 8.6 static | ORT_TRT static | ORT_CUDA static | TRT
8.6 dynamic | ORT_TRT dynamic | ORT_CUDA dynamic
-- | -- | -- | -- | -- | -- | -- | --
1 | 1 | 32.07 | 30.55 | 32.89 | 36.41 | 38.30 | 34.83
4 | 1 | 125.36 | 97.40 | 97.49 | 118.24 | 114.95 | 99.10
1 | 4 | 62.29 | 60.24 | 62.50 | 72.49 | 77.82 | 67.66
4 | 4 | 203.51 | 173.11 | 168.32 | 217.14 | 215.71 | 172.53
* Dynamic engine is built for batch size 1 to 8, image size 512x512 to
768x768, optimized for batch size 1 and 512x512
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.
### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250
---------
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
To make the code more consistent. Now some TRT pipelines download TRT
binaries on-the-fly, while other TRT pipelines use a preinstalled
version. This PR make them the same.
### Description
<!-- Describe your changes. -->
Remove developement id and force codesign not required in the test macos
target.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix failure happened in iOS_Full_xcframwork stage in
Zip-Nuget-Java-NodeJS packaging pipeline.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Fix unit tests failures in build with contrib ops disabled.
- QDQTransformerTests.QDQPropagation_GH11605_Opset12_19
- TransposeOptimizerTests.QnnTransposeNonConstBroadcastInput
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
* Add SDXL Turbo to the demo.
* Change default scheduler to EulerA for XL or Turbo since DDIM does not
work well with small steps.
Example to run the model in demo (See README for instructions):
```
python3 demo_txt2img_xl.py --version xl-turbo --height 512 --width 512 --denoising-steps 1 --scheduler UniPC "little cute gremlin sitting on a bed, cinematic"
```
### Description
1. Add a new stage to download java tools from https://oss.sonatype.org
and publish them to pipeline artifact
2. Remove downloads in other jobs, they get the java tools from pipeline
artifact
3. consolidate final_java_testing stages.
### Motivation and Context
Reduce downloads to reduce the connection error like below.
```
--2023-11-28 07:16:31-- https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar
Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23
Connecting to oss.sonatype.org (oss.sonatype.org)|3.227.40.198|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2023-11-28 07:16:32 ERROR 502: Bad Gateway.
```
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
### Description
Make use of unsafe string constructor that is able to convert native
UTF-8 string straight into the string instance buffer.
### Motivation and Context
Reduce garbage,
### Description
ESLint will went into error sometimes.
The root cause is because some large generated JavaScript file in the
tsconfig's include path will cause TypeScript parser fail in a line of
`string.match()` with a regex on a huge string (~8MB), causing the
following error:
```
RangeError: Maximum call stack size exceeded
```
The solution is to remove the large files from the tsconfig's include
path. Previously I excluded the `web/dist/` folder and this PR excludes
`web/test/ort.test[.min].js`.