Commit graph

9508 commits

Author SHA1 Message Date
Changming Sun
71da0824f3
Upgrade binskim and fix an error in nuget packaging pipeline (#17340)
### Description
Upgrade binskim and fix an error in nuget packaging pipeline.
2023-08-30 07:52:06 -07:00
Adrian Lizarraga
21ae86e405
[QNN EP] Fix test zero-point calculation and flaky MatMul test (#17338)
### Description
- Fix incorrect zero-point calculation in unit tests. Affects int8(signed) QDQ models.
- Replace flaky MatMul test that occasionally fails on main branch with a version that uses explicit inputs.

### Motivation and Context
Fix bug and improve test accuracy and stability.
2023-08-29 23:16:57 -07:00
Jian Chen
922629aad8
Upgrade Centos7 to Alamlinux8 (#16907)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
2023-08-29 21:05:36 -07:00
Tianlei Wu
c961f67b5e
Handle dtype attribute in float16 conversion script (#17321)
Some operators have dtype attribute (search `dtype` in
https://github.com/onnx/onnx/blob/main/docs/Operators.md).
This change make sure dtype attribute is handled correctly in float16
conversion.
2023-08-29 18:41:56 -07:00
Adam Louly
8224891236
add logits option to generate artifacts (#17276)
### Description

Adding the ability to export logits as an output for train and eval
graphs in generate_artifacts
it will remain optional..
2023-08-29 16:55:31 -07:00
cloudhan
f3682eee3b
Fix log color, otherwise, the immediate line followed by the colored log will be tainted (#17329) 2023-08-30 07:46:04 +08:00
Ryan Hill
c438360c1e
Noticed a simple simplification in beam_search_topk (#17275)
### Description
There was an Init() method that does exactly like the lines I replaced,
so I switched to it.

### Motivation and Context
Simpler with no drawbacks.
2023-08-29 15:17:33 -07:00
Yi Zhang
d4a61ac71f
Pr trggiers generated by code (#17247)
### Description
1. Refactor the trigger rules generation.
2. Skip all doc changes in PR pipelines.


### Motivation and Context 
Make all trigger rules generated by running set-trigger-rules.py to
reduce inconsistences.
It's easily to make mistakes to copy&paste manually. 

For example: these 2 excludes are different, Why?

4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml (L16-L18)


4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml (L27-L29)


### Note
All changes in workflow yamls are generated by code.
Please review the **skip-js.yml, skip-docs.yml and
set-trigger-rules.py**.

@fs-eire, please double check the 
filter rules in skip-js.yml
and the skipped workflows

7023c2edff/tools/ci_build/set-trigger-rules.py (L14-L41)
2023-08-30 05:57:03 +08:00
AtanasDimitrovQC
fd0917b27b
Propagate noop_with_empty_axes in reduce operators. (#16845) 2023-08-29 14:15:03 -07:00
kushalpatil07
7b92057376
EvalStep called with wrong inputs onnxruntime_training_cxx_inline.h (#17331) 2023-08-29 14:14:35 -07:00
Yulong Wang
e5ca3f3dcb
[js/api] introducing IO binding for tensor (#16452)
[//]: # (## Work In Progress. Feedbacks are welcome!)

### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.

This change is a way to resolve #15312

### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`

2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.

3. Add factories for creating `Tensor` instances:
    a. `fromTexture()` to create a WebGL texture bound tensor data
    b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
    c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer

### Examples:

create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
  executionProviders: [ 'webgl' ],
  preferredOutputLocation: { 'image_output:0': 'texture' }
});

...

const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
2023-08-29 12:58:26 -07:00
Chen Fu
8827363fd2
Bugfixes: dangling pointers and python property typo (#17285)
### Description
Bug fixes


### Motivation and Context
Fixing one dangling pointer, and one python property name typo
2023-08-29 12:50:15 -07:00
Jiajia Qin
fffefb1c22
[js/webgpu] Optimize matmul (#16969)
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.

MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.
2023-08-29 12:40:57 -07:00
Patrice Vignola
4880f1da46
Fix attention fusion for UNet onnx model export when using LoRA weights (#17249)
### Description
Tested with stable diffusion unet models exported by both pytorch 2.1.0
(nightly) and pytorch 1.13.1, with and without LoRA weights.



### Motivation and Context
LoRA weights modifiy the unet model by adding matmul and scale
operations to every q/k/v/out tensors, which breaks the current MHA
pattern recognition.
2023-08-29 11:59:30 -07:00
Hector Li
761c4333b5
[QNN EP] GridSample op support (#17317)
### Description
QNN EP GridSample op support
2023-08-29 11:41:59 -07:00
Hector Li
742b192a34
[QNN EP] Enable GlobalMaxPool op (#17304)
### Description
[QNN EP] Enable GlobalMaxPool op
2023-08-29 11:25:34 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
Yi Zhang
0e9e9b2a67
Fix one exception in post merge (#17327)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-29 19:24:50 +08:00
Baiju Meswani
5d2c57363f
Sign CUDA Kernel (#17293) 2023-08-28 21:03:58 -07:00
Baiju Meswani
38ea8c3931
Increase max error tolerance for ConvTransposeGrad test (#17315) 2023-08-28 17:05:40 -07:00
Tianlei Wu
ee9d046112
Fix model serialization with external data in current directory (#17311)
When original model has external data in current directory, saving the
optimized model will raise File not found exception during looking for
external data file under root directory "/". This fix will look under
current directory for this case.

I manually tested an extra case and it is working: Original model with
external data in root directory ("/"), and save optimized to current
directory.

BTW, there is another bug found: when
"session.optimized_model_external_initializers_min_size_in_bytes" is set
a large value, some tensor is still pointed to the original external
data file. Add a TODO in unit test for this bug. Possible solution: load
external data into memory before saving model.
2023-08-28 16:06:04 -07:00
Caroline
228db24317
Add training API functions to WASM API (#16521)
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.


### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.

---------

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-08-28 11:05:02 -07:00
Hariharan Seshadri
cbd97515cd
[JS/WebGPU] Support GatherElements kernel (#17243)
### Description
As title


### Motivation and Context
Improve WebGPU kernel coverage
2023-08-28 09:55:25 -07:00
mindest
53169f59e5
[ROCm] Sort candidate solutions in rocBLAS/hipBLASLt for deterministic offline tuning (#17297)
### Description

Sort the candidates in rocBLAS/hipBLASLt to make sure that they are
properly ordered and can be correctly fetched by saved indices in
offline tuning cases.
2023-08-28 16:34:21 +08:00
cloudhan
bf8b1681f9
Build nuget pkg for ROCm (#16791)
Add nuget pkg building and publishing for ROCm EP

---------
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
2023-08-28 13:35:08 +08:00
Yulong Wang
bb1871332f
[js/webgpu] add kernel Not and Equal (#17306)
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.

**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2)
after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.

For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-27 19:50:17 -07:00
simonjub
4eedd3bb46
[TRT EP] Fix logic to reach cache encryption code. (#17111)
### Description
This is a followup to PR #15519 that is closed in favor of this one.


### Motivation and Context
The current implementation of TRT cache has no code execution path
possible so that an encrypted TRT engine cache could be created when
flags engine_cache_enable and engine_decryption_enable are true. This
was originally raised in issue #12551.
2023-08-26 20:09:03 -07:00
Scott McKay
ca0159b45d
Various test infra updates from testing Azure ops with MAUI test app (#17262)
### Description
<!-- Describe your changes. -->
- fix issue with handling string input
- set minSdkVersion
  - otherwise defaults to 19 which we don't support and the build breaks
- comment out the debug logging hook
  - enabling it breaks the Android native logging
  - can be enabled if you need to debug C# code
- update test data tools to allow creating input data for raw file
contents (e.g. audio) and from strings (e.g. auth token value)
- fix some warnings

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve test setup
2023-08-27 09:35:00 +10:00
Yulong Wang
ddcd46174e
[js/webgpu] fix jsepOnRunEnd (#17300)
### Description
fix jsepOnRunEnd: jsepOnRunEnd() need to be run after runPromise is
resolved.
2023-08-26 00:30:28 -07:00
Yifan Li
808215366d
Fix Multi GPU TensorRT tests (#17269)
### Description
* Integrate `trt_multi_gpu` test stage in ORT post merge CI (Win-2xA10
vm)
* Deprecate Linux MultiGPU TRT CI (This vm will be deprecated soon)
* Add multi gpu support to existing C# test cases
* Deprecate unfunctional flag `--enable_multi_device_tests`

### Motivation and Context
* Two contexts of replacing Linux MultiGPU TRT CI:
* Flag `--enable_multi_device_tests` is not functional, which cannot
detect issues like #17036
* The Linux-2xM60 VM of this CI pool is about to be deprecated 9/6/23.
Need to enable this test in other dualGPU vm pool.
2023-08-25 20:30:45 -07:00
Arthur Islamov
c262879214
Added DML and CUDA provider support in onnxruntime-node (#16050)
### Description
I've added changes to support CUDA and DML (only on Windows, on other
platforms it will throw an error)



### Motivation and Context
It fixes this feature request
https://github.com/microsoft/onnxruntime/issues/14127 which is tracked
here https://github.com/microsoft/onnxruntime/issues/14529

I was working on StableDiffusion implementation for node.js and it is
very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version
https://github.com/dakenf/stable-diffusion-nodejs

---------
2023-08-25 16:57:06 -07:00
Jiajia Qin
873ef8b8f0
[js/webgpu] add label for some webgpu APIs (#17291)
### Description
<!-- Describe your changes. -->
With the label, it's more easier to identify which op causes the error.

Without the label, the error message is like below: 
```
Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32'
    return W[i2o_W(indices)];
    ^^^^^^

 - While validating [ShaderModuleDescriptor]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
```
With the label, the error message is like below:
```
Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32'
    return W[i2o_W(indices)];
    ^^^^^^

 - While validating [ShaderModuleDescriptor "ConvTranspose2D"]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "ConvTranspose2D"]).
```
### Motivation and Context
This change is mainly for debugging. With this change, we can easily
know that `ConvTranspose2D`'s shader has problem from above message.
2023-08-25 12:12:56 -07:00
xhcao
5e8d94cec8
[js/webgpu] support Greater and Less operators (#17296)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 12:11:25 -07:00
Adrian Lizarraga
5a83a67f32
Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127)
### Description
- Enables int32 support for com.microsoft.DequantizeLinear (contrib op)
- Makes the `zero_point` input optional for Quantize/Dequantize contrib
ops
- Enables QDQ transformations with the Quantize/Dequantize contrib ops
- Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests,
TransposeOptimizerTests

### Testing
List of tested graph transformations:
- [x] QDQSelectorActionTransformer
  - qdq_transformer_test.cc
- [x] QDQS8ToU8Transformer
  - qdq_transformer_test.cc
- [x] DoubleQDQPairsRemover
  - qdq_transformer_test.cc
- [x] IdenticalChildrenConsolidation
  - qdq_transformer_test.cc
- [x] QDQPropagation
  - qdq_transformer_test.cc
- [x] QDQFinalCleanup
  - qdq_transformer_test.cc
- [x] CliQuantFusion
  - qdq_transformer_test.cc
- [x] ReluQuantFusion
  - qdq_transformer_test.cc
- [x] EnsureUniqueDQForNodeUnit 
  - ensure_unique_dq_for_node_unit_test.cc
- [x] TransposeOptimizer 
  - transpose_optimizer_test.cc
- [x] CommonSubexpressionElimination
  - graph_transform_test.cc
- [x] ConstantFolding
  - graph_transform_test.cc

### Motivation and Context
We need to [support mixed 16-bit/8-bit precision QDQ
models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is
the first step in achieving this goal: we need to make QDQ contrib ops
work with our optimizations/transformations.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-08-25 09:57:51 -07:00
Yulong Wang
79c4ed9a45
[js/webgpu] support error pop and kernel name (#17260)
### Description
This PR contains changes to support error pop and kernel name.

- Add a function `JsepGetNodeName` to allow reading kernel name from JS
to C++
- When in debug mode ( `env.debug = true;` ) or in profiling mode (
`env.webgpu.profilingMode = 'default';` ), kernel name will be read from
ORT; otherwise use the kernel pointer ( a number ) as kernel name to
save calls from JS to C++.
- When in debug mode, WebGPU validation errors will be recorded and if
any error occurs, `inferenceSession.run()` will fail (Promise get
rejected). Behavior when not in debug mode is not changed. This is
because recording errors are not zero-overhead, and GPU validation
errors should occur consistently in and not in debug mode.
- Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to:
   - allow implementation of the features mentioned above.
   - pass session ID to backend.
2023-08-25 08:08:15 -07:00
satyajandhyala
da180b20fa
[JS/Web] Fix ConvTranspose shader code compilation errors. (#17232)
### Description
Fix JSEP ConvTranspose shader code errors.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 06:25:54 -07:00
guyang3532
401129d484
Add support for more ops for padding elimination (#17217)
Add support for Gelu/ReduceMean/SimplifiedLayerNormalization for padding
elimination
2023-08-25 18:02:15 +08:00
mindest
735cc8e6c8
[ROCm] enable If op for ROCm EP. (#17279)
### Description
Enable If op for ROCm EP.
2023-08-25 17:49:49 +08:00
Yi Zhang
9cd33e07b4
Readd Tests in Window GPU Reduced Ops workflow (#17294)
### Description
Add single test step in Window GPU Reduced Ops workflow


### Motivation and Context
The old workflow's building and testing were running in one command.
In PR #17263, the test step was removed by mistake.
So, readd it.
How to consolidate the test step is in consideration.
2023-08-25 15:56:59 +08:00
Yi Zhang
4a0f8f6672
Skip one flaky Test (#17290)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It's skipped in the PR
```
2023-08-25T02:37:48.7772670Z 1: [ RUN      ] ModelTests/ModelTest.Run/cuda__models_opset9_Candy_candy
2023-08-25T02:37:48.7824755Z 1: D:\a\_work\1\s\onnxruntime\test\providers\cpu\model_tests.cc(91): Skipped
2023-08-25T02:37:48.7825343Z 1: Skipping single test It's in broken_tests
```
2023-08-25 14:48:41 +08:00
Changming Sun
3e934030f4
nodejs: Release Ort Env before main function returns (#17288)
### Description
Release OrtEnv before main function returns. Before this change, OrtEnv
is deleted when C/C++ runtime destructs all global variables in ONNX
Runtime's core framework.
The callstack is like this:
```
  * frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7
    frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2
    frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17
    frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1
    frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2
    frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17
    frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261
    frame #7: 0x00007ffff7857630 libc.so.6`exit + 32
    frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135
    frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128
    frame #10: 0x0000000000abbdee node`_start + 46
```
After this change, OrtEnv will be deleted before the main function
returns and nodejs is still alive.
2023-08-24 23:07:02 -07:00
mindest
93ae17d1bb
[ROCm] Add hipBLASLt workspace support (#17096)
### Description
* hipBLASLt extra workspace for split-k
* type update (due to extra support for fp8 in hipBLASLt)
* minor changes
2023-08-25 13:08:57 +08:00
pengwa
7c98f45928
Fix layernorm and softmax axis after upstream (#17255)
### Fix layernorm and softmax axis after upstream

For Gather (the slicing is a scalar), the output rank is small than its
inputs.

When we upstream this kind of Gather before softmax or layernorm, we
should also update the axis attribute.
Otherwise, the axis might be out-of-date and incorrect for the updated
rank.

```
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception
    raise exception
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 280, in forward
    self._build_graph(graph_transformer_config)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 158, in wrapper
    result = func(graph_execution_manager, *args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 273, in wrapper
    result = func(graph_execution_manager, *args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 361, in _build_graph
    super()._build_graph(graph_transformer_config)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 184, in _build_graph
    self._graph_builder.build(config)
RuntimeError: /onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:823 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Node (Softmax_2904) Op (Softmax) [ShapeInferenceError] 'axis' must be in [-3 , 2]. Its actual value is: 3
```
2023-08-25 12:26:22 +08:00
Faith Xu
86238fb507
[Docs] Auto generate JS API (#17271)
### Description
Adds new workflow to generate js docs with latest changes so the API
page can stay up to date

[Test page of latest js
docs](https://faxu.github.io/onnxruntime/docs/api/js/modules/InferenceSession.html)
2023-08-24 17:35:37 -07:00
Yi Zhang
756eda2cc4
Windows CI build steps template (#17263)
### Description
1. New windows ci build steps template.
2. Remove useless variables.

### Motivation and Context
1. Make it easier to apply build cache to all windows CIs.
2. Other team's devs only need to take care of build options


###Comparision
Before: 

9f21f694cf/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml (L19-L82)

After:
b4c1f2261b/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml (L35-L54)
2023-08-25 05:58:49 +08:00
Hector Li
680fac64ed
[QNN EP] Support non-quantized Op on HTP (#17194)
### Description
[QNN EP] Support non-quantized Op on HTP

1. Remove the limitation in GetCapability that always require QDQ node
unit group to partition the node on NPU backend. So that we can support
non-quantized Slice op with int32 data input on HTP.
2. Enable Where QDQ node unit
3. Separate out the flag is_npu_backend & is_quantized_node to make it
clear
4. Separate output QuantizeLinear, DequantizeLinear to QdqOpBuilder to
better identify quantized/un-quantized input/output tensor
5. Separate out a TransposeOpBuilder to make it simple for Transpose
node processing. Especially for Single Transpose node in QDQ model, for
case like Q->Tranpose->DQ, Transpose is not QDQ node unit group, it's
single node. But we should treat it as quantized node. Output should has
same data type and quantization parameter with input. Another case is to
support non-quantized data for Transpose in QDQ model.
6. Remove is_npu_backend flag from OpBuilder interface. Set the backend
type in QnnBackendManager, QnnMOdel & QnnModelWrapper, so that
OpBuilders can always get it from QnnModelWrapper.
7. Add unit tests for quantized/non-quantized Transpose (int32, float32)
on HTP backend
2023-08-24 14:57:16 -07:00
pengwa
18d5cfdb85
Fix build - redefinition of default argument for ‘long unsigned int Extent’ (#17281)
### Fix build - redefinition of default argument for ‘long unsigned int
Extent’

One of the training customer env, building ORT, there is such a build
error. The GCC version are

```
aiscuser@node-0:/tmp/onnxruntime$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0


aiscuser@node-0:/tmp/onnxruntime$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0


```

But on our dev node using same GCC/G++, we don't have build issue., not
sure what's the difference but giving an explict type when creating
`gsl::span` fixed the problem.

```
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’
  394 | class span
      |       ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here
   46 | template <class ElementType, std::size_t Extent = dynamic_extent>
      |                                                   ^~~~~~~~~~~~~~~
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete
   82 | [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) {
      |                                                                                             ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void*, size_t)’:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed:
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte*, size_t&)’
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’
  740 | span(Type (&)[Extent]) -> span<Type, Extent>;
      | ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note:   template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note:   mismatched types ‘Type [Extent]’ and ‘const std::byte*’
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’
  743 | span(std::array<Type, Size>&) -> span<Type, Size>;
      | ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note:   template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note:   mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte*’
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
```



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 00:40:40 +08:00
pengwa
d90afc697b
Introduce ZeROOffloadSubscriber for ORTModule (#17006)
### Introduce ZeROOffloadSubscriber for ORTModule

As part of the work: integrate ORTModule with DeepSpeed stage3, this PR
mainly focus on moving original PyTorch-based (leveraging hooks) param
partition/offload implementation to ORTModule compatible implementation.

Changes include:
1. Refactor `SubscriberBase`/`SubcriberManager` to support
pre-forward/post_forward hooks.
2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook
function as much as possible. Since all hook functions are defined in
`DeepSpeedZeRoOffload._register_hooks_recursively` and
`DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is,
the closure is not complex, all hooks are referencing the owning
`DeepSpeedZeRoOffload` instance, so we can create new hook function with
`FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance,
then call the new created function in subscriber's
`pre_forward_module_apply_impl` and `post_forward_module_apply_impl`
interfaces.
3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to
register the `ZeROOffloadSubscriber` for the model, then we don't need
change any code on the DeepSpeed repo (at least so far).
4. Fix the ATen embedding custom symbolic exporter function by
tolerating weights size be (0) (changed by DeepSpeed zero stage 3).

UT will be added once stage3 is fully supported. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 00:15:22 +08:00
Baiju Meswani
fca81cc5d5
ConvTransposeGrad CUDA Kernel (#17201) 2023-08-24 09:08:06 -07:00
Jian Chen
33415b9da4
Removing 10.14 suffix from osx nuget package (#17277)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-24 08:51:54 -07:00