Commit graph

11997 commits

Author SHA1 Message Date
Yi Zhang
435e19953e
Fix llama.covert_onnx to make it runnable in CI (#19372)
### Description
1.  make parity_check use local model to avoid using hf token
2. del the model didn't work because it tried to del the object define
out of the function scope.
     So it caused out of memory in A10.
3. In fact, 16G GPU memory (one T4) is enough. But the conversion
process always be killed in T4 and it works on A10/24G.
     Standard_NC4as_T4_v3 has 28G CPU memory
     Standard_NV36ads_A10_v5 has 440G memory.
     It looks that the model conversion needs very huge memory.

### Motivation and Context
Last time, I came across some issues in convert_to_onnx.py so I use the
onnx model in https://github.com/microsoft/Llama-2-Onnx for testing.
Now, these issues could be fixed. So I use onnx model generated by this
repo and the CI can cover the model conversion.
2024-02-05 07:26:24 +08:00
PeixuanZuo
0cba56e0a0
[ROCm] Fix CI pipeline by fixing pytest version (#19407)
Fix pytest version to 7.4.4, higher version will cause error

`from onnxruntime.capi import onnxruntime_validation 
ModuleNotFoundError: No module named 'onnxruntime.capi'`
2024-02-04 16:37:36 +08:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
Tianlei Wu
18c3acb198
update import in convert_generation.py (#19385)
Fix for https://github.com/microsoft/onnxruntime/issues/19376
- Use absolute import instead of relative import for now. 
- Fix some typo
2024-02-02 10:16:37 -08:00
Jiajia Qin
ccbe264a39
[js/webgpu] Add LeakyRelu activation for fusedConv (#19369)
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.

For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`

And above problem was introduced when we make activation attributes as
uniforms instead of constant.

BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
2024-02-02 09:06:38 -08:00
Yulong Wang
50806a7dd5
[js/web] support external data in npm test (#19377)
### Description
support external data in npm test.

This allows test runner to detect whether an external data is available
in the test folder, and if it is, load it as external data
automatically.

this feature does not parse every model to figure out whether the model
has external data. the following comments in code explained how to
determine whether should parse the model file.

```js
      // for performance consideration, we do not parse every model. when we think it's likely to have external
      // data, we will parse it. We think it's "likely" when one of the following conditions is met:
      // 1. any file in the same folder has the similar file name as the model file
      //    (e.g., model file is "model_abc.onnx", and there is a file "model_abc.pb" or "model_abc.onnx.data")
      // 2. the file size is larger than 1GB
```
2024-02-02 09:05:57 -08:00
Jiajia Qin
efc17e79de
[js/webgpu] Fix the undefined push error (#19366)
### Description
This PR fixes below errors when enable webgpu profiling: 
```
TypeError: Cannot read properties of undefined (reading 'push')
```
2024-02-02 02:04:06 -08:00
PeixuanZuo
9139bdda02
[ROCm] CK implementation support causal mask (#18943)
Use `MaskingSpecialization::MaskOutUpperTriangle` to support causal mask
in ck implementation.
2024-02-02 16:34:51 +08:00
Adrian Lizarraga
a2eb967008
Fix Split index bugs uncovered by QNN SDK 2.19 (#19381)
### Description
- When converting ONNX split sizes to QNN split indices, do not include
the split at index 0. QNN 2.19 assumes index 0 is implicit and throws a
validation error if provided.
- Fix bug when using an ONNX Split operator with a `num_outputs`
attribute that does not evenly divide into `shape[axis]`. The ONNX spec
states that the last chunk should be smaller, but QNN EP made the last
chunk larger.
- Fix bug when using an ONNX Split operator with a `split` input. QNN EP
was incorrectly passing the split sizes as split indices without
conversion.

### Motivation and Context
QNN SDK 2.19 updated validation criteria for Split operators. QNN EP was
previously passing a split index that should have been implicit.

Also, discovered a bugs when using `num_outputs` attribute and `split`
input.
2024-02-02 00:22:16 -08:00
petermcaughan
09d5c1b56f
Fix DEBUG_GENERATION build (#19383)
### Description
Currently, ORT will fail a build when the flag DEBUG_GENERATION is set
to 1 (used to debug BeamSearch and GreedySearch) in
[console_dumper.h](3b63d85c25/onnxruntime/contrib_ops/cpu/utils/console_dumper.h (L12))
with the following error:


`onnxruntime/onnxruntime/contrib_ops/cpu/transformers/logits_processor.h:270:15:
error: ‘DumpScores’ was not declared in this scope`

This is because it is defined in `logits_processor.cc`, and a debugging
artifact was passed in an earlier PR where this function is called from
`logits_processor.h` before it is defined
[[link](3a2ab1963a/onnxruntime/contrib_ops/cpu/transformers/logits_processor.h (L270))].
Builds with the flag have been broken since that PR was merged.

This PR moves DumpScores() definition from `logits_processor.cc` to
`logits_processor.h` so that all debug statements can be used correctly
in `logits_processor.cc` and `logits_processor.h` and build succeeds
with this debug flag.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2024-02-01 23:53:32 -08:00
zz002
b71be3c1e3
[VitisAI] Resolving compilation errors when using USE_VITISAI (#19368)
### Description
<!-- Describe your changes. -->
Resolving compilation errors when using USE_VITISAI


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There will be compilation errors when USE_VITISAI is enabled
This is in addition to the #19058

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-02-01 23:00:50 -08:00
Xu Xing
3a2ab1963a
[js/webgpu] Refactor createTensorShapeVariables (#18883) 2024-02-01 17:59:00 -08:00
Changming Sun
13ad922e7f
Improve MatMulNBits test (#19378)
### Description
The test creates millions of threads. This change is to avoid that by
using an existing thread pool.


### Motivation and Context
2024-02-01 16:18:14 -08:00
ironman
8a2646ce60
Metrics - llama-2 - Add package name and version to engine of onnxruntime (#19325)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-01 15:52:20 -08:00
He Li
1bdd7d9499
Update oneDNN to v3.0.1 in order to support gcc 13 (#19344)
### Description

Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug
hindering gcc 13.

### Motivation and Context


Referring to
[oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548).

- When building with `--use_dnnl` using gcc 13.x, it will fail due to
this upstream issue.
- This is fixed in `v3.0.1`
[tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this
commit](1d7971ce48).
2024-02-01 15:39:03 -08:00
jingyanwangms
319481898c
Give a triton library missing warning instead of silently turn off (#19276)
### Description
When USE_ORTMODULE_TRITON is set to 1 but there's no triton library,
triton function is silently turned off. This adds a warning
2024-02-01 15:25:33 -08:00
Hector Li
0fa88bc810
Multi-partition support for context binary cache feature (#18865)
### Description
Multi-partition support for context binary cache feature
1. In QNNEP create the list of EPContext nodes if ep_context_enable is enabled, so that it can dump the model with multiple partitions
2. Extend context loading part to support multiple EPContext nodes

### Motivation and Context
It only support single partition before this changes. There's graph partition limitation for context cache feature after this change.
2024-02-01 15:04:29 -08:00
Linnea May
eb0ce86db8
[DML] Resize 18 & 19 (#19071)
### Description
<!-- Describe your changes. -->
Register resize-18 and -19, which will be lit up automatically when dml
feature level bumps up to 6300.

It's worth noting that DML has a different implementation for antialias
than does ORT CPU. DML does iterative downsampling whenever the scale
factor is less than 0.5. This is equivalent to performing resize with a
variable-sized input window (also equivalent to mip mapping). ORT takes
a different approach, using the same convolution approach as PIL. The
two implementations approach each other in certain cases (with
iota-generated data) but they usually aren't perfectly equivalent.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Linnea May <linneamay@microsoft.com>
2024-02-01 10:26:37 -08:00
Yueqing Zhang
1d6f13fb92
[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058)
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
2024-01-31 21:08:26 -08:00
Scott McKay
68b6064be6
Fix reporting of unused initializers in subgraphs (#19341)
### Description
<!-- Describe your changes. -->
Increment num_resolves_ inside the graph resolve finalization function
so the subgraphs have the same value.

This prevents incorrect output regarding removing unused initializers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19141
2024-02-01 08:02:12 +10:00
Yi-Hong Lyu
55b60d8fe0
Turn off Neural Speed to avoid slowdowns (#19265)
Disable Neural Speed to prevent the operation following MatMulNBits from
significantly slowing down.
2024-01-31 13:40:25 -08:00
Adrian Lizarraga
ca8d4459d4
Add contrib Q/DQ ops to symbolic shape inference tool (#19340)
### Description
Adds type/shape inferencing support for MSFT domain QuantizeLinear and
DequantizeLinear operators to symbolic_shape_infer.py


### Motivation and Context
Need a way to infer the types and shapes of Q/DQ ops in models that use
the MSFT domain versions (e.g., int16 quantization).
2024-01-31 10:38:01 -08:00
Phoebe Chen
2b361c04d6
Fix Flatbuffer build issue. (#19296)
### Description

Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux.
This commit addresses the flatbuffer build issue with the following
changes:
1. Remove the Werror flag in the flarbuffer patch.
2. Add a compilation option to suppress the 'stringop-overflow' error in
the Flatbuffers within the xnnpack provider.

### Motivation and Context
https://github.com/google/flatbuffers/issues/8119
https://github.com/microsoft/onnxruntime/pull/19239

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-31 10:12:43 -08:00
zesongw
d87f73ab44
[WebNN EP] Use GetVecUint32FromVecInt64 to simplify the code (#19324)
- Use the function `GetVecUint32FromVecInt64` in helper.h to replace
`transform`.
- Change some `int32_t` to `uint32_t`.
- Remove a useless `temp`.
2024-01-31 00:20:07 -08:00
Baiju Meswani
3262e8df2f
Introduce a Nominal Checkpoint for On-Device Training (#19232) 2024-01-30 22:11:25 -08:00
petermcaughan
4562c910fe
Whisper Crash Fix (#19345)
### Description
There is a current bug in the BeamSearch implementation of T5, GPT, and
Whisper due to an interaction between two PRs merged in the past 7
months.

First PR/code change is the addition of BeamSearchScorer GPU
implementation. This PR accelerates some operations by executing them in
the GPU and not the CPU. The approach for this code change didn't
utilize a cudaStream when copying one particular variable from GPU to
CPU (see nullptr value here:
[[link](b65d3d0a53/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_t5.h (L213))]).

The second PR/code change was the alteration to utilize a cudaStream to
initialize various memory buffers in BeamSearch (see `stream` included
as the last argument in these allocations
[[link](d1431e1b78/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_base.h (L25))]).

During the in-between period of these two PRs, I believe neither
allocation utilized a stream and were thus synchronized. Once the latter
PR was merged, the copy became desynchronized with the initialization
due to different streams.

The fix for this is to reintroduce the same stream into the copy
operation added in the first PR.



### Motivation and Context
This does not happen reliably on every hardware with every script due to
the race condition nature, but the bug completely breaks ORT execution
with a BeamSearch model.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2024-01-30 21:53:18 -08:00
Yulong Wang
dd1f6ccc45
[js/webgpu] resolve codescan alert (#19343)
### Description
resolve codescan alert:
https://github.com/microsoft/onnxruntime/security/code-scanning/17687
2024-01-30 21:06:21 -08:00
Xu Xing
d73131cf0f
[js/webgpu] Use DataType as uniform cpu type (#19281)
This saves turning data type to string by tensorDataTypeEnumToString.
2024-01-30 21:05:08 -08:00
Jiajia Qin
85cef0af8c
[js/webgpu] Support capture and replay for jsep (#18989)
### Description
This PR expands the graph capture capability to JS EP, which is similar
to #16081. But for JS EP, we don't use the CUDA Graph, instead, we
records all gpu commands and replay them, which removes most of the cpu
overhead to avoid the the situation that gpu waiting for cpu.

mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from
4.58ms on Intel A770.

All limitations are similar with CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the JS EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IObinding is required.

The usage is like below:
Method 1: specify outputs buffers explicitly.
```
    const sessionOptions = {
        executionProviders: [
          {
            name: "webgpu",
          },
        ],
        enableGraphCapture: true,
      };
    const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
   
    // prepare the inputBuffer/outputBuffer
    ... ...

   const feeds = {
       'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
   };

   const fetches = {
       'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] })
   };

   let results = await session.run(feeds, fetches);  // The first run will begin to capture the graph.

   // update inputBuffer content
  ... ...
   results = = await session.run(feeds, fetches);  // The 2ed run and after will directly call replay to execute the graph.

  ... ...
   session.release();
```
Method 2: Don't specify outputs buffers explicitly. Internally, when
graph capture is enabled, it will set all outputs location to
'gpu-buffer'.
```
    const sessionOptions = {
        executionProviders: [
          {
            name: "webgpu",
          },
        ],
        enableGraphCapture: true,
      };
    const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);

    // prepare the inputBuffer
    ... ...

   const feeds = {
       'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
   };

   let results = await session.run(feeds);  // The first run will begin to capture the graph.
   
   // update inputBuffer content
  ... ...
   results = = await session.run(feeds);  // The 2ed run and after will directly call replay to execute the graph.

  ... ...
   session.release();
2024-01-30 18:28:03 -08:00
Scott McKay
6dd0079d13
Exclude more code from custom_ops.cc when not required in minimal build (#19142)
### Description
<!-- Describe your changes. -->
- Split out the code that implements the OrtKernelContext API (used by
compiled nodes and custom ops) and the code that implements the custom
ops API.
- Exclude based on minimal build settings using helpers
- the main change is to simply wrap the implementation into a lambda so
it can be easily enabled/disabled
  - actual implementation of all functions are unchanged
- Re-organize so the related implementations are together
- most diffs are from this, but without the reorg it would be much
harder to know which helper to use
- General cleanup of lines that were too long.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Saves ~10KB in a minimal build.

Build command used for comparison
```
./build --android --android_api=29 --android_sdk="d:\Android" --android_abi=arm64-v8a --parallel --android_ndk_path="D:\Android\ndk\26.0.10792818\" --build_shared_lib --cmake_generator Ninja --skip_tests --minimal_build --disable_rtti --disable_ml_ops --disable_exceptions --cmake_extra_defines=onnxruntime_BUILD_UNIT_TESTS=OFF --include_ops_by_config .\no_ops.config --config MinSizeRel
```

Main: 1,218,480 bytes
With changes: 1,208,320 bytes
2024-01-31 12:25:34 +10:00
Wanming Lin
1e936bfd63
[WebNN] Ignore empty optional input tensor (#19235)
Empty optional input tensors are indicated by an empty name, which are
allowed and we should just ignore them.
2024-01-30 18:09:16 -08:00
Yi Zhang
e74f141338
Save stablediffusion and open-clip in pipeline cache (#19314)
### Description
1. save the model to pipeline cache
2. lower the similarly bar to 97
3. publish the generated image that we can check it once the test fails


### Motivation and Context
Reduce model downloads
2024-01-31 09:39:27 +08:00
Adrian Lizarraga
0c38e96bb5
[Quant tool] Ensure MSFT opset for Q/DQ models (#19335)
### Description
Updates qdq quantization to ensure the final model has the
`com.microsoft` opset import if the model uses Q/DQ ops with the
`com.microsoft` domain (e.g., for int16 quantization)


### Motivation and Context
Need to ensure the MSFT domain is correctly set for all relevant cases.
Otherwise, shape inferencing tools will raise an exception.
2024-01-30 17:19:08 -08:00
Jiajia Qin
90883a366a
[js/webgpu] Add hardSigmoid activation for fusedConv (#19233)
### Description
Add hardSigmoid activation for fusedConv. It will be used by
mobilenetv3-small-100 model.
2024-01-30 16:28:53 -08:00
Changming Sun
8dad9d92f4
Move einsum's test data to constexpr variables (#19320)
### Description
emscripten's C++ compiler has difficulty on compiling einsum_test.cc
because the file has too many local variables. So I moved them to
constexpr.
2024-01-30 15:59:37 -08:00
Edward Chen
c379a89bcb
[MLAS AArch64] SQNBitGemm optimization (#19272)
1. Add support for packing 4-bit values 32 at a time for CompInt8. 32 4-bit values can fit into a single 128-bit NEON register. For CompInt8, this enables a more efficient path for block sizes greater than or equal to 32. CompFp32 seems to do better with handling 16 elements at a time, so this 32-value packing is not used there.
Pack differently based on compute type. Adjust APIs to handle this.

2. Introduce template argument for whether to handle zero-point. This results in less code for the no zero-point (symmetric) case. However, there is a binary size increase due to the additional template instantiations.
2024-01-30 14:29:12 -08:00
Changming Sun
04afe77305
Update ThirdPartyNotices.txt: Add Intel neural-speed (#19332)
Add Intel neural-speed to ThirdPartyNotices.txt because it will be
shipped in the default build in most of our packages.
2024-01-30 12:40:30 -08:00
kunal-vaishnavi
febec1c586
Update Whisper export with beam search (#19322)
### Description
This PR updates the Whisper export with beam search by adding the
following.

- Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the
Whisper with beam search model
- Sets the default PyTorch attention implementation to `eager` to allow
existing attention fusions to continue working
- Re-uses the cache directory when loading the PyTorch model to reduce
memory used on disk
- Adds `--disable_auto_mixed_precision` to the example FP16 export
command

### Motivation and Context
- [This PR](https://github.com/microsoft/onnxruntime/pull/19112) added
the `is_unidirectional` parameter to `CheckInputs`, but it was not
provided when checking the inputs in `DecoderMaskedMultiHeadAttention`.
- [This PR](https://github.com/microsoft/onnxruntime/pull/19200)
explains the reasoning behind why `eager` is used to load the
`WhisperAttention` class.
- By re-using the cache directory for loading the PyTorch model, only
one copy of the PyTorch model is saved on disk instead of two copies.
- By providing this flag, there will be less Cast nodes in the Whisper
with beam search model to switch between FP16 and FP32 precision.
2024-01-30 11:59:15 -08:00
ivberg
3454f86e70
Windows - Only set thread affinity on Server with auto affinity (#19318)
### Description
Only set thread affinity on Server with auto affinity. Auto affinity =
when API user does specify thread settings or affinity themselves.

### Motivation and Context
On client best to let OS scheduler handle. On big (P-Core) / little
(E-Core) CPU designs affinity overrides win32 Quality of Service (QoS)
and has high power usage. Specifically on background workloads whose
process is tagged QoS Utility (Background), this affinity setting
overrides the OS scheduler that only wants to schedule on the E-Cores.
Thus P-Cores waking up uses more energy than intended on client and
users gets less battery life.

Foreground AI workloads would be tagged QoS High and would run the ORT
threads on all cores.
2024-01-30 10:53:10 -08:00
liqun Fu
b84cb247e3
io_binding to handle optional input of sequence type_proto (#19273) 2024-01-30 10:25:14 -08:00
Wei-Sheng Chin
ffc3431a66
Update ScatterElements to Support Opset 13, 15, 18 (#19198)
`ScatterElements` in opset 18 has been around for a while. However, the
highest opset supporting `ScatterElements` in ORT is 13. This PR
implement this op in CUDA EP by replacing `assignment` in the current
CDUA kernel with `atomic reduction` (e.g., atomic add, atomic max). A
series of fundamental atomic functions (e.g., atomic max for int8_t and
half) are implemented in `common.cuh`; the implementation is general
enough to cover old CUDA and new CUDA versions.

- The core changes are in `cuda/atomic/common.cuh` with very detailed
documentation including `bit-wise operation's visualization`. They are
also copied to `rocm/atomic/common.cuh` to support AMD GPU.
- `/cuda/tensor/gather_elements_impl.cu` contains small changes to call
the new atomic functions to support new `reduction` behavior in new
`ScatterElements`.
- New `ScatterElements` are defined in `rocm_execution_provider.cc` and
`cuda_execution_provider.cc`.
2024-01-30 09:18:50 -08:00
Rachel Guo
3e17ca3dab
Fix iOS artifacts issue in Microsoft.ML.OnnxRuntime Nuget Package (#19311)
### Description
<!-- Describe your changes. -->

Updates to only include ios archs framework in artifacts included in
Nuget Package.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Related issue:
https://github.com/microsoft/onnxruntime/issues/19295#issuecomment-1914143256

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-01-30 08:44:20 -08:00
Changming Sun
a92802f940
Disable a few tests for wasm build (#19316) 2024-01-30 08:16:57 -08:00
Vincent Wang
9f68a27c7a
[ORTModule] Handle Cast on Constant Number on Triton Code-gen (#19321)
When using scaled_dot_product_attention on float16 type, the exported
graph has Sqrt(float16(constant)), which cannot be ConstantFold in ORT
because Sqrt CPU kernel doesn't support float16. This causes Triton
code-gen generates code like:

result = 128.0.to(tl.float32)

This code cannot be compiled because .to() cannot be applied to
constant.

This PR is to handle such case that constant number will not do the
Cast.
2024-01-30 17:04:01 +08:00
Xu Xing
624b4e2063
[js/webgpu] Remove enableShapesUniforms (#19279) 2024-01-29 17:49:06 -08:00
Chi Lo
00d048121b
[TensorRT EP] Fix InferenceSession::Run() not thread-safe issue (#19301)
Given that InferenceSession::Run() is guaranteed to be thread-safe
meaning multiple threads can call this function concurrently,
TRT EP needs to carefully take care of concurrency here, if not,
following concurrent issue might happen:
    

- It's suggested that to perform inference concurrently in multiple
streams, use one trt execution context per stream.
In the design of TRT EP (Not apply per-thread context implementation)
and if multiple threads are calling InferenceSession::Run()
concurrently, the trt execution context instance is shared by all the
threads and each thread aquires different stream from ORT.
So TRT EP will end up having one trt execution context using multiple
streams which is not suggested.
But, since the whole compute_func() is protected by the lock and if
cudaStreamSynchronize() is enforced here, one trt execution context per
stream is guaranteed.
     
Therefore, TRT EP needs to call cudaStreamSynchronize() at
compute_func() which means to wait until stream has completed all
operations to prevent the concurrent

github isse: https://github.com/microsoft/onnxruntime/issues/19275
2024-01-29 17:36:27 -08:00
Baiju Meswani
465540d29b
Update training api python documentation (#19287) 2024-01-29 14:14:15 -08:00
Changming Sun
e91d91ae4f
Fix a build issue: /MP was not enabled correctly (#19190)
### Description

In PR #19073 I mistunderstood the value of "--parallel". Instead of
testing if args.parallel is None or not , I should test the returned
value of number_of_parallel_jobs function.

If build.py was invoked without --parallel, then args.parallel equals to
1. Because it is the default value. Then we should not add "/MP".
However, the current code adds it. Because if `args.paralllel` is
evaluated to `if 1` , which is True.
If build.py was invoked with --parallel with additional numbers, then
args.parallel equals to 0. Because it is unspecified. Then we should add
"/MP". However, the current code does not add it. Because `if
args.paralllel` is evaluated to `if 0` , which is False.

This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons. 

### Motivation and Context
2024-01-29 12:45:38 -08:00
Changming Sun
4ee222413f
Update OneBranch.Nuget-WindowsAI-Pipeline.Official.yml for Azure Pipelines (#19293)
To fix a pipeline issue.
2024-01-29 12:00:42 -08:00
Guenther Schmuelling
9e69606360
fix f16 for attention, enable slice and flatten for more types (#19262) 2024-01-29 10:13:46 -08:00