Commit graph

892 commits

Author SHA1 Message Date
Scott McKay
df740d7d15
Throw if unique_ptr or array allocation fails due to SafeInt overflow (#18941)
### Description
<!-- Describe your changes. -->
If we fail to calculate the buffer size (due to overflow) we currently
return a nullptr. This is inconsistent as an actual memory allocation
failure throws. An overflow would typically be due to bad input so an
exception makes more sense given that.

Change to throw so code using MakeUniquePtr* and AllocArray* doesn't
need to check for nullptr.

Add some extra info to the log message to help debugging.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Should help with #18905 by avoiding the invalid attempted usage of a
nullptr from the allocation. Extra info _might_ help with figuring out
where the overflow is coming from which is the real issue.
2024-01-03 07:57:51 +10:00
Hector Li
8931854528
Move some QNN EP provider options to session options (#18877)
Move QNN EP provider options to session options

### Description
Need to use session option to support multi-partition for context cache feature. To smooth the transaction, move the provider options to session options first.

This is the first step for PR:
PR https://github.com/microsoft/onnxruntime/pull/18865
2023-12-20 00:13:38 -08:00
pengwa
ccf3b2054b
Allow layer-wise recompute (#18566)
### Allow layer-wise recompute 

Early, we need users/developers to specify the subgraphs to recompute,
now we introduced a more user-friendly way to enable recompute for all
detected stashed activation recomputation subgraphs. This scarifies
getting the best configs while makes it easier to support user
requirements when they switches from PyTorch per-layer gradient
checkpoint to ORTModule.

`ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by
default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin
`ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible
to existing recompute usage in ORTModule integrated models.

Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans
detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be
respected any more.


Add Unit Tests using 3 layer blooms. 



https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md
2023-12-12 08:44:05 +08:00
Hector Li
ccfea55942
[QNN EP] Enable QNN HTP VTCM size setting (#18653)
### Description
[QNN EP] Enable QNN HTP VTCM size setting
2023-11-30 21:09:13 -08:00
Edward Chen
14a343441d
Fix Objective-C static analysis build (#18606)
- Patch abseil to fix a compile error about not finding `cxxabi.h`.
- Fix some static analysis warnings.
2023-11-28 17:14:20 -08:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Dmitri Smirnov
81a763a9eb
Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519)
### Description
Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number
of template instantiations.

### Motivation and Context
The reported size reduction is small, just a few Ks. Just trying it out.
2023-11-21 14:13:50 -08:00
Sheil Kumar
2a01622536
Hide NPU Adapter selection behind macro (#18515)
Hide NPU Adapter selection behind macro

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-11-21 08:47:56 -08:00
RandySheriffH
53917a3353
Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478)
Move data member in LiteOpFunc to its parent to avoid possible mem
leaks.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-11-18 15:00:54 -08:00
Edward Chen
0a4d76d98b
MLAS AArch64 quantized int4 Gemm kernel (#18031)
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.

- Connect MatMulNBits contrib op to MLAS function.
2023-11-15 09:31:54 -08:00
Dmitri Smirnov
f19c673595
If Branch Constant Folding (#18105)
### Description

When and if `If` condition proves to be a constant value, inline the
corresponding subgraph yielding to more constant folding and
optimization.

### Motivation and Context
Newly converted models feature lots of nested `If` nodes that can be
inlined and collapsed.

In particular, for the sample models we are gaining on TorchScript
exported models.
For `HF Mobile Bert Dynamo` runtime went down from 0.069 -> 0.046. In
total, AOT inlining + `If` constant folding
yields improvement of about 50% 0.102 -> 0.046. Brining us very close to
TorchScript exported models.

`HF Bart Dynamo` further improves 0.668 -> 0.45. AOT + `If` constant
folding improves 0.98 -> 0.45

Earlier the size of 
HF Mobile Bert **161Mb+**, now **98Mb**
HF Bart Dynamo pre-optimized model was about **1.2Gb**. It is now
**710MB**


![image](https://github.com/microsoft/onnxruntime/assets/11303988/1491a247-d371-4e66-85a3-2aeb702e8ca0)
2023-11-13 17:33:30 -08:00
RandySheriffH
646f77a94b
Align context virtuals (#18396)
Deprecate ROCM context virtual function, to align with CUDA.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-11-11 12:41:37 +10:00
RandySheriffH
59262dfc63
Add cuda context headers to zip (#18330)
Expose cuda context headers for cuda custom ops.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-11-09 14:53:58 -08:00
Ted Themistokleous
8d50313816
[Migraphx EP] Static int8 QDQ support (#17931)
### Description
<!-- Describe your changes. -->
Adding static int8 quantization support for MIGraphX Execution Provider

- Allows for parsing in calibration tables generated by Onnxruntime or
TensorRT's toolsets
- Add proper environment variables into the MIGraphX EP
- Update python API to include updating execution provider flags -> was
missing on python side
- Hook into MIGraphX's int8 quantitation and optimization of models

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Required so that we can get onnxruntime to pass in models while
leveraging the existing tooling for int8 static QDQ quantization.

First step in a series of PRs which will add further static quantization
on the operator level as MIGraphX releases further support.

These changes drew heavily from the tensorRT EP should allow for similar
functionality for GPU based (versus CPU) quantization of models before
an inference is performed.

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2023-11-09 17:46:49 +08:00
Hector Li
55c19d6ab5
[QNN EP] Enable option to set QNN context priority (#18315)
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".

### Description
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".

This feature guarantees the model inference with higher priority. Tested
with onnxruntime_perf_test tool using same model.
1. Run the model on the NPU with single instance, the latency is 300ms.
2. Run the same model on NPU with 2 instance at same time.
   Case 1:   
   both with same priority (high ) -- latency is 600ms
   Case 2:   
   1 with low priority -- latency is 30,000ms
   1 with high priority --  latency is 300ms
   Case 3:   
   1 with normal priority -- latency is 15,000ms
   1 with high priority --  latency is 300ms
2023-11-08 20:56:36 -08:00
Justin Chu
c250540722
Bump linter versions (#18341)
Bump linter versions and run format.
2023-11-08 13:04:40 -08:00
Adrian Lizarraga
a0eeeafa80
[QNN EP] Session option for graph optimization (#18262)
### Description
Adds the QNN session option `htp_graph_finalization_optimization_mode`
to enable QNN graph optimizations at the expense of longer preparation
time.

### Motivation and Context
Allow enabling QNN graph optimizations per app/model.
2023-11-08 10:06:15 -08:00
Preetha Veeramalai
d87216bcb1
Openvino ep ort 23.1 (#17911)
### Description
Integration to OpenVINO 2023.1


### Motivation and Context

- Alignment with latest OpenVINO Version. 
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.

---------

Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
2023-11-01 08:39:39 -07:00
RandySheriffH
2b95e74fa1
Versioning for custom op (#18088)
Allow custom ops to have versions.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-10-31 16:50:27 -07:00
Maximilian Müller
2eeafc37bc
Enable global TRT timing cache (#17865)
I am adding a new `trt_timing_cache_path` option. Internally it is
handled as `global_cache_path_` and will be set via a fall through
approach:
1. no path provided => workdir
2. `trt_engine_cache_path` provided but no `trt_timing_cache_path` =>
`trt_engine_cache_path`
3. `trt_timing_cache_path` provided => `trt_timing_cache_path` (if not
provided `trt_engine_cache_path` will still be workdir)

### Motivation and Context

A TRT timing cache can be reused across multiple models as it only holds
kernel timings and it is common that network "patterns" are reused. This
can accelerate build times a lot.

---------

Co-authored-by: Carson M <carson@pyke.io>
2023-10-27 09:23:19 -07:00
Patrice Vignola
538e97cbda
[DML EP] Add dynamic graph compilation (#17876)
Historically, DML was only able to fuse partitions when all sizes are
known in advance or when we were overriding them at session creation
time. But in practice, it should be possible to compile partitions at
compute time if the caller knows that the dimensions won't be changed
for every inference (e.g. resizing a webcam window, or padding the input
to powers of 2). This graph will be cached and reused until the sizes
change.

This is an opt-in option gated under the `enable_dynamic_graph_fusion`
option, which means that it will only be enabled when the caller
requests it since they have more context on how their model will be
called between inferences.

This PR also adds the option to disable metacommands from the python
API, which is an option for the C API but was lacking for python.
2023-10-25 19:56:16 -07:00
liqun Fu
efa0cc2562
implement isinf20 and isnan20 (#17874) 2023-10-24 10:58:54 -07:00
Dmitri Smirnov
2c50b75a26
Functions Ahead Of Time inlininng (#17764)
### Description
Inline functions in an EP aware fashion. 

The result of this PR is that models that are having been inlined by
ONNX inliner and optimized and models that have been AOT inlined appear
to be visually identical.

For tests I used two models. The only difference is the resulting size
because ONNX inliner removes local function definitions and AOT does
not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF
Bart` it was ~500K. It seems that the resuling model size affects the
load time more than the actual optimizations.

In general, the inlined models grow in size very fast and can easily
exceed 2Gb limit.

Q. Should we make AOT optional?

`If` costant folding and the removal of local inlined models will be
coming in other PRs.

Some stats:

![image](https://github.com/microsoft/onnxruntime/assets/11303988/fcb4c815-2e06-4574-8d96-5a0a727d1ecf)
2023-10-23 17:42:20 -07:00
RandySheriffH
009cd4ea2e
Allow cuda custom ops allocate deferred cpu mem (#17893)
Expose a new allocator from cuda stream.
The allocator manages deferred cpu memory which only get recycled before
stream destruction.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-10-20 16:12:21 -07:00
Maximilian Müller
7c17e33c07
Make CUDA a NHWC EP (#17200)
### Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.

### Motivation and Context


Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)
```

I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels: 

![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b)

If I add the introduced `-l` flag we see below kernels:

![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008)

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.

Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |

|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
|                         |  -e cuda -t 5 -q |   -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53

## Outlook

Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-10-16 10:16:37 -07:00
RandySheriffH
c6c3555d0e
Custom op shape inference API (#17737)
Add c/cxx API to allow custom ops do shape  inference.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-10-13 12:57:42 -07:00
Zhang Lei
762703e037
Support output cross qk, dtw and more for whisper model (#17500)
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob

Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:

* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
2023-10-13 11:47:15 -07:00
Numfor Tiapo
b8f373b0ae
Add API for NPU Device Selection in the DML EP (#17612)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-11 14:53:00 -07:00
Hector Li
385fab5bae
[QNN EP] Qnn cache improvement (#17757)
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.

### Motivation and Context
 reduce the memory overhead and initialization time overhead
2023-10-06 15:56:33 -07:00
Chi Lo
569876fb16
[TensorRT EP] Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field (#17617)
Two major modifications of this PR:

1. Refactor OrtTensorRTProviderOptions initialization and make it easy
to add new field.
2. Make Python API capable of using TensorRT plugins by adding new
Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs
to register ep's custom op domain before model load. For C++ API, it's
slightly different, when calling
SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op
domain to session option. Later ORT can register custom op domain from
session option before model loading)
2023-10-06 14:12:20 -07:00
Adrian Lizarraga
8e6019af2e
[QNN EP] Enable QNN Saver for debugging issues (#17747)
### Description
- Enables option to use the QNN Saver backend for dumping QNN API calls
to file.
- Adds logic to read environment variable
`ORT_UNIT_TEST_ENABLE_QNN_SAVER` from QNN EP unit tests. If enabled,
unit tests will use the QNN Saver backend and dump files to
`./saver_output/`.


### Motivation and Context
QNN Saver makes it easier to debug issues when unit tests fail. The
output files generated by QNN Saver can be used to replay the exact QNN
API calls that lead to a specific error condition.

QNN Saver dumps QNN API calls (and weights) to disk.
- saver_output/saver_output.c: C file containing all QNN API calls.
- saver_output/params.bin: binary file containing all
input/output/parameter tensor data provided during tensor creation, op
config validation, and graph execution.

Enabling the QNN Saver backend has 2 note-worthy effects:
  1. All QNN API calls will succeed.
  2. Inference output returns dummy data.
 
Because the output files from QNN Saver are always overwritten, it is
recommended to run individual unit tests via the `--gtest_filter`
command-line option.

Example (linux):
```shell
$ ORT_UNIT_TEST_ENABLE_QNN_SAVER=1 ./onnxruntime_test_all --gtest_filter=QnnHTPBackendTests.Resize_DownSample_Linear_AlignCorners
```
2023-10-03 16:24:33 -07:00
Pranav Sharma
668c70ee11
Add support for specifying a custom logging function per session. (#17727)
### Description
Add support for specifying a custom logging function per session.
Bindings for other languages will be added after this PR is merged.

### Motivation and Context
Users want a way to override the logging provided by the environment.
2023-09-29 19:46:55 -07:00
Scott McKay
33295ed883
Handle string initializers in constant folding (#17422)
### Description
<!-- Describe your changes. -->
* Allow either an allocator or a MemBuffer to be used when creating an
OrtValue from an TensorProto
* `Tensor<std::string>` requires an allocator to allocate/free the
string values
* Forcing the buffer to be allocated outside of the Tensor doesn't seem
to provide any benefit in this usage as the Tensor class disables copy
and assignment (so we wouldn't create 2 copies of the buffer via the
Tensor class that externally managing the would buffer avoid)
* New approach means we don't need to manage the buffers in the
optimizer Info class as the Tensor dtor will do that
* Update naming - MLValue was replaced by OrtValue a long time ago

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17392
2023-09-27 21:15:58 +10:00
RandySheriffH
37dcefb5b7
Patch lite custom op API (#17605)
A few enhancements:
- Support compute returning status;
- Support variadic;

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-09-26 14:02:18 -07:00
Vincent Wang
e6301eee6a
Bump Up Version to 1.17.0 (#17587)
Bump up version to 1.17.0 as the 1.16.0 release branch had been branched
out.
2023-09-20 11:02:58 +08:00
Dmitri Smirnov
fdb132643d
Remove redundant Resolve() after each inlined function (#17556)
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.

### Motivation and Context
Poor performance for inlining the model and session initialization.

Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)

RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
2023-09-15 12:13:37 -07:00
cao lei
32f5658abb
remove gsl to make status.h independent from gsl (#17402)
### Description
<!-- Describe your changes. -->
Make status.h independent from gsl.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In the coming new feature external EP API (see the prototype
https://github.com/microsoft/onnxruntime/pull/16718), we need to expose
stream in the public header, however, stream is dependent on status.h
which is dependent on gsl. We are seeking a way to decouple stream from
gsl.

From Changming's comment offline, prefast is disabled so all
GSL_SUPPRESS are not taking any effect now. He will handle the warnings
when enable prefast in the future
2023-09-13 21:47:43 -07:00
Yulong Wang
550293d9ad
OrtMemoryInfo: support new name "WebGPU_Buffer" (#17469)
### Description
Add new name "WebGPU_Buffer" to OrtMemoryInfo.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
#17465
#17469 (this one)
2023-09-08 16:37:35 -07:00
Xavier Dupré
024f1dd72b
Fix float 8 rounding on CPU (#16940)
### Description
Fix float 8 rounding issues discovered in issue #16938 (only CPU
provider).
2023-09-07 20:48:25 +02:00
RandySheriffH
6c39641ea2
Fix a memleak in RunAsync python (#17326)
Release ort value outputs that are created and released from
ort::run(...).

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-30 12:54:17 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
pengwa
18d5cfdb85
Fix build - redefinition of default argument for ‘long unsigned int Extent’ (#17281)
### Fix build - redefinition of default argument for ‘long unsigned int
Extent’

One of the training customer env, building ORT, there is such a build
error. The GCC version are

```
aiscuser@node-0:/tmp/onnxruntime$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0


aiscuser@node-0:/tmp/onnxruntime$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0


```

But on our dev node using same GCC/G++, we don't have build issue., not
sure what's the difference but giving an explict type when creating
`gsl::span` fixed the problem.

```
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’
  394 | class span
      |       ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here
   46 | template <class ElementType, std::size_t Extent = dynamic_extent>
      |                                                   ^~~~~~~~~~~~~~~
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete
   82 | [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) {
      |                                                                                             ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void*, size_t)’:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed:
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte*, size_t&)’
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’
  740 | span(Type (&)[Extent]) -> span<Type, Extent>;
      | ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note:   template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note:   mismatched types ‘Type [Extent]’ and ‘const std::byte*’
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’
  743 | span(std::array<Type, Size>&) -> span<Type, Size>;
      | ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note:   template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note:   mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte*’
   83 |   return gsl::span(reinterpret_cast<const std::byte*>(data), length);
      |                                                                    ^
```



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 00:40:40 +08:00
Scott McKay
b3cb775cf9
Two fixes involving minimal builds (#17000)
### Description
<!-- Describe your changes. -->
- allocation planner was breaking if graph had no nodes
- in this particular model a branch of an If node returned an outer
scope value directly.

- if model used non-tensor types and sparse tensors are disabled the
call to IsSpareTensor causes an exception when prematurely terminates
the code.
- it's perfectly fine to check if a value is a sparse tensor when
support for them is disabled. we just can't do anything with that
OrtValue which is what the current ifdef's after the call to
IsSparseTensor handle.




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix model execution failure for partner with model that uses sequences
in a minimal build with sparse tensors disabled.
2023-08-23 16:01:22 +10:00
Edward Chen
ae62d752d6
Prevent GSL_SUPPRESS arguments from being modified by clang-format (#17242)
Prevent `GSL_SUPPRESS` arguments from being modified by clang-format and update existing usages.

clang-format was changing something like `GSL_SUPPRESS(r.11)` to `GSL_SUPPRESS(r .11)`.

For some compilers (e.g., clang), the `gsl::suppress` attribute takes a quoted string argument. We don't want to insert spaces there.
2023-08-22 18:26:53 -07:00
Edward Chen
d6cd41cfc1
[CoreML EP] Add Shape, Gather, and Slice ops (#17153)
Add CoreML EP shape related ops:
- Shape
- Gather
- Slice

Add support for int64/int32 inputs in CoreML EP.
2023-08-18 22:34:34 -07:00
Dmitri Smirnov
5c54b64a63
Create NodeArgs for all Constant nodes and initializers for functions being inlined (#17089)
### Description
When functions are inlined and constant nodes are being converted to
initializers, we need to create NodeArg for them.
Similar for inlined function subgraph, but we choose to give priority to
non-constant nodes and then fill the gaps with constant and
initializers.

### Motivation and Context
This addresses issue
https://github.com/microsoft/onnxruntime/issues/16813 for
`eca_halonext26ts_mod.onnx` model where it fails to remove unused
initializer because `NodeArg` was not created for it.
2023-08-17 14:22:28 -07:00
Changming Sun
5249b7ab7c
Re-implement stacktrace (#17173)
### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.

### Motivation and Context
2023-08-16 16:07:49 -07:00
RandySheriffH
3dd2c1b4d7
EP context for custom op (#16454)
Implement infrastructures to allow EP resources surfaced to custom ops.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-16 13:03:40 -07:00
Yulong Wang
9cd4e5af68
[wasm] upgrade emsdk to 3.1.44 (#17069)
### Description
This change upgrade emsdk to 3.1.44.

Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".

most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
2023-08-10 16:08:36 -07:00
Chi Lo
7361c283c7
Add API for updating CUDA EP provider option user compute stream (#17037)
Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update
CUDA EP provider options where its data type is pointer that can't be
represented by string.

Note: Please see some comments for the similar [PR
](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.
2023-08-09 09:24:19 -07:00