This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.
Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.
Example of specifying sharding information
```python
@onnxscript.script()
def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
# Run MatMul by sharding x along column axis and w along row axis on
# 2 GPUs.
return MICROSOFT_OPSET.DistributedMatMul(
tensor_x,
tensor_w,
device_mesh_shape=[2],
device_mesh_elements=[0, 1],
input_shard_specs=["RS[0]", "S[0]R"],
output_shard_specs=["RR"],
)
onnx_model = matmul_rs_sr_rr.to_model_proto(
input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
output_types=[FLOAT[2, 2]],
)
```
In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.
C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
public:
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// Device mesh is a tensor of device indices.
// A tensor can then be partitioned along specific mesh axes.
//
// Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
// Let's consider some examples.
// 1. 1D device mesh [0, 1, 2, 3]. In this case,
// device_mesh_shape is [4] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor along its axis 1, the
// corresponding sharding spec is a string "RS[0]".
// 2. 2D device mesh [[0, 1], [2, 3]]. In this case,
// device_mesh_shape is [2, 2] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor's
// rows along mesh axis 1 and
// columns along mesh axis 0, the
// corresponding sharding spec is a string "S[1]S[0]".
// If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
// GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding
// proccess.
// - Start with a 2-D device mesh [[0, 1], [2, 3]] and
// a 2-D tensor [[5, 6], [7, 8]]
// - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
// - Split GPU mesh along axis 1 and tensor along
// axis 0 for "S[1]" in "S[1]S[0]"
// - GPU: [[0], [2]], Tensor: [[5, 6]]
// GPU: [[1], [3]], Tensor: [[7, 8]]
// - Split GPU mesh along axis 0 and tensor along
// axis 1 for "S[0]" in "S[1]S[0]"
// - GPU: [[0]], Tensor: [[5]]
// - GPU: [[2]], Tensor: [[6]]
// - GPU: [[1]], Tensor: [[7]]
// - GPU: [[3]], Tensor: [[8]]
// Actual shape of device mesh represented by `device_mesh_elements`.
std::vector<int64_t> device_mesh_shape;
// Flattened device mesh.
std::vector<int64_t> device_mesh_elements;
};
class AxisPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// This class is the in-memory representation of
// 1. if a tensor is sharded or not (aka replica), and
// 2. which tensor axis is shard by which device mesh axis.
// Let's consider sharding 2-D tensor along column axis on
// device mesh [0, 1] as an example.
// The required sharding spec RS[0] can be represented by
// - AxisPartitionSpec(Condition::Replica, -1)
// - AxisPartitionSpec(Condition::Shard, 0)
public:
// Status of a tensor axis.
// A tensor axis can be either sharded or replicated
// along a device mesh axis.
enum class Condition { Replica,
Shard };
// This field tells if a tensor axis is sharded or not.
Condition cond;
// If a tensor axis is sharded, this field tells which device
// mesh axis to distribute the shards along.
// If a tensor axis is not sharded, this field is ignored.
int device_mesh_axis;
// A helper to construct a replica spec for a tensor axis.
static AxisPartitionSpec CreateReplica() {
return AxisPartitionSpec(Condition::Replica, -1);
}
// A helper to construct a sharding spec for a tensor axis.
// This tensor axis is sharded along `device_mesh_axis` in device mesh.
static AxisPartitionSpec CreateShard(int device_mesh_axis) {
return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
}
};
class TensorPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// TensorPartitionSpec holds a collection of AxisPartitionSpec and an
// associated DeviceMesh. It is responsible for determining how a tensor
// should be partitioned across a device mesh.
//
// Example 1: RS[0]
// In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
// - The first object is a Replica, denoting that the first axis of the tensor is
// not sharded but is instead replicated.
// - The second object is a Shard along the 0-th axis of the device mesh. It denotes
// that the second axis of the tensor is sharded along the first axis of the
// device mesh.
//
// Example 2: S[0]RR
// In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
// - The first object is a Shard along the 0-th axis of the device mesh, indicating
// that the first axis of the tensor is sharded along the first axis of the
// device mesh.
// - The second and third objects are Replicas, indicating that the second and third
// axes of the tensor are not sharded but are instead replicated.
public:
// axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
// axis_specs[0] is for row axis and axis_specs[1] is for
// column axis. axis_specs[i].device_mesh_axis = j means that
// tensor axis i is sharded along device mesh axis j.
std::vector<AxisPartitionSpec> axis_specs;
// device_mesh: DeviceMesh for sharding the associated tensor.
// Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
DeviceMesh device_mesh;
};
```
### Description
<!-- Describe your changes. -->
Adds the option to set max_intermediate_outputs for quantization with
the MinMaxCalibrater via. extra_options following the structure of
existing flags.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
When running quantization with the MinMaxCalibrater with larger
datasets, one quickly runs out of memory since it tries to load the full
dataset. Since merging and clearing of the intermediate_outputs is
already implemented within the Calibrater this simply adds an optional
flag to make use of these functions during quantization.
Add CUDA EP to the demo of stable diffusion.
### A100 Performance
Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency
(ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms)
-- | -- | -- | -- | -- | -- | --
SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 |
N/A
SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222
SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971
SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700
We can see that ORT_CUDA is the most robust one for handling dynamic
input shape. PyTorch could be a good choice if you run large batch size.
The above result is from one A100-SXM4-80GB GPU (in
Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or
768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from
source, and the following packages or libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13
For static input shape, the engine is built with static batch size and
static image shape, and cuda graph is enabled.
For dynamic input shape, the engine is built to support dynamic batch
size and dynamic image shape, and cuda graph is disabled. The TensorRT
engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and
the optimized image size is 512x512.
The script to test static and dynamic input shape are like the
following:
```
prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining"
for e in TRT ORT_TRT ORT_CUDA
do
python demo_txt2img.py --engine $e "$prompt"
python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt"
python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt"
done
```
Performance of PyTorch is from commands like the following:
```
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768
```
The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.
Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.
The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.
| Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%
The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13
### Description
- Enables option to use the QNN Saver backend for dumping QNN API calls
to file.
- Adds logic to read environment variable
`ORT_UNIT_TEST_ENABLE_QNN_SAVER` from QNN EP unit tests. If enabled,
unit tests will use the QNN Saver backend and dump files to
`./saver_output/`.
### Motivation and Context
QNN Saver makes it easier to debug issues when unit tests fail. The
output files generated by QNN Saver can be used to replay the exact QNN
API calls that lead to a specific error condition.
QNN Saver dumps QNN API calls (and weights) to disk.
- saver_output/saver_output.c: C file containing all QNN API calls.
- saver_output/params.bin: binary file containing all
input/output/parameter tensor data provided during tensor creation, op
config validation, and graph execution.
Enabling the QNN Saver backend has 2 note-worthy effects:
1. All QNN API calls will succeed.
2. Inference output returns dummy data.
Because the output files from QNN Saver are always overwritten, it is
recommended to run individual unit tests via the `--gtest_filter`
command-line option.
Example (linux):
```shell
$ ORT_UNIT_TEST_ENABLE_QNN_SAVER=1 ./onnxruntime_test_all --gtest_filter=QnnHTPBackendTests.Resize_DownSample_Linear_AlignCorners
```
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Add DML EP to the acceptable provider list in the optimizer.
### Motivation and Context
With DML EP, graph optimization was not performed in onnxruntime.
### Description
Allow WebGPU backend to specify `preferredLayout`. Default is NHWC.
```js
const options = {executionProviders: [{name:'webgpu', preferredLayout: 'NCHW'}]};
sess1 = await ort.InferenceSession.create('./mobilenetv2-12.onnx', options);
```
### Motivation and Context
- implement @qjia7's requirement for an easier way to do performance
comparison between NCHW vs NHWC.
- It's possible that NCHW does better on some models and NHWC on others.
So offer user the capability to switch.
### Description
<!-- Describe your changes. -->
WebNN only supports 2-D input tensor along axis 1. For now, we use
Reshape and Transpose wraparound to get the compatible input.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable more models to run on WebNN.
### Description
<!-- Describe your changes. -->
Update E2E test to also check InferenceSession.create with bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add tests to validate #17739
The patch also introduces the method which copies
data from GPU to CPU synchronously.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Another three ops for fp16
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Add support for specifying a custom logging function per session.
Bindings for other languages will be added after this PR is merged.
### Motivation and Context
Users want a way to override the logging provided by the environment.
### Description
Following the design document:
* Added CreateTrainingSessionHandler to the Backend interface
* All existing Backend implementations throw an error for the new method
createTrainingSessionHandler
* Created TrainingSession namespace, interface, and
TrainingSessionFactory interface
* Created TrainingSessionImpl class implementation
As methods are implemented, the TrainingSession interface will be added
to or modified.
### Motivation and Context
Adding the public-facing interfaces to the onnxruntime-common package is
one of the first steps to support ORT training for web bindings.
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com>
### Description
<!-- Describe your changes. -->
Use `.buffer` of Uint8Array to get ArrayBuffer.
TODO: Add E2E React Native test case to cover JS level testing to avoid
future breakage.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17732
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Python API to check whether collective ops are available or not
### Description
<!-- Describe your changes. -->
Adding an API to check whether collective ops are available or not.
Since there is no independent MPI enabled build, this flag can be used
on Python front for branching. Specifically, to conditionally enable
tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Flag to be used in Python to check whether onnxruntime supports
collective ops or not. Handy for conditionally enabling/disabling tests
and for other branching decisions.
### Description
Google test can be built either with absl/re2 or not. This PR enables
the build option so that google test framework can print out a nice
stacktrace when something went wrong. It helps locate test errors in CI
build pipelines.
Also, Google test will remove the build option and make it always ON. So
sooner or later we must make this change.
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
### Description
Add ConvTranspose implementation using MatMul to increase perf.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
- Treat Resize as layout sensitive by default
- whilst the ONNX spec does not specify a layout, EPs tend to implement
only one
- add second usage in L2 of TransposeOptimizer to plugin the ability to
push a Transpose through a Resize assigned to the CPU EP
- Allow EP specific logic for changes the ops considered to be layout
sensitive to be plugged in
- expected usage is for #17200
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Finish simplifying/clarifying transpose optimization and layout
transformation that was proposed in #15552. This PR along with #17618
should complete the changes.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
The condition check is not correct
```
if (is_unidirectional_ && enable_fused_causal_attention_) { // GPT
}
else { // BERT
}
```
Change it to
```
if (is_unidirectional_) { // GPT
}
else { // BERT
}
```
Another walkaround is to enable fused causal attention by adding an
environment variable `ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1` before
running stable diffusion.
### Motivation and Context
Without the fix, optimized CLIP model of stable diffusion will encounter
error in running Attention node:
2023-09-24 16:15:31.206037898 [E:onnxruntime:,
sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned
while running Attention node. Name:'Attention_0' Status Message:
/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu:207
bool
onnxruntime::contrib::cuda::FusedMHARunnerFP16v2::mhaImpl::is_flash_attention(int)
const interface->mHasCausalMask == false was false.
Note that the bug has been there for a long time. It is just surfaced
since we recently added a fusion for CLIP, which will trigger the error.
We will add a comprehensive test for causal attention later to avoid
such corner cases.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add Gelu/QuickGelu/GeluGrad/QuickGeluGrad support to Triton Codegen so
that it can be fused with some other connected supported Ops. For
example, in llama2, it can be fused with Mul so we will have extra 1-2%
perf gain.
### Description
<!-- Describe your changes. -->
Add ability for transpose optimizer to look past a DQ node if it has a
constant initializer as input. This allows UnsqueezeInput/TransposeInput
to modify the initializer in-place in the same way it would for a
non-QDQ format model.
Shared initializers are also handled, and any additional
Squeeze/Transpose added to the other usages of the initializer should
cancel out when we push the same Transpose though them.
The in-place modification means we don't need to run QDQ fixup and
constant folding after layout transformation. This means we do not need
to enable those optimizers in a minimal build to get an optimal model
post-layout transformation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Ensure layout transformation produces optimal model in full and minimal
builds.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
* Allow either an allocator or a MemBuffer to be used when creating an
OrtValue from an TensorProto
* `Tensor<std::string>` requires an allocator to allocate/free the
string values
* Forcing the buffer to be allocated outside of the Tensor doesn't seem
to provide any benefit in this usage as the Tensor class disables copy
and assignment (so we wouldn't create 2 copies of the buffer via the
Tensor class that externally managing the would buffer avoid)
* New approach means we don't need to manage the buffers in the
optimizer Info class as the Tensor dtor will do that
* Update naming - MLValue was replaced by OrtValue a long time ago
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17392
### Description
<!-- Describe your changes. -->
Fix for this issue which raise the error of FileNotAccessd in windows
when the context of TemporaryDirectory finished.
https://github.com/microsoft/onnxruntime/issues/17627
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/17627
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
Updated a couple of old links in the technical documentation that where
pointing to files present prior to the migration to
https://onnxruntime.ai/docs.
### Description
<!-- Describe your changes. -->
For now, WebNN Softmax only support 2D (or implicitly coerce to 2D)
inputs and the last axis.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fallback some cases to pass the CI.
### Description
Adds prepacked weights container to model subgraphs.
### Motivation and Context
Allows for initializer sharing when the initializers are located in
subgraphs. I encountered this bug when attempting to share weights
between T5 BeamSearch models where the shareable initializers are
located in the encoder and decoder subgraphs and it failed to reduce
memory usage.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->