This also set the Path variable for the downloaded libraries.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR
limitation: __CUDA_ARCH__ >= 700
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
add CI steps to log info for test failure investigating.
Currently Web CI is marked as 'optional'. This change adds some script
to dump debug info for investigating the random test failure
### Description
Make all build_wasm tasks (NPM packaging and post merge)run on Linux.
Enable web gpu test in npm package pipeline too.
### Motivation and Context
Even on Windows, build_wasm is running in cygwin.
So, it could save a lot of time to run it on Linux.
### Description
<!-- Describe your changes. -->
Set DML package name correctly so the build doesn't try and include mobile targets.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
### Description
<!-- Describe your changes. -->
Fix bad delegates.
Add script to detect mismatch, and run in CI and when creating nuget
package.
Ignore whitespace when looking at the diff to the .cs file as
clang-format ran.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#18363
### Description
Only one of "--cuda_version" and "--cuda_home" is needed. If they were
both specified, the first one will take precedence. Since we download
cuda SDKs on-the-fly now, the machines will not need to have a
preinstalled CUDA SDK therefore will not have VS-CUDA integration
extension. Therefore the "--cuda_version" flag will not work. This PR
deletes such usages.
Related PR: #15915
### Description
This PR fixes the TypeScript type check.
Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:
- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before.
### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
### Description
Add the pool definition in 2 stages even the pool is Microsoft-Hosted
Pool.
### Motivation and Context
Recently, in Nuget pipeline, when we click the Stages to Run

It always pops up
```
Encountered error(s) while parsing pipeline YAML:
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
```
1. Now we use a released version of ONNX, so we can directly download a
prebuilt package from pypi.org. We do not need to build one from source.
2. Update protobuf python package's version to match the C/C++ version
we are using.
3. Update tensorboard python python because the current one is
incompatible with the newer protobuf version.
### Description
Add CI changes for #18287
Install onnx explicitly to pass windows GPU+dml stage.
### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
### Description
Update the C# nuget build infrastructure to make building a test nuget
package more user friendly and to simplify
- Remove usage of dotnet and msbuild in CIs
- was temporary requirement until .net 6 MAUI was added to the released
Visual Studio
- remove SelectedTargets property and its usage
- Add property for excluding mobile targets
- generally we exclude based on the nuget package name
- can now specify `/p:IncludeMobileTargets=false` on the command line to
force exclusion
- support building test package using build.py `--build_nuget` better
- limit inclusion of xamarin targets as building with them requires a
lot more infrastructure
- use msbuild directly if xamarin targets are included. use dotnet
otherwise.
- remove quoting of property values as it doesn't appear to be necessary
and breaks when msbuild is being used
- add infrastructure to be able to pack the nuget package on linux with
`dotnet pack`
- `nuget pack` is not user friendly as-per comments in changes
- requires stub csproj to provide the nuspec path
- Remove netstandard1.0 targets from nuspec
- we removed support from the actual bindings previously
- Remove usage of nuget-staging directory when creating nuget package on
linux
- the nuspec file element has a fully qualified path for a source file
so there is no obvious benefit to copying to a staging directory prior
to packing
### Motivation and Context
Address issues with 1P users trying to create test nuget packages
locally.
Long overdue cleanup of CI complexity.
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well
Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
- some ops use a workspace buffer
- copied workspace allocation from XNNPACK unit test code
- some suffixes changed
Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
- not an issue currently
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed
Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.
NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
### Description
Retry 3 times at most if the web test fails.
### Motivation and Context
Web GPU tests are not stable.
From this link, we could find these ort-web tests are all in top 10
failing tasks.
https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=161&contextType=build.
Generally, it could pass by manually rerunning it.
So, enable it to rerun automatically.
These test steps duration isn't long. So, it won't take too long to
retry.
### Description
Disable ccache for DML. This change is similar to #18104. Now the DML
build job is having the same timeout issue. I don't know why. But
disabling ccache probably would help.
### Description
Update batch file to set PATH for Cuda with TRT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This reverts commit 99b8dcaae2.
### Description
<!-- Describe your changes. -->
### Motivation and Context
Restore the dml stage in windows GPU pipeline.
Agent issue is solved by adding Feature.DisableGpuDriver in pool
properties.
### Description
Version 41.0.0 currently used has vulnerabilities.
### Motivation and Context
See [Vulnerable OpenSSL included in cryptography
wheels](https://github.com/advisories/GHSA-v8gr-m533-ghj9)
### Description
Motivation for this PR is reducing CI test time by removing unnecessary
tests from the pipelines.
Following changes are for reducing test time in pipelines:
- Skip CPU model tests in GPU builds. Training CIs run these tests as a
sanity check. There is no direct training code being tested in these
pipelines, furthermore, CPU tests are being run in CPU pipelines so no
need to run them again in GPU builds and block the GPU VM. This change
reduces testing time by 20-25 mins in all training GPU pipelines.
- Delete debug package building pipeline for linux training packages.
This was required by compiler team at some point but there have been 0
downloads of these packages.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is a temp fix for the failing "Zip-Nuget-Java-Nodejs Packaging
Pipeline". The pipeline is failing because I removed NodeJS from the
build machine pool's image, to reduce the number of dependencies we need
to maintain in VMs.
So this PR will temporarily move the test to a different machine pool to
get the test passed. Then I will move the test to docker. Docker images
are relatively easier to update and maintain. Now we almost run all
Linux test in docker, except for this one. Moving it to docker is needed
for enabling GPU support in nodejs, because all our Linux VMs do not
have CUDA.
### Motivation and Context
### Description
This PR:
(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)
(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.
Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770
### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
### Description
**Fixes NPM Packaging pipeline.**
Training was enabled for linux-wasm-ci.yml but not enabled for
win-wasm-ci.yml.
the web CI uses linux-wasm-ci.yml
NPM packaging pipeline uses win-wasm-ci.yml
### Description
<!-- Describe your changes. -->
Android emulator usage updates:
- Change approach to detecting boot has completed
- use `-delay-adb` and a simple command (`ls`) with `wait-for-device` as
the first step
- this ensures enough startup has occurred for adb to be responsive
- use secondary loop on the python side to check for sys.boot_completed
to be set
- doing the check on the python side provides more feedback and seems to
work well
- make the 'stop' logic more precise by using psutil
- add internal timeout of 20 mins for emulator startup
- waiting for the CI jobs overall timeout is way too long
- value is hardcoded for now (most CIs startup in under 10 mins) but
could be made configurable if needed
CI updates:
- add template for using the Android emulator
- update CIs to use template
- reorder React Native CI
- minimize the time the Android emulator or iOS simulator is running by
moving some build steps around
- don't run both at the same time
- unnecessary and potentially adds significant memory pressure to the
machine
- fix QNN Android emulator CI as much as possible
- now everything works apart from running onnx_test_runner with the QNN
EP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix inconsistent detection of the emulator boot completing.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Update NDK to 26.0.10792818 which is included in every macOS build
machine so that we do not need to download a different version every
time in every build.
### Motivation and Context
Downloading NDK on-the-fly is a main contributor of Android related
build failures.
### Description
<!-- Describe your changes. -->
### Motivation and Context
Compliance check would fail randomly but the stage couldn't be rerun if
the pipeline artifacts are already published.
There's the error like `Artifact xxxx already exists`.
We had to restart the whole pipeline if there's a random error in
compliance check.
### Description
allow gpu IO binding tests to fail temporarily.
when the root cause is still in investigation, use `continueOnError:
true` to allow the test to fail without blocking PRs.
### Description
"NPM packaging pipeline" needs to download an artifact from
"Zip-Nuget-Java-Nodejs Packaging Pipeline".
It has been a long-time issue that they two pipelines often use
different commit ids.
This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a
resource, so that "NPM packaging pipeline" will always fetch from the
pipeline run that triggers this NPM pipeline.
Their official document says:
"When you define a resource trigger, if its pipeline resource is from
the same repo as the current pipeline, triggering follows the same
branch and commit on which the event is raised."
### Description
<!-- Describe your changes. -->
Include CoreML EP in python package.
I've added to the base package as CoreML comes from the OS so there are
no additional libraries to distribute.
Updated the CPU-based provider list to add the AzureEP, which is also
included in the base package, to fix some test failures. Without this
the infrastructure thinks a device copy implementation is required
between AzureEP and CoreML nodes, which is not the case as the AzureEP
is CPU based.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16989
- Update ROCm and MIGraphX CI to ROCm5.7
- Simplify test exculde file. Some tests will output `registered
execution providers ROCMExecutionProvider were unable to run the model.`
if they cannot run.
- Add `enable_training` build argument for MIGraphX pipeline.
Python package pipeline fails due to "tokenizers" compilation. Since
"tokenizers" is a dep of "transformers", we update its version and hope
a new solution had been there.
```
error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
--> tokenizers-lib/src/models/bpe/trainer.rs:517:47
```
- we will publish the onnxruntime-training-rocm package on ADO feeds.
The onnxruntime-training package will solely be for cuda.
- Add new pipeline for onnxruntime-training-rocm ADO feeds
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only
package with latest rocm version is publish to ADO.
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.
### Motivation and Context
reduce the memory overhead and initialization time overhead
Two major modifications of this PR:
1. Refactor OrtTensorRTProviderOptions initialization and make it easy
to add new field.
2. Make Python API capable of using TensorRT plugins by adding new
Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs
to register ep's custom op domain before model load. For C++ API, it's
slightly different, when calling
SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op
domain to session option. Later ORT can register custom op domain from
session option before model loading)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.
### Motivation and Context
Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
### Description
<!-- Describe your changes. -->
Move the swift files to ORT SPM repo now:
https://github.com/microsoft/onnxruntime-swift-package-manager
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.
Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.
Example of specifying sharding information
```python
@onnxscript.script()
def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
# Run MatMul by sharding x along column axis and w along row axis on
# 2 GPUs.
return MICROSOFT_OPSET.DistributedMatMul(
tensor_x,
tensor_w,
device_mesh_shape=[2],
device_mesh_elements=[0, 1],
input_shard_specs=["RS[0]", "S[0]R"],
output_shard_specs=["RR"],
)
onnx_model = matmul_rs_sr_rr.to_model_proto(
input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
output_types=[FLOAT[2, 2]],
)
```
In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.
C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
public:
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// Device mesh is a tensor of device indices.
// A tensor can then be partitioned along specific mesh axes.
//
// Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
// Let's consider some examples.
// 1. 1D device mesh [0, 1, 2, 3]. In this case,
// device_mesh_shape is [4] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor along its axis 1, the
// corresponding sharding spec is a string "RS[0]".
// 2. 2D device mesh [[0, 1], [2, 3]]. In this case,
// device_mesh_shape is [2, 2] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor's
// rows along mesh axis 1 and
// columns along mesh axis 0, the
// corresponding sharding spec is a string "S[1]S[0]".
// If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
// GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding
// proccess.
// - Start with a 2-D device mesh [[0, 1], [2, 3]] and
// a 2-D tensor [[5, 6], [7, 8]]
// - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
// - Split GPU mesh along axis 1 and tensor along
// axis 0 for "S[1]" in "S[1]S[0]"
// - GPU: [[0], [2]], Tensor: [[5, 6]]
// GPU: [[1], [3]], Tensor: [[7, 8]]
// - Split GPU mesh along axis 0 and tensor along
// axis 1 for "S[0]" in "S[1]S[0]"
// - GPU: [[0]], Tensor: [[5]]
// - GPU: [[2]], Tensor: [[6]]
// - GPU: [[1]], Tensor: [[7]]
// - GPU: [[3]], Tensor: [[8]]
// Actual shape of device mesh represented by `device_mesh_elements`.
std::vector<int64_t> device_mesh_shape;
// Flattened device mesh.
std::vector<int64_t> device_mesh_elements;
};
class AxisPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// This class is the in-memory representation of
// 1. if a tensor is sharded or not (aka replica), and
// 2. which tensor axis is shard by which device mesh axis.
// Let's consider sharding 2-D tensor along column axis on
// device mesh [0, 1] as an example.
// The required sharding spec RS[0] can be represented by
// - AxisPartitionSpec(Condition::Replica, -1)
// - AxisPartitionSpec(Condition::Shard, 0)
public:
// Status of a tensor axis.
// A tensor axis can be either sharded or replicated
// along a device mesh axis.
enum class Condition { Replica,
Shard };
// This field tells if a tensor axis is sharded or not.
Condition cond;
// If a tensor axis is sharded, this field tells which device
// mesh axis to distribute the shards along.
// If a tensor axis is not sharded, this field is ignored.
int device_mesh_axis;
// A helper to construct a replica spec for a tensor axis.
static AxisPartitionSpec CreateReplica() {
return AxisPartitionSpec(Condition::Replica, -1);
}
// A helper to construct a sharding spec for a tensor axis.
// This tensor axis is sharded along `device_mesh_axis` in device mesh.
static AxisPartitionSpec CreateShard(int device_mesh_axis) {
return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
}
};
class TensorPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// TensorPartitionSpec holds a collection of AxisPartitionSpec and an
// associated DeviceMesh. It is responsible for determining how a tensor
// should be partitioned across a device mesh.
//
// Example 1: RS[0]
// In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
// - The first object is a Replica, denoting that the first axis of the tensor is
// not sharded but is instead replicated.
// - The second object is a Shard along the 0-th axis of the device mesh. It denotes
// that the second axis of the tensor is sharded along the first axis of the
// device mesh.
//
// Example 2: S[0]RR
// In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
// - The first object is a Shard along the 0-th axis of the device mesh, indicating
// that the first axis of the tensor is sharded along the first axis of the
// device mesh.
// - The second and third objects are Replicas, indicating that the second and third
// axes of the tensor are not sharded but are instead replicated.
public:
// axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
// axis_specs[0] is for row axis and axis_specs[1] is for
// column axis. axis_specs[i].device_mesh_axis = j means that
// tensor axis i is sharded along device mesh axis j.
std::vector<AxisPartitionSpec> axis_specs;
// device_mesh: DeviceMesh for sharding the associated tensor.
// Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
DeviceMesh device_mesh;
};
```
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
1. Upgrade nodejs from 16.x to 18.x for Windows pipelines
2. Avoid using Azure DevOps "NodeTool" on Linux. The tool installs
nodejs from internet or local disk cache. But we already moved all Linux
tests to docker. So we do not need the installer anymore.
3. Remove some other unused code.
### Description
1. Remove 'dnf update' from docker build scripts, because it upgrades TRT
packages from CUDA 11.x to CUDA 12.x.
To reproduce it, you can run the following commands in a CentOS CUDA
11.x docker image such as nvidia/cuda:11.8.0-cudnn8-devel-ubi8.
```
export v=8.6.1.6-1.cuda11.8
dnf install -y libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v} libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v}
dnf update -y
```
The last command will generate the following outputs:
```
========================================================================================================================
Package Architecture Version Repository Size
========================================================================================================================
Upgrading:
libnvinfer-devel x86_64 8.6.1.6-1.cuda12.0 cuda 542 M
libnvinfer-headers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 118 k
libnvinfer-headers-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 14 k
libnvinfer-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 13 M
libnvinfer-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 13 M
libnvinfer-vc-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 107 k
libnvinfer-vc-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 251 k
libnvinfer8 x86_64 8.6.1.6-1.cuda12.0 cuda 543 M
libnvonnxparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 467 k
libnvonnxparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 757 k
libnvparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 2.0 M
libnvparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 854 k
Installing dependencies:
cuda-toolkit-12-0-config-common noarch 12.0.146-1 cuda 7.7 k
cuda-toolkit-12-config-common noarch 12.2.140-1 cuda 7.9 k
libcublas-12-0 x86_64 12.0.2.224-1 cuda 361 M
libcublas-devel-12-0 x86_64 12.0.2.224-1 cuda 397 M
Transaction Summary
========================================================================================================================
```
As you can see from the output, they are CUDA 12 packages.
The problem can also be solved by lock the packages' versions by using
"dnf versionlock" command right after installing the CUDA/TRT packages.
However, going forward, to get the better reproducibility, I suggest
manually fix dnf package versions in the installation scripts like we do
for TRT now.
```bash
v="8.6.1.6-1.cuda11.8" &&\
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo &&\
yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}\
libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v}
```
When we have a need to upgrade a package due to security alert or some
other reasons, we manually change the version string instead of relying
on "dnf update". Though this approach increases efforts, it can make our
pipeines more stable.
2. Move python test to docker
### Motivation and Context
Right now the nightly gpu package mixes using CUDA 11.x and CUDA 12.x
and the result package is totally not usable(crashes every time)
### Description
Include onnxruntime_float16.h in the package.
### Motivation and Context
This was missed in the recently released 1.16 pkgs (except Nuget).
manylinux build is used for nightly packaging generation and it's hard
to capture issue in time when related files change. This PR add
manylinux build in CI.
### Description
1. use standard win build template
2. enable compiler cache
### Motivation and Context
Make win build task easy to maintain and accelerate the pipeline.
### Description
PR 15470 updated some C/C++ dependencies. The change caused ROCM EP's
nightly build to fail. see issue
https://github.com/ROCm-Developer-Tools/HIP/issues/2082 for a
background. So, the root cause is HIP compiler has a special requirement
that HIP's include dirs must be used before the operating system's
include folder: /usr/include. HIP adds "-isystem" in front of
"/usr/include". gcc or clang will search the folders added with "-I"
first, then the "-isystem" folder. It works fine as long as we do not
add "-I/usr/include" to the compile commands for *.cu files. It would be wrong if
we already have installed an open source library to /usr and want to use the
prebuilt library from there instead of the current build dir.
### Motivation and Context
### Description
supplement of #17417
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
The name of nightly ACPT image has been updated to
`ptebic.azurecr.io/internal/aifx/acpt/nightly-ubuntu-cuda-torch-dev`
As the previous image alias had `cu118`, `torch210dev` or `py38`, any
version update will break the training nightly pipeline
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Using constant image alias to avoid pipeline failure.
### Description
Delete all Prefast tasks because the new VS 17.7 version crashes every
time when we run the task on our CI build servers. However, we cannot
reproduce it locally. And this problem blocks us installing security
patches to our CI build machines.
Will use [CodeQL](https://codeql.github.com/) instead.
### Motivation and Context
Address some security alerts.
The old provisioning profile no longer works. Switched to a temporary one that we can use before a new one is available. The temporary one has a different name.
### Description
Updates the version of QNN SDK used by CI Pipelines. Enables some tests
fixed by 2.14.1, but still need to look into Resize in a separate PR.
### Motivation and Context
Test latest version of QNN SDK.
### Description
Update the Web CI pipelines:
- remove parameter 'WebTemplate': Since we start to support webgpu, the
linux-web-ci.yml is no longer working and it is already out-of-date.
remove this file and parameter so that we always use win-web-ci.yml
- change flag `RunWebGpuTests` into 2 flags, for release and debug.
Currently for CI we only run webgpu tests on release build. But we want
to have the capability to run webgpu tests on debug build as well.
After this PR is merged, next step is to enable both Debug and Release
webgpu tests in PostMerge pipeline.
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)
Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.
Bundled into this PR are minor formatting fixes + naming fixes.
### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
### Description
The yaml file changes made in #16050 do not really work. Currently the
pipeline is failing with error:
```
Error: Not found SourceFolder: C:\a\_work\5\b\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib
```
So, I will revert the yaml changes first to bring the pipeline back.
Some people are waiting for our nightly packages.
Test run:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=351104&view=results
### Motivation and Context
### Description
install dotnet 6.0 in the docker image.
move C# build and test into docker.
### Motivation and Context
### Note
The Unit tests and Symbolic shape infer's migration will be in another
PR.
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.
2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Now we have multiple data types that we want to disable for minimal
build and to reduce binary size. may be worth adding an argument in the
build script for specifying that.
Also for fp16 type stuff, it may be too restrict to disable that for all
minimal build.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
Add the compiler cache in linux GPU tensorRT CI.
Save about 30 minutes in the GPU machine. (52 minutes -> 24 minutes)
PS.
There're only white-space differences in the dockerfile.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Integrate `trt_multi_gpu` test stage in ORT post merge CI (Win-2xA10
vm)
* Deprecate Linux MultiGPU TRT CI (This vm will be deprecated soon)
* Add multi gpu support to existing C# test cases
* Deprecate unfunctional flag `--enable_multi_device_tests`
### Motivation and Context
* Two contexts of replacing Linux MultiGPU TRT CI:
* Flag `--enable_multi_device_tests` is not functional, which cannot
detect issues like #17036
* The Linux-2xM60 VM of this CI pool is about to be deprecated 9/6/23.
Need to enable this test in other dualGPU vm pool.
### Description
Add single test step in Window GPU Reduced Ops workflow
### Motivation and Context
The old workflow's building and testing were running in one command.
In PR #17263, the test step was removed by mistake.
So, readd it.
How to consolidate the test step is in consideration.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Unify some pre-build common steps.
### Motivation and Context
In the long run, other devs should only focus on build option and test
commands.
It would reduce mistakes and maintenance cost to use common template
steps.
There will be more PRs to achieve the goal.
### Description
1. Add a CUDA 12.x pipeline
2. Improve install_third_party_deps.ps1: avoid using Start-process.
Directly call the command instead.
### Motivation and Context
Since our official packages and all CI pipelines still use CUDA 11.x, we need extra pipelines to validate our source code level compatibility with CUDA 12.x. BTW for sure the prebuilt binaries in our release page are not compatible with CUDA 12.x. Do not report bugs for that.
AB#15152
### Description
1. Fix python packaging test pipeline. There was an error in
tools/ci_build/github/linux/run_python_tests.sh that it installed a
released version of onnxruntime python package from pypi.org to run the
test. Supposedly it should pick one from the current build.
2. Refactor the pipeline to allow choosing cmake build type from the web
UI when manually trigger a build. Now this feature is for Linux only.
Because I don't want to change too much when we are about to cut a
release branch. After that I will expand it to all platforms. This
feature is useful for debugging pipeline issues, also, we may consider
having a nightly pipeline to run all tests in Debug mode which may catch
extra bugs because in debug mode we can enforce range check.
Test run:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=342674&view=results
### Motivation and Context
Currently the pipeline has a crash error.
AB#18580
### Description
Updates NuGet packaging pipelines to use the correct license name.
### Motivation and Context
The license name changed. See https://github.com/microsoft/onnxruntime/pull/17170
The QNN_Windows_Nuget and Zip-Nuget-* pipelines will not run without this update.
### Description
Move DML build job's Prefast task to a CPU machine pool which has larger
memory. The current one runs out of memory in every run.
### Motivation and Context
To fix the broken python packaging pipeline.
- 'js/web'
- 'js/node'
- 'onnxruntime/core/providers/js'
is updated
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adds continuous integration and pull-requestion validation triggers
directly to the yaml file for the Windows x64 QNN CI Pipeline.
### Motivation and Context
There have been various unit tests failures that break the
QNN_Windows_Nuget pipeline, which builds QNN EP for Windows x64. This PR
ensures that QNN EP is built and tested on a Windows x64 image for every
pull request.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The onnxruntime-CI-nightly-ort-pipeline encounters occasional failures
due to synchronization discrepancies between the ACPT nightly image and
the repository. We are addressing this by executing tests using the
commit ID associated with the ort build within the ACPT image.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
1. Clean up cmake files. Remove some unused code
2. Remove the "Semmle" task from
tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is
deprecated and replaced by CodeQL.
### Description
- Disables Resize tests that use nearest mode on QNN CPU.
- Fixes indentation problems on yaml for win x64 qnn pipeline.
### Motivation and Context
The QNN windows Nuget pipeline does not run due to failing unit tests on
Windows x64. These tests should not be enabled until we determine the
rounding behavior of QNN's ResizeNearestNeighbor operator.
ROCm python package pipeline failed because this
PR(https://github.com/microsoft/onnxruntime/pull/16325) changed onnx
version to a commit and we need to build onnx from source. Low protobuf
version will cause build errors.
This PR remove `cmake ` and `protobuf ` from Dockerfile, these two will
install by `install_os_deps.sh`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The correct name should be onnxruntimecpubuildpython
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Some pipelines are failing. It is because PR #16325 set ONNX version to
`rel-1.14.1` . It is a branch name, not a commit or tag name. It means
whenever the branch got a new commit, we will auto pick it and use it.
### Description
This change upgrade emsdk to 3.1.44.
Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".
most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
### Description
enable webgpu in browser unit test.
The CI pipeline uses Edge v113+ which enables WebGPU.
===
**UPDATE on 08/07/2023:**
- add flags to Edge browser launch commandline so that Edge on CI agents
can initialize WebGPU correctly.
- ONLY enable webgpu on web release build. Other pipelines are using
flag `-b=wasm,webgl,xnnpack` to specify the other 3 backends explicitly.
- disable "Resize" related test failures. Once they are fixed the tests
can be re-enabled.
---------
Co-authored-by: Satya Jandhyala <satya.k.jandhyala@gmail.com>
Add script to get iOS simulator device info so we don't need to use hardcoded specifiers which may or may not refer to a valid simulator device.
Add use-xcode-version step to a packaging pipeline so it uses a consistent version of Xcode.
### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
### Description
This change allows Web CI to do some check as the first step, so that if
there are errors it won't launch the task to build web assembly, which
is heavy.
Checks includes:
- "npm ci" in /js, /js/common and /js/web. this implicitly include:
- typescript compiler in /js
- typescript compiler in /js/common
- webpack build in /js/common
- typescript compiler in /js/web
- ESLint on typescripts
- clang-format formatter (.js, .ts, .cc, .h, .mm)
- Prettier formatter (.json, .jsonc, .md)
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
1. rename OrtValue.FillStringTensorElement to StringTensorSetElementAt .
To the API user I think we're conceptually setting the string at an
offset in the tensor with is roughly equivalent to `List<string> list
... list[index] = "value"`.
2. While working on new inference examples, I noticed that I am still
inclined to use `DenseTensor` for N-D indexing. Added `GetStrides()` and
`GetIndex()` from strides for long dims, so the user can obtain strides
and translate N-D indices into a flat index to operate directly on the
native `OrtValue` buffers. Expose these functions to the user.
3. Make sure we generate docs for C# public static functions.
### Description
There are currently multiple failures that blocking the CI pipelines so
this PR has all of the fixes in order to make sure it passes the CI.
Otherwise a single fix will still fail the CI.
includes:
#16960#16958
Please help to make sure this PR get merged once CI passed.
@snnn @carzh @guschmue
Fixed:
[AB#18118](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18118)
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
- enable unit test for js/common in CI
- add debug config in js/.vscode/launch.json
- enable source map for js/common/test for debugging purposes; add
source map files to ignore list
- ignore js/common/test folder for npm packaging
### Description
1. As a follow-up of #16761, this PR allows build ORT on iOS/Android
without the need to explicitly specify a protoc path. #16761 is for
WASM. This one is for iOS/Android
2. Update the MacOS/Linux build scripts that build/install protobuf from
source. Make them be more flexible. Add the support for
RedHatEnterprise(ubi), which will needed for upgrading the base image
from centos:7 to ubi:8.
3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile :
the docker file's base image has preinstalled protobuf in /usr/local, we
should uninstall them to avoid conflicts.
### Description
The `%AGENT_TEMPDIRECTORY%\v11.8` is created in azcopy step.
So, the set env step should be after the azcopy step.
### Motivation and Context
Correct the previous logic
Unify the step since multiple jobs are using it.
### Description
<!-- Describe your changes. -->
Split stages for CPU and CPU+NNAPI builds as CodeQL is enabled at the
stage level.
We run it for CPU+NNAPI as that covers all the Android code.
We don't want to run it for both as duplicate issues would be created
for a problem in code included in both builds.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove VS 2019 code.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->