Commit graph

1688 commits

Author SHA1 Message Date
Jian Chen
05526b354b
Adding new yaml file for downloading cuda, and trt from azure blob (#18443)
This also set the Path variable for the downloaded libraries. 

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-14 19:47:39 -08:00
Ye Wang
f9af94009b
onboard MoE (#18279)
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR

limitation: __CUDA_ARCH__ >= 700


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-14 16:48:51 -08:00
Changming Sun
27d068569a
Remove Node.js tool installer task from web ci pipeline (#18434)
EMSDK already has a nodejs. We will use that one to be more
consistent(the CI build pipeline would be less dependent on the VM
image).
2023-11-14 13:16:01 -08:00
Yulong Wang
d22b1af5da
[js/web] add CI steps to log info for test failure investigating (#18418)
### Description
add CI steps to log info for test failure investigating.

Currently Web CI is marked as 'optional'. This change adds some script
to dump debug info for investigating the random test failure
2023-11-14 11:40:58 -08:00
Changming Sun
a09099f2dd
Remove XNNPack from web pipelines (#18419)
### Description
Remove XNNPack from web pipelines for now
2023-11-13 22:43:53 -08:00
Yi Zhang
0b16185223
build wasm with linux (#18106)
### Description
Make all build_wasm tasks (NPM packaging and post merge)run on Linux.
Enable web gpu test in npm package pipeline too.


### Motivation and Context
Even on Windows, build_wasm is running in cygwin.
So, it could save a lot of time to run it on Linux.
2023-11-14 14:42:11 +08:00
Scott McKay
897c1c1f05
Set DML package name correctly in CI (#18405)
### Description
<!-- Describe your changes. -->
Set DML package name correctly so the build doesn't try and include mobile targets. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
2023-11-14 14:01:59 +10:00
Scott McKay
8ff41aea09
Fix 4 more bad delegates missing the attribute that cause iOS AOT errors at runtime (#18390)
### Description
<!-- Describe your changes. -->
Fix bad delegates.
Add script to detect mismatch, and run in CI and when creating nuget
package.

Ignore whitespace when looking at the diff to the .cs file as
clang-format ran.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#18363
2023-11-14 14:00:21 +10:00
PeixuanZuo
37d8bed53d
[ROCm] add migraphx into onnxruntime-training-rocm package (#18339) 2023-11-14 11:54:22 +08:00
PeixuanZuo
a62a500ae1
[ROCm] Update CK version (#17628)
update ck version
2023-11-13 15:43:38 -08:00
Changming Sun
c3b5479056
Remove extra CUDA version flag (#18397)
### Description
Only one of "--cuda_version" and "--cuda_home" is needed. If they were
both specified, the first one will take precedence. Since we download
cuda SDKs on-the-fly now, the machines will not need to have a
preinstalled CUDA SDK therefore will not have VS-CUDA integration
extension. Therefore the "--cuda_version" flag will not work. This PR
deletes such usages.

Related PR: #15915
2023-11-13 15:11:42 -08:00
Yulong Wang
6b0c97b43f
[js/web] fix typescript type check (#18343)
### Description

This PR fixes the TypeScript type check.

Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:

- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
2023-11-10 16:03:38 -08:00
Changming Sun
2d23b4e117
Update min macos version (#18251) 2023-11-10 11:08:17 -08:00
RandySheriffH
59262dfc63
Add cuda context headers to zip (#18330)
Expose cuda context headers for cuda custom ops.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-11-09 14:53:58 -08:00
Changming Sun
812532592e
Add a build validation for Linux ARM64 cross-compile (#18200)
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before. 


### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
2023-11-08 13:03:18 -08:00
Yulong Wang
d117a8010f
fix typo (node)->(browser) in linux-wasm-ci.yml (#18309)
### Description
fix display name `'Build and test (node) (simd + threads)'` to `'Build
and test (browser) (simd + threads)'`
2023-11-07 17:07:40 -08:00
Yi Zhang
9868a71373
[Fix] Stages to Run couldn't be selected (#18310)
### Description
Add the pool definition in 2 stages even the pool is Microsoft-Hosted
Pool.



### Motivation and Context
Recently, in Nuget pipeline, when we click the Stages to Run

![image](https://github.com/microsoft/onnxruntime/assets/16190118/45af295e-fa75-402a-a7de-803c6a2ab7cd)
It always pops up 
```
Encountered error(s) while parsing pipeline YAML:
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
```
2023-11-07 17:52:47 +08:00
Changming Sun
398ef677ba
Update protobuf python package's version (#18203)
1. Now we use a released version of ONNX, so we can directly download a
prebuilt package from pypi.org. We do not need to build one from source.
2. Update protobuf python package's version to match the C/C++ version
we are using.
3. Update tensorboard python python because the current one is
incompatible with the newer protobuf version.
2023-11-06 09:22:54 -08:00
Yi Zhang
b7b8b5b2ce
Fix Eigen-3.4.0 URL and hash (#18290)
### Description
Add CI changes for #18287

Install onnx explicitly to pass windows GPU+dml stage.


### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
2023-11-06 09:19:51 -08:00
Scott McKay
c352e9b1f9
Rework/cleanup the C# build infrastructure for nuget packages. (#18127)
### Description
Update the C# nuget build infrastructure to make building a test nuget
package more user friendly and to simplify
- Remove usage of dotnet and msbuild in CIs
- was temporary requirement until .net 6 MAUI was added to the released
Visual Studio
  - remove SelectedTargets property and its usage
- Add property for excluding mobile targets
  -  generally we exclude based on the nuget package name
- can now specify `/p:IncludeMobileTargets=false` on the command line to
force exclusion
- support building test package using build.py `--build_nuget` better
- limit inclusion of xamarin targets as building with them requires a
lot more infrastructure
- use msbuild directly if xamarin targets are included. use dotnet
otherwise.
- remove quoting of property values as it doesn't appear to be necessary
and breaks when msbuild is being used
- add infrastructure to be able to pack the nuget package on linux with
`dotnet pack`
    - `nuget pack` is not user friendly as-per comments in changes
    - requires stub csproj to provide the nuspec path 
- Remove netstandard1.0 targets from nuspec
  - we removed support from the actual bindings previously
- Remove usage of nuget-staging directory when creating nuget package on
linux
- the nuspec file element has a fully qualified path for a source file
so there is no obvious benefit to copying to a staging directory prior
to packing

### Motivation and Context
Address issues with 1P users trying to create test nuget packages
locally.
Long overdue cleanup of CI complexity.
2023-11-03 09:05:17 -07:00
Scott McKay
4f2096be38
Update XNNPACK to latest version (#18038)
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well

Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
-  some ops use a workspace buffer
   -  copied workspace allocation from XNNPACK unit test code
- some suffixes changed 

Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
    - not an issue currently
  
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed

Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.

NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
2023-11-03 09:04:28 -07:00
Yi Zhang
9f5a6856fe
Rerun the flaky ort-web tests automatically (#18187)
### Description
Retry 3 times at most if the web test fails.


### Motivation and Context
Web GPU tests are not stable.

From this link, we could find these ort-web tests are all in top 10
failing tasks.

https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=161&contextType=build.

Generally, it could pass by manually rerunning it.
So, enable it to rerun automatically.

These test steps duration isn't long. So, it won't take too long to
retry.
2023-11-03 16:34:56 +08:00
Changming Sun
d8d79521ca
Disable ccache for DML (#18230)
### Description
Disable ccache for DML. This change is similar to #18104. Now the DML
build job is having the same timeout issue. I don't know why. But
disabling ccache probably would help.
2023-11-02 16:00:55 -07:00
liqun Fu
20f2dd8b6b
use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177) 2023-10-31 14:58:21 -07:00
Jian Chen
29e40987e3
Update batch file to set PATH for Cuda with TRT (#18182)
### Description

Update batch file to set PATH for Cuda with TRT

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-31 10:22:40 -07:00
Jian Chen
8a574b874c
Update setup_env_cuda.bat (#18176)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-30 21:28:02 -07:00
Yi Zhang
436056dcd7
Revert "Disable dml stage in windows GPU pipeline temporarily. (#18034)" (#18150)
This reverts commit 99b8dcaae2.

### Description
<!-- Describe your changes. -->



### Motivation and Context
Restore the dml stage in windows GPU  pipeline.
Agent issue is solved by adding Feature.DisableGpuDriver in pool
properties.
2023-10-30 15:41:07 +08:00
Xavier Dupré
c10b83eb68
Update python cryptography version to 41.0.4 (#18056)
### Description

Version 41.0.0 currently used has vulnerabilities.

### Motivation and Context

See [Vulnerable OpenSSL included in cryptography
wheels](https://github.com/advisories/GHSA-v8gr-m533-ghj9)
2023-10-27 12:06:38 +02:00
Jian Chen
7c18c60bc2
Change cuda image for tensorRT to the one with cudnn8 (#18102)
### Description
copilot:summary


### Motivation and Context
copliot::walkthrough
2023-10-26 16:28:57 -07:00
Ashwini Khade
f2e19a8ccf
Updates to training pipelines to reduce CI time (#18116)
### Description
Motivation for this PR is reducing CI test time by removing unnecessary
tests from the pipelines.

Following changes are for reducing test time in pipelines:

- Skip CPU model tests in GPU builds. Training CIs run these tests as a
sanity check. There is no direct training code being tested in these
pipelines, furthermore, CPU tests are being run in CPU pipelines so no
need to run them again in GPU builds and block the GPU VM. This change
reduces testing time by 20-25 mins in all training GPU pipelines.

- Delete debug package building pipeline for linux training packages.
This was required by compiler team at some point but there have been 0
downloads of these packages.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-26 14:58:57 -07:00
Chi Lo
455a9ce614
[TensorRT EP] Use latest onnx-tensorrt parser (#18067)
Use latest onnx-tensorrt to fix compile error.

Please see the issue
https://github.com/microsoft/onnxruntime/issues/18029
2023-10-26 13:55:12 -07:00
Jian Chen
b023de0bfc
Redo #18044 Install CUDA 12.2 on Windows (#18093) 2023-10-26 10:12:46 -07:00
Changming Sun
0f72739b6d
Disable ccache for WinML build (#18104)
### Description
It seems would resolve the timeout issue. 


### Motivation and Context
2023-10-26 19:03:01 +08:00
Jian Chen
76e275baf4
Merge Cuda docker files into a single one (#18020)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-24 15:17:36 -07:00
Changming Sun
6ec45f2ba5
Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 (#18069)
### Description
Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019
machines to a single one to ease management.
2023-10-24 13:04:08 -07:00
Changming Sun
abb329179a
Update win-wasm-ci.yml: increase the timeout value (#18023) 2023-10-24 10:50:12 -07:00
Jian Chen
e63ccd3cbb
Install CUDA 12.2 on Windows (#18044)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-24 10:47:23 -07:00
liqun Fu
020824ed50
Update ONNX to 1.15.0rc1 (#17914) 2023-10-20 15:08:25 -07:00
Yi Zhang
99b8dcaae2
Disable dml stage in windows GPU pipeline temporarily. (#18034)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-20 08:41:40 -07:00
Jian Chen
cbb0e0f83c
Create a new Dockerfile for cuda 12 and trt 8.6.1.6-1.cuda12.0 (#18000) 2023-10-18 14:46:02 -07:00
Changming Sun
57c8736596
Move a nodejs test to a different machine pool (#17970)
### Description
This is a temp fix for the failing "Zip-Nuget-Java-Nodejs Packaging
Pipeline". The pipeline is failing because I removed NodeJS from the
build machine pool's image, to reduce the number of dependencies we need
to maintain in VMs.
So this PR will temporarily move the test to a different machine pool to
get the test passed. Then I will move the test to docker. Docker images
are relatively easier to update and maintain. Now we almost run all
Linux test in docker, except for this one. Moving it to docker is needed
for enabling GPU support in nodejs, because all our Linux VMs do not
have CUDA.


### Motivation and Context
2023-10-17 09:30:14 -07:00
Hariharan Seshadri
9356986730
Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972)
### Description
This PR:

(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)

(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.

Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770

### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
2023-10-17 09:23:52 -07:00
Yulong Wang
f7341e8103
enable training for win-wasm-ci.yml (#17954)
### Description
**Fixes NPM Packaging pipeline.**

Training was enabled for linux-wasm-ci.yml but not enabled for
win-wasm-ci.yml.

the web CI uses linux-wasm-ci.yml
NPM packaging pipeline uses win-wasm-ci.yml
2023-10-16 16:07:20 +08:00
Scott McKay
ae211999dd
Attempt to make the usage of the Android emulator in CIs more robust (#17903)
### Description
<!-- Describe your changes. -->
Android emulator usage updates:
- Change approach to detecting boot has completed
- use `-delay-adb` and a simple command (`ls`) with `wait-for-device` as
the first step
    - this ensures enough startup has occurred for adb to be responsive
- use secondary loop on the python side to check for sys.boot_completed
to be set
- doing the check on the python side provides more feedback and seems to
work well
- make the 'stop' logic more precise by using psutil
- add internal timeout of 20 mins for emulator startup
  - waiting for the CI jobs overall timeout is way too long
- value is hardcoded for now (most CIs startup in under 10 mins) but
could be made configurable if needed

CI updates:
- add template for using the Android emulator
  - update CIs to use template
- reorder React Native CI
- minimize the time the Android emulator or iOS simulator is running by
moving some build steps around
  - don't run both at the same time
- unnecessary and potentially adds significant memory pressure to the
machine
- fix QNN Android emulator CI as much as possible
- now everything works apart from running onnx_test_runner with the QNN
EP

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix inconsistent detection of the emulator boot completing.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-10-15 08:42:36 +10:00
PeixuanZuo
0c5b1598d3
[ROCm] Add ROCm Debug wheels to private ADO Feeds (#17887)
Add ROCm Debug wheels to private ADO Feeds
2023-10-13 10:28:10 +08:00
Changming Sun
3f3ece4a39
Update NDK to 26.0.10792818 (#17852)
### Description
Update NDK to 26.0.10792818 which is included in every macOS build
machine so that we do not need to download a different version every
time in every build.

### Motivation and Context
Downloading NDK on-the-fly is a main contributor of Android related
build failures.
2023-10-12 14:08:43 -07:00
Yi Zhang
9d07ca3621
Move compliance check before publishing pipeline artifact (#17857)
### Description
<!-- Describe your changes. -->


### Motivation and Context
Compliance check would fail randomly but the stage couldn't be rerun if
the pipeline artifacts are already published.
There's the error like `Artifact xxxx already exists`.
We had to restart the whole pipeline if there's a random error in
compliance check.
2023-10-12 15:48:04 +08:00
Yulong Wang
25bbd8d4eb
[js/web] allow gpu IO binding tests to fail temporarily (#17892)
### Description
allow gpu IO binding tests to fail temporarily.

when the root cause is still in investigation, use `continueOnError:
true` to allow the test to fail without blocking PRs.
2023-10-11 21:21:21 -07:00
Changming Sun
138ccecd22
Change how "NPM packaging pipeline" downloads packages from another pipeline (#17838)
### Description
"NPM packaging pipeline" needs to download an artifact from
"Zip-Nuget-Java-Nodejs Packaging Pipeline".
It has been a long-time issue that they two pipelines often use
different commit ids.
This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a
resource, so that "NPM packaging pipeline" will always fetch from the
pipeline run that triggers this NPM pipeline.
Their official document says:
"When you define a resource trigger, if its pipeline resource is from
the same repo as the current pipeline, triggering follows the same
branch and commit on which the event is raised."
2023-10-11 21:07:27 -07:00
Scott McKay
046939b0c1
Include CoreML in mac os python packages (#17844)
### Description
<!-- Describe your changes. -->
Include CoreML EP in python package.

I've added to the base package as CoreML comes from the OS so there are
no additional libraries to distribute.

Updated the CPU-based provider list to add the AzureEP, which is also
included in the base package, to fix some test failures. Without this
the infrastructure thinks a device copy implementation is required
between AzureEP and CoreML nodes, which is not the case as the AzureEP
is CPU based.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16989
2023-10-10 11:44:32 +10:00