Commit graph

1810 commits

Author SHA1 Message Date
Changming Sun
ed550b5fe5
Change webgpu CI pipeline to use a preinstalled chrome (#19729)
### Description
Change webgpu CI pipeline to use a preinstalled chrome. Hopefully it can
increase the stability. Now the chrome got from puppeteer often failed
to start.
2024-02-29 20:36:29 -08:00
Changming Sun
250779474d
Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698)
### Description
The original one reports "out of disk space", which needs to be
investigated.
2024-02-28 19:36:26 -08:00
Changming Sun
a93c31e3c9
Update dml-vs-2022.yml (#19687)
### Description
Fix a build error in "Zip-Nuget-Java-Nodejs Packaging Pipeline" which
deletes files too early.
2024-02-28 12:03:17 -08:00
Changming Sun
7a147fc6f7
Remove a bash task from webgpu CI pipeline (#19682)
### Description
It is a "Bash" task that requires running bash on Windows. Most Windows
operating systems do not have Bash installed. Given this task is only
debugging purposes, we can remove it for now.


### Motivation and Context
I am making this change because I am regenerating the VM image in a
different manner, and the new image does not contain bash. Once this PR
is in, I can switch the images.
2024-02-28 18:20:53 +08:00
Yi Zhang
f95c0773a1
Add share memory Flag in docker (#19672)
### Description



### Motivation and Context
Ref:
https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem

Co-authored-by: Your Name <your@email.com>
2024-02-28 10:40:40 +08:00
Scott McKay
1c468a03b9
Improve Nuget-CUDA-Packaging-Pipeline (#19668)
### Description
<!-- Describe your changes. -->
* Publish the artifacts as late as possible
* once published the artifacts are immutable, and any retry will fail if
they exist
  * if any step fails after publishing the stage cannot be retried
* use powershell to cleanup
  * DeleteFiles is taking >30 mins and causing the stage to timeout
  * powershell took < 1s

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make pipeline more robust
2024-02-27 09:27:43 -08:00
Scott McKay
580ee20dfc
Tweak Windows build parallelization settings (#19664)
### Description
<!-- Describe your changes. -->
Use UseMultiToolTask and limit the number of cl.exe instances running. 

MultiToolTask info:
https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/

Info on why limiting CL_MPCount can help:
https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows

The current CIs have 4 cores (both physical and logical). Hardcoded the
GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work
fine. Can adjust if needed to base it on the actual number of cores or
to use build.py to build.

Caveat: I've run about 16 builds and haven't seen a slow build yet, but
as the root cause of the slow builds isn't really known this isn't
guaranteed to be a fix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Try and prevent super slow GPU builds by reducing number of tasks
potentially running in parallel.
2024-02-27 08:56:16 -08:00
Yi Zhang
3b46ab6439
Re-add testing removed by mistake. (#19647) 2024-02-27 08:46:29 -08:00
Rachel Guo
5bb58a10e7
Enable the most verbose logging level in detox E2E React Native CI (#19659)
### Description
<!-- Describe your changes. -->

The RN CI has intermittent failure error with "app seems to idle".
enable the most verbose logging level (and can add steps to dump
device.log from the detox folder/artifacts if necessary) to at least get
more information.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-02-26 20:00:14 -08:00
Scott McKay
8bd943be39
Retry flaky XCode iOS UI tests if we get a known error (#19639)
### Description
<!-- Describe your changes. -->
Xcode UI tests seem to be flaky:
https://github.com/orgs/community/discussions/68807
Add a couple of retries if we get a "Timed out while loading
Accessibility." error which is transient.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-27 09:31:32 +10:00
Yi Zhang
0fcc6fb760
Add Whisper model in CI (#19604)
### Description
 Add Whisper Conversion and E2E into Big Models pipeline



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <your@email.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2024-02-25 14:04:22 +08:00
Yi Zhang
c980149c85
Add log for random exception in Linux GPU Test Stage. (#19569)
### Description
1. check GPU status in docker
2. use stages to make test stage can leverage existing building
artifacts


### Motivation and Context
To investigate the root cause of the random exception
`CUDA failure 100: no CUDA-capable device is detected`
2024-02-24 13:00:53 -08:00
Scott McKay
45e20bf781
Use build.py to build in py-win-gpu.yml so parallelization parameters are set (#19578)
### Description
<!-- Describe your changes. -->
build.py sets a few parallelization parameters when building. Using
msbuild directly lacks those.


7a5860e490/tools/ci_build/build.py (L1665-L1669)

Changed to use build.py. If there's a concern with that we _could_ set
the parameters in the yaml, but that will be uglier due to duplicating
logic in multiple places.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-21 10:38:37 +08:00
PeixuanZuo
f3e3b531fe
Update build directory clean up stage for python package pipeline (#19553)
Fix to make clean up stage take effect.

If the `SourceFolder ` is empty, the task deletes files from the root
folder of the repository as though
[$(Build.SourcesDirectory)](https://learn.microsoft.com/en-us/azure/devops/pipelines/build/variables)
was specified.
2024-02-20 10:31:39 +08:00
Adrian Lizarraga
4874a41008
[QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546)
### Description
Updates the default QNN SDK version to 2.19.2.240210.

### Motivation and Context
Build and test the latest version of QNN SDK in our pipelines.
2024-02-16 16:59:43 -08:00
Tianlei Wu
1dce5e1732
Disable TF32 in Linux_Test stage of Linux GPU CI Pipeline (#19541)
### Description
Some test thresholds that previously worked in T4 GPU does not work
anymore. The reason is current pipeline uses A10, and TF32 is enabled by
default.

Disable TF32 in Linux GPU CI Pipeline in testing to avoid such random
test failure.

### Motivation and Context
Linux Test has random failure at tests:

ProviderOptionsTest > testCUDAOptions() FAILED
org.opentest4j.AssertionFailedError: array contents differ at index
[446], expected: <0.0419757> but was: <0.041948937>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.runProvider(ProviderOptionsTest.java:99)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.testCUDAOptions(ProviderOptionsTest.java:43)
 
org.opentest4j.AssertionFailedError: array contents differ at index [6],
expected: <0.0225981> but was: <0.022587791>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:676)
at app//ai.onnxruntime.InferenceTest.testCUDA(InferenceTest.java:615)
2024-02-16 14:41:11 -08:00
rui-ren
d63c664ca0
fix rocm ci pipeline (#19525)
### Description
<!-- Describe your changes. -->

ROCm CI pipeline issue.
```
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /home/onnxruntimedev/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
    main()
  File "/stage/huggingface-transformers/examples/pytorch/language-modeling/run_mlm.py", line 242, in main
    datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir)
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/load.py", line 856, in load_dataset
    builder_instance.download_and_prepare(
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/builder.py", line 583, in download_and_prepare
    self._download_and_prepare(
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/builder.py", line 639, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/onnxruntimedev/.cache/huggingface/modules/datasets_modules/datasets/wikitext/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/wikitext.py", line 138, in _split_generators
    data_file = dl_manager.download_and_extract(self.config.data_url)
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 289, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 197, in download
    downloaded_path_or_paths = map_nested(
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 195, in map_nested
    return function(data_struct)
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 220, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 281, in cached_path
    output_path = get_from_cache(
  File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 634, in get_from_cache
    raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Update the `datasets` pipeline to latest version `2.17.0`.
2024-02-15 00:02:08 -08:00
Prathik Rao
3b03b2e046
Upgrade default ORTModule opset from 15 to 17 (#19315)
### Description
<!-- Describe your changes. -->

This PR upgrades ORTModule's default opset from 15 to 17. Opset 17 is
the final opset supported by torchscript exporter
(https://github.com/pytorch/pytorch/pull/107829)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Engineering excellence contribution for ORT Training DRI.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-02-14 11:19:33 -08:00
Yifan Li
5c7e6b2e2a
[EP Perf] Add CI option to enable TRT-OSS parser (#19448)
### Description
<!-- Describe your changes. -->
* Introducing CI option to enable TRT-OSS parser, during ep perf
testing:

![image](https://github.com/microsoft/onnxruntime/assets/109183385/a9ba6393-6b94-4b8f-8ca4-ba7bc7954504)

By default, open-sourced onnx-tensorrt parser listed under
[cmake/deps.txt](https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt#L39-L40)
will be used if enabling this option.


### To verify this option and check the difference during ORT image
build:
If this option is enabled:
<img width="649" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/109183385/3b778583-451e-4617-ba8c-c064442e60fd">

If this option is not enabled (by default):
<img width="683" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/109183385/cd8383ba-eff4-4536-94ab-a1424bb858ab">

* update default usage of cmake/trt version to the latest

### Motivation and Context
Make it easier to test oss parser and find potential gap between
tensorrt builtin/oss parser.

Schedule runs with oss parser will be set after this PR gets merged
2024-02-12 23:04:08 -08:00
Adrian Lizarraga
4dfba53bfb
[QNN EP] Build x64 python wheel for QNN EP (#19499)
### Description
Adds a job to the python packaging pipeline that builds x64 python
wheels for QNN EP.



### Motivation and Context
Necessary to create a cached QNN model on Windows x64, which is done by
creating a properly configured onnxruntime session with QNN EP.
2024-02-12 20:54:04 -08:00
Baiju Meswani
c831031ad5
Remove cuda gencode 90 to reduce onnxruntime-training package size (#19486) 2024-02-12 09:24:36 -08:00
Justin Chu
3d2ddf96e3
Bump ruff linter to 0.2.1 (#19471)
### Motivation and Context

Include new lint rules
2024-02-08 16:08:27 -08:00
Jian Chen
75f06319d6
Change binet to bin (#19424)
### Description
This pull request includes a small change to the
`Dockerfile.manylinux2_28_cuda` file in the
`tools/ci_build/github/linux/docker` directory. The change corrects the
`PREPEND_PATH` argument from `/usr/local/cuda/binet` to
`/usr/local/cuda/bin`, ensuring the correct path to CUDA binaries is
set.
2024-02-07 09:51:02 -08:00
Edward Chen
df5c6718bd
Remove iOS simulator max runtime version limit. (#19396) 2024-02-06 14:54:06 -08:00
Yulong Wang
a4cfdc1c28
update comments for nodejs binding artifact preparation. (#19425)
### Description
document update as a following-up for #19274
2024-02-05 22:58:35 -08:00
Jian Chen
06a84c8a0d
Enable DML on Windows and CUDA on Linux for Node.js binding (#19274)
This pull request includes modifications to the `c-api-cpu.yml` Azure
Pipelines configuration file. The changes mainly revolve around the
Node.js packaging stage and the handling of Node.js artifacts. The most
significant changes include renaming the Node.js packaging stage, adding
a new dependency to the stage, changing artifact names, adding a new
script to list Node.js artifacts, and updating the source folder for
copying NuGet binaries.

Changes in Node.js packaging:

*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L503-R508):
Renamed the Node.js packaging stage from `Nodejs_Packaging_CPU` to
`Nodejs_Packaging` and added `Windows_CI_GPU_DML_Dev` as a new
dependency to the stage.

Changes in handling of Node.js artifacts:

*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L568-R569):
Changed the artifact name from `drop-onnxruntime-nodejs-win-x64` to
`drop-onnxruntime-nodejs-win-x64-dml` in the task to download pipeline
artifacts for Windows x64.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R595-R598):
Added a new script to list Node.js artifacts from the directory
`$(Build.BinariesDirectory)/nodejs-artifacts/win32/x64/`.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L635-R640):
Updated the source folder from
`$(Build.BinariesDirectory)\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib`
to `$(Build.BinariesDirectory)\nodejs-artifacts\win32\x64` in the task
to copy NuGet binaries to the directory
`$(Build.SourcesDirectory)\js\node\bin\napi-v3\win32\x64`.

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-02-05 14:33:58 -08:00
Yi Zhang
435e19953e
Fix llama.covert_onnx to make it runnable in CI (#19372)
### Description
1.  make parity_check use local model to avoid using hf token
2. del the model didn't work because it tried to del the object define
out of the function scope.
     So it caused out of memory in A10.
3. In fact, 16G GPU memory (one T4) is enough. But the conversion
process always be killed in T4 and it works on A10/24G.
     Standard_NC4as_T4_v3 has 28G CPU memory
     Standard_NV36ads_A10_v5 has 440G memory.
     It looks that the model conversion needs very huge memory.

### Motivation and Context
Last time, I came across some issues in convert_to_onnx.py so I use the
onnx model in https://github.com/microsoft/Llama-2-Onnx for testing.
Now, these issues could be fixed. So I use onnx model generated by this
repo and the CI can cover the model conversion.
2024-02-05 07:26:24 +08:00
PeixuanZuo
0cba56e0a0
[ROCm] Fix CI pipeline by fixing pytest version (#19407)
Fix pytest version to 7.4.4, higher version will cause error

`from onnxruntime.capi import onnxruntime_validation 
ModuleNotFoundError: No module named 'onnxruntime.capi'`
2024-02-04 16:37:36 +08:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
Yi Zhang
e74f141338
Save stablediffusion and open-clip in pipeline cache (#19314)
### Description
1. save the model to pipeline cache
2. lower the similarly bar to 97
3. publish the generated image that we can check it once the test fails


### Motivation and Context
Reduce model downloads
2024-01-31 09:39:27 +08:00
Rachel Guo
3e17ca3dab
Fix iOS artifacts issue in Microsoft.ML.OnnxRuntime Nuget Package (#19311)
### Description
<!-- Describe your changes. -->

Updates to only include ios archs framework in artifacts included in
Nuget Package.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Related issue:
https://github.com/microsoft/onnxruntime/issues/19295#issuecomment-1914143256

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-01-30 08:44:20 -08:00
Changming Sun
e91d91ae4f
Fix a build issue: /MP was not enabled correctly (#19190)
### Description

In PR #19073 I mistunderstood the value of "--parallel". Instead of
testing if args.parallel is None or not , I should test the returned
value of number_of_parallel_jobs function.

If build.py was invoked without --parallel, then args.parallel equals to
1. Because it is the default value. Then we should not add "/MP".
However, the current code adds it. Because if `args.paralllel` is
evaluated to `if 1` , which is True.
If build.py was invoked with --parallel with additional numbers, then
args.parallel equals to 0. Because it is unspecified. Then we should add
"/MP". However, the current code does not add it. Because `if
args.paralllel` is evaluated to `if 0` , which is False.

This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons. 

### Motivation and Context
2024-01-29 12:45:38 -08:00
Yi Zhang
e96a038f01
Add VP test in Stable diffusion pipeline (#19300)
### Description
1. Add visual parity test based on openai clip model
2. Add trigger rules

### Motivation and Context
1. check generated image is expected
2. reduce unnecessary triggers
2024-01-29 09:33:58 -08:00
Tianlei Wu
358650d441
Fix BigModel stable diffusion pipeline (#19277)
### Description
Fix two issues:
(1) We can only use single quote inside `bash -c "..."`. Current
pipeline job stopped at `python3 demo_txt2img.py astronaut` and skip the
following commands. In this change, we remove the remaining commands to
get same effect (otherwise, the pipeline runtime might be 2 hours
instead of 15 minutes).
(2) Fix a typo of Stable.
2024-01-25 17:19:04 -08:00
Changming Sun
bc54ad3f03
Update abseil to a release tag and register neural_speed (#19255)
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
2024-01-24 14:37:39 -08:00
Yi Zhang
d7aebf9ea8
Move Nuget Test from T4 to A10 to reduce release duration (#19253)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Running release process is very painful and boring because some GPU jobs
have to wait so long time.

![image](https://github.com/microsoft/onnxruntime/assets/16190118/1c5c981e-68d4-4678-9758-443fbf362802)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/ba0d79ba-1554-4c7a-93dd-6ea8144c9295)

![image](https://github.com/microsoft/onnxruntime/assets/16190118/36cab833-71c1-4ff5-bca5-f4caa9aee0c9)
On the one hand, we could move some T4 from PR process since some jobs
are not using T4 any more and on the other hand, we can continue to
change some jobs' agent from T4 to A4 too.

In the future, T4 will mainly be used for the scenarioes that big GPU
memory is needed, multiple GPU cards or some special cases.


Test runs:

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=401786&view=logs&j=8048494c-e6eb-5e47-5e87-ff0aa863325d

cc @YUNQIUGUO @snnn
2024-01-24 14:15:07 +08:00
Yi Zhang
54871a2773
Replace T4 to A10 in Linux GPU workflow (#19205)
### Description
1. Update Linux GPU  machine from T4 to A10, sm=8.6
2. update the tolerance 

### Motivation and Context
1. Free more T4 and test with higher compute capability.
2. ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss
and fail this test
```
2024-01-19T13:27:18.8302842Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z   ret.first
2024-01-19T13:27:25.8445514Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z 
2024-01-19T13:27:25.8555736Z [  FAILED  ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z   ret.first
2024-01-19T13:27:29.3176322Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ

```
3. some other test like SSD throw other exception, so skip them
'''
2024-01-22T09:07:40.8446910Z [ RUN ]
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-22T09:07:51.5587571Z
/onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358:
Failure
2024-01-22T09:07:51.5588512Z Expected equality of these values:
2024-01-22T09:07:51.5588870Z   COMPARE_RESULT::SUCCESS
2024-01-22T09:07:51.5589467Z     Which is: 4-byte object <00-00 00-00>
2024-01-22T09:07:51.5589953Z   ret.first
2024-01-22T09:07:51.5590462Z     Which is: 4-byte object <01-00 00-00>
2024-01-22T09:07:51.5590841Z expected 1, got 63
'''
2024-01-23 10:49:24 -08:00
Adrian Lizarraga
37d14d7896
[QNN EP] Create Windows ARM64 nightly python package (#19128)
### Description
Adds a job to create a nightly python package for ORT/QNN on Windows
ARM64.
Must build onnxruntime-qnn with python 3.11 and numpy 1.25.

**Note: pipeline run may take up to 3 hrs**

### Motivation and Context
Make it possible to get a nightly python package with the latest updates
to QNN EP.
Issue #19161
2024-01-22 18:14:41 -08:00
Yifan Li
e283cdb218
Fix Fuzz Testing CI (#19228)
### Description
<!-- Describe your changes. -->
Add BuildArch

To verify:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=400952&view=logs&j=5b022bb4-70a7-5401-8766-a8a7802c7150&t=291e85c7-5547-590b-50de-4e01fcd4eba3&l=14

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 15:44:57 -08:00
Yi Zhang
780acda7b4
Add Big models pipeline (#19222)
### Description
2 models are added in CI.
Stabe diffusion Model stage is based on
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

LLama2 FP16 is based on https://github.com/microsoft/Llama-2-Onnx.
12G GPU memory is not enough, so I choose T4 to run it.

### Motivation and Context
Add regular E2E test for big models. 
It will be triggered in main build, that is, it'll run after one PR is
merged.

More models will be added later.

### Test Runs ###

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1275191&view=results
2024-01-22 14:02:56 -08:00
Edward Chen
c8ce83967e
Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209) 2024-01-19 15:30:09 -08:00
Adrian Lizarraga
28a16c223c
[QNN EP] Update QNN pipelines to use QNN SDK 2.18 by default (#19129)
### Description
Update QNN pipelines to use QNN SDK 2.18 by default



### Motivation and Context
Test with the latest version of QNN SDK by default.
2024-01-18 14:59:23 -08:00
Yi Zhang
dc1fed7268
[Fix] Dual Cuda version isn't supported as expected in Linux Gpu pipeline (#19192)
### Description
<!-- Describe your changes. -->


### Motivation and Context
It isn't support expected dual cuda version 

cuda 12 link

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1272235&view=logs&j=f2f63060-d9d6-52d0-adee-b97db5a9ab91
2024-01-18 13:26:26 -08:00
Guenther Schmuelling
dd2177c5d7
enable webnn in ci build (#19163)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-18 13:11:47 -08:00
Jian Chen
9da3e36138
Fix buildJava from Zip-Nuget-Java-Nodejs Packaging Pipeline (#19187)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-17 17:20:42 -08:00
Changming Sun
81d363045b
Upgrade Ubuntu machine pool from 20.04 to 22.04 (#19117)
### Description
Upgrade Ubuntu machine pool from 20.04 to 22.04
2024-01-16 17:25:18 -08:00
Changming Sun
e2e488d6f8
Revert "iOS packaging pipeline stability" (#19135)
Reverts microsoft/onnxruntime#19097 because it broken Android CI
pipeline.
2024-01-16 09:18:35 -08:00
Jian Chen
c92f72ebeb
Merge Linux Nuget GPU pipeline with zip-nuget (#19120)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-16 08:59:03 -08:00
pengwa
1150b1f81e
ORTModule memory improvement (#18924)
## Dependency

https://github.com/microsoft/onnxruntime/pull/19007

## ORTModule memory efficient gradient management

Previously I have tried to solve the coarsed-grained gradient
accumulation/update problem in ORTModule with
https://github.com/microsoft/onnxruntime/pull/8979, while that
resolution somehow is not fully validated with DDP or there is user
hooks on the gradient accumulation on torch parameter.

This PR is addressing the problem in the similar approach as PR 8979,
e.g. trigger gradient accumulation once ORT computed the grad, but
instead of use a AccumulateGrad op, this time with a ONNX operator
PythonOp, internally it will call param.backward(grad), which will help
handle all related hooks correctly.


## Design

Check the details from


https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ

## Convergence Validation:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe)

differences are on mostly 0.000x, sometimes 0.00x, which may comes from
the different order gradient apply happens before or after this change
(on deepspeed zero stage 2)


## TODO

Consolidate the logic with Stage3's similar logic.
2024-01-16 08:57:37 +08:00
Yi Zhang
922a2f00e3
Extend timeout in Nuget-CUDA-Packaging-Pipeline (#19138)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Linux_GPU_x64 job in the pipeline has been canceled due to timeout since
0112.
2024-01-15 14:37:22 +08:00