onnxruntime/tools/ci_build/github/azure-pipelines
pengwa 1150b1f81e
ORTModule memory improvement (#18924)
## Dependency

https://github.com/microsoft/onnxruntime/pull/19007

## ORTModule memory efficient gradient management

Previously I have tried to solve the coarsed-grained gradient
accumulation/update problem in ORTModule with
https://github.com/microsoft/onnxruntime/pull/8979, while that
resolution somehow is not fully validated with DDP or there is user
hooks on the gradient accumulation on torch parameter.

This PR is addressing the problem in the similar approach as PR 8979,
e.g. trigger gradient accumulation once ORT computed the grad, but
instead of use a AccumulateGrad op, this time with a ONNX operator
PythonOp, internally it will call param.backward(grad), which will help
handle all related hooks correctly.


## Design

Check the details from


https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ

## Convergence Validation:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe)

differences are on mostly 0.000x, sometimes 0.00x, which may comes from
the different order gradient apply happens before or after this change
(on deepspeed zero stage 2)


## TODO

Consolidate the logic with Stage3's similar logic.
2024-01-16 08:57:37 +08:00
..
nodejs/templates
nuget/templates Fix Nuget CUDA Packaging pipeline (#19054) 2024-01-11 11:59:21 -08:00
stages Extend timeout in Nuget-CUDA-Packaging-Pipeline (#19138) 2024-01-15 14:37:22 +08:00
templates Disabling python3.12 on training python packaging pipleines (#19123) 2024-01-14 14:51:00 -08:00
triggers
android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml [QNN EP] Update QNN SDK to version 2.17.0 (#18684) 2023-12-06 11:05:41 -08:00
android-x86_64-crosscompile-ci-pipeline.yml
binary-size-checks-pipeline.yml
build-perf-test-binaries-pipeline.yml Remove enable_mac_silicon settings (#19108) 2024-01-12 11:01:39 -08:00
c-api-noopenmp-packaging-pipelines.yml Always download cuda and trt libraries from Azure blob (#19118) 2024-01-14 11:37:26 -08:00
clean-build-docker-image-cache-pipeline.yml
cuda-packaging-pipeline.yml Fix cuda-packaging-pipeline.yml (#19115) 2024-01-12 19:09:25 -08:00
linux-ci-pipeline.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
linux-cpu-aten-pipeline.yml
linux-cpu-eager-pipeline.yml
linux-cpu-minimal-build-ci-pipeline.yml Set NDK version in Linux CPU Minimal Build E2E CI Pipeline (#18810) 2023-12-14 08:08:41 -08:00
linux-dnnl-ci-pipeline.yml
linux-gpu-ci-pipeline.yml Create a new Nuget Package pipeline for CUDA 12 (#18135) 2023-11-28 09:03:46 -08:00
linux-gpu-tensorrt-ci-pipeline.yml Create a new Nuget Package pipeline for CUDA 12 (#18135) 2023-11-28 09:03:46 -08:00
linux-gpu-tensorrt-daily-perf-pipeline.yml [EP Perf] Fix missing Azure cli & use onnx zoo model inside image (#18917) 2024-01-01 17:14:39 -08:00
linux-migraphx-ci-pipeline.yml [ROCm] Update CI/Packaging pipeline to ROCm6.0 (#18985) 2024-01-03 17:25:15 +08:00
linux-multi-gpu-tensorrt-ci-pipeline.yml
linux-openvino-ci-pipeline.yml
linux-qnn-ci-pipeline.yml [QNN EP] Support multithreaded inference of a single session (#18981) 2024-01-04 13:32:48 -08:00
mac-ci-pipeline.yml
mac-coreml-ci-pipeline.yml Update min macos version (#18251) 2023-11-10 11:08:17 -08:00
mac-ios-ci-pipeline.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
mac-ios-packaging-pipeline.yml iOS packaging pipeline stability (#19097) 2024-01-13 19:27:44 -08:00
mac-objc-static-analysis-ci-pipeline.yml Update absl and gtest to fix an ARM64EC build error (#18735) 2023-12-07 15:55:17 -08:00
mac-react-native-ci-pipeline.yml
npm-packaging-pipeline.yml use EO pool for windows web_cpu stage (#18737) 2023-12-07 10:10:00 -08:00
nuget-cuda-publishing-pipeline.yml Update Nuget publishing jobs (#18851) 2023-12-19 16:54:46 -08:00
orttraining-linux-ci-pipeline.yml
orttraining-linux-gpu-ci-pipeline.yml
orttraining-linux-gpu-ortmodule-distributed-test-ci-pipeline.yml
orttraining-linux-nightly-ortmodule-test-pipeline.yml ORTModule memory improvement (#18924) 2024-01-16 08:57:37 +08:00
orttraining-mac-ci-pipeline.yml
orttraining-pai-ci-pipeline.yml [ROCm] Update CI/Packaging pipeline to ROCm6.0 (#18985) 2024-01-03 17:25:15 +08:00
orttraining-py-packaging-pipeline-cpu.yml Remove enable_mac_silicon settings (#19108) 2024-01-12 11:01:39 -08:00
orttraining-py-packaging-pipeline-cuda.yml
orttraining-py-packaging-pipeline-cuda12.yml Training packaging pipeline for cuda12 (#18524) 2023-11-21 13:19:21 -08:00
orttraining-py-packaging-pipeline-rocm.yml [ROCm] Update CI/Packaging pipeline to ROCm6.0 (#18985) 2024-01-03 17:25:15 +08:00
post-merge-jobs.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
publish-nuget.yml Update Nuget publishing jobs (#18851) 2023-12-19 16:54:46 -08:00
py-cuda-package-test-pipeline.yml Adding new pipeline for python cuda testing (#18718) 2023-12-18 18:13:03 -08:00
py-cuda-packaging-pipeline.yml Update the template files to correct stage to fix the python cuda 12 packaging pipeline (#18651) 2023-12-01 07:57:46 -08:00
py-cuda-publishing-pipeline.yml Adding a new pipeline for publishing to Python Cuda 12 packages. (#18712) 2023-12-11 14:17:46 -08:00
py-package-build-pipeline.yml
py-package-test-pipeline.yml Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 (#18614) 2023-11-29 10:32:42 -08:00
py-packaging-pipeline.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
qnn-ep-nuget-packaging-pipeline.yml Add --parallel to QNN EP NuGet pipeline build command (#19126) 2024-01-13 02:38:40 -08:00
web-ci-pipeline.yml update to emsdk-3.1.51 (#18844) 2024-01-12 16:04:33 -08:00
win-ci-fuzz-testing.yml [Fix] exception in Fuzz Test pipeline (#18984) 2024-01-03 14:53:31 +08:00
win-ci-pipeline.yml Disable ccache in Windows CPU CI pipeline (#19131) 2024-01-13 18:40:43 -08:00
win-gpu-ci-pipeline.yml Move Windows GPU training job to A10 (#19041) 2024-01-08 09:19:58 -08:00
win-gpu-reduce-op-ci-pipeline.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
win-gpu-tensorrt-ci-pipeline.yml Enable Address Sanitizer in CI (#19073) 2024-01-12 07:24:40 -08:00
win-qnn-arm64-ci-pipeline.yml [QNN EP] Support multithreaded inference of a single session (#18981) 2024-01-04 13:32:48 -08:00
win-qnn-ci-pipeline.yml [QNN EP] Support multithreaded inference of a single session (#18981) 2024-01-04 13:32:48 -08:00