From 93bf7c4d52bc96fb3ffbba39175b5ce7db3e01ae Mon Sep 17 00:00:00 2001 From: baijumeswani Date: Mon, 4 Jan 2021 10:09:39 -0800 Subject: [PATCH] Documentation for distributed CI tests pipeline (#6140) --- ...ow_to_add_distributed_ci_pipeline_tests.md | 54 +++++++++++++++++++ ...linux-gpu-distributed-test-ci-pipeline.yml | 3 +- 2 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 orttraining/orttraining/test/python/how_to_add_distributed_ci_pipeline_tests.md diff --git a/orttraining/orttraining/test/python/how_to_add_distributed_ci_pipeline_tests.md b/orttraining/orttraining/test/python/how_to_add_distributed_ci_pipeline_tests.md new file mode 100644 index 0000000000..cfafe7f696 --- /dev/null +++ b/orttraining/orttraining/test/python/how_to_add_distributed_ci_pipeline_tests.md @@ -0,0 +1,54 @@ +## Getting Started + +This is a simple guide on how the distributed CI pipeline works and how it can be leveraged. + +### The Pipeline + +The distributed CI pipeline is intended for running tests that require a distributed environment (for example, tests that need to be run with ```mpirun```). +The pipeline ```yml``` file is defined in [```tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml```](https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml). +The pipeline runs on every pull request commit under the [```orttraining-distributed```](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=140&_a=summary) check. +The flow of events in the pipeline are: + +1. Clone the git repository and checkout the branch that needs to run for the CI (the pull request). +2. Build the docker container installing all dependencies that are needed for the distributed tests (for example, ```open-mpi```) +3. Run all tests defined in the file [```orttraining/orttraining/test/python/orttraining_distributed_tests.py```](https://github.com/microsoft/onnxruntime/blob/master/orttraining/orttraining/test/python/orttraining_distributed_tests.py) through the script [```orttraining/orttraining/test/python/launch_test.py```](https://github.com/microsoft/onnxruntime/blob/master/orttraining/orttraining/test/python/launch_test.py) +4. Report the status of the tests. + +## Running Locally + +To run the entire set of distributed tests locally, run the following command from the build directory: +```sh +python orttraining_distributed_tests.py +``` + +> **Note**: these set of tests can only be run on a machine with multiple gpus and the test will terminate if the number of gpus is less than 2. + +## Adding Tests to the Pipeline + +Follow the below steps to add new distributed tests that will run in this pipeline. + +1. Create a new python file that can be called as a script. Let's call this ```dummy_distributed_test.py``` as an example. +2. Make sure this ```dummy_distributed_test.py``` can be called and executed using either ```python dummy_distributed_test.py``` or using ```mpirun -n -x NCCL_DEBUG=INFO python dummy_distributed_test.py```. A real example of such a test file is [```orttraining/orttraining/test/python/orttraining_test_checkpoint.py```](https://github.com/microsoft/onnxruntime/blob/master/orttraining/orttraining/test/python/orttraining_test_checkpoint.py). +3. Create a new function in ```orttraining/orttraining/test/python/orttraining_distributed_tests.py``` + ```python + def run_dummy_distributed_tests(cwd, log): + log.debug('Running: Dummy distributed tests') + + command = [sys.executable, 'dummy_distributed_test.py'] + + run_subprocess(command, cwd=cwd, log=log).check_returncode() + ``` + Refer to ```run_checkpoint_tests()``` for an example. +4. Add a call to the ```run_dummy_distributed_tests()``` in the ```main()``` function in ```orttraining/orttraining/test/python/orttraining_distributed_tests.py``` + ```python + run_dummy_distributed_tests(cwd, log) + ``` + Refer to ```run_checkpoint_tests()``` for an example. +5. Call the distributed test suite on a local machine and ensure there are no failures. + ```sh + python orttraining_distributed_tests.py + ``` + +> **Note**: If the test requires multiple ```run_subprocess()``` calls, restructure the test file(s) such that they have a single entry point. Refer to ```orttraining/orttraining/test/python/orttraining_test_checkpoint.py``` for an example. + +Once the above has been tried and tested, submit a pull request and the tests should be executed in the [distributed CI pipeline](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=140&_a=summary). Make sure to search for ```'Running: Dummy distributed tests'``` in the pipeline logs to ensure that the newly added tests were successfully run in the pipeline. diff --git a/tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml index 3737ad65a1..743306bc9a 100644 --- a/tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml +++ b/tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-distributed-test-ci-pipeline.yml @@ -25,7 +25,8 @@ jobs: -m DisplayName: 'Build' - # all distributed tests + # Entry point for all distributed CI tests. + # Refer to orttraining/orttraining/test/python/how_to_add_distributed_ci_pipeline_tests.md for guidelines on how to add new tests to this pipeline. - script: | docker run \ --gpus all \