mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-06-16 01:33:39 +00:00
### Model post process for zero stage3 training This is the last change to make single GPU/Multiple GPUs run pass. Design details: https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9 `PyTorch` runs with ZeROOffloadSubscriber: ``` model = prepare_model(...) from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3 configure_ort_compatible_zero_stage3() ``` `ORTModule` runs with ZeROOffloadSubscriber: ``` os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1' from onnxruntime.training.ortmodule import ORTModule model = ORTModule(self.model) ``` It will be fairly easy to debug convergence issue if both ORT and PyTorch can run the same offload path. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> |
||
|---|---|---|
| .. | ||
| experimental | ||
| torch_cpp_extensions | ||
| __init__.py | ||
| _custom_autograd_function.py | ||
| _custom_autograd_function_exporter.py | ||
| _custom_autograd_function_runner.py | ||
| _custom_gradient_registry.py | ||
| _custom_op_symbolic_registry.py | ||
| _execution_agent.py | ||
| _fallback.py | ||
| _fallback_exceptions.py | ||
| _gradient_accumulation_manager.py | ||
| _graph_execution_interface.py | ||
| _graph_execution_manager.py | ||
| _graph_execution_manager_factory.py | ||
| _inference_manager.py | ||
| _io.py | ||
| _logger.py | ||
| _onnx_models.py | ||
| _runtime_inspector.py | ||
| _torch_module_factory.py | ||
| _torch_module_interface.py | ||
| _torch_module_ort.py | ||
| _torch_module_pytorch.py | ||
| _training_manager.py | ||
| _utils.py | ||
| _zero_stage3_compatibility.py | ||
| graph_transformer_registry.py | ||
| options.py | ||
| ortmodule.py | ||