onnxruntime/onnxruntime/test/testdata/ort_ckpt
ashbhandare 58f53966d3
Add Distributed Checkpointing support (#3639)
* Change naming of moments to Moment_x_<weight_name>

* Add checkpointing code and zero checkpoint aggregation

* Correct aggregation for LAMB, cleanup

* Add simple checkpointing test

* Add test for zero checkpoint aggregation

* Fix tests

* fix test

* Review changes

* Fix test after review comment fix

* Fix API, test

* Fix test after API change

* Decouple save load from ORTTrainer

* Add flag to not break checkpointing with ORTModel'

Co-authored-by: aishwarya bhandare <aibhanda@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-04-29 14:52:21 -07:00
..
bert_toy_lamb.ZeRO.0.3.ort.pt Add Distributed Checkpointing support (#3639) 2020-04-29 14:52:21 -07:00
bert_toy_lamb.ZeRO.1.3.ort.pt Add Distributed Checkpointing support (#3639) 2020-04-29 14:52:21 -07:00
bert_toy_lamb.ZeRO.2.3.ort.pt Add Distributed Checkpointing support (#3639) 2020-04-29 14:52:21 -07:00
bert_toy_lamb.ZeRO.3.3.ort.pt Add Distributed Checkpointing support (#3639) 2020-04-29 14:52:21 -07:00