* fixes for ort_trainer.py to resume from checkpoint
* define self.state_dict_ during init
* add comment of explanation
* add unit test for restore from checkpoint
* fix file not found
Co-authored-by: suffian khan <sukha@microsoft.com>
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* Remove Useless Cast during Transformer.
* Resolve comments.
* Check if graph can remove the node.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly.
* Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it.
* Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
* Update TopK implementation.
- add faster heap
- special case k=1
- update selector for when to use heap and when to use nth_element based on performance testing
- parallelize if enough work to do
- reduce templatized code
- add some extra unit tests.
Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes:
```
batches = [10, 25, 50]
batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048]
k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128]
```
For larger batches (e.g. 50x2048) the speedup is over 20x.
Threadpool related changes.
Don't create ORT threadpool if openmp is enabled (except for inter op threadpool).
Created a new static function ThreadPool::NumThreads to account for openmp settings and null threadpool ptr.
Log a warning when using SetIntraOpNumThreads when openmp is enabled.
Added a document for ORT devs.
Fix LSTM to use the new threadpool abstractions.
Rename GetNumCpuCores to GetThreadAffinityMasks and move it to the Env class.
Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>
* add frontend minst test
* to use torch nightly with torchvision
* remove incorrect comment per reviewer's comment
* experiment torchvision import failure
* experiment install_deps.sh
* more experiment install_deps.sh
* experiment install_deps.sh with --upgrade
* Experiment with install_deps.sh.
* Experiment with install_ubuntu.sh.
* Use Ubuntu 18.04 and Python 3.6 for CI.
* Update cmake version for CI.
* Install MPI on Ubuntu 18.04 for CI.
* Increase tolerance for MNIST test.
* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.
* Clean-up.
* Update ort_trainer.py from ort_training.
* Get default Ubuntu Python ver back to 3.5.
* Add underscore to opset_version parameter name in ORTTrainer constructor.
* Move loss/model wrap before the call for sample output.
* Update expected values for MNIST test.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
* Generalize reshape fusion
* Allow arbitrary number of Concat arguments
* Apply fusion even when an output of an internal node is used elsewhere
* Fix a bug when an internal node's output is the subgraph output
* Simplify code
* Add Attention fusion for GPT2
* Support distilgpt2 in benchmark_gpt2.py
* Add options to disable Attention/SkipLayerNormalization/EmbedLayerNormalization/BiasGelu fusions
* Add logging at the begining of each fusion
* Update notebooks: Add Gpt2OnnxModel.py to list of script files.
* Add test for gpt2 model optimization
* Add optional parameters (--input_ids --segment_ids --input_mask) for graph inputs
* Fuse BiasGelu
* Handle model that does not have segment_ids input.
* Allow fuse embed layer without mask