* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly.
* Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it.
* Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
* Update TopK implementation.
- add faster heap
- special case k=1
- update selector for when to use heap and when to use nth_element based on performance testing
- parallelize if enough work to do
- reduce templatized code
- add some extra unit tests.
Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes:
```
batches = [10, 25, 50]
batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048]
k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128]
```
For larger batches (e.g. 50x2048) the speedup is over 20x.
Threadpool related changes.
Don't create ORT threadpool if openmp is enabled (except for inter op threadpool).
Created a new static function ThreadPool::NumThreads to account for openmp settings and null threadpool ptr.
Log a warning when using SetIntraOpNumThreads when openmp is enabled.
Added a document for ORT devs.
Fix LSTM to use the new threadpool abstractions.
Rename GetNumCpuCores to GetThreadAffinityMasks and move it to the Env class.
Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>
* add frontend minst test
* to use torch nightly with torchvision
* remove incorrect comment per reviewer's comment
* experiment torchvision import failure
* experiment install_deps.sh
* more experiment install_deps.sh
* experiment install_deps.sh with --upgrade
* Experiment with install_deps.sh.
* Experiment with install_ubuntu.sh.
* Use Ubuntu 18.04 and Python 3.6 for CI.
* Update cmake version for CI.
* Install MPI on Ubuntu 18.04 for CI.
* Increase tolerance for MNIST test.
* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.
* Clean-up.
* Update ort_trainer.py from ort_training.
* Get default Ubuntu Python ver back to 3.5.
* Add underscore to opset_version parameter name in ORTTrainer constructor.
* Move loss/model wrap before the call for sample output.
* Update expected values for MNIST test.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
* Generalize reshape fusion
* Allow arbitrary number of Concat arguments
* Apply fusion even when an output of an internal node is used elsewhere
* Fix a bug when an internal node's output is the subgraph output
* Simplify code