* temporary disable LSTM_Seq_lens_unpacked for dml test
* temporary disable LSTM_Seq_lens_unpacked for dml test
* temporary disable LSTM_Seq_lens_unpacked
Co-authored-by: Ethan Tao <ettao@microsoft.com>
1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* pipeline transformer
* clean up
* address feedback
* add record/wait for first stage and updated split script
* address feedback
* make recv/send signal as initializer
* merge
* address feedback
* unify input and initializer
* address feedback and bug fix
* minor fix
* windows build
* fix
* Expand elmination and Expand gradient.
* Resolve comments.
* Fix test break.
* Check if graph can remove the node.
* Resolve comment.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* fixes for ort_trainer.py to resume from checkpoint
* define self.state_dict_ during init
* add comment of explanation
* add unit test for restore from checkpoint
* fix file not found
Co-authored-by: suffian khan <sukha@microsoft.com>
1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* checkin
* fix MSVC build error
* test changes
* split pivot output into multiple tensors
* add horizon tensor
* Support multiple types for non-pivot tensor
* limit horizon tensor type to int32_t as max_horizon type
* work around some conversion warnings for local machine
* support variadic shape for non-pivot input
* dropping all rows is an exception
* fix a bug
* fix the way that generates horizon tensor
* more tests added
* add TypeConstraint() in ONNX_OPERATOR_KERNEL_EX
* update Featurizerslibrary
* Remove Useless Cast during Transformer.
* Resolve comments.
* Check if graph can remove the node.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Remove paramters like --gpu_only --sequence_length. Update bert GPU notebook accordingly.
* Remove input_int32 and float16 parameters from constructors of BertOnnxModel class and other classes derived from it.
* Update gpt2 benchmark. Add comments in gpt2 notebook to indicate work in progress. Clear notebook output before official 1.3.0 release is ready.
* Update TopK implementation.
- add faster heap
- special case k=1
- update selector for when to use heap and when to use nth_element based on performance testing
- parallelize if enough work to do
- reduce templatized code
- add some extra unit tests.
Perf tested vs. master. Average speedup is 3.75x using this combination of input sizes:
```
batches = [10, 25, 50]
batch_size = [8, 16, 32, 64, 128, 256, 512, 1024, 2048]
k = [1, 2, 4, 6, 8, 16, 24, 32, 48, 64, 128]
```
For larger batches (e.g. 50x2048) the speedup is over 20x.
Threadpool related changes.
Don't create ORT threadpool if openmp is enabled (except for inter op threadpool).
Created a new static function ThreadPool::NumThreads to account for openmp settings and null threadpool ptr.
Log a warning when using SetIntraOpNumThreads when openmp is enabled.
Added a document for ORT devs.
Fix LSTM to use the new threadpool abstractions.
Rename GetNumCpuCores to GetThreadAffinityMasks and move it to the Env class.
Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>