* beam search refactoring checkin
* add factory class and deduplicate code
* one step beam search works on gpu
Co-authored-by: Xiaoyu Liu <xiaoyu@xiaoyu-VM.z4vh1dzj5eoevgybsksdpz2izh.jx.internal.cloudapp.net>
* Add gradient registration and tests for Min/Max
* Add helper function for min/max grad test
* limit Min/Max Grad to accept at most two inputs; modify test case accordingly
* resolve merge error
* Fixes needed to PropagateCast transformation.
* Added number of passes to the logs.
* Added logging support to OrtModuleGraphBuilder.
* Added new testcases.
* Added NodeArgToConsumerMap
* working on re-organizing js code for ortweb
* remove dup files
* move folder
* fix common references
* fix common es5
* add webpack to common
* split interfact/impl
* use cjs for node
* add npmignore for common
* update sourcemap config for common
* update node
* adjust folder/path in CI and build
* update folder
* nit: readme
* add bundle for dev
* correct nodejs paths
* enable ORT_API_MANUAL_INIT
* set name for umd library
* correct name for commonjs export
* add priority into registerBackend()
* fix npm ci pwd
* update eslintrc
* revise code
* revert package-lock lockfileVersion 2->1
* update prebuild
* resolve comments
* update document
* revise eslint config
* update eslint for typescript rules
* revert changes by mistake in backend.ts
* add env
* resolve comments
Parallelize MinMax, Quantize and batched quantize GEMM
Performance problem identified in T5 decoder model (quantized). DynamicMatMul operator is identified as the culprit. This operator spend time on getting MinMax of a Tensor, quantize a tensor, and perform a batched qgemm. All of these can be parallelized.
Currently GEMM is parallelized. However, in batched GEMM, we sequentially call GEMM multiple times. This causes multiple starting and ending of parallel sections, which can be slow sometimes. So we made the following changes:
Parallel task partition no longer depends on degree of parallelism, only on shape of the matrices.
In a single GEMM, perform 2D partition of the multiplication, along panel lines, to reduce repeated packing.
For batched GEMM, all parallel tasks are executed in a single parallel section, reducing the cost of starting threads and waiting for them to finish.
* IsInf ReduceSum transform
* Revert unnecessary changes, add isinf_only and isnan_only attr
* add tests, review comments
* Disable test for non-cuda
* Move IsAllFinite from training to contrib op
* review comments
* Review comment, formatting
* Enable test for ROCm EP
* Add DropoutGrad function body
* Add DropoutGrad function body
* Fix documentation and add test cases
* Fix template specialization
* Check expansion for float16 and bfloat16
* Refactor mlas unittest.
* Fix building issue on Linux (non msvc).
* Fix unused variable CI issue seems for old gnuc.
* Move to unittest foler one level down, and some other word change.
* Fix typo cause some test wrong.
* Correct some missing registered test_case count.