* Pass cuda stream to thrust function to not use default stream.
In the commit 299ace0, ORT has been changed to not use cuda default stream.
* update amd_hipify.py
* remove un-necessary stream sync
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* add async dispatch
* minor renamings
* build py38
* restore yml
* fix sync up issue between dispatch thread and main
* fix comments
* refactor SummonWorker and rename to RunInParallelInternal
Currently in high dimension matmul, we call multiple GEMM sequentially. In this change we execute these GEMMs in parallel, removing barriers between two adjacent GEMM operations.
Performance tested with Bert and T5 model. Bert model shows no noticeable perf differences, as the heavy lifting is done by the attention operator, which is not changed in this PR. In T5 model, we see no regression on low parallel threads (x4), and performance improvement is more pronounced in high number of threads (8-16). T5 shows 10% speedup with 16 threads. With profiling, we can see the most expensive MatMul operators in T5 achieves around 20% speedup with 16 threads.
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* first attempt rocm training wheel
* modifications needed to python packaging pipeline for Rocm 4.1
* changges to not conflict with cuda
missed stage1 changes
remove package push
add option r to getopt
try again without python install
try again without python install
try again without python install
split pipelines and add back push to remote storage
try on cuda gpu pool
try again
try again
try running without az subscription set
try again on original pipeline
change pool
passing AMD Rocm whl on AMD-GPU pool
split rocm pipeline from cuda pipeline
remove comments
* try adding Rocm tests as well
* try with tests in place
* fix trailing ws
* add training data
* try again as root for tests
* use python3
* typo
* try to map video, render group into container
* try again
* try again
* try to avoid yum error code
* make UID 1001
* try without yum downgrade
* define rocm_version=None
* remove CUDA related comments for Rocm Dockerfile
* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)
* missed requirements-rocm.txt from last commit
* fix whitespace
* Made the python script generating the testcases modular.
* Modified RemoveBackToBackCasts function to remove cast even if the parent node has other consumers.
* Modified InsertCastNodes to update the graph consistently for other functions to work.
* Moved ConcatNames function to the top.
* PropagateBackward/SearchUpstream and PropagateFP16CastsFromOutputsToInputs insert FP32 casts if the level >1 in order to propagate FP16 casts backwards.
* Added new testcases for level two setting.
* initial dynamic load example
* support load EP in the provider options
* support dynamic load EP in orttrainer
* split the provider interface; fix comments in pr
* remove experiment code
* add test
* remove useless file
* add test model file;fix linux brewak
* fix linux build and missing file
* fix python build
* fix python build
* fix python binding
* fix python test
* fix runtime path for posix env
* exclude the shared library from minimal build
* fix comments in pr;
* seperate the provider shared lib loading
* excluded from minimal / macos / ios build
* skip copy the provider shared lib for minimal build and mac os
* fix macos build
* exclude the test for macos build
* exclude from andorid build
* exclude from web assembly build
* enable the invalid ep test
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* beam search refactoring checkin
* add factory class and deduplicate code
* one step beam search works on gpu
Co-authored-by: Xiaoyu Liu <xiaoyu@xiaoyu-VM.z4vh1dzj5eoevgybsksdpz2izh.jx.internal.cloudapp.net>
* Add gradient registration and tests for Min/Max
* Add helper function for min/max grad test
* limit Min/Max Grad to accept at most two inputs; modify test case accordingly
* resolve merge error
* Fixes needed to PropagateCast transformation.
* Added number of passes to the logs.
* Added logging support to OrtModuleGraphBuilder.
* Added new testcases.
* Added NodeArgToConsumerMap
* working on re-organizing js code for ortweb
* remove dup files
* move folder
* fix common references
* fix common es5
* add webpack to common
* split interfact/impl
* use cjs for node
* add npmignore for common
* update sourcemap config for common
* update node
* adjust folder/path in CI and build
* update folder
* nit: readme
* add bundle for dev
* correct nodejs paths
* enable ORT_API_MANUAL_INIT
* set name for umd library
* correct name for commonjs export
* add priority into registerBackend()
* fix npm ci pwd
* update eslintrc
* revise code
* revert package-lock lockfileVersion 2->1
* update prebuild
* resolve comments
* update document
* revise eslint config
* update eslint for typescript rules
* revert changes by mistake in backend.ts
* add env
* resolve comments
Parallelize MinMax, Quantize and batched quantize GEMM
Performance problem identified in T5 decoder model (quantized). DynamicMatMul operator is identified as the culprit. This operator spend time on getting MinMax of a Tensor, quantize a tensor, and perform a batched qgemm. All of these can be parallelized.
Currently GEMM is parallelized. However, in batched GEMM, we sequentially call GEMM multiple times. This causes multiple starting and ending of parallel sections, which can be slow sometimes. So we made the following changes:
Parallel task partition no longer depends on degree of parallelism, only on shape of the matrices.
In a single GEMM, perform 2D partition of the multiplication, along panel lines, to reduce repeated packing.
For batched GEMM, all parallel tasks are executed in a single parallel section, reducing the cost of starting threads and waiting for them to finish.