The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Move fbs include from header to cc
* add initial cmake for flatbuffers
* Move most flatbuffers util to ort_flatbuffers
* move code around
* fix
* move test/perf runner to use flatbuffer directly instead of model
* minor update
* Fix build break
* Clean up includes and foward decl
* Fix traning CI build breaks
* Addressed PR comment, replaced some include with forward decls
* Remove ORT_MUST_USE_RESULT temporarily
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
- Add support for ENABLE_LANGUAGE_INTEROP_OPS in training build which is enabled for nightly builds
- Fix passing of environment variables to `sudo docker run` in build definitions
- Fix setup.py package naming logic
* dashboard integration - first phase
* change a field
* perf scripts
* addressing PR comments
* address comments and fix build
* minor
* make GetConfigFromData() const
* more update for comments
* addressing comments
* more on addressing comments
* minor
* fix build
* add condition check
* more on comments
* retrun status
* remove batch size
* on comments
* rename pkg path
* rename pkg path
* additional commentss
Co-authored-by: Ethan Tao <ettao@microsoft.com>