Commit graph

16 commits

Author SHA1 Message Date
Edward Chen
9e65f3bfdb
Replace deprecated Python dependency sklearn with scikit-learn. (#13585) 2022-11-08 09:08:29 -08:00
Jeff Daily
65c67764ae
remove line "ADD model ${WORKSPACE_DIR}/model" in the amdgpu Dockerfile (#12914)
Follow-up to #12707. docker build is broken otherwise; model dir is
gone.
2022-10-14 13:17:28 -07:00
Baiju Meswani
9e47eb68e0
Remove unused orttraining amd dockerfiles and scripts (#12707) 2022-09-02 18:43:21 -07:00
Justin Chu
fdce4fa6af
Format all python files under onnxruntime with black and isort (#11324)
Description: Format all python files under onnxruntime with black and isort.

After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.

#11315, #11316
2022-04-26 09:35:16 -07:00
Weixing Zhang
840212e115
Enable OneHot kernel for ROCm EP and add Dockerfile for ROCm 4.3.1 (#9656)
* enable OneHot for ROCm EP

* add dockerfile for ROCm 4.3.1

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-12-07 12:47:00 -08:00
Jeff Daily
d02de9c1bc
[ROCm] dockerfile updates (#7955)
* do not remove onnxruntime build directory in Dockerfile.rocm4.1.pytorch

* restore ONNX Runtime Training Examples to rocm 4.2 dockerfile
2021-06-10 23:50:19 -07:00
Weixing Zhang
dce76c15e7
add dockfile for ROCm 4.2 (#7749)
* add dockfile for ROCm 4.2
* using rocm/pytorch:rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1
2021-06-08 08:02:27 -07:00
Peng
c2435d24ec
Clean up ROCm4.1 Dockerfile build directory (#7732)
* Clean up ROCm4.1 Dockerfile build directory

* remove the UCX and OMPI build directories after installation
2021-05-20 10:04:49 -07:00
Weixing Zhang
59b57d8322
HSA_NO_SCRATCH_RECLAIM and RCCL_ALLTOALL_KERNEL_DISABLE are not needed for ROCm 4.1 (#7224)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-02 18:19:11 -07:00
Jeff Daily
65ce5f07b3
add Dockerfile.rocm4.1.pytorch (#7152) 2021-03-26 21:40:10 -07:00
Suffian Khan
5cb8934459
update Dockerfile for workaround for issue in RCCL for rocm4.0 (#7108) 2021-03-23 13:36:04 -07:00
Jesse Benson
c562952750 Dockerfile to build onnxruntime with ROCm 4.0 2020-12-22 10:21:12 -08:00
Weixing Zhang
2705115732
add dockerfile for ROCm3.10 and update BUILD.md for ROCm EP (#5821)
* add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile

It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used

* update BUILD.md

* add dockerfile for rocm3.10
2020-12-08 23:14:56 -08:00
Weixing Zhang
fc614ad050 revert the code change which was based on b4869926
The change b4869926 which was to remove per-thread allocator would cause seg fault for
distributed training.

In addition, add dockerfile for ROCm3.9
2020-11-15 00:24:32 -08:00
Weixing Zhang
fff85a6a35
Add GPU kernels for ROCm EP (#5655)
* Add kernels for AMD GPU.

This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.

Please refer to "HIP Porting Guide" for details.

* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".

* Use hipMemsetAsync and add checks on HIP calls.

* move the generated files to build folder.

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2020-11-06 16:11:06 -08:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00