Commit graph

81 commits

Author SHA1 Message Date
kailums
dd28f09ce2
fix issue when build with hipblasLt on rocm6.1 (#22553)
### Description
<!-- Describe your changes. -->

hipblasLt library is released with rocm6.x, and current onnxruntime's
code need some modifications to match new hipblasLt API.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-28 13:57:08 +08:00
Jeff Daily
5aabc53121
[ROCm] redo hipify of version controlled files (#22449)
### Description
Updates the ROCm EP opsets to match the current CUDA EP opsets. Also
enable the test CApiTest.basic_cuda_graph_with_annotation.

Note that some changes are whitespace-only. These changes were made to
improve the comparison of corresponding ROCm and CUDA EP source files
when using a side by side diff tool.

### Motivation and Context
The ROCm EP derives from the CUDA EP. Many source files are shared
between the EPs and "hipified" during the ROCm EP build, however quite a
few files within the ROCm EP are under source control after their
initial hipification. Over time these ROCm EP files get stale relative
to their CUDA EP counterparts. It becomes necessary to re-hipify these
otherwise static files in order to pick up important changes such as
opset differences.
2024-10-18 12:40:54 -07:00
Jeff Daily
8c21680ffc
[ROCm] prefer hip interfaces over roc during hipify (#22394)
### Description
Change the hipify step to remove the -roc option to hipify-perl. This
will prefer hipblas over rocblas. rocblas can still be called directly
such as in TunableOp.

### Motivation and Context
hip interfaces are preferred over roc for porting from cuda to hip.
Calling roc interfaces is meant for ROCm-specific enhancements or
extensions.
2024-10-14 20:34:03 -07:00
PeixuanZuo
6226c5f62f
[ROCm] Add SkipGroupNorm for ROCm EP (#19303)
Add SkipGroupNorm for ROCm EP.

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-02-21 11:08:48 +08:00
Tianlei Wu
c4b49fb7bf
[CUDA] remove CUBLAS_TENSOR_OP_MATH mode (#19431)
This pull request replaces `CUBLAS_TENSOR_OP_MATH` with
`CUBLAS_DEFAULT_MATH`. The changes affect several files, including test
cases and a Python script for AMD hipify process.

### Motivation and Context

CUBLAS_TENSOR_OP_MATH mode is deprecated:
https://docs.nvidia.com/cuda/cublas/index.html#cublasmath-t

On CUDA versions prior to 11, users are required to set the math mode to
CUBLAS_TENSOR_OP_MATH manually to be able to use tensor cores for FP16.
On CUDA 11 and CUDA 12, this is no longer required. Since latest ORT
only supports CUDA >= 11 so it is safe to remove CUBLAS_TENSOR_OP_MATH
from our code base.
2024-02-06 12:48:39 -08:00
Ted Themistokleous
7b2aefa856
undo hipify of __half to rocblas_half (#18573)
Fixes build issue seen with newer ROCm releases

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-11-24 18:04:23 +08:00
Jeff Daily
07317316cc
CUDA EP vs ROCM EP hipify audit (#17776)
Migrate most CUDA EP improvements and changes to ROCM EP. The process
involves using hipify against all CUDA EP files (i.e. do not exclude any
files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them
against the ROCM EP files that are under source control and pull in most
changes. These changes include functional as well as formatting and
makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff
somewhat less obvious due to formatting changes.

- hipify audit of onnxruntime/core/providers/rocm, enable ops
  - Loop
  - Scan
- hipify audit of onnxruntime/contrib_ops/rocm
- fix contrib ops search implementation
- enable more contrib ops
  - Affine
  - ComplexMul
  - ConvTransposeWithDynamicPads
  - Crop
  - DynamicSlice
  - FFT [Rfft, Irfft]
  - GreedySearch
  - ImageScaler
  - ParametricSoftplus
  - ScaledTanh
  - ThresholdRelu

---------

Co-authored-by: cloudhan <cloudhan@outlook.com>
2023-10-13 10:13:53 +08:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
ytaous
d632f9a3fa
[ROCm] Enable Sampling Op UT on AMD (#14581)
Making basic porting effort to run Sampling UT on ROCm ep, based on the
commits:

https://github.com/microsoft/onnxruntime/pull/13426
https://github.com/microsoft/onnxruntime/pull/14218

1. enabling EmbedLayerNorm op
2. enabling Sampling op
3. enabling helpers to copy data from CPU->GPU for subgraph

This task is the first checkpoint. There could be other missing ops when
testing a real model.
We will migrate more code onto ROCm as needed.

Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-06 20:52:06 -08:00
Jeff Daily
fe052e603b
ROCm header path updates (#14170)
ROCm reorganized header file locations. Use the new locations to avoid
warnings.
2023-01-16 10:28:13 +08:00
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
cloudhan
928c9fc348
Hipify during build instead of before cmake config (#13333)
### Description

Currently, hipify happens before cmake is configured and then cmake glob
the directories. This get rids of thoes customized python threading
logic and opt for build system itself to generate the files.

This also supersede the half baked branch
[sukha/hipify-with-cmake](https://github.com/microsoft/onnxruntime/tree/sukha/hipify-with-cmake)
2022-10-20 22:46:22 -07:00
ytaous
2cc4e7e5c2
[Build] Fix broken AMD CI (#13082)
Introduced by https://github.com/microsoft/onnxruntime/pull/12949
- add missing lines in excluded list

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2022-09-24 00:21:25 -07:00
cloudhan
e9d91cac55
Fix hipify not running if the pwd is not the root of onnxruntime repo (#12941) 2022-09-21 14:27:01 +08:00
cloudhan
14365b67a0
Fix hipify due to CUDA EP tensorrt_fused_multihead_attention optimization (#12990)
Recent change in CUDA EP #12814 makes hipify extremely slow and breaks the building. This PR fixes it by c

The onnxruntime/contrib_ops/rocm/bert/attention.h is checkout-ed from the version before #12814 and manually hipify-ed.
Slightly extend amd_hipify.py to allow wildcard file match and exclude all `tensorrt_fused_multihead_attention/*` files from hipify
2022-09-19 15:29:23 +08:00
Tianlei Wu
95c4fc6877
[CUDA] Add TensorRT fused attention fp16 v2 kernels (#12814)
* Add TensorRT fused attention fp16 kernels
* drop sm 72;  seq 512 for sm75; and head_size 32 kernels
* Add env variable ORT_DISABLE_FUSED_ATTENTION
* exclude files in hipify
* update AttentionPastState_dynamic test threshold
* fix --use_mask_index in benchmark
2022-09-13 15:16:12 -07:00
Hariharan Seshadri
ad69aac491
Introduce ordered quantization ops for the CUDA EP [1/n] (#12582)
Initial core small set for the ordered quantization ops for cuda EP.
2022-09-07 11:58:15 -07:00
Vincent Wang
262a597e2a
[CUDA] BiasSoftmax and Dropout Fusion (#12667)
* bias softmax dropout fusion

* fix rocm build

* move some files
2022-09-01 10:01:44 +08:00
mwootton
817dc94345
Add first pass of rocm kernel profiler (#10911)
* Add first pass of rocm kernel profiler

* Clean up rocm_profiler. Format args. Demangle kernel names.
Add Api EventRecords

* Remove debug output

* Temporarily disable profiling unit test 'api record check' for cupti

* Fix compile error for non-gpu builds

* Use common file for demangle and pid/tid.  Namespace ThreadUtil.  Fix gpu buffer clearing.

* Merge demangle into profiler_common

* Merge demangle into profiler_common part 2

* Style cleanup

* Resolve linking issues via ProviderHost interface

* Demangle cuda kernel names

* Clean up comments

* Fix formatting

* Fix anal retentive formatting
2022-08-26 19:38:03 -07:00
Xinya Zhang
eb827bd3e5
[ROCm] NGramRepeatBlock, LongformerAttention and DecoderAttention Ops (#11971)
* [ROCm] enable NGramRepeatBlock Op

* [ROCm] Enable testing ROCm in NGramRepeatBlockTest.NGramSize_3

Also link onnxruntime_test_all with amdhip64 when USE_ROCM=1

* [ROCm] add LongformerAttention Op

* [ROCm] Enable LongformerAttentionTest

* [ROCm] Add DecoderAttention Op

* Enable DecoderAttention Test for ROCm.

* [ROCM] Updates according to reviews
2022-08-11 19:32:08 -07:00
Vincent Wang
37995a7245
[CUDA] BiasSoftmax Supporting New Pattern (#12361) 2022-08-05 06:59:24 +08:00
Xinya Zhang
77cab7a3a5
[ROCm] Add AveragePool, GlobalAveragePool, MaxPool, GlobalMaxPool Ops (#11968)
* [ROCm] disable expected failure tests PoolTest.MaxPool_10_DilationPadding_?d

* [ROCm] Add AveragePool, GlobalAveragePool, MaxPool, GlobalMaxPool Ops

* (To squash after review) Replace rocm/nn/pool.cc with amd_hipify.py changes

* [ROCM] Replace miCompat with Helper functions

* (to squash) fix the compiling error of SetPoolingNdDescriptorHelper
2022-08-03 14:36:36 -07:00
Xinya Zhang
01f3a197d7
[ROCm] InstanceNormalization, BatchNormalization and LRN Ops (#11972)
* [ROCm] Add InstanceNormalization Op

* Enable InstanceNormBatch1_fp16 and InstanceNormBatch2_fp16 for ROCm

* [ROCm] Add BatchNormalization for fp32 and fp16

* Enable BatchNormTest for ROCm

* [ROCm] Add LRN Op

* [ROCM] replace miCompat functions with Helper functions
2022-08-02 23:14:26 -07:00
ytaous
e4bd41fb3b
[ROCm] Enable Einsum for inferencing perf (#12360)
* enable einsum

* address comments

* comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2022-07-28 20:26:25 -07:00
Ye Wang
89ac61f4d4
support gpt2 model with greedy search (#12068)
* greedy search gpt2 cpu checkin

* add cuda support

* add test

* provider

* update

* fix some bugs

* refactor impl class

* refactor test

* remove unused func

* refactor parameters class

* simplify padding

* fix lint warnings

* python format

* Revert "python format"

This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf.

* python format

* fix pipelines

* fix pipeline

* move bufferallocater to generate_impl_base

* review comments(alignment, filename/namespace change)

* rebase2

* python reformat

* reformat

* fix rocm build

* review comment

* review comments

* review comments

* fix a bug

* rebase test files

* python format

* format import order

* review comments

* fix build
2022-07-22 15:45:16 -07:00
Xinya Zhang
03dfcb0e87
[ROCm] Enable int8 for MatMulInteger Op (#11776) 2022-07-21 11:20:48 -07:00
mindest
add631410a
[ROCm] Re-enable ReduceL1, L2 and related tests (#12209)
Re-enable ReduceL1,L2 and related tests
2022-07-20 13:13:02 +08:00
zhangyaobit
a9b9c7f69f
Add autotuning support to FastGelu (#12093)
* Add autotuning for FastGelu (Draft).

* Clean up.

* delete unused header file

* Fix lint errors.

* Add missing template parameter.

* Improvements.

* Fix type.

* Fix namespace issue.
2022-07-06 23:17:48 -07:00
Hubert Lu
dbcf54aa41
Add hipified SkipLayerNorm code for ROCmEP (#12107)
* First attempt for half2 vectorized memory access in SkipLayerNorm

* Add some functions for debugging

* Clean up the code

* Clean up the code

* Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp

* Add a unit test for a larger input size

* Fix some Lint C++ warnings

* Use ILP = 4 for the vectorized kernels

* Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm

* Use conditional operator for input_v

* Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel

* Clean some comments and rename the layernorm function

* Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel

* Resolve a Lint C++ warning

* Fix SkipLayerNormBatch1_Float16_vec output data

* Add hipified code of bert SkipLayerNorm for ROCmEP

* Resolve some Lint C++ warnings

* Resolve some Lint C++ warnings

* Resolve some Lint C++ warnings

* Resolve Python formatting issue
2022-07-06 22:13:11 -07:00
Hubert Lu
f4ba199bad
Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)
* Using vectorized loads (float2) for fp16 to improve performance

* Fix a few warnings from cpplint

* Fix a few warnings from cpplint

* Use __float2half2_rn and fix some cpplint warnings

* Move some computaions to LaunchFastGeluKernel

* Fix some Lint C++ warning

* Using vectorized loads (float4) for fp16 to improve performance

* Switch   whether to optimize FastGelu with float4 vectorization

* Switch to float4 memory access based on input_length in FastGelu

* Comment how to set the threshold of float2 and float4 vectorized kernels

* Add FastGelu fp16 unit tests for bias_length = 2 and 8

* Make vectorized kernels generic with aligned_vector

* Unify the vectorized kernels with/without bias

* Refactor the code to suppress cpplint warnings

* Solve formatting issues

* Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel

* Move fast_gelu_impl.h to rocm/bert

* Fix some Lint C++ warnings and code alignment
2022-06-24 12:46:17 -07:00
Justin Chu
fdce4fa6af
Format all python files under onnxruntime with black and isort (#11324)
Description: Format all python files under onnxruntime with black and isort.

After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.

#11315, #11316
2022-04-26 09:35:16 -07:00
Weixing Zhang
0aaf3a676a
Update reduce norm1/norm2 and layernorm kernels with ROCm 4.3.1 (#9399)
* update layernorm to reflect the fix in ROCm 4.3.1

* fix UT

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-04-07 22:54:12 -07:00
Tianlei Wu
36c3271546
BeamSearch op cuda (#10556)
Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph
2022-02-25 13:08:55 -08:00
zhangyaobit
fd16085cea
Zhanyao/attention (#10545)
* Enable Attention op for ROCM EP.

As a note, potential hipify improvements: (1) handle math
contants (attention_softmax.h), (2) correctly generate transpose
options for the GEMM helpers, consider counterpart/dummy API for
CublasMathModeSetter (attention_impl.cu, attention_impl.cu). After
these improvements, we don't need to manually keep copies of the
above mentioned files any more.

* Clean up debugging code.
2022-02-17 09:02:45 -08:00
Jeff Daily
e7efcc93fe
[ROCm] update hipify-perl location (#10102)
* [ROCm] update hipify-perl location

Depending on the ROCm version installed, hipify-perl might not always
live in the hard-coded path of /opt/rocm/bin. Use python 3.3's
shutil.which to locate the script.

* provide alternative locations for hipify-perl if not in PATH

* implement hipify-perl search as a function

This avoids running the logic during module import since all builds
import the amd_hipify module.

* fix flake8 errors
2022-01-06 17:21:02 -08:00
Ye Wang
6856619b18
Decoder Attention CUDA Op (#9792)
* add kernel interface

* register kernel

* add self/cross qkv projection without cache

* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)

* refactor ConcatPastToPresent

* DecoderQkvToContext interface

* q,k,v buffer and cache as output

* qk, pv and transctx

* fix compiler error on linux machine

* key_padding_mask

* add test_parity file. However not runnable

* add partial unittest

* made partial attributes to inputs

* --gen_doc

* change kernel interface, add more tests

* morre parity tests

* fix test

* fix typo

* transpose optimizer has bug. remove it temporarily

* add input shape checks

* add type/shape inference

* fix cache shape check

* fix rocm build failure

* fix rocm build error

* review comments

* review comments
2021-11-19 19:25:36 -08:00
pengwa
6e09fc5152
Implement block wise softmax for reduction dimention > 1024 cases. (#9696)
* implement block wise softmax for reduction dimention > 1024 cases.

* fix builds

* fix

* fix amd build

* fix amd build

* fix win-gpu build

* add tests

* remove cudnn path/add python tests
2021-11-14 11:47:58 +08:00
Jeff Daily
ca7116ca3e
CUDA EP's ResizeImpl now uses functors, hipify for ROCm EP (#9466)
Support for device function pointers is not yet available for ROCm.
Instead, the device function pointers were converted to device functors.
Case statements, lambdas, and macros are used for dispatch; as a result,
all combinations of kernels are compiled with inlined functors. The
basis of this approach can be found in PyTorch.

Lastly, hipify and register Resize and Upsample for ROCm EP.
2021-10-21 15:02:41 -07:00
Jeff Daily
66ceb6926d
rehipify ROCm EP files under orttraining (#9443)
* rehipify rocm ep files under orttraining committed to source control

* fix flake8 error
2021-10-21 13:36:21 -07:00
Jeff Daily
89a22fb641
Add TopK to ROCm EP (#9391)
* Add TopK to ROCm EP

* flake8 fix
2021-10-20 10:39:44 -07:00
Jeff Daily
f8acc6d0e8
Add NonMaxSuppression and RoiAlign to ROCm EP (#9394) 2021-10-20 10:38:45 -07:00
Jeff Daily
c33391329a
Add QuantizeLinear and DequantizeLinear to ROCm EP (#9401) 2021-10-20 10:37:58 -07:00
Jeff Daily
52c53e396d
hipify tensor/gather_nd_impl.cu (#9392) 2021-10-19 14:15:49 -07:00
Jeff Daily
a2ba923ac7
hipify fast_divmod.h (#9400) 2021-10-19 12:34:46 -07:00
Jeff Daily
a8e2e8d76a
hipify tensor/transpose.cc and tensor/transpose.h (#9397) 2021-10-19 12:27:36 -07:00
Jeff Daily
c8789d3047
[ROCm] static re-hipify of CUDA EP to ROCm EP, now a shared provider (#8877)
* re-hipify all rocm EP sources

* fix all other files affected by re-hipify

* add cuda_provider_factory.h to amd_hipify.py

* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration

* Fix ReduceConsts template specialization introduced in #9101.

Fixes the error when building for ROCm 4.3.1:

error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)

* fix flake8 error in amd_hipify.py

* speed up hipify with concurrent.futures

* flake8 fix in amd_hipify.py
2021-10-14 15:15:51 -07:00
Suffian Khan
47888392ab
Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)
* make work for both rocm 4.2 and rocm 4.3.1

* fix rocm 4.3.1 docker image reference

* fix CUDA_VERSION to ROCM_VERSION

* fix ReduceConsts conflict def

* add ifdef to miopen_common.h as well

* trailing ws
2021-09-19 23:36:03 -07:00
mindest
a71dab691d
Implement BatchNormInternal for cuda (#8172)
* correct batchnorm replacement output order;

remove bn replacement in grad graph builder

* update op defs and kernel class

* implement batch norm internal and grad.

* change saved_var into saved_inv_std

* cuda test case: bn internal

* remove redundant include

* fix comment; add support and UT for 1d input.

* exclude batch_norm_internal in amd_hipify

* run BNInternal UT for CUDA only

* fix CI error

* fix comment errors

* fix error

* add comment for inconsistency with cudnnBN doc

* additional comments for cudnnBN inconsistency
2021-07-28 16:04:49 +08:00
Hariharan Seshadri
5369821ad6
Support SpaceDepth ops in the CUDA and ROCM EPs (#7960) 2021-07-09 01:00:22 -07:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00