mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-07-04 04:07:22 +00:00
In ORT, there is only 3 cuda stream: default, HtoD, DtoH. And both HtoD and DtoH are non-blocking stream. Thus, per-thread stream mode doesn't have any benefit. I also tried in multiple thread env and the legacy mode is also better than per-thread model. Below is the perf of a 3 layer bert on v100. Unit is ms: batch size 1: concurrency | c=1 | c=2 | c=4 legacy | 0.54 | 1.17 | 2.68 per-thread | 0.66 | 1.37 | 2.86 batch size 4: concurrency | c=1 | c=2 | c=4 legacy | 1.1 | 2.22 | 4.6 per-thread | 1.21 | 2.44 | 4.98 batch size 64: concurrency | c=1 | c=2 | c=4 legacy | 8.09 | 16.13 | 32.37 per-thread | 8.18 | 16.26 | 32.45 |
||
|---|---|---|
| .. | ||
| external | ||
| onnx | ||
| patches | ||
| CMakeLists.txt | ||
| ConfigureVisualStudioCodeAnalysis.props | ||
| EnableVisualStudioCodeAnalysis.props | ||
| get_boost.cmake | ||
| onnxruntime.cmake | ||
| onnxruntime_automl_featurizers.cmake | ||
| onnxruntime_codegen.cmake | ||
| onnxruntime_common.cmake | ||
| onnxruntime_config.h.in | ||
| onnxruntime_csharp.cmake | ||
| onnxruntime_dependencies.dot | ||
| onnxruntime_framework.cmake | ||
| onnxruntime_graph.cmake | ||
| onnxruntime_language_interop_ops.cmake | ||
| onnxruntime_mlas.cmake | ||
| onnxruntime_nuphar_extern.cmake | ||
| onnxruntime_optimizer.cmake | ||
| onnxruntime_providers.cmake | ||
| onnxruntime_pyop.cmake | ||
| onnxruntime_python.cmake | ||
| onnxruntime_server.cmake | ||
| onnxruntime_session.cmake | ||
| onnxruntime_unittests.cmake | ||
| onnxruntime_util.cmake | ||
| protobuf_function.cmake | ||