onnxruntime/onnxruntime/core/framework
Tang, Cheng a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
..
allocation_planner.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
allocation_planner.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
allocator.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
allocator_stats.h
allocatormgr.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
allocatormgr.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
arena_extend_strategy.h
bfc_arena.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
bfc_arena.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
callback.cc
callback.h
compute_capability.h
config_options.cc Enabling thread pool to be numa-aware (#13778) 2022-12-12 10:33:55 -08:00
config_options.h Pass SessionOptions to XnnpackProviderFactoryCreator. (#13318) 2022-12-10 14:23:46 +08:00
copy.cc
copy.h Add support for other data types to Split CPU kernel. (#13900) 2022-12-12 09:29:15 -08:00
customregistry.cc
data_transfer.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
data_transfer.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
data_transfer_manager.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
data_transfer_manager.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
data_transfer_utils.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
data_types.cc
debug_node_inputs_outputs_utils.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
debug_node_inputs_outputs_utils.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
device_stream_collection.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
device_stream_collection.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
element_type_lists.h
empty.cc
endian_utils.cc Remove CUDA 10.2 support (#12541) 2022-08-10 22:46:41 -07:00
endian_utils.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
error_code.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
error_code_helper.h
ex_lib_loader.cc
ex_lib_loader.h
execution_frame.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
execution_frame.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
execution_plan_base.h Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
execution_provider.cc Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791) 2022-09-20 14:24:59 -07:00
execution_providers.h Revert reverse setup of allocators + create/register allocator in CPU EP only when needed. (#12954) 2022-09-15 17:54:32 -07:00
execution_steps.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
execution_steps.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
fallback_cpu_capability.cc Bugfix for GetCpuPreferredNodes (#13590) 2022-12-14 17:54:55 +08:00
fallback_cpu_capability.h [DML EP] Revert DML's cpu fallback logic (#13605) 2022-11-10 00:56:23 -08:00
feeds_fetches_manager.cc Eliminate memory allocations per recent profiling (#12225) 2022-07-25 14:14:38 -07:00
feeds_fetches_manager.h Eliminate memory allocations per recent profiling (#12225) 2022-07-25 14:14:38 -07:00
func_kernel.cc
func_kernel.h
fuse_nodes_funcs.cc
fuse_nodes_funcs.h
graph_partitioner.cc [DML EP] Don't fuse a capability outside the compile call (#13468) 2022-10-26 15:21:33 -07:00
graph_partitioner.h Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791) 2022-09-20 14:24:59 -07:00
iexecutor.h Eliminate memory allocations per recent profiling (#12225) 2022-07-25 14:14:38 -07:00
kernel_def_builder.cc Decouple strided tensor support from ENABLE_TRAINING (#13829) 2022-12-07 09:22:21 -08:00
kernel_lookup.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
kernel_registry.cc Consolidate enabled/default kernel def type constraints (#13034) 2022-09-27 14:04:15 -07:00
kernel_registry_manager.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
kernel_registry_manager.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
kernel_type_str_resolver.cc Binary size reduction in KernelTypeStrResolver and GraphPartitioner (#13172) 2022-09-30 13:50:39 -07:00
kernel_type_str_resolver.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
kernel_type_str_resolver_utils.cc Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791) 2022-09-20 14:24:59 -07:00
kernel_type_str_resolver_utils.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
math.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
mem_buffer.h
mem_pattern.h Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
mem_pattern_planner.h Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
memcpy.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
memcpy.h
memory_info.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
memory_info.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
mldata_type_utils.cc
mldata_type_utils.h
murmurhash3.cc Add big endian support to murmurhash3 (#12549) 2022-08-11 18:39:39 +10:00
murmurhash3.h
node_index_info.cc
node_index_info.h Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
onnxruntime_map_type_info.cc
onnxruntime_map_type_info.h
onnxruntime_sequence_type_info.cc
onnxruntime_sequence_type_info.h
onnxruntime_typeinfo.cc
onnxruntime_typeinfo.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
op_kernel.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
op_kernel_context_internal.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
op_kernel_info.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
op_kernel_type_control_utils.h
op_node_proto_helper.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
ort_stl_allocator.h
ort_value_name_idx_map.h Build VS 2022 no Abseil adjustment (#12780) 2022-08-31 11:47:43 -07:00
ort_value_pattern_planner.cc Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
ort_value_pattern_planner.h Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
ort_value_tensor_slicer.cc Eliminate memory allocations per recent profiling (#12225) 2022-07-25 14:14:38 -07:00
ort_value_tensor_slicer.h
partial_graph_execution_state.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
partial_graph_execution_state.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
prepacked_weights.cc
prepacked_weights.h
prepacked_weights_container.cc
prepacked_weights_container.h Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888) 2022-06-17 17:07:21 +08:00
print_tensor_utils.h
program_region.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
random_generator.cc
random_generator.h
random_seed.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
random_seed.h
run_options.cc
sequential_execution_plan.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
sequential_executor.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
sequential_executor.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
session_options.cc
session_options.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
session_state.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
session_state.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
session_state_utils.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
session_state_utils.h Incrementally free initializers while saving to OrtValue instances (#12485) 2022-08-09 10:59:10 +10:00
simple_tensor_allocator.cc Incrementally free initializers while saving to OrtValue instances (#12485) 2022-08-09 10:59:10 +10:00
simple_tensor_allocator.h Incrementally free initializers while saving to OrtValue instances (#12485) 2022-08-09 10:59:10 +10:00
sparse_tensor.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
sparse_utils.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
sparse_utils.h
stream_execution_context.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
stream_execution_context.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
tensor.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
tensor_allocator.cc Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
tensor_allocator.h Incrementally free initializers while saving to OrtValue instances (#12485) 2022-08-09 10:59:10 +10:00
tensor_allocator_with_mem_pattern.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
tensor_external_data_info.cc
tensor_external_data_info.h Support direct usage of ORT format model flatbuffer for initializers (#12465) 2022-08-12 18:31:43 +10:00
tensor_shape.cc
tensor_type_and_shape.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
tensor_type_and_shape.h
tensorprotoutils.cc Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
tensorprotoutils.h Avoid duplicate symbol error between ONNX and ORT for ostream operator<< with TensorShapeProto (#12651) 2022-08-22 17:20:52 +10:00
TensorSeq.h
transpose_helper.cc
transpose_helper.h Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
tunable.h Include algorithm selection exposed by ROCBLAS extensions API in GEMM autotuning (#13831) 2022-12-08 14:21:17 -08:00
utils.cc Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
utils.h Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00