Commit graph

27 commits

Author SHA1 Message Date
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
Changming Sun
4e9e01cb3c
Fix SDL warnings in CPU EP (#9975) 2021-12-19 20:54:29 -08:00
Changming Sun
44c701192b
Revert a bad change in bfc_arena.cc (#10057) 2021-12-15 23:38:45 -08:00
Changming Sun
9d9ebd3b85
Fix some static analysis warnings in the core framework (#10033) 2021-12-14 14:41:42 -08:00
Hariharan Seshadri
3360024a0b
Support plugging in custom user-defined allocators for sharing between sessions (#8059) 2021-07-22 10:17:35 -07:00
Hariharan Seshadri
43e2ee37f2
Some cosmetic changes (#7741) 2021-05-18 00:02:07 -07:00
Hariharan Seshadri
4b691a5c0d
Add ability for memory arenas to "shrink" periodically (#7284) 2021-05-08 07:53:21 -07:00
Hariharan Seshadri
7b11283af0
Add ability to allocate initialized tensor memory from non-arena memory (#7267) 2021-04-20 20:27:48 -07:00
Scott McKay
7250562271
Fix edge case in BFCArena where allocation failures could lead to an infinite loop. (#6145)
#4656
2020-12-17 07:52:31 +10:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Ryan Hill
3207de276c
Remove IDeviceAllocator class as it doesn't extend IAllocator in any way. (#5067) 2020-09-10 00:46:35 -07:00
gwang-msft
ea5732319e
Add option ORT_NO_EXCEPTIONS to disable most exception/throw in /onnxruntime/ (#4894)
* init no exception changes

* initial test

* disable exceptions

* more throw handling

* minor update

* fix linux build break

* fix windows/nuphar build break

* address cr comments, move #ifdef to ORT_CATCH

* address cr comments, move #ifdef to ORT_CATCH

* handle return statement in ORT_CATCH

* linux build break fix

* addressed cr comments, remove ort_catch_end

* addressed cr comments, remove ort_catch_end

* move mlas to a separated ifdef flag

* merge master, move some new code in master to no_exc

Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
2020-08-28 23:03:51 -07:00
Pranav Sharma
29dcfb24ab
Allow multiple sessions to share an allocator, optimize constant folding memory usage, expose arena configs. (#4813)
* Add support for sharing allocators

* Incremental update

* Address some PR comments, add unit tests, add documentation.

* Address PR comments, add tests and some documentation.

* Fix build and test issues

* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
2020-08-22 10:03:17 -07:00
Scott McKay
175983c082
Move memory info into IAllocator (#2850)
- Update IAllocator setup to move the OrtMemoryInfo to the base class instead of requiring derived classes to have that as a member and override a virtual method to return it.
  - Cleanup CreateAllocator setup to take an argument as to whether to wrap the device allocator in an arena allocator. The choice to do that isn't a property of the underlying device allocator.
  - Minor cleanups in the various EPs to adjust to the change to IAllocator and CreateAllocator, and to use the create_arena flag consistently when available.
2020-06-22 11:18:52 +10:00
Scott McKay
9790e19424
Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062)
* Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block.

The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model.

There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory.

Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper.

Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.
2020-06-05 18:54:01 +10:00
ytaous
a08f16471a
Address comments around bfc arena (#3460)
* rename setting

* todo comments

* fix build

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-04-08 19:35:32 -07:00
ytaous
f73008483a
safeint for region bytes in bfc arena and code clean up (#3447)
* PR comments

* remove build issue workaround

* SafeInt for region bytes

* fix build

* fix build

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-04-08 13:54:42 -07:00
Edward Chen
e542cfd0e0 Introduce training changes. 2020-03-11 14:39:03 -07:00
Scott McKay
a92e924ab2
Revert "Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks (#2835)" (#2904)
This reverts commit 724ff0753b.
2020-01-24 14:02:30 +10:00
Changming Sun
201b089a36
Fix some warnings on Windows (#2560)
1. Enable warning "4503" # Decorated name length exceeded.
2. Enable warning "4146" # unary minus operator applied to unsigned type.
3. Enable float64 support for the Softmax operator
4. Enable compliance checks for Windows x86 32bits build
5. Use TryBatchParallelFor to replace some fallback code in mlas pooling.cc
6. Fix Android CI pipeline.
2020-01-22 15:59:11 -08:00
Scott McKay
fc51473b09
Update BFCArena logic to use backoff if cudaMalloc fails. Makes behaviour equivalent to when a CPU allocation fails. Add unit test. (#2748)
Clear error when throwing an exception for a failed CUDA call so that there is only one error mechanism being used at a time.
Minor improvements to logging to aid debugging of BFCArena behaviour.
2020-01-22 14:21:21 +10:00
Scott McKay
724ff0753b
Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks (#2835)
* Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks.
2020-01-17 07:41:48 +10:00
Scott McKay
b43254282f
Handle bad alloc exception in bfc arena (#1846)
* Handle std::bad_alloc when growing arena.
Allow more than one attempt at reducing the buffer if allocation fails. More memory may have become available so never trying to backpedal more than once means we potentially fail when a large enough buffer could have been allocated.
2019-09-19 17:56:39 +10:00
Ke Zhang
3bf0e364e2
Move CopyTensor out of IExecutionProvider interface. (#1268)
* add ortdevice class

* add data transfer manager for copying tensors.

* update

* add data trasnfer for gpu

* fix constexpr build break.

* update

* remove unnecessary header files.

* remove unnecessary header files.

* add dependency

* add dependency

* add dependency

* add dependency

* fix linux build break.

* update

* fix build break

* fix build break

* fix build break

* update

* update

* update c api.

* update to not use OrtCreateAllocatorInfo

* change to all eps .

* fix linux build break

* remove useless codes.

* update

* move datatransfermanager in session state

* update

* fix cuda build break.

* fix comments

* fix windows GPU build.

* fix comments

* fix build break

* fix comments

* fix test failure

* update

* fix comments

* fix onnx runtime server.

* update

* fix test failure.

* fix comments

* fix comment
2019-07-11 14:49:20 -07:00
Changming Sun
c87929e949 Use nsync for implementing condition variable 2019-01-21 22:59:42 -08:00
Ryan Hill
11b369a864
Abbreviate ONNXRuntime as Ort in all of our public APIs (#175)
Applies to all public headers and macros, plus many internal ones. There are still some internal things with OnnxRuntime in the name, but this fixes all public functions & macros.
2018-12-14 14:54:23 -08:00
Pranav Sharma
89618e8f1e Initial bootstrap commit. 2018-11-19 16:48:22 -08:00