onnxruntime/include/onnxruntime/core/framework
Scott McKay 9790e19424
Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062)
* Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block.

The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model.

There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory.

Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper.

Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.
2020-06-05 18:54:01 +10:00
..
alloc_kind.h
allocator.h Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062) 2020-06-05 18:54:01 +10:00
customregistry.h
data_types.h address PR comments (#3312) 2020-03-25 19:35:12 -07:00
data_types_internal.h Fix some warnings on Windows (#2560) 2020-01-22 15:59:11 -08:00
endian.h Edgchen1/endian utils (#2181) 2019-10-21 22:28:35 -07:00
execution_provider.h Ryanunderhill/mkldnn dll (#3314) 2020-05-06 00:57:09 -07:00
fence.h
framework_common.h
func_api.h Ryanunderhill/mkldnn dll (#3314) 2020-05-06 00:57:09 -07:00
kernel_def_builder.h Introduce training changes. 2020-03-11 14:39:03 -07:00
kernel_registry.h Parallel all the activations ops (#3722) 2020-05-05 01:18:17 -07:00
ml_value.h Introduce container type runtime checks and other improvements (#2522) 2019-12-04 16:04:17 -08:00
op_kernel.h Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-21 03:31:32 +00:00
op_kernel_info.h Replace GSL with GSL-LITE submodule and fix up refs (#1920) 2019-10-01 12:43:29 -07:00
op_node_proto_helper.h Replace GSL with GSL-LITE submodule and fix up refs (#1920) 2019-10-01 12:43:29 -07:00
run_options.h Fix evaluation issues (#3538) 2020-04-28 21:03:37 -07:00
sparse_tensor.h Replace GSL with GSL-LITE submodule and fix up refs (#1920) 2019-10-01 12:43:29 -07:00
tensor.h View Op - new unit tests and add support for tensor memcpy by offset/size (#3439) 2020-04-07 13:07:11 -07:00
tensor_shape.h Filter out info from non-const initializers during shape inferencing (#1806) 2019-09-26 13:44:33 +10:00