onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-31 23:27:43 +00:00

Author	SHA1	Message	Date
Sherlock	eb5c1f0fcc	Unify activation and initializer alignment value (#6109 ) * Unify activation and initializer alignment value * Fix VerifyInputTensorsAllocatedContiguously	2020-12-14 13:13:41 -08:00
M. Zeeshan Siddiqui	9b010963b7	Turn off peak memory logging and fix memory pattern generation bug. (#5676 ) * Turn off peak memory log lines and fix memory pattern generation bug. * Turn off peak memory log lines and fix memory pattern generation bug. Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-03 08:44:15 -08:00
M. Zeeshan Siddiqui	f2168cef29	Misc. cleanup. (#5659 ) Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-02 07:05:28 -08:00
M. Zeeshan Siddiqui	9af0d48524	Memory planner and pattern generation enhancements. (#4443 ) * static allocation. * chanegs. * contigious dynamic allocation. * contigious dynamic allocation. * fix bugs. * fix bug. * build errors. * PR feedback. * PR feedback. * Update Graph builder for nccl_allreduce, mps. * misc. * fix windows build break. * changes. * fine-grained memory-time scheduling. * merge. * fix misc stuff. * fix windows build. * fix windows build. * fix merge bug. * merge conflicts. * revert onnx-tensorrt submodule commit. * fix submodule commit. * misc. * merge conflicts. * Revert "merge conflicts." This reverts commit `319a071a6e`. * merge conflict. * merge conflict. * merge conflicts. * fixes. * PR feedback. * build break. * build break. * Add asserts. * Add asserts. * asserts. * asserts. * asserts. * asserts. * asserts. * fixes. * fixes. Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-01 23:05:46 -08:00
Dmitri Smirnov	3433576fd3	Support for Sparse Initializers (#5540 ) Introduce sparse_initializers support. Convert them to dense on model load and prune graph_proto_ so they don't consume space. Convert back to sparse on ORT Format model save. Implement serializing sparse initializers to OrtFormat. Fix Model::ToProto() to return original sparse initializers Set a flag that graph_sync is needed when loading a simple ORT Format model. otherwise nothing is resolved. Add ORT Format history to README.md ifdef MINIMAL build for DenseToSparseTensorInitializer Allow duplicate initializers to support existing models. Issue a warning instead of aborting. * Revert "Remove SparseTensor support from minimal build. (#5114)" This reverts commit `59ee8ffb17`. Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>	2020-10-27 10:32:06 -07:00
Scott McKay	59ee8ffb17	Remove SparseTensor support from minimal build. (#5114 ) * Remove SparseTensor support from minimal build. Currently the only valid usage of a SparseTensor is as an attribute of a Constant node. That would have been lifted to a dense tensor initializer when loading the onnx model, so would not exist when saving the ORT format model. Due to that there can be no SparseTensors in an ORT format model. Co-authored-by: gwang <wanggy@outlook.com>	2020-09-11 17:56:54 +10:00
gwang-msft	ea5732319e	Add option ORT_NO_EXCEPTIONS to disable most exception/throw in /onnxruntime/ (#4894 ) * init no exception changes * initial test * disable exceptions * more throw handling * minor update * fix linux build break * fix windows/nuphar build break * address cr comments, move #ifdef to ORT_CATCH * address cr comments, move #ifdef to ORT_CATCH * handle return statement in ORT_CATCH * linux build break fix * addressed cr comments, remove ort_catch_end * addressed cr comments, remove ort_catch_end * move mlas to a separated ifdef flag * merge master, move some new code in master to no_exc Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>	2020-08-28 23:03:51 -07:00
Wei-Sheng Chin	7905c57f43	Revert "Remove code which is not thread-safe. (#4454 )" (#4712 ) * Revert "Remove code which is not thread-safe. (#4454)" This reverts commit `5222b2c6c0`. * Resolve race condition * More thread-safe changes * Remove unused lock Polish comments	2020-08-06 18:42:05 -07:00
Sherlock	eb0f57f0e4	Localized Recompute for Gelu and AttentionDropout (#4402 ) * Gelu Activation Recompute Draft * Prototype for localized recompute * Introduce localized_recompute rewriter * Command line args for enabling recompute * Add logger to Gradient Graph Builder * use const when possible	2020-08-04 21:48:15 -07:00
Wei-Sheng Chin	e9d20e9dba	Revise Send and Recv (#4547 ) * Add ability to retrieve inferred shapes when executing a kernel. This ability helps Recv to know its output shapes without doing actual cummunication. Of course, if the output shapes cannot be inferred, Recv still needs to do communication to get shapes from Send. * Avoid communicating shape information when it can be inferred statically * Replace unordered_map with thread-safe wrapper. We don't want to have racing condition and undefined behavior when using parallel executor.y * Remove cout * Add missing file * Address comments * Check dim_value. -1 means missing * lock properly * Address comments (remove thread-safe map) * Remove poc header * Replace Stream with DeferredReleaseCPUPtr	2020-07-30 23:02:45 -07:00
Wei-Sheng Chin	5222b2c6c0	Remove code which is not thread-safe. (#4454 ) Because of acync access to the memory logger when using parallel executor, ORT crashes sometime.	2020-07-08 14:27:56 -07:00
Scott McKay	274e6b4153	Cleanup SessionState. Move allocator lookup to SessionState. (#4194 ) * Move allocators to SessionState so they're decoupled from ExecutionProviders - when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored - add device based lookup - simplifies logic for copying feeds/fetches across devices Cleanup SessionState and SessionStateInitializer - provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters - simplify SessionStateInitializer - reduced down to FinalizeSessionState method	2020-06-28 14:55:42 +10:00
Derek Murray	a541d28fb4	Lazily get allocator when allocating an MLValue (#4276 ) According to profiling in #4267, getting the allocator can account for a large fraction of overhead when accessing a kernel output, due to STL container operations. The allocator isn't used when (i) we're not creating a fence, and (ii) we have a memory pattern and a pre-allocated buffer, so we can avoid this overhead.	2020-06-19 15:55:43 -07:00
Wei-Sheng Chin	189fb60ef9	Fix a bug and add code to profile memory (#4241 ) * Fix a bug and add code to profile memory 1. Compile Send/Recv again (currently broken because of HOROVOD refactor). 2. Add code to print out initializer allocation size and activation memory size. * Address comments * Split memory counts per locations * Fix a metric	2020-06-16 10:17:27 -07:00
Scott McKay	9790e19424	Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062 ) * Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block. The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model. There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory. Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper. Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.	2020-06-05 18:54:01 +10:00
Scott McKay	2fed37c8eb	Fix bug in handling of an initializer that provides a graph output. (#3912 ) * Outputs from model execution should always be returned in a newly allocated buffer or an pre-allocated buffer provided in fetches. When an initializer is providing a graph output (e.g. constant folding may result in this) we were returning an OrtValue that pointed to the initializer and not a separately allocated buffer with a copy. This was wrong as: - value wasn't returned in a pre-allocated fetch so whilst the value returned was correct, it was returned in the wrong place - user could alter the data in the initializer via the returned value * Add unit test with and without pre-allocated fetch. * Add some extra info around why we're handling this special case.	2020-05-12 20:42:58 +10:00
ytaous	66c7579c93	address PR comments (#3312 ) * address PR comments * PR comments * PR comments * disable logging * typo Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-03-25 19:35:12 -07:00
Edward Chen	e542cfd0e0	Introduce training changes.	2020-03-11 14:39:03 -07:00
Yufeng Li	64feee1b52	Logging in framework.cc should use the session logger (#3059 )	2020-02-21 17:11:14 -08:00
Scott McKay	a92e924ab2	Revert "Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks (#2835 )" (#2904 ) This reverts commit `724ff0753b`.	2020-01-24 14:02:30 +10:00
Changming Sun	201b089a36	Fix some warnings on Windows (#2560 ) 1. Enable warning "4503" # Decorated name length exceeded. 2. Enable warning "4146" # unary minus operator applied to unsigned type. 3. Enable float64 support for the Softmax operator 4. Enable compliance checks for Windows x86 32bits build 5. Use TryBatchParallelFor to replace some fallback code in mlas pooling.cc 6. Fix Android CI pipeline.	2020-01-22 15:59:11 -08:00
Scott McKay	724ff0753b	Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks (#2835 ) * Use IArenaAllocator::Reserve for initializers and mem pattern planner blocks.	2020-01-17 07:41:48 +10:00
Dmitri Smirnov	d34fb62012	Introduce container type runtime checks and other improvements (#2522 ) Rework TensorSeq in a manner consistent with Tensor and SparseTensor in terms of type system setup. Reduce templating. Introduce helpers to ensure the same data type. Make OrtValue __dtor not virtual. Introduce ContainerChecker	2019-12-04 16:04:17 -08:00
Dmitri Smirnov	25b3c51661	Introduce PrimitiveType into a Type System along with an integer constant (#2307 ) Improve perf by avoiding GetType<T>() calls. Introduce MLTypeCallDispatcher to switch on Input Type. Add Tensor IsType<T>() fast method.	2019-11-08 17:47:06 -08:00
Scott McKay	ffb94fd170	Fix bug with delayed allocation of If and Scan outputs. (#2024 ) * Fix bug with delayed allocation of If and Scan outputs. If the subgraph is producing output on a non-CPU device the delayed allocation was incorrectly providing a CPU allocated tensor. Check for the required location, and update 'fetches' instead if a device copy is needed. The utils::ExecuteGraph logic will handle the device copy in this case.	2019-10-11 19:49:21 +10:00
Dmitri Smirnov	627f853a44	Downgrade compiler to CentOS 4.8.5 (#1985 ) Make onnxruntime CPU build and run on CentOS GCC 4.8.5	2019-10-03 15:40:46 -07:00
Dmitri Smirnov	d1b1cdc5c4	Replace GSL with GSL-LITE submodule and fix up refs (#1920 ) Remove gsl subodule and replace with a local copy of gsl-lite Refactor for onnxruntime::make_unique gsl::span size and index are now size_t Remove lambda auto argument type detection. Remove constexpr from fail_fast in gsl due to Linux not being happy. Comment out std::stream support due to MacOS std lib broken. Move make_unique into include/core/common so it is accessible for server builds. Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc due to x86 build. Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized	2019-10-01 12:43:29 -07:00
Pranav Sharma	52fe574fed	Rename OrtAllocatorInfo to OrtMemoryInfo to make it more obvious. (#1758 ) * Mention OrtCreateSessionFromArray in C API doc * Rename OrtAllocatorInfo to OrtMemoryInfo to avoid confusion	2019-09-05 14:20:37 -07:00
Scott McKay	8a559d75ae	Minor perf improvements. (#1580 ) * Minor perf improvements. - Cache the vector sizes in IExecutionFrame and NodeIndexInfo to avoid calls to size(). - 2 instructions instead of 10 - Remove an unnecessary check in IExecutionFrame - add a check to the ctor so we guarantee it's unnecessary - Reserve memory for the vectors in BroadcastIterator - saves reallocs if more than one value is added - but rare with the mlperf models for multiple values to be added so benefit is limited. - slight tweak to the Broadcaster ctor code to make it more readable	2019-08-13 09:05:48 +10:00
Scott McKay	6e430c0526	A few performance improvements coming out of ssd_mobilenet and ssd_resnet34 analysis (#1578 ) * A few performance improvements: - Make the iteration in NonZero more efficient by using a raw pointer and simplifying the increment logic - add another unit test to check the new logic works with 3 dimensional tensor - gains about 2% for ssd_mobilenet - Avoid floating point operations on each iteration on Concat - about 0.5% for ssd_mobilenet and ssd_resnet34 - Put common case first in ExecutionFrame::AllocateAsPerAllocationPlan to avoid unnecessary call to IsSparseTensor - about 0.05% for ssd_mobilenet - Minor tweak to put some ctors in the TensorShape header so they can be inlined more easily	2019-08-08 07:20:00 +10:00
Scott McKay	387d4c72bb	Strip invalid dim_param and dim_value values out. Allow re-use in event of shape mismatch if buffer is large enough (#1439 ) * Remove invalid dim_param and dim_value values when creating a NodeArg. * Allow re-use of a large enough buffer if there's a shape mismatch. * Update handling in python to treat unset dimension the same as a dim_param (equivalent to None). * Fix GetTensorShapeFromTensorShapeProto to handle neither dim_param and dim_value being set.	2019-07-23 14:55:54 +10:00
Yulong Wang	887930e6c2	inference overheads optimizations (#1392 ) This change makes some optimizations on various places. This change consists of a part of PR #1240 (removed the problematic part) and some other trivial fix. 1. reduce unnecessary copy when constructing vector or objects that contains vector as member. use std::move when applicable. 2. use std::vector<std::reference_wrapper<const TensorShape>> instead of std::vector<TensorShape>, when it is only for constant reference usage. 3. calculate key BEFORE (instead of AFTER) acquire lock in SessionState::GetMemoryPatternGroup other trivial fixes (code should be straightforward and self-explainable).	2019-07-18 19:40:48 -07:00
Scott McKay	ac6a4afb0f	Add validation of shape when re-using a buffer in ExecutionFrame (#1356 ) * Check for empty string as dim_param in allocation planner. * Validate shape is compatible at runtime when re-using Tensor.	2019-07-09 14:59:07 +10:00
Scott McKay	065e9dc1ba	Block size mismatches are expected if sequence length varies or there are NonZero ops. Reduce log severity of message due to that. (#1211 )	2019-06-13 19:13:31 +10:00
G. Ramalingam	b23ab6a06e	Implementation of sparse tensor (#1121 ) * Initial implementation of sparse tensor * minor cleanup * minor cleanup (remove empty line) * simplify template usage in test-case * address linux build error * fix constructor order to address compiler warning * Address PR comments * handle allocation in optimizer execution frame * address compiler warning message and PR feedback comment * address gcc unused warning for protobuf code * address PR comment	2019-06-06 11:50:38 -07:00
Changming Sun	c18de6817b	Rename MLValue to OrtValue (#1154 )	2019-06-03 17:29:55 -07:00
Scott McKay	c7d1c007d5	Fix accidental copy where a reference was fine. (#1090 ) Clarify a couple of other uses of 'auto'	2019-05-23 16:47:30 -07:00
ybrnathan	7421755198	Optimize ExecutionFrame to avoid mem re-allocation. (#1085 )	2019-05-23 10:59:36 -07:00
Changming Sun	2663b9c443	Remove unnecessary casts from OrtValue to MLValue(#1051 )	2019-05-17 07:52:59 -07:00
Changming Sun	99556b111d	Make MemPatternPlanner on/off switchable in model weight loading (#989 )	2019-05-16 14:39:09 -07:00
ybrnathan	b0a37477db	Fix memory corruption issue when CPU->CUDA memcpy is involved (#879 )	2019-04-22 20:21:14 -07:00
Yufeng Li	0bf12e9dbf	Add option to enable/disable memory pattern back (#872 ) Memory pattern doesn't work for parallel executor by design. Enabling Memory Pattern for parallel executor logs warning and make the perf bad. Add option to enable/disable memory pattern back.	2019-04-22 13:49:41 -07:00
Scott McKay	971058fc38	Avoid copy of pre-existing value to subgraph output (#637 ) * Add AllocKind::kShare to allow copying the MLValue for a pre-existing value to a graph output when an Identity node is involved. Ideally we can make this handling for an Identity node more general purpose, however the current logic to free an MLValue during execution doesn't take into account a re-use point also needing a free. Due to that, limit the scope and start with a somewhat ugly hardcoded approach. Migrate some changes from PR497 The existing Loop unit tests exercise the new code. Also manually stepped through the problematic model to verify the unnecessary copy was avoided. * Fix build error * Fix missing switch case in debug output of allocation plan * Limit optimization to Loop	2019-03-19 06:55:59 +10:00
Changming Sun	cf41f76d79	Fix some warnings (#551 )	2019-03-06 11:46:59 -08:00
Changming Sun	8e0fff7b8d	Support large model(>2GB) (#520 ) 1. Support the new external data extension in ONNX 1.4 onnx/onnx#678 2. Enable onnxruntime_perf_test in Mac Build 3. move path_lib.h from onnx_test_runner source dir to onnxruntime_framework 4. Enable memory planner for string tensors 5. Make memory planner always enabled, to simplify model loading logic 6. Delete some duplicated code between onnxruntime_perf_test and onnx_test_runner 7. Delete win_getopt_mb lib. 8. Remove the dependency on Pathcch lib, which is only available on Windows 8 and newer.	2019-03-05 21:27:12 -08:00
Weixing Zhang	8a59287c46	Create OptimizerExecutionFrame for graph optimization (#526 ) * Create OptimizerExecutionFrame for optimizer With this change, optimizer can easily invoke CPU kernels for graph optimization.	2019-03-04 10:56:41 -08:00
Scott McKay	6c7099a18e	Break dependency on SessionState for ExecutionFrame and OpKernelContext so optimizers can execute a node with a minimal setup (#498 ) * Break dependency on SessionState for ExecutionFrame and OpKernelContext so optimizers can execute a node with a minimal setup. - Create IExecutionFrame - split out core logic and interface from extended logic used in full Graph execution (that uses allocation plan and memory pattern planner) - Update NodeIndexInfo to allow contruction from a subset of nodes - split out logic from GraphNodes into a re-usable template so it can be used with a vector of const Node* as well as a vector of unique_ptr<Node> - Remove SessionState from OpKernelContext - Misc cleanups - move AllocPlanPerValue out of SequentialExecutionPlan as it's used in a more generic manner that isn't specific to a sequential execution plan NOTE: I manually tested the new paths, especially NodeIndexInfo. There will shortly be optimizers added that use the new infrastucture so they'll get test coverage as part of those changes. * Fix linux build issue. Handle graph with no nodes in NodeIndexInfo.	2019-02-27 15:46:50 -08:00
Scott McKay	fc7185f060	Various optimizations to reduce the setup and device copying cost outside of the call to ExecuteGraph. (#470 ) * Various optimizations to reduce the setup and execution cost. Cache information about the feeds and fetches, and any device copies required to execute the graph so we minimize checking for later calls to ExecuteGraph using the same input/output. - enable use of caching in Loop and Scan - make use of caching optional for InferenceSession::Run - handle calls to Run with different feeds and fetches to support scenarios where there may be a truncated sequence in some calls Take the feed names and MLValue instances as vectors so the order is deterministic. Add unit tests Update onnxruntime_perf_test to enable caching. * Couple of tweaks. Fix shared library unit test failure. Attempt to workaround MacOS build failure due to VC++ bug around including reaching scope values in a lambda automatically. * Rework order of init in Run so we get nice error messages about invalid feed/output names. * Refine logic around copying MLValue using execution provider so common code can be used. Simplify the logic due to this change. Split the paths for executing with/without cached info so we can be more const correct with how FeedsFetchesManager is passed in. This makes it clearer when a shared instance can be used due to it being const. Cache the FeedsFetchesManager instances in the control flow nodes. They can be re-used across calls to Compute. * Removed unused local variable to fix some builds. * Fix build issue by cleaning up some more unused params. * Check names when using cache entry from SessionState. Add unit test.	2019-02-20 12:12:17 +10:00
Scott McKay	efb72540be	Separate out constant node index information from ExecutionFrame (#410 ) * Separate out the NodeArg index information from ExecutionFrame so it is only calculated once. * Skip copy to/from device if only CPU execution provider is registered. Cleanups. * Address PR comments. Clean up a few areas. * Fix Linux build error	2019-02-01 10:55:49 +10:00
Scott McKay	b194b7df0d	Add the ability to use a custom allocator for fetches to avoid unnecessary copies in control flow operators. (#377 ) * Add the ability to use a custom allocator for fetches. Allows control flow nodes to forward the allocation to the control flow op and avoid an unnecessary copy when the subgraph output has a symbolic dimension. Update Scan and If to use custom allocators when applicable. * Remove unnecessary forward declaration * Fix Mac build warnings	2019-01-29 19:48:10 +10:00

1 2

56 commits