Commit graph

151 commits

Author SHA1 Message Date
Sherlock
eb5c1f0fcc
Unify activation and initializer alignment value (#6109)
* Unify activation and initializer alignment value

* Fix VerifyInputTensorsAllocatedContiguously
2020-12-14 13:13:41 -08:00
Sherlock
a53f4dd379
Introduce VariadicAlias, remove hardcoded alias limits (#6106)
* Introduce VariadicAlias, remove hardcoded alias limits

* Include optional-lite in winml build

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-11 10:47:08 -08:00
RandySheriffH
404982ded5
Enable varied input type for custom op (#6066)
* allow custom op taking varied types

* refactor test case

* add test model

* refactor test case

* enable copy elision

* update test case

* fix issue in ToString function
2020-12-09 15:10:42 -08:00
Hariharan Seshadri
d46dbeafd3
Expose knobs to create and share (CPU) allocators across sessions in C# and Python (#5634) 2020-11-21 14:12:33 -08:00
Scott McKay
00412a76e9
Exclude some training specific code from the minimal build. Cleanup some related aspects of allocation planner. (#5861)
* Exclude some training specific code around the allocation planning and initializer handling from the minimal build.
Simplify the code around tracking start/end usage of a value.
2020-11-20 20:25:46 +10:00
Scott McKay
7b76b57fc8
Support EPs that compile nodes in a minimal build. (#5776)
* Support EPs that compile nodes in a minimal build. This enables NNAPI being used.
2020-11-17 13:52:22 +10:00
Hariharan Seshadri
b92fc66ea1
Support opset-13 specs of controlflow ops (Loop, If) (#5665) 2020-11-11 23:44:14 -08:00
edgchen1
2acdc3cd82
Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729) 2020-11-09 11:37:06 -08:00
M. Zeeshan Siddiqui
9af0d48524
Memory planner and pattern generation enhancements. (#4443)
* static allocation.

* chanegs.

* contigious dynamic allocation.

* contigious dynamic allocation.

* fix bugs.

* fix bug.

* build errors.

* PR feedback.

* PR feedback.

* Update Graph builder for nccl_allreduce, mps.

* misc.

* fix windows build break.

* changes.

* fine-grained memory-time scheduling.

* merge.

* fix misc stuff.

* fix windows build.

* fix windows build.

* fix merge bug.

* merge conflicts.

* revert onnx-tensorrt submodule commit.

* fix submodule commit.

* misc.

* merge conflicts.

* Revert "merge conflicts."

This reverts commit 319a071a6e.

* merge conflict.

* merge conflict.

* merge conflicts.

* fixes.

* PR feedback.

* build break.

* build break.

* Add asserts.

* Add asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* fixes.

* fixes.

Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-01 23:05:46 -08:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Dmitri Smirnov
742ffb860c
Allow Kernels refer to some attribute data directly in the protobuf (#5624)
* Introduce OpKernelInfo GetAttrAsSpan() for floats and ints attribute proto arrays
  and GetAttrsStringRefs() to return a vector of string references.
  These new APIs allow kernels not copy attribute arrays especially if they are large
  and save on memory.
  but refer directly to data that is in AttributeProto.
  Modify TfIdfVectorizer to take advantage of the new API.

Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-29 16:12:54 -07:00
Ryan Hill
e90b6f06d1
Factor out IAllocator so that it can be shared with shared providers (#5567)
* Factor out IAllocator so shared providers can use it directly.
2020-10-27 17:28:17 -07:00
Dmitri Smirnov
3433576fd3
Support for Sparse Initializers (#5540)
Introduce sparse_initializers support.
  Convert them to dense on model load and prune graph_proto_
  so they don't consume space. Convert back to sparse on ORT Format model save.
  Implement serializing sparse initializers to OrtFormat.
  Fix Model::ToProto() to return original sparse initializers
  Set a flag that graph_sync is needed when loading a simple ORT Format model.
  otherwise nothing is resolved.
  Add ORT Format history to README.md
  ifdef MINIMAL build for DenseToSparseTensorInitializer
  Allow duplicate initializers to support existing models.
  Issue a warning instead of aborting.

* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.



Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-27 10:32:06 -07:00
Ryan Hill
82c7a9756e
Fix shared provider unload crash (#5553) 2020-10-21 13:01:21 -07:00
Changming Sun
280cdf31f5
Revert "Fix shared provider unload crash (#5523)" (#5547)
This reverts commit 610676293e. Because Linux DNNL pipeline is failing.
2020-10-20 08:01:28 -07:00
Ryan Hill
610676293e
Fix shared provider unload crash (#5523)
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
2020-10-19 18:08:38 -07:00
Scott McKay
a92ccbe1bc
Various armv7 related fixes (#5394)
* - Link with libatomic if needed
 - Install pip differently so it doesn't clash with the system pip which may involve a wrapper script
 - Remove ability to specify offset when Tensor allocates the data. The data prior to offset isn't accessible by anything.
 - Fix use of offset in TensorOpTest to work on armv7 where it must be aligned to the type it points to.
 - Fix ActivationOpNoInfTest.Softsign to allow for armv7 behavior
 - Fix ReductionOpTest.ReduceMean_*keepdims to allow for armv7 floating point inaccuracy

* Address PR comments
2020-10-09 22:34:32 +10:00
Pranav Sharma
974b9bfc09
Allow sharing of initializers between sessions. (#5092)
* Allow sharing of initializers between sessions.

* Allow sharing of initializers between sessions (2).

* Add test for C#

* Add test for C#; address PR comments

* Address PR comments
Moved AddInitializer logic to internal session options
Added tests for owned buffer
Clarified documentation
Fix bug where memory info and not device was getting compared

* Fix test

* Fix training build

* Add ver 5 end marker and ver 6 starter, add scenario and usage examples.
2020-09-21 14:09:37 -07:00
Scott McKay
59ee8ffb17
Remove SparseTensor support from minimal build. (#5114)
* Remove SparseTensor support from minimal build.

Currently the only valid usage of a SparseTensor is as an attribute of a Constant node. That would have been lifted to a dense tensor initializer when loading the onnx model, so would not exist when saving the ORT format model. Due to that there can be no SparseTensors in an ORT format model.

Co-authored-by: gwang <wanggy@outlook.com>
2020-09-11 17:56:54 +10:00
Ryan Hill
3207de276c
Remove IDeviceAllocator class as it doesn't extend IAllocator in any way. (#5067) 2020-09-10 00:46:35 -07:00
Vincent Wang
84de14a833
Register OpSet13 CUDA Kernels for BERT/UniLMv2 (#4856)
* opset13 cuda kernels for BERT.

* add opset13 SoftmaxCrossEntropyLoss.

* opset13 size.

* fix argmax/min for ut.

* fix ut failure for argmax/min.

* OrtMemTypeCPUInput

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-05 08:09:52 +08:00
Scott McKay
b5c2932ae8
Last major set of ORT format model changes (#5056)
* Add minimal build option to build.py
Group some of the build settings so binary size reduction options are all together
Make some cmake variable naming more consistent
Replace usage of std::hash with murmurhash3 for kernel. std::hash is implementation dependent so can't be used.
Add initial doco and ONNX to ORT model conversion script
Misc cleanups of minimal build breaks.
2020-09-05 07:59:01 +10:00
Ryan Hill
d792af776d
Remove Cuda dependency from TensorRT shared provider (#5014) 2020-09-04 11:35:02 -07:00
gwang-msft
7ca8388dc9
[ORT Mobile] file format schema and file I/O code (#4973)
* ort mobile file format schema and [de]serializing code
2020-09-01 11:51:31 +10:00
gwang-msft
ea5732319e
Add option ORT_NO_EXCEPTIONS to disable most exception/throw in /onnxruntime/ (#4894)
* init no exception changes

* initial test

* disable exceptions

* more throw handling

* minor update

* fix linux build break

* fix windows/nuphar build break

* address cr comments, move #ifdef to ORT_CATCH

* address cr comments, move #ifdef to ORT_CATCH

* handle return statement in ORT_CATCH

* linux build break fix

* addressed cr comments, remove ort_catch_end

* addressed cr comments, remove ort_catch_end

* move mlas to a separated ifdef flag

* merge master, move some new code in master to no_exc

Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
2020-08-28 23:03:51 -07:00
Scott McKay
08eb15068c
Exclude the Map types from the build if ML ops are disabled. (#4908)
* Exclude the Map types from the build if ML ops are disabled. They're the only ops that use Map.
2020-08-27 17:48:12 +10:00
Scott McKay
728e886bba
Add kernel def hash logic for minimal build (#4891)
* Add hash based lookup of kernels
2020-08-23 14:39:07 +10:00
Scott McKay
db7669b225
Reduce ONNX dependency in minimal build (#4890)
* Next round of changes.

Remove inclusion of ONNX schema header
Exclude custom registry related things
Move IsConstantInitializer from graph_utils to Graph as it's needed in a minimal build and graph_utils is excluded.
2020-08-23 07:02:13 +10:00
Pranav Sharma
29dcfb24ab
Allow multiple sessions to share an allocator, optimize constant folding memory usage, expose arena configs. (#4813)
* Add support for sharing allocators

* Incremental update

* Address some PR comments, add unit tests, add documentation.

* Address PR comments, add tests and some documentation.

* Fix build and test issues

* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
2020-08-22 10:03:17 -07:00
RandySheriffH
3fa73a5b6a
ReduceBinarySize (#4747)
* cancel night build on pyop

* add rewriter to rewrite cpu provider

* skip BuildKernelCreateInfo<void>

* refactor variable name and comment

* include ops from csv file

* process multiple eps

* add default function to cuda provider

* rename function and add license header

* fix import

* add doc

* fix typo

* deal with empty kernel entry in cuda

* rename the rewriter file

* add comment into provider file

* add comment and rename function

* log warnings

* refactor extracting logic

* add entry for script to run solo

* add better example

* avoid onnx importing

* fix flake8 alerts

* minor fixes to better comments and doc

* add entries for all domains

* add void entry into contrib providers

* format cuda_contrib_kernels.cc

* format cpu_contrib_kernels.cc

* add all providers

* add default entry to all providers

* include op_kernel header

* cancelling change in providers beyond cpu/cuda

* rename file and switch file format to domain;opset;op1,op2...

* update doc

* restore non-regular ending grammar in cuda_contrib_kernels.cc

* add ort_root as input argument of script

* enable test in ci

* update doc

* update doc

* revert change on linux gnu ci

* switch to set to host ops

* simplify trimming logic

* add domain map to track current model

* allow ort_root to take relative path
2020-08-21 19:50:13 -07:00
Vincent Wang
5eaac31faa
support opset13 on transformers. (#4837)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-08-19 11:13:37 +08:00
Hariharan Seshadri
ea3b4e1f8d
Fix bug in DispatchOnTensorType macro (#4808) 2020-08-17 01:16:01 -07:00
Scott McKay
8fb743f767
Refactor Cast to reduce binary size. (#4765)
* Refactor Cast to reduce binary size.
82.5 -> 60.8KB on Windows

* Address PR comments.
Fix build issue.
2020-08-13 20:43:22 +10:00
Ryan Hill
ac725b53f6
Convert TensorRT provider into a shared library (#4721)
Lots of changes to shared library interfaces, new lighter weight design.
2020-08-10 21:17:16 -07:00
Yufeng Li
b22091dc91
Add the framework to support prepack (#4413)
* add support of prepack
* add support for QAttention and DynamicQuantizeMatMul
* add an use_prepacking option
* add use_prepacking in c_sharp api
2020-08-07 09:39:19 -07:00
Wei-Sheng Chin
e9d20e9dba
Revise Send and Recv (#4547)
* Add ability to retrieve inferred shapes when executing a kernel.
This ability helps Recv to know its output shapes without doing
actual cummunication. Of course, if the output shapes cannot be
inferred, Recv still needs to do communication to get shapes from
Send.

* Avoid communicating shape information when it can be inferred statically

* Replace unordered_map with thread-safe wrapper.
We don't want to have racing condition and undefined behavior
when using parallel executor.y

* Remove cout

* Add missing file

* Address comments

* Check dim_value. -1 means missing

* lock properly

* Address comments (remove thread-safe map)

* Remove poc header

* Replace Stream with DeferredReleaseCPUPtr
2020-07-30 23:02:45 -07:00
Chi Lo
affdeb53c2
Add Python API for specifying device options. (#4205)
* Add python API for specifying CUDA device id

* Modification for providing session based python api for specifying
device id

* When include header file pybind11/stl.h, conversion between c++
containers and Python list, vector and dict data structure are
automatically enabled.

https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#

Therefore, refactor the code for better leverage this advantage.

* Make struct CudaDeviceOptions as default cuda device options

* Implement sess.set_providers(list_of_providers, list_of_provider_option_dicts)

But still stay consistent with existing sess.set_providers(list_of_provider)

* Add cuda provider option default setting

* Add support for setting cuda cuda_mem_limit and arena_extend_strategy.
Also resolved the merge conflict on session.py

* Use python ctypes to call cuda library to help python unittest

* Refine the code with reviewer's suggestions

* Add the capability of getting execution provider's configuration

- Once we introduced the capability to set execution provider's
configuration, it makes sense to add capability of getting ep's configuration.

* Modify the code with reviewer's suggestions.

* Using stoull() and stoul() depends on 32/64-bits architecture.

* Rewrite the testcases for testing setting CUDA device id

Note: We need to make sure every ORT process be run on one CUDA device
at a time.

* Make sure old session object is destroyed by python gc before new
session object is being created

* Move testcases to original onnxruntime_test_python.py

* Fix bugs to pass CI build

* Make it pass CI build (cont.)

* Make it pass CI build (cont.)
2020-07-21 07:28:13 -07:00
Tracy Sharpe
08235e1662
add Output() overloads (#4546) 2020-07-19 15:21:12 -07:00
Tiago Koji Castro Shibata
2189c77e5b
static_typename (#4520)
* Use static_typename

* Disable RTTI outside of Release

* Fix unused var

* Add test types

* PR feedback
2020-07-16 16:31:02 -07:00
edgchen1
6c7da5e9d3
Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418)
For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.
2020-07-10 10:20:23 -07:00
Tixxx
b156ae4448
Support training_mode flag in eval (#4324)
* add training_mode feed for evaluation to support opset12
2020-07-08 10:38:54 -07:00
Scott McKay
274e6b4153
Cleanup SessionState. Move allocator lookup to SessionState. (#4194)
* Move allocators to SessionState so they're decoupled from ExecutionProviders
  - when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
  - add device based lookup
    - simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
  - provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
  - simplify SessionStateInitializer
    - reduced down to FinalizeSessionState method
2020-06-28 14:55:42 +10:00
Scott McKay
175983c082
Move memory info into IAllocator (#2850)
- Update IAllocator setup to move the OrtMemoryInfo to the base class instead of requiring derived classes to have that as a member and override a virtual method to return it.
  - Cleanup CreateAllocator setup to take an argument as to whether to wrap the device allocator in an arena allocator. The choice to do that isn't a property of the underlying device allocator.
  - Minor cleanups in the various EPs to adjust to the change to IAllocator and CreateAllocator, and to use the create_arena flag consistently when available.
2020-06-22 11:18:52 +10:00
Scott McKay
9790e19424
Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062)
* Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block.

The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model.

There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory.

Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper.

Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.
2020-06-05 18:54:01 +10:00
Paul Fultz II
7759136610
Add amd migraphx execution provider to onnx runtime (#2929)
* Add amd migraphx execution provider to onnx runtime

* rename MiGraphX to MIGraphX

* remove unnecessary changes in migraphx_execution_provider.cc

* add migraphx EP to tests

* add input requests of the batchnorm operator

* add to support an onnx operator PRelu

* update migrapx dockerfile and removed one unused line

* sync submodules with mater branch

* fixed a small bug

* fix various bugs to run msft real models correctly

* some code cleanup

* fix python file format

* fixed a code style issue

* add default provider for migraphx execution provider

Co-authored-by: Shucai Xiao <Shucai.Xiao@amd.com>
2020-05-27 04:24:59 +08:00
Ryan Hill
d5ec353e58
Ryanunderhill/mkldnn dll (#3314)
First version of allowing providers to work as DLLs, only implemented for DNNL so far.

More improvements to come next!
2020-05-06 00:57:09 -07:00
Changming Sun
bd78364411
Parallel all the activations ops (#3722)
1. Parallel all the activations ops.
2. Parallel the performance critical path of the LRN op, which makes the ONNX model zoo googlenet model runs 60% faster(latency reduced from 21ms to 13ms).
3. Make the Gemm-Activation fusion support with all the activations ops. Before this change, it only supports LeakyRelu/Relu/Sigmoid/Tanh.
4. Delete onnxruntime/test/framework/op_kernel_test.cc because the file is almost empty.
5. Remove the loggings in KernelRegistry::TryFindKernel, return Status with error message instead.
2020-05-05 01:18:17 -07:00
Tixxx
0638565fe0
Fix evaluation issues (#3538)
* allow switching between eval and training modes dynamically

Co-authored-by: Tixxx <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
2020-04-28 21:03:37 -07:00
Jeff Bloomfield
f1c19f8495 merge master 2020-04-25 19:04:58 -07:00
Edward Chen
daa14b64e3 Merge remote-tracking branch 'origin/master' into edgchen1/merge_from_master 2020-04-21 03:31:32 +00:00