Commit graph

319 commits

Author SHA1 Message Date
Hariharan Seshadri
ea3b4e1f8d
Fix bug in DispatchOnTensorType macro (#4808) 2020-08-17 01:16:01 -07:00
Bogdan Bugaev
8ba6b6a21e
Support usage of C API with C++ standards older than C++11 (#4257)
* Use throw() in C API if noexcept is not supported
2020-08-15 11:39:28 -07:00
Maxim Kalinin
ec36c793e8
Eliminate redundant subexpressions (#3047)
* Eliminate redundant subexpressions

Apply local value numbering to merge graph nodes that will always
evaluate to the same value.

* Rename cpp->cc

* Handle optional arguments

* Add test models

* Add more tests with optional arguments

* Fix processing of subgraphs

Also, be resilient to possible mixture of optional and variadic
parameters

* Fix random operators

* Address PR comments

* Minor changes and a test

* Move CSE before constant folding

* Random* operators are always non-deterministic

Even when seed is provided.

* Fix a CSE test

* Reuse the list of non-deterministic operators with constant folding pass

* Address PR comments

* Fix formatting

* Address PR comment

* Minor cleanup / comments

* Fix build failure in Linux

* Reuse existing optimizer/utils file.

Also, check for graph outputs when removing a node.

* Add a test

* Fix compiler warnings

* Fix build in older compilers

* More compatibility with old STL versions
2020-08-14 01:13:05 -07:00
Scott McKay
8fb743f767
Refactor Cast to reduce binary size. (#4765)
* Refactor Cast to reduce binary size.
82.5 -> 60.8KB on Windows

* Address PR comments.
Fix build issue.
2020-08-13 20:43:22 +10:00
Tim Harris
9cec98ec1b
Honor allow_spinning at barrier at end of parallel sections (#4767)
This commit means that when the thread pool is configured to spin, then we spin at the barrier at the end of parallel sections in the main thread, in addition to having workers spin waiting for work. 

The change updates Barrier.h to take an additional boolean to select spin/block, and passes this in based on the thread pool configuration. 

It adds an additional test case for barriers, although no problems were identified by the test case.
2020-08-13 09:40:40 +01:00
Josh Bradley
b7254551f0
Add new api function At() (#4457)
* add modern standards to function arguments
* add first version of At for better tensor element access
2020-08-11 18:34:03 -07:00
Ryan Hill
ac725b53f6
Convert TensorRT provider into a shared library (#4721)
Lots of changes to shared library interfaces, new lighter weight design.
2020-08-10 21:17:16 -07:00
Dmitri Smirnov
3530ce541c
Expose IOBinding features via C/C++/C# language bindings. (#4646)
Expose I/O Binding in C/C++/C#
  Expose OrtAllocator, OrtMemoryAllocation, OrtMemoryInfo and OrtIoBinding
2020-08-10 13:33:49 -07:00
Yufeng Li
b22091dc91
Add the framework to support prepack (#4413)
* add support of prepack
* add support for QAttention and DynamicQuantizeMatMul
* add an use_prepacking option
* add use_prepacking in c_sharp api
2020-08-07 09:39:19 -07:00
Sherlock
eb0f57f0e4
Localized Recompute for Gelu and AttentionDropout (#4402)
* Gelu Activation Recompute Draft

* Prototype for localized recompute

* Introduce localized_recompute rewriter

* Command line args for enabling recompute

* Add logger to Gradient Graph Builder

* use const when possible
2020-08-04 21:48:15 -07:00
edgchen1
9d7284fc3b
Enable MatMul + Scale fusion (#4669)
Update TransposeMatMul to support scaling of the matrix product by a constant scalar value (analogous to the GEMM alpha parameter). Rename TransposeMatMul to TransposeScaleMatMul.
Fuse MatMul with surrounding Mul/Div with constant scalar into TransposeScaleMatMul.
2020-08-04 16:27:22 -07:00
Tim Harris
4bd9e8d05c
Stress-test and fix thread pool when work queues are full (#4690)
While investigating an unrelated issue, I noticed that the thread pool may drop tasks when a burst of 1024+ tasks is submitted by a thread from inside the pool. Today, in general, we execute work synchronously in this case. However, there is a bug where work submitted by a thread already inside the pool will be discarded instead of executed. Currently the only scenario where I can see this occurring is when the parallel executor is used with a model in which such a large number of nodes become eligible to run all at once. This PR fixes the underlying issue and adds a test case for burst-submission of work.
2020-08-04 10:19:49 +01:00
Wei-Sheng Chin
e9d20e9dba
Revise Send and Recv (#4547)
* Add ability to retrieve inferred shapes when executing a kernel.
This ability helps Recv to know its output shapes without doing
actual cummunication. Of course, if the output shapes cannot be
inferred, Recv still needs to do communication to get shapes from
Send.

* Avoid communicating shape information when it can be inferred statically

* Replace unordered_map with thread-safe wrapper.
We don't want to have racing condition and undefined behavior
when using parallel executor.y

* Remove cout

* Add missing file

* Address comments

* Check dim_value. -1 means missing

* lock properly

* Address comments (remove thread-safe map)

* Remove poc header

* Replace Stream with DeferredReleaseCPUPtr
2020-07-30 23:02:45 -07:00
Xiang Zhang
d73e01e5b9
remove ENABLE_TELEMETRY macro (#4633) 2020-07-27 20:06:11 -07:00
Alisha Sonawalla
1e67fff93c
Add GetStringTensorElement, GetStringTensorElementLength and FillStringTensorElement API (#4374)
Add new string tensor APIs and unit tests
2020-07-24 21:35:46 -07:00
Chi Lo
affdeb53c2
Add Python API for specifying device options. (#4205)
* Add python API for specifying CUDA device id

* Modification for providing session based python api for specifying
device id

* When include header file pybind11/stl.h, conversion between c++
containers and Python list, vector and dict data structure are
automatically enabled.

https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#

Therefore, refactor the code for better leverage this advantage.

* Make struct CudaDeviceOptions as default cuda device options

* Implement sess.set_providers(list_of_providers, list_of_provider_option_dicts)

But still stay consistent with existing sess.set_providers(list_of_provider)

* Add cuda provider option default setting

* Add support for setting cuda cuda_mem_limit and arena_extend_strategy.
Also resolved the merge conflict on session.py

* Use python ctypes to call cuda library to help python unittest

* Refine the code with reviewer's suggestions

* Add the capability of getting execution provider's configuration

- Once we introduced the capability to set execution provider's
configuration, it makes sense to add capability of getting ep's configuration.

* Modify the code with reviewer's suggestions.

* Using stoull() and stoul() depends on 32/64-bits architecture.

* Rewrite the testcases for testing setting CUDA device id

Note: We need to make sure every ORT process be run on one CUDA device
at a time.

* Make sure old session object is destroyed by python gc before new
session object is being created

* Move testcases to original onnxruntime_test_python.py

* Fix bugs to pass CI build

* Make it pass CI build (cont.)

* Make it pass CI build (cont.)
2020-07-21 07:28:13 -07:00
Tracy Sharpe
08235e1662
add Output() overloads (#4546) 2020-07-19 15:21:12 -07:00
Yulong Wang
0229a6a929
[C++ API] add SessionOptions::SetLogSeverityLevel() (#4545) 2020-07-17 21:14:41 -07:00
Yulong Wang
fdc5c308c4
introduce macro ORT_API_MANUAL_INIT in C++ API (#4536)
* introduce macro ORT_API_MANUAL_INIT in C++ API

* resolve comments
2020-07-17 13:23:30 -07:00
Tiago Koji Castro Shibata
2189c77e5b
static_typename (#4520)
* Use static_typename

* Disable RTTI outside of Release

* Fix unused var

* Add test types

* PR feedback
2020-07-16 16:31:02 -07:00
Xueyun Zhu
7d96960ec8
support pipeline partition with shared initializer (#4321)
* support bert partition with shared initializer

* address feedback

* address feedback

* address feedback

* add more test

* remove bert-tiny model

* address feedback

* address function comment

* move CreateNodeArg to graph_utils

* rename function name

* rename function name

* fix windows build

* fix windows type conversion warning

* add function comment
2020-07-14 17:21:40 -07:00
Tim Harris
a95ae164f7
Create N-1 threads in intra-op pool, given main thread now active (#4493)
Create N-1 threads in a thread pool when configured with intra-op parallelism of N. This ensures we have N active threads, given that the main thread also runs work. To avoid ambiguity on the value returned, rename ThreadPool::NumThreads method to ThreadPool::DegreeOfParallelism, and make corresponding updates in MLAS and operators.
2020-07-14 09:48:50 +01:00
edgchen1
6c7da5e9d3
Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418)
For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.
2020-07-10 10:20:23 -07:00
Josh Bradley
ca5af9d622
Add modern C++ standards for Ort::Value (#4367)
* add modern standards to function arguments

* code cleanup

* fix code formatting

* add element access convenience function

* change template type name to match rest of code

* remove new At() convenience function

* add better documentation message
2020-07-09 00:35:41 -07:00
Tixxx
b156ae4448
Support training_mode flag in eval (#4324)
* add training_mode feed for evaluation to support opset12
2020-07-08 10:38:54 -07:00
Ashwini Khade
ef602835b0
update getfunctionbody (#4396) 2020-07-02 09:00:37 -07:00
liqunfu
5dcb9b4858
Liqun/backprop deterministic graph (#4315)
make gradient graph deterministic
add to session option use_deterministic_compute.
2020-07-01 12:39:10 -07:00
Ashwini Khade
0404763f23
Update function body initialization for ONNX functions (#4332)
* Update function body initialization

* minor fix

* changes per review comments

* minor fix

* format fix

* add function initialization in mixed precision transformer

* more updates

* more fixes
2020-06-30 14:30:59 -07:00
Scott McKay
274e6b4153
Cleanup SessionState. Move allocator lookup to SessionState. (#4194)
* Move allocators to SessionState so they're decoupled from ExecutionProviders
  - when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
  - add device based lookup
    - simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
  - provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
  - simplify SessionStateInitializer
    - reduced down to FinalizeSessionState method
2020-06-28 14:55:42 +10:00
Josh Bradley
990b43ddf2
Add modern C++ standards to the C++ API (#4217)
As a zero-cost wrapper around the C API, the current state of the C++ API is still pretty low-level and requires programmers to use C-style standards to interact with ONNX.
2020-06-25 22:28:00 -07:00
Tim Harris
3fc68cb150
Remove non-trivially-destructible thread-local from thread pool state, blocking ARM64 builds (#4336)
- Move thread hint vectors from thread-local struct

- Add static_assert that the per-thread state in the thread pool is trivially-destructible

- Rename "thread_data" to "worker_data" (only allocated for workers in the pool, not threads calling into the pool)
2020-06-25 19:04:31 +01:00
Prabhat
151ef1c8a5
Add C++ wrapper for GetAvailableProviders() C API (#4313) 2020-06-25 13:11:55 +05:30
Tim Harris
9e3b5c62fb
Use OpenMP-like synchronization patterns in Eigen thread pool (#4236)
Updates the thread pool implementation to make work distribution over the Eigen thread pool more closely resemble techniques used in OpenMP. In particular:

(1) A thread entering a parallel loop works on the iterations itself, rather than requiring a thread switch to/from a thread in the pool, if called from outside the thread pool.

(2) To support this, work items pushed to the thread pool run a loop to claim iterations from a shared counter via atomic-fetch-and-add, as opposed to having work items themselves represent individual batches of iterations. This means that any thread working on the loop can execute any batch of iterations, including having the main thread run through all of the batches itself if the loop turns out to be short-running.

(3) As with OpenMP active scheduling, the worker loop spins waiting for work prior to blocking. This avoids OS blocking / wake-up paths in workloads with series of short-running parallel sections.
2020-06-22 10:04:53 +01:00
Prabhat
57fabfba7a
Added GetAvailableProviders() to C API (#4247)
* Added GetAvailableProviders to C API

* Fix API version and Windows build error

* Changed function name

* Changed ORT_API_VERSION to 4

* Moved all_providers array to constants.h

* Move check for providers to constants.h

* Changed name of array to avoid warning

* Address review comment

* Added unit test
2020-06-22 10:10:25 +08:00
Scott McKay
175983c082
Move memory info into IAllocator (#2850)
- Update IAllocator setup to move the OrtMemoryInfo to the base class instead of requiring derived classes to have that as a member and override a virtual method to return it.
  - Cleanup CreateAllocator setup to take an argument as to whether to wrap the device allocator in an arena allocator. The choice to do that isn't a property of the underlying device allocator.
  - Minor cleanups in the various EPs to adjust to the change to IAllocator and CreateAllocator, and to use the create_arena flag consistently when available.
2020-06-22 11:18:52 +10:00
Wei-Sheng Chin
de9da123cf
Enable static memory planning for pipeline. (#4204)
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
   the symbolic shapes can be resolved.

* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
   This is done by removing all but one gradient tensor from the last
   RecordEvent in the backward pass.

* Address a comment

* Fix Windows build
2020-06-12 21:43:50 -07:00
Xueyun Zhu
65a682354b
enable pipeline to run with mixed precision (#4113)
* enable pipeline to run with mixed precision

* address feedback

* address feedback

* test log

* pipe infomation if test fails

* ci failure
2020-06-10 22:16:24 -07:00
suffiank
7f5339505e
Discover trainable parameters using reverse DFS from loss node (#4116)
Discover trainable parameters using reverse DFS from loss node, omitting recursion along untrainable inputs.

Co-authored-by: suffian khan <sukha@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: suffian khan <sukha@microsoft.com>
2020-06-08 14:16:10 -07:00
Scott McKay
9790e19424
Handle mem pattern allocation failure better. Make BFCArena behavior more consistent (#4062)
* Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block.

The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model.

There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory.

Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper.

Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.
2020-06-05 18:54:01 +10:00
Andrews548
62b44527e5
Add ArmNN Execution Provider (#3714)
* Add ArmNN Execution Provider

Add a new execution provider targeting Arm architecture based on ArmNN.
Validated on NXP i.MX8QM CPU with ResNet50, MobileNetv2 and VGG models.

reviewed-by: mike.caraman@nxp.com

* Minor fixes

- renamed onnxruntime_ARMNN_RELU_USECPU to onnxruntime_ARMNN_RELU_USE_CPU
- fixed acl typo

* remove extra includes. added exception for ArmNN in test

* fix indentation

* Separated the activation implementation from the cpu and fixed the blockage from the endif

Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
2020-06-03 22:57:51 +05:30
Xueyun Zhu
633008b5ef
Add pipeline online partition logic for pipeline (#3996)
* online partition

* fix when multiple consumer nodes is in cut info

* fix windows build

* address feedback

* adding test

* feedback

* address feedback

* add parser for cut edge

* windows build
2020-05-26 17:44:09 -07:00
Paul Fultz II
7759136610
Add amd migraphx execution provider to onnx runtime (#2929)
* Add amd migraphx execution provider to onnx runtime

* rename MiGraphX to MIGraphX

* remove unnecessary changes in migraphx_execution_provider.cc

* add migraphx EP to tests

* add input requests of the batchnorm operator

* add to support an onnx operator PRelu

* update migrapx dockerfile and removed one unused line

* sync submodules with mater branch

* fixed a small bug

* fix various bugs to run msft real models correctly

* some code cleanup

* fix python file format

* fixed a code style issue

* add default provider for migraphx execution provider

Co-authored-by: Shucai Xiao <Shucai.Xiao@amd.com>
2020-05-27 04:24:59 +08:00
edelaye
64b5f7edf6
Initial release of Vitis-AI Execution Provider (#3771)
* Initial release of Vitis-AI Execution Provider

* Add documentation, fix for onnxruntime::Model changes and use stringstream instead of file dump for model passing

* - Add Vitis-AI docker file
- Add online quantization flow Vitis-AI execution provider
- Fix remarks

* - Add fatal error build message for Vitis-AI cmake build on Windows
- Fix pep8 issue in build.py
- Add Vitis-AI execution provider example in docs

Co-authored-by: Elliott Delaye <elliott@xilinx.com>
Co-authored-by: Jorn Tuyls <jornt@xilinx.com>
Co-authored-by: Jorn Tuyls <jtuyls@users.noreply.github.com>
2020-05-19 05:32:32 -07:00
Vincent Wang
3c24841569
Fold Shape Node During Constant Folding (#3748)
* Fold Shape node in constant folding.

* bugfix

* Fix test failure.

* Bugfix for C++ frontend.

* Bugfix for C++ frontend.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-05-09 20:15:03 +08:00
Sheil Kumar
cf6a1c1715
Fix Windows Inbox build failing on 1) building raw api tests and 2) referencing _winml namespace in onnxruntime.dll (#3872)
* add build inbox flag

* remove raw tests and wstring for utf filenames

* enable raw tests

* use ToWideString

* create new utf8 helper

* update string helper to utf8

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2020-05-08 15:59:16 -07:00
Ryan Hill
d5ec353e58
Ryanunderhill/mkldnn dll (#3314)
First version of allowing providers to work as DLLs, only implemented for DNNL so far.

More improvements to come next!
2020-05-06 00:57:09 -07:00
airockchip
edaf8a542c
Initial PR for RKNPU execution provider (#3609)
* Initial RKNPU execution provider

    * Init

    * Support Ops:
        Conv, Relu, Clip, LeakyRelu,
        MaxPool, AveragePool, GlobalAveragePool,
        Concat, Softmax, BatchNormalization, Gemm,
        Add, Mul, Sub,
        Reshape, Squeeze, Unsqueeze,
        Flatten, Transpose,
        QLinearConv, DequantizeLinear

    * Add rknpu unittest

    * Update BUILD.md and Add RKNPU-ExecutionProvider.md

* misc code update

* fix CLIP accuracy issue.

* fix "Error: Duplicate definition of name".

* move rknpu_ddk out of onnxruntime submodule.

* remove temporary code.

* add rknpu namespace.

* update misc of node_attr_helper

* add const & comment for onnx_converter

* add const & comment for shaper

* unify variable name

Co-authored-by: dkm <dkm@rock-chips.com>
Co-authored-by: George Wu <jywu@microsoft.com>
2020-05-05 20:36:47 -07:00
Changming Sun
bd78364411
Parallel all the activations ops (#3722)
1. Parallel all the activations ops.
2. Parallel the performance critical path of the LRN op, which makes the ONNX model zoo googlenet model runs 60% faster(latency reduced from 21ms to 13ms).
3. Make the Gemm-Activation fusion support with all the activations ops. Before this change, it only supports LeakyRelu/Relu/Sigmoid/Tanh.
4. Delete onnxruntime/test/framework/op_kernel_test.cc because the file is almost empty.
5. Remove the loggings in KernelRegistry::TryFindKernel, return Status with error message instead.
2020-05-05 01:18:17 -07:00
Scott McKay
15eca74d15
Make ThreadPool::PartitionWork a bit more user friendly. Update a few places to use PartitionWork. (#3795) 2020-05-02 17:09:55 +10:00
Changming Sun
edd5855fb7 Remove eigen device from thread pool 2020-05-01 02:21:57 -07:00