Commit graph

410 commits

Author SHA1 Message Date
Edward Chen
d761571afc
Deprecate Python global configuration functions [Part 2] (#6171)
Update Python API to allow more flexibility for setting providers and provider options.

The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.

Update some EP-specific option parsing to fail on unknown options.

Other clean up.
2021-01-07 10:10:55 -08:00
Hariharan Seshadri
d42399e1b0
Allow querying a GraphProto's doc_string as part of ModelMetadata (#6248) 2021-01-05 22:18:03 -08:00
Edward Chen
ce6161cf67
Add MakeStringLite which uses current locale, update some MakeString call sites to use it instead. (#6252)
* Add MakeStringLite which uses current locale, update macros to use that to generate messages.

* Convert calls to MakeStringLite().
2021-01-04 19:27:24 -08:00
Hector Li
ffb4b62826
Fix allocator issue for TensorRT IOBinding (#6240)
* Fix issue: https://github.com/microsoft/onnxruntime/issues/6094

Root cause: we didn't expose the OrtMemoryInfo for TRT, so it will cause issue if user want use IObinding for Tensorrt.

Short term fix, add the OrtMemoryInfo for TRT. Long term should unify the allocator for CUDA and TRT
2020-12-31 20:15:43 -08:00
Scott McKay
2da8060f34
Helper for compiling EP to generate deterministic unique ids for use in MetaDef names (#6156)
* Create a helper for generating unique ids that can be used by an EP that creates compiled nodes and needs ids to be deterministic for a model when used in multiple sessions.

Added to IExecutionProvider as this can potentially be used by all compiling EPs and is more robust than a simplistic counter (although EP implementer is free to choose either approach).

* Restructure the helper so it can be called across the EP bridge.
Add ability to call id generation helper from EP bridge
  - convert DNNL EP to use helper to validate
Address issue where a new Model may be loaded into the same address as a previous one.
  - hash the bytes in the Graph instance (1728 bytes currently) to use as the key to the full hash for the model
Add lock around id generation to ensure no issues if multiple sessions partitions graphs at exactly the same time.
  - Extremely unlikely but would be hard to debug and the locking cost is not an issue as it's only incurred during graph partitioning and not execution.
2020-12-21 12:17:58 +10:00
Pranav Sharma
efa1b0d864
Minor fix to satisfy c++14 (#6162) 2020-12-17 13:53:24 -08:00
Ryan Hill
ac62cf8058
Unify IExecutionProvider and IExecutionProviderFactory interfaces (#6108)
* Remove Provider_IExecutionProvider and make the internal IExecutionProvider usable by shared providers
* Change Provider_IExecutionProviderFactory to be the core version.
2020-12-15 16:45:53 -08:00
Edward Chen
64709b1335
Deprecate Python global configuration functions [Part 1] (#5923)
Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.
2020-12-15 11:32:43 -08:00
Sherlock
eb5c1f0fcc
Unify activation and initializer alignment value (#6109)
* Unify activation and initializer alignment value

* Fix VerifyInputTensorsAllocatedContiguously
2020-12-14 13:13:41 -08:00
Sherlock
a53f4dd379
Introduce VariadicAlias, remove hardcoded alias limits (#6106)
* Introduce VariadicAlias, remove hardcoded alias limits

* Include optional-lite in winml build

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-11 10:47:08 -08:00
RandySheriffH
404982ded5
Enable varied input type for custom op (#6066)
* allow custom op taking varied types

* refactor test case

* add test model

* refactor test case

* enable copy elision

* update test case

* fix issue in ToString function
2020-12-09 15:10:42 -08:00
Pranav Sharma
2c5ba9ab00
Bump up API version for 1.6 release (#6076) 2020-12-08 01:24:29 -08:00
Moshe David
06ad516a5d
w (#5947)
Co-authored-by: modav <modav@microsoft.com>
2020-11-30 10:35:44 +10:00
Hariharan Seshadri
d46dbeafd3
Expose knobs to create and share (CPU) allocators across sessions in C# and Python (#5634) 2020-11-21 14:12:33 -08:00
Guoyu Wang
cc6e8fb7cc
Filter initializers for GraphViewer with IndexedSubGraph (#5884)
* fix filtered subgraph initializer issue

* minor fix

* Inlcude implicit input of nodes to see if they are initializers

* Add test case

* minor update

* Address PR comments

* Fix some code error
2020-11-20 18:36:58 -08:00
Ryan Hill
ba739a8000
Convert OpenVINO into a shared provider (#5778)
Same as Dnnl and TensorRT before it, now with more methods and more cleanup.
2020-11-20 17:39:57 -08:00
Scott McKay
00412a76e9
Exclude some training specific code from the minimal build. Cleanup some related aspects of allocation planner. (#5861)
* Exclude some training specific code around the allocation planning and initializer handling from the minimal build.
Simplify the code around tracking start/end usage of a value.
2020-11-20 20:25:46 +10:00
S. Manohar Karlapalem
ff58f621fa
Remove nGraph Execution Provider (#5858)
* Remove nGraph Execution Provider

Pursuant to nGraph deprecation notice: https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md#deprecation-notice

**Deprecation Notice**

| | |
| --- | --- |
| Deprecation Begins	| June 1, 2020 |
| Removal Date |	December 1, 2020 |

Starting with the OpenVINO™ toolkit 2020.2 release, all of the features
previously available through nGraph have been merged into the OpenVINO™
toolkit. As a result, all the features previously available through
ONNX RT Execution Provider for nGraph have been merged with ONNX RT
Execution Provider for OpenVINO™ toolkit.

Therefore, ONNX RT Execution Provider for **nGraph** will be deprecated
starting June 1, 2020 and will be completely removed on December 1,
2020. Users are recommended to migrate to the ONNX RT Execution Provider
for OpenVINO™ toolkit as the unified solution for all AI inferencing on
Intel® hardware.

* Remove nGraph Licence info from ThirdPartyNotices.txt

* Use simple Test.Run() for tests without EP exclusions

To be consistent with rest of test code.

* Remove nGraph EP functions from Java code
2020-11-19 16:47:55 -08:00
Guoyu Wang
261462be0d
Change NNAPI runtime options to use uint32_t (#5863)
* Change nnapi options unsigned long -> uint32_t

* Move options from long to int in java code
2020-11-19 13:38:49 -08:00
Pranav Sharma
c2a993e745
Add documentation for OrtArenaCfg for CreateAndRegisterAllocator API. (#5831)
* Add documentation for OrtArenaCfg for CreateAndRegisterAllocator API.

* Address PR comments

* More comments
2020-11-18 10:21:20 -08:00
Scott McKay
7b76b57fc8
Support EPs that compile nodes in a minimal build. (#5776)
* Support EPs that compile nodes in a minimal build. This enables NNAPI being used.
2020-11-17 13:52:22 +10:00
Guoyu Wang
dc0f7b8f82
Remove onnxruntime_session_options_config_keys.h from c_api (#5772)
* Remove seesion config keys header from c_api

* remove copy session config header in release package

* Keep the session option config header in the package
2020-11-12 09:12:13 -08:00
Hariharan Seshadri
b92fc66ea1
Support opset-13 specs of controlflow ops (Loop, If) (#5665) 2020-11-11 23:44:14 -08:00
Tim Harris
48b14b52b8
Remove Env::Task wrapper around std::function (#5753)
This is a small perf / clean-up change. It removes the Env::Task abstraction which wraps a single std::function field, and adds at least one virtual method call overhead when creating a Task and when executing it. The POSIX and Windows implementations are now identical.
2020-11-10 20:22:07 +00:00
Tim Harris
5e44d25c5a
Support multi-loop parallel sections, use multi-loop sections in GRU (#5602)
This PR updates the ThreadPool API to support multi-loop parallel sections. As with the OpenMP "parallel" construct, this allows per-loop work to be amortized over a series of loops. For ORT, it also promotes locality between successive loops in the sense that iteration X of one loop will tend to run on the same worker thread as iteration X of preceding loops.

The change was developed while optimizing the implementation of a model that performed better with OpenMP. Profiling indicated that OpenMP was providing lower loop entry/exit costs and that, via OpenMP's static scheduling, it was leading to a lower L2 miss rate in the series of parallel loops used in GRU.

The main changes are:

- Addition of ThreadPool::ParallelSection and underlying support in the modified Eigen thread pool.

- In EigenNonBlockingThreadPool.h, refactoring the RunInParallel method to support two variants: one that takes an existing parallel section object created by the caller, and another (used by default) that creates its own parallel section.

- Simplify ThreadPool::LoopCounter (used by worker threads to claim loop iterations), basing it an ID supplied by the underlying Eigen thread pool for affinity in a series of loops.

- Fix a possible perf issue where a loop with iterations scheduled in batches would have more threads than batches available.

- Use of parallel sections in the GRU operator.

- Additional test cases in threadpool_test.h.

- Additional comments at the top of threadpool.h and EigenNonBlockingThreadPool.h.
2020-11-10 12:24:57 +00:00
edgchen1
2acdc3cd82
Move GetUseDeterministicCompute() to OpKernelContext to avoid need to downcast to OpKernelContextInternal. (#5729) 2020-11-09 11:37:06 -08:00
Dmitri Smirnov
2bf5046d4e
Add tag types for Ort::Float16_t and Ort:Bfloat16_t structs (#5716)
Add tag types for Ort::Float16_t and Ort:Bfloat16_t structs
  that contain uint16_t values for float16 and bfloat16.
  These will serve as type dispatching types for C++ API.
  They are of uint16_t size and arrays of these types can be used
  to create Tensors of the corresponding types.
  Make documentation Doxygen compliant.
2020-11-06 16:41:26 -08:00
Scott McKay
2127a229d7
The IndexedSubGraph is used to create the Function body, but after that is invalid as the nodes it referred to have been removed from the main Graph. As such there's no need to store it in the FunctionImpl instance. (#5669) 2020-11-05 17:21:56 +10:00
Guoyu Wang
a2b551ff08
Add runtime options for NNAPI EP (#5576)
* Add options for nnapi ep

* Add nnapi flags test

* add comments

* Add flag comments

* Make the flags bitset const

* Fix build break

* Add stub changes to java and c# api

* Fix java related build break

* Fix java build break

* Switch to bit flags instead of bitset
2020-11-04 10:08:43 -08:00
edgchen1
07bd4ef470
Upgrade optional implementation to https://github.com/martinmoene/optional-lite. (#5563) 2020-11-03 15:27:47 -08:00
Scott McKay
c9f44276da
Add ability to filter GraphViewer using IndexedSubGraph. (#5614)
* Add ability to filter GraphViewer using IndexedSubGraph. This is to support compiling execution providers in a minimal build.
2020-11-04 07:08:18 +10:00
Wenbing Li
5b44982971
Change the OrtCustomOp invocation as a constant. (#5506)
* Chanage the OrtCustomOp invocation as a constant.

* fix build on macos

* build fixing
2020-11-02 10:38:07 -08:00
M. Zeeshan Siddiqui
9af0d48524
Memory planner and pattern generation enhancements. (#4443)
* static allocation.

* chanegs.

* contigious dynamic allocation.

* contigious dynamic allocation.

* fix bugs.

* fix bug.

* build errors.

* PR feedback.

* PR feedback.

* Update Graph builder for nccl_allreduce, mps.

* misc.

* fix windows build break.

* changes.

* fine-grained memory-time scheduling.

* merge.

* fix misc stuff.

* fix windows build.

* fix windows build.

* fix merge bug.

* merge conflicts.

* revert onnx-tensorrt submodule commit.

* fix submodule commit.

* misc.

* merge conflicts.

* Revert "merge conflicts."

This reverts commit 319a071a6e.

* merge conflict.

* merge conflict.

* merge conflicts.

* fixes.

* PR feedback.

* build break.

* build break.

* Add asserts.

* Add asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* asserts.

* fixes.

* fixes.

Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-11-01 23:05:46 -08:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Dmitri Smirnov
742ffb860c
Allow Kernels refer to some attribute data directly in the protobuf (#5624)
* Introduce OpKernelInfo GetAttrAsSpan() for floats and ints attribute proto arrays
  and GetAttrsStringRefs() to return a vector of string references.
  These new APIs allow kernels not copy attribute arrays especially if they are large
  and save on memory.
  but refer directly to data that is in AttributeProto.
  Modify TfIdfVectorizer to take advantage of the new API.

Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-29 16:12:54 -07:00
Sergii Dymchenko
2e1fa3ccb7
Fix GeluRecompute for 2 inputs case. (#5573)
* Add test for FastGelu + GeluRecompute.

* Fix GeluRecompute for 2 inputs case.

* Fix test for BiasGelu + GeluRecompute.

* Copy all inputs to Gelu, not just 2.

* Move GeluRecompute test to training-specific file.
2020-10-29 00:07:13 -07:00
Tim Harris
5e8952ef89
ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590)
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.

Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
2020-10-28 09:49:18 +00:00
Ryan Hill
e90b6f06d1
Factor out IAllocator so that it can be shared with shared providers (#5567)
* Factor out IAllocator so shared providers can use it directly.
2020-10-27 17:28:17 -07:00
Dmitri Smirnov
3433576fd3
Support for Sparse Initializers (#5540)
Introduce sparse_initializers support.
  Convert them to dense on model load and prune graph_proto_
  so they don't consume space. Convert back to sparse on ORT Format model save.
  Implement serializing sparse initializers to OrtFormat.
  Fix Model::ToProto() to return original sparse initializers
  Set a flag that graph_sync is needed when loading a simple ORT Format model.
  otherwise nothing is resolved.
  Add ORT Format history to README.md
  ifdef MINIMAL build for DenseToSparseTensorInitializer
  Allow duplicate initializers to support existing models.
  Issue a warning instead of aborting.

* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.



Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-27 10:32:06 -07:00
Yufeng Li
30cdc74bc0
Enable prepacking in subgraph (#5433)
Prepacking in subgraph is not supported currently. We see more and more models with subgraph, which has MatMul, MatMulInteger and other ops. Prepacking can speed up those models significantly.
2020-10-26 22:22:31 -07:00
Du Li
860cb22260
Bug fix for C API (#5520)
* remove if_def from C api

* Fix CI issues.

* revert change for symbols.txt
2020-10-24 13:37:58 -07:00
Ryan Hill
82c7a9756e
Fix shared provider unload crash (#5553) 2020-10-21 13:01:21 -07:00
Changming Sun
280cdf31f5
Revert "Fix shared provider unload crash (#5523)" (#5547)
This reverts commit 610676293e. Because Linux DNNL pipeline is failing.
2020-10-20 08:01:28 -07:00
Ryan Hill
610676293e
Fix shared provider unload crash (#5523)
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
2020-10-19 18:08:38 -07:00
Sunghoon
645d978589
Sunghcho/denormals (#5391)
* Add session option and global thread pool option to set denormal as zero.

* Revert unneccessary changes.

* Add cpuinfo submodule

* Add more comments

* Remove cpuinfo submodule dependency and check only SSE3 support for ftz and daz inspired by Tensorflow

* Preserve API order in C api

* Clean up and utilize SSE3 detection logic from existeing cpuid_info.h

* Keep the same order with header file

* Fix build issue with Linux pipeline, which has old g++ compiler

* Fix broken build on Linux and remove a duplicated unit test

* Remove reformatting at eigen thread pool

* Remove flatbuffers which is not intentionally added

* Revert "Remove flatbuffers which is not intentionally added"

This reverts commit 9f509a9aaaa3c7832d88854c82fd26b234770b7f.

* Remove flatbuffers which is not intentionally added

* Resolve comments
  - Put details on APIs
  - Add a log for ftz/daz initialization
  - Add clang
  - Fix typo

* Remove unnecessary header include

* Resolve comments
2020-10-15 12:47:42 -07:00
Chun-Wei Chen
2b6b3a2ee6
Add GetProfilingStartTimeNs() to Python/C# APIs (#5280)
* add Python API for getProfilingStartTime

* debug for using Python API

* add in C# api

* use uint intead of uint64_t to prevent warning

* typo for GetProfilingStartTimeNs

* remove const

* Update onnxruntime/python/session.py

Co-authored-by: Pranav Sharma <emailpranav@gmail.com>

* remove unnecessary return

* Add Python unit test

* Add C# unit test and refactor Python test

* use ulong in C# for uint64_t in C++

* remove time.monotonic_ns

* syntax: remove public for inner function

* correct the API's order

* getprofilingstarttime after run

* Correct the right order in NativeMethod.cs

* update order

* nit: remove spaces

* Update csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs

Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>

* use the updated function

* add comment about the precision

* add more comments

* add session.py back

* fix flake8

* remove session.py

* Add comments in C, C#, Python APIs about precision

Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
2020-10-14 05:32:43 -07:00
Xiang Zhang
b12824fa7a
add telemetry event for nodejs binding (#5463) 2020-10-12 22:53:01 -07:00
KeDengMS
c444b9d76a
Add CUDA option to run copy in default stream (#5445)
* Add CUDA option to run copy in default stream

This change fixes #4829. Thanks @maherzog for providing the repro!

The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.

BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.

This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.

However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.

This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.

User may use following options to revert to previous behavior:
C API:
  struct OrtCUDAProviderOptions cudaProviderOpt;
  cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
  CUDAExecutionProviderInfo cudaEPInfo;
  cudaEPInfo.do_copy_in_default_stream = false;
C# API:
  pending...
Python:
  import onnxruntime
  onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

* Confirmed the test failes in CI when doing copy in separate stream

Revert the test to get CI pass now

* Fix Windows test

* Address CR
2020-10-12 22:12:05 -07:00
Scott McKay
a92ccbe1bc
Various armv7 related fixes (#5394)
* - Link with libatomic if needed
 - Install pip differently so it doesn't clash with the system pip which may involve a wrapper script
 - Remove ability to specify offset when Tensor allocates the data. The data prior to offset isn't accessible by anything.
 - Fix use of offset in TensorOpTest to work on armv7 where it must be aligned to the type it points to.
 - Fix ActivationOpNoInfTest.Softsign to allow for armv7 behavior
 - Fix ReductionOpTest.ReduceMean_*keepdims to allow for armv7 floating point inaccuracy

* Address PR comments
2020-10-09 22:34:32 +10:00
Du Li
323c4dfe02
Adding an option for cudnn conv algorithms. (#5159)
* adding cudnn conv algorithm selection options.

* adding cudnn conv algorithm selection options.

* export the api

* adding the perf test option.

* accomodating pr comments.

* Move OrtSessionOptionsAppendExecutionProvider_CUDA to onnxruntime_c_api.h

* Accomodating PR comments.
2020-10-05 16:53:52 -07:00