Commit graph

7863 commits

Author SHA1 Message Date
Maajid khan
d98062da0c
[OpenVINO-EP] Hetero support (#5627)
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
  as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Passing right device type for vadm_baackend

Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed batching logic for 2020.4 and above

* Fixed flake8 PEP8 errors

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
2020-10-30 22:35:08 -07:00
Changming Sun
d9293f38e6 Revert "Custom Op on GPU (#5620)"
This reverts commit 2c63196600.
2020-10-30 21:23:51 -07:00
Changming Sun
7948a4b0bc Revert "add header (#5648)"
This reverts commit d7f3baed18.
2020-10-30 21:23:51 -07:00
KeDengMS
32bf6390ad
Some fixes to symbolic shape inference (#5642)
* Some fixes to symbolic shape inference

1. Topological sort before iteration in graph
2. Fix a case in slice: start=100000, end=-100000, step=-1, dim=2
3. Fix Nuphar Gemm test's random seed
4. Slice opset 1 axes is optional
2020-10-30 19:28:47 -07:00
Hariharan Seshadri
7a80a4b526
Support more C# APIs (#5608) 2020-10-30 19:19:50 -07:00
Zhang Lei
17bce6f07e
Implement Im2colNd NHWC and related qlinearconv logic for u8s8. (#5612)
Implement Im2colNd NHWC and related qlinearconv logic for u8s8, and training.
2020-10-30 15:28:30 -07:00
RandySheriffH
d7f3baed18
add header (#5648)
Co-authored-by: RandySheriffH <rashuai@microsoft.com>
2020-10-30 14:26:10 -07:00
Changming Sun
3e71e8bd7e
Revert "[CUDA EP] remove per-thread allocator (#5415)" (#5647)
This reverts commit b4869926d3 because it broke our multiple GPU test pipeline.
2020-10-30 13:58:33 -07:00
RandySheriffH
2c63196600
Custom Op on GPU (#5620)
* add case for cpu custom op on gpu

* format doc

* restrict GPU custom op on Linux GPU CI only

* separate cu file to a independent project

* fix typo

Co-authored-by: RandySheriffH <rashuai@microsoft.com>
2020-10-30 12:25:44 -07:00
S. Manohar Karlapalem
aa38893afb
[OpenVINO-EP] Add Dockerfile with C# API bindings (#5633)
* Update Dockerfile README with C# info

* Add OpenVINO EP dockerfile with C# APIs
2020-10-30 11:27:15 -07:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Dmitri Smirnov
742ffb860c
Allow Kernels refer to some attribute data directly in the protobuf (#5624)
* Introduce OpKernelInfo GetAttrAsSpan() for floats and ints attribute proto arrays
  and GetAttrsStringRefs() to return a vector of string references.
  These new APIs allow kernels not copy attribute arrays especially if they are large
  and save on memory.
  but refer directly to data that is in AttributeProto.
  Modify TfIdfVectorizer to take advantage of the new API.

Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-29 16:12:54 -07:00
Vincent Wang
1fa1c51544
bug fix for name of gradient constant (#5626)
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-10-30 07:08:19 +08:00
KeDengMS
b4869926d3
[CUDA EP] remove per-thread allocator (#5415)
Now that we are using legacy default stream, which is shared among all inference threads,
there is no need to have per-thread allocator.

In the past, the race could happen when two threads running concurrently on GPU:
thread1: allocA->copyA->computeA->freeA
thread2: allocB->copyB->computeB->freeB

Note that freeA/B only means the buffer is ready to be allocated on CPU, while the corresponding
operation on GPU is not finished yet. It is possible for thread1/2 use the same buffer, when the
alloc/free pair are not interleaved (note that alloc/free is thread-safe)

If the GPU commands run in separate per-thread default stream, there's a chance that copyA/computeA
 are interleaved with copyB/computeB, even when the order in CPU execution is not interleaved. This
would cause incorrect results if computeB uses copyA's results.

By using one legacy default stream, CPU execution order would match the GPU execution order, so
if A and B use the same buffer from alloc, the correpsonding copy/compute won't be interleaved. If
the copy/compute is indeed interleaved, then allocA and allocB would return different buffers, thus
no racing either.
2020-10-29 11:33:05 -07:00
Sergii Dymchenko
2e1fa3ccb7
Fix GeluRecompute for 2 inputs case. (#5573)
* Add test for FastGelu + GeluRecompute.

* Fix GeluRecompute for 2 inputs case.

* Fix test for BiasGelu + GeluRecompute.

* Copy all inputs to Gelu, not just 2.

* Move GeluRecompute test to training-specific file.
2020-10-29 00:07:13 -07:00
Dwayne Robinson
b85e7a19ea
isalnum is not defined - include cctype (#5623) 2020-10-28 23:31:34 -07:00
Changming Sun
e6956be40c
Publish no-openmp python packages to test pypi (#5610)
Publish no-openmp python packages to test pypi
2020-10-28 19:49:53 -07:00
Tracy Sharpe
b68e98e0b0
optimize QLinearConv depthwise convolutions (#5605) 2020-10-28 16:42:53 -07:00
Jeff Bloomfield
1d87831c6e Merged PR 5344477: Disable GPU timeouts in DML EP command queue creation
GPU timeouts have already been disabled in command queues created by Winml, but not the ones created by the DML EP within the ORT API
2020-10-28 23:34:19 +00:00
liqunfu
5129b4d5bc
batch size tests (#5508)
* batch size tests

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-28 15:55:40 -07:00
Rohith_Kvsp
50582abe93
Fix IS_ANDROID Issue (#5599)
Fixed static IS_ANDROID detection
  final static IS_ANDROID is causing an Error Unsupport arch:aarch64, so removed IS_ANDROID & replaced with   IS_ANDROID with isAndroid().
2020-10-28 14:42:33 -07:00
Ryan Lai
bbfd914d72
Skip new model test additions (#5611) 2020-10-28 13:27:49 -07:00
Juliana Franco
27c6d1eeb2
move variable declaration to avoid unused variable error (#5603)
Co-authored-by: Juliana <jufranc@microsoft.com>
2020-10-28 09:23:58 -07:00
George Wu
0dbf3e8893
enable arena for arm64 (#5613) 2020-10-28 08:40:43 -07:00
Tim Harris
5e8952ef89
ThreadPool clean up : mm_pause in loops, correctly spin-then-wait, and adopt static methods consistently in the API (#5590)
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.

Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
2020-10-28 09:49:18 +00:00
liqunfu
92662659ba
Liqun/remove number matching (#5606)
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-27 21:27:37 -07:00
Ryan Hill
e90b6f06d1
Factor out IAllocator so that it can be shared with shared providers (#5567)
* Factor out IAllocator so shared providers can use it directly.
2020-10-27 17:28:17 -07:00
Suffian Khan
e5b0d192f4
pin transformers dependence to sentencepiece==0.1.92 due to ci fail (#5607) 2020-10-27 16:21:40 -07:00
Maajid khan
ddf83d1ace
Maajid/multi threading 2 (#5568)
* Enabled multi-threading for OpenVino EP

->Enabled support for concurrent_session_runs

*Run UEP using concurrent_session_runs > 1
*Enabled support for ORT_PARALLEL ExecutionMode

->Documentation Added for Enabling MultiThreading

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor Fixes added
*Configure the value of nireq during Runtime
*Documentation typos rectified and details
added for Multi_Threaded Inference

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Some checks added for this fix
*Added checks to invalidate wrong nireq value
and assigned it to default value of 8
*Added new config options for enable_vpu_fast_compile
which were changed w.r.t OpenVINO_2021.1 Release

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
2020-10-27 14:48:12 -07:00
Weixing Zhang
b851973f22
pipeline_worker_pool_.JoinAll() should be called in pipeline code path (#5604)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-10-27 11:57:46 -07:00
ytaous
6f824c25e5
Dropout op elimination - enable for ORT training (#5588)
* dropout elimination

* per comments

* fix build

* fix build

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-27 11:51:23 -07:00
Dmitri Smirnov
3433576fd3
Support for Sparse Initializers (#5540)
Introduce sparse_initializers support.
  Convert them to dense on model load and prune graph_proto_
  so they don't consume space. Convert back to sparse on ORT Format model save.
  Implement serializing sparse initializers to OrtFormat.
  Fix Model::ToProto() to return original sparse initializers
  Set a flag that graph_sync is needed when loading a simple ORT Format model.
  otherwise nothing is resolved.
  Add ORT Format history to README.md
  ifdef MINIMAL build for DenseToSparseTensorInitializer
  Allow duplicate initializers to support existing models.
  Issue a warning instead of aborting.

* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.



Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
2020-10-27 10:32:06 -07:00
Yufeng Li
30cdc74bc0
Enable prepacking in subgraph (#5433)
Prepacking in subgraph is not supported currently. We see more and more models with subgraph, which has MatMul, MatMulInteger and other ops. Prepacking can speed up those models significantly.
2020-10-26 22:22:31 -07:00
Changming Sun
564da960ce Fix nuphar docker file build break 2020-10-26 20:08:07 -07:00
Hariharan Seshadri
6c310858e3
Support opset-13 Resize kernels (#5575) 2020-10-26 17:26:06 -07:00
Ramakrishnan Sivakumar
5bcb5f5a3d
MLAS: Add support for AVXVNNI (#5592)
Adds Gemm kernels with AVXVNNI support for Int8 acceleration
2020-10-26 16:27:48 -07:00
Jeff Bloomfield
e380fd3c6b Merged PR 5334334: Fix asserts and failure in GraphKernelHelper.cpp
This extends a workaround needed to match node inputs with Tensors to the EP code handling constant input upload.

This was causing issues in a couple of models, including EfficientDet, although that model still fails due to this bug:
https://microsoft.visualstudio.com/OS/_workitems/edit/29970551

Related work items: #29706035
2020-10-26 22:16:22 +00:00
Sherlock
694a4d6413
Add more loggings for GradientBuilder (#5556)
* Add more loggings for GradientBuilder

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-26 15:15:52 -07:00
edgchen1
68fe722691
GatherGrad optimization (#5524)
The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.

The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.

Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.

The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.

The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
2020-10-26 12:53:53 -07:00
Sergii Dymchenko
8224718f8f
Enable CommonSubexpressionElimination in training. (#5504)
* Add test for CommonSubexpressionElimination in training.

* Enable CommonSubexpressionElimination in training.

* Add ommonSubexpressionEliminationApplyOnce for training.
2020-10-26 11:25:15 -07:00
Hariharan Seshadri
44773c60e3
Add a CUDA based IOBinding test (#5572) 2020-10-26 10:57:36 -07:00
Xavier Dupré
f4cee22b9b
Handle -inf in ReduceSumLogExp, fix regression introduced in PR #5370 (#5583)
* Handle -inf in ReduceSumLogExp operator
* Update reduction_ops_test.cc
* Remove a case which has a different behaviour CPU/GPU
2020-10-26 09:58:02 +01:00
Tracy Sharpe
502f67ba58
MLAS: implement u8x8 GEMM for aarch32 (#5580) 2020-10-25 23:05:12 -07:00
Andrew McDowell
b2da700e4d
Allow Upper case letters in RHS of einsum equations. (#5569)
Co-authored-by: Andrew McDowell <andrew@neva-labs.com>
2020-10-25 18:11:12 -07:00
Ye Wang
51af108af5
Support older version of slice in reshape fusion (#5574)
* support older version of slice in reshape fusion

* fix

* review partial comments

* add test

* add gen file
2020-10-24 14:48:18 -07:00
Du Li
860cb22260
Bug fix for C API (#5520)
* remove if_def from C api

* Fix CI issues.

* revert change for symbols.txt
2020-10-24 13:37:58 -07:00
Pranav Sharma
3f3b202e36
Optimize GatherElements further, add threshold for parallelizing Scaler. (#5579)
* Optimize GatherElements more.

* Optimize GatherElements further, add threshold for parallelizing Scaler.

* Add basic tests to exercises the parallel path
2020-10-24 12:38:31 -07:00
ISS Build Account
49ec73e939 Merge remote-tracking branch 'upstream/master' into DmlDev 2020-10-23 12:34:04 +00:00
Guoyu Wang
3f06286154
Add Flatten support for NNAPI (#5545)
* Add flatten support for NNAPI, correct some typo in NNAPI code files

* Address review comments

* Update CanSkipReshape

* Add test for verify NNAPI is actually running for a supported model

* Adding test for reshape/flatten test for NNAPI

* Add one extra verbose log for skipping reshape

* Fix Android CI failure

* Correct test file name to fix Android CI failure
2020-10-22 18:15:53 -07:00
ytaous
7da5949279
NVTX label change (#5562)
* label change

* more info on label

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-22 10:34:20 -07:00