Commit graph

303 commits

Author SHA1 Message Date
Zhang Lei
9992f0f812
Implement QLinear GlobalAveragePool with sse2/neon. (#5838)
Add QLinear Global Average Pool for quantization for ARM and SSE2.

Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>
2020-11-23 19:23:58 -08:00
sfatimar
916410151c
Fix for hetero multi python binding with new shared library (#5895)
Co-authored-by: sfatimar <sahar.fatima@intel/com>
2020-11-23 15:41:10 -08:00
Ye Wang
3d5b48a894
remove use_cdn when loading pretrained model (#5900) 2020-11-23 14:26:55 -08:00
Hariharan Seshadri
d46dbeafd3
Expose knobs to create and share (CPU) allocators across sessions in C# and Python (#5634) 2020-11-21 14:12:33 -08:00
Ryan Hill
ba739a8000
Convert OpenVINO into a shared provider (#5778)
Same as Dnnl and TensorRT before it, now with more methods and more cleanup.
2020-11-20 17:39:57 -08:00
Olivia Jain
3738ca7e10
Improve perf testing (#5760)
* build off a specific commit and archive wheel file

* rename to fp32, prefix results w/ commit, add CPU col

* rename 99th to 90 percentile

* get symbolic_shape from master each time

* add install archive wheel, parallel build

* shortening hash
2020-11-20 16:03:09 -08:00
Scott McKay
f0142da59c
Add NNAPI to providers that can be used via the python bindings. (#5867)
Update ORT model conversion script
  - add args for specifying optimization level and whether to use NNAPI
  - add logic to create a list of required ops and ORT format model that can be used with NNAPI
2020-11-21 09:18:35 +10:00
Takeshi Watanabe
a622533ecc
Support profile_file_prefix in python binding (#5864) 2020-11-20 14:28:50 -08:00
S. Manohar Karlapalem
ff58f621fa
Remove nGraph Execution Provider (#5858)
* Remove nGraph Execution Provider

Pursuant to nGraph deprecation notice: https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md#deprecation-notice

**Deprecation Notice**

| | |
| --- | --- |
| Deprecation Begins	| June 1, 2020 |
| Removal Date |	December 1, 2020 |

Starting with the OpenVINO™ toolkit 2020.2 release, all of the features
previously available through nGraph have been merged into the OpenVINO™
toolkit. As a result, all the features previously available through
ONNX RT Execution Provider for nGraph have been merged with ONNX RT
Execution Provider for OpenVINO™ toolkit.

Therefore, ONNX RT Execution Provider for **nGraph** will be deprecated
starting June 1, 2020 and will be completely removed on December 1,
2020. Users are recommended to migrate to the ONNX RT Execution Provider
for OpenVINO™ toolkit as the unified solution for all AI inferencing on
Intel® hardware.

* Remove nGraph Licence info from ThirdPartyNotices.txt

* Use simple Test.Run() for tests without EP exclusions

To be consistent with rest of test code.

* Remove nGraph EP functions from Java code
2020-11-19 16:47:55 -08:00
Hariharan Seshadri
62508ef0e4
Revert "Remove MKLML build config (#5559)" (#5855) 2020-11-19 10:53:08 -08:00
Yufeng Li
6f86c4dbe3
Quantize LSTM (#5595)
Quantize LSTM:
1. dynamically quantizes MatMul inside the LSTM. It doesn't quantize activation function.
2. support per-channel on the input weight and recurrent weight.
2020-11-18 11:21:49 -08:00
Peichen Xie
e8c0f5d0ff
Update the quantization script to support GEMM (transB==1) (#5432)
* Modify onnx_quantizer.py

* Fix topology order issues

* Handle more cases
2020-11-17 21:24:48 -08:00
Scott McKay
7b76b57fc8
Support EPs that compile nodes in a minimal build. (#5776)
* Support EPs that compile nodes in a minimal build. This enables NNAPI being used.
2020-11-17 13:52:22 +10:00
Maajid khan
a84a058f9e
[OpenVINO-EP] Enabling Multi Device support (#5740)
* Enabling Multi Device support for UEP

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor fix added
*Added a simple fix to determine OpenVINO
version for Arm build as well

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
2020-11-11 15:16:30 -08:00
Chi Lo
92292de135
Tensorrt perf tool (#5436)
* Add YAML file for pipeline

* Modify typo

* Add working directory

* Modify and test

* Modfiy and test

* Modify and test

* Modify and test

* Modify

* Modify

* Modify

* Modify

* Make sure to copy all the result files

* Add clearn up

* Modify

* Modify agent pool name

* Upload only specific artifacts

* Modify

* Integrated CI Pipeline for running TRT perf as well as added the “large amount of models” into perf model target

* Fix bug

* Fix bug

* Add reading the information regarding previously known failing models
and then skip testing them during benchmark/validation

* Modify the script file for CI

* Replace print with logger.info

* Fix bug

* Fix bug

* Refine the code

* Modify the script so that it can capture script segmentation fault while
running ORT

* Fix bug

* fix bug

* fix bug

* Add debug info

* fix bug

* Refine perf code

* Refine the code

* fix bug

* Code refactoring

* change many-models path

* remove metadata after validation/benchmark are done

* Update README.md

* Fix bug so that metadata doesn't hold stale value

* Remove hardcode and update README

* Add arguments to the script to make it run correctly

* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines

* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines

* Fix bug so that metadata doesn't hold stale value

* Fix small bug of finding test dataset directory for FP16 test data, as
well as modification of some output information

* use -i random for perf test of TRT changes

Co-authored-by: Olivia Jain <oljain@microsoft.com>
2020-11-06 12:27:42 -08:00
Ye Wang
95e6da7957
Revert saving optimized model as external data (#5690)
* revert and add support for saving external data

* review comments

* update
2020-11-06 11:54:19 -08:00
Zhang Lei
77b1eea9cf
Add option to allow quantize_input() use input_qtype for initializers. (#5721) 2020-11-06 09:33:24 -08:00
Yufeng Li
5c4543e194
Calibrate float tensor only (#5704) 2020-11-04 23:55:48 -08:00
Ye Wang
a028ca41ec
Optimize flaubert (#5651)
* optimize flaubert

* fix an issue and format

* revert non-relevent change

* review comments
2020-11-03 09:51:42 -08:00
Wei-Sheng Chin
8856c2595b
Sync the two IDs in OrtMemoryInfo when calling ctor (#5663)
* Sync the two IDs in OrtMemoryInfo when calling ctor

* Also fix the same problem for output
2020-11-02 23:22:47 -08:00
Tianlei Wu
2c02530603
Bert Model Profiling Tool (#5654)
* Add profiler tool for BERT models
2020-11-02 13:47:37 -08:00
Derek Murray
ff538b8d3a
Minor fixes in BERT Inference notebook (#5637)
Add missing commas to the code example.
2020-11-02 09:49:23 -08:00
Maajid khan
d98062da0c
[OpenVINO-EP] Hetero support (#5627)
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
  as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Passing right device type for vadm_baackend

Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed batching logic for 2020.4 and above

* Fixed flake8 PEP8 errors

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
2020-10-30 22:35:08 -07:00
KeDengMS
32bf6390ad
Some fixes to symbolic shape inference (#5642)
* Some fixes to symbolic shape inference

1. Topological sort before iteration in graph
2. Fix a case in slice: start=100000, end=-100000, step=-1, dim=2
3. Fix Nuphar Gemm test's random seed
4. Slice opset 1 axes is optional
2020-10-30 19:28:47 -07:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00
Maajid khan
ddf83d1ace
Maajid/multi threading 2 (#5568)
* Enabled multi-threading for OpenVino EP

->Enabled support for concurrent_session_runs

*Run UEP using concurrent_session_runs > 1
*Enabled support for ORT_PARALLEL ExecutionMode

->Documentation Added for Enabling MultiThreading

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Minor Fixes added
*Configure the value of nireq during Runtime
*Documentation typos rectified and details
added for Multi_Threaded Inference

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Some checks added for this fix
*Added checks to invalidate wrong nireq value
and assigned it to default value of 8
*Added new config options for enable_vpu_fast_compile
which were changed w.r.t OpenVINO_2021.1 Release

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
2020-10-27 14:48:12 -07:00
Tianlei Wu
1f304fbee7
Attention with past and no unidirectional mask (#5557)
* Update fusions to support shared node, and mask of all ones
2020-10-21 20:12:02 -07:00
Changming Sun
5802fe1699
Remove MKLML build config (#5559)
Remove MKLML build config
2020-10-21 13:11:25 -07:00
Hariharan Seshadri
4291c57322
[C# and Python APIs] Expose knobs to enable/disable platform telemetry collection (#5481) 2020-10-21 10:32:13 -07:00
Yufeng Li
6c2162e97a
Fix quantization of Conv1D with bias (#5491)
* Fix reshape for Conv with bias
2020-10-20 15:27:26 -07:00
KeDengMS
e1a54c4090
Symbolic shape inference: fix a bug in shape merge (#5519)
* Symbolic shape inference: fix a bug in shape merge

OpType Where:
input0: ['mt_src_tokens_batch', 1, 1, 'mt_src_tokens_len']
input1: []
input2: ['mt_prev_output_tokens_batch', 12, 'mt_prev_output_tokens_len', 'floor(mt_src_tokens_batch*mt_src_tokens_len/mt_prev_output_tokens_batch)'] 1
output: [None, 12, 'mt_prev_output_tokens_len', None]

* Undo unintended TRT change
2020-10-16 17:54:57 -07:00
Chun-Wei Chen
2b6b3a2ee6
Add GetProfilingStartTimeNs() to Python/C# APIs (#5280)
* add Python API for getProfilingStartTime

* debug for using Python API

* add in C# api

* use uint intead of uint64_t to prevent warning

* typo for GetProfilingStartTimeNs

* remove const

* Update onnxruntime/python/session.py

Co-authored-by: Pranav Sharma <emailpranav@gmail.com>

* remove unnecessary return

* Add Python unit test

* Add C# unit test and refactor Python test

* use ulong in C# for uint64_t in C++

* remove time.monotonic_ns

* syntax: remove public for inner function

* correct the API's order

* getprofilingstarttime after run

* Correct the right order in NativeMethod.cs

* update order

* nit: remove spaces

* Update csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs

Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>

* use the updated function

* add comment about the precision

* add more comments

* add session.py back

* fix flake8

* remove session.py

* Add comments in C, C#, Python APIs about precision

Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
2020-10-14 05:32:43 -07:00
Ye Wang
67315d8ae0
Optimize openai-gpt/albert model and add fusion test (#5466)
* optimize openai-gpt

* add huggingface model fusion test

* move albert's attention fusion here

* add test for albert fusion
2020-10-13 19:24:14 -07:00
KeDengMS
c444b9d76a
Add CUDA option to run copy in default stream (#5445)
* Add CUDA option to run copy in default stream

This change fixes #4829. Thanks @maherzog for providing the repro!

The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.

BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.

This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.

However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.

This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.

User may use following options to revert to previous behavior:
C API:
  struct OrtCUDAProviderOptions cudaProviderOpt;
  cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
  CUDAExecutionProviderInfo cudaEPInfo;
  cudaEPInfo.do_copy_in_default_stream = false;
C# API:
  pending...
Python:
  import onnxruntime
  onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

* Confirmed the test failes in CI when doing copy in separate stream

Revert the test to get CI pass now

* Fix Windows test

* Address CR
2020-10-12 22:12:05 -07:00
Hariharan Seshadri
b9f90e297e
Support sharing of initializers between session via the Python API (#5407) 2020-10-09 20:26:28 -07:00
Ye Wang
90f976d060
Some improvements on transformers tool (#5383)
* modify tensoflow benchmark gpu setting

* add export from tf choice in script

* fix typo

* match more embedlayernorm pattern

* format
2020-10-08 19:35:17 -07:00
Tianlei Wu
15696b8fce
bump version to 1.5.2 (#5420) 2020-10-08 16:30:13 -07:00
Yufeng Li
b04cf2d229
Update ORT to 1.5.1 in Bert Quantization Notebook (#5396)
* Update ORT to 1.5.1 in Bert Quantization Notebook
2020-10-08 09:55:01 -07:00
Tianlei Wu
8ee2b08325
Allow benchmark different threads (#5390) 2020-10-07 11:13:01 -07:00
Tianlei Wu
094384781e
Add --use_external_data_format in convert_to_onnx.py (#5393) 2020-10-07 09:42:02 -07:00
Hariharan Seshadri
6f54113a1b
Support OrtValue binding in Python to enable interesting IOBinding scenarios in Python (#5248) 2020-10-06 21:14:41 -07:00
Du Li
323c4dfe02
Adding an option for cudnn conv algorithms. (#5159)
* adding cudnn conv algorithm selection options.

* adding cudnn conv algorithm selection options.

* export the api

* adding the perf test option.

* accomodating pr comments.

* Move OrtSessionOptionsAppendExecutionProvider_CUDA to onnxruntime_c_api.h

* Accomodating PR comments.
2020-10-05 16:53:52 -07:00
Vlad Burlik
c20fcf26eb
Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy (#5304)
* ONNX GPU runtime fails to fallback to CPU when GPU is not available OR busy
https://github.com/microsoft/onnxruntime/issues/5299

* comments

* Init _fallback_providers before C.InferenceSession

* As per review: Fallback providers order supersedes user's providers order, IF they are included into providers list.

* Code convention fix

* pep8
2020-10-02 22:45:14 -07:00
Tianlei Wu
f5e4c0ea04
Fix benchmark_gpt2 model verification (#5343) 2020-10-02 13:53:02 -07:00
Sherlock
e71668f92c
Expose recompute configs to the frontend (#5318)
* Expose recompute configs to the frontend

* Add frontend test

* Ensure recompute graph transformer is only applied once

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-10-02 09:49:47 -07:00
Tianlei Wu
e33de20861
Update gpt2 notebook for int8 quantization (#5346)
* Update gpt2 notebook for ORT 1.5
* add sections for int8 quantization including QAT note
2020-10-02 09:41:52 -07:00
Yufeng Li
e8b9aa1f29 fix quantization of EmbeddingLayerNorm (#5321) 2020-10-01 20:08:43 -07:00
KeDengMS
7495dc167a
Symbolic shape inference: fix a bug in auto_merge when broadcasting (#5349)
The bug happens when merging following shapes:

input0: [1, 1, 'Min(1024, input1_dynamic_axes_3)', 'Min(1024, input1_dynamic_axes_3)']
input1: ['input1_dynamic_axes_1*input1_dynamic_axes_2', 12, 'input1_dynamic_axes_3', 'input1_dynamic_axes_3']
input2: []

The fix is to avoid broadcasting merge on input2
2020-10-01 15:24:00 -07:00
Ye Wang
caed6c264c
Add tf2pytorch wrapper in transformers tool (#5316)
* init checkin

* format

* refactor

* review comments
2020-10-01 13:58:58 -07:00
Ye Wang
1a12f510fc
Support T5 benchmarking in transformers tool (#5133)
* init checkin

* review comments

* modify according to transformers release
2020-09-29 22:58:28 -07:00