Commit graph

3056 commits

Author SHA1 Message Date
Scott McKay
dc50aa42d5
Refactor session state finalization and kernel lookup usage (#4763)
* Refactor SessionState to support coming de/serialization changes
  - move more parts into SessionState to simplify usage
  - do the kernel lookup once instead of multiple times from different places
  - rename finalize_session_state.* to session_state_utils.* as the finalization logic is now inside SessionState

* Fix some build issues

* Move subgraph session state creation into SessionState. It's not needed by GraphPartitioner any more so we can delay the creation until later. Fixes issue where EP may have removed the subgraph during partitioning when taking a control flow node, and SessionState thought the subgraph was still valid.

* Address PR comments

* Clarify a comment
2020-08-20 12:19:38 +10:00
liqunfu
d7233c7c97
Fix training for models with dict input (#4842)
This PR also includes:
	* Remove defaults from named tuples to support python 3.6
	* Allows model which takes dicts as input
	* Adapts BERT finetuning example to run on the new frontend
        * Match numbers for BERT fine tuning model

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-19 18:36:36 -07:00
Thiago Crepaldi
7cc88ef7ed
Port legacy checkpoint API into new front-end (#4855)
* Port legacy checkpoint API into new front-end

This PR also fixes:
	* Warnings on ORTTrainer for improper tensor copies
	* Inaccurate LRScheduler tests using wrong LR
	* Stale DeepSpeed documentation
	* Minor code refactoring for Toy BERT tests
        * Move experimental state_dict() and load_state_dict() into checkpoint ns
2020-08-19 14:27:28 -07:00
Alexandros Koumparoulis
75ad7be336
add caching support for dynamic input models (#4702) 2020-08-19 11:41:40 -07:00
gwang-msft
fff0b41fcb
Nuget build break fix (#4854)
* rename new header file to fix build break

* update code to use the new header file name
2020-08-19 13:51:33 +10:00
Vincent Wang
5eaac31faa
support opset13 on transformers. (#4837)
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-08-19 11:13:37 +08:00
Scott McKay
61a5502af0
Fix some incorrect operator registrations. (#4838) 2020-08-19 09:23:33 +10:00
Changming Sun
1ba07ccfaf Codesign validator fixes 2020-08-18 16:20:15 -07:00
Yufeng Li
0575881949
Update quantization notebook to pytorch 1.6 (#4834) 2020-08-18 14:20:46 -07:00
gwang-msft
dee7596724
Add a generic collection of session configurations to the SessionOptions (#4718)
* adding generic configurations for session options

* fix a build break on linux

* fix training ci build break

* fix training ci build break

* addressed CR comments

* fix traning ci build break

* move config_key from enum to string

* add c# api

* add python api

* fix build break

* move prepacking from 2 new api entries to session options configs

* fix traning ci build break

* add python test, update some comments, move const key definition to avoid build break

* addressed comments

* move definitions of keys to common.h

* move api to version 5

* remove accidental change in build.py

* remove pragma to avoid build break

* addressed CR comments

* fix the python build break, and move location of config keys definition

* small typo changes
2020-08-18 13:40:40 -07:00
Nat Kershaw (MSFT)
81ff168833
Update stale.yml with current labels and mark stale items as "stale" (#4831) 2020-08-18 13:25:57 -07:00
ytaous
2605af9a0b
Fix for mainz model (#4744)
* fix for mainz model

* fix build

* on comments

* revert the extra check

* on comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-08-18 11:47:19 -07:00
Thiago Crepaldi
f3b0c93a45
Fix issue preventing loss scaler to run due (#4833)
`LossScaler.update()` was not being properly called due to the incorrect TrainStepInfo.all_finite assignment.

Additionally to this fix, _ORTTrainerModelDesc.is_finite was renamed to _ORTTrainerModelDesc.all_finite to make it more uniform with TrainStepInfo
2020-08-18 10:03:02 -07:00
Hariharan Seshadri
a3c95374c3
Support asymmetric paddings in CUDA Conv kernel (#4627) 2020-08-18 02:09:30 -07:00
Hariharan Seshadri
c878ecbbe0
Sahar/csharp support openvino (refined) (#4835)
* Sahar/csharp support openvino (#4703)

* Temp changes and include openvino to ensure nuget package is created with linux till we configure azure ci pipeline

* string id change

* native nuget indentation changes

* documentation changes

* Update Openvino_execution_provider.md

Documentation includes openvino execution provider

* Update OpenVino-ExecutionProvider.md

update details to build csharp api for openvino execution provider .

* vadm backend revert

* Update Openvino-Execution-Provider.md

updated for review comments

* Update OpenVino-Execution-Provider.md

* Update OpenVINO-ExecutionProvider.md

* nuget package custome support for openvino
change in native nuget spec python script for including linux runtime

* change to make path to boolean flag

* removed the tab

* Update OpenVINO-ExecutionProvider.md

updated for review comments

* chnages to include pep8 warnings
modification to documentation

Co-authored-by: saharfraza <sfatima.3001@gmail.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>

* Changes to include csharp support for openvino

* Fix flake error

* Fix

Co-authored-by: sfatimar <64512376+sfatimar@users.noreply.github.com>
Co-authored-by: saharfraza <sfatima.3001@gmail.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
2020-08-17 21:52:17 -07:00
Rayan-Krishnan
24d9f4e0c3
Add More Extensive ONNX BERT Tests (#4827)
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2020-08-17 19:54:22 -07:00
Changming Sun
e98697ec28
Fix nuget cpu package pipeline (#4832) 2020-08-17 17:08:48 -07:00
jingyanwangms
d3af669980
Auto upgrade base image dependencies (#4797)
* use unattended-upgrade

* PR comment

* add comment

Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-17 17:05:01 -07:00
Thiago Crepaldi
f933910ea3
Update LambConfig defaults to match backend (#4826) 2020-08-17 16:58:14 -07:00
RandySheriffH
6a360bad6b
ReplaceStrncpy (#4823)
* replace strncpy with strlcpy

* keep strncpy to linux

* cancel reseting of string ending for strlcpy

Co-authored-by: Randy <Randy@randysmac.attlocal.net>
Co-authored-by: RandySheriffH <rashuai@microsoft.com>
2020-08-17 16:44:44 -07:00
edgchen1
32a5f3d5b6
Check status of element-wise op prepare functions. (#4830)
* Check status of BinaryElementwise::Prepare().

* Add additional status checks for BinaryElementwise::Prepare() and UnaryElementwise::Prepare().

* Add status checks for BinaryElementwisePreparation::BinaryElementwiseBroadcastPrepareHelper().
2020-08-17 16:11:30 -07:00
Thiago Crepaldi
ef20efe015
Register cerberus license into ThirdPartyNotices.txt (#4828)
Governance Compliance component shows cerberus is ok:
https://dev.azure.com/onnxruntime/onnxruntime/_componentGovernance/112016/53457366

As this is installed by pip, I am assuming we don't need to update
cgmanifest.json file too.
2020-08-17 15:03:54 -07:00
Ksenija Stanojevic
ea37a4d89b
Add Trilu custom op (#4537)
Co-authored-by: neginraoof <neginmr@utexas.edu>
2020-08-17 14:42:26 -07:00
Tianlei Wu
1ce2982f65
Update GPT-2 notebook using IO Binding example (#4799) 2020-08-17 10:43:36 -07:00
Changming Sun
360e2ae11b
Update eigen to the latest to support C++20 (#4817) 2020-08-17 10:19:48 -07:00
George Wu
94a6f50af6 Revert "Sahar/csharp support openvino (#4703)"
This reverts commit 0a0ac70eec.
2020-08-17 10:05:21 -07:00
Thiago Crepaldi
42408aa3ed
Add new PytTrch front-end (#4815)
* Add ORTTrainerOptions class for the new pytorch frontend (#4382)

Add ORTTrainerOptions class and some placeholders

* Add _ORTTrainerModelDesc to perform validation for model description (#4416)

* Add Loss Scaler classes to the new frontend (#4306)

* Add TrainStepInfo used on the new frontend API (#4256)

* Add Optimizer classes to the new frontend (#4280)

* Add LRScheduler implementation (#4357)

* Add basic ORTTrainer API (#4435)

This PR presents the public API for ORTTrainer for the short term
development.

It also validates and saves input parameters, which will be used in the
next stages, such as building ONNX model, post processing the model and
configuring the training session

* Add opset_version into ORTTrainerOptions and change type of ORTTrainer.loss_fn (#4592)

* Update ModelDescription and minor fix on ORTTrainer ctor (#4605)

* Update ModelDescription and minor fix on ORTTrainer/ORTTrainerOptions

This PR keeps the public API intact, but changes how model description is stored on the backend

Currently, users creates a dict with two lists of tuples.
One list called 'inputs' and each tuple has the following format tuple(name, shape).
The second list is called 'outputs' and each tuple can be either tuple(name, shape) or tuple(name, shape, is_loss).

With this PR, when this dict is passed in to ORTTrainer, it is fully validated as usual.
However, tuples are internally replaced by namedtuples and all output tuples will have
tuple(name, shape, is_loss) format instead of is_loss being optionally present.

Additionally to that normalization in the internal representation (which eases coding),
two internal methods were created to replace a namedtuple(name, shape) to namedtuple(name, shape, dtype)
or namedtuple(name, shape, is_loss, dtype) dependeing whether the tuple is an input or output.

This is necessary as ORTTRainer finds out data types of each input/output during model export to onnx.

Finally, a minor fix was done on ORTTrainer. It could initialize ORTTrainerOptions incorrectly when options=None

* Rename input name for test

* Add ONNX Model Export to New Frontend (#4612)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Create training session + minor improvements (#4668)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Save ONNX model in file (#4671)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add eval step (#4674)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add train_step (#4677)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add LR Scheduler (#4694)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Add deterministic compute tests (#4716)


Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Add legacy vs experimental ORTTrainer accuracy comparison (#4727)

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>

* Add Mixed precision/LossScaler + several fixes (#4739)

Additionally to the mixed precision/loss scaler code, this PR includes:

* Fix CUDA training
* Add optimization_step into TrainStepInfo class
* Refactor LRSCheduler to use optimization_step instead of step
* Updated several default values at ORTTrainerOptions
* Add initial Gradient Accumulation supported. Untested
* Fix ONNX model post processing
* Refactor unit tests

* Add ONNX BERT example + minor fixes (#4757)

* Fix training issue when passing ONNX file into ORTTrainer

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add Dynamic Shape support (#4758)

* Update DeepSpeed Zero Stage option to a separate option group (#4772)

* Add support to fetches (#4777)

* Add Gradient Accumulation Steps support (#4793)

* Fix Dynamic Axes feature and add unit test (#4795)

* Add frozen weights test (#4807)

* Move new pytorch front-end to 'experimental' namespace (#4814)

* Fix build

Co-authored-by: Rayan-Krishnan <rayankrishnan@live.com>
Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-08-17 09:45:25 -07:00
Changming Sun
5eec4f66ed
Refactor manylinux docker image and the related pipelines (#4751)
1. Publish the image ACR, instead of building it every time for every PR
2. Make USE_MKLML and USE_OPENMP be able to co-exist. Currently both of them are enabled in our Linux CI build but indeed only one of them is taking effect.
3. Split nuphar and DNNL to separated pipelines.
4. Fix two warnings in onnxruntime/core/optimizer/matmul_scale_fusion.cc and onnxruntime/test/tvm/tvm_basic_test.cc.
5. Update the manylinux2010_x86_64 image to the latest.
2020-08-17 09:40:31 -07:00
Hariharan Seshadri
ea3b4e1f8d
Fix bug in DispatchOnTensorType macro (#4808) 2020-08-17 01:16:01 -07:00
Ori Levari
5899c1197a
add telemetry for named dimension overrides (#4794)
Co-authored-by: Ori Levari <orlevari@microsoft.com>
2020-08-16 17:09:55 -07:00
sfatimar
0a0ac70eec
Sahar/csharp support openvino (#4703)
* Temp changes and include openvino to ensure nuget package is created with linux till we configure azure ci pipeline

* string id change

* native nuget indentation changes

* documentation changes

* Update Openvino_execution_provider.md

Documentation includes openvino execution provider

* Update OpenVino-ExecutionProvider.md

update details to build csharp api for openvino execution provider .

* vadm backend revert

* Update Openvino-Execution-Provider.md

updated for review comments

* Update OpenVino-Execution-Provider.md

* Update OpenVINO-ExecutionProvider.md

* nuget package custome support for openvino
change in native nuget spec python script for including linux runtime

* change to make path to boolean flag

* removed the tab

* Update OpenVINO-ExecutionProvider.md

updated for review comments

* chnages to include pep8 warnings
modification to documentation

Co-authored-by: saharfraza <sfatima.3001@gmail.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
2020-08-16 17:07:26 -07:00
Tang, Cheng
1b1a6a4ca9
Bump onnx to get bfloat16 in ops, and some update in ort to support bfloat16 (#4791)
* bump onnx to support bfloat16

* sign test code

* fix ut failures

* add bfloat type in gradient schema

* add bfloat16 to gathernd

* add bfloat16 into grad op defs

* temp disable gpu fusing transformers

* bfloat16 support fix

* more fix to bfloat

* bug ifx

* add bfloat16 to transpose matmul

* fix sce loss

* fix cast opset13 and other missing part of bfloat16

* Revert "temp disable gpu fusing transformers"

This reverts commit b627bc9019.

* add SCEloss back

* fix build break

* fix gpu failure due to missing kernel in opset13

* add tile opset 13 kernel

* Revert "fix gpu failure due to missing kernel in opset13"

This reverts commit 661d63d0599029757f240d29afd64b197b76b880.

* fix comments in pr

* fix cuda break due to opset13

* fix missing msdomain

* add nll loss tests into android build's broken list; disable bfloat16 cast tests due to the wrong type saved in onnx test data, will fix it in onnx first

Co-authored-by: Cheng Tang <chenta@microsoft.com>
2020-08-16 17:05:40 -07:00
Bogdan Bugaev
8ba6b6a21e
Support usage of C API with C++ standards older than C++11 (#4257)
* Use throw() in C API if noexcept is not supported
2020-08-15 11:39:28 -07:00
George Wu
8d2e22558d
unattended-upgrades (#4804) 2020-08-14 18:12:27 -07:00
ashbhandare
5a8962d327
Make grad name unique (#4788)
* Make grad name unique

* Modify for review comment
2020-08-14 15:17:17 -07:00
Weixing Zhang
afa89566d7
Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training (#4731)
* use cublas extenstion API for fp16

* Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training

To avoid accuracy, the accumulation needs to be done in FP32 for training.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-08-14 02:12:14 -07:00
Maxim Kalinin
ec36c793e8
Eliminate redundant subexpressions (#3047)
* Eliminate redundant subexpressions

Apply local value numbering to merge graph nodes that will always
evaluate to the same value.

* Rename cpp->cc

* Handle optional arguments

* Add test models

* Add more tests with optional arguments

* Fix processing of subgraphs

Also, be resilient to possible mixture of optional and variadic
parameters

* Fix random operators

* Address PR comments

* Minor changes and a test

* Move CSE before constant folding

* Random* operators are always non-deterministic

Even when seed is provided.

* Fix a CSE test

* Reuse the list of non-deterministic operators with constant folding pass

* Address PR comments

* Fix formatting

* Address PR comment

* Minor cleanup / comments

* Fix build failure in Linux

* Reuse existing optimizer/utils file.

Also, check for graph outputs when removing a node.

* Add a test

* Fix compiler warnings

* Fix build in older compilers

* More compatibility with old STL versions
2020-08-14 01:13:05 -07:00
Marcus Turewicz
ce65275edf
C# samples: Faster R-CNN (#4733)
* C# sample: Faster R-CNN

* Add link to new sample in samples README

* Remove duplicate image
2020-08-13 17:05:01 -07:00
Sergii Dymchenko
de2685261b
Install AzureML support and commonly used packages in the training image. (#4790) 2020-08-13 16:48:48 -07:00
stevenlix
7acef875bb
Fix bugs in TensorRT (#4780)
* fix bugs

* Move -Wno-deprecated-declarations to target compile flag
2020-08-13 16:09:27 -07:00
Yulong Wang
aa993e95c9
enable build flag '--use_openmp' on MacOS (#4774)
* enable build flag '--use_openmp' on MacOS

* cmake 3.16.1 to enable find_package(OpenMP) on mac
2020-08-13 15:56:42 -07:00
George Wu
f12e9de111
build fixes for https://github.com/microsoft/onnxruntime/pull/4721 (#4784)
* test

* test

* add missing CUDA header include

* debug

* fix

* fix python package for dnnl and tensorrt.

* fix

* fix windows build.

* revert

* target_link_directories for tensorrt shared lib.
2020-08-14 06:24:44 +08:00
James Yuzawa
aca34352a5
Java API: Documentation cleanup (#4395)
* update java API docs

* fix link

* rearrange

* update platforms, use table

* use javadoc.io

* craigacp tested it in java 14

* update link

* fix broken link

* fix testdata link
2020-08-13 12:06:42 -07:00
Sheil Kumar
722602f32d
replace namespace reference with alias (#4786)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2020-08-13 11:14:55 -07:00
ashbhandare
5e7a6e78e3
Changes for BART dynamic shapes in reduction (#4730)
* Modify to hit row reduction over cudnn

* kernel overflow fix

* Cleanup

* fix for mainz/zcode model

* revert

* Review comments

* Review comments
2020-08-13 11:14:01 -07:00
edgchen1
74b3b8448c
Fix MatmulTransposeFusion::ApplyImpl() setting of modified flag (#4775)
Update MatmulTransposeFusion::ApplyImpl() to set modified flag whenever a fusion is performed.
2020-08-13 09:51:52 -07:00
Scott McKay
8fb743f767
Refactor Cast to reduce binary size. (#4765)
* Refactor Cast to reduce binary size.
82.5 -> 60.8KB on Windows

* Address PR comments.
Fix build issue.
2020-08-13 20:43:22 +10:00
Tim Harris
9cec98ec1b
Honor allow_spinning at barrier at end of parallel sections (#4767)
This commit means that when the thread pool is configured to spin, then we spin at the barrier at the end of parallel sections in the main thread, in addition to having workers spin waiting for work. 

The change updates Barrier.h to take an additional boolean to select spin/block, and passes this in based on the thread pool configuration. 

It adds an additional test case for barriers, although no problems were identified by the test case.
2020-08-13 09:40:40 +01:00
Faith Xu
61b2a663a3
Update Python version support (#4778) 2020-08-12 23:48:23 -07:00
Changming Sun
cddddc4d55
Add missing header file to MNIST.cpp (#4773)
Resolve #4766
2020-08-12 21:46:11 -07:00