Commit graph

3268 commits

Author SHA1 Message Date
Ashwini Khade
9ba2cfb71b
fix py packaging pipeline (#5038)
* add test skip logic when opset > allowed opset

* fix attribute error

* plus fix
2020-09-03 09:32:10 -07:00
Bowen Bao
22ba266bd6
Add flag to _internal_use to control export of contrib ops in ort trainer (#4968) 2020-09-03 09:11:47 -07:00
Scott McKay
28445c88f9
Changes to enable saving and loading an ORT format model (#4995)
* Changes to enable saving and loading an ORT format model via the public APIs.
Cleanup session.py to try and make slightly more understandable. More refactoring is needed here.
Couple of bug fixes

* Fix bug in handling NodeArg serialization for optional inputs which has a name and no type info.

* Address PR comments
  - tweak SessionOptions config to avoid double lookup
  - merge duplicated functionality in python binding around registering an EP with optional options

Fix a couple of build issues.

* Update C API to be consistent with python API
  - only load model in InferenceSession ctor if required
  - support loading ORT model in minimal build

* Fix nodejs test.
We get an invalid path error from LoadInterOp first now

* Another attempt at fixing nodejs test.
Error message depends on whether ENABLE_LANGUAGE_INTEROP_OPS is defined. Make the output consistent.

The interop implementation looks suspicious given it appears to be internal code that is going via the public api. TBD if that should be fixed.

* Fix couple of build issues.

* Disable test temporarily so PR can be checked in.
Will fix in separate PR that adds final pieces for minimal build as the test is required there.

* Give up on nodejs test and make the match simpler.
Fix init call in TrainingSession python to not pass through sess. it wasn't being used in Session anyway so passing it through just adds confusion.

* Fix call to Session.__init__ in TrainingSession.
Session now initializes Session._sess to None to make it clearer where the 'ownership' of that member is, and that needs to happen before TrainingSession sets it.
2020-09-03 09:10:48 -07:00
Tim Harris
bbb9d92a5f
Remove SchedulingParams variants of ThreadPool::TryParallelFor (#5050) 2020-09-03 09:04:31 -07:00
gwang-msft
fde7a2c848
Temporarily switch SafeInt to a fork for an option to disable exceptions (#5041)
* Removed submodule

* Add safeint fork
2020-09-02 23:21:39 -07:00
Ryan Hill
e0d1cf19a6
Fix allocator bug (#5042) 2020-09-02 21:21:18 -07:00
Weixing Zhang
3268717615
Enable TF32 for training on A100 (#4914)
* enable TF32 for training on A100

it can be disabled by env: NVIDIA_TF32_OVERRIDE = 0
2020-09-02 19:21:54 -07:00
Hariharan Seshadri
a9db287bd7
Return windows error code for library loading and unloading failure (#5036) 2020-09-02 18:07:36 -07:00
Ye Wang
b4e9e98cee
Add more huggingface models in benchmark tools (#4986)
* checkin more huggingface models

* review comments

* review comments
2020-09-02 16:41:58 -07:00
Sherlock
a935731bd3
Neg Gradient (#5022)
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-09-02 15:54:17 -07:00
Dudeldu
4a0f6595eb
Enable metadata and signature changes in graph transformers (#4783)
After applying all the graph transformations the metadata and signature could have changes
(e.g.: new outputs got added, or the outputs/inputs got renamed). Therefore the local
copies of metadata and signature, that InferenceSession administrated for faster lookup, has to be updated.
For this the `SaveModelMetadata`, that now has to be idempotent, should be called after resolving the transformed graph
2020-09-02 15:46:36 -07:00
Hariharan Seshadri
4fd4b74149
Change session option values if they don't work with EPs being registered for the session (#4991) 2020-09-02 15:13:23 -07:00
Nat Kershaw (MSFT)
8a03b6e5c7
Render Operator documentation as compliant markdown (#3658) 2020-09-02 15:07:50 -07:00
Dmitri Smirnov
e1901a7e10
Improve performance of CUDA implementations for GatherElements and Greater, Equal and Less (#4989)
Make GatherElements kernel process 16 items each.
  unroll the constant loop. Quit loops early for zero dividend.
  Optimize Binary CompareFunction and remove Impl_Cast invocation.
2020-09-02 10:17:39 -07:00
Changming Sun
d5d5e37e76
Build system enhancements (#5012)
1. Add a docker file for CUDA11
2. Support setting CUDA_ARCHITECTURES from command line.
2020-09-02 10:13:26 -07:00
Thiago Crepaldi
aabed34d5c
Fix checkpoint API and improve loss scaler handling (#4950)
This PR also includes:
	* More LossScaler tests
        * Minor LossScaler improvement
	* Check model after extra post processing
	* Improve basic training tests to include all optimizers
	* Set rtol=1e-7 tolerance for Legacy vs Experimental frontend API tests
	* Increase number of training tests for Legacy vs Experimental tests
	* Minor refactoring on existing tests
        * Fix Checkpoint API for Gradient Accumulation / fp16 scenarios
2020-09-02 09:38:02 -07:00
Thiago Crepaldi
eebc2cccce
Fix fetches when eval_step's input is a subset of train_step's input (#4966)
This PR also includes MNIST sample using the new forntend as a sample
2020-09-02 08:57:44 -07:00
Vincent Wang
a6e219deff
Pass Model Path to TensorProtoToMLValue from Constant Folding for External Inputs (#5000)
* Don't constant fold external inputs.

* pass model_path to TensorProtoToMLValue

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2020-09-02 21:54:40 +08:00
gwang-msft
5651c23271
Fix for Android ORT android initOsArch exception (#5006) 2020-09-02 00:48:06 -07:00
xkszltl
44b3accb74
Missing header for std::once_flag and std::call_once. (#5010) 2020-09-02 00:46:59 -07:00
Changming Sun
9902b57090
Fix a warning in global_thread_pools/test_inference.cc (#4987)
* Fix a warning in global_thread_pools/test_inference.cc
2020-09-01 20:45:22 -07:00
Thiago Crepaldi
f38f2d5b54
Port #4920 into the new pytorch frontend (#4965) 2020-09-01 19:00:49 -07:00
Hariharan Seshadri
d30dd41c0e
Remove public default ctor in PyInferenceSession and replace it with a protected ctor (#4990) 2020-09-01 17:10:36 -07:00
Ryan Lai
c6a3620ba8
Remove evaluate telemetry due to redundancy (#4996)
* Remove evaluate start / stop from telemetry

* Remove eval telemetry

* remove check for evaluate time delay

* add comment

* remove const

Co-authored-by: Ryan Lai <ryalai96@gamil.com>
2020-09-01 17:02:00 -07:00
Tianlei Wu
a47cae031f
Use raw attention mask in BERT related fusions (#4889)
* Use raw attention mask in fusion
* update python scripts to use raw attention mask by default
2020-09-01 13:22:20 -07:00
liqunfu
d79af260bb
Liqun/new api orttraining test transformers (#4982)
* matching transformer model test with Lamb
* increase epochs
* use atol 1e-6 to pass full precision test
2020-09-01 13:11:06 -07:00
gwang-msft
64237d999c
Add Cmake config for onnxruntime_NO_EXCEPTIONS (#4975)
* additional noexception setting, added compile options

* more no exception changes

* addressed PR comments

* Fix build issue when MSVC static library is used.

* Clarify comment

* add fatal message for onnxruntime_NO_EXCEPTIONS enabled without onnxruntime_MINIMAL_BUILD

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2020-09-01 10:17:50 -07:00
Pranav Sharma
ad1701dfb1
Rename DeviceAllocatorRegistrationInfo to a more generic name; Use OrtArenaCfg for arena members; Remove unused OrtMemType; Simplify CreateAllocator interface. (#4970)
* Rename DeviceAllocatorRegistrationInfo to a more generic name; Remove OrtMemType; Simplify CreateAllocator interface.

* - fix builds
- fixed mixed aggregation + constructor calls (which were coded before this PR)
- changed default value of max_mem in API header
- added some validation of values for for arena_extend_strategy

* fix tensorrt and cuda tests
2020-09-01 09:25:32 -07:00
Yufeng Li
ffc2b25a3a
Quantization tool improvement (#4933)
Improve quantization tools:
1. Support QAT
2. Make quantization tool to register Operators.
3. Make the API clear to use

Co-authored-by: t-yguo <t-yguo@microsoft.com>
2020-09-01 09:07:46 -07:00
Zhang Lei
464bbd27a9
Zhalei/optimize nms (#4875)
* double the speed of non_max_suppression for cpu.

* handle edge case in test case.
2020-08-31 23:33:54 -07:00
Zhang Lei
cf1b74396a
Fix build break for microbench. (#4960) 2020-08-31 23:29:07 -07:00
RandySheriffH
14b51d6502
CiPipeline@ReducedOpsBuild (#4917)
* cancel night build on pyop

* setup ci pipeline for build of reduced ops

* add back c# test

* remove debugging print

* add testing model

* add more arg in pipeline script

* disable pipeline trigger temporarily

* fix yaml format

* fix yaml format

* fix pipeline error

* rid c# test

* add ops for test cases

* add Conv from domain com.microsoft.nchwc

* remove --reduce_ops

* fix typo

* remove --build_java

* add test case for excluded op

* update doc with --skip_test

* formatting code, renaming files and simplify yaml

* remove debug build from yaml

* remove surplus ops from included_ops.txt

* add MinSizeRel build to yaml

* rename test cases and models

* exclude ir test from minimum build

* restrict ir test to be only applied to reduced ops build
2020-08-31 21:21:18 -07:00
gwang-msft
7ca8388dc9
[ORT Mobile] file format schema and file I/O code (#4973)
* ort mobile file format schema and [de]serializing code
2020-09-01 11:51:31 +10:00
George Wu
bca9ccb1b3
add install sec updates (#4957) 2020-08-31 18:13:02 -07:00
Xueyun Zhu
1e1f5a9c79
support data parallel + pipeline parallel (#4648)
* enable data + pipeline parallel

* distributed group calculation

* fix typo

* fix test and minor changes
2020-08-31 17:32:03 -07:00
Thiago Crepaldi
9817b8c8a7
Fix state_dict/checkpoint issue introduced by #4639 (#4984)
https://github.com/microsoft/onnxruntime/pull/4639 changed the default
behavior by removing optimizer state from state_dict/checkpoint APIs.
The reason for the previous change was to allow models trained on ORT to
be used for inference on PyTorch, which is an important feature.

Due to the change aforementioned, when resuming training from a checkpoint,
the optimizer would start with random weights, leading to a bad performance.
This behavior would also cause reproducibility issues, as the optimizer
wouldnt be able to resume from its previous state.

This PR adds a boolean flag to state_dict/save_xheckpoint API that
when True (default) it saves both model and optimizer state.
When False, only the model state is kept.
2020-08-31 17:00:14 -07:00
Ashwini Khade
8679a7244e
Enable rejecting models based on onnx opset (#4912)
* enable rejecting models based on onnx opset

* enable unreleased opsets in linux and mac CI

* test fixes and more updates

* enable unreleased opsets in CI builds

* enable released opsets in linux cis

* try fix windows ci yml

* yml fixes

* update yml

* yml updates post master merge

* review comments

* bug fix
2020-08-31 13:35:36 -07:00
Sherlock
50c610e70a
Stop Gradient at Shape op (#4983) 2020-08-31 13:13:17 -07:00
Faith Xu
7af052fd62
Add CI status badges for Training builds (#4951)
* Add CI status badges for Training builds

* Fix links
2020-08-31 12:10:38 -07:00
M. Zeeshan Siddiqui
6d9d252bc3
Disable NegativeLogLikelihoodLoss_LargeSizeTensor test (#4979)
Disabling this test until it's intermittent failure is root caused, this is a function and does not have a dedicated op by itself. However, this op is not used in known model to the best of my knowledge to disabling this test for the sanity of CI until the investigation is over is probably reasonable.
2020-08-31 11:02:07 -07:00
edgchen1
b41e5e88fb
Add more node debug dump functionality. (#4921)
Add ability to dump node inputs/outputs to files, filter nodes, configure behavior with environment variables.
2020-08-31 10:17:23 -07:00
Sherlock
98f7fdd7da
Handle MatmulGradient with 2D Weight at B (#4977) 2020-08-30 22:56:33 -07:00
Changming Sun
bac41969be
update (#4948) 2020-08-29 19:05:07 -07:00
Hariharan Seshadri
64d52ae47d
Support creating sessions using DML EP via C# (#4955) 2020-08-29 15:18:50 -07:00
Hariharan Seshadri
7080e485a3
hHandle upper-cased subscript labels in Einsum (#4964) 2020-08-29 15:18:21 -07:00
Dwayne Robinson
f4b057b098
Fix DML License in nuget package (#4969) 2020-08-29 00:02:01 -07:00
gwang-msft
ea5732319e
Add option ORT_NO_EXCEPTIONS to disable most exception/throw in /onnxruntime/ (#4894)
* init no exception changes

* initial test

* disable exceptions

* more throw handling

* minor update

* fix linux build break

* fix windows/nuphar build break

* address cr comments, move #ifdef to ORT_CATCH

* address cr comments, move #ifdef to ORT_CATCH

* handle return statement in ORT_CATCH

* linux build break fix

* addressed cr comments, remove ort_catch_end

* addressed cr comments, remove ort_catch_end

* move mlas to a separated ifdef flag

* merge master, move some new code in master to no_exc

Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
2020-08-28 23:03:51 -07:00
Brian Martin
655ffd5d5b
make (de)tensorization events measure level events (#4958)
* make tensorizer events measures

* throttle the events and add a new one SoftwareBitmapToGPUTensorTelemetryEvent

* factor out timing code into a class

* typo

* typo

* move eventimer class into its own header file

* add throttling to detensorization and remove variable timing

* make detensorization events measures as well

* add ConvertGPUTensorToSoftwareBitmapTelemetryEvent event

* de-duplicate event names

* fix comment

* PR feedback
2020-08-28 16:49:32 -07:00
Thiago Crepaldi
cd0f2fb48c
Add code oweners for pytorch frontend (#4963) 2020-08-28 15:57:52 -07:00
Hariharan Seshadri
7045910d10
Support RegisterCustomOpsLibrary via the Python API (#4764) 2020-08-28 13:24:29 -07:00