Commit graph

11997 commits

Author SHA1 Message Date
Scott McKay
6dc25a60f8
Make the reduction ops more consistent in checking if no transpose is required and skipping the copy of the input data if that is the case. Significantly better performance when this is done (2x faster for model calling ReduceSumSquare with input of {2048,10}). (#3265) 2020-03-20 06:55:38 +10:00
ytaous
ca7985fd9f
Address PR comments (#3256)
* comments

* fix path

* fix path

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-03-19 10:40:00 -07:00
Changming Sun
8f00147c14 Fix a few warnings 2020-03-19 09:22:28 -07:00
Tiago Koji Castro Shibata
3bdb0b620a
Fix WCOS/Win32 linking bugs (#3126)
* Fix WCOS/Win32 linking bugs

* Remove unused NODEFAULTLIB flags

* Avoid plain target_link_libraries signature

* Avoid plain target_link_libraries signature

* Fix library list escaping

* Use library list instead of string

* Remove duplicate link to windowsapp.lib

* Remove Win32 build workarounds

* Specify CMake policies before initializing language

* Expose Win32 header definitions during build

* Force set API family

* Enable Win32 APIs in featurizer

* Use MT dynamic CRT

* Expose Win32 specific functions

* Disable app container globally

* Disable default wide functions in featurizers

* Add featurizers to test include path

* Workaround https://gitlab.kitware.com/cmake/cmake/issues/19428

* Revert pipeline debugging hacks

* Skip /FI in CUDA sources

* Default to Win32 builds

* Enable WCOS when using WinML

* Use generator expression to apply CMAKE_MSVC_RUNTIME_LIBRARY to C++ only
2020-03-19 08:52:40 -07:00
edgchen1
61e8a24340
Address PR comments (#3255)
* Added comment for ntfw_remove().

* Rewrite WindowsEnv::DeleteFolder(), some other clean up.
2020-03-18 17:57:57 -07:00
edgchen1
d82f72e65c
Add ort_training build status file. (#3257) 2020-03-18 17:39:57 -07:00
Pranav Sharma
435f014d71
Add support for sessions to share a global threadpool. (#3177)
* Add support for sessions to share a global threadpool.

* Fix build issues

* Add tests, fix build issues.

* Added some documentation

* Fix centos issue when threadpools become nullptr due to 1 core.

* Fix mac and x86 build issues

* Address some PR comments

* Disabled test for android, added few more tests and addressed more PR comments.

* const_cast
2020-03-18 15:42:46 -07:00
Sherlock
03d14bae2b
Register ONNX Training Ops (#3252) 2020-03-18 12:36:57 -07:00
edgchen1
e03b8a1e2f
Move path_lib from onnxruntime/core/framework to onnxruntime/core/platform. (#3253)
Moved path_lib.h/cc from onnxruntime/core/framework to onnxruntime/core/platform and from the onnxruntime_framework to the onnxruntime_common libraries.
2020-03-18 11:53:46 -07:00
Xiang Zhang
61621d4053
Add extra fields to ORT telemetry (#3234)
* Add extra fields to ORT telemetry

* fix linux build failure caused by using HRESULT

* little refactor
2020-03-18 09:37:35 -07:00
Xavier Dupré
bd348ec6ca
Add unit test to cover TreeEnsembleClassifier applied to binary classification and 2 classes (#3230)
* Add unit test to cover TreeEnsembleClassifier for binary classification
2020-03-18 11:32:58 +01:00
jaka.katrasnik
88c65f8add Fixes GTest deprecation warnings 2020-03-17 16:38:55 -07:00
edgchen1
c5576d70a6
Fix build issues (#3214)
* Fixed issues with Python and inference-only build.

* Handle ImportError for training imports.

* fix windows build

* fix compile error

* fix centos build

* fix windows build

* fix compile error

* Use SafeInt for allocation calculation, fix typo.

Co-authored-by: Ethan Tao <ettao@microsoft.com>
2020-03-17 16:10:23 -07:00
Tianlei Wu
0700d13ece
Add Bert Optimization Notebooks (#3204)
* Add notebooks for GPU and CPU inference of PyTorch BERT SQuAD model
* update bert_optimization.py: Do not add duplicated logger handler
* Add machineinfo.py to show machine configuration for notebook.
* Update bert performance test tool:
(1) Set OpenMP environment variable before importing onnxruntime.
(2) Use sub-process for each test
(3) Allow test multiple batch_size
(4) Add latency percentile
(5) Add warmup
2020-03-17 11:56:36 -07:00
Dwayne Robinson
f6211217d8 Fixes 2020-03-16 21:26:36 -07:00
Faith Xu
8bc4e3195d
Updates to roadmap (#3155)
* Updates to roadmap

* remove redundant directML

* Add JS to future investments
2020-03-16 18:19:07 -07:00
Ori Levari
e63f817eb6
avoid IDXGIFactory 6 where possible to enable WinML GPU Path downlevel to RS3 (#3180) 2020-03-16 15:25:32 -07:00
Xiang Zhang
682dde2b3b
add dml_ep_lock (#3200)
* add dml_ep_lock

* Move Winml process-wide lock back to individual sessions
2020-03-16 14:32:12 -07:00
Sherlock
4b2c8e884e
Udpate License Header (#3212) 2020-03-16 10:24:31 -07:00
Xavier Dupré
6319357a99
Reduce number of allocations in TreeEnsemble (#3217)
* reduce number of allocations in TreeEnsemble

* Fix probabilities for binary case.

* fix outbound access

Co-authored-by: xavier dupré <xavier.dupre@gmail.com>
2020-03-16 12:22:15 +01:00
Changming Sun
0fceb33288
Fix onnxruntime server docker file build failure (#3219)
1. Fix onnxruntime server docker file build failure. Tested with the notebook in ONNX tutorial, it works well.
2. Delete the docker files for the other EPs, because currently they don't work and I don't have enough time to update them.
2020-03-15 14:46:46 -07:00
Jesse Benson
3a7539e071 Update bert-base convergence values 2020-03-13 23:03:34 -07:00
Jesse Benson
dc11b82956 Tweak the dropout calculation. 2020-03-13 23:03:34 -07:00
Tracy Sharpe
88c20eaef1
MLAS: rename AVX512BW->AVX512Core (#3216)
Cleanup change: remap functions and files with Avx512BW to Avx512Core.
2020-03-13 22:45:51 -07:00
Dwayne Robinson
551d28be9a Update. 2020-03-13 19:06:00 -07:00
Dwayne Robinson
d489288e3c Add kernels, including stubs. 2020-03-13 18:56:22 -07:00
Dmitri Smirnov
2a6e5ce978
Speedup and reduce binary size for TfIdfVectorizer (#3197)
Speed up TfIdf.
  Build Trie like structure to quickly exclude dead-ends. 
  Use ParallelFor() for each of the rows processing.
  Make it non-template, batch it.
  Check for short tail within the inner loop.
2020-03-13 17:00:59 -07:00
Tracy Sharpe
fe0b2b2abd
QLinearConv speed up (#3196)
For x86/x64 builds, change the QLinearConv op to use MLAS for the u8u8=s32 GEMM, then requantize the intermediate buffer to u8.
2020-03-13 16:54:55 -07:00
Changming Sun
0a1257e467
Adjust the grouping logic in ThreadPool::TryBatchParallelFor (#3207)
1. No more plus 1.
2. Use MlasPartitionWork function to calculate the work index range.
2020-03-13 12:49:17 -07:00
Yulong Wang
5bc0d8be5c
Fix TopK Cuda implementation (#3176)
Fixes a bug in TopK cuda implementation when input size is between GridDim::maxThreadsPerBlock and GridDim::maxThreadsPerBlock * 2. In this case, the BitonicTopK will generate all-zero outputs.
2020-03-13 11:46:17 -07:00
Ori Levari
93569bf0f4
fix regex to populate dll version information correctly 2020-03-13 11:35:49 -07:00
Yufeng Li
c69194ec4c
fix the missing return in _get_quantize_input_nodes and format code with yapf (#3199)
* fix the missing return for function _get_quantize_input_nodes

* format quantization code with yapf
2020-03-13 09:28:41 -07:00
Xavier Dupré
d99554bea1
Improves implementation of tree ensemble regressor and classifier (4 to 5 times faster) (#2692)
* Improves implementation of tree ensemble regressor (4 to 5 times faster)
* Use ORT_THROW
2020-03-13 14:10:37 +01:00
Scott McKay
e9d5ed270f
Normalizer performance improvements (#3201)
* Simplify Normalizer as the spec only requires support for 2D input.

Tried using eigen (LpNorm<1>(), and norm()) on each row but that was much slower.

* Remove unused variable
2020-03-13 22:15:44 +10:00
Scott McKay
890cb78b20
Use Eigen::logistic instead of manually computing values. (#3186)
* Use MlasComputeLogistic instead of manually computing values.
* Update test script to allow the tolerance to be specified when checking float output from logreg_iris.onnx.
2020-03-13 20:27:25 +10:00
Hariharan Seshadri
b8575dda7b
Avoid some heap allocations in the InferenceSession and Model classes (#3103)
* Avoid some heap allocations in the InferenceSession and Model classes
2020-03-12 18:38:10 -07:00
Edward Chen
24793f5fc7 Revert change from RelWithDebInfo to Release in OnnxRuntime.CSharp.sln. 2020-03-12 16:51:45 -07:00
Zeeshan Siddiqui
2cad08bd60 Merged PR 5688: Upgrade ONNX submodule to the latest from github ONNX master.
We want to implement SoftmaxCrossentropy and NegativeLossLikelihoodLoss forward training ops for opset-12 but that requires ONNX submodule to point to the latest commit to have the latest and greatest ONNX spec!

- Reverse integrate changes from *.in.proto files in github ONNX repo.
- Regenerate csharp/test/Microsoft.ML.OnnxRuntime.Tests/OnnxMl.cs
- Disable ONNX tests that don't have op implementation for the latest opset.
2020-03-12 16:51:45 -07:00
Ethan Tao
2f1e997e5b Merged PR 5686: fix P100/fp16 issues
1. misaligned address in atomic_add()
2. GatherNDGradKernel to use atomic_add
3. enable/add UTs for GatherNDGrad and reduction_ops using half
- __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead
4. verified convergence graph and perf test
- p100 is much slower than v100 on fp16
- fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue
5. verify convergence test on Dev3/v100

TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)
2020-03-12 16:51:45 -07:00
Ke Deng
75025461e2 Initial implementation of graph cut and pipeline
This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"
2020-03-12 16:51:45 -07:00
Changming Sun
a02638eb46
Adjust the threading logic in ThreadPool::ParallelFor (#3178)
1. Do not reuse the main thread.
2. Do not plus one when mlas calculate the number of tasks to schedule. (It was me put the plus one there)

This is the second try of #1839

It's known that this change has negative performance impact on some of the models.
2020-03-12 11:33:33 -07:00
Scott McKay
f49912c42a
Performance improvement to Transpose when moving single axis. (#3173)
* Avoid use of vectors for tracking reader/writer offsets as it adds too much overhead if there are a lot of readers or writers.

Tracy found improvements in resnet34-ssd1200 and BERT Squad with this approach.
2020-03-12 14:49:02 +10:00
edgchen1
fa4dd51e3b
Add back orttraining-linux-gpu-inference-only-ci-pipeline.yml. (#3182) 2020-03-11 18:03:58 -07:00
Edward Chen
3af5a2a2cf Change Tensor::[Set]ByteOffset() to use ptrdiff_t. 2020-03-11 22:07:24 +00:00
Edward Chen
80dd62a240 Enable CI for training. 2020-03-11 14:41:32 -07:00
Edward Chen
e542cfd0e0 Introduce training changes. 2020-03-11 14:39:03 -07:00
Paul McDaniel
6791ed0217
Documentation updates for 1.2 for WinML (#3149)
* api goverannce draft

* Update CONTRIBUTING.md

updated for ABI proposals

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Incomplete, a draft iteartion of 2 more changes - api docs and high levle design

* pushing to see how the picture size works on screen.

* added 2 charts on api choice and distribution choice

* details on contract checking

* lint cleanup and links

* PR feedback.

* fixed markdown and lists

* more markdown and lists

* fixed broken links

* PR feedback

* commas

* PR comments from nick

* PR feedback

* fixed build section

Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com>
2020-03-11 14:19:30 -07:00
Hariharan Seshadri
a912415bac
Support custom ops targeting the CUDA EP (#3165)
* Initial commit

* Minor nit

* Comment

* Fix build

* Fix build
2020-03-11 00:49:01 -07:00
Hariharan Seshadri
3464801c3e
Explicitly specify NugetPackage parameter while validating nuget in some release pipelines (#3139) 2020-03-10 15:14:09 -07:00
Yufeng Li
3de1fc096d
Move zero point inputs of MatmulInteger to CPU memory (#3159) 2020-03-10 13:56:23 -07:00