Update ReformatSourcePython.bat to use YAPF to format python code, and add onnxruntime\test directory to be formatted.
Add onnxruntime\.style.yapf for configuration. The style is based on google, except max column width 120.
Format python scripts using ReformatSourcePython.bat.
* Add notebook for bert squad model exported by python 1.4
* update bert performance test tool:
(1) set OpenMP environment variable before importing onnxruntime.
(2) launch new process for each test.
* Add notebook
Reduce combinations in perf test
* update readme
* fix quote
* Allow test multiple batch_size
* Add latency percentile
* Add warm up run
Reset logger for notebook
* refine default settings to test for cpu/gpu
* Add script to dump machine info
* Add notebooks for PyTorch SQuAD model GPU and CPU inference
* Update machineinfo.py: add license header; format by yapf
* Do not reset log handler. Skip adding handler if existed.
* Add comments about GPU result diff.
Filter rows of batch set to keep only one setting.
* update according to review feedback
* Download script from master branch
* Add notebook for bert model exported by keras2onnx
* format columns in result table
* re-run and update notebook
* Fix WCOS/Win32 linking bugs
* Remove unused NODEFAULTLIB flags
* Avoid plain target_link_libraries signature
* Avoid plain target_link_libraries signature
* Fix library list escaping
* Use library list instead of string
* Remove duplicate link to windowsapp.lib
* Remove Win32 build workarounds
* Specify CMake policies before initializing language
* Expose Win32 header definitions during build
* Force set API family
* Enable Win32 APIs in featurizer
* Use MT dynamic CRT
* Expose Win32 specific functions
* Disable app container globally
* Disable default wide functions in featurizers
* Add featurizers to test include path
* Workaround https://gitlab.kitware.com/cmake/cmake/issues/19428
* Revert pipeline debugging hacks
* Skip /FI in CUDA sources
* Default to Win32 builds
* Enable WCOS when using WinML
* Use generator expression to apply CMAKE_MSVC_RUNTIME_LIBRARY to C++ only
* Add support for sessions to share a global threadpool.
* Fix build issues
* Add tests, fix build issues.
* Added some documentation
* Fix centos issue when threadpools become nullptr due to 1 core.
* Fix mac and x86 build issues
* Address some PR comments
* Disabled test for android, added few more tests and addressed more PR comments.
* const_cast
Moved path_lib.h/cc from onnxruntime/core/framework to onnxruntime/core/platform and from the onnxruntime_framework to the onnxruntime_common libraries.
* Add notebooks for GPU and CPU inference of PyTorch BERT SQuAD model
* update bert_optimization.py: Do not add duplicated logger handler
* Add machineinfo.py to show machine configuration for notebook.
* Update bert performance test tool:
(1) Set OpenMP environment variable before importing onnxruntime.
(2) Use sub-process for each test
(3) Allow test multiple batch_size
(4) Add latency percentile
(5) Add warmup
1. Fix onnxruntime server docker file build failure. Tested with the notebook in ONNX tutorial, it works well.
2. Delete the docker files for the other EPs, because currently they don't work and I don't have enough time to update them.
Speed up TfIdf.
Build Trie like structure to quickly exclude dead-ends.
Use ParallelFor() for each of the rows processing.
Make it non-template, batch it.
Check for short tail within the inner loop.
Fixes a bug in TopK cuda implementation when input size is between GridDim::maxThreadsPerBlock and GridDim::maxThreadsPerBlock * 2. In this case, the BitonicTopK will generate all-zero outputs.
* Simplify Normalizer as the spec only requires support for 2D input.
Tried using eigen (LpNorm<1>(), and norm()) on each row but that was much slower.
* Remove unused variable
* Use MlasComputeLogistic instead of manually computing values.
* Update test script to allow the tolerance to be specified when checking float output from logreg_iris.onnx.
1. Do not reuse the main thread.
2. Do not plus one when mlas calculate the number of tasks to schedule. (It was me put the plus one there)
This is the second try of #1839
It's known that this change has negative performance impact on some of the models.
* Avoid use of vectors for tracking reader/writer offsets as it adds too much overhead if there are a lot of readers or writers.
Tracy found improvements in resnet34-ssd1200 and BERT Squad with this approach.
* api goverannce draft
* Update CONTRIBUTING.md
updated for ABI proposals
* Update CONTRIBUTING.md
* Update CONTRIBUTING.md
* Incomplete, a draft iteartion of 2 more changes - api docs and high levle design
* pushing to see how the picture size works on screen.
* added 2 charts on api choice and distribution choice
* details on contract checking
* lint cleanup and links
* PR feedback.
* fixed markdown and lists
* more markdown and lists
* fixed broken links
* PR feedback
* commas
* PR comments from nick
* PR feedback
* fixed build section
Co-authored-by: Nick Geisler <36938193+ngeisler11@users.noreply.github.com>
Discussed with Faith, because the data size is very small and changes are gradual, there is no need to delete the old data. We want to keep all the history.
* update GeluFusion to support pattern from PyTorch 1.4;
* Fix a bug that missing the check of an edge between mul2 and root.
* update script to fuse gelu from PyTorch 1.4
* Add test for python optimizer
This change fixes#3129. When running onnxruntime as dll on Windows, CUDA does some internal cleanups when process exits. After this, any call to CUDA would cause crash. Delayload makes thread_local destructor to happen after CUDA cleanup, thus the crash.