1.Let mlas use session thread pool
2.Remove onnxruntime_USE_MLAS cmake option
3. Remove the win32 thread pool code inside mlas
mlas will:
1.use ort thread pool if it get passed in
2.use openmp if the threadpool parameter is nullptr
3.run single threaded if the threadpool parameter is nullptr and openmp is disabled.
Added Sample Featurizer and Infrastructure
Make featurizers and unit tests compile and run with GTest.
Create definitions for the first featurizer kernel.
Add new operator domain.
Create datetime_transformer kernel and build.
Move OPAQUE types definitions for featurizers kerneles out to a separate cc.
Register them with the type system.
Provide unit tests for new AutoML DateTimeTransformer kernel.
Make necessary adjustments to the test infrastructure to make it run
with new types.
- Added python script for generating markdown doc from the registered opkernels.
- Made some conditional changes in the pybind to expose necessary python API
- Added some missing type-constraints in the op kernel registrations
* Mention OrtCreateSessionFromArray in C API doc
* review changes
* use enum for graph optimization level
* Use explicit values for enums
* updates...
* Add friendly enum for graph optimization levels in C, C# and Python APIs.
* Fix linux build
* Fix build breakage due to master merge
* PR comments
* Minor perf improvements.
- Cache the vector sizes in IExecutionFrame and NodeIndexInfo to avoid calls to size().
- 2 instructions instead of 10
- Remove an unnecessary check in IExecutionFrame
- add a check to the ctor so we guarantee it's unnecessary
- Reserve memory for the vectors in BroadcastIterator
- saves reallocs if more than one value is added
- but rare with the mlperf models for multiple values to be added so benefit is limited.
- slight tweak to the Broadcaster ctor code to make it more readable
* Mention OrtCreateSessionFromArray in C API doc
* Fix perf test executable due to removal of certain C APIs
* fix linux build
* Avoid duplication
* Fix mem leak
* Update nGraph to 0.21 and adjust the EP
* Share the graph initializers between custom ops
* Update nGraph to 0.22 and exclude Gather entirely
* Enable building on Windows with nGraph v0.21.1-rc.0
* Disable the unsigned input Shrink op tests for nGraph until the next update
* Line-shortening code refactor
* Fix for the master branch merge artifact
* MKLDNN patches adjustment for Windows
* Exclude MatMulInteger for non-const zero points
* Exclude ConvInteger for non-const zero points
* Enable full Cast op support
* Use the v0.22.1 tag
* Skip ConvTranspose_InvalidKernelShape test for ngraph provider
* Create sub-graph ModelProto from fused_node
* Implement new LabelEncoder in opset 2 in ML domain
* Fix compilation error
* Fix tests
* Include ONNX's fix
* Formatting and addressing a comment
* Address a minor comment
* For majority of nodes, we do not need to do fence check. Instead, we only need to do FenceCheck for CPU<->GPU mem sync node
But we pay the Fence check cost for every single node and every single input and output.
This change will minimize the Fence check to only do it when necessary.
* remove memory copy between CUDA and TRT
* add info to RegisterExecutionProvider input
* use new IDeviceAllocator for trt allocator
* remove SetDefaultInputsMemoryType from TRT EP
* remove onnx-tensorrt 5.0
* add submodule onnx-tensorrt branch 5.1
* remove redundancy
* Update transformer_memcpy.cc
* Update tensorrt_execution_provider.cc
* switch to TensorRT 5.1.5.0
* update python binding
* disable failed test case on TensorRT
* Update activation_op_test.cc
* upgrade to TensorRT container 19.06
* update according to feedback
* add comments
* remove tensorrt allocator and use cuda(gpu) allocator
* update onnx-tensorrt submodule
* change ci build cuda directory name
Fix race condition issue in RNN/LSTM/GRU.
Description:
The filter_desc and rnn_desc could also be changed in compute which could be in multi-thread. It will cause race condition issue.
Fix:
create temperate cudnn descriptors
cache cudnn_dropout_desc_ which won't change
* A few performance improvements:
- Make the iteration in NonZero more efficient by using a raw pointer and simplifying the increment logic
- add another unit test to check the new logic works with 3 dimensional tensor
- gains about 2% for ssd_mobilenet
- Avoid floating point operations on each iteration on Concat
- about 0.5% for ssd_mobilenet and ssd_resnet34
- Put common case first in ExecutionFrame::AllocateAsPerAllocationPlan to avoid unnecessary call to IsSparseTensor
- about 0.05% for ssd_mobilenet
- Minor tweak to put some ctors in the TensorShape header so they can be inlined more easily
* If there is an outer scope value that matches a subgraph input, don't create an implicit input from the outer scope value.
Minor unrelated change for issue noticed while debugging: Use unordered_set for implicit inputs so we don't add them multiple times.
* Add unit test based on onnx issue.