* change c++14 to c++11
* add ld lib path for centos
* enable csharp tests on macos
* fix C API test on MacOS + fix manylinux dotnet install
* fix manylinux dotnet install
* fix lib link
Rework TensorSeq in a manner consistent with Tensor and SparseTensor
in terms of type system setup.
Reduce templating. Introduce helpers to ensure the same
data type.
Make OrtValue __dtor not virtual.
Introduce ContainerChecker
* enabme telemetry
* enable telemetry
* set enable telemetry as default
* for debugging
* remove log and set disable telemetry as default back
* delete private file while testing
* resolve comment: mainly add license header, rename macro and update docs
* rewording in privacy.md
* add centos tests to linux cpu ci pipeline
* Disable failing test
* use centos6 instead of centos7
* change back to centos7
* add dotnet runtime dependency
* fix dotnet runtime dependencies
* install dotnet sdk instead of runtimes
* add more dotnet dependencies
* temporary skip failing test
* ix lib path
* reenable failing test
Add support of GPT2 model optimization:
* Match subgraph of Gelu Approximation (using Tanh).
* Fuse LayerNormalization if SkipLayerNormalization is not ready.
* Output model even if embedding layer is not fused.
* Improve Reshape Fusion to improve coverage.
* Refine constant input checking, and output fused op counter.
Update script according to latest op improvements:
* Fusion of Add Bias and Gelu.
* Fuse SkipLayerNormalization and Add Bias.
Other:
* Add ReduceSum for mask as intermediate step.
* Refactor verbose setting.
* Constant folding bug fix/improvements
- Handle constant folding for node that is assigned to a non cpu EP
- Check for errors in optimizer execution frame setup
- Improve CUDA partitioning to look for initializers in parent graphs
- Add unit test
Fixes#2474
* [NupharEP] Add parallel schedule to JIT function name
Update Nuphar docker to use Python 3.6 and ubuntu 18.04
* Update notebook
* Avoid JIT cache file name conflict
* [NupharEP] Enable parallel schedule
* Update TVM with the fix to TVM threadpool to use OpenMP if possible
* Add parallel schedule when trying to vectorize
With this change, BERT squad perf on a 4-core (8 HT) CPU goes from 187ms to 150ms
* Address CR, docs and cmake update
* Doc fix
* Fix mkl
* Fix TVM windows build when using mklml
- Add --skip_tests option to build.py based on github feedback
- Add debug output at end of run_subprocess so it's clearer when the output is from a different process running
- Add check for scipy as it's required by gen_test_models.py for the onnx tests
- Use log.warning instead of warnings.warn for consistency. We use the logger almost everywhere and somewhat randomly used warnings.warn in two places.
- Add check for 'wheel' dependency not being found in setup.py and handle more gracefully
- Fix invalid input name in Keras tests
* Cuda Clip() for op set 11.
* make min_val and max_value input CPU memory directly.
* Remove original cu file useless "#pragma once"
* merge duplicate logic into one class.
Enable conv/conv_transpose and existing pooling for opset 11 in cuda execution provider.
They are of spec dilates/strides change related cuda pooling ops for op set 11.
* Optimize CPU Transpose for one axis moving either inwards or outwards. We have optimizations for NCHW <-> NHWC in CUDA but not CPU. This provides a more generic optimization to the CPU implementation.
Tested performance in both directions with data sizes of 8, 16, 32 and 64 bits, size of axis being moved of 3, 16 and 32, and number of elements to move of 100x100, 300x300 and 1000x1000.
Across all tests the average improvement even with the overhead of python was 2.5x. No cases were slower. Some were 6x faster.
Binary size increase in RelWithDebInfo build is ~5K.
NOTE: See PR comments for details of performance comparison with Eigen. Eigen is slightly faster but increases binary size by 55K just for support of rank 4 input. Binary size would be further increased to support different ranks.
Add Attention Fusion Transformer to fuse multi-head self attention subgraph to one node for optimizing Bert model inference performance.
It supports BERT model exported from PyTorch. It fuses about 20 nodes into one Attention node, and could significantly improve the inference speed of BERT model.
Support symbolic (first dimension for batch size) in input shape.