1. Add LTCG back. It was set to default OFF in my previous PR to speed up Windows build. It is only needed in release pipelines.
2. Remove --use_featurizers from all the packaging pipelines
3. Make sure all the packages have openmp
Use CUDA 10.1 for Linux build
(Windows change is already in)
Please note, cublas 10.2.1.243 is for CUDA SDK 10.1.243, not CUDA 10.2.x. CUDA 10.2.89 need cublas 10.2.2.89. They match on the last part of the digits.
libcublas10-10.1.0.105 won't work!!!
The cuda docker image by viswamy is already using 10.1, no need to change.
* GPT2 Gelu Fusion & Test
* change header path
* Refine code & add missing test onnx file
* Fix builds & refine float/double/fp16 compare.
* Fix builds
* Add Bias Check and UTs
* Fix build and uts
* Fuse with second formula & test
* minor change
* disable FastGelu to see whether the builds can pass
* Verify where is wrong
* disable for debugging
* Revert "disable for debugging"
This reverts commit 535c0817fb36fb95a75773a7f00c8b969dd5362c.
* Revert "Verify where is wrong"
This reverts commit ffc43ec1d136636ba2cee30df49f563a75e84676.
* disable the transformer for inference currently
* Enable FastGeluFusion and fix segement fault when run bertsquad10.onnx test
* Add more Unit tests convering Gelu subgraph use graph input/output
(cherry picked from commit 0739ab985240c6d9acdb8f0afd40c5fb316166af)
* Mode Bias Fusion in BiasGelu.cc
Co-authored-by: Changming Sun <chasun@microsoft.com>
Add support to fuse ReorderOutput+Transpose(NHWC). Converting from NCHWc to NHWC tensors is a trivial copy of data and avoids the cost of a transpose node.
This fixes a customer reported issue where the NCHWc optimizer was dropping graph outputs when an edge was used as both a graph output and an input to another NCHWc node.
* Optimization for Bert and DistilBert model exported by keras2onnx
* Add model_type parameter for models from different export tools (pytorch, tf2onnx, keras2onnx).
* Split LayerNormalization and SkipLayerNormalization fusions
Optimize the implementation of Math::Im2col that is currently used for ConvInteger/QLinearConv. Also, avoid Im2col for pointwise convolutions in ConvInteger.
* merge training kernels to master
* merge training kernels to master
* revert two files
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* merge training kernels to master
* Avoid unneccesary copy creations of ModelProto
* Comment nit
* Nuit
* Comment refactoring
* Comment refactoring
* Fix build break
* Fix a few more instances where copies take place
* update onnx-tensorrt submodule to trt7 branch
* add fp16 option for TRT7
* switch to master branch of onnx tensorrt
* update submodule
* update to TensorRT7.0.0.11
* update to onnx-tensorrt for TensorRT7.0
* switch to private branch due to issues in master branch
* remove trt_onnxify
* disable warnings c4804 for TensorRT parser
* disable warnings c4702 for TensorRT parser
* add back sanity check of shape tensort input in the parser
* disable some warnings for TensorRT7
* change fp16 threshold for TensorRT
* update onn-tensorrt parser
* fix cycle issue in faster-rcnn and add cycle detection in GetCapability
* Update TensorRT container to v20.01
* Update TensorRT image name
* Update linux-multi-gpu-tensorrt-ci-pipeline.yml
* Update linux-gpu-tensorrt-ci-pipeline.yml
* disable rnn tests for TensorRT
* disable rnn tests for TensorRT
* disabled some unit test for TensorRT
* update onnx-tensorrt submodule
* update build scripts for TensorRT
* formating the code
* Update TensorRT-ExecutionProvider.md
* Update BUILD.md
* Update tensorrt_execution_provider.h
* Update tensorrt_execution_provider.cc
* Update win-gpu-tensorrt-ci-pipeline.yml
* use GetEnvironmentVar function to get env virables and switch to Win-GPU-2019 agent pool for win CI build
* change tensorrt path
* change tensorrt path
* fix win ci build issue
* update code based on the reviews
* fix build issue
* roll back to cuda10.0
* add RemoveCycleTest for TensorRT
* fix windows ci build issues
* fix ci build issues
* fix file permission
* fix out of range issue for max_workspace_size_env
Provide alternative std::mutex implementation on Windows. OrtMutex is no longer an alias of std::mutex.
We do it because:
1. This new thing is faster and much much simpler.
2. Static constructors are considered harmful. We should avoid such thing as possible as we can.
* Enable ARM64 release builds
* Add ARM release
* Skip C# dll signing in ARM
* Copy ARM binaries to Nuget
* Restore nuget packages before ARM packaging
* wip
* Use host protoc at C# build
* Set ProtocDirectory on cross-compiled builds
* wip
* Fix typo