* Using vectorized loads (float2) for fp16 to improve performance
* Fix a few warnings from cpplint
* Fix a few warnings from cpplint
* Use __float2half2_rn and fix some cpplint warnings
* Move some computaions to LaunchFastGeluKernel
* Fix some Lint C++ warning
* Using vectorized loads (float4) for fp16 to improve performance
* Switch whether to optimize FastGelu with float4 vectorization
* Switch to float4 memory access based on input_length in FastGelu
* Comment how to set the threshold of float2 and float4 vectorized kernels
* Add FastGelu fp16 unit tests for bias_length = 2 and 8
* Make vectorized kernels generic with aligned_vector
* Unify the vectorized kernels with/without bias
* Refactor the code to suppress cpplint warnings
* Solve formatting issues
* Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel
* Move fast_gelu_impl.h to rocm/bert
* Fix some Lint C++ warnings and code alignment
* Add .net6 support to the C# nuget package.
Currently requires jumping through a lot of hoops due to .net 6 only being supported in the preview release of VS 2022.
Build existing targets using msbuild.
Add .net6 targets and build using dotnet.
Create nuget package with combined targets.
A few misc automated changes from VS to spacing and adding a couple of properties.
* Try manually installing trt8.4 in multi-gpu pipeline
* Remove stmts that clean up cmake, ctest. Update tensorrt repository name passed to get_docker_image.py
* Update trt and cudnn home
* Don't install trtexec cli tool.
* Increase job timeout
* Revert timeout change and use trt placeholder builder build option
* update trt 8.4ga
* trt 8.4 linux ci pipeline
* fix cmake
* placeholder_builder
* trt 8.4 windows pipeline
* gpu package pipeline
* trt 8.4.1.5 , packaging pipeline updates
* python packaging
* ctest timeout
* python packaging test
* bump timeout
* python format
* format
* revert
* newline
* enable trt python tests
* typo
* python format
* disable on windows
* Rework the EP factory creation setup so we're not cut-and-pasting function declarations in multiple places.
Convert append EP for SNPE to be generic, and also use for XNNPACK.
Add XNNPACK to C# API
* Don't need stub for MIGraphX as it's using provider bridge.
* Remove old 'create' functions that aren't applicable now that the EPs are built as separate libraries.
* Only use EPs that require the layout transform if the opset is supported by the layout transformer.
* Update wasm registration of xnnpack.
* aten op for inference
* fix build error
* more some code to training only
* remove domain from operator name
* move aten_op_executor ext out from ortmodule
* add pipeline
* add exec mode
* fix script
* fix ut script
* fix test pipeline
* failure test
* rollback
* bugfix
* resolve comments
* enable aten for python build only
* fix win build
* use target_compile_definitions
* support io binding
* turn off aten by default
* fix ut
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: zhijxu <zhijxu@microsoft.com>
* update TVM
* get alignment constant from TVM
* update TVM_VM_SetInputs to upstream with TVM API
* fix CI issue: update TVM EP dependencies
* add sudo
* revert changes needed to install missing package
* add package for TVM EP CI
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
* Implement XNNPACK support via an EP.
* Layout transform uses the GraphPartitioner infrastructure.
* Node fusion is supported.
* Conv and MaxPool implementations were ported from Changming's PR.
* Added optional mutex in InferenceSession::Run as we only want to allow sequential calls if xnnpack is enabled
* [UPDATE] update amd ci pipeline 2 rocm5.1.1
* [FIX] json format error
* [ERROR] disable unit tests
* [FIX] ucx error
* [FIX] cmake version
* [FIX] units test