* First attempt for half2 vectorized memory access in SkipLayerNorm
* Add some functions for debugging
* Clean up the code
* Clean up the code
* Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp
* Add a unit test for a larger input size
* Fix some Lint C++ warnings
* Use ILP = 4 for the vectorized kernels
* Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm
* Use conditional operator for input_v
* Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel
* Clean some comments and rename the layernorm function
* Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel
* Resolve a Lint C++ warning
* Fix SkipLayerNormBatch1_Float16_vec output data
* Add hipified code of bert SkipLayerNorm for ROCmEP
* Resolve some Lint C++ warnings
* Resolve some Lint C++ warnings
* Resolve some Lint C++ warnings
* Resolve Python formatting issue
* First attempt for half2 vectorized memory access in SkipLayerNorm
* Add some functions for debugging
* Clean up the code
* Clean up the code
* Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp
* Add a unit test for a larger input size
* Fix some Lint C++ warnings
* Use ILP = 4 for the vectorized kernels
* Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm
* Use conditional operator for input_v
* Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel
* Clean some comments and rename the layernorm function
* Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel
* Resolve a Lint C++ warning
* Fix SkipLayerNormBatch1_Float16_vec output data
The generated bindings causes C# build errors that require workaround code. Disabling generation should avoid the need for any workarounds.
As the user has the C# ORT package with the C# to C bindings there's no need for binding generation that calls the ORT Java API (which is C# -> Java ->C).
* Add op tuning functionality and example for vector add.
* Add namespace.
* Various improvements.
* use unique pointer
* fix lint errors
* Check return error.
* Add utility methods for resize_output
* Eager mode: implement abs.out
This is an initial hand written implementation of an out= operator to
demonstrate how to structure out= methods using resize_out helper
methods.
This is meant to be used as a reference when we update the code
generator to generate implementations for out= operations.
* Update C# runtest.sh for opset 17
Should have been part of https://github.com/microsoft/onnxruntime/pull/11924
* get appropriate opset version from onnx doc
* use absolute rather than relative path
* fix typo in var name
* Add net6 targets.
Remove maccatalyst as we don't have a native build targetting that.
* Set platform in macos targets
* Add targetFramework entries
* Move NativeLib.DllName definition and set using preprocessor values for simplicity. Couldn't get it to build with the preprocessor based setup when it was in a separate file.
Update the nuspec generation to set platform version for .net6 targets. TODO: Validate versions. I copied them from the managed nuget package the packaging pipeline generated prior to adding targets. Possibly w could/should lower some of the versions.
Hopefully the need to specify a version goes away when the release version of VS2022 supports .net6.
* Try android 31.1 as https://github.com/actions/virtual-environments/blob/main/images/win/Windows2022-Readme.md suggests that should be available on the CI machines
* Fix patch version mismatch
Add some extra debug info in case it helps
* Debug nuget location in CI
* Add workspace entry back in
* Add steps
* One more attempt with hardcoded nuget.exe path and original android31.0 version
* Better fix - found explicit nuget download and updated version there.
* flake8 fixes
* Fix black complaints.
* Exit Microsoft_ML_OnnxRuntime_CheckPrerequisites for net6 iOS.
* Removed outdated comment
Add support for PyTorch `resize_` operation. The PyTorch API method is documented
here:
https://pytorch.org/docs/stable/generated/torch.Tensor.resize_.html
Implementation notes:
There are some implementation details that might deviate from
expectations:
- As the Onnxruntime::tensor does not support resize operation, this
functionality is supported on the TensorImpl by swapping out the
backing tensor if the size changes.
- In the ORT model the shape of the TensorImpl is defined by the
backing onnxruntime::tensor, so it is not supported to have a
TensorImpl with a different shape / size than the backing
onnxruntime::tensor. This means when resizing to a smaller TensorImpl,
other implementations might keep the same backing storage, ORT will
re-allocate a new onnxruntime::tensor and copy over as many of the
existing elements that fit. Functionally, you will end up with same
output, but the underlying buffer will be re-allocated.
A future change could be to allow ORTTensorImpl to have a different
size / shape than the onnxrutime::tensor backing it, and then we
could improve this behavior.
The canonical CPU / CUDA implementations in PyTorch repository:
CPU: aten/src/ATen/native/Resize.cpp
CUDA: aten/src/ATen/native/cuda/Resize.cpp
* Add FastGelu to kernel explorer for profiling.
* fix python lint errors
* Fix one more python lint error
* Delete white space (python lint)
* Various improvements.
* Update README.md
* refactor header files
With this patch, it optimizes Resize when the input X is 4D int8/uint8 tensor
and the mode is linear by:
* Transforming NCHW Resize to NHWC variant
* Using the NHWC Resize kernel without floating-point computation
It improves DeepLab V3 with uint8 quantization by 19% on X64. It also improves
Resize of DeepLab V3 with int8 quantization by 15%~18% on X64.
* Add warning about future computation change for Convtranspose with auto_pad
* improve msg
* update TODO to make lint happy
* update more contents for warning and add if
* valid was not infected
* move it into kernel registration
* parse auto_pad myself
* try to use conv_transpose_attrs_.auto_pad directly
* infrastructure for handshake mechanism was implemented. sha256 was selected as first hash algorithm
* check hash during compile in TVMso EP
* add IPP-CRYPTO to external dependencies for TVM EP
* made checkHash method constant
* removed the public implementation of the SHA-256 algorithm so as not to cause a license conflict
* implemented SHA-256 calculation using ipp-crypto library
* fix dependency for ipp-crypto
* add provider options for hash check
* update documentation for added provider options
* add hash check condition
* fix docs
* fix lint
* fix ORT_THROW
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
(1) add --run_shape_inference to make shape inference optional
(2) add --vocab_mask to make the input optional
(3) add --overwrite in gpt2 convert_to_onnx to allow overwrite existed raw onnx from PyTorch
(4) save gpt2 model tensors to one external data file by default
(5) group convert_beam_search arguments to multiple groups
(6) make --decoder_onnx optional for gpt2 model
(7) replace print by logger
(8) update shape inference function to support external data.
(9) when saving external data, show warning if onnx version < 1.12
* Pad fallback to CPU
* Added queryPad in operatorRegistration.cpp
* Acknowledged PR comments
* Used any_of
* used none_of instead of any_of
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
* create op from ep
* read input count from context
* create holder to host nodes
* fix typo
* cast type before comparison
* throw error on API fail
* silence warning from minimal build
* switch to unique_ptr with deleter to host nodes
* fix typo
* fix build err for minimal
* fix build err for minimal
* add UT for conv
* enable test on CUDA
* add comment
* fix typo
* use gsl::span and string view for Node constructor
* Added two APIs - CopyKernelInfo and ReleaseKernelInfo
* pass gsl::span by value
* switch to span<NodeArg* const> to allow for reference to const containers
* fix typo
* fix reduced build err
* fix reduced build err
* refactoring node construction logic
* rename exceptions
* add input and output count as arguments for op creation
* refactor static member
* use ORT_CATCH instead of catch
* cancel try catch
* add static value name map
* format input definition and set err code
* fix comments
* fix typo
Improve performance of BiasGelu on OneDNN execution provider
This modifies how BiasGelu is handled by the OneDNN execution provider
by executing the gelu_erf primitive as a postop of the binary_add primitive.
Also fixes extra data copies made when running on GPU.
Signed-off-by: George Nash <george.nash@intel.com>