Add Xamarin support to the ORT nuget packages.
- Update C# code to support Xamarin builds for iOS and Android
- refactor some things to split out common code
- include iOS and Android ORT native shared library in native nuget package
* add p50 in test
* support opset-13 of softmax
* update a operators.md
* resolve comments
* fix lint and format
Co-authored-by: Yulong Wang <yulongw@microsoft.com>
* POWER: Add Dgemm kernel for POWER processor
This patch adds new dgemm kernel specific to POWER processor.
* POWER: Restrict new functions to VSX in header
* Remove warning check in header
* POWER: Dgemm Adjust indentation
Fixing indentation based on review comments.
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* Using cost model's thread count rather than max number of threads when
parallel tasks.
* according to perf test result, decrease parallel on channels.
* Seems no use on parallel channels for qavg_pool according several models, remove it.
* Revert "Using cost model's thread count rather than max number of threads when"
This reverts commit 5fa47cd5b5ddbaa4e5ef97ccbc53200324379544.
* optimize python overhead of _post_amp_backward
* overwrite apex amp's zero_grad for faster implementation
* move unscale_fp16_grads_into_fp32_grads into C++ impl
* improve the efficiency furthur, reducing 3.5ms to 1.7ms for unilm.
* unilm 1.7ms to 338us: 1). optimize python list <==> std::vector copy, 2). launch the kernels as long as num_elem reach thresh hold. This help reduce the CUDA idel time.
* refine the logic a bit after validating
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
"core/graph/function.h" appears twice:
- `include/onnxruntime/core/graph/function.h`
- `onnxruntime/core/graph/function.h` --> This one is redundant and not used anywhere
Support for device function pointers is not yet available for ROCm.
Instead, the device function pointers were converted to device functors.
Case statements, lambdas, and macros are used for dispatch; as a result,
all combinations of kernels are compiled with inlined functors. The
basis of this approach can be found in PyTorch.
Lastly, hipify and register Resize and Upsample for ROCm EP.
The dnnl_binary ops need the memory format to match the format expected by
Onnxruntime. If the memory format of the inputs do not match each other
there will be an error in the calculated results.
Additionally, since the code manually pads the tensor dimensions for broadcasting
the inputs are expected to be in Onnxruntimes format.
Since detecting and reordering the memory to Ort format matches what was previously
done for the Reshape op the code was moved from dnnl_reshape to
dnnl_subgraph_primitive under the name GetMemoryInOrtFormat.
One small additional change made to the capability code log to also print the
percentage of nodes run by the dnnl execution provider.
Signed-off-by: George Nash <george.nash@intel.com>