onnxruntime/include
edgchen1 6c7da5e9d3
Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418)
For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.
2020-07-10 10:20:23 -07:00
..
onnxruntime/core Optimize CUDA Sum op kernel and refactor CUDA elementwise variadic input op kernels (#4418) 2020-07-10 10:20:23 -07:00