* move table names to one location
* remove session metadata
* reload trt inputs
* fix posting names
* Update linux-gpu-tensorrt-daily-perf-pipeline.yml for Azure Pipelines
* remove comments
* Split up anubis job and perf run
* add trt environ variables
* No embedded links
* expand model tests name
* skip cpu/cuda for trt when running onnxruntime_test_all
* only run trt ep for c++ unit test
* Update CMAKE_CUDA_ARCHITECTURES for T4
* Use new t4 agent pool
* Update YAML for run T4 on Windows
* revert code
* Update CMAKE_CUDA_ARCHITECTURES
* fix wrong value
* Remove cpu/cuda directly in model tests
* add only CMAKE_CUDA_ARCHITECTURES=75
* remove expanding model test name to see difference
* revert code
* Add fallback execution provider for unit test
* Add fallback execution provider for unit test (cont)
* add conditional to add fackback cuda ep
* Reduction op takes much longer time for TRT 8.2, so we test smaller range of inputs
* use M60
* revert code
* revert code
* add comments
* Modify code and add comment
* modify comment
* update comment
* add comment
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.
Perf number with single thread:
Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms
mobilenet_edgetpu 220 / 213 18.5 / 17.6
cartoongan 8537 / 8521 967 / 928
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* If the tensor is of gray8 format, we should call the gray8 shader
* other check (which resolves to unknown in this case) is incorrectly being compared to constant and not DXGI_FORMAT
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Add abseil cgmanifest declaration. Update coding standards for InlinedContainers
Adjust coding guidelines. Add default N calculation for InlinedVector<T, N> for general use.
Rename T from InlinedShapeVectorT. Fix Eager build
Add LLVM Copyright with modified derived code notice.
* add clear test name for model tests
* handle remove character
* modify for test
* Modify for correct test name
* Remove test code
* add comments
* make it only on Linux
* change function name
* Convert from wchar_t to char
* Export numpy_T as onnx transpose
* further fixes, test
Co-authored-by: Aishwarya Bhandare <aibhanda@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* add qdqgroup as input for NodeUnit
* minor update
* hookup nnapi_ep
* minor update
* update compiler setting
* Add a simple UT
* Pipeline change to add build minimal extended with NNAPI for Android
* move GetAllNodeUnits to node_unit.h, add UT for NodeUnits, minor updates
* minor updates
* address CR comments
Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
* improved usability of TVM EP
* moved technical import under a condition related to TVM EP only
* Revert "moved technical import under a condition related to TVM EP only"
* add conditional _ld_preload.py file extension for TVM EP
* improve readability of inserted code
Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain.
TODO, use Symmetric Quantized GEMM in other operators!
TODO address activation buffer overread in custom allocators and tensors supplied by users.
DOT kernel perf test:
Pixel 5a:
Cartoongan 513.539 ms 471.786 ms
Efficient 57.5169 ms 56.4174 ms
Edgetpu 14.6673 ms 13.5959 ms
NEON kernel perf test
Pixel 3a
Cartoongan 1423.53 ms 1069.92 ms
Efficient 114.086 ms 107.968 ms
Edgetpu 39.2632 ms 36.9839 ms
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Add abseil and inlined containers typedefs
Introduce TensorShapeVector for shape building.
Use gsl::span<const T> to make interfaces accept different types of vector like args.
Introduce InineShapeVectorT for shape capacity typed instantiations
Refactor cuda slice along with provider shared interfaces
Refactor Concat, Conv, Pad
Build with Conv Einsum and ConvTranspose refactored.
Remove TesnorShape::GetDimsAsVector()
Refactor SliceIterator and SliceIteratorBase
Refactor broadcast
Refactor Pads for twice as long
Remove memory planner intermediate shapes vector
Refactor orttraining
Fix passing TenshroShapeVector to tests
Remove abseil copy and submodule, use FetchContent_Declare/Fetch
Path with separate command
Make RocmAsyncBuffer accept anything convertible to span. Adjust Linux GPU pipeline.