* cleaned up the additional header in C-api
* ensure test failure surfaces in the build pipeline
* sanitized runtest.bat
* cleanup unneeded headers
* formatting and typos
* support non-tensor types
* support non-tensor types.
* support non-tensor types.
* fix compilation issues
* fix compilation issues
* Build without mkldnn for release packages. We'll default to MLAS.
* Remove tvm as well
* Add openmp
* Add check for linux version supporting glibc 2.23 or higher
* Refactor the libc check to SessionOptions
* removed whitespace
* Update SessionOptions.cs
* Various optimizations to reduce the setup and execution cost.
Cache information about the feeds and fetches, and any device copies required to execute the graph so we minimize checking for later calls to ExecuteGraph using the same input/output.
- enable use of caching in Loop and Scan
- make use of caching optional for InferenceSession::Run
- handle calls to Run with different feeds and fetches to support scenarios where there may be a truncated sequence in some calls
Take the feed names and MLValue instances as vectors so the order is deterministic.
Add unit tests
Update onnxruntime_perf_test to enable caching.
* Couple of tweaks.
Fix shared library unit test failure.
Attempt to workaround MacOS build failure due to VC++ bug around including reaching scope values in a lambda automatically.
* Rework order of init in Run so we get nice error messages about invalid feed/output names.
* Refine logic around copying MLValue using execution provider so common code can be used. Simplify the logic due to this change.
Split the paths for executing with/without cached info so we can be more const correct with how FeedsFetchesManager is passed in. This makes it clearer when a shared instance can be used due to it being const.
Cache the FeedsFetchesManager instances in the control flow nodes. They can be re-used across calls to Compute.
* Removed unused local variable to fix some builds.
* Fix build issue by cleaning up some more unused params.
* Check names when using cache entry from SessionState. Add unit test.
* Fix a issue for Reduce Max, Min. Per cudnn document, only Max/Min ops requires the indices output, it will report error if requesting indices for the other reduction ops.
* support non-tensor types
* support non-tensor types.
* support non-tensor types.
* fix compilation issues
* fix compilation issues
* fix compilation issues
* add test cases
* test cases
* add test cases
* try to fix string test case
* working now
* use allocator (broken)
* string test broken after using allocator
* full working example
* Fix PR comments
* Initial check-in of Native Capi tests
* Minor update
* Updated with OrtCreateCpuAllocatorInfo working after including cpu_provider_factory.h
* Minor editw
* Minor update
* random generator to continue generate random numbers
* update with reviewer's comments
* update with reviewer's comments, remove an unnecessary change
* random generator to continue generate random numbers, update with reviewer's comments
* Optimize pad performance by flatten the inner most no padding axis. This will significantly reduces the total number of memcpy since memcpy usually only happen for inner most axis.
For example, for a shape of [1,224,224,3] with padding [0,3,3,0,0,3,3,0], can be flatten as [1,224,672] with padding [0,3,9,0,3,9].
With this fix, Pad performance can be improved by >7 times for above example.
* Fix typo in comments of pad performance optimization
* Pass dims as const reference instead of value.
* Fix Linux GPU warning
* Move dim check to Init.
* Adding initial props file updates to support native projects
* remove unnecessary header files
* removed double backslashes
* only include c api header, drop cxx api
* Remove copying of test models
* Update cast kernel to support to/from string
* Update namespace
* Add support for literal numeric case
* Update to support -INF test
* Update kernel registration for cast
* Update ONNX to 1.4.1
* Update registy api
* Resolve some comments
* Update cast kernel implementation
* Resolve comments
* Fixed test data in onnx
* Update cast kernel implementation
* Resolve PR comments
* Update cast_op.cc
* Update onnx commits info
* Update comments
* Move build dependencies like setuptools wheel numpy into docker image, so won't install them again and again for docker build
* revert the changes in install_deps.sh
* Enable USE_MKLML_FOR_BLAS
* add mklml include directory for onnxruntime_provider and onnxruntime_provider_cuda
* add mklml_include_dir to include_directories
* try removing the --version-script
* remove --no-undefined flag
* remove the -rpath linker flag
* remove the -rpath linker flag, including the -Wl
* remove the --whole-archive flags
* added -all_load -noall_load flags in place of --whole-archive and --no-whole-archive
* spell correct all-load
* set the MacOS specific cmake configs with if(APPLE) condition
* added --build_shared_lib to mac CI
* Correct the Consts::Zero & Consts::One for half type
* 1. Fix the CreateConstantOnes for float16 type
2. Add cuda kernel code in the BatchNorm for float 16 type, there's issue to run cudnnBatchNormalizationForwardInference with float 16 type
3. Add float 16 test case for Gemm & BatchNorm CUDA kernel only
* Fix build
* fix Linux build
* fix build
* Update the fix for BatchNorm, still use cuddn API cudnnBatchNormalizationForwardInference. The root case is, for half type, should use alpha, beta, scale, B, mean, var with float type.
* fix build
* enable 2 fp16 models for GPU test
* enable fp16 test for MaxPool
* Need to adjust per_sample_tolerance configuration in the model test