* Initial check-in of Native Capi tests
* Minor update
* Updated with OrtCreateCpuAllocatorInfo working after including cpu_provider_factory.h
* Minor editw
* Minor update
* random generator to continue generate random numbers
* update with reviewer's comments
* update with reviewer's comments, remove an unnecessary change
* random generator to continue generate random numbers, update with reviewer's comments
* Optimize pad performance by flatten the inner most no padding axis. This will significantly reduces the total number of memcpy since memcpy usually only happen for inner most axis.
For example, for a shape of [1,224,224,3] with padding [0,3,3,0,0,3,3,0], can be flatten as [1,224,672] with padding [0,3,9,0,3,9].
With this fix, Pad performance can be improved by >7 times for above example.
* Fix typo in comments of pad performance optimization
* Pass dims as const reference instead of value.
* Fix Linux GPU warning
* Move dim check to Init.
* Adding initial props file updates to support native projects
* remove unnecessary header files
* removed double backslashes
* only include c api header, drop cxx api
* Remove copying of test models
* Update cast kernel to support to/from string
* Update namespace
* Add support for literal numeric case
* Update to support -INF test
* Update kernel registration for cast
* Update ONNX to 1.4.1
* Update registy api
* Resolve some comments
* Update cast kernel implementation
* Resolve comments
* Fixed test data in onnx
* Update cast kernel implementation
* Resolve PR comments
* Update cast_op.cc
* Update onnx commits info
* Update comments
* Move build dependencies like setuptools wheel numpy into docker image, so won't install them again and again for docker build
* revert the changes in install_deps.sh
* Enable USE_MKLML_FOR_BLAS
* add mklml include directory for onnxruntime_provider and onnxruntime_provider_cuda
* add mklml_include_dir to include_directories
* try removing the --version-script
* remove --no-undefined flag
* remove the -rpath linker flag
* remove the -rpath linker flag, including the -Wl
* remove the --whole-archive flags
* added -all_load -noall_load flags in place of --whole-archive and --no-whole-archive
* spell correct all-load
* set the MacOS specific cmake configs with if(APPLE) condition
* added --build_shared_lib to mac CI
* Correct the Consts::Zero & Consts::One for half type
* 1. Fix the CreateConstantOnes for float16 type
2. Add cuda kernel code in the BatchNorm for float 16 type, there's issue to run cudnnBatchNormalizationForwardInference with float 16 type
3. Add float 16 test case for Gemm & BatchNorm CUDA kernel only
* Fix build
* fix Linux build
* fix build
* Update the fix for BatchNorm, still use cuddn API cudnnBatchNormalizationForwardInference. The root case is, for half type, should use alpha, beta, scale, B, mean, var with float type.
* fix build
* enable 2 fp16 models for GPU test
* enable fp16 test for MaxPool
* Need to adjust per_sample_tolerance configuration in the model test
* Create a project for graph optimizer.
Move optimizer related code to the folder optimizer.
* Fix build failures.
* rebase and fix build failures.
* fix build failure.
* fix build failure with cuda path.
* fix python build failure.
* Move two transformers(memcpy and insert_cast) from framework to optimizer.
* rebase.
* SessionState should not depend on optimizer.
* Copy input tensors
* Check that default CPU execution provider is registered successfully
* Insert Memcpy only when an input is connected to both provider and non-provider nodes.
* Add Recurse method to GraphTransformer.
Move GraphTransformer::Apply to ApplyImpl and make private.
Add non-virtual GraphTransformer::Apply method to handle calling Graph::Resolve in a more consistent manner.
Create MemcpyTransformer GraphTransformer to handle memcpy operations on subgraphs in a more standard way.
* Checkpoint
* Make the subgraph insert less verbose
* Add graph nesting level to transformer ApplyImpl
Tweak cast transformer to recurse nicely and avoid unnecessary Resolve calls by splitting out the duplicate removal into a separate transformer.
Decouple memcpy transformer from ExecutionProviders and minimise what's in the header.
* Recurse into subgraphs inside GraphPartitioner
* Update a couple of new transformers
* Check Recurse return value.
* Cleanup some memory management in inference session by moving some things into SessionState
* Add deleted flag to rewrite rules so we stop processing nodes that are removed.
Remove some (most likely) unnecessary Resolve calls. As we always call Resolve for a graph modified by a transformer there's generally no need for the transformer to do it.
* Minor cleanups.
* Add some extra usage information to the comments in GraphTransformer.
* Address PR comments
* Type and Shape inference for QuantizeeLinear and DeQuantizeLinear Ops
* removing redundant type checking for some inputs and outputs
* remove unnecessary type check deom type inference