* Initial draft
* updates per review
* fix link
* plus one more link fix
* small changes to the optimizer documentation
* some more changes
* done
* update C_API with doc link
* Introduce execution mode for clarity and extensibility; Change Python APIs accordingly; Replace DisableSequentialExecution API with EnableParallelExecution for clarity.
* Fix cuda build
* Modify the test slightly
* Make C and C# APIs consistent with Python.
Description: Refine threading control options and move inter op thread pool to session state.
Added thread_utils.h/cc to centralize the decision around the thread pool size under various conditions.
Motivation and Context
Currently the thread pool size of the parallel executor is hardcoded to 32 for some reason. This PR makes the options to configure the thread pool sizes clearer.
* Mention OrtCreateSessionFromArray in C API doc
* Update perf tool documentation to reflect the new graph optimization enums. Relax constraint for enable_all.
* Update one more doc
* Update onnx test runner documentation
* Add default in the docs