onnxruntime/include
glen-amd 281ed8c12d
VitisAI EP Context Model (#20926)
# Why so many commits
- Runtime debugging - which is necessary
- Three different approaches to EP context model - as a result testing back and forth
- Windows compatibility issues - this development has been done on Linux for convenience

# "Open" (?) questions
- Full offloading to a specific EP
- Dumping EP context models by EPs vs [by
ONNXRT](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L725))
- [Node name to pick
nodes](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L654))

# VitisAI EP made three variant implementations that have respective pros and cons (and of course we can combine them)
## Serialize and cache the list of compute capabilities and the original
ONNX model itself
## In `ComputeCapability()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
## In `Compile()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key

# EP context model creation
- Precondition
Session option configuration `kOrtSessionOptionEpContextEnable` (aka "ep.context_enable") is enabled.
- Approach 1
  - Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP implements/overrides `IExecutionProvider::GetEpContextNodes()` method.
    3. ONNXRT core creates an EP context model and saves/dumps it.
       - `CreateEpContextModel()` in the file "graph_partitioner.cc"
- In `get_ep_context_node()`, `Node::Name()` is used to check whether a node is an EP context node. This limits that EP model creation can only happen in `IExecutionProvider::Compile()`.
- The workaround is (1) not implementing `IExecutionProvider::GetEpContextNodes()` and (2) dumping the EP context model by EP itself.
4. Optionally, EP can also dump the EP context model it created by
iteself.
  - Examples
    - `QNNExecutionProvider`
    - `VitisAIExecutionProvider`
- Approach 2
  - Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP does NOT implement `IExecutionProvider::GetEpContextNodes()` at all.
    3. EP dumps the EP context model it created.
  - Examples
    - `TensorrtExecutionProvider`
       - UPDATES
- TRT EP is switching to leveraging
`IExecutionProvider::GetEpContextNodes()`
    - `OpenVINOExecutionProvider` (?)

# What to cache in EP context nodes
- Non Compilation based EPs
  - Examples
    - `VitisAIExecutionProvider`
  - Characteristics
- Heavy lifting work happens in `IExecutionProvider::GetCapability()`.
  - Preconditions
- `IExecutionProvider::GetCapability()` is only called once by ONNXRT.
  - Cache content
    - Serialization of a list of `ComputeCapability`
      - Not EP-specific
      - Serialized using `onnx::FunctionProto`
    - EP-specific cache
- Compilation based EPs
  - Examples
    - `QNNExecutionProvider`
    - `TensorrtExecutionProvider`
    - `MIGraphXExecutionProvider`
    - `OpenVINOExecutionProvider`
  - Cache content
    - EP-specific cache

# Requirements
- Offline / AOT compilation of ONNX models with EP context cache
- Compile somewhere, run everywhere
- Pseudo code with brief explanation
  ```
  GenerateCache(original_onnx_file, cache_onnx_file) model_buffer = load(original_onnx_file) --> Load the original ONNX model file
    model_buffer = decrypt(model_buffer)
session_options = { kOrtSessionOptionEpContextEnable: true,
kOrtSessionOptionEpContextFilePath: temp_file } --> Set necessary configs
Ort::CreateSessionFromArray(model_buffer, session_options) --> The new ONNX model with EP context is created and dumped into the user specified file "temp_file"
    temp_buffer = encrypt(temp_file)
write(temp_buffer, cache_onnx_file) --> Write the encypted context of "temp_file" into the "cache_onnx_file" file


  InitializeInferenceSession(cache_onnx_file)
model_buffer = load(cache_onnx_file) --> Load the ONNX model with EP context from the file generated in the previous step
    model_buffer = decrypt(model_buffer)
    session_options = { }
Ort::CreateSessionFromArray(model_buffer, session_options) --> Create and initalize an session with the EP context model
  ```
- Python code with comments
  - EP context model creation
    ```python
    import onnxruntime as onnxrt


    # Session options for creating an ONNX model with EP context cache.
    sess_opts = onnxrt.SessionOptions()

    # Verbose.
    sess_opts.log_severity_level = 0

    # This is REQUIRED.
    sess_opts.add_session_config_entry("ep.context_enable", "1")
    # This is OPTIONAL.
# Either an absolute path (preferred for now) or a relative path (WIP)
is okay.
# sess_opts.add_session_config_entry("ep.context_file_path",
"/some/path/to/original_model_ctx.onnx")
    # This is OPTIONAL.
    sess_opts.add_session_config_entry("ep.context_embed_mode", "1")

    orig_model_location = "/some/path/to/original_model.onnx"
sess = onnxrt.InferenceSession(orig_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
    ```
  - Inference run with an EP context model
    ```python
    import onnxruntime as onnxrt


    # Session options for creating an ONNX model with EP context cache.
    sess_opts = onnxrt.SessionOptions()

    # Default EP context model path.
# ep_ctx_model_location = "/some/path/to/origina_model.onnx_ctx.onnx"
    # User configured EP context model path.
    ep_ctx_model_location = "/some/path/to/origina_model_ctx.onnx"
sess = onnxrt.InferenceSession(ep_ctx_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])

    model_inputs = {}
    run_opts = onnxrt.RunOptions()
    # Verbose.
    run_opts.log_severity_level = 1
    sess.run(None, model_inputs, run_opts)
    ```

---------

Co-authored-by: Glen Cao <glen@Glens-MacBook-Air.local>
2024-07-12 21:22:58 -07:00
..
onnxruntime/core VitisAI EP Context Model (#20926) 2024-07-12 21:22:58 -07:00