onnxruntime/docs/python/notebooks/onnxruntime-nuphar-tutorial.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ONNX Runtime: Tutorial for Nuphar execution provider\n",
    "**Accelerating model inference via compiler, using Docker Images for ONNX Runtime with Nuphar**\n",
    "\n",
    "This example shows how to accelerate model inference using Nuphar, an execution provider that leverages just-in-time compilation to generate optimized executables.\n",
    "\n",
    "For more background about Nuphar, please check [Nuphar-ExecutionProvider.md](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/Nuphar-ExecutionProvider.md) and its [build instructions](https://github.com/microsoft/onnxruntime/blob/master/BUILD.md#nuphar).\n",
    "\n",
    "#### Tutorial Roadmap:\n",
    "0. Prerequistes\n",
    "1. Create and run inference on a simple ONNX model, and understand how ***compilation*** works in Nuphar.\n",
    "2. Create and run inference on a model using ***LSTM***, run symbolic shape inference, edit LSTM ops to Scan, and check Nuphar speedup.\n",
    "3. ***Quantize*** the LSTM model and check speedup in Nuphar (CPU with AVX2 support is required).\n",
    "4. Working on a real model: ***Bidirectional Attention Flow ([BiDAF](https://arxiv.org/pdf/1611.01603))*** from onnx model zoo.\n",
    "5. ***Ahead-Of-Time (AOT) compilation*** to save just-in-time compilation cost on model load.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Prerequistes\n",
    "Please make sure you have installed following Python packages. Besides, C++ compiler/linker is required for ahead-of-time compilation. Please make sure you have g++ if running on Linux, or Visual Studio 2017 on Windows.\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cpufeature\n",
    "import numpy as np\n",
    "import onnx\n",
    "from onnx import helper, numpy_helper\n",
    "import os\n",
    "from timeit import default_timer as timer\n",
    "import shutil\n",
    "import subprocess\n",
    "import sys\n",
    "import tarfile\n",
    "import urllib.request\n",
    "def is_windows():\n",
    "  return sys.platform.startswith('win')\n",
    "if is_windows():\n",
    "  assert shutil.which('cl.exe'), 'Please make sure MSVC compiler and liner are in PATH.'\n",
    "else:\n",
    "  assert shutil.which('g++'), 'Please make sure g++ is installed.'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And Nuphar package in onnxruntime is required too. Please make sure you are using Nuphar enabled build."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import onnxruntime\n",
    "from onnxruntime.nuphar.model_editor import convert_to_scan_model\n",
    "from onnxruntime.nuphar.model_quantizer import convert_matmul_model\n",
    "from onnxruntime.nuphar.rnn_benchmark import generate_model\n",
    "from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Create and run inference on a simple ONNX model\n",
    "Let's start with a simple model: Y = ((X + X) * X + X) * X + X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = onnx.ModelProto()\n",
    "opset = model.opset_import.add()\n",
    "opset.domain == 'onnx'\n",
    "opset.version = 7 # ONNX opset 7 is required for LSTM op later\n",
    "\n",
    "graph = model.graph\n",
    "X = 'input'\n",
    "Y = 'output'\n",
    "\n",
    "# declare graph input/ouput with shape [seq, batch, 1024]\n",
    "dim = 1024\n",
    "model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "model.graph.output.add().CopyFrom(helper.make_tensor_value_info(Y, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "\n",
    "# create nodes: Y = ((X + X) * X + X) * X + X\n",
    "num_nodes = 5\n",
    "for i in range(num_nodes):\n",
    "  n = helper.make_node('Mul' if i % 2 else 'Add',\n",
    "                       [X, X if i == 0 else 'out_'+str(i-1)],\n",
    "                       ['out_'+str(i) if i < num_nodes - 1 else Y],\n",
    "                       'node'+str(i))\n",
    "  model.graph.node.add().CopyFrom(n)\n",
    "\n",
    "# save the model\n",
    "simple_model_name = 'simple.onnx'\n",
    "onnx.save(model, simple_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use nuphar execution provider to run the inference for the model that we created above, and use settings string to check the generated code.\n",
    "\n",
    "Because of the redirection of output, we dump the lowered code from a subprocess to a log file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "code_to_run = '''\n",
    "import onnxruntime\n",
    "s = 'codegen_dump_lower:verbose'\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(s)\n",
    "sess = onnxruntime.InferenceSession('simple.onnx')\n",
    "'''\n",
    "\n",
    "log_file = 'simple_lower.log' \n",
    "with open(log_file, \"w\") as f:\n",
    "  subprocess.run([sys.executable, '-c', code_to_run], stdout=f, stderr=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lowered log is similar to C source code, but the whole file is lengthy to show here. Let's just check the last few lines that are most important:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['produce node4 {\\n',\n",
       " '  for (ax0, 0, seq) {\\n',\n",
       " '    for (ax1, 0, batch) {\\n',\n",
       " '      for (ax2.outer, 0, 64) {\\n',\n",
       " '        node4[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] = (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)])))))\\n',\n",
       " '      }\\n',\n",
       " '    }\\n',\n",
       " '  }\\n',\n",
       " '}\\n',\n",
       " '\\n']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(log_file) as f:\n",
    "    log_lines = f.readlines()\n",
    "\n",
    "log_lines[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The compiled code showed that the nodes of Add/Mul were fused into a single function, and vectorization was applied in the loop. The fusion was automatically done by the compiler in the Nuphar execution provider, and did not require any manual model editing.\n",
    "\n",
    "Next, let's run inference on the model and compare the accuracy and performance with numpy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'onnxruntime: 0.315 seconds, numpy: 0.728 seconds'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "seq = 128\n",
    "batch = 16\n",
    "input_data = np.random.rand(seq, batch, dim).astype(np.float32)\n",
    "sess = onnxruntime.InferenceSession(simple_model_name)\n",
    "feed = {X:input_data}\n",
    "output = sess.run([], feed)\n",
    "np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "assert np.allclose(output[0], np_output)\n",
    "\n",
    "repeats = 100\n",
    "start_ort = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess.run([], feed)\n",
    "end_ort = timer()\n",
    "start_np = timer()\n",
    "for i in range(repeats):\n",
    "    np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "end_np = timer()\n",
    "'onnxruntime: {0:.3f} seconds, numpy: {1:.3f} seconds'.format(end_ort - start_ort, end_np - start_np)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 2. Create and run inference on a model using LSTM\n",
    "Now, let's take one step further to work on a 4-layer LSTM model, created from onnxruntime.nuphar.rnn_benchmark module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_model = 'LSTMx4.onnx'\n",
    "input_dim = 256\n",
    "hidden_dim = 1024\n",
    "generate_model('lstm', input_dim, hidden_dim, bidirectional=False, layers=4, model_name=lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**IMPORTANT**: Nuphar generates code before knowing shapes of input data, unlike other execution providers that do runtime shape inference. Thus, shape inference information is critical for compiler optimizations in Nuphar. To do that, we run symbolic shape inference on the model. Symbolic shape inference is based on the ONNX shape inference, and enhanced by sympy to better handle Shape/ConstantOfShape/etc. ops using symbolic computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "SymbolicShapeInference.infer_shapes(input_model=lstm_model, output_model=lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's check baseline performance on the generated model, using CPU execution provider."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_baseline = onnxruntime.InferenceSession(lstm_model)\n",
    "sess_baseline.set_providers(['CPUExecutionProvider']) # default provider in this container is Nuphar, this overrides to CPU EP\n",
    "seq = 128\n",
    "input_data = np.random.rand(seq, 1, input_dim).astype(np.float32)\n",
    "feed = {sess_baseline.get_inputs()[0].name:input_data}\n",
    "output = sess_baseline.run([], feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run RNN models in Nuphar execution provider efficiently, LSTM/GRU/RNN ops need to be converted to Scan ops. This is because Scan is more flexible, and supports quantized RNNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "scan_model = 'Scan_LSTMx4.onnx'\n",
    "convert_to_scan_model(lstm_model, scan_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After conversion, let's compare performance and accuracy with baseline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'nuphar: 2.899 seconds, baseline: 2.911 seconds'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sess_nuphar = onnxruntime.InferenceSession(scan_model)\n",
    "output_nuphar = sess_nuphar.run([], feed)\n",
    "assert np.allclose(output[0], output_nuphar[0])\n",
    "\n",
    "repeats = 10\n",
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_nuphar.run([], feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "'nuphar: {0:.3f} seconds, baseline: {1:.3f} seconds'.format(end_nuphar - start_nuphar, end_baseline - start_baseline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Quantize the LSTM model\n",
    "Let's get more speed-ups from Nuphar by quantizing the floating point GEMM/GEMV in LSTM model to int8 GEMM/GEMV.\n",
    "\n",
    "**NOTE:** For inference speed of quantizated model, a CPU with AVX2 instructions is preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cpufeature.CPUFeature['AVX2'] or 'No AVX2, quantization model might be slow'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use onnxruntime.nuphar.model_quantizer to quantize floating point GEMM/GEMVs. Assuming GEMM/GEMV takes form of input * weights, weights are statically quantized per-column, and inputs are dynamically quantized per-row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "quantized_model = 'Scan_LSTMx4_int8.onnx'\n",
    "convert_matmul_model(scan_model, quantized_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now run the quantized model, and check accuracy. Please note that quantization may cause accuracy loss, so we relax the comparison threshold a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_quantized = onnxruntime.InferenceSession(quantized_model)\n",
    "output_quantized = sess_quantized.run([], feed)\n",
    "assert np.allclose(output[0], output_quantized[0], rtol=1e-3, atol=1e-3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now check quantized model performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'quantized: 0.768 seconds, non-quantized: 2.899 seconds'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_quantized = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_quantized.run([], feed)\n",
    "end_quantized = timer()\n",
    "\n",
    "'quantized: {0:.3f} seconds, non-quantized: {1:.3f} seconds'.format(end_quantized - start_quantized, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Working on a real model: Bidirectional Attention Flow (BiDAF)\n",
    "BiDAF is a machine comprehension model that uses LSTMs. The inputs to this model are paragraphs of contexts and queries, and the outputs are start/end indices of words in the contexts that answers the queries.\n",
    "\n",
    "First let's download the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download BiDAF model\n",
    "cwd = os.getcwd()\n",
    "bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'\n",
    "bidaf_local = os.path.join(cwd, 'bidaf.tar.gz')\n",
    "if not os.path.exists(bidaf_local):\n",
    "  urllib.request.urlretrieve(bidaf_url, bidaf_local)\n",
    "with tarfile.open(bidaf_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check the performance of the CPU provider:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "bidaf = os.path.join(cwd, 'bidaf', 'bidaf.onnx')\n",
    "sess_baseline = onnxruntime.InferenceSession(bidaf)\n",
    "sess_baseline.set_providers(['CPUExecutionProvider'])\n",
    "# load test data\n",
    "test_data_dir = os.path.join(cwd, 'bidaf', 'test_data_set_3')\n",
    "tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]\n",
    "feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}\n",
    "output_baseline = sess_baseline.run([], feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The context in this test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"with 4:51 left in regulation , carolina got the ball on their own 24 - yard line with a chance to mount a game - winning drive , and soon faced 3rd - and - 9 . on the next play , miller stripped the ball away from newton , and after several players dove for it , it took a long bounce backwards and was recovered by ward , who returned it five yards to the panthers 4 - yard line . although several players dove into the pile to attempt to recover it , newton did not and his lack of aggression later earned him heavy criticism . meanwhile , denver  ' s offense was kept out of the end zone for three plays , but a holding penalty on cornerback josh norman gave the broncos a new set of downs . then anderson scored on a 2 - yard touchdown run and manning completed a pass to bennie fowler for a 2 - point conversion , giving denver a 24 – 10 lead with 3:08 left and essentially putting the game away . carolina had two more drives , but failed to get a first down on each one .\""
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['context_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'who recovered the strip ball ?'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['query_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ward'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['context_word'][output_baseline[0][0]:output_baseline[1][0]+1].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now put all steps together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# editing\n",
    "bidaf_converted = 'bidaf_mod.onnx'\n",
    "SymbolicShapeInference.infer_shapes(bidaf, bidaf_converted)\n",
    "convert_to_scan_model(bidaf_converted, bidaf_converted)\n",
    "# When quantizing, there's an only_for_scan option to quantize only the GEMV inside Scan ops.\n",
    "# This is useful when the input dims of LSTM being much bigger than hidden dims.\n",
    "# BiDAF has several LSTMs with input dim being 800/1400/etc, while hidden dim is 100.\n",
    "# So unlike the LSTMx4 model above, we use only_for_scan here\n",
    "convert_matmul_model(bidaf_converted, bidaf_converted, only_for_scan=True)\n",
    "\n",
    "# inference and verify accuracy\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "output = sess.run([], feed)\n",
    "assert all([np.allclose(o, ob) for o, ob in zip(output, output_baseline)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check performance after all these steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'nuphar: 0.128 seconds, baseline: 0.177 seconds'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess.run([], feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "'nuphar: {0:.3f} seconds, baseline: {1:.3f} seconds'.format(end_nuphar - start_nuphar, end_baseline - start_baseline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The benefit of quantization in BiDAF is not as great as in the LSTM sample above, because BiDAF has relatively small hidden dimensions, which limited the gain from optimization inside Scan ops. However, this model still benefits from fusion/vectorization/etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Ahead-Of-Time (AOT) compilation\n",
    "Nuphar runs Just-in-time (JIT) compilation when loading models. The compilation may lead to slow cold start. We can use create_shared script to build dll from JIT code and accelerate model loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'JIT took 3.163 seconds'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_jit = timer()\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "end_jit = timer()\n",
    "'JIT took {0:.3f} seconds'.format(end_jit - start_jit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a folder for JIT cache\n",
    "cache_dir = os.path.join(cwd, 'bidaf_cache')\n",
    "# remove any stale cache files\n",
    "if os.path.exists(cache_dir):\n",
    "  shutil.rmtree(cache_dir)\n",
    "os.makedirs(cache_dir, exist_ok=True)\n",
    "# use settings to enable JIT cache\n",
    "settings = 'nuphar_cache_path:{}'.format(cache_dir)\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(settings)\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now object files of JIT code is stored in cache_dir, let's link them into dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['jit.so']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cache_versioned_dir = os.path.join(cache_dir, os.listdir(cache_dir)[0])\n",
    "# use onnxruntime.nuphar.create_shared module to create dll\n",
    "onnxruntime_dir = os.path.split(os.path.abspath(onnxruntime.__file__))[0]\n",
    "subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', cache_versioned_dir], check=True)\n",
    "os.listdir(cache_versioned_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the model loading speed-up with AOT dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'AOT took 0.464 seconds'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_aot = timer()\n",
    "# NOTE: Nuphar settings string is not sticky. It needs to be reset before creating InferenceSession\n",
    "settings = 'nuphar_cache_path:{}'.format(cache_dir)\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(settings)\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "end_aot = timer()\n",
    "'AOT took {0:.3f} seconds'.format(end_aot - start_aot)"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "kedeng"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "msauthor": "ke.deng"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}