onnxruntime/docs/python/notebooks/onnxruntime-nuphar-tutorial.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ONNX Runtime: Tutorial for Nuphar execution provider\n",
    "**Accelerating model inference via compiler, using Docker Images for ONNX Runtime with Nuphar**\n",
    "\n",
    "This example shows how to accelerate model inference using Nuphar, an execution provider that leverages just-in-time compilation to generate optimized executables.\n",
    "\n",
    "For more background about Nuphar, please check [Nuphar-ExecutionProvider.md](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/Nuphar-ExecutionProvider.md) and its [build instructions](https://github.com/microsoft/onnxruntime/blob/master/BUILD.md#nuphar).\n",
    "\n",
    "#### Tutorial Roadmap:\n",
    "1. Prerequistes\n",
    "2. Create and run inference on a simple ONNX model, and understand how ***compilation*** works in Nuphar.\n",
    "3. Create and run inference on a model using ***LSTM***, run symbolic shape inference, edit LSTM ops to Scan, and check Nuphar speedup.\n",
    "4. ***Quantize*** the LSTM model and check speedup in Nuphar (CPU with AVX2 support is required).\n",
    "5. Working on real models from onnx model zoo: ***BERT squad*** and ***Bidirectional Attention Flow ([BiDAF](https://arxiv.org/pdf/1611.01603))***.\n",
    "6. ***Ahead-Of-Time (AOT) compilation*** to save just-in-time compilation cost on model load.\n",
    "7. Performance tuning for single thread inference.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Prerequistes\n",
    "Please make sure you have installed following Python packages. Besides, C++ compiler/linker is required for ahead-of-time compilation. Please make sure you have g++ if running on Linux, or Visual Studio 2017 on Windows.\n",
    "\n",
    "For simplicity, you may use [Nuphar docker image](https://github.com/microsoft/onnxruntime/blob/master/dockerfiles/README.md) from Microsoft Container Registry.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cpufeature\n",
    "import numpy as np\n",
    "import onnx\n",
    "from onnx import helper, numpy_helper\n",
    "import os\n",
    "from timeit import default_timer as timer\n",
    "import shutil\n",
    "import subprocess\n",
    "import sys\n",
    "import tarfile\n",
    "import urllib.request\n",
    "\n",
    "def is_windows():\n",
    "  return sys.platform.startswith('win')\n",
    "\n",
    "if is_windows():\n",
    "  assert shutil.which('cl.exe'), 'Please make sure MSVC compiler and liner are in PATH.'\n",
    "else:\n",
    "  assert shutil.which('g++'), 'Please make sure g++ is installed.'\n",
    "\n",
    "def print_speedup(name, delta_baseline, delta):\n",
    "    print(\"{} speed-up {:.2f}%\".format(name, 100*(delta_baseline/delta - 1)))\n",
    "    print(\"    Baseline: {:.3f} s, Current: {:.3f} s\".format(delta_baseline, delta))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And Nuphar package in onnxruntime is required too. Please make sure you are using Nuphar enabled build."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import onnxruntime\n",
    "from onnxruntime.nuphar.model_editor import convert_to_scan_model\n",
    "from onnxruntime.nuphar.model_quantizer import convert_matmul_model\n",
    "from onnxruntime.nuphar.rnn_benchmark import generate_model\n",
    "from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create and run inference on a simple ONNX model\n",
    "Let's start with a simple model: Y = ((X + X) * X + X) * X + X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = onnx.ModelProto()\n",
    "opset = model.opset_import.add()\n",
    "opset.domain == 'onnx'\n",
    "opset.version = 7 # ONNX opset 7 is required for LSTM op later\n",
    "\n",
    "graph = model.graph\n",
    "X = 'input'\n",
    "Y = 'output'\n",
    "\n",
    "# declare graph input/ouput with shape [seq, batch, 1024]\n",
    "dim = 1024\n",
    "model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "model.graph.output.add().CopyFrom(helper.make_tensor_value_info(Y, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "\n",
    "# create nodes: Y = ((X + X) * X + X) * X + X\n",
    "num_nodes = 5\n",
    "for i in range(num_nodes):\n",
    "  n = helper.make_node('Mul' if i % 2 else 'Add',\n",
    "                       [X, X if i == 0 else 'out_'+str(i-1)],\n",
    "                       ['out_'+str(i) if i < num_nodes - 1 else Y],\n",
    "                       'node'+str(i))\n",
    "  model.graph.node.add().CopyFrom(n)\n",
    "\n",
    "# save the model\n",
    "simple_model_name = 'simple.onnx'\n",
    "onnx.save(model, simple_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use nuphar execution provider to run the inference for the model that we created above, and use settings string to check the generated code.\n",
    "\n",
    "Because of the redirection of output, we dump the lowered code from a subprocess to a log file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "code_to_run = '''\n",
    "import onnxruntime\n",
    "s = 'codegen_dump_lower:verbose'\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(s)\n",
    "sess = onnxruntime.InferenceSession('simple.onnx')\n",
    "'''\n",
    "\n",
    "log_file = 'simple_lower.log' \n",
    "with open(log_file, \"w\") as f:\n",
    "  subprocess.run([sys.executable, '-c', code_to_run], stdout=f, stderr=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lowered log is similar to C source code, but the whole file is lengthy to show here. Let's just check the last few lines that are most important:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['    for (ax2.outer, 0, 64) {\\n',\n",
       " '      if ((0 <= (ax0.ax1.fused/batch))) {\\n',\n",
       " '        if (((ax0.ax1.fused/batch) < seq)) {\\n',\n",
       " '          node4[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] = (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)])))))\\n',\n",
       " '        }\\n',\n",
       " '      }\\n',\n",
       " '    }\\n',\n",
       " '  }\\n',\n",
       " '}\\n',\n",
       " '\\n']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(log_file) as f:\n",
    "    log_lines = f.readlines()\n",
    "\n",
    "log_lines[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The compiled code showed that the nodes of Add/Mul were fused into a single function, and vectorization was applied in the loop. The fusion was automatically done by the compiler in the Nuphar execution provider, and did not require any manual model editing.\n",
    "\n",
    "Next, let's run inference on the model and compare the accuracy and performance with numpy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fusion speed-up 434.50%\n",
      "    Baseline: 0.716 s, Current: 0.134 s\n"
     ]
    }
   ],
   "source": [
    "seq = 128\n",
    "batch = 16\n",
    "input_data = np.random.rand(seq, batch, dim).astype(np.float32)\n",
    "sess = onnxruntime.InferenceSession(simple_model_name)\n",
    "feed = {X:input_data}\n",
    "output = sess.run([], feed)\n",
    "np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "assert np.allclose(output[0], np_output)\n",
    "\n",
    "repeats = 100\n",
    "start_ort = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess.run([], feed)\n",
    "end_ort = timer()\n",
    "start_np = timer()\n",
    "for i in range(repeats):\n",
    "    np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "end_np = timer()\n",
    "print_speedup('Fusion', end_np - start_np, end_ort - start_ort)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 3. Create and run inference on a model using LSTM\n",
    "Now, let's take one step further to work on a 4-layer LSTM model, created from onnxruntime.nuphar.rnn_benchmark module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_model = 'LSTMx4.onnx'\n",
    "input_dim = 256\n",
    "hidden_dim = 1024\n",
    "generate_model('lstm', input_dim, hidden_dim, bidirectional=False, layers=4, model_name=lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**IMPORTANT**: Nuphar generates code before knowing shapes of input data, unlike other execution providers that do runtime shape inference. Thus, shape inference information is critical for compiler optimizations in Nuphar. To do that, we run symbolic shape inference on the model. Symbolic shape inference is based on the ONNX shape inference, and enhanced by sympy to better handle Shape/ConstantOfShape/etc. ops using symbolic computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "SymbolicShapeInference.infer_shapes(input_model=lstm_model, output_model=lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's check baseline performance on the generated model, using CPU execution provider."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_baseline = onnxruntime.InferenceSession(lstm_model)\n",
    "sess_baseline.set_providers(['CPUExecutionProvider']) # default provider in this container is Nuphar, this overrides to CPU EP\n",
    "seq = 128\n",
    "input_data = np.random.rand(seq, 1, input_dim).astype(np.float32)\n",
    "feed = {sess_baseline.get_inputs()[0].name:input_data}\n",
    "output = sess_baseline.run([], feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run RNN models in Nuphar execution provider efficiently, LSTM/GRU/RNN ops need to be converted to Scan ops. This is because Scan is more flexible, and supports quantized RNNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "scan_model = 'Scan_LSTMx4.onnx'\n",
    "convert_to_scan_model(lstm_model, scan_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After conversion, let's compare performance and accuracy with baseline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar Scan speed-up 7.68%\n",
      "    Baseline: 3.037 s, Current: 2.821 s\n"
     ]
    }
   ],
   "source": [
    "sess_nuphar = onnxruntime.InferenceSession(scan_model)\n",
    "output_nuphar = sess_nuphar.run([], feed)\n",
    "assert np.allclose(output[0], output_nuphar[0])\n",
    "\n",
    "repeats = 10\n",
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_nuphar.run([], feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar Scan', end_baseline - start_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Quantize the LSTM model\n",
    "Let's get more speed-ups from Nuphar by quantizing the floating point GEMM/GEMV in LSTM model to int8 GEMM/GEMV.\n",
    "\n",
    "**NOTE:** For inference speed of quantizated model, a CPU with AVX2 instructions is preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cpufeature.CPUFeature['AVX2'] or 'No AVX2, quantization model might be slow'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use onnxruntime.nuphar.model_quantizer to quantize floating point GEMM/GEMVs. Assuming GEMM/GEMV takes form of input * weights, weights are statically quantized per-column, and inputs are dynamically quantized per-row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "quantized_model = 'Scan_LSTMx4_int8.onnx'\n",
    "convert_matmul_model(scan_model, quantized_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now run the quantized model, and check accuracy. Please note that quantization may cause accuracy loss, so we relax the comparison threshold a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_quantized = onnxruntime.InferenceSession(quantized_model)\n",
    "output_quantized = sess_quantized.run([], feed)\n",
    "assert np.allclose(output[0], output_quantized[0], rtol=1e-3, atol=1e-3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now check quantized model performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Quantization speed-up 278.52%\n",
      "    Baseline: 2.821 s, Current: 0.745 s\n"
     ]
    }
   ],
   "source": [
    "start_quantized = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_quantized.run([], feed)\n",
    "end_quantized = timer()\n",
    "\n",
    "print_speedup('Quantization', end_nuphar - start_nuphar, end_quantized - start_quantized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Working on real models\n",
    "\n",
    "### BERT Squad\n",
    "\n",
    "BERT (Bidirectional Encoder Representations from Transformers) applies Transformers to language modelling. With Nuphar, we may fuse and compile the model to accelerate inference on CPU.\n",
    "\n",
    "#### Download model and test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download BERT squad model\n",
    "cwd = os.getcwd()\n",
    "model_url = 'https://onnxzoo.blob.core.windows.net/models/opset_10/bert_squad/download_sample_10.tar.gz'\n",
    "model_local = os.path.join(cwd, 'download_sample_10.tar.gz')\n",
    "if not os.path.exists(model_local):\n",
    "  urllib.request.urlretrieve(model_url, model_local)\n",
    "with tarfile.open(model_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run symbolic shape inference\n",
    "Note that this model has computations like `min(100000, seq_len)` which could be simplified to `seq_len` if we know `seq_len` is not going to be too big. We can do this by setting int_max. Besides, auto_merge is used to make sure the all nodes in the entire model could have shape inferenced by merging symbolic dims when broadcasting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_dir = os.path.join(cwd, 'download_sample_10')\n",
    "model = os.path.join(model_dir, 'bertsquad10.onnx')\n",
    "model_with_shape_inference = os.path.join(model_dir, 'bertsquad10_shaped.onnx')\n",
    "\n",
    "# run symbolic shape inference\n",
    "SymbolicShapeInference.infer_shapes(model, model_with_shape_inference, auto_merge=True, int_max=100000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run inference on original model, using CPU execution provider, with maximum optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_options = onnxruntime.SessionOptions()\n",
    "sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL\n",
    "sess_baseline = onnxruntime.InferenceSession(model, sess_options)\n",
    "sess_baseline.set_providers(['CPUExecutionProvider'])\n",
    "\n",
    "# load test data\n",
    "test_data_dir = os.path.join(model_dir, 'test_data_set_1')\n",
    "tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]\n",
    "feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}\n",
    "output_baseline = sess_baseline.run([], feed)\n",
    "\n",
    "repeats = 20\n",
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end_baseline = timer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run inference on the model with symbolic shape inference, using Nuphar execution provider\n",
    "First let's check accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = onnxruntime.InferenceSession(model_with_shape_inference)\n",
    "output = sess.run([], feed)\n",
    "assert all([np.allclose(o, ob, atol=1e-4) for o, ob in zip(output, output_baseline)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then check speed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar BERT squad speed-up 65.18%\n",
      "    Baseline: 5.023 s, Current: 3.041 s\n"
     ]
    }
   ],
   "source": [
    "start_nuphar = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess.run([], feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar BERT squad', end_baseline - start_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BiDAF with quantization\n",
    "\n",
    "BiDAF is a machine comprehension model that uses LSTMs. The inputs to this model are paragraphs of contexts and queries, and the outputs are start/end indices of words in the contexts that answers the queries.\n",
    "\n",
    "First let's download the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download BiDAF model\n",
    "cwd = os.getcwd()\n",
    "bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'\n",
    "bidaf_local = os.path.join(cwd, 'bidaf.tar.gz')\n",
    "if not os.path.exists(bidaf_local):\n",
    "  urllib.request.urlretrieve(bidaf_url, bidaf_local)\n",
    "with tarfile.open(bidaf_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check the performance of the CPU provider:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "bidaf = os.path.join(cwd, 'bidaf', 'bidaf.onnx')\n",
    "sess_baseline = onnxruntime.InferenceSession(bidaf)\n",
    "sess_baseline.set_providers(['CPUExecutionProvider'])\n",
    "# load test data\n",
    "test_data_dir = os.path.join(cwd, 'bidaf', 'test_data_set_3')\n",
    "tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]\n",
    "feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}\n",
    "output_baseline = sess_baseline.run([], feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The context in this test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"with 4:51 left in regulation , carolina got the ball on their own 24 - yard line with a chance to mount a game - winning drive , and soon faced 3rd - and - 9 . on the next play , miller stripped the ball away from newton , and after several players dove for it , it took a long bounce backwards and was recovered by ward , who returned it five yards to the panthers 4 - yard line . although several players dove into the pile to attempt to recover it , newton did not and his lack of aggression later earned him heavy criticism . meanwhile , denver  ' s offense was kept out of the end zone for three plays , but a holding penalty on cornerback josh norman gave the broncos a new set of downs . then anderson scored on a 2 - yard touchdown run and manning completed a pass to bennie fowler for a 2 - point conversion , giving denver a 24 – 10 lead with 3:08 left and essentially putting the game away . carolina had two more drives , but failed to get a first down on each one .\""
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['context_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'who recovered the strip ball ?'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['query_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ward'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(feed['context_word'][output_baseline[0][0]:output_baseline[1][0]+1].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now put all steps together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# editing\n",
    "bidaf_converted = 'bidaf_mod.onnx'\n",
    "SymbolicShapeInference.infer_shapes(bidaf, bidaf_converted)\n",
    "convert_to_scan_model(bidaf_converted, bidaf_converted)\n",
    "# When quantizing, there's an only_for_scan option to quantize only the GEMV inside Scan ops.\n",
    "# This is useful when the input dims of LSTM being much bigger than hidden dims.\n",
    "# BiDAF has several LSTMs with input dim being 800/1400/etc, while hidden dim is 100.\n",
    "# So unlike the LSTMx4 model above, we use only_for_scan here\n",
    "convert_matmul_model(bidaf_converted, bidaf_converted, only_for_scan=True)\n",
    "\n",
    "# inference and verify accuracy\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "output = sess.run([], feed)\n",
    "assert all([np.allclose(o, ob) for o, ob in zip(output, output_baseline)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check performance after all these steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar quantized BiDAF speed-up 45.63%\n",
      "    Baseline: 0.305 s, Current: 0.209 s\n"
     ]
    }
   ],
   "source": [
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess.run([], feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar quantized BiDAF', end_baseline - start_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The benefit of quantization in BiDAF is not as great as in the LSTM sample above, because BiDAF has relatively small hidden dimensions, which limited the gain from optimization inside Scan ops. However, this model still benefits from fusion/vectorization/etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Ahead-Of-Time (AOT) compilation\n",
    "Nuphar runs Just-in-time (JIT) compilation when loading models. The compilation may lead to slow cold start. We can use create_shared script to build dll from JIT code and accelerate model loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'JIT took 4.655 seconds'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_jit = timer()\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "end_jit = timer()\n",
    "'JIT took {:.3f} seconds'.format(end_jit - start_jit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a folder for JIT cache\n",
    "cache_dir = os.path.join(cwd, 'bidaf_cache')\n",
    "# remove any stale cache files\n",
    "if os.path.exists(cache_dir):\n",
    "  shutil.rmtree(cache_dir)\n",
    "os.makedirs(cache_dir, exist_ok=True)\n",
    "# use settings to enable JIT cache\n",
    "settings = 'nuphar_cache_path:{}'.format(cache_dir)\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(settings)\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now object files of JIT code is stored in cache_dir, let's link them into dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['jit.so']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cache_versioned_dir = os.path.join(cache_dir, os.listdir(cache_dir)[0])\n",
    "# use onnxruntime.nuphar.create_shared module to create dll\n",
    "onnxruntime_dir = os.path.split(os.path.abspath(onnxruntime.__file__))[0]\n",
    "subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', cache_versioned_dir], check=True)\n",
    "os.listdir(cache_versioned_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the model loading speed-up with AOT dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AOT speed-up 967.73%\n",
      "    Baseline: 4.655 s, Current: 0.436 s\n"
     ]
    }
   ],
   "source": [
    "start_aot = timer()\n",
    "# NOTE: Nuphar settings string is not sticky. It needs to be reset before creating InferenceSession\n",
    "settings = 'nuphar_cache_path:{}'.format(cache_dir)\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(settings)\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "end_aot = timer()\n",
    "print_speedup('AOT', end_jit - start_jit, end_aot - start_aot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Performance tuning for single thread inference.\n",
    "By default, Nuphar enables parallel schedule for lower inference latency with multiple threads, when building with MKLML or OpenMP. For some models, user may want to run single-thread inference for better throughput with multiple concurrent inference threads, and turning off parallel schedule may make inference a bit faster in single thread."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single thread perf w/o parallel schedule speed-up 2.83%\n",
      "    Baseline: 0.315 s, Current: 0.307 s\n"
     ]
    }
   ],
   "source": [
    "# set OMP_NUM_THREADS to 1 for single thread inference\n",
    "# this would mak\n",
    "os.environ['OMP_NUM_THREADS'] = '1'\n",
    "\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "start_baseline = timer()\n",
    "for i in range(repeats):\n",
    "    output_baseline = sess_baseline.run([], feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "# use NUPHAR_PARALLEL_MIN_WORKLOADS=0 to turn off parallel schedule, using settings string\n",
    "# it can be set from environment variable too: os.environ['NUPHAR_PARALLEL_MIN_WORKLOADS'] = '0'\n",
    "settings = 'nuphar_parallel_min_workloads:0'\n",
    "onnxruntime.capi._pybind_state.set_nuphar_settings(settings)\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "\n",
    "start = timer()\n",
    "for i in range(repeats):\n",
    "    output = sess_baseline.run([], feed)\n",
    "end = timer()\n",
    "print_speedup('Single thread perf w/o parallel schedule', end_baseline - start_baseline, end - start)"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "kedeng"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "msauthor": "ke.deng"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}