onnxruntime/docs/python/inference/notebooks/onnxruntime-nuphar-tutorial.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ONNX Runtime: Tutorial for Nuphar execution provider\n",
    "**Accelerating model inference via compiler, using Docker Images for ONNX Runtime with Nuphar**\n",
    "\n",
    "This example shows how to accelerate model inference using Nuphar, an execution provider that leverages just-in-time compilation to generate optimized executables.\n",
    "\n",
    "For more background about Nuphar, please check [Nuphar-ExecutionProvider.md](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/Nuphar-ExecutionProvider.md) and its [build instructions](https://www.onnxruntime.ai/docs/how-to/build.html#nuphar).\n",
    "\n",
    "#### Tutorial Roadmap:\n",
    "1. Prerequistes\n",
    "2. Create and run inference on a simple ONNX model, and understand how ***compilation*** works in Nuphar.\n",
    "3. Create and run inference on a model using ***LSTM***, run symbolic shape inference, edit LSTM ops to Scan, and check Nuphar speedup.\n",
    "4. ***Quantize*** the LSTM model and check speedup in Nuphar (CPU with AVX2 support is required).\n",
    "5. Working on real models from onnx model zoo: ***BERT squad***, ***GPT-2*** and ***Bidirectional Attention Flow ([BiDAF](https://arxiv.org/pdf/1611.01603))***.\n",
    "6. ***Ahead-Of-Time (AOT) compilation*** to save just-in-time compilation cost on model load.\n",
    "7. Performance tuning for single thread inference.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Prerequistes\n",
    "Please make sure you have installed following Python packages. Besides, C++ compiler/linker is required for ahead-of-time compilation. Please make sure you have g++ if running on Linux, or Visual Studio 2017 on Windows.\n",
    "\n",
    "For simplicity, you may use [Nuphar docker image](https://github.com/microsoft/onnxruntime/blob/master/dockerfiles/README.md) from Microsoft Container Registry.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cpufeature\n",
    "import hashlib\n",
    "import numpy as np\n",
    "import onnx\n",
    "from onnx import helper, numpy_helper\n",
    "import os\n",
    "from timeit import default_timer as timer\n",
    "import shutil\n",
    "import subprocess\n",
    "import sys\n",
    "import tarfile\n",
    "import urllib.request\n",
    "\n",
    "def is_windows():\n",
    "  return sys.platform.startswith('win')\n",
    "\n",
    "if is_windows():\n",
    "  assert shutil.which('cl.exe'), 'Please make sure MSVC compiler and liner are in PATH.'\n",
    "else:\n",
    "  assert shutil.which('g++'), 'Please make sure g++ is installed.'\n",
    "\n",
    "def print_speedup(name, delta_baseline, delta):\n",
    "    print(\"{} speed-up {:.2f}%\".format(name, 100*(delta_baseline/delta - 1)))\n",
    "    print(\"    Baseline: {:.3f} s, Current: {:.3f} s\".format(delta_baseline, delta))\n",
    "\n",
    "def create_cache_dir(cache_dir):\n",
    "    # remove any stale cache files\n",
    "    if os.path.exists(cache_dir):\n",
    "        shutil.rmtree(cache_dir)\n",
    "    os.makedirs(cache_dir, exist_ok=True)\n",
    "\n",
    "def md5(file_name):\n",
    "    hash_md5 = hashlib.md5()\n",
    "    with open(file_name, \"rb\") as f:\n",
    "        for chunk in iter(lambda: f.read(4096), b\"\"):\n",
    "            hash_md5.update(chunk)\n",
    "    return hash_md5.hexdigest()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And Nuphar package in onnxruntime is required too. Please make sure you are using Nuphar enabled build."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import onnxruntime\n",
    "from onnxruntime.nuphar.model_editor import convert_to_scan_model\n",
    "from onnxruntime.nuphar.model_quantizer import convert_matmul_model\n",
    "from onnxruntime.nuphar.rnn_benchmark import generate_model, perf_test\n",
    "from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create and run inference on a simple ONNX model\n",
    "Let's start with a simple model: Y = ((X + X) * X + X) * X + X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = onnx.ModelProto()\n",
    "opset = model.opset_import.add()\n",
    "opset.domain == 'onnx'\n",
    "opset.version = 7 # ONNX opset 7 is required for LSTM op later\n",
    "model.ir_version = onnx.IR_VERSION\n",
    "\n",
    "graph = model.graph\n",
    "X = 'input'\n",
    "Y = 'output'\n",
    "\n",
    "# declare graph input/ouput with shape [seq, batch, 1024]\n",
    "dim = 1024\n",
    "model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "model.graph.output.add().CopyFrom(helper.make_tensor_value_info(Y, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))\n",
    "\n",
    "# create nodes: Y = ((X + X) * X + X) * X + X\n",
    "num_nodes = 5\n",
    "for i in range(num_nodes):\n",
    "  n = helper.make_node('Mul' if i % 2 else 'Add',\n",
    "                       [X, X if i == 0 else 'out_'+str(i-1)],\n",
    "                       ['out_'+str(i) if i < num_nodes - 1 else Y],\n",
    "                       'node'+str(i))\n",
    "  model.graph.node.add().CopyFrom(n)\n",
    "\n",
    "# save the model\n",
    "simple_model_name = 'simple.onnx'\n",
    "onnx.save(model, simple_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use nuphar execution provider to run the inference for the model that we created above, and use settings string to check the generated code.\n",
    "\n",
    "Because of the redirection of output, we dump the lowered code from a subprocess to a log file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "code_to_run = '''\n",
    "import onnxruntime\n",
    "s = 'codegen_dump_lower:verbose'\n",
    "providers = [('NupharExecutionProvider', {'nuphar_settings': s}), 'CPUExecutionProvider']\n",
    "sess = onnxruntime.InferenceSession('simple.onnx', providers=providers)\n",
    "'''\n",
    "\n",
    "log_file = 'simple_lower.log' \n",
    "with open(log_file, \"w\") as f:\n",
    "  subprocess.run([sys.executable, '-c', code_to_run], stdout=f, stderr=f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lowered log is similar to C source code, but the whole file is lengthy to show here. Let's just check the last few lines that are most important:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['    for (ax2.outer, 0, 64) {\\n',\n",
       " '      if ((0 <= (ax0.ax1.fused/batch))) {\\n',\n",
       " '        if (((ax0.ax1.fused/batch) < seq)) {\\n',\n",
       " '          node4[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] = (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)])))))\\n',\n",
       " '        }\\n',\n",
       " '      }\\n',\n",
       " '    }\\n',\n",
       " '  }\\n',\n",
       " '}\\n',\n",
       " '\\n']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(log_file) as f:\n",
    "    log_lines = f.readlines()\n",
    "\n",
    "log_lines[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The compiled code showed that the nodes of Add/Mul were fused into a single function, and vectorization was applied in the loop. The fusion was automatically done by the compiler in the Nuphar execution provider, and did not require any manual model editing.\n",
    "\n",
    "Next, let's run inference on the model and compare the accuracy and performance with numpy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fusion speed-up 445.44%\n",
      "    Baseline: 0.732 s, Current: 0.134 s\n"
     ]
    }
   ],
   "source": [
    "seq = 128\n",
    "batch = 16\n",
    "input_data = np.random.rand(seq, batch, dim).astype(np.float32)\n",
    "sess = onnxruntime.InferenceSession(simple_model_name)\n",
    "simple_feed = {X:input_data}\n",
    "simple_output = sess.run([], simple_feed)\n",
    "np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "assert np.allclose(simple_output[0], np_output)\n",
    "\n",
    "simple_repeats = 100\n",
    "start_ort = timer()\n",
    "for i in range(simple_repeats):\n",
    "    sess.run([], simple_feed)\n",
    "end_ort = timer()\n",
    "start_np = timer()\n",
    "for i in range(simple_repeats):\n",
    "    np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data\n",
    "end_np = timer()\n",
    "print_speedup('Fusion', end_np - start_np, end_ort - start_ort)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 3. Create and run inference on a model using LSTM\n",
    "Now, let's take one step further to work on a 4-layer LSTM model, created from onnxruntime.nuphar.rnn_benchmark module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_model = 'LSTMx4.onnx'\n",
    "input_dim = 256\n",
    "hidden_dim = 1024\n",
    "generate_model('lstm', input_dim, hidden_dim, bidirectional=False, layers=4, model_name=lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**IMPORTANT**: Nuphar generates code before knowing shapes of input data, unlike other execution providers that do runtime shape inference. Thus, shape inference information is critical for compiler optimizations in Nuphar. To do that, we run symbolic shape inference on the model. Symbolic shape inference is based on the ONNX shape inference, and enhanced by sympy to better handle Shape/ConstantOfShape/etc. ops using symbolic computation.\n",
    "\n",
    "**IMPORTANT**: When running multi-threaded inference, Nuphar currently uses TVM's parallel schedule with has its own thread pool that's compatible with OpenMP and MKLML. The TVM thread pool has not been integrated with ONNX runtime thread pool, so intra_op_num_threads won't control it. Please make sure the build is with OpenMP or MKLML, and use OMP_NUM_THREADS to control thread pool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "onnx.save(SymbolicShapeInference.infer_shapes(onnx.load(lstm_model)), lstm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's check baseline performance on the generated model, using CPU execution provider."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_baseline = onnxruntime.InferenceSession(lstm_model, providers=['CPUExecutionProvider'])\n",
    "seq = 128\n",
    "input_data = np.random.rand(seq, 1, input_dim).astype(np.float32)\n",
    "lstm_feed = {sess_baseline.get_inputs()[0].name:input_data}\n",
    "lstm_output = sess_baseline.run([], lstm_feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run RNN models in Nuphar execution provider efficiently, LSTM/GRU/RNN ops need to be converted to Scan ops. This is because Scan is more flexible, and supports quantized RNNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_scan_model = 'Scan_LSTMx4.onnx'\n",
    "convert_to_scan_model(lstm_model, lstm_scan_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After conversion, let's compare performance and accuracy with baseline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar Scan speed-up 9.22%\n",
      "    Baseline: 3.070 s, Current: 2.811 s\n"
     ]
    }
   ],
   "source": [
    "sess_nuphar = onnxruntime.InferenceSession(lstm_scan_model)\n",
    "output_nuphar = sess_nuphar.run([], lstm_feed)\n",
    "assert np.allclose(lstm_output[0], output_nuphar[0])\n",
    "\n",
    "lstm_repeats = 10\n",
    "start_lstm_baseline = timer()\n",
    "for i in range(lstm_repeats):\n",
    "    sess_baseline.run([], lstm_feed)\n",
    "end_lstm_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(lstm_repeats):\n",
    "    sess_nuphar.run([], lstm_feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar Scan', end_lstm_baseline - start_lstm_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Quantize the LSTM model\n",
    "Let's get more speed-ups from Nuphar by quantizing the floating point GEMM/GEMV in LSTM model to int8 GEMM/GEMV.\n",
    "\n",
    "**NOTE:** For inference speed of quantizated model, a CPU with AVX2 instructions is preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cpufeature.CPUFeature['AVX2'] or 'No AVX2, quantization model might be slow'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use onnxruntime.nuphar.model_quantizer to quantize floating point GEMM/GEMVs. Assuming GEMM/GEMV takes form of input * weights, weights are statically quantized per-column, and inputs are dynamically quantized per-row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_quantized_model = 'Scan_LSTMx4_int8.onnx'\n",
    "convert_matmul_model(lstm_scan_model, lstm_quantized_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now run the quantized model, and check accuracy. Please note that quantization may cause accuracy loss, so we relax the comparison threshold a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_quantized = onnxruntime.InferenceSession(lstm_quantized_model)\n",
    "output_quantized = sess_quantized.run([], lstm_feed)\n",
    "assert np.allclose(lstm_output[0], output_quantized[0], rtol=1e-3, atol=1e-3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now check quantized model performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Quantization speed-up 722.57%\n",
      "    Baseline: 2.811 s, Current: 0.342 s\n"
     ]
    }
   ],
   "source": [
    "start_quantized = timer()\n",
    "for i in range(lstm_repeats):\n",
    "    sess_quantized.run([], lstm_feed)\n",
    "end_quantized = timer()\n",
    "\n",
    "print_speedup('Quantization', end_nuphar - start_nuphar, end_quantized - start_quantized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check RNN quantization performance, please use rnn_benchmark.perf_test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perf_rnn (with default threads) lstm_i80_h512_bi_l6_no_batch.onnx: run for 348 iterations, top 5 avg 28.368 ms\n",
      "perf_scan (with 4 threads) lstm_i80_h512_bi_l6_no_batch_scan.onnx: run for 373 iterations, top 5 avg 25.819 ms\n",
      "perf_int8 (with 4 threads) lstm_i80_h512_bi_l6_no_batch_int8.onnx: run for 778 iterations, top 5 avg 12.534 ms\n",
      "Nuphar Quantization speed up speed-up 126.33%\n",
      "    Baseline: 0.028 s, Current: 0.013 s\n"
     ]
    }
   ],
   "source": [
    "rnn_type = 'lstm'                         # could be 'lstm', 'gru' or 'rnn'\n",
    "num_threads = cpufeature.CPUFeature['num_physical_cores'] # no hyper thread\n",
    "input_dim = 80                            # size of input dimension\n",
    "hidden_dim = 512                          # size of hidden dimension in cell\n",
    "bidirectional = True                      # specify RNN being bidirectional\n",
    "layers = 6                                # number of stacked RNN layers\n",
    "seq_len = 40                              # length of sequence\n",
    "batch_size = 1                            # size of batch\n",
    "original_ms, scan_ms, int8_ms = perf_test(rnn_type, num_threads, input_dim, hidden_dim, bidirectional, layers, seq_len, batch_size)\n",
    "print_speedup('Nuphar Quantization speed up', original_ms / 1000, int8_ms / 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Working on real models\n",
    "\n",
    "### 5.1 BERT Squad\n",
    "\n",
    "BERT (Bidirectional Encoder Representations from Transformers) applies Transformers to language modelling. With Nuphar, we may fuse and compile the model to accelerate inference on CPU.\n",
    "\n",
    "#### Download model and test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download BERT squad model\n",
    "cwd = os.getcwd()\n",
    "bert_model_url = 'https://onnxzoo.blob.core.windows.net/models/opset_10/bert_squad/download_sample_10.tar.gz'\n",
    "bert_model_local = os.path.join(cwd, 'download_sample_10.tar.gz')\n",
    "if not os.path.exists(bert_model_local):\n",
    "  urllib.request.urlretrieve(bert_model_url, bert_model_local)\n",
    "with tarfile.open(bert_model_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run symbolic shape inference\n",
    "Note that this model has computations like `min(100000, seq_len)` which could be simplified to `seq_len` if we know `seq_len` is not going to be too big. We can do this by setting int_max. Besides, auto_merge is used to make sure the all nodes in the entire model could have shape inferenced by merging symbolic dims when broadcasting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "bert_model_dir = os.path.join(cwd, 'download_sample_10')\n",
    "bert_model = os.path.join(bert_model_dir, 'bertsquad10.onnx')\n",
    "bert_model_with_shape_inference = os.path.join(bert_model_dir, 'bertsquad10_shaped.onnx')\n",
    "\n",
    "# run symbolic shape inference\n",
    "onnx.save(SymbolicShapeInference.infer_shapes(onnx.load(bert_model), auto_merge=True, int_max=100000), bert_model_with_shape_inference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run inference on original model, using CPU execution provider, with maximum optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess_options = onnxruntime.SessionOptions()\n",
    "sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL\n",
    "sess_baseline = onnxruntime.InferenceSession(bert_model, sess_options=sess_options, providers=['CPUExecutionProvider'])\n",
    "\n",
    "# load test data\n",
    "test_data_dir = os.path.join(bert_model_dir, 'test_data_set_1')\n",
    "tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]\n",
    "bert_feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}\n",
    "bert_output_baseline = sess_baseline.run([], bert_feed)\n",
    "\n",
    "bert_repeats = 20\n",
    "start_bert_baseline = timer()\n",
    "for i in range(bert_repeats):\n",
    "    sess_baseline.run([], bert_feed)\n",
    "end_bert_baseline = timer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run inference on the model with symbolic shape inference, using Nuphar execution provider\n",
    "First let's check accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = onnxruntime.InferenceSession(bert_model_with_shape_inference)\n",
    "output = sess.run([], bert_feed)\n",
    "assert all([np.allclose(o, ob, atol=1e-4) for o, ob in zip(output, bert_output_baseline)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then check speed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar BERT squad speed-up 66.70%\n",
      "    Baseline: 5.093 s, Current: 3.055 s\n"
     ]
    }
   ],
   "source": [
    "start_nuphar = timer()\n",
    "for i in range(bert_repeats):\n",
    "    sess.run([], bert_feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar BERT squad', end_bert_baseline - start_bert_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 GPT-2 with fixed batch size\n",
    "GPT-2 is a language model using Generative Pre-Trained Transformer for text generation. With Nuphar, we may fuse and compile the model to accelerate inference on CPU.\n",
    "\n",
    "#### Download model and test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download GPT-2 model\n",
    "cwd = os.getcwd()\n",
    "gpt2_model_url = 'https://onnxzoo.blob.core.windows.net/models/opset_10/GPT2/GPT-2.tar.gz'\n",
    "gpt2_model_local = os.path.join(cwd, 'GPT-2.tar.gz')\n",
    "if not os.path.exists(gpt2_model_local):\n",
    "  urllib.request.urlretrieve(gpt2_model_url, gpt2_model_local)\n",
    "with tarfile.open(gpt2_model_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Change batch dimension to fixed value, and run symbolic shape inference\n",
    "The GPT-2 model from model zoo has a symbolic batch dimension. By replacing it with a fixed value, compiler would be able to generate better code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpt2_model_dir = os.path.join(cwd, 'GPT2')\n",
    "gpt2_model = os.path.join(gpt2_model_dir, 'model.onnx')\n",
    "\n",
    "# edit batch dimension from symbolic to int value for better codegen\n",
    "mp = onnx.load(gpt2_model)\n",
    "mp.graph.input[0].type.tensor_type.shape.dim[0].dim_value = 1\n",
    "onnx.save(mp, gpt2_model)\n",
    "\n",
    "gpt2_model_with_shape_inference = os.path.join(gpt2_model_dir, 'model_shaped.onnx')\n",
    "\n",
    "# run symbolic shape inference\n",
    "onnx.save(SymbolicShapeInference.infer_shapes(onnx.load(gpt2_model), auto_merge=True), gpt2_model_with_shape_inference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run inference and compare accuracy/performance to CPU provider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar GPT-2 speed-up 16.15%\n",
      "    Baseline: 2.671 s, Current: 2.299 s\n"
     ]
    }
   ],
   "source": [
    "sess_options = onnxruntime.SessionOptions()\n",
    "sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL\n",
    "sess_baseline = onnxruntime.InferenceSession(gpt2_model, sess_options=sess_options, providers=['CPUExecutionProvider'])\n",
    "\n",
    "# load test data, note the tensor proto name in data does not match model, so override it in feed\n",
    "input_name = [i.name for i in sess_baseline.get_inputs()][0] # This model only has one input\n",
    "test_data_dir = os.path.join(gpt2_model_dir, 'test_data_set_0')\n",
    "tp = onnx.load_tensor(os.path.join(test_data_dir, 'input_0.pb'))\n",
    "gpt2_feed = {input_name:numpy_helper.to_array(tp).reshape(1,-1)} # the test data missed batch dimension\n",
    "gpt2_output_baseline = sess_baseline.run([], gpt2_feed)\n",
    "\n",
    "gpt2_repeats = 100\n",
    "start_gpt2_baseline = timer()\n",
    "for i in range(gpt2_repeats):\n",
    "    sess_baseline.run([], gpt2_feed)\n",
    "end_gpt2_baseline = timer()\n",
    "\n",
    "sess = onnxruntime.InferenceSession(gpt2_model_with_shape_inference)\n",
    "output = sess.run([], gpt2_feed)\n",
    "assert all([np.allclose(o, ob, atol=1e-4) for o, ob in zip(output, gpt2_output_baseline)])\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(gpt2_repeats):\n",
    "    output = sess.run([], gpt2_feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar GPT-2', end_gpt2_baseline - start_gpt2_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 BiDAF with quantization\n",
    "\n",
    "BiDAF is a machine comprehension model that uses LSTMs. The inputs to this model are paragraphs of contexts and queries, and the outputs are start/end indices of words in the contexts that answers the queries.\n",
    "\n",
    "First let's download the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download BiDAF model\n",
    "cwd = os.getcwd()\n",
    "bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'\n",
    "bidaf_local = os.path.join(cwd, 'bidaf.tar.gz')\n",
    "if not os.path.exists(bidaf_local):\n",
    "  urllib.request.urlretrieve(bidaf_url, bidaf_local)\n",
    "with tarfile.open(bidaf_local, 'r') as f:\n",
    "  f.extractall(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check the performance of the CPU provider:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "bidaf_dir = os.path.join(cwd, 'bidaf')\n",
    "bidaf = os.path.join(bidaf_dir, 'bidaf.onnx')\n",
    "sess_baseline = onnxruntime.InferenceSession(bidaf, providers=['CPUExecutionProvider'])\n",
    "# load test data\n",
    "test_data_dir = os.path.join(cwd, 'bidaf', 'test_data_set_3')\n",
    "tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]\n",
    "bidaf_feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}\n",
    "bidaf_output_baseline = sess_baseline.run([], bidaf_feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The context in this test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"with 4:51 left in regulation , carolina got the ball on their own 24 - yard line with a chance to mount a game - winning drive , and soon faced 3rd - and - 9 . on the next play , miller stripped the ball away from newton , and after several players dove for it , it took a long bounce backwards and was recovered by ward , who returned it five yards to the panthers 4 - yard line . although several players dove into the pile to attempt to recover it , newton did not and his lack of aggression later earned him heavy criticism . meanwhile , denver  ' s offense was kept out of the end zone for three plays , but a holding penalty on cornerback josh norman gave the broncos a new set of downs . then anderson scored on a 2 - yard touchdown run and manning completed a pass to bennie fowler for a 2 - point conversion , giving denver a 24 – 10 lead with 3:08 left and essentially putting the game away . carolina had two more drives , but failed to get a first down on each one .\""
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(bidaf_feed['context_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'who recovered the strip ball ?'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(bidaf_feed['query_word'].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ward'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(list(bidaf_feed['context_word'][bidaf_output_baseline[0][0]:bidaf_output_baseline[1][0]+1].reshape(-1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now put all steps together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# editing\n",
    "bidaf_converted = 'bidaf_mod.onnx'\n",
    "onnx.save(SymbolicShapeInference.infer_shapes(onnx.load(bidaf)), bidaf_converted)\n",
    "convert_to_scan_model(bidaf_converted, bidaf_converted)\n",
    "# When quantizing, there's an only_for_scan option to quantize only the GEMV inside Scan ops.\n",
    "# This is useful when the input dims of LSTM being much bigger than hidden dims.\n",
    "# BiDAF has several LSTMs with input dim being 800/1400/etc, while hidden dim is 100.\n",
    "# So unlike the LSTMx4 model above, we use only_for_scan here\n",
    "convert_matmul_model(bidaf_converted, bidaf_converted, only_for_scan=True)\n",
    "\n",
    "# inference and verify accuracy\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "output = sess.run([], bidaf_feed)\n",
    "assert all([np.allclose(o, ob) for o, ob in zip(output, bidaf_output_baseline)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check performance after all these steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nuphar quantized BiDAF speed-up 43.79%\n",
      "    Baseline: 1.517 s, Current: 1.055 s\n"
     ]
    }
   ],
   "source": [
    "bidaf_repeats = 100\n",
    "start_bidaf_baseline = timer()\n",
    "for i in range(bidaf_repeats):\n",
    "    sess_baseline.run([], bidaf_feed)\n",
    "end_bidaf_baseline = timer()\n",
    "\n",
    "start_nuphar = timer()\n",
    "for i in range(bidaf_repeats):\n",
    "    sess.run([], bidaf_feed)\n",
    "end_nuphar = timer()\n",
    "\n",
    "print_speedup('Nuphar quantized BiDAF', end_bidaf_baseline - start_bidaf_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The benefit of quantization in BiDAF is not as great as in the LSTM sample above, because BiDAF has relatively small hidden dimensions, which limited the gain from optimization inside Scan ops. However, this model still benefits from fusion/vectorization/etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Ahead-Of-Time (AOT) compilation\n",
    "Nuphar runs Just-in-time (JIT) compilation when loading models. The compilation may lead to slow cold start. We can use create_shared script to build dll from JIT code and accelerate model loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'JIT took 5.749 seconds'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start_jit = timer()\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "end_jit = timer()\n",
    "'JIT took {:.3f} seconds'.format(end_jit - start_jit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# use settings to enable JIT cache\n",
    "bidaf_cache_dir = os.path.join(bidaf_dir, 'cache')\n",
    "create_cache_dir(bidaf_cache_dir)\n",
    "settings = 'nuphar_cache_path:{}'.format(bidaf_cache_dir)\n",
    "providers = [('NupharExecutionProvider', {'nuphar_settings': settings}), 'CPUExecutionProvider']\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted, providers=providers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now object files of JIT code is stored in cache_dir, let's link them into dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['jit.so']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bidaf_cache_versioned_dir = os.path.join(bidaf_cache_dir, os.listdir(bidaf_cache_dir)[0])\n",
    "# use onnxruntime.nuphar.create_shared module to create dll\n",
    "subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', bidaf_cache_versioned_dir], check=True)\n",
    "os.listdir(bidaf_cache_versioned_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the model loading speed-up with AOT dll:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AOT speed-up 694.06%\n",
      "    Baseline: 5.749 s, Current: 0.724 s\n"
     ]
    }
   ],
   "source": [
    "start_aot = timer()\n",
    "settings = 'nuphar_cache_path:{}'.format(bidaf_cache_dir)\n",
    "providers = [('NupharExecutionProvider', {'nuphar_settings': settings}), 'CPUExecutionProvider']\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted, providers=providers)\n",
    "end_aot = timer()\n",
    "print_speedup('AOT', end_jit - start_jit, end_aot - start_aot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moreover, Nuphar AOT also supports:\n",
    "* Generate JIT cache with AVX/AVX2/AVX-512 and build a AOT dll including support for all these CPUs, which makes deployment easier when targeting different CPUs in one package.\n",
    "* Bake model checksum into AOT dll to validate model with given AOT dll."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scan_LSTMx4_int8.onnx in avx2 speed-up 790.69%\n",
      "    Baseline: 3.070 s, Current: 0.345 s\n",
      "Scan_LSTMx4_int8.onnx in avx speed-up 361.68%\n",
      "    Baseline: 3.070 s, Current: 0.665 s\n"
     ]
    }
   ],
   "source": [
    "# create object files for different CPUs\n",
    "cache_dir = os.path.join(os.getcwd(), 'lstm_cache')\n",
    "model_name = lstm_quantized_model\n",
    "model_checksum = md5(model_name)\n",
    "repeats = lstm_repeats\n",
    "feed = lstm_feed\n",
    "time_baseline = end_lstm_baseline - start_lstm_baseline\n",
    "multi_isa_so = 'avx_avx2_avx512.so'\n",
    "\n",
    "create_cache_dir(cache_dir)\n",
    "settings = 'nuphar_cache_path:{}'.format(cache_dir)\n",
    "for isa in ['avx512', 'avx2', 'avx']:\n",
    "    settings_with_isa = settings + ', nuphar_codegen_target:' + isa\n",
    "    providers = [('NupharExecutionProvider', {'nuphar_settings': settings_with_isa}), 'CPUExecutionProvider']\n",
    "    sess = onnxruntime.InferenceSession(model_name, providers=providers)\n",
    "    cache_versioned_dir = os.path.join(cache_dir, os.listdir(cache_dir)[0])\n",
    "\n",
    "# link object files to AOT dll\n",
    "subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', cache_versioned_dir, '--input_model', model_name, '--output_name', multi_isa_so], check=True)\n",
    "\n",
    "# now load the model with AOT dll\n",
    "# NOTE: when nuphar_codegen_target is not set, it defaults to current CPU ISA\n",
    "settings = 'nuphar_cache_path:{}, nuphar_cache_so_name:{}, nuphar_cache_model_checksum:{}, nuphar_cache_force_no_jit:on'.format(cache_dir, multi_isa_so, model_checksum)\n",
    "providers = [('NupharExecutionProvider', {'nuphar_settings': settings}), 'CPUExecutionProvider']\n",
    "sess = onnxruntime.InferenceSession(model_name, providers=providers)\n",
    "\n",
    "# force to a different ISA which is a subset of current CPU\n",
    "# NOTE: if an incompatible ISA is used, exception on invalid instructions would be thrown\n",
    "for valid_isa in ['avx2', 'avx']:\n",
    "    settings_with_isa = 'nuphar_cache_path:{}, nuphar_cache_so_name:{}, nuphar_cache_model_checksum:{}, nuphar_codegen_target:{}, nuphar_cache_force_no_jit:on'.format(cache_dir, multi_isa_so, model_checksum, valid_isa)\n",
    "    providers = [('NupharExecutionProvider', {'nuphar_settings': settings_with_isa}), 'CPUExecutionProvider']\n",
    "    sess = onnxruntime.InferenceSession(model_name, providers=providers)\n",
    "\n",
    "    start_nuphar = timer()\n",
    "    for i in range(repeats):\n",
    "        sess.run([], feed)\n",
    "    end_nuphar = timer()\n",
    "\n",
    "    print_speedup('{} in {}'.format(model_name, valid_isa), time_baseline, end_nuphar - start_nuphar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Performance tuning for single thread inference.\n",
    "By default, Nuphar enables parallel schedule for lower inference latency with multiple threads, when building with MKLML or OpenMP. For some models, user may want to run single-thread inference for better throughput with multiple concurrent inference threads, and turning off parallel schedule may make inference a bit faster in single thread."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single thread perf w/o parallel schedule speed-up 1.49%\n",
      "    Baseline: 1.540 s, Current: 1.518 s\n"
     ]
    }
   ],
   "source": [
    "# set OMP_NUM_THREADS to 1 for single thread inference\n",
    "# this would mak\n",
    "os.environ['OMP_NUM_THREADS'] = '1'\n",
    "\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted)\n",
    "start_baseline = timer()\n",
    "for i in range(bidaf_repeats):\n",
    "    sess_baseline.run([], bidaf_feed)\n",
    "end_baseline = timer()\n",
    "\n",
    "# use NUPHAR_PARALLEL_MIN_WORKLOADS=0 to turn off parallel schedule, using settings string\n",
    "# it can be set from environment variable too: os.environ['NUPHAR_PARALLEL_MIN_WORKLOADS'] = '0'\n",
    "settings = 'nuphar_parallel_min_workloads:0'\n",
    "providers = [('NupharExecutionProvider', {'nuphar_settings': settings}), 'CPUExecutionProvider']\n",
    "sess = onnxruntime.InferenceSession(bidaf_converted, providers=providers)\n",
    "\n",
    "start = timer()\n",
    "for i in range(bidaf_repeats):\n",
    "    sess_baseline.run([], bidaf_feed)\n",
    "end = timer()\n",
    "print_speedup('Single thread perf w/o parallel schedule', end_baseline - start_baseline, end - start)\n",
    "\n",
    "del os.environ['OMP_NUM_THREADS']"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "kedeng"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "msauthor": "ke.deng"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}