update Tensorflow_Tf2onnx_Bert-Squad_OnnxRuntime_CPU.ipynb (#8898)

* init checkin * update * update * Update Tensorflow_Tf2onnx_Bert-Squad_OnnxRuntime_CPU.ipynb * Update Tensorflow_Tf2onnx_Bert-Squad_OnnxRuntime_CPU.ipynb * use prettrained model * re-run * re-run
2026-06-06 00:03:22 +00:00 · 2021-09-02 09:59:40 -07:00 · 2021-09-02 09:59:40 -07:00 · 7647caa520
commit 7647caa520
parent 4570d85f20
1 changed files with 353 additions and 303 deletions
--- a/onnxruntime/python/tools/transformers/notebooks/Tensorflow_Tf2onnx_Bert-Squad_OnnxRuntime_CPU.ipynb
+++ b/onnxruntime/python/tools/transformers/notebooks/Tensorflow_Tf2onnx_Bert-Squad_OnnxRuntime_CPU.ipynb
@ -2,29 +2,28 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "# Inference TensorFlow Bert Model with ONNX Runtime on CPU"
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
-    "In this tutorial, you'll be introduced to how to load a Bert model using TensorFlow, convert it to ONNX using Keras2onnx, and inference it for high performance using ONNX Runtime. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."
-   ]
+    "In this tutorial, you'll be introduced to how to load a Bert model using TensorFlow, convert it to ONNX using tf2onnx, and inference it for high performance using ONNX Runtime. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 0. Prerequisites ##\n",
    "First we need a python environment before running this notebook.\n",
@ -32,7 +31,7 @@
    "You can install [AnaConda](https://www.anaconda.com/distribution/) and [Git](https://git-scm.com/downloads) and open an AnaConda console when it is done. Then you can run the following commands to create a conda environment named cpu_env:\n",
    "\n",
    "```console\n",
-    "conda create -n cpu_env python=3.6\n",
+    "conda create -n cpu_env python=3.8\n",
    "conda activate cpu_env\n",
    "conda install -c anaconda ipykernel\n",
    "conda install -c conda-forge ipywidgets\n",
@ -41,40 +40,51 @@
    "\n",
    "Finally, launch Jupyter Notebook and you can choose cpu_env as kernel to run this notebook.\n",
    "\n",
-    "Let's install [Tensorflow](https://www.tensorflow.org/install), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), Keras2Onnx and other packages like the following:"
-   ]
+    "Let's install [Tensorflow](https://www.tensorflow.org/install), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), [tf2onnx](https://github.com/onnx/tensorflow-onnx) and other packages like the following:"
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
   "source": [
    "import sys\n",
    " \n",
-    "!{sys.executable} -m pip install --quiet --upgrade tensorflow==2.3.0\n",
-    "!{sys.executable} -m pip install --quiet --upgrade onnxruntime==1.4.0\n",
-    "!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools==1.4.0\n",
-    "!{sys.executable} -m pip install --quiet --upgrade keras2onnx==1.7.0\n",
-    "!{sys.executable} -m pip install --quiet transformers==3.0.2\n",
+    "!{sys.executable} -m pip install --quiet --upgrade tensorflow==2.6.0\n",
+    "!{sys.executable} -m pip install -i https://test.pypi.org/simple/ ort-nightly\n",
+    "!{sys.executable} -m pip install --quiet --upgrade tf2onnx==1.9.2\n",
+    "!{sys.executable} -m pip install --quiet transformers==4.9.2\n",
    "!{sys.executable} -m pip install --quiet onnxconverter_common\n",
-    "!{sys.executable} -m pip install --quiet wget pandas"
-   ]
+    "!{sys.executable} -m pip install --quiet psutil wget pandas"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Looking in indexes: https://test.pypi.org/simple/\n",
+      "Requirement already satisfied: ort-nightly in /bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages (1.8.2.dev20210901001)\n",
+      "Requirement already satisfied: protobuf in /bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages (from ort-nightly) (3.17.3)\n",
+      "Requirement already satisfied: numpy>=1.16.6 in /bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages (from ort-nightly) (1.19.5)\n",
+      "Requirement already satisfied: flatbuffers in /bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages (from ort-nightly) (1.12)\n",
+      "Requirement already satisfied: six>=1.9 in /bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages (from protobuf->ort-nightly) (1.15.0)\n"
+     ]
+    }
+   ],
+   "metadata": {
+    "scrolled": true
+   }
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "Let's define some constants:"
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
   "source": [
    "# Whether allow overwrite existing script or model.\n",
    "enable_overwrite = False\n",
@ -84,13 +94,13 @@
    "\n",
    "# Max sequence length for the export model\n",
    "max_sequence_length = 512"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
   "source": [
    "import os\n",
    "cache_dir = './cache_models'\n",
@ -98,56 +108,42 @@
    "for directory in [cache_dir, output_dir]:\n",
    "    if not os.path.exists(directory):\n",
    "        os.makedirs(directory)"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "tf.config.set_visible_devices([], 'GPU') # Disable GPU for fair comparison"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 1. Load Pretrained Bert model ##"
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "Start to load fine-tuned model. This step take a few minutes to download the model for the first time."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForQuestionAnswering: ['nsp___cls', 'mlm___cls']\n",
-      "- This IS expected if you are initializing TFBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).\n",
-      "- This IS NOT expected if you are initializing TFBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
-      "Some weights of TFBertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs']\n",
-      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
-     ]
-    }
-   ],
   "source": [
    "from transformers import (TFBertForQuestionAnswering, BertTokenizer)\n",
    "\n",
-    "#model_name_or_path = 'bert-large-uncased-whole-word-masking-finetuned-squad'\n",
-    "model_name_or_path = \"bert-base-cased\"\n",
+    "model_name_or_path = 'bert-large-uncased-whole-word-masking-finetuned-squad'\n",
+    "#model_name_or_path = \"bert-base-cased\"\n",
    "is_fine_tuned = (model_name_or_path == 'bert-large-uncased-whole-word-masking-finetuned-squad')\n",
    "\n",
    "# Load model and tokenizer\n",
@ -155,22 +151,35 @@
    "model = TFBertForQuestionAnswering.from_pretrained(model_name_or_path, cache_dir=cache_dir)\n",
    "# Needed this to export onnx model with multiple inputs with TF 2.2\n",
    "model._saved_model_inputs_spec = None"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "All model checkpoint layers were used when initializing TFBertForQuestionAnswering.\n",
+      "\n",
+      "All the layers of TFBertForQuestionAnswering were initialized from the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad.\n",
+      "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.\n"
+     ]
+    }
+   ],
+   "metadata": {
+    "scrolled": true
+   }
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 2. TensorFlow Inference\n",
    "\n",
    "Use one example to run inference using TensorFlow as baseline."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
-   "metadata": {},
-   "outputs": [],
   "source": [
    "import numpy\n",
    "\n",
@ -178,117 +187,110 @@
    "# Pad to max length is needed. Otherwise, position embedding might be truncated by constant folding.\n",
    "inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors='tf',\n",
    "                               max_length=max_sequence_length, pad_to_max_length=True, truncation=True)\n",
-    "start_scores, end_scores = model(inputs)\n",
+    "output = model(inputs)\n",
+    "start_scores, end_scores = output.start_logits, output.end_logits\n",
    "\n",
    "num_tokens = len(inputs[\"input_ids\"][0])\n",
    "if is_fine_tuned:\n",
    "    all_tokens = tokenizer.convert_ids_to_tokens(inputs[\"input_ids\"][0])\n",
    "    print(\"The answer is:\", ' '.join(all_tokens[numpy.argmax(start_scores) : numpy.argmax(end_scores)+1]))"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stderr",
+     "text": [
+      "/bert_ort/wy/anaconda3/envs/cpu_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2184: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "The answer is: a performance - focused inference engine for on ##nx models\n"
+     ]
+    }
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Tensorflow Inference time for sequence length 512 = 1133.13 ms\n"
-     ]
-    }
-   ],
   "source": [
    "import time\n",
    "start = time.time()\n",
    "for _ in range(total_runs):\n",
-    "    start_scores, end_scores = model(inputs)\n",
+    "    outputs = model(inputs)\n",
    "end = time.time()\n",
    "print(\"Tensorflow Inference time for sequence length {} = {} ms\".format(num_tokens, format((end - start) * 1000 / total_runs, '.2f')))"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "Tensorflow Inference time for sequence length 512 = 1360.38 ms\n"
+     ]
+    }
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
-    "## 3. Export model to ONNX using Keras2onnx\n",
-    "\n",
-    "Now we use Keras2onnx to export the model to ONNX format. It takes about 3 minutes for bert-base, or 18 minutes for bert-large model.",
-    "\n",
+    "## 3. Export model to ONNX using tf2onnx\r\n",
+    "\r\n",
+    "Now we use tf2onnx to export the model to ONNX format.\r\n",
    "Note that we could also convert tensorflow checkpoints to pytorch(supported by huggingface team, ref:https://huggingface.co/transformers/converting_tensorflow_models.html) and then convert to onnx using torch.onnx.export()."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
-   "metadata": {},
-   "outputs": [],
   "source": [
-    "import keras2onnx\n",
-    "from keras2onnx.proto import keras\n",
+    "import tf2onnx\n",
+    "tf2onnx.logging.set_level(tf2onnx.logging.ERROR)\n",
    "\n",
-    "output_model_path =  os.path.join(output_dir, 'keras_{}.onnx'.format(model_name_or_path))\n",
+    "output_model_path =  os.path.join(output_dir, 'tf2onnx_{}.onnx'.format(model_name_or_path))\n",
+    "opset_version = 13\n",
+    "use_external_data_format = False\n",
+    "\n",
+    "specs = []\n",
+    "for name, value in inputs.items():\n",
+    "    dims = [None] * len(value.shape)\n",
+    "    specs.append(tf.TensorSpec(tuple(dims), value.dtype, name=name))\n",
    "\n",
    "if enable_overwrite or not os.path.exists(output_model_path):\n",
    "    start = time.time()\n",
-    "    onnx_model = keras2onnx.convert_keras(model, model.name)\n",
-    "    keras2onnx.save_model(onnx_model, output_model_path)\n",
-    "    print(\"Keras2onnx run time = {} s\".format(format(time.time() - start, '.2f')))"
-   ]
+    "    _, _ = tf2onnx.convert.from_keras(model,\n",
+    "                                      input_signature=tuple(specs),\n",
+    "                                      opset=opset_version,\n",
+    "                                      large_model=use_external_data_format,\n",
+    "                                      output_path=output_model_path)\n",
+    "    print(\"tf2onnx run time = {} s\".format(format(time.time() - start, '.2f')))"
+   ],
+   "outputs": [],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 4. Inference the Exported Model with ONNX Runtime"
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
-    "### OpenMP Environment Variable\n",
-    "\n",
-    "OpenMP environment variable is important for CPU inference of Bert models. After running this notebook, you can find the best setting from [Performance Test Tool](#Performance-Test-Tool) result for your machine.\n",
-    "\n",
-    "Setting environment variables shall be done before importing onnxruntime. Otherwise, they might not take effect."
-   ]
+    "Now we are ready to inference the model with ONNX Runtime."
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import psutil\n",
-    "\n",
-    "# ATTENTION: these environment variables must be set before importing onnxruntime.\n",
-    "os.environ[\"OMP_NUM_THREADS\"] = str(psutil.cpu_count(logical=True))\n",
-    "os.environ[\"OMP_WAIT_POLICY\"] = 'ACTIVE'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we are ready to inference the model with ONNX Runtime. Here we can see that OnnxRuntime has better performance than TensorFlow for this example even without optimization."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "ONNX Runtime cpu inference time for sequence length 512 (model not optimized): 654.49 ms\n"
-     ]
-    }
-   ],
   "source": [
    "import psutil\n",
    "import onnxruntime\n",
@ -296,9 +298,8 @@
    "\n",
    "sess_options = onnxruntime.SessionOptions()\n",
    "\n",
-    "# intra_op_num_threads=1 can be used to enable OpenMP in OnnxRuntime 1.2.0.\n",
-    "# For OnnxRuntime 1.3.0 or later, this does not have effect unless you are using onnxruntime-gpu package.\n",
-    "# sess_options.intra_op_num_threads=1\n",
+    "# Set the intra_op_num_threads\n",
+    "sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)\n",
    "\n",
    "# Providers is optional. Only needed when you use onnxruntime-gpu for CPU inference.\n",
    "session = onnxruntime.InferenceSession(output_model_path, sess_options, providers=['CPUExecutionProvider'])\n",
@ -316,75 +317,99 @@
    "end = time.time()\n",
    "print(\"ONNX Runtime cpu inference time for sequence length {} (model not optimized): {} ms\".format(num_tokens, format((end - start) * 1000 / total_runs, '.2f')))\n",
    "del session"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "ONNX Runtime cpu inference time for sequence length 512 (model not optimized): 1185.30 ms\n"
+     ]
+    }
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 10,
   "source": [
    "# Some weights of TFBertForQuestionAnswering might not be initialized without fine-tuning.\n",
    "if is_fine_tuned:\n",
    "    print(\"***** Verifying correctness (TensorFlow and ONNX Runtime) *****\")\n",
-    "    print('start_scores are close:', numpy.allclose(results[0], start_scores.cpu(), rtol=1e-05, atol=1e-04))\n",
-    "    print('end_scores are close:', numpy.allclose(results[1], end_scores.cpu(), rtol=1e-05, atol=1e-04))"
-   ]
+    "    print('start_scores are close:', numpy.allclose(results[1], start_scores.cpu(), rtol=1e-05, atol=1e-04))\n",
+    "    print('end_scores are close:', numpy.allclose(results[0], end_scores.cpu(), rtol=1e-05, atol=1e-04))"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "***** Verifying correctness (TensorFlow and ONNX Runtime) *****\n",
+      "WARNING:tensorflow:From <ipython-input-10-1948b56cd554>:4: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+      "Instructions for updating:\n",
+      "Use tf.identity instead.\n",
+      "start_scores are close: True\n",
+      "end_scores are close: True\n"
+     ]
+    }
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 5. Model Optimization\n",
    "\n",
    "[ONNX Runtime BERT Model Optimization Tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) is a set of tools for optimizing and testing BERT models. Let's try some of them on the exported models."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "### BERT Optimization Script\n",
    "\n",
-    "The script **optimizer.py** can help optimize BERT model exported by PyTorch, tf2onnx or keras2onnx. Since our model is exported by keras2onnx, we shall use **--model_type bert_keras** parameter.\n",
+    "The script **optimizer.py** can help optimize BERT model exported by PyTorch, tf2onnx or keras2onnx. Since our model is exported by tf2onnx, we shall use **--model_type bert_tf** parameter.\n",
    "\n",
    "It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph."
-   ]
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "source": [
+    "!{sys.executable} -m pip install --quiet coloredlogs sympy \n",
+    "\n",
+    "optimized_model_path =  os.path.join(output_dir, 'tf2onnx_{}_opt_cpu.onnx'.format(model_name_or_path))\n",
+    "\n",
+    "from onnxruntime.transformers import optimizer\n",
+    "optimized_model = optimizer.optimize_model(output_model_path, model_type='bert_tf', num_heads=12, hidden_size=768)\n",
+    "optimized_model.use_dynamic_axes()\n",
+    "optimized_model.save_model_to_file(optimized_model_path)"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "failed in shape inference <class 'AttributeError'>\n",
+      "failed in shape inference <class 'AttributeError'>\n"
+     ]
+    }
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We run the optimized model using same inputs. The inference latency might be reduced after optimization. The output result is the same as the one before optimization."
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "optimized_model_path =  os.path.join(output_dir, 'keras_{}_opt_cpu.onnx'.format(model_name_or_path))\n",
-    "\n",
-    "from onnxruntime_tools import optimizer\n",
-    "optimized_model = optimizer.optimize_model(output_model_path, model_type='bert_keras', num_heads=12, hidden_size=768)\n",
-    "optimized_model.use_dynamic_axes()\n",
-    "optimized_model.save_model_to_file(optimized_model_path)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We run the optimized model using same inputs. The inference latency is reduced after optimization. The output result is the same as the one before optimization."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "ONNX Runtime cpu inference time on optimized model: 328.48 ms\n"
-     ]
-    }
-   ],
   "source": [
    "session = onnxruntime.InferenceSession(optimized_model_path, sess_options)\n",
    "# use one run to warm up a session\n",
@ -397,16 +422,30 @@
    "end = time.time()\n",
    "print(\"ONNX Runtime cpu inference time on optimized model: {} ms\".format(format((end - start) * 1000 / total_runs, '.2f')))\n",
    "del session"
-   ]
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "ONNX Runtime cpu inference time on optimized model: 1076.50 ms\n"
+     ]
+    }
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
+   "execution_count": 13,
+   "source": [
+    "print(\"***** Verifying correctness (before and after optimization) *****\")\n",
+    "print('start_scores are close:', numpy.allclose(opt_results[0], results[0], rtol=1e-05, atol=1e-04))\n",
+    "print('end_scores are close:', numpy.allclose(opt_results[1], results[1], rtol=1e-05, atol=1e-04))"
+   ],
   "outputs": [
    {
-     "name": "stdout",
     "output_type": "stream",
+     "name": "stdout",
     "text": [
      "***** Verifying correctness (before and after optimization) *****\n",
      "start_scores are close: True\n",
@ -414,98 +453,113 @@
     ]
    }
   ],
-   "source": [
-    "print(\"***** Verifying correctness (before and after optimization) *****\")\n",
-    "print('start_scores are close:', numpy.allclose(opt_results[0], results[0], rtol=1e-05, atol=1e-04))\n",
-    "print('end_scores are close:', numpy.allclose(opt_results[1], results[1], rtol=1e-05, atol=1e-04))"
-   ]
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "### Model Results Comparison Tool\n",
    "\n",
    "If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.\n",
    "\n",
    "Example of comparing the models before and after optimization:"
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
+   "execution_count": 14,
+   "source": [
+    "# The baseline model is exported using max sequence length, and no dynamic axes\n",
+    "!{sys.executable} -m onnxruntime.transformers.compare_bert_results --baseline_model $output_model_path --optimized_model $optimized_model_path --batch_size 1 --sequence_length $max_sequence_length --samples 10"
+   ],
   "outputs": [
    {
-     "name": "stdout",
     "output_type": "stream",
+     "name": "stdout",
     "text": [
      "100% passed for 10 random inputs given thresholds (rtol=0.001, atol=0.0001).\n",
-      "maximum absolute difference=1.6242265701293945e-06\n",
-      "maximum relative difference=0.009154098108410835\n"
+      "maximum absolute difference=0\n",
+      "maximum relative difference=0\n"
     ]
    }
   ],
-   "source": [
-    "# The baseline model is exported using max sequence length, and no dynamic axes\n",
-    "!{sys.executable} -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $output_model_path --optimized_model $optimized_model_path --batch_size 1 --sequence_length $max_sequence_length --samples 10"
-   ]
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "### Performance Test Tool\n",
    "\n",
    "This tool measures performance of BERT model inference using OnnxRuntime Python API.\n",
    "\n",
    "The following command will create 100 samples of batch_size 1 and sequence length 128 to run inference, then calculate performance numbers like average latency and throughput etc."
-   ]
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "source": [
+    "THREAD_SETTING = '-n {}'.format(psutil.cpu_count(logical=True))\n",
+    "\n",
+    "!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $optimized_model_path --batch_size 1 --sequence_length 128 --samples 100 --test_times 1 $THREAD_SETTING"
+   ],
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "test setting TestSetting(batch_size=1, sequence_length=128, test_cases=100, test_times=1, use_gpu=False, intra_op_num_threads=24, seed=3, verbose=False)\n",
+      "Generating 100 samples for batch_size=1 sequence_length=128\n",
+      "Running test: model=tf2onnx_bert-large-uncased-whole-word-masking-finetuned-squad_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=24,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False\n",
+      "Average latency = 363.76 ms, Throughput = 2.75 QPS\n",
+      "Test summary is saved to onnx_models/perf_results_CPU_B1_S128_20210901-191653.txt\n"
+     ]
+    }
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Let's load the summary file and take a look. In this machine, the best result is achieved by OpenMP. The best setting might be difference using different hardware or model."
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Running test: model=keras_bert-base-cased_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=100,test_times=1,contiguous=None,use_gpu=False,warmup=True\n",
-      "Average latency = 97.93 ms, Throughput = 10.21 QPS\n",
-      "test setting TestSetting(batch_size=1, sequence_length=128, test_cases=100, test_times=1, contiguous=None, use_gpu=False, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\n",
-      "Generating 100 samples for batch_size=1 sequence_length=128\n",
-      "Test summary is saved to onnx_models\\perf_results_CPU_B1_S128_20200728-165907.txt\n"
-     ]
-    }
-   ],
   "source": [
-    "THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\n",
+    "import glob     \n",
+    "import pandas\n",
    "\n",
-    "!{sys.executable} -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_model_path --batch_size 1 --sequence_length 128 --samples 100 --test_times 1 --inclusive $THREAD_SETTING"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's load the summary file and take a look. In this machine, the best result is achieved by OpenMP. The best setting might be difference using different hardware or model."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {},
+    "latest_result_file = max(glob.glob(os.path.join(output_dir, \"perf_results_*.txt\")), key=os.path.getmtime)\n",
+    "result_data = pandas.read_table(latest_result_file)\n",
+    "print(latest_result_file)\n",
+    "\n",
+    "result_data.drop(['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu'], axis=1, inplace=True)\n",
+    "result_data.drop(['Latency_P50', 'Latency_P75', 'Latency_P90', 'Latency_P95'], axis=1, inplace=True)\n",
+    "cols = result_data.columns.tolist()\n",
+    "cols = cols[-4:] + cols[:-4]\n",
+    "result_data = result_data[cols]\n",
+    "result_data"
+   ],
   "outputs": [
    {
-     "name": "stdout",
     "output_type": "stream",
+     "name": "stdout",
     "text": [
-      "./onnx_models\\perf_results_CPU_B1_S128_20200728-165907.txt\n"
+      "./onnx_models/perf_results_CPU_B1_S128_20210901-191653.txt\n"
     ]
    },
    {
+     "output_type": "execute_result",
     "data": {
+      "text/plain": [
+       "   Latency(ms)  Latency_P99  Throughput(QPS)  intra_op_num_threads\n",
+       "0       363.76       550.04             2.75                    24"
+      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
@ -525,62 +579,33 @@
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
-       "      <th>intra_op_num_threads</th>\n",
-       "      <th>OMP_NUM_THREADS</th>\n",
-       "      <th>OMP_WAIT_POLICY</th>\n",
-       "      <th>contiguous</th>\n",
       "      <th>Latency(ms)</th>\n",
       "      <th>Latency_P99</th>\n",
       "      <th>Throughput(QPS)</th>\n",
+       "      <th>intra_op_num_threads</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
-       "      <td>1</td>\n",
-       "      <td>12</td>\n",
-       "      <td>ACTIVE</td>\n",
-       "      <td>None</td>\n",
-       "      <td>97.93</td>\n",
-       "      <td>158.16</td>\n",
-       "      <td>10.21</td>\n",
+       "      <td>363.76</td>\n",
+       "      <td>550.04</td>\n",
+       "      <td>2.75</td>\n",
+       "      <td>24</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
-      ],
-      "text/plain": [
-       "   intra_op_num_threads OMP_NUM_THREADS OMP_WAIT_POLICY contiguous  \\\n",
-       "0                     1              12          ACTIVE       None   \n",
-       "\n",
-       "   Latency(ms)  Latency_P99  Throughput(QPS)  \n",
-       "0        97.93       158.16            10.21  "
      ]
     },
-     "execution_count": 17,
     "metadata": {},
-     "output_type": "execute_result"
+     "execution_count": 16
    }
   ],
-   "source": [
-    "import glob     \n",
-    "import pandas\n",
-    "\n",
-    "latest_result_file = max(glob.glob(os.path.join(output_dir, \"perf_results_*.txt\")), key=os.path.getmtime)\n",
-    "result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\n",
-    "print(latest_result_file)\n",
-    "\n",
-    "result_data.drop(['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu', 'warmup'], axis=1, inplace=True)\n",
-    "result_data.drop(['Latency_P50', 'Latency_P75', 'Latency_P90', 'Latency_P95'], axis=1, inplace=True)\n",
-    "cols = result_data.columns.tolist()\n",
-    "cols = cols[-4:] + cols[:-4]\n",
-    "result_data = result_data[cols]\n",
-    "result_data"
-   ]
+   "metadata": {}
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
   "source": [
    "## 6. Additional Info\n",
    "\n",
@ -592,77 +617,99 @@
    "\n",
    "Here is the machine configuration that generated the above results. The machine has GPU but not used in CPU inference.\n",
    "You might get slower or faster result based on your hardware."
-   ]
+   ],
+   "metadata": {}
  },
  {
   "cell_type": "code",
-   "execution_count": 18,
-   "metadata": {},
+   "execution_count": 17,
+   "source": [
+    "!{sys.executable} -m pip install --quiet py-cpuinfo py3nvml\n",
+    "!{sys.executable} -m onnxruntime.transformers.machine_info --silent"
+   ],
   "outputs": [
    {
-     "name": "stdout",
     "output_type": "stream",
+     "name": "stdout",
     "text": [
      "{\n",
      "  \"gpu\": {\n",
-      "    \"driver_version\": \"442.23\",\n",
+      "    \"driver_version\": \"455.45.01\",\n",
      "    \"devices\": [\n",
      "      {\n",
-      "        \"memory_total\": 8589934592,\n",
-      "        \"memory_available\": 8480882688,\n",
-      "        \"name\": \"GeForce GTX 1070\"\n",
+      "        \"memory_total\": 16945512448,\n",
+      "        \"memory_available\": 16941318144,\n",
+      "        \"name\": \"Tesla V100-PCIE-16GB\"\n",
+      "      },\n",
+      "      {\n",
+      "        \"memory_total\": 16945512448,\n",
+      "        \"memory_available\": 16941318144,\n",
+      "        \"name\": \"Tesla V100-PCIE-16GB\"\n",
+      "      },\n",
+      "      {\n",
+      "        \"memory_total\": 16945512448,\n",
+      "        \"memory_available\": 16941318144,\n",
+      "        \"name\": \"Tesla V100-PCIE-16GB\"\n",
+      "      },\n",
+      "      {\n",
+      "        \"memory_total\": 16945512448,\n",
+      "        \"memory_available\": 5449973760,\n",
+      "        \"name\": \"Tesla V100-PCIE-16GB\"\n",
      "      }\n",
      "    ]\n",
      "  },\n",
      "  \"cpu\": {\n",
-      "    \"brand\": \"Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz\",\n",
-      "    \"cores\": 6,\n",
-      "    \"logical_cores\": 12,\n",
-      "    \"hz\": \"3.1920 GHz\",\n",
-      "    \"l2_cache\": \"1536 KB\",\n",
-      "    \"l3_cache\": \"12288 KB\",\n",
-      "    \"processor\": \"Intel64 Family 6 Model 158 Stepping 10, GenuineIntel\"\n",
+      "    \"brand\": \"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\",\n",
+      "    \"cores\": 24,\n",
+      "    \"logical_cores\": 24,\n",
+      "    \"hz\": \"2593997000,0\",\n",
+      "    \"l2_cache\": 262144,\n",
+      "    \"flags\": \"3dnowprefetch,abm,adx,aes,apic,avx,avx2,bmi1,bmi2,clflush,cmov,constant_tsc,cpuid,cx16,cx8,de,erms,f16c,fma,fpu,fsgsbase,fxsr,hle,ht,hypervisor,invpcid,invpcid_single,lahf_lm,lm,mca,mce,md_clear,mmx,movbe,msr,mtrr,nopl,nx,osxsave,pae,pat,pcid,pclmulqdq,pdpe1gb,pge,pni,popcnt,pse,pse36,pti,rdrand,rdrnd,rdseed,rdtscp,rep_good,rtm,sep,smap,smep,ss,sse,sse2,sse4_1,sse4_2,ssse3,syscall,tsc,vme,xsave,xsaveopt,xtopology\",\n",
+      "    \"processor\": \"x86_64\"\n",
      "  },\n",
      "  \"memory\": {\n",
-      "    \"total\": 16971259904,\n",
-      "    \"available\": 3480842240\n",
+      "    \"total\": 473403277312,\n",
+      "    \"available\": 413617876992\n",
+      "  },\n",
+      "  \"os\": \"Linux-4.15.0-1103-azure-x86_64-with-glibc2.17\",\n",
+      "  \"python\": \"3.8.11.final.0 (64 bit)\",\n",
+      "  \"packages\": {\n",
+      "    \"transformers\": \"4.9.2\",\n",
+      "    \"torch\": \"1.9.0\",\n",
+      "    \"tensorflow\": \"2.6.0\",\n",
+      "    \"sympy\": \"1.8\",\n",
+      "    \"protobuf\": \"3.17.3\",\n",
+      "    \"ort-nightly\": \"1.8.2.dev20210901001\",\n",
+      "    \"onnxconverter-common\": \"1.8.1\",\n",
+      "    \"onnx\": \"1.10.1\",\n",
+      "    \"numpy\": \"1.19.5\",\n",
+      "    \"flatbuffers\": \"1.12\"\n",
      "  },\n",
-      "  \"python\": \"3.6.10.final.0 (64 bit)\",\n",
-      "  \"os\": \"Windows-10-10.0.18362-SP0\",\n",
      "  \"onnxruntime\": {\n",
-      "    \"version\": \"1.4.0\",\n",
+      "    \"version\": \"1.8.2\",\n",
      "    \"support_gpu\": false\n",
      "  },\n",
      "  \"pytorch\": {\n",
-      "    \"version\": \"1.5.0+cpu\",\n",
-      "    \"support_gpu\": false\n",
+      "    \"version\": \"1.9.0+cu102\",\n",
+      "    \"support_gpu\": true,\n",
+      "    \"cuda\": \"10.2\"\n",
      "  },\n",
      "  \"tensorflow\": {\n",
-      "    \"version\": \"2.3.0\",\n",
-      "    \"git_version\": \"v2.3.0-rc2-23-gb36436b087\",\n",
+      "    \"version\": \"2.6.0\",\n",
+      "    \"git_version\": \"v2.6.0-rc2-32-g919f693420e\",\n",
      "    \"support_gpu\": true\n",
      "  }\n",
      "}\n"
     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "2020-07-28 16:59:18.638897: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll\n"
-     ]
    }
   ],
-   "source": [
-    "!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent"
-   ]
+   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "cpu_env",
-   "language": "python",
-   "name": "cpu_env"
+   "name": "python3",
+   "display_name": "Python 3.8.11 64-bit ('cpu_env': conda)"
  },
  "language_info": {
   "codemirror_mode": {
@ -674,9 +721,12 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.10"
+   "version": "3.8.11"
+  },
+  "interpreter": {
+   "hash": "074a0f2de953c1a97ea00d0646e5f1fda38562dc7d977159c4c016157763bc64"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
-}
+}