From 950b863e22ecfe8a4a36cf133e0eee8591fe0f8c Mon Sep 17 00:00:00 2001 From: Klein Hu Date: Fri, 19 Jul 2019 17:15:01 -0700 Subject: [PATCH] Update ONNX Runtime Server documents for build and usage. (#1444) --- BUILD.md | 1 + docs/ONNX_Runtime_Server_Usage.md | 103 +++++++++--------------------- 2 files changed, 30 insertions(+), 74 deletions(-) diff --git a/BUILD.md b/BUILD.md index ed7764aee6..54499df318 100644 --- a/BUILD.md +++ b/BUILD.md @@ -62,6 +62,7 @@ The complete list of build options can be found by running `./build.sh (or ./bui 1. ONNX Runtime server (and only the server) requires you to have Go installed to build, due to building BoringSSL. See https://golang.org/doc/install for installation instructions. 2. In the ONNX Runtime root folder, run `./build.sh --config RelWithDebInfo --build_server --use_openmp --parallel` +3. ONNX Runtime Server supports sending log to [rsyslog](https://www.rsyslog.com/) daemon. To enable it, please build with an additional parameter: `--cmake_extra_defines onnxruntime_USE_SYSLOG=1`. The build command will look like this: `./build.sh --config RelWithDebInfo --build_server --use_openmp --parallel --cmake_extra_defines onnxruntime_USE_SYSLOG=1` ## Build/Test Flavors for CI diff --git a/docs/ONNX_Runtime_Server_Usage.md b/docs/ONNX_Runtime_Server_Usage.md index 9d6a25a368..aa6a7c1b84 100644 --- a/docs/ONNX_Runtime_Server_Usage.md +++ b/docs/ONNX_Runtime_Server_Usage.md @@ -1,11 +1,14 @@

Note: ONNX Runtime Server is still in beta state. It's currently not ready for production environments.

-# How to Use ONNX Runtime Server REST API for Prediction +# How to Use ONNX Runtime Server for Prediction -ONNX Runtime Server provides a REST API for prediction. The goal of the project is to make it easy to "host" any ONNX model as a RESTful service. The CLI command to start the service is shown below: +ONNX Runtime Server provides an easy way to start an inferencing server for prediction with both HTTP and GRPC endpoints. The CLI command to start the server is shown below: ``` $ ./onnxruntime_server +Version: +Commit ID: + the option '--model_path' is required but missing Allowed options: -h [ --help ] Shows a help message and exits @@ -15,21 +18,22 @@ Allowed options: --address arg (=0.0.0.0) The base HTTP address --http_port arg (=8001) HTTP port to listen to requests --num_http_threads arg (=<# of your cpu cores>) Number of http threads - - + --grpc_port arg (=50051) GRPC port to listen to requests ``` -Note: The only mandatory argument for the program here is `model_path` +**Note**: The only mandatory argument for the program here is `model_path` ## Start the Server -To host an ONNX model as a REST API server, run: +To host an ONNX model as an inferencing server, simply run: ``` ./onnxruntime_server --model_path /// ``` -The prediction URL is in this format: +## HTTP Endpoint + +The prediction URL for HTTP endpoint is in this format: ``` http://:/v1/models//versions/:predict @@ -37,16 +41,20 @@ http://:/v1/models//versions/ 0. In the future, model_names and versions will be verified. -## Request and Response Payload +### Request and Response Payload -An HTTP request can be a Protobuf message in two formats: binary or JSON. The HTTP request header field `Content-Type` tells the server how to handle the request and thus it is mandatory for all requests. Requests missing `Content-Type` will be rejected as `400 Bad Request`. +The request and response need to be a protobuf message. The Protobuf definition can be found [here](https://github.com/Microsoft/onnxruntime/blob/master/onnxruntime/server/protobuf/predict.proto). + +A protobuf message could have two formats: binary and JSON. Usually the binary payload has better latency, in the meanwhile the JSON format is easy for human readability. + +The HTTP request header field `Content-Type` tells the server how to handle the request and thus it is mandatory for all requests. Requests missing `Content-Type` will be rejected as `400 Bad Request`. * For `"Content-Type: application/json"`, the payload will be deserialized as JSON string in UTF-8 format * For `"Content-Type: application/vnd.google.protobuf"`, `"Content-Type: application/x-protobuf"` or `"Content-Type: application/octet-stream"`, the payload will be consumed as protobuf message directly. -The Protobuf definition can be found [here](https://github.com/Microsoft/onnxruntime/blob/master/onnxruntime/server/protobuf/predict.proto). +Clients can control the response type by setting the request with an `Accept` header field and the server will serialize in your desired format. The choices currently available are the same as the `Content-Type` header field. If this field is not set in the request, the server will use the same type as your request. -## Inferencing +### Inferencing To send a request to the server, you can use any tool which supports making HTTP requests. Here is an example using `curl`: @@ -60,15 +68,17 @@ or curl -X POST --data-binary "@predict_request_0.pb" -H "Content-Type: application/octet-stream" -H "Foo: 1234" http://127.0.0.1:8001/v1/models/mymodel/versions/3:predict ``` -Clients can control the response type by setting the request with an `Accept` header field and the server will serialize in your desired format. The choices currently available are the same as the `Content-Type` header field. - -## Interactive tutorial notebook +### Interactive tutorial notebook A simple Jupyter notebook demonstrating the usage of ONNX Runtime server to host an ONNX model and perform inferencing can be found [here](https://github.com/onnx/tutorials/blob/master/tutorials/OnnxRuntimeServerSSDModel.ipynb). +## GRPC Endpoint + +If you prefer using the GRPC endpoint, the protobuf could be found [here](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/server/protobuf/prediction_service.proto). You could generate your client and make a GRPC call to it. To learn more about how to generate the client code and call to the server, please refer to [the tutorials of GRPC](https://grpc.io/docs/tutorials/). + ## Advanced Topics -### Number of HTTP Threads +### Number of Worker Threads You can change this to optimize server utilization. The default is the number of CPU cores on the host machine. @@ -79,66 +89,11 @@ For easy tracking of requests, we provide the following header fields: * `x-ms-request-id`: will be in the response header, no matter the request result. It will be a GUID/uuid with dash, e.g. `72b68108-18a4-493c-ac75-d0abd82f0a11`. If the request headers contain this field, the value will be ignored. * `x-ms-client-request-id`: a field for clients to tracking their requests. The content will persist in the response headers. -Here is an example of a client sending a request: +### rsyslog Support -#### Client Side +If you prefer using an ONNX Runtime Server with [rsyslog](https://www.rsyslog.com/) support([build instruction](https://github.com/microsoft/onnxruntime/blob/master/BUILD.md#build-onnx-runtime-server-on-linux)), you should be able to see the log in `/var/log/syslog` after the ONNX Runtime Server runs. For detail about how to use rsyslog, please reference [here](https://www.rsyslog.com/category/guides-for-rsyslog/). -``` -$ curl -v -X POST --data-binary "@predict_request_0.pb" -H "Content-Type: application/octet-stream" -H "Foo: 1234" -H "x-ms-client-request-id: my-request-001" -H "Accept: application/json" http://127.0.0.1:8001/v1/models/mymodel/versions/3:predict -Note: Unnecessary use of -X or --request, POST is already inferred. -* Trying 127.0.0.1... -* Connected to 127.0.0.1 (127.0.0.1) port 8001 (#0) -> POST /v1/models/mymodel/versions/3:predict HTTP/1.1 -> Host: 127.0.0.1:8001 -> User-Agent: curl/7.47.0 -> Content-Type: application/octet-stream -> x-ms-client-request-id: my-request-001 -> Accept: application/json -> Content-Length: 3179 -> Expect: 100-continue -> -* Done waiting for 100-continue -* We are completely uploaded and fine -< HTTP/1.1 200 OK -< Content-Type: application/json -< x-ms-request-id: 72b68108-18a4-493c-ac75-d0abd82f0a11 -< x-ms-client-request-id: my-request-001 -< Content-Length: 159 -< -* Connection #0 to host 127.0.0.1 left intact -{"outputs":{"Sample_Output_Name":{"dims":["1","10"],"dataType":1,"rawData":"6OpzRFquGsSFdM1FyAEnRFtRZcRa9NDEUBj0xI4ydsJIS0LE//CzxA==","dataLocation":"DEFAULT"}}}% -``` +## Report Issues -#### Server Side +If you see any issues or want to ask questions about the server, please feel free to do so in this repo with the version and commit id from the command line. -And here is what the output on the server side looks like with logging level of verbose: - -``` -2019-04-04 23:48:26.395200744 [V:onnxruntime:72b68108-18a4-493c-ac75-d0abd82f0a11, predict_request_handler.cc:40 Predict] Name: mymodel Version: 3 Action: predict -2019-04-04 23:48:26.395289437 [V:onnxruntime:72b68108-18a4-493c-ac75-d0abd82f0a11, predict_request_handler.cc:46 Predict] x-ms-client-request-id: [my-request-001] -2019-04-04 23:48:26.395540707 [I:onnxruntime:InferenceSession, inference_session.cc:736 Run] Running with tag: 72b68108-18a4-493c-ac75-d0abd82f0a11 -2019-04-04 23:48:26.395596858 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, inference_session.cc:976 CreateLoggerForRun] Created logger for run with id of 72b68108-18a4-493c-ac75-d0abd82f0a11 -2019-04-04 23:48:26.395731391 [I:onnxruntime:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:42 Execute] Begin execution -2019-04-04 23:48:26.395763319 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:45 Execute] Size of execution plan vector: 12 -2019-04-04 23:48:26.396228981 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Convolution28 -2019-04-04 23:48:26.396580161 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Plus30 -2019-04-04 23:48:26.396623732 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 10 -2019-04-04 23:48:26.396878822 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: ReLU32 -2019-04-04 23:48:26.397091882 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Pooling66 -2019-04-04 23:48:26.397126243 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 11 -2019-04-04 23:48:26.397772701 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Convolution110 -2019-04-04 23:48:26.397818174 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 13 -2019-04-04 23:48:26.398060592 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Plus112 -2019-04-04 23:48:26.398095300 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 14 -2019-04-04 23:48:26.398257563 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: ReLU114 -2019-04-04 23:48:26.398426740 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Pooling160 -2019-04-04 23:48:26.398466031 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 15 -2019-04-04 23:48:26.398542823 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Times212_reshape0 -2019-04-04 23:48:26.398599687 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Times212_reshape1 -2019-04-04 23:48:26.398692631 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Times212 -2019-04-04 23:48:26.398731471 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 17 -2019-04-04 23:48:26.398832735 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:156 Execute] Releasing node ML values after computing kernel: Plus214 -2019-04-04 23:48:26.398873229 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:197 ReleaseNodeMLValues] Releasing mlvalue with index: 19 -2019-04-04 23:48:26.398922929 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:160 Execute] Fetching output. -2019-04-04 23:48:26.398956560 [V:VLOG1:72b68108-18a4-493c-ac75-d0abd82f0a11, sequential_executor.cc:163 Execute] Done with execution. -``` \ No newline at end of file