Open Inference Protocol (V2 Inference Protocol)
The Open Inference Protocol, also known as KServe V2 Protocol, provides a standardized interface for model inference across different machine learning frameworks and serving systems. For an inference server to be compliant with this protocol, the server must implement the health, metadata, and inference V2 APIs.
Optional features are explicitly noted and are not required for compliance. A compliant inference server may choose to implement the HTTP/REST API and/or the gRPC API.
Overview
KServe's V2 protocol addresses several limitations found in the V1 protocol, including better performance and improved compatibility across different model frameworks and servers. The protocol supports both HTTP/REST and gRPC interfaces, offering flexibility in implementation.
Important Notes
- V2 protocol does not currently support the explain endpoint that is available in V1 protocol
- V2 adds standardized server readiness, liveness, and metadata endpoints
- V2 supports version-specific model endpoints
- All strings in all contexts are case-sensitive
- The V2 protocol supports an extension mechanism as a required part of the API
API Endpoints
HTTP/REST API
API | Verb | Path | Request Payload | Response Payload |
---|---|---|---|---|
Inference | POST | v2/models/<model_name>[/versions/<model_version>]/infer | Inference Request JSON Object | Inference Response JSON Object |
Model Metadata | GET | v2/models/<model_name>[/versions/<model_version>] | Model Metadata Response JSON Object | |
Server Readiness | GET | v2/health/ready | Server Ready Response JSON Object | |
Server Liveness | GET | v2/health/live | Server Live Response JSON Object | |
Server Metadata | GET | v2 | Server Metadata Response JSON Object | |
Model Readiness | GET | v2/models/<model_name>[/versions/<model_version>]/ready | Model Ready Response JSON Object |
Note: Path contents in []
are optional.
gRPC API
API | rpc Endpoint | Request Message | Response Message |
---|---|---|---|
Inference | ModelInfer | ModelInferRequest | ModelInferResponse |
Model Ready | ModelReady | ModelReadyRequest | ModelReadyResponse |
Model Metadata | ModelMetadata | ModelMetadataRequest | ModelMetadataResponse |
Server Ready | ServerReady | ServerReadyRequest | ServerReadyResponse |
Server Live | ServerLive | ServerLiveRequest | ServerLiveResponse |
Server Metadata | ServerMetadata | ServerMetadataRequest | ServerMetadataResponse |
API Definitions
Health/Readiness/Liveness Probes
The Model Readiness probe answers the question "Did the model download and is it able to serve requests?" and responds with the available model name(s).
The Server Readiness/Liveness probes answer the question "Is my service and its infrastructure running, healthy, and able to receive and process requests?"
API | Definition |
---|---|
Inference | The /infer endpoint performs inference on a model. The response is the prediction result. |
Model Metadata | The "model metadata" API is a per-model endpoint that returns details about the model passed in the path. |
Server Ready | The "server ready" health API indicates if all the models are ready for inferencing. The "server ready" health API can be used directly to implement the Kubernetes readinessProbe |
Server Live | The "server live" health API indicates if the inference server is able to receive and respond to metadata and inference requests. The "server live" API can be used directly to implement the Kubernetes livenessProbe. |
Server Metadata | The "server metadata" API returns details describing the server. |
Model Ready | The "model ready" health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. |
Payload Contents
HTTP/REST API Payloads
The HTTP/REST API uses JSON for requests and responses. All JSON schemas contain standard JSON types including $number
, $string
, $boolean
, $object
and $array
. Fields marked #optional
indicate optional JSON fields.
Model Ready Response JSON Object
A successful model ready request is indicated by a 200 HTTP status code. The model ready response object is returned in the HTTP body.
{
"name": "model_name",
"ready": true
}
name
: The name of the model.ready
: Boolean indicating whether the model is ready for inferencing.
Server Ready Response JSON Object
A successful server ready request is indicated by a 200 HTTP status code. The server ready response object is returned in the HTTP body.
{
"ready": true
}
ready
: Boolean indicating whether the server is ready for inferencing.
Server Live Response JSON Object
A successful server live request is indicated by a 200 HTTP status code. The server live response object is returned in the HTTP body.
{
"live": true
}
live
: Boolean indicating whether the server is live for inferencing.
Server Metadata Response JSON Object
A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object is returned in the HTTP body.
{
"name": "inference_server_name",
"version": "inference_server_version",
"extensions": ["extension_name", "..."]
}
name
: A descriptive name for the server.version
: The server version.extensions
: The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions.
Server Metadata Response JSON Error Object
A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the server metadata error response object.
{
"error": "error message"
}
error
: The descriptive message for the error.
Inference Request JSON Object
An inference request is made with an HTTP POST to an inference endpoint. In the request, the HTTP body contains the Inference Request JSON Object.
{
"id": "string", // optional
"parameters": {}, // optional
"inputs": [
{
"name": "string",
"shape": [1, 2, 3],
"datatype": "FP32",
"parameters": {}, // optional
"data": [1.0, 2.0, 3.0]
}
],
"outputs": [ // optional
{
"name": "string",
"parameters": {} // optional
}
]
}
id
: An optional identifier for this request. If specified, this identifier must be returned in the response.parameters
: Optional parameters for the inference request expressed as key/value pairs.inputs
: The input tensors. Each input is described using the Request Input schema.outputs
: Optional specifications of the required output tensors. If not specified, all outputs produced by the model will be returned using default settings.
Request Input
Each input in the inputs
array is described using the following schema:
{
"name": "string",
"shape": [1, 2, 3],
"datatype": "FP32",
"parameters": {}, // optional
"data": [1.0, 2.0, 3.0]
}
name
: The name of the input tensor.shape
: The shape of the input tensor, as an array of integers.datatype
: The data type of the input tensor elements, as defined in Tensor Data Types.parameters
: Optional parameters for this input tensor, expressed as key/value pairs.data
: The input tensor data as a JSON array. The array must contain the tensor elements in row-major order.
Request Output
Each output in the outputs
array is described using the following schema:
{
"name": "string",
"parameters": {} // optional
}
name
: The name of the output tensor.parameters
: Optional parameters for this output tensor, expressed as key/value pairs.
Inference Response JSON Object
In the corresponding response to an inference request, the HTTP body contains the Inference Response JSON Object or an Inference Response JSON Error Object.
{
"model_name": "string",
"model_version": "string", // optional
"id": "string",
"parameters": {}, // optional
"outputs": [
{
"name": "string",
"shape": [1, 2, 3],
"datatype": "FP32",
"parameters": {}, // optional
"data": [4.0, 5.0, 6.0]
}
]
}
model_name
: The name of the model that produced this response.model_version
: Optional. The version of the model that produced this response.id
: The identifier from the request, if specified.parameters
: Optional response parameters expressed as key/value pairs.outputs
: The output tensors, each described using the format specified in the response output schema.
Inference Response JSON Error Object
A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the Inference Response JSON Error Object:
{
"error": "error message"
}
error
: A descriptive message for the error.
Model Metadata Response JSON Object
A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response, the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object.
A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object is returned in the HTTP body for every successful model metadata request.
{
"name": "string",
"versions": ["v1", "v2"], // optional
"platform": "string",
"inputs": [
{
"name": "string",
"datatype": "FP32",
"shape": [1, -1, 3]
}
],
"outputs": [
{
"name": "string",
"datatype": "FP32",
"shape": [1, 3]
}
]
}
name
: The name of the model.versions
: Optional. The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don't support explicitly requested versions.platform
: The framework/backend for the model. See Platforms.inputs
: The inputs required by the model, each described by a tensor metadata object.outputs
: The outputs produced by the model, each described by a tensor metadata object.
Each model input and output tensor's metadata is described with a tensor metadata object:
{
"name": "string",
"datatype": "FP32",
"shape": [1, -1, 3]
}
name
: The name of the tensor.datatype
: The data type of the tensor elements as defined in Tensor Data Types.shape
: The shape of the tensor. Variable-size dimensions are specified as -1.
Model Metadata Response JSON Error Object
A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the Model Metadata Response JSON Error Object:
{
"error": "error message"
}
error
: The descriptive message for the error.
Parameters
In the V2 protocol, parameters
is an optional object containing zero or more parameters expressed as key/value pairs. These parameters can be provided in requests or returned in responses to control various aspects of inference and server behavior.
The protocol itself doesn't define any specific parameters, but individual inference servers may define and document the parameters they support. For example, a server might define parameters to control timeout, batching behavior, or other server-specific configurations.
gRPC API
The gRPC API provides the same functionality as the HTTP/REST API but uses Protocol Buffers for more efficient serialization and communication. A compliant inference server may choose to implement the gRPC API in addition to or instead of the HTTP/REST API.
The V2 protocol's gRPC API is defined in the open_inference_grpc.proto file. This file defines the service and message formats for all API endpoints.
gRPC Service Definition
The gRPC API defines a GRPCInferenceService service with the following RPCs:
service GRPCInferenceService
{
// Health
rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}
rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}
rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}
// Metadata
rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}
rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}
// Inference
rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
}
gRPC Inference Payload Example
For the ModelInfer RPC, the request and response messages are defined as follows (simplified view):
message ModelInferRequest {
string model_name = 1;
string model_version = 2; // Optional
string id = 3; // Optional
map<string, InferParameter> parameters = 4; // Optional
repeated ModelInferRequest.InferInputTensor inputs = 5;
repeated ModelInferRequest.InferRequestedOutputTensor outputs = 6; // Optional
}
message ModelInferResponse {
string model_name = 1;
string model_version = 2; // Optional
string id = 3; // Optional
map<string, InferParameter> parameters = 4; // Optional
repeated ModelInferResponse.InferOutputTensor outputs = 5;
}
The gRPC API provides the same functionality as the HTTP/REST API but with the performance advantages of gRPC's binary serialization and streaming capabilities.
Tensor Data Types
The protocol supports various tensor data types for both input and output:
Data Type | Size (bytes) |
---|---|
BOOL | 1 |
UINT8 | 1 |
UINT16 | 2 |
UINT32 | 4 |
UINT64 | 8 |
INT8 | 1 |
INT16 | 2 |
INT32 | 4 |
INT64 | 8 |
FP16 | 2 |
FP32 | 4 |
FP64 | 8 |
BYTES | Variable (max 232) |
Tensor Data Representation
Tensor data in all representations must be:
- Flattened to a one-dimensional, row-major order of the tensor elements
- Without any stride or padding between elements
- In "linear" order
For JSON formats, data can be provided in a nested array matching the tensor shape or as a flattened array.
Platforms
The protocol supports various ML platforms, identified using the format <project>_<format>
:
tensorrt_plan
: A TensorRT model encoded as a serialized engine or "plan"tensorflow_graphdef
: A TensorFlow model encoded as a GraphDeftensorflow_savedmodel
: A TensorFlow model encoded as a SavedModelonnx_onnxv1
: An ONNX model encoded for ONNX Runtimepytorch_torchscript
: A PyTorch model encoded as TorchScriptmxnet_mxnet
: An MXNet modelcaffe2_netdef
: A Caffe2 model encoded as a NetDef
Examples
Here are examples of the various API calls in the V2 protocol.
HTTP/REST Examples
Inference Example
Request:
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
"id": "42",
"inputs": [
{
"name": "input0",
"shape": [2, 2],
"datatype": "UINT32",
"data": [1, 2, 3, 4]
},
{
"name": "input1",
"shape": [3],
"datatype": "BOOL",
"data": [true, false, true]
}
],
"outputs": [
{
"name": "output0"
}
]
}
Response:
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <yy>
{
"id": "42",
"outputs": [
{
"name": "output0",
"shape": [3, 2],
"datatype": "FP32",
"data": [1.0, 1.1, 2.0, 2.1, 3.0, 3.1]
}
]
}
Model Metadata Example
Request:
GET /v2/models/mymodel HTTP/1.1
Host: localhost:8000
Response:
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <zz>
{
"name": "mymodel",
"versions": ["1", "2"],
"platform": "pytorch_torchscript",
"inputs": [
{
"name": "input0",
"datatype": "UINT32",
"shape": [2, 2]
},
{
"name": "input1",
"datatype": "BOOL",
"shape": [3]
}
],
"outputs": [
{
"name": "output0",
"datatype": "FP32",
"shape": [3, 2]
}
]
}
Server Ready Example
Request:
GET /v2/health/ready HTTP/1.1
Host: localhost:8000
Response:
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <aa>
{
"ready": true
}
Benefits of V2 Protocol
- Standardized Interfaces: Consistent API across different ML frameworks
- Health and Readiness: Built-in health checking and readiness probes
- Metadata Support: Rich model and server metadata
- Performance: Optimized for high-throughput inference
- Dual Interface: Support for both RESTful and gRPC APIs
- Binary Data: Optional binary data representation for better performance
- Version Support: Explicit model version management
Extensions
The V2 Protocol supports extensions for additional functionality:
- Binary Tensor Data Extension - For high-performance data transfer
- Other extensions may be proposed separately
Next Steps
- Learn about the Binary Tensor Data Extension
- Explore the V1 Protocol if you need the explain functionality
- Check the KServe website for more information about compatible serving runtimes