Deploying ONNX Models with KServe
This guide demonstrates how to deploy an ONNX model using KServe's InferenceService
and run inference on the deployed model.
Prerequisites
Before you begin, make sure you have:
- A Kubernetes cluster with KServe installed. If not, follow the KServe installation guide.
kubectl
CLI configured to communicate with your cluster.
Create the InferenceService
Create an InferenceService
resource to deploy your ONNX model:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "style-sample"
spec:
predictor:
model:
protocolVersion: v2
modelFormat:
name: onnx
storageUri: "gs://kfserving-examples/models/onnx"
resources:
requests:
cpu: "100m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
For the default KServe installation, when using the new schema, you must specify protocolVersion as v2 for ONNX models. Otherwise, you will get a "no runtime found" error.
Save this configuration to a file named onnx.yaml
and apply it:
kubectl apply -f onnx.yaml
inferenceservice.serving.kserve.io/style-sample configured
Run a Sample Inference
1. Set Up Environment Variables
First, determine the ingress IP and ports, then set the INGRESS_HOST
and INGRESS_PORT
environment variables.
export ISVC_NAME=style-sample
export SERVICE_HOSTNAME=$(kubectl get inferenceservice ${ISVC_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
2. Verify the Service is Healthy
Check that your service is ready to receive requests:
curl -v -H "Host:${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/health/ready
3. Install Dependencies
If you haven't already, install the required Python packages:
pip install numpy pillow protobuf requests
4. Run Inference Using Python
Below is the code to perform inference with the deployed ONNX model. This code is adapted from the original notebook example to work directly in the documentation. We'll use a sample image for the style transfer example.
Load and Preprocess the Image
from PIL import Image
import numpy as np
import requests
import json
import os
# Load & resize image
image = Image.open("image.jpg")
image = image.resize((224, 224), Image.LANCZOS)
# Preprocess image data
norm_img_data = np.array(image).astype('float32')
norm_img_data = np.transpose(norm_img_data, [2, 0, 1])
norm_img_data = np.expand_dims(norm_img_data, axis=0)
Create the Request Message
# Create request message to be sent to the predictor
message_data = {}
inputs = {}
message_data["inputs"] = []
inputs["name"] = "input1"
inputs["shape"] = norm_img_data.shape
inputs["datatype"] = "FP32" # ONNX model expects float32
inputs["data"] = norm_img_data.tolist()
message_data["inputs"].append(inputs)
Call the Predictor and Process the Response
# Define variables (should match what you set earlier)
service_hostname = os.environ.get("SERVICE_HOSTNAME")
model_name = "style" # The model name used by the ONNX sample
ingress_ip = os.environ.get("INGRESS_HOST", "localhost")
ingress_port = os.environ.get("INGRESS_PORT", "8080")
# Call predictor
predictor_url = f"http://{ingress_ip}:{ingress_port}/v2/models/{model_name}/infer"
request_headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Host': service_hostname
}
response = requests.post(
predictor_url,
headers=request_headers,
data=json.dumps(message_data)
)
print(f"Response status code: {response.status_code}")
# Process response
if response.status_code == 200:
response_message = json.loads(response.text)
output1 = np.array(response_message["outputs"][0]['data'], dtype=np.float32)
output1 = output1.reshape(3, 224, 224)
# Postprocess
result = np.clip(output1, 0, 255)
result = result.transpose(1, 2, 0).astype("uint8")
img = Image.fromarray(result)
# Display or save the image
img.save("output_image.jpg")
print("Stylized image saved as 'output_image.jpg'")
else:
print(f"Error: {response.text}")
Complete Jupyter Notebook Example
If you prefer to run this as a Jupyter notebook, we've provided a complete example in the mosaic-onnx.ipynb file. The notebook includes:
- Setting up environment variables
- Loading and preprocessing the image
- Creating the request message
- Sending the inference request
- Processing the response
- Visualizing the stylized result
To run the notebook, first install the required dependencies from the requirements.txt file:
pip install -r requirements.txt
Then launch Jupyter and open the notebook:
jupyter notebook mosaic-onnx.ipynb
Uploading Your Own Model
Store your trained model on cloud storage in a Model Repository
Once the model is exported as ONNX
model file, the next step is to upload the model to a GCS bucket.
Triton supports loading multiple models so, it expects a model repository which follows a required layout in the bucket.
<model-repository-path>/
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-file>
<version>/
<model-definition-file>
...
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-file>
<version>/
<model-definition-file>
The sample model used in this example is already uploaded and available for use. However, if you would like to use your own ONNX model, follow these steps:
- Export your model in ONNX format.
- Upload your model as
model.onnx
to a cloud storage service (S3, GCS, or Azure Blob). - Update the
storageUri
in the InferenceService YAML to point to your model location. - Apply the updated YAML configuration.
Make sure your model is compatible with ONNX Runtime and follows the expected input/output formats.