Deploying TensorFlow Models with KServe
This guide walks you through how to deploy a TensorFlow
model using KServe's InferenceService
. You'll learn how to serve models through both HTTP/REST and gRPC endpoints, and implement canary rollout strategies for model updates.
Prerequisites
Before you begin, make sure you have:
- A Kubernetes cluster with KServe installed.
kubectl
CLI configured to communicate with your cluster.- Basic knowledge of Kubernetes concepts and TensorFlow saved models.
Creating the InferenceService with V1 REST Endpoints
Create an InferenceService
resource that specifies the TensorFlow
model format and a storageUri
pointing to your saved TensorFlow model:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
Apply the YAML configuration to create the InferenceService
:
kubectl apply -f tensorflow.yaml
inferenceservice.serving.kserve.io/flower-sample created
Wait for the InferenceService
to be in ready state:
kubectl get isvc flower-sample
You should see output similar to:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
flower-sample http://flower-sample.default.example.com True 100 flower-sample-predictor-default-n9zs6 7m15s
Running a Prediction
To test your deployed model:
- First, determine the ingress IP and ports for your cluster.
- Set the
INGRESS_HOST
andINGRESS_PORT
environment variables accordingly. - Use the following command to send a prediction request with sample input:
MODEL_NAME=flower-sample
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/models/tensorflow-sample:predict HTTP/1.1
> Host: tensorflow-sample.default.example.com
> User-Agent: curl/7.73.0
> Accept: */*
> Content-Length: 16201
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 16201 out of 16201 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 222
< content-type: application/json
< date: Sun, 31 Jan 2021 01:01:50 GMT
< x-envoy-upstream-service-time: 280
< server: istio-envoy
<
{
"predictions": [
{
"scores": [0.999114931, 9.20987877e-05, 0.000136786213, 0.000337257545, 0.000300532585, 1.84813616e-05],
"prediction": 0,
"key": " 1"
}
]
}
Creating the InferenceService with gRPC Endpoints
KServe also supports gRPC for inference requests. To create an InferenceService
that exposes a gRPC endpoint:
- Serverless
- Raw Deployment
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-grpc"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
ports:
- containerPort: 9000
name: h2c # knative expects grpc port name to be 'h2c'
protocol: TCP
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-grpc"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
ports:
- containerPort: 9000
name: grpc-port # KServe requires the port name to be in the format <protocol>[-<suffix>]
protocol: TCP
Apply the YAML configuration to create the gRPC InferenceService
:
kubectl apply -f grpc.yaml
inferenceservice.serving.kserve.io/flower-grpc created
Running a gRPC Prediction
To run predictions using the gRPC endpoint, you'll need to:
- Set up a Python virtual environment with TensorFlow Serving API:
# The prediction script is written in TensorFlow 1.x
pip install tensorflow-serving-api>=1.14.0,<2.0.0
- Create a Python script named
grpc_client.py
to handle gRPC requests. Below is a Python client example for making gRPC requests to your deployed TensorFlow model:
import argparse
import json
import base64
import grpc
from tensorflow.contrib.util import make_tensor_proto
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
def predict(host, port, hostname, model, signature_name, input_path):
# If hostname not set, we assume the host is a valid knative dns.
if hostname:
host_option = (('grpc.ssl_target_name_override', hostname,),)
else:
host_option = None
channel = grpc.insecure_channel(target='{host}:{port}'.format(host=host, port=port), options=host_option)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
with open(input_path) as json_file:
data = json.load(json_file)
image = data['instances'][0]['image_bytes']['b64']
key = data['instances'][0]['key']
# Call classification model to make prediction
request = predict_pb2.PredictRequest()
request.model_spec.name = model
request.model_spec.signature_name = signature_name
image = base64.b64decode(image)
request.inputs['image_bytes'].CopyFrom(
make_tensor_proto(image, shape=[1]))
request.inputs['key'].CopyFrom(make_tensor_proto(key, shape=[1]))
result = stub.Predict(request, 10.0)
print(result)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--host', help='Ingress Host Name', default='localhost', type=str)
parser.add_argument('--port', help='Ingress Port', default=80, type=int)
parser.add_argument('--model', help='TensorFlow Model Name', type=str)
parser.add_argument('--signature_name', help='Signature name of saved TensorFlow model',
default='serving_default', type=str)
parser.add_argument('--hostname', help='Service Host Name', default='', type=str)
parser.add_argument('--input_path', help='Prediction data input path', default='./input.json', type=str)
args = parser.parse_args()
predict(args.host, args.port, args.hostname, args.model, args.signature_name, args.input_path)
- Run the gRPC prediction script with the sample input:
MODEL_NAME=flower-grpc
INPUT_PATH=./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
python grpc_client.py --host $INGRESS_HOST --port $INGRESS_PORT --model $MODEL_NAME --hostname $SERVICE_HOSTNAME --input_path $INPUT_PATH
outputs {
key: "key"
value {
dtype: DT_STRING
tensor_shape {
dim {
size: 1
}
}
string_val: " 1"
}
}
outputs {
key: "prediction"
value {
dtype: DT_INT64
tensor_shape {
dim {
size: 1
}
}
int64_val: 0
}
}
outputs {
key: "scores"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 6
}
}
float_val: 0.9991149306297302
float_val: 9.209887502947822e-05
float_val: 0.00013678647519554943
float_val: 0.0003372581850271672
float_val: 0.0003005331673193723
float_val: 1.848137799242977e-05
}
}
model_spec {
name: "flowers-sample"
version {
value: 1
}
signature_name: "serving_default"
}