Deploy Tensorflow Model with InferenceService¶
Create the HTTP InferenceService¶
Create an InferenceService
yaml which specifies the framework tensorflow
and storageUri
that is pointed to a
saved tensorflow model, and name it as tensorflow.yaml
.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
spec:
predictor:
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
Apply the tensorflow.yaml to create the InferenceService
, by default it exposes a HTTP/REST endpoint.
kubectl apply -f tensorflow.yaml
Expected Output
$ inferenceservice.serving.kserve.io/flower-sample created
Wait for the InferenceService
to be in ready state
kubectl get isvc flower-sample
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
flower-sample http://flower-sample.default.example.com True 100 flower-sample-predictor-default-n9zs6 7m15s
Run a prediction¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
, the inference request input
file can be downloaded here.
MODEL_NAME=flower-sample
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH
Expected Output
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/models/tensorflow-sample:predict HTTP/1.1
> Host: tensorflow-sample.default.example.com
> User-Agent: curl/7.73.0
> Accept: */*
> Content-Length: 16201
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 16201 out of 16201 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 222
< content-type: application/json
< date: Sun, 31 Jan 2021 01:01:50 GMT
< x-envoy-upstream-service-time: 280
< server: istio-envoy
<
{
"predictions": [
{
"scores": [0.999114931, 9.20987877e-05, 0.000136786213, 0.000337257545, 0.000300532585, 1.84813616e-05],
"prediction": 0,
"key": " 1"
}
]
}
Canary Rollout¶
Canary rollout is a great way to control the risk of rolling out a new model by first moving a small percent of the traffic to it and then gradually increase the percentage.
To run a canary rollout, you can apply the canary.yaml
with the canaryTrafficPercent
field specified.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
spec:
predictor:
canaryTrafficPercent: 20
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers-2"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-sample"
spec:
predictor:
canaryTrafficPercent: 20
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers-2"
Apply the canary.yaml to create the Canary InferenceService.
kubectl apply -f canary.yaml
To verify if the traffic split percentage is applied correctly, you can run the following command:
kubectl get isvc flower-sample
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
flower-sample http://flower-sample.default.example.com True 80 20 flower-sample-predictor-default-n9zs6 flower-sample-predictor-default-2kwtr 7m15s
As you can see the traffic is split between the last rolled out revision and the current latest ready revision, KServe automatically tracks the last rolled out(stable) revision for you so you
do not need to maintain both default and canary on the InferenceService
as in v1alpha2.
Create the gRPC InferenceService¶
Create InferenceService
which exposes the gRPC port and by default it listens on port 9000.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-grpc"
spec:
predictor:
tensorflow:
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
ports:
- containerPort: 9000
name: h2c
protocol: TCP
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flower-grpc"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
ports:
- containerPort: 9000
name: h2c
protocol: TCP
Apply grpc.yaml to create the gRPC InferenceService.
kubectl apply -f grpc.yaml
Expected Output
$ inferenceservice.serving.kserve.io/flower-grpc created
Run a prediction¶
We use a python gRPC client for the prediction, so you need to create a python virtual environment and
install the tensorflow-serving-api
.
# The prediction script is written in TensorFlow 1.x
pip install tensorflow-serving-api>=1.14.0,<2.0.0
Run the gRPC prediction script.
MODEL_NAME=flower-grpc
INPUT_PATH=./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
python grpc_client.py --host $INGRESS_HOST --port $INGRESS_PORT --model $MODEL_NAME --hostname $SERVICE_HOSTNAME --input_path $INPUT_PATH
Expected Output
outputs {
key: "key"
value {
dtype: DT_STRING
tensor_shape {
dim {
size: 1
}
}
string_val: " 1"
}
}
outputs {
key: "prediction"
value {
dtype: DT_INT64
tensor_shape {
dim {
size: 1
}
}
int64_val: 0
}
}
outputs {
key: "scores"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 6
}
}
float_val: 0.9991149306297302
float_val: 9.209887502947822e-05
float_val: 0.00013678647519554943
float_val: 0.0003372581850271672
float_val: 0.0003005331673193723
float_val: 1.848137799242977e-05
}
}
model_spec {
name: "flowers-sample"
version {
value: 1
}
signature_name: "serving_default"
}