Model Serving Runtimes¶

KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto model serving runtimes such as TFServing, TorchServe, Triton Inference Server. In addition KFServer is the Python model serving runtime implemented in KServe itself with prediction v1 protocol, MLServer implements the prediction v2 protocol with both REST and gRPC. These model serving runtimes are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case. KServe provides basic API primitives to allow you easily build custom model serving runtime, you can use other tools like BentoML to build your custom model serving image.

After models are deployed with InferenceService, you get all the following serverless features provided by KServe.

Scale to and from Zero
Request based Autoscaling on CPU/GPU
Revision Management
Optimized Container
Batching
Request/Response logging
Traffic management
Security with AuthN/AuthZ
Distributed Tracing
Out-of-the-box metrics
Ingress/Egress control

Model Serving Runtime	Exported model	Prediction Protocol	gRPC	Versions
Triton Inference Server	TensorFlow,TorchScript,ONNX	v2		Compatibility Matrix
TFServing	TensorFlow SavedModel	v1		TFServing Versions
TorchServe	Eager Model/TorchScript	v1		0.4.1
SKLearn MLServer	Pickled Model	v2		0.23.1
XGBoost MLServer	Saved Model	v2		1.1.1
SKLearn KFServer	Pickled Model	v1	--	0.20.3
XGBoost KFServer	Saved Model	v1	--	0.82
PMML KFServer	PMML	v1	--	PMML4.4.1
LightGBM KFServer	Saved LightGBM Model	v1	--	2.3.1
Custom KFServer	--	v1	--	--

Note

The model serving runtime version can be overwritten with the runtimeVersion field on InferenceService yaml and we highly recommend setting this field for production services.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3