Model Serving Runtimes¶

KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto model serving runtimes such as TFServing, TorchServe, Triton Inference Server. For Hugging Face models, KServe provides Hugging Face Server for hosting the transformer based models with Open Inference and OpenAI Protocol. In addition ModelServer is the Python model serving runtime implemented in KServe itself with prediction v1 and Open Inference Protocol(v2), These model serving runtimes are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case. KServe provides basic API primitives to allow you easily build custom model serving runtime, you can use other tools like BentoML to build your custom model serving image.

After models are deployed with InferenceService, you get all the following serverless features provided by KServe.

Scale to and from Zero
Request based Autoscaling on CPU/GPU
Revision Management
Optimized Container
Batching
Request/Response logging
Traffic management
Security with AuthN/AuthZ
Distributed Tracing
Out-of-the-box metrics
Ingress/Egress control

The table below identifies each of the model serving runtimes supported by KServe. The HTTP and gRPC columns indicate the prediction protocol version that the serving runtime supports. The KServe prediction protocol is noted as either "v1" or "v2". Some serving runtimes also support their own prediction protocol, these are noted with an *. The default serving runtime version column defines the source and version of the serving runtime - MLServer, KServe or its own. These versions can also be found in the runtime kustomization YAML. All KServe native model serving runtimes use the current KServe release version (v0.12). The supported framework version column lists the major version of the model that is supported. These can also be found in the respective runtime YAML under the supportedModelFormats field. For model frameworks using the KServe serving runtime, the specific default version can be found in kserve/python. In a given serving runtime directory the pyproject.toml file contains the exact model framework version used. For example, in kserve/python/lgbserver the pyproject.toml file sets the model framework version to 3.3.2, lightgbm ~= 3.3.2.

Model Serving Runtime	Exported model	HTTP	gRPC	Default Serving Runtime Version	Supported Framework (Major) Version(s)	Examples
Custom ModelServer	--	v1, v2	v2	--	--	Custom Model
LightGBM ModelServer	Saved LightGBM Model	v1, v2	v2	v0.14.1 (KServe)	4	LightGBM Iris
MLFlow ModelServer	Saved MLFlow Model	v2	v2	v1.5.0 (MLServer)	2	MLFLow wine-classifier
PMML ModelServer	PMML	v1, v2	v2	v0.14.1 (KServe)	3, 4 (PMML4.4.1)	SKLearn PMML
SKLearn ModelServer	Pickled Model	v1, v2	v2	v0.14.1 (KServe)	1.5	SKLearn Iris
TFServing	TensorFlow SavedModel	v1	*tensorflow	2.6.2 (TFServing Versions)	2	TensorFlow flower
TorchServe	Eager Model/TorchScript	v1, v2, *torchserve	*torchserve	0.9.0 (TorchServe)	2	TorchServe mnist
Triton Inference Server	TensorFlow,TorchScript,ONNX	v2	v2	23.05-py3 (Triton)	8 (TensoRT), 1, 2 (TensorFlow), 2 (PyTorch), 2 (Triton) Compatibility Matrix	Torchscript cifar
XGBoost ModelServer	Saved Model	v1, v2	v2	v0.14.1 (KServe)	2	XGBoost Iris
HuggingFace ModelServer	Saved Model / Huggingface Hub Model_Id	v1, v2, OpenAI	--	v0.14.1 (KServe)	4 (Transformers)	--
HuggingFace VLLM ModelServer	Saved Model / Huggingface Hub Model_Id	v2, OpenAI	--	v0.14.1 (KServe)	0 (Vllm)	--

*tensorflow - Tensorflow implements its own prediction protocol in addition to KServe's. See: Tensorflow Serving Prediction API documentation

*torchserve - PyTorch implements its own prediction protocol in addition to KServe's. See: Torchserve gRPC API documentation

Note

The model serving runtime version can be overwritten with the runtimeVersion field on InferenceService yaml and we highly recommend setting this field for production services.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3