Model Serving Frameworks Overview

KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto various model serving runtimes. This page provides an overview of the supported frameworks and their capabilities.

Introduction

KServe supports multiple model serving runtimes including:

TensorFlow Serving - Google's serving system for TensorFlow models.
Triton Inference Server - NVIDIA's inference server supporting multiple frameworks
Hugging Face Server - Specialized for transformer models with Open Inference and OpenAI Protocol support with vLLM.
LightGBM ModelServer - Specialized for LightGBM models.
XGBoost ModelServer - Specialized for XGBoost models.
PMML ModelServer - Specialized for PMML models.
SKLearn ModelServer - Specialized for SKLearn models.
PaddlePaddle ModelServer - Specialized for PaddlePaddle models.

These runtimes provide out-of-the-box model serving capabilities. For more complex use cases, you can build custom model servers using KServe's API primitives or tools like BentoML.

Key Features

When you deploy models with InferenceService, you automatically get these serverless features:

Scalability

Scale to and from Zero - Automatic scaling based on traffic
Request-based Autoscaling - Support for both CPU and GPU scaling
Optimized Containers - Performance-optimized runtime containers

Management

Revision Management - Track and manage different model versions
Traffic Management - Advanced routing and canary deployments
Batching - Automatic request batching for improved throughput

Observability

Request/Response Logging - Comprehensive logging capabilities
Distributed Tracing - End-to-end request tracing
Out-of-the-box Metrics - Built-in monitoring and metrics

Security

Authentication/Authorization - Secure access controls
Ingress/Egress Control - Network traffic management

Supported Frameworks

The following tables show model serving runtimes supported by KServe, split into predictive and generative inference capabilities:

Protocol Support

HTTP/gRPC columns indicate the prediction protocol version (v1 or v2)
Asterisk (*) indicates custom prediction protocols in addition to KServe's standard protocols
Default Runtime Version shows the source and version of the serving runtime

Generative Inference
Predictive Inference

Framework	Exported Model Format	HTTP	gRPC	Default Runtime Version	Supported Framework (Major) Version(s)	Examples
HuggingFace ModelServer	Saved Model, Huggingface Hub Model_Id	OpenAI	--	v0.15 (KServe)	4 (Transformers)	GitHub Examples
HuggingFace VLLM ModelServer	Saved Model, Huggingface Hub Model_Id	OpenAI	--	v0.15 (KServe)	0 (VLLM)	GitHub Examples

Framework	Exported Model Format	HTTP	gRPC	Default Runtime Version	Supported Framework (Major) Version(s)	Examples
Custom ModelServer	Custom implementation	v1, v2	v2	User-defined	User-defined	GitHub Examples
LightGBM ModelServer	Saved LightGBM Model (.bst)	v1, v2	v2	v0.15 (KServe)	4	GitHub Examples
MLFlow ModelServer	Saved MLFlow Model	v2	v2	v1.5.0 (MLServer)	2	GitHub Examples
PMML ModelServer	PMML (.pmml)	v1, v2	v2	v0.15 (KServe)	3, 4 (PMML4.4.1), 3 (Spark MLlib)	GitHub Examples
SKLearn ModelServer	Pickled Model (.pkl, .pickle), Joblib (.joblib)	v1, v2	v2	v0.15 (KServe)	1.5	GitHub Examples
TensorFlow Serving	TensorFlow SavedModel	v1	*tensorflow	2.6.2 (TFServing Versions)	2	GitHub Examples
Triton Inference Server	TensorFlow, TorchScript, ONNX, TensorRT	v2	v2	23.05-py3 (Triton)	8 (TensorRT), 1, 2 (TensorFlow), 2 (PyTorch), 2 (Triton) Compatibility Matrix	GitHub Examples
XGBoost ModelServer	Saved Model (.bst, .json, .ubj)	v1, v2	v2	v0.15 (KServe)	2	GitHub Examples
PaddlePaddle ModelServer	Saved Model (.pdmodel)	v1, v2	v2	v0.15 (KServe)	2	GitHub Examples
HuggingFace ModelServer	Saved Model, Huggingface Hub Model_Id	v1, v2	v2	v0.15 (KServe)	4 (Transformers)	GitHub Examples

Protocol Notes

*tensorflow: TensorFlow implements its own prediction protocol in addition to KServe's standard protocols. See the TensorFlow Serving Prediction API documentation.

Version Information

The framework versions and runtime configurations can be found in several locations:

Runtime versions: Check the runtime kustomization YAML
Supported formats: See individual runtime YAML files under the supportedModelFormats field
KServe native runtimes: Find specific versions in kserve/python subdirectories' pyproject.toml files

For example, the LightGBM server version can be found in the pyproject.toml file, which specifies lightgbm ~= 3.3.2.

Runtime Version Configuration

Production Recommendation

For production services, we highly recommend explicitly setting the runtimeVersion field in your InferenceService specification to ensure consistent deployments and avoid unexpected version changes.

You can override the default model serving runtime version using the runtimeVersion field:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    model:
      modelFormat:
        name: "pytorch"
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 23.08-py3

Next Steps

Explore the KServe GitHub repository for more examples
Learn about custom model serving
Check out the sample implementations for hands-on tutorials
Read the KServe developer guide

Introduction​

Key Features​

Scalability​

Management​

Observability​

Security​

Supported Frameworks​

Protocol Notes​

Version Information​

Runtime Version Configuration​

Next Steps​