Skip to main content

Model Serving Frameworks Overview

KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto various model serving runtimes. This page provides an overview of the supported frameworks and their capabilities.

Introduction

KServe supports multiple model serving runtimes including:

These runtimes provide out-of-the-box model serving capabilities. For more complex use cases, you can build custom model servers using KServe's API primitives or tools like BentoML.

Key Features

When you deploy models with InferenceService, you automatically get these serverless features:

Scalability

  • Scale to and from Zero - Automatic scaling based on traffic
  • Request-based Autoscaling - Support for both CPU and GPU scaling
  • Optimized Containers - Performance-optimized runtime containers

Management

  • Revision Management - Track and manage different model versions
  • Traffic Management - Advanced routing and canary deployments
  • Batching - Automatic request batching for improved throughput

Observability

  • Request/Response Logging - Comprehensive logging capabilities
  • Distributed Tracing - End-to-end request tracing
  • Out-of-the-box Metrics - Built-in monitoring and metrics

Security

  • Authentication/Authorization - Secure access controls
  • Ingress/Egress Control - Network traffic management

Supported Frameworks

The following tables show model serving runtimes supported by KServe, split into predictive and generative inference capabilities:

Protocol Support
  • HTTP/gRPC columns indicate the prediction protocol version (v1 or v2)
  • Asterisk (*) indicates custom prediction protocols in addition to KServe's standard protocols
  • Default Runtime Version shows the source and version of the serving runtime
FrameworkExported Model FormatHTTPgRPCDefault Runtime VersionSupported Framework (Major) Version(s)Examples
HuggingFace ModelServerSaved Model, Huggingface Hub Model_IdOpenAI--v0.15 (KServe)4 (Transformers)GitHub Examples
HuggingFace VLLM ModelServerSaved Model, Huggingface Hub Model_IdOpenAI--v0.15 (KServe)0 (VLLM)GitHub Examples

Version Information

The framework versions and runtime configurations can be found in several locations:

For example, the LightGBM server version can be found in the pyproject.toml file, which specifies lightgbm ~= 3.3.2.

Runtime Version Configuration

Production Recommendation

For production services, we highly recommend explicitly setting the runtimeVersion field in your InferenceService specification to ensure consistent deployments and avoid unexpected version changes.

You can override the default model serving runtime version using the runtimeVersion field:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "torchscript-cifar"
spec:
predictor:
model:
modelFormat:
name: "pytorch"
storageUri: "gs://kfserving-examples/models/torchscript"
runtimeVersion: 23.08-py3

Next Steps