KServe

Why KServe?

The open-source standard for self-hosted AI, providing a unified platform for both Generative and Predictive AI inference on Kubernetes.Simple enough for quick deployments, yet powerful enough for the most demanding enterprise workloads.

🤖 Generative AI

🧠 LLM-Optimized

OpenAI-compatible inference protocol for seamless integration with large language models

🚅 GPU Acceleration

High-performance serving with GPU support and optimized memory management for large models

💾 Model Caching

Intelligent model caching to reduce loading times and improve response latency for frequently used models

🗂️ KV Cache Offloading

Advanced memory management with KV cache offloading to CPU/disk for handling longer sequences efficiently

📈 Autoscaling

Request-based autoscaling capabilities optimized for generative workload patterns

🔧 Hugging Face Ready

Native support for Hugging Face models with streamlined deployment workflows

📊 Predictive AI

🧮 Multi-Framework

Support for TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more

🔀 Intelligent Routing

Seamless request routing between predictor, transformer, and explainer components with automatic traffic management

🔄 Advanced Deployments

Canary rollouts, inference pipelines, and ensembles with InferenceGraph

⚡ Auto-scaling

Request-based autoscaling with scale-to-zero for predictive workloads

🔍 Model Explainability

Built-in support for model explanations and feature attribution to understand prediction reasoning

📊 Advanced Monitoring

Enables payload logging, outlier detection, adversarial detection, and drift detection with AI Fairness 360 and ART integration

💰 Cost Efficient

Scale-to-zero on expensive resources when not in use, reducing infrastructure costs

Simple and Powerful API

KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning models. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

✓Standard K8s API across ML frameworks

✓Pre/post processing and explainability

✓OpenAI specification support for LLMs

✓Canary rollouts and A/B testing

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "llm-service"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
      storageUri: "hf://meta-llama/Llama-3.1-8B-Instruct"

How KServe Works

KServe provides a Kubernetes custom resource definition for serving ML models on arbitrary frameworks, encapsulating complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

Control Plane

Manages lifecycle of ML models, providing model revision tracking, canary rollouts, and A/B testing

Data Plane

Standardized inference protocol for model servers with request/response APIs, supporting both predictive and generative models

InferenceService

Core Kubernetes custom resource that simplifies ML model deployment with automatic scaling, networking, and health checks

Inference Graph

Enables advanced deployments with pipelines for pre/post processing, ensembles, and multi-model workflows

Quick Start

Get started with KServe in minutes. Follow these simple steps to deploy your first model.

Install KServe

Install KServe and its dependencies on your Kubernetes cluster:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml

Create an InferenceService

Deploy a pre-trained model with a simple YAML configuration:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "qwen-llm"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
      resources:
        requests:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"

Send Inference Requests

Make predictions using the deployed model:

curl -v -H "Host: qwen-llm.default.example.com" \
  http://localhost:8080/openai/v1/chat/completions -d @./prompt.json

Detailed Guide

Trusted by Industry Leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

View all adopters →

Ready to Transform Your ML Deployment?

Simplify your journey from model development to production with KServe's standardized inference platform for both predictive and generative AI models

Get Started Install KServe