Why KServe?
Single platform that unifies Generative and Predictive AI inference on Kubernetes. Simple enough for quick deployments, yet powerful enough to handle enterprise-scale AI workloads with advanced features.
🤖 Generative AI
🧠 LLM-Optimized
OpenAI-compatible inference protocol for seamless integration with large language models
🚅 GPU Acceleration
High-performance serving with GPU support and optimized memory management for large models
💾 Model Caching
Intelligent model caching to reduce loading times and improve response latency for frequently used models
🗂️ KV Cache Offloading
Advanced memory management with KV cache offloading to CPU/disk for handling longer sequences efficiently
📈 Autoscaling
Request-based autoscaling capabilities optimized for generative workload patterns
🔧 Hugging Face Ready
Native support for Hugging Face models with streamlined deployment workflows
📊 Predictive AI
🧮 Multi-Framework
Support for TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more
🔀 Intelligent Routing
Seamless request routing between predictor, transformer, and explainer components with automatic traffic management
🔄 Advanced Deployments
Canary rollouts, inference pipelines, and ensembles with InferenceGraph
⚡ Auto-scaling
Request-based autoscaling with scale-to-zero for predictive workloads
🔍 Model Explainability
Built-in support for model explanations and feature attribution to understand prediction reasoning
📊 Advanced Monitoring
Enables payload logging, outlier detection, adversarial detection, and drift detection with AI Fairness 360 and ART integration
💰 Cost Efficient
Scale-to-zero on expensive resources when not in use, reducing infrastructure costs
Simple and Powerful API
KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning models. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "llm-service"
spec:
predictor:
model:
modelFormat:
name: huggingface
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
storageUri: "hf://meta-llama/Llama-3.1-8B-Instruct"
How KServe Works
KServe provides a Kubernetes custom resource definition for serving ML models on arbitrary frameworks, encapsulating complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

Control Plane
Manages lifecycle of ML models, providing model revision tracking, canary rollouts, and A/B testing
Data Plane
Standardized inference protocol for model servers with request/response APIs, supporting both predictive and generative models
InferenceService
Core Kubernetes custom resource that simplifies ML model deployment with automatic scaling, networking, and health checks
Inference Graph
Enables advanced deployments with pipelines for pre/post processing, ensembles, and multi-model workflows
Quick Start
Get started with KServe in minutes. Follow these simple steps to deploy your first model.
Install KServe
Install KServe and its dependencies on your Kubernetes cluster:
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml
Create an InferenceService
Deploy a pre-trained model with a simple YAML configuration:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "qwen-llm"
spec:
predictor:
model:
modelFormat:
name: huggingface
storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
resources:
requests:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
Send Inference Requests
Make predictions using the deployed model:
curl -v -H "Host: qwen-llm.default.example.com" \
http://localhost:8080/openai/v1/chat/completions -d @./prompt.json
Trusted by Industry Leaders
KServe is used in production by organizations across various industries, providing reliable model inference at scale.
Ready to Transform Your ML Deployment?
Simplify your journey from model development to production with KServe's standardized inference platform for both predictive and generative AI models