Welcome to KServe

Deploy and scale AI models effortlessly - from cutting-edge Large Language Models to traditional ML models - with enterprise-grade reliability across any cloud or on-premises environment.

Why KServe?

KServe eliminates the complexity of productionizing AI models. Whether you're a data scientist wanting to deploy your latest LLM experiment, a DevOps engineer building scalable ML infrastructure, or a decision maker evaluating AI platforms, KServe provides a unified solution that works across clouds and scales with your needs.

🚀 From Experiment to Production in Minutes - Deploy GenAI services and ML models with simple YAML configurations, no complex infrastructure setup required.

☁️ Cloud-Agnostic by Design - Run anywhere: AWS, Azure, GCP, on-premises, or hybrid environments with consistent behavior.

📈 Enterprise-Scale Ready - Automatically handle traffic spikes, scale to zero when idle, and manage hundreds of models efficiently.

What Makes KServe Different?

GenAI-First Platform

Deploy Large Language Models with OpenAI-compatible APIs out of the box. Chat completions, streaming responses, embeddings - all just work with your existing tools and SDKs.

Universal ML Support

Beyond GenAI, serve any ML framework: TensorFlow, PyTorch, Scikit-Learn, XGBoost, and more. One platform for all your AI workloads.

Zero Infrastructure Overhead

Focus on your models, not infrastructure. KServe handles load balancing, autoscaling, canary deployments, and monitoring automatically.

Production-Ready Security

Enterprise authentication, network policies, and compliance features built-in. Deploy with confidence in regulated environments.

Key Benefits

Generative Inference Benefits

✅ LLM Multi-framework Support - Deploy LLMs from Hugging Face, vLLM, and custom generative models
✅ OpenAI-Compatible APIs - Chat completion, completion, streaming, and embedding endpoints
✅ LocalModelCache for LLMs - Cache large models locally to reduce startup time from 15-20 minutes to ~1 minute
✅ KV Cache Offloading - Optimized memory management for long conversations and large contexts
✅ Multi-node Inference - Distributed LLM serving ✅ Envoy AI Gateway Integration - Enterprise-grade API management and routing for AI workloads
✅ Metric-based Autoscaling - Scale based on token throughput, queue depth, and GPU utilization
✅ Advanced Generative Deployments - Canary rollouts and A/B testing for LLM experiments

Predictive Inference Benefits

✅ Multi-framework Model Serving - Deploy models from TensorFlow, PyTorch, Scikit-Learn, XGBoost, and more
✅ InferenceGraph for Model Ensembles - Chain and ensemble multiple models together for complex workflows
✅ Batch Prediction Support - Efficient processing of large datasets with batch inference capabilities
✅ Preprocessing & Postprocessing - Built-in data transformation pipelines and feature engineering
✅ Real-time Scoring - Low-latency prediction serving for real-time applications
✅ Production ML Monitoring - Comprehensive observability, drift detection, and explainability
✅ Standard Inference Protocols - Support for Open Inference Protocol (V1/V2) across frameworks

Universal Benefits (Both Inference Types)

✅ Serverless Inference Workloads - Automatic scaling including scale-to-zero on both CPU and GPU
✅ High Scalability & Density - Intelligent routing and density packing using ModelMesh
✅ Enterprise-Ready Operations - Production monitoring, logging, and observability out of the box

How It Works

Simple Deployment Model: Describe your model requirements in a simple configuration file, and KServe handles the rest - from container orchestration to load balancing.

Built on Kubernetes: KServe extends Kubernetes with custom resources specifically designed for AI/ML workloads, providing cloud-native scalability and reliability while abstracting away the complexity.

Pluggable Architecture: Support for multiple serving runtimes means you can use the best inference engine for your specific model type - whether that's vLLM for LLMs, TorchServe for PyTorch models, or custom containers for specialized needs.

Architecture Overview

KServe consists of two main components:

Control Plane

InferenceService CRD - Manages model serving lifecycle
InferenceGraph CRD - Orchestrates model ensembles and chaining workflows
Serving Runtime - Pluggable model runtime implementations
ClusterServingRuntime: Define cluster-wide model runtimes
LocalModelCache CRD - Cache large models locally on nodes for faster startup and scaling
Model Storage - Support for various storage systems (Huggingface, S3, GCS, Azure, PVC, etc.)

Data Plane

Predictor - Serves model predictions
Transformer - Pre/post processing logic
Explainer - Model explanations and interpretability

Quick Start

Ready to deploy your first model? Choose your path:

🚀 Get Started with KServe

Deploy your first GenAI service with Qwen LLM in minutes

🏗️ Installation Guide

Set up KServe on your Kubernetes cluster

📚 Model Serving Guide

Learn about different serving patterns and frameworks

Supported Model Frameworks

Predictive Inference

Scikit-Learn - Python-based ML models
XGBoost - Gradient boosting framework
TensorFlow - Deep learning models
PyTorch - PyTorch models via Triton
ONNX - Open Neural Network Exchange models
TRT - TensorRT optimized models
Hugging Face - Transformers and NLP models
MLflow - MLflow packaged models
Custom Runtimes - Bring your own serving logic

Generative Inference

Large Language Models (LLMs) - Text generation via vLLM
Hugging Face Transformers - Text2Text generation
OpenAI-Compatible APIs - Chat completions, embeddings, and more

Multi-Framework Support

NVIDIA Triton - High-performance inference server
AMD - Optimized inference on AMD hardware

What's Next?

📖 First GenAI InferenceService - Deploy your first GenAI model
📖 First Predictive InferenceService - Deploy your first predictive model
📖 KServe Concepts - Understand the core concepts of KServe
🔧 Administration Guide - Install and configure KServe
🎯 Model Serving - Learn serving patterns
🌐 API Reference - Explore the complete API
👥 Community - Join the KServe community

Community & Support

GitHub: github.com/kserve/kserve
Slack: CNCF Slack #kserve
Community Meetings: Monthly meetings calendar

Learning Path

Start Here: Deploy your first GenAI model or Deploy your first predictive model
Learn the Basics: Model Serving Guide
API Reference: Complete API documentation
Join the Community: See who's using KServe

Need Help?

💬 Join our Slack community
🐛 Report issues on GitHub
📖 Browse the API reference

KServe is a CNCF incubating project and part of the Kubeflow ecosystem.

Why KServe?​

What Makes KServe Different?​

GenAI-First Platform​

Universal ML Support​

Zero Infrastructure Overhead​

Production-Ready Security​

Key Benefits​

Generative Inference Benefits​

Predictive Inference Benefits​

Universal Benefits (Both Inference Types)​

How It Works​

Architecture Overview​

Control Plane​

Data Plane​

Quick Start​

🚀 Get Started with KServe​

🏗️ Installation Guide​

📚 Model Serving Guide​

Supported Model Frameworks​

Predictive Inference​

Generative Inference​

Multi-Framework Support​

What's Next?​

Community & Support​

Learning Path​

Need Help?​