Skip to main content
KServe Logo

KServe

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

Why KServe?

Single platform that unifies Generative and Predictive AI inference on Kubernetes. Simple enough for quick deployments, yet powerful enough to handle enterprise-scale AI workloads with advanced features.

🤖 Generative AI

🧠 LLM-Optimized

OpenAI-compatible inference protocol for seamless integration with large language models

🚅 GPU Acceleration

High-performance serving with GPU support and optimized memory management for large models

💾 Model Caching

Intelligent model caching to reduce loading times and improve response latency for frequently used models

🗂️ KV Cache Offloading

Advanced memory management with KV cache offloading to CPU/disk for handling longer sequences efficiently

📈 Autoscaling

Request-based autoscaling capabilities optimized for generative workload patterns

🔧 Hugging Face Ready

Native support for Hugging Face models with streamlined deployment workflows

📊 Predictive AI

🧮 Multi-Framework

Support for TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more

🔀 Intelligent Routing

Seamless request routing between predictor, transformer, and explainer components with automatic traffic management

🔄 Advanced Deployments

Canary rollouts, inference pipelines, and ensembles with InferenceGraph

⚡ Auto-scaling

Request-based autoscaling with scale-to-zero for predictive workloads

🔍 Model Explainability

Built-in support for model explanations and feature attribution to understand prediction reasoning

📊 Advanced Monitoring

Enables payload logging, outlier detection, adversarial detection, and drift detection with AI Fairness 360 and ART integration

💰 Cost Efficient

Scale-to-zero on expensive resources when not in use, reducing infrastructure costs

Simple and Powerful API

KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning models. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

Standard K8s API across ML frameworks
Pre/post processing and explainability
OpenAI specification support for LLMs
Canary rollouts and A/B testing
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "llm-service"
spec:
predictor:
model:
modelFormat:
name: huggingface
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
storageUri: "hf://meta-llama/Llama-3.1-8B-Instruct"

How KServe Works

KServe provides a Kubernetes custom resource definition for serving ML models on arbitrary frameworks, encapsulating complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features to your ML deployments.

KServe Architecture

Control Plane

Manages lifecycle of ML models, providing model revision tracking, canary rollouts, and A/B testing

Data Plane

Standardized inference protocol for model servers with request/response APIs, supporting both predictive and generative models

InferenceService

Core Kubernetes custom resource that simplifies ML model deployment with automatic scaling, networking, and health checks

Inference Graph

Enables advanced deployments with pipelines for pre/post processing, ensembles, and multi-model workflows

Quick Start

Get started with KServe in minutes. Follow these simple steps to deploy your first model.

1

Install KServe

Install KServe and its dependencies on your Kubernetes cluster:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml
2

Create an InferenceService

Deploy a pre-trained model with a simple YAML configuration:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "qwen-llm"
spec:
predictor:
model:
modelFormat:
name: huggingface
storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
resources:
requests:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
3

Send Inference Requests

Make predictions using the deployed model:

curl -v -H "Host: qwen-llm.default.example.com" \
http://localhost:8080/openai/v1/chat/completions -d @./prompt.json

Trusted by Industry Leaders

KServe is used in production by organizations across various industries, providing reliable model inference at scale.

Bloomberg logo
IBM logo
Red Hat logo
NVIDIA logo
AMD logo
Kubeflow logo
Cloudera logo
Canonical logo
Cisco logo
Gojek logo
Inspur logo
Max Kelsen logo
Prosus logo
Wikimedia Foundation logo
Naver Corporation logo
Zillow logo
Striveworks logo
Cars24 logo
Upstage logo
Intuit logo
Alauda logo

Ready to Transform Your ML Deployment?

Simplify your journey from model development to production with KServe's standardized inference platform for both predictive and generative AI models