Control Plane
The KServe control plane is responsible for managing the lifecycle of inference services, coordinating with Kubernetes APIs, handling resource orchestration, and providing autoscaling capabilities. It serves as the brain of the KServe platform, ensuring that model serving workloads are properly deployed, monitored, and scaled according to demand.
Overview
The control plane operates independently from the data plane, allowing for clean separation of concerns and improved system reliability. It handles all management operations while the data plane focuses purely on inference execution. The control plane supports two distinct deployment modes, each with different architectural patterns and capabilities.
Architecture Diagrams
Raw Deployment Mode with Gateway API (Recommended)
Raw Deployment Mode with Kubernetes Ingress (Limited Functionality)
Serverless Mode with Knative
LLM Model Caching Architecture
Core Components
KServe Controller Manager
The KServe Controller Manager is the central component of the control plane, responsible for managing the complete lifecycle of InferenceService resources.
Primary Responsibilities:
- InferenceService Reconciliation: Continuously monitors and reconciles InferenceService custom resources
- Resource Orchestration: Creates and manages underlying Kubernetes resources (Deployments, Services, networking)
- Networking Management: Configures Gateway API resources (recommended) or Kubernetes Ingress
- Autoscaling Configuration: Sets up HPA and optionally KEDA for custom metrics scaling
- Model Server Management: Manages model server containers with appropriate configurations
- Request/Response Handling: Coordinates logging, batching, and model pulling operations
Architecture Integration:
- Operates using the Kubernetes controller pattern with watch-reconcile loops
- Integrates with admission webhooks for validation and mutation
- Coordinates with both Raw Deployment and Serverless deployment modes
- Manages model agent containers for enhanced functionality
LocalModel Controller (LLM Model Caching)
The LocalModel Controller is a specialized component designed specifically for efficient large language model (LLM) caching and management.
Key Features:
- Node-Level Model Caching: Creates PersistentVolumeClaims on individual nodes for model storage
- Distributed Model Management: Coordinates model downloading and caching across cluster nodes
- Storage Optimization: Minimizes model download times and storage redundancy
- Cache Lifecycle Management: Handles model cache creation, updates, and cleanup
- Integration with InferenceService: Seamlessly works with InferenceServices that have
localModel
labels
Specialized Architecture:
- Works in conjunction with LocalModelNode Controller for node-specific operations
- Manages download jobs that fetch models from external storage (S3, GCS, etc.)
- Optimizes for LLM use cases where models are large and benefit from local caching
- Supports multi-node deployments with distributed caching strategies
Additional Controllers
InferenceGraph Controller:
- Manages complex inference workflows and model chaining
- Handles multi-model inference pipelines
- Coordinates request routing between multiple models
- Supports conditional logic and branching in inference workflows
TrainedModel Controller:
- Manages TrainedModel custom resources for model versioning
- Handles model storage and retrieval operations
- Supports model lineage and metadata management
- Integrates with model registries and storage backends
Networking Components
Gateway API Integration (Recommended)
KServe strongly recommends using Gateway API for advanced networking capabilities:
Advantages over Kubernetes Ingress:
- Advanced Traffic Management: Support for traffic splitting, canary deployments, and blue-green deployments
- Protocol Support: Native support for HTTP/2, gRPC, and other protocols
- Enhanced Security: Better TLS termination and certificate management
- Vendor Agnostic: Standardized API that works across different ingress controllers
- Future-Proof: Active development and expanding feature set
Components:
- HTTPRoute Resources: Fine-grained routing control with header-based routing, path matching, and traffic splitting
- Gateway Resources: Centralized gateway configuration with multiple listeners and protocols
- Backend References: Direct integration with Kubernetes Services and advanced load balancing
Kubernetes Ingress (Limited Support)
While supported for backward compatibility, Kubernetes Ingress has significant limitations:
Limitations:
- ❌ No Advanced Traffic Splitting: Cannot perform percentage-based traffic routing
- ❌ Limited Protocol Support: Does not support explain endpoints
- ❌ No Path-Based Routing: Lacks support for routing based on request paths
- ❌ Vendor Lock-in: Implementation varies significantly between ingress controllers
Use Cases:
- Legacy environments that cannot adopt Gateway API
- Simple use cases with basic HTTP routing requirements
- Transitional setups migrating to Gateway API
Autoscaling Components
Horizontal Pod Autoscaler (HPA)
- Metrics-Based Scaling: Scales based on CPU, memory, and custom metrics
- Integration: Native Kubernetes integration with all deployment modes
- Configuration: Configurable scaling policies and thresholds
- Stability: Battle-tested and reliable scaling mechanism
KEDA Integration (Optional)
- Advanced Metrics: Scales based on external metrics (queues, databases, custom metrics)
- Event-Driven Scaling: Supports scaling based on events and triggers
- Multi-Cloud Support: Works with various cloud provider metrics
- Custom Resources: Uses ScaledObject and ScaledJob custom resources
Scale-to-Zero (Serverless Mode Only)
- Knative Autoscaler (KPA): Provides scale-to-zero capabilities
- Activator Component: Handles request queuing and pod activation
- Cold Start Optimization: Minimizes latency when scaling from zero
Admission Webhooks
Validation Webhooks
- Schema Validation: Ensures InferenceService specifications conform to expected schemas
- Business Logic Validation: Enforces custom validation rules and constraints
- Resource Limits: Validates resource requests and limits
- Security Policies: Enforces security policies and compliance requirements
Mutation Webhooks
- Default Injection: Automatically injects default values for optional fields
- Configuration Enhancement: Adds required sidecars and configuration
- Security Context Injection: Applies security contexts and policies
- Resource Optimization: Optimizes resource requests based on model requirements
Core Responsibilities
Resource Management
- InferenceService Creation: Translates high-level InferenceService specifications into underlying Kubernetes resources
- Configuration Management: Handles model configuration, runtime settings, and deployment parameters
- Resource Allocation: Manages CPU, memory, and GPU resource assignments
- Storage Management: Coordinates model storage access and volume mounts
Lifecycle Management
- Deployment Orchestration: Coordinates the rollout of new model versions
- Health Monitoring: Tracks service health and readiness
- Rollback Capabilities: Manages rollback to previous stable versions
- Cleanup Operations: Handles resource cleanup during service deletion
Integration & Coordination
- Kubernetes API Integration: Seamlessly integrates with native Kubernetes resources
- Networking Configuration: Sets up ingress, services, and network policies
- Security Enforcement: Applies RBAC, security contexts, and admission policies
- Monitoring Integration: Coordinates with observability tools and metrics collection
Deployment Modes
KServe supports two distinct deployment modes, each with different architectural patterns, capabilities, and use cases. The choice between modes significantly impacts the networking, autoscaling, and operational characteristics of your inference services.
Raw Deployment Mode (Recommended for LLMs)
Raw Deployment mode uses standard Kubernetes resources and is recommended for most production environments.
Architecture Characteristics:
- Standard Kubernetes Deployments: Uses Deployment, Service, and networking resources
- Gateway API Networking: Leverages Gateway API for advanced traffic management (recommended)
- Kubernetes Ingress Fallback: Supports Kubernetes Ingress with limited functionality
- HPA + Optional KEDA: Uses HPA for basic scaling, with KEDA for advanced metrics
- Predictable Behavior: Standard Kubernetes behavior with well-understood operational patterns
Advantages:
- ✅ Production Stability: Battle-tested Kubernetes primitives with predictable behavior
- ✅ Operational Familiarity: Standard Kubernetes tooling and operational practices
- ✅ Resource Control: Fine-grained control over pod lifecycle and resource management
- ✅ Debugging Simplicity: Standard Kubernetes debugging tools and practices
- ✅ Multi-Tenancy: Better isolation and resource management for multi-tenant environments
- ✅ Networking Flexibility: Support for both Gateway API and Ingress networking
- ✅ Persistent Connections: Better support for long-running connections and streaming
Best Use Cases:
- Production environments requiring high availability
- Enterprise deployments with existing Kubernetes expertise
- Multi-tenant environments with strict isolation requirements
- Applications requiring persistent connections or streaming
- Environments with complex networking requirements
Networking Options:
Gateway API (Recommended):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
annotations:
serving.knative.dev/enablePassthrough: "true"
networking.kserve.io/gateway-api: "true"
spec:
predictor:
sklearn:
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
Kubernetes Ingress (Limited):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
annotations:
serving.knative.dev/enablePassthrough: "true"
networking.kserve.io/ingress-class: "nginx"
spec:
predictor:
sklearn:
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
Serverless Mode (Knative-Based)
Serverless mode leverages Knative Serving for event-driven, scale-to-zero capabilities.
Architecture Characteristics:
- Knative Service Resources: Uses Knative Service and Revision resources
- Knative Gateway: Uses Istio or Kourier gateway with VirtualServices
- Scale-to-Zero: Automatic scaling down to zero replicas when no traffic
- Queue Proxy: Each pod includes a Knative queue proxy for metrics and request handling
- Revision Management: Automatic versioning and traffic splitting capabilities
Advantages:
- ✅ Scale-to-Zero: Automatic scaling to zero replicas saves costs
- ✅ Event-Driven: Scales based on incoming requests automatically
- ✅ Revision Management: Built-in versioning and rollback capabilities
- ✅ Traffic Splitting: Native support for canary deployments and A/B testing
- ✅ Resource Efficiency: Optimal resource utilization for variable workloads
- ✅ Cold Start Optimization: Knative optimizations for minimizing cold start latency
Limitations:
- ⚠️ Cold Start Latency: Initial requests may experience higher latency
- ⚠️ Complexity: Additional Knative components increase operational complexity
- ⚠️ Resource Overhead: Queue proxy containers add resource overhead
- ⚠️ Networking Dependencies: Requires Istio or Kourier for networking
- ⚠️ Debugging Complexity: More complex request flow through multiple components
Best Use Cases:
- Cost-sensitive environments benefiting from scale-to-zero
- Variable or unpredictable traffic patterns
- Burst traffic scenarios where rapid scaling is required
Configuration Example:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
sklearn:
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
minReplicas: 0 # Enable scale-to-zero
maxReplicas: 10
scaleTarget: 70 # Target concurrency per pod
Mode Comparison
Feature | Raw Deployment | Serverless |
---|---|---|
Networking | Gateway API (recommended) / Ingress | Knative Gateway (Istio/Kourier) |
Autoscaling | HPA + KEDA | Knative Autoscaler (KPA) |
Scale-to-Zero | ❌ Not supported | ✅ Native support |
Cold Start | ✅ No cold start | ⚠️ Cold start latency |
Traffic Splitting | ✅ With Gateway API | ✅ Native support |
Operational Complexity | ✅ Standard Kubernetes | ⚠️ Additional Knative complexity |
Resource Overhead | ✅ Minimal overhead | ⚠️ Queue proxy overhead |
Production Readiness | ✅ Highly recommended | ⚠️ Depends on requirements |
Choosing the Right Mode
Choose Raw Deployment when:
- Running production workloads requiring high availability
- Need predictable performance without cold starts
- Have existing Kubernetes operational expertise
- Require advanced networking with Gateway API
- Need fine-grained control over pod lifecycle
Choose Serverless when:
- Cost optimization through scale-to-zero is important
- Traffic patterns are variable or unpredictable
- Development/testing environments with limited resources
- Need advanced traffic management features
- Experimenting with inference services
Next Steps
- Explore the Data Plane to understand inference execution.
- Learn about Resources to understand KServe custom resources.