Understanding LLMInferenceService
What is LLMInferenceService?β
LLMInferenceService is a Kubernetes Custom Resource Definition (CRD) introduced in KServe as part of its strategic shift towards GenAI-first architecture. Built on the foundation of llm-dβa production-ready framework for scalable LLM servingβLLMInferenceService delivers enterprise-grade capabilities for deploying and managing Large Language Model inference workloads on Kubernetes.
The llm-d project provides a proven architecture for high-performance LLM serving, combining vLLM's inference engine with Kubernetes orchestration and intelligent routing capabilities. Features like KV-cache aware scheduling, disaggregated prefill-decode serving, and distributed inference enable both optimal performance and cost efficiency. By integrating llm-d's architecture through a native Kubernetes CRD, KServe makes these advanced patterns accessible and easy to deploy, allowing users to achieve faster time-to-value while maintaining production-grade reliability.
Why a Separate CRD?β
While KServe has traditionally used the InferenceService CRD for serving machine learning models (and it can still serve LLMs), KServe now adopts a dual-track strategy:
InferenceService: Optimized for Predictive AI workloads (traditional ML models like scikit-learn, TensorFlow, PyTorch)LLMInferenceService: Purpose-built for Generative AI workloads (Large Language Models)
This separation allows KServe to provide specialized features for LLM servingβsuch as distributed inference, prefill-decode separation, advanced routing, and multi-node orchestrationβwithout adding complexity to the traditional InferenceService API.
Evolution: Dual-Track Strategyβ
Strategic Separation:
- InferenceService: Remains the standard for Predictive AI (classification, regression, recommendations)
- LLMInferenceService: Dedicated to Generative AI with specialized optimizations
- Can you use InferenceService for LLMs? Yes, but only for basic single-node deployments. Advanced features like prefill-decode separation, multi-node orchestration, and intelligent scheduling are not available
Key Featuresβ
π― Composable Configurationβ
Mix and match LLMInferenceServiceConfig resources for flexible deployment patterns:
- Model configurations
- Workload templates
- Router settings
- Scheduler policies
π Multiple Deployment Patternsβ
- Single-Node: Simple deployments for small models (less than 7B parameters)
- Multi-Node: Distributed inference with LeaderWorkerSet for medium-large models
- Prefill-Decode: Disaggregated serving for cost optimization
- DP+EP: Data and Expert Parallelism for MoE models (Mixtral, DeepSeek-R1)
π Advanced Routingβ
- Gateway API: Standard Kubernetes ingress
- Intelligent Scheduling: KV cache-aware, load-balanced routing
- Prefill-Decode Separation: Automatic routing to optimal pools
β‘ Distributed Inferenceβ
- Tensor Parallelism (TP): Split model layers across GPUs
- Data Parallelism (DP): Replicate models for throughput
- Expert Parallelism (EP): Distribute MoE experts across nodes
π§ Production-Readyβ
- RBAC and authentication
- Model storage integration (HuggingFace, S3, PVC)
- KV cache transfer via RDMA
- Monitoring and metrics
When to Use LLMInferenceServiceβ
| Scenario | Use LLMInferenceService? | Why? |
|---|---|---|
| Serving LLMs (7B-405B params) | β Yes | Optimized for LLM workloads |
| Multi-GPU inference | β Yes | Built-in parallelism support |
| High throughput requirements | β Yes | Prefill-decode separation, intelligent routing |
| Traditional ML models | β No | Use InferenceService instead |
| Small models (less than 70B params) | π‘ Optional | Either works, but LLMInferenceService offers more features |
High Level Architecture Overviewβ
Core Components at a Glanceβ
1. Model Specification (spec.model)β
Defines the LLM model source, name, and characteristics:
- Model URI (HuggingFace, S3, PVC)
- Model name for API requests
- Scheduling criticality
- LoRA adapters (optional)
Learn more: Configuration Guide
2. Workload Specificationβ
Defines compute resources and deployment patterns:
spec.template: Single-node or decode workloadspec.worker: Multi-node workers (triggers LeaderWorkerSet)spec.prefill: Prefill-only workload (disaggregated)
Learn more: Configuration Guide
3. Router Specification (spec.router)β
Defines traffic routing and load balancing:
- Gateway: Entry point for external traffic
- HTTPRoute: Path-based routing rules
- Scheduler: Intelligent endpoint selection (EPP)
Learn more: Architecture Guide
4. Parallelism Specification (spec.parallelism)β
Defines distributed inference strategies:
- Tensor Parallelism (TP)
- Data Parallelism (DP)
- Expert Parallelism (EP)
Learn more: Configuration Guide
Quick Exampleβ
Here's a minimal LLMInferenceService for serving Llama-3.1-8B:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: llama-3-8b
namespace: default
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3
template:
containers:
- name: main
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: 32Gi
router:
gateway: {} # Managed gateway
route: {} # Managed HTTPRoute
scheduler: {} # Managed scheduler
What this creates:
- 3 replica pods with 1 GPU each
- Gateway API ingress
- Intelligent scheduler for load balancing
- Storage initializer for model download
Documentation Mapβ
This overview provides a high-level introduction to LLMInferenceService. For detailed information, explore the following guides:
π Core Conceptsβ
- Configuration Guide: Detailed spec reference and configuration patterns
- Architecture Guide: System architecture and component interactions
- Dependencies: Required infrastructure components
π§ Advanced Topicsβ
- Scheduler Configuration: Prefix cache routing, load-aware scheduling
- Multi-Node Deployment: LeaderWorkerSet, RDMA networking
- Security: Authentication, RBAC, network policies
Summaryβ
LLMInferenceService provides a comprehensive, Kubernetes-native approach to LLM serving with:
- β Composable Configuration: Mix and match configs for flexible deployment
- β Multiple Workload Patterns: Single-node, multi-node, prefill-decode separation
- β Advanced Routing: Gateway API + intelligent scheduler
- β Distributed Inference: Tensor, data, and expert parallelism
- β Production-Ready: Monitoring, RBAC, storage integration, KV cache transfer
This architecture enables organizations to deploy and scale LLM inference workloads efficiently on Kubernetes, with the flexibility to optimize for different model sizes, hardware configurations, and performance requirements.