Skip to main content
Version: Next

LLMInferenceService Architecture Deep Dive

This guide provides an in-depth look at the LLMInferenceService architecture, component interactions, and advanced patterns for production deployments.

Prerequisites: Familiarity with core concepts and configuration is recommended.


System Architecture Overview

Architecture Overview

Gateway Architecture

What is a Gateway?

A Gateway is the entry point for external traffic into the Kubernetes cluster. It's a Kubernetes Gateway API resource that:

  • Defines listeners (HTTP, HTTPS, ports)
  • Configures TLS termination
  • Managed by a Gateway Provider (Envoy Gateway, Istio, etc.)
  • Can be cluster-scoped or namespace-scoped

Managed vs Referenced Gateway

Managed Gateway (Default)

spec:
router:
gateway: {} # KServe creates Gateway automatically

What KServe creates

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: llama-3-8b-kserve-gateway
namespace: default
spec:
gatewayClassName: eg # Default: Envoy Gateway
listeners:
- name: http
port: 80
protocol: HTTP

Referenced Gateway (Existing)

spec:
router:
gateway:
refs:
- name: my-custom-gateway
namespace: istio-system

Use case: Shared gateway across multiple services.


HTTPRoute Architecture

What is an HTTPRoute?

An HTTPRoute defines path-based routing rules that connect Gateways to backend services (InferencePools or Services).

Managed HTTPRoute (Default)

spec:
router:
route: {} # KServe creates HTTPRoute automatically

What KServe creates

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llama-3-8b-kserve-route
namespace: default
spec:
parentRefs:
- name: llama-3-8b-kserve-gateway
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: llama-3-8b-inference-pool

Routing Flow

Routing Flow

Scheduler Architecture

Overview

The Scheduler (also called Endpoint Picker Pod - EPP) provides intelligent request routing based on:

  • Prefix cache: Routes to pods with matching KV cache blocks
  • Load: Balances requests across available endpoints
  • Prefill-Decode separation: Routes to appropriate pool

Scoring Mechanism

The scheduler tracks KV cache blocks via ZMQ events from vLLM pods:

  • BlockStored: Cache block created (includes block hash, tokens, storage location)
  • BlockRemoved: Cache block evicted from memory

These events populate an index mapping {ModelName, BlockHash}{PodID, DeviceTier}, allowing the scheduler to track which pods hold which cache blocks.

For each incoming request, the scheduler calculates a weighted score across all endpoints using pluggable scorers:

ScorerWeightPurpose
Prefix cache scorer2.0Prioritizes pods with matching KV cache blocks
Load-aware scorer1.0Balances requests across endpoints
Queue scorer(configurable)Routes based on queue depth

The request is routed to the highest-scoring pod, optimizing for both cache hit rate and load distribution.


Scheduler vs No Scheduler

FeatureNo SchedulerWith Scheduler
RoutingKubernetes Service (kube-proxy)InferencePool (EPP)
Load BalancingRound-robinIntelligent (load-aware, cache-aware)
Prefix Cache✅ Routes to pods with matching KV cache
Prefill-Decode✅ Automatic pool selection
Resource OverheadMinimal (no extra pods)Low (1 scheduler pod)
Use CaseSimple/DevProduction

Request Flow Analysis

Standard Request Flow

1. Client sends request


2. Gateway receives (port 80/443)


3. HTTPRoute matches path (/v1/completions)


4. Routes to InferencePool


5. Gateway queries EPP Service
"Which endpoint should I use?"


6. EPP evaluates:
- Prefix cache match (weight: 2.0)
- Current load (weight: 1.0)


7. EPP returns selected endpoint
"Use Pod 2 (10.0.1.42:8000)"


8. Gateway forwards to Pod 2


9. Pod 2 processes inference


10. Response flows back to client

Prefill-Decode Request Flow

1. Client sends NEW request (no KV cache)


2. Gateway -> HTTPRoute -> InferencePool


3. EPP: "Calculate Score to decide which pod is the best to take this new request" → Route to Decode Pool


4. Routing container (in selected Decode Pod) forwards request to Prefill Pod, which processes prompt and generates KV cache


5. KV cache transferred to Decode Pod via RDMA if it needs


6. Response includes KV transfer metadata


7. Client sends CONTINUATION request (with KV cache ID)


8. EPP: Calculate score (detects KV cache match on Decode Pod, skips Prefill) → Route directly to Decode Pod


9. Decode Pod uses transferred KV cache


10. Token-by-token generation

Network Flow

KV Cache Communication

LLM serving requires two types of KV cache communication:

1. KV Cache Event Tracking (ZMQ)

Purpose: Real-time monitoring of KV cache blocks for intelligent routing

  • Protocol: ZMQ (Zero Message Queue) over TCP/IP
  • Usage: vLLM publishes events when cache blocks are created/evicted
  • Consumer: Scheduler (EPP) tracks which pods have which cache blocks (see Scoring Mechanism)

Configuration:

spec:
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--kv-events-config '{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://scheduler:5557\",\"topic\":\"kv@${POD_IP}@model\"}'"

2. KV Cache Data Transfer (NixlConnector)

Purpose: Actual KV cache block transfer for prefill-decode separation

  • Protocol: NixlConnector (RDMA-based, RoCE network)
  • Usage: Transfer KV cache blocks from prefill pods to decode pods
  • Use Case: Disaggregated prefill-decode architecture

Configuration:

spec:
template:
containers:
- name: main
env:
- name: KSERVE_INFER_ROCE
value: "true"
- name: VLLM_ADDITIONAL_ARGS
value: "--kv_transfer_config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"

Advanced Patterns

Pattern: Multi-Node Prefill-Decode

Combining P/D separation with LeaderWorkerSet:

spec:
# Decode workload (multi-node)
parallelism:
tensor: 4
data: 8
dataLocal: 4
template:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "4"
worker:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "4"

# Prefill workload (multi-node)
prefill:
parallelism:
tensor: 4
data: 16
dataLocal: 8
template:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "8"
worker:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "8"

Result:

  • Prefill: 2 LWS replicas (16/8), each with 8 GPUs
  • Decode: 2 LWS replicas (8/4), each with 4 GPUs
  • Total: 24 GPUs

Next Steps