Version: Next

LLMInferenceService Architecture Deep Dive

This guide provides an in-depth look at the LLMInferenceService architecture, component interactions, and advanced patterns for production deployments.

Prerequisites: Familiarity with core concepts and configuration is recommended.

System Architecture Overview

Gateway Architecture

What is a Gateway?

A Gateway is the entry point for external traffic into the Kubernetes cluster. It's a Kubernetes Gateway API resource that:

Defines listeners (HTTP, HTTPS, ports)
Configures TLS termination
Managed by a Gateway Provider (Envoy Gateway, Istio, etc.)
Can be cluster-scoped or namespace-scoped

Managed vs Referenced Gateway

Managed Gateway (Default)

spec:
  router:
    gateway: {}  # KServe creates Gateway automatically

What KServe creates

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llama-3-8b-kserve-gateway
  namespace: default
spec:
  gatewayClassName: eg  # Default: Envoy Gateway
  listeners:
    - name: http
      port: 80
      protocol: HTTP

Referenced Gateway (Existing)

spec:
  router:
    gateway:
      refs:
        - name: my-custom-gateway
          namespace: istio-system

Use case: Shared gateway across multiple services.

HTTPRoute Architecture

What is an HTTPRoute?

An HTTPRoute defines path-based routing rules that connect Gateways to backend services (InferencePools or Services).

Managed HTTPRoute (Default)

spec:
  router:
    route: {}  # KServe creates HTTPRoute automatically

What KServe creates

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llama-3-8b-kserve-route
  namespace: default
spec:
  parentRefs:
    - name: llama-3-8b-kserve-gateway
  rules:
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-3-8b-inference-pool

Routing Flow

Scheduler Architecture

Overview

The Scheduler (also called Endpoint Picker Pod - EPP) provides intelligent request routing based on:

Prefix cache: Routes to pods with matching KV cache blocks
Load: Balances requests across available endpoints
Prefill-Decode separation: Routes to appropriate pool

Scoring Mechanism

The scheduler tracks KV cache blocks via ZMQ events from vLLM pods:

BlockStored: Cache block created (includes block hash, tokens, storage location)
BlockRemoved: Cache block evicted from memory

These events populate an index mapping {ModelName, BlockHash} → {PodID, DeviceTier}, allowing the scheduler to track which pods hold which cache blocks.

For each incoming request, the scheduler calculates a weighted score across all endpoints using pluggable scorers:

Scorer	Weight	Purpose
Prefix cache scorer	2.0	Prioritizes pods with matching KV cache blocks
Load-aware scorer	1.0	Balances requests across endpoints
Queue scorer	(configurable)	Routes based on queue depth

The request is routed to the highest-scoring pod, optimizing for both cache hit rate and load distribution.

Scheduler vs No Scheduler

Feature	No Scheduler	With Scheduler
Routing	Kubernetes Service (kube-proxy)	InferencePool (EPP)
Load Balancing	Round-robin	Intelligent (load-aware, cache-aware)
Prefix Cache	❌	✅ Routes to pods with matching KV cache
Prefill-Decode	❌	✅ Automatic pool selection
Resource Overhead	Minimal (no extra pods)	Low (1 scheduler pod)
Use Case	Simple/Dev	Production

Request Flow Analysis

Standard Request Flow

1. Client sends request
              │
              ▼
2. Gateway receives (port 80/443)
              │
              ▼
3. HTTPRoute matches path (/v1/completions)
              │
              ▼
4. Routes to InferencePool
              │
              ▼
5. Gateway queries EPP Service
   "Which endpoint should I use?"
              │
              ▼
6. EPP evaluates:
   - Prefix cache match (weight: 2.0)
   - Current load (weight: 1.0)
              │
              ▼
7. EPP returns selected endpoint
   "Use Pod 2 (10.0.1.42:8000)"
              │
              ▼
8. Gateway forwards to Pod 2
              │
              ▼
9. Pod 2 processes inference
              │
              ▼
10. Response flows back to client

Prefill-Decode Request Flow

1. Client sends NEW request (no KV cache)
                  │
                  ▼
2. Gateway -> HTTPRoute -> InferencePool
                  │
                  ▼
3. EPP: "Calculate Score to decide which pod is the best to take this new request" → Route to Decode Pool
                  │
                  ▼
4. Routing container (in selected Decode Pod) forwards request to Prefill Pod, which processes prompt and generates KV cache
                  │
                  ▼
5. KV cache transferred to Decode Pod via RDMA if it needs
                  │
                  ▼
6. Response includes KV transfer metadata
                  │
                  ▼
7. Client sends CONTINUATION request (with KV cache ID)
                  │
                  ▼
8. EPP: Calculate score (detects KV cache match on Decode Pod, skips Prefill) → Route directly to Decode Pod
                  │
                  ▼
9. Decode Pod uses transferred KV cache
                  │
                  ▼
10. Token-by-token generation

Network Flow

KV Cache Communication

LLM serving requires two types of KV cache communication:

1. KV Cache Event Tracking (ZMQ)

Purpose: Real-time monitoring of KV cache blocks for intelligent routing

Protocol: ZMQ (Zero Message Queue) over TCP/IP
Usage: vLLM publishes events when cache blocks are created/evicted
Consumer: Scheduler (EPP) tracks which pods have which cache blocks (see Scoring Mechanism)

Configuration:

spec:
  template:
    containers:
      - name: main
        env:
          - name: VLLM_ADDITIONAL_ARGS
            value: "--kv-events-config '{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://scheduler:5557\",\"topic\":\"kv@${POD_IP}@model\"}'"

2. KV Cache Data Transfer (NixlConnector)

Purpose: Actual KV cache block transfer for prefill-decode separation

Protocol: NixlConnector (RDMA-based, RoCE network)
Usage: Transfer KV cache blocks from prefill pods to decode pods
Use Case: Disaggregated prefill-decode architecture

Configuration:

spec:
  template:
    containers:
      - name: main
        env:
          - name: KSERVE_INFER_ROCE
            value: "true"
          - name: VLLM_ADDITIONAL_ARGS
            value: "--kv_transfer_config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"

Advanced Patterns

Pattern: Multi-Node Prefill-Decode

Combining P/D separation with LeaderWorkerSet:

spec:
  # Decode workload (multi-node)
  parallelism:
    tensor: 4
    data: 8
    dataLocal: 4
  template:
    containers:
      - name: main
        resources:
          limits:
            nvidia.com/gpu: "4"
  worker:
    containers:
      - name: main
        resources:
          limits:
            nvidia.com/gpu: "4"

  # Prefill workload (multi-node)
  prefill:
    parallelism:
      tensor: 4
      data: 16
      dataLocal: 8
    template:
      containers:
        - name: main
          resources:
            limits:
              nvidia.com/gpu: "8"
    worker:
      containers:
        - name: main
          resources:
            limits:
              nvidia.com/gpu: "8"

Result:

Prefill: 2 LWS replicas (16/8), each with 8 GPUs
Decode: 2 LWS replicas (8/4), each with 4 GPUs
Total: 24 GPUs

Next Steps

Configuration Guide: Detailed spec reference
Dependencies: Install required components

System Architecture Overview​

Gateway Architecture​

What is a Gateway?​

Managed vs Referenced Gateway​

Managed Gateway (Default)​

Referenced Gateway (Existing)​

HTTPRoute Architecture​

What is an HTTPRoute?​

Managed HTTPRoute (Default)​

Routing Flow​

Scheduler Architecture​

Overview​

Scoring Mechanism​

Scheduler vs No Scheduler​

Request Flow Analysis​

Standard Request Flow​

Prefill-Decode Request Flow​

Network Flow​

KV Cache Communication​

1. KV Cache Event Tracking (ZMQ)​

2. KV Cache Data Transfer (NixlConnector)​

Advanced Patterns​

Pattern: Multi-Node Prefill-Decode​

Next Steps​