Skip to main content

Data Plane

The KServe data plane is responsible for executing inference requests with high performance and low latency. It handles the actual model serving workloads, including prediction, generation, transformation, and explanation tasks. The data plane is designed to be independent from the control plane, focusing purely on inference execution.

Overview

The data plane consists of runtime components that serve models and process inference requests. It supports multiple protocols, frameworks, and deployment patterns while maintaining high throughput and low latency characteristics essential for production model serving.

KServe's data plane introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases.

For traditional ML models, KServe offers the V1 and Open Inference Protocol (V2) inference protocols, while for generative AI workloads, it supports the OpenAI-compatible API specification which has become the de facto standard for large language model (LLM) interactions. This enables seamless integration with popular LLM clients and tools while providing streaming capabilities essential for generative AI applications.

By implementing these protocols, both inference clients and servers increase their utility and portability by operating seamlessly on platforms that have standardized around these APIs. KServe's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving and TorchServe.

KServe also supports advanced features like model ensembling, A/B testing by composing multiple InferenceServices together. This allows for complex inference workflows while maintaining a simple and consistent API.

Architecture

note

This documentation uses tabs to separate information about Generative Inference (for large language models and generative AI) and Predictive Inference (for traditional ML models). Click the tabs above to switch between these different architectures.

Generative Inference Architecture

The Generative Inference architecture in KServe is specifically designed to handle large language models (LLMs) and other generative AI workloads that require specialized handling for streaming responses, token-by-token generation, and efficient memory management.

Generative Inference Architecture

Key Components

Envoy AI Gateway

The top-level gateway that provides:

  • Unified API: Consistent interface for different model types
  • Usage Tracking: Monitoring and metering of model usage
  • Intelligent Routing: Smart request distribution based on model requirements
  • Rate Limiting: Controls to prevent system overload
  • LLM Observability: Specialized monitoring for LLM performance
Gateway API Inference Extension

Provides enhanced routing capabilities:

  • Endpoint Picker: Implements intelligent routing based on multiple factors
  • QoS Management: Quality of service guarantees
  • Load-Aware Routing: Distributes requests based on current system load
  • Cache-Aware Routing: Utilizes caching for improved performance
Control Plane Components

The left side of the architecture includes key control plane components:

  • KServe Controller: Manages the lifecycle of inference services
  • KEDA LLM Autoscaler: Provides autoscaling capabilities specific to LLM workloads
  • Model Cache Controller: Manages the model caching system
  • GPU Scheduler/DRA: Handles GPU resource allocation and scheduling
Inference Service Deployment

Standard deployment for simpler generative models:

  • vLLM Containers: Multiple optimized inference containers for LLMs
  • Storage Containers: Handles model download, storage and retrieval from cache
  • Model Cache / OCI: Caches models for faster access and reduced latency across nodes. Also supports OCI image formats for model storage.
Distributed Inference

For larger models requiring distributed processing:

  • vLLM Head Deployment: Coordinates inference across worker nodes
  • vLLM Worker Deployment: Distributed model computation
  • vLLM Prefill Deployment: Handles initial context processing
  • vLLM Decoding Deployment: Manages token generation process
Infrastructure Components
  • Heterogeneous GPU Farm: Supports different GPU types (H100, H200, A100, MIG)
  • Distributed KV Cache: Shared memory system for model key-value pairs
  • Model Registry / Huggingface Hub / GCS / Azure / S3: Multiple model storage options including cloud storage integrations

Traffic Flow

There are two main request flows for generative inference: standard inference for simpler models and distributed inference for larger models. Both paths support streaming responses, which is crucial for LLM applications to provide low-latency user experiences.

The following sequence diagram illustrates how requests flow through the KServe generative inference data plane components.

Data Plane Protocols

KServe supports popular API protocols for generative models, primarily based on the OpenAI API specification, which has become a de facto standard for large language model (LLM) inference.

OpenAI Compatible APIs

The OpenAI-compatible API endpoints are designed to provide a familiar interface for LLM applications:

APIMethodEndpointDescription
Chat CompletionPOST/v1/chat/completionsGenerate conversational responses from a given chat history
CompletionPOST/v1/completionsGenerate text completions for a given prompt
EmbeddingsPOST/v1/embeddingsGenerate vector embeddings for input text
Score *POST/v1/scoreGet the score for a given text

*Note: The Score API is not part of the OpenAI specification but is included for compatibility with KServe's generative inference capabilities.

Streaming Support

Generative inference protocols support streaming responses, which is crucial for LLM applications:

  • Server-Sent Events (SSE): Used for token-by-token streaming
  • Chunk-Based Responses: Partial responses sent as they're generated
  • Early Termination: Ability to cancel ongoing generations

Chat Completion API Example

// Request
POST /v1/chat/completions
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, can you help me with KServe?"}
],
"temperature": 0.7,
"max_tokens": 150,
"stream": true
}

// Response (non-streaming)
{
"id": "cmpl-123abc",
"object": "chat.completion",
"created": 1694268762,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Yes, I'd be happy to help you with KServe! KServe is a serverless framework for serving machine learning models on Kubernetes..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 18,
"completion_tokens": 33,
"total_tokens": 51
}
}

Embedding API Example

// Request
POST /v1/embeddings
{
"model": "text-embedding-ada-002",
"input": "The food was delicious and the service was excellent."
}

// Response
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.0023064255, -0.009327292, ...],
"index": 0
}
],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}

Key Features of Generative Inference Protocol

  • Parameter Control: Fine-tuning of generation parameters (temperature, top-p, etc.)
  • Streaming: Token-by-token response streaming
  • Function Calling: Structured outputs and function invocation
  • System Instructions: Ability to set behavior constraints
  • Token Management: Tracking token usage for quota/billing

Next Steps

For Generative Inference

  • Learn about vLLM integration for efficient LLM serving
  • Understand model sharding and distribution for large language models
  • Explore streaming APIs for token-by-token generation
  • Implement caching strategies for improved LLM performance

For Predictive Inference

General Resources

  • Explore Control Plane to understand service management
  • Learn about Resources to understand KServe custom resources