2 posts tagged with "Community"

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

April 21, 2026 · 5 min read

Yuan Tang

Project Lead, KServe; Senior Principal Software Engineer, Red Hat

Scott Cabrinha

Staff Site Reliability Engineer, Tesla

Robert Shaw

Director of Engineering, Red Hat

Sai Krishna

Staff Software Engineer, Tesla

Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet.

Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d

March 5, 2026 · 8 min read

Yuan Tang

Project Lead, KServe; Senior Principal Software Engineer, Red Hat

Ran Pollak

Manager, AI Catalyst at Red Hat

Enterprises today seek to integrate generative AI (GenAI) capabilities into their applications. However, scaling large AI models introduces complexity: managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.

Platform engineering leaders require more than just model deployment capabilities. They need a robust, Kubernetes-native infrastructure that supports:

Efficient GPU utilization
Intelligent request routing
Distributed inference patterns
Cost-aware autoscaling
Production-grade governance

This article demonstrates how two open-source solutions, KServe and llm-d, can be combined to address these challenges.

We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with deeper focus on KServe's LLMInferenceService, available since KServe v0.16.