Skip to main content

One post tagged with "Community"

Blog Contributed by Community Members

View All Tags

Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d

· 8 min read
Yuan Tang
Project Lead, KServe; Senior Principal Software Engineer, Red Hat
Ran Pollak
Manager, AI Catalyst at Red Hat

Enterprises today seek to integrate generative AI (GenAI) capabilities into their applications. However, scaling large AI models introduces complexity: managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.

Platform engineering leaders require more than just model deployment capabilities. They need a robust, Kubernetes-native infrastructure that supports:

  • Efficient GPU utilization
  • Intelligent request routing
  • Distributed inference patterns
  • Cost-aware autoscaling
  • Production-grade governance

This article demonstrates how two open-source solutions, KServe and llm-d, can be combined to address these challenges.

We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with deeper focus on KServe's LLMInferenceService, available since KServe v0.16.