Announcing KServe v0.18 - Multi-Node Inference, OpenAI Responses API, and LLM-D v0.6

April 29, 2026 · 11 min read

Approver, KServe; Senior Software Engineer, Red Hat

Published on April 29, 2026

We are excited to announce the release of KServe v0.18. This release brings multi-node inference support without Ray, LeaderWorkerSet (LWS)-based autoscaling for multi-node workloads, OpenAI Responses API routing, namespace-scoped ModelCache, vLLM upgrade to v0.19.0, llm-d v0.6 integration, enhanced security hardening, and GKE Gateway compatibility improvements.

🖥️ Multi-Node Inference Without Ray

KServe v0.18 adds the groundwork for multi-node InferenceService deployments without requiring Ray (#5366).

Ray is a distributed computing framework commonly used to coordinate multi-node inference workloads. While powerful, it adds operational overhead — a Ray head node to manage, separate scaling concerns, and another failure domain.

Previously, multi-node distributed inference required deploying a Ray cluster for execution coordination. With recent vLLM updates supporting the mp (multiprocessing) distributed executor backend, KServe can now orchestrate multi-node inference directly — eliminating the need for a Ray head node and reducing operational complexity.

A new multinode/executor-backend annotation on the ServingRuntime spec switches between ray (default) and mp mode. In mp mode, the number of nodes is derived from pipeline parallelism (PP) and GPUs per node from tensor parallelism (TP), rather than being managed externally by Ray. The controller creates a worker headless service for pod DNS discovery and injects the necessary environment variables (PIPELINE_PARALLEL_SIZE, TENSOR_PARALLEL_SIZE, WORKER_SVC) into worker containers.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: Standard
    serving.kserve.io/autoscalerClass: none
  name: llama3-multinode-mp
spec:
  predictor:
    model:
      runtime: kserve-vllm-multinode  # vLLM-based runtime with mp backend
      modelFormat:
        name: huggingface
      storageUri: pvc://llama-3-70b-pvc/hf/70b_instruction_tuned
    workerSpec:
      pipelineParallelSize: 2  # Number of nodes
      tensorParallelSize: 4    # GPUs per node
      containers:
        - name: worker-container
          resources:
            limits:
              nvidia.com/gpu: "4"

🚀 LLMInferenceService Enhancements

⚡ LeaderWorkerSet (LWS)-Based Autoscaling for Multi-Node Workloads

LeaderWorkerSet (LWS) is a Kubernetes workload API designed for distributed multi-node applications and treats a group of pods spanning multiple nodes as a single deployable unit. LLMInferenceService now supports LeaderWorkerSet as an autoscaling target for multi-node workloads (#5356). Previously, combining spec.worker (multi-node) with spec.scaling (autoscaling) was blocked by validation rules because WVA only supported Deployment as a scaleTargetRef. As of WVA v0.6.0, LeaderWorkerSet is now a supported scaling target.

This enables scaling of distributed inference deployments where each replica spans multiple nodes — scaling entire multi-node replica groups up and down as a unit rather than individual pods.

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-70b-autoscaled
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-70B-Instruct
    name: meta-llama--Llama-3.1-70B-Instruct
  parallelism:
    tensor: 4
    data: 8
    dataLocal: 4
  scaling:
    minReplicas: 1
    maxReplicas: 4
    wva:
      variantCost: "15.0"
      keda:
        pollingInterval: 30
        cooldownPeriod: 300
        idleReplicaCount: 0
  template:
    containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: "4"
  worker:
    containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: "4"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      pool: {}

🔄 LLM-D v0.6 Upgrade

llm-d is a Kubernetes-native distributed inference scheduler built for disaggregated prefill/decode workloads.

The llm-d dependency has been upgraded to v0.6 (#5121, #5346), bringing improvements to the inference scheduler and endpoint picker.

The WVA (Workload Variant Autoscaler) has also been updated to v0.6.0 (#5333, #5344) with enhanced scaling behavior and LeaderWorkerSet support.

WVA (Workload Variant Autoscaler) is KServe's scaling component that manages traffic-weighted variant routing alongside replica scaling.

🌐 OpenAI Responses API Support

KServe now routes traffic to the OpenAI Responses API endpoint (/v1/responses) via HTTPRoute (#5291). This enables LLMInferenceService deployments to serve the latest OpenAI-compatible API surface, including the Responses API that combines chat completions with tool use and structured outputs in a single endpoint.

🔗 SectionName Support for Gateway Refs

In the Kubernetes Gateway API, a single Gateway can expose multiple listeners on different ports or protocols. SectionName is the field that lets a route target one specific listener rather than all of them.

LLMInferenceService now supports SectionName in Gateway references (#5410), allowing HTTPRoutes to target specific listeners on a shared Gateway. This is particularly useful in multi-tenant environments where a single Gateway exposes multiple listeners on different ports or protocols.

🔑 ImagePullSecrets Propagation

LLMInferenceService now propagates imagePullSecrets from the default ServiceAccount to workload pods (#5324), simplifying deployments in private registry environments where pull credentials are managed centrally through the namespace's default ServiceAccount.

🛡️ Pod Security Standards (PSS) Restricted Profile Enforcement

Kubernetes Pod Security Standards (PSS) define three policy levels for pod configurations with restricted being the strictest, requiring pods to run without root access, with dropped Linux capabilities, and no privilege escalation paths.

The LLMInferenceService default template now enforces the Pod Security Standards (PSS) restricted profile (#5302), ensuring that workload pods run with minimal privileges by default. This aligns LLMInferenceService with Kubernetes security best practices for production clusters.

📦 Additional LLMInferenceService Improvements

Storage migration support for LLMInferenceService APIs with deferred migration until webhooks are serving (#5149, #5286)
InferencePool readiness evaluation now checks both prefill and decode pools (#5202)
InferencePool auto-migration to v1 with Gateway detection (#5041, #5316)
Scheduler TLS certificate reload for seamless cert rotation (#5260)
Platform hooks for networking and manager configuration (#5240)
Build hooks for distribution-specific logic (#5217)
LLMInferenceServiceConfig updates with TLS flag support (#5249)
Scaling validation improvements for unsupported configurations (#5212)
MinReplicas default and MaxReplicas type fix (#5237)
Status URL always set with discovered gateway address (#5339)
Graceful resource deletion handling in controllers (#5252)
Extra headers support in wait_for_model_response (#5215)
Disabled Uvicorn access logging in vLLM configs (#5154)
TokenProcessorConfig and blockSize migration (#5305)

📦 Namespace-Scoped ModelCache

The LocalModel controller now supports namespace-scoped ModelCache via a new LocalModelNamespaceCache CRD (#4887). While the existing cluster-scoped LocalModelCache makes cached models available globally, the new LocalModelNamespaceCache restricts model caching to a specific namespace — enabling multi-tenant isolation where different teams manage their own model caches independently without affecting the cluster-wide state.

The LocalModelNamespaceCache shares the same spec structure as LocalModelCache but adds an optional storage field and serviceAccountName for credential management, eliminating the need to create a separate ClusterStorageContainer for authenticated model downloads:

apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNamespaceCache
metadata:
  name: meta-llama3-8b-instruct
  namespace: team-a  # Scoped to this namespace
spec:
  sourceModelUri: "hf://meta-llama/meta-llama-3-8b-instruct"
  modelSize: 10Gi
  nodeGroups:
    - workers
  serviceAccountName: hf-model-sa  # Optional: SA with attached secrets for credential lookup

Namespace-scoped download jobs now run in the designated jobNamespace (#5262), and the LocalModelCache CRD also gained the optional storage and serviceAccountName fields — mirroring the InferenceService storage configuration for a consistent experience:

apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  sourceModelUri: "hf://meta-llama/meta-llama-3-8b-instruct"
  modelSize: 10Gi
  nodeGroups:
    - workers
  serviceAccountName: hf-model-sa
  storage:
    key: hf-secret
    parameters:
      endpoint: "https://huggingface.co"

🔧 InferenceService and Platform Improvements

vLLM Upgrade to v0.19.0

KServe v0.18 ships with vLLM v0.19.0 (#5367), upgraded through v0.17.1 (#5338). This brings significant performance improvements and new features from the vLLM project. The upstream-built vLLM CPU image is now used directly, replacing the custom-built image (#5264).

GKE Gateway Compatibility

A new DisableHTTPRouteTimeout configuration flag (#5313) has been added to support GKE Gateway, which does not support HTTPRoute timeouts. This allows KServe to be deployed on GKE without timeout-related errors in the Gateway controller.

HuggingFace Token Classification

The HuggingFace serving runtime now supports offset_mapping output for token classification tasks (#5244), providing character-level position information for each classified token — useful for NER and other span-extraction applications.

CloudEvents Logging

Inference loggers now include occurrence and record timestamps in CloudEvents (#5139), improving observability and enabling accurate event ordering in distributed logging pipelines.

Additional Enhancements

PYTHONPATH environment variable blocking in ISVC and ServingRuntime webhooks for security hardening (#5340)
Agent readiness probe fix to mirror user-defined httpGet path (#5345)
Intermittent S3 403 Forbidden errors fixed during model download (#5393)
Raw deployment scheme fix — Status.Address scheme hardcoded to HTTP (#5363)
SecurityContext preservation — prevents overwriting existing SecurityContext during configuration (#5167)
Multi-download flag recovery in default ClusterStorageContainer (#5368)
MLServer Seldon image bumped to v1.7.1 (#4573)
Helm chart version overrides support for dependencies (#5268)

Infrastructure Updates

Python formatter migrated from Black to Ruff (#5000)
MD5 replaced with SHA-256 in E2E test fixtures (#5271)
GitHub Actions pinned to SHA via pinact for supply chain security (#5320)
Merge queue support added for CI workflows (#5253)
Optimized Go and Python Dockerfiles for faster CI builds (#5273, #5274)
Standalone kustomize pinned and decoupled from kubectl (#5322)
Kustomize bumped from v5.5.0 to v5.8.0 (#5218)
OpenTelemetry SDK bumped to v1.40.0 (#5199)
Release automation with Copilot CLI support (#5419)

🔒 Security Fixes

Multiple security vulnerabilities have been addressed:

CVE-2026-32597 — PyJWT critical header validation bypass (#5283)
CVE-2026-33186 — gRPC authorization bypass (#5342)
CVE-2026-30922 — pyasn1 Denial of Service vulnerability (#5404)
MD5 to SHA-256 migration in E2E fixtures to eliminate weak hash usage (#5271)
GitHub Actions SHA pinning to prevent supply chain attacks (#5320)

🔍 Release Notes

For the complete list of all 130 merged pull requests, bug fixes, and known issues, visit the GitHub release pages:

🙏 Acknowledgments

We extend our gratitude to all 27 contributors who made this release possible, including 10 first-time contributors. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.

New Contributors: @fyuan1316, @seBasKov, @Gregory-Pereira, @SebastienSyd, @Yuuuuuu0319, @sophieliu15, @RishabhSaini, @alokdangre, @reyshazni, @iranzo — welcome and thank you!
Core Contributors: The KServe maintainers and regular contributors
Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the Community

We invite you to explore the new features in KServe v0.18 and contribute to the ongoing development of the project:

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

🖥️ Multi-Node Inference Without Ray​

🚀 LLMInferenceService Enhancements​

⚡ LeaderWorkerSet (LWS)-Based Autoscaling for Multi-Node Workloads​

🔄 LLM-D v0.6 Upgrade​

🌐 OpenAI Responses API Support​

🔗 SectionName Support for Gateway Refs​

🔑 ImagePullSecrets Propagation​

🛡️ Pod Security Standards (PSS) Restricted Profile Enforcement​

📦 Additional LLMInferenceService Improvements​

📦 Namespace-Scoped ModelCache​

🔧 InferenceService and Platform Improvements​

vLLM Upgrade to v0.19.0​

GKE Gateway Compatibility​

HuggingFace Token Classification​

CloudEvents Logging​

Additional Enhancements​

Infrastructure Updates​

🔒 Security Fixes​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​