Skip to content

Announcing: KServe v0.14

We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face.

Below are a summary of the key changes.

Introducing Inference client for Python

The KServe Python SDK now includes both REST and GRPC inference clients. The new Inference clients of the SDK were delivered as alpha features.

Inline with the features documented in issue #3270, both clients have the following characteristics:

  • The clients are asynchronous
  • Support for HTTP/2 (via httpx library)
  • Support Open Inference Protocol v1 and v2
  • Allow client send and receive tensor data in binary format for HTTP/REST request, see binary tensor data extension docs.

As usual, the version 0.14.0 of the KServe Python SDK is published to PyPI and available to install via pip install.

Support for OCI storage for models (modelcars) becomes stable

In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models.

This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring process namespace sharing in the pod.

There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition.

The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly. The stabilization is available since KServe version 0.14, where modelcars are now a stable feature.

Future plan

Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in issue #4083.

Using volume mounts based on OCI artifacts is the optimal implementation, but this is only recently possible since Kubernetes 1.31 as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.

Introducing Model Cache

With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.

The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an alpha feature. In this release local node storage is used for storing models and LocalModelCache custom resource provides the control about which models to store in the cache. The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3. Read the design document for the details.

!localmodelcache

By caching the models, you get the following benefits:

  • Minimize the time it takes for LLM pods to start serving requests.

  • Sharing the same storage for pods scheduled on the same GPU node.

  • Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.

The model cache is currently disabled by default. To enable, you need to modify the localmodel.enabled field on the inferenceservice-config ConfigMap.

You can follow local model cache tutorial to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with InferenceService by loading models from local cache to accelerate the container startup.

Support for Hugging Face hub in storage initializer

The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema hf:// is now supported in the storageUri field of InferenceServices. The following YAML partial shows this:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      storageUri: hf://meta-llama/meta-llama-3-8b-instruct

Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService.

Read the documentation for more details.

Hugging Face vLLM backend changes

  • vLLM backend to update to 0.6.1 #3948
  • Support trust_remote_code flag for vllm #3729
  • Support text embedding task in hugging face server #3743
  • Add health endpoint for vLLM backend #3850
  • Added hostIPC field to ServingRuntime CRD, for supporting more than one GPU in Serverless mode #3791
  • Support shared memory volume for vLLM backend #3910

Other Changes

This release also includes several enhancements and changes:

What's New?

  • New flag for automount serviceaccount token by #3979
  • TLS support for inference loggers #3837
  • Allow PVC storage to be mounted in ReadWrite mode via an annotation #3687
  • Support HTTP Headers passing for KServe python custom runtimes #3669

What's Changed?

  • Ray is now an optional dependency #3834
  • Support for Python 3.12 is added, while support Python 3.8 is removed #3645

For complete details on the new features and updates, visit our official release notes.

Join the community

Thanks for all the contributors who have made the commits to 0.14 release!

The KServe Project

Back to top