Announcing: KServe v0.11¶

We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.

Here is a summary of the key changes:

KServe Core Inference Enhancements¶

Support path based routing which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http://<ingress_domain>/serving/<namespace>/<isvc_name>. Please refer to the doc for how to enable path based routing.
Introduced priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc.

Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:

  apiVersion: "serving.kserve.io/v1alpha1"
  kind: ClusterStorageContainer
  metadata:
    name: default
  spec:
    container:
      name: storage-initializer
      image: kserve/model-registry:latest
      resources:
        requests:
          memory: 100Mi
          cpu: 100m
        limits:
          memory: 1Gi
          cpu: "1"
    supportedUriFormats:
      - prefix: model-registry://

Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:

  apiVersion: serving.kserve.io/v1alpha1
  kind: InferenceGraph
  metadata:
    name: graph_with_switch_node
  spec:
    nodes:
      root:
        routerType: Sequence
        steps:
          - name: "rootStep1"
            nodeName: node1
            dependency: Hard
          - name: "rootStep2"
            serviceName: {{ success_200_isvc_id }}
      node1:
        routerType: Switch
        steps:
          - name: "node1Step1"
            serviceName: {{ error_404_isvc_id }}
            condition: "[@this].#(decision_picker==ERROR)"
            dependency: Hard

For more details please refer to the issue.

Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs.

Enhanced Python SDK Dependency Management¶

KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install. While pip install still works, we highly recommend using poetry to ensure predictable dependency management.
The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage].

KServe Python Runtimes Improvements¶

KServe Python Runtimes including sklearnserver, lgbserver, xgbserver now support the open inference protocol for both REST and gRPC.
Logging improvements including adding Uvicorn access logging and a default KServe logger.
Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.

LLM Runtimes¶

TorchServe LLM Runtime¶

KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.

vLLM Runtime¶

Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.

ModelMesh Updates¶

Storing Models on Kubernetes Persistent Volumes (PVC)¶

ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes. Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.

Horizontal Pod Autoscaling (HPA)¶

Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand.

Model Metrics, Metrics Dashboard, Payload Event Logging¶

ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.

A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.

The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.

What's Changed? ¶

To allow longer InferenceService name due to DNS max length limits from issue, the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url.
Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.
Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified.
```
curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json
```

For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11.

Join the community¶

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Thanks for all the contributors who have made the commits to 0.11 release!

The KServe Working Group