Skip to main content

Announcing KServe v0.10.0

ยท 7 min read
Dan Sun
Co-Founder, KServe

Published on February 5, 2023

We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService.

๐ŸŒ KServe Networking Optionsโ€‹

Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline.

๐Ÿ“Š KServe Telemetry for Serving Runtimesโ€‹

We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess, predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime. Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations.

๐Ÿš€ Open(v2) Inference Protocol Support Coverageโ€‹

As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol.

In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor.

from kserve import Model

def image_transform(byte_array):
image_processing = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
image = Image.open(io.BytesIO(byte_array))
tensor = image_processing(image).numpy()
return tensor

class CustomModel(Model):
def predict(self, request: InferRequest, headers: Dict[str, str]) -> InferResponse:
input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
input_tensors = np.asarray(input_tensors)
output = self.model(input_tensors)
torch.nn.functional.softmax(output, dim=1)
values, top_5 = torch.topk(output, 5)
result = values.flatten().tolist()
response_id = generate_uuid()
infer_output = InferOutput(name="output-0", shape=list(values.shape), datatype="FP32", data=result)
infer_response = InferResponse(model_name=self.name, infer_outputs=[infer_output], response_id=response_id)
return infer_response

class CustomTransformer(Model):
def preprocess(self, request: InferRequest, headers: Dict[str, str]) -> InferRequest:
input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
input_tensors = np.asarray(input_tensors)
infer_inputs = [InferInput(name="INPUT__0", datatype='FP32', shape=list(input_tensors.shape),
data=input_tensors)]
infer_request = InferRequest(model_name=self.model_name, infer_inputs=infer_inputs)
return infer_request

You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol.

โš ๏ธ Warning: A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess, predict and postprocess handlers.

Please check the following matrix for supported ModelFormats and ServingRuntimes.

Model Formatv1Open(v2) REST/gRPC
Tensorflowโœ… TFServingโœ… Triton
PyTorchโœ… TorchServeโœ… TorchServe
TorchScriptโœ… TorchServeโœ… Triton
ONNXโŒโœ… Triton
Scikit-learnโœ… KServeโœ… MLServer
XGBoostโœ… KServeโœ… MLServer
LightGBMโœ… KServeโœ… MLServer
MLFlowโŒโœ… MLServer
Customโœ… KServeโœ… KServe

๐Ÿ—๏ธ Multi-Arch Image Supportโ€‹

KServe control plane images kserve-controller, kserve/agent, kserve/router are now supported for multiple architectures: ppc64le, arm64, amd64, s390x.

๐Ÿ” KServe Storage Credentials Supportโ€‹

  • Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA.
  • Support Azure Blobs with managed identity.

๐Ÿ“Š ModelMesh Updatesโ€‹

ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8.

Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving.

Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.

๐Ÿ” Release Notesโ€‹

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe v0.10 and ModelMesh v0.10.

๐Ÿ™ Acknowledgmentsโ€‹

We want to thank all the contributors who made this release possible:

Individual Contributors:

Core Contributors: The KServe maintainers and working group members

Community: Everyone who reported issues, provided feedback, and tested features

๐Ÿค Join the Communityโ€‹


The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!