Model Serving Runtimes¶
KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto model serving runtimes such as TFServing, TorchServe, Triton Inference Server. In addition ModelServer is the Python model serving runtime implemented in KServe itself with prediction v1 protocol, MLServer implements the prediction v2 protocol with both REST and gRPC. These model serving runtimes are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case. KServe provides basic API primitives to allow you easily build custom model serving runtime, you can use other tools like BentoML to build your custom model serving image.
After models are deployed with InferenceService, you get all the following serverless features provided by KServe.
- Scale to and from Zero
- Request based Autoscaling on CPU/GPU
- Revision Management
- Optimized Container
- Batching
- Request/Response logging
- Traffic management
- Security with AuthN/AuthZ
- Distributed Tracing
- Out-of-the-box metrics
- Ingress/Egress control
Model Serving Runtime | Exported model | Prediction Protocol | HTTP | gRPC | Versions | Examples |
---|---|---|---|---|---|---|
Triton Inference Server | TensorFlow,TorchScript,ONNX | v2 | Compatibility Matrix | Torchscript cifar | ||
TFServing | TensorFlow SavedModel | v1 | TFServing Versions | TensorFlow flower | ||
TorchServe | Eager Model/TorchScript | v1/v2 REST | 0.6.0 | TorchServe mnist | ||
SKLearn MLServer | Pickled Model | v2 | 1.0.1 | SKLearn Iris V2 | ||
XGBoost MLServer | Saved Model | v2 | 1.5.0 | XGBoost Iris V2 | ||
SKLearn ModelServer | Pickled Model | v1 | -- | 1.0.1 | SKLearn Iris | |
XGBoost ModelServer | Saved Model | v1 | -- | 1.5.0 | XGBoost Iris | |
PMML ModelServer | PMML | v1 | -- | PMML4.4.1 | SKLearn PMML | |
LightGBM ModelServer | Saved LightGBM Model | v1 | -- | 3.2.0 | LightGBM Iris | |
Custom ModelServer | -- | v1 | -- | -- | Custom Model |
Note
The model serving runtime version can be overwritten with the runtimeVersion
field on InferenceService yaml and we highly recommend
setting this field for production services.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "torchscript-cifar"
spec:
predictor:
triton:
storageUri: "gs://kfserving-examples/models/torchscript"
runtimeVersion: 21.08-py3