Skip to content

KServe Documentation Website

Index

KServe Documentation Website

Home
Getting started
Getting started
Administration Guide
Administration Guide
- Install KServe
  Install KServe
  - Predictive Inference
    Predictive Inference
    
    Serverless installation
    
    Kubernetes deployment installation
    
    ModelMesh installation
    
    Kourier Networking Layer
  - Generative Inference
    Generative Inference
    
    Kubernetes deployment installation
    
    AI Gateway Integration
  - Istio Service Mesh
  - Gateway API migration
User Guide
User Guide
- Concepts
  Concepts
  - Control Plane
    Control Plane
    
    Model Serving Control Plane
  - Data Plane
    Data Plane
    
    Model Serving Data Plane
    
    V1 Inference Protocol
    
    Open Inference Protocol (V2 Inference Protocol)
    
    Open Inference Protocol Extensions
    Open Inference Protocol Extensions
    
    Binary Tensor Data Extension
  - Serving Runtimes
- Generative Inference
  Generative Inference
  - Serving Runtime
    Serving Runtime
    
    Overview
    
    Text Generation
    
    Text2Text Generation
    
    OpenAI SDK Integration
  - Multi Node Inference
  - Model Cache
  - LLM Autoscaler
  - KV Cache Offloading
  - AI Gateway Integration
- Predictive Inference
  Predictive Inference
  - Model Serving Runtimes
    Model Serving Runtimes
    
    Supported Model Frameworks/Formats
    Supported Model Frameworks/Formats
    
    Overview
    
    Tensorflow
    
    PyTorch
    
    Scikit-learn
    
    XGBoost
    
    PMML
    
    Spark MLlib
    
    LightGBM
    
    Paddle
    
    MLFlow
    
    ONNX
    
    Hugging Face
    Hugging Face
    
    Overview
    
    Token Classification
    
    Text Classification
    
    Fill Mask
    
    Multi-Framework Serving Runtimes
    Multi-Framework Serving Runtimes
    
    Nvidia Triton
    Nvidia Triton
    
    TorchScript
    
    Tensorflow
    
    Hugging Face
    
    AMD
    
    How to write a custom predictor
  - Multi Model Serving
    Multi Model Serving
    
    Overview
    Overview
    
    The Scalability Problem
    
    ModelMesh Overview
  - Transformers(pre/post processing)
    Transformers(pre/post processing)
    
    How to write a custom transformer
    
    Collocate transformer and predictor
    
    Feast
  - Rollout Strategies
    Rollout Strategies
    
    Canary
    
    Canary Example
  - Autoscaling
    Autoscaling
    
    Knative Autoscaler(KPA)
    
    Kubernetes Autoscaler(HPA)
    
    KEDA Autoscaler
  - Request Batching
    Request Batching
    
    Inference Batcher
  - Payload Logging
    Payload Logging
    
    Inference Logger
  - Kafka
    Kafka
    
    Inference with Kafka Event Source
  - Inference Observability
    Inference Observability
    
    Prometheus Metrics
    
    Grafana Dashboards
  - Model Explainability
    Model Explainability
    
    Concept
    
    TrustyAI Explainer
    
    Alibi Explainer
    Alibi Explainer
    
    Image Explainer
    
    Income Explainer
    
    Text Explainer
    
    AIX Explainer
  - Model Monitoring
    Model Monitoring
    
    Alibi Detector
    
    AIF Bias Detector
    
    ART Adversarial Detector
- Inference Graph
  Inference Graph
- Model Storage
  Model Storage
  - Storage Containers
  - Configure CA Certificate
  - Azure
  - PVC
  - S3
  - OCI
  - URI
  - GCS
  - Hugging Face
- Node Scheduling
  Node Scheduling
  - Overview
  - InferenceService Node Scheduling
API Reference
API Reference
Developer Guide
Developer Guide
- How to contribute
- Debugging guide
Blog
Blog
- Releases
  Releases
Community
Community

Index

vLLM Runtime¶

The official vLLM support is available through Hugging Face Serving Runtime.