Version: Next

Text Generation with Large Language Models

This guide shows how to deploy LLMs for text generation using KServe on GPU and CPU environments.

Prerequisites

Before getting started, ensure you have:

A Kubernetes cluster with KServe installed.
For GPU: NVIDIA GPU resources available on your nodes.
For CPU: nodes with enough CPU and memory. CPUs with AVX-512 instruction support are recommended.
A Hugging Face access token to download gated models like Llama3.

Create a Hugging Face Token Secret

Both backends require a Hugging Face token to download the Llama3 model:

hf-secret.yaml
apiVersion: v1
kind: Secret
metadata:
    name: hf-secret
type: Opaque	
stringData:
    HF_TOKEN: <your-hugging-face-token>

Save this to a file (e.g., hf-secret.yaml) and apply it:

kubectl apply -f hf-secret.yaml

Create a StorageContainer

To enable KServe to access Hugging Face models, you need to create a ClusterStorageContainer that uses the Hugging Face token secret. This allows KServe to authenticate and download models from the Hugging Face Hub.

To know more about storage containers, refer to the Storage Containers documentation.

huggingface-storage.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
  name: hf-hub
spec:
  container:
    name: storage-initializer
    image: 'kserve/storage-initializer:latest'
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-secret
            key: HF_TOKEN
            optional: false
    resources:
      requests:
        memory: 2Gi
        cpu: '1'
      limits:
        memory: 4Gi
        cpu: '1'
  supportedUriFormats:
    - prefix: 'hf://'

Deploy Text Generation Model

Understanding Backend Options

KServe supports two inference backends for serving LLMs. This guide covers two primary options:

vLLM Backend (default): This is the recommended backend for serving LLMs, providing optimized performance and lower latency. It supports advanced features like model parallelism and efficient memory management.
Hugging Face Backend: This backend uses the standard Hugging Face library. It is suitable for simpler use cases but may not perform as well as vLLM for larger models or high concurrency scenarios.

Please refer to the overview of KServe's generative inference capabilities for more details on these backends.

vLLM Backend (Recommended)
Hugging Face Backend
vLLM Backend (CPU)

The vLLM backend is enabled by default in KServe's Hugging Face serving runtime for optimal performance.

llama3-vllm.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-8b
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
      storageUri: hf://meta-llama/meta-llama-3-8b-instruct
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

Save this to a file (e.g., llama3-vllm.yaml) and apply it:

kubectl apply -f llama3-vllm.yaml

If you prefer to use the standard Hugging Face inference pipeline, specify the backend explicitly:

llama3-hf.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-8b
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
        - --backend=huggingface
      storageUri: hf://meta-llama/meta-llama-3-8b-instruct
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

Save this to a file (e.g., llama3-hf.yaml) and apply it:

kubectl apply -f llama3-hf.yaml

For clusters without GPUs, KServe can serve models using the vLLM CPU backend. Use a smaller model that fits in memory, such as Qwen2-0.5B-Instruct.

CPU Performance

CPU inference is slower than GPU. Expect 5 to 20 tokens per second depending on model size and hardware. Use smaller models (under 3B parameters) for acceptable latency.

qwen-cpu.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-cpu
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=qwen2
        - --dtype=bfloat16
        - --max-model-len=2048
      storageUri: hf://Qwen/Qwen2-0.5B-Instruct
      env:
        - name: VLLM_CPU_KVCACHE_SPACE
          value: "4"
        - name: VLLM_CPU_OMP_THREADS_BIND
          value: "auto"
        - name: VLLM_ENABLE_V1_MULTIPROCESSING
          value: "0"
      resources:
        requests:
          cpu: "4"
          memory: 8Gi
        limits:
          cpu: "8"
          memory: 16Gi

note

KServe auto-selects the CPU container image when no nvidia.com/gpu resource is requested. You do not need to specify the image manually.

Key configuration points:

Parameter	Value	Why
`--dtype=bfloat16`	bfloat16	Stable on CPU. Do not use `float16` which can cause numerical instability without GPU.
`--max-model-len=2048`	2048	Limits context window to reduce memory usage on CPU. Increase if your workload needs more.
`VLLM_CPU_KVCACHE_SPACE`	`4`	Allocates 4 GiB for the KV cache. Increase for longer contexts or larger models.
`VLLM_CPU_OMP_THREADS_BIND`	`auto`	Binds OpenMP threads to CPU cores. Prevents thread migration and NUMA contention.
`VLLM_ENABLE_V1_MULTIPROCESSING`	`0`	Runs in single process mode. Recommended for CPU to avoid IPC overhead.

Save this to a file (e.g., qwen-cpu.yaml) and apply it:

kubectl apply -f qwen-cpu.yaml

Verifying Deployment

Check that your InferenceService is ready:

kubectl get inferenceservice llama3-8b

Expected Output

You should see output similar to:

NAME         URL                                     READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                  AGE
llama3-8b  http://llama3-8b.default.example.com  True           100                              llama3-8b-predictor-default-xxxx   5m

Wait until the READY column shows True before proceeding.

Making Inference Requests

Both backends support the OpenAI-compatible API endpoints for inference. Set up your environment variables first:

# Replace with your actual model name and InferenceService
MODEL_NAME=llama3
SERVICE_NAME=llama3-8b
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${SERVICE_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Determine your ingress information as per KServe documentation and set INGRESS_HOST and INGRESS_PORT accordingly.

Using the Completions API

For simple text completion:

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d '{
  "model": "'"${MODEL_NAME}"'", 
  "prompt": "Write a poem about colors", 
  "max_tokens": 100,
  "stream": false
}'

Expected Output

You should receive a response similar to:

{
  "id": "cmpl-625a9240f25e463487a9b6c53cbed080",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": " and how they make you feel\nColors, oh colors, so vibrant and bright\nA world of emotions, a kaleidoscope in sight\nRed"
    }
  ],
  "created": 1718620153,
  "model": "llama3",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 6,
    "total_tokens": 36
  }
}

Using the Chat Completions API

For chat-based interactions with system and user messages:

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d '{
  "model": "'"${MODEL_NAME}"'",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that speaks like Shakespeare."},
    {"role": "user", "content": "Write a poem about colors"}
  ],
  "max_tokens": 100,
  "stream": false
}'

Expected Output

You should receive a response similar to:

{
   "id": "cmpl-9aad539128294069bf1e406a5cba03d3",
   "choices": [
     {
       "finish_reason": "length",
       "index": 0,
       "message": {
         "content": "  O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
         "tool_calls": null,
         "role": "assistant",
         "function_call": null
       },
       "logprobs": null
     }
   ],
   "created": 1718638005,
   "model": "llama3",
   "system_fingerprint": null,
   "object": "chat.completion",
   "usage": {
     "completion_tokens": 30,
     "prompt_tokens": 37,
     "total_tokens": 67
   }
}

Streaming Responses

Both APIs support streaming for real-time token generation. Simply set "stream": true in your request:

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d '{
  "model": "'"${MODEL_NAME}"'",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that speaks like Shakespeare."},
    {"role": "user", "content": "Write a poem about colors"}
  ],
  "max_tokens": 100,
  "stream": true
}'

Expected Output

You will receive a continuous stream of tokens as they are generated, similar to:

data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" ","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" O","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":",","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":"skie","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":",","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" what","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: [DONE]

Troubleshooting

Common issues and solutions:

Init:OOMKilled: The storage initializer ran out of memory. Increase the memory limits in the ClusterStorageContainer.
OOM errors: Increase the memory allocation in the InferenceService specification.
Pending Deployment: Check that your cluster has enough GPU (or CPU/memory) resources available.
Model not found: Verify your Hugging Face token and model ID.
Illegal instruction (CPU): Your CPU does not support the required instruction set. vLLM CPU images work best with AVX-512. Check with lscpu | grep avx512.
Slow startup (CPU): CPU model loading takes longer than GPU. A 0.5B model may take 1 to 2 minutes to become ready. A 3B model may take 5 minutes or more.
Numerical errors (CPU): Make sure you set --dtype=bfloat16. Using float16 on CPU can produce incorrect results.

Next Steps

Once you've successfully deployed your text generation model, consider:

Advanced serving options like multi-node inference for large models
Exploring other inference tasks such as text-to-text generation and embedding
Optimizing performance with features like model caching and KV cache offloading
Auto-scaling your inference services based on traffic patterns using KServe's auto-scaling capabilities
Token based rate limiting to control usage with AI Gateway for serving models.

For more information on KServe's capabilities for generative AI, see the generative inference overview.

Prerequisites​

Create a Hugging Face Token Secret​

Create a StorageContainer​

Deploy Text Generation Model​

Understanding Backend Options​

Verifying Deployment​

Making Inference Requests​

Using the Completions API​

Using the Chat Completions API​

Streaming Responses​

Troubleshooting​

Next Steps​

Prerequisites

Create a Hugging Face Token Secret

Create a StorageContainer

Deploy Text Generation Model

Understanding Backend Options

Verifying Deployment

Making Inference Requests

Using the Completions API

Using the Chat Completions API

Streaming Responses

Troubleshooting

Next Steps