Skip to main content

Deploy Your First GenAI Service

In this tutorial, you will deploy a Large Language Model (LLM) using KServe's InferenceService to create a powerful generative AI service. We'll use the Qwen model, a state-of-the-art language model developed by Alibaba, capable of understanding and generating human-like text across multiple languages.

You will learn how to deploy the model and interact with it using OpenAI-compatible APIs, making it easy to integrate with existing applications and tools that support the OpenAI standard.

Since your LLM is deployed as an InferenceService rather than a basic Kubernetes deployment, you automatically get enterprise-grade features like autoscaling, load balancing, canary deployments, and GPU acceleration out of the box πŸš€.

Prerequisites​

Before you begin, ensure you have followed the KServe Quickstart Guide to set up KServe in your Kubernetes cluster. This guide assumes you have a working KServe installation and a Kubernetes cluster ready for deployment.

tip

KServe recommends Raw Deployment for Generative AI use cases.

1. Create a namespace​

First, create a namespace to use for deploying KServe resources:

kubectl create namespace kserve-test

2. Create an InferenceService​

Create an InferenceService to deploy the Qwen LLM model. This model will be served using KServe's Hugging Face runtime with vLLM backend for optimized performance.

warning

Do not deploy InferenceServices in control plane namespaces (i.e. namespaces with control-plane label). The webhook is configured in a way to skip these namespaces to avoid any privilege escalations. Deploying InferenceServices to these namespaces will result in the storage initializer not being injected into the pod, causing the pod to fail with the error No such file or directory: '/mnt/models'.

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "qwen-llm"
namespace: kserve-test
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=qwen
storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
resources:
limits:
cpu: "2"
memory: 6Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
EOF
Using Hugging Face Token

If you need to authenticate with Hugging Face, first create a secret:

kubectl create secret generic hf-secret \
--from-literal=HF_TOKEN=your_hf_token_here \
-n kserve-test

Then create a clusterstoragecontainer resource with the secret reference:

apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
name: hf-hub
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
requests:
memory: 2Gi
cpu: "1"
limits:
memory: 4Gi
cpu: "1"
supportedUriFormats:
- prefix: hf://

3. Check InferenceService status.​

kubectl get inferenceservices qwen-llm -n kserve-test
Expected Output
NAME       URL                                             READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
qwen-llm http://qwen-llm.kserve-test.example.com True 100 qwen-llm-predictor-default-47q2g 7d23h

If your DNS contains example.com please consult your admin for configuring DNS or using custom domain.

4. Determine the ingress IP and ports​

Execute the following command to determine if your Kubernetes cluster is running in an environment that supports external load balancers

kubectl get svc istio-ingressgateway -n istio-system
Expected Output
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)   AGE
istio-ingressgateway LoadBalancer 172.21.109.129 130.211.10.121 ... 17h

If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')

5. Perform inference​

Create a JSON file named chat-input.json with the following content to send a chat completion request to the Qwen model:

cat <<EOF > "./chat-input.json"
{
"model": "qwen",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that provides clear and concise answers."
},
{
"role": "user",
"content": "Write a short poem about artificial intelligence and machine learning."
}
],
"max_tokens": 150,
"temperature": 0.7,
"stream": false
}
EOF

Depending on your setup, use one of the following commands to curl the InferenceService:

curl -v -H "Content-Type: application/json" http://qwen-llm.kserve-test.example.com/openai/v1/chat/completions -d @./chat-input.json

You should see a response similar to the following, which contains the generated text from the Qwen model:

{
"id": "cmpl-generated-id",
"object": "chat.completion",
"created": 1703123456,
"model": "qwen",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a poem about artificial intelligence and machine learning:\n\nSilicon minds awakening bright,\nThrough data streams and neural flight,\nPatterns learned from endless code,\nAI walks the digital road.\n\nMachine learning, wise and true,\nFinds the answers we pursue,\nIn the dance of ones and zeros,\nTechnology becomes our heroes."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 67,
"total_tokens": 112
}
}

6. Clean up​

To clean up the resources created in this tutorial, delete the InferenceService and the namespace:

kubectl delete inferenceservice qwen-llm -n kserve-test
kubectl delete namespace kserve-test

7. Next Steps​

Now that you have successfully deployed a generative AI service using KServe, you can explore more advanced features such as:

  • πŸ“– KServe Concepts - Learn about the core concepts of KServe.
  • πŸ“– Supported Tasks - Discover the various tasks that KServe can handle.
  • πŸ“– Autoscaling: Automatically scale your service based on traffic and resource usage / metrics.
  • πŸ“– **KV Cache Offloading - Learn how to offload key-value caches to external storage for improved performance and reduced latency.
  • πŸ“– **Model Caching - Learn how to cache models for faster startup time.
  • πŸ“– Token Rate Limiting - Rate limit users based on token usage.