Text-to-Text Generation with T5 Model
Text-to-text generation is a versatile NLP task where both input and output are text, enabling applications like translation, summarization, and question answering. This guide demonstrates how to deploy Google's T5 model using KServe's flexible inference runtimes.
Prerequisites
Before getting started, ensure you have:
- A Kubernetes cluster with KServe installed.
- GPU resources available for model inference (this example uses NVIDIA GPUs).
Create a Hugging Face Secret (Optional)
If you plan to use private models from Hugging Face, you need to create a Kubernetes secret containing your Hugging Face API token. This step is optional for public models.
kubectl create secret generic hf-secret \
--from-literal=HF_TOKEN=<your_huggingface_token>
Create a StorageContainer (Optional)
For models that require authentication, you might need to create a ClusterStorageContainer
. While the model in this example is public, for private models you would need to configure access:
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
name: hf-hub
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
requests:
memory: 2Gi
cpu: "1"
limits:
memory: 4Gi
cpu: "1"
supportedUriFormats:
- prefix: hf://
To know more about storage containers, refer to the Storage Containers documentation.
Deploy T5 Model
Understanding Backend Options
KServe supports two inference backends for serving text-to-text models:
- vLLM Backend (default): This is the recommended backend for most language models, providing optimized performance with techniques like paged attention and continuous batching.
- Hugging Face Backend: This backend uses the standard Hugging Face inference API. It serves as a fallback for models not supported by vLLM, like the T5 model in this example.
At the time this document was written, the t5 model
is not supported by the vLLM engine, so the runtime will automatically
use the Hugging Face backend to serve the model.
Please refer to the overview of KServe's generative inference capabilities for more details on these backends.
Deploy T5 with Hugging Face Backend
Since T5 is not currently supported by vLLM, we'll use the Hugging Face backend explicitly:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
resources:
limits:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
Apply the YAML:
kubectl apply -f huggingface-t5.yaml
Verifying Deployment
Check that your InferenceService is ready:
kubectl get inferenceservices huggingface-t5
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-t5 http://huggingface-t5.default.example.com True 100 huggingface-t5-predictor-default-47q2g 7d23h
Wait until the READY
column shows True
before proceeding.
Making Inference Requests
The T5 model supports the OpenAI-compatible API endpoints for inference. Set up your environment variables first:
# Set up service hostname for requests
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-t5 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
Determine your ingress information as per KServe documentation and set INGRESS_HOST
and INGRESS_PORT
accordingly.
Using the Completions API
For text-to-text translation:
curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'
{
"id": "de53f527-9cb9-47a5-9673-43d180b704f2",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": "Das Haus ist wunderbar."
}
],
"created": 1717998661,
"model": "t5",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"completion_tokens": 7,
"prompt_tokens": 11,
"total_tokens": 18
}
}
Streaming Responses
The API also supports streaming for real-time token generation. Simply set "stream": true
in your request:
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":true, "max_tokens": 30 }'
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Haus "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"ist "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"wunderbar.</s>"}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: [DONE]
Troubleshooting
Common issues and solutions:
- Init:OOMKilled: This indicates that the storage initializer exceeded the memory limits. You can try increasing the memory limits in the
ClusterStorageContainer
. - OOM errors: Increase the memory allocation in the InferenceService specification
- Pending Deployment: Ensure your cluster has enough GPU resources available
- Model not found: Double-check your Hugging Face token and model ID
Next Steps
Once you've successfully deployed your text generation model, consider:
- Advanced serving options like multi-node inference for large models
- Exploring other inference tasks such as reranking and embedding
- Optimizing performance with features like model caching and KV cache offloading
- Auto-scaling your inference services based on traffic patterns using KServe's auto-scaling capabilities
- Token based rate limiting to control usage with AI Gateway for serving models.
For more information on KServe's capabilities for generative AI, see the generative inference overview.