Deploy the t5 model for text2text generation task with Hugging Face LLM Serving Runtime¶
In this example, We demonstrate how to deploy t5 model for Text2Text Generation task from
Hugging Face by deploying the InferenceService with Hugging Face Serving
runtime.
Serve the Hugging Face LLM model using HuggingFace Backend¶
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
You can use --backend=huggingface argument to perform the inference using Hugging Face API.
KServe Hugging Face backend runtime also
supports the OpenAI /v1/completions and /v1/chat/completions endpoints for
inference.
Note
At the time this document was written, the t5 model is not supported by the vLLM engine,
so the runtime will automatically
use the Hugging Face backend to serve the model.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
resources:
limits:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService status.¶
kubectl get inferenceservices huggingface-t5
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-t5 http://huggingface-t5.default.example.com True 100 huggingface-t5-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress
IP and ports and set INGRESS_HOST and INGRESS_PORT.
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-t5 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
Sample OpenAI Completions request:¶
curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'
Expected Output
{
"id": "de53f527-9cb9-47a5-9673-43d180b704f2",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": "Das Haus ist wunderbar."
}
],
"created": 1717998661,
"model": "t5",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"completion_tokens": 7,
"prompt_tokens": 11,
"total_tokens": 18
}
}
Sample OpenAI Completions streaming request:¶
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":true, "max_tokens": 30 }'
Expected Output
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Haus "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"ist "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"wunderbar.</s>"}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: [DONE]