Deploy the t5 model for text2text generation task with Hugging Face LLM Serving Runtime¶

In this example, We demonstrate how to deploy t5 model for Text2Text Generation task from Hugging Face by deploying the InferenceService with Hugging Face Serving runtime.

Serve the Hugging Face LLM model using HuggingFace Backend¶

KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.

You can use --backend=huggingface argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also supports the OpenAI /v1/completions and /v1/chat/completions endpoints for inference.

Note

At the time this document was written, the t5 model is not supported by the vLLM engine, so the runtime will automatically use the Hugging Face backend to serve the model.

Yaml

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-t5
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=t5
        - --model_id=google-t5/t5-small
        - --backend=huggingface
      resources:
        limits:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
EOF

Check `InferenceService` status.¶

kubectl get inferenceservices huggingface-t5

Expected Output

NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
huggingface-t5       http://huggingface-t5.default.example.com             True           100                              huggingface-t5-predictor-default-47q2g   7d23h

Perform Model Inference¶

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-t5 -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Sample OpenAI Completions request:¶

curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'

Expected Output

{
  "id": "de53f527-9cb9-47a5-9673-43d180b704f2",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "Das Haus ist wunderbar."
    }
  ],
  "created": 1717998661,
  "model": "t5",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 11,
    "total_tokens": 18
  }
}

Sample OpenAI Completions streaming request:¶

curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":true, "max_tokens": 30 }'

Expected Output

data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Haus "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"ist "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"wunderbar.</s>"}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
data: [DONE]

Deploy the t5 model for text2text generation task with Hugging Face LLM Serving Runtime¶

Serve the Hugging Face LLM model using HuggingFace Backend¶

Check InferenceService status.¶

Perform Model Inference¶

Sample OpenAI Completions request:¶

Sample OpenAI Completions streaming request:¶

Check `InferenceService` status.¶