Version: 0.16

Hugging Face Token Classification with KServe

This guide demonstrates how to deploy a BERT model for token classification tasks using KServe's Hugging Face serving runtime. Token classification models assign labels to individual tokens within a text sequence, making them useful for tasks like named entity recognition (NER), part-of-speech tagging, and chunking.

Prerequisites

Before you begin, make sure you have:

A Kubernetes cluster with KServe installed.
kubectl CLI configured to communicate with your cluster
Basic knowledge of Kubernetes concepts and Hugging Face models
GPU resources (optional but recommended for better performance)

Deploying the BERT Model for Token Classification

In this example, we'll deploy a BERT model for Named Entity Recognition (NER) using the Hugging Face Serving runtime. We'll demonstrate deployment using both V1 and V2 protocols.

Create a Hugging Face Secret (Optional)

If you plan to use private models from Hugging Face, you need to create a Kubernetes secret containing your Hugging Face API token. This step is optional for public models.

kubectl create secret generic hf-secret \
  --from-literal=HF_TOKEN=<your_huggingface_token>

Create a StorageContainer (Optional)

For models that require authentication, you might need to create a ClusterStorageContainer. While the model in this example is public, for private models you would need to configure access:

huggingface-storage.yaml
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
  name: hf-hub
spec:
  container:
    name: storage-initializer
    image: kserve/storage-initializer:latest
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-secret
          key: HF_TOKEN
          optional: false
    resources:
      requests:
        memory: 2Gi
        cpu: "1"
      limits:
        memory: 4Gi
        cpu: "1"
  supportedUriFormats:
    - prefix: hf://

To know more about storage containers, refer to the Storage Containers documentation.

Deploy with the V1 Protocol

Create an InferenceService resource to deploy the BERT token classification model:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=bert
        - --disable_lower_case
      storageUri: "hf://dslim/bert-base-NER"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

Save this configuration to a file named huggingface-bert-v1.yaml and apply it:

kubectl apply -f huggingface-bert-v1.yaml

Check the InferenceService Status

Verify that the InferenceService is deployed and ready:

kubectl get inferenceservices huggingface-bert

Expected output

NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
huggingface-bert     http://huggingface-bert.default.example.com           True           100                              huggingface-bert-predictor-default-47q2g     7d23h

Perform Model Inference with V1 Protocol

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

Set up the environment variables:

MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Send a prediction request:

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"instances": ["My name is Wolfgang and I live in Berlin", "My name is Lisa and I live in Paris"]}'

Expected Output

{"predictions":[[[0,0,0,0,3,0,0,0,0,7,0]],[[0,0,0,0,3,0,0,0,0,7,0]]]}

Deploy with the Open Inference Protocol (V2)

For V2 protocol deployment, we need to set the protocolVersion field to v2:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      protocolVersion: v2
      args:
        - --model_name=bert
        - --disable_lower_case
      storageUri: "hf://dslim/bert-base-NER"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

Save this configuration to a file named huggingface-bert-v2.yaml and apply it:

kubectl apply -f huggingface-bert-v2.yaml

Check the InferenceService Status

Verify that the InferenceService is deployed and ready:

kubectl get inferenceservices huggingface-bert

Expected output

NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
huggingface-bert     http://huggingface-bert.default.example.com           True           100                              huggingface-bert-predictor-default-47q2g     7d23h

Perform Model Inference with V2 Protocol

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

Set up the environment variables:

MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Send a prediction request:

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["My name is Wolfgang and I live in Berlin", "My name is Lisa and I live in Paris"]}]}'

Expected Output

{
  "model_name": "bert",
  "model_version": null,
  "id": "3117e54b-8a6a-4072-9d87-6d7bdfe05eed",
  "parameters": null,
  "outputs": [
    {
      "name": "output-0",
      "shape": [2,1,11],
      "datatype": "INT64",
      "parameters": null,
      "data":[0,0,0,0,3,0,0,0,0,7,0,0,0,0,0,3,0,0,0,0,7,0]
    }
  ]
}

Understanding the Output

In the response, the numbers represent entity tags for each token in the input text. For the BERT NER model used in this example:

0: "O" (Outside of a named entity)
3: "B-PER" (Beginning of a person entity)
7: "B-LOC" (Beginning of a location entity)

So in the example response:

"Wolfgang" and "Lisa" are identified as person entities (B-PER)
"Berlin" and "Paris" are identified as location entities (B-LOC)

Customizing the Model

You can customize the deployment by:

Using a different pre-trained model: Change the --model_id argument to a different Hugging Face model ID
Adding model configuration options: Add additional arguments to customize tokenization or model behavior
Adjusting resource requests: Modify the CPU, memory, or GPU resources based on your requirements

Advanced Configuration

Return Token Probabilities

The Hugging Face runtime supports returning probabilities for each token prediction by adding the --return_probabilities argument to your deployment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert-with-probs
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      protocolVersion: v2
      args:
        - --model_name=bert
        - --disable_lower_case
        - --return_probabilities
      storageUri: "hf://dslim/bert-base-NER"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

Save this configuration to a file named huggingface-bert-with-probs.yaml and apply it:

kubectl apply -f huggingface-bert-with-probs.yaml

When using the --return_probabilities flag, the response will include probability scores for each entity tag prediction, providing confidence levels for the model's predictions. This is particularly useful for:

Understanding the model's confidence in its predictions
Setting thresholds for accepting predictions based on confidence
Analyzing ambiguous cases where multiple entity types might be plausible
Fine-tuning model performance based on confidence distributions

The output format will include probability values for each possible entity tag, allowing you to see the confidence distribution across all possible classifications:

{
  "model_name": "bert",
  "model_version": null,
  "id": "a53fe3cf-a328-4d46-a92c-fd4a0e5c5301",
  "parameters": null,
  "outputs": [
    {
      "name": "output-0",
      "shape": [2,1,11],
      "datatype": "BYTES",
      "parameters": null,
      "data": [
        "{0: 0.982, 1: 0.003, 2: 0.001, 3: 0.002, 4: 0.001, 5: 0.002, 6: 0.001, 7: 0.002, 8: 0.001}",
        "{0: 0.978, 1: 0.003, 2: 0.001, 3: 0.010, 4: 0.001, 5: 0.002, 6: 0.001, 7: 0.005, 8: 0.003}",
        "{0: 0.983, 1: 0.003, 2: 0.001, 3: 0.006, 4: 0.003, 5: 0.001, 6: 0.001, 7: 0.002, 8: 0.003}",
        "{0: 0.977, 1: 0.004, 2: 0.002, 3: 0.008, 4: 0.003, 5: 0.001, 6: 0.001, 7: 0.003, 8: 0.004}",
        "{0: 0.002, 1: 0.001, 2: 0.001, 3: 0.985, 4: 0.002, 5: 0.002, 6: 0.001, 7: 0.004, 8: 0.001}",
        /* ... additional token probabilities for the first sentence ... */
        "{0: 0.936, 1: 0.078, 2: 0.002, 3: 0.102, 4: 0.002, 5: 0.001, 6: 0.001, 7: 0.073, 8: 0.002}",
        /* ... probabilities for tokens in the second sentence ... */
        "{0: 0.003, 1: 0.002, 2: 0.001, 3: 0.002, 4: 0.001, 5: 0.004, 6: 0.001, 7: 0.985, 8: 0.001}",
        "{0: 0.943, 1: 0.015, 2: 0.003, 3: 0.006, 4: 0.002, 5: 0.002, 6: 0.003, 7: 0.008, 8: 0.003}"
      ]
    }
  ]
}

Processing Labels with Python

For a more user-friendly interpretation of the model outputs, you can process the labels using Python:

import requests
import json
import ast

# Set up variables
ingress_host = "your-ingress-host"
ingress_port = "your-ingress-port"
service_hostname = "your-service-hostname"
model_name = "bert"

# Input text
texts = ["My name is Wolfgang and I live in Berlin", "My name is Lisa and I live in Paris"]

# Define the request payload (V2 protocol)
request_data = {
    "inputs": [
        {
            "name": "input-0", 
            "shape": [len(texts)], 
            "datatype": "BYTES", 
            "data": texts
        }
    ]
}

# Send the request
url = f"http://{ingress_host}:{ingress_port}/v2/models/{model_name}/infer"
headers = {
    "Host": service_hostname,
    "Content-Type": "application/json"
}
response = requests.post(url, headers=headers, data=json.dumps(request_data))

# Process the response
result = response.json()
probability_data = result["outputs"][0]["data"]

# NER tags for the dslim/bert-base-NER model
tag_map = {
    0: "O",       # Outside of a named entity
    1: "B-MISC",  # Beginning of a miscellaneous entity
    2: "I-MISC",  # Inside of a miscellaneous entity
    3: "B-PER",   # Beginning of a person entity
    4: "I-PER",   # Inside of a person entity
    5: "B-ORG",   # Beginning of an organization entity
    6: "I-ORG",   # Inside of an organization entity
    7: "B-LOC",   # Beginning of a location entity
    8: "I-LOC"    # Inside of a location entity
}

# Process probability data
token_predictions = []
token_probabilities = []

for prob_dict_str in probability_data:
    # Convert string representation to dictionary
    prob_dict = ast.literal_eval(prob_dict_str)
    
    # Find tag with highest probability
    max_tag = max(prob_dict.items(), key=lambda x: x[1])[0]
    token_predictions.append(max_tag)
    
    # Store all probabilities
    token_probabilities.append(prob_dict)

# Group by sentence (assuming we know the token counts per sentence)
# For this example, let's say the first sentence has 11 tokens
first_sentence_tokens = token_predictions[:11]
second_sentence_tokens = token_predictions[11:]

# Map numeric labels to text labels
first_sentence_entities = [tag_map.get(tag, f"Unknown-{tag}") for tag in first_sentence_tokens]
second_sentence_entities = [tag_map.get(tag, f"Unknown-{tag}") for tag in second_sentence_tokens]

# Display results with confidence scores
print("First sentence entities:")
for i, entity in enumerate(first_sentence_entities):
    if entity != "O":  # Only show named entities
        confidence = token_probabilities[i][int(token_predictions[i])]
        print(f"Entity: {entity}, Confidence: {confidence:.4f}")

print("\nSecond sentence entities:")
for i, entity in enumerate(second_sentence_entities):
    if entity != "O":  # Only show named entities
        confidence = token_probabilities[i+11][int(token_predictions[i+11])]
        print(f"Entity: {entity}, Confidence: {confidence:.4f}")
for i, (text, tags) in enumerate(zip(texts, tagged_outputs)):
    print(f"Text {i+1}: {text}")
    print(f"Tags {i+1}: {tags}")
    print()

This script will produce a more readable output showing the named entity tags for each token in the input text.

Processing Probabilities with Python

If you're using the --return_probabilities flag, you can also process and analyze the probability values to understand the model's confidence in its predictions:

import requests
import json
import numpy as np
import ast
from scipy.special import softmax

# Set up variables
ingress_host = "your-ingress-host"
ingress_port = "your-ingress-port"
service_hostname = "your-service-hostname"
model_name = "bert"

# Input text
texts = ["My name is Wolfgang and I live in Berlin"]

# Define the request payload (V2 protocol)
request_data = {
    "inputs": [
        {
            "name": "input-0", 
            "shape": [len(texts)], 
            "datatype": "BYTES", 
            "data": texts
        }
    ]
}

# Send the request
url = f"http://{ingress_host}:{ingress_port}/v2/models/{model_name}/infer"
headers = {
    "Host": service_hostname,
    "Content-Type": "application/json"
}
response = requests.post(url, headers=headers, data=json.dumps(request_data))

# Process the response
result = response.json()

# Get logits from the output (response contains raw logits in string format)
logits_data = result["outputs"][0]["data"]

# NER tags for the dslim/bert-base-NER model
tag_map = {
    0: "O",       # Outside of a named entity
    1: "B-MISC",  # Beginning of a miscellaneous entity
    2: "I-MISC",  # Inside of a miscellaneous entity
    3: "B-PER",   # Beginning of a person entity
    4: "I-PER",   # Inside of a person entity
    5: "B-ORG",   # Beginning of an organization entity
    6: "I-ORG",   # Inside of an organization entity
    7: "B-LOC",   # Beginning of a location entity
    8: "I-LOC"    # Inside of a location entity
}

# Parse logits from string format and convert to probabilities
token_predictions = []
token_probabilities = []

for logits_str in logits_data:
    # Convert the string representation to a dictionary
    logits_dict = ast.literal_eval(logits_str)
    
    # Extract logits values and sort by key to ensure correct order
    logits = [logits_dict[i] for i in range(len(logits_dict))]
    
    # Convert logits to probabilities using softmax
    probs = softmax(logits)
    
    # Get the most likely tag
    pred_idx = np.argmax(probs)
    
    token_predictions.append(pred_idx)
    token_probabilities.append(probs)

# Process results for each token
tokens = texts[0].split()  # Simple tokenization (not accurate for BERT)

print(f"Text: {texts[0]}")
print("Token\tPredicted Tag\tConfidence\tAll Probabilities")
print("-----\t------------\t----------\t----------------")

# Note: This is a simplified example - actual BERT tokenization would be different
for i, (token, pred, probs) in enumerate(zip(tokens, token_predictions[:len(tokens)], token_probabilities[:len(tokens)])):
    tag = tag_map.get(pred, f"Unknown-{pred}")
    confidence = probs[pred]
    
    # Format probabilities as percentages
    probs_formatted = ", ".join([f"{tag_map[i]}:{p*100:.1f}%" for i, p in enumerate(probs)])
    
    print(f"{token}\t{tag}\t{confidence*100:.1f}%\t{probs_formatted}")
print()

This script demonstrates how to:

Extract both prediction labels and probability values
Reshape probability values for each token
Calculate confidence scores for each prediction
Display all possible tag probabilities for each token

By analyzing these probability distributions, you can identify cases where the model is uncertain (e.g., when multiple tags have similar probabilities) or set confidence thresholds for using predictions in production systems.

Understanding Probabilities vs. Raw Logits

When working with token classification models, you have two options for accessing the model's confidence values:

Probabilities (--return_probabilities): Normalized values between 0 and 1 that sum to 1 across all possible tags for each token. These are easier to interpret as they directly represent the model's confidence in each tag.
Raw Logits (--return_raw_logits): Unnormalized output values directly from the model before softmax is applied. These values can range from negative to positive infinity and are harder to interpret directly, but preserve more information for specialized processing.

When to Use Probabilities vs. Raw Logits

Feature	Probabilities	Raw Logits
Interpretability	✅ Directly represents confidence (0-1)	❌ Requires conversion to be meaningful
Preservation of information	❌ Information loss due to softmax	✅ Preserves full model output
Typical use case	Production systems, thresholding, user-facing applications	Research, model fine-tuning, custom processing pipelines
Processing requirement	Ready to use	Needs softmax conversion

Converting Between Logits and Probabilities

If you receive raw logits but need probabilities, you can convert them using the softmax function:

import numpy as np
from scipy.special import softmax

# Convert logits to probabilities
def logits_to_probs(logits):
    return softmax(logits)

# Example
raw_logits = [7.069, -0.455, -1.075, -0.858, -1.136, -0.965, -1.593, -0.979, -1.188]
probabilities = logits_to_probs(raw_logits)
print([f"{p:.4f}" for p in probabilities])
# Output: ['0.9918', '0.0030', '0.0016', '0.0020', '0.0015', '0.0018', '0.0010', '0.0018', '0.0014']

Practical Applications for Raw Logits

Raw logits can be particularly useful for:

Temperature scaling: Adjust the "confidence" of predictions by applying different temperature parameters to logits before softmax
Ensemble methods: Combine logits from multiple models before applying softmax
Calibration: Apply post-hoc calibration techniques that work directly on logits
Custom decision thresholds: Implement specialized decision boundaries that work better with unnormalized values

Accessing Raw Logits with --return_raw_logits

For advanced use cases requiring the raw model outputs, you can disable the post-processing step entirely using the --return_raw_logits flag. This returns the raw logits directly from the model:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert-with-logits
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      protocolVersion: v2
      args:
        - --model_name=bert
        - --disable_lower_case
        - --return_raw_logits
      storageUri: "hf://dslim/bert-base-NER"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

With --return_raw_logits, the response includes raw logit values before any softmax normalization:

{
  "model_name": "bert",
  "model_version": null,
  "id": "a53fe3cf-a328-4d46-a92c-fd4a0e5c5301",
  "parameters": null,
  "outputs": [
    {
      "name": "output-0",
      "shape": [2,1,11],
      "datatype": "BYTES",
      "parameters": null,
      "data": [
        "{0: 7.06871223449707, 1: -0.45536208152770996, 2: -1.0750019550323486, 3: -0.8583133816719055, 4: -1.1360667943954468, 5: -0.9653294682502747, 6: -1.593642234802246, 7: -0.9799260497093201, 8: -1.1887028217315674}",
        /* ... more token logits ... */
      ]
    }
  ]
}

Processing Raw Logits with Python

Here's an example of how to process raw logits and convert them to probabilities:

import requests
import json
import numpy as np
import ast
from scipy.special import softmax

# Set up variables and send request (same as previous examples)
# ...

# Process the raw logits response
result = response.json()
logits_data = result["outputs"][0]["data"]

# Parse and convert logits
tokens = texts[0].split()  # Simplified tokenization
processed_tokens = []

for token, logits_str in zip(tokens, logits_data[:len(tokens)]):
    # Parse the logits dictionary
    logits_dict = ast.literal_eval(logits_str)
    
    # Extract logits in correct order
    logits = [logits_dict[i] for i in sorted(logits_dict.keys())]
    
    # Convert to probabilities
    probs = softmax(logits)
    
    # Get predicted tag and its probability
    pred_idx = np.argmax(logits)
    tag = tag_map[pred_idx]
    confidence = probs[pred_idx]
    
    processed_tokens.append({
        "token": token,
        "tag": tag,
        "confidence": confidence,
        "logits": logits,
        "probabilities": probs
    })

# Print results with both logits and probabilities
for item in processed_tokens:
    if item["tag"] != "O":  # Only show named entities
        print(f"Token: {item['token']}")
        print(f"  Tag: {item['tag']} (confidence: {item['confidence']:.4f})")
        print(f"  Logits: {[f'{l:.2f}' for l in item['logits'][:3]]}")
        print(f"  Probs: {[f'{p:.4f}' for p in item['probabilities'][:3]]}")
        print()

When to Use --return_raw_logits

The --return_raw_logits flag is particularly useful for:

Advanced model analysis: Examining raw model outputs to understand prediction behavior
Custom processing pipelines: Implementing your own post-processing logic instead of using the default
Model calibration: Analyzing and adjusting the model's confidence calibration
Academic research: Preserving all numerical information from the model for research purposes

Troubleshooting

If you encounter issues during deployment or inference, consider the following:

Init:OOMKilled: This indicates that the storage initializer exceeded the memory limits. You can try increasing the memory limits in the ClusterStorageContainer.

Next Steps

Learn about other NLP tasks supported by Hugging Face and KServe

Prerequisites​

Deploying the BERT Model for Token Classification​

Create a Hugging Face Secret (Optional)​

Create a StorageContainer (Optional)​

Deploy with the V1 Protocol​

Check the InferenceService Status​

Perform Model Inference with V1 Protocol​

Deploy with the Open Inference Protocol (V2)​

Check the InferenceService Status​

Perform Model Inference with V2 Protocol​

Understanding the Output​

Customizing the Model​

Advanced Configuration​

Return Token Probabilities​

Processing Labels with Python​

Processing Probabilities with Python​

Understanding Probabilities vs. Raw Logits​

When to Use Probabilities vs. Raw Logits​

Converting Between Logits and Probabilities​

Practical Applications for Raw Logits​

Accessing Raw Logits with --return_raw_logits​

Processing Raw Logits with Python​

When to Use --return_raw_logits​

Troubleshooting​

Next Steps​

Prerequisites

Deploying the BERT Model for Token Classification

Create a Hugging Face Secret (Optional)

Create a StorageContainer (Optional)

Deploy with the V1 Protocol

Check the InferenceService Status

Perform Model Inference with V1 Protocol

Deploy with the Open Inference Protocol (V2)

Check the InferenceService Status

Perform Model Inference with V2 Protocol

Understanding the Output

Customizing the Model

Advanced Configuration

Return Token Probabilities

Processing Labels with Python

Processing Probabilities with Python

Understanding Probabilities vs. Raw Logits

When to Use Probabilities vs. Raw Logits

Converting Between Logits and Probabilities

Practical Applications for Raw Logits

Accessing Raw Logits with --return_raw_logits

Processing Raw Logits with Python

When to Use --return_raw_logits

Troubleshooting

Next Steps