Text Embeddings with Sentence Transformers

Text embeddings are numerical representations of text that capture semantic meaning in vector form. These embeddings are essential for various machine learning applications including semantic search, clustering, recommendation systems, and similarity analysis. This guide demonstrates how to deploy a Sentence Transformer model for generating text embeddings using KServe's flexible inference runtimes.

Understanding Embedding Models

Embeddings transform text into high-dimensional vector spaces where semantically similar texts are positioned closer together. This mathematical representation enables machines to understand relationships between words and sentences based on their meaning rather than just lexical matching.

KServe supports different embedding models through its serving runtimes:

Sentence Transformers: These models are specifically trained to generate meaningful sentence embeddings and are widely used for semantic similarity tasks.
General-purpose LLMs: Some large language models can also generate embeddings as one of their capabilities.

Prerequisites

Before getting started, ensure you have:

A Kubernetes cluster with KServe installed.
GPU resources available for model inference (recommended for better performance).
Basic familiarity with vector embeddings concepts.

Create a Hugging Face Secret (Optional)

If you plan to use private models from Hugging Face, you need to create a Kubernetes secret containing your Hugging Face API token. This step is optional for public models.

kubectl create secret generic hf-secret \
  --from-literal=HF_TOKEN=<your_huggingface_token>

Create a StorageContainer (Optional)

For models that require authentication, you might need to create a ClusterStorageContainer. While the model in this example is public, for private models you would need to configure access:

huggingface-storage.yaml
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
  name: hf-hub
spec:
  container:
    name: storage-initializer
    image: kserve/storage-initializer:latest
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-secret
          key: HF_TOKEN
          optional: false
    resources:
      requests:
        memory: 2Gi
        cpu: "1"
      limits:
        memory: 4Gi
        cpu: "1"
  supportedUriFormats:
    - prefix: hf://

To know more about storage containers, refer to the Storage Containers documentation.

Deploy Embedding Model

Understanding Backend Options

KServe supports two inference backends for serving LLMs. This guide covers two primary options:

vLLM Backend (default): This is the recommended backend for serving LLMs, providing optimized performance and lower latency. It supports advanced features like model parallelism and efficient memory management.
Hugging Face Backend: This backend uses the standard Hugging Face library. It is suitable for simpler use cases but may not perform as well as vLLM for larger models or high concurrency scenarios.

Please refer to the overview of KServe's generative inference capabilities for more details on these backends.

Choose the appropriate backend for your embedding model deployment:

note

Note that the backends use different values for the --task argument. The vLLM backend uses embed, while the Hugging Face backend uses text_embedding. Ensure you use the correct one based on your deployment choice.

vLLM Backend (Recommended)
Hugging Face Backend

For embedding models, the vLLM backend provides high-performance embedding generation with optimized CUDA kernels:

embedding-vllm.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: embedding-model
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=sentence-transformer
        - --task=embed
      storageUri: "hf://sentence-transformers/all-MiniLM-L6-v2"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 1Gi
          nvidia.com/gpu: "1"

Apply the YAML:

kubectl apply -f embedding-vllm.yaml

If you prefer to use the standard Hugging Face inference pipeline, specify the backend explicitly:

embedding-hf.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: embedding-model
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=sentence-transformer
        - --task=text_embedding
        - --backend=huggingface
      storageUri: "hf://sentence-transformers/all-MiniLM-L6-v2"
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 1Gi
          nvidia.com/gpu: "1"

Apply the YAML:

kubectl apply -f embedding-hf.yaml

Verifying Deployment

Check that your InferenceService is ready:

kubectl get inferenceservices embedding-model

Expected Output

NAME              URL                                              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                      AGE
embedding-model   http://embedding-model.default.example.com       True           100                              embedding-model-predictor-default-xjh8p   3m

Wait until the READY column shows True before proceeding.

Making Inference Requests

Sentence Transformer models support API endpoints for generating embeddings. The request format varies slightly based on the backend you chose. Set up your environment variables first:

# Set up service hostname for requests
SERVICE_NAME="embedding-model"
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${SERVICE_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)

Determine your ingress information as per KServe documentation and set INGRESS_HOST and INGRESS_PORT accordingly.

Generating Text Embeddings

You can generate embeddings for a single sentence or multiple sentences in a single request:

curl -H "Content-Type: application/json" \
-H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/embeddings \
-d '{
  "model": "sentence-transformer",
  "input": "This is an example sentence for embedding generation."
}'

Expected Output

The response will contain the generated embedding vector:

{ 
  "id":"embd-0d91c708-eb5b-4d60-ac08-c9a227ec09c4",
  "object": "list",
  "created":1750840715,
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.007899402640759945,
        0.008340050466358662,
        0.035796716809272766,
        ... // 384-dimensional vector continues
        0.09077603369951248,
        0.03257409855723381,
        -0.02999882400035858
      ],
      "index": 0
    }
  ],
  "model": "sentence-transformer",
  "usage": {
    "prompt_tokens": 13,
    "total_tokens": 13,
    "completion_tokens": 0,
    "prompt_tokens_details":null
  }
}

Use Cases for Embeddings

Embeddings generated with this model can be used for:

Semantic Search: Index documents and perform similarity-based searches
Text Clustering: Group similar documents together
Recommendation Systems: Find content similar to user preferences
Data Analysis: Visualize text relationships through dimensionality reduction techniques
Downstream ML Tasks: Use embeddings as features for classification or regression models

You can store these embeddings in vector databases like Pinecone, Weaviate, or Milvus for efficient similarity searches at scale.

Troubleshooting

Common issues and solutions:

Init:OOMKilled: This indicates that the storage initializer exceeded the memory limits. You can try increasing the memory limits in the ClusterStorageContainer.
OOM errors: Increase the memory allocation in the InferenceService specification
Pending Deployment: Ensure your cluster has enough GPU resources available
Model not found: Double-check your model ID and ensure it's publicly available
Incorrect vector dimensions: Verify that your application expects vectors of the same dimension that the model produces

Next Steps

Once you've successfully deployed your embedding model, consider:

Advanced serving options like multi-node inference for large models
Exploring other inference tasks such as text-to-text generation and reranking
Optimizing performance with features like model caching and KV cache offloading
Auto-scaling your inference services based on traffic patterns using KServe's auto-scaling capabilities
Token based rate limiting to control usage with AI Gateway for serving models.

For more information on KServe's capabilities for generative AI, see the generative inference overview.

Understanding Embedding Models​

Prerequisites​

Create a Hugging Face Secret (Optional)​

Create a StorageContainer (Optional)​

Deploy Embedding Model​

Understanding Backend Options​

Verifying Deployment​

Making Inference Requests​

Generating Text Embeddings​

Use Cases for Embeddings​

Troubleshooting​

Next Steps​