Skip to main content

Hugging Face LLM Serving Runtime

The Hugging Face serving runtime can serve Hugging Face models out of the box. The preprocess and post-process handlers are implemented based on different predictive inference tasks, for example text classification, token-classification and fill-mask and more.

Hugging Face Predictor

KServe supports deploying Hugging Face models using a specialized predictor that integrates with the Hugging Face transformers library. This allows you to deploy pre-trained models directly from the Hugging Face Hub or your own custom models.

Supported NLP Tasks

The following NLP tasks are supported out of the box:

  • Text Classification: Classify text into predefined categories
  • Token Classification: Classify individual tokens in text (e.g., Named Entity Recognition)
  • Fill-Mask: Predict masked tokens in text

API Endpoints

Predictive inference tasks like token classification, sequence classification, and fill mask are served using KServe's Open Inference Protocol or V1 API.

tip

The Hugging Face runtime supports OpenAI Endpoints as well for generative tasks. For more information, see the Hugging Face Generative Inference documentation.

Environment Variables

The Hugging Face runtime image has the following environment variables set by default:

VariableDefaultDescription
SAFETENSORS_FAST_GPUEnabledImproves model loading performance
HF_HUB_DISABLE_TELEMETRYEnabledDisables telemetry data collection

Examples

The following examples demonstrate how to deploy and perform inference using the Hugging Face runtime with different predictive inference tasks:

Configuration Options

Hugging Face Runtime Arguments

Below is an explanation of command line arguments supported by the Hugging Face runtime. vLLM backend engine arguments can also be specified on the command line and will be parsed by the Hugging Face runtime.

  • --model_name: The name of the model used on the endpoint path.
  • --model_dir: The local path where the model is downloaded to. If model_id is provided, this argument will be ignored.
  • --model_id: Hugging Face model id.
  • --model_revision: Hugging Face model revision.
  • --tokenizer_revision: Hugging Face tokenizer revision.
  • --dtype: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'. Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends. Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads.
  • --task: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'. If not provided, model server will try to infer the task from model architecture.
  • --backend: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'.
  • --max_model_len: Max number of tokens the model can process/tokenize. If not mentioned, uses model's max position encodings.
  • --disable_lower_case: Disable lower case for the tokenizer.
  • --disable_special_tokens: The sequences will not be encoded with the special tokens relative to the model.
  • --trust_remote_code: Allow loading of models and tokenizers with custom code.
  • --tensor_input_names: The tensor input names passed to the model for triton inference server backend.
  • --return_token_type_ids: Return token type ids.
  • --return_probabilities: Return probabilities of predicted indexes. This is only applicable for tasks 'sequence_classification', 'token_classification' and 'fill_mask'.
  • --return_raw_logits: Return raw logits without processing. Supported only classification tasks such as token classification, text classification and fill-mask."
  • --disable_log_requests: Disable logging of requests.

Next Steps

Explore specific NLP tasks and deployment options: