Skip to content

Serve Large Language Model with Huggingface Accelerate

This documentation explains how KServe supports large language model serving via TorchServe. The large language refers to the models that are not able to fit into a single GPU, and they need to be sharded onto multiple partitions over multiple GPUs.

Huggingface Accelerate can load sharded checkpoints and the maximum RAM usage will be the size of the largest shard. By setting device_map to true, Accelerate automatically determines where to put each layer of the model depending on the available resources.

Package the model

  1. Download the model bigscience/bloom-7b1 from Huggingface Hub by running

    python --model_name bigscience/bloom-7b1

  2. Compress the model

    zip -r model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/

  3. Package the model Create the setup_config.json file with accelerate settings:

  4. Enable low_cpu_mem_usage to use accelerate
  5. Recommended max_memory in setup_config.json is the max size of shard.
        "revision": "main",
        "max_memory": {
            "0": "10GB",
            "cpu": "10GB"
        "low_cpu_mem_usage": true,
        "device_map": "auto",
        "offload_folder": "offload",
        "offload_state_dict": true,
torch-model-archiver --model-name bloom7b1 --version 1.0 --handler --extra-files,setup_config.json
  1. Upload to your cloud storage, or you can use the uploaded bloom model from KServe GCS bucket.

Serve the large language model with InferenceService

kind: InferenceService
  name: "bloom7b1"
      runtimeVersion: 0.8.2
      storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
          cpu: "2"
          memory: 32Gi
          cpu: "2"
          memory: 32Gi

Run the Inference

Now, assuming that your ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or you can follow this instruction to find out your ingress IP and port.

SERVICE_HOSTNAME=$(kubectl get inferenceservice bloom7b1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v \
  -H "Host: ${SERVICE_HOSTNAME}" \
  -H "Content-Type: application/json" \
  -d @./text.json \

{"predictions":["My dog is cute.\nNice.\n- Hey, Mom.\n- Yeah?\nWhat color's your dog?\n- It's gray.\n- Gray?\nYeah.\nIt looks gray to me.\n- Where'd you get it?\n- Well, Dad says it's kind of...\n- Gray?\n- Gray.\nYou got a gray dog?\n- It's gray.\n- Gray.\nIs your dog gray?\nAre you sure?\nNo.\nYou sure"]}
Back to top