Skip to content

Serve Large Language Model with Huggingface Accelerate

This documentation explains how KServe supports large language model serving via TorchServe. The large language refers to the models that are not able to fit into a single GPU, and they need to be sharded onto multiple partitions over multiple GPUs.

Huggingface Accelerate can load sharded checkpoints and the maximum RAM usage will be the size of the largest shard. By setting device_map to true, Accelerate automatically determines where to put each layer of the model depending on the available resources.

Package the model

  1. Download the model bigscience/bloom-7b1 from Huggingface Hub by running

    python Download_model.py --model_name bigscience/bloom-7b1
    

  2. Compress the model

    zip -r model.zip model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
    

  3. Package the model Create the setup_config.json file with accelerate settings:

  4. Enable low_cpu_mem_usage to use accelerate
  5. Recommended max_memory in setup_config.json is the max size of shard.
    {
        "revision": "main",
        "max_memory": {
            "0": "10GB",
            "cpu": "10GB"
        },
        "low_cpu_mem_usage": true,
        "device_map": "auto",
        "offload_folder": "offload",
        "offload_state_dict": true,
        "torch_dtype":"float16",
        "max_length":"80"
    }
    
torch-model-archiver --model-name bloom7b1 --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json
  1. Upload to your cloud storage, or you can use the uploaded bloom model from KServe GCS bucket.

Serve the large language model with InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "bloom7b1"
spec:
  predictor:
    pytorch:
      runtimeVersion: 0.8.2
      storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
      resources:
        limits:
          cpu: "2"
          memory: 32Gi
          nvidia.com/gpu: "2"
        requests:
          cpu: "2"
          memory: 32Gi
          nvidia.com/gpu: "2"

Run the Inference

Now, assuming that your ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or you can follow this instruction to find out your ingress IP and port.

SERVICE_HOSTNAME=$(kubectl get inferenceservice bloom7b1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v \
  -H "Host: ${SERVICE_HOSTNAME}" \
  -H "Content-Type: application/json" \
  -d '{"instances": ["My dog is cute."]}' \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/bloom7b1:predict

Expected Output

{"predictions":["My dog is cute.\nNice.\n- Hey, Mom.\n- Yeah?\nWhat color's your dog?\n- It's gray.\n- Gray?\nYeah.\nIt looks gray to me.\n- Where'd you get it?\n- Well, Dad says it's kind of...\n- Gray?\n- Gray.\nYou got a gray dog?\n- It's gray.\n- Gray.\nIs your dog gray?\nAre you sure?\nNo.\nYou sure"]}
Back to top