Serve Large Language Model with Huggingface Accelerate¶
This documentation explains how KServe supports large language model serving via TorchServe.
              The large language refers to the models that are not able to fit into a single GPU, and they need
              to be sharded onto multiple partitions over multiple GPUs.
Huggingface Accelerate can load sharded checkpoints and the maximum RAM usage will be the size of
              the largest shard. By setting device_map to true, Accelerate automatically
              determines where
              to put each layer of the model depending on the available resources.
Package the model¶
- 
                
Download the model
bigscience/bloom-7b1from Huggingface Hub by runningpython Download_model.py --model_name bigscience/bloom-7b1 - 
                
Compress the model
zip -r model.zip model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/ - 
                
Package the model Create the
setup_config.jsonfile with accelerate settings: - Enable 
low_cpu_mem_usageto use accelerate - Recommended 
max_memoryin setup_config.json is the max size of shard.{ "revision": "main", "max_memory": { "0": "10GB", "cpu": "10GB" }, "low_cpu_mem_usage": true, "device_map": "auto", "offload_folder": "offload", "offload_state_dict": true, "torch_dtype":"float16", "max_length":"80" } 
torch-model-archiver --model-name bloom7b1 --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json
            - Upload to your cloud storage, or you can use the uploaded bloom model from KServe GCS bucket.
 
Serve the large language model with InferenceService¶
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "bloom7b1"
spec:
  predictor:
    pytorch:
      runtimeVersion: 0.8.2
      storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
      resources:
        limits:
          cpu: "2"
          memory: 32Gi
          nvidia.com/gpu: "2"
        requests:
          cpu: "2"
          memory: 32Gi
          nvidia.com/gpu: "2"
            Run the Inference¶
Now, assuming that your ingress can be accessed at
              ${INGRESS_HOST}:${INGRESS_PORT} or you can follow this instruction
              to find out your ingress IP and port.
            
SERVICE_HOSTNAME=$(kubectl get inferenceservice bloom7b1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v \
  -H "Host: ${SERVICE_HOSTNAME}" \
  -H "Content-Type: application/json" \
  -d @./text.json \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/bloom7b1:predict
{"predictions":["My dog is cute.\nNice.\n- Hey, Mom.\n- Yeah?\nWhat color's your dog?\n- It's gray.\n- Gray?\nYeah.\nIt looks gray to me.\n- Where'd you get it?\n- Well, Dad says it's kind of...\n- Gray?\n- Gray.\nYou got a gray dog?\n- It's gray.\n- Gray.\nIs your dog gray?\nAre you sure?\nNo.\nYou sure"]}