Serve Large Language Model with Huggingface Accelerate¶
This documentation explains how KServe supports large language model serving via TorchServe
.
The large language refers to the models that are not able to fit into a single GPU, and they need
to be sharded onto multiple partitions over multiple GPUs.
Huggingface Accelerate can load sharded checkpoints and the maximum RAM usage will be the size of
the largest shard. By setting device_map
to true, Accelerate
automatically determines where
to put each layer of the model depending on the available resources.
Package the model¶
-
Download the model
bigscience/bloom-7b1
from Huggingface Hub by runningpython Download_model.py --model_name bigscience/bloom-7b1
-
Compress the model
zip -r model.zip model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
-
Package the model Create the
setup_config.json
file with accelerate settings: - Enable
low_cpu_mem_usage
to use accelerate - Recommended
max_memory
in setup_config.json is the max size of shard.{ "revision": "main", "max_memory": { "0": "10GB", "cpu": "10GB" }, "low_cpu_mem_usage": true, "device_map": "auto", "offload_folder": "offload", "offload_state_dict": true, "torch_dtype":"float16", "max_length":"80" }
torch-model-archiver --model-name bloom7b1 --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json
- Upload to your cloud storage, or you can use the uploaded bloom model from KServe GCS bucket.
Serve the large language model with InferenceService¶
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: "bloom7b1"
spec:
predictor:
pytorch:
runtimeVersion: 0.8.2
storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
resources:
limits:
cpu: "2"
memory: 32Gi
nvidia.com/gpu: "2"
requests:
cpu: "2"
memory: 32Gi
nvidia.com/gpu: "2"
Run the Inference¶
Now, assuming that your ingress can be accessed at
${INGRESS_HOST}:${INGRESS_PORT}
or you can follow this instruction
to find out your ingress IP and port.
SERVICE_HOSTNAME=$(kubectl get inferenceservice bloom7b1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d '{"instances": ["My dog is cute."]}' \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/bloom7b1:predict
Expected Output
{"predictions":["My dog is cute.\nNice.\n- Hey, Mom.\n- Yeah?\nWhat color's your dog?\n- It's gray.\n- Gray?\nYeah.\nIt looks gray to me.\n- Where'd you get it?\n- Well, Dad says it's kind of...\n- Gray?\n- Gray.\nYou got a gray dog?\n- It's gray.\n- Gray.\nIs your dog gray?\nAre you sure?\nNo.\nYou sure"]}