Deploy InferenceService with model from Hugging Face(HF) Hub¶
You can specify the storageUri
field on InferenceService
YAML with the following format to deploy the models from Hugging Face Hub.
hf://${REPO}/${MODEL}:${HASH}(optional)
e.g. hf://facebook/opt-125m
Public Hugging Face Models¶
If no credential is provided, anonymous client will be used to download the model from HF repo.
Private Hugging Face Models¶
KServe supports authenticating with HF_TOKEN
for downloading the model and create a Kubernetes secret to store the HF token.
apiVersion: v1
kind: Secret
metadata:
name: storage-config
type: Opaque
data:
HF_TOKEN: aGZfVk5Vd1JVAUdCa0l4WmZMTHVrc2VHeW9VVm9udU5pBHUVT==
Deploy InferenceService with Models from HF Hub¶
Option 1: Use Service Account with Secret Ref¶
Create a Kubernetes ServiceAccount
with the HF token secret name reference and specify the ServiceAccountName
in the InferenceService
Spec.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: hfserviceacc
secrets:
- name: storage-config
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
serviceAccountName: hfserviceacc # Option 1 for authenticating with HF_TOKEN
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_dir=/mnt/models
storageUri: hf://meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
Option 2: Use Environment Variable with Secret Ref¶
Create a Kubernete HF token and specify the HF token secret reference using environment variable in the InferenceService
Spec.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_dir=/mnt/models
storageUri: hf://meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
env:
- name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
EOF
Check the InferenceService status.¶
kubectl get inferenceservices huggingface-llama3
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-llama3 http://huggingface-llama3.default.example.com True 100 huggingface-llama3-predictor-default-47q2g 7d23h