Skip to content


Deploy the LLaMA model with vLLM Runtime

Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

You can deploy the LLaMA model with built vLLM inference server container image using the InferenceService yaml API spec. We have work in progress integrating vLLM with Open Inference Protocol and KServe observability stack.

The LLaMA model can be downloaded from huggingface and upload to your cloud storage.

kubectl apply -n kserve-test -f - <<EOF
kind: InferenceService
  name: llama-2-7b
    - args:
        - --port
        - "8080"
        - --model
        - /mnt/models
        - python3
        - -m
        - vllm.entrypoints.api_server
        - name: STORAGE_URI
          value: gs://kfserving-examples/llm/huggingface/llama
      image: kserve/vllmserver:latest
      name: kserve-container
          cpu: "4"
          memory: 50Gi
          cpu: "1"
          memory: 50Gi


vLLM runtime is still experimental, please expect API changes and further integration in the next KServe release.

kubectl apply -f ./vllm.yaml

Benchmarking vLLM Runtime

You can download the benchmark testing data set by running


The tokenizer can be found from the downloaded llama model.

Now, assuming that your ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or you can follow this instruction to find out your ingress IP and port.

You can run the benchmarking script and send the inference request to the exposed URL.

python --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5

Expected Output

   Total time: 216.81 s
   Throughput: 4.61 requests/s
   Average latency: 7.96 s
   Average latency per token: 0.02 s
   Average latency per output token: 0.04 s
Back to top