Integrating KServe LLM Deployments with LLM SDKs

KServe-deployed LLMs can seamlessly integrate with popular LLM application frameworks through standardized interfaces. This guide demonstrates how to connect your deployed models with widely-used SDKs to build AI applications.

Deploy a KServe LLM Inference Service

First, you need a deployed LLM inference service. Follow our Text Generation with Llama3 guide to deploy a model. After completing the deployment, you'll have a model endpoint ready for integration.

Getting Your Model Endpoint

Once your model is deployed, you need to obtain the service hostname for API calls:

SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)

For the Llama3 example, the model name is llama3. You'll need both the service hostname and model name for SDK integration.

Integration with OpenAI SDK

The OpenAI SDK is widely used for working with LLMs. KServe's OpenAI-compatible endpoints make it easy to connect your deployed models with applications built using this SDK.

Installation

Install the OpenAI Python client:

pip3 install openai

Usage Example

Create a Python script (sample_openai.py) to interact with your KServe LLM:

Python

from openai import OpenAI

Deployment_url = "<SERVICE_HOSTNAME>"
client = OpenAI(
    base_url=f"{Deployment_url}/openai/v1",
    api_key="empty",
)

# typial chat completion response
print("Typical chat completion response:")
response = client.chat.completions.create(
    model="llama3",
    messages=[
        {'role': 'user', 'content': "What's 1+1? Answer in one word."}
    ],
    temperature=0,
    max_tokens=256
)

reply = response.choices[0].message
print(f"Extracted reply: \n{reply.content}\n")

# streaming chat completion response
print("Streaming chat completion response:")
stream = client.chat.completions.create(
    model='llama3',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    temperature=0,
    max_tokens=300,
    stream=True  # this time, we set stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Running the Example

Execute the script to see both regular and streaming responses:

python3 sample_openai.py

Expected Output

Typical chat completion response:
Extracted reply: 
Two.

Streaming chat completion response:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100

Key Points

Replace <SERVICE_HOSTNAME> with your actual service hostname
The endpoint path /openai/v1 routes requests through KServe's OpenAI-compatible interface
The api_key="empty" parameter is needed but authentication can be configured separately
The model parameter should match the model name from your InferenceService

Integration with LangChain Framework

LangChain is a popular framework for developing applications powered by language models. It provides components for working with LLMs and building more complex AI applications.

Installation

Install the LangChain OpenAI integration package:

pip3 install langchain-openai

Usage Example

Create a Python script (sample_langchain.py) to interact with your KServe LLM through LangChain:

Python

from langchain_openai import ChatOpenAI

Deployment_url = "<SERVICE_HOSTNAME>"

llm = ChatOpenAI(
    model_name="llama3",
    base_url=f"{Deployment_url}/openai/v1",
    openai_api_key="empty",
    temperature=0,
    max_tokens=256,
)

# typial chat completion response
print("Typical chat completion response:")

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
reply = llm.invoke(messages)
print(f"Extracted reply: \n{reply.content}\n")

# streaming chat completion response
print("Streaming chat completion response:")
for chunk in llm.stream("Write me a 1 verse song about goldfish on the moon"):
    print(chunk.content, end="", flush=True)

Running the Example

Execute the script to see both regular and streaming responses:

python3 sample_langchain.py

Expected Output

Typical chat completion response:
Extracted reply: 
Je adore le programmation.

Streaming chat completion response:
Here is a 1-verse song about goldfish on the moon:

"In the lunar lake, where the craters shine
A school of goldfish swim, in a celestial shrine
Their scales glimmer bright, like stars in the night
As they dart and play, in the moon's gentle light"

Key Points

LangChain provides higher-level abstractions for working with LLMs
You can create chains, agents, and more complex workflows using your KServe-deployed models
The integration follows the same pattern as the OpenAI SDK, utilizing the OpenAI-compatible endpoints

Additional SDK Options

KServe's OpenAI-compatible endpoints allow integration with many other frameworks and SDKs:

LlamaIndex

LlamaIndex is a data framework for LLM applications that helps with data connection and retrieval augmented generation (RAG).

pip install llama-index-llms-openai

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="llama3",
    api_base=f"http://{SERVICE_HOSTNAME}/openai/v1",
    api_key="empty"
)

response = llm.complete("What is the capital of France?")
print(response)

Direct API Calls

For languages without specific SDKs, you can use standard HTTP clients:

cURL
JavaScript (Fetch)

curl -X POST "http://${SERVICE_HOSTNAME}/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "temperature": 0.7
  }'

const response = await fetch(`http://${serviceHostname}/openai/v1/chat/completions`, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'llama3',
    messages: [{ role: 'user', content: 'Hello, how are you?' }],
    temperature: 0.7,
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);

Best Practices

When integrating with KServe-deployed LLMs:

Error Handling: Implement robust error handling for network issues, timeouts, and API errors.
Caching: Consider caching responses for frequently asked questions to reduce latency and costs.
Monitoring: Track usage metrics, latency, and error rates to optimize your application.
Fallback Mechanisms: Implement fallback options if primary model responses are slow or unavailable.
Token Management: Be mindful of token limits when designing prompts and handling responses.

Next Steps

After integrating your LLM with an SDK, consider exploring:

Advanced serving options like multi-node inference for large models
Exploring other inference tasks such as text-to-text generation and embeddings
Optimizing performance with features like model caching and KV cache offloading
Auto-scaling your inference services based on traffic patterns using KServe's auto-scaling capabilities

By connecting your KServe-deployed models with these popular SDKs, you can quickly build sophisticated AI applications while maintaining control over your model infrastructure.

Deploy a KServe LLM Inference Service​

Getting Your Model Endpoint​

Integration with OpenAI SDK​

Installation​

Usage Example​

Running the Example​

Key Points​

Integration with LangChain Framework​

Installation​

Usage Example​

Running the Example​

Key Points​

Additional SDK Options​

LlamaIndex​

Direct API Calls​

Best Practices​

Next Steps​

Deploy a KServe LLM Inference Service

Getting Your Model Endpoint

Integration with OpenAI SDK

Installation

Usage Example

Running the Example

Key Points

Integration with LangChain Framework

Installation

Usage Example

Running the Example

Key Points

Additional SDK Options

LlamaIndex

Direct API Calls

Best Practices

Next Steps