Version: 0.16

Triton TensorFlow Models

Triton Inference Server provides a flexible, robust inference solution supporting multiple frameworks, including TensorFlow. This guide demonstrates deploying a TensorFlow-based BERT (Bidirectional Encoder Representations from Transformers) model on KServe using Triton Inference Server.

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

What You'll Learn

This guide demonstrates:

Inference on Question Answering (QA) tasks with BERT Base/Large models
Using fine-tuned NVIDIA BERT models
Deploying a Transformer for preprocessing with the BERT tokenizer
Deploying a BERT model on Triton Inference Server
Making inference requests using KServe's V2 protocol

Prerequisites

Before you begin, make sure you have:

A Kubernetes cluster with KServe installed.
Your cluster's Istio Ingress gateway must be network accessible.
Basic knowledge of TensorFlow, BERT models, and Triton Inference Server.
kubectl CLI tool configured with your cluster.

Setup (Knative Mode Only)

Skip tag resolution for nvcr.io which requires authentication to resolve Triton inference server image digest:

kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving

Increase progress deadline since pulling Triton image and large BERT models may take longer than the default timeout of 120s (this setting requires Knative 0.15.0+):

kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving

Creating a Custom Transformer for BERT Tokenizer

Implementing Pre/Post-Processing Logic

To effectively use the BERT model with Triton, we need to create a custom transformer that handles:

Preprocessing: Converting paragraphs and questions to BERT input format using the BERT tokenizer
Prediction: Calling Triton Inference Server using the Python REST API
Postprocessing: Converting raw predictions to answers with probabilities

Here's an example transformer implementation:

class BertTransformer(kserve.Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.short_paragraph_text = "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."

        self.tokenizer = tokenization.FullTokenizer(vocab_file="/mnt/models/vocab.txt", do_lower_case=True)
        self.model_name = "bert_tf_v2_large_fp16_128_v2"
        self.triton_client = None

    def preprocess(self, inputs: Dict) -> Dict:
        self.doc_tokens = data_processing.convert_doc_tokens(self.short_paragraph_text)
        self.features = data_processing.convert_examples_to_features(self.doc_tokens, inputs["instances"][0], self.tokenizer, 128, 128, 64)
        return self.features

    def predict(self, features: Dict) -> Union[Dict, InferResponse]:
        if not self.triton_client:
            self.triton_client = httpclient.InferenceServerClient(
                url=self.predictor_config.predictor_host, verbose=True)

        unique_ids = np.zeros([1,1], dtype=np.int32)
        segment_ids = features["segment_ids"].reshape(1,128)
        input_ids = features["input_ids"].reshape(1,128)
        input_mask = features["input_mask"].reshape(1,128)

        inputs = []
        inputs.append(httpclient.InferInput('unique_ids', [1,1], "INT32"))
        inputs.append(httpclient.InferInput('segment_ids', [1, 128], "INT32"))
        inputs.append(httpclient.InferInput('input_ids', [1, 128], "INT32"))
        inputs.append(httpclient.InferInput('input_mask', [1, 128], "INT32"))
        inputs[0].set_data_from_numpy(unique_ids)
        inputs[1].set_data_from_numpy(segment_ids)
        inputs[2].set_data_from_numpy(input_ids)
        inputs[3].set_data_from_numpy(input_mask)

        outputs = []
        outputs.append(httpclient.InferRequestedOutput('start_logits', binary_data=False))
        outputs.append(httpclient.InferRequestedOutput('end_logits', binary_data=False))
        result = self.triton_client.infer(self.model_name, inputs, outputs=outputs)
        return result.get_response()

    def postprocess(self, result: Dict) -> Dict:
        end_logits = result['outputs'][0]['data']
        start_logits = result['outputs'][1]['data']
        n_best_size = 20

        # The maximum length of an answer that can be generated. This is needed
        #  because the start and end predictions are not conditioned on one another
        max_answer_length = 30

        (prediction, nbest_json, scores_diff_json) = \
           data_processing.get_predictions(self.doc_tokens, self.features, start_logits, end_logits, n_best_size, max_answer_length)
        return {"predictions": prediction, "prob": nbest_json[0]['probability'] * 100.0}

Building the Transformer Docker Image

Build the KServe Transformer image with the above code:

cd bert_tokenizer_v2
docker build -t $USER/bert_transformer-v2:latest . --rm

Alternatively, you can use the pre-built image kfserving/bert-transformer-v2:latest.

Creating the InferenceService

Add the custom KServe Transformer image and Triton Predictor to the InferenceService specification:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    model:
      modelFormat:
        name: triton
      protocolVersion: v2
      storageUri: "gs://kfserving-examples/models/triton/bert"
      runtimeVersion: 20.10-py3
      resources:
        limits:
          cpu: "1"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 8Gi

Apply the InferenceService YAML:

kubectl apply -f bert_v1beta1.yaml

Expected Output

inferenceservice.serving.kserve.io/bert-v2 created

Checking the InferenceService Status

Verify that your InferenceService is running properly:

kubectl get inferenceservice bert-v2

Expected Output

Output should look like:

NAME      URL                                           READY   AGE
bert-v2   http://bert-v2.default.35.229.120.99.xip.io   True    71s

Confirm that both the transformer and predictor components are in a Ready state:

kubectl get revision -l serving.kserve.io/inferenceservice=bert-v2

Expected Output

Output should look like:

NAME                                CONFIG NAME                   K8S SERVICE NAME                    GENERATION   READY   REASON
bert-v2-predictor-default-plhgs     bert-v2-predictor-default     bert-v2-predictor-default-plhgs     1            True    
bert-v2-transformer-default-sd6nc   bert-v2-transformer-default   bert-v2-transformer-default-sd6nc   1            True  

Running a Prediction

First, determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

Send a question request with the following input. The transformer expects a list of instances or inputs and preprocesses them into the expected tensor format for the Triton Inference Server:

{
  "instances": [
    "What President is credited with the original notion of putting Americans in space?"
  ]
}

You can find the sample input file here: input.json

Execute the following command to make a prediction request:

MODEL_NAME=bert-v2
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservices bert-v2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

Expected Output

{"predictions": "John F. Kennedy", "prob": 77.91848979818604}

Model Repository Structure for TensorFlow Models

For TensorFlow models, Triton expects a specific model repository structure. Your model repository should follow this layout:

<model-repository-path>/
  <model-name>/
    config.pbtxt
    1/
      model.savedmodel/
        saved_model.pb
        variables/
          variables.data-00000-of-00001
          variables.index

The config.pbtxt file defines the model configuration, including inputs, outputs, and execution options. Here's an example configuration for a TensorFlow model:

name: "bert_tf_v2_large_fp16_128_v2"
platform: "tensorflow_savedmodel"
max_batch_size: 8
input [
  {
    name: "unique_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "segment_ids"
    data_type: TYPE_INT32
    dims: [ 128 ]
  },
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ 128 ]
  },
  {
    name: "input_mask"
    data_type: TYPE_INT32
    dims: [ 128 ]
  }
]
output [
  {
    name: "end_logits"
    data_type: TYPE_FP32
    dims: [ 128 ]
  },
  {
    name: "start_logits"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

Performance Optimization

For optimal performance with TensorFlow models on Triton, consider these options:

Dynamic Batching: Enable dynamic batching to improve throughput when serving multiple requests:

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

Instance Groups: Configure multiple model instances across GPUs or CPUs:

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

Mixed Precision: Use FP16 precision for faster inference on supported GPUs:

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "auto_mixed_precision"
  } ]
}}

Conclusion

This guide demonstrated how to deploy a TensorFlow BERT model using Triton Inference Server on KServe. The approach shown can be extended to other TensorFlow models. By leveraging the flexibility of Triton Inference Server with KServe's serving capabilities, you can deploy sophisticated TensorFlow models with custom pre and post-processing in Kubernetes environments.

What You'll Learn​

Prerequisites​

Setup (Knative Mode Only)​

Creating a Custom Transformer for BERT Tokenizer​

Implementing Pre/Post-Processing Logic​

Building the Transformer Docker Image​

Creating the InferenceService​

Checking the InferenceService Status​

Running a Prediction​

Model Repository Structure for TensorFlow Models​

Performance Optimization​

Conclusion​

What You'll Learn

Prerequisites

Setup (Knative Mode Only)

Creating a Custom Transformer for BERT Tokenizer

Implementing Pre/Post-Processing Logic

Building the Transformer Docker Image

Creating the InferenceService

Checking the InferenceService Status

Running a Prediction

Model Repository Structure for TensorFlow Models

Performance Optimization

Conclusion