Binary Tensor Data Extension¶
The Binary Tensor Data Extension allows clients to send and receive tensor data in a binary format in the body of an HTTP/REST request. This extension is particularly useful for sending and receiving FP16 data as there is no specific data type for a 16-bit float type in the Open Inference Protocol and large tensors for high-throughput scenarios.
Overview¶
Tensor data represented as binary data is organized in little-endian byte order, row major, without stride or padding between elements. All tensor data types are representable as binary data in the native size of the data type. For BOOL type element true is a single byte with value 1 and false is a single byte with value 0. For BYTES type an element is represented by a 4-byte unsigned integer giving the length followed by the actual bytes. The binary data for a tensor is delivered in the HTTP body after the JSON object (see Examples).
The binary tensor data extension uses parameters to indicate that an input or output tensor is communicated as binary data.
The binary_data_size
parameter is used in $request_input
and $response_output
to indicate that the input or output tensor is communicated as binary data:
- "binary_data_size" : int64 parameter indicating the size of the tensor binary data, in bytes.
The binary_data
parameter is used in $request_output
to indicate that the output should be returned from KServe runtime
as binary data.
- "binary_data" : bool parameter that is true if the output should be returned as binary data and false (or not given) if the tensor should be returned as JSON.
The binary_data_output
parameter is used in $inference_request
to indicate that all outputs should be returned from KServe runtime as binary data, unless overridden by "binary_data" on a specific output.
- "binary_data_output" : bool parameter that is true if all outputs should be returned as binary data and false (or not given) if the outputs should be returned as JSON. If "binary_data" is specified on an output it overrides this setting.
When one or more tensors are communicated as binary data, the HTTP body of the request or response will contain the JSON inference request or response object followed by the binary tensor data in the same order as the order of the input or output tensors are specified in the JSON.
- If any binary data is present in the request or response the
Inference-Header-Content-Length
header must be provided to give the length of the JSON object, and Content-Length continues to give the full body length (as HTTP requires).
Examples¶
Sending and Receiving Binary Data¶
For the following request the input tensors input0
and input2
are sent as binary data while input1
is sent as non-binary data. Note that the input0
and input2
input tensors have a parameter binary_data_size
which represents the size of the binary data.
The output tensor output0
must be returned as binary data as that is what is requested by setting the binary_data
parameter to true. Also note that the size of the JSON part is provided in the Inference-Header-Content-Length
and the total size of the binary data is reflected in the Content-Length
header.
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/octet-stream
Inference-Header-Content-Length: <xx> # Json length
Content-Length: <xx+19> # Json length + binary data length (In this case 16 + 3 = 19)
{
"model_name" : "mymodel",
"inputs" : [
{
"name" : "input0",
"shape" : [ 2, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "input1",
"shape" : [ 2, 2 ],
"datatype" : "UINT32",
"data": [[1, 2], [3, 4]]
},
{
"name" : "input2",
"shape" : [ 3 ],
"datatype" : "BOOL",
"parameters" : {
"binary_data_size" : 3
}
}
],
"outputs" : [
{
"name" : "output0",
"parameters" : {
"binary_data" : true
}
},
{
"name" : "output1"
}
]
}
<16 bytes of data for input0 tensor>
<3 bytes of data for input2 tensor>
Assuming the model returns a [ 3, 2 ] tensor of data type FP16 and a [2, 2] tensor of data type FP32 the following response would be returned.
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Inference-Header-Content-Length: <yy> # Json length
Content-Length: <yy+16> # Json length + binary data length (In this case 16)
{
"outputs" : [
{
"name" : "output0",
"shape" : [ 3, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "output1",
"shape" : [ 2, 2 ],
"datatype" : "FP32",
"data" : [[1.203, 5.403], [3.434, 34.234]]
}
]
}
<16 bytes of data for output0 tensor>
from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
from kserve.protocol.infer_type import RequestedOutput
from kserve.inference_client import RESTConfig
fp16_data = np.array([[1.1, 2.22], [3.345, 4.34343]], dtype=np.float16)
uint32_data = np.array([[1, 2], [3, 4]], dtype=np.uint32)
bool_data = np.array([True, False, True], dtype=np.bool)
# Create input tensor with binary data
input_0 = InferInput(name="input_0", datatype="FP16", shape=[2, 2])
input_0.set_data_from_numpy(fp16_data, binary_data=True)
input_1 = InferInput(name="input_1", datatype="UINT32", shape=[2, 2])
input_1.set_data_from_numpy(uint32_data, binary_data=False)
input_2 = InferInput(name="input_2", datatype="BOOL", shape=[3])
input_2.set_data_from_numpy(bool_data, binary_data=True)
# Create request output
output_0 = RequestedOutput(name="output_0", binary_data=True)
output_1 = RequestedOutput(name="output_1", binary_data=False)
# Create inference request
infer_request = InferRequest(
model_name="mymodel",
request_id="2ja0ls9j1309",
infer_inputs=[input_0, input_1, input_2],
requested_outputs=[output_0, output_1],
)
# Create the REST client
config = RESTConfig(verbose=True, protocol="v2")
rest_client = InferenceRESTClient(config=config)
# Send the request
infer_response = await rest_client.infer(
"http://localhost:8000",
model_name="TestModel",
data=infer_request,
headers={"Host": "test-server.com"},
timeout=2,
)
# Read the binary data from the response
output_0 = infer_response.outputs[0]
fp16_output = output_0.as_numpy()
# Read the non-binary data from the response
output_1 = infer_response.outputs[1]
fp32_output = output_1.data # This will return the data as a list
fp32_output_arr = output_1.as_numpy()
Requesting All The Outputs To Be In Binary Format¶
For the following request, binary_data_output
is set to true to receive all the outputs as binary data. Note that the
binary_data_output
is set in the $inference_request
parameters field, not in the $inference_input
parameters field. This parameter can be overridden for a specific output by setting binary_data
parameter to false in the $request_output
.
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: 75
{
"model_name": "my_model",
"inputs": [
{
"name": "input_tensor",
"datatype": "FP32",
"shape": [1, 2],
"data": [[32.045, 399.043]],
}
],
"parameters": {
"binary_data_output": true
}
}
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Inference-Header-Content-Length: <yy> # Json length
Content-Length: <yy+48> # Json length + binary data length (In this case 16 + 32)
{
"outputs" : [
{
"name" : "output_tensor0",
"shape" : [ 3, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "output_tensor1",
"shape" : [ 2, 2 ],
"datatype" : "FP32",
"parameters": {
"binary_data_size": 32
}
}
]
}
<16 bytes of data for output_tensor0 tensor>
<32 bytes of data for output_tensor1 tensor>
from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
from kserve.protocol.infer_type import RequestedOutput
from kserve.inference_client import RESTConfig
fp32_data = np.array([[32.045, 399.043]], dtype=np.float32)
# Create the input tensor
input_0 = InferInput(name="input_0", datatype="FP32", shape=[1, 2])
input_0.set_data_from_numpy(fp16_data, binary_data=False)
# Create inference request with binary_data_output set to True
infer_request = InferRequest(
model_name="mymodel",
request_id="2ja0ls9j1309",
infer_inputs=[input_0],
parameters={"binary_data_output": True}
)
# Create the REST client
config = RESTConfig(verbose=True, protocol="v2")
rest_client = InferenceRESTClient(config=config)
# Send the request
infer_response = await rest_client.infer(
"http://localhost:8000",
model_name="TestModel",
data=infer_request,
headers={"Host": "test-server.com"},
timeout=2,
)
# Read the binary data from the response
output_0 = infer_response.outputs[0]
fp16_output = output_0.as_numpy()
output_1 = infer_response.outputs[1]
fp32_output_arr = output_1.as_numpy()