Triton

This post will help you to run inference using Triton Inference Server. We will infer a model written in PyTorch. Following that, we will compile a model into TensorRT and infer that as well. For the code, a g4dn.2xlarge instance on EC2 (8 vCPU, 32 GB RAM, single T4 GPU) was used.

Make a directory structure like this

root@server:/work$ tree -L 1

├── models
	├── pt
		├── 1
	├── trt
		├── 1

First, we will pull the image for Triton

docker pull nvcr.io/nvidia/tritonserver:22.10-py3

We can start the server as

docker run --rm -it \
 -v $(pwd):/work \
  --name triton \
  --gpus all \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  --runtime=nvidia nvcr.io/nvidia/tritonserver:22.10-py3 \
  tritonserver \
  --model-repository=/work/models \
  --exit-on-error=false \
  --repository-poll-secs=20 \
  --model-control-mode="poll"

For inference, we need a few libraries.

pip install nvidia-pyindex
pip install tritonclient[all]

To check the status of the server, wait for 20s and execute the following. Proceed iff you get a 200 response code

curl -v localhost:8000/v2/health/ready

Copy your PyTorch scripted/traced model to the models/pt/1 directory.

Create a configuration file for this model

configuration = """
name: "pt"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 32, 32 ]
    reshape { shape: [ 1, 3, 32, 32 ] }
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
    reshape { shape: [ 10 ] }
  }
]
parameters: {
key: "INFERENCE_MODE"
    value: {
    string_value: "true"
    }
}
"""

with open('/home/ubuntu/work/models/pt/config.pbtxt', 'w') as file:
    file.write(configuration)

Wait for 20s and execute. Proceed iff you get a

curl -v localhost:8000/v2/models/pt

Make an inference request

import tritonclient.http as tritonhttpclient
VERBOSE = False
model_label = 'input__0'
input_shape = ( 3, 32, 32)
input_dtype = 'FP32'
output_name = 'output__0'
model_name = 'pt'
url = 'localhost:8000'
model_version = '1'

triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

Create a function to take an image as input and preprocess it.

import numpy as np
from torchvision import transforms
from PIL import Image

# preprocessing function
def img_preprocess(img_path="../GTC/img/cat.jpg"):
    img = Image.open(img_path)
    preprocess = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2471, 0.2435, 0.2616]),
    ])
    return preprocess(img).numpy()

transformed_img = img_preprocess()

Make the request

input0 = tritonhttpclient.InferInput(model_label, transformed_img.shape, datatype="FP32")
input0.set_data_from_numpy(transformed_img, binary_data=False)

output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False, class_count=10)
response = triton_client.infer(model_name, model_version=model_version, inputs=[input0], outputs=[output])
output_label = response.as_numpy(output_name)

You will see that it has returned the logits in a logit:class format. To convert it to a dictionary, we take the following steps

with open("../GTC/cifar10_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]

results = {}
for r in output_label:
    cat = int(r.split(":")[1])
    conf = float(r.split(":")[0])
    results[categories[cat]] = conf
logits = list(results.values())

import torch
import torch.nn.functional as F
logits = torch.tensor(logits)
preds = (F.softmax(logits, dim=-1) * 100).numpy()

for c,k in enumerate(results):
    results[k] = preds[c]

results

Here is a result obtained from inference

TensorRT

First, let’s pull the PyTorch container from Nvidia that includes TensorRT.

docker pull nvcr.io/nvidia/pytorch:22.10-py3

Convert the Torhscript model to TensorRT model via the following code

import torch
import torch_tensorrt

# load model
model = torch.jit.load("cifar10-script.pt")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 32, 32))],
    enabled_precisions= { torch.half} # Run with FP32
)

# Save the model
torch.jit.save(trt_model, "model1.pt")

Note that we are using half precision, so our results might not exactly be the same.

Create the configuration file

configuration = """
name: "trt"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 32, 32 ]
    reshape { shape: [ 1, 3, 32, 32 ] }
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
    reshape { shape: [ 10 ] }
  }
]
"""

with open('/home/ubuntu/work/models/trt/config.pbtxt', 'w') as file:
    file.write(configuration)

Wait for 30s and make sure you get a 200 Response code from the server

curl -v localhost:8000/v2/models/trt

Use the same preprocess function as above. Make the inference request (don’t forget to change the model name via model_name = 'trt'!)

You will get a similar result as above.

For the complete code, visit the repo.

mmg's blog

Inference using Triton and TensorRT

Triton

TensorRT