Detection Unit

To create a detector unit the developer has to write the nvinfer@detector declaration as shown in the following listing:

- element: nvinfer@detector
  name: Primary_Detector
  model:
    format: caffe
    model_file: resnet10.caffemodel
    batch_size: 1
    precision: int8
    int8_calib_file: cal_trt.bin
    label_file: labels.txt
    input:
      scale_factor: 0.0039215697906911373
    output:
      num_detected_classes: 4
      layer_names: [conv2d_bbox, conv2d_cov/Sigmoid]

Full documentation for the detector unit is located on the unit page: NvInferDetector.

You must configure model parameters, input and output. The parameters also may be specified in the model.config_file file according to Nvidia DeepStream specification for a detector (example). YAML-based parameters override those configured in model.config_file.

Batch Size

The batch size is directly connected with inference performance. When handling multiple streams consider setting batch_size equal to maximum expected number of streams. For Jetson family Nvidia often uses the batch size equal to 32 in their benchmarks.

The batching is a sophisticated topic: there is no silver bullet for selecting a batch size, especially when inference happens on the second and later steps; experiment with your values to achieve the best possible performance and GPU utilization.

Unit Input

The input block describes how the unit preprocesses the data and selects objects which are the subject of operation. All the input parameters can be found in the specification for NvInferModelInput.

Object Selection

Input defines several parameters which are important to understand. However, here we would like to highlight the object parameter which specifies a label for objects selected for inference. If the object parameter is not set, it is assumed to be equal to frame: default object with ROI covering the whole frame (without paddings added).

The primary detectors usually work with the default object: frame, while secondary detectors specify labels to select objects of interest.

There are other parameters in the input section which select objects, like object_min_width, object_min_height and others. Please, keep in mind that the selected objects are passed to the model, and those not selected are ignored, but they continue to persist in metadata, so downstream elements can consume them later.

Imagine the situation when you have 2 models: the first one works well with medium-size objects but is very heavyweight while the second works perfectly with large objects and is lightweight. To efficiently utilize resources you may place two inference elements with properly defined input blocks:

- element: nvinfer@detector
  name: Secondary_Detector_Small
  model:
    format: caffe
    model_file: model1
    input:
      object_min_height: 30
      object_max_height: 100
    ...

- element: nvinfer@detector
  name: Secondary_Detector_Large
  model:
    format: caffe
    model_file: model2
    input:
      object_min_height: 101

Custom Preprocessing

Input also allows making custom preprocessing with preprocess_object_meta and preprocess_object_tensor. To use the former, you have to implement code with the interface BasePreprocessObjectMeta, the later is an advanced topic which is not covered here. Keep in mind, that all modifications made in preprocess are restored after the unit is done with its processing.

Example of preprocess_object_meta:

input:
  object: object_detector.something
  preprocess_object_meta:
    module: something_detector.input_preproc
    class_name: TopCrop

"""Model input metadata preprocessing: cropping variations."""

from savant_rs.primitives.geometry import BBox

from savant.base.input_preproc import BasePreprocessObjectMeta
from savant.meta.object import ObjectMeta


class CropTopPreprocessObjectMeta(BasePreprocessObjectMeta):
    """Make bbox height no lesser that bbox width."""

    def __call__(self, object_meta: ObjectMeta) -> BBox:
        bbox = object_meta.bbox
        max_dim = max(bbox.width, bbox.height)
        return BBox(
            bbox.xc,
            bbox.top + max_dim / 2,
            bbox.width,
            max_dim,
        )

Unit Output

The output section describes how the unit processes metadata before passing them to the following unit. The parameters of output may be found in the specification for NvInferObjectModelOutput.

Layer Names

The parameter layer_names defines the names of the model output layers returned for postprocessing.

Converter

Output defines an important parameter converter which is basically a method which makes bounding boxes from a raw tensor. For detection models prepared for Nvidia DeepStream, the converter parameter is not required, however if the model’s output cannot be parsed automatically, you must provide an implementation of BaseObjectModelOutputConverter to produce boxes for detected objects.

Example:

converter:
  module: savant.converter.yolo_x
  class_name: TensorToBBoxConverter
  kwargs:
    decode: true

The converter implementation can be found in the class TensorToBBoxConverter.

Note

The converter can access data with CPU-based tools (NumPy) or in GPU with CuPy. Typically, NumPy-based processing is a default choice; however, if you have a lot of data to process, you may consider using CuPy to speed up the processing. To use CuPy, you need to set the tensor_format parameter to CuPy in the converter class.

For example:

class TensorToBBoxConverter(BaseObjectModelOutputConverter):
  tensor_format: TensorFormat = TensorFormat.CuPy

This will allow the converter to use CuPy to process the data. For an example of the CuPy converter take a look at the sample at the link on GitHub: https://github.com/insight-platform/Savant/tree/develop/samples/yolov8_seg

Normally, if the tensor output is small, it is better to use NumPy. If the tensor output is large (frequently met in tasks like segmentation, instance segmentation, etc.), it is beneficial to consider using CuPy to speed up the processing.

An example of the converter for YOLOv4 listed below. The YOLOv4 model has two output layers: the first represents box definition (incl. class_id), the last is for confidence.

Note

When you are writing the converter you must return objects relative to the ROI of the parent object: the framework will recalculate their absolute coordinates.

class TensorToBBoxConverter(BaseObjectModelOutputConverter):
    """`YOLOv4 <https://github.com/Tianxiaomo/pytorch-YOLOv4>`_ output to bbox
    converter."""

    def __call__(
        self,
        *output_layers: np.ndarray,
        model: ObjectModel,
        roi: Tuple[float, float, float, float],
    ) -> np.ndarray:
        """Converts detector output layer tensor to bbox tensor.

        :param output_layers: Output layer tensor
        :param model: Model definition, required parameters: input tensor shape,
            maintain_aspect_ratio
        :param roi: [top, left, width, height] of the rectangle
            on which the model infers
        :return: BBox tensor (class_id, confidence, xc, yc, width, height, [angle])
            offset by roi upper left and scaled by roi width and height
        """
        boxes, confs = output_layers
        roi_left, roi_top, roi_width, roi_height = roi

        # [num, 1, 4] -> [num, 4]
        bboxes = np.squeeze(boxes)

        # left, top, right, bottom => xc, yc, width, height
        bboxes[:, 2] -= bboxes[:, 0]
        bboxes[:, 3] -= bboxes[:, 1]
        bboxes[:, 0] += bboxes[:, 2] / 2
        bboxes[:, 1] += bboxes[:, 3] / 2

        # scale
        if model.input.maintain_aspect_ratio:
            bboxes *= min(roi_width, roi_height)
        else:
            bboxes[:, [0, 2]] *= roi_width
            bboxes[:, [1, 3]] *= roi_height
        # correct xc, yc
        bboxes[:, 0] += roi_left
        bboxes[:, 1] += roi_top

        # [num, num_classes] --> [num]
        confidences = np.max(confs, axis=-1)
        class_ids = np.argmax(confs, axis=-1)

        return np.concatenate(
            (
                class_ids.reshape(-1, 1).astype(np.float32),
                confidences.reshape(-1, 1),
                bboxes,
            ),
            axis=1,
        )

Object Filtering

Within output you may also select only necessary objects by specifying their IDs and labels:

output:
  layer_names: [output_bbox/BiasAdd, output_cov/Sigmoid]
  num_detected_classes: 3
  objects:
    - class_id: 0
      label: person
      selector:
        kwargs:
          min_width: 32
          min_height: 32
    - class_id: 2
      label: face
      selector:
        kwargs:
          confidence_threshold: 0.1

All skipped classes will be permanently excluded from the next steps of the pipeline. The selector block also allows defining a filter to eliminate unnecessary objects.

If unit name is Primary_Detector, then to address selected objects in the following units use Primary_Detector.person and Primary_Detector.face labels.

The default selector implementation runs NMS and allows selecting objects by specifying min_width, min_height, and confidence_threshold. To create a custom selector you have to implement BaseSelector. You may want to take a look at BBoxSelector to get an idea of how to write it.

Converter vs Selector

The converter and selector are two different concepts. The converter is used to convert the model output to a format that can be used by the selector. The selector is used to select the objects that are of interest. Their functionality can duplicate each other, but they are used for different purposes.

It mostly depends on the purpose, but as a rule of thumb, you should use the converter if you need to convert the model output to a format that can be used by the selector. You should use the selector if you need to filter out the objects that are of interest.

Note

Despite their different roles, a custom implementation of the converter can contain features normally implemented in the selector. It depends on the developer’s preferences and the performance considerations.

Performance Considerations

The goal of the converter is to transform the model output to a format that can be used by the selector and exclude what is not required by running NMS and exclude classes. This is important to avoid creating unnecessary metadata. However, when every class requires additional filtering, these operations can be carried out by selectors individually for each class. Every class can have its own selector with its own parameters.

Default Selector

The default selector implementation is BBoxSelector. It is a simple implementation of the selector that runs NMS and excludes objects based on their confidence and size. When the module and class_name parameters are not specified in the objects.<class_id>.selector block, the default selector is used.

The default selector parameters are (all parameters are optional):

confidence_threshold: confidence threshold (default: 0.5);
nms_iou_threshold: NMS IoU threshold (default: 0.5);
top_k: top k objects to keep (default: 0, means all objects are kept);
min_width: minimal object width (default: 0, means no width filtering);
min_height: minimal object height (default: 0, means no height filtering);
max_width: maximum object width (default: 0, means no width filtering);
max_height: maximum object height (default: 0, means no height filtering).

output:
  layer_names: [output_bbox/BiasAdd, output_cov/Sigmoid]
  num_detected_classes: 3
  objects:
    - class_id: 0
      label: person
    - class_id: 2
      label: face

The code above demonstrates the use of the default selector with the default parameters.