Complex Model Unit

The complex model unit is used for inferring complex models that perform both detection and determine model attributes simultaneously, i.e.,

complex model = detection model + attribute model

This element combines both the detection and attribute model units. Below is an example of defining such a unit for inferring a model that detects faces and simultaneously finds facial keypoints.

- element: nvinfer@complex_model
  name: face_detector
  model:
    format: onnx
    onnx-file: retinaface_resnet50.onnx
    batch-size: 16
    precision: fp16
    input:
      object: person_detector.person
      shape: [3, 192, 192]
      offsets: [104.0, 117.0, 123.0]
    output:
      layer_names: ['bboxes', 'scores', 'landmarks']
      converter:
        module: customer_analysis.retinaface_converter
        class_name: RetinafaceConverter
      objects:
        - class_id: 0
          label: face
          selector:
            module: savant.selector.detector
            class_name: BBoxSelector
            kwargs:
              confidence_threshold: 0.991
              nms_iou_threshold: 0.4
              min_height: 70
              min_width: 90
      attributes:
        - name: landmarks

We will not describe the parameters for the input section, as they are similar to those described in Detection Unit. The output section is of particular interest, we specify both the objects section (described in the Detection Unit) and the attributes section (described in the Attribute Model Unit).

The converter must be implemented by specifying BaseComplexModelOutputConverter as the parent class. The converter for this example is provided below.

class RetinafaceConverter(BaseComplexModelOutputConverter):
    def __call__(
        self,
        *output_layers: np.ndarray,
        model: ComplexModel,
        roi: Tuple[float, float, float, float]
    ) -> Tuple[np.ndarray, List[List[Tuple[Any, float]]]]:
        """Converts raw model output tensors to savant format.

        :param output_layers: Model output layer tensors
        :param model: Complex model, required parameters: input tensor shape, maintain_aspect_ratio flag
        :param roi_width: width of the rectangle on which the model infers
        :param roi_height: height of the rectangle on which the model infers
        :return: BBox tensor BBox tensor (class_id, confidence, xc, yc, width, height, [angle])
            offset by roi upper left and scaled by roi width and height,
            and list of attributes values with confidences
        """

        bboxes, scores, landmarks = detector_decoder(
            roi,
            *output_layers,  # bboxes  # scores # landmarks
        )

        bbox_tensor = np.concatenate(
            (
                np.zeros((len(bboxes), 1)),
                scores.reshape(-1, 1),
                bboxes,
            ),
            axis=1,
        )

        attrs = [[(model.output.attributes[0].name, x.tolist(), None)] for x in landmarks]
        return bbox_tensor, attrs

The model used in the example has three outputs. Two are related to detections, and the third returns the coordinates of the facial keypoints for the detected face. The converter processes the first two outputs with the names bboxes and scores to obtain the boxes, while the third output with the name landmarks returns the keypoints, which are returned as attributes for each detected object. Note that the number of boxes and the length of the attribute list for each box must match.

The detector_decoder is a separate function specifically written to process the outputs of the RetinaNet model and is not provided here, as it does not affect the overall understanding of the principles of writing converters.