Multi-Object Tracking with DeepSORT

This example uses:

This example shows how to integrate appearance features from a re-Identification (Re-ID) Deep Neural Network with a multi-object tracker to improve the performance of camera-based object tracking. The implementation closely follows the Deep Simple Online and Realtime (DeepSORT) multi-object tracking algorithm [1]. This example uses the Sensor Fusion and Tracking Toolbox™ and the Computer Vision Toolbox™.

Introduction

The objectives of multi-object tracking are to estimate the number of objects in a scene, to accurately estimate their position, and to establish and maintain unique identities for all objects. You often achieve this through a tracking-by-detection approach that consists of two consecutive tasks. First, you obtain the detections of objects in each frame. Second, you perform track the association and management across frames.

This example builds upon the SORT algorithm, introduced in the Implement Simple Online and Realtime Tracking (Sensor Fusion and Tracking Toolbox) example. The data association and track management of SORT is efficient and simple to implement, but it is ineffective when tracking objects over occlusions in single-view camera scenes.

The increasingly popular Re-ID networks provide appearance features, sometimes called appearance embeddings, for each object detection. Appearance features are a representation of the visual appearance of an object. They offer an additional measure of the similarity (or distance) between a detection and a track. The integration of appearance information into the data association is a powerful technique to handle tracking over longer occlusions and therefore reduces the number of switches in track identities.

Pre-Trained Person Re-Identification Network

Download the re-identification pre-trained network from the internet. Refer to the Reidentify People Throughout a Video Sequence Using ReID Network example to learn about this network and how to train it. You use this pre-trained network to evaluate appearance feature for each detection.

helperDownloadReIDResNet();

Downloading Pretrained Person ReID Network (~198 MB)

Load the Re-ID network.

load("personReIDResNet_v2.mat","net");

To obtain the appearance feature vector of a detection, you extract the bounding box coordinates and convert them to image frame indices. You can then crop out the bounding box of the detection and use the extractReidentificationFeatures method of the reidentificationNetwork object to obtain the appearance features. The associationExampleData MAT-file contains a detection object and a frame, the following code illustrates the use of the extractReidentificationFeatures method.

load("associationExampleData.mat","newDetection","frame");

% Crop frame to measurement bounding box
bbox = newDetection.Measurement;
croppedPerson = imcrop(frame, bbox);
imshow(croppedPerson);

% Extract appearance features of the cropped pedestrian.
appearanceVect = extractReidentificationFeatures(net,croppedPerson)

appearanceVect = 2048×1 single column vector

   -0.4880
    0.3705
   -0.4299
   -0.0240
    0.6064
    0.4683
    0.0888
    0.4270
   -0.0068
   -0.0947
      ⋮

Use the supporting function runReIDNet to iterate over a set of detections and perform the steps above.

Assignment Distances

In this section, you learn about the three types of distances that the DeepSORT assignment strategy relies on.

Consider the previous frame and detection. They are depicted in the image below. In the current frame, an object detector returns the detection (Det: 1, in yellow) which should be associated with existing tracks maintained by the multi-object tracker. The tracker hypothesizes that an object with TrackID 1 exist in the current frame, and its estimated bounding box is shown in orange. The track shown on the image is also saved in the associationExampleData MAT-file.

Each distance type may return values in a different range but larger values always indicate that the detection and track are less likely to be of the same object.

load("associationExampleData.mat","predictedTrack");

Bounding Box Intersection Over Union

This is the distance metric used in SORT. It formulates a distance between a track and a detection based on the overlap ratio of the two bounding boxes.

$distanceIoU = 1 - \frac{Area of In tersection}{Area of Union}$

The output, $distanceIoU$ is a scalar between 0 and 1. Evaluate the intersection-over-union distance using the helperDeepSORT.distanceIoU function.

helperDeepSORT.distanceIoU(predictedTrack, newDetection)

ans = 0.5688

Mahalanobis Distance

Another common approach to evaluate the distance between detections and tracks is the Mahalanobis distance, a statistical distance between probability density functions. It accounts for the uncertainty in the current bounding box location estimate and the uncertainty in the measurement. The distance is given by the following equation

$distanceMahalanobis = {(z - Hx)}^{T} S^{- 1} (z - Hx)$

$z$ is the bounding box measurement of the detection and $x$ is the track state. $H$ is the Jacobian of the measurement function, which can also be interpreted as the projection from the 8-dimensional state space to the 4-dimensional measurement space in this example. In other words, $Hx$ is the predicted measurement. $S$ is the innovation covariance matrix with the following definition.

$S = {HPH}^{t} + R$

where $R$ is the measurement noise covariance.

Evaluate the Mahalanobis distance between the predicted track and the detection.

predictedMeasurement = predictedTrack.State([1 3 5 7])' % Same as Hx

predictedMeasurement = 1×4

  962.4930  353.9284   54.4362  174.6672

innovation = newDetection.Measurement-predictedMeasurement % z - Hx

innovation = 1×4

   16.9370   -0.4384   15.2838    0.7628

S = predictedTrack.StateCovariance([1 3 5 7],[1 3 5 7]) + newDetection.MeasurementNoise % Same as HPH' + R

S = 4×4

   49.1729         0         0         0
         0   49.1729         0         0
         0         0   49.1729         0
         0         0         0   49.1729

Use the helperDeepSORT.distanceMahalanobis function to calculate the distance.

helperDeepSORT.distanceMahalanobis(predictedTrack,newDetection)

ans = 10.5999

The output of $distanceMahalanobis$ is a positive scalar. Unlike, the other two distances, it is not bounded.

Appearance Cosine Distance

This distance metric evaluates the distance between a detection and the predicted track in the appearance feature space.

In DeepSORT [1], each track keeps the history of appearance feature vectors from previous detection assignments. Inspect the Appearance field of the saved track, under the ObjectAttributes property. In this example, appearance vectors are unit vectors with 2048 elements. The following predicted track history has 3 vectors.

appearanceHistory = predictedTrack.ObjectAttributes.Appearance

appearanceHistory = 2048×3 single matrix

   -0.2481   -0.5268   -0.7212
    0.5355    1.1441    2.0087
   -0.5731   -0.8569   -1.7000
   -0.1705    0.1594    0.0325
    0.6062    1.3976    1.8887
    0.4375    0.4383    1.0141
   -0.2393    0.0501    0.2047
    0.1737   -0.0448   -0.3690
   -0.3931   -0.7453   -1.9172
   -0.1576   -0.1666    0.0391
      ⋮

The distance between two appearance vectors is derived directly from their scalar product.

$d = 1 - \frac{⟨ {appearance}_{1}, {appearance}_{2} ⟩}{‖ {appearance}_{1} ‖ ‖ {appearance}_{2} ‖}$

With this formula, you can calculate the distance between the appearance vector of a detection and the track history as follows.

detectionAppearance = newDetection.ObjectAttributes.Appearance;
1- (detectionAppearance./vecnorm(detectionAppearance))' *(appearanceHistory./vecnorm(appearanceHistory))

ans = 1×3 single row vector

    0.1729    0.1154    0.1303

Define the appearance cosine distance between a track and a detection as the minimum distance across the history of the track appearance vectors, also called a gallery. Use the helperDeepSORT.distanceCosine function to calculate it.

helperDeepSORT.distanceCosine(predictedTrack, newDetection)

ans = single
    0.1154

The appearance cosine distance returns a scalar between 0 and 2.

In this example you use the three distance metrics to formulate the overall assignment problem in terms of cost minimization. You calculate distances for all possible pairs of detections and tracks to form cost matrices.

Matching Cascade

The original idea behind DeepSORT is to combine the Mahalanobis distance and the appearance feature cosine distance to assign a set of new detections to the set of current tracks. The combination is done using a weight parameter $λ$ that has a value between 0 and 1.

$Cost = λ {Mahalanobis}_{Cost} + (1 - λ) {Cosine}_{Cost}$

Both Mahalanobis and the appearance cosine cost matrices are subjected to gating thresholds. Thresholding is done by setting cost matrix elements larger than their respective thresholds to Inf.

Due to the growth of the state covariance for unassigned tracks, the Mahalanobis distance tends to favor tracks that have not been updated in the last few frames over tracks with a smaller prediction error. DeepSORT handles this effect by splitting tracks into groups according to the last frame they were assigned. The algorithm assigns tracks that were updated in the previous frame first. Tracks are assigned to the new detections using linear assignment. Any remaining detections are considered for the assignment with the next track group. Once all track groups have been given a chance to get assigned, the remaining unassigned tracks of unassigned age 1, and the remaining unassigned detections are selected for linear assignment based on their IoU cost matrix. The flowchart below describes the matching cascade.

The helperDeepSORT class implements the assignment routine. You can modify the code and try your own assignment instead.

Build DeepSORT Tracker

In this section you construct DeepSORT. The remaining components are the estimation filters, the feature update, and the track initialization and deletion routine. The diagram below gives a summary of all the components involved in tracking-by-detection with DeepSORT.

Matching Cascade

The following properties configure the DeepSORT's matching cascade assignment described in the previous section.

AppearanceWeight
MahalanobisAssignmentThreshold
AppearanceAssignmentThreshold
IOUAssignmentThreshold

Set IOUAssignmentThreshold to 0.95 to allow assignment of detections to new tentative tracks with as little as 5% bounding box overlap. In this video, the low frame-rate, the closeness of the camera to the scene, and the small number of people in the scene lead to few and small overlap between consecutive detections. You can set the threshold to a lower value in videos with higher frame-rate or more crowded scenes.

Next, set the MahalanobisAssignmentThreshold and AppearanceAssignmentThreshold properties. The Mahalanobis distance follows a chi-square distribution. Therefore, draw the threshold from the inverse chi-square distribution for a confidence interval of about 95%. For a 4-dimensional measurement space, the value is 9.4877. Manual tuning leads to an appearance threshold of 0.4.

In [1], setting the AppearanceWeight $λ$ to 0 gives better results. In this scene, the combination of the Mahalanobis threshold and the appearance threshold resolves most assignment ambiguities. Therefore, you can choose any value between 0 and 1. For more crowded scenes, consider including some Mahalanobis distance by using non-zero appearance weight as noted in [2].

Estimation filters

As in SORT, the bounding boxes are estimated with a linear Kalman Filter using a constant velocity motion model. You use the initvisionbboxkf filter initialization function. Set the following properties of the helperDeepSORT object to configure the filter. Refer to the initvisionbboxkf (Sensor Fusion and Tracking Toolbox) documentation for more details.

FrameRate
FrameSize
NoiseIntensity

The video has a frame-rate of 1Hz and a frame size of [1288 964] pixels. The Kalman filter noise intensity is a tuning parameter. In this example, the value 0.001 leads to satisfying results.

Track Initialization and Deletion

A new track is confirmed if it has been assigned for 2 consecutive frames. An existing track is deleted if it is missed for more than $T_{Lost}$ frames. In this example you set $T_{Lost} = 5$ . This is long enough to account for all the occlusions in the video which has a low frame-rate (1Hz). For videos with higher frame-rate, you should increase this value accordingly. The following properties inherited from the trackerGNN System object specify the confirmation and deletion logic.

ConfirmationThreshold
DeletionThreshold

Set ConfirmationThreshold to [2 2] and DeletionThreshold to [ $T_{Lost}$ $T_{Lost}$ ] according to the above.

Appearance Feature Update

For each assigned track, DeepSORT stores the appearance feature vectors of detections into their assigned tracks. Configure the tracker using the following properties.

AppearanceUpdate
MaxNumAppearanceFrames
AppearanceMomentum

In the original algorithm, DeepSORT stores a gallery of appearance vectors from past frames. Set the AppearanceUpdate property to "Gallery" and the MaxNumAppearanceFrames property to choose the depth of the gallery. You first use a value of 50 frames. Consider increasing this value for high frame-rate videos.

There exist variants of DeepSORT using a different update mechanism [2 ,3]. You can also set AppearanceUpdate to "EMA" to use an exponential moving average update. In this configuration, each track only stores a single appearance vector and updates it with the assigned detection's appearance using the equation:

${TrackAppearance}_{k + 1} = α {TrackAppearance}_{k} + (1 - α) DetectionAppearance$

where $α$ is a real between 0 and 1 called the momentum term.

In this configuration, the MaxNumAppearanceFrames property is not used. Similarly, in the previous gallery configuration, the AppearanceMomentum property is not used.

A second variant consists of combining the exponential moving average with the gallery method. In this method, each track stores a gallery of EMA appearance vector. Set AppearanceUpdate to "EMA Gallery" to use this option. Both MaxNumAppearanceFrames and AppearanceMomentum are applicable properties in this configuration.

The gallery method captures long term appearance changes for tracks because it stores previous frames appearances and it does not favor the appearance from the latest frames over older frames. The gallery method is not robust to errors in association for the same reason. Once an erroneous appearance feature is stored in the gallery, it will corrupt the distance evaluation in later frames (this is due to taking the minimum of distances across the gallery). The exponential moving average appearance update is more robust to erroneous association since the error will be averaged out, at the expense of not capturing long term appearance change. The exponential moving average gallery offers a compromise between the two methods.

Configure DeepSORT Tracker

With all the previous considerations, create a DeepSORT tracker.

lambda = 0.02;
Tlost = 5;

tracker = helperDeepSORT(ConfirmationThreshold = [2 2],...
    DeletionThreshold = [Tlost Tlost],...
    AppearanceUpdate = "Gallery",...
    MaxNumAppearanceFrames = 50,...
    MahalanobisAssignmentThreshold = 10,...
    AppearanceAssignmentThreshold = 0.4,...
    IOUAssignmentThreshold = 0.95,...
    AppearanceWeight = lambda,....
    FrameSize = [1288 964], ...
    FrameRate = 1,...
    NoiseIntensity = 1e-3*ones(1,4));

Evaluate DeepSORT

In this section, you exercise the tracker on the pedestrian tracking video and evaluate its performance using tracking metrics.

Pedestrian Tracking Dataset

Download the pedestrian tracking video file.

helperDownloadPedestrianTrackingVideo();

The PedestrianTrackingYOLODetections MAT-file contains detections generated from a YOLO v4 object detector using CSP-DarkNet-53 network and trained on the COCO dataset. See the yolov4ObjectDetector object for more details. The PedestrianTrackingGroundTruth MAT-file contains the ground truth for this video. Refer to the Import Camera-Based Datasets in MOT Challenge Format for Object Tracking (Sensor Fusion and Tracking Toolbox) example to learn how to import the ground truth and detection data into appropriate Sensor Fusion and Tracking Toolbox™ formats.

datasetname="PedestrianTracking";
load(datasetname+"GroundTruth.mat","truths");
load(datasetname+"YOLODetections.mat","detections");

Set the measurement noise covariance matrix using a standard deviation of 5 pixels for the corner coordinates and the width and height of the bounding box. The measurement noise covariance depends on the statistics of the detector, you should modify this value accordingly if you are using a different object detector.

R = diag ([25, 25, 25, 25]);
for i=1:numel(detections)
    for j=1:numel(detections{i})
        detections{i}(j).MeasurementNoise = R;
    end
end

Run the Tracker

Next, exercise the complete tracking workflow on the Pedestrian Tracking video. To use the tracker, call the tracker with an array of objectDetection objects as the input, as if it were a function. The tracker returns confirmed, tentative, and all tracks, and an analysis info structure, similar as the trackerGNN object.

Filter out the YOLO detections with a confidence score lower than 0.5. Delete tracks if their bounding box is entirely out of the camera frame. This is to prevent maintaining tracks that are outside of the camera field of view more than 5 frames.

% Display
reader = VideoReader("PedestrianTrackingVideo.avi");

% Initialize track log
deepSORTTrackLog = objectTrack.empty;

% Set minimum detection score
detectionScoreThreshold = 0.5;

% Choose appearance update method
tracker.AppearanceUpdate = "Gallery";
tracker.AppearanceMomentum = 0.9;

% Choose cost appearance weight
tracker.AppearanceWeight = 0;

% Toggle on/off visualization
player = vision.DeployableVideoPlayer;

reset(tracker);      
for i=1:reader.NumFrames

    % Advance reader
    frame = readFrame(reader);

    % Parse detections set to retrieve detections on the ith frame
    curFrameDetections = detections{i};
    attributes = arrayfun(@(x) x.ObjectAttributes, curFrameDetections);
    scores = arrayfun(@(x) x.Score, attributes);
    highScoreDetections = curFrameDetections(scores > detectionScoreThreshold);

    % Run Re-ID Network on detections
    highScoreDetections = runReIDNet(net, frame, highScoreDetections);
    [tracks, tenttracks, ~, info] = tracker(highScoreDetections);
    
    deleteOutOfFrameTracks(tracker, tracks);

    frameWithTracks = helperAnnotateDeepSORTTrack(tracks, frame);
    step(player, frameWithTracks);

    % Log tracks for evaluation
    deepSORTTrackLog = [deepSORTTrackLog ; tracks]; %#ok<AGROW>
end

From the results, the person tracked with ID = 3 is occluded multiple times and makes abrupt change of direction. This makes it difficult to track with only motion information by the means of the Mahalanobis distance or bounding box overlap. The use of appearance feature allows to maintain a unique track identifier for this person over this entire sequence and for the rest of the video. This is not achieved with the simpler SORT algorithm or when setting DeepSORT to only use the Mahalanobis distance. You can verify this by setting the AppearanceWeight parameter to 1 and relaxing the appearance threshold by setting AppearanceAssignmentThreshold to 2.

Tracking Metrics

The CLEAR multi-object tracking metrics provide a standard set of tracking metrics to evaluate the quality of tracking algorithm. These metrics are popular for video-based tracking applications. Use the trackCLEARMetrics (Sensor Fusion and Tracking Toolbox) object to evaluate the CLEAR metrics for the two SORT runs.

The CLEAR metrics require a similarity method to match track and true object pairs in each frame. In this example, you use the IoU2d similarity method and set the SimilarityThreshold property to 0.01. This means that a track can only be considered a true positive match with a truth object if their bounding boxes overlap by at least 1%. The metric results can vary depending on the choice of this threshold.

tcm = trackCLEARMetrics(SimilarityMethod ="IoU2d", SimilarityThreshold = 0.01);

The first step is to convert the objectTrack format to the trackCLEARMetrics input format specific to the IoU2d similarity method. Convert the track log.

deepSORTTrackedObjects = repmat(struct("Time",0,"TrackID",1,"BoundingBox", [0 0 0 0]),size(deepSORTTrackLog));
for i=1:numel(deepSORTTrackedObjects)
    deepSORTTrackedObjects(i).Time = deepSORTTrackLog(i).UpdateTime;
    deepSORTTrackedObjects(i).TrackID = deepSORTTrackLog(i).TrackID;
    deepSORTTrackedObjects(i).BoundingBox(:) = getTrackPositions(deepSORTTrackLog(i), [1 0 0 0 0 0 0 0; 0 0 1 0 0 0 0 0 ; 0 0 0 0 1 0 0 0; 0 0 0 0 0 0 1 0])';
end

To evaluate the results on the Pedestrian class only, you only keep ground truth elements with ClassID equal to 1 and filter out other classes.

truths = truths([truths.ClassID]==1);

Use the evaluate object function to obtain the metrics as a table.

deepSORTresults = evaluate(tcm, deepSORTTrackedObjects, truths);
disp(deepSORTresults)

    MOTA (%)    MOTP (%)    Mostly Tracked (%)    Partially Tracked (%)    Mostly Lost (%)    False Positive    False Negative    Recall (%)    Precision (%)    False Track Rate    ID Switches    Fragmentations
    ________    ________    __________________    _____________________    _______________    ______________    ______________    __________    _____________    ________________    ___________    ______________

     89.037      92.064           84.615                 15.385                   0                 25                41            93.189         95.734            0.14793              0               3

The CLEAR MOT metrics corroborate the quality of DeepSORT in keeping identities of tracks over time with no ID switch and very little fragmentation. This is the main benefit of using DeepSORT over SORT. Meanwhile, maintaining tracks alive over occlusions results in predicted locations being maintained (coasting) and compared against true position, which leads to increased number of false positives and false negatives when the overlap between the coasted tracks and true bounding boxes is less than the metric threshold. This is reflected in the MOTA score of DeepSORT.

Refer to the trackCLEARMetrics (Sensor Fusion and Tracking Toolbox) page for additional information about all the CLEAR metrics quantities.

Note that the matching cascade is the original idea behind DeepSORT to handle the spread of covariance during occlusions. The Mahalanobis distance can be modified to be more robust to such effects, and a single step assignment can lead to identical or even better performance, as shown in [2].

Conclusion

In this example you have learned how to implement the DeepSORT object tracking algorithm. This is an example of attribute fusion by using deep appearance features for the assignment. The appearance attribute is updated using a simple memory buffer. You also have learned how to integrate a Re-Identification Deep Learning network as part of the tracking-by-detection framework to improve the performance of camera-based tracking in the presence of occlusions.

Supporting Functions

function detections = runReIDNet(net, frame, detections)

if isempty(detections)
    detections = objectDetection.empty;
else
    for j =1:numel(detections)

        % Crop frame
        bbox = detections(j).Measurement;
        croppedPerson = imcrop(frame,bbox);

        % Extract appearance features of the cropped pedestrian.
        appearanceVect = extractReidentificationFeatures(net,croppedPerson);
        detections(j).ObjectAttributes.Appearance = appearanceVect;
    end
end
end

deleteOutOfFrameTracks deletes tracks if their bounding box is entirely out of the video frame.

function deleteOutOfFrameTracks(tracker, confirmedTracks)
% Get bounding boxes
allboxes = helperDeepSORT.getTrackRectangles(confirmedTracks);
allboxes = max(allboxes, realmin);
alloverlaps = bboxOverlapRatio(allboxes,[1,1,1288,964]);
isOutOfFrame = ~alloverlaps;
allTrackIDs = [confirmedTracks.TrackID];
trackToDelete = allTrackIDs(isOutOfFrame);
for i=1:numel(trackToDelete)
    tracker.deleteTrack(trackToDelete(i));
end
end

Reference

[1] Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." In 2017 IEEE international conference on image processing (ICIP), pp. 3645-3649.

[2] Du, Yunhao, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. "Strongsort: Make deepsort great again." IEEE Transactions on Multimedia (2023).

[3] Du, Yunhao, Junfeng Wan, Yanyun Zhao, Binyu Zhang, Zhihang Tong, and Junhao Dong. "Giaotracker: A comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021." In Proceedings of the IEEE/CVF International conference on computer vision, pp. 2809-2819. 2021.