Multi-Object Tracking and Human Pose Estimation

Since R2024a

This example uses:

This example shows how to detect multiple people, track them, and estimate their body poses in a video by using pretrained deep learning networks and a global nearest-neighbor (GNN) assignment tracking approach.

People tracking and body pose estimation find applications in areas such as sport analysis, security surveillance, and human-computer interaction. Visual tracking and pose estimation involve these three primary steps:

1. Detection — Detect people in each video frame using a pretrained peopleDetector object.

2. Tracking — Track the detected people across each video frame using the trackerGNN object and its functions. Because you can mostly describe the motion of people within the short periods of time between frames as constant velocity, the tracker in this example uses a constant-velocity linear Kalman filter.

3. Keypoint Detection — Identify keypoints on the detected people and estimate their body poses from these keypoints using a pretrained HRNet object keypoint detector. You can then estimate the body poses by connecting the detected keypoints.

Create Object Detector

Create a pretrained peopleDetector object detector to detect people in each video frame. In this example, you use a "small-network" pretrained peopleDetector deep learning network.

peopleDet = peopleDetector("small-network");

Create Object Keypoint Detector

Create a pretrained HRNet object keypoint detector to detect keypoints in a human body. For this example, use the default HRNet deep learning network, which is trained on the COCO keypoint detection data set. The default network uses the HRNet-W32 network as the base network. In an HRNet-W32 network, the last three stages of the high-resolution subnetworks have 32 convolved feature maps. For more information about HRNet architecture, see Getting Started with HRNet (Computer Vision Toolbox).

keyPtDetector = hrnetObjectKeypointDetector;

Specify the threshold value for detecting valid key points. The threshold of the object keypoint detector specifies the confidence level for determining if a keypoint is valid. If an object you want to detect in the video is occluded, use a low threshold value. For this example, set the threshold value for detecting keypoints to 0.3.

keyPtDetector.Threshold = 0.3;

Read Video Data

Read video data into the MATLAB® workspace by using the VideoReader object. This example uses a video that captures individuals walking in an atrium. The atrium contains plants that partially obstruct people in some frames of the video.

reader = VideoReader("atrium.mp4");

Initialize Multi-Object Tracker

Create a trackerGNN System object™ and set its properties. In this example, the video has been captured using a single camera sensor. Thus, set the MaxNumSensors property of the tracker to 1. Set the maximum number of tracks for the GNN tracker to 10. You can increase or decrease the maximum number of tracks depending on the complexity of the tracking scenario. If you have more targets to track, increase the value of the maximum number of tracks.

tracker = trackerGNN(MaxNumSensors=1,MaxNumTracks=10);

By default, the GNN tracker confirms or deletes a track based on its history, which is the number of recent tracker updates for which the tracker assigns a detection to a track. The ConfirmationThreshold and DeletionThreshold properties specify the threshold values for the tracker to confirm or delete a track, respectively. In this example, the threshold values have been set empirically.

Specify the ConfirmationThreshold property of the tracker as a 1-by-2 vector of form [M N]. A track is confirmed if the tracker assigns a detection to it for at least M out of the last N updates.

tracker.ConfirmationThreshold = [2 5];

Specify the DeletionThreshold property of the tracker as a 1-by-2 vector of form [P R]. If the tracker has not assigned a confirmed track to any detection P times in the last R tracker updates, then it deletes the track.

Update the AssignmentThreshold property of the tracker to reduce the detections that are not assigned to tracks.

tracker.DeletionThreshold = [23 23];
tracker.AssignmentThreshold = 30*[5 inf];

Create and initialize a constant-velocity linear Kalman filter using the initvisionbboxkf function. This filter tracks bounding boxes of detections in each video frame. Specify the tracking frame rate and tracking frame size using the input video frame rate and frame size, respectively.

frameRate = reader.FrameRate;
frameSize = [reader.Width reader.Height];
tracker.FilterInitializationFcn = @(detection)initvisionbboxkf(detection,FrameRate=frameRate,FrameSize=frameSize);

Detect and Track People, and Estimate Body Poses

Follow these steps to detect and track people in the input video and estimate and visualize their body poses.

Step 1: Detect people

Use the helperDetectObjects supporting function to detect people in the input video using the pretrained peopleDetector object. The function returns the bounding boxes of detections for person class. If a frame contains no individuals, the function returns empty bounding boxes. The function has three tunable inputs: skipFrame, detectionThreshold, and minDetectionSize. You can modify these input variables to increase detection speed and the number of detections.

To increase processing speed, specify a larger value for skipFrame to increase the number of frames to bypass during object detection. However, setting this value too high might result in the loss of object tracks.
To control the number of detections, change the value of detectionThreshold. The function removes detections that have scores less than this threshold value. To reduce false positives, increase this value.
The minDetectionSize argument defines the size of the smallest region containing the object as a vector of form [height width]. Units are in pixels.

skipFrame = 2;
detectionThreshold = 0.5;
minDetectionSize = [5 5];

Step 2: Track People

Use the helperTrackBoundingBoxes supporting function to track the bounding boxes of the detected objects using the trackerGNN System object. The helperTrackBoundingBoxes function takes the bounding box detections from the previous step as input, and outputs the updated track positions for the bounding boxes. If a track is not assigned to a detection in the current frame, the function marks the track as predicted in the annotation. Each track keeps count of the number of consecutive frames for which it remains unassigned. If the count exceeds the DeletionThreshold specified in the Initialize Multi-Object Tracker section, the function assumes that the object has left the field of view and deletes the track.

Step 3: Detect Keypoints and Estimate Body Pose

Use the helperDetectKeypointsUsingHRNet supporting function to detect the keypoints of the tracked people within a frame using the pretrained HRNet object keypoint detector. The function takes the tracked bounding boxes as input, and outputs 17 keypoints, along with their validity, for each detection. The detected keypoints denote the positions of specific body parts of the detected people in each video frame. If a bounding box is empty, the corresponding keypoints and their validities are also empty. The keypoint connections specified by the pretrained detector show the body poses of the detected humans.

Step 4: Visualize Results

Create a video player to play the video file using the vision.VideoPlayer (Computer Vision Toolbox) object. Set the size and position of the video player window, in pixels.

player = vision.VideoPlayer(Position=[20 400 700 400]);

Use the helperDisplayResults supporting function to display the estimated poses and their associated tracks across each video frame. The helper function uses the insertObjectKeypoints (Computer Vision Toolbox) function to display the keypoints and the keypoint connections. The visualization displays the detected tracks using yellow bounding boxes annotated over the video frame.

numFrames = reader.NumFrames;
% Track multiple people and estimate their body poses throughout the input video.
while hasFrame(reader)
    frame = readFrame(reader);
    frameNum = round(reader.FrameRate*reader.CurrentTime);

    % Step 1: Detect people and predict bounding box.
    bboxes = helperDetectObjects(peopleDet,frame,detectionThreshold,frameNum,skipFrame,minDetectionSize);
    
    % Step2: Track people across video frames.
    [trackBboxes,labels] = helperTrackBoundingBoxes(tracker,reader.CurrentTime,bboxes);
 
    % Step 3: Detect keypoints of tracked people.
    [keypoints,validity] = helperDetectKeypointsUsingHRNet(frame,keyPtDetector,trackBboxes);

    % Step 4: Display tracked people and their body pose.
    frame = helperDisplayResults(frame,keyPtDetector.KeypointConnections,keypoints,validity,trackBboxes,labels,frameNum);

    % Display video
    player(frame);
end

Supporting Functions

The helperDetectObjects supporting function detects people in the input video using the pretrained peopleDetector object.

function box = helperDetectObjects(peopleDet, frame, detectionThreshold, frameNum, skipFrame, minDetectionSize)
box = [];
if mod(frameNum, skipFrame) == 0
    box = detect(peopleDet, frame, Threshold=detectionThreshold, ...
        MinSize=minDetectionSize);
    if isempty(box)
        box = [];
    end
end
end

The helperTrackBoundingBoxes supporting function computes the tracks for the detected people to track them in video frames.

function [boxes,labels] = helperTrackBoundingBoxes(tracker,currentTime,boxes)

% Convert bounding boxes into the objectDetection format.
thisFrameBboxes = boxes;
numMeasurementsInFrame = size(thisFrameBboxes,1);
detectionsInFrame = cell(numMeasurementsInFrame,1);
for detCount = 1:numMeasurementsInFrame
    detectionsInFrame{detCount} = objectDetection( ...
        currentTime, ...
        thisFrameBboxes(detCount,:), ...         % Use bounding box as measurement in pixels
        MeasurementNoise=diag([25 25 25 25]) ... % Bounding box measurement noise in pixels
        );
end

% Update the tracker.
if isLocked(tracker) || ~isempty(detectionsInFrame)
    tracks = tracker(detectionsInFrame,currentTime);
else
    tracks = objectTrack.empty;
end
positionSelector = [1 0 0 0 0 0 0 0; 0 0 1 0 0 0 0 0; 0 0 0 0 1 0 0 0; 0 0 0 0 0 0 1 0];
boxes = getTrackPositions(tracks,positionSelector);
ids = [tracks.TrackID];
isCoasted = [tracks.IsCoasted];

% Customize labels to show when a track was not corrected with any measurement and is coasted
labels = arrayfun(@(a)num2str(a),ids,"uni",0);
isPredicted = cell(size(labels));
isPredicted(isCoasted) = {'predicted'};
labels = strcat(labels,isPredicted);
end

The helperDetectKeypointsUsingHRNet supporting function detects keypoints of the tracked people within a video frame using the HRNet object keypoint detector.

function [keypoints,validity] = helperDetectKeypointsUsingHRNet(frame,keyPtDet,boxes)
if ~isempty(boxes)
    if ~any(boxes<=0,"all")
        [keypoints,~,validity] = detect(keyPtDet,frame,boxes);
    else
        keypoints = [];
        validity = [];
    end
else
    keypoints = [];
    validity = [];
end
end

The helperDisplayResults supporting function draws a bounding box, pose skeleton, and label ID for each track in the video frame.

function frame = helperDisplayResults(frame,keypointConnections,keypoints,validity,boxes,labels,frameNum)

% Draw keypoints and their connection.
if ~isempty(validity)
    frame = insertObjectKeypoints(frame,keypoints,"KeypointVisibility",validity, ...
        Connections=keypointConnections,ConnectionColor="green", ...
        KeypointColor="yellow");
    % Draw the bounding boxes in the frame.
    frame = insertObjectAnnotation(frame,"rectangle",boxes,labels, ...
        TextBoxOpacity=0.5);
    % Add frame count on the top right corner.
    frame = insertText(frame,[0 0],"Frame: "+int2str(frameNum), ...
        BoxColor="black",TextColor="yellow",BoxOpacity=1);
end
end