Get the code for this tutorial
In this tutorial, you will learn how to use YOLOv8 to detect objects and how to use DeepSORT to track these objects in a video.
At the time of writing this article (April 2023), YOLOv8 is the latest version of the YOLO object detection algorithm. It is a state-of-the-art object detector developed by Ultralytics.
Ultralytics made YOLOv8 a lot easier to work with. With just a few lines of code, you can easily detect objects in an image or a video.
Object tracking differs from object detection in that it tracks the same object over time. So basically, object tracking is object detection over time.
With object detection, the focus is on identifying and localizing objects within each frame of a video or image, without having any information about the objects over time. On the other hand, with object tracking, the goal is to not only detect and localize objects but also to keep track of their movements across multiple frames of the video.
Object tracking is particularly useful in scenarios where we need to monitor the trajectory of an object or group of objects over time. For example, tracking can be used for intrusion detection, traffic monitoring, …
To achieve object tracking, we need to build a model that can assign unique IDs to objects and maintain their identity across frames, even when they are partially or completely occluded by other objects or undergo significant changes in appearance.
This is where DeepSORT comes in - it provides a framework for assigning unique IDs to objects and tracking their movements over time.
In order to use YOLOv8 and DeepSORT, we need to install the Ultralytics and DeepSORT Python packages.
There are some issues with the original DeepSORT implementation, so I forked the repository and made some adjustments to make it work with the latest version of TensorFlow. We can also get the class names of the detected objects (which was not possible to do with the original implementation).
To install the Ultralytics and my forked version of the DeepSORT packages, run the following commands in your terminal (make sure you are inside the project directory):
pip install ultralytics
git clone git@github.com:python-dontrepeatyourself/deep_sort.git
Let's now review our project structure for this tutorial.
Here is how I structured this project:
$ tree --filelimit 8
.
├── 1.mp4
├── config
│ ├── coco.names
│ └── mars-small128.pb
├── deep_sort [9 entries exceeds filelimit, not opening dir]
├── helper.py
├── object_detection_tracking.py
├── output.mp4
├── ultralytics [16 entries exceeds filelimit, not opening dir]
└── yolov8n.pt
Here is a brief description of the files and folders:
Things can get a little bit confusing when it comes to object detection and tracking. So to make things easier, we will first see how to detect objects in a video using YOLOv8 and then we will see how we can integrate the DeepSORT tracker with our YOLOv8 object detector to track those detected objects.
Let's start with object detection.
Creating an object detector with YOLOv8 is very easy. All we need to do is import the YOLOv8 class from the Ultralytics package and apply it to an image or a video.
Let's first create a new Python file called object_detection_tracking.py and import the necessary packages:
import numpy as np
import datetime
import cv2
from ultralytics import YOLO
from helper import create_video_writer
conf_threshold = 0.5
# Initialize the video capture and the video writer objects
video_cap = cv2.VideoCapture("1.mp4")
writer = create_video_writer(video_cap, "output.mp4")
# Initialize the YOLOv8 model using the default weights
model = YOLO("yolov8n.pt")
We will use the conf_threshold variable to set the confidence threshold for object detection. This means that we will only detect objects with a confidence score greater than the confidence threshold.
The create_video_writer function is a helper function that we will use to create a video writer object. We will use this object to write the output video.
The model variable is a YOLOv8 object that we will use to detect objects in the video.
Here we are using the default YOLOv8 weights that are provided by Ultralytics (yolov8n.pt), which are trained on the COCO dataset. The weights will be downloaded automatically when you first run the code.
You can also use your own custom weights, but you will need to train the model on your own dataset first.
Let's now loop over the frames of the video and detect objects in each frame:
# loop over the frames
while True:
# starter time to computer the fps
start = datetime.datetime.now()
ret, frame = video_cap.read()
# if there is no frame, we have reached the end of the video
if not ret:
print("End of the video file...")
break
############################################################
### Detect the objects in the frame using the YOLO model ###
############################################################
# run the YOLO model on the frame
results = model(frame)
Inside the loop, we first read the next frame from the video capture object.
The ret variable is a boolean that indicates whether the frame was successfully read. If the frame was successfully read, the frame variable will contain the frame. Otherwise, the frame variable will be None.
So if the ret variable is False, we have reached the end of the video and we break out of the loop.
Then we run the YOLOv8 model on the frame. This will return a ultralytics.yolo.engine.results.Results object with the following attributes:
boxes: ultralytics.yolo.engine.results.Boxes object
keypoints: None
keys: ['boxes']
masks: None
names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}
orig_img: array([[[ 99, 145, 138],
[103, 149, 142],
[107, 153, 146],
...,
[132, 150, 138],
[132, 150, 138],
[125, 143, 131]],
...,
[[111, 164, 156],
[105, 158, 150],
[105, 158, 150],
...,
[133, 138, 144],
[133, 138, 144],
[133, 138, 144]]], dtype=uint8)
orig_shape: (720, 1280)
path: 'image0.jpg'
probs: None
speed: {'preprocess': 0.5915164947509766, 'inference': 34.77835655212402, 'postprocess': 0.5271434783935547}
The boxes attribute is a ultralytics.yolo.engine.results.Boxes object that contains the bounding boxes of the detected objects and some other information:
print(results[0].boxes)
# output:
boxes: tensor([[7.8548e+02, 5.1154e-01, 1.0214e+03, 6.2262e+02, 9.2543e-01, 0.0000e+00],
[5.0879e+02, 2.5563e+02, 6.3798e+02, 6.2519e+02, 8.5625e-01, 0.0000e+00],
[3.0231e+02, 3.6799e+02, 7.0716e+02, 6.3381e+02, 5.6319e-01, 1.3000e+01],
[3.0361e+02, 3.6963e+02, 5.5384e+02, 6.3172e+02, 3.0199e-01, 1.3000e+01]])
cls: tensor([ 0., 0., 13., 13.])
conf: tensor([0.9254, 0.8562, 0.5632, 0.3020])
data: tensor([[7.8548e+02, 5.1154e-01, 1.0214e+03, 6.2262e+02, 9.2543e-01, 0.0000e+00],
[5.0879e+02, 2.5563e+02, 6.3798e+02, 6.2519e+02, 8.5625e-01, 0.0000e+00],
[3.0231e+02, 3.6799e+02, 7.0716e+02, 6.3381e+02, 5.6319e-01, 1.3000e+01],
[3.0361e+02, 3.6963e+02, 5.5384e+02, 6.3172e+02, 3.0199e-01, 1.3000e+01]])
id: None
is_track: False
orig_shape: tensor([ 720, 1280])
shape: torch.Size([4, 6])
xywh: tensor([[903.4377, 311.5681, 235.9163, 622.1130],
[573.3878, 440.4119, 129.1873, 369.5559],
[504.7360, 500.8981, 404.8489, 265.8228],
[428.7267, 500.6769, 250.2260, 262.0896]])
xywhn: tensor([[0.7058, 0.4327, 0.1843, 0.8640],
[0.4480, 0.6117, 0.1009, 0.5133],
[0.3943, 0.6957, 0.3163, 0.3692],
[0.3349, 0.6954, 0.1955, 0.3640]])
xyxy: tensor([[7.8548e+02, 5.1154e-01, 1.0214e+03, 6.2262e+02],
[5.0879e+02, 2.5563e+02, 6.3798e+02, 6.2519e+02],
[3.0231e+02, 3.6799e+02, 7.0716e+02, 6.3381e+02],
[3.0361e+02, 3.6963e+02, 5.5384e+02, 6.3172e+02]])
xyxyn: tensor([[6.1366e-01, 7.1047e-04, 7.9797e-01, 8.6476e-01],
[3.9750e-01, 3.5505e-01, 4.9842e-01, 8.6832e-01],
[2.3618e-01, 5.1109e-01, 5.5247e-01, 8.8029e-01],
[2.3720e-01, 5.1338e-01, 4.3269e-01, 8.7739e-01]])
As you can see, there is quite a lot of information in the boxes attribute and they are quite easy to understand.
The one that we are interested in is the data attribute. It contains the bounding boxes in the format [x1, y1, x2, y2, confidence, class_id].
The x1, y1, x2, y2 are the coordinates of the bounding box. The confidence is the confidence of the bounding box and the class_id is the id of the class that the bounding box belongs to.
Let's see how we can use this information to draw the bounding boxes on the image.
# loop over the results
for result in results:
# initialize the list of bounding boxes, confidences, and class IDs
bboxes = []
confidences = []
class_ids = []
# loop over the detections
for data in result.boxes.data.tolist():
x1, y1, x2, y2, confidence, class_id = data
x = int(x1)
y = int(y1)
w = int(x2) - int(x1)
h = int(y2) - int(y1)
class_id = int(class_id)
# filter out weak predictions by ensuring the confidence is
# greater than the minimum confidence
if confidence > conf_threshold:
bboxes.append([x, y, w, h])
confidences.append(confidence)
class_ids.append(class_id)
cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
Here, we are using result.boxes.data.tolist() to get the detections in the format [x1, y1, x2, y2, confidence, class_id].
Next, we check if the confidence of the bounding box is greater than the conf_threshold. If it is, we add the bounding box, confidence, and class_id to their respective lists.
Finally, we draw the bounding boxes on the image using cv2.rectangle.
Let's finish our code by writing the fps on the frame and displaying the frame.
############################################################
### Some post-processing to display the results ###
############################################################
# end time to compute the fps
end = datetime.datetime.now()
# calculate the frame per second and draw it on the frame
fps = f"FPS: {1 / (end - start).total_seconds():.2f}"
cv2.putText(frame, fps, (50, 50),
cv2.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 255), 8)
cv2.imshow("Output", frame)
# write the frame to disk
writer.write(frame)
if cv2.waitKey(1) == ord("q"):
break
# release the video capture, video writer, and close all windows
video_cap.release()
writer.release()
cv2.destroyAllWindows()
The video below shows the output of the code.
As you can see, the code is working fine. It's easy to use the YOLOv8 for object detection.
Let's move now to the interesting part of the tutorial. We will see how we can track the objects detected by YOLOv8.
We first need to import some classes and functions from the deep_sort package.
# ...
from deep_sort.deep_sort.tracker import Tracker
from deep_sort.deep_sort import nn_matching
from deep_sort.deep_sort.detection import Detection
from deep_sort.tools import generate_detections as gdet
# define some parameters
conf_threshold = 0.5
max_cosine_distance = 0.4
nn_budget = None
# ...
# Initialize the deep sort tracker
model_filename = "config/mars-small128.pb"
encoder = gdet.create_box_encoder(model_filename, batch_size=1)
metric = nn_matching.NearestNeighborDistanceMetric(
"cosine", max_cosine_distance, nn_budget)
tracker = Tracker(metric)
# load the COCO class labels the YOLO model was trained on
classes_path = "config/coco.names"
with open(classes_path, "r") as f:
class_names = f.read().strip().split("\n")
# create a list of random colors to represent each class
np.random.seed(42) # to get the same colors
colors = np.random.randint(0, 255, size=(len(class_names), 3)) # (80, 3)
We first need to load the mars-small128.pb model. This model is used to extract the features of the bounding boxes.
Next, we create a NearestNeighborDistanceMetric object. This object is used to compute the distance between the features extracted by the mars-small128.pb model.
Finally, we create a Tracker object. This object is used to track the objects detected by YOLOv8.
We also need to load the coco.names file. This file contains the names of the classes the YOLOv8 model was trained on. We will use this file to get the name of the class the bounding box belongs to.
We also create a list of random colors to represent each class. We will draw the bounding boxes of each class with a different color. This will help us to distinguish between the different classes.
Let's continue our code; we will see how we can use the Tracker object to track the objects detected by YOLOv8.
# loop over the frames
while True:
# ...
############################################################
### Detect the objects in the frame using the YOLO model ###
############################################################
results = model(frame)
for result in results:
# ...
for data in result.boxes.data.tolist():
# ...
if confidence > conf_threshold:
bboxes.append([x, y, w, h])
confidences.append(confidence)
class_ids.append(class_id)
# cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2) # comment this line
############################################################
### Track the objects in the frame using DeepSort ###
############################################################
# get the names of the detected objects
names = [class_names[class_id] for class_id in class_ids]
# get the features of the detected objects
features = encoder(frame, bboxes)
# convert the detections to deep sort format
dets = []
for bbox, conf, class_name, feature in zip(bboxes, confidences, names, features):
dets.append(Detection(bbox, conf, class_name, feature))
# run the tracker on the detections
tracker.predict()
tracker.update(dets)
Let's understand the code above. After detecting the objects in the frame using the YOLOv8 model, we filter the detections to keep only those with a confidence greater than conf_threshold and we add the bounding box, confidence, and class_id of each detection to their respective lists.
Next, we get the names of the detected objects using the class_names and the class_ids lists.
After that, we get the features of the detected objects using the encoder object.
Finally, we convert the detections to the deep_sort format and we run the tracker on the detections.
Now we can loop over the tracked objects and draw the bounding boxes on the frame.
# loop over the tracked objects
for track in tracker.tracks:
if not track.is_confirmed() or track.time_since_update > 1:
continue
# get the bounding box of the object, the name
# of the object, and the track id
bbox = track.to_tlbr()
track_id = track.track_id
class_name = track.get_class()
# convert the bounding box to integers
x1, y1, x2, y2 = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
# get the color associated with the class name
class_id = class_names.index(class_name)
color = colors[class_id]
B, G, R = int(color[0]), int(color[1]), int(color[2])
# draw the bounding box of the object, the name
# of the predicted object, and the track id
text = str(track_id) + " - " + class_name
cv2.rectangle(frame, (x1, y1), (x2, y2), (B, G, R), 2)
cv2.rectangle(frame, (x1 - 1, y1 - 20),
(x1 + len(text) * 12, y1), (B, G, R), -1)
cv2.putText(frame, text, (x1 + 5, y1 - 8),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
After tracking the objects in the frame, we loop over the tracked objects and check if the track is confirmed. If the track is not confirmed, we skip it.
Otherwise, we get the bounding box of the object, its name, and the track id.
We also get the color associated with the class name and use it to draw the bounding box of the object.
Finally, we can reuse the same code as in the object detection part to draw the fps, save and show the frame.
############################################################
### Some post-processing to display the results ###
############################################################
# end time to compute the fps
end = datetime.datetime.now()
# calculate the frame per second and draw it on the frame
fps = f"FPS: {1 / (end - start).total_seconds():.2f}"
cv2.putText(frame, fps, (50, 50),
cv2.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 255), 8)
cv2.imshow("Output", frame)
# write the frame to disk
writer.write(frame)
if cv2.waitKey(1) == ord("q"):
break
# release the video capture, video writer, and close all windows
video_cap.release()
writer.release()
cv2.destroyAllWindows()
The video below is the same video used in the previous section for object detection.
You can see that the tracker is able to re-identify some of the objects after losing the bounding box for a few frames but for some of them, it fails to re-identify them and assigns them a new ID.
For example, the bicycle at the bottom right is assigned the ID 39 and after losing the bounding box for a few frames, the tracker assigns it a new ID (42), and then after losing the bounding box again, it assigns it another new ID (48).
You can increase the max_age parameter of the Tracker object to increase the number of frames the tracker can track the object without a bounding box.
For example, in the video below, I increased the max_age parameter from the default value of 30 to 60.
tracker = Tracker(metric, max_age=60)
You can see that this time the bicycle is first assigned the ID 38 and after losing the bounding box for a few frames, the tracker is able to re-identify it with the same ID (38).
In this hands-on tutorial, you learned how to use the DeepSORT algorithm and the YOLOv8 model to detect and track objects in a video.
You learned how to assign a unique ID to each object and how to re-identify the object after losing the bounding box for a few frames.
As I said before, combining object detection with object tracking allows us to monitor the movement of objects in a video, for example, to count the number of objects.
This opens up a lot of possibilities for applications such as people counting, traffic monitoring, sports analysis, and more.
I will try to show how to make some real-world applications using object detection and object tracking in the future Insha'Allah.
I hope you enjoyed this tutorial and that you learned something new.
As always, if you have any questions or suggestions, please leave a comment below.
To access the source code for this tutorial, please subscribe to my newsletter using the form on the landing page. Once subscribed, you will receive an email with the link to the source code.