Link to code: https://github.com/nrsyed/pytorch-yolov3
- Part 1 (Background)
- Part 2 (Initializing the network)
- Part 3 (Inference)
- Part 4 (Real-time multithreaded detection)
- Part 5 (Command-line interface)
In this post, we extend the inference code discussed in the last post and demonstrate how to incorporate it into a multithreaded pipeline to perform detection and display the results in real time on a webcam stream or video. The functions and classes we’ll use to do so are also found in inference.py.
Real-time detection in a webcam stream
Let’s look at the definition of the function detect_in_cam(), which, as the name implies, we’ll use to detect objects in a webcam stream.
The inputs to the function on lines 436-437 are defined in the subsequent
docstring. Several of them are either self-explanatory or were discussed in the
previous post (as inputs to the inference() function). The
argument is a list of class names corresponding to the class indices
class_idx) from the last post. The optional
frames argument is used if the
frames (with bounding boxes drawn over the original webcam stream, etc.) are
needed by the calling code—for example, to write the result to a video
file (this is what was used to create
the YouTube demonstration video).
frames should be a list provided by the
caller. Each frame will be appended to the list as the webcam stream is
processed. Since lists are passed by reference, changes made to the list in this
function are also visible to the calling code.
On lines 458-459, we instantiate the VideoGetter and VideoShower classes, which are defined earlier in inference.py. We’ll get to the definitions of these classes shortly. For now, all you need to know is that they enable us to asynchronously get (hence VideoGetter) frames from the webcam and show (hence VideoShower) the processed frames with detection bounding boxes in separate threads. Utilizing multithreading for these I/O-bound tasks significantly speeds up our pipeline. Lines 462-463 initialize the variables we’ll use to compute the running average FPS of the pipeline—this will allow us to measure just how “real time” it is. For this, we use a deque that keeps track of the processing time for the last 30 frames.
The rest of the function comprises a while loop that grabs the current frame
from the webcam on line 473, runs inference on the frame (lines
474-477), draws the resulting bounding boxes and class names (or class indices
if class names were not provided) on the frame (lines 478-480), draws the
current processing FPS on the frame if desired (lines 482-487), and
finally displays the frame on line 489. On lines 490-491, the resulting
frame is appended to the
frames list if it was provided. Finally, on line 493,
the framerate for the current loop iteration, which is the inverse of the
processing time for the frame, is appended to the deque.
Multithreading to get and display frames
One of the more popular posts I’ve written is on multithreading with OpenCV to improve performance. As the comments on that post point out, there were a couple issues with that implementation and my method of evaluation. These issues have been corrected in this updated implementation (and which will probably be evaluated in more detail in a future post). First, here’s the code for the VideoGetter class used to read frames from the webcam.
The __init__() method is fairly straightforward. It instantiates an OpenCV
VideoCapture object and initializes its return values (stored as instance
variables) by calling its read() method. We also initialize the boolean instance
self.stopped that can be used to determine if the thread is still
The thread itself is started in the start() method on line 22; note
that we’re making use of the
which is part of Python’s standard library. The
thread runs the VideoGetter.get() method defined on lines 26-37. All it does
is continuously read the next frame from the VideoCapture object and store it in
self.frame such that a frame is always available. Note that, as the get()
method’s docstring indicates, frames are not stored in a queue—this
means that the previous frame is always overwritten. This was an intentional
design choice to ensure that we’re always reading the most recent frame,
even if previous frames were missed due to the processing taking too much time.
Below is the code for the VideoShower class (that’s shower as in “show-er” and not as in “rain shower”):
The class’s __init__() method takes two optional arguments:
frame with which to initialize the class) and
win_name (the name of the OpenCV
window in which the frame will be displayed). OpenCV relies on
to determine which window to update. Like the VideoGetter
class, it also has a start() method to start the thread used to show frames. The
show() method on lines 61-74 is where most of the magic happens in a while
loop that continuously displays the current frame using
OpenCV’s imshow() function.
As the method’s docstring indicates,
self.frame attribute is set to None after the call to
imshow(). This is because the call to imshow() is relatively computationally
expensive and by only running it when the calling code has manually set
self.frame to an actual frame, we can realize a performance increase. Also
note that that the loop monitors keypresses and that when the “q”
key is pressed, stops the loop.
Detection in a video file
The detect_in_video() function found in inference.py is used to run detection on the frames of a video file. Functionally speaking, the detect_in_cam() and detect_in_video() functions do the same thing (instantiate a VideoCapture object, read frames from it, and run inference on those frames). The differences are mainly stylistic in nature. The detect_in_video() function does not utilize multithreading or the VideoGetter and VideoShower classes described above, and it makes the video showing/displaying functionality optional. This is based on the assumption that, with a video, we want to process every single frame; as we learned above, the VideoGetter class might result in frames being skipped if the processing time for a frame exceeds the time required to read the next frame from the OpenCV VideoCapture object. It’s also based on the assumption that, with a video file, there are no real-time constraints—in other words, I’ve assumed that the goal is to process all frames regardless of how long it takes.
The code for the function is below.
In practice, there’s nothing stopping you from providing a webcam id for
the detect_in_cam() function’s
filepath argument or, conversely, a path
to a video file for the
cam_id argument of the detect_in_cam() function. In
fact, doing so might be an informative exercise.
This post covered the functions used to run inference on a video file or webcam stream for real time detection. In the next post, we’ll examine the command line interface and the top-level code that actually calls these functions.