Link to code: https://github.com/nrsyed/pytorch-yolov3
- Part 1 (Background)
- Part 2 (Initializing the network)
- Part 3 (Inference)
- Part 4 (Real-time multithreaded detection)
- Part 5 (Command-line interface)
The last several posts covered the theory behind YOLOv3, as well as the code responsible for performing inference on an image and real-time detection in a webcam stream. In this post, we’ll tackle the command line interface that ties it all together and which was used to produce the video from the first post (included again below for good measure).
The command-line interface (CLI) is defined in the main() function of __main__.py. They’re described in more detail below.
Input source arguments
The input source can be 1) a webcam or video stream, 2) an image or images, or 3) a video file. One and only one of these three options must be provided.
--cam, -C [cam_id]: This option performs detection on the stream from a webcam or video stream. By default, it uses video capture device id 0 (if only one webcam is plugged in or you’re using a laptop with a built-in webcam, its device id is usually 0) if no value is supplied, i.e.,
-C. However, the numeric device id or a path to a video stream (e.g., the path to an RTSP stream) can optionally be specified as an argument, e.g.,
--image, -I <path>: This option has one required argument—a path to an image file or to a directory of image files, e.g.,
--video, -V <path>: This option has one required argument—a path to a video file, e.g.,
There are several model-related arguments, some of which are required and others optional.
--config, -c <path>(required): The path to the Darknet config file in which the network architecture is defined, e.g.,
--device, -d <device>(optional): The PyTorch device on which to run the network (default “cuda”). This should be a string that can be parsed by torch.device. This can be, for example, “cpu” to run on the CPU or “cuda” to run on the primary CUDA device/GPU. The CUDA device/GPU can be specified using “cuda:0”, “cuda:1”, “cuda:2”, etc. to run on GPU 0 or GPU 1 or GPU 2, respectively, if you have multiple GPUs. Note that the GPU device id as seen by the OS (which can be determined from nvidia-smi) may not be the same as the device id as seen by PyTorch. For example, I have two GPUs in my system: an RTX 2080 Ti and a GTX Titan X. nvidia-smi reports the following information about these GPUs:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:2D:00.0 On | N/A | | 15% 55C P0 63W / 250W | 1041MiB / 11011MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 00000000:2E:00.0 Off | N/A | | 22% 31C P8 14W / 250W | 1MiB / 12212MiB | 0% Default | +-------------------------------+----------------------+----------------------+
Here, we can see that RTX 2080 Ti is GPU 0 and the GTX Titan X is GPU 1.
However, the GPU ids as seen by PyTorch may not necessarily match. To be sure
you’re running on the correct GPU, you can supply the
option, which will print the name of the GPU, but if you have multiple identical
GPUs, you may be better off using a tool like
nvtop to check GPU usage.
--iou-thresh, -i <iou>(optional): The IOU threshold is a value in the range (0, 1] that sets the bounding box overlap for two detections to be considered duplicates. This value is ultimately passed to the non-max suppression function. By default, it’s set to 0.3.
--class-names, -n <path>(optional): The path to the text file of newline-separate class names, e.g.,
-n models/coco.names. The contents of this file are simply used to display the correct name above each bounding box. If omitted, the class index (an integer from 0 to
num_classes - 1) is displayed instead.
--prob-thresh, -p <prob>(optional): The detection probability threshold—a float between [0, 1]. Detections below the given probability will be ignored. By default, this is set to 0.05 to filter out low confidence predictions.
--weights, -w <path>(required): The path to the model weights, e.g.,
Lastly, there are several arguments that control what information or files are output by the detection pipeline.
--output, -o <path>(optional): Path to output .mp4 video file.
--show-fps(optional): Show the current framerate (frames per second) on the displayed video.
--verbose, -v(optional): Print diagnostic information to the terminal while running the detection pipeline.
Setting up the network and detector
We use argparse to process these command line arguments, then initialize the network and set some basic parameters starting on line 106 of the current version of __main__.py.
On line 107, we check if CUDA is available (if it was specified). If it isn’t, we tell the user and utilize the CPU instead. If CUDA was specified but is not available, this can either mean CUDA isn’t installed on the system, or that PyTorch was compiled without CUDA, or that the PyTorch version doesn’t support the version of CUDA installed on the system. On lines 115-117, we instantiate the Darknet class and load the network weights (see previous posts), then set the network to eval mode—this is a PyTorch torch.nn.Module method that should be called for inference (as opposed to training). On lines 119-120, we also call torch.nn.Module.cuda() if the model is to be loaded on the GPU and not the CPU. Lines 122-127 print information on the device being used, e.g., GPU, if the verbose option was supplied. Finally, we load the list of class names on lines 129-132 if a path to a file was given.
On lines 134-143, we set the value for the input source.
Lines 145-172 handle the case where the input source is an image or a directory of images. The images are loaded into memory and each image is processed sequentially on lines 158-165. In the future, I may add the option to run all the images in a single batch—this can provide a performance boost over running the images one by one, but also utilizes more GPU memory (or CPU memory if not using CUDA). Finally, lines 167-172 draw bounding boxes on each image and display them one at a time.
Both the webcam and video input sources are handled in the same top-level else
block encompassed by lines 173-210. The actual detection functionality for
these cases was discussed in the previous post and is found in
On lines 174-176, we initialize a list called
the resulting video is to be written to an mp4 output file. Each processed frame
will be appended to
frames by inference.detect_in_cam() or
inference.detect_in_video(). After the function has completed, we’ll write
the frames to a video using the OpenCV
The VideoWriter class allows us to write one frame to a video file at a time, which begs the question: why have I chosen to store the frames in a list and then write them all at the end instead of writing them as they’re processed? When I tried the latter approach, I found that processing time increased dramatically and severely reduced performance, even when I put the video writing functionality into a separate thread (in addition to the video reading and showing threads). For whatever reason, I couldn’t leverage multithreading to concurrently read, display, and write frames, which leads me to believe there’s something going on under the hood that will require a deeper dive for me to understand. For now, I’ve opted to stick with the list because, even though it has the disadvantage of consuming an ever-increasing amount of RAM as it grows, this memory usage is relatively small. Recording for a longer duration (on the order of hours) would require a more robust solution.
Webcam input is handled on lines 178-197. The detection function is wrapped in a try/except/finally block so that the output video is written even if the detect_in_cam() function encounters an error and exits unexpectedly. The video file input is handled on lines 198-210. In both cases, writing an output video file is handled by the write_mp4() function, which is defined earlier in the same file.
This function is fairly straightforward. Line 24 gets the output video
dimensions from the first frame in
frames. The OpenCV VideoWriter object is
instantiated by providing it the path to the output video, a
FourCC code corresponding to the mp4 encoding, the output video framerate,
and the output video dimensions.
With this discussion of the command line interface, this series of posts comes to an end. We’ve delved into the fundamentals of YOLOv3, the PyTorch implementation of the network, utilizing the network for inference, and the multithreaded approach to real-time detection. This approach can generalize to any detection backend, not just YOLOv3, and I’m sure I’ll revisit this pipeline again in the future.