This is the fifth post in a series on implementing an SVM object detection pipeline for video with OpenCV-Python.
- Part 1: SVMs, HOG features, and feature extraction
- Part 2: Sliding window technique and heatmaps
- Part 3: Feature descriptor code and OpenCV vs scikit-image HOG functions
- Part 4: Training the SVM classifier
- Part 5: Implementing the sliding window search
- Part 6: Heatmaps and object identification
The previous posts touched on the foundations of SVMs, HOG features, and the sliding window plus heatmap approach to find objects in an image, then discussed the code from portions of the pipeline responsible for extracting features from sample images and training an SVM classifier to differentiate between vehicles and non-vehicles. In this post, we’ll discuss the portions of the code that implement the sliding window to determine which parts of an image contain an object. The Python source code and project are available on Github.
As in the previous posts, let’s start with a video of the final result to see what we’re working towards:
Sliding window search
Although we’ve trained our SVM classifier on nice 64×64 images, an actual dashcam video will not be a 64×64 image that neatly contains or doesn’t contain an entire vehicle. Instead, vehicles may be present in different parts of the image at different scales. One way to tackle this is with a sliding window search in which a window of fixed size is slid across and down the image. This is performed at several scales by scaling the image down—this repeated downscaling is called an “image pyramid.”
The results I obtained with the aforementioned approach were inconsistent, so I tried something different. My alternative approach was to 1) scan only the portion of the image below the horizon and above the dash, since there should be no vehicles outside this region, and 2) use a window of variable size that’s relatively small near the horizon and larger toward the bottom of the image, the rationale being that vehicles near the horizon are farther away and will appear smaller whereas vehicles near the bottom of the image are closer and will appear larger. A sample demonstration of this approach can be seen in the following video:
The colors don’t signify anything, they’re simply meant to help differentiate adjacent windows from one another; I wrote a simple little utility called uniquecolors.py to generate any arbitrary number of unique colors, which I used to create the above clip.
The actual sliding window search is facilitated by the
function defined in the file
slidingwindow.py. The function has the following signature:
Note that the function doesn’t actually operate on an image. Instead, we
image_size, a tuple containing the width and height of an image, from
which it determines the coordinates of all the windows to search (based on the
remaining input parameters). This allows us to call the function once, store the
list of window coordinates to be searched, then refer to this list for every
As demonstrated in the video, the window traverses the image from top to bottom
and left to right.
init_size sets the initial size of the window.
sets the overlap between adjacent windows while moving left to right, as a
fraction of the current window width; in other words, if the current window
width were 100 pixels,
x_overlap=0.5 would cause the window to step 50 pixels
to the right for each left-to-right step.
y_step determines the amount by
which the top of the window slides down for each step in the top-to-bottom
direction, as a fraction of total image height. If the image were 600 pixels
y_step=0.05 would cause the window to step 30 pixels toward the bottom
at each vertical step.
y_range set the portion of the image to
search as a fraction of the image width and height, respectively. For example,
x_range=(0, 0.5) would cause only the left half of the image to be searched.
y_range=(0.67, 1.0) would cause only the bottom third of the image
to be searched. Finally,
scale sets the ratio by which to increase the size of
the window with each vertical step toward the bottom of the image. If
scale > 1, the window gets larger with each step in the y direction. If
scale < 1, the window gets smaller with each step in the y direction. If
scale = 1, the window size remains fixed.
The actual implementation of these parameters is fairly short:
Observe that each window is added to the list in the form of a tuple containing the x and y coordinates of the window’s upper left corner and the x and y coordinates of the window’s lower right corner, in that order.
Building the detector
Having established our sliding window function, we can move on to making use of
it in the object detector, for which I’ve defined a
Detector class in a file
__init__() method of this class, found on lines
19-30, sets the sliding window parameters we just discussed in the previous
section. Next, the classifier dictionary produced by the
train.py is loaded via the
loadClassifier() method on lines 32-75.
From the dict, we extract the scikit-learn
LinearSVC on line 50 (which
actually classifies feature vectors as containing or not containing the object
on which it was trained), the scikit-learn
StandardScaler on line 51
(which scales feature vectors before they’re fed to the classifier), and the
color space and color channels for which we wish to extract features on lines
Note that we use an
OpenCV color space conversion constant, which was determined by the
processFiles() function in train.py. Since OpenCV uses the BGR color space by
default and no color conversion is required if the user has chosen to use this
color space, I’ve assigned -1 as a default value, which doesn’t correspond to an
actual OpenCV color conversion constant.
Then we re-instantiate a
Descriptor, which will produce the feature vector for
each window, using the original descriptor parameters on lines 59-73.
Originally, I’d packaged the
Descriptor object into the dictionary, but found
that, if the dictionary was pickled (saved to file), attempting to load the
pickle file produced errors if the pickled dictionary included a
The last helper method we define for the
Detector class is
lines 77-99. The signature for the function is simply:
It takes an image in the form of a 3D numpy array, converts it to the appropriate color space (lines 84-85) and keeps only the desired channels (lines 87-90):
We check that the array is three-dimensional even if it contains only a single
channel, adding a third dimension via
np.newaxis on line 90 if necessary.
Having preprocessed the image, the next step is to obtain the feature vector for
each window from the sliding window (lines 92-94), which we store in a list
Then we scale the feature vectors and run them through the SVM classifier, which returns an array of 0s and 1s, where each element corresponds to a window, 0 signifies that the window does not contain the object, and 1 signifies that it does. Lastly, on line 99, we use a list comprehension to return only the window coordinates of windows predicted to contain an object.
We’re now ready to actually apply the sliding window search and classification to a video, which will be the topic of the next post.