This post is the fourth in a series on developing a HOG-based SVM with OpenCV-Python for detecting objects in a video.
- Part 1: SVMs, HOG features, and feature extraction
- Part 2: Sliding window technique and heatmaps
- Part 3: Feature descriptor code and OpenCV vs scikit-image HOG functions
- Part 4: Training the SVM classifier
- Part 5: Implementing the sliding window search
- Part 6: Heatmaps and object identification
With this post, we’ll continue to look at the code, which is available on Github. Once again, this is the final result after all is said and done:
Specifically, this post will focus on extracting features from the training images to train the SVM classifier.
Extracting feature vectors
First, we’ll extract features from the sample images and split the sample data into training, cross-validation, and test sets. The training set is what we’ll use to initially train the classifier, and will consist of most of our sample data. The cross-validation set set will be used to evaluate the trained classifier. Based on the accuracy with which the classifier evaluates images in the cross-validation set, we can tune the parameters of the classifier or, as I’ve done in my implementation, retrain the classifier with samples from the cross-validation set that it misclassified. Finally, to get a true measure of the classifier’s accuracy, we evaluate its performance on the test set. The reason for reserving a small amount of data for use in a test set is that it’s data the classifier has not seen, been trained on, or been tuned for, unlike the cross-validation set. Consequently, it gives us an unbiased measure of the classifier’s accuracy, ideally.
Before getting started, we need sample images to work with. I used the Udacity dataset (scroll down to find links for vehicle and non-vehicle data), which contains about 8800 images of vehicles, taken from a variety of angles, and about 9000 images of non-vehicles—things like the road and objects that might be found around the road, like street signs, guardrails, etc. The Udacity dataset actually contains images taken from the GTI and KITTI datasets, in case you’re looking for more images.
The functions we need can both be found in
train.py in the Github project. To extract features from our sample images,
we’ll use the first function, named processFiles()
, whose input
signature looks like this:
|
|
The first two arguments, pos_dir
and neg_dir
, specify the paths to the
directories containing the positive and negative sample images, respectively.
The recurse
argument specifies whether or not to look in sub-directories
within those directories for images. color_space
, as the name implies, sets
the color space to which images will be converted before feature extraction; the
supported color spaces can be found later in the file on lines 71-97. Note
that OpenCV’s default behavior for reading and displaying images is to use
the BGR color space. The channels
argument is a list specifying the color
channel indices to use. For example, color_space="bgr", channels=[0, 2]
would
indicate that we wish to use the B and R channels of the BGR representation of
the image. color_space="hls", channels=[1, 2]
would indicate that we wish to
use the L and S channels of the HLS representation of the image. Beyond that,
the remaining arguments will be supplied to the
Descriptor object, which actually extracts features from an image after
it’s been converted to the desired color space and channels. See the
previous post for more information on the Descriptor class.
We make use of some calls to Python’s built-in os
library to build lists
of positive and negative file names from the positive and negative sample
directories, respectively:
|
|
After building the file lists and parsing input arguments specifying the color
space and desired channels, we iterate through the lists, load each file,
convert it to the desired color space if not BGR, keep only the desired
channels, get the feature vector by feeding the image to our Descriptor
object, and append it to the appropriate list.
|
|
Scaling features
It is very important that the features be scaled, i.e., normalized. Different types of features are not necessarily measured along the same scale—for example, how do you compare a HOG bin value from HOG features to a color histogram bin value between 0 and 255? Without scaling or normalizing our features, features with large values can overshadow features with smaller values. I’ve utilized scikit-learn’s sklearn.preprocessing.StandardScaler, which, for each column in our list of feature vectors (i.e., each feature), removes the mean so that all values are centered around zero and scales the variance to one, so different features can be directly compared:
|
|
Notice how we fit the scaler to the complete set of feature vectors, then
actually scale the features with StandardScaler.transform()
. After training
the classifier on these scaled features, we must remember to apply the same
scaler to any feature vectors we wish to classify later.
Training, cross-validation, and test sets
Next, we shuffle and split our data into a training set, cross-validation set, and test set. The training set is the data on which the classifier will be initially trained. The cross-validation set will be used to evaluate the trained classifier. We can use the classifier’s performance on the cross-validation set to retrain it or tune its parameters as we see fit, until we’re satisfied with its performance. Finally, the classifier is evaluated on data it hasn’t seen before, i.e., the test set, to quantify its ultimate accuracy.
70/20/10 is a common split, i.e., putting 70% of the original data in the training set, 20% in the cross-validation set, and 10% in the test set. I ended up going with 75/20/5 based on the fact that the classifier’s performance on the test set seemed to have little correlation with its performance on actual video.
Training the classifier
To train the classifier, we supply it with the training data and a list of “labels” to tell it which class each training example belongs to:
|
|
In this case, the classes can simply be represented by numbers, where “1”
represents a positive example (corresponding to an image that contains the
object) and “0” represents a negative example (corresponding to an image that
doesn’t contain the object). Here, I’ve used scikit-learn’s
sklearn.svm.LinearSVC, which is initialized with the desired
hyperparameters on lines 283-284. The actual training occurs on line 285
via the LinearSVC.fit()
method, which we supply with the training data and
corresponding labels.
Next, we evaluate its performance on the cross-validation set positive and negative samples:
|
|
A perfect classifier would return all 1s for the positive cross-validation samples and all 0s for the negative cross-validation samples. However, there will inevitably be misclassified samples. To improve the performance of our classifier, we specifically add the misclassified samples from the cross-validation set to the original training set, then retrain the classifier as follows:
|
|
The classifier is now completely trained! If we wanted to, we could have continued to tune the classifier and its hyperparameters based on its performance on the cross-validation set. Scikit-learn offers a class to facilitate this, sklearn.model_selection.GridSearchCV. In this case, I didn’t do any tuning beyond retraining with the incorrectly classified samples from the cross-validation set, since I found the SVM to attain high accuracy (often >99%) without additional tuning.
To determine the final accuracy, we simply run the trained classifier on the test set:
|
|
In the next post, we’ll see how to utilize the trained classifier to find vehicles in a video.