Dog Breed Classifier using Deep Learning
This article covers different methods to classify images of dogs according to their breed using Convolutional Neural Network (CNN). The target is to achieve as high accuracy as possible.
Overview
This project aims to build convolutional neural networks (CNN) to classify breeds of dogs. CNN is often used in image classification because of its ability to develop a two-dimensional image's internal representation. This allows the model to learn the position and scale-invariant structures in the data, which is essential when working with images. We built a CNN from scratch and used existing famous CNNs like VGG16 and Resnet50.
Problem Statement
This project uses image classification to predict correct dog breed using images. Apart from finding the corresponding dog breed, the model will give the most similar dog breed to the human face if the human image is passed.
- When the image is passed, it is checked for its validity, which means it is checked whether the given image is of a dog or human or not. It is done using the human detector and dog detector. If both detectors fail to detect, it will give the error to provide the correct image.
- Suppose the human detector can detect the human in the given image. In that case, the model will try to predict the most suitable dog breed according to the human face.
- If the dog detector detects a dog in the picture, the corresponding dog breed is predicted.
To predict dog breeds the Dog breed classifier is used.
Metrics
To evaluate the model's performance, we use accuracy as the metric. We can see in the above bar graph that values are spread uniformly. All the values exist in the range of [26, 77] (both included). Since the target variable is balanced, we opt for accuracy as our evaluation metric.
We defined the accuracy as the following:
accuracy = 100*np.sum(np.array(dog_breed_predictions)==np.argmax(test_targets, axis=1))/len(dog_breed_predictions)
Here, dog_breed_predictions stores the index of predicted dog breed for each image in test set.
Analysis
Data Exploration
This data set was provided by the Udacity as part of their nanodegree course. The data set consists of dogs and human face images with the different dog categories. In the data set:
- There are 133 total dog categories.
- There are 8351 dog images.
- Also, there are 13233 total human images as well.
We divide our dog image data of 8351 images into three parts training data of size 6680 images, a validation set of 835, and test data set of the remaining 836 dog images.
The frequency of dog breeds vary between 26 to 77. Where the mean value is around 49 images. Below we can see the bar graphs representing the same.
Data Visualization
As mentioned before there are 133 dog categories available in the data set. Below, we can see the bar graph with the frequency of each dog breed in the data set.
fig. Dog breed-frequency exploration graph
fig. Dog breed-frequency exploration graph
Here, the dog breeds are aligned on the x-axis, and the frequency respected to each dog breed is aligned to the y-axis.
The following bar graph shows the spread of the dog breed files across their frequency.
Here, on the x-axis, the number of dog breeds is mentioned, and on the y-axis, their frequencies are mentioned. For example, we can see that five dog breeds have 77 images in the data set.
Human Detector
We use OpenCV's implementation of Haar feature-based cascade classifiers to detect human faces in images. OpenCV provides many pre-trained face detectors, stored as XML files on the GitHub. We have used one of these.
The above two code snippets detect the human face in the image. If the face is present, it will return "True."
Dog detector
To detect dogs in the images, we use a pre-trained ResNet-50 model along with weights that have been trained on the Imagenet, a vast and popular dataset used for image classification and vision tasks. Given an image, this pre-trained ResNet-50 model returns a prediction (derived from the available categories in ImageNet) for the object contained in the picture.
Also, we have used this dictionary to identify the predicted object category. In this Imagenet dictionary we have used the categories corresponding to the keys from 151-268. So, if the predicted object value lies in this range, the detector will return True.
Data Preprocessing
We have first to preprocess the data. To do this, we divide our dog image data of 8351 images into three parts training data of size 6680 images, a validation set of 835, and test data set of the remaining 836 dog images.
But, before we pass our data to the neural network for prediction, first, we need to convert it into a suitable input size. We changed the input shape of the images to (224,224,3). Here, the width and the height of the output image will be 224, and 3 represent the bands of the image.
We also divide the pixel values of the images by 255 to get their values in the range of 0 to 1.
Modelling the CNN (Implementation)
1. From Scratch (without Transfer learning)
First, we created a CNN by layering it ourselves. Then we pass our refined (processed) data to the network. As we have train on all the layers, this process is a high time complex process.
Following is the summary of the CNN we used:
Here, we can see that the number of filters is increasing while the 2D dimension is decreasing, which can be helpful to detect more refined features.
Performance
The model doesn't perform too well. We used accuracy as the evaluation metric. We used multiple Epochs to train the model, but, the overall accuracy on the test suit is around 1%.
2. VGG16 based model (Using Transfer learning)
Since our model does not work the way we hoped, we need to explore other ways to improve accuracy. Also, because the training time was very high, it was unsuitable for high epoch values. To reduce the training time we train the CNN using the transfer learning.
Transfer learning refers to using the elements of the pre-trained model in the new model, which are trying to do similar tasks. Instead of training the whole network in transfer learning, we only retain the last layer of the pre-trained, well-performed neural network. In this scenario, we do not have to train internal layers, which reduces the significant amount of time.
The model uses the pre-trained VGG-16 model as a fixed feature extractor, where the final convolutional output of VGG-16 is fed as input to our model. We only add a global average pooling layer and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax.
We downloaded the Vgg16 bottleneck features, which will allow us to access pre-trained elements of the Vgg16. Following is the summary of Vgg16 based model:
Performance
The model performs significantly better than our previous model. We are getting an accuracy of around 42.5%, which is a massive improvement from our previous model accuracy of only 1%. Also, the training time was lesser even with higher epochs.
Refinement
3. ResNet-50 based model
Although, the accuracy improved significantly using the Vgg16, it is still very less and can be improved further. To do this we used another very popular neural network: ResNet-50.
Similar to Vgg16 we also downloaded bottleneck features of ResNet-50. ResNet-50 was used as a fixed feature extractor, where the final convolutional output of ResNet-50 is fed as input to our model. Following is the summary of ResNet-50 based model:
Performance
The model performs perfectly to achieve the accuracy of 83.13% with 20 epochs. It performs expectedly as ResNet-50 is a more advanced and complex model than Vgg16. It has more internal layers and has more hyperparameters to train.
Results
Model Evaluation and Validation
The accuracy improved as we moved on to more complex and deeper networks. The first network, which we created from scratch, gave the accuracy around 1%, too low.
The next VGG16 based model outperformed it significantly by achieving an accuracy of 42.5%, and, in the last, the ResNet-50 achieved an accuracy above 83%.
Since the ResNet-50 achieves the highest accuracy among the three, we used it to test against our manual inputs.
While testing against our manual inputs, we observed that sometimes it is unable to predict the correct breed of the dog (in case of Barbet it predicted Portuguese water dog). Other times it gives us the correct breed and correct outputs in the case of human faces and gives error when neither dog nor human is detected.
Justification
As we discussed earlier, going forward, we are working with more complex models. Vgg16 is way more complex with layers than the model we created from scratch. That is why the features were more refined and suitable for prediction. Similarly, the ResNet-50 is the most complex and is way deeper than the other two. So expectedly it gives us the best result among three.
Conclusion
Reflection
This article discussed different techniques to classify dog and human images into different dog breeds. First, we created human face and dog detectors. This worked perfectly as we achieved 100% detection for the sample we tested it against.
After that, we worked with three different CNNs: the first was from scratch, and it took too much time to train and performed poorly. Then we used transfer learning with the help of vgg16. The training time was significantly less this time, and accuracy improved to 42.5%. Lastly, we used transfer learning with the ResNet-50 and got the best results.
Improvements
By the end of it, we achieved a good accuracy of 83.13% with ResNet-50. Although, it performs well we still can do a few things to improve the performance further.
- Increase the size of the data set with more dog pictures.
- The number of categories (breeds) also need to be increased to cover more dogs and predict them accurately.
- The labels could be improved as well.
- Different Hyper-Parameters can also help to improve.
Code
The full source code and the data sets for the project can be found at the GitHub directory:
Acknowledgement
The Udacity Data Scientist Nanodegree course initiates the project and provides part of the codes of the project.
Comments
Post a Comment