OBJECT DETECTION — YOU LOOK ONLY ONCE (YOLO V3)
TABLE OF CONTENTS :-
INTRODUCTION
PROBLEM/OBJECTIVE
PROBLEM MAPPING
PERFORMANCE METRICS
DEEP LEARNING MODELS
YOLO(YOU LOOK ONLY ONCE)
YOLO V3 NETWORK DYNAMICS
BOUNDING BOX REPRESENTATION
PER-SCALE SIGMOIDS
LOSS-FUNCTION
MULTI-SCALE PREDICTIONS
FEATURE PYRAMID NETWORKS
FILTERING OF BOXES
NON-MAX SUPPRESSION
WORKING EXPLAINED (IMAGENET CLASSIFICATION AND OBJECT DETECTION)
PERFORMANCE
READINGS AND REFERENCES
INTRODUCTION
The task of object detection involves detecting an object of a category/class in a picture by putting a bounding box around that object with a specific color indicating that the object is of a specific category.
Let’s take the example of the following images:-
Image 1:
Image 2:
Image 3:
Image 4:
The above four images shows the detection of objects of various categories such as dogs, vehicles, human beings and kittens with a box around those suggesting that there is an object of a particular category within that box.
PROBLEM/OBJECTIVE
The problem is to detect an object of a particular class/category present in an image irrespective of the fact that there could be multiple objects present in an image.
PROBLEM MAPPING
The problem of object detection can be mapped to two different problems of object localization and object classification wherein object localization finds out the position in the image where there is an object present with a probability and object classification classifies that object to a particular category by giving a probability estimate of that object belonging to that class.
PERFORMANCE METRICS
Please refer here for the metric to judge the performance of the object detection algorithm.
DEEP LEARNING MODELS
The deep learning models take input as an image or stills from a video and outputs the bounding boxes around objects of various classes in the images. One such very important source of dataset is the Coco dataset which has 80 classes of objects. The data can be found at COCO — Common Objects in Context (cocodataset.org). This dataset is being used for object detection and has few thousands of images belonging to these 80 classes of data.
Deep Learning Models have two parameters based on which they are judged, especially for the case of object detection. They are the parameters of speed and mAP but there is always a trade off between speed and mAP.
Speed
Speed of an object detection algorithm would refer to the time which it takes to recognize one image, although it is measured in FPS which refers to Frames per second. It means how many frames or images can the algorithm recognize in a second. This parameter is pretty important in real-time face detection application.
mAP (Mean Average Precision)
The mean average precision of an object detection algorithm is pretty important in applications such as medical diagnosis and Optical Character Recognition (OCR). This metric is the mean of the average precision achieved for each of the category/classes. Average Precision for each of the category/class is the area under the Precision- Recall curve for that class.
YOLO (You Look Only Once)
These are the set of algorithms for object detection in which the network only looks the image once to detect multiple objects. Hence, the name YOLO, You Look Only Once. This algorithm uses a single convolutional neural network to extract features from the image and simultaneously predict multiple bounding boxes in the image from those extracted bottleneck features with the probability of an object in the box and also the probability of each category of object to be present in that box.
Pros and Cons
Pros:-
Firstly, YOLO is very fast as this algorithm reframes object detection as a single regression problem, from image pixels to bounding box coordinates and class probabilities in a single go in contrast to two stage algorithms such as R-CNNs ,Fast R-CNNs and Faster R-CNNs which proposes regions in the first stage to generate potential bounding boxes in an image and then they run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections and rescore boxes based on other objects in the scene.
Secondly, YOLO reasons globally about the image while making predictions. Unlike sliding window and region proposal — based techniques, YOLO sees the entire image during the training and test time and so it implicitly encodes contextual information about classes as well as their appearance.
Thirdly, YOLO learns generalizable representations of objects. When trained and tested on natural images, YOLO outperforms other detection methods like DPM and R-CNN by a wide margin.
Cons:-
YOLO still lags behind state-of-the-art detection systems in accuracy as it is extremely fast. While it quickly identifies objects in images it struggles to precisely localize some objects, especially small ones lesser than 5% of the image.
YOLO struggles to detect objects which are in group as in birds flying in flocks, plates arranged in a stack etc.
YOLO struggles to generalize to objects in new or unusual aspect ratios or configurations.
There are other algorithms such as Single Shot Detection, DPM(Deformable Part Models) and RetinaNet which are also used for detecting objects. YOLO algorithms are extremely fast to detect objects as it localizes and classifies objects in one single go. Coming to performance measured at the level of mAP @ 0.5 IOU, YOLOv3 is on par with RetinaNet but is 4X faster in predictions.
How YOLO works?
YOLO is a new approach to object detection. In this approach, object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. In this approach, a single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
Unlike other approaches to object detection as DPM (Deformable Parts Model) using a sliding window approach wherein the classifier is run at evenly spaced locations over the entire image and R-CNN (Region based Convolutional Neural Network) generating potential bounding boxes in the image and then running a classifier on the proposed boxes, YOLO is reframed as a regression problem straight from image pixels to the bounding box coordinates and class probabilities. It is refreshingly simple. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
YOLO V3 NETWORK DYNAMICS
This version of YOLO (Version 3, abbreviated as V3) has some minor updates to the earlier versions of YOLO. Here are a few of those :-
- This network is slightly bigger than the previous network but it’s more accurate and fast enough although.
- A 320 x 320 YOLO V3 runs in 22 ms at 28.2 mAP.
- This network achieves 57.9 AP50 (.5 IOU mAP) in 51 ms on a Titan X GPU, compared to 57.5 AP50 by RetinaNet in 198 ms, essentially at almost the same accuracy achieving almost 4x faster results.
BOUNDING BOX REPRESENTATION
This network YOLO V3 learns a set of coordinates tx, ty, tw and th instead of the actual bounding box coordinates bx, by, bw and bh.
- The actual coordinates bx, by, bw and bh are derived from two sets of coordinates, one is the anchor box coordinates, denoted by cx, cy, pw and ph and the other one is the set of learnt coordinates from the training data denoted by tx, ty, tw and th.
- cx and cy denote the offset from the top left most corner and pw and ph are the width and height of the anchor boxes.
- The actual bounding box coordinates are given by bx, by, bw and bh, the first two calculated by a small change of tx and ty given by the sigmoid function added to cx and cy. The other two bw and bh are given by the product term of the remaining p coordinates and e raised to the power of the remaining learnt t coordinates.
The coordinates of the anchor boxes given by cx, cy, pw and ph are fixed and are arrived by using K-Means clustering on the bounding boxes that are the ground truth labels with K=5. The below diagram would make it more clearer.
The above diagram shows that most of the training bounding boxes are present in those 5 locations that are given in the right hand box.
PER-CLASS SIGMOIDS
To achieve the object detection task using bounding boxes, the following steps are followed :-
The pretrained Darknet-53 network on the Imagenet data is used for Object Classification.
The task of object detection is achieved using the learnt coordinates tx, ty, tw and th from the network, the probability of the box containing any object and the probability of each of the 80 categories or class of objects.
Now, for representing the probabilities of each of the 80 classes or categories, 80 sigmoids are used for prediction of probabilities for each of these 80 categories, mainly because of the presence of overlapping categories which are female and woman i.e. an object classified as female is also a woman. The idea behind not using a SoftMax for prediction of 80 probabilities is that a SoftMax would always categorize this object as either a female or a woman and not both.
LOSS FUNCTION
The loss function is comprised of the following components :-
- The classification loss.
- The localization loss which is nothing but the error between the ground truth box and the predicted bounding box.
- The confidence loss (the loss on the objectness score of the predicted box)
Classification loss:- This loss is basically the log loss on the categories contained in any of the boxes that is having an object. Mathematically, it can be interpreted as follows:-
Localization loss:- This loss is the squared loss on the predicted and the actual bounding box coordinates for those boxes in which there is an object. Specifically speaking, it is the squared loss on sqrt(tx), sqrt(ty), tw and th. Mathematically, it can be interpreted as :-
λcoord is the weight given to the squared loss.
Confidence loss :- This loss is the sum of the log loss on the objectness score of a box in the boxes containing an object and also in the boxes not containing any object, with the loss weighted by a λnoobj parameter.
For those boxes wherein there is an object, mathematically, it can be expressed as :-
For those boxes, where there is no object, mathematically it is expressed as:-
NOTE :- To balance out the squared loss in the bounding box coordinates and the log loss for those boxes containing no objects, as the coordinates range from 0 to inf and the probability range from 0 to 1, λcoord and λnoobj were fixed as 5 and 0.5.
The full loss is given by :-
Please refer this for the loss function.
MULTI-SCALE PREDICTIONS
This concept of Multi-Scale Predictions help in predicting smaller images. Let’s have an intuition of the same.
- The bounding box detections are made from the feature maps taken at 13 x 13 scale by predicting the bounding boxes centered at each of these pixels, information of these coming from 13 x 13 grids, each of size 32 x 32 pixels in the input image.
- If there is a small image which is present in any one of the 13 x 13 grids of the input image, then it becomes impossible for the 13 x 13 feature map to detect the image by drawing a bounding box centering at any pixel as the information related to the image is itself captured in one of the pixels.
To overcome the above limitation of not being able to detect a smaller object, the prediction of the bounding boxes is done by taking the feature maps at 13 x 13 , 26 x 26 and 52 x 52 scales so that the chances of predicting a bounding box across a smaller image increases.
FEATURE PYRAMID NETWORKS
The concept of Feature Pyramid Networks is borrowed from the structure of pyramids.
If at every layer of Convolution Neural Network, we apply a stride of 2, the result of which is a feature map that is half of the size of the feature map in the previous layer.
If we continue doing it for two consecutive layers, then the resulting structure of feature maps would be very similar to a pyramid. The below structure would clarify the same.
In the above diagram the concept is well demonstrated and the Multi-Scale Prediction objective is achieved by taking the information of the feature maps at each layer from the Darknet 53 architecture. The lateral connection shown in the above diagram takes the information from each of the below layers after up sampling the information.
FILTERING OF BOXES
This essentially corresponds to filtering out those predictions of bounding boxes for which the maximum product of the objectness score and the conditional probabilities of the categories is lesser than 0.5 as those are useless.
This is essential because of the humungous amount of bounding boxes present in the output of the object detector, for 3 bounding boxes per cell in the image. The total number of bounding boxes that are output as per 3 bounding boxes per cell as per multi-scale prediction are 10,647. The below figure would clarify this.
The number of bounding boxes in total with 3 bounding boxes per cell is :-
n = 52 x 52 x 3 + 26 x 26 x 3 + 13 x 13 x 3
= 13 x 13 x 3 x (16 + 4 + 1)
= 507 x 21
= 10647 bounding boxes.
Filtering operation would cause a lot of boxes to be filtered out if 3 boxes are supposed to be outputted per cell resulting in a descent amount of boxes as output. This operation is practical also as presence of 10,647 objects in an image is highly unlikely.
NON-MAX SUPPRESSION
This operation is useful to output only one box for one object if that is covered by the bounding boxes centered in the surrounding cells also. It is done by taking the box that has the maximum IOU with all the other predicted bounding boxes.
WORKING EXPLAINED (IMAGENET CLASSIFICATION AND OBJECT DETECTION)
The object detection is done in two stages:-
- Object Classification
- Object Detection (Generation of Bounding Boxes for detecting objects along with the label of the object).
OBJECT CLASSIFICATION :-
Object Classification is done using the pretrained weights of the Darknet-53 architecture on the Imagenet dataset. This network is used as the backbone network for Object Detection. Let’s have a look in brief at the convolution operation and the network architecture of Darknet-53.
Convolution Operation :-
In the above image, the three layers of Red, Green and Blue represents the channels in the input image and each of the channels is actually denoted by pixels of various intensities denoted by those squares.
The intensities are actually denoted by numbers between 0 and 255. To collect the information about the edges in the image, we use filters which are initialized randomly to some numbers but through back propagation, they attain the desired weights.
In the above image, there are two such filters which are used to capture different information from the same RGB image. In reality, while designing any Convolutional Neural Network, the choice of the number of channels which are present in the output layer of a Convolutional Neural Network is a hyperparameter.
The next layer where the information from the input RGB image is getting captured denoted by white layers is the output layer of a convolution operation.
Please read more about convolution more here.
Darknet — 53 Architecture
Each of the block consists of a 1x1 convolution followed by a 3x3 convolution doubling the number of filters and thereafter followed by a residual connection skipping the next convolutional layer adding up to the inputs before going into the activation of the next to next convolutional layer.
Each of the convolutional layer consists of a Convolution layer, Batch Normalization and Leaky Relu which is a Relu but for any negative inputs, the activation is not zero but a small number of the order of 10^-2 multiplied by the input or the sum of the product of weights and activations from the previous layer.
There is a strided 3 x 3 convolution of stride 2 after each block of convolution of double the filter size of the preceding layer but halving the dimension of the same.
The convolutional blocks are followed by Global Average Pooling layer followed by a fully connected layer of 1000 neurons (for recognizing 1000 categories) and a softmax on top of that to get the most probable class out.
The strided convolutions halve the dimensions five times. So, in the whole architecture, the input dimension becomes 1/32 times. Hence, the feature extractor trained on classification data outputs a dimension which is 1/32 times of the input dimension with 1024 filters.
We can have a look at the comparison of various backbone feature extractors as below:-
Now, the above network classifies the images of the imagenet and gives the results after the softmax layer. The output just above the average pool layer is 13 x 13 x 1024 for an input image of 416 x 416 x 3. Similarly, it is 10 x 10 x 1024 for an input image of 320 x 320 x 3 and 19 x 19 x 1024 for an input image of 608 x 608 x 3 dimension.
OBJECT DETECTION :-
Object Detection is achieved by taking the feature maps, considering an input image of 416 x 416 x 3, from 13 x 13 x 1024, 26 x 26 x 512 and 52 x 52 x 256 outputs of the above network.
After that, these are passed through a bunch of convolutions and also, the lateral information from the below layers, for instance for a feature map of 52 x 52 x 256, is taken from below layer of 26 x 26 x 512 and for a feature map of 26 x 26 x 512, is taken from 13 x 13 x 1024 feature map.
The below diagram would be helpful.
Big Object Detection :-
For detecting big objects, following steps are followed:-
- The output of the last convolution block of Darknet-53 architecture which is 13 x 13 x 1024 is passed through 3 layers of 1x1 512 filters and 3x3 1024 filters.
- The output of the above operation is passed through 255 1x1 filters as we need a tensor of 13 x 13 x 255 for detecting 3 bounding boxes per cell.
Medium Object Detection :-
For detecting medium objects, following steps are followed:-
- The output of the step — 1 for the large objects detection is passed through 256 1x1 filters and then it is upsampled to get a tensor size of 26 x 26 x 256.
- The output of the above step is concatenated with the output of the last but one convolution block of Darknet-53, after which it is passed through 3 layers of 1x1 512 filters and 3x3 1024 filters.
- The output of the above step is passed through 255 1x1 filters to achieve the medium objects detection where in the bounding boxes are centered at each of the 26 x 26 pixels.
Small Object Detection:-
For detecting small objects, following steps are followed:-
- The output of the second step for detecting medium objects is passed through 128 1x1 filters and then upsampled and concatenated with the output of the last but second convolution block of Darknet-53.
- The output of the above step is passed through is passed through 3 layers of 1x1 512 filters and 3x3 1024 filters.
- The above output is passed through 255 1x1 filters to achieve the small objects detection where in the bounding boxes are centered at each of the 52 x 52 pixels.
The tensor for detecting one object in one bounding box is shown in the above picture.
Training of the Network :-
The training of the network is done by backpropagating the detection loss on the initialized weights of the Darknet — 53 architecture for the purpose of classification, thereby updating the weights of the Darknet — 53 architecture also for the detection purpose.
Performance
Let’s have a look at the comparison of YOLO V3 with other algorithms
As you can see, YOLO V3–416 is extremely fast giving a very comparable performance in terms of mAP at 0.5 IOU. So, it outperforms other state-of-the art object detection system in terms of speed.
READINGS AND REFERENCES:-
Research Paper for YOLO V1, YOLO V2 and YOLO V3
YOLO V1 :- https://arxiv.org/abs/1506.02640
YOLO V2 :- https://arxiv.org/abs/1612.08242
YOLO V3 :- https://arxiv.org/abs/1804.02767
Code for YOLO V3:-
Machine Learning Mastery:-
https://machinelearningmastery.com/how-to-perform-object-detection-with-yolov3-in-keras/
Keras Implementation:-
GitHub — qqwweee/keras-yolo3: A Keras implementation of YOLOv3 (Tensorflow backend)
C Implementation:-
Github Repo:-
References:-
Python Lessons (pylessons.com)
https://westcentralus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fa