# Difference between revisions of "overfeat: integrated recognition, localization and detection using convolutional networks"

Line 1: | Line 1: | ||

= Introduction = | = Introduction = | ||

− | + | Recognizing the category of the dominant object in an image is a task to which Convolutional Networks (ConvNets) have been applied for many years. ConvNets have advanced the state of the art on large datasets such as 1000-category ImageNet | |

+ | <ref name=DeJ> | ||

+ | Deng, Jia, ''et al'' [http://www.image-net.org/papers/imagenet_cvpr09.pdf "ImageNet: A Large-Scale Hierarchical Image Database."] in CVPR09, (2009). | ||

+ | </ref>. | ||

+ | |||

+ | This research shows that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy. | ||

= Vision Tasks = | = Vision Tasks = | ||

+ | This research explores three computer vision tasks in increasing order of difficulty: <br /> | ||

+ | (i) classification, (ii) localization, and (iii) detection.<br /> | ||

+ | Each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (because images can also contain multiple unlabeled objects). After classifying five objects in the image, a bounding box for each classified object is returned. The predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class. | ||

+ | Images from 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013) is used for this research. The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision measure. Figure 1 illustrates the higher difficulty of the detection process. | ||

+ | <center> | ||

+ | [[File:Im_2.PNG | frame | center |Figure 1. This image illustrates the higher difficulty of the detection dataset, which can contain many small objects while the classification and localization images typically contain a single large object. ]] | ||

+ | </center> | ||

= Classification = | = Classification = | ||

− | + | ||

During the ''train ''phase, this model uses the same fixed input size approach proposed by Krizhevsky ''et al.'' | During the ''train ''phase, this model uses the same fixed input size approach proposed by Krizhevsky ''et al.'' | ||

<ref name=KrA> | <ref name=KrA> | ||

Krizhevsky, Alex, ''et al'' [http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf "ImageNet Classiﬁcation with Deep Convolutional Neural Networks."] in NIPS (2012). | Krizhevsky, Alex, ''et al'' [http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf "ImageNet Classiﬁcation with Deep Convolutional Neural Networks."] in NIPS (2012). | ||

</ref>. | </ref>. | ||

− | This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.As depicted in Figure | + | This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.As depicted in Figure 2, this network contains eight layers with weights; the ﬁrst ﬁve are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. This network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. |

Each image is downsampled so that the smallest dimension is 256 pixels. Then five random crops (and their horizontal flips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128. The weights in the network are initialized randomly. They are then updated by stochastic gradient descent. Overﬁtting can be reduced by using “DropOut” | Each image is downsampled so that the smallest dimension is 256 pixels. Then five random crops (and their horizontal flips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128. The weights in the network are initialized randomly. They are then updated by stochastic gradient descent. Overﬁtting can be reduced by using “DropOut” | ||

Line 21: | Line 33: | ||

<center> | <center> | ||

− | [[File:Im_1.PNG | frame | center |Figure | + | [[File:Im_1.PNG | frame | center |Figure 2. An illustration of the architecture of this CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the ﬁgure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. ]] |

</center> | </center> | ||

For ''test'' phase, the entire image is explored by densely running the network at each location and at multiple scales. This approach yields significantly more views for voting, which increases robustness while remaining efficient. | For ''test'' phase, the entire image is explored by densely running the network at each location and at multiple scales. This approach yields significantly more views for voting, which increases robustness while remaining efficient. | ||

+ | For resolution augmentation, 6 scales of input are used which result in unpooled layer 5 maps of varying resolution. These are then pooled and presented to the classifier using the following procedure, | ||

+ | |||

+ | |||

+ | |||

+ | (a). For a single image, at a given scale, we start with the unpooled layer 5 feature maps.<br /> | ||

+ | (b). Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions), repeated 3x3 times for <math>(\Delta x,\Delta y)</math> pixel offsets of {0, 1, 2}.<br /> | ||

+ | (c). This produces a set of pooled feature maps, replicated (3x3) times for different <math>(\Delta x,\Delta y)</math> combinations.<br /> | ||

+ | (d). The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding window fashion to the pooled maps, yielding C-dimensional output maps (for a given <math>(\Delta x,\Delta y)</math> combination).<br /> | ||

+ | (e). The output maps for different <math>(\Delta x,\Delta y)</math> combinations are reshaped into a single 3D output map (two spatial dimensions x C classes). | ||

+ | |||

+ | <center> | ||

+ | [[File:Im_3.PNG | frame | center |Figure 3. 1D illustration (to scale) of output map computation for classification. (a): 20 pixel unpooled layer 5 feature map. (b): max pooling over non-overlapping 3 pixel groups, using offsets of \Delta = {0, 1, 2} pixels (red, green, blue respectively). (c): The resulting 6 pixel pooled maps, for different \Delta . (d): 5 pixel classifier (layers 6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by C maps for each \Delta. (e): reshaped into 6 pixel by C output maps. ]] | ||

+ | </center> | ||

+ | |||

+ | These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent). | ||

+ | The procedure above is repeated for the horizontally flipped version of each image. The final classification is produced by | ||

+ | (I) Taking the spatial max for each class, at each scale and flip. | ||

+ | (II) Averaging the resulting C-dimensional vectors from different scales and flip. | ||

+ | (III) Taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector. | ||

+ | The approach described above, with 6 scales, achieves a top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance: the singlescale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Figure 3 brings a relatively small improvement in the single-scale method, but is also of importance for the multi-scale gains shown here. | ||

== Localization == | == Localization == | ||

− | |||

For localization, the classification-trained network is modified. to do so, classifier layers are replaced by a regression network and then trained to predict object bounding boxes at each spatial location and scale. Then regression predictions are combined together, along with the classification results at each location. | For localization, the classification-trained network is modified. to do so, classifier layers are replaced by a regression network and then trained to predict object bounding boxes at each spatial location and scale. Then regression predictions are combined together, along with the classification results at each location. |

## Revision as of 03:56, 23 October 2015

# Introduction

Recognizing the category of the dominant object in an image is a task to which Convolutional Networks (ConvNets) have been applied for many years. ConvNets have advanced the state of the art on large datasets such as 1000-category ImageNet
<ref name=DeJ>
Deng, Jia, *et al* "ImageNet: A Large-Scale Hierarchical Image Database." in CVPR09, (2009).
</ref>.

This research shows that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.

# Vision Tasks

This research explores three computer vision tasks in increasing order of difficulty:

(i) classification, (ii) localization, and (iii) detection.

Each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (because images can also contain multiple unlabeled objects). After classifying five objects in the image, a bounding box for each classified object is returned. The predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class.
Images from 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013) is used for this research. The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision measure. Figure 1 illustrates the higher difficulty of the detection process.

# Classification

During the *train *phase, this model uses the same fixed input size approach proposed by Krizhevsky *et al.*
<ref name=KrA>
Krizhevsky, Alex, *et al* "ImageNet Classiﬁcation with Deep Convolutional Neural Networks." in NIPS (2012).
</ref>.
This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.As depicted in Figure 2, this network contains eight layers with weights; the ﬁrst ﬁve are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. This network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

Each image is downsampled so that the smallest dimension is 256 pixels. Then five random crops (and their horizontal flips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128. The weights in the network are initialized randomly. They are then updated by stochastic gradient descent. Overﬁtting can be reduced by using “DropOut”
<ref name=HiG>
Hinton, Geoffrey, *et al* "Improving neural networks by preventing co-adaptation of feature detectors." arXiv:1207.0580, (2012).
</ref>
to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. DropOut is employed on the fully connected layers (6th and 7th) in the classifier. For *training* phase, multiple GPUs are used to increase the computation speed.

For *test* phase, the entire image is explored by densely running the network at each location and at multiple scales. This approach yields significantly more views for voting, which increases robustness while remaining efficient.
For resolution augmentation, 6 scales of input are used which result in unpooled layer 5 maps of varying resolution. These are then pooled and presented to the classifier using the following procedure,

(a). For a single image, at a given scale, we start with the unpooled layer 5 feature maps.

(b). Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions), repeated 3x3 times for [math](\Delta x,\Delta y)[/math] pixel offsets of {0, 1, 2}.

(c). This produces a set of pooled feature maps, replicated (3x3) times for different [math](\Delta x,\Delta y)[/math] combinations.

(d). The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding window fashion to the pooled maps, yielding C-dimensional output maps (for a given [math](\Delta x,\Delta y)[/math] combination).

(e). The output maps for different [math](\Delta x,\Delta y)[/math] combinations are reshaped into a single 3D output map (two spatial dimensions x C classes).

These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent).

The procedure above is repeated for the horizontally flipped version of each image. The final classification is produced by (I) Taking the spatial max for each class, at each scale and flip. (II) Averaging the resulting C-dimensional vectors from different scales and flip. (III) Taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector.

The approach described above, with 6 scales, achieves a top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance: the singlescale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Figure 3 brings a relatively small improvement in the single-scale method, but is also of importance for the multi-scale gains shown here.

## Localization

For localization, the classification-trained network is modified. to do so, classifier layers are replaced by a regression network and then trained to predict object bounding boxes at each spatial location and scale. Then regression predictions are combined together, along with the classification results at each location.

## Detection

# References

<references />