yolo 进化史之 yolov2

yolov1 和当时最好的目标检测系统相比, 有很多缺点．比如和 Fast R-CNN 相比, 定位错误更多．和基于区域选择的目标检测方法相比, recall 也比较低．yolov2 的目标即在保证分类准确度的情况下, 尽可能地去提高 recall 和定位精度.

上图是 yolo 尝试了的方法.

可以看到使得检测精度得到大幅提升的主要就是 hi-res classifier 和 dimension priors && location prediction

Batch Normalization

bn 使得ｍAP 提高了２％．并且可以去掉 dropout 而不带来过拟合.

High Resolution Classifier

yolo 可以看成２部分组成, 一个是特征提取部分, 这部分就是分类网络的全连接层之前的部分．一个是 yolo 做预测的部分．

YOLO 训练分为两个阶段. 首先, 我们训练一个像 VGG16 这样的分类器网络. 然后用卷积层替换全连接层, 并对其进行端到端的再训练, 用于目标检测. yolov1 用 224 * 224 的图片训练分类器, 然后用 448 * 448 的图片做目标检测. yolov2 在用 224*224 的图片读分类网络做训练以后, 再用１０个迭代, 用 448*448 的图片去对网络做微调. 这样的话, 卷积核的参数就可以更好地适应高分辨率的输入, 然后用 448*448 的图片去做检测网络的训练. 此举提高了ｍAP 4%.

Convolutional With Anchor Boxes

yolov1 用全连接层做 box 的坐标预测. 这个会造成在训练的初始, 梯度不够稳定, 因为一开始预测的尺寸对某一物体有效, 可能对另一物体无效. 但是现实世界里, 目标的尺寸并不是随机的, 所以我们事先聚类好一些 anchor box(锚或者叫先验框), 依次为基础, 去做 box 坐标预测.

anchor 的采用让 mAP 从 69.5 掉到了 69.2, 但是 recall 从 81% 上升到了 88%.

Using anchor boxes we get a small decrease in accuracy.

YOLO only predicts 98 boxes per image but with anchor
boxes our model predicts more than a thousand. Without
anchor boxes our intermediate model gets 69:5 mAP with a
recall of 81%. With anchor boxes our model gets 69:2 mAP
with a recall of 88%. Even though the mAP decreases, the
increase in recall means that our model has more room to

improve.

论文里, 这里让人有点迷惑, 其实这里说的 anchor box 是作者手工选择的 box, 而不是 k-means 聚类出来的, 采用了ｋ-means 聚类的 box 作为 anchor box, 把ｍAP 提高了接近 5%. 对应于文章开头的图里的 dimension priors．聚类先验框可以参考

去除全连接层

把对 class 的预测从 cell 级别调整到针对 box.

yolov1 每个 cell 预测出２个 box,class 个 prob. yolov2 有５个 anchor box．依据每个 anchor box 预测出 (1+4+20) 个参数, 所以每个 cell 预测出

5*(1+4+20)=125 个参数.

图像输入尺寸由 448 调整到 416, 同时去掉一个池化层

这样最终得到的 feature map 的 13*13 的．

作者认为通常目标位于图片中央, 尤其是大目标, 所以希望特征图是奇数的, 这样就有某一个确定的 cell 去预测目标而不是用临近的 4 个 cell．

We do this because we want an odd number of

locations in our feature map so there is a single center cell.

Objects, especially large objects, tend to occupy the center
of the image so it's good to have a single location right at
the center to predict these objects instead of four locations
that are all nearby

去掉一个池化层使得最终输出是 13*13 (instead of 7*7).

Direct location prediction

我们怎么计算预测的 box 坐标值呢?

σ(tx)函数将预测值限定到了 0-1 之间．这样就保证了我们预测出来的 box 仍然是围绕着当前 cell 的．这一点也使得网络更稳定.

Since we constrain the location prediction the
parametrization is easier to learn, making the network
more stable. Using dimension clusters along with directly
predicting the bounding box center location improves
YOLO by almost 5% over the version with anchor boxes
Fine-Grained Features

随着卷积不断进行, 我们最终得到一个 13*13 的特征图. 对大目标来说, 基于这个特征图做预测是 ok 的, 但是对小目标来说就没那么好了. Faster R-CNN 或者 SSD 在不同的 layer 生成的特征图上去做位置的预测, 相当于不同分辨率的特征图负责不同尺寸的目标. yolo 采取了一个不同的思路, 把两个 layer 的 feature map 连成一个．称之为 passthrough, 在此基础去做预测．如下图:

Multi-Scale Training

由于去掉了全连接层, 模型的输入可以使任意 size. 为了让 yolov2 有更好的鲁棒性, 在训练的时候, 我们每 10 个 batch 就随机改变 input 的 size．由于模型是进行 32 倍下采样的, 所以我们把 input size 改变成 320,352...608 这些尺寸.

以上是 yolov2 提升准确率所做的改造. 现在我们来看下为了更快的推理速度, yolov2 都做了什么.

Googlenet

大部分检测网络是以ＶＧＧ-16 做为特征提取器的．以一个 224*224 的图片为例, 一次前向传播,ＶＧＧ-16 有 30.69 billion 次浮点数运算. yolo 用了一个基于 googlenet 的定制化的网络, 一次前向传播只有 8.52 billion 次运算. 相应的, 代价是准确率的稍微下降.

Darknet-19

作者继续去简化特征提取层的网络结构. 如上图. 注意上图份两部分. 最后三层 (conv,avgpool,softmax) 是做分类的．前面的ｎ层是做特征提取的．

对这个分类网络, 先在ＩmageNet 做分类的训练, 把特征提取的网络的参数训练出来, 先用 224*224 的做训练, 再用 448*448 的做微调．之后保持特征提取部分的网络不变, 把最后几层替换掉, 去做检测网络的训练. 如下图

参考: https://arxiv.org/abs/1612.08242

来源: http://www.bubuko.com/infodetail-3194213.html

与本文相关文章

暂无,快来抢沙发吧！