Yolov3 代码分析与训练自己数据集

现在要针对我们需求引入检测模型, 只检测人物, 然后是图像能侧立, 这样人物在里面占比更多, 也更清晰, 也不需要检测人占比小的情况, 如下是针对这个需求, 用的 yolov3-tiny 模型训练后的效果.

Yolov3 模型网上也讲烂了, 但是总感觉不看代码, 不清楚具体实现看讲解总是不清晰, 在这分析下 darknet 的实现, 给自己解惑, 顺便也做个笔记.

首先查看打开 yolov3.cfg, 我们看下网络, 可以用 netron 查看图形界面, 可以发现网络主要以卷积层构成, shortcut(残差连接),route(通道组合)三种构成, 首先用步长为 2 的卷积缩小图像一次, 然后开始用 shortcut(残差连接)连接一次再用步长为 2 的卷积缩小图像缩小二次, 后面开始不断用卷积与残差组合, 到开始分支, 分出二个部分, 每分出一个分支就把主支图像缩小次, 最后加上主支部分一共三个分支, 就是一共有 3 个 yolo 层, 其中主支部分缩小了五次, 第一次分支缩小三次, 第二次分支缩小四次.

这里也解答了以前我的一个疑惑, 从 ResNet 网络开始, 开始隔层交流, 不管是相加还是整合, 我疑惑的是如何在文件这种列表形式下描述分支结构, 原来很简单, 一次描述一个分支, 然后用 route/shortcut 记录分支层, 继续向下描述.

回到网络部分, 这三次分支可以用这图表示, 网络上不知谁的, 确实表达了很多要说的, 不过有个问题, 应该是版本更新了, 我手上配置拿的是长宽是 608*608, 所以下图需要改一些, 如 13*13 是 19*19, 大家明白就行, 还有主支是 32(2^5)倍那个, 16 与 8 是第二分支与第一分支, 其中先算主支部分, 算完主支然后上层卷积层 upsample 再与第二分支 route 一起算, 第一分支同这逻辑.

这个 yolo 层就是主支部分, 可以看如上分析, 在这层特征图长宽只有 19*19, 对应如上 anchors 中是根据 k-means 算法拿到对应图集 K=9 的分类簇框, 其中 mask=6,7,8 指向最后三个大框, 主支部分主要检测大物体, 根据前面分析可以知道中框包含了大框的特征图结果, 而小框包含中大的特征图结果, 这应该 yolov3 相比 v2 针对小物体的识别提高的原因, 而 shortcut 与 route 则保证网络能加深到一百多层.

darknet 种各层的主要有三个函数, 分别是 make_xxx_layer, forward_xxx_layer, backward_xxx_layer, 这三个函数 make_xxx_layer 是初始各种参数, 根据参数自动算一些参数, 如卷积层, 根据传入特征图的大小与核的参数, 就能算出传出特征图的大小, 以及申请内存或是显存, forward_xxx_layer 表示根据当前层参数计算预测, 而 backward_xxx_layer 根据上层的 delta(如上层是 yolo 层, delta 就表示期望输出 - 预测输出)与输入计算梯度, 并更新下层需要的 delta, 以及还有个 update_xxx_layer 用于根据 backward_xxx_layer 计算的梯度来更新参数. 而 yolo,region,detection,softmax 这几种检测层相对卷积层来说, 没有参数 weight, 主要用来计算 delta(期望输出 - 预测输出). 而反向传播就是从这个 yolo 的 delta 开始的, 在 yolo 的反向传播中, 先把这个 delta 给卷积层的 delta, 然后结合前一层的输出, 求得当前层的梯度, 并把 delta 结合当前层的 weights 求得新的 delta, 然后下一层卷积层根据这个 delta 梯度, 循环下去更新所有参数.

下面根据代码来分析 yolov 层, 先看下 make_yolo_layer 的实现.

// 在 yolov3 中, n=3 total=9(mask 在 yolov3 中分三组, 分别是[0,1,2/3,4,5/6,7,8],n 表示分组里几个数据, n*3=total)
//608*608 下, 第一组是 19*19, 第二组是 38*38, 第三组是 76*76, 每组检查对应索引里的 anchors
layer make_yolo_layer(int batch, int w, int h, int n, int total, int *mask, int classes)
{
    // 在这假设在主支中, 其中缩小 5 次, 608/(2^5)=19, 这个分支中, w=h=19
    int i;
    layer l = { 0 };
    l.type = YOLO;
    // 检测几种类型边框 (这分支对应上面 anchors[6,7,8] 这个用来检测大边框)
    l.n = n;
    // 如上, 在 yolov3 中, 有大中小分别有三个边框聚合, 一共是 3*3=9
    // 而在 yolov3-tiny 中, 有大小分别三个边框聚合, 一共是 3*2=6
    l.total = total;
    // 一般来说, 训练为 32, 预测为 1
    l.batch = batch;
    // 主支, 608/(2^5)=19
    l.h = h;
    l.w = w;
    // 如上在主支中, 每张特征图有 19*19 个元素, c 表示特征图个数, n 表示对应的 anchors[6,7,8]这三个
    //4 表示 box 坐标, 1 是 Po(预测机率与 IOU 正确率)的概率, classes 是预测的类别数
    l.c = n * (classes + 4 + 1);
    l.out_w = l.w;
    l.out_h = l.h;
    l.out_c = l.c;
    // 检测一共有多少个类别
    l.classes = classes;
    // 计算代价(数据集整体的误差描述)
    l.cost = calloc(1, sizeof(float));
    // 对应表示 anchors
    l.biases = calloc(total * 2, sizeof(float));
    // 对应如上 anchors 中 n 对应需要用到的索引
    if (mask) l.mask = mask;
    else {
        l.mask = calloc(n, sizeof(int));
        for (i = 0; i <n; ++i) {
            l.mask[i] = i;
        }
    }
    l.bias_updates = calloc(n * 2, sizeof(float));
    // 当前层 batch 为 1 的所有输出特征图包含的元素个数, 每个元素为一个 float
    l.outputs = h * w*n*(classes + 4 + 1);
    // 当前层 batch 为 1 时所有输入特征图包含的元素个数, 每个元素为一个 float
    l.inputs = l.outputs;
    // 标签 (真实) 数据, 这里 90 表示如上 w*h(19*19 中)每个格子最多有 90 个 label.
    // 而每个 label 前面 4 个 float 表示 box 的四个点, 后面 1 个 float 表示当前类别
    l.truths = 90 * (4 + 1);
    // 计算误差(数据单个的误差描述), 用来表示 期望输出 - 真实输出
    l.delta = calloc(batch*l.outputs, sizeof(float));
    l.output = calloc(batch*l.outputs, sizeof(float));
    for (i = 0; i < total * 2; ++i) {
        l.biases[i] = .5;
    }
    l.forward = forward_yolo_layer;
    l.backward = backward_yolo_layer;
#ifdef GPU
    l.forward_gpu = forward_yolo_layer_gpu;
    l.backward_gpu = backward_yolo_layer_gpu;
    l.output_gpu = cuda_make_array(l.output, batch*l.outputs);
    l.delta_gpu = cuda_make_array(l.delta, batch*l.outputs);
#endif
    fprintf(stderr, "yolo\n");
    srand(0);
    return l;
}
make_yolo_layer

以主支来做分析, 每张特征图有 19*19 个元素, c 表示特征图个数, 结合上图, 一共有 3*(80+4+1)=255, 简单来说, 分别对应 116,90, 156,198, 373,326 这三个聚类簇, 其中前 85 个就是 116,90 这框的结果, 前 4 个是边框坐标, 1 个置信度, 80 个类别概率, 一共三个, 还有标签 (真实) 数据, 这里 90 表示如上 w*h(19*19 中)每个格子最多有 90 个 label, 而每个 label 前面 4 个 float 表示 box 的四个点, 后面 1 个 float 表示当前类别, 搞清楚这二个对应排列, 在如下的 forward_yolo_layer 里, 我们才能明白如何计算的 delta, 代码加上注释有点多, 这段就不贴了, 代码部分说明下.

主要有二部分, 还是先说下, 在这里 delta 表示期望输出 - out(不同框架可能不同, 我看 caffe 里的 yolo 实现, 就是 out - 期望输出).

第一部分, 前面查找所有特征图 (在这 1batch 是三张) 里的所有元素 (19*19) 里的所有 confidence, 准确来说在第三分支 19*19 个元素, 每个元素有三个大框预测, 检测对应所有框里真实数据最好的 box 的 iou, 如果 iou 大于设定的 ignore_thresh, 则设 delta 为 0, 否则就是 0-out(0 表示没有, 我们期望输出是 0).

第二部分, 在对照所有真实 label 中, 先找到这个 label 是否是大框, 如果不是, 这个 yolo 层不管, 如果是, 继续看是上面 6,7,8 中的那一个, 我们假设是 8, 根据真实框的位置确定在特征图的位置(19*19 中), 其 8 对应 255 张特征图中间的 85-170 这 85 张图, 然后比较真实的 BOX 与对应特征图预测的 box, 算出对应 box 的 delta, 然后是 confidence 的 delta(可以知道, 正确的 box 位置上的元素会算二次 confidence 损失), 然后是类别的 delta. 各 delta 比较简单, 如果认为是真的, delta=1-out, 如果是错的, delta=0-out, 简单来说就是期望输出 - out, 最后网络 cost 就是每层 yolo 的 delta 的平方和加起来的均值.

然后是 yolo 训练时输出的各项参数(这图用的是 yolov3-tiny 训练, 所以只有 16 和 23 这二个 yolo 层), 对比如上 16 层检测大的, 23 检测小的.

可以看到, count 是表示当前层与真实 label 正确配对的 box 数, 其中所有参数都是针对这个值的平均值, 除 no obj 外, 不过从代码上来, 这个参数意义并不大, 所以当前 yolo 层如果出现 nan 这个的打印, 也是正常的, 只是表示当前 batch 刚好所有图片都是大框或是小框, 所以提高 batch 的数目可以降低 nan 出现的机率, 不过相应的是, batch 提高, 可能显存就暴了, 我用的 2070 一次用默认的 64 张显存就不够, 只能改成 32 张. 其中 avg iou 表示当前层正确配对的 box 的交并比的平均值, class 表示表示当前层正确配对类别的平均机率, obj 表示 confidence = P(object)* IOU, 表示预测 box 包含对象与 IOU 好坏的评分, 0.5R/0.7R: 表示 iou 在 0.5/0.7 上与正确配对的 box 的比率.

搞明白 darknet 框架各层后, 回到我们需求, 引入检测模型, 只检测人物, 然后是图像能侧立, 这样人物在里面占比更多, 也更清晰, 也不需要检测人占比小的情况. 先说明下, 用的 yolov3-tiny, 因为可能要每桢检查并不需要占太多资源, 故使用简化模型.

首先筛选满足条件的数据集, 本来准备用 coco 数据自带 API 分析, 发现还麻烦些, 数据全有了, 逻辑并不复杂, 用 winform 自己写了就行了.

/// <summary>
/// 数据经过 funcFilterLabel 过滤, 过滤后的数据需要全部满足 discardFilterLabel
/// </summary>
/// <param name="instData"></param>
/// <param name="funcFilterLabel">满足条件就采用</param>
/// <param name="discardFilterLabel">需要所有标签满足的条件</param>
/// <returns></returns>
public List<ImageLabel> CreateYoloLabel(instances instData,
    Func<annotationOD, image, bool> funcFilterLabel,
    Func<annotationOD, image, bool> discardFilterLabel)
{
    List<ImageLabel> labels = new List<ImageLabel>();
    //foreach (var image in instData.images)
    Parallel.ForEach(instData.images, (image image) =>
     {
         var anns = instData.annotations.FindAll(p => p.image_id == image.id && funcFilterLabel(p, image));
         bool bReserved = anns.TrueForAll((annotationOD ao) => discardFilterLabel(ao, image));
         if (anns.Count> 0 && bReserved)
         {
             ImageLabel iml = new ImageLabel();
             iml.imageId = image.id;
             iml.name = image.file_name;
             float dw = 1.0f / image.width;
             float dh = 1.0f / image.height;
             foreach (var ann in anns)
             {
                 BoxIndex boxIndex = new BoxIndex();
                 boxIndex.box.xcenter = (ann.bbox[0] + ann.bbox[2] / 2.0f) * dw;
                 boxIndex.box.ycenter = (ann.bbox[1] + ann.bbox[3] / 2.0f) * dh;
                 boxIndex.box.width = ann.bbox[2] * dw;
                 boxIndex.box.height = ann.bbox[3] * dh;
                 // 注册
                 boxIndex.catId = findCategoryId(instData.categories, ann.category_id);
                 if (boxIndex.catId>= 0)
                     iml.boxs.Add(boxIndex);
             }
             if (iml.boxs.Count> 0)
             {
                 lock (labels)
                 {
                     labels.Add(iml);
                 }
             }
         }
     });
    return labels;
}
public async void BuildYoloData(DataPath dataPath, string txtListName)
{
    instances instance = new instances();
    if (!File.Exists(dataPath.AnnotationPath))
    {
        setText(dataPath.AnnotationPath + "路径不存在.");
        return;
    }
    setText("正在读取文件中:" + Environment.NewLine + dataPath.AnnotationPath);
    var jsonTex = await Task.FromResult(File.ReadAllText(dataPath.AnnotationPath));
    setText("正在解析文件中:" + Environment.NewLine + dataPath.AnnotationPath);
    instance = await Task.FromResult(JsonConvert.DeserializeObject<instances>(jsonTex));
    setText("正在分析文件包含人物图像:" + instance.images.Count + "个");
    List<ImageLabel> labels = await Task.FromResult(COCODataManager.Instance.CreateYoloLabel(
        instance,
        (annotationOD at, image image) =>
         {
             // 是否人类
             return at.category_id == 1;
         },
        (annotationOD at, image image) =>
         {
             // 是否满足所有人类标签都面积占比都大于十分之一
             return (at.bbox[2] / image.width) * (at.bbox[3] / image.height)> 0.1f;
         }));
    setText("正在生成 label 文件:" + Environment.NewLine + dataPath.LabelPath);
    if (!Directory.Exists(dataPath.LabelPath))
    {
        Directory.CreateDirectory(dataPath.LabelPath);
    }
    await Task.Run(() =>
    {
        Parallel.ForEach(labels, (ImageLabel imageLabel) =>
        {
            string fileName = Path.Combine(dataPath.LabelPath,
                Path.GetFileNameWithoutExtension(imageLabel.name) + ".txt");
            using (var file = new StreamWriter(Path.Combine(dataPath.LabelPath, fileName), false))
            {
                foreach (var label in imageLabel.boxs)
                {
                    file.WriteLine(label.catId + "" + label.box.xcenter +" " + label.box.ycenter +
                        "" + label.box.width +" "+ label.box.height +" ");
                }
            }
        });
        string path = Path.Combine(Directory.GetParent(dataPath.LabelPath).FullName,
            txtListName + ".txt");
        using (var file = new StreamWriter(path, false))
        {
            foreach (var label in labels)
            {
                string lpath = Path.Combine(dataPath.DestImagePath, label.name);
                file.WriteLine(lpath);
            }
        }
    });
    setText("正在复制需要的文件到指定目录:" + dataPath.AnnotationPath);
    await Task.Run(() =>
    {
        Parallel.ForEach(labels, (ImageLabel imageLabel) =>
        {
            string spath = Path.Combine(dataPath.SourceImagePath, imageLabel.name);
            string dpsth = Path.Combine(dataPath.DestImagePath, imageLabel.name);
            if (File.Exists(spath))
                File.Copy(spath, dpsth, true);
        });
    });
    setText("全部完成");
}
CreateYoloLabel

只有一点需要注意, 我们只记录人类 box 标签数据, 但是这些标签需要全部大于特定面积的图, 如果你选择的上面还有小面积人物, 又不给 box 标签训练, 最后 yolo 层并没面板的损失函数, 会造成干扰, 在这本来也没有检查小面积人物的需求, yolov3-tiny 层数本也不多, 需求泛化后精度很低.

darknet 本身并没有针对侧立做适配, 我们需要修改相应逻辑来完成, 很简单, 一张图四个方向旋转后, 同样修改相应的 truth box 就行了是, darknet 数据加载主要在 data.c 这部分, 找到我们使用的加载逻辑 load_data_detection 里, 主要针对如下修改.

data load_data_detection(int n, char **paths, int m, int w, int h, int boxes, int classes,
    float jitter, float hue, float saturation, float exposure)
{
....................
        random_distort_image(sized, hue, saturation, exposure);
        int flip = rand() % 2;
        if (flip)
            flip_image(sized);
        int vflip = rand() % 2;
        if (vflip)
            vflip_image(sized);
        int trans = rand() % 2;
        if (trans)
            transpose_image(sized);
        d.X.vals[i] = sized.data;
        fill_truth_detection(random_paths[i], boxes, d.y.vals[i], classes, flip,
            -dx / w, -dy / h, nw / w, nh / h, vflip, trans);
        free_image(orig);
....................
}
void correct_boxes(box_label *boxes, int n, float dx, float dy, float sx,
    float sy, int flip, int vflip, int trans)
{
    int i;
    for (i = 0; i < n; ++i) {
        if (boxes[i].x == 0 && boxes[i].y == 0) {
            boxes[i].x = 999999;
            boxes[i].y = 999999;
            boxes[i].w = 999999;
            boxes[i].h = 999999;
            continue;
        }
        boxes[i].left = boxes[i].left  * sx - dx;
        boxes[i].right = boxes[i].right * sx - dx;
        boxes[i].top = boxes[i].top   * sy - dy;
        boxes[i].bottom = boxes[i].bottom* sy - dy;
        if (flip)
        {
            float swap = boxes[i].left;
            boxes[i].left = 1. - boxes[i].right;
            boxes[i].right = 1. - swap;
        }
        if (vflip)
        {
            float swap = boxes[i].top;
            boxes[i].top = 1. - boxes[i].bottom;
            boxes[i].bottom = 1. - swap;
        }
        boxes[i].left = constrain(0, 1, boxes[i].left);
        boxes[i].right = constrain(0, 1, boxes[i].right);
        boxes[i].top = constrain(0, 1, boxes[i].top);
        boxes[i].bottom = constrain(0, 1, boxes[i].bottom);
        boxes[i].x = (boxes[i].left + boxes[i].right) / 2;
        boxes[i].y = (boxes[i].top + boxes[i].bottom) / 2;
        boxes[i].w = (boxes[i].right - boxes[i].left);
        boxes[i].h = (boxes[i].bottom - boxes[i].top);
        boxes[i].w = constrain(0, 1, boxes[i].w);
        boxes[i].h = constrain(0, 1, boxes[i].h);
        if (trans)
        {
            float temp = boxes[i].x;
            boxes[i].x = boxes[i].y;
            boxes[i].y = temp;
            temp = boxes[i].w;
            boxes[i].w = boxes[i].h;
            boxes[i].h = temp;
        }
    }
}
load_data_detection

然后拿到 yolov3-tiny.cfg 文件, 先把 burn_in 修改成 1, 我们没有预精确数据, 最开始就以原始学习率开始训练, 对应二个 yolo 层里的 classes 改成一, 记的前面说过, 这个层就是分析上面的卷积层, 故上面的输出 filters=3*(4+1+1)=18, 第一次训练后发现五 W 次就没怎么收敛了, 分析了下, 应该是 anchors 导致的, 当图像侧立后, 这里的也应该有类似数据, 故选择全大面积人物二种特定框, 分别取侧立, 这里正确的搞法应该是用 k-means 再重新取 K=2 算一下所有现在特定图的框, 分别对应大框与小框, 然后侧立下, 大框就有二个, 小框二个, num=4,mask 每层设二个索引, 或是 K=4, 然后让上面来, 后面抽出时间完善这步, 现暂时用如下数据 anchors = 100,100, 119,59, 59,119, 200,200, 326,373, 373,326, 这些调整后, 现训练 21W 次, 也还一直收敛中.

训练与验证根据他自身的 train_yolo/validata_yolo 修改下, 自己可以打印出自己想要的信息, 验证可以结合 opencv 显示我们想要的各种图形比对效果.

来源: https://www.cnblogs.com/zhouxin/p/11099030.html

与本文相关文章

暂无,快来抢沙发吧！