『TensorFlow』SSD源码学习_其六：标签整理

Fork版本项目地址：SSD

一、输入标签生成

在数据预处理之后，图片、类别、真实框格式较为原始，不能够直接作为损失函数的输入标签（ssd向前网络只需要图像就行，这里的处理主要需要满足loss的计算），对于一张图片（三维CHW）我们需要如下格式的数据作为损失函数标签：

gclasse：           搜索框对应的真实类别

　　　　　　　长度为ssd特征层f的list，每一个元素是一个Tensor，shape为：该层中心点行数×列数×每个中心点包含搜索框数目

gscores：           搜索框和真实框的IOU，gclasses中记录的就是该真实框的类别

　　　　　　    长度为ssd特征层f的list，每一个元素是一个Tensor，shape为：该层中心点行数×列数×每个中心点包含搜索框数目

glocalisations：搜索框相较于真实框位置修正，由于有4个坐标，所以维度多了一维

　　　　　　　长度为ssd特征层f的list，每一个元素是一个Tensor，shape为：该层中心点行数×列数×每个中心点包含搜索框数目×4

为了计算出上面标签，我们函数调用如下（train_ssd_network.py）：

            # f层个(m,m,k)，f层个(m,m,k,4xywh)，f层个(m,m,k) f层表示提取ssd特征的层的数目

            # 0-20数字,方便loss的坐标记录,IOU值

            gclasses, glocalisations, gscores = \

                ssd_net.bboxes_encode(glabels, gbboxes, ssd_anchors)

输入变量都是前几节中的函数输出（train_ssd_network.py）：

ssd_anchors = ssd_net.anchors(ssd_shape)  # 调用类方法，创建搜素框

# Pre-processing image, labels and bboxes.

# 'CHW' (n,) (n, 4)

image, glabels, gbboxes = \

        image_preprocessing_fn(image, glabels, gbboxes,

                               out_shape=ssd_shape,  # (300,300)

                               data_format=DATA_FORMAT)  # 'NCHW'

至此，我们再来看一看该函数如何实现，其处理过程是按照ssd特征层进行划分，首先建立三个list，然后对于每一个特征层计算该层的三个Tensor，最后分别添加进list中（ssd_common.py）：

def tf_ssd_bboxes_encode(labels,

                         bboxes,

                         anchors,

                         num_classes,

                         no_annotation_label,

                         ignore_threshold=0.5,

                         prior_scaling=(0.1, 0.1, 0.2, 0.2),

                         dtype=tf.float32,

                         scope='ssd_bboxes_encode'):

    with tf.name_scope(scope):

        target_labels = []

        target_localizations = []

        target_scores = []

        # anchors_layer: (y, x, h, w)

        for i, anchors_layer in enumerate(anchors):

            with tf.name_scope('bboxes_encode_block_%i' % i):

                # (m,m,k)，xywh(m,m,4k)，(m,m,k)

                t_labels, t_loc, t_scores = \

                    tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,

                                               num_classes, no_annotation_label,

                                               ignore_threshold,

                                               prior_scaling, dtype)

                target_labels.append(t_labels)

                target_localizations.append(t_loc)

                target_scores.append(t_scores)

        return target_labels, target_localizations, target_scores

每一层处理是重点（ssd_common.py），从这里我们可以更深刻体会到所有框体长度信息归一化的便捷之处——不同层的框体均可以直接和真实框做运算，毕竟它们都是0~1的相对位置：

# 为了有助理解，m表示该层中心点行列数，k为每个中心点对应的框数，n为图像上的目标数

def tf_ssd_bboxes_encode_layer(labels,         # (n,)

                               bboxes,         # (n, 4)

                               anchors_layer,  # y(m, m, 1), x(m, m, 1), h(k,), w(k,)

                               num_classes,

                               no_annotation_label,

                               ignore_threshold=0.5,

                               prior_scaling=(0.1, 0.1, 0.2, 0.2),

                               dtype=tf.float32):

    """Encode groundtruth labels and bounding boxes using SSD anchors from

    one layer.

    Arguments:

      labels: 1D Tensor(int64) containing groundtruth labels;

      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;

      anchors_layer: Numpy array with layer anchors;

      matching_threshold: Threshold for positive match with groundtruth bboxes;

      prior_scaling: Scaling of encoded coordinates.

    Return:

      (target_labels, target_localizations, target_scores): Target Tensors.

    """

    # Anchors coordinates and volume.

    yref, xref, href, wref = anchors_layer  # y(m, m, 1), x(m, m, 1), h(k,), w(k,)

    ymin = yref - href / 2.  # (m, m, k)

    xmin = xref - wref / 2.

    ymax = yref + href / 2.

    xmax = xref + wref / 2.

    vol_anchors = (xmax - xmin) * (ymax - ymin)  # 搜索框面积(m, m, k)

    # Initialize tensors...

    # 下面各个Tensor矩阵的shape等于中心点坐标矩阵的shape

    shape = (yref.shape[0], yref.shape[1], href.size)  # (m, m, k)

    feat_labels = tf.zeros(shape, dtype=tf.int64)  # (m, m, k)

    feat_scores = tf.zeros(shape, dtype=dtype)

    feat_ymin = tf.zeros(shape, dtype=dtype)

    feat_xmin = tf.zeros(shape, dtype=dtype)

    feat_ymax = tf.ones(shape, dtype=dtype)

    feat_xmax = tf.ones(shape, dtype=dtype)

    def jaccard_with_anchors(bbox):

        """Compute jaccard score between a box and the anchors.

        """

        int_ymin = tf.maximum(ymin, bbox[0])  # (m, m, k)

        int_xmin = tf.maximum(xmin, bbox[1])

        int_ymax = tf.minimum(ymax, bbox[2])

        int_xmax = tf.minimum(xmax, bbox[3])

        h = tf.maximum(int_ymax - int_ymin, 0.)

        w = tf.maximum(int_xmax - int_xmin, 0.)

        # Volumes.

        # 处理搜索框和bbox之间的联系

        inter_vol = h * w  # 交集面积

        union_vol = vol_anchors - inter_vol \

            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])  # 并集面积

        jaccard = tf.div(inter_vol, union_vol)  # 交集/并集，即IOU

        return jaccard  # (m, m, k)

    def condition(i, feat_labels, feat_scores,

                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):

        """Condition: check label index.

        """

        r = tf.less(i, tf.shape(labels))

        return r[0]  # tf.shape(labels)有维度，所以r有维度

    def body(i, feat_labels, feat_scores,

             feat_ymin, feat_xmin, feat_ymax, feat_xmax):

        """Body: update feature labels, scores and bboxes.

        Follow the original SSD paper for that purpose:

          - assign values when jaccard > 0.5;

          - only update if beat the score of other bboxes.

        """

        # Jaccard score.

        label = labels[i]  # 当前图片上第i个对象的标签

        bbox = bboxes[i]   # 当前图片上第i个对象的真实框bbox

        jaccard = jaccard_with_anchors(bbox)  # 当前对象的bbox和当前层的搜索网格IOU，(m, m, k)

        # Mask: check threshold + scores + no annotations + num_classes.

        mask = tf.greater(jaccard, feat_scores)  # 掩码矩阵，IOU大于历史得分的为True，(m, m, k)

        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))

        mask = tf.logical_and(mask, feat_scores > -0.5)

        mask = tf.logical_and(mask, label < num_classes)  # 不太懂，label应该必定小于类别数

        imask = tf.cast(mask, tf.int64)  # 整形mask

        fmask = tf.cast(mask, dtype)     # 浮点型mask

        # Update values using mask.

        # 保证feat_labels存储对应位置得分最大对象标签，feat_scores存储那个得分

        # (m, m, k) × 当前类别scalar + (1 - (m, m, k)) × (m, m, k)

        # 更新label记录，此时的imask已经保证了True位置当前对像得分高于之前的对象得分，其他位置值不变

        feat_labels = imask * label + (1 - imask) * feat_labels

        # 更新score记录，mask为True使用本类别IOU，否则不变

        feat_scores = tf.where(mask, jaccard, feat_scores)

        # 下面四个矩阵存储对应label的真实框坐标

        # (m, m, k) × 当前框坐标scalar + (1 - (m, m, k)) × (m, m, k)

        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin

        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin

        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax

        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        return [i+1, feat_labels, feat_scores,

                feat_ymin, feat_xmin, feat_ymax, feat_xmax]

    # Main loop definition.

    # 对当前图像上每一个目标进行循环

    i = 0

    (i,

     feat_labels, feat_scores,

     feat_ymin, feat_xmin,

     feat_ymax, feat_xmax) = tf.while_loop(condition, body,

                                           [i,

                                            feat_labels, feat_scores,

                                            feat_ymin, feat_xmin,

                                            feat_ymax, feat_xmax])

    # Transform to center / size.

    # 这里的y、x、h、w指的是对应位置所属真实框的相关属性

    feat_cy = (feat_ymax + feat_ymin) / 2.

    feat_cx = (feat_xmax + feat_xmin) / 2.

    feat_h = feat_ymax - feat_ymin

    feat_w = feat_xmax - feat_xmin

    # Encode features.

    # prior_scaling: [0.1, 0.1, 0.2, 0.2]，放缩意义不明

    # ((m, m, k) - (m, m, 1)) / (k,) * 10

    # 以搜索网格中心点为参考，真实框中心的偏移，单位长度为网格hw

    feat_cy = (feat_cy - yref) / href / prior_scaling[0]

    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]

    # log((m, m, k) / (m, m, 1)) * 5

    # 真实框宽高/搜索网格宽高，取对

    feat_h = tf.log(feat_h / href) / prior_scaling[2]

    feat_w = tf.log(feat_w / wref) / prior_scaling[3]

    # Use SSD ordering: x / y / w / h instead of ours.(m, m, k, 4)

    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)  # -1会扩维，故有4

    return feat_labels, feat_localizations, feat_scores

可以看到（最后几行），feat_localizations用于位置修正记录，其中存储的并不是直接的搜索框和真实框的差，而是按照loss函数所需要的格式进行存储，但是进行prior_scaling处理的意义不明，不过直观来看对loss函数不构成负面影响（损失函数值依旧是搜索框等于真实框最佳）。

二、处理为batch

生成batch数据队列

截止到目前，我们的数据都是对单张图片而言，需要将之整理为batch size的Tensor，不过有点小麻烦，就是我们的数据以list包含Tensor为主，维度扩充需要一点小技巧（tf_utils.py）：

def reshape_list(l, shape=None):

    """Reshape list of (list): 1D to 2D or the other way around.

    Args:

      l: List or List of list.

      shape: 1D or 2D shape.

    Return

      Reshaped list.

    """

    r = []

    if shape is None:

        # Flatten everything.

        for a in l:

            if isinstance(a, (list, tuple)):

                r = r + list(a)

            else:

                r.append(a)

    else:

        # Reshape to list of list.

        i = 0

        for s in shape:

            if s == 1:

                r.append(l[i])

            else:

                r.append(l[i:i+s])

            i += s

    return r

这个函数可以将list1：[Tensor11, [Tensor21, Tensor22, ……], [Ten31, Tensor32, ……], ……]和list2：[Tensor1, Tensor2, ……]这样的形式相互转换，需要的就是记录下list1中各子list长度，单个Tensor记为1（train_ssd_network.py）：

            batch_shape = [1] + [len(ssd_anchors)] * 3  # (1,f层,f层,f层)

            # Training batches and queue.

            r = tf.train.batch(  # 图片，中心点类别，真实框坐标，得分

                tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),

                batch_size=FLAGS.batch_size,  # 32

                num_threads=FLAGS.num_preprocessing_threads,

                capacity=5 * FLAGS.batch_size)

            b_image, b_gclasses, b_glocalisations, b_gscores = \

                tf_utils.reshape_list(r, batch_shape)

            # Intermediate queueing: unique batch computation pipeline for all

            # GPUs running the training.

            batch_queue = slim.prefetch_queue.prefetch_queue(

                tf_utils.reshape_list([b_image, b_gclasses, b_glocalisations, b_gscores]),

                capacity=2 * deploy_config.num_clones)

由于tf.train.batch接收输入格式为[Tensor1, Tensor2, ……]，所以要先使用上面函数处理输入，使单张图片的标签数据变化为batch size的标签数据，再将标签数据格式变换回来（实际就是把list1化为list2后给其中每一个Tensor加了一个维度，再变换回list1的格式），最后将batch size的Tensor创建队列，不过没必要这么麻烦，实际上像下面这么做也不会报错，省略了来回折腾Tensor的过程……

            batch_shape = [1] + [len(ssd_anchors)] * 3 # (1,f层,f层,f层)

            r = tf.train.batch(  # 图片，中心点类别，真实框坐标，得分

                tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),

                batch_size=FLAGS.batch_size,  # 32

                num_threads=FLAGS.num_preprocessing_threads,

                capacity=5 * FLAGS.batch_size)

            # Intermediate queueing: unique batch computation pipeline for all

            # GPUs running the training.

            batch_queue = slim.prefetch_queue.prefetch_queue(

                r,                                # <-----输入格式实际上并不需要调整

                capacity=2 * deploy_config.num_clones)

获取batch数据队列

            # Dequeue batch.

            b_image, b_gclasses, b_glocalisations, b_gscores = \

                tf_utils.reshape_list(batch_queue.dequeue(), batch_shape)  # 重整list

出队后整理一下list格式即可，此时获取的数据格式如下（vgg_300为例）：

<tf.Tensor 'batch:0' shape=(32, 3, 300, 300) dtype=float32>

[<tf.Tensor 'batch:1' shape=(32, 38, 38, 4) dtype=int64>,
 <tf.Tensor 'batch:2' shape=(32, 19, 19, 6) dtype=int64>,
 <tf.Tensor 'batch:3' shape=(32, 10, 10, 6) dtype=int64>,
 <tf.Tensor 'batch:4' shape=(32, 5, 5, 6) dtype=int64>,
 <tf.Tensor 'batch:5' shape=(32, 3, 3, 4) dtype=int64>,
 <tf.Tensor 'batch:6' shape=(32, 1, 1, 4) dtype=int64>]

[<tf.Tensor 'batch:7' shape=(32, 38, 38, 4, 4) dtype=float32>,
 <tf.Tensor 'batch:8' shape=(32, 19, 19, 6, 4) dtype=float32>,
 <tf.Tensor 'batch:9' shape=(32, 10, 10, 6, 4) dtype=float32>,
 <tf.Tensor 'batch:10' shape=(32, 5, 5, 6, 4) dtype=float32>,
 <tf.Tensor 'batch:11' shape=(32, 3, 3, 4, 4) dtype=float32>,
 <tf.Tensor 'batch:12' shape=(32, 1, 1, 4, 4) dtype=float32>]

[<tf.Tensor 'batch:13' shape=(32, 38, 38, 4) dtype=float32>,
 <tf.Tensor 'batch:14' shape=(32, 19, 19, 6) dtype=float32>,
 <tf.Tensor 'batch:15' shape=(32, 10, 10, 6) dtype=float32>,
 <tf.Tensor 'batch:16' shape=(32, 5, 5, 6) dtype=float32>,
 <tf.Tensor 'batch:17' shape=(32, 3, 3, 4) dtype=float32>,
 <tf.Tensor 'batch:18' shape=(32, 1, 1, 4) dtype=float32>]

此时的数据格式已经符合loss函数和网络输入要求，运行即可：

            # Construct SSD network.

            # 这个实例方法会返回之前定义的函数ssd_arg_scope（允许修改两个参数）

            arg_scope = ssd_net.arg_scope(weight_decay=FLAGS.weight_decay,

                                          data_format=DATA_FORMAT)

            with slim.arg_scope(arg_scope):

                # predictions: (BS, H, W, 4, 21)

                # localisations: (BS, H, W, 4, 4)

                # logits: (BS, H, W, 4, 21)

                predictions, localisations, logits, end_points = \

                    ssd_net.net(b_image, is_training=True)

            # Add loss function.

            ssd_net.losses(logits, localisations,

                           b_gclasses, b_glocalisations, b_gscores,

                           match_threshold=FLAGS.match_threshold,  # .5

                           negative_ratio=FLAGS.negative_ratio,  # 3

                           alpha=FLAGS.loss_alpha,  # 1

                           label_smoothing=FLAGS.label_smoothing)  # .0

正向传播函数会获取相关的节点，损失函数则会将函数值添加到loss collection中。