语义分割丨PSPNet源码解析「测试阶段」

引言

本文接着上一篇语义分割丨PSPNet源码解析「网络训练」，继续介绍语义分割的测试阶段。

模型训练完成后，以什么样的策略来进行测试也非常重要。

一般来说模型测试分为单尺度single scale和多尺度multi scale，多尺度的结果一般比单尺度高。除此之外，其他细节比如是将整幅图送进网络，还是采用滑动窗口sliding window 每次取图的一部分送进网络这些也会影响测试结果。下面将基于代码进行阐述。

完整代码见：https://github.com/speedinghzl/pytorch-segmentation-toolbox/blob/master/evaluate.py

evaluate.py

main

下面是测试主函数的前半部分，args.whole表示是否使用多尺度。

如果args.whole为false，则采取单尺度，调用predict_sliding，滑动窗口。

如果args.whole为true，则采取多尺度，调用predict_multiscale并传入[0.75, 1.0, 1.25, 1.5, 1.75, 2.0]作为缩放系数，整图预测。

def main():

    """Create the model and start the evaluation process."""

    args = get_arguments()	#传入参数

    # gpu0 = args.gpu

    os.environ["CUDA_VISIBLE_DEVICES"]=args.gpu

    h, w = map(int, args.input_size.split(','))	#h = 769, w = 769

    if args.whole:

        input_size = (1024, 2048)

    else:

        input_size = (h, w)	#(769,769)

    model = Res_Deeplab(num_classes=args.num_classes)	#构造模型

    saved_state_dict = torch.load(args.restore_from)	#导入权重

    model.load_state_dict(saved_state_dict)	#模型加载权重

    model.eval()	#测试模式

    model.cuda()

    testloader = data.DataLoader(CSDataSet(args.data_dir, args.data_list, crop_size=(1024, 2048), mean=IMG_MEAN, scale=False, mirror=False),

                                    batch_size=1, shuffle=False, pin_memory=True)

    data_list = []

    confusion_matrix = np.zeros((args.num_classes,args.num_classes))	#构造混淆矩阵 shape(19,19)

    palette = get_palette(256)	#上色板

    interp = nn.Upsample(size=(1024, 2048), mode='bilinear', align_corners=True)	#上采样

    if not os.path.exists('outputs'):

        os.makedirs('outputs')

    for index, batch in enumerate(testloader):

        if index % 100 == 0:

            print('%d processd'%(index))

        image, label, size, name = batch

		#image.shape(1,3,1024,2048)、label.shape(1,1024,2048)、size=[[1024,2048,3]]

        size = size[0].numpy()	#size=[1024,2048,3]

        with torch.no_grad():	#无需梯度回传

            if args.whole:	#若采用整图训练，则调用multiscale方法 output.shape(1024,2048,19)

                output = predict_multiscale(model, image, input_size, [0.75, 1.0, 1.25, 1.5, 1.75, 2.0], args.num_classes, True, args.recurrence)

            else:	#否则采用滑动窗口法

                output = predict_sliding(model, image.numpy(), input_size, args.num_classes, True, args.recurrence)

下面分别看一下单尺度下predict_sliding和多尺度下predict_whole和predict_multiscale的实现。

predict_sliding

该方法是用一个固定大小的窗口，每次从图片上扣下一部分，送进网络得到输出。然后窗口滑动，滑动前后有1/3的重叠区域，重叠部分概率叠加。最终用总概率除以重叠次数就得到了每个像素的平均概率。

#image.shape(1,3,1024,2048)、tile_size=(769,769)、classes=19、flip=True、recur=1

def predict_sliding(net, image, tile_size, classes, flip_evaluation, recurrence):

    interp = nn.Upsample(size=tile_size, mode='bilinear', align_corners=True)

    image_size = image.shape	#(1,3,1024,2048)

    overlap = 1/3	#每次滑动的重合率为1/3

    stride = ceil(tile_size[0] * (1 - overlap))	#滑动步长:769*(1-1/3) = 513

    tile_rows = int(ceil((image_size[2] - tile_size[0]) / stride) + 1)  #行滑动步数:(1024-769)/513 + 1 = 2

    tile_cols = int(ceil((image_size[3] - tile_size[1]) / stride) + 1)	#列滑动步数:(2048-769)/513 + 1 = 4

    print("Need %i x %i prediction tiles @ stride %i px" % (tile_cols, tile_rows, stride))

    full_probs = np.zeros((image_size[2], image_size[3], classes))	#初始化全概率矩阵 shape(1024,2048,19)

    count_predictions = np.zeros((image_size[2], image_size[3], classes))	#初始化计数矩阵 shape(1024,2048,19)

    tile_counter = 0	#滑动计数0

    for row in range(tile_rows):	# row = 0,1

        for col in range(tile_cols):	# col = 0,1,2,3

            x1 = int(col * stride)	#起始位置x1 = 0 * 513 = 0

            y1 = int(row * stride)	#		 y1 = 0 * 513 = 0

            x2 = min(x1 + tile_size[1], image_size[3])	#末位置x2 = min(0+769, 2048)

            y2 = min(y1 + tile_size[0], image_size[2])	#	   y2 = min(0+769, 1024)

            x1 = max(int(x2 - tile_size[1]), 0)  #重新校准起始位置x1 = max(769-769, 0)

            y1 = max(int(y2 - tile_size[0]), 0)  #				  y1 = max(769-769, 0)

            img = image[:, :, y1:y2, x1:x2]	#滑动窗口对应的图像 imge[:, :, 0:769, 0:769]

            padded_img = pad_image(img, tile_size)	#padding 确保扣下来的图像为769*769

            # plt.imshow(padded_img)

            # plt.show()

            tile_counter += 1	#计数加1

            print("Predicting tile %i" % tile_counter)

			#将扣下来的部分传入网络，网络输出概率图。

            padded_prediction = net(Variable(torch.from_numpy(padded_img), volatile=True).cuda())	#[x, x_dsn]

            if isinstance(padded_prediction, list):

                padded_prediction = padded_prediction[0]	#x.shape(1,19,97,97)

            padded_prediction = interp(padded_prediction).cpu().data[0].numpy().transpose(1,2,0)	#上采样shape(769,769,19)

            prediction = padded_prediction[0:img.shape[2], 0:img.shape[3], :]	#扣下相应面积 shape(769,769,19)

            count_predictions[y1:y2, x1:x2] += 1	#窗口区域内的计数矩阵加1

            full_probs[y1:y2, x1:x2] += prediction  #窗口区域内的全概率矩阵叠加预测结果

    # average the predictions in the overlapping regions

    full_probs /= count_predictions	#全概率矩阵 除以 计数矩阵 即得 平均概率

    # visualize normalization Weights

    # plt.imshow(np.mean(count_predictions, axis=2))

    # plt.show()

    return full_probs	#返回整张图的平均概率 shape(1024,2048,19)

predict_multiscale

该函数以不同的scales调用predict_whole，若采用翻转，则将图片翻转后送入网络，得到网络输出，再将网络输出翻转，叠加之前的输出并除以2。

#image.shape(1,3,1024,2048)、tile_size=(769,769)、scales=[0.75, 1.0, 1.25, 1.5, 1.75, 2.0]、

#classes=19、flip=True、recur=1

def predict_multiscale(net, image, tile_size, scales, classes, flip_evaluation, recurrence):

    """

    Predict an image by looking at it with different scales.

        We choose the "predict_whole_img" for the image with less than the original input size,

        for the input of larger size, we would choose the cropping method to ensure that GPU memory is enough.

    """

    image = image.data

    N_, C_, H_, W_ = image.shape	#1, 3, 1024, 2048

    full_probs = np.zeros((H_, W_, classes))	#shape(1024, 2048, 19)

    for scale in scales:	#[0.75, 1.0, 1.25, 1.5, 1.75, 2.0]

        scale = float(scale)	#0.75

        print("Predicting image scaled by %f" % scale)

		#用不同比例对图片进行缩放

        scale_image = ndimage.zoom(image, (1.0, 1.0, scale, scale), order=1, prefilter=False)	#shape(1,3,768,1536)

        scaled_probs = predict_whole(net, scale_image, tile_size, recurrence)	#预测缩放后的整张图像

        if flip_evaluation == True:	#若采取翻转

            flip_scaled_probs = predict_whole(net, scale_image[:,:,:,::-1].copy(), tile_size, recurrence)	#翻转后再次预测整张

            scaled_probs = 0.5 * (scaled_probs + flip_scaled_probs[:,::-1,:])	#翻转前后各占50%

        full_probs += scaled_probs	#全概率累加 shape(1024, 2048, 19)

    full_probs /= len(scales)	#求平均概率

    return full_probs	#shape(1024, 2048, 19)

predict_whole

如果采取整图预测，那么图片尺寸跟网络输入(cropsize)可能会有冲突。因此网络输出长宽可能不等，故需要将输出上采样（拉伸）成指定输入。

#image.shape(1,3,1024,2048)、tile_size=(769,769)

def predict_whole(net, image, tile_size, recurrence):

    image = torch.from_numpy(image)

    interp = nn.Upsample(size=tile_size, mode='bilinear', align_corners=True)	#上采样

    prediction = net(image.cuda())	#[x, x_dsn]

    if isinstance(prediction, list):

        prediction = prediction[0]	#x.shape(1,19,97,193)注意这里跟滑动窗口法不同，输出的h、w并不相等

    prediction = interp(prediction).cpu().data[0].numpy().transpose(1,2,0)	#插值 shape(1024,2048,19)

    return prediction

main

完成上述操作后得到output，将其归一化并取channel维度上的最大值，得预测结果seg_pred，我们可以使用putpalette函数上色得到彩色的分割效果。

更重要的，我们需要计算分割指标mIoU，这里使用了混淆矩阵confusion_matrix方法，我们将seg_gt和seg_pred中有效区域取出并将其拉成一维向量，输入get_confusion_matrix函数。

        seg_pred = np.asarray(np.argmax(output, axis=2), dtype=np.uint8)	#对结果进行softmax归一化 shape(1024,2048)

        output_im = PILImage.fromarray(seg_pred)	#将数组转换为图像

        output_im.putpalette(palette)				#给图像上色

        output_im.save('outputs/'+name[0]+'.png')	#保存下来

        seg_gt = np.asarray(label[0].numpy()[:size[0],:size[1]], dtype=np.int)	#取出label shape(1024,2048)

        ignore_index = seg_gt != 255	#找到label中的有效区域即不为255的位置，用ignore_index来指示位置

        seg_gt = seg_gt[ignore_index]	#将有效区域取出并转换为1维向量

        seg_pred = seg_pred[ignore_index]	#同上转换为1维向量，位置一一对应

        # show_all(gt, output)

        confusion_matrix += get_confusion_matrix(seg_gt, seg_pred, args.num_classes)	#混淆矩阵加上本张图的预测结果

对预测结果进行上色，为1024x2048x1上的每个像素点分配RGB通道上的三个值，得到1024x2048x3。

def get_palette(num_cls):

    """ Returns the color map for visualizing the segmentation mask.

    Args:

        num_cls: Number of classes

    Returns:

        The color map

    """

    n = num_cls

    palette = [0] * (n * 3)

    for j in range(0, n):

        lab = j

        palette[j * 3 + 0] = 0

        palette[j * 3 + 1] = 0

        palette[j * 3 + 2] = 0

        i = 0

        while lab:

            palette[j * 3 + 0] |= (((lab >> 0) & 1) << (7 - i))

            palette[j * 3 + 1] |= (((lab >> 1) & 1) << (7 - i))

            palette[j * 3 + 2] |= (((lab >> 2) & 1) << (7 - i))

            i += 1

            lab >>= 3

    return palette

get_confusion_matrix

初始化混淆矩阵confusion_matrix，其维度为19x19，混淆矩阵中第i行第j列表示本属于第i类却被误判为第j列的像素点个数。

于是我们需要通过gt_label和pred_label，以确定每个pixel在混淆矩阵上的位置。

我们新建一个向量index = (gt_label * class_num + pred_label)，以行优先的方式用一维向量来存储二维信息。

例如gt_label[0]=1,pred_label[0]=3有index[0]=1*19+3=22，index[0]=22表示第0个像素点本属于第1类的却被误判为3类，于是confusion_matrix[1][3]计数加一。

#gt_label、pred_label都为1维向量

def get_confusion_matrix(gt_label, pred_label, class_num):

        """

        Calcute the confusion matrix by given label and pred

        :param gt_label: the ground truth label

        :param pred_label: the pred label

        :param class_num: the nunber of class

        :return: the confusion matrix

        """

        index = (gt_label * class_num + pred_label).astype('int32')	#以行优先的方式用一维向量存储二维位置信息

        label_count = np.bincount(index)	#对各种情况进行计数，如第1类被误判为第2类的一共有x个像素点

        confusion_matrix = np.zeros((class_num, class_num))	#初始化混淆矩阵 shape(19,19)

        for i_label in range(class_num):	#0,1,2,...,18

            for i_pred_label in range(class_num):	#0,1,2,...,18

                cur_index = i_label * class_num + i_pred_label	#0*18+0, 0*18+1, ..., 18*18+18 每一次对应一种判断情况

                if cur_index < len(label_count):

                    confusion_matrix[i_label, i_pred_label] = label_count[cur_index]	#矩阵放入对应判断情况的次数

        return confusion_matrix

main

语义分割的评价指标mIoU计算如下。

\[MIoU=\frac {1}{k+1}\sum^k_{i=0}\frac{p_{ii}}{\sum^k_{j=0}p_{ij}+\sum^k_{j=0}p_{ji}-p_{ii}}
\]

计算每一类的IoU然后求平均。一类的IoU计算方式如下，例如i=1，\(p_{11}\)表示true positives，即本属于1类且预测也为1类， \(\sum^k_{j=0}p_{1j}\)表示本属于1类却预测为其他类的像素点数（注意，这里包含了\(p_{11}\)），\(\sum^k_{j=0}p_{j1}\)表示本属于其他类却预测为1类的像素点数（注意，这里也包含了 \(p_{11}\)），在分母处\(p_{11}\)计算了两次所以要减去一个\(p_{11}\)

从混淆矩阵定义知，对角线上的元素即为\(p_{ii}\)，对第i行求和即为\(\sum^k_{j=0} p_{ij}\)，对第i列求和即为\(\sum^k_{j=0} p_{ji}\)，于是通过混淆矩阵计算mIoU就非常简单了，见代码。

pos = confusion_matrix.sum(1)	#混淆矩阵对行求和

    res = confusion_matrix.sum(0)	#混淆矩阵对列求和

    tp = np.diag(confusion_matrix)	#取出对角元素，即正确判断的次数

    IU_array = (tp / np.maximum(1.0, pos + res - tp))	#每一类的IoU = ∩/∪ shape(,19)

    mean_IU = IU_array.mean()	#对类取平均

    # getConfusionMatrixPlot(confusion_matrix)

    print({'meanIU':mean_IU, 'IU_array':IU_array})

    with open('result.txt', 'w') as f:

        f.write(json.dumps({'meanIU':mean_IU, 'IU_array':IU_array.tolist()}))

小结

一般而言，滑动窗口相比整图预测能得到更好的结果，但该工程并没有将多尺度跟滑动窗口结合，如果结合将有望得到更好的提升。此外，有些网络的中间层对尺寸有要求（比如必须长宽相等），那么整图预测的方法将行不通。因此建议无论是单尺度还是多尺度，都采用滑动窗口法。