win10 Faster-RCNN训练自己数据集遇到的问题集锦 (转)

题注: 在win10下训练实在是有太多坑了,在此感谢网上的前辈和大神,虽然有的还会把你引向另一个坑~~.

最近，用faster rcnn跑一些自己的数据，数据集为某遥感图像数据集——RSOD，标注格式跟pascal_voc差不多，但由于是学生团队标注，中间有一些标注错误，也为后面训练埋了很多坑。下面是用自己的数据集跑时遇到的一些问题，一定一定要注意：在确定程序完全调通前，务必把迭代次数设一个较小的值（比如100），节省调试时间。

错误目录：

1 ./tools/train_faster_rcnn_alt_opt.py is not found

2 assert (boxes[:, 2] >= boxes[:, 0]).all()

3 'module' object has no attribute 'text_format'

4 Typeerror：Slice indices must be integers or None or have __index__ method

5 TypeError: 'numpy.float64' object cannot be interpreted as an index

6 error=cudaSuccess(2 vs. 0) out of memory？

7 loss_bbox = nan，result: Mean AP＝0.000

8 AttributeError: 'NoneType' object has no attribute 'astype'

错误1: 执行sudo ./train_faster_rcnn_alt_opt.sh 0 ZF pascal_voc，报错：./tools/train_faster_rcnn_alt_opt.py is not found

解决方法：执行sh文件位置错误，应退回到py-faster-rcnn目录下，执行sudo ./experiments/scripts/train_faster_rcnn_alt_opt.sh 0 ZF pascal_voc

错误2：在调用append_flipped_images函数时出现： assert (boxes[:, 2] >= boxes[:, 0]).all()

网上查资料说：出现这个问题主要是自己的数据集标注出错。由于我们使用自己的数据集，可能出现x坐标为0的情况，而pascal_voc数据标注都是从1开始计数的，所以faster rcnn代码里会转化成0-based形式，对Xmin，Xmax，Ymin，Ymax进行-1操作，从而会出现溢出，如果x=0，减1后溢出为65535。更有甚者，标记坐标为负数或者超出图像范围。主要解决方法有：

（1）修改lib/datasets/imdb.py，在boxes[:, 2] = widths[i] - oldx1 - 1后插入：

for b in range(len(boxes)):
if boxes[b][2]< boxes[b][0]:
boxes[b][0] = 0

for b in range(len(boxes)):

    if boxes[b][2]< boxes[b][0]:

        boxes[b][0] = 0

这种方法其实头痛医头，且认为溢出只有可能是 boxes[b][0] ，但后面事实告诉我， boxes[b][2] 也有可能溢出。不推荐。

（2）修改lib/datasets/pascal_voc.py中_load_pascal_annotation函数，该函数是读取pascal_voc格式标注文件的，下面几句中的-1全部去掉（pascal_voc标注是1-based,所以需要-1转化成0-based,如果我们的数据标注是0-based,再-1就可能溢出，所以要去掉）。如果只是0-based的问题（而没有标注为负数或超出图像边界的坐标），这里就应该解决问题了。

x1 = float(bbox.find('xmin').text)#-1
y1 = float(bbox.find('ymin').text)#-1
x2 = float(bbox.find('xmax').text)#-1
y2 = float(bbox.find('ymax').text)#-1

x1 = float(bbox.find('xmin').text)#-1

y1 = float(bbox.find('ymin').text)#-1

x2 = float(bbox.find('xmax').text)#-1

y2 = float(bbox.find('ymax').text)#-1

（3）标注文件矩形越界

我执行了上面两步，运行stage 1 RPN, init from ImageNet Model时还是报错。说明可能不仅仅是遇到x=0的情况了，有可能标注本身有错误，比如groundtruth的x1<0或x2>imageWidth。决定先看看到底是那张图像的问题。在lib/datasets/imdb.py的

assert (boxes[:, 2] >= boxes[:, 0]).all()

assert (boxes[:, 2] >= boxes[:, 0]).all()

这句前面加上:

print self.image_index[i]

print self.image_index[i]

打印当前处理的图像名，运行之后报错前最后一个打印的图像名就是出问题的图像啦，检测Annotation中该图像的标注是不是有矩形越界的情况。经查，还真有一个目标的x1被标注成了-2。

更正这个标注错误后，正当我觉得终于大功告成之时，依然报错……咬着牙对自己说“我有耐心”。这次报错出现在“Stage 1 Fast R-CNN using RPN proposals, init from ImageNet model”这个阶段，也就是说此时调用append_flipped_images函数处理的是rpn产生的proposals而非标注文件中的groundtruth。不科学啊，groundtruth既然没问题，proposals怎么会溢出呢？结论：没删缓存！把py-faster-rcnn/data/cache中的文件和 py-faster-rcnn/data/VOCdevkit2007/annotations_cache中的文件统统删除。是这篇博客给我的启发。在此之前，我花了些功夫执迷于找标注错误，如果只是想解决问题就没有必要往下看了，但作为分析问题的思路，可以记录一下：

首先我决定看看到底哪个proposal的问题。还是看看是哪张图像的问题，在lib/datasets/imdb.py的

assert (boxes[:, 2] >= boxes[:, 0]).all()

assert (boxes[:, 2] >= boxes[:, 0]).all()

这句前面加上：

print ("num_image:%d"%(i))

print ("num_image:%d"%(i))

然后运行，打印图像在训练集中的索引（这次不需要知道图像名），找到告警前最后打印的那个索引，比如我找到的告警前索引为320，下一步就是看看这个图片上所有的proposal是不是正常，同样地，在告警语句前插入：

if i==320:
print self.image_index[i]
for z in xrange(len(boxes)):
print ('x2:%d x1:%d'%(boxes[z][2],boxes[z][0]))
if boxes[z][2]<boxes[z][0]:
print"here is the bad point!!!"

            if i==320:

                print self.image_index[i]

                for z in xrange(len(boxes)):

                    print ('x2:%d  x1:%d'%(boxes[z][2],boxes[z][0]))

                    if boxes[z][2]<boxes[z][0]:

	                print"here is the bad point!!!"

再次运行后看日志，发现here is the bad point!!!出现在一组“x2=-64491 x1=1011”后，因为我的图像宽度是1044，而1044-65535=-64491，所以其实是x2越界了，因boxes[:, 2] = widths[i] - oldx1 - 1，其实也就是图像反转前对应的oldx1=65534溢出，为什么rpn产生的proposal也会溢出呢？正常情况下，rpn产生的proposal是绝不会超过图像范围的，除非——标准的groundtruth就超出了！而groundtruth如果有问题，stage 1 RPN, init from ImageNet Model这个阶段就应该报错了，所以是一定是缓存的问题。

错误3：pb2.text_format(...)这里报错'module' object has no attribute 'text_format'。

解决方法：在./lib/fast_rcnn/train.py文件里import google.protobuf.text_format。网上有人说把protobuf版本回退到2.5.0，但这样会是caffe编译出现新问题——“cannot import name symbol database”，还需要去github上下对应的缺失文件，所以不建议。

错误4：执行到lib/proposal_target_layer.py时报错Typeerror：Slice indices must be integers or None or have __index__ method

解决方法：这个错误的原因是，numpy1.12.0之后不在支持float型的index。网上很多人说numpy版本要降到1.11.0，但我这样做了之后又有新的报错：ImportError: numpy.core.multiarray failed to import。正确的解决办法是：numpy不要降版本（如果已经降了版本，直接更新到最新版本就好），只用修改lib/proposal_target_layer.py两处：(PS:我就在这里耽误了好久)

在126行后加上：

start=int(start)
end=int(end)

start=int(start)

end=int(end)

在166行后加上：

fg_rois_per_this_image=int(fg_rois_per_this_image)

fg_rois_per_this_image=int(fg_rois_per_this_image)

错误5：py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py的_sample_rois函数中报错TypeError: 'numpy.float64' object cannot be interpreted as an index

解决方法：这与错误（4）其实是一个问题，都是numpy版本导致的。一样地，不支持网上很多答案说的降低版本的方法，更稳妥的办法是修改工程代码。这里给出的解决方案。修改minibatch.py文件：

第26行：

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

改为：

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int)

第173行：

cls = clss[ind]

cls = clss[ind]

改为：

cls = int(clss[ind])

cls = int(clss[ind])

另外还有3处需要加上.astype(np.int),分别是：

#lib/datasets/ds_utils.py line 12 :
hashes = np.round(boxes * scale).dot(v)
#lib/fast_rcnn/test.py line 129：
hashes = np.round(blobs['rois'] * cfg.DEDUP_BOXES).dot(v)
#lib/rpn/proposal_target_layer.py line 60 :
fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

#lib/datasets/ds_utils.py line 12 :

hashes = np.round(boxes * scale).dot(v)

#lib/fast_rcnn/test.py line 129：

hashes = np.round(blobs['rois'] * cfg.DEDUP_BOXES).dot(v)

#lib/rpn/proposal_target_layer.py line 60 :

fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

错误6：error=cudaSuccess(2 vs. 0) out of memory？

GPU内存不足，有两种可能：（1）batchsize太大；（2）GPU被其他进程占用过多。

解决方法：先看GPU占用情况：watch -n 1 nvidia-smi，实时显示GPU占用情况，运行训练程序看占用变化。如果确定GPU被其他程序大量占用，可以关掉其他进程 kill -9 PID。如果是我们的训练程序占用太多，则考虑将batchsize减少。

错误7：在lib/fast_rcnn/bbox_transform.py文件时RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_widths / ex_widths)，然后loss_bbox = nan，最终的Mean AP＝0.000

网上很多人说要降低学习率，其实这是指标不治本，不过是把报错的时间推迟罢了，而且学习率过低，本身就有很大的风险陷入局部最优。

经过分析调试，发现这个问题还是自己的数据集标注越界的问题！！！越界有6种形式：x1<0; x2>width; x2<x1; y1<0; y2>height; y2<y1。不巧的是，源代码作者是针对pascal_voc数据写的，压根就没有考虑标注出错的可能性。发布的代码中只在append_flipped_images函数里 assert (boxes[:, 2] >= boxes[:, 0]).all()，也就是只断言了水平翻转后的坐标x2>=x1，这个地方报错可能是x的标注错误，参考前面的错误2。但是，对于y的标注错误，根本没有检查。

分析过程：先找的报warning的 lib/fast_rcnn/bbox_transform.py，函数bbox_transform，函数注释参考这里。在

targets_dw = np.log(gt_widths / ex_widths)

    targets_dw = np.log(gt_widths / ex_widths)

前面加上：

print(gt_widths)
print(ex_widths)
print(gt_heights)
print(ex_heights)
assert(gt_widths>0).all()
assert(gt_heights>0).all()
assert(ex_widths>0).all()
assert(ex_heights>0).all()

    print(gt_widths)

    print(ex_widths)

    print(gt_heights)

    print(ex_heights)

    assert(gt_widths>0).all()

    assert(gt_heights>0).all()

    assert(ex_widths>0).all()

    assert(ex_heights>0).all()

然后运行，我发现AssertError出现在assert(ex_heights>0).all()，也就是说存在anchor高度为负数的，而height跟标注数据y方向对应，所以考虑是标注数据y的错误。类似于错误2，我回到lib/datasets/imdb.py，append_flipped_images函数中加入对y标注的检查。直接粘贴代码吧:

#源代码中没有获取图像高度信息的函数，补充上
def _get_heights(self):
return [PIL.Image.open(self.image_path_at(i)).size[1]
for i in xrange(self.num_images)]
def append_flipped_images(self):
num_images = self.num_images
widths = self._get_widths()
heights = self._get_heights()#add to get image height
for i in xrange(num_images):
boxes = self.roidb[i]['boxes'].copy()
oldx1 = boxes[:, 0].copy()
oldx2 = boxes[:, 2].copy()
print self.image_index[i]#print image name
assert (boxes[:,1]<=boxes[:,3]).all()#assert that ymin<=ymax
assert (boxes[:,1]>=0).all()#assert ymin>=0,for 0-based
assert (boxes[:,3]<heights[i]).all()#assert ymax<height[i],for 0-based
assert (oldx2<widths[i]).all()#assert xmax<withd[i],for 0-based
assert (oldx1>=0).all()#assert xmin>=0, for 0-based
assert (oldx2 >= oldx1).all()#assert xmax>=xmin, for 0-based
boxes[:, 0] = widths[i] - oldx2 - 1
boxes[:, 2] = widths[i] - oldx1 - 1
#print ("num_image:%d"%(i))
assert (boxes[:, 2] >= boxes[:, 0]).all()
entry = {'boxes' : boxes,
'gt_overlaps' : self.roidb[i]['gt_overlaps'],
'gt_classes' : self.roidb[i]['gt_classes'],
'flipped' : True}
self.roidb.append(entry)
self._image_index = self._image_index * 2

    #源代码中没有获取图像高度信息的函数，补充上

    def _get_heights(self):

      return [PIL.Image.open(self.image_path_at(i)).size[1]

              for i in xrange(self.num_images)]

    def append_flipped_images(self):

        num_images = self.num_images

        widths = self._get_widths()

        heights = self._get_heights()#add to get image height

        for i in xrange(num_images):

            boxes = self.roidb[i]['boxes'].copy()

            oldx1 = boxes[:, 0].copy()

            oldx2 = boxes[:, 2].copy()

            print self.image_index[i]#print image name

            assert (boxes[:,1]<=boxes[:,3]).all()#assert that ymin<=ymax

            assert (boxes[:,1]>=0).all()#assert ymin>=0,for 0-based

            assert (boxes[:,3]<heights[i]).all()#assert ymax<height[i],for 0-based

            assert (oldx2<widths[i]).all()#assert xmax<withd[i],for 0-based

            assert (oldx1>=0).all()#assert xmin>=0, for 0-based

            assert (oldx2 >= oldx1).all()#assert xmax>=xmin, for 0-based

            boxes[:, 0] = widths[i] - oldx2 - 1

            boxes[:, 2] = widths[i] - oldx1 - 1

            #print ("num_image:%d"%(i))

            assert (boxes[:, 2] >= boxes[:, 0]).all()

            entry = {'boxes' : boxes,

                     'gt_overlaps' : self.roidb[i]['gt_overlaps'],

                     'gt_classes' : self.roidb[i]['gt_classes'],

                     'flipped' : True}

            self.roidb.append(entry)

        self._image_index = self._image_index * 2

然后运行，遇到y有标注错误的地方就会报AssertError，然后看日志上最后一个打印的图像名，到对应的Annotation上查看错误标记，改过来后不要忘记删除py-faster-rcnn/data/cache缓存。然后再运行，遇到AssertError再改对应图像的标准，再删缓存……重复直到所有的标注错误都找出来。然后就大功告成了，MAP不再等于0.000了！

错误8：训练大功告成，mAP=0.66，可以测试一下了。具体的这个博客写的很清楚。在执行demo.py文件时报错：im_orig = im.astype(np.float32, copy=True)，AttributeError: 'NoneType' object has no attribute 'astype'。

解决方法：仔细检查路径和文件名，查看demo.py里路径相关的文件。

以上。

win10 Faster-RCNN训练自己数据集遇到的问题集锦 (转)的更多相关文章

如何才能将Faster R-CNN训练起来？
如何才能将Faster R-CNN训练起来? 首先进入 Faster RCNN 的官网啦,即:https://github.com/rbgirshick/py-faster-rcnn#installa ...
caffe学习三：使用Faster RCNN训练自己的数据
本文假设你已经完成了安装,并可以运行demo.py 不会安装且用PASCAL VOC数据集的请看另来两篇博客. caffe学习一:ubuntu16.04下跑Faster R-CNN demo (基于c ...
faster rcnn训练自己的数据集
采用Pascal VOC数据集的组织结构,来构建自己的数据集,这种方法是faster rcnn最便捷的训练方式
python3 + Tensorflow + Faster R-CNN训练自己的数据
之前实现过faster rcnn, 但是因为各种原因,有需要实现一次,而且发现许多博客都不全面.现在发现了一个比较全面的博客.自己根据这篇博客实现的也比较顺利.在此记录一下(照搬). 原博客:http ...
Fast RCNN 训练自己数据集 (2修改数据读取接口)
Fast RCNN训练自己的数据集 (2修改读写接口) 转载请注明出处,楼燚(yì)航的blog,http://www.cnblogs.com/louyihang-loves-baiyan/ http ...
Fast RCNN 训练自己数据集 (1编译配置)
FastRCNN 训练自己数据集 (1编译配置) 转载请注明出处,楼燚(yì)航的blog,http://www.cnblogs.com/louyihang-loves-baiyan/ https:/ ...
faster rcnn训练详解
http://blog.csdn.net/zy1034092330/article/details/62044941 py-faster-rcnn训练自己的数据:流程很详细并附代码 https://h ...
Faster Rcnn训练自己的数据集过程大白话记录
声明:每人都有自己的理解,动手实践才能对细节更加理解! 一.算法理解此处省略一万字.................. 二.训练及源码理解首先配置: 在./lib/utils文件下....运行 p ...
caffe 用faster rcnn 训练自己的数据遇到的问题
1 . 怎么处理那些pyx和.c .h文件在lib下有一些文件为.pyx文件,遇到不能import可以cython 那个文件,然后把lib文件夹重新make一下. 遇到.c 和 .h一样的操作. 2 ...

随机推荐

BZOJ.3992.[SDOI2015]序列统计(DP NTT 原根)
题目链接 $Description$ 给定$n,m,x$和集合$S$.求$\prod_{i=1}^na_i\equiv x\ (mod\ m)$的方案数.其中$a_i\in S$. ...
BZOJ1889 : Maximal
二分答案,判断是否存在合法方案使得每个数都不超过$mid$. 考虑网络流建图: $i$点的流量下限为$\max(a_i-mid,0)$,费用为$1$,故拆点进行限制. $i$向$i+1$.$S$向$i ...
qq截图存放在电脑的哪个文件夹
1,登陆QQ,页面最下面的“主菜单”,选择“设置”,点击进入: 2,在弹出的窗口中选择“文件管理”,点击: 3,在“文件管理”页面选择“打开文件夹”,返回到上层文件夹:QQ文件夹页面 4,在QQ文件夹 ...
用列表实现一个简单的图书管理系统 python
#coding=utf-8 book_list=[] #图书馆所有书 unborrowed_book=[] #可借阅的书 borrowed_book=[] #已经借出去的书 def add(): ...
uploadify Cookie 验证登入上传问题
上传文件时必须验证是否已登入. 当用FormsAuthentication做登入,使用FormsAuthentication.FormsCookieName进行验证是否已登入即可. <scrip ...
pygame 笔记-5 模块化&加入敌人
上一节,已经用OOP方法,把几个类抽象出来了,但是都集中在一个.py文件中,代码显得很冗长,这一节复用模块化的思想,把这个大文件拆分成几个小文件: 先把主角Player单独放到一个文件player.p ...
React进阶之高阶组件
前言本文代码浅显易懂,思想深入实用.此属于react进阶用法,如果你还不了解react,建议从文档开始看起. 我们都知道高阶函数是什么, 高阶组件其实是差不多的用法,只不过传入的参数变成了react ...
mount 命令用法
mount 功能: 加载指定的文件系统:mount可将指定设备中指定的文件系统加载到 Linux目录下(也就是装载点).可将经常使用的设备写入文件/etc/fastab,以使系统在每次启动时自动加 ...
使用python实现深度神经网络 2（转）
https://blog.csdn.net/oxuzhenyi/article/details/73026796 导数与梯度.矩阵运算性质.科学计算库numpy 一.实验介绍 1.1 实验内容虽然在 ...
GIMP使用笔记
一:背景透明化 1:选中背景:选择——按颜色——点击图片背景 2:透明化:图层——透明化——颜色到Alpha——选择背景颜色,转换为alpha透明二:裁剪图片 1:选择:工具箱——选择套具——使用套 ...

win10 Faster-RCNN训练自己数据集遇到的问题集锦 (转)

win10 Faster-RCNN训练自己数据集遇到的问题集锦 (转)的更多相关文章

随机推荐

热门专题