（原）下载pubFig的python代码

转载请注明出处：

http://www.cnblogs.com/darkknightzh/p/5715305.html

pubFig数据库网址：

http://www.cs.columbia.edu/CAVE/databases/pubfig/

由于版权的原因，该数据库未提供图片，只提供了图片的链接，并且某些链接已经失效。

说明：1. 某些网址需要跨越绝境长城，因而最好开代理

2. dev_urls.txt和eval_urls.txt均可在官网下载。

3. python新手，因而程序写的不好看，并且还有问题。。。

问题1：文件不存在，这个没法避免。

问题2：有时候链接某个url时，时间很长，之后会抛出异常，并提示类似下面的信息：

HTTPConnectionPool(host='www.stardepot.ca', port=): Max retries exceeded with url: /img/Miley_Cyrus_27.jpg (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x02AAC3B0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))

暂时不知道怎么解决。

 __author__ = 'XXX'

 import os

 import numpy as np

 import urllib

 import re  # regular expression libiary

 import requests

 import time

 def findAllStrLoc(inStr, findStr):

     loc = []

     start = 0

     while True:

         curLoc = inStr.find(findStr, start)

         if curLoc == -1:     # if search string not found, find() returns -1

             break           # search is complete, break out of the while loop

         start = curLoc + 1    # move to next possible start position

         loc.append(curLoc)

     return loc

 def loadData(dataPath, startLine):

     datas = []

     f = open(dataPath, 'r')   # with open(dataPath, 'r') as f:

     for line in f.readlines()[startLine:]:

         # data = line.strip().split()

         loc = findAllStrLoc(line, '\t')

         data = []

         data.append(line[0:(loc[0])])        # person   # the end index of the sub str is excluded

         data.append(line[loc[0]+1:loc[1]])   # imagenum

         data.append(line[loc[1]+1:loc[2]])   # url

         rect = line[loc[2]+1:loc[3]]         # rect

         rectLoc = re.findall(r'\d+', rect)

         for ind in range(len(rectLoc)):

             data.append(rectLoc[ind])

         data.append(line[loc[3]+1:len(line)-1])  # md5sum

         datas.append(data)

     f.close()

     return np.array(datas)      # datas

 def createimgfolder(imgFolder):

     if not os.path.isdir(imgFolder):

         os.makedirs(imgFolder)

 def getImgNameFromURL(url):

     loc = findAllStrLoc(url, '/')

     imgName = url[loc[len(loc)-1]+1:]

     txtName = imgName.split('.')[0] + '.txt'

     return (imgName, txtName)

 def exists(path):

     r = requests.head(path)

     return r.status_code == requests.codes.ok

 def main():

     print('loading data')

     imgInfo = loadData('D:/dev_urls.txt', 2)

     print('finish loading data\n')

     databaseFolder = 'D:/pubFig'

     createimgfolder(databaseFolder)

     for i in range(9526, len(imgInfo)):

         curtime = time.strftime('%y%m%d-%H%M%S',time.localtime())

         imgFolder = databaseFolder + '/' + imgInfo[i][0]

         createimgfolder(imgFolder)

         url = imgInfo[i][2]

         (imgName, txtName) = getImgNameFromURL(url)

         try:

             if exists(url):

                 page = urllib.urlopen(url)

                 img = page.read()

                 page.close()

                 imgPath = imgFolder + '/' + imgName

                 f = open(imgPath, "wb")

                 f.write(img)

                 f.close()

                 txtPath = imgFolder + '/' + txtName

                 f = open(txtPath, "w")

                 for j in range(4):

                     f.write(imgInfo[i][j+3] + ' ')

                 f.close()

                 print('%s:%d/%d %s finish'%(curtime, i+1, len(imgInfo), url))

             else:

                 print('%s:%d/%d %s does not exist'%(curtime, i+1, len(imgInfo), url))

         except (Exception) as e:

             print('%s:%d/%d %s exception %s'%(curtime, i+1, len(imgInfo), url, e))

     print('finish')

 if __name__ == '__main__':

     main()

（原）下载pubFig的python代码的更多相关文章

beamer中插入c代码，python代码的经验
下面是插入的scala代码,它与python在某些语法上类似,所在在https://github.com/olivierverdier/python-latex-highlighting下载了一个py ...
单链表反转的原理和python代码实现
链表是一种基础的数据结构,也是算法学习的重中之重.其中单链表反转是一个经常会被考察到的知识点. 单链表反转是将一个给定顺序的单链表通过算法转为逆序排列,尽管听起来很简单,但要通过算法实现也并不是非常容 ...
[转] Python 代码性能优化技巧
选择了脚本语言就要忍受其速度,这句话在某种程度上说明了 python 作为脚本的一个不足之处,那就是执行效率和性能不够理想,特别是在 performance 较差的机器上,因此有必要进行一定的代码优化 ...
Python代码性能优化技巧
摘要:代码优化能够让程序运行更快,可以提高程序的执行效率等,对于一名软件开发人员来说,如何优化代码,从哪里入手进行优化?这些都是他们十分关心的问题.本文着重讲了如何优化Python代码,看完一定会让你 ...
Python 代码性能优化技巧（转）
原文:Python 代码性能优化技巧 Python 代码优化常见技巧代码优化能够让程序运行更快,它是在不改变程序运行结果的情况下使得程序的运行效率更高,根据 80/20 原则,实现程序的重构.优化. ...
Python 代码性能优化技巧
选择了脚本语言就要忍受其速度,这句话在某种程度上说明了 python 作为脚本的一个不足之处,那就是执行效率和性能不够理想,特别是在 performance 较差的机器上,因此有必要进行一定的代码优化 ...
如何在Java中调用Python代码
有时候,我们会碰到这样的问题:与A同学合作写代码,A同学只会写Python,而不会Java, 而你只会写Java并不擅长Python,并且发现难以用Java来重写对方的代码,这时,就不得不想方设法“调 ...
在Java中调用Python代码
极少数时候,我们会碰到类似这样的问题:与A同学合作写代码, A同学只会写Python,不熟悉Java ,而你只会写Java不擅长Python,并且发现难以用Java来重写对方的代码,这时,就不得不想方 ...
Effective Python之编写高质量Python代码的59个有效方法
这个周末断断续续的阅读完了<Effective Python之编写高质量Python代码 ...

随机推荐

ZigZag-LeetCode
题目: The string "PAYPALISHIRING" is written in a zigzag pattern on a given number of rows l ...
使用nRF51822/nRF51422创建一个简单的BLE应用 ---入门实例手册(中文)之五
5应用测试需要一个USB dongle与开发板evaluation kit,并配合Master Control Panel软件,以用于测试BLE应用.前期的准备工作在<nRF51822 Eva ...
T4 模板自动生成带注释的实体类文件
T4 模板自动生成带注释的实体类文件 - 只需要一个 SqlSugar.dll 生成实体就是这么简单,只要建一个T4文件和文件夹里面放一个DLL. 使用T4模板教程步骤1 创建T4模板如果你没有 ...
excel筛选两列值是否相同，如果相同返回第三列值
见图:
C语言解析日志，存储数据到伯克利DB
编译命令 gcc -o dbwriter dbwriter.c -ldb dbwriter.c #include <assert.h> #include <stdlib.h> ...
XJOI网上同步训练DAY3 T2
考试的时候已经想出来怎么做了,但是没有时间打了T_T 思路:我们考虑将询问以lim排序,然后树链剖分,把边作为线段树的节点,然后随着询问lim的增大,改变线段树中节点的信息,然后每次询问我们用树链剖分 ...
extern "c"用法
在Windows驱动开发中,如果是使用C++开发的,那么必须在有些关键函数钱加extern c 的关键词,否则编译出来的函数,跟C语言编译的函数不同,导致驱动程序不能被有效识别. 最关键的是Driv ...
5.1.1 读取Redis 数据
Redis 服务器是Logstash 推荐的Broker选择,Broker 角色就意味会同时存在输入和输出两个插件. 5.1.1 读取Redis 数据 LogStash::Input::Redis 支 ...
《Algorithms 4th Edition》读书笔记——2.4 优先队列(priority queue)-Ⅴ
命题Q.对于一个含有N个元素的基于堆叠优先队列,插入元素操作只需要不超过(lgN + 1)次比较,删除最大元素的操作需要不超过2lgN次比较. 证明.由命题P可知,两种操作都需要在根节点和堆底之间移动 ...
最小费用最大流模板 poj 2159 模板水题
Going Home Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 15944 Accepted: 8167 Descr ...

（原）下载pubFig的python代码

（原）下载pubFig的python代码的更多相关文章

随机推荐

热门专题