python爬虫（一）抓取色影无忌图片

由于平时爱好摄影。所以喜欢看看色影无忌论坛的获奖摄影作品，所以写了个小script用来抓取上面的获奖图片，亲自測试能够使用。

自己主动抓全部的获奖图片

完整代码：

#-*-coding=utf-8-*-

__author__ = 'rocchen'

from bs4 import BeautifulSoup

import urllib2,sys,StringIO,gzip,time,random,re,urllib,os

reload(sys)

sys.setdefaultencoding('utf-8')

class Xitek():

    def __init__(self):

        self.url="http://photo.xitek.com/"

        user_agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

        self.headers={"User-Agent":user_agent}

        self.last_page=self.__get_last_page()

    def __get_last_page(self):

        html=self.__getContentAuto(self.url)

        bs=BeautifulSoup(html,"html.parser")

        page=bs.find_all('a',class_="blast")

        last_page=page[0]['href'].split('/')[-1]

        return int(last_page)

    def __getContentAuto(self,url):

        req=urllib2.Request(url,headers=self.headers)

        resp=urllib2.urlopen(req)

        #time.sleep(2*random.random())

        content=resp.read()

        info=resp.info().get("Content-Encoding")

        if info==None:

            return content

        else:

            t=StringIO.StringIO(content)

            gziper=gzip.GzipFile(fileobj=t)

            html = gziper.read()

            return html

    #def __getFileName(self,stream):

    def __download(self,url):

        p=re.compile(r'href="(/photoid/\d+)"')

        #html=self.__getContentNoZip(url)

        html=self.__getContentAuto(url)

        content = p.findall(html)

        for i in content:

            print i

            photoid=self.__getContentAuto(self.url+i)

            bs=BeautifulSoup(photoid,"html.parser")

            final_link=bs.find('img',class_="mimg")['src']

            print final_link

            #pic_stream=self.__getContentAuto(final_link)

            title=bs.title.string.strip()

            filename = re.sub('[\/:*?"<>|]', '-', title)

            filename=filename+'.jpg'

            urllib.urlretrieve(final_link,filename)

            #f=open(filename,'w')

            #f.write(pic_stream)

            #f.close()

        #print html

        #bs=BeautifulSoup(html,"html.parser")

        #content=bs.find_all(p)

        #for i in content:

        #    print i

        '''

        print bs.title

        element_link=bs.find_all('div',class_="element")

        print len(element_link)

        k=1

        for href in element_link:

            #print type(href)

            #print href.tag

        '''

        '''

            if href.children[0]:

                print href.children[0]

        '''

        '''

            t=0

            for i in href.children:

                #if i.a:

                if t==0:

                    #print k

                    if i['href']

                    print link

                        if p.findall(link):

                            full_path=self.url[0:len(self.url)-1]+link

                            sub_html=self.__getContent(full_path)

                            bs=BeautifulSoup(sub_html,"html.parser")

                            final_link=bs.find('img',class_="mimg")['src']

                            #time.sleep(2*random.random())

                            print final_link

                    #k=k+1

                #print type(i)

                #print i.tag

                #if hasattr(i,"href"):

                    #print i['href']

                #print i.tag

                t=t+1

                #print "*"

        '''

        '''

            if href:

                if href.children:

                    print href.children[0]

        '''

            #print "one element link"

    def getPhoto(self):

        start=0

        #use style/0

        photo_url="http://photo.xitek.com/style/0/p/"

        for i in range(start,self.last_page+1):

            url=photo_url+str(i)

            print url

            #time.sleep(1)

            self.__download(url)

        '''

        url="http://photo.xitek.com/style/0/p/10"

        self.__download(url)

        '''

        #url="http://photo.xitek.com/style/0/p/0"

        #html=self.__getContent(url)

        #url="http://photo.xitek.com/"

        #html=self.__getContentNoZip(url)

        #print html

        #'''

def main():

    sub_folder = os.path.join(os.getcwd(), "content")

    if not os.path.exists(sub_folder):

        os.mkdir(sub_folder)

    os.chdir(sub_folder)

    obj=Xitek()

    obj.getPhoto()

if __name__=="__main__":

    main()

具体解说请移步：

http://www.30daydo.com/article/56

python爬虫（一）抓取色影无忌图片的更多相关文章

Python爬虫实战---抓取图书馆借阅信息
Python爬虫实战---抓取图书馆借阅信息原创作品,引用请表明出处:Python爬虫实战---抓取图书馆借阅信息前段时间在图书馆借了很多书,借得多了就容易忘记每本书的应还日期,老是担心自己会违约 ...
Python爬虫实现抓取腾讯视频所有电影【实战必学】
2019-06-27 23:51:51 阅读数 407 收藏更多分类专栏: python爬虫前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问 ...
【转】Python爬虫：抓取新浪新闻数据
案例一抓取对象: 新浪国内新闻(http://news.sina.com.cn/china/),该列表中的标题名称.时间.链接. 完整代码: from bs4 import BeautifulSou ...
Python爬虫：抓取新浪新闻数据
案例一抓取对象: 新浪国内新闻(http://news.sina.com.cn/china/),该列表中的标题名称.时间.链接. 完整代码: from bs4 import BeautifulSou ...
Python爬虫，抓取淘宝商品评论内容!
作为一个资深吃货,网购各种零食是很频繁的,但是能否在浩瀚的商品库中找到合适的东西,就只能参考评论了!今天给大家分享用python做个抓取淘宝商品评论的小爬虫! 思路我们就拿"德州扒鸡&qu ...
Python入门-编写抓取网站图片的爬虫-正则表达式
//生命太短我用Python! //Python真是让一直用c++的村里孩子长知识了! 这个仅仅是一个测试,成功抓取了某网站1000多张图片. 下一步要做一个大新闻大工程 #config = ut ...
python爬虫数据抓取方法汇总
概要:利用python进行web数据抓取方法和实现. 1.python进行网页数据抓取有两种方式:一种是直接依据url链接来拼接使用get方法得到内容,一种是构建post请求改变对应参数来获得web返 ...
python爬虫批量抓取ip代理
使用爬虫抓取数据时,经常要用到多个ip代理,防止单个ip访问太过频繁被封禁.ip代理可以从这个网站获取:http://www.xicidaili.com/nn/.因此写一个python程序来获取ip代 ...
Python -- 网络编程 -- 抓取网页图片 -- 豆瓣妹子
首先分析页面URL,形如http://dbmeizi.com/category/[1-14]?p=[0-476] 图片种类对应编号: 1:'性感', 2:'有沟', 3:'美腿', 4:'小露点', ...

随机推荐

myBatis通过逗号分隔字符串,foreach
前言当数据库里存储的值是以逗号分隔格式存储的字符串时. 数据格式如下: id name ids 1 张三 a,b,c 2 李四 c,d,e 我们拿到的条件参数是:b,e 1.后台通 ...
centos6.5下 python3.6安装、python3.6虚拟环境
https://www.cnblogs.com/paladinzxl/p/6919049.html # python3.6的安装 wget https://www.python.org/ftp/pyt ...
Font-Awesome最新版完整使用教程
何为Font-Awesome Font Awesome gives you scalable vector icons that can instantly be customized - size, ...
View_01_LayoutInflater的原理、使用方法
View_01_LayoutInflater的原理.使用方法本篇博客是郭神博客Android视图状态及重绘流程分析,带你一步步深入了解View(一)的读书笔记的笔记. LayoutInflater简 ...
将NSTimer加入至RunLoop中的两种方法差别
- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launc ...
1.3 Quick Start中 Step 3: Create a topic官网剖析（博主推荐）
不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Step 3: Create a topic Step 3: 创建一个主题(topi ...
开源性能测试工具——jemeter介绍+安装说明
一. Apache JMeter介绍 1. Apache JMeter是什么 Apache JMeter 是Apache组织的开放源代码项目,是一个100%纯Java桌面应用,用于压力测试和性能测量. ...
vue移动端上拉加载更多
LoadMore.vue <template> <div class="load-more-wrapper" @touchstart="touchSta ...
BZOJ3674可持久化并查集（模板）
没什么可说的,就是一个可持久化线段树维护一个数组fa以及deep按秩合并好了注意一下强制在线蒟蒻的我搞了好长时间QAQ 贴代码: #include<cstdio> #include&l ...
C++编码优化之减少冗余拷贝或赋值
临时变量目前遇到的一些产生临时变量的情况:函数实参.函数返回值.隐式类型转换.多余的拷贝 1. 函数实参这点应该比较容易理解,函数参数,如果是实参传递的话,函数体里的修改并不会影响调用时传入的参数 ...

python爬虫（一）抓取 色影无忌图片

python爬虫（一）抓取 色影无忌图片的更多相关文章

随机推荐

热门专题

python爬虫（一）抓取色影无忌图片

python爬虫（一）抓取色影无忌图片的更多相关文章