python之路——爬虫实例

urlController.py

import bsController

from urllib import request

class SpiderMain(object):

    def __init__(self):

        self.header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',

               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

               'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',

               'Accept-Encoding': 'none',

               'Accept-Language': 'en-US,en;q=0.8',

               'Connection': 'keep-alive'}

        self.bsManage = bsController.bsManage()

    def getUrl(self,rootUrl):

        for i in range(1,500):

            url = rootUrl+'%s' %i+'.html'

            req = request.Request(url)

            for h in self.header:

                   req.add_header(h, self.header[h])

            try:

              html = request.urlopen(req).read()

              # print(html)

              self.bsManage.getPageUrl(html,i)

              req.close()

            except request.URLError as e:

              if hasattr(e, 'code'):

                print('Error code:',e.code)

              elif hasattr(e, 'reason'):

                print('Reason:',e.reason)

if __name__=='__main__':

    rootUrl = 'http://www.meitulu.com/item/'

    obj_root = SpiderMain()

    obj_root.getUrl(rootUrl)

bsController.py

from bs4 import BeautifulSoup

from urllib import request

import os

class bsManage:

    def __init__(self):

        self.pageUrl = 'http://www.meitulu.com/item/'

        self.header = {

            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',

            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',

            'Accept-Encoding': 'none',

            'Accept-Language': 'en-US,en;q=0.8',

            'Connection': 'keep-alive'}

    # html是获取到的网页的html

    # i表示i_x.html

    def getPageUrl(self,html,i):

        soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')

        # 获取到最后一个连接

        lastUrl = soup.find_all('div', {'id': 'pages'})[0].find_all('a')[-2]['href']

        # print(html)

        # print(lastUrl)

        # 获取到最后一页的数字

        if i < 10:

            len = 1

        elif i<100:

            len = 2

        elif i<1000:

            len = 3

        elif i<10000:

            len = 4

        lastPage = int(lastUrl[29+len:-5])

        # 创建图片文件夹

        if not os.path.exists('img'):

            os.mkdir('img')

        path = 'img/%s' %i

        if not os.path.exists(path):

            os.mkdir(path)

        # 先爬取第一页 因为url格式不一样

        # 获取所需要图片的连接 array

        links = soup.find_all('img',class_='content_img')

        for link in links:

               name = str(link['src'])[-21:]

               data = request.urlopen(link['src']).read()

               img = open('img/%s/' %i + name,'wb+')

               img.write(data)

               img.close()

        # print('%d 已经爬完' %i)

        # str = self.pageUrl + '%s' %i + '.html'

        # print(str)

        # 每一个页面下有lastPage个小页面

        for j in range(2,lastPage+1):

            # 重新拼接url 获取到下一页的url

            url = self.pageUrl + '%s_%s' %(i,j) + '.html'

            self.saveImgWithUrl(url,i)

        print('%d 已经爬完' %i)

    def saveImgWithUrl(self,url,i):

        req = request.Request(url)

        for h in self.header:

            req.add_header(h, self.header[h])

        try:

            html = request.urlopen(req).read()

            soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')

            # 获取所需要图片的连接 array

            links = soup.find_all('img', class_='content_img')

            for link in links:

                name = str(link['src'])[-21:]

                data = request.urlopen(link['src']).read()

                img = open('img/%s/' % i + name, 'wb+')

                img.write(data)

                img.close()

        except request.URLError as e:

            if hasattr(e, 'code'):

                print('Error code:', e.code)

            elif hasattr(e, 'reason'):

                print('Reason:', e.reason)

python之路——爬虫实例的更多相关文章

python之路 - 爬虫
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕 ...
嵩天老师python网课爬虫实例1的问题和解决方法
一,AttributeError: 'NoneType' object has no attribute 'children', 网页'tbody'没有子类很明显,报错的意思是说tbody下面没有c ...
python应用：爬虫实例(静态网页)
爬取起点中文网某本小说实例: # -*-coding:utf8-*- import requests import urllib import urllib2 from bs4 import Beau ...
python应用：爬虫实例(动态网页)
以爬取搜狗图片为例,网页特点:采用“瀑布流”的方式加载图片,图片的真实地址存放在XHR中 #-*-coding:utf8-*- import requests import urllib import ...
Python之路【第十九篇】：爬虫
Python之路[第十九篇]:爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用 ...
Python之路【第八篇】：堡垒机实例以及数据库操作
Python之路[第八篇]:堡垒机实例以及数据库操作堡垒机前戏开发堡垒机之前,先来学习Python的paramiko模块,该模块机遇SSH用于连接远程服务器并执行相关操作 SSHClient ...
Python 爬虫实例
下面是我写的一个简单爬虫实例 1.定义函数读取html网页的源代码 2.从源代码通过正则表达式挑选出自己需要获取的内容 3.序列中的htm依次写到d盘 #!/usr/bin/python import ...
Python爬虫实例：爬取B站《工作细胞》短评——异步加载信息的爬取
很多网页的信息都是通过异步加载的,本文就举例讨论下此类网页的抓取. <工作细胞>最近比较火,bilibili 上目前的短评已经有17000多条. 先看分析下页面右边 li 标签中的就是短 ...
Python爬虫实例：爬取猫眼电影——破解字体反爬
字体反爬字体反爬也就是自定义字体反爬,通过调用自定义的字体文件来渲染网页中的文字,而网页中的文字不再是文字,而是相应的字体编码,通过复制或者简单的采集是无法采集到编码后的文字内容的. 现在貌似不少网 ...

随机推荐

ssh非交互式密码输入
ssh登陆不能在命令行中指定密码.sshpass的出现,解决了这一问题.sshpass用于非交互SSH的密码验证,一般用在sh脚本中,无须再次输入密码. 它允许你用 -p 参数指定明文密码,然后直接登 ...
readlink
readlink命令标签: ubuntulinux工具file 2012-03-15 14:06 3674人阅读评论(1) 收藏举报分类: linux系统(184) C语言(92) re ...
大数据分析神兽麒麟(Apache Kylin)
1.Apache Kylin是什么? 在现在的大数据时代,越来越多的企业开始使用Hadoop管理数据,但是现有的业务分析工具(如Tableau,Microstrategy等)往往存在很大的局限,如难以 ...
java_method_Log输出日志的方法
package cn.com.qmhd.tools; import org.apache.log4j.Logger; import org.apache.log4j.PropertyConfigura ...
java编译期优化与执行期优化技术浅析
java语言的"编译期"是一段不确定的过程.由于它可能指的是前端编译器把java文件转变成class字节码文件的过程,也可能指的是虚拟机后端执行期间编译器(JIT)把字节码转变成机 ...
数据库系统——B+树索引
原文来自于:http://dblab.cs.toronto.edu/courses/443/2013/05.btree-index.html 1. B+树索引概述在上一篇文章中,我们讨论了关于ind ...
win32 api 文件操作！
CreateFile打开文件要对文件进行读写等操作,首先必须获得文件句柄,通过该函数可以获得文件句柄,该函数是通向文件世界的大门. ReadFile从文件中读取字节信息.在打开文件获得了文件句柄之后, ...
UNDERSTANDING VOLATILE VIA EXAMPLE--reference
We have spent last couple of months stabilizing the lock detection functionality in Plumbr. During t ...
ld: 18 duplicate symbols for architecture i386 .linker command failed with exit code 1 (use -v to see invocation)_
昨天被linker这个错误卡了一个小时!!!各种办法都试了是导入第三方的问题 .. 网上说要把所有的.m文件导入但是我下载的微博SDK根本不关事..后来大概知道是导入了多个相同的文件... ...
试答卓同学的 iOS 面试题
卓同学昨天写了一篇文章<4道过滤菜鸟的iOS面试题>.我手痒决定默写一个参考答案.后来发现不认真回答被大家喷成狗,所以决定积极改造,重新做人.下面就是修编之后的答案. 1. struct和 ...

python之路——爬虫实例

python之路——爬虫实例的更多相关文章

随机推荐

热门专题