python 爬虫004-使用urllib2与正则表达式扒取糗事百科新鲜页首页帖子

面向过程的方式

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib2

import sys

import re

import os

type = sys.getfilesystemencoding()

if __name__ == '__main__':

    # 1.访问其中一个网页地址，获取网页源代码

    url = 'http://www.qiushibaike.com/textnew/'

    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'

    headers = {'User-Agent': user_agent}

    try:

        req = urllib2.Request(url=url, headers=headers)

        res = urllib2.urlopen(req)

        html = res.read().decode("UTF-8").encode(type)

    except urllib2.HTTPError as e:

        print e

        exit()

    except urllib2.URLError as e:

        print e

        exit()

    # 2.根据抓取到的网页源代码去提取想要的数据，帖子id，帖子内容

    regex_content = re.compile(

        '<div class="article block untagged mb15" id=(.*?)>(?:.*?)<div class="content">(.*?)</div>',

        re.S)

    items = re.findall(regex_content, html)

    for item in items:

        file_name = item[0].strip('\'')

        content = item[1].strip().lstrip('<span>').rstrip('</span>').replace('\n', '').replace(

            '<br/>', '\n')

        # 3.保存抓取的数据到文件中

        path = 'qiubai'

        if not os.path.exists(path):

            os.makedirs(path)

        file_path = path + '/' + file_name + '.txt'

        with open(file_path, 'w') as fp:

            fp.write(content)

            fp.close()

面向对象的方式

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib2

import re

import os

import sys

type = sys.getfilesystemencoding()

class Spider:

    def __init__(self):

        self.url = 'http://www.qiushibaike.com/textnew/page/%s/?s=4979315'

        self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'

    # 获取网页源代码

    def get_page(self, page_index):

        headers = {'User-Agent': self.user_agent}

        try:

            req = urllib2.Request(url=self.url % str(page_index), headers=headers)

            res = urllib2.urlopen(req)

            html = res.read().decode("UTF-8").encode(type)

            return html

        except urllib2.HTTPError as e:

            print e

            exit()

        except urllib2.URLError as e:

            print e

            exit()

    # 分析网页源代码

    def analysis(self, html):

        regex_content = re.compile(

            '<div class="article block untagged mb15" id=(.*?)>(?:.*?)<div class="content">(.*?)</div>',

            re.S)

        items = re.findall(regex_content, html)

        return items

    # 保存抓取的数据到文件中

    def save(self, items, path):

        if not os.path.exists(path):

            os.makedirs(path)

        for item in items:

            file_name = item[0].strip('\'')

            content = item[1].strip().lstrip('<span>').rstrip('</span>').replace('\n', '').replace(

                '<br/>', '\n')

            file_path = path + '/' + file_name + '.txt'

            with open(file_path, 'w') as fp:

                fp.write(content)

                fp.close()

    # 运行的方法

    def run(self):

        print u'开始抓取内容...'

        for i in range(1, 3):

            content = self.get_page(i)

            items = self.analysis(content)

            self.save(items, 'qiubai')

        print u'内容抓取完毕...'

if __name__ == '__main__':

    sp = Spider()

    sp.run()

***微信扫一扫，关注“python测试开发圈”，了解更多测试教程！***

python 爬虫004-使用urllib2与正则表达式扒取糗事百科新鲜页首页帖子的更多相关文章

初识python 之爬虫：使用正则表达式爬取“糗事百科 - 文字版”网页数据
初识python 之爬虫:使用正则表达式爬取"古诗文"网页数据的兄弟篇. 详细代码如下: #!/user/bin env python # author:Simple-Sir ...
python爬取糗事百科段子
初步爬取糗事百科第一页段子(发布人,发布内容,好笑数和评论数) #-*-coding:utf--*- import urllib import urllib2 import re page = url ...
Python爬虫实战一之爬取糗事百科段子
大家好,前面入门已经说了那么多基础知识了,下面我们做几个实战项目来挑战一下吧.那么这次为大家带来,Python爬取糗事百科的小段子的例子. 首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把 ...
Python爬虫--抓取糗事百科段子
今天使用python爬虫实现了自动抓取糗事百科的段子,因为糗事百科不需要登录,抓取比较简单.程序每按一次回车输出一条段子,代码参考了 http://cuiqingcai.com/990.html 但该 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
Python爬虫爬取糗事百科段子内容
参照网上的教程再做修改,抓取糗事百科段子(去除图片),详情见下面源码: #coding=utf-8#!/usr/bin/pythonimport urllibimport urllib2import ...
芝麻HTTP：Python爬虫实战之爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的 ...
8.Python爬虫实战一之爬取糗事百科段子
大家好,前面入门已经说了那么多基础知识了,下面我们做几个实战项目来挑战一下吧.那么这次为大家带来,Python爬取糗事百科的小段子的例子. 首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把 ...
python网络爬虫--简单爬取糗事百科
刚开始学习python爬虫,写了一个简单python程序爬取糗事百科. 具体步骤是这样的:首先查看糗事百科的url:http://www.qiushibaike.com/8hr/page/2/?s=4 ...

随机推荐

我的Android进阶之旅------>关于android:layout_weight属性的详细解析
关于androidlayout_weight属性的详细解析效果一效果二图3的布局代码图4的布局代码效果三图7代码图8代码效果四效果五版权声明:本文为[欧阳鹏]原创文章,欢迎转载,转 ...
DBA学习参考绝佳资料
原文来自:pursuer.chen 原文地址:https://www.cnblogs.com/chenmh/default.aspx?page=1 [置顶]MongoDB 文章目录 2018-02-0 ...
Java读写.properties文件实例，解决中文乱码问题
package com.lxk.propertyFileTest; import java.io.*; import java.util.Properties; /** * 读写properties文 ...
ES6的十个新特性
这里只讲 ES6比较突出的特性,因为只能挑出十个,所以其他特性请参考官方文档: /** * Created by zhangsong on 16/5/20. */// ***********Nu ...
[MongoDB] 学习笔记（2）
1. Mac 安装mongodb: 官网下载mac版mongodb,解压到本地目录,如/Applications/mongodb,注意这里的地址是根更目录下的Applications,然后在根目录下创 ...
javascript 中的比较大小，兼 typeof()用法
javascript中的排序: 1.不同类型比类型 (字符串 > 数字) 2.同类型:(字符串比按字母顺序 )(数字比大小) 测试: <!DOCTYPE html> ...
nginx2
Nginx的高可用是keeplived,keeplived是为lvs服务的. Nginx上分别安装keepalived,keepalived之间通过心跳交流,主节点宕机备节点起来.keepalived ...
Linux工作管理 jobs、fg、bg、nohup命令
概述在Linux 中我们登陆了一个终端,已经在执行一个操作,可以通过一定的操作或命令在不关闭当前操作的情况下执行其他操作. 例如,我在当前终端正在 vi 一个文件,在不停止 vi 的情况下,如果我想 ...
Android平台Camera实时滤镜实现方法探讨(三)--通过Shader实现YUV转换RBG
http://blog.csdn.net/oshunz/article/details/50055057 文章例如该链接通过将YUV分成三个纹理,在shader中取出并且经过公式变换,转换成RGB.我 ...
java的arrayCopy用法
java的arrayCopy用法 final , ); //System.arraycopy(samplesConverted, 0, bytes, 0, 1024); 先贴上语法: publ ...

python 爬虫004-使用urllib2与正则表达式扒取糗事百科新鲜页首页帖子

面向过程的方式

面向对象的方式

python 爬虫004-使用urllib2与正则表达式扒取糗事百科新鲜页首页帖子的更多相关文章

随机推荐

热门专题