Python 之糗事百科多线程爬虫案例

import requests

from lxml import etree

import json

import threading

import queue

# 采集html类

class GetHtml(threading.Thread):

    def __init__(self, page_queue):

        threading.Thread.__init__(self)

        self.page_queue = page_queue

    def run(self):

        self.do_get_html()

    def do_get_html(self):

        headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0"}

        global data_queue

        while True:

            if self.page_queue.empty():

                break

            page = self.page_queue.get()

            url = "https://www.qiushibaike.com/8hr/page/%s/" % str(page)

            timeout = 5

            while timeout > 0:

                try:

                    _response = requests.get(url, headers=headers)

                    html = _response.content

                    # 保存到待解析队列

                    data_queue.put(html)

                    break

                except ConnectionError as e:

                    print(e)

                timeout -= 1

            if timeout < 0:

                print("time out, url: " + url)

class ParseHtml(threading.Thread):

    def __init__(self):

        threading.Thread.__init__(self)

    def run(self):

        self.do_parse_data()

    def do_parse_data(self):

        global total, f

        while True:

            if data_queue.empty():

                break

            try:

                html = data_queue.get()

                text = etree.HTML(html)

                list_node = text.xpath("//li[contains(@id, 'qiushi_tag_')]")

                for node in list_node:

                    username = node.xpath(".//a[@class='recmd-user']/img/@alt")[0]

                    user_img = node.xpath(".//a[@class='recmd-user']/img/@src")[0]

                    zan_num = node.xpath(".//div[@class='recmd-num']/span[position()=1]/text()")[0]

                    ping_num = node.xpath(".//div[@class='recmd-num']/span[position()=4]/text()")

                    content = node.xpath(".//a[@class='recmd-content']/text()")

                    if len(ping_num) > 0:

                        ping_num = ping_num[0]

                    else:

                        ping_num = 0

                    if len(content) > 0:

                        content = content[0]

                    else:

                        content = ""

                    result = {

                        "username": username,

                        "imgUrl": user_img,

                        "vote": zan_num,

                        "comments": ping_num,

                        "content": content

                    }

                    total += 1

                    f.write((json.dumps(result, ensure_ascii=False) + "\n").encode("utf-8"))

            except RuntimeError as e:

                print(e)

def main():

    # 将采集到的html保存到队列

    for i in range(1, 21):

        page_queue.put(i)

    # 开启采集线程

    get_html_thread = []

    for i in range(100):

        get_html = GetHtml(page_queue)

        get_html.start()

        get_html_thread.append(get_html)

    # 等待所有采集线程完成

    for thread in get_html_thread:

        thread.join()

    # 开启解析线程

    parse_html_thread = []

    for i in range(100):

        parse_html = ParseHtml()

        parse_html.start()

        parse_html_thread.append(parse_html)

    # 等待所有解析线程完成

    for thread in parse_html_thread:

        thread.join()

    # 关闭文件

    f.close()

    print("采集数据完成，总共%s条数据" % total)

if __name__ == '__main__':

    data_queue = queue.Queue()

    page_queue = queue.Queue()

    f = open("./qunaerwang.json", "wb")

    total = 0

    main()

数据：

{"username": "夲少姓〖劉〗", "imgUrl": "//pic.qiushibaike.com/system/avtnew/1187/11878716/thumb/20190520091055.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "马中赤兔人中啥了？"}

{"username": "一枕清霜゛", "imgUrl": "//pic.qiushibaike.com/system/avtnew/3371/33712263/thumb/20190511210156.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "一个段子手，一个神回复"}

{"username": "窝里斗窝里", "imgUrl": "//pic.qiushibaike.com/system/avtnew/1427/14275616/thumb/20181228173532.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "鹰科猛禽走路的姿势看上去总是屌屌的！！"}

{"username": "2丫头还是个宝宝", "imgUrl": "//pic.qiushibaike.com/system/avtnew/2219/22190863/thumb/20190131225946.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "都说孩子玩沙子有助于孩子的智力发育，所以家里买了一车沙子放院子给逗逗玩。逗逗拿了一个铲子和一个望远镜玩具，当着我的面把望远镜在埋沙子里。拉着我的手:妈妈，我在沙"}

{"username": "★像风一样一样★", "imgUrl": "//pic.qiushibaike.com/system/avtnew/2716/27163432/thumb/20180306191622.JPEG?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "去朋友家看到的，特殊的插排，一个插排才多少钱啊？"}

{"username": "无语滴滴", "imgUrl": "//pic.qiushibaike.com/system/avtnew/3782/37821797/thumb/20190430173233.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "朋友和他女友吵架闹分手，我们都去劝。他女友抹抹眼泪看着窗外说了一句话：“要不是还有几个快递在路上，我真想死了算了。”"}

{"username": "愚人愚之不如愚己", "imgUrl": "//pic.qiushibaike.com/system/avtnew/1927/19270659/thumb/20160618154530.jpg?imageView2/1/w/50/h/50", "vote": "", "comments": "", "content": "老司机你发了多大的誓？"}

余下数据省略。。。

往后思路：

　　1、保存到数据库

　　2、保存到redis中、然后再同步到数据库

Python 之糗事百科多线程爬虫案例的更多相关文章

python 爬糗事百科
糗事百科网站段子爬取,糗事百科是我见过的最简单的网站了!!! #-*-coding:utf8-*- import requests import re import sys reload(sys) s ...
(python)查看糗事百科文字点赞作者等级评论
import requestsimport reheaders = { 'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; ...
[爬虫]用python的requests模块爬取糗事百科段子
虽然Python的标准库中 urllib2 模块已经包含了平常我们使用的大多数功能,但是它的 API 使用起来让人感觉不太好,而 Requests 自称 “HTTP for Humans”,说明使用更 ...
python scrapy实战糗事百科保存到json文件里
编写qsbk_spider.py爬虫文件 # -*- coding: utf-8 -*- import scrapy from qsbk.items import QsbkItem from scra ...
爬虫_糗事百科（scrapy）
糗事百科scrapy爬虫笔记 1.response是一个'scrapy.http.response.html.HtmlResponse'对象,可以执行xpath,css语法来提取数据 2.提取出来的数 ...
Python爬虫(十八)_多线程糗事百科案例
多线程糗事百科案例案例要求参考上一个糗事百科单进程案例:http://www.cnblogs.com/miqi1992/p/8081929.html Queue(队列对象) Queue是python ...
python 多线程糗事百科案例
案例要求参考上一个糗事百科单进程案例 Queue(队列对象) Queue是python中的标准库,可以直接import Queue引用;队列是线程间最常用的交换数据的形式 python下多线程的思考 ...
【Python爬虫实战】多线程爬虫---糗事百科段子爬取
多线程爬虫:即程序中的某些程序段并行执行,合理地设置多线程,可以让爬虫效率更高糗事百科段子普通爬虫和多线程爬虫分析该网址链接得出:https://www.qiushibaike.com/8hr/pag ...
Python爬虫(十七)_糗事百科案例
糗事百科实例爬取糗事百科段子,假设页面的URL是: http://www.qiushibaike.com/8hr/page/1 要求: 使用requests获取页面信息,用XPath/re做数据提取 ...

随机推荐

Mysql Innodb存储引擎 insert 死锁分析
http://chenzhenianqing.cn/articles/1308.html
SVM学习（续）核函数 & 松弛变量和惩罚因子
SVM的文章可以看:http://www.cnblogs.com/charlesblc/p/6193867.html 有写的最好的文章来自:http://www.blogjava.net/zhenan ...
1.求整数最大的连续0的个数 BinaryGap Find longest sequence of zeros in binary representation of an integer.
求整数最大的连续0的个数 A binary gap within a positive integer N is any maximal sequence of consecutive zeros t ...
Wireshark 抓包遇到 you don’t have permission to capture on that device mac 错误的解决方案
Wireshark 抓包遇到 you don’t have permission to capture on that device mac 错误的解决方案上次有篇博客讲了如何利用wireshark ...
(WIP) DPDK理论学习(by quqi99)
作者:张华发表于:2016-04-22版权声明:能够随意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 ( http://blog.csdn.net/quqi99 ) 组成模 ...
JAVA学习第二十一课（多线程（一）) - (初步了解）
放假在家,歇了好几天了,也没学习,今天学习一下多线程.找找感觉.后天就要回学校了.sad... PS:包没有什么技术含量,会用就可以,日后开发就必需要会用啦,所以打算先放一放,先来多线程一.多线程 ...
changing permissions of Read-only file system in linux
up vote 2 down vote favorite 1 i use this command to make a bootable flash disk of linux mint sudo ...
XAML实例教程系列 - 资源(Resources)
Kevin Fan分享开发经验,记录开发点滴 XAML实例教程系列 - 资源(Resources) 2012-08-10 12:47 by jv9, 1386 阅读, 1 评论, 收藏, 编辑在Wi ...
[luogu_U15118]萨塔尼亚的期末考试
https://zybuluo.com/ysner/note/1239615 题面 \(T\)次询问,求出\[\sum_{i=1}^n\frac{i}{\frac{n(n+1)}{2}}fib_i\] ...
Black Rock Shooter
在人气动漫 Black Rock shooter 中,当加贺里对麻陶说出了"滚回去"以后,与此同时,在另一个心灵世界里, BRS 也遭到了敌人的攻击.此时,一共有 n 个攻击排成 ...

Python 之糗事百科多线程爬虫案例

Python 之糗事百科多线程爬虫案例的更多相关文章

随机推荐

热门专题