多线程糗事百科案例

案例要求参考上一个糗事百科单进程案例:http://www.cnblogs.com/miqi1992/p/8081929.html

Queue(队列对象)

Queue是python中的标准库，可以直接import Queue引用；队列时线程间最常用的交互数据的形式。

python下多线程的思考

对于资源，加锁是个重要的环节。因为python原生的list,dict等，都是not thread safe的。而Queue,是线程安全的，因此在满足使用条件下，建议使用队列

初始化：class Queue.Queue(maxsize)FIFO先进先出
包中的常用方法：
- Queue.qszie()返回队列的大小
- Queue.empty()如果队列为空，返回True，否则返回False
- Queue.full()如果队列满了，返回True,反之False
- Queue.full 与 maxsize大小对应
- Queue.get([block[, timeout]])获取队列，timeout等待事件
创建一个"队列"对象
- import Queue
- myqueue = Queue.Queue(maxsize=10)
将一个值放入队列中
- myqueue.put(10)
将一个值从队列中取出
- myqueue.get()

多线程示意图

#-*- coding:utf-8 -*-

import requests

from lxml import etree

from Queue import Queue

import threading

import time

import json

class Thread_crawl(threading.Thread):

    """

        抓取线程类

    """

    def __init__(self, threadID, q):

        threading.Thread.__init__(self)

        self.threadID = threadID

        self.q = q

    def run(self):

        print("String: "+self.threadID)

        self.qiushi_spider()

        print("Exiting: "+self.threadID)

    def qiushi_spider(self):

        while True:

            if self.q.empty():

                break

            else:

                page = self.q.get()

                print('qiushi_spider=', self.threadID, 'page=', str(page))

                url = 'http://www.qiushibaike.com/8hr/page/' + str(page)+"/"

                headers = {

                    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',

                    'Accept-Language':'zh-CN,zh;q=0.8'

                }

                #多次尝试失败结束，防止死循环

                timeout = 4

                while timeout > 0:

                    timeout -= 1

                    try:

                        content = requests.get(url, headers = headers)

                        data_queue.put(content.text)

                        break

                    except Exception, e:

                        print "qiushi_spider", e

                if timeout < 0:

                    print 'timeout', url

class Thread_Parser(threading.Thread):

    """

        页面解析类

    """

    def __init__(self, threadID, queue, lock, f):

        threading.Thread.__init__(self)

        self.threadID = threadID

        self.queue = queue

        self.lock = lock

        self.f = f

    def run(self):

        print("starting ", self.threadID)

        global total, exitFlag_Parser

        while not exitFlag_Parser:

            try:

                """

                    调用队列对象的get()方法从队头删除并返回一个项目。可选参数为block， 默认为True

                    如果队列为空且block为True，get()就使调用线程暂停，直至有项目可用

                    如果队列为空且block为False,队列将引发Empty异常

                """

                item = self.queue.get(False)

                if not item:

                    pass

                self.parse_data(item)

                self.queue.task_done()

                print("Thread_Parser=", self.threadID, 'total=', total)

            except:

                pass

        print "Exiting ", self.threadID

    def parse_data(self, item):

        """

            解析网页函数

            :param item:网页内容

            :return

        """

        global total

        try:

            html = etree.HTML(item)

            result = html.xpath('//div[contains(@id,"qiushi_tag")]')

            for site in result:

                try:

                    imgUrl = site.xpath('.//img/@src')[0]

                    title = site.xpath('.//h2')[0].text

                    content = site.xpath('.//div[@class="content"]/span')[0].text.strip()

                    vote = None

                    comments = None

                    try:

                        # 投票次数

                        vote = site.xpath('.//i')[0].text

                        # print(vote)

                        #print site.xpath('.//*[@class="number"]')[0].text

                        # 评论信息

                        comments = site.xpath('.//i')[1].text

                    except:

                        pass

                    result = {

                        'imageUrl' : imgUrl,

                        'title' : title,

                        'content' : content,

                        'vote' : vote,

                        'comments' : comments

                    }

                    with self.lock:

                        self.f.write(json.dumps(result, ensure_ascii=False).encode('utf-8') + '\n')

                except Exception, e:

                    print("site in result ", e)

        except Exception, e:

            print("parse_data", e)

        with self.lock:

            total += 1

data_queue = Queue()

exitFlag_Parser = False

lock = threading.Lock()

total = 0

def main():

    output = open('qiushibaike.json', 'a')

    #初始化网页页码page从1-10个页面

    pageQueue = Queue(10)

    for page in range(1, 11):

        pageQueue.put(page)

    #初始化采集线程

    crawlthreads = []

    crawllist = ["crawl-1", "crawl-2", "crawl-3"]

    for threadID in crawllist:

        thread = Thread_crawl(threadID, pageQueue)

        thread.start()

        crawlthreads.append(thread)

    # #初始化解析线程parseList

    parserthreads = []

    parserList = ["parser-1", "parser-2", "parser-3"]

    #分别启动parserList

    for threadID in parserList:

        thread = Thread_Parser(threadID, data_queue, lock, output)

        thread.start()

        parserthreads.append(thread)

    # 等待队列情况

    while not pageQueue.empty():

        pass

    #等待所有线程完成

    for t in crawlthreads:

        t.join()

    while not data_queue.empty():

        pass

    #通知线程退出

    global exitFlag_Parser

    exitFlag_Parser = True

    for t in parserthreads:

        t.join()

    print 'Exiting Main Thread'

    with lock:

        output.close()

if __name__ == '__main__':

    main()

Python爬虫(十八)_多线程糗事百科案例的更多相关文章

[Python]网络爬虫（八）：糗事百科的网络爬虫（v0.2）源码及解析
转自:http://blog.csdn.net/pleasecallmewhy/article/details/8932310 项目内容: 用Python写的糗事百科的网络爬虫. 使用方法: 新建一个 ...
python 多线程糗事百科案例
案例要求参考上一个糗事百科单进程案例 Queue(队列对象) Queue是python中的标准库,可以直接import Queue引用;队列是线程间最常用的交换数据的形式 python下多线程的思考 ...
芝麻HTTP：Python爬虫实战之爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的 ...
python 爬虫实战1 爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 本篇目标抓取糗事百科热门段子过滤带有图片的段子实现每按一次回车显示一个段子的发布时间,发布人 ...
Python爬虫实战之爬取糗事百科段子
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的 ...
Python爬虫实战之爬取糗事百科段子【华为云技术分享】
首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致之前的代码没法用了,会导致无法输出和CPU占用过高的 ...
[python]爬虫学习（三）糗事百科
import requestsimport osfrom bs4 import BeautifulSoupimport timepage=2url='http://www.qiushibaike.co ...
python爬虫——利用BeautifulSoup4爬取糗事百科的段子
import requests from bs4 import BeautifulSoup as bs #获取单个页面的源代码网页 def gethtml(pagenum): url = 'http: ...
Python爬虫(十七)_糗事百科案例
糗事百科实例爬取糗事百科段子,假设页面的URL是: http://www.qiushibaike.com/8hr/page/1 要求: 使用requests获取页面信息,用XPath/re做数据提取 ...

随机推荐

A计划（双层bfs）
A计划 Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submissio ...
Hat’s Words
Hat’s Words Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others)Total ...
AngularJS学习篇（十七）
AngularJS 输入验证 <!DOCTYPE html> <html> <script src="http://apps.bdimg.com/libs/an ...
Android开发之漫漫长途 Ⅳ——Activity的显示之ViewRootImpl初探
该文章是一个系列文章,是本人在Android开发的漫漫长途上的一点感想和记录,我会尽量按照先易后难的顺序进行编写该系列.该系列引用了<Android开发艺术探索>以及<深入理解And ...
riot.js教程【三】访问DOM元素、使用jquery、mount输入参数、riotjs标签的生命周期
前文回顾 riot.js教程[二]组件撰写准则.预处理器.标签样式和装配方法 riot.js教程[一]简介访问DOM元素你可以通过this.refs对象访问dom元素而且还有大量的属性简写方式可 ...
unity下跨平台excel读写
这是以前写的跨windows和ios读写excel的工具,因为原来导表工具引用的第三方读写excel的dll只能在windos下使用,造成要在mac机器上跑PC端或者打包的时候,每次都要先在windo ...
c#关键字及ref和out
最近在写程序时遇到ref,out 参数问题.回头有自习看了看MSDN,才有巩固了基础.我把我的测试程序贴出来,大家分享一下. ref 关键字使参数按引用传递.其效果是,当控制权传递回调用方法时, ...
Linux系列教程（二十）——Linux的shell概述以及如何执行脚本
从这篇博客开始,我们将进入Linux的shell脚本的学习,这对于Linux学习爱好者而言是特别重要的一节,也是特别有意思的一节,shell 脚本就像我们知道的Java,php类似的编程语言一样,通过 ...
java的String构造对象的几种方法以及内存运行过程
String类创建对象的方法可以分为以下三种 1.String a = "123"; 2.String b = new String("123"); 3.Str ...
java历史概述
java简介Java是由Sun Microsystems公司于 1995年5月推出的Java面向对象程序设计语言(以下简称Java语言)和Java平台的总称.由James Gosling和同事们共同研 ...

Python爬虫(十八)_多线程糗事百科案例

多线程糗事百科案例

Queue(队列对象)

多线程示意图

Python爬虫(十八)_多线程糗事百科案例的更多相关文章

随机推荐

热门专题