爬虫_古诗文网(队列，多线程，锁，正则，xpath)

 import requests

 from queue import Queue

 import threading

 from lxml import etree

 import re

 import csv

 class Producer(threading.Thread):

     headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'}

     def __init__(self, page_queue, poem_queue, *args, **kwargs):

         super(Producer, self).__init__(*args, **kwargs)

         self.page_queue = page_queue

         self.poem_queue = poem_queue

     def run(self):

         while True:

             if self.page_queue.empty():

                 break

             url = self.page_queue.get()

             self.parse_html(url)

     def parse_html(self, url):

         # poems = []

         headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'}

         response = requests.get(url, headers=headers)

         response.raise_for_status()

         html = response.text

         html_element = etree.HTML(html)

         titles = html_element.xpath('//div[@class="cont"]//b/text()')

         contents = html_element.xpath('//div[@class="contson"]')

         hrefs = html_element.xpath('//div[@class="cont"]/p[1]/a/@href')

         for index, content in enumerate(contents):

             title = titles[index]

             content = etree.tostring(content, encoding='utf-8').decode('utf-8')

             content = re.sub(r'<.*?>|\n|', '', content)

             content = re.sub(r'\u3000\u3000', '', content)

             content = content.strip()

             href = hrefs[index]

             self.poem_queue.put((title, content, href))

 class Consumer(threading.Thread):

     def __init__(self, poem_queue, writer, gLock, *args, **kwargs):

         super(Consumer, self).__init__(*args, **kwargs)

         self.writer = writer

         self.poem_queue = poem_queue

         self.lock = gLock

     def run(self):

         while True:

             try:

                 title, content, href = self.poem_queue.get(timeout=20)

                 self.lock.acquire()

                 self.writer.writerow((title, content, href))

                 self.lock.release()

             except:

                 break

 def main():

     page_queue = Queue(100)

     poem_queue = Queue(500)

     gLock = threading.Lock()

     fp = open('poem.csv', 'a',newline='', encoding='utf-8')

     writer = csv.writer(fp)

     writer.writerow(('title', 'content', 'href'))

     for x in range(1, 100):

         url = 'https://www.gushiwen.org/shiwen/default.aspx?page=%d&type=0&id=0' % x

         page_queue.put(url)

     for x in range(5):

         t = Producer(page_queue, poem_queue)

         t.start()

     for x in range(5):

         t = Consumer(poem_queue, writer, gLock)

         t.start()

 if __name__ == '__main__':

     main()

运行结果

爬虫_古诗文网(队列，多线程，锁，正则，xpath)的更多相关文章

爬虫_中国天气网_文字天气预报（xpath）
import requests from lxml import etree headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/5 ...
requests_cookie登陆古诗文网。session的使用
通过登录失败,快速找到登录接口获取hidden隐藏域中的id的value值 # 通过登陆然后进入到主页面 # 通过找登陆接口我们发现登陆的时候需要的参数很多 # _VIEWSTATE: /m1O ...
爬虫_豆瓣全部正在热映电影（xpath）
单纯地练习一下xpath import requests from lxml import etree def get_url(url): html = requests.get(url) retur ...
Python爬虫入门教程 11-100 行行网电子书多线程爬取
行行网电子书多线程爬取-写在前面最近想找几本电子书看看,就翻啊翻,然后呢,找到了一个叫做周读的网站 ,网站特别好,简单清爽,书籍很多,而且打开都是百度网盘可以直接下载,更新速度也还可以,于是乎, ...
初识python 之爬虫：使用正则表达式爬取“古诗文”网页数据
通过requests.re(正则表达式) 爬取"古诗文"网页数据. 详细代码如下: #!/user/bin env python # author:Simple-Sir # tim ...
Java多线程--锁的优化
Java多线程--锁的优化提高锁的性能减少锁的持有时间一个线程如果持有锁太长时间,其他线程就必须等待相应的时间,如果有多个线程都在等待该资源,整体性能必然下降.所有有必要减少单个线程持有锁的时间 ...
synchronized与static synchronized 的差别、synchronized在JVM底层的实现原理及Java多线程锁理解
本Blog分为例如以下部分: 第一部分:synchronized与static synchronized 的差别第二部分:JVM底层又是怎样实现synchronized的第三部分:Java多线程锁 ...
Python爬虫爬取全书网小说，程序源码+程序详细分析
Python爬虫爬取全书网小说教程第一步:打开谷歌浏览器,搜索全书网,然后再点击你想下载的小说,进入图一页面后点击F12选择Network,如果没有内容按F5刷新一下点击Network之后出现如下 ...
SANSA 上上洛可可贾伟作品高山流水香炉香插香台香具高端商务礼品黑色【正品价格图片折扣评论】_尚品网ShangPin.com
SANSA 上上洛可可贾伟作品高山流水香炉香插香台香具高端商务礼品黑色[正品价格图片折扣评论]_尚品网ShangPin.com

随机推荐

python学习之第八篇——字典嵌套之字典中嵌套字典
cities = { 'shanghai':{'country':'china','population':10000,'fact':'good'}, 'lendon':{'country':'eng ...
Atcoder F - LCS (DP-最长公共子序列，输出字符串)
F - LCS Time Limit: 2 sec / Memory Limit: 1024 MB Score : 100100 points Problem Statement You are gi ...
微服务治理平台的RPC方案实现
导读:本文主要探讨了rpc框架在微服务化中所处的位置,需要解决的问题.同时介绍了用友云微服务治理平台的rpc解决方案,为什么选择该方案.该方案提供的好处是什么.同时也会介绍用友RPC框架的基本结构以及 ...
【学习总结】vi/vim命令是使用
每次要么想不起来用,要么进去了出不来,真是醉了.痛定思痛此处填坑. 参考教程:菜鸟教程vi/vim 实验环境:借Git-bash宝地一用注意:记住关键的步骤! 按i a o进入输入模式(即使有时按v ...
linux系统下MySQL表名区分大小写问题
linux系统下MySQL表名区分大小写问题 https://www.cnblogs.com/jun1019/p/7073227.html [mysqld] lower_case_table_name ...
centos7 network eno16777736
Network service not running - eno16777736 not activated - CentOShttps://www.centos.org/forums/viewto ...
Laravel 获取 Route Parameters (路由参数) 的 5 种方法
Laravel 获取路由参数的方式有很多,并且有个小坑,汇总如下. 假设我们设置了一个路由参数: 现在我们访问 http://test.dev/1/2 在 TestController ...
Linux Centos 迁移Mysql 数据位置
Linux Centos 迁移Mysql 数据位置由于业务量增加导致安装在系统盘(20G)磁盘空间被占满了, 现在进行数据库的迁移. Mysql 是通过 yum 安装的. Centos6.5Mysq ...
[转帖]Vim 编辑器底端 [noeol], [dos] 的含义
Vim 编辑器底端 [noeol], [dos] 的含义 2012年11月28日 23:13:04 strongwangjiawei 阅读数:15484 https://blog.csdn.net/s ...
java之序列化
详细内容连接https://blog.csdn.net/qq_27093465/article/details/78544505 Java 之 Serializable 序列化和反序列化的概念,作用 ...

爬虫_古诗文网(队列，多线程，锁，正则，xpath)

爬虫_古诗文网(队列，多线程，锁，正则，xpath)的更多相关文章

随机推荐

热门专题