前言

用上多线程，感觉爬虫跑起来带着风

运行情况

爬取了9万多条文本记录，耗时比较短，一会儿就是几千条

关键点

多个线程对同一全局变量进行修改要加锁

          # 获取锁，用于线程同步

          threadLock.acquire()

          global htmlId

          htmlId=htmlId+1

          self.htmlId=htmlId

          # 释放锁

          threadLock.release()

改进

如果因为某种原因爬取失败，这个爬虫并未对失败原因进行记录，所以不方便对漏网之鱼重新爬取。而且由于对方的反爬虫机制，每爬取5000多条时发现对方没有响应，此时应该插入等待时间。

代码

# coding:UTF-8

# https://******.com/article/91510.html

from bs4 import BeautifulSoup

import requests

import threading

import time

htmlId=70570

class Spider(threading.Thread):

     def __init__(self, threadID, name):

          threading.Thread.__init__(self)

          self.threadID = threadID

          self.name = name

          '''

          logging.basicConfig(level=logging.DEBUG,#控制台打印的日志级别

                    filename='new.log',

                    filemode='a',##模式，有w和a，w就是写模式，每次都会重新写日志，覆盖之前的日志

                    #a是追加模式，默认如果不写的话，就是追加模式

                    format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s'

                    #日志格式

                    )

          '''

          self.targetURL='https://******.com/article/'

          self.header={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

          threadLock.acquire()

          global htmlId

          self.htmlId=htmlId

          threadLock.release()

     def nextURL(self):

          # 获取锁，用于线程同步

          threadLock.acquire()

          global htmlId

          htmlId=htmlId+1

          self.htmlId=htmlId

          # 释放锁

          threadLock.release()

          url=self.targetURL+str(self.htmlId)+'.html'

          return url

     def getHtml(self,url):

          try:

               html=requests.get(url,headers=self.header,timeout=10).text

               return html

          except:

               #logging.warning('warning: id= '+str(self.htmlId)+' can\'t connect!')

               return url

     def parserHtml(self,html):

          try:

               bf = BeautifulSoup(html)

               title=bf.find_all('title')[0].text

               content=bf.find_all('div', class_ = 'img-center')[0].text

               return title,content

          except:

               #logging.warning('warning: id= '+str(self.htmlId)+' can\'t parser!')

               return str(self.htmlId),str(self.htmlId)

     def saveContent(self,title,content):

          try:

               with open("./fiction/"+title+'.html','a') as f:

                    content=title+'\r\n'+content

                    f.write(content)

               #logging.info('success: id= '+str(self.htmlId)+' '+title)

          except:

               #logging.warning('warning: id= '+str(self.htmlId)+' failed to save content!')

               self.relax()

     def finishCondition(self):

          if(self.htmlId>=91510):

               return True

          else:

               return False

     def relax(self):

          time.sleep(5)

     def run(self):

          print ("开启线程： " + self.name)

          '''

          # 获取锁，用于线程同步

          threadLock.acquire()

          # 释放锁，开启下一个线程

          threadLock.release()

          '''

          while(not self.finishCondition()):

               url=self.nextURL()

               print(url)

               html=self.getHtml(url)

               title,content=self.parserHtml(html)

               self.saveContent(title,content)

if __name__ == "__main__":

     threadLock = threading.Lock()

     threads = []

     i=1

     num=51

     # 创建新线程

     for item in range(num):

          exp="thread"+str(i)+" = "+"Spider("+str(i)+","+"\"Thread-"+str(i)+"\")"

          exec(exp)

          i=i+1

     i=1

     for item in range(num):

          exp="thread"+str(i)+".start()"

          exec(exp)

          i=i+1

     i=1

     for item in range(num):

          exp="threads.append(thread"+str(i)+")"

          exec(exp)

          i=i+1

     '''

     thread1 = Spider(1, "Thread-1")

     # 开启新线程

     thread1.start()

     # 添加线程到线程列表

     threads.append(thread1)

     '''

     # 等待所有线程完成

     for t in threads:

          t.join()

     print ("退出主线程")

     print (htmlId)

     '''

     target = 'https://******.com/article/91510.html'

     req = requests.get(url = target)

     html = req.text

     bf = BeautifulSoup(html)

     texts = bf.find_all('div', class_ = 'img-center')

     print(texts[0].text)

     '''

Python多线程爬虫的更多相关文章

python多线程爬虫+批量下载斗图啦图片项目（关注、持续更新）
python多线程爬虫项目() 爬取目标:斗图啦(起始url:http://www.doutula.com/photo/list/?page=1) 爬取内容:斗图啦全网图片使用工具:requests ...
python多线程爬虫设计及实现示例
爬虫的基本步骤分为:获取,解析,存储.假设这里获取和存储为io密集型(访问网络和数据存储),解析为cpu密集型.那么在设计多线程爬虫时主要有两种方案:第一种方案是一个线程完成三个步骤,然后运行多个线程 ...
Python多线程爬虫与多种数据存储方式实现(Python爬虫实战2)
1. 多进程爬虫对于数据量较大的爬虫,对数据的处理要求较高时,可以采用python多进程或多线程的机制完成,多进程是指分配多个CPU处理程序,同一时刻只有一个CPU在工作,多线程是指进程内部有多个类 ...
Python多线程爬虫爬取电影天堂资源
最近花些时间学习了一下Python,并写了一个多线程的爬虫程序来获取电影天堂上资源的迅雷下载地址,代码已经上传到GitHub上了,需要的同学可以自行下载.刚开始学习python希望可以获得宝贵的意见. ...
python 多线程爬虫
最近,一直在做网络爬虫相关的东西. 看了一下开源C++写的larbin爬虫,仔细阅读了里面的设计思想和一些关键技术的实现. 1.larbin的URL去重用的很高效的bloom filter算法: 2. ...
Python多线程爬虫爬取网页图片
临近期末考试,但是根本不想复习!啊啊啊啊啊啊啊!!!! 于是做了一个爬虫,网址为 https://yande.re,网页图片为动漫美图(图片带点颜色........宅男福利 github项目地址为:h ...
Python多线程爬虫详解
一.程序进程和线程之间的关系程序:一个应用就是一个程序,比如:qq,爬虫进程:程序运行的资源分配最小单位, 很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知 ...
python 多线程爬虫实例
多进程 Multiprocessing 模块 Process 类用来描述一个进程对象.创建子进程的时候,只需要传入一个执行函数和函数的参数即可完成 Process 示例的创建. star() 方法启动 ...
python多线程爬虫：亚马逊价格
import re import requests import threading import time from time import ctime,sleep from queue impor ...

随机推荐

java踩坑
1. java判断两个字符串是否相等用equals 2. java只传递指针遇到的坑: 1 import java.util.*; 2 3 public class mapTest { 4 publi ...
arm cortex-m0plus源码学习（三）GPIO
概述: Cortex-m0的integration_kit提供三个GPIO接口,其中GPIO0传输到外部供用户使用,为EXTGPIO:GPIO1是内核自己的信号,不能乱改,会崩掉:GPIO2是一些中断 ...
MQ（转）
1. 到底什么时候该使用MQ? 1). 典型场景一:数据驱动的任务依赖采用MQ的优点是: a. 不需要预留buffer,上游任务执行完,下游任务总会在第一时间被执行 b. 依赖多个任务,被多个任务依 ...
zabbix 监控项（key）
Key 描述返回值参数详细说明 agent.hostname 返回被监控端名称字符串 - 返回配置文件中配置的被监控端的名称 agent.ping 检测被监控端是否存活 1 - 运行中其他 ...
SQL Server中调用WebService
首先要启用Ole Automation Procedures,使用sp_configure 配置时如果报错"不支持对系统目录进行即席更新",可以加上WITH OVERRIDE选项. ...
matplotlib 画动态图以及plt.ion()和plt.ioff()的使用
学习python的道路是漫长的,今天又遇到一个问题,所以想写下来自己的理解方便以后查看. 在使用matplotlib的过程中,常常会需要画很多图,但是好像并不能同时展示许多图.这是因为python可视 ...
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error
周五下午的时候升级了一个环境,跑了批处理sh升级脚本后,启动时报下列错误: INFO | jvm 1 | 2017/02/24 17:39:09 | java.io.IOException INFO ...
实现定时器定时 1 秒钟，LED 亮灭显示
实现定时器定时 1 秒钟,LED 亮灭显示要求每隔一秒钟,实现LED灯的显隐转换实验代码 /*************************************************** ...
选择排序法、冒泡排序法、插入排序法、系统提供的底层sort方法排序之毫秒级比较
我的代码: package PlaneGame;/** * 选择排序法.冒泡排序法.插入排序法.系统提供的底层sort方法排序之毫秒级比较 * @author Administrator */impo ...
yum all installed dependent packages while removing a package in centos 7?
how to remove all installed dependent packages while removing a package in centos 7? # yum history # ...

Python多线程爬虫

前言

运行情况

关键点

改进

代码

Python多线程爬虫的更多相关文章

随机推荐

热门专题