Python爬虫系统化学习(5)

Python爬虫系统化学习(5)

多线程爬虫,在之前的网络编程中,我学习过多线程socket进行单服务器对多客户端的连接，通过使用多线程编程，可以大大提升爬虫的效率。

Python多线程爬虫主要由三部分组成:线程的创建，线程的定义，线程中函数的调用。

线程的创建:多通过for循环调用进行,通过thread.start()唤醒线程,thread.join()等待线程自动阻塞

示例代码如下：

for i in range(1,6):

    thread=MyThread("thread"+str(i),list[i-1])

    thread.start()

    thread_list.append(thread)

for thread in thread_list:

    thread.join()

线程的定义:线程的定义使用了继承,通常定义线程中包含两个函数,一个是init初始化函数,在类创建时自动调用,另一个是run函数,在thread.start()函数执行时自动调用，示例代码如下:

class MyThread(threading.Thread):

    def __init__(self,name,link_s):

        threading.Thread.__init__(self)

        self.name=name

    def run(self):

        print('%s is in Process:'%self.name)

        #通过spider我们调用了爬虫函数

        spider(self.name,self.links)

        print('%s is out Process'%self.name)

线程中函数的调用是在run里面进行的,而多线程爬虫的重点就是将多线程与爬虫函数紧密结合起来,这就需要我们为爬虫们分布任务,也就是每个函数都要爬些什么内容。

首先我编写了个写文件,将贝壳找房的1-300页南京租房网址链接写入a.txt,代码如下：

zurl="https://nj.zu.ke.com/zufang/pg"

for i in  range(101,300):

    turl=url+str(i)+'\n'

    print(turl)

    with open ('a.txt','a+') as f:

        f.write(turl)

其次在main函数中将这些链接写入元组中

link_list=[]

with open('a.txt',"r") as f:

    file_list=f.readlines()

    for i in file_list:

        i=re.sub('\n','',i)

        link_list.append(i)

此后通过调用link_list[i]就可以为每个爬虫布置不同的任务了

max=len(link_list) #max为最大页数

page=0 #page为当前页数

def spider(threadName, link_range):

    global page

    global max

    while page<=max:

        i = page

        page+=1

        try:

            r = requests.get(link_list[i], timeout=20)

            soup = BeautifulSoup(r.content, "lxml")

            house_list = soup.find_all("div", class_="content__list--item")

            for house in house_list:

                global num

                num += 1

                house_name = house.find('a', class_="twoline").text.strip()

                house_price = house.find('span', class_="content__list--item-price").text.strip()

                info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price

                print(info)

        except Exception as e:

            print(threadName, "Error", e)

如此这些线程就可以异步的进行信息获取了，整体代码如下

#coding=utf-8

import re

import requests

import threading

import time

from bs4 import BeautifulSoup

page=0

num=0

link_list=[]

with open('a.txt',"r") as f:

    file_list=f.readlines()

    for i in file_list:

        i=re.sub('\n','',i)

        link_list.append(i)

max=len(link_list)

print(max)

class MyThread(threading.Thread):

    def __init__(self,name):

        threading.Thread.__init__(self)

        self.name=name

    def run(self):

        print('%s is in Process:'%self.name)

        spider(self.name)

        print('%s is out Process'%self.name)

max=len(link_list) #max为最大页数

page=0 #page为当前页数

def spider(threadName):

    global page

    global max

    while page<=max:

        i = page

        page+=1

        try:

            r = requests.get(link_list[i], timeout=20)

            soup = BeautifulSoup(r.content, "lxml")

            house_list = soup.find_all("div", class_="content__list--item")

            for house in house_list:

                global num

                num += 1

                house_name = house.find('a', class_="twoline").text.strip()

                house_price = house.find('span', class_="content__list--item-price").text.strip()

                info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price

                print(info)

        except Exception as e:

            print(threadName, "Error", e)

start = time.time()

for i in range(1,6):

    thread=MyThread("thread"+str(i))

    thread.start()

    thread_list.append(thread)

for thread in thread_list:

    thread.join()

end=time.time()

print("All using time:",end-start)

此外多线程爬虫还可以与队列方式结合起来，产生全速爬虫，速度会更快一点：具体完全代码如下：

#coding:utf-8

import threading

import time

import re

import requests

import queue as Queue

link_list=[]

with open('a.txt','r') as f:

   file_list=f.readlines()

   for each in file_list:

      each=re.sub('\n','',each)

      link_list.append(each)

class MyThread(threading.Thread):

   def __init__(self,name,q):

      threading.Thread.__init__(self)

      self.name=name

      self.q=q

   def run(self):

      print("%s is start "%self.name)

      crawel(self.name,self.q)

      print("%s is end "%self.name)

def crawel(threadname,q):

   while not q.empty():

      temp_url=q.get(timeout=1)

      try:

         r=requests.get(temp_url,timeout=20)

         print(threadname,r.status_code,temp_url)

      except Exception as e:

         print("Error",e)

         pass

if __name__=='__main__':

   start=time.time()

   thread_list=[]

   thread_Name=['Thread-1','Thread-2','Thread-3','Thread-4','Thread-5']

   workQueue=Queue.Queue(1000)

   #填充队列

   for url in link_list:

      workQueue.put(url)

   #创建线程

   for tname in thread_Name:

      thread=MyThread(tname,workQueue)

      thread.start()

      thread_list.append(thread)

   for t in thread_list:

      t.join()

   end=time.time()

   print("All using time:",end-start)

   print("Exiting Main Thread")

使用队列进行爬虫需要queue库,除去线程的知识,我们还需要队列的知识与之结合,上述代码中关键的队列知识有创建与填充队列,调用队列,持续使用队列3个，分别如下：

️：创建与队列：

workQueue=Queue.Queue(1000)

   #填充队列

   for url in link_list:

      workQueue.put(url)

️：调用队列：

thread=MyThread(tname,workQueue)

️：持续使用队列：

def crawel(threadname,q):

   while not q.empty():

      pass

使用队列的思想就是先进先出,出完了就结束。

多进程爬虫:一般来说多进程爬虫有两种组合方式:multiprocessing和Pool+Queuex

muiltprocessing使用方法与thread并无多大差异,只需要替换部分代码即可,分别为进程的定义与初始化,以及进程的结束。

️：进程的定义与初始化：

class Myprocess(Process):

    def __init__(self):

        Process.__init__(self)

️：进程的递归结束：设置后当父进程结束后,子进程自动会被终止

p.daemon=True

另外一种方法是通过Manager和Pool结合使用

manager=Manager()

workQueue=manager.Queue(1000)

for url in link_list:

    workQueue.put(url)

pool=Pool(processes=3)

for i in range(1,5):

    pool.apply_async(crawler,args=(workQueue,i))

pool.close()

pool.join()

Python爬虫系统化学习(5)的更多相关文章

Python爬虫系统化学习(4)
Python爬虫系统化学习(4) 在之前的学习过程中,我们学习了如何爬取页面,对页面进行解析并且提取我们需要的数据. 在通过解析得到我们想要的数据后,最重要的步骤就是保存数据. 一般的数据存储方式有两 ...
Python爬虫系统化学习(2)
Python爬虫系统学习(2) 动态网页爬取当网页使用Javascript时候,很多内容不会出现在HTML源代码中,所以爬取静态页面的技术可能无法使用.因此我们需要用动态网页抓取的两种技术:通过浏览 ...
Python爬虫系统化学习(3)
一般来说当我们爬取网页的整个源代码后,是需要对网页进行解析的. 正常的解析方法有三种 ①:正则匹配解析 ②:BeatuifulSoup解析 ③:lxml解析正则匹配解析: 在之前的学习中,我们学习过 ...
Python爬虫系统学习(1)
Python爬虫系统化学习(1) 前言:爬虫的学习对生活中很多事情都很有帮助,比如买房的时候爬取房价,爬取影评之类的,学习爬虫也是在提升对Python的掌握,所以我准备用2-3周的晚上时间,提升自己对 ...
一个Python爬虫工程师学习养成记
大数据的时代,网络爬虫已经成为了获取数据的一个重要手段. 但要学习好爬虫并没有那么简单.首先知识点和方向实在是太多了,它关系到了计算机网络.编程基础.前端开发.后端开发.App 开发与逆向.网络安全. ...
python爬虫专栏学习
知乎的一个讲python的专栏,其中爬虫的几篇文章,偏入门解释,快速看了一遍. 入门爬虫基本原理:用最简单的代码抓取最基础的网页,展现爬虫的最基本思想,让读者知道爬虫其实是一件非常简单的事情. 爬虫 ...
Python爬虫的学习经历
在准备学习人工智能之前呢,我看了一下大体的学习纲领.发现排在前面的是PYTHON的基础知识和爬虫相关的知识,再者就是相关的数学算法与金融分析.不过想来也是,如果想进行大量的数据运算与分析,宏大的基础数 ...
python爬虫scrapy学习之篇二
继上篇<python之urllib2简单解析HTML页面>之后学习使用Python比较有名的爬虫scrapy.网上搜到两篇相应的文档,一篇是较早版本的中文文档Scrapy 0.24 文档, ...
【Python爬虫案例学习】下载某图片网站的所有图集
前言其实很简短就是利用爬虫的第三方库Requests与BeautifulSoup. 其实就几行代码,但希望没有开发基础的人也能一下子看明白,所以大神请绕行. 基本环境配置 python 版本:2.7 ...

随机推荐

poj1180 Batch Scheduling
Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 3590 Accepted: 1654 Description There ...
Codeforces Round #479 (Div. 3) C. Less or Equal (排序,贪心)
题意:有一个长度为$n$的序列,要求在$[1,10^9]$中找一个$x$,使得序列中恰好$k$个数满足$\le x$.如果找不到$x$,输出$-1$. 题解:先对这个序列排 ...
郁闷的出纳员 HYSBZ - 1503
OIER公司是一家大型专业化软件公司,有着数以万计的员工.作为一名出纳员,我的任务之一便是统计每位员工的工资.这本来是一份不错的工作,但是令人郁闷的是,我们的老板反复无常,经常调整员工的工资.如果他 ...
c++虚函数、子类中调用父类方法
全部代码: 1 #include<stdio.h> 2 #include<string.h> 3 #include<iostream> 4 #include< ...
nginx 80端口跳转到443
nginx配置文件80配置中增加 rewrite ^ https://$http_host$request_uri? permanent; 如图: https://blog.csdn.net/jian ...
.NET并发编程-函数闭包
本系列学习在.NET中的并发并行编程模式,实战技巧内容目录函数式编程闭包的应用记忆化函数缓存函数式编程一个函数输出当做另一个函数输入.有时候一个复杂问题,我们拆分成很多个步骤函数,这些函数组合 ...
Linux-单用户/救援模式
目录企业案例一:忘记root密码企业案例二:修改了默认的运行级别为poweroff或者reboot 企业案例三:误损坏MBR(只能以救援模式解决) 企业案例四:误删除GRUB菜单(只能以救援模式解 ...
一个http请求的完整详细过程
整个流程域名解析: 与服务器建立连接:tcp连接: 发起HTTP请求: 服务器响应HTTP请求,浏览器得到html代码: 浏览器解析html代码,并请求html代码中的资源(如js.css.图片): ...
PDF transform to PPT online & free
PDF transform to PPT online & free > Speaker Deck Share Presentationswithout the Mess Speaker ...
Chrome blocked third-party cookies
Chrome blocked third-party cookies Chrome Incognito Chrome 无痕模式 https://support.google.com/chrome/an ...

Python爬虫系统化学习(5)

Python爬虫系统化学习(5)的更多相关文章

随机推荐

热门专题