一、爬虫

1、概述

网络爬虫,搜索引擎就是爬虫的应用者。

2、爬虫分类

(1)通用爬虫,常见就是搜索引擎,无差别的收集数据,存储,提取关键字,构建索引库,给用户提供搜索接口。

爬取一般流程:

初始化一批URL,将这些url放入到等待爬取队列。

从队列取出这些url,通过dns解析ip,对应ip站点下载HTML页面,保存到本地服务器中,爬取完的url放到已爬取队列。

分析这些网页内容,找出网页里面关心的url连接,继续执行第二步,直到爬取结束。

搜索引擎如何获取一个新网站的url。

新网站主动提交给搜索引擎。

通过其他网站页面中设置的外链。

搜索引擎和dns服务商合作,获取最新收录的网站。

(2)聚焦爬虫

有针对性的编写特定领域数据的爬取程序,针对某些类别数据的采集的爬虫,是面向主题的。

3、robots协议

指定一个robots.txt文件,告诉爬虫引擎什么可以爬取。

这个协议为了让搜索引擎更有效率搜索自己内容,提供了sitemap这样的文件。

这个文件禁止抓取的往往又是可能我们感兴趣的内容,反而泄露了这些地址。。

4、http请求和响应处理

爬虫网页就是通过HTTP协议访问网页,不过通过浏览器访问往往是人的行为,把程序编程人的行为的问题。

Urllib包

from urllib.request import urlopen





response = urlopen('http://www.bing.com')

print(response.closed)



with response:

    print(response.status)

    print(response._method)

    print(response.read())

    print(response.closed)

    print(response.info)

print(response.closed)

使用等,urllib包,使用查询等。

解决useragent问题:

from urllib.request import urlopen,Request



url = 'http://www.bing.com'

ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



req = Request(url,headers={'User-agent':ua})

response = urlopen(req,timeout=10)

# print(req)

print(response.closed)



with response:

    print(response.status)

    print(response._method)

    # print(response.read())

    # print(response.closed)

    # # print(response.info)

    print(response.geturl())

print(req.get_header('User-agent'))

print(response.closed)

Chrome浏览器获取useragent

5、parse

from urllib import parse



d = {

    'id':1,

    'name':'tom',

    'url':'http://www.magedu.com'

}



url = 'http://www.magedu.com'

u = parse.urlencode(d)   #url编码

print(u)



print(parse.unquote(u))#解码

6、请求方法

from urllib import parse

import simplejson



base_url = 'http://cn.bing.com/search'



d = {

    'q':'马哥教育'

}

# d = {

#     'id':1,

#     'name':'tom',

#     'url':'http://www.magedu.com'

# }



# url = 'http://www.magedu.com'

u = parse.urlencode(d)   #url编码



# url = '{}?{}'.format(base_url,u)

# print(url)

#

# print(parse.unquote(url))#解码



from urllib.request import urlopen,Request



url = 'http://httpbin.org/post'



ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



data = parse.urlencode({'name':'张三,@=/&*','age':'6'})



req = Request(url,headers={

    'User-agent':ua

})



# res = urlopen(req)



with urlopen(req,data= data.encode()) as res:

    text = res.read()

    d = simplejson.loads(text)

    print(d)

    # with open('c:/assets/bing.html','wb+') as f:

        # f.write(res.read())

        # f.flush()

from urllib import parse

import simplejson



base_url = 'http://cn.bing.com/search'



d = {

    'q':'马哥教育'

}

# d = {

#     'id':1,

#     'name':'tom',

#     'url':'http://www.magedu.com'

# }



# url = 'http://www.magedu.com'

u = parse.urlencode(d)   #url编码



# url = '{}?{}'.format(base_url,u)

# print(url)

#

# print(parse.unquote(url))#解码



from urllib.request import urlopen,Request



url = 'http://httpbin.org/post'



ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



data = parse.urlencode({'name':'张三,@=/&*','age':'6'})



req = Request(url,headers={

    'User-agent':ua

})



# res = urlopen(req)



with urlopen(req,data= data.encode()) as res:

    text = res.read()

    d = simplejson.loads(text)

    print(d)

    # with open('c:/assets/bing.html','wb+') as f:

        # f.write(res.read())

        # f.flush()

7、爬取豆瓣网

from urllib.request import Request,urlopen

import simplejson

from urllib import
parse



ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



jurl = 'https://movie.douban.com/j/search_subjects'



d = {

    'type':'movie',

    'tag':'热门',

    'page_limit':10,

    'page_start':10

}



req = Request('{}?{}'.format(jurl,parse.urlencode(d)),headers={

    'User-agent':ua

})



with urlopen(req) as res:

    sub = simplejson.loads(res.read())

    print(len(sub))

    print(sub)

8、解决https,ca证书的问题

忽略证书,ssl

from urllib.request import Request,urlopen



from urllib import
parse

import ssl



#request =
Request('http://www.12306.cn/mormhweb')

request = Request('http://www.baidu.com')

request.add_header('User-agent','Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/67.0.3396.99 Safari/537.36'

)



context = ssl._create_unverified_context() 
#忽略不可用证书



with urlopen(request,context=context) as res:

    print(res._method)

    print(res.read())

9、urllib3

pip install urllib3

import urllib3





url = 'http://movie.douban.com'



ua = 
'Mozilla/5.0
(Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36'



with urllib3.PoolManager() as http:   #连接池管理器

    response = http.request('GET',url,headers={'User-agent':ua})

    print(1,response)

    print(2,type(response))

    print(3,response.status,response.reason)

    print(4,response.headers)

    print(5,response.data)

import urllib3

from urllib.parse import urlencode

from urllib3 import
HTTPResponse



url = 'http://movie.douban.com'



ua = 
'Mozilla/5.0
(Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99
Safari/537.36'



jurl = 'https://movie.douban.com/j/search_subjects'



d = {

    'type':'movie',

    'tag':'热门',

    'page_limit':10,

    'page_start':10

}



# with urllib3.PoolManager() as http:   #连接池管理器

#     response =
http.request('GET',url,headers={'User-agent':ua})   #可以指定请求方法

#     print(1,response)

#     print(2,type(response))

#     print(3,response.status,response.reason)

#     print(4,response.headers)

#     print(5,response.data)



with urllib3.PoolManager() as http:

    response = http.request('GET','{}?{}'.format(jurl,urlencode(d)),headers={'User-agent':ua})

    print(response)

    print(response.status)

    print(response.data)

10、requests库

Requests使用了urllib3.

pip install requests

import urllib3

from urllib.parse import urlencode

from urllib3 import
HTTPResponse

import requests





# url = 'http://movie.douban.com'



ua = 
'Mozilla/5.0
(Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36'



jurl = 'https://movie.douban.com/j/search_subjects'



d = {

    'type':'movie',

    'tag':'热门',

    'page_limit':10,

    'page_start':10

}

url = '{}?{}'.format(jurl,urlencode(d))



# with urllib3.PoolManager() as http:   #连接池管理器

#     response =
http.request('GET',url,headers={'User-agent':ua})   #可以指定请求方法

#     print(1,response)

#     print(2,type(response))

#    
print(3,response.status,response.reason)

#     print(4,response.headers)

#     print(5,response.data)



# with urllib3.PoolManager() as http:

#     response =
http.request('GET','{}?{}'.format(jurl,urlencode(d)),headers={'User-agent':ua})

#     print(response)

#     print(response.status)

#     print(response.data)



response =
requests.request('GET',url,headers = {'User-agent':ua})



with response:

    print(response.text)

    print(response.status_code)

    print(response.url)

    print(response.headers)

    print(response.request)

带会话的方式  session。

会把请求头等信息自动管理。

import urllib3

from urllib.parse import urlencode

from urllib3 import
HTTPResponse

import requests





# url = 'http://movie.douban.com'



ua = 
'Mozilla/5.0
(Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99
Safari/537.36'



jurl = 'https://movie.douban.com/j/search_subjects'



d = {

    'type':'movie',

    'tag':'热门',

    'page_limit':10,

    'page_start':10

}

# url = '{}?{}'.format(jurl,urlencode(d))



# with urllib3.PoolManager() as http:   #连接池管理器

#     response =
http.request('GET',url,headers={'User-agent':ua})   #可以指定请求方法

#     print(1,response)

#     print(2,type(response))

#    
print(3,response.status,response.reason)

#     print(4,response.headers)

#     print(5,response.data)



# with urllib3.PoolManager() as http:

#     response =
http.request('GET','{}?{}'.format(jurl,urlencode(d)),headers={'User-agent':ua})

#     print(response)

#     print(response.status)

#     print(response.data)



# response = requests.request('GET',url,headers = {'User-agent':ua})

#

# with response:

#     print(response.text)

#     print(response.status_code)

#     print(response.url)

#     print(response.headers)

#     print(response.request)

urls = ['https://www.baidu.com/s?wd=magedu','https://www.baidu.com/s?wd=magedu']



session = requests.Session()

with session:

    for url in urls:

        response = session.get(url,headers={'User-agent':ua})

        with response:

            print(1,response.text)

            print(2,response.status_code)

            print(3,response.url)

            print(4,response.headers)

            print(5,response.request.headers)

            print('--------')

            print(response.cookies)

            print('--------------')

            print(response.cookies)

11、特别注意

个别网站登录的时候cookie,登录的时候要把原来的cookie带回去,然后登录成功后其给你返回一个新的,否则不能进行相关操作。有些时候只是带一些cookie相关的值即可。

反爬措施:对于用户发起的请求来检测上一次是否访问的是我的网站。

在network的referer里面显示上一次访问网站的哪个一页。

Files:上传的文件内容。

路由器的将用户名和密码加密放在请求头里面。

Cert证书。

Requests基本功能:

import requests





ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36
Core/1.63.5514.400 QQBrowser/10.1.1660.400'

url = 'https://dig.chouti.com/login'



data = {

'phone':'8618804928235',

'password':'tana248654',

'oneMonth':'1'

}



r1_urls = 'https://dig.chouti.com'

r1 = requests.get(url=r1_urls,headers={'User-Agent':ua})

# print(r1.text)

r1_cookie = r1.cookies.get_dict()

print('r1',r1.cookies)



response = requests.post(url,data,headers={'User-Agent':ua},cookies=r1_cookie)



print(response.text)

print(response.cookies.get_dict())





r3 = requests.post(url='https://dig.chouti.com/link/vote?linksId=21718341',

                   cookies={'gpsd':r1_cookie.get('gpsd')},headers={'User-Agent':ua})



print(r3.text)

二、HTML解析

通过上面的库,可以拿到HTML内容。

1、Xpath

http://www.qutoric.com/xmlquire/

站点。

路径的遍历,查找到需要的内容。

2、lxml库

解析HTML的库。

https://lxml.de/

安装:

pip install lxml

爬取豆瓣网top10

import urllib3

from urllib.parse import urlencode

from urllib3 import
HTTPResponse

import requests

from lxml import
etree



# url = 'http://movie.douban.com'



ua = 
'Mozilla/5.0
(Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36'



jurl = 'https://movie.douban.com/j/search_subjects'



d = {

    'type':'movie',

    'tag':'热门',

    'page_limit':10,

    'page_start':10

}

# urls =
['https://www.baidu.com/s?wd=magedu','https://www.baidu.com/s?wd=magedu']

urls = ['https://movie.douban.com/']



session = requests.Session()

with session:

    for url in urls:

        response = session.get(url,headers={'User-agent':ua})

        with response:

            content = response.text



        html = etree.HTML(content)

        title = html.xpath("//div[@class='billboard-bd']//tr")

        for t in title:

            txt = t.xpath('.//text()')

            print(''.join(map(lambda x:x.strip(),txt)))

            # print(t)

3、beautifulsoup4

4、可以导航的string(navigablestring)

深度优先遍历。

Soup.findall().

Soup.findall(id =’header’)

5、css选择器

Soup.select          正则表达式

Pip install jsonpath.

from concurrent.futures import ThreadPoolExecutor

import threading

import time

from queue import
Queue

import logging

import requests

from bs4 import
BeautifulSoup



event = threading.Event()

url = 'https://news.enblogs.com'

path = '/n/page/'

ua = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



urls = Queue()

htmls = Queue()

outps = Queue()





def create_urls(start,stop,step=1):

    for i in range(start,stop+1,step):

        url1 = '{}{}{}/'.format(url,path,i)

        urls.put(url1)



def crawler():

    while not event.is_set():

        try:

            url1 =
urls.get(True,1)



            response = requests.get(url,headers={'User-agent':ua})

            with response:

                html = response.text

                htmls.put(html)

        except Exception
as e:

            print(1,e)



def parse():

    while not event.is_set():

        try:

            html = htmls.get(True,1)

            soup = BeautifulSoup(html,'lxml')

            news = soup.select('h2.news_entry a')





            for n in news:

                txt = n.text

                url1 = url + n.attrs.get('href')

                outps.put((txt,url1))



        except Exception
as e:

            print(e)



def save(path):

    with open(path,'a+',encoding='utf-8') as f:

        while not event.is_set():

            try:

                title,url1 = outps.get(True,1)

                f.write('{}{}\n'.format(title,url1))

                f.flush()

            except Exception
as e:

                print(e)



executor = ThreadPoolExecutor(max_workers=10)

executor.submit(create_urls,1,10)

executor.submit(parse)

executor.submit(save,'c:/new.txt')



for i in range(7):

    executor.submit(crawler)



while True:

    cmd = input('>>>')

    if cmd.strip()
== 'q':

        event.set()

        executor.shutdown()

        print('close')

        time.sleep()

        break

三、动态网页处理

很多网站采用的是ajax技术,spa技术。部分内容都是异步加载的,提高用户体验。

1、phantomjs无头浏览器

http://phantomjs.org/

Xml http 与后端服务器建立的连接。

2、selenium

(1)自动化测试工具等,可以直接截图。模仿浏览器的行为等。

from selenium import webdriver

import datetime

import time

import random





driver = webdriver.PhantomJS('c:/assets/phantomjs-2.1.1-windows/bin/phantomjs.exe')



driver.set_window_size(1024,1024)

url = 'https://cn.bing.com/search?q=%E9%A9%AC%E5%93%A5%E6%95%99%E8%82%B2'

driver.get(url)







def savedic():

    try:

        base_dir = 'C:/assets/'

        filename = '{}{:%Y%m%d%H%M%S}{}.png'.format(base_dir,datetime.datetime.now(),random.randint(1,100))

        driver.save_screenshot(filename)

    except Exception
as e:

        print(1,e)

# time.sleep(6)

# print('-------')

# savedic()

MAXRETRIES = 5

while MAXRETRIES:

    try:

        ele = driver.find_element_by_id('b_results')

        print(ele)

        print('===========')

        savedic()

        break

    except Exception as e:

        print(e)

        print(type(e))

    time.sleep(1)

    MAXRETRIES -= 1

查找数据等,异步的方式。

(2)下拉框子使用,使用Select。

3、模拟键盘输入

模仿浏览器登录,先找到登录框的id,然后,setkeys。

之后返回登录后的网页。

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

import time

import random

import datetime





driver = webdriver.PhantomJS('c:/assets/phantomjs-2.1.1-windows/bin/phantomjs.exe')



driver.set_window_size(1024,1024)



url = 'https://www.oschina.net/home/login?goto_page=https%3A%2F%2Fwww.oschina.net%2F'



def savedic():

    try:

        base_dir = 'C:/assets/'

        filename = '{}{:%Y%m%d%H%M%S}{}.png'.format(base_dir,datetime.datetime.now(),random.randint(1,100))

        driver.save_screenshot(filename)

    except Exception
as e:

        print(1,e)



driver.get(url)

print(driver.current_url,111111111111)

savedic()



email = driver.find_element_by_id('userMail')

passwed = driver.find_element_by_id('userPassword')



email.send_keys('604603701@qq.com')

passwed.send_keys('tana248654')

savedic()

passwed.send_keys(Keys.ENTER)







time.sleep(2)

print(driver.current_url,2222222222)

userinfo = driver.find_element_by_class_name('user-info')

print(userinfo.text)

time.sleep(2)

cookie = driver.get_cookies()

print(cookie)

savedic()

4、页面等待

(1)time.sleep

数据js加载需要一定的时间内。

线程休眠。

设置尝试的次数等

(2)selenium里面的wait

显示等待

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions
as EC

try:

    email = WebDriverWait(driver,10).until(

       
EC.presence_of_all_elements_located((By.ID,'userMail'))

    )

    savedic()

finally:

    driver.quit()

隐士的等待

driver.implicitly_wait(10)

总结:

四、scrapy框架

1、安装

Pip install scrapy    可能报错,报错的原因是下载tw开头的文件.whl文件,然后pip安装。

2、使用

scrapy startproject scrapyapp   开启一个项目

scrapy genspider donz_spider dnoz.org  进入spider文件下创建一个新的模块,把要爬取的网站加到url列表中。

scrapy genspider -t basic dbbook douban.com   继承自baseic模板。内容少。

scrapy genspider -t crawl book douban.com   继承自crawl模板,内容多。

-t 后面加的是模板。  然后名字和网站

scrapy crawl donz_spider   运行代码,运行时候报错的话pip install pypiwin32

from scrapy.http.response.html import HtmlResponse

response 继承于HTMLResponse。

在item设置中设置要爬取的信息的类例如标题。

在spiders下的文件里面写爬虫的xpath,爬取的队列及爬取内容的匹配。

Middlewares里面是中间件。

Pipelines里面处理函数。

五、scrapy-redis组件

1、scrapy-redis使用

Pip install
scrapy_redis

使用redis作为队列需要的配置文件

Setting.py

BOT_NAME = 'scrapyapp'



SPIDER_MODULES = ['scrapyapp.spiders']

NEWSPIDER_MODULE = 'scrapyapp.spiders'



USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'



ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

COOKIES_ENABLED = False



SCHEDULER = "scrapy_redis.scheduler.Scheduler"

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

ITEM_PIPELINES = {      # 
redis数据库连接相关

    'scrapyapp.pipelines.ScrapyappPipeline': 300,

    'scrapy_redis.pipelines.RedisPipeline': 543,

}

REDIS_HOST = '192.168.118.130'   

REDIS_PORT = 6379



# LOG_LEVEL = 'DEBUG'

Spiders 下面的爬虫文件.py

# -*- coding: utf-8
-*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from scrapy_redis.spiders import RedisCrawlSpider

from ..items import
MovieItem





class MoviecommentSpider(RedisCrawlSpider):

    name = 'moviecomment'

    allowed_domains = ['douban.com']

    # start_urls = ['http://douban.com/']

    redis_key = 'moviecomment1:start_urls'



    rules = (

        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=False),

    )



    def parse_item(self, response):

        # i = {}

        #i['domain_id'] =
response.xpath('//input[@id="sid"]/@value').extract()

        #i['name'] =
response.xpath('//div[@id="name"]').extract()

        #i['description'] =
response.xpath('//div[@id="description"]').extract()

        # return i

        comment = '//div[@class="comment-item"]//span[@class="short"]/text()'

        reviews = response.xpath(comment).extract()

        for review
in reviews:

            item = MovieItem()

            item['comment'] =
review.strip()

            yield item

Item.py

import scrapy





class MovieItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    comment = scrapy.Field()

redis数据中要设置一个key值和movecomment.py 中的redis_key = 'moviecomment1:start_urls'  设置value及初始的url值。

完成后数据库会存储响应的值

可以在redis-cli 后面加上 –ra

2、分析

(1)jieba分词

Pip install jieba

(2)stopword停用词

数据清洗:把脏数据洗掉,检测出并除去数据中无效或者无关的数据,例如空值,非法值的检测,重复数据检测等。

(3)词云

Pip install
wordcloud

from redis import Redis

import json

import jieba





redis = Redis()

stopwords = set()

with open('', encoding='gbk') as f:

    for line
in f:

        print(line.rstrip('\r\n').encode())

        stopwords.add(line.rstrip('\r\n'))

print(len(stopwords))

print(stopwords)

items = redis.lrange('dbreview:items', 0, -1)

print(type(items))





words = {}

for item in items:

    val = json.loads(item)['review']

    for word
in jieba.cut(val):

        words[word] = words.get(word, 0) + 1

print(len(words))

print(sorted(words.items(), key=lambda x: x[1], reverse=True))

分词代码测试

六、scrapy项目

1、知识回顾

2、爬取技术网站

praise_nums = response.xpath("//span[contains(@class,
'vote-post-up')]/text()").extract()

fav_nums = response.xpath("//span[contains(@class,
'bookmark-btn')]/text()").extract()

# match_re =
re.match(".*(\d+).*", fav_nums)

class的值有多个的时候,使用container进行选取。

from scrapy.http import Request  #找到的url传递给下一级

from urllib import parse

#提取下一页并交给scrapy下载

next_url =
response.xpath('//div[@class="navigation
margin-20"]/a[4]/@href').extract()

if next_url:

    yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

(1)图片处理及存储:

pip install pillow

IMAGES_URLS_FIELD =
"front_image_url"

project_dir =
os.path.abspath(os.path.dirname(__file__))

IMAGES_STORE = os.path.join(project_dir, 'images')

(2)写入到本地文件:

class JsonWithEncodingPipeline(object):

    def __init__(self):

        self.file
= codecs.open('article.json', 'w', encoding='utf-8')



    def process_item(self, item, spider):

        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"

        self.file.write(lines)

        return item



    def spider_closed(self, spider):

        self.file.close()

scrapy自带的JsonItemExporter

(3)导出功能,还有csv文件等

class JsonItemExporterPipeline(object):

    '''

   
调用scrapy的JsonItemExporter

    '''

   
def __init__(self):

        self.file
= open('articleexport.json', 'wb')

        self.exporter
= JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)

        self.exporter.start_exporting()



    def close_spider(self, spider):

        self.exporter.finish_exporting()

        self.file.close()



    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

(4)数据库插入操作

class MysqlPipeline(object):

    def __init__(self):

        self.conn
= MySQLdb.connect('192.168.118.131', 'wang', 'wang', 'scrapy_jobbole', charset='utf8', use_unicode=True)

        self.cursor
= self.conn.cursor()



    def process_item(self, item, spider):

        insert_sql = """

        insert into jobbole_article(title,
url, create_date, fav_nums)

        values (%s, %s, %s, %s)

        """

        self.cursor.execute(insert_sql, (item['title'], item['url'], item['create_date'], item['fav_nums']))

        self.conn.commit()

(5)scrapy提供的异步方法

import MySQLdb

import MySQLdb.cursors

from twisted.enterprise import adbapi

class MysqlTwistedPipeline(object):

    def __init__(self, dbpool):

        self.dbpool
= dbpool



    @classmethod

    def from_settings(cls, settings):

        dbparms = dict(

            host=settings['MYSQL_HOST'],

            db=settings['MYSQL_DBNAME'],

            user=settings['MYSQL_USER'],

            password =
settings['MYSQL_PASSWORD'],

            charset='utf8',

            cursorclass = MySQLdb.cursors.DictCursor,

            use_unicode = True

        )



        dbpool = adbapi.ConnectionPool('MySQLdb', **dbparms)

        return cls(dbpool)



    def process_item(self, item, spider):

        '''

       
异步操作

        :param item:

        :param spider:

        :return:

        '''

       
query = self.dbpool.runInteraction(self.do_insert, item)

        query.addErrback(self.handle_error)



    def handle_error(self, failure):

        '''

       
处理插入的异常

        :param failure:

        :return:

        '''

       
print(failure)



    def do_insert(self, cursor, item):

        '''

       
执行具体插入

        :param cursor:

        :param item:

        :return:

        '''

       
insert_sql = """

        insert into
jobbole_article(title, url, create_date, fav_nums)

        values (%s, %s, %s, %s)

        """

        cursor.execute(insert_sql, (item['title'], item['url'], item['create_date'], item['fav_nums']))

(5)将django的model集成到scrapy

Scrapy-djangoitem

(6)改变超多的xpath和css,使用itemloader

# 通过itemloader加载item

item_loader =
ArticleItemLoader(item=ArticleItem(), response=response)

# item_loader.add_css()

item_loader.add_xpath('title', '//div[@class="entry-header"]/h1/text()')

可以在item里面的field里面选择,

class ArticleItem(scrapy.Item):

    title = scrapy.Field(

        input_processor=MapCompose(add_jobbole)

    )

    create_date = scrapy.Field(

        input_processor=MapCompose(add_time)

    )

自定义输出:

class ArticleItemLoader(ItemLoader):

    # 自定义item
loader

    default_output_processor = TakeFirst()

pipeline后面的数值是优先级的问题

七、反爬虫策略

1、修改settings和middlewares文件

Setting里面设置一个user-agent-list的列表。

Middlewares里面设置

class RandomUserAgentMiddlware(object):

    '''

   
随机更换user-agent

    '''

   
def __init__(self, crawler):

        super(RandomUserAgentMiddlware, self).__init__()

        self.user_agent_list
= crawler.settings.get("user_agent_list", [])



    @classmethod

    def from_crawler(cls, crawler):

        return cls(crawler)



    def process_request(self, request, spider):

        request.headers.setdefault('User-Agent', random())

2、随意更换user-agent 的库

>pip install fake-useragent

from fake_useragent import UserAgent

class RandomUserAgentMiddlware(object):

    '''

   
随机更换user-agent

    '''

   
def __init__(self, crawler):

        super(RandomUserAgentMiddlware, self).__init__()

        # self.user_agent_list =
crawler.settings.get("user_agent_list", [])

        self.ua
= UserAgent()

    @classmethod

    def from_crawler(cls, crawler):

        return cls(crawler)



    def process_request(self, request, spider):

        request.headers.setdefault('User-Agent', self.ua.random)

class RandomUserAgentMiddlware(object):

    '''

   
随机更换user-agent

    '''

   
def __init__(self, crawler):

        super(RandomUserAgentMiddlware, self).__init__()

        # self.user_agent_list =
crawler.settings.get("user_agent_list", [])

        self.ua
= UserAgent()

        self.ua_type
= crawler.settings.get("RANDOM_UA_TYPE", "random") 配置项

    @classmethod

    def from_crawler(cls, crawler):

        return cls(crawler)



    def process_request(self, request, spider):

        def get_ua():

            return  getattr(self.ua, self.ua_type)

        request.headers.setdefault('User-Agent', get_ua())

随机选取一个user-agent

3、代理ip

普通ip代理

request.meta['proxy'] = "http://61.135.217.7:80"  #ip
代理

(1)直接设置普通ip

(2)首先爬取某代理网站的代理ip存入到数据库中,然后从数据库中找到数据,放到middlewares里面进行ip代理。

import requests

from scrapy.selector import Selector

import MySQLdb

import threading

from fake_useragent import UserAgent





conn = MySQLdb.connect(host='127.0.0.1', user='root', passwd='centos', db='test', charset='utf8')

cour = conn.cursor()



ua = UserAgent()





def crawl_ips():

    headers = {

        'User-Agent':  'Mozilla/5.0 (Windows NT 6.2; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',

    }

    for i in range(3):

        re = requests.get('http://www.xicidaili.com/wt/{0}'.format(i), headers=headers)



    seletor = Selector(text=re.text)

    all_trs = seletor.css('#ip_list tr')

    ip_list = []

    for tr in all_trs:

        speed_strs = tr.css(".bar::attr(title)").extract()

        if speed_strs:

            speed_str = speed_strs[0]



        all_texts = tr.css('td::text').extract()

        if all_texts:

            ip = all_texts[0]

            port = all_texts[1]

            proxy_type = all_texts[5]

            ip_list.append((ip, port, proxy_type, speed_str.split('秒')[0]))



        for ip_info
in ip_list:

            cour.execute(

                "insert xici_ip_list(ip, port, speed, proxy_type)
VALUES('{0}', '{1}', '{2}', '{3}')".format(

                    ip_info[0], ip_info[1], ip_info[3], ip_info[2])



            )

            conn.commit()

            print('数据库写入完成')





# crawl_ips()





class GetIP(object):

    def delete_ip(self, ip):

        delete_sql = """

        delete from xici_ip_list where
ip='{0}'

        """.format(ip)

       
cour.execute(delete_sql)

        conn.commit()

        return True



    def judge_ip(self, ip, port):

        http_url = 'http://ww.baidu.com'

        proxy_url = 'http://{}:{}'.format(ip, port)

        try:

            proxy_dict = {

                'http': proxy_url

            }

            response =
requests.get(http_url, proxies=proxy_dict)

        except Exception
as e:

            print('invalid ip and port')

            self.delete_ip(ip)

            return False

        else:

            code = response.status_code

            if code
>= 200 and code < 300:

                print('eddective ip')

                return True

            else:

                print('invalid ip and port')

                self.delete_ip(ip)

                return False



    def get_random_ip(self):

        # 从数据库中随机获取一个ip

        sql = """

        SELECT ip, port FROM xici_ip_list

        ORDER BY RAND()

        LIMIT 1

        """

        result =
cour.execute(sql)

        for ip_info
in cour.fetchall():

            ip = ip_info[0]

            port = ip_info[1]

            judge_ip = self.judge_ip(ip, port)

            if judge_ip:

                return "http://{0}:{1}".format(ip, port)

            else:

                return self.get_random_ip()





# t = threading.Thread(target=crawl_ips)

# t.start()



get_ip = GetIP()



get_ip.get_random_ip()

class RandomProxyMiddleware(object):

    #动态设计ip代理

    def process_request(self, request, spider):

        get_ip = GetIP()

        request.meta['proxy'] = get_ip.get_random_ip()  #ip
代理

(3)插件化scrapy-proxies

https://github.com/aivarsk/scrapy-proxies/blob/master/scrapy_proxies

(4)scrapy-crawlera

收费版本

(5)tor洋葱网络

https://github.com/aivarsk/scrapy-proxies/blob/master/scrapy_proxies

稳定版本

八、验证码识别

1、验证码识别方法

编码实现tesseract-cor

在线打码

http://www.yundama.com/

人工打码

Python爬虫知识的更多相关文章

  1. python爬虫知识脉络

  2. Python爬虫实战 批量下载高清美女图片

    彼岸图网站里有大量的高清图片素材和壁纸,并且可以免费下载,读者也可以根据自己需要爬取其他类型图片,方法是类似的,本文通过python爬虫批量下载网站里的高清美女图片,熟悉python写爬虫的基本方法: ...

  3. python爬虫之企某科技JS逆向

    python爬虫简单js逆向案例在学习时需要用到数据,学习了python爬虫知识,但是在用爬虫程序的时候就遇到了问题.具体如下,在查看请求数据时发现返回的数据是加密的信息,现将处理过程记录如下,以便大 ...

  4. 【Python爬虫】入门知识

    爬虫基本知识 这阵子需要用爬虫做点事情,于是系统的学习了一下python爬虫,觉得还挺有意思的,比我想象中的能干更多的事情,这里记录下学习的经历. 网上有关爬虫的资料特别多,写的都挺复杂的,我这里不打 ...

  5. python爬虫主要就是五个模块:爬虫启动入口模块,URL管理器存放已经爬虫的URL和待爬虫URL列表,html下载器,html解析器,html输出器 同时可以掌握到urllib2的使用、bs4(BeautifulSoup)页面解析器、re正则表达式、urlparse、python基础知识回顾(set集合操作)等相关内容。

    本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...

  6. Python爬虫(1):基础知识

    爬虫基础知识 一.什么是爬虫? 向网站发起请求,获取资源后分析并提取有用数据的程序. 二.爬虫的基本流程 1.发起请求 2.获取内容 3.解析内容 4.保存数据 三.Request和Response ...

  7. python 爬虫与数据可视化--python基础知识

    摘要:偶然机会接触到python语音,感觉语法简单.功能强大,刚好朋友分享了一个网课<python 爬虫与数据可视化>,于是在工作与闲暇时间学习起来,并做如下课程笔记整理,整体大概分为4个 ...

  8. python爬虫工程师各个阶段需要掌握的技能和知识介绍

    本文主要介绍,想做一个python爬虫工程师,或者也可以说是,如何从零开始,从初级到高级,一步一步,需要掌握哪些知识和技能. 初级爬虫工程师: Web前端的知识:HTML, CSS, JavaScri ...

  9. python 爬虫基础知识一

    网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...

随机推荐

  1. C++ 变量声明数组

    int len; cin>>len; int *p=new int[len]; delete[] p; 不能写作 int p[]=new int[len]; 因为new是开辟了内存空间后返 ...

  2. 翻译:《实用的Python编程》01_04_Strings

    目录 | 上一节 (1.3 数字) | 下一节 (1.5 列表) 1.4 字符串 本节介绍处理文本的方法. 表示字面量文本 在程序中字符串字面量使用引号来书写. # 单引号(Single quote) ...

  3. soft tab

    soft tab hard-tabs 是硬件 tab,就是按一个 tab 键; soft-tabs 是软件 tab,通过按 4个 space 键实现; refs Tabs vs. Spaces, FR ...

  4. JAMstack (JavaScript + APIs + Markup)

    JAMstack (JavaScript + APIs + Markup) The modern way to build Websites and Apps that delivers better ...

  5. Angular 8.x in Action

    Angular 8.x in Action web fullstack / fullstack web Angular 8 https://angular.io/ Angular 2, Angular ...

  6. How to implement an accurate countdown timer with js

    How to implement an accurate countdown timer with js 如何用 js 实现一个精确的倒计时器 原理剖析 web worker js custom ti ...

  7. Flutter CodePen challenges

    Flutter CodePen challenges 挑战赛 https://mp.weixin.qq.com/s/qIYokWN9SVgr-F7YxbJuOQ CodePen Flutter 编辑器 ...

  8. github & webhooks

    github & webhooks git auto commit bash shell script https://developer.github.com/webhooks/ POST ...

  9. c++ DWORD和uintptr_t

    x86模式 DWORD 是4字节 x86模式 uintptr_t 是4字节 x64模式 DWORD 是4字节 x64模式 uintptr_t 是8字节 std::cout << sizeo ...

  10. JS实现点击加载更多效果

    适用场景:后端直接把所有的文章都给你调出来了,但是领导又让做点击加载更多效果...(宝宝心里苦啊)   点击加载更多效果:         第一个和第二个参数分别是btn和ul的DOM(必填)     ...