scrapy从安装到爬取煎蛋网图片

下载地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/
pip install wheel
pip install lxml
pip install pyopenssl
pip install Twisted
pip install pywin32
pip install scrapy

scrapy startproject jandan 创建项目
cd jandan
cd jandan

items.py 存放数据
pipelines.py 管道文件

由于煎蛋网有反爬虫措施，我们需要做一些处理

settings文件

ROBOTSTXT_OBEY = False #不遵寻reboot协议

DOWNLOAD_DELAY = 2 #下载延迟时间

DOWNLOAD_TIMEOUT = 15 #下载超时时间

COOKIES_ENABLED = False #禁用cookie

DOWNLOADER_MIDDLEWARES = {
   #请求头
   'jandan.middlewares.RandomUserAgent': 100,
   #代理ip
   'jandan.middlewares.RandomProxy': 200,
}

#请求列表
USER_AGENTS = [
   #遨游
   "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
   #火狐
   "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
   #谷歌
   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
]

#代理ip列表
PROXIES = [
   {"ip_port":"119.177.90.103:9999","user_passwd":""},
   #代理ip无密码
   {"ip_port":"101.132.122.230:3128","user_passwd":""},
   #代理ip有密码
   # {"ip_port":"123.139.56.238:9999","user_passwd":"root:admin"}
]

#管道文件，取消注释
ITEM_PIPELINES = {
   'jandan.pipelines.JandanPipeline': 300,
}

IMAGES_STORE = "images"

middlewares文件

import random

import base64

from jandan.settings import USER_AGENTS

from jandan.settings import PROXIES

class RandomUserAgent(object):

    def process_request(self,request,spider):

        useragent = random.choice(USER_AGENTS)

        request.headers.setdefault("User-Agent",useragent)

class RandomProxy(object):

    def process_request(self,request,spider):

        proxy = random.choice(PROXIES)

        if proxy["user_passwd"] is None:

            request.meta["proxy"] = "http://" + proxy["ip_port"]

        else:

            # b64编码接收字节对象,在py3中str是unicode，需要转换,返回是字节对象

            base64_userpasswd = base64.b16encode(proxy["user_passwd"].encode())

            request.meta["proxy"] = "http://" + proxy["ip_port"]

            #拼接是字符串，需要转码

            request.headers["Proxy-Authorization"] = "Basic " + base64_userpasswd.decode()

items文件

import scrapy

class JandanItem(scrapy.Item):

    name = scrapy.Field()

    url = scrapy.Field()

scrapy genspider -t crawl dj jandan.net 创建crawlscrapy类爬虫
会自动在spiders下创建jandan.py文件,页面由js编写，需要BeautifulSoup类定位js元素获取数据

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from jandan.items import JandanItem

from selenium import webdriver

from bs4 import BeautifulSoup as bs4

class JdSpider(CrawlSpider):

    name = 'jd'

    allowed_domains = ['jandan.net']

    start_urls = ['http://jandan.net/pic/page-1#comments/']

    rules = (

        Rule(LinkExtractor(allow=r'pic/page-\d+'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):

        item = JandanItem()

        driver = webdriver.PhantomJS()

        driver.get(response.url)

        soup = bs4(driver.page_source, 'html.parser')

        all_data = soup.find_all('div', {'class': 'row'})

        for i in all_data:

            name = i.find("strong")

            item["name"] = name.get_text().strip()

            link = i.find('a', {'class': 'view_img_link'})

            url = link.get("href")

            if len(url) == 0:

                return

            item["url"] = "http://" + url.split("//")[-1]

            yield item

pipelines.py

import json

import os

import requests

from scrapy.conf import settings

class JandanPipeline(object):
　　　#保存为json文件

    # def __init__(self):

    #     self.filename = open("jandan.json","wb")

    #     self.num = 0

    #

    # def process_item(self, item, spider):

    #     text = json.dumps(dict(item),ensure_ascii=False) + "\n"

    #     self.filename.write(text.encode("utf-8"))

    #     self.num += 1

    #     return item

    #

    # def close_spider(self,spider):

    #     self.filename.close()

    #     print("总共有" + str(self.num) + "个资源")

　　#下载到本地

    def process_item(self, item, spider):

         if 'url' in item:

            dir_path = settings["IMAGES_STORE"]

            if not os.path.exists(dir_path):

                os.makedirs(dir_path)

            su = "." + item["url"].split(".")[-1]

            path = item["name"] + su

            new_path = '%s/%s' % (dir_path, path)

            if not os.path.exists(new_path):

                with open(new_path, 'wb') as handle:

                    response = requests.get(item["url"], stream=True)

                    for block in response.iter_content(1024):

                        if not block:

                            break

                        handle.write(block)

            return item

scrapy crawl dj 启动爬虫

scrapy shell "https://hr.tencent.com/position.php?&start=0" 发送请求

奉上我的github地址，会定期更新项目

https://github.com/bjptw/workspace

scrapy从安装到爬取煎蛋网图片的更多相关文章

Python 爬虫爬取煎蛋网图片
今天, 试着爬取了煎蛋网的图片. 用到的包: urllib.request os 分别使用几个函数,来控制下载的图片的页数,获取图片的网页,获取网页页数以及保存图片到本地.过程简单清晰明了直接上源代 ...
python爬取煎蛋网图片
``` py2版本: #-*- coding:utf-8 -*-#from __future__ import unicode_literimport urllib,urllib2,timeimpor ...
Python Scrapy 爬取煎蛋网妹子图实例（一）
前面介绍了爬虫框架的一个实例,那个比较简单,这里在介绍一个实例爬取煎蛋网妹子图,遗憾的是上周煎蛋网还有妹子图了,但是这周妹子图变成了随手拍, 不过没关系,我们爬图的目的是为了加强实战应用,管 ...
selenium爬取煎蛋网
selenium爬取煎蛋网直接上代码 from selenium import webdriver from selenium.webdriver.support.ui import WebDriv ...
python3爬虫爬取煎蛋网妹纸图片（上篇）
其实之前实现过这个功能,是使用selenium模拟浏览器页面点击来完成的,但是效率实际上相对来说较低.本次以解密参数来完成爬取的过程. 首先打开煎蛋网http://jandan.net/ooxx,查看 ...
python爬虫–爬取煎蛋网妹子图片
前几天刚学了python网络编程,书里没什么实践项目,只好到网上找点东西做. 一直对爬虫很好奇,所以不妨从爬虫先入手吧. Python版本:3.6 这是我看的教程:Python - Jack -Cui ...
爬虫实例——爬取煎蛋网OOXX频道（反反爬虫——伪装成浏览器）
煎蛋网在反爬虫方面做了不少工作,无法通过正常的方式爬取,比如用下面这段代码爬取无法得到我们想要的源代码. import requests url = 'http://jandan.net/ooxx' ...
Python 爬取煎蛋网妹子图片
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Date : 2017-08-24 10:17:28 # @Author : EnderZhou (z ...
Scrapy爬虫框架之爬取校花网图片
Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓取 )所设 ...

随机推荐

第8章 CSS3中的变形与动画（上）
变形--旋转 rotate() 旋转rotate()函数通过指定的角度参数使元素相对原点进行旋转.它主要在二维空间内进行操作,设置一个角度值,用来指定旋转的幅度.如果这个值为正值,元素相对原点中心顺时 ...
基于svg.js实现对图形的拖拽、选择和编辑操作
本文主要记录如何使用 svg.js 实现对图形的拖拽,选择,图像渲染及各类形状的绘制操作. 1.关于SVG SVG 是可缩放的矢量图形,使用XML格式定义图像,可以生成对应的DOM节点,便于对单个图形 ...
07_dubbo_compiler
[开始解析最后一行代码 ExtensionLoader.getAdaptiveExtension()] ExtensionLoader<Protocol> loader = Extensi ...
C#——Visual Studio项目中的AssemblyInfo.cs文件包含的配置信息
Visual Studio程序集项目中的AssemblyInfo.cs文件中的内容 using System.Reflection; using System.Runtime.CompilerServ ...
PHP中empty、isset和is_null的使用区别
关于PHP中empty().isset() 和 is_null() 这三个函数的区别,之前记得专门总结过,上次又被问到,网上已经很多,就用几个例子来说明: 测试用例选取: <?php $a;$b ...
初学scrum及首次团队开发
一.初学scrum 1.什么是scrum Scrum在英语的意思是橄榄球里的争球.而在这里Scrum是一种迭代式增量软件开发过程,经常性的被用于敏捷软件开发.Scrum包括了一系列实践和预定义角色的过 ...
[acm 1002] 浙大 Fire Net
已转战浙大题目 http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemId=2 浙大acm 1002 #include <iostre ...
linux第一个C语言和sh脚本
linux第一个C语言 $ gedit hello_world.c #include <stdio.h> int main(void) { printf("hello world ...
Maven库下载很慢解决办法，利用中央仓库
以下四个都是可用的: http://mirrors.ibiblio.org/maven2/ http://mvnrepository.com/ http://repository.jboss.org/ ...
python requests实现windows身份验证登录
1.安装ntlm https://github.com/requests/requests-ntlm pip install requests_ntlm 2.使用 import requests f ...

scrapy从安装到爬取煎蛋网图片

scrapy从安装到爬取煎蛋网图片的更多相关文章

随机推荐

热门专题