44.scrapy爬取链家网站二手房信息-2

全面采集二手房数据：

网站二手房总数据量为27650条，但有的参数字段会出现一些问题，因为只给返回100页数据，具体查看就需要去细分请求url参数去请求网站数据。
我这里大概的获取了一下筛选条件参数，一些存在问题也没做细化处理，大致的采集数据量为21096，实际19794条。

看一下执行完成结果：

{'downloader/exception_count': 199,
'downloader/exception_type_count/twisted.internet.error.NoRouteError': 192,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 7,
'downloader/request_bytes': 9878800,
'downloader/request_count': 21096,
'downloader/request_method_count/GET': 21096,
'downloader/response_bytes': 677177525,
'downloader/response_count': 20897,
'downloader/response_status_count/200': 20832,
'downloader/response_status_count/301': 49,
'downloader/response_status_count/302': 11,
'downloader/response_status_count/404': 5,
'dupefilter/filtered': 53,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 12, 8, 49, 42, 371235),
'httperror/response_ignored_count': 5,
'httperror/response_ignored_status_count/404': 5,
'log_count/DEBUG': 21098,
'log_count/ERROR': 298,
'log_count/INFO': 61,
'request_depth_max': 3,
'response_received_count': 20837,
'retry/count': 199,
'retry/reason_count/twisted.internet.error.NoRouteError': 192,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 7,
'scheduler/dequeued': 21096,
'scheduler/dequeued/memory': 21096,
'scheduler/enqueued': 21096,
'scheduler/enqueued/memory': 21096,
'spider_exceptions/TypeError': 298,
'start_time': datetime.datetime(2018, 11, 12, 7, 59, 52, 608383)}
2018-11-12 16:49:42 [scrapy.core.engine] INFO: Spider closed (finished)

采集数据如图：

num = 296910/15=19794条

2. lianjia.py

# -*- coding: utf-8 -*-

import scrapy

class LianjiaSpider(scrapy.Spider):

    name = 'lianjia'

    allowed_domains = ['gz.lianjia.com']

    start_urls = ['https://gz.lianjia.com/ershoufang/pg1/']

　　

    def parse(self, response):

        for i in range(1,8):

            for j in range(1,8):

                url = 'https://gz.lianjia.com/ershoufang/p{}a{}pg1'.format(i,j)

                yield scrapy.Request(url=url,callback=self.parse_detail)

    def parse_detail(self,response):

        # 符合筛选条件的个数

        counts = response.xpath("//h2[@class='total fl']/span/text()").extract_first().strip()

        # print(counts)

        if int(counts)%30 >0:

            p_num = int(counts)//30+1

            # print(p_num)

            # 拼接首页url

            for k in  range(1,p_num+1):

                url = response.url

                link_url = url.split('pg')[0]+'pg{}/'.format(k)

                # print(link_url)

                yield scrapy.Request(url=link_url,callback=self.parse_detail2)

    def parse_detail2(self,response):

            #获取当前页面url

            link_urls = response.xpath("//div[@class='info clear']/div[@class='title']/a/@href").extract()

            for link_url in link_urls:

                # print(link_url)

                yield scrapy.Request(url=link_url,callback=self.parse_detail3)

            # print('*'*100)

    def parse_detail3(self,response):

            title = response.xpath("//div[@class='title']/h1[@class='main']/text()").extract_first()

            print('标题: '+ title)

            dist = response.xpath("//div[@class='areaName']/span[@class='info']/a/text()").extract_first()

            print('所在区域: '+ dist)

            contents = response.xpath("//div[@class='introContent']/div[@class='base']")

            # print(contents)

            house_type = contents.xpath("./div[@class='content']/ul/li[1]/text()").extract_first()

            print('房屋户型: '+ house_type)

            floor = contents.xpath("./div[@class='content']/ul/li[2]/text()").extract_first()

            print('所在楼层: '+ floor)

            built_area = contents.xpath("./div[@class='content']/ul/li[3]/text()").extract_first()

            print('建筑面积: '+ built_area)

            family_structure = contents.xpath("./div[@class='content']/ul/li[4]/text()").extract_first()

            print('户型结构: '+ family_structure)

            inner_area = contents.xpath("./div[@class='content']/ul/li[5]/text()").extract_first()

            print('套内面积: '+ inner_area)

            architectural_type = contents.xpath("./div[@class='content']/ul/li[6]/text()").extract_first()

            print('建筑类型: '+ architectural_type)

            house_orientation = contents.xpath("./div[@class='content']/ul/li[7]/text()").extract_first()

            print('房屋朝向: '+ house_orientation)

            building_structure = contents.xpath("./div[@class='content']/ul/li[8]/text()").extract_first()

            print('建筑结构: '+ building_structure)

            decoration_condition = contents.xpath("./div[@class='content']/ul/li[9]/text()").extract_first()

            print('装修状况: '+ decoration_condition)

            proportion = contents.xpath("./div[@class='content']/ul/li[10]/text()").extract_first()

            print('梯户比例: '+ proportion)

            elevator = contents.xpath("./div[@class='content']/ul/li[11]/text()").extract_first()

            print('配备电梯: '+ elevator)

            age_limit =contents.xpath("./div[@class='content']/ul/li[12]/text()").extract_first()

            print('产权年限: '+ age_limit)

            # try:

            #     house_label = response.xpath("//div[@class='content']/a/text()").extract_first()

            # except:

            #     house_label = ''

            # print('房源标签: ' + house_label)

            with open('text2', 'a', encoding='utf-8')as f:

                f.write('\n'.join(

                    [title,dist,house_type,floor,built_area,family_structure,inner_area,architectural_type,house_orientation,building_structure,decoration_condition,proportion,elevator,age_limit]))

                f.write('\n' + '=' * 50 + '\n')

            print('-'*100)

3.代码还需要细分的话，就多配置url的请求参数，缩小筛选范围，获取页面就更精准，就能避免筛选到过3000的数据类型，可以再去细分。

44.scrapy爬取链家网站二手房信息-2的更多相关文章

43.scrapy爬取链家网站二手房信息-1
首先分析:目的:采集链家网站二手房数据1.先分析一下二手房主界面信息,显示情况如下: url = https://gz.lianjia.com/ershoufang/pg1/显示总数据量为27589套 ...
Python——Scrapy爬取链家网站所有房源信息
用scrapy爬取链家全国以上房源分类的信息: 路径: items.py # -*- coding: utf-8 -*- # Define here the models for your scrap ...
python - 爬虫入门练习爬取链家网二手房信息
import requests from bs4 import BeautifulSoup import sqlite3 conn = sqlite3.connect("test.db&qu ...
python爬虫：利用BeautifulSoup爬取链家深圳二手房首页的详细信息
1.问题描述: 爬取链家深圳二手房的详细信息,并将爬取的数据存储到Excel表 2.思路分析: 发送请求--获取数据--解析数据--存储数据 1.目标网址:https://sz.lianjia.com ...
Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
Python爬取链家二手房源信息
爬取链家网站二手房房源信息,第一次做,仅供参考,要用scrapy. import scrapy,pypinyin,requests import bs4 from ..items import L ...
python3 爬虫教学之爬取链家二手房（最下面源码） //以更新源码
前言作为一只小白,刚进入Python爬虫领域,今天尝试一下爬取链家的二手房,之前已经爬取了房天下的了,看看链家有什么不同,马上开始. 一.分析观察爬取网站结构这里以广州链家二手房为例:http:/ ...
Scrapy实战篇（一）之爬取链家网成交房源数据（上）
今天,我们就以链家网南京地区为例,来学习爬取链家网的成交房源数据. 这里推荐使用火狐浏览器,并且安装firebug和firepath两款插件,你会发现,这两款插件会给我们后续的数据提取带来很大的方便. ...
python爬虫：爬取链家深圳全部二手房的详细信息
1.问题描述: 爬取链家深圳全部二手房的详细信息,并将爬取的数据存储到CSV文件中 2.思路分析: (1)目标网址:https://sz.lianjia.com/ershoufang/ (2)代码结构 ...

随机推荐

datax 数据同步迁移
https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md https://github.com/aliba ...
InfluxDB 的UTC时间问题与简单的持续查询语句
原文:https://blog.csdn.net/Vblegend_2013/article/details/80904275 最近项目中使用了时序数据库InfluxDB 各方性能也是蛮强大的.但是唯 ...
Random 中的种子怎么理解
种子就是生成随机数的根,就是产生随机数的基础.计算机的随机数都是伪随机数,以一个真随机数(种子)作为初始条件,然后用一定的算法不停迭代产生随机数.Java项目中通常是通过Math.random方法和R ...
adb命令模拟按键输入keycode
adb命令模拟按键输入keycode 2017年05月18日 14:57:32 阅读数:1883 例子: //这条命令相当于按了设备的Backkey键 adb shell input keyevent ...
【Linux】使用fsck对磁盘进行修复
在后台执行磁盘修复 nohup fsck.ext3 -y /dev/sdb1 > /root/fsck.log 2>&1 & 使用nohup和& 让进程在后台执行 ...
力奋github：https://github.com/birdstudiocn
我的github地址https://github.com/birdstudiocn
blog决定不用二级域名，改为二级目录
看了一篇文章,受益匪浅,到底是用二级域名还是二级目录?已转载到得闲佬设计. 分析了一下得闲佬设计的因素,因为得闲佬设计是小站,流量很小,而且更新文章频率也不大,没必要把流量分出去做一个独立的站点所以 ...
安卓权威编程指南 - 第五章学习笔记(两个Activity)
学习安卓编程权威指南第五章的时候自己写了个简单的Demo来加深理解两个Activity互相传递数据的问题,然后将自己的学习笔记贴上来,如有错误还请指正. IntentActivityDemo学习笔记 ...
基础 - #pragma pack (n) 设置对齐方式
// pragma_pack.cpp : 定义控制台应用程序的入口点. // #include "stdafx.h" #include <windows.h> #inc ...
Flume原理解析【转】
一.Flume简介 flume 作为 cloudera 开发的实时日志收集系统,受到了业界的认可与广泛应用.Flume 初始的发行版本目前被统称为 Flume OG(original generati ...

44.scrapy爬取链家网站二手房信息-2

44.scrapy爬取链家网站二手房信息-2的更多相关文章

随机推荐

热门专题