利用python scrapy 框架抓取豆瓣小组数据

因为最近在找房子在豆瓣小组-上海租房上找，发现搜索困难，于是想利用爬虫将数据抓取. 顺便熟悉一下Python.

这边有scrapy 入门教程出处：http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.html

差不多跟教程说的一样，问题技术难点是转码，上述教程并未详细指出. 我还是把代码贴出来，请供参考.

E:\tutorial>tree /f

Folder PATH listing for volume 文档

Volume serial number is -BBB3

E:.

│  scrapy.cfg

│

└─tutorial

    │  items.py

    │  items.pyc

    │  pipelines.py

    │  pipelines.pyc

    │  settings.py

    │  settings.pyc

    │  __init__.py

    │  __init__.pyc

    │

    └─spiders

            douban_spider.py

            douban_spider.pyc

            __init__.py

            __init__.pyc

item.py: 这有一篇很好介绍ITEM的文章（http://blog.csdn.net/iloveyin/article/details/41309609）

from scrapy.item import Item, Field

class DoubanItem(Item):

    title = Field()

    link = Field()

    #resp = Field()

    #dateT = Field()

pipelines.py #定义你自己的PipeLine方式，详细中文转码可在此处解决

# -*- coding: utf-8 -*-

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

import codecs  

class TutorialPipeline(object):

    def __init__(self):

        self.file = codecs.open('items.json', 'wb', encoding='gbk')

    def process_item(self, item, spider):

        line = json.dumps(dict(item)) + '\n'

        print line

        self.file.write(line.decode("unicode_escape"))

        return item

在setting.py 加入相应的 ITEM_PIPELINES 属性（红色字体为新加部分）

# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project

#

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

#

#     http://doc.scrapy.org/en/latest/topics/settings.html

#

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']

NEWSPIDER_MODULE = 'tutorial.spiders'

ITEM_PIPELINES = {

    'tutorial.pipelines.TutorialPipeline':300

}  

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

接下来是spider.py

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.http import Request

from tutorial.items import DoubanItem

class DoubanSpider(BaseSpider):

    name = "douban"

    allowed_domains = ["douban.com"]

    start_urls = [

        "http://www.douban.com/group/shanghaizufang/discussion?start=0",

        "http://www.douban.com/group/shanghaizufang/discussion?start=25",

        "http://www.douban.com/group/shanghaizufang/discussion?start=50",

        "http://www.douban.com/group/shanghaizufang/discussion?start=75",

        "http://www.douban.com/group/shanghaizufang/discussion?start=100",

        "http://www.douban.com/group/shanghaizufang/discussion?start=125",

        "http://www.douban.com/group/shanghaizufang/discussion?start=150",

        "http://www.douban.com/group/shanghaizufang/discussion?start=175",

        "http://www.douban.com/group/shanghaizufang/discussion?start=200"

]

    def parse(self, response):

        hxs = HtmlXPathSelector(response)

        sites = hxs.xpath('//tr/td')

        items=[]

        for site in sites:

            item = DoubanItem()

            item['title'] =site.xpath('a/@title').extract()

            item['link'] = site.xpath('a/@href').extract()

           # item['resp'] = site.xpath('text()').extract()

           # item['dateT'] = site.xpath('text()').extract()

            items.append(item)

        return items

用JSON数据方式导出：

scrapy crawl douban -o items.json -t json

这有个JSON 转成CSV工具的网站，可以帮助转换：

https://json-csv.com/

结果效果展示,这样方便检索和过滤

利用python scrapy 框架抓取豆瓣小组数据的更多相关文章

Python爬虫之抓取豆瓣影评数据
脚本功能: 1.访问豆瓣最受欢迎影评页面(http://movie.douban.com/review/best/?start=0),抓取所有影评数据中的标题.作者.影片以及影评信息 2.将抓取的信息 ...
python scrapy框架爬取豆瓣
刚刚学了一下,还不是很明白.随手记录. 在piplines.py文件中将爬到的数据放到json中 class DoubanmoviePipelin2json(object):#打开文件 open_ ...
如何利用Python网络爬虫抓取微信朋友圈的动态（上）
今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...
利用Python网络爬虫抓取微信好友的签名及其可视化展示
前几天给大家分享了如何利用Python词云和wordart可视化工具对朋友圈数据进行可视化,利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例,以及利用Python网络爬虫抓取微信好友的所 ...
利用Python网络爬虫抓取微信好友的所在省位和城市分布及其可视化
前几天给大家分享了如何利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例,感兴趣的小伙伴可以点击链接进行查看.今天小编给大家介绍如何利用Python网络爬虫抓取微信好友的省位和城市,并且将 ...
如何利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例
前几天给大家分享了利用Python网络爬虫抓取微信朋友圈的动态(上)和利用Python网络爬虫爬取微信朋友圈动态——附代码(下),并且对抓取到的数据进行了Python词云和wordart可视化,感兴趣 ...
Python小爬虫——抓取豆瓣电影Top250数据
python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...
【python数据挖掘】爬取豆瓣影评数据
概述: 爬取豆瓣影评数据步骤: 1.获取网页请求 2.解析获取的网页 3.提速数据 4.保存文件源代码: # 1.导入需要的库 import urllib.request from bs4 impo ...
基于python的scrapy框架爬取豆瓣电影及其可视化
1.Scrapy框架介绍主要介绍,spiders,engine,scheduler,downloader,Item pipeline scrapy常见命令如下: 对应在scrapy文件中有,自己增加 ...

随机推荐

在Windows程序中启用console输出-2016.01.04
在某些时候,我们可能需要在Win32窗口应用程序中打开控制台窗口,打印一些消息,或者作为当前程序的另外一个人机交互界面,或者为了帮助调试程序.为了达到这种效果,需要了解函数AllocConsole和C ...
见怪不怪的typedef
typedef是C++中的一个十分重要的关键字,它有强大的功能和方法的用途.但是有时候,碰到一些用到typedef的地方却感到很奇怪了. 给个栗子尝尝: typedef void(*pFun)(voi ...
C# Socket网络编程精华篇(转)
我们在讲解Socket编程前,先看几个和Socket编程紧密相关的概念: TCP/IP层次模型当然这里我们只讨论重要的四层 01,应用层(Application):应用层是个很广泛的概念,有一些基本 ...
Java语法细节（2）
1.逻辑运算符 &和&&,|和||的区别 &&:和&的结果是一样的,但运算过程有区别 &&:只要左边结果为假,就不再执行右边的,结果为假 ...
Dell™ SAS 5/iR 集成适配器和适配器用户指南
http://www.sxszjzx.com/~t096/manual/sc/SAS_5ir/index.htm
android studio首次运行出错
转载2015-10-24 16:28:15 标签:androidstudioandroidstudio无法启androidstudio1.4无法 Internal error. Please repo ...
★★★.NET 在meta标签中使用表达式设置页面的关键字
在aspx文件中给meta标签的属性复制是不能直接使用表达式的错误的写法: <meta name="keywords" content="<%=news ...
winfrom 导入Excel表到access数据库（来自小抽奖系统）
网上有很多这种方法,本人只是针对自己的系统来实现的 //导入excel表 private void ImportTSMenu_Click(object sender, EventArgs e) { O ...
SequoiaDB的数据分区操作
在SequoiaDB集群环境中,用户往往将数据存放在不同的逻辑节点与物理节点中,以达到并行计算的目的. 分区:把包含相同数据的一组数据节点叫一个分区,如上图绿色方块组成三个分区. 分区键:切分时,所依 ...
在Javascript操作JSON对象，增加删除修改
在Javascript操作JSON对象,增加删除修改全有的,详情见代码 <script type="text/javascript"> var jsonObj2 = { ...

利用python scrapy 框架抓取豆瓣小组数据

利用python scrapy 框架抓取豆瓣小组数据的更多相关文章

随机推荐

热门专题