利用python scrapy 框架抓取豆瓣小组数据

因为最近在找房子在豆瓣小组-上海租房上找，发现搜索困难，于是想利用爬虫将数据抓取. 顺便熟悉一下Python.

这边有scrapy 入门教程出处：http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.html

差不多跟教程说的一样，问题技术难点是转码，上述教程并未详细指出. 我还是把代码贴出来，请供参考.

E:\tutorial>tree /f

Folder PATH listing for volume 文档

Volume serial number is -BBB3

E:.

│  scrapy.cfg

│

└─tutorial

    │  items.py

    │  items.pyc

    │  pipelines.py

    │  pipelines.pyc

    │  settings.py

    │  settings.pyc

    │  __init__.py

    │  __init__.pyc

    │

    └─spiders

            douban_spider.py

            douban_spider.pyc

            __init__.py

            __init__.pyc

item.py: 这有一篇很好介绍ITEM的文章（http://blog.csdn.net/iloveyin/article/details/41309609）

from scrapy.item import Item, Field

class DoubanItem(Item):

    title = Field()

    link = Field()

    #resp = Field()

    #dateT = Field()

pipelines.py #定义你自己的PipeLine方式，详细中文转码可在此处解决

# -*- coding: utf-8 -*-

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

import codecs  

class TutorialPipeline(object):

    def __init__(self):

        self.file = codecs.open('items.json', 'wb', encoding='gbk')

    def process_item(self, item, spider):

        line = json.dumps(dict(item)) + '\n'

        print line

        self.file.write(line.decode("unicode_escape"))

        return item

在setting.py 加入相应的 ITEM_PIPELINES 属性（红色字体为新加部分）

# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project

#

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

#

#     http://doc.scrapy.org/en/latest/topics/settings.html

#

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']

NEWSPIDER_MODULE = 'tutorial.spiders'

ITEM_PIPELINES = {

    'tutorial.pipelines.TutorialPipeline':300

}  

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

接下来是spider.py

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.http import Request

from tutorial.items import DoubanItem

class DoubanSpider(BaseSpider):

    name = "douban"

    allowed_domains = ["douban.com"]

    start_urls = [

        "http://www.douban.com/group/shanghaizufang/discussion?start=0",

        "http://www.douban.com/group/shanghaizufang/discussion?start=25",

        "http://www.douban.com/group/shanghaizufang/discussion?start=50",

        "http://www.douban.com/group/shanghaizufang/discussion?start=75",

        "http://www.douban.com/group/shanghaizufang/discussion?start=100",

        "http://www.douban.com/group/shanghaizufang/discussion?start=125",

        "http://www.douban.com/group/shanghaizufang/discussion?start=150",

        "http://www.douban.com/group/shanghaizufang/discussion?start=175",

        "http://www.douban.com/group/shanghaizufang/discussion?start=200"

]

    def parse(self, response):

        hxs = HtmlXPathSelector(response)

        sites = hxs.xpath('//tr/td')

        items=[]

        for site in sites:

            item = DoubanItem()

            item['title'] =site.xpath('a/@title').extract()

            item['link'] = site.xpath('a/@href').extract()

           # item['resp'] = site.xpath('text()').extract()

           # item['dateT'] = site.xpath('text()').extract()

            items.append(item)

        return items

用JSON数据方式导出：

scrapy crawl douban -o items.json -t json

这有个JSON 转成CSV工具的网站，可以帮助转换：

https://json-csv.com/

结果效果展示,这样方便检索和过滤

利用python scrapy 框架抓取豆瓣小组数据的更多相关文章

Python爬虫之抓取豆瓣影评数据
脚本功能: 1.访问豆瓣最受欢迎影评页面(http://movie.douban.com/review/best/?start=0),抓取所有影评数据中的标题.作者.影片以及影评信息 2.将抓取的信息 ...
python scrapy框架爬取豆瓣
刚刚学了一下,还不是很明白.随手记录. 在piplines.py文件中将爬到的数据放到json中 class DoubanmoviePipelin2json(object):#打开文件 open_ ...
如何利用Python网络爬虫抓取微信朋友圈的动态（上）
今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...
利用Python网络爬虫抓取微信好友的签名及其可视化展示
前几天给大家分享了如何利用Python词云和wordart可视化工具对朋友圈数据进行可视化,利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例,以及利用Python网络爬虫抓取微信好友的所 ...
利用Python网络爬虫抓取微信好友的所在省位和城市分布及其可视化
前几天给大家分享了如何利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例,感兴趣的小伙伴可以点击链接进行查看.今天小编给大家介绍如何利用Python网络爬虫抓取微信好友的省位和城市,并且将 ...
如何利用Python网络爬虫抓取微信好友数量以及微信好友的男女比例
前几天给大家分享了利用Python网络爬虫抓取微信朋友圈的动态(上)和利用Python网络爬虫爬取微信朋友圈动态——附代码(下),并且对抓取到的数据进行了Python词云和wordart可视化,感兴趣 ...
Python小爬虫——抓取豆瓣电影Top250数据
python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...
【python数据挖掘】爬取豆瓣影评数据
概述: 爬取豆瓣影评数据步骤: 1.获取网页请求 2.解析获取的网页 3.提速数据 4.保存文件源代码: # 1.导入需要的库 import urllib.request from bs4 impo ...
基于python的scrapy框架爬取豆瓣电影及其可视化
1.Scrapy框架介绍主要介绍,spiders,engine,scheduler,downloader,Item pipeline scrapy常见命令如下: 对应在scrapy文件中有,自己增加 ...

随机推荐

理解python可变类型vs不可变类型，深拷贝vs浅拷贝
核心提示: 可变类型 Vs 不可变类型可变类型(mutable):列表,字典不可变类型(unmutable):数字,字符串,元组这里的可变不可变,是指内存中的那块内容(value)是否可以被改变 ...
在vs2005中使用AnkhSvn服务端IP改变无法连接
1.打开VS2005,选择文件-->Subversion-->Pending Changes 2.在弹出的对话框中选择other. 荆州古城
JS数组(Array)处理函数总结
1.concat() 连接两个或更多的数组该方法不会改变现有的数组,而仅仅会返回被连接数组的一个副本.例如: <script type="text/javascript"&g ...
软件工程 speedsnail 第二次冲刺6
20150523 完成任务:碰撞墙壁,或线身体翻转: 遇到问题: 问题1 身体翻转与帧数冲突解决1 运用循环嵌套解决明日任务: 蜗牛碰到线后速度方向的调整
Windows Server 2008 R2 密码破解
Win 2008 Server 忘记密码怎么办,不能像Win7/8/XP 那样用PE破解就只有这种方法了1.首先,把Windows 2008 的镜像放进去光驱我们用光驱启动 2. 这时候按下S ...
Ajax+Asp.Net无刷新分页
1.新建解决方案,并建立四个项目BLL,DAL,Model,PagerTest,如图所示: 2.Model代码 using System; using System.Collections.Gener ...
【转】Javascript 严格模式详解
ref: http://www.ruanyifeng.com/blog/2013/01/javascript_strict_mode.html 一.概述除了正常运行模式,ECMAscript 5添加 ...
uglifyjs压缩JS
一.故事总有其背景年末将至,很多闲适的时间,于是刷刷微博,接触各种纷杂的信息——美其名曰“学习”.运气不错,遇到了一个新名词,uglifyjs. 据说是用来压缩JS文件的,据说还能优化JS,据说是基 ...
Google账户无法登陆-Solved
Author:KillerLegend Date:2014.5.19 From:http://www.cnblogs.com/killerlegend/p/3737888.html 这几天不知道怎么回 ...
delphi的几个特别关键字 object absolute
1.object关键字相当于C++中的struct, record定义个结构体只能定义数据,而object可以定义方法,默认都是public的. 代码示例如下: TTest = record na ...

利用python scrapy 框架抓取豆瓣小组数据

利用python scrapy 框架抓取豆瓣小组数据的更多相关文章

随机推荐

热门专题