Python 2.7_利用xpath语法爬取豆瓣图书top250信息

大年初二,忙完家里一些事,顺带有人交流爬取豆瓣图书top250

1、构造urls列表 urls=['https://book.douban.com/top250?start={}'.format(str(i) for i in range(0, 226, 25))]

2、模块 requests获取网页源代码 lxml 解析网页 xpath提取

3、提取信息

4、可以封装成函数此处没有封装调用

python代码：

#coding:utf-8

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

from lxml import etree

import requests

urls=['https://book.douban.com/top250?start={}'.format(str(i) for i in range(0, 226, 25))]

for url in urls:

    html=requests.get(url).content

    selector=etree.HTML(html)

    infos=selector.xpath('//tr[@class="item"]')

    for info in infos:

        book_name = info.xpath('td/div/a/@title')[0]

        book_url = info.xpath('td/div/a/@href')[0]

        published_infos = str(info.xpath('td/p/text()')[0])

        splitlistinfos = published_infos.split('/')

        #print splitlistinfos

        published_date=str(splitlistinfos[-2])

        #print published_date

        price = str(splitlistinfos[-1])

        #print price

        rate = info.xpath('td/div/span[2]/text()')[0]

        # comment_nums = info.xpath('td/div/span[3]/text()')[0]

        # print comment_nums

        comment_nums = info.xpath('td/div/span[3]/text()')[0].strip('(').strip().strip(')').strip().strip('人评价').strip() +'人评价'

        introduceinfo = info.xpath('td/p/span/text()')

        print book_name,book_url,published_date,price,rate,comment_nums,introduceinfo[0] if len(introduceinfo) > 0 else ''

Python 2.7_利用xpath语法爬取豆瓣图书top250信息_20170129的更多相关文章

python系列之（3）爬取豆瓣图书数据
上次介绍了beautifulsoup的使用,那就来进行运用下吧.本篇将主要介绍通过爬取豆瓣图书的信息,存储到sqlite数据库进行分析. 1.sqlite SQLite是一个进程内的库,实现了自给自足 ...
Scrapy中用xpath/css爬取豆瓣电影Top250：解决403HTTP status code is not handled or not allowed
好吧,我又开始折腾豆瓣电影top250了,只是想试试各种方法,看看哪一种的方法效率是最好的,一直进行到这一步才知道 scrapy的强大,尤其是和selector结合之后,速度飞起.... 下面我就采用 ...
Python爬虫-爬取豆瓣图书Top250
豆瓣网站很人性化,对于新手爬虫比较友好,没有如果调低爬取频率,不用担心会被封 IP.但也不要太频繁爬取. 涉及知识点:requests.html.xpath.csv 一.准备工作需要安装reques ...
爬取豆瓣电影Top250信息
# -*- coding:utf-8 -*- __author__ = "MuT6 Sch01aR" import requests from pyquery import PyQ ...
python爬虫1——获取网站源代码(豆瓣图书top250信息)
# -*- coding: utf-8 -*- import requests import re import sys reload(sys) sys.setdefaultencoding('utf ...
Python爬虫小白入门（七）爬取豆瓣音乐top250
抓取目标: 豆瓣音乐top250的歌名.作者(专辑).评分和歌曲链接使用工具: requests + lxml + xpath. 我认为这种工具组合是最适合初学者的,requests比pytho ...
python 爬虫&爬取豆瓣电影top250
爬取豆瓣电影top250from urllib.request import * #导入所有的request,urllib相当于一个文件夹,用到它里面的方法requestfrom lxml impor ...
Python爬虫入门：爬取豆瓣电影TOP250
一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...
scrapy爬虫框架教程（二）-- 爬取豆瓣电影TOP250
scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...

随机推荐

异常：Retrieving the COM class factory for component with CLSID {00024500-0000-0000-C000-000000000046} failed due to the following error: 80070005.
异常:Retrieving the COM class factory for component with CLSID {00024500-0000-0000-C000-000000000046} ...
formatblock 块及
有标签,执行标签替换,只是替换标签,属性不改变. 在无标签外部添加标签
iOS copy 和 mutableCopy 学习
(参考 iOS 52个技巧学习心得笔记第二章对象 , 消息, 运行期)的对象部分关于Copy 有个经典问题”大部分的时候NSString的属性都是copy,那copy与strong的情况下到底 ...
python日志操作logging
步骤: 1.定义一个日志收集器 my_logger = logging.getLogger("kitty") 2.设定级别.默认为warning:debug,,info,error ...
iOS_AFNetWorking框架分析
网络 — 你的程序离开了它就不能生存下去!苹果的Foundation framework中的NSURLConnection又非常难以理解, 不过这里有一个可以使用的替代品:AFNetworking.A ...
EXP-00008:遇到ORACLE错误904问题
案例情景--在一次Oracle 数据库导出时: C:\Documents and Settings\Administrator>exp lsxy/lsxy@lsxy_db file=E:\lsx ...
Ubuntu 使用国内apt源
编辑/etc/apt/source-list deb http://cn.archive.ubuntu.com/ubuntu/ trusty main restricted universe mult ...
Ajax缓存处理
如果直接用jQuery里的$.ajax()方法的话,去除缓存很简单,只需要配置一下缓存属性cache为false,但如果想要简单写法getJSON(),去除缓存就不能通过配置来解决了.因为getJSO ...
hdoj1004--Let the Balloon Rise
Problem Description Contest time again! How excited it is to see balloons floating around. But to te ...
在java中public void与public static void区别
static 方法可以被main方法直接调用,而非static方法不可以.因为static方法是属于类的,是类方法.可以通过类名.方法名直接调用.而非static方法必须等对象被new出来以后才能使用 ...

Python 2.7_利用xpath语法爬取豆瓣图书top250信息_20170129

Python 2.7_利用xpath语法爬取豆瓣图书top250信息_20170129的更多相关文章

随机推荐

热门专题