python 爬虫&爬取豆瓣电影top250

爬取豆瓣电影top250
from urllib.request import *  #导入所有的request，urllib相当于一个文件夹，用到它里面的方法request
from lxml import etree  #调用包
import pickle #
import time
arr = []       #定义一个空数组，用来添加爬出的数据
url = "https://movie.douban.com/top250?start="   #豆瓣top250网址
urls = [ url+str(i) for i in range(0,250,25)] #每次步进值25，总共250个，爬取十次
def aa(link):    #定义一个函数aa
    time.sleep(1)  #间隔一秒
    print("正在爬取:%s"%link)   #提示信息可以实时看到爬取信息
    with urlopen(link) as html:  #在html中打开爬取的数据
        text = html.read().decode("utf-8")# 读取并且解码数据
    doc = etree.HTML(text)       #解析html  etree这是lxml中的方法
    #分别爬取电影名字titles、详细信息news、评分grade、最佳评论comment、网址links
    titles = doc.xpath("//ol[@class='grid_view']/li/div[@class='item']/div[@class='info']/div[@class='hd']/a/span[1]/text()")
    news= doc.xpath("//ol[@class='grid_view']/li/div[@class='item']/div[@class='info']/div[@class='bd']/p/text()")
    grade= doc.xpath("//ol[@class='grid_view']/li/div[@class='item']/div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()")
    comment= doc.xpath("//ol[@class='grid_view']/li/div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")
    links = doc.xpath("//ol[@class='grid_view']/li/div[@class='item']/div[@class='info']/div[@class='hd']/a/@href")
    arr.append(list(zip(titles,news,grade,comment,links))) #用append方法将爬取数据添加到数组arr
for link in urls: #遍历十页urls
    aa(link)   #调用
with open("豆瓣电影.txt",'wb') as f: #打开本地文件“豆瓣电影.txt”以写的方式，二进制
    pickle.dump(arr,f)     #pickle包
with open("豆瓣电影.txt",'rb') as f:
    obj = pickle.load(f)    #加载
for item in obj:
    print(item)
import xlwt#（写入）
wb=xlwt.Workbook()  #创建表格对象
ws=wb.add_sheet("豆瓣电影")
with open("豆瓣电影.txt",'rb') as f:
    arr=pickle.load(f)
index=0
for arr2 in arr:
    for title,news,grade,comment,links in arr2:
        #序号
        ws.write(index,0,index+1)
        # title
        ws.write(index,1,title)
        ws.write(index,2,news)
        ws.write(index,3,grade)
        ws.write(index,4,comment)
        ws.write(index,5,links)
        index+=1

wb.save("豆瓣电影.xls")

python 爬虫&爬取豆瓣电影top250的更多相关文章

Python爬虫-爬取豆瓣电影Top250
#!usr/bin/env python3 # -*- coding:utf-8-*- import requests from bs4 import BeautifulSoup import re ...
python爬虫 Scrapy2-- 爬取豆瓣电影TOP250
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
Python爬虫----抓取豆瓣电影Top250
有了上次利用python爬虫抓取糗事百科的经验,这次自己动手写了个爬虫抓取豆瓣电影Top250的简要信息. 1.观察url 首先观察一下网址的结构 http://movie.douban.com/to ...
Python爬虫爬取豆瓣电影之数据提取值xpath和lxml模块
工具:Python 3.6.5.PyCharm开发工具.Windows 10 操作系统.谷歌浏览器目的:爬取豆瓣电影排行榜中电影的title.链接地址.图片.评价人数.评分等网址:https:// ...
Python爬虫爬取豆瓣电影名称和链接，分别存入txt，excel和数据库
前提条件是python操作excel和数据库的环境配置是完整的,这个需要在python中安装导入相关依赖包: 实现的具体代码如下: #!/usr/bin/python# -*- coding: utf ...
python3 爬虫---爬取豆瓣电影TOP250
第一次爬取的网站就是豆瓣电影 Top 250,网址是:https://movie.douban.com/top250?start=0&filter= 分析网址'?'符号后的参数,第一个参数's ...
Python爬虫-爬取豆瓣图书Top250
豆瓣网站很人性化,对于新手爬虫比较友好,没有如果调低爬取频率,不用担心会被封 IP.但也不要太频繁爬取. 涉及知识点:requests.html.xpath.csv 一.准备工作需要安装reques ...
python爬虫-爬取豆瓣电影数据
#!/usr/bin/python# coding=utf-8# 作者 :Y0010026# 创建时间 :2018/12/16 16:27# 文件 :spider_05.py# IDE :PyChar ...
Python爬虫入门：爬取豆瓣电影TOP250
一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...

随机推荐

苹果 ios 微信浏览器界面 ajax 提交带 file 的 form 总是走error方法
1. 问题问题出在微信端,而且是苹果机的微信端(苹果你咋这么矫情,安卓正常).:代码还是之前的代码,貌似是苹果升级系统后部分版本出现的 BUG,后来证明确实跟 ios 版本有关,网上也找过类似的解决 ...
[js]js杂项陆续补充中...
hasOwnProperty判断对象是否有这个属性 p = { 'name': 'maotai', 'age': 22 }; console.log(p.hasOwnProperty('names') ...
133A
#include <stdio.h> #include<string.h> #include <stdbool.h> #define MAXSIZE 105 int ...
Elasticsearch 快速入门教程
面向文档应用中的对象很少只是简单的键值列表,更多时候它拥有复杂的数据结构,比如包含日期.地理位置.另一个对象或者数组. 总有一天你会想到把这些对象存储到数据库中.将这些数据保存到由行和列组成的关系数 ...
python相关学习文档收集
bs4中文文档: 用于网页爬虫 https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ GitLab-CI 从安装到差点放弃 https://segm ...
Java-HttpURLConnection详细说明与实例
URLConnection 类是一个抽象类,代表应用程序和URL之间的通信连接,此类的实例可用于读取和写入此URL引用的资源.URLConnection 允许使用GET,POST或者其他HTTP方法请 ...
Interesting (manacher + 前缀和处理)
题意:相邻的两端回文串的价值为两个回文串总的区间左端点 × 区间右端点.然后计算目标串中所有该情况的总和. 思路:首先用manacher求出所有中心点的最大半径,然后我们知道对于左区间我们把贡献记录在 ...
MTCNN试用
检测工作想借用MTCNN里的48-net,源码来自CongWeilin Git 下下来就能跑,真是良心进入pepare_data准备好数据以后进入48-net,目录下有一个pythonLayer.p ...
vue用mand-mobile ui做交易所移动版实战示例
vue用mand-mobile ui做交易所移动版实战示例先展示几个界面: 目录结构: main.js // The Vue build version to load with the `impo ...
Poj3624 Charm Bracelet （01背包）
题目链接:http://poj.org/problem?id=3624 Description Bessie has gone to the mall's jewelry store and spie ...

python 爬虫&爬取豆瓣电影top250

python 爬虫&爬取豆瓣电影top250的更多相关文章

随机推荐

热门专题