用Python写简单的爬虫

准备：

1.扒网页，根据URL来获取网页信息

import urllib.parse

import urllib.request

response = urllib.request.urlopen("https://www.cnblogs.com")

print(response.read())

urlopen方法

urlopen(url, data, timeout)

url即为URL，data是访问URL时要传送的数据，timeout是设置超时时间　

返回response对象

response对象的read方法，可以返回获取到的网页内容

POST方式

import urllib.parse

import urllib.request

values = {"username":"XXX","password":"XXX"}

data = urllib.parse.urlencode(values)

data = data.encode('utf-8')

url = "https://passport.cnblogs.com/user/signin?ReturnUrl=https://home.cnblogs.com/&AspxAutoDetectCookieSupport=1"

response = urllib.request.urlopen(url,data)

print(response.read())

GET方式

import urllib.parse

import urllib.request

values = {"itemCount":30}

data = urllib.parse.urlencode(values)

data = data.encode('utf-8')

url = "https://news.cnblogs.com/CommentAjax/GetSideComments"

data = urllib.parse.urlencode(values)

response = urllib.request.urlopen(url+'?'+data)

print(response.read())

2.正则表达式re模块

Python 自带了re模块，提供了对正则表达式的支持

#返回pattern对象

re.compile(string[,flag])

#以下为匹配所用函数

re.match(pattern, string[, flags]) #在字符串中查找，是否能匹配正则表达式

re.search(pattern, string[, flags]) #字符串的开头是否能匹配正则表达式

re.split(pattern, string[, maxsplit]) #通过正则表达式将字符串分离

re.findall(pattern, string[, flags]) #找到 RE 匹配的所有子串，并把它们作为一个列表返回

re.finditer(pattern, string[, flags]) #找到 RE 匹配的所有子串，并把它们作为一个迭代器返回

re.sub(pattern, repl, string[, count]) #找到 RE 匹配的所有子串，并将其用一个不同的字符串替换

re.subn(pattern, repl, string[, count])#返回 (sub(repl, string[, count]), 替换次数)

3.Beautiful Soup，是从网页抓取数据的库，使用时需要导入 bs4 库

详细介绍

4.MongoDB

使用的MongoEngine库

详细介绍

示例：

　　抓取博客园前20页数据，保存到MongoDB中

1.获取博客园的数据

　　request.py

import urllib.parse

import urllib.request

def getHtml(url,values):

    data = urllib.parse.urlencode(values)

    response_result = urllib.request.urlopen(url+'?'+data).read()

    html = response_result.decode('utf-8')

    return html

def requestCnblogs(num):

    print('请求数据page:',num)

    url = 'https://www.cnblogs.com/mvc/AggSite/PostList.aspx'

    values= {

        'CategoryId':808,

        'CategoryType' : 'SiteHome',

        'ItemListActionName' :'PostList',

        'PageIndex' : num,

        'ParentCategoryId' : 0,

        'TotalPostCount' : 4000

    }

    result = getHtml(url,values)

    return result

　　注：

　　　　打开第二页，f12，找到https://www.cnblogs.com/mvc/AggSite/PostList.aspx

2.解析获取来的数据

　　deal.py

from bs4 import BeautifulSoup

import request

import re

def blogParser(index):

  cnblogs = request.requestCnblogs(index)

  soup = BeautifulSoup(cnblogs, 'html.parser')

  all_div = soup.find_all('div', attrs={'class': 'post_item_body'}, limit=20)

  blogs = []

  #循环div获取详细信息

  for item in all_div:

      blog = analyzeBlog(item)

      blogs.append(blog)

  return blogs

def analyzeBlog(item):

    result = {}

    a_title = find_all(item,'a','titlelnk')

    if a_title is not None:

        result["title"] = a_title[0].string

        result["link"] = a_title[0]['href']

    p_summary = find_all(item,'p','post_item_summary')

    if p_summary is not None:

        result["summary"] = p_summary[0].text

    footers = find_all(item,'div','post_item_foot')

    footer = footers[0]

    result["author"] = footer.a.string

    str = footer.text

    time = re.findall(r"发布于 .+? .+? ", str)

    result["create_time"] = time[0].replace('发布于 ','')

    return result

def find_all(item,attr,c):

    return item.find_all(attr,attrs={'class':c},limit=1)

注：

　　分析html结构

3.将处理好的数据保存到MongoDB

　　db.py

from mongoengine import *

connect('test', host='localhost', port=27017)

import datetime

class Blogs(Document):

    title = StringField(required=True, max_length=200)

    link = StringField(required=True)

    author = StringField(required=True)

    summary = StringField(required=True)

    create_time = StringField(required=True)

def savetomongo(contents):

    for content in contents:

        blog = Blogs(

            title=content['title'],

            link= content['link'],

            author=content['author'],

            summary=content['summary'],

            create_time=content['create_time']

        )

        blog.save()

    return "ok"

def haveBlogs():

    blogs = Blogs.objects.all()

    return len(blogs)

4.开始抓取数据

test.py

import db

import deal

print("start.......")

for i in range(1, 21):

    contents = deal.blogParser(i)

    db.savetomongo(contents)

    print('page',i,' OK.')

counts = db.haveBlogs()

print("have ",counts," blogs")

print("end.......")

注：

　　当前使用的Python版本是3.6.1

可以在可视化工具中查看（可是化工具介绍）

用Python写简单的爬虫的更多相关文章

【Python开发】【神经网络与深度学习】如何利用Python写简单网络爬虫
平时没事喜欢看看freebuf的文章,今天在看文章的时候,无线网总是时断时续,于是自己心血来潮就动手写了这个网络爬虫,将页面保存下来方便查看先分析网站内容,红色部分即是网站文章内容div,可以看 ...
Python 利用Python编写简单网络爬虫实例3
利用Python编写简单网络爬虫实例3 by:授客 QQ:1033553122 实验环境 python版本:3.3.5(2.7下报错实验目的获取目标网站“http://bbs.51testing. ...
Python 利用Python编写简单网络爬虫实例2
利用Python编写简单网络爬虫实例2 by:授客 QQ:1033553122 实验环境 python版本:3.3.5(2.7下报错实验目的获取目标网站“http://www.51testing. ...
爬虫入门-使用python写简单爬虫
从第一章到上一章为止,基本把python所有的基础点都已经包括了,我们有控制逻辑的关键字,有内置数据结构,有用于工程需要的函数和模块,又有了标准库和第三方库,可以写正规的程序了. python可以做非 ...
[Python学习] 简单网络爬虫抓取博客文章及思想介绍
前面一直强调Python运用到网络爬虫方面很有效,这篇文章也是结合学习的Python视频知识及我研究生数据挖掘方向的知识.从而简介下Python是怎样爬去网络数据的,文章知识很easy ...
使用Python编写简单网络爬虫抓取视频下载资源
我第一次接触爬虫这东西是在今年的5月份,当时写了一个博客搜索引擎.所用到的爬虫也挺智能的,起码比电影来了这个站用到的爬虫水平高多了! 回到用Python写爬虫的话题. Python一直是我主要使用的脚 ...
使用python实现简单的爬虫
python爬虫的简单实现开发环境的配置 python环境的安装编辑器的安装爬虫的实现包的安装简单爬虫的初步实现将数据写入到数据库-简单的数据清洗-数据库的连接-数据写入到数据库开发环境 ...
Python实现简单的爬虫获取某刀网的更新数据
昨天晚上无聊时,想着练习一下Python所以写了一个小爬虫获取小刀娱乐网里的更新数据 #!/usr/bin/python # coding: utf-8 import urllib.request i ...
用Python写一个小爬虫吧！
学习了一段时间的web前端,感觉有点看不清前进的方向,于是就写了一个小爬虫,爬了51job上前端相关的岗位,看看招聘方对技术方面的需求,再有针对性的学习. 我在此之前接触过Python,也写过一些小脚 ...

随机推荐

pthon 批量压缩当前目录,子目录下图片
需求经常可能有需要压缩图片的需求. 但是一些批量处理图片的软件又仅仅支持压缩一个目录下的图片, 所以写下了这个图片处理程序: 需要安装: python 2.x Image模块特点: 压缩当前目录, ...
select 操作选中添加、删除操作Javascript
//添加选中项 function addItem() { var myMember = document.getElementById("myMember"); var other ...
python json 数据操作
python 有专门针对 json 操作的函数 #!/usr/bin/python3 import json mytest_js = { "a" : 1, "b" ...
maven配置阿里云仓库
在mirrors的节点中添加: <mirror>  <id>nexus-a ...
strace命令用法详解
Linux利器 strace strace常用来跟踪进程执行时的系统调用和所接收的信号. 在Linux世界,进程不能直接访问硬件设备,当进程需要访问硬件设备(比如读取磁盘文件,接收网络数据等等)时,必 ...
Microsoft.AspNet.Identity.EntityFramework/IdentityDbContext.cs
using System; using System.Collections.Generic; using System.ComponentModel.DataAnnotations.Schema; ...
python datetime和unix时间戳之间相互转换
python datetime和unix时间戳之间相互转换 1.代码: import time import datetime # ...
jquery miniui 学习笔记
1.取组件值传递form data,load发送请求加载数据 <script type="text/JavaScript"> mini.parse(); // ...
R语言 data.frame 大全
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, ...
FileOutPutStream in 创新实训自然语言交流系统
FileOutPutStream在c盘等一级目录下是可以创建文件的,如: new FileOutputStream("c:\\kk.txt");但是在c\\test等就创建不了,F ...

用Python写简单的爬虫

用Python写简单的爬虫的更多相关文章

随机推荐

热门专题