python3+beautifulSoup4.6抓取某网站小说（三）网页分析，BeautifulSoup解析

本章学习内容：将网站上的小说都爬下来，存储到本地。

目标网站：www.cuiweijuxs.com

分析页面，发现一共4步：从主页进入分版打开分页列表、打开分页下所有链接、打开作品页面、打开单章内容。

所以实现步骤如下：

1、进入分版页面，www.cuiweijuxs.com/jingpinxiaoshuo/

找到最大分页数

<a href="http://www.cuiweijuxs.com/jingpinxiaoshuo/5_122.html" class="last">122</a>

循环打开每个页面

href="http://www.cuiweijuxs.com/jingpinxiaoshuo/5_?.html"

2、找到当页所有链接，循环打开单页链接，下为可定位元素

div id="newscontent"
 div class="l"
　　<span class="s2">
 　　<a href="http://www.cuiweijuxs.com/4_4521/" target="_blank">标题</a>

3、打开单页链接，找到章节列表，下为可定位元素

<div id="list">
<dd>
<a href="/4_4508/528170.html">第一章</a>
</dd>
</div>

4、打开单章链接，读取内容

<div id="content">

内容

<div>

setup1：创建class，初始化参数，抽象化获取beautifulsoup解析后到网页

# -*- coding: UTF-8 -*-

from urllib import request

from bs4 import BeautifulSoup

import os

'''

使用BeautifulSoup抓取网页

'''

class Capture():

    def __init__(self):

        self.index_page_url = 'http://www.cuiweijuxs.com/'

        self.one_page_url = 'http://www.cuiweijuxs.com/jingpinxiaoshuo/'

        self.two_page_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/5_?.html"

        self.folder_path = '小说/'

        self.head = {}

        # 写入User Agent信息

        self.head[

            'User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'

    # 获取BeautifulSoup

    def getSoup(self, query_url):

        req = request.Request(query_url, headers=self.head)

        webpage = request.urlopen(req)

        html = webpage.read()

        #soup = BeautifulSoup(html, 'html.parser')

        soup = BeautifulSoup(html, 'html5lib')

        return soup

        # end getSoup

setup2：创建进入分版页面，找到最大分页数，并循环打开每个页面

# 读取更新列表

    def readPageOne(self):

        soup = self.getSoup(self.one_page_url)

        last = soup.find("a","last")

        itemSize = int(last.string)

        page_url = str(self.two_page_url)

        for item in range(itemSize):

            print( item )

            new_page_url = page_url.replace( "?",str(item+1) )

            self.readPageTwo(new_page_url)

    # end readPageOne

　　使用getSoup方法获取解析后到html网页，使用find方法找到class是“last”的a标签，获取最大分页数

　　循环分页，从1开始

setup3：读取单页链接

#读取单页链接

def readPageTwo(self,page_url):

    soup = self.getSoup(page_url)

    con_div = soup.find('div',{'id':'newscontent'}).find('div',{'class':'l'})

    a_list = con_div.find_all('span',{'class':'s2'})[0].find_all('a')

    print(a_list)

    for a_href in a_list:

        #print(child)

        href = a_href.get('href')

        folder_name = a_href.get_text()

        print('a_href',href,'---folder_name',folder_name)

        path = self.folder_path + folder_name

        self.createFolder(path)

        self.readPageThree(href,path)

        # end for

# end readPageTwo

　　找到div下id是newscontent的标签，再往下找到class是“l”的div，再找到所有class是“s2”的span，找到此span下的a标签，循环打开a标签

并找到标签名（ a_href.get_text() ）作为文件夹名称

setup4：打开作品页面，循环章节链接，拼接文件名称

   #打开作品页面

    def readPageThree(self,page_url,path):

        soup = self.getSoup(page_url)

        print('readPageThree--',page_url)

        a_list = soup.find('div', {'id': 'list'}).find_all('a')

        idx = 0

        for a_href in a_list:

            idx = idx+1

            href = self.index_page_url +  a_href.get('href')

            txt_name =   path + '/' +  str(idx) + '_'+ a_href.get_text()  + '.txt'

            print('a_href', href, '---path', txt_name)

            isExists = os.path.exists(txt_name)

            if isExists:

                print(txt_name, '已存在')

            else:

                self.readPageFour(href,txt_name)

setup5：打开章节链接，读取id=content的div下所有内容，写入文件中

 #读取单章内容并写入

    def readPageFour(self,page_url,path):

        soup = self.getSoup(page_url)

        con_div = soup.find('div', {'id': 'content'})

        content = con_div.get_text().replace('<br/>', '\n').replace(' ', ' ')

        self.writeTxt(path,content)

完整代码实现如下：

 # -*- coding: UTF-8 -*-

 from urllib import request

 from bs4 import BeautifulSoup

 import os

 '''

 使用BeautifulSoup抓取网页

 '''

 class Capture():

     def __init__(self):

         self.index_page_url = 'http://www.cuiweijuxs.com/'

         self.one_page_url = 'http://www.cuiweijuxs.com/jingpinxiaoshuo/'

         self.two_page_url = "http://www.cuiweijuxs.com/jingpinxiaoshuo/5_?.html"

         self.folder_path = '小说/'

         self.head = {}

         # 写入User Agent信息

         self.head[

             'User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'

     # 获取BeautifulSoup

     def getSoup(self, query_url):

         req = request.Request(query_url, headers=self.head)

         webpage = request.urlopen(req)

         html = webpage.read()

         #soup = BeautifulSoup(html, 'html.parser')

         soup = BeautifulSoup(html, 'html5lib')

         return soup

         # end getSoup

     #读取更新列表

     def readPageOne(self):

         soup = self.getSoup(self.one_page_url)

         last = soup.find("a","last")

         itemSize = int(last.string)

         page_url = str(self.two_page_url)

         for item in range(itemSize):

             print( item )

             new_page_url = page_url.replace( "?",str(item+1) )

             self.readPageTwo(new_page_url)

         # end readPageOne

     #读取单页链接

     def readPageTwo(self,page_url):

         soup = self.getSoup(page_url)

         con_div = soup.find('div',{'id':'newscontent'}).find('div',{'class':'l'})

         a_list = con_div.find_all('span',{'class':'s2'})[0].find_all('a')

         print(a_list)

         for a_href in a_list:

             #print(child)

             href = a_href.get('href')

             folder_name = a_href.get_text()

             print('a_href',href,'---folder_name',folder_name)

             path = self.folder_path + folder_name

             self.createFolder(path)

             self.readPageThree(href,path)

             # end for

         # end readPage

     #打开单章链接

     def readPageThree(self,page_url,path):

         soup = self.getSoup(page_url)

         print('readPageThree--',page_url)

         a_list = soup.find('div', {'id': 'list'}).find_all('a')

         idx = 0

         for a_href in a_list:

             idx = idx+1

             href = self.index_page_url +  a_href.get('href')

             txt_name =   path + '/' +  str(idx) + '_'+ a_href.get_text()  + '.txt'

             print('a_href', href, '---path', txt_name)

             isExists = os.path.exists(txt_name)

             if isExists:

                 print(txt_name, '已存在')

             else:

                 self.readPageFour(href,txt_name)

     #读取单章内容并写入

     def readPageFour(self,page_url,path):

         soup = self.getSoup(page_url)

         con_div = soup.find('div', {'id': 'content'})

         content = con_div.get_text().replace('<br/>', '\n').replace('&nbsp;', ' ')

         self.writeTxt(path,content)

     def readPageHtml(self,page_url,path):

         soup = self.getSoup(page_url)

         con_div = soup.find('div', {'id': 'content'})

         content = con_div.get_text().replace('<br/>', '\n').replace('&nbsp;', ' ')

     def createFolder(self,path):

         path = path.strip()

         # 去除尾部 \ 符号

         path = path.rstrip("\\")

         isExists = os.path.exists(path)

         # 不存在则创建

         if not isExists:

             os.makedirs(path)

             print(path + ' create')

         else:

             print( path + ' 目录已存在')

         #end createFolder

     def writeTxt(self,file_name,content):

         isExists = os.path.exists(file_name)

         if isExists:

             print(file_name,'已存在')

         else:

             file_object = open(file_name, 'w',encoding='utf-8')

             file_object.write(content)

             file_object.close()

     def run(self):

         try:

             self.readPageOne()

         except BaseException as error:

             print('error--',error)

 Capture().run()

python3+beautifulSoup4.6抓取某网站小说（三）网页分析，BeautifulSoup解析的更多相关文章

python3+beautifulSoup4.6抓取某网站小说（一）爬虫初探
本次学习重点: 1.使用urllib的request进行网页请求,获取当前url整版网页内容 2.对于多级抓取,先想好抓取思路,再动手 3.BeautifulSoup获取html网页中的指定内容 4. ...
python3+beautifulSoup4.6抓取某网站小说（四）多线程抓取
上一篇多文章,是二级目录,根目录"小说",二级目录"作品名称",之后就是小说文件. 本篇改造了部分代码,将目录设置为根目录->作者目录->作品目录- ...
python3+beautifulSoup4.6抓取某网站小说（二）基础功能设计
本章学习内容:1.网页编码还原读取2.功能设计 stuep1:网页编码还原读取本次抓取对象: http://www.cuiweijuxs.com/jingpinxiaoshuo/ 按照第一篇的代码来 ...
Python多进程方式抓取基金网站内容的方法分析
因为进程也不是越多越好,我们计划分3个进程执行.意思就是 :把总共要抓取的28页分成三部分. 怎么分呢? # 初始range r = range(1,29) # 步长 step = 10 myList ...
Python3利用BeautifulSoup4批量抓取站点图片的代码
边学边写代码,记录下来.这段代码用于批量抓取主站下所有子网页中符合特定尺寸要求的的图片文件,支持中断. 原理很简单:使用BeautifulSoup4分析网页,获取网页<a/>和<im ...
Python3.x+Fiddler抓取APP数据
随着移动互联网的市场份额逐步扩大,手机APP已经占据我们的生活,以往的数据分析都借助于爬虫爬取网页数据进行分析,但是新兴的产品有的只有APP,并没有网页端这对于想要提取数据的我们就遇到了些问题,本章以 ...
Python3.x：抓取百事糗科段子
Python3.x:抓取百事糗科段子实现代码: #Python3.6 获取糗事百科的段子 import urllib.request #导入各类要用到的包 import urllib import ...
使用BurpSuite抓取HTTPS网站的数据包
昨天面试,技术官问到了我如何使用BurpSuite抓取https网站的数据包,一时间没能回答上来(尴尬!).因为以前https网站的数据包我都是用Fiddler抓取的,Fiddlert自动帮我们配置好 ...
sqlserver 抓取所有执行语句 SQL语句分析死锁抓取
原文:sqlserver 抓取所有执行语句 SQL语句分析死锁抓取在多人开发中最头疼的是人少事多没有时间进行codereview,本来功能都没时间写,哪有时间来开会细细来分析代码.软件能跑就行, ...

随机推荐

[转]python_常用断言assert
原文地址:http://www.jianshu.com/p/eea0b0e432da python自动化测试中寻找元素并进行操作,如果在元素好找的情况下,相信大家都可以较熟练地编写用例脚本了,但光进行 ...
Mysql操作符号
1.比较运算符: = 相等 <> 不等于 != 这个也可以 > 大于 < 小于 >= 大于等于 <= 小于等于 2.逻辑运算符: is null ...
Ruby String类
String类更新: 2017/06/10 更新: 2017/06/23 puts()要空格可以直接不加参数更新: 2017/08/17 增加rails引入的titleize 更新: 2017/1 ...
Elasticsearch的功能、使用场景以及特点
1.Elasticsearch的功能,干什么的 2.Elasticsearch的适用场景,能在什么地方发挥作用 3.Elasticsearch的特点,跟其他类似的东西不同的地方在哪里 1.Elasti ...
WIN32 API ------ 最简单的Windows窗口封装类
1 开发语言抉择 1.1 关于开发Win32 程序的语言选择 C还是C++ 在决定抛弃MFC,而使用纯Win32 API 开发Window桌面程序之后,还存在一个语言的选择,这就是是否使用C++.C+ ...
206 Reverse Linked List 反转链表
反转一个单链表.进阶:链表可以迭代或递归地反转.你能否两个都实现一遍?详见:https://leetcode.com/problems/reverse-linked-list/description/ ...
AJPFX总结面向对象思想设计原则
面向对象思想设计原则 A.单一职责原则其实就是开发人员经常说的”高内聚,低耦合” 也就是说,每个类应该只有一个职责,对外只能提供一种功能,而引起类变化的原 ...
Linux 之 2>&1
我们在Linux下经常会碰到nohup command>/dev/null 2>&1 &这样形式的命令.首先我们把这条命令大概分解下首先就是一个nohup表示当前用户和系统 ...
[ SDOI 2010 ] 古代猪文
\(\\\) Description 一句话题意: 设 \(x=\sum_{d|n} C_n^d\),求 \(G^x\pmod {999911659}\) . 从原题面大段语文中其实不难推出所求. \ ...
openmv第一次调试
2018-09-19 20:14:51 import sensor, image, time import car import json import time from pyb import U ...

python3+beautifulSoup4.6抓取某网站小说（三）网页分析，BeautifulSoup解析

python3+beautifulSoup4.6抓取某网站小说（三）网页分析，BeautifulSoup解析的更多相关文章

随机推荐

热门专题