Python爬虫-《神雕侠侣》

Python3.5

爬取《神雕侠侣》http://www.kanunu8.com/wuxia/201102/1610.html

武侠迷，所以喜欢爬取武侠小说

#!/usr/bin/python

# -*- coding: utf-8 -*-

from selenium import webdriver

import os

from docx import Document

import re

class House():

    def __init__(self):

        self.headers = {

            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'}

        self.baseUrl = 'http://www.kanunu8.com/wuxia/201102/1610.html'

        self.basePath = os.path.dirname(__file__)

    def makedir(self, name):

        path = os.path.join(self.basePath, name)

        isExist = os.path.exists(path)

        if not isExist:

            os.makedirs(path)

            print('File has been created.')

        else:

            print('The file is existed.')

        #切换到该目录下

        os.chdir(path)

    def connect(self, url):

        try:

            driver = webdriver.PhantomJS()

            driver.get(url)

            return driver

        except:

            print('This page is not existed.')

    #爬取每个板块中每一章节的链接地址

    def getBookLinkList(self, url):

        driver = self.connect(url)

        bookLinkList = []

        try:

            #找到所有href链接

            bookLinks = driver.find_elements_by_xpath("//a")

            for link in bookLinks:

                temp = link.get_attribute('href')

                print(temp)

                try:

                    #通过正则表达式筛选出各章节的链接

                    pattern = re.compile(".+\/[0-9]{5}\.html$")

                    if pattern.match(temp):

                        print('ok')

                        bookLinkList.append(link.get_attribute('href'))

                except:

                    print('little error')

        except:

            print('Error')

        return bookLinkList

    #爬取每本书的细节数据

    def getBookDetail(self, url):

        driver = self.connect(url)

        try:

            #找到标题和文章内容

            title = driver.find_element_by_xpath('//h2').text

            content = driver.find_element_by_xpath('//p').text

            print(title)

            print(content)

        except:

            print('Error.')

        return title, content

    def getData(self):

        doc = Document()

        self.makedir('StoryFiles')

        bookLinkList = self.getBookLinkList(self.baseUrl)

        for linkUrl in bookLinkList:

            doc.add_paragraph(self.getBookDetail(linkUrl))

        doc.save('神雕侠侣.docx')

if __name__ == '__main__':

    house = House()

    house.getData()

Python爬虫-《神雕侠侣》的更多相关文章

爬虫前篇 /https协议原理剖析
爬虫前篇 /https协议原理剖析目录爬虫前篇 /https协议原理剖析 1. http协议是不安全的 2. 使用对称秘钥进行数据加密 3. 动态对称秘钥和非对称秘钥 4. CA证书的应用 5. ...
Python网络爬虫http和https协议
一.HTTP协议 1.官方概念: HTTP协议是Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(WWW:World Wide Web )服务器传输超文 ...
Python爬虫-02：HTTPS请求与响应，以及抓包工具Fiddler的使用
目录 1. HTTP和HTTPS 1.1. HTTP的请求和响应流程:打开一个网页的过程 1.2. URL 2. 客户端HTTP请求 3. Fiddler抓包工具的使用 3.1. 工作原理 3.2. ...
java爬虫爬取https协议的网站时，SSL报错， java.lang.IllegalArgumentException TSLv1.2 报错
目前在广州一家小公司实习,这里的学习环境还是挺好的,今天公司从业十几年的大佬让我检查一下几年前的爬虫程序是否还能使用…… 我从myeclipse上check out了大佬的程序,放到workspace ...
Python爬虫帮你打包下载所有抖音好听的背景音乐，还不快收藏一起听歌【华为云技术分享】
版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/devcloud/article/detai ...
Python爬虫入门教程 48-100 使用mitmdump抓取手机惠农APP-手机APP爬虫部分
1. 爬取前的分析 mitmdump是mitmproxy的命令行接口,比Fiddler.Charles等工具方便的地方是它可以对接Python脚本. 有了它我们可以不用手动截获和分析HTTP请求和响应 ...
python爬虫相关
一.Python re模块的基本用法: https://blog.csdn.net/chenmozhe22/article/details/80601971 二.爬取网页图片 https://www. ...
02.Python网络爬虫第二弹《http和https协议》
一.HTTP协议 1.官方概念: HTTP协议是Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(WWW:World Wide Web )服务器传输超文 ...
Python网络爬虫第二弹《http和https协议》
一.HTTP协议 1.官方概念: HTTP协议是Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(WWW:World Wide Web )服务器传输超文 ...
python网络爬虫《http和https协议》
一.HTTP协议 1.官方概念: HTTP协议是Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(WWW:World Wide Web )服务器传输超文 ...

随机推荐

PHP FILTER_VALIDATE_URL 过滤器
定义和用法 FILTER_VALIDATE_URL 过滤器把值作为 URL 来验证. Name: "validate_url" ID-number: 273 可能的标志: FILT ...
python bs4解析网页时 bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to inst（转）
Python小白,学习时候用到bs4解析网站,报错 bs4.FeatureNotFound: Couldn't find a tree builder with the features you re ...
switch type 类型判断
golang 语言中也有类是 javascript 的 typeof 判断类型的方法比如 func (a interface{}){ //第一种 if inst,ok:=a.(TypeA);o ...
element not visible的解决方法
抛出异常主题为element not visible主要有一下三个方面的原因. 元素之间存在逻辑关系,比如你要选择地址时,中国选择完毕之后,才能选择北京.如果想直接一步到位,则会出现element n ...
架构-软件系统体系结构-C/S架构：C/S架构
ylbtech-架构-软件系统体系结构-C/S架构:C/S架构 Client/Server架构,即客户端/服务器架构.是大家熟知的软件系统体系结构,通过将任务合理分配到Client端和Server端, ...
VMware Pro v14.1.1 官方版本及激活密钥
热门虚拟机软件VMware Workstation Pro现已更新至14.1.1,14.0主要更新了诸多客户机操作系统版本,此外全面兼容Wind10创建者更新.12.0之后属于大型更新,专门为Win1 ...
2019牛客多校第三场H-Magic Line
Magic Line 题目传送门解题思路因为坐标的范围只有正负1000,且所有点坐标都是整数,所以所有点相连构成的最大斜率只有2000,而我们能够输出的的坐标范围是正负10^9.所以我们先把这n个 ...
The Preliminary Contest for ICPC Asia Nanjing 2019（ B H F）
B. super_log 题意:研究一下就是求幂塔函数 %m的值. 思路:扩展欧拉降幂. AC代码: #include<bits/stdc++.h> using namespace std ...
node express 会话管理中间件 --- cookie-parser
本文转载自:https://www.cnblogs.com/bq-med/p/8995100.html cookie是由服务器发送给客户端(浏览器)的小量信息. 我们知道,平时上网时都是使用无状态的H ...
Spring 学习笔记 IoC 基础
Spring IoC Ioc 是什么 IoC -- Inversion of Control(控制反转)什么是控制?什么是反转? 控制反转了什么? 在很早之前写项目不用 Spring 的时候,都是在 ...

Python爬虫-《神雕侠侣》

Python爬虫-《神雕侠侣》的更多相关文章

随机推荐

热门专题