python3爬虫-使用requests爬取起点小说

import requests

from lxml import etree

from urllib import parse

import os, time

def get_page_html(url):

    '''向url发送请求'''

    resoponse = session.get(url, headers=headers, timeout=timeout)

    try:

        if resoponse.status_code == 200:

            return resoponse

    except Exception:

        return None

def get_next_url(resoponse):

    '''获取下一页的url链接'''

    if resoponse:

        try:

            selector = etree.HTML(resoponse.text)

            url = selector.xpath("//a[@id='j_chapterNext']/@href")[0]

            next_url = parse.urljoin(resoponse.url, url)

            return next_url

        except IndexError:

            return None

def xs_content(resoponse):

    '''获取小说的章节名，内容'''

    if resoponse:

        selector = etree.HTML(resoponse.text)

        title = selector.xpath("//h3[@class='j_chapterName']/text()")[0]

        content_xpath = selector.xpath(

            "//div[contains(@class,'read-content') and contains(@class,'j_readContent')]//p/text()")

        return title, content_xpath

def write_to_txt(info_tuple: tuple):

    if not info_tuple: return

    path = os.path.join(BASE_PATH, info_tuple[0])

    if not os.path.exists(path):

        with open(path + ".txt", "wt", encoding="utf-8") as f:

            for line in info_tuple[1]:

                f.write(line + "\n")

            f.flush()

def run(url):

    '''启动'''

    html = get_page_html(url)

    next_url = get_next_url(html)

    info_tupe = xs_content(html)

    if next_url and info_tupe:

        print("正在写入")

        write_to_txt(info_tupe)

        time.sleep(sleep_time)  # 延迟发送请求的时间，减少对服务器的压力。

        print("正在爬取%s" % info_tupe[0])

        print("正在爬取%s" % next_url)

        run(next_url)

if __name__ == '__main__':

    session = requests.Session()

    sleep_time = 5

    timeout = 5

    BASE_PATH = r"D:\图片\LSZJ"  # 存放文件的目录

    url = "https://read.qidian.com/chapter/8iw8dkb_ZTxrZK4x-CuJuw2/fWJwrOiObhn4p8iEw--PPw2"  # 这是斗破苍穹第一章的url    需要爬取的小说的第一章的链接(url)

    headers = {

        "Referer": "read.qidian.com",

        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"

    }

    print('开始运行爬虫')

    run(url)

python3爬虫-使用requests爬取起点小说的更多相关文章

python3爬虫-通过requests爬取图虫网
import requests from fake_useragent import UserAgent from requests.exceptions import Timeout from ur ...
python3爬虫-通过requests爬取西刺代理
import requests from fake_useragent import UserAgent from lxml import etree from urllib.parse import ...
Python3爬虫使用requests爬取lol英雄皮肤
本人博客:https://xiaoxiablogs.top 此次爬取lol英雄皮肤一共有两个版本,分别是多线程版本和非多线程版本. 多线程版本 # !/usr/bin/env python # -*- ...
使用scrapy爬虫,爬取起点小说网的案例
爬取的页面为https://book.qidian.com/info/1010734492#Catalog 爬取的小说为凡人修仙之仙界篇,这边小说很不错. 正文的章节如下图所示其中下面的章节为加密部 ...
python从爬虫基础到爬取网络小说实例
一.爬虫基础 1.1 requests类 1.1.1 request的7个方法 requests.request() 实例化一个对象,拥有以下方法 requests.get(url, *args) r ...
python3 爬虫教学之爬取链家二手房（最下面源码） //以更新源码
前言作为一只小白,刚进入Python爬虫领域,今天尝试一下爬取链家的二手房,之前已经爬取了房天下的了,看看链家有什么不同,马上开始. 一.分析观察爬取网站结构这里以广州链家二手房为例:http:/ ...
【Python3爬虫】我爬取了七万条弹幕，看看RNG和SKT打得怎么样
一.写在前面直播行业已经火热几年了,几个大平台也有了各自独特的“弹幕文化”,不过现在很多平台直播比赛时的弹幕都基本没法看的,主要是因为网络上的喷子还是挺多的,尤其是在观看比赛的时候,很多弹幕不是喷选 ...
python3 [爬虫实战] selenium 爬取安居客
我们爬取的网站:https://www.anjuke.com/sy-city.html 获取的内容:包括地区名,地区链接: 安居客详情一开始直接用requests库进行网站的爬取,会访问不到数据的, ...
【Python3 爬虫】14_爬取淘宝上的手机图片
现在我们想要使用爬虫爬取淘宝上的手机图片,那么该如何爬取呢?该做些什么准备工作呢? 首先,我们需要分析网页,先看看网页有哪些规律打开淘宝网站http://www.taobao.com/ 我们可以看到 ...

随机推荐

Android Toast:是一个类，主要管理消息的提示
Toast:是一个类,主要管理消息的提示.makeText(),是Toast的一个方法,用来显示信息,分别有三个参数.第一个参数:this,是上下文参数,指当前页面显示第二个参数:“string st ...
Angular-学习。
今天刚学了点关于Angular的知识,就迫不及待的想跟大家来分享. 1.angular.extend ( )方法可以把一个或多个对象中的方法和属性扩展到一个目的对象中. <script typ ...
[CENTOS7] [IPTABLES] 卸载Firewall Id安装 IPTABLES及防火墙设置
卸载Firewall ID,重装IPTABLES:先停止服务 systemctl stop firewalldsystemctl mask firewalld yum install iptabl ...
MySQL8.0初体验
MySQL8.0的官方社区开源版出来有段时间了,而percona的8.0版本还没有正式对外发布(已发布测试版),一直以来也没安装体验下这个号称质的飞跃的版本,今天正好有些时间就下了安装体验体验. 一. ...
CompletionService和ExecutorCompletionService
CompletionService用于提交一组Callable任务,其take方法返回已完成的一个Callable任务对应的Future对象. 如果你向Executor提交了一个批处理任务,并且希 ...
spring定时，cronExpression表达式解释
附:cronExpression表达式解释: 0 0 12 * * ?---------------在每天中午12:00触发 0 15 10 ? * *---------------每天上午10:15 ...
WDS使用捕获映像制作企业自定义映像
来源:http://www.07net01.com/linux/WDSshiyongbuhuoyingxiangzhizuoqiyezidingyiyingxiang_545749_137448761 ...
svn检出项目，Project *** is already imported into workspace
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.如果从svn检出的项目 Import ---- Existing Maven Pro ...
Hadoop HBase概念学习系列之HBase里的存储数据流程（二十三）
这个,很简单,但凡是略懂大数据的,就很清楚,不多说,直接上图.
Hadoop HBase概念学习系列之HBase里的列式数据库（十七）
列式数据库,从数据存储方式上有别于行式数据库,所有数据按列存取. 行式数据库在做一些列分析时,必须将所有列的信息全部读取出来而列式数据库由于其是按列存取,因此只需在特定列做I/O即可完成查询与分析, ...

python3爬虫-使用requests爬取起点小说

python3爬虫-使用requests爬取起点小说的更多相关文章

随机推荐

热门专题