Python3下基于bs4和sqlalchemy的爬虫实现

本文来自网易云社区

作者：王贝

小学生现在都在学python了，作为专业程序员当然不能落下了，所以，快马加鞭，周六周末在家学起了python3，python3的基本语法比较简单，相比于Java开发更加敏捷，python3的基础就不讲了，这里主要讲下我这里的爬虫小程序的实现逻辑吧

上下模块图：

一目了然，总体上就是这5步，涉及到python3的requests,bs4,re,sqlalchemy这四个模块。

（1）requests：

是一个很强大的http客户端库，提供了丰富的api，比如发一个get请求：

with requests.get(url,params={},headers={}) as rsp:

   res.text   #返回值文本内容

发一个入参为json的post请求：

with requests.post(url,json={},headers={}) as rsp:

   res.text #返回值文本内容

等等。

这里值得说一下，为什么用with as，with会先执行__enter__()方法，其返回值就是as，requests里返回值就是rsp，当with as 这一逻辑行执行结束时，就会执行__exit__()方法，requests里__exit__()方法将request close掉了，这就是程序没有显示调用close的原因。下面程序里会有一个例子彰显with as的功能。

requests还有很多强大的功能，参考：https://www.cnblogs.com/lilinwei340/p/6417689.html。

（2）bs4 BeatifulSoup

学过java的都知道java有个jsoup，jsoup就是对html模版进行解析，变成各个标签集合，这里bs4和jsoup如出一辙，api也基本一致，比如，一下html代码，我们想获取新闻，地图，视频，贴吧这些内容，只要：

soup=BeautifulSoup(html,'html.parser')

atags=soup.find('div',{'id':'u1'}).findChilren('a',{'class':'mnav'})

values=[]for atag in atags:

   values.append(atag.text)

以上程序即可实现我们的要求，python解析html的还有一个scrapy框架的xpath，以后分享scrapy时再讲。

<html>

<head>

    <meta http-equiv=content-type content=text/html;charset=utf-8>

    <meta http-equiv=X-UA-Compatible content=IE=Edge>

    <meta content=always name=referrer>

    <link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>

    <title>百度一下，你就知道</title></head>

<body link=#0000cc>

<div id=wrapper>

    <div id=head>

        <div >            <div >                <div >                    <div id=lg><img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129></div>

                    <form id=form name=f action=//www.baidu.com/s >                        <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden

                                                                                                          name=rsv_bp

                                                                                                          value=1>

                        <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span

                                ><input id=kw name=wd >                                                           autocomplete=off autofocus></span><span

                                ><input type=submit id=su value=百度一下 ></span></form>

                </div>

            </div>

            <div id=u1><a href=http://news.baidu.com name=tj_trnews >                                                                                         name=tj_trhao123 >                <a href=http://map.baidu.com name=tj_trmap >                                                                                >                        href=http://tieba.baidu.com name=tj_trtieba >                <noscript><a

                        href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1

                        name=tj_login >                <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=' + encodeURIComponent(window.location.href + (window.location.search === "" ? "?" : "&") + "bdorz_come=1") + '" name="tj_login" >登录</a>');</script>

                <a href=//www.baidu.com/more/ name=tj_briicon >        </div>

    </div>

    <div id=ftCon>

        <div id=ftConw><p id=lh><a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a></p>

            <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  <a

                    href=http://jianyi.baidu.com/ >                    src=//www.baidu.com/img/gs.gif></p></div>

    </div>

</div>

</body>

</html>

(3) re

re正则模块很强大，有match search sub replace这些api，每个都有自己的特长，可以参考：http://www.runoob.com/python3/python3-reg-expressions.html

(4) sqlalchemy

一款python的数据库orm框架，用了下，很好用，有点类似于java 的hibernate，但更灵活。

说了这么多，该帖下爬虫脚本的代码了，下面是目录结构，毕竟也是专业程序员，不能写的一团糟，也要讲究架构，哈哈。

------youku_any #包名

--------------datasource.py #专门管理数据源session

--------------youkubannerdao.py #程序里抓取的优酷banner信息，这个是dao层

--------------youkuservice.py #不用说了，业务逻辑

还有一件事情，就是建表，不多说了：

CREATE TABLE `youku_banner` (  `id` bigint(22) NOT NULL AUTO_INCREMENT,  `type` int(2) NOT NULL, #优酷banner类型 1:电视 2:电影 3.综艺 

  `year` int(4) NOT NULL,  `month` int(2) NOT NULL,  `date` int(2) NOT NULL,  `hour` int(2) NOT NULL,  `minute` int(2) NOT NULL,  `img` varchar(255) DEFAULT NULL,  `title` varchar(255) DEFAULT NULL,  `url` varchar(255) DEFAULT NULL,  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,  PRIMARY KEY (`id`),  KEY `idx_uniq` (`year`,`month`,`date`,`hour`) USING BTREE

) ENGINE=InnoDB AUTO_INCREMENT=83 DEFAULT CHARSET=utf8mb4

接下来就是代码实现了：

datasource.py

from sqlalchemy import create_enginefrom sqlalchemy.orm import sessionmaker

dburl = 'mysql+pymysql://root:123@localhost/youku?charset=utf8'#pool_size 置为100 session回收时间3600sds = create_engine(dburl,pool_size=100,pool_recycle=3600)

Session = sessionmaker(bind=ds)# session=Session()#创建session管理类class SessionManager():

    def __init__(self):

        self.session=Session()    def __enter__(self):

        return self.session    #连接池管理session，不需要显示close

    def __exit__(self, exc_type, exc_val, exc_tb):

        # session.close()

        print('not close')

youkubannerdao.py

from sqlalchemy import Sequence, Column, Integer, BigInteger, String, TIMESTAMP, textfrom sqlalchemy.ext.declarative import declarative_basefrom youku_any.datasource import SessionManager

Base = declarative_base()#继承基类Baseclass YoukuBanner(Base):

    #指定表名

    __tablename__ = 'youku_banner'

    #定义字段映射关系

    id = Column(BigInteger, Sequence('id'), primary_key=True)

    type=Column(Integer)

    year = Column(Integer)

    month = Column(Integer)

    date = Column(Integer)

    hour = Column(Integer)

    minute = Column(Integer)

    img = Column(String(255))

    title = Column(String(255))

    url = Column(String(255))

    createTime = Column('create_time', TIMESTAMP)    def add(self):

        #with as 先执行SessionManager __enter__() 逻辑行结束执行__exit()__

        with SessionManager() as session:            try:

                session.add(self)

                session.commit()            except:

                session.rollback()    def addBatch(self,values):

        with SessionManager() as session:            try:

                session.add_all(values)

                session.commit()            except:

                session.rollback()    def select(self,param):

        with SessionManager() as session:            return session.query(YoukuBanner).select_from(YoukuBanner).filter(param)    def remove(self,parma):

        with SessionManager() as session:            try:

                session.query(YoukuBanner).filter(parma).delete(synchronize_session='fetch')

                session.commit()            except:

                session.rollback()    def update(self,param,values):

        with SessionManager() as session:            try:

                session.query(YoukuBanner).filter(param).update(values, synchronize_session='fetch')

                session.commit()            except:

                session.rollback()

youkuservice.py

import requestsimport jsonimport reimport datetimefrom bs4 import BeautifulSoupfrom sqlalchemy import textfrom youku_any.youkubannerdao import YoukuBannerdef getsoup(url):

    with requests.get(url, params=None, headers=None) as req:        if req.encoding != 'utf-8':

            encodings = requests.utils.get_encodings_from_content(req.text)            if encodings:

                encode = encodings[0]            else:

                encode = req.apparent_encoding

        encode_content = req.content.decode(encode).encode('utf-8')

        soup = BeautifulSoup(encode_content, 'html.parser')        return soupdef getbanner(soup):

    # soup = BeautifulSoup()

    # soup.findChild()

    bannerDivP = soup.find('div', {'id': 'm_86804', 'name': 'm_pos'})

    bannerScript = bannerDivP.findChildren('script', {'type': 'text/javascript'})[1].text

    m = re.search('\[.*\]', bannerScript)

    banners = json.loads(m.group())    for banner in banners:

        time = datetime.datetime.now()

        youkubanner = YoukuBanner(type=1, year=time.year, month=time.month, date=time.day, hour=time.hour,

                                  minute=time.minute,

                                  img=banner['img'], title=banner['title'], url=banner['url'])

        youkubanner.add()

soup=getsoup('http://tv.youku.com/')

getbanner(soup)

youkuBanner = YoukuBanner()

youkuBanner.remove(parma=text('id=67 or id=71'))

youkuBanner.update(param=text('id=70'),values={'title':YoukuBanner.title + '呼啸山庄'})for i in range(0,10000):

    youkuBanner.update(param=text('id=70'), values={'title': YoukuBanner.title + '呼啸山庄'})

    bannerList = youkuBanner.select(param=text('id > 66 and id < 77 order by id asc limit 0,7'))

    print("lines--------%d" % i)    # time.sleep(10)

    for banner in bannerList:

        print(banner.id,banner.minute,banner.img,banner.title)

到此，一个简答的爬虫脚本就写完了，周末两天的成果还是有点小满足，不过这只是python的冰山一脚，还有好多等着我们去探讨呢。

网易云免费体验馆，0成本体验20+款云产品！

更多网易研发、产品、运营经验分享请访问网易云社区。

相关文章：
【推荐】 SpringBoot入门（一）——开箱即用

Python3下基于bs4和sqlalchemy的爬虫实现的更多相关文章

简单的python2.7基于bs4和requests的爬虫
python的编码问题比较恶心. decode解码encode编码在文件头设置 # -*- coding: utf-8 -*-让python使用utf8. # -*- coding: utf- -* ...
python3下scrapy爬虫(第二卷:初步抓取网页内容之直接抓取网页）
上一卷中介绍了安装过程,现在我们开始使用这个神奇的框架跟很多博主一样我也先选择一个非常好爬取的网站作为最初案例,那么我先用屌丝必备网站http://www.shaimn.com/xinggan/作为 ...
python3制作捧腹网段子页爬虫
0x01 春节闲着没事(是有多闲),就写了个简单的程序,来爬点笑话看,顺带记录下写程序的过程.第一次接触爬虫是看了这么一个帖子,一个逗逼,爬取煎蛋网上妹子的照片,简直不要太方便.于是乎就自己照猫画虎, ...
基于Scrapy的B站爬虫
基于Scrapy的B站爬虫最近又被叫去做爬虫了,不得不拾起两年前搞的东西. 说起来那时也是突发奇想,想到做一个B站的爬虫,然后用的都是最基本的Python的各种库. 不过确实,实现起来还是有点麻烦的 ...
基于Node.js的强大爬虫能直接发布抓取的文章哦
基于Node.js的强大爬虫能直接发布抓取的文章哦基于Node.js的强大爬虫能直接发布抓取的文章哦!本爬虫源码基于WTFPL协议,感兴趣的小伙伴们可以参考一下一.环境配置 1)搞一台服务器,什 ...
大数据下基于Tensorflow框架的深度学习示例教程
近几年,信息时代的快速发展产生了海量数据,诞生了无数前沿的大数据技术与应用.在当今大数据时代的产业界,商业决策日益基于数据的分析作出.当数据膨胀到一定规模时,基于机器学习对海量复杂数据的分析更能产生较 ...
CentOS 环境下基于 Nginx uwsgi 搭建 Django 站点
因为我的个人网站 restran.net 已经启用,博客园的内容已经不再更新.请访问我的个人网站获取这篇文章的最新内容,CentOS 环境下基于 Nginx uwsgi 搭建 Django 站点以下 ...
基于redis的简易分布式爬虫框架
代码地址如下:http://www.demodashi.com/demo/13338.html 开发环境 Python 3.6 Requests Redis 3.2.100 Pycharm(非必需,但 ...
基于bs4库的HTML标签遍历方法
基于bs4库的HTML标签遍历方法 import requests r=requests.get('http://python123.io/ws/demo.html') demo=r.text HTM ...

随机推荐

linux debian 时间设置中无法选择“自动设定时间和日期”
没有安装ntpdate 执行:apt-get install ntpdate ntp.sjtu.edu.cn 202.120.2.101 (上海交通大学网络中心NTP服务器地址)s1a.time.ed ...
cas实现单点登录原理
1.基于Cookie的单点登录的回顾基于Cookie的单点登录核心原理: 将用户名密码加密之后存于Cookie中,之后访问网站时在过滤器(filter)中校验用户权限,如果没有权限则从 ...
个人博客 attack.cf
新开了个emlog搭的博客地址:attack.cf 主要分享一下网络安全方面的东西和一些精品资源欢迎来访
canvas、svg、canvas与svg的区别
一.canvas canvas 画布,位图 <canvas> 标签定义图形,比如图表和其他图像,您必须使用脚本来绘制图形注意:不要在style中给canvas设置宽高,会有位移差 can ...
GetRelativePath获取相对路径
public static string GetRelativePath(string baseDirPath, string subFullPath) { // ForceBasePath to a ...
SQLSERVER是怎麽通过索引和统计信息来找到目标数据的(第三篇)
SQLSERVER是怎麽通过索引和统计信息来找到目标数据的(第三篇) 最近真的没有什么精力写文章,天天加班,为了完成这个系列,硬着头皮上了再看这篇文章之前请大家先看我之前写的第一篇和第二篇第一篇: ...
DA层（数据访问层）的方法不用静态的
1.静态方法,不会经过构造函数,所以你不能通过构造函数来初始参数,你只能通过传递参数,来初始他当你有多种参数需要传递的时候,你就要不断重载他了.当然你可以用参数型的类型,不过如果参数有一定结构,就很麻 ...
cmd下查询端口占用以及根据进程id名称结束进程
cmd窗口中: C:\Users\insentek>netstat -aon|findstr "1099" TCP 0.0.0.0:1099 0.0.0.0:0 LISTEN ...
[CV笔记]OpenCV机器学习笔记
KNN算法: 目的是分类,具体过程为,先训练,这个训练我估计只是对训练数据进行一个存储,knn测试的过程是根据测试样例找出与这个样例的距离最近的k个点,看这k个点中哪个分类所占的比例比较多,那么这个样 ...
vue2使用animate css
先上几个链接 vue插件大集合:awesome-vue vue2插件: vue2-animate:vue2-animate vue2插件vue2-animateDEMO: vue2-animatede ...

Python3下基于bs4和sqlalchemy的爬虫实现

Python3下基于bs4和sqlalchemy的爬虫实现的更多相关文章

随机推荐

热门专题