BeautifulSoup爬虫基础知识

安装beautiful soup模块

　　Windows:

　　　　pip install beautifulsoup4

　　Linux:

　　　　apt-get install python-bs4

BS4解析器比较

BS官方推荐使用lxml作为解析器，因为其速度快，也比较稳定。那么lxml解析器是怎么安装的呢？

Windows下安装lxml方法：

　　1、pip安装

　　　　pip install lxml

　　　　安装出错，原因是需要Visual c++,在windows下通过pip安装lmxl总会出现问题，如果你非要使用pip去安装的话，就把依赖一一解决了再pip.

　　2、手工安装

　　　　1、先在http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml下载符合自己系统版本的lmxl，如lxml‑3.6.4‑cp27‑cp27m‑win_amd64.whl

　　　　2、安装wheel模块，pip install wheel

　　　　3、安装lxml模块，pip install lxml‑3.6.4‑cp27‑cp27m‑win_amd64.whl

Linux下安装lxml方法：

　　apt-get install python-lxml

BS4解析器的使用

<!DOCTYPE html>

<html lang="en">

<head>

    <meta charset="UTF-8">

    <title>武汉旅游景点</title>

</head>

<body>

    <div id="content">

        <div class="title">

            <h3>武汉景点</h3>

        </div>

        <ul class="table">

            <li>景点<a>门票价格</a></li>

        </ul>

        <ul class="content">

            <li nu="">东湖<a class="price"></a></li>

            <li nu="">磨山<a class="price"></a></li>

            <li nu="">欢乐谷<a class="price"></a></li>

            <li nu="">海昌极地海洋世界<a class="price"></a></li>

            <li nu="" src="http://mm.howkuai.com/wp-content/uploads/2017a/03/06/limg.jpg">玛雅水上乐园<a class="price"></a></li>

        </ul>

    </div>

</body>

</html>

#!/usr/bin/env python

# _*_ coding:utf- _*_

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("scenery.html"),"lxml")

print soup.prettify()

简单的使用

字符集的问题　　

　　当一个文件或网页导入BeautifulSoup之后，它会自动地很快猜测出文件或网页的常用字符编码，如果不能自动猜测出来的话可以用exclude_encoding和from_encoding来处理。

　　　　排除某种编码

　　　　soup = BeautifulSoup(open("scenery.html"),exclude_encodings=["iso-8859-7","gb2312"])

　　　　使用某种编码
　　　　soup = BeautifulSoup(open("scenery.html"),from_encoding="big5")

BS解析的原理

　　bs4将网页节点解析成了一个个Tag，然后根据标签名称、标签属性名称、标签属性值及顺序等将数据过滤出来。

　　1、根据标签名称查找标签

　　　　soup.TagName

　　　　soup.find(TagName)

　　　　soup.find_all(TagName)

#!/usr/bin/env python

# _*_ coding:utf-8 _*_

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("scenery.html"),"lxml")

# 解析出第一个a标签

print soup.a

print soup.find("a")

# 解析出所有a标签

print soup.find_all("a")

结果：

<a>门票价格</a>

<a>门票价格</a>

[<a>\u95e8\u7968\u4ef7\u683c</a>, <a class="price">60</a>, <a class="price">60</a>, <a class="price">108</a>, <a class="price">150</a>, <a class="price">150</a>]

　2、标签名称相同时，外加属性值解析数据

　　特殊写法：仅适用于查找class的内容，可以理解为专为class而设

　　　　soup.find(TagName,[attrsName])

　　　　soup.find_all(TagName,[attrsName])

　　万能写法,还可用于解析自定义属性：

　　　　soup.find(TagName,attrs={AttrName:AttrValue})

　　　　soup.find_all(TagName,attrs={AttrName:AttrValue})

#!/usr/bin/env python

# _*_ coding:utf-8 _*_

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("scenery.html"),"lxml")

# 解析出第一个属性值为price的a标签

print soup.find("a","price")

print soup.find("a",attrs={"class":"price"})

# 解析出所有属性值为price的a标签

print soup.find_all("a","price")

结果： <a class="price">60</a> <a class="price">60</a> [<a class="price">60</a>, <a class="price">60</a>, <a class="price">108</a>, <a class="price">150</a>, <a class="price">150</a>] <li nu="1">东湖<a class="price">60</a></li>

解析标签值

# 解析属性值

print soup.find("li",attrs={"nu":"1"}).get("nu")

# 解析文本

print soup.find("li",attrs={"nu":"1"}).a.get_text()

结果：

1

60

显示属性的值

# 解析出属性nu=1的li标签

nu5 = soup.find("li",attrs={"nu":"5"})

# 解析nu=5的li标签的src属性值

print nu5.attrs['src']

根据文本找标签

r = re.compile("texttest")

soup.find("a",text=r).parent

查找内容为texttest的a标签的父标签

到此为止可以用BeautifulSoup做些简单的爬虫了。

用BeautifulSoup写一个简单的处理百度贴吧的例子，爬取百度贴吧中权利的游戏的贴子。

 #!/usr/bin/env python

 # _*_ coding:utf- _*_

 import urllib2

 from bs4 import BeautifulSoup

 import itemWrite

 class Item(object):

     title = None

     firstAuthor = None

     firstTime = None

     reNum = None

     content = None

     lastAuthor = None

     lastTime = None

 class GetTiebaInfo(object):

     def __init__(self,url):

         self.url = url

         self.pageSum =

         self.urls = self.getUrls(self.pageSum)

         self.items = self.spider(self.urls)

         self.itemWrite("test.txt",self.items)

     def getUrls(self,pageSum):

         urls = []

         pns = [str(i*) for i in range(pageSum)]

         ul = self.url.split("=")

         for pn in pns:

             ul[-] = pn

             tmp = "=".join(ul)

             urls.append(tmp)

         return urls

     def getResponseContent(self,url):

         try:

             response = urllib2.urlopen(url.encode("utf8"))

             return response.read()

         except:

             print "url open faild"

             return None

     def spider(self,urls):

         items = []

         for url in urls:

             htmlContent = self.getResponseContent(url)

             soup = BeautifulSoup(htmlContent,'lxml')

             tagsli = soup.find_all("li",attrs={"class":" j_thread_list clearfix"})

             for tag in tagsli:

                 item = Item()

                 item.title = tag.find("a",attrs={"class":"j_th_tit "}).get_text().strip()

                 try:

                     item.firstAuthor = tag.find("span","frs-author-name-wrap").a.get_text().strip()

                 except:

                     item.firstAuthor = 'zzz'

                 item.firstTime = tag.find("span","pull-right is_show_create_time").get_text().strip()

                 item.reNum = tag.find("span",attrs={"title":u"回复"}).get_text().strip()

                 item.content = tag.find("div",attrs={"class":"threadlist_abs threadlist_abs_onlyline "}).get_text().strip()

                 item.lastAuthor = tag.find("span",attrs={"class":"tb_icon_author_rely j_replyer"}).a.get_text().strip()

                 item.lastTime = tag.find("span",attrs={"title":u"最后回复时间"}).get_text().strip()

                 items.append(item)

         return items

     def itemWrite(self,filename,items):

         itemWrite.writeTotxt(filename,items)

 if __name__ == '__main__':

     url = u'http://tieba.baidu.com/f?kw=权利的游戏&ie=utf-8&pn=0'

     Get = GetTiebaInfo(url)

完整代码

 #!/usr/bin/env python

 # _*_ coding:utf- _*_

 # 写到文本文件

 def writeTotxt(fileName,items):

     with open(fileName,'w') as fp:

         for item in items:

             fp.write("title:%s\t author:%s\t firstTime:%s\n content:%s\n reNum:%s\t lastAuthor:%s\t lastTime:%s\n\n"

                      %(item.title.encode("utf8"),item.firstAuthor.encode("utf8"),item.firstTime.encode("utf8"),item.content.encode("utf8"),item.reNum.encode("utf8"),item.lastAuthor.encode("utf8"),item.lastTime.encode("utf8")))

 # 写到Excel文件

 # 写到DB

itemWrite

itemWrite中我只写了一个将数据写入文本的函数，还有写入excel和db的函数没有完善，因为都很简单，不想写了，有个意思就行了。

BeautifulSoup爬虫基础知识的更多相关文章

python 爬虫基础知识一
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...
Python爬虫基础知识入门一
一.什么是爬虫,爬虫能做什么爬虫,即网络爬虫,大家可以理解为在网络上爬行的一直蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛咯,如果它遇到资源,那么它就会抓取下来.比如它在抓取一个网 ...
Python 爬虫基础知识
requests Python标准库中提供了:urllib.urllib2.httplib等模块以供Http请求,但是,它的 API 太渣了.它是为另一个时代.另一个互联网所创建的.它需要巨量的工作, ...
网络爬虫基础知识（Python实现）
浏览器的请求 url=请求协议(http/https)+网站域名+资源路径+参数 http:超文本传输协议(以明文的形式进行传输),传输效率高,但不安全. https:由http+ssl(安全套接子层 ...
Python归纳 | 爬虫基础知识
1. urllib模块库 Urllib是python内置的HTTP请求库,urllib标准库一共包含以下子包: urllib.error 由urllib.request引发的异常类 urllib.pa ...
【VB6】使用VB6创建和访问Dom树【爬虫基础知识】
使用VB6创建和访问Dom树关键字:VB,DOM,HTML,爬虫,IHTMLDocument 我们知道,在VB中一般大家会用WebBrowser来获取和操作dom对象. 但是,有这样一种情形,却让我 ...
Scrapy爬虫学习笔记 - 爬虫基础知识
一.正则表达式二.深度和广度优先三.爬虫去重策略
python 爬虫基础知识(继续补充)
学了这么久爬虫,今天整理一下相关知识点,还会继续更新 HTTP和HTTPS HTTP协议(HyperText Transfer Protocol,超文本传输协议):是一种发布和接收 HTML页面的方法 ...
自学Python四爬虫基础知识储备
首先,推荐两个关于python爬虫不错的博客:Python爬虫入门教程专栏和 Python爬虫学习系列教程 .写的都非常不错,我学习到了很多东西!在此,我就我看到的学到的进行总结一下! 爬虫就是 ...

随机推荐

CAJ Viewer安装流程以及CAJ或Pdf转换为Word格式
不多说,直接上干货! pdf转word格式,最简单的就是,实用工具 Adobe Acrobat DC 首先声明的是,将CAJ或者Pdf转换成Word文档,包括里面的文字.图片以及格式,根本不需 ...
根据屏幕尺寸计算rem
!(function (doc, win) { var docEle = doc.documentElement, evt = "onorientationchange" in w ...
面试题-----ICMP协议简介
ICMP协议简介 l ICMP网际控制报文协议,通过它可以知道故障的具体原因和位置. l 由于IP不是为可靠传输服务设计的,ICMP的目的主要是用于在TCP/IP网络中发送出错和控制消息. l ...
Nginx反向代理实现会话（session）保持的两种方式（转）
http://blog.csdn.net/gaoqiao1988/article/details/53390352 一.ip_hash: ip_hash使用源地址哈希算法,将同一客户端的请求总是发往同 ...
layer相关使用
父子页面传参数转自:https://blog.csdn.net/babyxue/article/details/76854106 1.父页面打开子页面并向子页面传参数 function setCho ...
Struts2配置文件struts.xml的编辑自动提示代码功能
第一步:复制struts.xml头部地址第二步:Window --->Preferences 第三步:XML--->XML Catalog--->Add 第四步:在Key中粘贴复制 ...
c语言----<项目>_小游戏<2048>
2048 小游戏主要是针对逻辑思维的一个训练. 主要学习方面:1.随机数产生的概率.2.行与列在进行移动的时候几种情况.3.MessageBox的使用 #include <iostream&g ...
Golang 并发Groutine详解
概述 1.并行和并发并行(parallel):指在同一时刻,有多条指令在多个处理器上同时执行. 并发(concurrency):指在同一时刻只能有一条指令执行,但多个进程指令被快速的轮换执行,使得在 ...
[转]How to log queries using Entity Framework 7?
本文转自:https://stackoverflow.com/questions/26747837/how-to-log-queries-using-entity-framework-7
Node.js Cookie管理
Cookie 管理我们可以使用中间件向 Node.js 服务器发送 cookie 信息,以下代码输出了客户端发送的 cookie 信息: var express=require('express') ...

BeautifulSoup爬虫基础知识

BeautifulSoup爬虫基础知识的更多相关文章

随机推荐

热门专题