爬虫初窥day3：BeautifulSoup

信息提取

1.通过Tag对象的属性和方法

#!/usr/bin/python

# -*- coding: utf- -*-

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

html = urlopen('https://www.cnblogs.com/pcat/p/5398997.html')

soup = BeautifulSoup(html.read().decode('utf-8'),'html.parser')#避免乱码，先utf-8解码

#print()输出第一个匹配项

print(soup)

print(soup.a)

print(soup.a.name)

print(soup.a.attrs)

print(soup.a.string)

soup.html.get_text()输出字符串，原文排版

2.通过标签树对象的find_all()方法

aS = soup.find_all('a')

for i in aS:

    print(i)

    #print(i.name)

    #print(i.attrs)

    #print(i.string)

#find_all带条件(name,attrs,string,text,recursive,可多条件匹配)

hrefs = soup.find_all(href=re.compile('pcat$'))#以pcat结尾的链接

for i in hrefs:

    print(i)

#对css类名属性class进行搜索时，为避免与python保留字冲突，需用class_

a = soup.find_all(class_ = 'postDesc')

print(a)

#补充1.text匹配非属性内容。.["a","b"]的形式，表示匹配多个值

3.通过标签树对象的find()方法

#find返回一个标签节点，find_all返回多值列表

#find

e1 =soup.find('head').find('title')#在标签名为head的tag中查找title标签

print(e1)

4.通过CSS选择器

#标签名

soup.select('p')#搜索所有标签名为p的标签

soup.select('p a')#搜索所有p标签的子孙节点中标签名为a的标签。即下N层

soup.select('p > a')#搜索所有p标签的直接子节点中标签名为a的标签。即下一层

#类名

soup.select('.blogStats ')#所有类名为blogStats的标签

soup.select('.blogStats span')#所有类名为blogStats且子孙节点中标签名为span的标签

soup.select('a.menu')#标签名为a并且类名为menu的标签

e1=soup.select('a.menu')#标签名为a并且类名为menu的标签

for i in e1:

    print(i['href'])

#id

soup.select('#stats_post_count')#所有id为xxx的标签

soup.select('#navList #blog_nav_sitehome')#所有id为xxx且其子孙节点id为xxx的标签

#属性

soup.select('a[href]')#标签名为a且属性中存在href的所有标签

soup.select('a[href="https://www.cnblogs.com/pcat/"]')#标签名为a且href属性值为http://...的所有标签

soup.select('a[href^="http"]')#标签名为a且href属性以http开头的标签

soup.select('a[href$="http"]')#标签名为a且href属性以pcat结尾的标签

soup.select('a[href*="cnblogs"]')#标签名为a且href属性包含example的标签

#标签名/类名/id/属性 空格[ ] 右符号'>' 相互搭配

遍历

1.下行遍历

<tag>.contents	以列表形式返回Tag的所有子节点
<tag>.children	以迭代形式返回Tag的所有子节点
<tag>.descendants	以迭代形式返回Tag的所有子孙节点
<tag>.strings	以迭代形式返回Tag及其所有子孙节点的非属性字符串
<tag>.stripped_strings	以迭代形式返回Tag去除空白字符后的非属性字符串

#contents

e1=soup.ul.contents

print(type(e1))

print(len(e1))

#children

e1=soup.ul.children

for i in e1:

    print(i)

#descendants

e1=soup.ul.descendants

for i in e1:

    print(i)

#strings

e1=soup.ul.strings

for i in e1:

    print(i)

#stripped_strings

e1=soup.ul.stripped_strings

for i in e1:

    print(i)

2.上行遍历

parent	以列表形式返回tag的所有父亲节点
parents	以迭代形式返回tag的所有父辈节点

3.水平遍历

next_sibling	按文档顺序，返回Tag的下一个相邻兄弟节点
previous_sibling	按文档顺序，返回Tag的上一个相邻兄弟节点
next_siblings	按文档顺序，返回Tag的后续兄弟节点
previous_siblings	按文档顺序，返回Tag的前续兄弟节点

爬虫初窥day3：BeautifulSoup的更多相关文章

爬虫初窥day4：requests
Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 模块提供了你所需要的大多数 ...
爬虫初窥day2：正则
正则在线测试 http://tool.oschina.net/regex https://www.regexpal.com/ http://tool.chinaz.com/regex exp1:筛选所 ...
爬虫初窥day1：urllib
模拟“豆瓣”网站的用户登录 # coding:utf-8 import urllib url = 'https://www.douban.com/' data = urllib.parse.urlen ...
python爬虫 scrapy2_初窥Scrapy
sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...
Scrapy001-框架初窥
Scrapy001-框架初窥 @(Spider)[POSTS] 1.Scrapy简介 Scrapy是一个应用于抓取.提取.处理.存储等网站数据的框架(类似Django). 应用: 数据挖掘信息处理 ...
scrapy2_初窥Scrapy
递归知识:oop,xpath,jsp,items,pipline等专业网络知识,初级水平并不是很scrapy,可以从简单模块自己写. 初窥Scrapy Scrapy是一个为了爬取网站数据,提取结构性数 ...
Scrapy 1.4 文档 01 初窥 Scrapy
初窥 Scrapy Scrapy 是用于抓取网站并提取结构化数据的应用程序框架,其应用非常广泛,如数据挖掘,信息处理或历史存档. 尽管 Scrapy 最初设计用于网络数据采集(web scraping ...
python2.7 爬虫初体验爬取新浪国内新闻_20161130
python2.7 爬虫初学习模块:BeautifulSoup requests 1.获取新浪国内新闻标题 2.获取新闻url 3.还没想好,想法是把第2步的url 获取到下载网页源代码再去分析源 ...
R语言爬虫初尝试-基于RVEST包学习
注意:这文章是2月份写的,拉勾网早改版了,代码已经失效了,大家意思意思就好,主要看代码的使用方法吧.. 最近一直在用且有维护的另一个爬虫是KINDLE 特价书爬虫,blog地址见此: http://w ...

随机推荐

Realtime Rendering 5
[Real Time Rendering 5] 1.In radiometry, the function that is used to describe how a surface reflect ...
EF CodeFirst学习笔记004--足够聪明
将BlogTypes注释掉,但因为Blogs中定义了BlogType 这样类型的属性,所以Ef会聪明的找到BlogType类. public class BlogEntities:DbContext ...
SpringMVC参考
史上最简单的 Spring MVC http://blog.csdn.net/column/details/14594.html
zookeeper报错： org.I0Itec.zkclient.exception.ZkMarshallingError: java.io.EOFException
zookeeper报错: org.I0Itec.zkclient.exception.ZkMarshallingError: java.io.EOFException 主要因为是没有序列化. 可以使用 ...
java学习笔记整理
java知识模块:1.基础知识,数组,字符串,正则表达式:2.类和对象,接口,继承,多态,抽象类,内部类,泛型,java常用类库.3.异常处理: 4.IO: 5.事件处理: 6.多线程: 7 ...
swiper轮播的slide高度自适应
方式1:官方给的属性 autoHeight: true, //高度随内容变化发现实际没效果方式2:先定义了一个slide的高度数组, //设置slide父级高度 index为slide的索引 fu ...
并查集和树的一些性质 hdu1325
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1325 题意是每次输入一对数字n,m表示一条树边,并且n是m的父亲,直到n==0&&m= ...
电商项目中学到的git命令
1.在拉下来的文件夹被删除后的操作创建了文件后 git init 增加了 .git文件 ls -al 查看后有.git文件夹 git remote add origin (ssh) 连接到git仓库 ...
Codeforces Beta Round #65 (Div. 2)
Codeforces Beta Round #65 (Div. 2) http://codeforces.com/contest/71 A #include<bits/stdc++.h> ...
Python词云（词频统计，掩膜显示）
Python2.7 anaconda.安装Wordcloud,网上有许多下载路径,说一下掩模,就是在这个膜的区域才会有东西,当然这个与实际的掩模还有一定区别,这个词频显示是把所有统计的词,显示在这个掩 ...

爬虫初窥day3：BeautifulSoup

爬虫初窥day3：BeautifulSoup的更多相关文章

随机推荐

热门专题