python3爬虫03（find_all用法等）

#read1.html文件
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p></body></html>

#!/usr/bin/env python
# # -*- coding:UTF-8 -*-

import os
import re
import requests
from bs4 import NavigableString
from bs4 import BeautifulSoup

curpath=os.path.dirname(os.path.realpath(__file__))
hmtlpath=os.path.join(curpath,'read1.html')

res=requests.get(hmtlpath)

soup=BeautifulSoup(res.content,features="html.parser")

for str in soup.stripped_strings:
    print(repr(str))

links=soup.find_all(class_="sister")
for parent in links.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

print(links.next_sibling)

for link in links:
    print(link.next_element)
print(link.next_sibling)

print(link.privous_element)
print(link.privous_sibling)

def has_class_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

def not_lacie(href):
    return href and not re.compile("lacie").search(href)

def not_tillie(href):
    return href and not re.compile("tillie").search(href)

def not_tillie1(id):
    return id and not re.compile("link2").search(id)

file=open("soup.html","r",encoding="utf-8")
soup=BeautifulSoup(file,features="lxml")

#find_all用法
tags=soup.find_all(re.compile('^b'))
tags=soup.find_all('b')
tags=soup.find_all(['a','b'])
tags=soup.find_all(has_class_no_id)
tags=soup.find_all(True)
tags=soup.find_all(href=not_lacie)
for tag in tags:
    print(tag.name)

def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

tags=soup.find_all(id=not_tillie1)
for tag in tags:
    print(tag)

tags=soup.find_all(attrs={"id":"link3"})
for tag in tags:
    print(tag)

soup.find_all(recursive=False)
tags=soup.select("body a")
tags=soup.select("p > a")
tags=soup.select("p > #link1")
tags=soup.select("html head title")
tags=soup.select(".sister")
tags=soup.select("[class~=sister]")
tags=soup.select("#link1 + .sister")
tags=soup.select("#link1")
tags=soup.select("a#link1")
tags=soup.select("a[href]")
tags=soup.select('a[href^="http://example"]')
tags=soup.select('a[href$="tillie"]')
tags=soup.select('a[href*=".com/el"]')
for tag in tags:
    print(tag)

file=open("soup.html","r",encoding="utf-8")
soup=BeautifulSoup(file,features="html.parser")
soup=BeautifulSoup(file,features="html.parser")
print(soup.prettify())
print(type(soup))
print(type(soup.title))
print(type(soup.title.string))
print(type(soup.b.string))

print(soup.head.name)
print(soup.title.name)
print(soup.a.name)
print(soup.name)

tag=soup.a
print(tag["href"])
print(tag.string)
print(tag["class"])
print(tag.attrs)

print(soup.title.string)
print(soup.title.name)
print(soup.p.attrs)
print(soup.a.attrs)
print(soup.a["class"])

python3爬虫03（find_all用法等）的更多相关文章

python3爬虫（find_all用法等）
#read1.html文件 # <html><head><title>The Dormouse's story</title></head> ...
python3爬虫系列19之反爬随机 User-Agent 和 ip代理池的使用
站长资讯平台:python3爬虫系列19之随机User-Agent 和ip代理池的使用我们前面几篇讲了爬虫增速多进程,进程池的用法之类的,爬虫速度加快呢,也会带来一些坏事. 1. 前言比如随着我们爬虫 ...
Python3爬虫系列：理论+实验+爬取妹子图实战
Github: https://github.com/wangy8961/python3-concurrency-pics-02 ,欢迎star 爬虫系列: (1) 理论 Python3爬虫系列01 ...
笔趣看小说Python3爬虫抓取
笔趣看小说Python3爬虫抓取获取HTML信息解析HTML信息整合代码获取HTML信息 # -*- coding:UTF-8 -*- import requests if __name__ ...
python3爬虫中文乱码之请求头‘Accept-Encoding’：br 的问题
当用python3做爬虫的时候,一些网站为了防爬虫会设置一些检查机制,这时我们就需要添加请求头,伪装成浏览器正常访问. header的内容在浏览器的开发者工具中便可看到,将这些信息添加到我们的爬虫代码 ...
Python3 爬虫之 Scrapy 核心功能实现（二）
博客地址:http://www.moonxy.com 基于 Python 3.6.2 的 Scrapy 爬虫框架使用,Scrapy 的搭建过程请参照本人的另一篇博客:Python3 爬虫之 Scrap ...
Python3 爬虫之 Scrapy 框架安装配置（一）
博客地址:http://www.moonxy.com 基于 Python 3.6.2 的 Scrapy 爬虫框架使用,Scrapy 的爬虫实现过程请参照本人的另一篇博客:Python3 爬虫之 Scr ...
python3爬虫--反爬虫应对机制
python3爬虫--反爬虫应对机制内容来源于: Python3网络爬虫开发实战: 网络爬虫教程(python2): 前言: 反爬虫更多是一种攻防战,针对网站的反爬虫处理来采取对应的应对机制,一般需 ...
python3爬虫（4）各种网站视频下载方法
python3爬虫(4)各种网站视频下载方法原创H-KING 最后发布于2019-01-09 11:06:23 阅读数 13608 收藏展开理论上来讲只要是网上(浏览器)能看到图片,音频,视频,都能够 ...

随机推荐

1.jQuery入口函数与javaScript入口函数
1.jQuery入口函数与javaScript入口函数 JQ入口函数: $(document).ready(function(){ }); 或者 $(function(){ }) Js入口函数: w ...
一套简单的web即时通讯——第二版
前言接上一版,这一版的页面与功能都有所优化,具体如下: 1.优化登录拦截 2.登录后获取所有好友并区分显示在线.离线好友,好友上线.下线都有标记 3.将前后端交互的值改成用户id.显示值改成昵称ni ...
iscsi使用教程（中）
服务端管理命令 ### tgtadm 是一个模式化的命令,其使用格式如下: # tgtadm --lld [driver] --op [operation] --mode [mode] [OPTION ...
SCUT - 337 - 岩殿居蟹 - 线段树 - 树状数组
https://scut.online/p/337 这个东西是个阶梯状的.那么可以考虑存两棵树,一棵树是阶梯的,另一棵树的平的,随便一减就是需要的阶梯. 优化之后貌似速度比树状数组还惊人. #incl ...
通过增删改查对比Array,Map,Set,Object的使用成本和实现方式
1.Array 和 Map 对比 { // array and map 增查改删 let map = new Map(); let arr = []; // 增 map.set('a', 1); ...
springBoot2.0 配置 mybatis+mybatisPlus+redis
一.Idea新建springBoot项目 next到完成,然后修改使用自己的maven 等待下载包二.pom.xml文件 <?xml version="1.0" encod ...
[USACO1.4]等差数列 Arithmetic Progressions
题目描述一个等差数列是一个能表示成a, a+b, a+2b,..., a+nb (n=0,1,2,3,...)的数列. 在这个问题中a是一个非负的整数,b是正整数.写一个程序来找出在双平方数集合(双 ...
阿里云服务器 linux 怎么安装php（PHPSTUDY）开发环境
1.首先登录行云管家(https://yun.cloudbility.com/login.html) wget -c http://lamp.phpstudy.NET/phpstudy.bin //下 ...
牛客假日团队赛1 J.分组
链接: https://ac.nowcoder.com/acm/contest/918/J 题意: 在Farmer John最喜欢的节日里,他想要给他的朋友们赠送一些礼物.由于他并不擅长包装礼物,他想 ...
简述raid0,raid1,raid5,raid10 的工作原理及特点
RAID 0 支持1块盘到多块盘,容量是所有盘之和 RAID1 只支持2块盘,容量损失一块盘 RAID 5最少三块盘,不管硬盘数量多少,只损失一块容量 RAID 10最少4块盘,必须偶数硬盘,不管硬盘 ...

python3爬虫03（find_all用法等）

python3爬虫03（find_all用法等）的更多相关文章

随机推荐

热门专题