python3爬虫（find_all用法等）

#read1.html文件

# <html><head><title>The Dormouse's story</title></head>

# <body>

# <p class="title"><b>The Dormouse's story</b></p>

#

# <p class="story">Once upon a time there were three little sisters; and their names were

# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

# and they lived at the bottom of a well.</p>

#

# <p class="story">...</p></body></html>

#!/usr/bin/env python

# # -*- coding:UTF-8 -*-

import os

import re

import requests

from bs4 import NavigableString

from bs4 import BeautifulSoup

curpath=os.path.dirname(os.path.realpath(__file__))

hmtlpath=os.path.join(curpath,'read1.html')

res=requests.get(hmtlpath)

soup=BeautifulSoup(res.content,features="html.parser")

for str in soup.stripped_strings:

    print(repr(str))

links=soup.find_all(class_="sister")

for parent in links.parents:

    if parent is None:

        print(parent)

    else:

        print(parent.name)

print(links.next_sibling)

for link in links:

    print(link.next_element)

print(link.next_sibling)

print(link.privous_element)

print(link.privous_sibling)

def has_class_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

def not_lacie(href):

    return href and not re.compile("lacie").search(href)

def not_tillie(href):

    return href and not re.compile("tillie").search(href)

def not_tillie1(id):

    return id and not re.compile("link2").search(id)

file=open("soup.html","r",encoding="utf-8")

soup=BeautifulSoup(file,features="lxml")

#find_all用法

tags=soup.find_all(re.compile('^b'))

tags=soup.find_all('b')

tags=soup.find_all(['a','b'])

tags=soup.find_all(has_class_no_id)

tags=soup.find_all(True)

tags=soup.find_all(href=not_lacie)

for tag in tags:

    print(tag.name)

def surrounded_by_strings(tag):

    return (isinstance(tag.next_element, NavigableString)

            and isinstance(tag.previous_element, NavigableString))

tags=soup.find_all(id=not_tillie1)

for tag in tags:

    print(tag)

tags=soup.find_all(attrs={"id":"link3"})

for tag in tags:

    print(tag)

soup.find_all(recursive=False)

tags=soup.select("body a")

tags=soup.select("p > a")

tags=soup.select("p > #link1")

tags=soup.select("html head title")

tags=soup.select(".sister")

tags=soup.select("[class~=sister]")

tags=soup.select("#link1 + .sister")

tags=soup.select("#link1")

tags=soup.select("a#link1")

tags=soup.select("a[href]")

tags=soup.select('a[href^="http://example"]')

tags=soup.select('a[href$="tillie"]')

tags=soup.select('a[href*=".com/el"]')

for tag in tags:

    print(tag)

file=open("soup.html","r",encoding="utf-8")

soup=BeautifulSoup(file,features="html.parser")

soup=BeautifulSoup(file,features="html.parser")

print(soup.prettify())

print(type(soup))

print(type(soup.title))

print(type(soup.title.string))

print(type(soup.b.string))

print(soup.head.name)

print(soup.title.name)

print(soup.a.name)

print(soup.name)

tag=soup.a

print(tag["href"])

print(tag.string)

print(tag["class"])

print(tag.attrs)

print(soup.title.string)

print(soup.title.name)

print(soup.p.attrs)

print(soup.a.attrs)

print(soup.a["class"])

python3爬虫（find_all用法等）的更多相关文章

python3爬虫03（find_all用法等）
#read1.html文件# <html><head><title>The Dormouse's story</title></head># ...
python3爬虫系列19之反爬随机 User-Agent 和 ip代理池的使用
站长资讯平台:python3爬虫系列19之随机User-Agent 和ip代理池的使用我们前面几篇讲了爬虫增速多进程,进程池的用法之类的,爬虫速度加快呢,也会带来一些坏事. 1. 前言比如随着我们爬虫 ...
笔趣看小说Python3爬虫抓取
笔趣看小说Python3爬虫抓取获取HTML信息解析HTML信息整合代码获取HTML信息 # -*- coding:UTF-8 -*- import requests if __name__ ...
python3 字典常见用法总结
python3 字典常见用法总结 Python字典是另一种可变容器模型,且可存储任意类型对象,如字符串.数字.元组等其他容器模型. 一.创建字典字典由键和对应值成对组成.字典也被称作关联数组或哈希表 ...
Python3爬虫系列：理论+实验+爬取妹子图实战
Github: https://github.com/wangy8961/python3-concurrency-pics-02 ,欢迎star 爬虫系列: (1) 理论 Python3爬虫系列01 ...
python3爬虫中文乱码之请求头‘Accept-Encoding’：br 的问题
当用python3做爬虫的时候,一些网站为了防爬虫会设置一些检查机制,这时我们就需要添加请求头,伪装成浏览器正常访问. header的内容在浏览器的开发者工具中便可看到,将这些信息添加到我们的爬虫代码 ...
Python3 range() 函数用法
Python3 range() 函数用法 Python3 内置函数 Python3 range() 函数返回的是一个可迭代对象(类型是对象),而不是列表类型, 所以打印的时候不会打印列表. Pyth ...
Python3 爬虫之 Scrapy 核心功能实现（二）
博客地址:http://www.moonxy.com 基于 Python 3.6.2 的 Scrapy 爬虫框架使用,Scrapy 的搭建过程请参照本人的另一篇博客:Python3 爬虫之 Scrap ...
Python3 爬虫之 Scrapy 框架安装配置（一）
博客地址:http://www.moonxy.com 基于 Python 3.6.2 的 Scrapy 爬虫框架使用,Scrapy 的爬虫实现过程请参照本人的另一篇博客:Python3 爬虫之 Scr ...

随机推荐

NPM(Node Package Manager,Node包管理器)
简介每个Node应用都有一个包含该应用元数据的文件-package.json,包含应用名.版本号以及依赖等信息. 我们使用NPM从NPM库下载并安装第三方包. 所有下载的包以及其依赖都保存在node ...
【learning】扩展lucas定理
首先说下啥是lucas定理: $\binom n m \equiv \binom {n\%P} {m\%P} \times \binom{n/P}{m/P} \pmod P$ 借助这个定理,求$\bi ...
mysql 常用操作命令
mysql官网指南:http://dev.mysql.com/doc/refman/5.1/zh/sql-syntax.html 1.导出整个数据库mysqldump -u 用户名 -p --defa ...
在matlab中实现线性回归和logistic回归
本文主要讲解在matlab中实现Linear Regression和Logistic Regression的代码,并不涉及公式推导.具体的计算公式和推导,相关的机器学习文章和视频一大堆,推荐看Andr ...
全网最详细的Xshell或SecureCRT下spark-shell里出现无法退格或者删除的问题现象的解决办法（图文详解）
不多说,直接上干货! 前言打开spark的命令行后,发现输错字符了,但是无法退格或者删除,这是比较苦恼的问题. 这个问题,得看你是用Xshell,还是SecureCRT. 一般是出现在SecureC ...
javascript 获取当前浏览器窗口宽高
获取当前浏览器窗口宽度:document.documentElement.clientWidth;获取当前浏览器窗口高度:document.documentElement.clientHeight; ...
j2ee高级开发技术课程第十四周
RPC(Remote Procedure Call Protocol) RPC使用C/S方式,采用http协议,发送请求到服务器,等待服务器返回结果.这个请求包括一个参数集和一个文本集,通常形成“cl ...
查看 postgresql 数据库编码，以及修改数据库编码
查看数据表编码: \encoding 修改数据库编码: update pg_database set encoding = pg_char_to_encoding('UTF8') where datn ...
WPF中的TextBlock隐藏边框
TextBlock默认是有边框的,显示效果如下:有一个淡蓝色的边框围绕着如果需要隐藏这个边框,则只需要在代码中加上以下代码即可: BorderBrush="{x:Null}" B ...
揭开Future的神秘面纱——任务执行
前言此文承接之前的博文解开Future的神秘面纱之取消任务补充一些任务执行的一些细节,并从全局介绍程序的运行情况. 系列目录揭开Future的神秘面纱——任务取消揭开Future的神秘面纱— ...

python3爬虫（find_all用法等）

python3爬虫（find_all用法等）的更多相关文章

随机推荐

热门专题