python3用BeautifulSoup用re.compile来匹配需要抓取的href地址

# -*- coding:utf-8 -*-

#python 2.7

#XiaoDeng

#http://tieba.baidu.com/p/2460150866

#标签操作

from bs4 import BeautifulSoup

import urllib.request

import re

#如果是网址，可以用这个办法来读取网页

#html_doc = "http://tieba.baidu.com/p/2460150866"

#req = urllib.request.Request(html_doc)

#webpage = urllib.request.urlopen(req)

#html = webpage.read()

html="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="xiaodeng"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

<a href="http://example.com/lacie" class="sister" id="xiaodeng">Lacie</a>

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html, 'html.parser')   #文档对象

#re.compile来匹配需要抓取的href地址

for k in  soup.find_all(href=re.compile("lacie")):

    print(k)

for k in  soup.find_all(string=re.compile("Lacie")):

    print(k)

python3用BeautifulSoup用re.compile来匹配需要抓取的href地址的更多相关文章

python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容
python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容 Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖 ...
(python3爬虫实战-第一篇）利用requests+正则抓取猫眼电影热映口碑榜
今天是个值得纪念了日子,我终于在博客园上发表自己的第一篇博文了.作为一名刚刚开始学习python网络爬虫的爱好者,后期本人会定期发布自己学习过程中的经验与心得,希望各位技术大佬批评指正.以下是我自己做 ...
python3.4学习笔记(十四) 网络爬虫实例代码，抓取新浪爱彩双色球开奖数据实例
python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例新浪爱彩双色球开奖数据URL:http://zst.aicai.com/ssq/openInfo/ 最终输出结果格 ...
Python3网络爬虫（1）：利用urllib进行简单的网页抓取
1.开发环境 pycharm2017.3.3 python3.5 2.网络爬虫的定义网络爬虫,也叫网络蜘蛛(web spider),如果把互联网比喻成一个蜘蛛网,spider就是一只在网上爬来爬去的 ...
Python3中BeautifulSoup的使用方法
BeautifulSoup的使用我们学习了正则表达式的相关用法,但是一旦正则写的有问题,可能得到的就不是我们想要的结果了,而且对于一个网页来说,都有一定的特殊的结构和层级关系,而且很多标签都有id或 ...
python3 - 通过BeautifulSoup 4抓取百度百科人物相关链接
导入需要的模块需要安装BeautifulSoup from urllib.request import urlopen, HTTPError, URLError from bs4 import Be ...
Python3中正则模块re.compile、re.match及re.search函数用法详解
Python3中正则模块re.compile.re.match及re.search函数用法 re模块 re.compile.re.match. re.search 正则匹配的时候,第一个字符是 r,表 ...
python3 调用 beautifulSoup 进行简单的网页处理
python3 调用 beautifulSoup 进行简单的网页处理 from bs4 import BeautifulSoup file = open('index.html','r',encodi ...
python3用BeautifulSoup抓取id='xiaodeng',且正则包含‘elsie’的标签
# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 #使用多个指定名字的参数可以同时过滤 ...

随机推荐

Type in Chakra
Type in Chakra Javascript是一个无类型的语言. 我们要讨论的类型是指Chakra内置的一些数据结构,这些结构维护了Object的信息. Type在一类Object中共享数据,使 ...
hdu 1879 有的边已存在（MST）
Sample Input31 2 1 0 //u v w 是否已建 1 3 2 02 3 4 031 2 1 01 3 2 02 3 4 131 2 1 01 3 2 12 3 4 10 Sample ...
java：冒泡排序、选择排序、插入排序实现
整数排序给一组整数,按照升序排序,使用选择排序,冒泡排序,插入排序或者任何 O(n2) 的排序算法. 样例样例 1: 输入: [3, 2, 1, 4, 5] 输出: [1, 2, 3, 4, 5] ...
Mac上c语言连接mysql遇到的问题
参照<Beginning Linux Programming>上的例程写了一个连接mysql的c语言小程序connect1.c.但是按照书上的编译命令无法编译.然后经过查阅资料解决了问题. ...
MySQL+Toad for Mysql安装，配置及导入中文数据解决乱码等问题
1.下载MySQL5.7版本,安装官网上的windows安装版,下载地址为:https://dev.mysql.com/downloads/windows/installer/5.7.html 安装选 ...
从源码看Spring Boot 2.0.1
Spring Boot 命名配置很少,却可以做到和其他配置复杂的框架相同的功能工作,从源码来看是怎么做到的. 我这里使用的Spring Boot版本是 2.0.1.RELEASE Spring Boo ...
saxon 处理xslt
下载saxon : https://sourceforge.net/projects/saxon/?source=typ_redirect 下载后拿到: saxon9he.jar 运行CMD: C:\ ...
vue 之加载 iframe 的处理
vue中加载 iframe 会出现跨域问题.以及iframe的高度自适应问题,以下是本人的解决办法: getGoodsContentHtml---- 你的iframe页面的地址, 如不同域的情况下 ...
Android入门笔记
Android项目的目录结构(Eclipse版) src:项目源代码文件夹 R.java:存放项目中所有资源文件的资源id,永远不要修改 Android.jar:Android的jar包,导入此包方可 ...
Unity3D引擎中特殊的文件夹
Editor Editor文件夹可以在根目录下,可以在子目录里,只要名是Editor就可以./xxx/xxx/Editor 和 /Editor 是一样的,多少个叫Editor的文件夹都可以.Edit ...

python3用BeautifulSoup用re.compile来匹配需要抓取的href地址

python3用BeautifulSoup用re.compile来匹配需要抓取的href地址的更多相关文章

随机推荐

热门专题