23-python用BeautifulSoup用抓取a标签内所有数据

1.获取子标签：

thr_msgs = soup.find_all('div',class_=re.compile('msg'))

for i in thr_msgs:

print(i)

first = i.select('em:nth-of-type(1)')

print(first)

>>>

<div class='\"msg\"'><em>佛山</em><em>1-3年</em><em>大专</em></div>

[<em>佛山</em>]

<div class='\"msg\"'><em>南京</em><em>3-5年</em><em>本科</em></div>

[<em>南京</em>]

<div class='\"msg\"'><em>南阳</em><em>1-3年</em><em>大专</em></div>

[<em>南阳</em>]

<div class='\"msg\"'><em>深圳</em><em>1年以内</em><em>本科</em></div>

[<em>深圳</em>]

2.过去一个标签内内容：

原文：https://blog.csdn.net/suibianshen2012/article/details/62040460?utm_source=copy

# -*- coding:utf-8 -*-

#python 2.7

#XiaoDeng

#http://tieba.baidu.com/p/2460150866

#标签操作

from bs4 import BeautifulSoup

import urllib.request

import re

#如果是网址，可以用这个办法来读取网页

#html_doc = "http://tieba.baidu.com/p/2460150866"

#req = urllib.request.Request(html_doc)

#webpage = urllib.request.urlopen(req)

#html = webpage.read()

html="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="xiaodeng"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

<a href="http://example.com/lacie" class="sister" id="xiaodeng">Lacie</a>

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html, 'html.parser') #文档对象

#查找a标签,只会查找出一个a标签

#print(soup.a)#<a class="sister" href="http://example.com/elsie" id="xiaodeng"><!-- Elsie --></a>

for k in soup.find_all('a'):

print(k)

print(k['class'])#查a标签的class属性

print(k['id'])#查a标签的id值

print(k['href'])#查a标签的href值

print(k.string)#查a标签的string

#tag.get('calss')，也可以达到这个效果

　　37-python中bs4获取的标签中如何提取子标签

23-python用BeautifulSoup用抓取a标签内所有数据的更多相关文章

python3用BeautifulSoup用字典的方法抓取a标签内的数据
# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 #标签操作 from bs4 imp ...
python3用BeautifulSoup抓取a标签
# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 from bs4 import Be ...
测试开发Python培训：抓取新浪微博评论提取目标数据-技术篇
测试开发Python培训:抓取新浪微博评论提取目标数据-技术篇在前面我分享了几个新浪微博的自动化脚本的实现,下面我们继续实现新的需求,功能需求如下: 1,登陆微博 2,抓取评论页内容3,用正则表 ...
(转)利用Beautiful Soup去抓取p标签下class=jstest的内容
1.利用Beautiful Soup去抓取p标签下class=jstest的内容 import io import sys import bs4 as bs import urllib.request ...
SQL Server定时自动抓取耗时SQL并归档数据发邮件脚本分享
SQL Server定时自动抓取耗时SQL并归档数据发邮件脚本分享第一步建库和建表 USE [master] GO CREATE DATABASE [MonitorElapsedHighSQL] G ...
Hawk： 20分钟无编程抓取大众点评17万数据
1. 主角出场:Hawk介绍 Hawk是沙漠之鹰开发的一款数据抓取和清洗工具,目前已经在Github开源.详细介绍可参考:http://www.cnblogs.com/buptzym/p/545419 ...
SQL Server定时自动抓取耗时SQL并归档数据脚本分享
原文:SQL Server定时自动抓取耗时SQL并归档数据脚本分享 SQL Server定时自动抓取耗时SQL并归档数据脚本分享第一步建库 USE [master] GO CREATE DATABA ...
利用wireshark抓取远程linux上的数据包
原文发表在我的博客主页,转载请注明出处. 前言因为出差,前后准备总结了一周多,所以博客有所搁置.出差真是累人的活计,不过确实可以学习到很多东西,跟着老板学习做人,学习交流的技巧.入正题~ wires ...
用PHP抓取百度贴吧邮箱数据
注:本程序可能非常适合那些做百度贴吧营销的朋友. 去逛百度贴吧的时候,经常会看到楼主分享一些资源,要求留下邮箱,楼主才给发. 对于一个热门的帖子,留下的邮箱数量是非常多的,楼主需要一个一个的去复制那些 ...

随机推荐

记录一些js框架用途
accounting.min.js 货币格式化alertify.min.js 提示信息库amd.loader.js 按需动态加载js文件angular-cookies.js 处理cookieangul ...
Centos7 环境下开机自启动服务（service）设置的改变（命令systemctl 和 chkconfig用法区别比较）
参考文章: <Linux 设置程序开机自启动 (命令systemctl 和 chkconfig用法区别比较)> http://blog.csdn.net/kenhins/article/ ...
Bakery
Masha wants to open her own bakery and bake muffins in one of the n cities numbered from 1 to n. The ...
python3 随机生成UserAgent
安装库 pip install fake_useragent #引入 from fake_useragent import UserAgent; ua = UserAgent(); print(ua. ...
458 - The Decoder & C语言gets函数，字符输出输出 & toascii()
Write a complete program that will correctly decode a set of characters into a valid message. Your p ...
goaccess nginx日志分析工具简单使用
goaccess 是一个比较方便的支持实时的日志分析工具,比较方便,同时安装&&配置简单安装 centos yum yum install -y goaccess 运行我的ngin ...
linux之 multipath 多路径
一.什么是多路径普通的电脑主机都是一个硬盘挂接到一个总线上,这里是一对一的关系.而到了有光纤组成的SAN环境,或者由iSCSI组成的IPSAN环境,由于主机和存储通过了光纤交换机或者多块网卡及IP来 ...
Spring 部署Tomcat 404 错误解决方案
将Spring项目部署到tomcat后,访问网页出现404错误 HTTP Status 404 – Not Found The origin server did not find a current ...
C#细说多线程（上）
本文主要从线程的基础用法,CLR线程池当中工作者线程与I/O线程的开发,并行操作PLINQ等多个方面介绍多线程的开发.其中委托的BeginInvoke方法以及回调函数最为常用.而 I/O线程可能容易遭 ...
1、hadoop HA分布式集群搭建
概述 hadoop2中NameNode可以有多个(目前只支持2个).每一个都有相同的职能.一个是active状态的,一个是standby状态的.当集群运行时,只有active状态的NameNode是正 ...

23-python用BeautifulSoup用抓取a标签内所有数据

23-python用BeautifulSoup用抓取a标签内所有数据的更多相关文章

随机推荐

热门专题