beautifulsoup的一些使用

自动补全代码：

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

print(soup.prettify())#如果html代码补全，则自动补全

print(soup.title.string)

查找标签

#基本使用

soup.title#<title>xxxxxxx</title>

soup.title.string#xxxxxx

获取名称

#基本使用

soup.title#<title>xxxxxxx</title>

soup.title.name#title

获取属性

#基本使用

soup.a#<a>xxxxxxx</a>

soup.a['name']#a标签的name属性值

获取内容

soup.title.string#xxxxxx

嵌套选择

print(soup.head.title.string)

子节点

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

print(soup.div.contents)

或者

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.div.children

print(a)

for i,j in enumerate(a):

    print(i,j)

子孙节点

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.div.descendants

print(a)

for i,j in enumerate(a):

    print(i,j)

获取父节点

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.div.parent

print(a)

获取祖先节点

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.div.parents

for i,j in enumerate(a):

    print(i,j)

获取兄弟节点

a=soup.div.next_siblings#后面的兄弟节点（迭代器）

前面的兄弟节点

a=soup.div.next_siblings#前面的兄弟节点（迭代器）

标准选择器
find_all(name,attrs,recursive,**kwargs)

name

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

print(soup.find_all('ul'))#根据标签名查找

import requests

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

for ul in soup.find_all('ul'):

    for li in ul.find_all('li'):

        print(li)

attrs

import requests

import re

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.find_all(attrs={'class':'lazy'})#<a class='lazy'>xxxxx</lazy>

for index,i in enumerate(a):

    result=re.findall(r'[a-zA-z]+://[^\s]*png',str(i))

    url=result[0]

    res = requests.get(url)

    with open('%d.png'%index,'wb')as f:

        f.write(res.content)

import requests

import re

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

a=soup.find_all(class_='lazy')
a=soup.find_all(id='lazy')

CSS选择器

import requests

import re

from bs4 import BeautifulSoup

response=requests.get('https://www.ithome.com/html/it/340684.htm',timeout=9)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.select('.lazy')

for index,i in enumerate(a):

    result=re.findall(r'[a-zA-z]+://[^\s]*png',str(i))

    url=result[0]

    res = requests.get(url)

    with open('./test/%d.png'%(index+1),'wb')as f:

        f.write(res.content)

获取css属性

import requests

import re

from bs4 import BeautifulSoup

headers={

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"

}

response=requests.get('https://www.zhihu.com/question/20519068/answer/215288567',headers=headers,timeout=None)

result=response.text

soup=BeautifulSoup(response.content,'lxml')

# print(soup.prettify())

a=soup.select('.lazy')

# print(a)

for index,i in enumerate(a):

    url=i['data-original']
# 
　　 #result=re.findall(r'[a-zA-z]+://[^\s]*jpg',str(i))

    # url=result[0]

    res = requests.get(url)

    with open('./test/%d.jpg'%(index+1),'wb')as f:

        f.write(res.content)

获取内容

li.get_text()

beautifulsoup的一些使用的更多相关文章

Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
使用beautifulsoup与requests爬取数据
1.安装需要的库 bs4 beautifulSoup requests lxml如果使用mongodb存取数据,安装一下pymongo插件 2.常见问题 1> lxml安装问题如果遇到lxm ...
BeautifulSoup ：功能使用
# -*- coding: utf-8 -*- ''' # Author : Solomon Xie # Usage : 测试BeautifulSoup一些用法及容易出bug的地方 # Envirom ...
BeautifulSoup研究一
BeautifulSoup的文档见 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 其中.contents 会将换行也记录为一个子节 ...
BeautifulSoup
参考:http://www.freebuf.com/news/special/96763.html 相关资料:http://www.jb51.net/article/65287.htm 1.Pytho ...
BeautifulSoup Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
BeautifulSoup很赞的东西最近出现一个问题:Python 3.3 soup=BeautifulSoup(urllib.request.urlopen(url_path),"htm ...
beautifulSoup(1)
import re from bs4 import BeautifulSoupdoc = ['<html><head><title>Page title</t ...
python BeautifulSoup模块的简要介绍
常用介绍: pip install beautifulsoup4 # 安装模块 from bs4 import BeautifulSoup # 导入模块 soup = BeautifulSoup(ht ...
BeautifulSoup 的用法
转自:http://cuiqingcai.com/1319.html Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python ...
BeautifulSoup的选择器
用BeautifulSoup查找指定标签(元素)的时候,有几种方法: soup=BeautifulSoup(html) 1.soup.find_all(tagName),返回一个指定Tag元素的列表 ...

随机推荐

鼠标在窗口中的坐标转换到 canvas 中的坐标
鼠标在窗口中的坐标转换到 canvas 中的坐标由于需要用到isPointInPath函数,所以必须得将鼠标在窗口中的坐标位置转换到canvas画布中的坐标,今天发现网上这种非常常见的写法其 ...
[python工具][1]sublime安装与配置
http://www.cnblogs.com/wind128/p/4409422.html 1 官网下载版本 http://www.sublimetext.com/3 选择 Windows - al ...
OA笔记
一:Asp.Net MVC请求处理原理(Asp.Net mvc 是怎样进入请求管道的.)请求-->IIS--->ISAPIRuntime-->HttpWorkRequest--> ...
js 抓取页面数据
数据抓取主要思路和原理在根节点document中监听所有需要抓取的事件在元素事件传递中,捕获阶段获取事件信息,进行埋点通过getBoundingClientRect() 方法可获取元素的大小和 ...
Codeforces Round #345 (Div. 2)——B. Beautiful Paintings（贪心求上升序列个数）
B. Beautiful Paintings time limit per test 1 second memory limit per test 256 megabytes input standa ...
关闭chrome浏览器的input香蕉黄背景
chrome浏览器input的自动完成,点击之后自动输入,input的背景会变成香蕉黄,用如下方法修复: /* Change the white to any color ;) 就是给input设置内 ...
学习 JSP：第二步创建一个JSP Web Project
接上文学习 JSP:第一步Eclipse+Tomcat+jre(配置环境) [创建新工程](Dynamic Web Project) 1.选择 "File-->New-->Dy ...
【CCF】最优灌溉最小生成树
[AC] #include<iostream> #include<cstdio> #include<string> #include<cstring> ...
SqlLite 安装与使用
一.安装文件官方下载地址: http://system.data.sqlite.org/index.html/doc/trunk/www/downloads.wiki 选择要下载的类库文件:sqli ...
【NOIP2016练习】T3 subset （分块，状压DP）
3 subset 3.1 题目述一开始你有一个空集,集合可以出现重复元素,然后有 Q 个操作 add s 在集合中加入数字 s. del s 在集合中删除数字 s.保证 s 存在 cnt s 查 ...

beautifulsoup的一些使用

beautifulsoup的一些使用的更多相关文章

随机推荐

热门专题