Python爬虫基础教程2

beautifulsoup4介绍/遍历文档树

bs4 > 从html或xml文件中提取的python库

用它来解析爬取回来的xml

安装：pip install beautifulsoup4

pip install lxml > 解析库

soup=BeautifulSoup('要解析的内容str类型','html.parser/lxml')

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title">

lqz

<b>The Dormouse's story</b>

</p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1" name='lqz'>Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup=BeautifulSoup(html_doc,'lxml')

# 1 美化，不是标准xml，完成美化

# print(soup.prettify())

# 2 遍历文档树---》通过 . 来遍历

# print(soup.html.body.p)  # 一层一层找

# print(soup.p)  # 跨层  只找第一个

#2、获取标签的名称

# print(soup.a.name)

#3、获取标签的属性  ---》属性字典

# print(soup.a.attrs['href'])

# print(soup.a.attrs.get('class'))   # class 会有多个 ['sister']

# print(soup.a.attrs.get('name'))

#4、获取标签的内容

# text  获得该标签内部子子孙孙所有标签的文本内容

# print(soup.p.text)

# # string p下的文本只有一个时，取到，否则为None

# print(soup.p.string)

# # strings

# print(list(soup.p.strings)) # generator

# #5、嵌套选择

# print(soup.html.body)

# ---- 了解

#6、子节点、子孙节点

# print(soup.body.contents) #p下所有子节点，只取一层

# print(list(soup.p.children)) #list_iterator得到一个迭代器,包含p下所有子节点  只取一层

# print(list(soup.body.descendants) ) # generator  子子孙孙

#7、父节点、祖先节点

# print(soup.a.parent) #获取a标签的父节点  直接父亲

# print(list(soup.a.parents) )#找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...

#8、兄弟节点

# print(soup.a.next_sibling) #下一个兄弟

# print(soup.a.previous_sibling) #上一个兄弟

#

# print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象

print(list(soup.a.previous_siblings)) #上面的兄弟们=>生成器对象

beautifulsoup4搜索文档树

# find   find_all

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>

</p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'lxml')

# 五种过滤器: 字符串、正则表达式、列表、True、方法

# 1 字符串--->查询的条件是字符串

# res=soup.find_all(name='p')

# res=soup.find_all('p')

# print(res)

# 类名叫sister的所有标签

# res=soup.find_all(class_='sister')

# print(res)

# id 叫link1的标签

# res=soup.find_all(id='link1')

# print(res)

# 文本内容叫Elsie的父标签

# res=soup.find(text='Elsie').parent

# print(res)

# 另一种方式

# # res=soup.find_all(attrs={'class':'sister'})

# res=soup.find_all(attrs={'id':'link1'})

# print(res)

# 2 正则表达式

# import re

# # res=soup.find_all(id=re.compile('^l'))

# res=soup.find_all(class_=re.compile('^s'))

# print(res)

# 3 列表

# res=soup.find_all(id=['link1','link2'])

# print(res)

# print(soup.find_all(name=['a','b']))

# print(soup.find_all(['a','b']))

# 4 True

# res=soup.find_all(id=True)  # 所有有id的标签

# res=soup.find_all(href=True)

# res=soup.find_all(class_=True)

# print(res)

# 5 方法

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(name=has_class_but_no_id))

find的其他参数

name

class_

id

text

attrs

-------

limit:限制调试，find_all用的    find本质是find_all  limit=1

recursive：查找的时候，是只找第一层还是子子孙孙都找，默认是True，子子孙孙都找

# limit 参数

# res=soup.find_all(href=True,limit=2)  # 限制查询条数

# print(res)

# recursive  查找的时候，是只找第一层还是子子孙孙都找

# res=soup.find_all(name='b',recursive=False)

# res=soup.find_all(name='b')

# 建议遍历和搜索一起用

res=soup.html.body.p.find_all(name='b',recursive=False)

print(res)

css选择器

bs4也支持css选择器/xpath

例:

# html_doc = """

# <html><head><title>The Dormouse's story</title></head>

# <body>

# <p class="title">

#     <b>The Dormouse's story</b>

#     Once upon a time there were three little sisters; and their names were

#     <a href="http://example.com/elsie" class="sister" id="link1">

#         <span>Elsie</span>

#     </a>

#     <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

#     <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

#     <div class='panel-1'>

#         <ul class='list' id='list-1'>

#             <li class='element'>Foo</li>

#             <li class='element'>Bar</li>

#             <li class='element'>Jay</li>

#         </ul>

#         <ul class='list list-small' id='list-2'>

#             <li class='element'><h1 class='yyyy'>Foo</h1></li>

#             <li class='element xxx'>Bar</li>

#             <li class='element'>Jay</li>

#         </ul>

#     </div>

#     and they lived at the bottom of a well.

# </p>

# <p class="story">...</p>

# """

from bs4 import BeautifulSoup

# soup=BeautifulSoup(html_doc,'lxml')

# select内写css选择器

# print(soup.select('.sister'))

# print(soup.select('#link1'))

# print(soup.select('#link1 span'))

# 终极大招---》如果不会写css选择器，可以复制

import requests

res=requests.get('https://www.w3school.com.cn/css/css_selector_attribute.asp')

soup=BeautifulSoup(res.text,'lxml')

# print(soup.select('#intro > p:nth-child(1) > strong'))

print(soup.select('#intro > p:nth-child(1) > strong')[0].text)

selenium基本使用

# selenium

selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests模块

无法直接执行JavaScript代码的问题

selenium本质是通过驱动浏览器，完全模拟浏览器的操作，

比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器

# 使用步骤：

	1 下载selenium

    2 操作浏览器：分不同浏览器，需要下载不同浏览器的驱动

    	-用谷歌---》谷歌浏览器驱动：https://registry.npmmirror.com/binary.html?path=chromedriver/

        -跟谷歌浏览器版本要对应  111.0.5563.65：

    3 下载完的驱动，放在项目路径下

    4 写代码，控制谷歌浏览器

    from selenium import webdriver

    import time

    bro = webdriver.Chrome(executable_path='chromedriver.exe')  # 打开一个谷歌浏览器

    bro.get('https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3')  # 在地址栏中输入地址

    print(bro.page_source)  # 当前页面的内容  (html格式)

    with open('1.html','w',encoding='utf-8') as f:

        f.write(bro.page_source)

    time.sleep(5)

    bro.close()  # 关闭浏览器

无界面浏览器



from selenium import webdriver

import time

from selenium.webdriver.chrome.options import Options

# 隐藏浏览器的图形化界面，但是数据还拿到

chrome_options = Options()

chrome_options.add_argument('window-size=1920x3000') #指定浏览器分辨率

chrome_options.add_argument('--hide-scrollbars') #隐藏滚动条, 应对一些特殊页面

chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度

chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败

# chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使用的浏览器位置

bro = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)  # 打开一个谷歌浏览器

# 隐藏浏览器的图形化界面，但是数据还拿到

bro.get('https://www.cnblogs.com/')  # 在地址栏中输入地址

print(bro.page_source)  # 当前页面的内容  (html格式)

time.sleep(5)

bro.close()  # 关闭浏览器

模拟登录百度

from selenium import webdriver

import time

from selenium.webdriver.common.by import By

bro = webdriver.Chrome(executable_path='chromedriver.exe')  # 打开一个谷歌浏览器

bro.get('https://www.baidu.com')

# 加入等待：找标签，如果找不到，就等待 x秒，如果还找不到就报错

bro.implicitly_wait(10)  # 1 等待

# 从页面中找到登录 a标签，点击它

# By.LINK_TEXT  按a标签文本内容找

btn = bro.find_element(by=By.LINK_TEXT, value='登录')

# 点击它

btn.click()

# 找到按账号登录的点击按钮，有id，优先用id，因为唯一  TANGRAM__PSP_11__changePwdCodeItem

btn_2 = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__changeSmsCodeItem')

btn_2.click()

time.sleep(1)

btn_2 = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__changePwdCodeItem')

btn_2.click()

time.sleep(1)

name = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__userName')

password = bro.find_element(by=By.ID, value='TANGRAM__PSP_11__password')

name.send_keys('306334678@qq.com')

password.send_keys('1234')

time.sleep(1)

submit=bro.find_element(by=By.ID,value='TANGRAM__PSP_11__submit')

submit.click()

time.sleep(2)

bro.close()  # 关闭浏览器

Selenuim其他用法

查找标签

# 两个方法

bro.find_element   找一个

bro.find_elements  找所有

# 可以按id，标签名，name属性名，类名，a标签的文字，a标签的文字模糊匹配，css选择器，xpath【后面聊】

# input_1=bro.find_element(by=By.ID,value='wd')  # 按id找

# input_1 = bro.find_element(by=By.NAME, value='wd')  # name属性名

# input_1=bro.find_element(by=By.TAG_NAME,value='input') # 可以按标签名字找

# input_1=bro.find_element(by=By.CLASS_NAME,value='s_ipt') # 可以按类名

# input_1=bro.find_element(by=By.LINK_TEXT,value='登录') # 可以按a标签内容找

# input_1=bro.find_element(by=By.PARTIAL_LINK_TEXT,value='录') # 可以按a标签内容找

# input_1 = bro.find_element(by=By.CSS_SELECTOR, value='#su')  # 可以按css选择器

获取位置属性大小，文本

print(tag.get_attribute('src'))  # 用的最多

tag.text  # 文本内容

#获取标签ID，位置，名称，大小（了解）

print(tag.id)

print(tag.location)

print(tag.tag_name)

print(tag.size)

等待元素被加载

代码执行很快，有的标签没来的及加载，直接查找就会报错，设置等待

# 隐示等待：所有标签，只要去找，找不到就遵循 等10s的规则

	bro.implicitly_wait(10)

# 显示等待：需要给每个标签绑定一个等待，麻烦

元素操作

# 点击

tag.click()

# 输入内容

tag.send_keys()

# 清空内容

tag.clear()

# 浏览器对象 最大化

bro.maximize_window()

#浏览器对象  截全屏

bro.save_screenshot('main.png')

执行js代码

bro.execute_script('alert("美女")')  # 引号内部的相当于 用script标签包裹了

# 可以干的事

	-获取当前访问的地址  window.location

    -打开新的标签

    -滑动屏幕--》bro.execute_script('scrollTo(0,document.documentElement.scrollHeight)')

    -获取cookie，获取定义的全局变量

切换选项卡

import time

from selenium import webdriver

browser=webdriver.Chrome(executable_path='chromedriver.exe')

browser.get('https://www.baidu.com')

browser.execute_script('window.open()')

print(browser.window_handles) #获取所有的选项卡

browser.switch_to.window(browser.window_handles[1])

browser.get('https://www.taobao.com')

time.sleep(2)

browser.switch_to.window(browser.window_handles[0])

browser.get('https://www.sina.com.cn')

browser.close()

浏览器前进后退

import time

from selenium import webdriver

browser=webdriver.Chrome(executable_path='chromedriver.exe')

browser.get('https://www.baidu.com')

browser.get('https://www.taobao.com')

browser.get('http://www.sina.com.cn/')

browser.back()

time.sleep(2)

browser.forward()

browser.close()

异常处理

import time

from selenium import webdriver

browser=webdriver.Chrome(executable_path='chromedriver.exe')

try:

except Exception as e:

    print(e)

finally:

    browser.close()

selenuim实战案例

selenuim登录cnblogs

from selenium import webdriver

from selenium.webdriver.common.by import By

import time

import json

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

try:

    ####1 获取cookie

    # bro.get('https://www.cnblogs.com/')

    # bro.implicitly_wait(10)

    # login_btn = bro.find_element(by=By.LINK_TEXT, value='登录')

    # login_btn.click()

    # username = bro.find_element(By.ID, 'mat-input-0')

    # password = bro.find_element(By.ID, 'mat-input-1')

    # submit_btn = bro.find_element(By.CSS_SELECTOR,

    #                               'body > app-root > app-sign-in-layout > div > div > app-sign-in > app-content-container > div > div > div > form > div > button')

    # username.send_keys('xxxx@qq.com')

    # # 手动输入密码，手动点击登录 搞好验证码，都成功，敲回车

    # input()

    #

    # # 取出cookies

    # cookie = bro.get_cookies()

    # print(cookie)

    # # 保存到本地文件

    # with open('cnblogs.json', 'w', encoding='utf-8') as f:

    #     json.dump(cookie, f)

    ### 2 打开首页

    bro.get('https://www.cnblogs.com/')  # 没有登录状态

    bro.implicitly_wait(10)

    time.sleep(2)

    # 打开本地的cookie的json文件

    with open('cnblogs.json', 'r', encoding='utf-8') as f:

        cookies = json.load(f)

    for cookie in cookies:

        bro.add_cookie(cookie)

    bro.refresh()  # 刷新

    time.sleep(5)

except Exception as e:

    print(e)

finally:

    bro.close()

Python爬虫基础教程2的更多相关文章

小白必看Python视频基础教程
Python的排名从去年开始就借助人工智能持续上升,现在它已经成为了第一名.Python的火热,也带动了工程师们的就业热.可能你也想通过学习加入这个炙手可热的行业,可以看看Python视频基础教程,小 ...
Python爬虫基础
前言 Python非常适合用来开发网页爬虫,理由如下: 1.抓取网页本身的接口相比与其他静态编程语言,如java,c#,c++,python抓取网页文档的接口更简洁:相比其他动态脚本语言,如perl ...
Python爬虫入门教程 48-100 使用mitmdump抓取手机惠农APP-手机APP爬虫部分
1. 爬取前的分析 mitmdump是mitmproxy的命令行接口,比Fiddler.Charles等工具方便的地方是它可以对接Python脚本. 有了它我们可以不用手动截获和分析HTTP请求和响应 ...
Python爬虫入门教程 43-100 百思不得姐APP数据-手机APP爬虫部分
1. Python爬虫入门教程爬取背景 2019年1月10日深夜,打开了百思不得姐APP,想了一下是否可以爬呢?不自觉的安装到了夜神模拟器里面.这个APP还是比较有名和有意思的. 下面是百思不得姐的 ...
Python数据分析基础教程
Python数据分析基础教程(第2版)(高清版)PDF 百度网盘链接:https://pan.baidu.com/s/1_FsReTBCaL_PzKhM0o6l0g 提取码:nkhw 复制这段内容后 ...
python爬虫-基础入门-python爬虫突破封锁
python爬虫-基础入门-python爬虫突破封锁 >> 相关概念 >> request概念:是从客户端向服务器发出请求,包括用户提交的信息及客户端的一些信息.客户端可通过H ...
python爬虫-基础入门-爬取整个网站《3》
python爬虫-基础入门-爬取整个网站<3> 描述: 前两章粗略的讲述了python2.python3爬取整个网站,这章节简单的记录一下python2.python3的区别 python ...
python爬虫-基础入门-爬取整个网站《2》
python爬虫-基础入门-爬取整个网站<2> 描述: 开场白已在<python爬虫-基础入门-爬取整个网站<1>>中描述过了,这里不在描述,只附上 python3 ...
python爬虫-基础入门-爬取整个网站《1》
python爬虫-基础入门-爬取整个网站<1> 描述: 使用环境:python2.7.15 ,开发工具:pycharm,现爬取一个网站页面(http://www.baidu.com)所有数 ...
Python Numpy基础教程
Python Numpy基础教程本文是一个关于Python numpy的基础学习教程,其中,Python版本为Python 3.x 什么是Numpy Numpy = Numerical + Pyth ...

随机推荐

利用python-pptx包批量修改ppt格式
最近实习需要对若干ppt进行格式上的调整,主要就是将标题的位置.对齐方式.字体等统一,人工修改又麻烦又容易错. 因此结合网上的pptx包资料,使用python脚本完成处理. 主要的坑点在于,shape ...
SQL Server 还原数据库
1.备份要还原的数据库选择要备份的数据库,右键单击,任务--备份. 2.备份完成后,将数据库还原 3.新建一个空的数据库,比如Gsy_TestNew,将备份的数据库还原到这个新的库上 4.右键单击[ ...
SpringBoot - Lombok使用详解3（@NoArgsConstructor、@AllArgsConstructor、@RequiredArgsConstructor）
五.Lombok 注解详解(3) 5,@NoArgsConstructor 注解在类上,为类提供一个无参的构造方法. 注意: 当类中有 final 字段没有被初始化时,编译器会报错,此时可用 @NoA ...
Java中继承相关知识点
继承 1.继承概述继承是面向对象的三大特征之一.可以使得子类具有父类的属性和方法,还可以在子类中重新定义,追加属性和方法 1.1 继承的格式格式:public class 子类名 extends ...
C语言II—作业03
1.作业头这个作业属于哪个课程 https://edu.cnblogs.com/campus/zswxy/SE2020-3 这个作业要求在哪里 https://edu.cnblogs.com/cam ...
1js 高级
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
BOM的概述及方法
BOM的概述: bom 称为浏览器对象模型(bowser object model),也就意味他可以获取浏览器上的所有内容以及相关的操作.BOM缺乏规范的,存在共有对象来解决这个问题,但是共有对象也存 ...
(十三).CSS3中的变换(transform)，过渡(transition)，动画(animation)
1 变换 transform 1.1 变换相关 CSS 属性 CSS 属性名含义值 transform 设置变换方式 transform-origin 设置变换的原点使用关键字或坐标设置位置 t ...
关于uni-app开发的微信小程序顶部导航条机型适配
背景: 小程序顶部导航栏那里的样式和功能都是小程序自带的,当我们在pages.json里的pages里新加一条页面配置时,会自动生成一个带顶部导航栏的空白页面,当然也可以再配置里"navig ...
BitBake使用攻略--BitBake的语法知识二
目录写在前面 1. BitBake中的任务 2. 任务配置 2.1 依赖 2.1.1 内部任务间的依赖 2.1.2 不同菜谱下的任务间依赖 2.1.3 运行时态下的依赖 2.1.4 递归依赖 2.1 ...

Python爬虫基础教程2

beautifulsoup4介绍/遍历文档树

beautifulsoup4搜索文档树

find的其他参数

css选择器

selenium基本使用

无界面浏览器

模拟登录百度

Selenuim其他用法

查找标签

获取位置属性大小，文本

等待元素被加载

元素操作

执行js代码

切换选项卡

浏览器前进后退

异常处理

selenuim实战案例

selenuim登录cnblogs

Python爬虫基础教程2的更多相关文章

随机推荐

热门专题