day94_11_26爬虫find与findall

一。使用json

　　正常的，如果需要将response结果序列化，需要将结果json.loads

res1=json.loads(response.text)

　　但是这样会很麻烦，request提供了json方法：

res2=response.json() #直接获取json数据

二。SSL认证

　　ssl就是http+SSL，也就是https。需要带上证书才能访问特定的网站。

　　证书需要浏览器下载。

#SSL

# https=http+ssl

import requests

respone=requests.get('https://www.12306.cn',

                     cert=('/path/server.crt',

                           '/path/key'))

print(respone.status_code)

三。使用代理

　　在get请求中proxies关键字就是存放代理网址，：（西刺）

　　通过META.get('REMOVE_ADDR')

import reques1ts

proxies={

    'http':'http://egon:123@localhost:9743',#带用户名密码的代理,@符号前是用户名与密码

    'http':'http://localhost:9743',

    'https':'https://localhost:9743',

    'http':'http://124.205.155.148:9090'

}

respone=requests.get('https://www.12306.cn',

                     proxies=proxies)

print(respone.status_code)

四。超时设置

import requests

respone=requests.get('https://www.baidu.com',

                     timeout=0.0001)

五。上传文件。

import requests

files={'file':open('a.jpg','rb')}

respone=requests.post('http://httpbin.org/post',files=files)

print(respone.status_code)

　　另外有检测服务器压力的工具

　　jmter 压力测试工具

六。使用bs4

　　使用插件bs4，可以快速匹配页面中的元素。

　　首先需要下载bs4和lxml

pip install lxml

pip install html5lib

pip install beautifulsoup

　　使用时首先需要将数据爬取，并生成Beautiful对象

import requests

from bs4 import BeautifulSoup

url='https://www.autohome.com.cn/news/1/#liststart'

res=requests.get(url)

soup=BeautifulSoup(res.text,'lxml')

　　再者使用基本用法find，获取一个对象，其中的筛选条件是系与id，name等例子：

div=soup.find(id='auto-channel-lazyload-article')

ul=div.find(name='ul')

li_list=ul.find_all(name='li')

# print(len(li_list))

for li in li_list:

    h3=li.find(name='h3')

    if h3:

        title=h3.text  #把h3标签的text取出来

        print(title)

    a=li.find(name='a')

    if a:

        article_url=a.get('href')  #取出a标签的href属性

        print(article_url)

    img=li.find(name='img')

    if img:

        img_url=img.get('src')

        print(img_url)

    p=li.find(name='p')

    if p:

        content=p.text

        print(content)

　　findall则是将所有元素都找到。

　　总结：

　　find：

　　-name="标签名" 标签

　　-id，class_,="" 把这个标签拿出来

　　-标签.text 取标签的内容

　　-标签.get(属性名) 取标签属性的内容

　　find_all

其他用法：

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" id="bbaa"><b name="xx" age="18">The Dormouse's story</b><b>xxxx</b></p>

<p class="xxx" a="xxx">asdfasdf</p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup=BeautifulSoup(html_doc,'lxml')
ress=soup.prettify()   #美化一下
soup=BeautifulSoup(ress,'lxml')

　　通过该对象点标签可以直接对其进行操作：

#遍历文档树

# print(soup.p.name) # 获取该对象中的标签名字

# print(soup.p.attrs) # 获取该对象中的属性集合

# print(soup.p.string)  # 获取标签中的字

# print(list(soup.p.strings))  # 迭代器

# print(soup.p.text) # 所有

# print(soup.p.b)

# print(soup.body.p.text) # 只识别文本呢

# print(soup.body.p.contents)    #生成期中的所有元素

# print(list(soup.body.p.children))  # 迭代器生成期中的所有元素

# print(list(soup.body.p.descendants))  # 迭代器输出所有孩子

# print(soup.body.p.parent)  # 输出p的父标签所有的元素

# print(list(soup.body.p.parents))  # 取出所以有父节点

# print(len(list(soup.body.p.parents)))

# print(soup.body.p.previous_sibling)  # 他的上一个兄弟

# print(soup.body.p.previous_sibling)

# print(soup.find(class_="xxx").previous_sibling)

# print(soup.p.next_sibling) # 下一个兄弟

# print(soup.a.previous_sibling)

# print(type(soup.p))

　　查找文档法：

　　一共有五种过滤器：字符串，正则，布尔，方法，列表

　　1.通过字符串过滤：

# print(soup.find_all(name='b'))

　　2.通过正则过滤

# print(soup.find_all(name=re.compile('^b')))

# print(soup.find_all(id=re.compile('^b')))

　　3.通过列表与布尔值：

# print(soup.find_all(name=['a','b']))

# print(soup.find_all(name=True))

　　4.通过方法：

# def has_class_but_no_id(tag):

#     return tag.has_attr('class') and not tag.has_attr('id')

# print(soup.find_all(name=has_class_but_no_id))

　　css选择器法：

# print(soup.select(".title"))

# print(soup.select("#bbaa"))
# print(soup.select('#bbaa b')[0].attrs.get('name'))

　　其他用法:

#recursive=False  只找同一层

#limit  找到第几个之后停止

七。通过测试软件自动点网站。

　　1.安装selenium模块：

pip install selenium

　　2.安装插件到项目文件夹下或者puthon下的scripts中，需要核对版本信息。：
http://npm.taobao.org/mirrors/chromedriver/78.0.3904.105/

　　使用：

from selenium import webdriver

bro=webdriver.Chrome()

bro.get('https://www.baidu.com')

time.sleep(3)

bro.close()

　　无窗口操作:

from selenium import webdriver

from selenium.webdriver.common.keys import Keys #键盘按键操作

import time

from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument('window-size=1920x3000') #指定浏览器分辨率

chrome_options.add_argument('--disable-gpu') #谷歌文档提到需要加上这个属性来规避bug

chrome_options.add_argument('--hide-scrollbars') #隐藏滚动条, 应对一些特殊页面

chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度

chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败

chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使

bro=webdriver.PhantomJS()

bro=webdriver.Chrome(chrome_options=chrome_options)

bro=webdriver.Chrome()

bro.get('https://www.baidu.com')

　　chrome支持无窗口操作。

　　自动化控制窗口：

bro=webdriver.Chrome()

bro.get('https://www.baidu.com')

# print(bro.page_source)

# time.sleep(3)

time.sleep(1)

#取到输入框

inp=bro.find_element_by_id('kw')

#往框里写字

inp.send_keys("美女")

inp.send_keys(Keys.ENTER) #输入回车

#另一种方式，取出按钮，点击su

time.sleep(3)

bro.close()

day94_11_26爬虫find与findall的更多相关文章

Python爬虫教程-19-数据提取-正则表达式(re)
本篇主页内容:match的基本使用,search的基本使用,findall,finditer的基本使用,匹配中文,贪婪与非贪婪模式 Python爬虫教程-19-数据提取-正则表达式(re) 正则表达式 ...
python爬虫--数据解析
数据解析什么是数据解析及作用概念:就是将一组数据中的局部数据进行提取作用:来实现聚焦爬虫数据解析的通用原理标签定位取文本或者属性正则解析正则回顾单字符: . : 除换行以外所有字符 ...
python爬虫笔记之re.match匹配，与search、findall区别
为什么re.match匹配不到?re.match匹配规则怎样?(捕一下seo) re.match(pattern, string[, flags]) pattern为匹配规则,即输入正则表达式. st ...
爬虫常用正则、re.findall 使用
爬虫常用正则爬虫经常用到的一些正则,这可以帮助我们更好地处理字符. 正则符单字符 . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字 [0-9] \D ...
网络爬虫re模块的findall()函数
findall()函数匹配所有符合规律的内容,并以列表的形式返回结果. a = '"<div>指数' \ '</div>"' word = re.finda ...
python爬虫笔记之re.compile.findall()
re.compile.findall原理是理解了,但输出不大理解(主要是加了正则表达式的括号分组) 一开始不懂括号的分组及捕捉,看了网上这个例子(如下),然而好像还是说不清楚这个括号的规律(还是说我没 ...
python获取ip代理列表爬虫
最近练习写爬虫,本来爬几张mm图做测试,可是爬到几十张的时候就会返回403错误,这是被网站服务器发现了,把我给屏蔽了. 因此需要使用代理IP.为了方便以后使用,我打算先写一个自动爬取ip代理的爬虫,正 ...
学习日记-从爬虫到接口到APP
最近都在复习J2E,多学习一些东西肯定是好的,而且现在移动开发工作都不好找了,有工作就推荐一下小弟呗,广州佛山地区,谢谢了. 这篇博客要做的效果很简单,就是把我博客的第一页每个条目显示在APP上,条目 ...
Python初学者之网络爬虫(二)
声明:本文内容和涉及到的代码仅限于个人学习,任何人不得作为商业用途.转载请附上此文章地址本篇文章Python初学者之网络爬虫的继续,最新代码已提交到https://github.com/octans ...

随机推荐

SpringBoot2.0整合WebSocket，实现后端数据实时推送！
之前公司的某个系统为了实现推送技术,所用的技术都是Ajax轮询,这种方式浏览器需要不断的向服务器发出请求,显然这样会浪费很多的带宽等资源,所以研究了下WebSocket,本文将详细介绍下. 一.什么是 ...
WPF应用中对WindowsFormHost内容进行裁剪
问题1: WPF中在使用WindowsFormsHost调用WinFrom控件时,若在WindowsFormsHost上层添加了WPF控件,该控件不会显示出来. <Grid> <W ...
python实现智能语音天气预报
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: 飞奔的帅帅 PS:如有需要Python学习资料的小伙伴可以加点击下 ...
实训第六天（mybatis）
今天实训第六天,我们学习了mybatis这个数据库框架,虽然说框架的环境搭建非常的繁琐,但是在了解原理和流程之后是非常的舒服的.因为有一个强大的工具被我掌握了,所以今天感觉非常的开心. 首先我们是在s ...
08-Node.js学习笔记-静态资源访问
静态资源服务器端不需要处理,可以直接响应给客户端的资源就是静态资源,例如css,javaScript,image文件动态资源相同的请求地址不同的响应资源,这种资源就是动态资源 http://ww ...
西北师大-2108Java】第十三次作业成绩汇总
[西北师大-2108Java]第十三次作业成绩汇总作业题目面向对象程序设计(JAVA) 第15周学习指导及要求实验目的与要求 (1)掌握菜单组件用途及常用API: (2)掌握对话框组件用途及常用 ...
（day65、66）Vue基础、指令、实例成员、JS函数this补充、冒泡排序
目录一.Vue基础 (一)什么是Vue (二)为什么学习Vue (三)如何使用Vue 二.Vue指令 (一)文本指令 (二)事件指令v-on (三)属性指令v-bind (四)表单指令v-model ...
【SDOI 2015】约数个数和
Problem Description 设 \(d(x)\) 为 \(x\) 的约数个数,给定 \(N\).\(M\),求 \[ \sum_{i=1}^N \sum_{j=1}^M d(ij) \] ...
Linux gzip: stdin: not in gzip format
在解压tar.gz文件的时候报错 tar -zxvf otp_src_18.3.tar.gz gzip: stdin: not in gzip format tar: Child returned s ...
python assert断言用法
作用:断言函数运行状态语法:assert condition,判断condition运行状态,若condition状态为false,则上报错误:AssertionError

day94_11_26爬虫find与findall

day94_11_26爬虫find与findall的更多相关文章

随机推荐

热门专题