六、BeautifulSoup4------自动登录网站（手动版）

每天一个小实例：（按照教学视频上自动登录的网站，很容易就成功了。自已练习登录别的网站，问题不断）

这个自己分析登录boss直聘。我用了一下午的时间，而且还是手动输入验证码，自动识别输入验证码的还没成功，果然是师傅领进门，修行看个人，以后要多练

第一步、先访问网站，分析一下登录需要什么数据

第二步、创建 Beautiful Soup 对象,指定解析器。提取出登录所用的数据

data = {

    'regionCode':'+86',

    'account':账号,

    'password':密码,

    'captcha':验证码,

    'randomKey':验证码携带的randomKey

}

第三步、登录成功后，就可以做登录才可以做的事情，我想了想没什么可做的，就简单取点工作信息，这个不登录也行。我就是练习练习 Beautiful Soup

 import requests

 from bs4 import BeautifulSoup

 #第一步、先访问网站，分析一下登录需要什么数据

 session = requests.Session()  #如果不用这步，requests访问要携带授权的cookies

 bossUrl = 'https://login.zhipin.com/'

 headers = {

 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

 }

 response =session.get(url=bossUrl,headers=headers)

 #第二步、创建 Beautiful Soup 对象,指定解析器。提取出登录所用的数据

 #下面的data中就是需要的数据

 soup = BeautifulSoup(response.text,'lxml')

 #获取验证码的url

 captchaUrl =soup.select('span .verifyimg')[0].get('src')

 img = requests.get(bossUrl + captchaUrl,headers=headers)

 #获取randomKey

 randomKey = captchaUrl.split('=')[1]

 #将验证码保存起来

 with open('captcha.png','wb') as f:

     f.write(img.content)

 #输入验证码

 captcha = input('请输入验证码：')

 data = {

     'regionCode':'+86',

     'account':账号,

     'password':密码,

     'captcha':验证码,

     'randomKey':验证码携带的randomKey

 }

 loginUrl = 'https://login.zhipin.com/login/account.json'

 login = session.post(loginUrl,data=data,headers=headers)

 #第三步、登录成功后，就可以做登录才可以做的事情，下面爬取得信息就算不登录也行，我就是练习练习

 boss = session.get("https://www.zhipin.com/")

 jobSoup =BeautifulSoup(boss.text,'lxml')

 jobPrimary = jobSoup.select('.sub-li')

 for job in jobPrimary:

     job_info = job.find('p').get_text()

     try:

         job_text = job.find_all(name='p', class_='job-text')[0].get_text()

     except :

         job_text =''

     print(job_info,job_text)

结果：我自己就是简单的提取一下数据，没有整理

 D:\python.exe F:/django_test/spider/captcha_test.py

 请输入验证码：5n47

 架构师30K - 45K  杭州3-5年本科

 web前端20K - 40K  北京3-5年本科

 测试开发20K - 40K  杭州经验不限本科

 销售运营23K - 24K  上海5-10年本科

 内容营销产品运营15K - 30K  北京5-10年本科

 Android22K - 44K  北京3-5年本科

 iOS(高P)38K - 58K  北京5-10年硕士

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解器。

Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

首先要先导入：

# 导入模块

from bs4 import BeautifulSoup

html = html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

asdf

    <div class="title">

        <b>The Dormouse's story总共</b>

        <h1>f</h1>

    </div>

<div class="story">Once upon a time there were three little sisters; and their names were

    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,

    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</div>

ad<br/>sf

<p class="story">...</p>

</body>

</html>

"""

#创建 Beautiful Soup 对象,指定解析器，如果不指定会出现警告

'''

 UserWarning: No parser was explicitly specified, so I'm using the best.... 

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))

'''

soup = BeautifulSoup(html,'lxml')

#打开本地 HTML 文件的方式来创建对象

# soup = BeautifulSoup(open('hello.html'),'lxml')

# 找到第一个a标签

tag1 = soup.find(name='a')

# 找到所有的a标签

tag2 = soup.find_all(name='a')

# 找到id＝link2的标签

tag3 = soup.select('#link2')

1.find_all(name, attrs, recursive, text, **kwargs)获取匹配的所有标签

 # tags = soup.find_all('a')

 # print(tags)

 # tags = soup.find_all('a',limit=)

 # print(tags)

 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

 # print(tags)

 # ####### 列表 #######

 # v = soup.find_all(name=['a','div'])

 # print(v)

 # v = soup.find_all(class_=['sister0', 'sister'])

 # print(v)

 # v = soup.find_all(text=['Tillie'])

 # print(v, type(v[]))

 # v = soup.find_all(id=['link1','link2'])

 # print(v)

 # v = soup.find_all(href=['link1','link2'])

 # print(v)

 # ####### 正则 #######

 import re

 # rep = re.compile('p')

 # rep = re.compile('^p')

 # v = soup.find_all(name=rep)

 # print(v)

 # rep = re.compile('sister.*')

 # v = soup.find_all(class_=rep)

 # print(v)

 # rep = re.compile('http://www.oldboy.com/static/.*')

 # v = soup.find_all(href=rep)

 # print(v)

 # ####### 方法筛选 #######

 # def func(tag):

 # return tag.has_attr('class') and tag.has_attr('id')

 # v = soup.find_all(name=func)

 # print(v)

 # ## get,获取标签属性

 # tag = soup.find('a')

 # v = tag.get('id')

 # print(v)

2.find(name, attrs, recursive, text, **kwargs),获取匹配的第一个标签

 tag = soup.find('a')

 print(tag)

 tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

 tag = soup.find(text='Lacie')

 print(tag)

3. name，标签名称； attr，标签属性

 tag = soup.find('a')

 #a标签

 print(tag.name)

 attrs = tag.attrs    # 获取

 print(tag.attrs)

 #{'class': ['sister0'], 'id': 'link1'}

 tag.attrs = {'i':123} # 设置

 tag.attrs['id'] = 'iiiii' # 设置

 print(tag.attrs)

 #{'i': 123, 'id': 'iiiii'}

4.children,所有子标签

 '''

 它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。

 我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象

 '''

 div = soup.find('div',class_="story")

 print(div.children)

 for children in div.children:

     print(children)

 #结果：

 '''Once upon a time there were three little sisters; and their names were

 <a class="sister0" id="link1">Els<span>f</span>ie</a>

 ,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

  and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 ;

 and they lived at the bottom of a well.'''

5.descendants所有子子孙孙标签

 div = soup.find('div',class_="story")

 print(div.descendants)

 for children in div.descendants:

     print(children)

 #结果：

 '''Once upon a time there were three little sisters; and their names were

 <a class="sister0" id="link1">Els<span>f</span>ie</a>

 Els

 <span>f</span>

 f

 ie

 ,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

 Lacie

  and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 Tillie

 ;

 and they lived at the bottom of a well.'''

6.CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

 #通过标签名查找

 print(soup.select('title'))

 #通过类名查找

 print(soup.select('.sister'))

 #通过id查找

 print(soup.select('#link1'))

 #组合查找

 '''组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，

    例如查找 div 标签中，id 等于 link1的内容，二者需要用空格分开'''

 print(soup.select('div #link1'))

 #属性查找

 '''查找时还可以加入属性元素，属性需要用中括号括起来，不在同一节点的空格隔开

    注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。'''

 print(soup.select('a[class="sister"]'))

 print(soup.select('div a[class="sister"]'))

 #获取内容以上的 select 方法返回的结果都是列表形式，

 # 可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

 for title in soup.select('title'):

     print(title.get_text())

7.clear,将标签的所有子标签全部清空（保留标签名）

 tag = soup.find('body')

 tag.clear()

 print(soup)

 '''结果：

 <html><head><title>The Dormouse's story</title></head>

 <body></body>

 </html>'''

8. decompose,递归的删除所有的标签

 tag = soup.find('body')

 tag.decompose()

 print(soup)

 '''结果：

 <html><head><title>The Dormouse's story</title></head>

 </html>'''

9. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

 tag = soup.find('body')

 v = tag.decode()

 print(type(soup))

 print(type(v))

 '''结果：

 <class 'bs4.BeautifulSoup'>

 <class 'str'>'''

10. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

tag = soup.find('body')

v = tag.encode()

print(type(soup))

print(type(v))

11.has_attr,检查标签是否具有该属性 ; get_text,获取标签内部文本内容; index,检查标签在某标签中的索引位置

12.当前的关联标签

  soup.next

  soup.next_element

  soup.next_elements

  soup.next_sibling

  soup.next_siblings

  tag.previous

  tag.previous_element

  tag.previous_elements

  tag.previous_sibling

  tag.previous_siblings

  tag.parent

  tag.parents

13.查找某标签的关联标签

  tag.find_next(...)

  tag.find_all_next(...)

  tag.find_next_sibling(...)

  tag.find_next_siblings(...)

  tag.find_previous(...)

  tag.find_all_previous(...)

  tag.find_previous_sibling(...)

  tag.find_previous_siblings(...)

  tag.find_parent(...)

  tag.find_parents(...)

  参数同find_all

14. 创建标签之间的关系

  tag = soup.find('div')

 a = soup.find('a')

  tag.setup(previous_sibling=a)

 print(tag.previous_sibling)

15.创建标签

 from bs4.element import Tag

 obj = Tag(name='a',attrs={'id':'qqq','class':'djj'})

 obj.string ='kv'

 print(obj)

 #结果<a class="djj" id="qqq">kv</a>

16.insert_after,insert_before 在当前标签后面或前面插入 ; append在当前标签内部追加一个标签; insert在当前标签内部指定位置插入一个标签

17 wrap，将指定标签把当前标签包裹起来；unwrap，去掉当前标签，将保留其包裹的标签

六、BeautifulSoup4------自动登录网站（手动版）的更多相关文章

Java 扫描微信公众号二维码，关注并自动登录网站
https://blog.csdn.net/qq_42851002/article/details/81327770 场景:用户扫描微信公众号的二维码,关注后自动登录网站,若已关注则直接登录. 逻辑: ...
python网络爬虫之使用scrapy自动登录网站
前面曾经介绍过requests实现自动登录的方法.这里介绍下使用scrapy如何实现自动登录.还是以csdn网站为例. Scrapy使用FormRequest来登录并递交数据给服务器.只是带有额外的f ...
吴裕雄--天生自然PYTHON学习笔记：python自动登录网站
打开 www. 5 l eta . com 网站,如果己经通过某用户名进行了登录,那么先退出登录 . 登录该网站的步骤一般如下 : ( 1 )单击右上角的“登录”按钮. ( 2 )先输入账号. ( ...
python模拟自动登录网站（urllib2）
不登录打开网页: import urllib2 request = urllib2.Request('http://www.baidu.com') response = urllib2.urlopen ...
IIS绑定Active Directory账号自动登录网站的方法
满足使用Request.ServerVariables["REMOTE_USER"]的条件: 1.IIS配置网站的目录安全性取消“启用匿名访问(&A)” 2.启用 “集成 ...
自动化web前端测试,自动登录网站.目前发现最靠谱的方法是imacros
imacros免费版登录宏代码的示例: //首先登出URL GOTO=http://yoursite/logout.html//打开登录页面URL GOTO=http://yoursite/logi ...
Python 自动登录网站（处理Cookie）
http://digiter.iteye.com/blog/1300884 Python代码 def login(): cj = cookielib.CookieJar() ope ...
C#网页自动登录和提交POST信息的多种方法(转)
网页自动登录和提交POST信息的核心就是分析网页的源代码(HTML),在C#中,可以用来提取网页HTML的组件比较多,常用的用WebBrowser.WebClient.HttpWebRequest这三 ...
java浏览器控件jxbrowser(简单demo模拟自动登录与点击)
写在前面: 老大让我写个脚本自动给他写dms有一段时间了,说实话当时不知道老大指的这个脚本是什么?毕竟是做web的,难道是写个数据库sql语句脚本吗?也就放在了一边.巧了,最近一个朋友说他之前写了个程 ...
自动化测试： Selenium 自动登录授权，再 Requests 请求内容
Selenium 自动登录网站.截图及 Requests 抓取登录后的网页内容.一起了解下吧. Selenium: 支持 Web 浏览器自动化的一系列工具和库的综合项目. Requests: 唯一的一 ...

随机推荐

HDU 5538 House Building(模拟——思维)
题目链接: http://acm.hdu.edu.cn/showproblem.php?pid=5538 Problem Description Have you ever played the vi ...
PLSQL 注册码
注册码:Product Code:4t46t6vydkvsxekkvf3fjnpzy5wbuhphqzserial Number:601769 password:xs374ca 本人版本 Versio ...
js数组操作记录
一 .splice() 方法向/从数组中添加/删除项目,然后返回被删除的项目. arrayObject.splice(index,howmany,item1,.....,itemX) 参数描述 in ...
Apache、IIS、Nginx等绝大多数web服务器，都不允许静态文件响应POST请求，否则会返回“HTTP/1.1 405 Method not allowed”错误。
例1:用Linux下的curl命令发送POST请求给Apache服务器上的HTML静态页 [root@new-host ~]# curl -d 1=1 http://www.sohu.com/inde ...
UI 网页三原色
配色宝典:设计师教你从零开始学习配色三原色 : ------------------------------------------------------------- 三原色 -- ...
04 整合IDEA+Maven+SSM框架的高并发的商品秒杀项目之高并发优化
Github:https://github.com/nnngu 项目源代码:https://github.com/nnngu/nguSeckill 关于并发并发性上不去是因为当多个线程同时访问一行数 ...
linux_操作系统
如何查询操作系统版本? cat /etc/redhat-release 什么是操作系统? -- win10,linux都是os,应用软件和硬件打交道中间桥梁软件,管理硬件+软件资源,计算机系统基础 ...
java中String的.trim()方法
该方法去除两边的空白符原理: 看看源码实现 public String trim() { int len = value.length; ; char[] val = value; /* avoid ...
java中可变长参数的定义及使用方法
JAVA中可以为方法定义可变长参数( Varargs)来匹配不确定数量的多个参数,其定义用“...”表示.其实,这类似于为方法传了一个数组,且在使用方法上也和数组相同,如下: public void ...
机器学习-GBDT和XGboost
参考: 陈天奇slides : https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf Friedman GBDT 论文: htt ...

六、BeautifulSoup4------自动登录网站（手动版）

六、BeautifulSoup4------自动登录网站（手动版）的更多相关文章

随机推荐

热门专题