mechanize (1)
最近看的关于网络爬虫和模拟登陆的资料,发现有这样一个包
mechanize ['mekə.naɪz]又称为机械化的意思,确实文如其意,确实有自动化的意思。
mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:
any URL can be opened, not just
http:mechanize.UserAgentBaseoffers easy dynamic configuration of user-agent features like protocol, cookie, redirection androbots.txthandling, without having to make a newOpenerDirectoreach time, e.g. by callingbuild_opener().Easy HTML form filling.
Convenient link parsing and following.
Browser history (
.back()and.reload()methods).The
RefererHTTP header is added properly (optional).Automatic observance of
robots.txt.Automatic handling of HTTP-Equiv and Refresh.
意思就是说 mechanize.Browser和mechanize.UserAgentBase只是urllib2.OpenerDirector的接口实现,因此,包括HTTP协议,所有的协议都可以打开
另外,提供了更简单的配置方式而不用每次都创建一个新的OpenerDirector
对表单的操作,对链接的操作、浏览历史和重载操作、刷新、对robots.txt的监视操作等等
import re
import mechanize
(1)实例化一个浏览器对象
br = mechanize.Browser()
(2)打开一个网址
br.open("http://www.example.com/")
(3)该网页下的满足text_regex的第2个链接
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
(4)网页的名称
print br.title()
(5)将网页的网址打印出来
print response1.geturl()
(6)网页的头部
print response1.info() # headers
(7)网页的body
print response1.read() # body
(8)选择body中的name =" order"的FORM
br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
(9)为name = cheeses的form赋值
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
# Submit current form. Browser calls .close() on the current response on
# navigation, so this closes response1
(10)提交
response2 = br.submit() # print currently selected form (don't call .submit() on this, use br.submit())
print br.form
(11)返回
response3 = br.back() # back to cheese shop (same data as response1)
# the history mechanism returns cached response objects
# we can still use the response, even though it was .close()d
response3.get_data() # like .seek(0) followed by .read()
(12)刷新网页
response4 = br.reload() # fetches from server (13)这可以列出该网页下所有的Form
for form in br.forms():
print form
# .links() optionally accepts the keyword args of .follow_/.find_link()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
这是文档中给出的一个例子,基本的解释已经在代码中给出
You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:
通过mechanize.UserAgent这个模块,我们可以实现对browser’s policy的控制,代码给出如下,也是来自与文档的例子:
br = mechanize.Browser()
# Explicitly configure proxies (Browser will attempt to set good defaults).
# Note the userinfo ("joe:password@") and port number (":3128") are optional.
br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
"ftp": "proxy.example.com",
})
# Add HTTP Basic/Digest auth username and password for HTTP proxy access.
# (equivalent to using "joe:password@..." form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.
br.add_password("http://example.com/protected/", "joe", "password")
# Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML).
br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
# Don't add Referer (sic) header
br.set_handle_referer(False)
# Don't handle Refresh redirections
br.set_handle_refresh(False)
# Don't handle cookies
br.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by
# default: no need to do this unless you have some reason to use a
# particular cookiejar)
br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True) # To make sure you're seeing all debug output:
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO) # Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
另外,还有一些类似于mechanize的网页交互模块,
There are several wrappers around mechanize designed for functional testing of web applications:
归根到底,都是对urllib2的封装,因此,选择一个比较好用的模块就好了!
mechanize (1)的更多相关文章
- Python使用mechanize模拟浏览器
Python使用mechanize模拟浏览器 之前我使用自带的urllib2模拟浏览器去进行訪问网页等操作,非常多站点都会出错误,还会返回乱码.之后使用了 mechanize模拟浏览器,这些情况都没出 ...
- 使用Mechanize实现自动化表单处理
使用Mechanize实现自动化表单处理 mechanize是对urllib2的部分功能的替换,能够更好的模拟浏览器行为,在web访问控制方面做得更全面 mechanize的特点: 1 http, ...
- Ruby:Mechanize的使用教程
小技巧 puts Mechanize::AGENT_ALIASES 可以打印出所有可用的user_agent puts Mechanize.instance_methods(false) 输出Mech ...
- python使用mechanize模拟登陆新浪邮箱
mechanize相关知识准备: mechanize.Browser()<br># 设置是否处理HTML http-equiv标头 set_handle_equiv(True)<br ...
- python之mechanize模拟浏览器
安装 Windows: pip install mechanize Linux:pip install python-mechanize 个人感觉mechanize也只适用于静态网页的抓取,如果是异步 ...
- pyhton mechanize 学习笔记
1:简单的使用 import mechanize # response = mechanize.urlopen("http://www.hao123.com/") request ...
- Mechanize抓取数据【Ruby】
创建: 2017/08/05 更新: 2018/01/08 修正: ele_inner_text -> ele.inner_text 补充: ...
- Python的50个模块,满足你各种需要
Python具有强大的扩展能力,我列出了50个很棒的Python模块,包含几乎所有的需要:比如Databases,GUIs,Images, Sound, OS interaction, Web,以及其 ...
- BruteXSS:XSS暴力破解神器
×01 BruteXSS BruteXSS是一个非常强大和快速的跨站点脚本暴力注入.它用于暴力注入一个参数.该BruteXSS从指定的词库加载多种有效载荷进行注入并且使用指定的载荷和扫描检查这些参数很 ...
随机推荐
- Struts2(接受表单参数)请求数据自动封装和数据类型转换
Struts2请求数据自动封装: (1)实现原理:参数拦截器 (2)方式1:jsp表单数据填充到action中的属性: 普通的成员变量,必须给set,get可以不给的. 注意点,A ...
- Jquery监听AJAX请求
.ajaxComplete() 当Ajax请求完成后注册一个回调函数.这是一个 AjaxEvent. .ajaxError() Ajax请求出错时注册一个回调处理函数,这是一个 Ajax Event. ...
- Codeforces 247D Mike and Fish
Mike and Fish 我们可以把这个模型转换一下就变成有两类点,一类是X轴, 一类是Y轴, 每个点相当于对应的点之间建一条边, 如果这条边变红两点同时+1, 变蓝两点同时-1. 我们能发现这个图 ...
- memcahe
网站的瓶颈 主要集中在数据库 ,用缓存(直接操作内存) 存储计算机的内存,如果一旦服务器断电,数据都将清空 内存:memcached redis基于文档:mongodb memcache:基于内存的高 ...
- adb命令大全
废话不多说,直接adb -help查看所有命令然后翻译 -a - directs adb to listen on all interfaces for a connection 指导adb监听连接的 ...
- 条件随机场之CRF++源码详解-预测
这篇文章主要讲解CRF++实现预测的过程,预测的算法以及代码实现相对来说比较简单,所以这篇文章理解起来也会比上一篇条件随机场训练的内容要容易. 预测 上一篇条件随机场训练的源码详解中,有一个地方并没有 ...
- sql注入总结(一)--2018自我整理
SQL注入总结 前言: 本文和之后的总结都是进行总结,详细实现过程细节可能不会写出来~ 所有sql语句均是mysql数据库的,其他数据库可能有些函数不同,但是方法大致相同 0x00 SQL注入原理: ...
- Codeforces Round #514 (Div. 2)
目录 Codeforces 1059 A.Cashier B.Forgery C.Sequence Transformation D.Nature Reserve(二分) E.Split the Tr ...
- 利用java编写的盲注脚本
之前在网上见到一个盲注的题目,正好闲来无事,便用java写了个盲注脚本,并记录下过程中的坑 题目源码: <?php header("Content-Type: text/html;ch ...
- 潭州课堂25班:Ph201805201 第十课 类的定义,属性和方法 (课堂笔记)
类的定义 共同属性,特征,方法者,可分为一类,并以名命之 class Abc: # class 定义类, 后面接类名 ( 规则 首字母大写 ) cls_name = '这个类的名字是Abc' # 在类 ...