配置Nutch模拟浏览器以绕过反爬虫限制
原文链接:http://yangshangchuan.iteye.com/blog/2030741
当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫),我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。
在nutch-default.xml中有项配置是和User-Agent相关的:
- <property>
- <name>http.agent.description</name>
- <value></value>
- <description>Further description of our bot- this text is used in
- the User-Agent header. It appears in parenthesis after the agent name.
- </description>
- </property>
- <property>
- <name>http.agent.url</name>
- <value></value>
- <description>A URL to advertise in the User-Agent header. This will
- appear in parenthesis after the agent name. Custom dictates that this
- should be a URL of a page explaining the purpose and behavior of this
- crawler.
- </description>
- </property>
- <property>
- <name>http.agent.email</name>
- <value></value>
- <description>An email address to advertise in the HTTP 'From' request
- header and User-Agent header. A good practice is to mangle this
- address (e.g. 'info at example dot com') to avoid spamming.
- </description>
- </property>
- <property>
- <name>http.agent.name</name>
- <value></value>
- <description>HTTP 'User-Agent' request header. MUST NOT be empty -
- please set this to a single word uniquely related to your organization.
- NOTE: You should also check other related properties:
- http.robots.agents
- http.agent.description
- http.agent.url
- http.agent.email
- http.agent.version
- and set their values appropriately.
- </description>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>Nutch-1.7</value>
- <description>A version string to advertise in the User-Agent
- header.</description>
- </property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>Nutch-1.7</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>
在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这项配置是如何构成User-Agent的:
- this.userAgent = getAgentString( conf.get("http.agent.name"),
- conf.get("http.agent.version"),
- conf.get("http.agent.description"),
- conf.get("http.agent.url"),
- conf.get("http.agent.email") );
this.userAgent = getAgentString( conf.get("http.agent.name"),
conf.get("http.agent.version"),
conf.get("http.agent.description"),
conf.get("http.agent.url"),
conf.get("http.agent.email") );
- private static String getAgentString(String agentName,
- String agentVersion,
- String agentDesc,
- String agentURL,
- String agentEmail) {
- if ( (agentName == null) || (agentName.trim().length() == 0) ) {
- // TODO : NUTCH-258
- if (LOGGER.isErrorEnabled()) {
- LOGGER.error("No User-Agent string set (http.agent.name)!");
- }
- }
- StringBuffer buf= new StringBuffer();
- buf.append(agentName);
- if (agentVersion != null) {
- buf.append("/");
- buf.append(agentVersion);
- }
- if ( ((agentDesc != null) && (agentDesc.length() != 0))
- || ((agentEmail != null) && (agentEmail.length() != 0))
- || ((agentURL != null) && (agentURL.length() != 0)) ) {
- buf.append(" (");
- if ((agentDesc != null) && (agentDesc.length() != 0)) {
- buf.append(agentDesc);
- if ( (agentURL != null) || (agentEmail != null) )
- buf.append("; ");
- }
- if ((agentURL != null) && (agentURL.length() != 0)) {
- buf.append(agentURL);
- if (agentEmail != null)
- buf.append("; ");
- }
- if ((agentEmail != null) && (agentEmail.length() != 0))
- buf.append(agentEmail);
- buf.append(")");
- }
- return buf.toString();
- }
private static String getAgentString(String agentName,
String agentVersion,
String agentDesc,
String agentURL,
String agentEmail) { if ( (agentName == null) || (agentName.trim().length() == 0) ) {
// TODO : NUTCH-258
if (LOGGER.isErrorEnabled()) {
LOGGER.error("No User-Agent string set (http.agent.name)!");
}
} StringBuffer buf= new StringBuffer(); buf.append(agentName);
if (agentVersion != null) {
buf.append("/");
buf.append(agentVersion);
}
if ( ((agentDesc != null) && (agentDesc.length() != 0))
|| ((agentEmail != null) && (agentEmail.length() != 0))
|| ((agentURL != null) && (agentURL.length() != 0)) ) {
buf.append(" ("); if ((agentDesc != null) && (agentDesc.length() != 0)) {
buf.append(agentDesc);
if ( (agentURL != null) || (agentEmail != null) )
buf.append("; ");
} if ((agentURL != null) && (agentURL.length() != 0)) {
buf.append(agentURL);
if (agentEmail != null)
buf.append("; ");
} if ((agentEmail != null) && (agentEmail.length() != 0))
buf.append(agentEmail); buf.append(")");
}
return buf.toString();
}
在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:
- String userAgent = http.getUserAgent();
- if ((userAgent == null) || (userAgent.length() == 0)) {
- if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
- } else {
- reqStr.append("User-Agent: ");
- reqStr.append(userAgent);
- reqStr.append("\r\n");
- }
String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
} else {
reqStr.append("User-Agent: ");
reqStr.append(userAgent);
reqStr.append("\r\n");
}
通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser):
1、模拟Firefox浏览器:
- <property>
- <name>http.agent.name</name>
- <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>20100101 Firefox/27.0</value>
- </property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
</property>
<property>
<name>http.agent.version</name>
<value>20100101 Firefox/27.0</value>
</property>
2、模拟IE浏览器:
- <property>
- <name>http.agent.name</name>
- <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>6.0)</value>
- </property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
</property>
<property>
<name>http.agent.version</name>
<value>6.0)</value>
</property>
3、模拟Chrome浏览器:
- <property>
- <name>http.agent.name</name>
- <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>537.36</value>
- </property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>537.36</value>
</property>
4、模拟Safari浏览器:
- <property>
- <name>http.agent.name</name>
- <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>534.57.2</value>
- </property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>534.57.2</value>
</property>
5、模拟Opera浏览器:
- <property>
- <name>http.agent.name</name>
- <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
- </property>
- <property>
- <name>http.agent.version</name>
- <value>19.0.1326.59</value>
- </property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
</property>
<property>
<name>http.agent.version</name>
<value>19.0.1326.59</value>
</property>
后记:查看User-Agent的方法:
1、http://www.useragentstring.com
配置Nutch模拟浏览器以绕过反爬虫限制的更多相关文章
- 使用HttpClient配置代理服务器模拟浏览器发送请求调用接口测试
在调用公司的某个接口时,直接通过浏览器配置代理服务器可以请求到如下数据: 请求url地址:http://wwwnei.xuebusi.com/rd-interface/getsales.jsp?cid ...
- Python 配置 selenium 模拟浏览器环境,带下载链接
使用浏览器渲染引擎.直接用浏览器在显示网页时解析HTML,应用CSS样式并执行JavaScript的语句. 这方法在爬虫过程中会打开一个浏览器,加载该网页,自动操作浏览器浏览各个网页,顺便把数据抓下来 ...
- python反爬虫解决方法——模拟浏览器上网
之前第一次练习爬虫的时候看网上的代码有些会设置headers,然后后面的东西我又看不懂,今天终于知道了原来这东西是用来模拟浏览器上网用的,因为有些网站会设置反爬虫机制,所以如果要获取内容的话,需要使用 ...
- scrapy反反爬虫策略和settings配置解析
反反爬虫相关机制 Some websites implement certain measures to prevent bots from crawling them, with varying d ...
- 关于千里马招标网知道创宇反爬虫521状态码的解决方案(python代码模拟js生成cookie _clearence值)
一.问题发现 近期我在做代理池的时候,发现了一种以前没有见过的反爬虫机制.当我用常规的requests.get(url)方法对目标网页进行爬取时,其返回的状态码(status_code)为521,这是 ...
- 《Python3反爬虫原理与绕过实战》作者韦世东
可以用(k1,k2)-k1来设置,如果有重复的key,则保留key1,舍弃key2/打印appleMap{1=Apple{id=1,name=苹果1,money=3.25,num=10},2=Appl ...
- python爬虫:使用Selenium模拟浏览器行为
前几天有位微信读者问我一个爬虫的问题,就是在爬去百度贴吧首页的热门动态下面的图片的时候,爬取的图片总是爬取不完整,比首页看到的少.原因他也大概分析了下,就是后面的图片是动态加载的.他的问题就是这部分动 ...
- Python开发爬虫之动态网页抓取篇:爬取博客评论数据——通过Selenium模拟浏览器抓取
区别于上篇动态网页抓取,这里介绍另一种方法,即使用浏览器渲染引擎.直接用浏览器在显示网页时解析 HTML.应用 CSS 样式并执行 JavaScript 的语句. 这个方法在爬虫过程中会打开一个浏览器 ...
- 第三百三十三节,web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies
第三百三十三节,web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录 模拟浏览器登录 start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于star ...
随机推荐
- curl几个选项
1.--cacert 选项请看https://curl.haxx.se/docs/sslcerts.html 2.CURL库怎样验证服务器证书 [复制链接] 中提到:你是客户端, 你希望的是: 你拿 ...
- Linux 零碎知识点
ln -s ../libs/ libs 在当前目录下建立一个符号链接文件libs,使它指向上一层目录的libs文件夹 关于su和su -的区别切换用户是可以使用su tom或者su - tom来实现, ...
- Unable to resolve target 'android-XX'的问题解决
1.问题现象(Unable to resolve target 'android-XX') 将低版本的代码导入eclipse时,常遇到这样的问题:Unable to resolve target 'a ...
- hdu 2571 命运(dp)
Problem Description 穿过幽谷意味着离大魔王lemon已经无限接近了! 可谁能想到,yifenfei在斩杀了一些虾兵蟹将后,却再次面临命运大迷宫的考验,这是魔王lemon设下的又一个 ...
- Tree(prime)
Tree Time Limit : 6000/2000ms (Java/Other) Memory Limit : 32768/32768K (Java/Other) Total Submissi ...
- [置顶] Codeforces Round #190 (Div. 2)(完全)
好久没有写博客了,一直找不到有意义的题可以写,这次也不算多么有意义,只是今天是比较空的一天,趁这个时候写一写. A. B. 有一点贪心,先把每个拿去3的倍数,余下0或1或2,然后三个一起拿. 对于以上 ...
- Android OpenCV实现图片叠加,水印
关于如何用纯OpenCV实现图片叠加的例子实在是太少,太多的是使用 C++,JNI实现的,如果要用C++的话,我们为啥不转行做C++ 下面的例子基于 Android JavaCV 实现了在im_bea ...
- 使用C#创建winform窗体,修改debugwen文件夹下exe应用程序的默认图标
在做一个接口程序是遇到的问题,记录一下: 在解决方案资源管理器上,右击项目名称——属性——点击图标和清单右边的的按纽——去Debug文件夹中找到自己的图标,打开.然后保存.
- phpStorm注册马码
phpstorm已经升级到10.0,原注册码失效,10.0注册方法:注册时选择“License server”输入 http://idea.lanyus.com/ 此工具类工具一直没有让我失望
- C# 3循环 for语句
循环:可以反复执行某段代码,直到不满足循环条件为止. 一.循环的四要素:初始条件.循环条件.状态改变.循环体. 1.初始条件:循环最开始的状态. 2.循环条件:在什么条件下进行循环,不满足此条件,则循 ...