配置Nutch模拟浏览器以绕过反爬虫限制

原文链接：http://yangshangchuan.iteye.com/blog/2030741

当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候，抓取的所有页面内容均为：您的访问请求被拒绝 ...... 这是最简单的反爬虫策略（该策略简单地读取HTTP请求头User-Agent的值来判断是人（浏览器）还是机器爬虫），我们只需要简单地配置Nutch来模拟浏览器（simulate web browser）就可以绕过这种限制。

在nutch-default.xml中有项配置是和User-Agent相关的：

<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>Nutch-1.7</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>

<property>

  <name>http.agent.description</name>

  <value></value>

  <description>Further description of our bot- this text is used in

  the User-Agent header.  It appears in parenthesis after the agent name.

  </description>

</property>

<property>

  <name>http.agent.url</name>

  <value></value>

  <description>A URL to advertise in the User-Agent header.  This will

   appear in parenthesis after the agent name. Custom dictates that this

   should be a URL of a page explaining the purpose and behavior of this

   crawler.

  </description>

</property>

<property>

  <name>http.agent.email</name>

  <value></value>

  <description>An email address to advertise in the HTTP 'From' request

   header and User-Agent header. A good practice is to mangle this

   address (e.g. 'info at example dot com') to avoid spamming.

  </description>

</property>

<property>

  <name>http.agent.name</name>

  <value></value>

  <description>HTTP 'User-Agent' request header. MUST NOT be empty -

  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents

	http.agent.description

	http.agent.url

	http.agent.email

	http.agent.version

  and set their values appropriately.

  </description>

</property>

<property>

  <name>http.agent.version</name>

  <value>Nutch-1.7</value>

  <description>A version string to advertise in the User-Agent

   header.</description>

</property>

在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这项配置是如何构成User-Agent的：

this.userAgent = getAgentString( conf.get("http.agent.name"),
conf.get("http.agent.version"),
conf.get("http.agent.description"),
conf.get("http.agent.url"),
conf.get("http.agent.email") );

this.userAgent = getAgentString( conf.get("http.agent.name"),

        conf.get("http.agent.version"),

        conf.get("http.agent.description"),

        conf.get("http.agent.url"),

        conf.get("http.agent.email") );

private static String getAgentString(String agentName,
String agentVersion,
String agentDesc,
String agentURL,
String agentEmail) {
if ( (agentName == null) || (agentName.trim().length() == 0) ) {
// TODO : NUTCH-258
if (LOGGER.isErrorEnabled()) {
LOGGER.error("No User-Agent string set (http.agent.name)!");
}
}
StringBuffer buf= new StringBuffer();
buf.append(agentName);
if (agentVersion != null) {
buf.append("/");
buf.append(agentVersion);
}
if ( ((agentDesc != null) && (agentDesc.length() != 0))
|| ((agentEmail != null) && (agentEmail.length() != 0))
|| ((agentURL != null) && (agentURL.length() != 0)) ) {
buf.append(" (");
if ((agentDesc != null) && (agentDesc.length() != 0)) {
buf.append(agentDesc);
if ( (agentURL != null) || (agentEmail != null) )
buf.append("; ");
}
if ((agentURL != null) && (agentURL.length() != 0)) {
buf.append(agentURL);
if (agentEmail != null)
buf.append("; ");
}
if ((agentEmail != null) && (agentEmail.length() != 0))
buf.append(agentEmail);
buf.append(")");
}
return buf.toString();
}

  private static String getAgentString(String agentName,

                                       String agentVersion,

                                       String agentDesc,

                                       String agentURL,

                                       String agentEmail) {

    if ( (agentName == null) || (agentName.trim().length() == 0) ) {

      // TODO : NUTCH-258

      if (LOGGER.isErrorEnabled()) {

        LOGGER.error("No User-Agent string set (http.agent.name)!");

      }

    }

    StringBuffer buf= new StringBuffer();

    buf.append(agentName);

    if (agentVersion != null) {

      buf.append("/");

      buf.append(agentVersion);

    }

    if ( ((agentDesc != null) && (agentDesc.length() != 0))

    || ((agentEmail != null) && (agentEmail.length() != 0))

    || ((agentURL != null) && (agentURL.length() != 0)) ) {

      buf.append(" (");

      if ((agentDesc != null) && (agentDesc.length() != 0)) {

        buf.append(agentDesc);

        if ( (agentURL != null) || (agentEmail != null) )

          buf.append("; ");

      }

      if ((agentURL != null) && (agentURL.length() != 0)) {

        buf.append(agentURL);

        if (agentEmail != null)

          buf.append("; ");

      }

      if ((agentEmail != null) && (agentEmail.length() != 0))

        buf.append(agentEmail);

      buf.append(")");

    }

    return buf.toString();

  }

在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头，这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent：

String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
} else {
reqStr.append("User-Agent: ");
reqStr.append(userAgent);
reqStr.append("\r\n");
}

String userAgent = http.getUserAgent();

if ((userAgent == null) || (userAgent.length() == 0)) {

	if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }

} else {

	reqStr.append("User-Agent: ");

	reqStr.append(userAgent);

	reqStr.append("\r\n");

}

通过上面的分析可知：在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器（Imitating a specific browser）：

1、模拟Firefox浏览器：

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
</property>
<property>
<name>http.agent.version</name>
<value>20100101 Firefox/27.0</value>
</property>

<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>20100101 Firefox/27.0</value>

</property>

2、模拟IE浏览器：

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
</property>
<property>
<name>http.agent.version</name>
<value>6.0)</value>
</property>

<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>6.0)</value>

</property>

3、模拟Chrome浏览器：

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>537.36</value>
</property>

<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>537.36</value>

</property>

4、模拟Safari浏览器：

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>534.57.2</value>
</property>

<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>534.57.2</value>

</property>

5、模拟Opera浏览器：

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
</property>
<property>
<name>http.agent.version</name>
<value>19.0.1326.59</value>
</property>

<property>

	<name>http.agent.name</name>

	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>

</property>

<property>

	<name>http.agent.version</name>

	<value>19.0.1326.59</value>

</property>

后记：查看User-Agent的方法：

1、http://www.useragentstring.com

2、http://whatsmyuseragent.com

3、http://www.enhanceie.com/ua.aspx

NUTCH/HADOOP视频教程

配置Nutch模拟浏览器以绕过反爬虫限制的更多相关文章

使用HttpClient配置代理服务器模拟浏览器发送请求调用接口测试
在调用公司的某个接口时,直接通过浏览器配置代理服务器可以请求到如下数据: 请求url地址:http://wwwnei.xuebusi.com/rd-interface/getsales.jsp?cid ...
Python 配置 selenium 模拟浏览器环境，带下载链接
使用浏览器渲染引擎.直接用浏览器在显示网页时解析HTML,应用CSS样式并执行JavaScript的语句. 这方法在爬虫过程中会打开一个浏览器,加载该网页,自动操作浏览器浏览各个网页,顺便把数据抓下来 ...
python反爬虫解决方法——模拟浏览器上网
之前第一次练习爬虫的时候看网上的代码有些会设置headers,然后后面的东西我又看不懂,今天终于知道了原来这东西是用来模拟浏览器上网用的,因为有些网站会设置反爬虫机制,所以如果要获取内容的话,需要使用 ...
scrapy反反爬虫策略和settings配置解析
反反爬虫相关机制 Some websites implement certain measures to prevent bots from crawling them, with varying d ...
关于千里马招标网知道创宇反爬虫521状态码的解决方案（python代码模拟js生成cookie _clearence值）
一.问题发现近期我在做代理池的时候,发现了一种以前没有见过的反爬虫机制.当我用常规的requests.get(url)方法对目标网页进行爬取时,其返回的状态码(status_code)为521,这是 ...
《Python3反爬虫原理与绕过实战》作者韦世东
可以用(k1,k2)-k1来设置,如果有重复的key,则保留key1,舍弃key2/打印appleMap{1=Apple{id=1,name=苹果1,money=3.25,num=10},2=Appl ...
python爬虫:使用Selenium模拟浏览器行为
前几天有位微信读者问我一个爬虫的问题,就是在爬去百度贴吧首页的热门动态下面的图片的时候,爬取的图片总是爬取不完整,比首页看到的少.原因他也大概分析了下,就是后面的图片是动态加载的.他的问题就是这部分动 ...
Python开发爬虫之动态网页抓取篇：爬取博客评论数据——通过Selenium模拟浏览器抓取
区别于上篇动态网页抓取,这里介绍另一种方法,即使用浏览器渲染引擎.直接用浏览器在显示网页时解析 HTML.应用 CSS 样式并执行 JavaScript 的语句. 这个方法在爬虫过程中会打开一个浏览器 ...
第三百三十三节，web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies
第三百三十三节,web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录模拟浏览器登录 start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于star ...

随机推荐

【HDU 4771 Stealing Harry Potter's Precious】BFS+状压
2013杭州区域赛现场赛二水... 类似“胜利大逃亡”的搜索问题,有若干个宝藏分布在不同位置,问从起点遍历过所有k个宝藏的最短时间. 思路就是,从起点出发,搜索到最近的一个宝藏,然后以这个位置为起点, ...
杭电oj 2037 今年暑假不AC
Tips:贪心算法的典型应用,可以按照节目结束时间由小到大排序,(至于结束时间相同的,有些人说按开始时间早的排序,不过个人认为不必处理,因为结束时间一样,两个之中要么都没有,要么必有一个)然后再依次进 ...
Gartner公布了集成系统的魔力象限 - Nutanix的关键技术是什么？
读报告,分析报告,写报告.这活儿我不专业.专业的是西瓜哥的这个:http://www.dostor.com/article/2014-06-25/9776476.shtml 再列出个几篇文章供參考: ...
【最大流】【HDU3338】【Kakuro Extension】
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=3338 题目大意:填数字,使白色区域的值得和等于有值得黑色区域的相对应的值,用网络流来做题目思路:增加 ...
String.Empty、string=”” 和null的区别
String.Empty是string类的一个静态常量: String.Empty和string=””区别不大,因为String.Empty的内部实现是: 1 2 3 4 5 6 7 8 9 10 1 ...
JQuery Ajax Options
参数名类型描述 url String (默认: 当前页地址) 发送请求的地址. type String (默认: "GET") 请求方式 ("POST" 或 ...
js学习心得（一）（菜鸟）
js基础已经打了好几次了,慕课跟着学了一遍,视频看了一些,还读了诸如 jsdom艺术,js精粹以及锋利jq(没读完). 这次再次重头读并写一遍代码,工具书是js,查缺补漏高级程序设计第二版,犀牛书有点 ...
多个DLL合并，DLL合并到EXE
1:) 下载 http://download.microsoft.com/download/1/3/4/1347C99E-9DFB-4252-8F6D-A3129A069F79/ILMerge.msi ...
DNS域欺骗攻击详细教程之Linux篇
.DNS域欺骗攻击原理 DNS欺骗即域名信息欺骗是最常见的DNS安全问题.当一个DNS服务器掉入陷阱,使用了来自一个恶意DNS服务器的错误信息,那么该DNS服务器就被欺骗了.DNS欺骗会使那些易受攻 ...
在Linux下sqlplus 登录时显示SID 和用户名
一般显示为: SQL> show user USER 为 "SYS" SQL> 在 $ORACLE_HOME/sqlplus/admin目录下编辑glogin.sql ...

配置Nutch模拟浏览器以绕过反爬虫限制

配置Nutch模拟浏览器以绕过反爬虫限制的更多相关文章

随机推荐

热门专题