nutch 异常集锦

异常：
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-dell\mapred\staging\dell1008071661\.staging to 0700
    at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
原因：
hadoop在windows下文件权限问题，在linux不存在这个问题。
解决方法：
1 代码的修改：
笔者使用的是nutch-1.7，对应的hadoop版本为1.2.0 下载地址：（hadoop-core-1.2.0）
在下载的release-1.2.0\src 下搜索 ‘FileUtil’ ，然后修改：

private static void checkReturnValue(boolean rv, File p, FsPermission permission)  {

    /**

    if (!rv) {

      throw new IOException("Failed to set permissions of path: " + p +

                            " to " +

                            String.format("%04o", permission.toShort()));

    }

    **/

  }

2 hadoop的编译：（不需要导入eclipse）
环境：Cygwin，Ant
Ant后会生成：\release-1.2.0\build\hadoop-core-1.2.1-SNAPSHOT.jar
改名为 hadoop-core-1.2.0 覆盖 \apache-nutch-1.7\lib\hadoop-core-1.2.0.jar即可。

异常
java.io.IOException: Job failed!

解决方案：

Src中的：

<property>

  <name>plugin.folders</name>

  <value>./src/plugin</value>

  <description>./src/pluginDirectories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used

  as is.  If relative, it is searched for on the classpath.</description>

</property>

记住是单数哦

bin中的:

plugin文件夹是单数，所以这里要做一下修改。

<property>

  <name>plugin.folders</name>

  <value>./src/plugins</value>

  <description>./src/pluginDirectories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used

  as is.  If relative, it is searched for on the classpath.</description>

</property>

异常：nutch下载的html不完整的因素
1 http://news.163.com/ skipped. Content of size 481597 was truncated to 65376

解决方案：

将conf/nutch-default.xml 将 parser.skip.truncated 为false

2 http请求的字节限制
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

异常：

种子添加了，http://www.gov.cn/

regex-urlfilter.txt 中添加了 +^http://www.gov.cn/

配置完全没错，但是爬虫却没采集到任何东西

原因：

对方设置了机器人协议。

解决方案：

如果要修改：Fetcher 类

 /**

              if (!rules.isAllowed(fit.u.toString())) {

                // unblock

                fetchQueues.finishFetchItem(fit, true);

                if (LOG.isDebugEnabled()) {

                  LOG.debug("Denied by robots.txt: " + fit.url);

                }

                output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);

                reporter.incrCounter("FetcherStatus", "robots_denied", 1);

                continue;

              }**/

异常 ： unzipBestEffort returned null

转载自：http://blog.chinaunix.net/uid-8345138-id-3358621.html

Nutch爬虫爬取某网页是出现下列异常：

ERROR http.Http (?:invoke0(?)) - java.io.IOException: unzipBestEffort returned null
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:472)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:151)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
ERROR http.Http (?:invoke0(?)) - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:173)

经过调试发现异常来源于：

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)

该异常原因：

此页面采用这个是一个分段传输，而nutch爬虫则默认采用了非分段式处理，导致构造GZIP时出错，从而影响了后面的GZIP解压失败。

是否是分段传输可以在Http headers里面看到，如果是分段传输则有：transfer-encoding：chunked这样一个响应。

处理方法：

1. 修改接口org.apache.nutch.metadata.HttpHeaders，添加：

public final static String TRANSFER_ENCODING = "Transfer-Encoding";

2. 在nutch中的org.apache.nutch.protocol.http.HttpResponse类中已经提供了分段传输类型的处理方法：

private void readChunkedContent(PushbackInputStream in,
StringBuffer line)

我们只需要在HttpResponse的构造方法总调用该方法即可，添加如下代码：

String transferEncoding = getHeader(Response.TRANSFER_ENCODING);
if(transferEncoding != null && transferEncoding.equalsIgnoreCase("chunked")){
StringBuffer line = new StringBuffer();
this.readChunkedContent(in, line);
}else{
readPlainContent(in);
}

修改完成，运行测试。

刚才不能爬取的站点终于可以爬取了

=========================================================

注：

1.有两个HttpResponse类，一个在protocol.http里面，一个在protocol.httpclient里面，我们需要修改的是前者。

2.Nutch2.0已将readChunkedContent方法删掉，故贴上Nutch1.5的方法，将这个方法放入HttpResponse：

点击(此处)折叠或打开

private void readChunkedContent(PushbackInputStream in, StringBuffer line)
throws HttpException, IOException {
boolean doneChunks = false;
int contentBytesRead = 0;
byte[] bytes = new byte[Http.BUFFER_SIZE];
ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
while (!doneChunks) {
if (Http.LOG.isTraceEnabled()) {
Http.LOG.trace("Http: starting chunk");
}
readLine(in, line, false);
String chunkLenStr;
// if (LOG.isTraceEnabled()) { LOG.trace("chunk-header: '" + line +
// "'"); }
int pos = line.indexOf(";");
if (pos < 0) {
chunkLenStr = line.toString();
} else {
chunkLenStr = line.substring(0, pos);
// if (LOG.isTraceEnabled()) { LOG.trace("got chunk-ext: " +
// line.substring(pos+1)); }
}
chunkLenStr = chunkLenStr.trim();
int chunkLen;
try {
chunkLen = Integer.parseInt(chunkLenStr, 16);
} catch (NumberFormatException e) {
throw new HttpException("bad chunk length: " + line.toString());
}
if (chunkLen == 0) {
doneChunks = true;
break;
}
if ((contentBytesRead + chunkLen) > http.getMaxContent())
chunkLen = http.getMaxContent() - contentBytesRead;
// read one chunk
int chunkBytesRead = 0;
while (chunkBytesRead < chunkLen) {
int toRead = (chunkLen - chunkBytesRead) < Http.BUFFER_SIZE ? (chunkLen - chunkBytesRead)
: Http.BUFFER_SIZE;
int len = in.read(bytes, 0, toRead);
if (len == -1)
throw new HttpException("chunk eof after "
+ contentBytesRead + " bytes in successful chunks"
+ " and " + chunkBytesRead + " in current chunk");
// DANGER!!! Will printed GZIPed stuff right to your
// terminal!
// if (LOG.isTraceEnabled()) { LOG.trace("read: " + new
// String(bytes, 0, len)); }
out.write(bytes, 0, len);
chunkBytesRead += len;
}
readLine(in, line, false);
}
if (!doneChunks) {
if (contentBytesRead != http.getMaxContent())
throw new HttpException(
"chunk eof: !doneChunk && didn't max out");
return;
}
content = out.toByteArray();
parseHeaders(in, line);
}

3.修改构造方法的地方在call readPlainContent的地方。

could only be replicated to  nodes, instead of 

周末机房断电，然后hadoop爆出如题的错误，解决方案就是关闭所有节点的防火墙，相关命令如下：

查看防火墙状态：

/etc/init.d/iptables status

暂时关闭防火墙：

/etc/init.d/iptables stop

禁止防火墙在系统启动时启动

/sbin/chkconfig --level  iptables off

重启iptables:

/etc/init.d/iptables restart

nutch 异常集锦的更多相关文章

【Apache Nutch系列】Nutch2.0配置安装异常集锦
1.java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration Exception in thread &qu ...
SP Flash Tool使用异常集锦
1.The load scatter file is invalid无法载入scatter文件 (ubuntu下)我如果我们在使用MTK的Smart Phone Flash Tool过程中无法载入Sc ...
java常见异常集锦
1. java.lang.nullpointerexception 这个异常大家肯定都经常遇到,异常的解释是"程序遇上了空指针",简单地说就是调用了未经初始化的对象或者是不存在的对 ...
JAVA常见异常集锦（持续更新）
No1:Nested in org.springframework.beans.factory.parsing.BeanDefinitionParsingException 2013-07-02 10 ...
MyBatis 异常集锦
异常1.使用映射器 (还没有使用Spring) 异常信息摘要: org.apache.ibatis.binding.BindingException: Type interface com.jege. ...
Hibernate 异常集锦
异常1.Error parsing JNDI name [foo] 异常信息摘要: org.hibernate.engine.jndi.JndiException: Error parsing JND ...
JPA 系列教程异常集锦
异常1.hibernate升级到3.5版本异常信息摘要: Associations marked as mappedBy must not define database mappings like ...
Tensorflow异常集锦
一.tensorflow checkpoint报错在调用tf.train.Saver#save时,如果使用的路径是绝对路径,那么保存的checkpoint里面用的就是绝对路径:如果使用的是相对路径, ...
Hibernate 菜鸟教程异常集锦
异常1.Error parsing JNDI name [foo] 异常信息摘要: org.hibernate.engine.jndi.JndiException: Error parsing JND ...

随机推荐

C语言小游戏之贪吃蛇
还记得非常久曾经听群里人说做贪吃蛇什么的,那时候大一刚学了C语言,认为非常难,根本没什么思路. 前不久群里有些人又在谈论C语言贪吃蛇的事了,看着他们在做,我也打算做一个出来. 如今大三,经过了这一年半 ...
mysql加入�管理员
1.首先用超级管理员登录,然后点击权限button 2.点击加入�新用户,填写登录名和password,全局权限不用选,点击新建用户button 3.编辑新加入�的用户(编辑权限) 4.找到" ...
【转】cocos2d-x游戏开发(八)各类构造器
欢迎转载:http://blog.csdn.net/fylz1125/article/details/8521997 这篇写cocos2d-x的构造器. cocos2d-x引入自动释放机制后,创建的对 ...
Ⅱ.AngularJS的点点滴滴--缓存
模板缓存-$templateCache and 缓存工厂 $cacheFactory 1.使用script标签 <html ng-app> <script src="htt ...
Android（java）学习笔记191：Android数据存储5种方式总结
1.使用文件(File)存储存储一般的数据 2.使用sharedperference(xml) 存储设置信息.配置信息.密码 3.数据库Sqlite 开源的,嵌入式的数据库,轻量级 4.使用Cont ...
HTML+CSS基础学习笔记（1）
一.了解HTML.CSS.JS 1.HTML是网页内容的载体. 内容就是网页制作者放在页面上想要让用户浏览的信息,可以包含文字.图片.视频等. 2.CSS样式是表现. 用来改变内容外观的东西称之为表现 ...
Javascript基础学习(3)_对象和数组
一.对象是一种无序的属性集合,每个属性都有自己的名字和值. 1.创建对象花括号内逗号分隔 var person = { "Name" : "LiCheng", ...
sqlserver2005唯一性约束
[转载]http://blog.163.com/rihui_7/blog/static/21228514320136193392749/ 1.设置字段为主键就是一种唯一性约束的方法,如 int p ...
常见 PL.SQL 数据库操作
Oracle PL/SQL 1, Alt +E 2, 默认大写功能, 解析SQL原则,Comment,UnComment. 3, 触发Trig,使用Test Window. 4, Compile In ...
3D Game Programming with directx 11 习题答案 8.3
第八章第三题 1.将flare.dds和flarealpha.dds拷贝到工程目录 2.创建shader resource view HR(D3DX11CreateShaderResourceVie ...

nutch 异常集锦

nutch 异常集锦的更多相关文章

随机推荐

热门专题