Running Nutch in Eclipse

Here are instructions for setting up a development environment for Nutch under the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch trunk in the above context.

Contents

Running Nutch in Eclipse

Before you start

Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line. However, it's very useful to be able to debug Nutch in Eclipse and is also extremely useful when applying and testing patches as it enables you to see them working in a larger context. This being said, you will still benefit greatly by looking at the hadoop.log output. This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.

Prerequisites

You need to have Apache Ant installed and configured on your system.
Grab the newest version of Eclipse available here.
All of the following should be available from the Eclipse Marketplace. However if not, you can download them throughout Eclipse as follows.
Once you've set up Eclipse, download Subclipse as per here. N.B. If you experience an error with the 1.8.x release, try 1.6.x. This tends to solve compatibility problems.
Grab IvyDE plugin for Eclipse as here.
Grab m2e plugin for Eclipse here

Steps

Checkout and Build Nutch

Get the latest source code from SVN using terminal. For Nutch 1.x (ie.trunk) run this:
```
 svn co https://svn.apache.org/repos/asf/nutch/trunk

 cd trunk
```
For Nutch 2.x run this:
```
 svn co https://svn.apache.org/repos/asf/nutch/branches/2.x

 cd 2.x
```
For Nutch 1.x (ie. trunk), skip ahead to step #5.

At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:

  org.apache.gora.hbase.store.HBaseStore

  org.apache.gora.cassandra.store.CassandraStore

  org.apache.gora.accumulo.store.AccumuloStore

  org.apache.gora.avro.store.AvroStore

  org.apache.gora.avro.store.DataFileAvroStore

In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:

 <property>

  <name>storage.data.store.class</name>

  <value>org.apache.gora.hbase.store.HBaseStore</value>

  <description>Default class for storing data</description>

 </property>

In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:
```
  <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
```
Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:
```
 gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
```
Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.x", set the property to:
```
 <property>

   <name>plugin.folders</name>

   <value>/home/tejas/Desktop/2.x/build/plugins</value>

 </property>
```
Run this command:
```
  ant eclipse
```

Load project in Eclipse

In Eclipse, click on “File” -> “Import...”
Select “Existing Projects into Workspace”
In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.
You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.
In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”
In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.

Create Eclipse launcher

Now, lets get geared to run something. Lets start off with the inject operation. Right click on the project in “Package Explorer” -> select “Run As” -> select “Run Configurations”. Create a new configuration. Name it as "inject".

For 1.x ie trunk : Set the main class as: org.apache.nutch.crawl.Injector
For 2.x : Set the main class as: org.apache.nutch.crawl.InjectorJob

In the arguments tab, for program arguments, provide the path of the input directory which has seed urls. Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

Click "Apply" and then click "Run". If everything was set perfectly, then you should see inject operation progressing on console.

If you want to find out the java class corresponding to any command, just peek inside "src/bin/nutch" script and at the bottom you would find a switch case with a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

Operation	Class in Nutch 1.x (i.e.trunk)	Class in Nutch 2.x
inject	org.apache.nutch.crawl.Injector	org.apache.nutch.crawl.InjectorJob
generate	org.apache.nutch.crawl.Generator	org.apache.nutch.crawl.GeneratorJob
fetch	org.apache.nutch.fetcher.Fetcher	org.apache.nutch.fetcher.FetcherJob
parse	org.apache.nutch.parse.ParseSegment	org.apache.nutch.parse.ParserJob
updatedb	org.apache.nutch.crawl.CrawlDb	org.apache.nutch.crawl.DbUpdaterJob

Debug Nutch in Eclipse

Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs.
Here are a few good places to set breakpoints in the 1.x codebase:

Fetcher [line: 1115] - run

Fetcher [line: 530] - fetch

Fetcher$FetcherThread [line: 560] - run()

Generator [line: 443] - generate

Generator$Selector [line: 108] - map

OutlinkExtractor [line: 71 & 74] - getOutlinks

Here are a few good places to set breakpoints in the 2.x codebase:

FetcherReducer$FetcherThread run() : line 487 : LOG.info("fetching " + fit.url ....

                                   : line 519 : final ProtocolStatus status = output.getStatus();

GeneratorMapper : map() : line 53

GeneratorReducer : reduce() : line 53

OutlinkExtractor : getOutlinks() : line 84

Remote Debugging in Eclipse

create a new Debug Configuration as Remote Java Application and remember the port (here: 37649)
launch nutch from command-line but add options to use the Java Debugger JDWP Agent Library, e.g. from bash:

% export NUTCH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:37649"

% $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/

the application will be suspended just after launch
now go to Eclipse, set appropriate break-points, and run the previously created Debug Configuration

Instead of creating an extra launch configuration for every tool you want to debug, one single configuration is enough to debug any tool (parsechecker, indexchecher, URL filter, etc.) and that even remotely (crawler/tool running on server, Eclipse debugger locally).

Debugging and Timeouts

Debugging takes time, esp. when inspecting variables, stack traces, etc. Usually too much time, so that some timeout will apply and stop the application. Set timeouts in the nutch-site.xml used for debugging to a rather high value (or -1 for unlimited), e.g., when debugging the parser:

<property>

  <name>parser.timeout</name>

  <value>-1</value>

</property>

Display Javadoc for Dependent Libraries

Eclipse is able to show Javadocs immediately, not only for Nutch classes but also for dependent libraries. While Eclipse takes the Javadocs of Nutch classes directly from the source files, this is not the case for dependent Ivy managed libraries. There are two ways to tell Eclipse where to find the Javadocs of dependent libs: (1) adding the Javadoc URL to a jar file, or (2) use the IvyDE Eclipse plugin. Note that both ways will modify the file .classpath. Because the ant eclipse target will overwrite the .classpath file, you should make a backup before and merge the changes made via Eclipse back afterwards.

Connect a Library to the Javadoc URL

The simplest way to connect a jar library with its Javadocs is to add the Javadoc URL manually in the classpath editor, see screenshot.

IvyDE

The Nutch build system delegates the managment of library dependencies to Apache Ivy. There is an Eclipse plugin IvyDE to integrate Ivy's dependency managment. It is well-documented, including a description how to add the managed libraries to the Eclipse project. The main Ivy file is ivy/ivy.xml but note that every plugin has its own ivy.xml. If working on a specific plugin, it is a good idea to add also its ivy.xml. It is possible to use IvyDE in addition to the libraries placed by ant eclipse in .classpath.

The repository hosting a library often also provides packages containing javadoc and sources. E.g., the JUnit repository https://repo1.maven.org/maven2/junit/junit/4.11/ provides the following files:

junit-4.11-javadoc.jar                             14-Nov-2012 19:21              379344

junit-4.11-sources.jar                             14-Nov-2012 19:21              151329

junit-4.11.jar                                     14-Nov-2012 19:21              245039

junit-4.11.pom                                     14-Nov-2012 19:21                2344

IvyDE is then able to fetch also javadoc and source packages (if provided) and show them in Eclipse. Again, there is an excellent description, how this can be enabled in the Source/Javadoc Mapping section of the Ivy preferences. Note that the Ivy cache (usually ~/.ivy/cache/) must be cleaned before Ivy Resolve is called from Eclipse.

Troubleshooting

eclipse: Cannot create project content in workspace

The Nutch source code must be out of the workspace folder. Alternatively you can download the code with eclipse (svn) under your workspace rather than try to create the project using existing code, eclipse sometimes doesn't let you do it from source code into the workspace.

Plugin directory not found

Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-default.xml or even better in nutch-site.xml. Ideally all efforts should be made to keep nutch-default.xml completely intact.

<property>

  <name>plugin.folders</name>

  <value>/home/....../trunk/src/plugin</value>

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

Debugging Hadoop classes

Sometimes (fairly often) it makes sense to also have the Hadoop classes available during debugging. This should really second nature as Nutch heavily relies upon the underlying Hadoop infrastructure. Therefore you can check out the Hadoop sources into your Eclipse IDE and combine to debug this way. You can:

Checkout the Hadoop version that should be used within Nutch trunk
Configure a Hadoop project similar to the Nutch project within your Eclipse IDE. See this.
Add the Hadoop project as a dependent project of Nutch project
You can now also set break points within Hadoop classes like inputformat implementations etc.

Non-ported Plugins to 2.x

Few plugins were not ported to Nutch 2.x series yet. If you are following the above tutorial for building Nutch 2.x, please check Nutch2Plugins for more information

Run Nutch In Eclipse on Linux and Windows nutch version 0.9的更多相关文章

【热文】为什么很多硅谷工程师偏爱 OS X，而不是 Linux 或 Windows？
校对:伯乐在线 - 黄利民链接: 1. Why do most of the developers in Silicon Valley prefer OS X over Linux or Windo ...
在Linux和Windows的Docker容器中运行ASP.NET Core
(此文章同时发表在本人微信公众号"dotNET每日精华文章",欢迎右边二维码来关注.) 译者序:其实过去这周我都在研究这方面的内容,结果周末有事没有来得及总结为文章,Scott H ...
【转】linux和windows下安装python集成开发环境及其python包
本系列分为两篇: 1.[转]windows和linux中搭建python集成开发环境IDE 2.[转]linux和windows下安装python集成开发环境及其python包 3.windows和l ...
linux与windows共享剪贴板(clipboard)
linux与windows共享剪贴板(clipboard)的方法先说两句废话,其实linux和windows之间不需要共享剪贴板,直接在putty中,按住SHIFT+鼠标选择就可以了. 但是作为一种 ...
3 linux、windows环境---路径分隔符不同导致的问题
问题:通常在eclipse,IntelliJ IDEA等进行代码编写时,程序中用到路径通常采用/job/test.properties或D:/job/test.properties等是形式作为文件路径 ...
【转载】LINUX 和 WINDOWS 内核的区别
LINUX 和 WINDOWS 内核的区别 [声明:欢迎转载,转载请注明出自CU ACCESSORY http://linux.chinaunix.net/bbs/thread-1153868-1-1 ...
LCOW —— 单一Docker引擎下可同时运行Linux和Windows容器啦！
https://blog.csdn.net/m2l0zgssvc7r69efdtj/article/details/79251059 就在上周,Docker官方的master分支上新增了LCOW(Li ...
windows的docker开始支持linux的镜像，Version 18.03.0-ce-win59 (16762)
LCOW containers can now be run next to Windows containers.Use '--platform=linux' in Windows containe ...
wkhtmltopdf+itext实现html生成pdf文件的打印下载(适用于linux及windows)
目中遇到个根据html转Java的功能,在java中我们itext可以快速的实现pdf打印下载的功能,在itext中我们一般有以下三中方式实现配置pdf模板,通过Adobe Acrobat 来设置域 ...

随机推荐

【Web API系列教程】3.4 — 实战：处理数据（处理实体关系）
前言本部分描写叙述了EF怎样载入相关实体的细节,而且怎样在你的模型类中处理环形导航属性.(本部分预备了背景知识,而这不是完毕这个教程所必须的.你也能够跳到第五节) 预载入和延迟载入预载入和延迟载入 ...
（译）RabbitMQ ——“Hello World”
原文地址:http://www.rabbitmq.com/tutorials/tutorial-one-dotnet.html 介绍 RabbitMQ是一个消息实体服务(broker):它接收及转发消 ...
oracle实现查询每个部门的员工工资排在前三的员工的基本信息具体举例
--先删除原先存在的表: drop table emp; --创建表emp create table emp ( deptno number, ename varchar2(20), sal numb ...
深刻理解Nginx之Nginx完整安装
1. Nginx安装 1.1预先准备 CentOS系统下,安装Nginx的库包依赖. 安装命令例如以下: sudo yum groupinstall "DevelopmentTools& ...
OpenSSL简单介绍及在Windows、Linux、Mac系统上的编译步骤
OpenSSL介绍:OpenSSL是一个强大的安全套接字层password库,囊括基本的password算法.经常使用的密钥和证书封装管理功能及SSL协议.并提供丰富的应用程序供測试或其他目的使用. ...
ios OpenCv的配置和人脸识别技术
作为一个好奇心非常重的人,面对未知的世界都想去一探到底. 于是做了个人脸识别的demo. 眼下国内的关于opencv技术文章非常少.都是互相抄袭.关键是抄个一小部分还不全.时间又是非常久之前的了,和如 ...
apiCloud中的数据库操作mcm-js-sdk的使用
1.引入js  <script type="text/javascript" src="../pl ...
踩坑 Windows 解决pip install出现“由于目标计算机积极拒绝，无法连接”的问题
解决pip install出现“由于目标计算机积极拒绝,无法连接”的问题可能是使用某软件自动设置了代理, 所以需要手动的取消代理才可以. 在Intel选项中把所有的代理都给去掉就可以了... ...
Excel里的多列求和（相邻或相隔皆适用）
最近,需要这个知识点,看到网上各种繁多复杂的资料,自己梳理个思路. 不多说,直接上干货! 简述:将L列.M列和N列,相加放到O列.(当然这里是相邻的列).同时,也可以求相隔几列的某些列相加.
HD-ACM算法专攻系列（23）——Crixalis's Equipment
题目描述: AC源码:此次考察贪心算法,解题思路:贪心的原则是使留下的空间最大,优先选择Bi与Ai差值最大的,至于为什么?这里用只有2个设备为例,(A1,B1)与(A2,B2),假设先搬运A1,搬运的 ...

Run Nutch In Eclipse on Linux and Windows nutch version 0.9