Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。

(一)命令执行

1、运行命令

[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
InjectorJob: starting at 2015-03-10 14:59:19
InjectorJob: Injecting urlDir: seeds
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

2、查看表中内容

hbase(main):004:0> scan 'sourcetest_webpage'
ROW COLUMN+CELL
com.163.money:http/ column=f:fi, timestamp=1425970761871, value=\x00'\x8D\x00
com.163.money:http/ column=f:ts, timestamp=1425970761871, value=\x00\x00\x01L\x02{\x08_
com.163.money:http/ column=mk:_injmrk_, timestamp=1425970761871, value=y
com.163.money:http/ column=mk:dist, timestamp=1425970761871, value=0
com.163.money:http/ column=mtdt:_csh_, timestamp=1425970761871, value=?\x80\x00\x00
com.163.money:http/ column=s:s, timestamp=1425970761871, value=?\x80\x00\x00
1 row(s) in 0.0430 seconds

3、读取数据库中的内容

由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容

[jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content
WebTable dump: starting
WebTable dump: done
[jediael@master local]$ cat test/part-r-00000
http://money.163.com/ key: com.163.money:http/
baseUrl: null
status: 0 (null)
fetchTime: 1425970759775
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
metadata _csh_ : ?锟

(二)源码流程分析

类:org.apache.nutch.crawl.InjectorJob

1、程序入口

 

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
args);
System.exit(res);
}

2、ToolRunner.run(String[] args)

此步骤主要是调用inject方法,其余均是一些参数合规性的检查

 

public int run(String[] args) throws Exception {
…………
inject(new Path(args[0]));
…………
}

3、inject()方法

nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。

<pre name="code" class="java">public void inject(Path urlDir) throws Exception {

    run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));

  }

4、job的具体配置,并创建hbase中的表格

public Map<String, Object> run(Map<String, Object> args) throws Exception {

    numJobs = 1;
currentJobNum = 0;
currentJob = new NutchJob(getConf(), "inject " + input);
FileInputFormat.addInputPath(currentJob, input);
currentJob.setMapperClass(UrlMapper.class);
currentJob.setMapOutputKeyClass(String.class);
currentJob.setMapOutputValueClass(WebPage.class);
currentJob.setOutputFormatClass(GoraOutputFormat.class); DataStore<String, WebPage> store = StorageUtils.createWebStore(
currentJob.getConfiguration(), String.class, WebPage.class);
GoraOutputFormat.setOutput(currentJob, store, true); currentJob.setReducerClass(Reducer.class);
currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true);
ToolUtil.recordJobStatus(null, currentJob, results);
}

5、mapper方法

由于Injector Job中无reducer,因此只要关注mapper即可。

mapper主要完成以下几项工作:

(1)对文本中的内容进行分析,并提取其中的参数

(2)根据filter过滤url

(3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String url = value.toString().trim(); // value is line of text if (url != null && (url.length() == 0 || url.startsWith("#"))) {
/* Ignore line that start with # */
return;
} // if tabs : metadata that could be stored
// must be name=value and separated by \t
float customScore = -1f;
int customInterval = interval;
Map<String, String> metadata = new TreeMap<String, String>();
if (url.indexOf("\t") != -1) {
String[] splits = url.split("\t");
url = splits[0];
for (int s = 1; s < splits.length; s++) {
// find separation between name and value
int indexEquals = splits[s].indexOf("=");
if (indexEquals == -1) {
// skip anything without a =
continue;
}
String metaname = splits[s].substring(0, indexEquals);
String metavalue = splits[s].substring(indexEquals + 1);
if (metaname.equals(nutchScoreMDName)) {
try {
customScore = Float.parseFloat(metavalue);
} catch (NumberFormatException nfe) {
}
} else if (metaname.equals(nutchFetchIntervalMDName)) {
try {
customInterval = Integer.parseInt(metavalue);
} catch (NumberFormatException nfe) {
}
} else
metadata.put(metaname, metavalue);
}
}
try {
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + url + ":" + e);
url = null;
}
if (url == null) {
context.getCounter("injector", "urls_filtered").increment(1);
return;
} else { // if it passes
String reversedUrl = TableUtil.reverseUrl(url); // collect it
WebPage row = WebPage.newBuilder().build();
row.setFetchTime(curTime);
row.setFetchInterval(customInterval); // now add the metadata
Iterator<String> keysIter = metadata.keySet().iterator();
while (keysIter.hasNext()) {
String keymd = keysIter.next();
String valuemd = metadata.get(keymd);
row.getMetadata().put(new Utf8(keymd),
ByteBuffer.wrap(valuemd.getBytes()));
} if (customScore != -1)
row.setScore(customScore);
else
row.setScore(scoreInjected); try {
scfilters.injectedScore(url, row);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Cannot filter injected score for url " + url
+ ", using default (" + e.getMessage() + ")");
}
}
context.getCounter("injector", "urls_injected").increment(1);
row.getMarkers()
.put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
Mark.INJECT_MARK.putMark(row, YES_STRING);
context.write(reversedUrl, row);
}
}

(三)重点源码学习

版权声明:本文为博主原创文章,未经博主允许不得转载。

Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏的更多相关文章

  1. Find The Multiple 分类: 搜索 POJ 2015-08-09 15:19 3人阅读 评论(0) 收藏

    Find The Multiple Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 21851 Accepted: 8984 Sp ...

  2. 周赛-Equidistant String 分类: 比赛 2015-08-08 15:44 6人阅读 评论(0) 收藏

    time limit per test 1 second memory limit per test 256 megabytes input standard input output standar ...

  3. Drainage Ditches 分类: POJ 图论 2015-07-29 15:01 7人阅读 评论(0) 收藏

    Drainage Ditches Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 62016 Accepted: 23808 De ...

  4. SQL 存储过程 分页 分类: SQL Server 2014-05-16 15:11 449人阅读 评论(0) 收藏

    set ANSI_NULLS ON set QUOTED_IDENTIFIER ON go -- ============================================= -- Au ...

  5. PIE(二分) 分类: 二分查找 2015-06-07 15:46 9人阅读 评论(0) 收藏

    Pie Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Submissio ...

  6. 周赛-DZY Loves Chessboard 分类: 比赛 搜索 2015-08-08 15:48 4人阅读 评论(0) 收藏

    DZY Loves Chessboard time limit per test 1 second memory limit per test 256 megabytes input standard ...

  7. Ultra-QuickSort 分类: POJ 排序 2015-08-03 15:39 2人阅读 评论(0) 收藏

    Ultra-QuickSort Time Limit: 7000MS   Memory Limit: 65536K Total Submissions: 48111   Accepted: 17549 ...

  8. Windows7下QT5开发环境搭建 分类: QT开发 2015-03-09 23:44 65人阅读 评论(0) 收藏

    Windows7下QT开法环境常见搭配方法有两种. 第一种是:QT Creator+QT SDK: 第二种是:VS+qt-vs-addin+QT SDK: 以上两种均可,所需文件见QT社区,QT下载地 ...

  9. cf 61E. Enemy is weak 树状数组求逆序数(WA) 分类: Brush Mode 2014-10-19 15:16 104人阅读 评论(0) 收藏

    #include <iostream> #include <algorithm> #include <cstdio> #include <cstring> ...

随机推荐

  1. 通过一个案例彻底读懂10046 trace--字节级深入破解

    转载请注明出处:http://blog.csdn.net/guoyjoe/article/details/37840583 2014.7.23晚20:30 Oracle support组猫大师分享&l ...

  2. GBX的Graph(最短路)

    Problem B: Graph Time Limit: 2 Sec  Memory Limit: 128 MB Submit: 1  Solved: 1 [cid=1000&pid=1&am ...

  3. 解决配置Ubuntu中vnc远程显示灰屏

    解决配置Ubuntu中vnc远程显示灰屏a. 缺失图形化工具b.  ~/.vnc/xstartup 权限不对1. Ubuntu 16.04 安装 VNC 及 Mate 桌面环境https://www. ...

  4. Five ways to maximize Java NIO and NIO.2--转

    原文地址:http://www.javaworld.com/article/2078654/java-se/java-se-five-ways-to-maximize-java-nio-and-nio ...

  5. 利用日志使管理Linux更轻松

    利用日志使管理Linux更轻松 操作系统的日志主要具有审计与监测的功能,通过对日志信息的分析,可以检查错误发生的原因,监测追踪入侵者及受到攻击时留下的痕迹,甚至还能实时的进行系统状态的监控.有效利用日 ...

  6. golang excel

    github.com/tealeg/xlsx 封装的接口简单易用 package main import ( "bufio" "fmt" "githu ...

  7. JavaScript--数据结构之栈

    4.1栈是一种高效的数据结构,是一种特殊的列表.栈内元素只能通过列表的一端访问,也就称为栈顶.后入的先出的操作.Last in First out.所以他的访问每次是访问到栈顶的元素,要想访问其余的元 ...

  8. erroe:plot.new() : figure margins too large

    使用R时多次出现这个错误,plot.new() : figure margins too large,提示图片边界太大 解决方法,win.graph(width=4.875, height=2.5,p ...

  9. [原创]react-vio-form 快速构建React表单应用

    react-vio-form 是一个react的快速轻量表单库,能快速实现表单构建.提供自定义表单格式.表单校验.表单信息反馈.表单信息隔离等功能.可采用组件声明或者API的形式来实现表单的功能 de ...

  10. BZOJ2402: 陶陶的难题II(树链剖分,0/1分数规划,斜率优化Dp)

    Description Input 第一行包含一个正整数N,表示树中结点的个数.第二行包含N个正实数,第i个数表示xi (1<=xi<=10^5).第三行包含N个正实数,第i个数表示yi ...