Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。

(一)命令执行

1、运行命令

[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
InjectorJob: starting at 2015-03-10 14:59:19
InjectorJob: Injecting urlDir: seeds
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

2、查看表中内容

hbase(main):004:0> scan 'sourcetest_webpage'
ROW COLUMN+CELL
com.163.money:http/ column=f:fi, timestamp=1425970761871, value=\x00'\x8D\x00
com.163.money:http/ column=f:ts, timestamp=1425970761871, value=\x00\x00\x01L\x02{\x08_
com.163.money:http/ column=mk:_injmrk_, timestamp=1425970761871, value=y
com.163.money:http/ column=mk:dist, timestamp=1425970761871, value=0
com.163.money:http/ column=mtdt:_csh_, timestamp=1425970761871, value=?\x80\x00\x00
com.163.money:http/ column=s:s, timestamp=1425970761871, value=?\x80\x00\x00
1 row(s) in 0.0430 seconds

3、读取数据库中的内容

由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容

[jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content
WebTable dump: starting
WebTable dump: done
[jediael@master local]$ cat test/part-r-00000
http://money.163.com/ key: com.163.money:http/
baseUrl: null
status: 0 (null)
fetchTime: 1425970759775
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
metadata _csh_ : ?锟

(二)源码流程分析

类:org.apache.nutch.crawl.InjectorJob

1、程序入口

 

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
args);
System.exit(res);
}

2、ToolRunner.run(String[] args)

此步骤主要是调用inject方法,其余均是一些参数合规性的检查

 

public int run(String[] args) throws Exception {
…………
inject(new Path(args[0]));
…………
}

3、inject()方法

nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。

<pre name="code" class="java">public void inject(Path urlDir) throws Exception {

    run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));

  }

4、job的具体配置,并创建hbase中的表格

public Map<String, Object> run(Map<String, Object> args) throws Exception {

    numJobs = 1;
currentJobNum = 0;
currentJob = new NutchJob(getConf(), "inject " + input);
FileInputFormat.addInputPath(currentJob, input);
currentJob.setMapperClass(UrlMapper.class);
currentJob.setMapOutputKeyClass(String.class);
currentJob.setMapOutputValueClass(WebPage.class);
currentJob.setOutputFormatClass(GoraOutputFormat.class); DataStore<String, WebPage> store = StorageUtils.createWebStore(
currentJob.getConfiguration(), String.class, WebPage.class);
GoraOutputFormat.setOutput(currentJob, store, true); currentJob.setReducerClass(Reducer.class);
currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true);
ToolUtil.recordJobStatus(null, currentJob, results);
}

5、mapper方法

由于Injector Job中无reducer,因此只要关注mapper即可。

mapper主要完成以下几项工作:

(1)对文本中的内容进行分析,并提取其中的参数

(2)根据filter过滤url

(3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String url = value.toString().trim(); // value is line of text if (url != null && (url.length() == 0 || url.startsWith("#"))) {
/* Ignore line that start with # */
return;
} // if tabs : metadata that could be stored
// must be name=value and separated by \t
float customScore = -1f;
int customInterval = interval;
Map<String, String> metadata = new TreeMap<String, String>();
if (url.indexOf("\t") != -1) {
String[] splits = url.split("\t");
url = splits[0];
for (int s = 1; s < splits.length; s++) {
// find separation between name and value
int indexEquals = splits[s].indexOf("=");
if (indexEquals == -1) {
// skip anything without a =
continue;
}
String metaname = splits[s].substring(0, indexEquals);
String metavalue = splits[s].substring(indexEquals + 1);
if (metaname.equals(nutchScoreMDName)) {
try {
customScore = Float.parseFloat(metavalue);
} catch (NumberFormatException nfe) {
}
} else if (metaname.equals(nutchFetchIntervalMDName)) {
try {
customInterval = Integer.parseInt(metavalue);
} catch (NumberFormatException nfe) {
}
} else
metadata.put(metaname, metavalue);
}
}
try {
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + url + ":" + e);
url = null;
}
if (url == null) {
context.getCounter("injector", "urls_filtered").increment(1);
return;
} else { // if it passes
String reversedUrl = TableUtil.reverseUrl(url); // collect it
WebPage row = WebPage.newBuilder().build();
row.setFetchTime(curTime);
row.setFetchInterval(customInterval); // now add the metadata
Iterator<String> keysIter = metadata.keySet().iterator();
while (keysIter.hasNext()) {
String keymd = keysIter.next();
String valuemd = metadata.get(keymd);
row.getMetadata().put(new Utf8(keymd),
ByteBuffer.wrap(valuemd.getBytes()));
} if (customScore != -1)
row.setScore(customScore);
else
row.setScore(scoreInjected); try {
scfilters.injectedScore(url, row);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Cannot filter injected score for url " + url
+ ", using default (" + e.getMessage() + ")");
}
}
context.getCounter("injector", "urls_injected").increment(1);
row.getMarkers()
.put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
Mark.INJECT_MARK.putMark(row, YES_STRING);
context.write(reversedUrl, row);
}
}

(三)重点源码学习

版权声明:本文为博主原创文章,未经博主允许不得转载。

Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏的更多相关文章

  1. Find The Multiple 分类: 搜索 POJ 2015-08-09 15:19 3人阅读 评论(0) 收藏

    Find The Multiple Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 21851 Accepted: 8984 Sp ...

  2. 周赛-Equidistant String 分类: 比赛 2015-08-08 15:44 6人阅读 评论(0) 收藏

    time limit per test 1 second memory limit per test 256 megabytes input standard input output standar ...

  3. Drainage Ditches 分类: POJ 图论 2015-07-29 15:01 7人阅读 评论(0) 收藏

    Drainage Ditches Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 62016 Accepted: 23808 De ...

  4. SQL 存储过程 分页 分类: SQL Server 2014-05-16 15:11 449人阅读 评论(0) 收藏

    set ANSI_NULLS ON set QUOTED_IDENTIFIER ON go -- ============================================= -- Au ...

  5. PIE(二分) 分类: 二分查找 2015-06-07 15:46 9人阅读 评论(0) 收藏

    Pie Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Submissio ...

  6. 周赛-DZY Loves Chessboard 分类: 比赛 搜索 2015-08-08 15:48 4人阅读 评论(0) 收藏

    DZY Loves Chessboard time limit per test 1 second memory limit per test 256 megabytes input standard ...

  7. Ultra-QuickSort 分类: POJ 排序 2015-08-03 15:39 2人阅读 评论(0) 收藏

    Ultra-QuickSort Time Limit: 7000MS   Memory Limit: 65536K Total Submissions: 48111   Accepted: 17549 ...

  8. Windows7下QT5开发环境搭建 分类: QT开发 2015-03-09 23:44 65人阅读 评论(0) 收藏

    Windows7下QT开法环境常见搭配方法有两种. 第一种是:QT Creator+QT SDK: 第二种是:VS+qt-vs-addin+QT SDK: 以上两种均可,所需文件见QT社区,QT下载地 ...

  9. cf 61E. Enemy is weak 树状数组求逆序数(WA) 分类: Brush Mode 2014-10-19 15:16 104人阅读 评论(0) 收藏

    #include <iostream> #include <algorithm> #include <cstdio> #include <cstring> ...

随机推荐

  1. Linux-PS1变量详解

    1.PS1 要修改linux终端命令行颜色,我们需要用到PS1,PS1是Linux终端用户的一个环境变量,用来说明命令行提示符的设置.在终端输入命令:#set,即可在输出中找到关于PS1的定义如下: ...

  2. pip 更新安装失败解决方法

    python3 -m ensurepip https://stackoverflow.com/questions/28664082/python-no-module-pip-main-error-wh ...

  3. pat(A) 2-06. 数列求和(模拟摆竖式相加)

    1.链接:http://www.patest.cn/contests/ds/2-06 2.思路:模拟摆竖式相加,因为同样位置上的数字同样,那么同一位上的加法就能够用乘法来表示 3.代码: #inclu ...

  4. HDOJ 5357 Easy Sequence DP

    a[i] 表示以i字符开头的合法序列有多少个 b[i] 表示以i字符结尾的合法序列有多少个 up表示上一层的'('的相应位置 mt[i] i匹配的相应位置 c[i] 包括i字符的合法序列个数  c[i ...

  5. [LuoguP4892]GodFly的寻宝之旅 状压DP

    链接 基础状压DP,预处理出sum,按照题意模拟即可 复杂度 \(O(n^22^n)\) #include<bits/stdc++.h> #define REP(i,a,b) for(in ...

  6. Filebeat的下载(图文讲解)

    第一步:进入Elasticsearch的官网 https://www.elastic.co/ 第二步:点击downloads https://www.elastic.co/downloads 第三步: ...

  7. 洛谷 P2692 覆盖

    P2692 覆盖 题目背景 WSR的学校有B个男生和G个女生都来到一个巨大的操场上扫地. 题目描述 操场可以看成是N 行M 列的方格矩阵,如下图(1)是一个4 行5 列的方格矩阵.每个男生负责打扫一些 ...

  8. 水题ing

    T1: https://www.luogu.org/problemnew/show/P1724幻想乡,东风谷早苗是以高达控闻名的高中生宅巫女.某一天,早苗终于入手了最新款的钢达姆模型.作为最新的钢达姆 ...

  9. amazeui学习笔记--css(常用组件11)--分页Pagination

    amazeui学习笔记--css(常用组件11)--分页Pagination 一.总结 1.分页使用:还是ul包li的形式: 分页组件,<ul> / <ol> 添加 .am-p ...

  10. hbase单机安装和简单使用

    电脑太卡了,使用不了hadoop和hdfs了,所以今天安装了一个伪分布式,数据存储在本地磁盘,也没有向HDFS中存,也没有使用自己的zookeeper,安装过程中还出了点小问题,总结一下,免得忘了. ...