Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏
Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。
(一)命令执行
1、运行命令
[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
InjectorJob: starting at 2015-03-10 14:59:19
InjectorJob: Injecting urlDir: seeds
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06
2、查看表中内容
hbase(main):004:0> scan 'sourcetest_webpage'
ROW COLUMN+CELL
com.163.money:http/ column=f:fi, timestamp=1425970761871, value=\x00'\x8D\x00
com.163.money:http/ column=f:ts, timestamp=1425970761871, value=\x00\x00\x01L\x02{\x08_
com.163.money:http/ column=mk:_injmrk_, timestamp=1425970761871, value=y
com.163.money:http/ column=mk:dist, timestamp=1425970761871, value=0
com.163.money:http/ column=mtdt:_csh_, timestamp=1425970761871, value=?\x80\x00\x00
com.163.money:http/ column=s:s, timestamp=1425970761871, value=?\x80\x00\x00
1 row(s) in 0.0430 seconds
3、读取数据库中的内容
由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容
[jediael@master local]$ bin/nutch readdb -dump ./test -crawlId sourcetest -content
WebTable dump: starting
WebTable dump: done
[jediael@master local]$ cat test/part-r-00000
http://money.163.com/ key: com.163.money:http/
baseUrl: null
status: 0 (null)
fetchTime: 1425970759775
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
marker _injmrk_ : y
marker dist : 0
reprUrl: null
metadata _csh_ : ?锟
(二)源码流程分析
类:org.apache.nutch.crawl.InjectorJob
1、程序入口
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
args);
System.exit(res);
}
2、ToolRunner.run(String[] args)
此步骤主要是调用inject方法,其余均是一些参数合规性的检查
public int run(String[] args) throws Exception {
…………
inject(new Path(args[0]));
…………
}
3、inject()方法
nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。
<pre name="code" class="java">public void inject(Path urlDir) throws Exception { run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir)); }
4、job的具体配置,并创建hbase中的表格
public Map<String, Object> run(Map<String, Object> args) throws Exception { numJobs = 1;
currentJobNum = 0;
currentJob = new NutchJob(getConf(), "inject " + input);
FileInputFormat.addInputPath(currentJob, input);
currentJob.setMapperClass(UrlMapper.class);
currentJob.setMapOutputKeyClass(String.class);
currentJob.setMapOutputValueClass(WebPage.class);
currentJob.setOutputFormatClass(GoraOutputFormat.class); DataStore<String, WebPage> store = StorageUtils.createWebStore(
currentJob.getConfiguration(), String.class, WebPage.class);
GoraOutputFormat.setOutput(currentJob, store, true); currentJob.setReducerClass(Reducer.class);
currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true);
ToolUtil.recordJobStatus(null, currentJob, results);
}
5、mapper方法
由于Injector Job中无reducer,因此只要关注mapper即可。
mapper主要完成以下几项工作:
(1)对文本中的内容进行分析,并提取其中的参数
(2)根据filter过滤url
(3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String url = value.toString().trim(); // value is line of text if (url != null && (url.length() == 0 || url.startsWith("#"))) {
/* Ignore line that start with # */
return;
} // if tabs : metadata that could be stored
// must be name=value and separated by \t
float customScore = -1f;
int customInterval = interval;
Map<String, String> metadata = new TreeMap<String, String>();
if (url.indexOf("\t") != -1) {
String[] splits = url.split("\t");
url = splits[0];
for (int s = 1; s < splits.length; s++) {
// find separation between name and value
int indexEquals = splits[s].indexOf("=");
if (indexEquals == -1) {
// skip anything without a =
continue;
}
String metaname = splits[s].substring(0, indexEquals);
String metavalue = splits[s].substring(indexEquals + 1);
if (metaname.equals(nutchScoreMDName)) {
try {
customScore = Float.parseFloat(metavalue);
} catch (NumberFormatException nfe) {
}
} else if (metaname.equals(nutchFetchIntervalMDName)) {
try {
customInterval = Integer.parseInt(metavalue);
} catch (NumberFormatException nfe) {
}
} else
metadata.put(metaname, metavalue);
}
}
try {
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + url + ":" + e);
url = null;
}
if (url == null) {
context.getCounter("injector", "urls_filtered").increment(1);
return;
} else { // if it passes
String reversedUrl = TableUtil.reverseUrl(url); // collect it
WebPage row = WebPage.newBuilder().build();
row.setFetchTime(curTime);
row.setFetchInterval(customInterval); // now add the metadata
Iterator<String> keysIter = metadata.keySet().iterator();
while (keysIter.hasNext()) {
String keymd = keysIter.next();
String valuemd = metadata.get(keymd);
row.getMetadata().put(new Utf8(keymd),
ByteBuffer.wrap(valuemd.getBytes()));
} if (customScore != -1)
row.setScore(customScore);
else
row.setScore(scoreInjected); try {
scfilters.injectedScore(url, row);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Cannot filter injected score for url " + url
+ ", using default (" + e.getMessage() + ")");
}
}
context.getCounter("injector", "urls_injected").increment(1);
row.getMarkers()
.put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
Mark.INJECT_MARK.putMark(row, YES_STRING);
context.write(reversedUrl, row);
}
}
(三)重点源码学习
版权声明:本文为博主原创文章,未经博主允许不得转载。
Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏的更多相关文章
- Find The Multiple 分类: 搜索 POJ 2015-08-09 15:19 3人阅读 评论(0) 收藏
Find The Multiple Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 21851 Accepted: 8984 Sp ...
- 周赛-Equidistant String 分类: 比赛 2015-08-08 15:44 6人阅读 评论(0) 收藏
time limit per test 1 second memory limit per test 256 megabytes input standard input output standar ...
- Drainage Ditches 分类: POJ 图论 2015-07-29 15:01 7人阅读 评论(0) 收藏
Drainage Ditches Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 62016 Accepted: 23808 De ...
- SQL 存储过程 分页 分类: SQL Server 2014-05-16 15:11 449人阅读 评论(0) 收藏
set ANSI_NULLS ON set QUOTED_IDENTIFIER ON go -- ============================================= -- Au ...
- PIE(二分) 分类: 二分查找 2015-06-07 15:46 9人阅读 评论(0) 收藏
Pie Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Submissio ...
- 周赛-DZY Loves Chessboard 分类: 比赛 搜索 2015-08-08 15:48 4人阅读 评论(0) 收藏
DZY Loves Chessboard time limit per test 1 second memory limit per test 256 megabytes input standard ...
- Ultra-QuickSort 分类: POJ 排序 2015-08-03 15:39 2人阅读 评论(0) 收藏
Ultra-QuickSort Time Limit: 7000MS Memory Limit: 65536K Total Submissions: 48111 Accepted: 17549 ...
- Windows7下QT5开发环境搭建 分类: QT开发 2015-03-09 23:44 65人阅读 评论(0) 收藏
Windows7下QT开法环境常见搭配方法有两种. 第一种是:QT Creator+QT SDK: 第二种是:VS+qt-vs-addin+QT SDK: 以上两种均可,所需文件见QT社区,QT下载地 ...
- cf 61E. Enemy is weak 树状数组求逆序数(WA) 分类: Brush Mode 2014-10-19 15:16 104人阅读 评论(0) 收藏
#include <iostream> #include <algorithm> #include <cstdio> #include <cstring> ...
随机推荐
- windows10系统window键失灵,没有反应
今天键盘的的Window键(win键)按了没反应,某度一圈全是它的垃圾营销号文章,没卵用..最后在微软官方社区支持找到解决方案.也建议大家遇到系统问题到微软社区去寻求帮助,毕竟人家是专业. 解决办法 ...
- git把本地文件上传到github上的步骤
1.清除clean 2.返回上一级cd .. 3.克隆仓库地址git clone+地址 4.添加忽悠文件vim .gitignore 5查看cat .gitignore 6.进入到test,并且添加所 ...
- Windows学习总结(3)——成为电脑高手必备的cmd命令大全
曾经看电影和电视里面电脑黑客快速敲击电脑键盘,一行行命令在电脑屏幕闪过,一个回车过后,一排排英文象走马灯一样在屏幕上转瞬即逝,那才是我们梦寐以求的高手,有木有!实际上,不光是黑客和系统维护人员,一般的 ...
- HDU——T 1711 Number Sequence
http://acm.hdu.edu.cn/showproblem.php?pid=1711 Time Limit: 10000/5000 MS (Java/Others) Memory Lim ...
- RelativeLayout-属性大全
// 相对于给定ID控件 <!--将该控件的底部置于给定ID的控件之上--> android:layout_above <!--将该控件的底部置于给定ID的控件之下--> an ...
- git -处理分支合并
1.分支间的合并 1)直接合并:把两个分支上的历史轨迹合二为一(就是所以修改都全部合并) zhangshuli@zhangshuli-MS-:~/myGit$ vim merge.txt zhangs ...
- 矩阵乘法java代码实现
矩阵只有当左边矩阵的列数等于右边矩阵的行数时,它们才可以相乘, 乘积矩阵的行数等于左边矩阵的行数,乘积矩阵的列数等于右边矩阵的列数 即A矩阵m*n,B矩阵n*p,C矩阵m*p: package exa ...
- Sql延时
IF EXISTS(SELECT * FROM sys.procedures WHERE name='usp_wait30s')BEGIN DROP PROC usp_wait30sENDgocrea ...
- 1.23 Python知识进阶 - 面向对象编程
一.编程方法 1.函数式编程:"函数式编程"是一种"编程范式"(programming paradigm),也就是如何编写程序的方法论.它属于"结构化 ...
- Android 基于ijkplayer+Rxjava+Rxandroid+Retrofit2.0+MVP+Material Design的android万能播放器aaa
MDPlayer万能播放器 MDPlayer,基于ijkplayer+Rxjava+Rxandroid+Retrofit2.0+MVP+Material Design的android万能播放器,可以播 ...