【Nutch2.2.1源代码分析之5】索引的基本流程

一、各个主要类之间的关系

SolrIndexerJob extends IndexerJob

1、IndexerJob：主要完成

2、SolrIndexerJob：主要完成

3、IndexUtil：主要只有一个方法public NutchDocument index(String key, WebPage page)，用于根据网页信息，返回一个solr的Document对象。

二、程序调用流程

查看Nutch中的执行脚本--nutch，得到以下信息：

elif [ "$COMMAND" = "solrindex" ] ; then

CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob

因此程序入口位于SolrIndexerJob类中。

（一）org.apache.nutch.indexer.SolrIndexerJob

1、程序入口

  public static void main(String[] args) throws Exception {

    final int res = ToolRunner.run(NutchConfiguration.create(),

        new SolrIndexerJob(), args);

    System.exit(res);

  }

使用了ToolRunner.run()来执行程序，可参考：使用ToolRunner运行Hadoop程序基本原理分析。

其中第一个参数主要是加载了nutch相关的参数，主要包括hadoop的core-default.xml、core-site.xml以及nutch的nutch-default.xml、nutch-site.xml。

第二个参数指明了运行SolrIndexerJob的run(String[])方法.

2、执行SolrIndexerJob类的run(String[])方法

  public int run(String[] args) throws Exception {

    if (args.length < 2) {

      System.err.println("Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]");

      return -1;

    }

    if (args.length == 4 && "-crawlId".equals(args[2])) {

      getConf().set(Nutch.CRAWL_ID_KEY, args[3]);

    }

    try {

      indexSolr(args[0], args[1]);

      return 0;

    } catch (final Exception e) {

      LOG.error("SolrIndexerJob: " + StringUtils.stringifyException(e));

      return -1;

    }

  }

先判断参数的合理性，然后执行执行indexSolr(String,String)方法。

3、执行indexSolr(String,String)方法

public void indexSolr(String solrUrl, String batchId) throws Exception {

    LOG.info("SolrIndexerJob: starting");

    run(ToolUtil.toArgMap(

        Nutch.ARG_SOLR, solrUrl,

        Nutch.ARG_BATCH, batchId));

    // do the commits once and for all the reducers in one go

    getConf().set(SolrConstants.SERVER_URL,solrUrl);

    SolrServer solr = SolrUtils.getCommonsHttpSolrServer(getConf());

    if (getConf().getBoolean(SolrConstants.COMMIT_INDEX, true)) {

      solr.commit();

    }

    LOG.info("SolrIndexerJob: done.");

  }

4、执行run(Map<...>）方法

@Override

  public Map<String,Object> run(Map<String,Object> args) throws Exception {

    String solrUrl = (String)args.get(Nutch.ARG_SOLR);

    String batchId = (String)args.get(Nutch.ARG_BATCH);

    NutchIndexWriterFactory.addClassToConf(getConf(), SolrWriter.class);

    getConf().set(SolrConstants.SERVER_URL, solrUrl);

    currentJob = createIndexJob(getConf(), "solr-index", batchId);

    currentJob.waitForCompletion(true);

    ToolUtil.recordJobStatus(null, currentJob, results);

    return results;

  }

（二）org.apache.nutch.indexer.IndexerJob

1、执行createIndexJob()方法。

  protected Job createIndexJob(Configuration conf, String jobName, String batchId)

  throws IOException, ClassNotFoundException {

    conf.set(GeneratorJob.BATCH_ID, batchId);

    Job job = new NutchJob(conf, jobName);

    // TODO: Figure out why this needs to be here

    job.getConfiguration().setClass("mapred.output.key.comparator.class",

        StringComparator.class, RawComparator.class);

    Collection<WebPage.Field> fields = getFields(job);

    StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,

        IndexerMapper.class);

    job.setNumReduceTasks(0);

    job.setOutputFormatClass(IndexerOutputFormat.class);

    return job;

  }

}

2、执行map相关的方法，包括setup()，map()，cleanup()

  public static class IndexerMapper

      extends GoraMapper<String, WebPage, String, NutchDocument> {

    public IndexUtil indexUtil;

    public DataStore<String, WebPage> store;

    protected Utf8 batchId;

    @Override

    public void setup(Context context) throws IOException {

      Configuration conf = context.getConfiguration();

      batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID, Nutch.ALL_BATCH_ID_STR));

      indexUtil = new IndexUtil(conf);

      try {

        store = StorageUtils.createWebStore(conf, String.class, WebPage.class);

      } catch (ClassNotFoundException e) {

        throw new IOException(e);

      }

    }

    protected void cleanup(Context context) throws IOException ,InterruptedException {

      store.close();

    };

    @Override

    public void map(String key, WebPage page, Context context)

    throws IOException, InterruptedException {

      ParseStatus pstatus = page.getParseStatus();

      if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)

          || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {

        return; // filter urls not parsed

      }

      Utf8 mark = Mark.UPDATEDB_MARK.checkMark(page);

      if (!batchId.equals(REINDEX)) {

        if (!NutchJob.shouldProcess(mark, batchId)) {

          if (LOG.isDebugEnabled()) {

            LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")");

          }

          return;

        }

      }

      NutchDocument doc = indexUtil.index(key, page);

      if (doc == null) {

        return;

      }

      if (mark != null) {

        Mark.INDEX_MARK.putMark(page, Mark.UPDATEDB_MARK.checkMark(page));

        store.put(key, page);

      }

      context.write(key, doc);

    }

  }

3、调用context.write()

由于 job.setOutputFormatClass(IndexerOutputFormat.class);

所以写入index？？

（三）public class IndexUtil

1、调用index()方法

  public NutchDocument index(String key, WebPage page) {

    NutchDocument doc = new NutchDocument();

    doc.add("id", key);

    doc.add("digest", StringUtil.toHexString(page.getSignature()));

    if (page.getBatchId() != null) {

      doc.add("batchId", page.getBatchId().toString());

    }

    String url = TableUtil.unreverseUrl(key);

    if (LOG.isDebugEnabled()) {

      LOG.debug("Indexing URL: " + url);

    }

    try {

      doc = filters.filter(doc, url, page);

    } catch (IndexingException e) {

      LOG.warn("Error indexing "+key+": "+e);

      return null;

    }

    // skip documents discarded by indexing filters

    if (doc == null) return null;

    float boost = 1.0f;

    // run scoring filters

    try {

      boost = scoringFilters.indexerScore(url, doc, page, boost);

    } catch (final ScoringFilterException e) {

      LOG.warn("Error calculating score " + key + ": " + e);

      return null;

    }

    doc.setScore(boost);

    // store boost for use by explain and dedup

    doc.add("boost", Float.toString(boost));

    return doc;

  }

三、plugin中的字段索引

1、关于basic字段的索引在public class BasicIndexingFilter implements IndexingFilter 中

【Nutch2.2.1源代码分析之5】索引的基本流程的更多相关文章

【Nutch2.2.1源代码分析之4】Nutch加载配置文件的方法
小结: (1)在nutch中,一般通过ToolRunner来运行hadoop job,此方法可以方便的通过ToolRunner.run(Configuration conf,Tool tool,Str ...
【原创】k8s源代码分析-----kubelet（1）主要流程
本人空间链接http://user.qzone.qq.com/29185807/blog/1460015727 源代码为k8s v1.1.1稳定版本号 kubelet代码比較复杂.主要是由于其担负的任 ...
LIRe 源代码分析 4：建立索引（DocumentBuilder）[以颜色布局为例]
===================================================== LIRe源代码分析系列文章列表: LIRe 源代码分析 1:整体结构 LIRe 源代码分析 ...
转：SDL2源代码分析
1:初始化(SDL_Init()) SDL简介有关SDL的简介在<最简单的视音频播放示例7:SDL2播放RGB/YUV>以及<最简单的视音频播放示例9:SDL2播放PCM>中 ...
Hadoop源代码分析
http://wenku.baidu.com/link?url=R-QoZXhc918qoO0BX6eXI9_uPU75whF62vFFUBIR-7c5XAYUVxDRX5Rs6QZR9hrBnUdM ...
Spark SQL 源代码分析之 In-Memory Columnar Storage 之 in-memory query
/** Spark SQL源代码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache ...
新秀nginx源代码分析数据结构篇（四）红黑树ngx_rbtree_t
新秀nginx源代码分析数据结构篇(四)红黑树ngx_rbtree_t Author:Echo Chen(陈斌) Email:chenb19870707@gmail.com Blog:Blog.csd ...
HBase源代码分析之HRegion上MemStore的flsuh流程（一）
了解HBase架构的用户应该知道,HBase是一种基于LSM模型的分布式数据库.LSM的全称是Log-Structured Merge-Trees.即日志-结构化合并-树. 相比于Oracle普通索引 ...
SDL2源代码分析3：渲染器（SDL_Renderer）
===================================================== SDL源代码分析系列文章列表: SDL2源代码分析1:初始化(SDL_Init()) SDL ...

随机推荐

Symfony2之创建一个简单的web应用
Symfony2——创建bundle bundle就像插件或者一个功能齐全的应用,我们在应用层上开发的应用的所有代码,包括:PHP文件.配置文件.图片.css文件.js文件等都会包含在bu ...
m2eclipse插件安装
一.给Eclipse安装maven的插件 m2eclipse 1 打开eclipse 2 Help -->Eclipse MarketPlace,在打开的界面搜索框中输入maven查找m2ecl ...
主题模型-LDA浅析
(一)LDA作用传统判断两个文档相似性的方法是通过查看两个文档共同出现的单词的多少,如TF-IDF等,这种方法没有考虑到文字背后的语义关联,可能在两个文档共同出现的单词很少甚至没有,但两个文档是相似 ...
NGINX小技巧--将所有目录和目录下所有文件分别给与不同的权限
为了安全,有时要将文件的权限进行限制,但,目录如果没有755,则不能进入. 所以需要分别给权限 find ./ -type f -name "*" |xargs ls -l
使用ngrok让微信公众平台通过80端口访问本机
最近在做微信开发,感觉测试不怎么方便,在网上找了找一下帖子,发现了这个好工具哈,与大家一同分享一下... 原文:http://blog.csdn.net/liuxiyangyang/article/d ...
cache buffers chains latch
cache buffers chains latch 从 Oracle 8i Database 开始, 散列锁存器<-------(1:m)------>hash bucket<-- ...
Linux企业级项目实践之网络爬虫（4）——主程序流程
当我们设计好程序框架之后就要开始实现它了.第一步当然是要实现主程序的流程框架.之后我们逐渐填充每个流程的细节和其需要调用的模块. 主程序的流程如下: 1. 解析命令行参数,并根据参数跳转到相应的处理 ...
NOI2011 NOI嘉年华
http://www.lydsy.com/JudgeOnline/problem.php?id=2436 首先离散化,离散化后时间范围为[1,cnt]. 求出H[i][j],表示时间范围在[i,j]的 ...
OpenCV的矩阵合并方法
有的时候我们需要将几个矩阵按行或者按列进行合并成一个大矩阵,这在Matlab里面非常的简单,但在OpenCV里面并没有这样的方法,现在我在OpenCV的源码里面发现合并矩阵的方法,分享给大家. A = ...
Spring的工作原理核心组件和应用
Spring框架 Spring 是管理多个java类的容器框架,注意是类不管理接口. Spring 的主要功能 Ioc 反转控制和 DI 依赖注入. 注入的方式可以是构造函数赋值也可以是 set方法赋 ...

【Nutch2.2.1源代码分析之5】索引的基本流程

【Nutch2.2.1源代码分析之5】索引的基本流程的更多相关文章

随机推荐

热门专题