Nutch2.x 集成ElasticSearch 抓取+索引
- <dependency org="org.elasticsearch" name="elasticsearch" rev="0.90.5" conf="*->default"/>
- <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
- <dependency org="log4j" name="log4j" rev="1.2.16" conf="*->master" />
- #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
- #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
- #gora.sqlstore.jdbc.user=sa
- #gora.sqlstore.jdbc.password=
- gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
- <property>
- <name>storage.data.store.class</name>
- <value>org.apache.gora.hbase.store.HBaseStore</value>
- </property>
- <property>
- <name>http.agent.name</name>
- <value>NutchCrawler</value>
- </property>
- <property>
- <name>parser.character.encoding.default</name>
- <value>utf-8</value>
- </property>
- <property>
- <name>http.accept.language</name>
- <value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value>
- </property>
- <property>
- <name>generate.batch.id</name>
- <value>1</value>
- </property>
- <configuration>
- <property>
- <name>hbase.rootdir</name>
- <value>file:///data/hbase</value>
- </property>
- <property>
- <name>hbase.zookeeper.property.dataDir</name>
- <value>/data/zookeeper</value>
- </property>
- </configuration>
ant clean
- <property>
- <name>plugin.folders</name>
- <value>/home/eryk/workspace/nutch/runtime/local/plugins</value>
- </property>
- <dependency>
- <groupId>net.sourceforge.nekohtml</groupId>
- <artifactId>nekohtml</artifactId>
- <version>1.9.15</version>
- </dependency>
- <dependency>
- <groupId>org.ccil.cowan.tagsoup</groupId>
- <artifactId>tagsoup</artifactId>
- <version>1.2</version>
- </dependency>
- <dependency>
- <groupId>rome</groupId>
- <artifactId>rome</artifactId>
- <version>1.0</version>
- </dependency>
- <dependency>
- <groupId>org.elasticsearch</groupId>
- <artifactId>elasticsearch</artifactId>
- <version>0.90.5</version>
- <optional>true</optional>
- </dependency>
- <dependency>
- <groupId>org.restlet.jse</groupId>
- <artifactId>org.restlet.ext.jackson</artifactId>
- <version>2.0.5</version>
- <exclusions>
- <exclusion>
- <artifactId>jackson-core-asl</artifactId>
- <groupId>org.codehaus.jackson</groupId>
- </exclusion>
- <exclusion>
- <artifactId>jackson-mapper-asl</artifactId>
- <groupId>org.codehaus.jackson</groupId>
- </exclusion>
- </exclusions>
- <optional>true</optional>
- </dependency>
- <dependency>
- <groupId>org.apache.gora</groupId>
- <artifactId>gora-core</artifactId>
- <version>0.3</version>
- <exclusions>
- <exclusion>
- <artifactId>jackson-mapper-asl</artifactId>
- <groupId>org.codehaus.jackson</groupId>
- </exclusion>
- </exclusions>
- <optional>true</optional>
- </dependency>
- Map<String,Object> argMap = ToolUtil.toArgMap(
- Nutch.ARG_THREADS, threads,
- Nutch.ARG_DEPTH, depth,
- Nutch.ARG_TOPN, topN,
- Nutch.ARG_SOLR, solrUrl,
- ElasticConstants.CLUSTER,elasticSearchAddr, //使用es建立索引
- Nutch.ARG_SEEDDIR, seedDir,
- Nutch.ARG_NUMTASKS, numTasks,
- Nutch.ARG_BATCH,batchId, //解决NullPointerException问题
- GeneratorJob.BATCH_ID,batchId); //解决NullPointerException问题,貌似没用
- run(argMap);
- public void open(TaskAttemptContext job) throws IOException {
- String clusterName = job.getConfiguration().get(ElasticConstants.CLUSTER);
- if (clusterName != null && !clusterName.contains(":")) {
- node = nodeBuilder().clusterName(clusterName).client(true).node();
- } else {
- node = nodeBuilder().client(true).node();
- }
- LOG.info(String.format("clusterName=[%s]",clusterName));
- if(clusterName.contains(":")){
- String[] addr = clusterName.split(":");
- client = new TransportClient()
- .addTransportAddress(new InetSocketTransportAddress(addr[0],Integer.parseInt(addr[1])));
- }else{
- client = node.client();
- }
- bulk = client.prepareBulk();
- defaultIndex = job.getConfiguration().get(ElasticConstants.INDEX, "index");
- maxBulkDocs = job.getConfiguration().getInt(
- ElasticConstants.MAX_BULK_DOCS, DEFAULT_MAX_BULK_DOCS);
- maxBulkLength = job.getConfiguration().getInt(
- ElasticConstants.MAX_BULK_LENGTH, DEFAULT_MAX_BULK_LENGTH);
- }
- 2013-11-03 22:57:36,682 INFO elasticsearch.node - [Ikonn] started
- 2013-11-03 22:57:36,682 INFO elastic.ElasticWriter - clusterName=[a2:9300]
- 2013-11-03 22:57:36,692 INFO elasticsearch.plugins - [Electron] loaded [], sites []
- 2013-11-03 22:57:36,863 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
- 2013-11-03 22:57:36,864 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
- 2013-11-03 22:57:36,864 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
- 2013-11-03 22:57:36,865 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
- 2013-11-03 22:57:37,946 INFO elastic.ElasticWriter - Processing remaining requests [docs = 86, length = 130314, total docs = 86]
- 2013-11-03 22:57:37,988 INFO elastic.ElasticWriter - Processing to finalize last execute
- 2013-11-03 22:57:41,986 INFO elastic.ElasticWriter - Previous took in ms 1590, including wait 3998
- 2013-11-03 22:57:42,020 INFO elasticsearch.node - [Ikonn] stopping ...
- 2013-11-03 22:57:42,032 INFO elasticsearch.node - [Ikonn] stopped
- 2013-11-03 22:57:42,032 INFO elasticsearch.node - [Ikonn] closing ...
- 2013-11-03 22:57:42,039 INFO elasticsearch.node - [Ikonn] closed
- 2013-11-03 22:57:42,041 WARN mapred.FileOutputCommitter - Output path is null in cleanup
- 2013-11-03 22:57:42,057 INFO elastic.ElasticIndexerJob - Done
Nutch2.x 集成ElasticSearch 抓取+索引的更多相关文章
- nutch2.2.1+mysql抓取数据
基本环境:linux centos6.5 nutch2.2.1 源码包, mysql 5.5 ,elasticsearch1.1.1, jdk1.7 1.下载地址http://mirror.bjtu. ...
- 15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
- 分析Ajax请求并抓取今日头条街拍美图
项目说明 本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
- Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容 利用requests请求目标站点,得到索引网页HTML代码,返回结果. 2.抓取详情页内容 解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 3.下载图片与保存数据库 将 ...
- 【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容 利用requests请求目标站点,得到索引网页HTML代码,返回结果. from urllib.parse import urlencode from requests.excep ...
- Nutch2.x 演示抓取第一个网站
http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/?utm_source=tuicool&utm_mediu ...
- Nutch2.1+mysql+solr3.6.1+中文网站抓取
1.mysql 数据库配置 linux mysql安装步骤省略. 在首先进入/etc/my.cnf (mysql为5.1的话就不用修改my.cnf,会导致mysql不能启动)在[mysqld] 下添加 ...
- windows环境下nutch2.x 在eclipse中实现抓取数据存进mysql详细步骤
nutch2.x 在eclipse中实现抓取数据存进mysql步骤 最近在研究nutch,花了几天时间,也遇到很多问题,最终结果还是成功了,在此记录,并给其他有兴趣的人提供参考,共同进步. 对nutc ...
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
原文地址: http://blog.sina.com.cn/s/blog_3c9872d00101p4f0.html Nutch 2.2.1发布快两月了,该版本与Nutch之前版本相比,有较大变化,特 ...
随机推荐
- Python 位运算符 逻辑运算符 成员运算符
位运算符 运算符 描述 实例 & 按位与运算符:参与运算的两个值,如果两个相应位都为1,则该位的结果为1,否则为0 (a & b) 输出结果12 ,二进制解释:0000 1100 | ...
- CentOS设置服务开机自动启动【转】
CentOS设置服务开机自动启动[转]Posted on 2012-06-28 16:00 eastson 阅读(4999) 评论(0) 编辑 收藏 CentOS安装好apache.mysql等服务器 ...
- ELK-Stack 最后一次全篇文档
简介: ELK-Stack 日志收集系统.最后一次全篇记录的笔记,之后关于 ELK 的笔记都将是片段型.针对性的. 环境介绍: ELK-Stack:192.168.1.25 ( Redis.LogS ...
- Tomcat ( 单机多 Tomcat 并存 )
简介: Tomcat 扩展( 一台服务器运行多个 tomcat ) 一.安装 JDK .Tomcat shell > rpm -ivh jdk-8u25-linux-x64.rpm # 安装 j ...
- Python进程监控-MyProcMonitor
psutil api文档: http://pythonhosted.org/psutil/ api 测试 #! /usr/bin/env python # coding=utf-8 import ps ...
- php-yii-form标签
yii 标签用法(模板) (2013-08-14 17:28:19) 转载▼ 标签: it 分类: yii yii模板中的label标签 <?php echo $form->labelEx ...
- Kafka源码分析
本文主要针对于Kafka的源码进行分析,版本为kafka-0.8.2.1. 由于时间有限,可能更新比较慢... Kafka.scala // 读取配置文件 val props = Utils.load ...
- rocketmq消费负载均衡--push消费为例
本文介绍了DefaultMQPushConsumerImpl消费者,客户端负载均衡相关知识点.本文从DefaultMQPushConsumerImpl启动过程到实现负载均衡,从源代码一步一步分析,共分 ...
- 安装kali linux 2017.1 【二、安装VMware-tools 以及相关问题处理】
一.基本步骤: 1.VMware Workstation菜单栏中,选择“虚拟机”,”安装VMware Tools...“. 2.挂载VMware Tools安装程序到/mnt/cdrom/. mkdi ...
- Nginx源码完全注释(2)ngx_array.h / ngx_array.c
数组头文件 ngx_array.h #include <ngx_config.h> #include <ngx_core.h> struct ngx_array_s { voi ...