ElasticSearch-hadoop saveToEs源码分析:

类的调用路径关系为:

EsSpark ->
EsRDDWriter ->
RestService ->
RestRepository ->
RestClient

他们的作用:

  • EsSpark,读取ES和存储ES的入口
  • EsRDDWriter,调用RestService创建PartitionWriter,对ES进行数据写入
  • RestService,负责创建 RestRepository,PartitionWriter
  • RestRepository,bulk高层抽象,底层利用NetworkClient做真实的http bulk请求

各个类对应的源码追踪如下:

https://github.com/elastic/elasticsearch-hadoop/blob/2.1/spark/core/main/scala/org/elasticsearch/spark/rdd/EsSpark.scala

  def saveToEs(rdd: RDD[_], resource: String) { saveToEs(rdd, Map(ES_RESOURCE_WRITE -> resource)) }
def saveToEs(rdd: RDD[_], resource: String, cfg: Map[String, String]) {
saveToEs(rdd, collection.mutable.Map(cfg.toSeq: _*) += (ES_RESOURCE_WRITE -> resource))
}
def saveToEs(rdd: RDD[_], cfg: Map[String, String]) {
CompatUtils.warnSchemaRDD(rdd, LogFactory.getLog("org.elasticsearch.spark.rdd.EsSpark")) if (rdd == null || rdd.partitions.length == 0) {
return
} val sparkCfg = new SparkSettingsManager().load(rdd.sparkContext.getConf)
val config = new PropertiesSettings().load(sparkCfg.save())
config.merge(cfg.asJava) rdd.sparkContext.runJob(rdd, new EsRDDWriter(config.save()).write _)
}

https://github.com/elastic/elasticsearch-hadoop/blob/2.1/spark/core/main/scala/org/elasticsearch/spark/rdd/EsRDDWriter.scala

  def write(taskContext: TaskContext, data: Iterator[T]) {
val writer = RestService.createWriter(settings, taskContext.partitionId, -1, log) taskContext.addOnCompleteCallback(() => writer.close()) if (runtimeMetadata) {
writer.repository.addRuntimeFieldExtractor(metaExtractor)
} while (data.hasNext) {
writer.repository.writeToIndex(processData(data))
}
}

https://github.com/elastic/elasticsearch-hadoop/blob/2.1/mr/src/main/java/org/elasticsearch/hadoop/rest/RestService.java

    public static PartitionWriter createWriter(Settings settings, int currentSplit, int totalSplits, Log log) {
Version.logVersion(); InitializationUtils.discoverEsVersion(settings, log);
InitializationUtils.discoverNodesIfNeeded(settings, log);
InitializationUtils.filterNonClientNodesIfNeeded(settings, log);
InitializationUtils.filterNonDataNodesIfNeeded(settings, log); List<String> nodes = SettingsUtils.discoveredOrDeclaredNodes(settings); // check invalid splits (applicable when running in non-MR environments) - in this case fall back to Random..
int selectedNode = (currentSplit < 0) ? new Random().nextInt(nodes.size()) : currentSplit % nodes.size(); // select the appropriate nodes first, to spread the load before-hand
SettingsUtils.pinNode(settings, nodes.get(selectedNode)); Resource resource = new Resource(settings, false); log.info(String.format("Writing to [%s]", resource)); // single index vs multi indices
IndexExtractor iformat = ObjectUtils.instantiate(settings.getMappingIndexExtractorClassName(), settings);
iformat.compile(resource.toString()); RestRepository repository = (iformat.hasPattern() ? initMultiIndices(settings, currentSplit, resource, log) : initSingleIndex(settings, currentSplit, resource, log)); return new PartitionWriter(settings, currentSplit, totalSplits, repository);
}

https://github.com/elastic/elasticsearch-hadoop/blob/2.1/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java

    /**
* Writes the objects to index.
*
* @param object object to add to the index
*/
public void writeToIndex(Object object) {
Assert.notNull(object, "no object data given"); lazyInitWriting();
doWriteToIndex(command.write(object));
}
    private void doWriteToIndex(BytesRef payload) {
// check space first
if (payload.length() > ba.available()) {
if (autoFlush) {
flush();
}
else {
throw new EsHadoopIllegalStateException(
String.format("Auto-flush disabled and bulk buffer full; disable manual flush or increase capacity [current size %s]; bailing out", ba.capacity()));
}
} data.copyFrom(payload);
payload.reset(); dataEntries++;
if (bufferEntriesThreshold > 0 && dataEntries >= bufferEntriesThreshold) {
if (autoFlush) {
flush();
}
else {
// handle the corner case of manual flush that occurs only after the buffer is completely full (think size of 1)
if (dataEntries > bufferEntriesThreshold) {
throw new EsHadoopIllegalStateException(
String.format(
"Auto-flush disabled and maximum number of entries surpassed; disable manual flush or increase capacity [current size %s]; bailing out",
bufferEntriesThreshold));
}
}
}
}
    public void flush() {
BitSet bulk = tryFlush();
if (!bulk.isEmpty()) {
throw new EsHadoopException(String.format("Could not write all entries [%s/%s] (maybe ES was overloaded?). Bailing out...", bulk.cardinality(), bulk.size()));
}
}
    public BitSet tryFlush() {
if (log.isDebugEnabled()) {
log.debug(String.format("Sending batch of [%d] bytes/[%s] entries", data.length(), dataEntries));
} BitSet bulkResult = EMPTY; try {
// double check data - it might be a false flush (called on clean-up)
if (data.length() > 0) {
bulkResult = client.bulk(resourceW, data);
executedBulkWrite = true;
}
} catch (EsHadoopException ex) {
hadWriteErrors = true;
throw ex;
} // discard the data buffer, only if it was properly sent/processed
//if (bulkResult.isEmpty()) {
// always discard data since there's no code path that uses the in flight data
discard();
//} return bulkResult;
}

https://github.com/elastic/elasticsearch-hadoop/blob/2.1/mr/src/main/java/org/elasticsearch/hadoop/rest/RestClient.java

    public BitSet bulk(Resource resource, TrackingBytesArray data) {
Retry retry = retryPolicy.init();
int httpStatus = 0; boolean isRetry = false; do {
// NB: dynamically get the stats since the transport can change
long start = network.transportStats().netTotalTime;
Response response = execute(PUT, resource.bulk(), data);
long spent = network.transportStats().netTotalTime - start; stats.bulkTotal++;
stats.docsSent += data.entries();
stats.bulkTotalTime += spent;
// bytes will be counted by the transport layer if (isRetry) {
stats.docsRetried += data.entries();
stats.bytesRetried += data.length();
stats.bulkRetries++;
stats.bulkRetriesTotalTime += spent;
} isRetry = true; httpStatus = (retryFailedEntries(response, data) ? HttpStatus.SERVICE_UNAVAILABLE : HttpStatus.OK);
} while (data.length() > 0 && retry.retry(httpStatus)); return data.leftoversPosition();
}

ElasticSearch-hadoop saveToEs源码分析的更多相关文章

  1. ElasticSearch Index操作源码分析

    ElasticSearch Index操作源码分析 本文记录ElasticSearch创建索引执行源码流程.从执行流程角度看一下创建索引会涉及到哪些服务(比如AllocationService.Mas ...

  2. Hadoop RPC源码分析

    Hadoop RPC源码分析 上一篇文章http://www.cnblogs.com/dycg/p/rpc.html 讲了Hadoop RPC的使用方法,这一次我们从demo中一层层进行分析. RPC ...

  3. [Hadoop] - TaskTracker源码分析(状态发送)

    TaskTracker节点向JobTracker汇报当前节点的运行时信息时候,是将运行状态信息同心跳报告一起发送给JobTracker的,主要包括TaskTracker的基本信息.节点资源使用信息.各 ...

  4. Hadoop TextInputFormat源码分析

    from:http://blog.csdn.net/lzm1340458776/article/details/42707047 InputFormat主要用于描述输入数据的格式(我们只分析新API, ...

  5. [Hadoop] - TaskTracker源码分析

    在Hadoop1.x版本中,MapReduce采用master/salve架构,TaskTracker就是这个架构中的slave部分.TaskTracker以服务组件的形式存在,负责任务的执行和任务状 ...

  6. [Hadoop] - TaskTracker源码分析(TaskTracker节点健康状况监控)

    在TaskTracker中对象healthStatus保存了当前节点的健康状况,对应的类是org.apache.hadoop.mapred.TaskTrackerStatus.TaskTrackerH ...

  7. Hadoop TaskScheduler源码分析

    TaskScheduler是MapReduce中的任务调度器.在MapReduce中,JobTracker接收JobClient提交的Job,将它们按InputFormat的划分以及其他相关配置,生成 ...

  8. Hadoop2源码分析-准备篇

    1.概述 我们已经能够搭建一个高可用的Hadoop平台了,也熟悉并掌握了一个项目在Hadoop平台下的开发流程,基于Hadoop的一些套件我们也能够使用,并且能利用这些套件进行一些任务的开发.在Had ...

  9. Hadoop RCFile存储格式详解(源码分析、代码示例)

    RCFile   RCFile全称Record Columnar File,列式记录文件,是一种类似于SequenceFile的键值对(Key/Value Pairs)数据文件.   关键词:Reco ...

随机推荐

  1. 04: python常用模块

    目录: 1.1 时间模块time() 与 datetime() 1.2 random()模块 1.3 os模块 1.4 sys模块 1.5 tarfile用于将文件夹归档成 .tar的文件 1.6 s ...

  2. 20145307陈俊达《网络对抗》Exp 8 Web基础

    20145307陈俊达<网络对抗>Exp 8 Web基础 基础问题回答 1.什么是表单? 表单是一个包含表单元素的区域,表单元素是允许用户在表单中输入信息的元素,表单在网页中主要负责数据采 ...

  3. 2018-2019-1 20189218《Linux内核原理与分析》第七周作业

    task_struck数据结构 在Linux内核中,通过task_struct这个结构体对进程进行管理,我们可以叫他PCB或者进程描述符.这个结构体定义在include/linux/sched.h中. ...

  4. ubuntu16.04下无线网卡无法正常连网

    背景:无线网卡初次连接可以正常上网,但是用了一会儿就会出现无法上网的情况 版本: Ubuntu 16.04 一.分析: 1.使用ifconfig命令发现不会显示无线网卡,说明无线网卡被关闭,笔者输出的 ...

  5. luoguP2572 [SCOI2010]序列操作

    题目&&链接 反正数据都是一样的,luogu比较友好 luogu bzoj lxhgww最近收到了一个01序列,序列里面包含了n个数,这些数要么是0,要么是1,现在对于这个序列有五种变 ...

  6. CodeForces 1029E div3

    题目链接 第一道场上自己做出来的E题...虽然是div3,而且是原题... 当时做完ABC,D题没有思路就去怼E了,然后发现貌似原题? 事实上就是原题... 给个原题链接... [HNOI2003]消 ...

  7. 解决 Ubuntu 14.04 图形界面无法正常显示 问题

    参考: ubuntu清除系统缓存的方法 apt-get(四) 每天一个linux命令(33):df 命令 Ubuntu server 12.10 /dev/sda1 disk full 解决 Ubun ...

  8. 02_HBase集群部署

    HBase集群部署 HBase是分布式数据库,本身也需要借助zookeeper进行集群节点间的协调(Master, RegionServer), 可以使用HBase自带的zookeeper,也可以使用 ...

  9. HDU 2546 饭卡(0-1背包)

    http://acm.hdu.edu.cn/showproblem.php?pid=2546 题意: 电子科大本部食堂的饭卡有一种很诡异的设计,即在购买之前判断余额.如果购买一个商品之前,卡上的剩余金 ...

  10. ros pbstream

    https://blog.csdn.net/xiekaikaibing/article/details/80320822