1 准备analyzer

内置analyzer

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

中文分词

smartcn

参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

ik

$ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

参考:https://github.com/medcl/elasticsearch-analysis-ik

其他plugins

参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html

2 创建索引--准备mapping,确定shards、replication

# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d '
{
"settings": {
"index.number_of_shards" : 10,
"index.number_of_routing_shards" : 30,
"index.number_of_replicas":1,
"index.translog.durability": "async",
"index.merge.scheduler.max_thread_count": 1,
"index.refresh_interval": "30s"
},
"mappings": {
"_doc": {
"_all": {
"enabled": false
},
"_source": {
"enabled": false
},
"properties": {
"title": { "type": "text", "analyzer": "ik_smart"},
"name": { "type": "keyword", "doc_values": false},
"age": { "type": "integer", "index": false},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}'

其中:

_source 控制是否存储原始json
_all 控制是否对原始json建倒排
analyzer 用于指定分词
doc_values 用于控制是否列式存储
index 用于控制是否倒排

The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.

数据类型

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

其中String有两种:text和keyword,区别是text会被分词,keyword不会被分词;

text

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

keyword

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

3 导入数据

3.1 调用index api

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

3.2 准备hive外部表

详见:https://www.cnblogs.com/barneywill/p/10300951.html

4 测试

# curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'

or

# curl -XGET 'http://localhost:9200/testdoc/_search?q=*'

5 问题

报错:all nodes failed

2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200]
2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200]
2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting...
2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree
2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366)
at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
... 8 more

解决方法:增加 index.number_of_shards,只能在创建索引时指定,默认为5

报错:es_rejected_execution_exception

Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]

报错原因:

thread_pool.write.queue_size

For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.

查看thread_pool统计

# curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7

通常由于写入速度、并发量或者压力较大超过es处理能力,超出queue的大小就会被reject

解决方法:

1)修改配置调优

index.refresh_interval: -1
index.number_of_replicas: 0
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024

详见:https://www.cnblogs.com/barneywill/p/10615249.html

2)减小写入压力

【原创】大数据基础之ElasticSearch(4)es数据导入过程的更多相关文章

  1. 【原创】大数据基础之ElasticSearch(2)常用API整理

    Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to i ...

  2. 【原创】大数据基础之ElasticSearch(1)简介、安装、使用

    ElasticSearch 6.6.0 官方:https://www.elastic.co/ 一 简介 ElasticSearch简单来说是对lucene的分布式封装,增加了shard(每个shard ...

  3. 【原创】大数据基础之ElasticSearch(5)重要配置及调优

    Index Settings 重要索引配置 Index level settings can be set per-index. Settings may be: 1 static 静态索引配置 Th ...

  4. 【原创】大数据基础之ElasticSearch(3)升级

    elasticsearch版本升级方案 常用的滚动升级过程(Rolling Upgrade)如下: $ curl -XPUT '$es_server:9200/_cluster/settings?pr ...

  5. 你的ES数据备份了吗?

    前言: 无论使用哪种存储软件,定期的备份数据都是重中之重,在使用ElasticSearch的时候,随着数据日益积累,存放es数据的磁盘空间也捉襟见肘, 此时对于业务功能使用不到的索引数据,又不能直接删 ...

  6. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  7. 【原创】大数据基础之Impala(2)实现细节

    一 架构 Impala is a massively-parallel query execution engine, which runs on hundreds of machines in ex ...

  8. 大数据基础知识问答----spark篇,大数据生态圈

    Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...

  9. 大数据篇:ElasticSearch

    ElasticSearch ElasticSearch是什么 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. ...

随机推荐

  1. 早上一起来,就看到朋友圈发这个,慌的一 B

    早上一起来,就看到朋友圈发这个,慌的一 B,也不知道是真是假- 图中的 c 表示已被确认,大家可以看到各个大厂真的是在大幅度裁员. 不知道明年的情况会如何,网上看到过一句话:2019 年也许是这 10 ...

  2. idea免费破解

    1.下载破解补丁. https://pan.baidu.com/s/1pWCr_HIHURSAbGvvo70wKA   密码:pxkv 2.下载idea网址: https://www.jetbrain ...

  3. Babel插件开发入门指南

    文章概览 主要包括:Babel如何进行转码.插件编写的入门基础.实例讲解如何编写插件. 阅读本文前,需要读者对Babel插件如何使用.配置有一定了解,可以参考笔者之前的文章. 本文所有例子可以在 笔者 ...

  4. Shell命令-文件及内容处理之sort、uniq

    文件及内容处理 - sort.unip 1. sort:对文件的文本内容排序 sort命令的功能说明 sort 命令用于将文本文件内容加以排序.sort 可针对文本文件的内容,以行为单位来排序. so ...

  5. Shell命令-文件压缩解压缩之tar、unzip

    文件及内容处理 - tar.unip 1.tar:打包压缩命令 tar命令的功能说明 tar 命令常用语用于备份文件,tar 是用来建立,还原备份文件的工具程序,它可以加入,解开备份文件内的文件 ta ...

  6. golang运算与循环等

    一.golang运算符 1.算术运算符 + 相加- 相减* 相乘/ 相除% 求余++ 自增-- 自减 2.关系运算符 == 等于!= 不等于> 大于< 小于>= 大于等于<= ...

  7. git 学习(2) ----- 分支

    当我们进行程序开发的过程中,有时会产生一个新的想法,然后就想马上试验,那我们怎么办? 如果我们继续在现有的基础上进行开发,但最后想法不成功,我们还要进行版本回退?如果我们的新想法,需要很长时间才能实现 ...

  8. 【并发编程】【JDK源码】JDK的(J.U.C)java.util.concurrent包结构

    本文从JDK源码包中截取出concurrent包的所有类,对该包整体结构进行一个概述. 在JDK1.5之前,Java中要进行并发编程时,通常需要由程序员独立完成代码实现.当然也有一些开源的框架提供了这 ...

  9. docker-跨主机存储

    容器分类 从业务数据的角度看,容器可以分为两类:无状态(stateless)容器和有状态(stateful)容器. 无状态是指容器在运行过程中不需要保存数据,每次访问的结果不依赖上一次访问,比如提供静 ...

  10. OpenLayers学习笔记(九)— 限制地图显示范围

    openlayers 3 地图上限制地图显示及拖动范围,坐标系是4326转3857,中心经纬度精确到小数点后六位,减少误差 GitHub:八至 作者:狐狸家的鱼 本文链接:ol3-限制地图显示及拖动范 ...