【原创】大数据基础之ElasticSearch(4)es数据导入过程
1 准备analyzer
内置analyzer
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
中文分词
smartcn
参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html
ik
$ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip
参考:https://github.com/medcl/elasticsearch-analysis-ik
其他plugins
参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html
2 创建索引--准备mapping,确定shards、replication
# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d '
{
"settings": {
"index.number_of_shards" : 10,
"index.number_of_routing_shards" : 30,
"index.number_of_replicas":1,
"index.translog.durability": "async",
"index.merge.scheduler.max_thread_count": 1,
"index.refresh_interval": "30s"
},
"mappings": {
"_doc": {
"_all": {
"enabled": false
},
"_source": {
"enabled": false
},
"properties": {
"title": { "type": "text", "analyzer": "ik_smart"},
"name": { "type": "keyword", "doc_values": false},
"age": { "type": "integer", "index": false},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}'
其中:
_source 控制是否存储原始json
_all 控制是否对原始json建倒排
analyzer 用于指定分词
doc_values 用于控制是否列式存储
index 用于控制是否倒排
The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.
数据类型
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
其中String有两种:text和keyword,区别是text会被分词,keyword不会被分词;
text
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html
keyword
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html
3 导入数据
3.1 调用index api
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
3.2 准备hive外部表
详见:https://www.cnblogs.com/barneywill/p/10300951.html
4 测试
# curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'
or
# curl -XGET 'http://localhost:9200/testdoc/_search?q=*'
5 问题
报错:all nodes failed
2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200]
2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200]
2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting...
2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree
2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366)
at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
... 8 more
解决方法:增加 index.number_of_shards,只能在创建索引时指定,默认为5
报错:es_rejected_execution_exception
Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]
报错原因:
thread_pool.write.queue_size
For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.
The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.
查看thread_pool统计
# curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7
通常由于写入速度、并发量或者压力较大超过es处理能力,超出queue的大小就会被reject
解决方法:
1)修改配置调优
index.refresh_interval: -1
index.number_of_replicas: 0
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024
详见:https://www.cnblogs.com/barneywill/p/10615249.html
2)减小写入压力
【原创】大数据基础之ElasticSearch(4)es数据导入过程的更多相关文章
- 【原创】大数据基础之ElasticSearch(2)常用API整理
Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to i ...
- 【原创】大数据基础之ElasticSearch(1)简介、安装、使用
ElasticSearch 6.6.0 官方:https://www.elastic.co/ 一 简介 ElasticSearch简单来说是对lucene的分布式封装,增加了shard(每个shard ...
- 【原创】大数据基础之ElasticSearch(5)重要配置及调优
Index Settings 重要索引配置 Index level settings can be set per-index. Settings may be: 1 static 静态索引配置 Th ...
- 【原创】大数据基础之ElasticSearch(3)升级
elasticsearch版本升级方案 常用的滚动升级过程(Rolling Upgrade)如下: $ curl -XPUT '$es_server:9200/_cluster/settings?pr ...
- 你的ES数据备份了吗?
前言: 无论使用哪种存储软件,定期的备份数据都是重中之重,在使用ElasticSearch的时候,随着数据日益积累,存放es数据的磁盘空间也捉襟见肘, 此时对于业务功能使用不到的索引数据,又不能直接删 ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 【原创】大数据基础之Impala(2)实现细节
一 架构 Impala is a massively-parallel query execution engine, which runs on hundreds of machines in ex ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 大数据篇:ElasticSearch
ElasticSearch ElasticSearch是什么 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. ...
随机推荐
- 【翻译】WhatsApp 加密概述(技术白皮书)
目录 简介 术语 客户端注册 会话初始化设置 接收会话设置 交换信息 传输媒体和附件 群组消息 通话设置 ...
- Ubuntu下解压压缩文件
1.ZIP解压 ZIP因为它的跨平台使用优点,是目前使用率最高的一种压缩方式,但是它的压缩率相比较tar.gz和tar.gz2来讲,却要低很多. 压缩命令:zip -r archive_n ...
- EQueue
EQueue 2.3.2版本发布(支持高可用) - dotNET跨平台 - CSDN博客https://blog.csdn.net/sD7O95O/article/details/78097193 E ...
- redis哨兵(Sentinel)、虚拟槽分区(cluster)和docker入门
一.Redis-Sentinel(哨兵) 1.介绍 Redis-Sentinel是redis官方推荐的高可用性解决方案,当用redis作master-slave的高可用时,如果master本身宕机,r ...
- Magento2 常见错误 ----- 定期更新
1.静态文件有版本号,静态文件不能读取,页面无法显示.如下图: 解决方案:其实URL里的版本号对于magento来说是合法的,这是因为我们缺少了一个文件\pub\static\.htaccess:导致 ...
- 适配相关:viewpoint,@media,vw/vh,em/rem
从网易与淘宝的font-size思考前端设计稿与工作流: http://www.cnblogs.com/lyzg/p/4877277.html Rem布局的原理解析: https://yanhaiji ...
- 基于Matlab实现多次最佳一致的函数逼近(类似求渐进函数)
%%%做系统识别很重要,方法上完全符合系统识别最基础的理论 function [sun]=main(n) fplot(,],'r'); x=ones(n+,); :n+ x(j+)=cos(pi*(n ...
- LOJ #6285 分块入门9
题意:区间众数,不带修改,带修改刚看了一眼没看懂cls在讲啥QAQ. 题解:按照代码中那个sqrt(n/2/log2(n))大小分块,可以用均值不等式证明的,就是假设查询和n同级,然后一通爆算就可以得 ...
- P1742 最小圆覆盖(计算几何)
体验过\(O(n^3)\)过\(10^5\)吗?快来体验一波当\(wys\)的快感吧\(QAQ\) 前置芝士1:二元一次方程组求解 设 \[\begin{cases}a1 * x + b1*y=c1\ ...
- JDK动态代理(Proxy)的两种实现方式
JDK自带的Proxy动态代理两种实现方式 前提条件:JDK Proxy必须实现对象接口 so,创建一个接口文件,一个实现接口对象,一个动态代理文件 接口文件:TargetInterface.java ...