1 准备analyzer

内置analyzer

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

中文分词

smartcn

参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

ik

$ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

参考:https://github.com/medcl/elasticsearch-analysis-ik

其他plugins

参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html

2 创建索引--准备mapping,确定shards、replication

# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d '
{
"settings": {
"index.number_of_shards" : 10,
"index.number_of_routing_shards" : 30,
"index.number_of_replicas":1,
"index.translog.durability": "async",
"index.merge.scheduler.max_thread_count": 1,
"index.refresh_interval": "30s"
},
"mappings": {
"_doc": {
"_all": {
"enabled": false
},
"_source": {
"enabled": false
},
"properties": {
"title": { "type": "text", "analyzer": "ik_smart"},
"name": { "type": "keyword", "doc_values": false},
"age": { "type": "integer", "index": false},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}'

其中:

_source 控制是否存储原始json
_all 控制是否对原始json建倒排
analyzer 用于指定分词
doc_values 用于控制是否列式存储
index 用于控制是否倒排

The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.

数据类型

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

其中String有两种:text和keyword,区别是text会被分词,keyword不会被分词;

text

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

keyword

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

3 导入数据

3.1 调用index api

参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

3.2 准备hive外部表

详见:https://www.cnblogs.com/barneywill/p/10300951.html

4 测试

# curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'

or

# curl -XGET 'http://localhost:9200/testdoc/_search?q=*'

5 问题

报错:all nodes failed

2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200]
2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200]
2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting...
2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree
2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366)
at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267)
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
... 8 more

解决方法:增加 index.number_of_shards,只能在创建索引时指定,默认为5

报错:es_rejected_execution_exception

Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]

报错原因:

thread_pool.write.queue_size

For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.

查看thread_pool统计

# curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7

通常由于写入速度、并发量或者压力较大超过es处理能力,超出queue的大小就会被reject

解决方法:

1)修改配置调优

index.refresh_interval: -1
index.number_of_replicas: 0
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024

详见:https://www.cnblogs.com/barneywill/p/10615249.html

2)减小写入压力

【原创】大数据基础之ElasticSearch(4)es数据导入过程的更多相关文章

  1. 【原创】大数据基础之ElasticSearch(2)常用API整理

    Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to i ...

  2. 【原创】大数据基础之ElasticSearch(1)简介、安装、使用

    ElasticSearch 6.6.0 官方:https://www.elastic.co/ 一 简介 ElasticSearch简单来说是对lucene的分布式封装,增加了shard(每个shard ...

  3. 【原创】大数据基础之ElasticSearch(5)重要配置及调优

    Index Settings 重要索引配置 Index level settings can be set per-index. Settings may be: 1 static 静态索引配置 Th ...

  4. 【原创】大数据基础之ElasticSearch(3)升级

    elasticsearch版本升级方案 常用的滚动升级过程(Rolling Upgrade)如下: $ curl -XPUT '$es_server:9200/_cluster/settings?pr ...

  5. 你的ES数据备份了吗?

    前言: 无论使用哪种存储软件,定期的备份数据都是重中之重,在使用ElasticSearch的时候,随着数据日益积累,存放es数据的磁盘空间也捉襟见肘, 此时对于业务功能使用不到的索引数据,又不能直接删 ...

  6. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  7. 【原创】大数据基础之Impala(2)实现细节

    一 架构 Impala is a massively-parallel query execution engine, which runs on hundreds of machines in ex ...

  8. 大数据基础知识问答----spark篇,大数据生态圈

    Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...

  9. 大数据篇:ElasticSearch

    ElasticSearch ElasticSearch是什么 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. ...

随机推荐

  1. 利用CocoaHttpServer搭建手机本地服务器

    原理 使用CocoaHTTPServer框架,在iOS端建立一个本地服务器,只要电脑和手机连入同一热点或者说网络,就可以实现通过电脑浏览器访问iOS服务器的页面,利用POST实现文件的上传. 实现 1 ...

  2. 打开Player时出现时间格式的错误提示

    安装完Player后如果更改了Windows的系统时间和日期显示格式,再次打开Player后会出现时间日期格式错误的提醒,需要按照要求更改Windows系统设置,才能正常运行Player. 此错误提示 ...

  3. Vue之状态管理(vuex)与接口调用

    Vue之状态管理(vuex)与接口调用 一,介绍与需求 1.1,介绍 1,状态管理(vuex) Vuex 是一个专为 Vue.js 应用程序开发的状态管理模式.它采用集中式存储管理应用的所有组件的状态 ...

  4. [转帖]Linux中的15个基本‘ls’命令示例

    Linux中的15个基本‘ls’命令示例 https://linux.cn/article-5109-1.html ls -lt 和 ls -ltr 来查看文件新旧顺序. list time rese ...

  5. 使用Spring表达式语言进行装备--SpEL

    本文主要想记录最近的两个使用spring框架实现通过配置文件装备Bean,以及使用SpEL装备Bean. 1.使用配置文件装备Bean: 当我们写某些Bean的时候是希望这个Bean当中的属性是可以通 ...

  6. python 基本数据类型以及运算符操作

    一.基本数据类型 为何要区分类型? 数据类型的值是变量值得类型,变量值之所以区分类型,是因为变量的值 用来记录事物的状态,而事物的状态有不同的种类,对应着,也必须用不 用类型去区分它们. 1.数字类型 ...

  7. [BZOJ 2480] [SPOJ 3105] Mod

    Description 已知数 \(a,p,b\),求满足 \(a^x\equiv b\pmod p\) 的最小自然数 \(x\). Input 每个测试文件中最多包含 \(100\) 组测试数据. ...

  8. GOOGLE RANKBRAIN 完整指南

    [译]GOOGLE RANKBRAIN 完整指南 ( 2018 最新版 ) 2018.01.29    来源  http://www.zhidaow.com/post/google-rankbrain ...

  9. 看毛片就能AC算法

    KMP && ACA KMP: 吼哇! 反正网上教程满天飞,我就不写了. 发个自己写的模板 /** freopen("in.in", "r", ...

  10. Day046--JavaScript-- DOM操作, js中的面向对象, 定时

    一. DOM的操作(创建,追加,删除) parentNode 获取父级标签 nextElementSibling 获取下一个兄弟节点 children 获取所有的子标签 <!DOCTYPEhtm ...