【原创】大数据基础之ElasticSearch(5)重要配置及调优
Index Settings 重要索引配置
Index level settings can be set per-index. Settings may be:
1 static 静态索引配置
They can only be set at index creation time or on a closed index.
只能在创建索引时设置或者在closed状态的索引上设置;
index.number_of_shards
The number of primary shards that an index should have. Defaults to 5. This setting can only be set at index creation time. It cannot be changed on a closed index. Note: the number of shards are limited to 1024 per index.
2 dynamic 动态索引配置
They can be changed on a live index using the update-index-settings API.
可以在索引存在时通过api修改;
index.number_of_replicas
The number of replicas each primary shard has. Defaults to 1.
index.refresh_interval
How often to perform a refresh operation, which makes recent changes to the index visible to search. Defaults to 1s. Can be set to -1 to disable refresh.
index.blocks.read_only
Set to true to make the index and index metadata read only, false to allow writes and metadata changes.
index.blocks.read
Set to true to disable read operations against the index.
index.blocks.write
Set to true to disable data write operations against the index. Unlike read_only, this setting does not affect metadata. For instance, you can close an index with a write block, but not an index with a read_only block.
index.merge.scheduler.max_thread_count
The maximum number of threads on a single shard that may be merging at once. Defaults to Math.max(1, Math.min(4, Runtime.getRuntime().availableProcessors() / 2)) which works well for a good solid-state-disk (SSD). If your index is on spinning platter drives instead, decrease this to 1.
index.translog.durability
Whether or not to fsync and commit the translog after every index, delete, update, or bulk request. This setting accepts the following parameters:
request: (default) fsync and commit after every request. In the event of hardware failure, all acknowledged writes will already have been committed to disk.
async: fsync and commit in the background every sync_interval. In the event of hardware failure, all acknowledged writes since the last automatic commit will be discarded.
写索引调优
1 Use bulk requests
批量请求
Bulk requests will yield much better performance than single-document index requests.
2 Use multiple workers/threads to send data to Elasticsearch
多线程,但要注意并发量不能太大以至于es无法处理而报错
Make sure to watch for TOO_MANY_REQUESTS (429) response codes (EsRejectedExecutionException with the Java client), which is the way that Elasticsearch tells you that it cannot keep up with the current indexing rate. When it happens, you should pause indexing a bit before trying again, ideally with randomized exponential backoff.
3 Increase the refresh interval
增加刷新间隔
The default index.refresh_interval is 1s, which forces Elasticsearch to create a new segment every second. Increasing this value (to say, 30s) will allow larger segments to flush and decreases future merge pressure.
4 Disable refresh and replicas for initial loads
在第一次大量写索引时禁用刷新和副本
If you need to load a large amount of data at once, you should disable refresh by setting index.refresh_interval to -1 and set index.number_of_replicas to 0. This will temporarily put your index at risk since the loss of any shard will cause data loss, but at the same time indexing will be faster since documents will be indexed only once. Once the initial loading is finished, you can set index.refresh_interval and index.number_of_replicas back to their original values.
5 Disable swapping
禁用swap
You should make sure that the operating system is not swapping out the java process by disabling swapping.
# swapoff -a
6 Give memory to the filesystem cache
The filesystem cache will be used in order to buffer I/O operations. You should make sure to give at least half the memory of the machine running Elasticsearch to the filesystem cache.
7 Use auto-generated ids
尽量使用自动生成id,可以节省查找id是否存在的开销;
When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.
8 Use faster hardware
使用更快的硬件,比如更多的内存缓存或者ssd
If indexing is I/O bound, you should investigate giving more memory to the filesystem cache (see above) or buying faster drives. In particular SSD drives are known to perform better than spinning disks.
9 Indexing buffer size
增加indices.memory.index_buffer_size,通常每个shard最多需要512M
If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most 512 MB indexing buffer per shard doing heavy indexing (beyond that indexing performance does not typically improve).
indices.memory.index_buffer_size
Accepts either a percentage or a byte size value. It defaults to 10%, meaning that 10% of the total heap allocated to a node will be used as the indexing buffer size shared across all shards.
修改配置
1 索引动态配置
$ curl -XPUT -H 'Content-Type: application/json' 'http://localhost:9200/testdoc/_settings' -d '{
"index": {
"refresh_interval":"-1",
"number_of_replicas":0,
"index.translog.durability":"async"
}
}'
可反复修改,设置为null即可恢复默认
2 集群配置
$ vi elasticsearch.yml
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024
修改后同步到所有节点并重启
注意以下配置已经deprecated
The bulk thread pool has been renamed to the write thread pool. This change was made to reflect the fact that this thread pool is used to execute all write operations: single-document index/delete/update requests, as well as bulk requests.
thread_pool.index.type
thread_pool.index.size
thread_pool.index.queue_size
thread_pool.bulk.type
thread_pool.bulk.size
thread_pool.bulk.queue_size
另外以上配置也不能通过api修改(即http://localhost:9200/_cluster/settings)
The prefix on all thread pool settings has been changed from threadpool to thread_pool.
Thread pool settings are now node-level settings. As such, it is not possible to update thread pool settings via the cluster settings API.
参考:
https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-disk-usage.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-translog.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-merge.html
【原创】大数据基础之ElasticSearch(5)重要配置及调优的更多相关文章
- 【原创】大数据基础之Hive(5)性能调优Performance Tuning
1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.eng ...
- 【原创】大数据基础之Impala(3)部分调优
1)将coordinator和executor角色分离 By default, each host in the cluster that runs the impalad daemon can ac ...
- 【原创】大数据基础之ElasticSearch(4)es数据导入过程
1 准备analyzer 内置analyzer 参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis- ...
- 【原创】大数据基础之ElasticSearch(1)简介、安装、使用
ElasticSearch 6.6.0 官方:https://www.elastic.co/ 一 简介 ElasticSearch简单来说是对lucene的分布式封装,增加了shard(每个shard ...
- 【原创】大数据基础之ElasticSearch(2)常用API整理
Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to i ...
- 【原创】大数据基础之ElasticSearch(3)升级
elasticsearch版本升级方案 常用的滚动升级过程(Rolling Upgrade)如下: $ curl -XPUT '$es_server:9200/_cluster/settings?pr ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 大数据篇:ElasticSearch
ElasticSearch ElasticSearch是什么 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. ...
- 数据倾斜是多么痛?spark作业调优秘籍
目录视图 摘要视图 订阅 [观点]物联网与大数据将助推工业应用的崛起,你认同么? CSDN日报20170703——<从高考到程序员——我一直在寻找答案> [直播]探究L ...
随机推荐
- JQ初级
一.认识jQuery 1.什么是jQuery jQuery是对原生JavaScript二次封装的工具函数集合 jQuery是一个简洁高效的且功能丰富的JavaScript工具库 2.jQuery的优势 ...
- Codeforces878 A. Short Program
题目类型:位运算 传送门:>Here< 题意:给出\(N\)个位运算操作,要求简化操作数量,使之结果不受影响(数据在1023之内) 解题思路 我们发现数字的每一位是独立的.也就是说,每一个 ...
- opencv 边缘检测原理
只是实现一下,暂不考虑效率 import cv2 as cv import numpy as np import math # 从源码层面实现边缘检测 img = cv.imread('../imag ...
- Mysql数据库使用量查询及授权
Mysql数据库使用量查询及授权 使用量查询 查看实例下每个库的大小 select TABLE_SCHEMA, concat(truncate(sum(data_length)/1024/1024,2 ...
- FWT快速沃尔什变换学习笔记
FWT快速沃尔什变换学习笔记 1.FWT用来干啥啊 回忆一下多项式的卷积\(C_k=\sum_{i+j=k}A_i*B_j\) 我们可以用\(FFT\)来做. 甚至在一些特殊情况下,我们\(C_k=\ ...
- Java【第八篇】面向对象之高级类特性
static 关键字 当我们编写一个类时,其实就是在描述其对象的属性和行为,而并没有产生实质上的对象,只有通过new关键字才会产生出对象,这时系统才会分配内存空间给对象,其方法才可以供外部调用.我们有 ...
- Mac spotlight无法搜索的解决方法
出现问题: 1. 在打开spotlight快速搜索的时候输入两个字符,后该搜索框自动会消失,很是奇怪重启等操作也没有效果 问题原因: 可能因为之前因为耗电原因关闭的全局搜索的索引,或者由于索引出现错误 ...
- nginx 返回数据被截断
nignx 代理 buffer proxy_buffers 16 512k; proxy_buffer_size 512k; fastcgi buffer fastcgi_buffers 4 64 ...
- 计算机网络--差错检测(帧检验序列FCS计算方法)
我们知道数据链路层广泛使用循环冗余检验CRC的检验技术 现在我们知道要发送的数据M=101001(长度为k=6) 在我们每次发送数据的时候需要在M后面添加一个N位的冗余码,一共发送(k+N)位数据 ...
- 2019南昌邀请赛网络预选赛 I. Max answer(单调栈+暴力??)
传送门 题意: 给你你一序列 a,共 n 个元素,求最大的F(l,r): F(l,r) = (a[l]+a[l+1]+.....+a[r])*min(l,r); ([l,r]的区间和*区间最小值,F( ...