elasticsearch查询之大数据集分页性能测试

一、测试环境

python 3.7

elasticsearch 6.8

elasticsearch-dsl 7

安装elasticsearch-dsl

pip install elasticsearch-dsl

测试elasticsearch连通性

from elasticsearch import Elasticsearch

from elasticsearch_dsl import Search

client = Elasticsearch(hosts=['http://127.0.0.1:9200'])

s = Search(using=client, index="my_store_index") .query("match_phrase_prefix", name="us")

s = s.source(['id'])

s = s.params(http_auth=["test", "test"])

response = s.execute()

for hit in response:

    print(hit.meta.score, hit.name)

11.642133 945d0426-033e-4a8a-86db-b776c6c9a082

11.642133 3c1aead4-aa6f-4256-a126-f29f84c9ac89

11.642133 77782add-ab58-4eb6-85af-bcbe79be9623

11.642133 75a02b9a-be31-4a78-a3d9-9af72f98cbf9

11.642133 d5aacf16-61fc-4f0c-b05d-3d57c8ab6236

11.642133 30912e1d-4662-4f24-bd5b-5a997e44c290

11.642133 95c28501-66a6-4786-917b-0f1e38707648

11.642133 605f4e11-08c8-4d60-b803-7925cf325cea

11.642133 5dd93a29-e75c-44e3-9f26-bd90e588bc1d

11.642133 84e97af5-4e99-466f-bd82-10cd2b79aa18

二、from + size一次性返回大量数据性能测试

通过以下code，直接使用from + size返回100000记录，耗时17279ms；

from elasticsearch import Elasticsearch

from elasticsearch_dsl import Search, Q

def from_size_query(client):

    s = Search(using=client, index="my_store_index")

    s = s.params(http_auth=["test", "test"], request_timeout=50);

    q = Q('bool',

        must_not=[Q('match_phrase_prefix', name='us')]

    )

    s = s.query(q)

    s = s.source(['id'])

    s = s[0:100000]

    response = s.execute()

    print(f'hit total {response.hits.total}')

    print(f'request time {response.took}ms')

client = Elasticsearch(hosts=['http://127.0.0.1:9200'])

from_size_query(client)

hit total 485070

request time 17279ms

三、使用search after分页返回大量数据性能测试

通过以下code，使用search_after分多次共返回100000记录；从执行结果可以看到当每页获取记录达到5000时，执行的时间基本变化不大；考虑到size增大对cpu和内存的影响，在测试数据情况下，size设置为3000或者4000比较合适；

def search_after_query(client, result):

    s = Search(using=client, index="my_store_index")

    s = s.params(http_auth=["test", "test"], request_timeout=50);

    q = Q('bool',

          must_not=[Q('match_phrase_prefix', name='us')]

          )

    s = s.query(q)

    if result['after_value']:

        s = s.extra(search_after= [result['after_value']])

    s = s.source(['id'])

    s = s[:result['size']]

    s = s.sort('id')

    response = s.execute()

    fetch = len(response.hits)

    result['total'] += response.took

    result['times'] -= 1

    while fetch == result['size'] and  result['times'] > 0:

        sort_val = response.hits.hits[-1].sort[-1]

        s = s.extra(search_after=[sort_val])

        response = s.execute()

        fetch = len(response.hits)

        result['total'] += response.took

        result['times'] -= 1

client = Elasticsearch(hosts=['http://127.0.0.1:9200'])

times = 100

result = {"total": 0, "times":times, "size": 1000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 50

result = {"total": 0, "times":times, "size": 2000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 25

result = {"total": 0, "times":times, "size": 4000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 20

result = {"total": 0, "times":times, "size": 5000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 10

result = {"total": 0, "times":times, "size": 10000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 5

result = {"total": 0, "times":times, "size": 20000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 2

result = {"total": 0, "times":times, "size": 50000, "after_value":None}

search_after_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

size 1000  request  100 times total 14111ms

size 2000  request  50 times total 11987ms

size 4000  request  25 times total 11167ms

size 5000  request  20 times total 10589ms

size 10000  request  10 times total 9930ms

size 20000  request  5 times total 9978ms

size 50000  request  2 times total 9946ms

四、使用scroll分页返回大量数据性能测试

通过以下code，使用search_after分多次共取回100000记录；从执行结果通过不同的size获取数据，执行的时间变化不大，所以elasticsearch官方也不建议使用scroll；

def search_scroll_query(client, result):

    s = Search(using=client, index="my_store_index")

    s = s.params( request_timeout=50, scroll='1m');

    q = Q('bool',

          must_not=[Q('match_phrase_prefix', name='us')]

          )

    s = s.query(q)

    s = s.source(['id'])

    s = s[:result['size']]

    response = s.execute()

    fetch = len(response.hits)

    result['total'] += response.took

    result['times'] -= 1

    scroll_id = response._scroll_id

    while fetch == result['size']  and  result['times'] > 0:

        response = client.scroll(scroll_id=scroll_id, scroll='1m', request_timeout=50)

        scroll_id = response['_scroll_id']

        fetch = len(response['hits']['hits'])

        result['total'] += response['took']

        result['times'] -= 1

client = Elasticsearch(hosts=['http://127.0.0.1:9200'], http_auth=["test", "test"])

times = 100

result = {"total": 0, "times":times, "size": 1000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 50

result = {"total": 0, "times":times, "size": 2000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 25

result = {"total": 0, "times":times, "size": 4000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 20

result = {"total": 0, "times":times, "size": 5000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 10

result = {"total": 0, "times":times, "size": 10000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 5

result = {"total": 0, "times":times, "size": 20000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

times = 2

result = {"total": 0, "times":times, "size": 50000}

search_scroll_query(client, result)

print(f'size {result["size"]}  request  {times} times total {result["total"]}ms ')

size 1000  request  100 times total 16573ms

size 2000  request  50 times total 17678ms

size 4000  request  25 times total 16719ms

size 5000  request  20 times total 16031ms

size 10000  request  10 times total 16008ms

size 20000  request  5 times total 16074ms

size 50000  request  2 times total 14390ms

elasticsearch查询之大数据集分页性能测试的更多相关文章

elasticsearch查询之大数据集分页查询
一. 要解决的问题 search命中的记录特别多,使用from+size分页,直接触发了elasticsearch的max_result_window的最大值: { "error" ...
python连接 elasticsearch 查询数据，支持分页
使用python连接es并执行最基本的查询 from elasticsearch import Elasticsearch es = Elasticsearch(["localhost:92 ...
[NewLife.XCode]高级查询（化繁为简、分页提升性能）
NewLife.XCode是一个有10多年历史的开源数据中间件,支持nfx/netcore,由新生命团队(2002~2019)开发完成并维护至今,以下简称XCode. 整个系列教程会大量结合示例代码和 ...
大数据学习[16]--使用scroll实现Elasticsearch数据遍历和深度分页[转]
题目:使用scroll实现Elasticsearch数据遍历和深度分页作者:星爷出处: http://lxWei.github.io/posts/%E4%BD%BF%E7%94%A8scroll% ...
elasticsearch查询之三种fetch id方式性能测试
一.使用场景介绍 elasticsearch除了普通的全文检索之外,在很多的业务场景中都有使用,各个业务模块根据自己业务特色设置查询条件,通过elasticsearch执行并返回所有命中的记录的id: ...
EF查询百万级数据的性能测试--多表连接复杂查询
相关文章:EF查询百万级数据的性能测试--单表查询一.起因上次做的是EF百万级数据的单表查询,总结了一下,在200w以下的数据量的情况(Sql Server 2012),EF是可以使用,但是由于 ...
ElasticSearch查询第一篇：搜索API
<ElasticSearch查询>目录导航: ElasticSearch查询第一篇:搜索API ElasticSearch查询第二篇:文档更新 ElasticSearch查询第三篇: ...
Elasticsearch入门教程(五)：Elasticsearch查询(一)
原文:Elasticsearch入门教程(五):Elasticsearch查询(一) 版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:h ...
报表性能优化方案之单数据集分页SQL实现层式报表
1.概述我们知道,行式引擎按页取数只适用于Oracle,mysql,hsql和sqlserver2008及以上数据库,其他数据库,如access,sqlserver2005,sqlite等必须编写分 ...

随机推荐

教学日志：javaSE-面向对象1
对象,类,属性,方法的理解 package com.tengxun.class6.oop1; /** * @Auther: Yu Panpan * @Date: 2021/12/7 - 12 - 07 ...
CS起源-havana地图红方打法分析
作者:海底淤泥 havana是美国第一人称射击游戏<反恐精英>中的地图之一,编号为cs_havana,这张地图发生在古巴哈瓦那的某座城市中,恐怖分子们挟持了几名美裔的重要政治人物,以此为筹 ...
[JNI开发]使用javah命令生成.h的头文件
第一步:进入对应的.java目录 javac xxx.java 生成对应的xxx.class文件第二步:退回到/java目录 javah -classpath . -jni 包名.类名
论文翻译：2020_ACOUSTIC ECHO CANCELLATION WITH THE DUAL-SIGNAL TRANSFORMATION LSTM NETWORK
论文地址:https://ieeexplore.ieee.org/abstract/document/9413510 基于双信号变换LSTM网络的回声消除摘要本文将双信号变换LSTM网络(DTLN ...
快看！❤️又一超实用浏览器插件！常用网站自动整合，JSON格式化，CSDN全站去广告！多种工具一键调用。开发者的福音！
其实这个插件才出来的时候博主也下载了使用过,并没有什么亮点,那时候甚至觉得有点多余,因为CSDN全站去广告啥的,早就安装了油猴脚本,广告?不存在的嘿嘿.. 就在前几天看见CSDN的活动在推荐这款插件, ...
使用 DML语句，对 “锦图网” 数据进行操作，聚合函数练习
查看本章节查看作业目录需求说明: 根据客户 ID 统计订单数.订单总金额.最高订单金额.最低订单金额和每份订单平均金额,并按订单总金额升序显示根据客户统计订单总订购人次数> 5 的统计信息 ...
图解MySQL：count(*) 、count(1) 、count(主键字段)、count(字段）哪个性能最好？
大家好,我是小林. 当我们对一张数据表中的记录进行统计的时候,习惯都会使用 count 函数来统计,但是 count 函数传入的参数有很多种,比如 count(1).count(*).count(字段 ...
深入 Laravel 内核之观察者模式
装饰模式核心内容: 观察者模式又称为发布订阅模式,定义了对象间的一对多依赖关系,当一个对象状态发生改变时,其相关依赖的其他对象都能接收到通知: 观察者模式的核心在于目标(Subject)和观察者(Ob ...
初识python：time 模版
语法及示例代码如下: import time # time 时间戳,1970年到当前时间的秒数 print('time:',time.time()) # sleep 延时.睡眠(s) print('s ...
实验 7 : OpenDaylight 实验 —— Python 中的 REST API 调用
实验 7 : OpenDaylight 实验 -- Python 中的 REST API 调用一.实验目的对 Python 调用 OpenDaylight 的 REST API 方法有初步了解. ...

elasticsearch查询之大数据集分页性能测试

elasticsearch查询之大数据集分页性能测试的更多相关文章

随机推荐

热门专题