ElasticSearch改造研报查询实践

背景：

　　1，系统简介：通过人工解读研报然后获取并录入研报分类及摘要等信息，系统通过摘要等信息来获得该研报的URI

　　2，现有实现：老系统使用MSSQL存储摘要等信息，并将不同的关键字分解为不同字段来提供搜索查询

　　3，存在问题：

　　　　-查询操作繁琐,死板：例如要查某个机构，标题含有周报的研报，现有系统需要勾选相应字段再输入条件

　　　　-查询速度缓慢，近千万级别数据响应时间4-5s

　　4，改进：使用es优化，添加多个关键字模糊查询(非长文本数据，因此未使用_socre进行评分查询)

　　　　-例如：输入“国泰君安周报”就可查询到所有相关的国泰君安的周报

1，新建Index

curl -X PUT 'localhost:9200/src_test_1' -H 'Content-Type: application/json' -d '

{

    "settings": {

        "number_of_shards": 1,

        "number_of_replicas":

    },

  "mappings": {

    "doc_test": {

      "properties": {

        "title": {#研报综合标题

          "type": "text",

          "analyzer": "ik_max_word",

          "search_analyzer": "ik_max_word"

        },

        "author": {#作者

          "type": "text",

          "analyzer": "ik_max_word",

          "search_analyzer": "ik_max_word"

        },

        "institution": {#机构

            "type": "text",

            "analyzer": "ik_max_word",

            "search_analyzer": "ik_max_word"

        },

          "industry": {#行业

              "type": "text",

              "analyzer": "ik_max_word",

              "search_analyzer": "ik_max_word"

          },

          "grade": {#评级

              "type": "text",

              "analyzer": "ik_max_word",

              "search_analyzer": "ik_max_word"

          },

          "doc_type": {#研报分类

              "type": "text",

              "analyzer": "ik_max_word",

              "search_analyzer": "ik_max_word"

          },

         "time": {#发布时间

          "type": "date" ,

          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"

         },

          "doc_uri": {#地址

           "type": "text",

            "index":false

         },

          "doc_size": {#文件大小

           "type": "integer",

            "index":false

         },

          "market": {#市场

          "type": "byte"

         }

      }

    }

  }

}'

※特别提示对于需要模糊查询的中文文本字段最好都设置text属性(keyword无法被分词：用于精确查找)，并使用ik_max_word分词器。

※使用ik_max_word原因：针对该场景，例如我想使用“国泰”关键词进行匹配，如果使用默认ik会将“国”，“泰”分开进行查询，而不是需求的“国泰”这个词

2，数据导入(CSV分批)

import pandas as pd

import numpy as np

from elasticsearch import Elasticsearch

from elasticsearch.helpers import bulk

es = Elasticsearch()

data_will_insert = []

x = 1

# #使用pandas读取csv数据；如果出现乱码加：encoding = "ISO-8859-1"

src_data = pd.read_csv('ResearchReportEx.csv')

for index,i in src_data.iterrows():

    x+=1

    #每次插入100000条

    if x%100000 == 99999:

        #es批量插入

        success, _ = bulk(es, data_will_insert, index='src_test_1', raise_on_error=True)

        print('Performed %d actions' % success)

        data_will_insert = []

    #判断市场

    if i['ExchangeType'] == 'CN':

        market = 0

    elif i['ExchangeType'] == 'HK':

        market = 1

    elif i['ExchangeType'] == 'World':

        market = 2

    else:

        market = 99

    data_will_insert.append({"_index":'src_test_1',"_type": 'doc_test','_source':

                {

                'title':i['Title'],

                'author':i['AuthorName'],

                'time':i['CreateTime']+':00',

                'institution':i['InstituteNameCN'],

                'doc_type':i['KindName'] if i['Kind2Name'] is np.NaN else i['KindName']+'|%s' % i['Kind2Name'],

                'industry':'' if i['IndustryName'] is np.NaN else i['IndustryName'],

                'grade':'' if i['GradeName'] is np.NaN else i['GradeName'],

                'doc_uri':i['FileURL'],

                'doc_size':i['Size'],

                'market':market

                }

                })

#将最后剩余在list中的数据插入

if len(data_will_insert)>0:

    success, _ = bulk(es, data_will_insert, index='src_test_1', raise_on_error=True)

    print('Performed %d actions' % success)

3，查询

import time

from elasticsearch import Elasticsearch

from elasticsearch.helpers import scan

# es连接

es = Elasticsearch()

# 计算运行时间装饰器

def cal_run_time(func):

    def wrapper(*args, **kwargs):

        start_time = time.time()

        res = func(*args, **kwargs)

        end_time = time.time()

        print(str(func) + '---run time--- %s' % str(end_time - start_time))

        return res

    return wrapper

@cal_run_time

def query_in_es():

    body = {

        "query": {

            "bool": {

                "must": [

                    {

                        "multi_match": {

                            "query": "国泰 报告",

                            "type": "cross_fields",#跨字段匹配

                            "fields": ["title", "institution","grade"

                                       "doc_type","author","industry"],#在这6个字段中进行查找

                            "operator": "and"

                        }#此查询条件等于：query中的关键词都在fields中所有字段拼接成的字符中

                    },

                    {

                        "range": {

                            "time": {

                                "gte": '2018-02-01'#默认查询限制时间

                            }

                        }

                    }

                ],

            }

        }

    }

    # 根据body条件查询

    scanResp = scan(es, body, scroll="10m", index="src_test_1", doc_type="doc_test", timeout="10m")

    row_num = 0

    for resp in scanResp:

        print(resp['_source'])

        row_num += 1

    print(row_num)

query_in_es()

※测试结果速度相当快：多关键字查询只需零点几秒

ElasticSearch改造研报查询实践的更多相关文章

PB级数据实时查询，滴滴Elasticsearch多集群架构实践
PB级数据实时查询,滴滴Elasticsearch多集群架构实践 mp.weixin.qq.com 点击上方"IT牧场",选择"设为星标"技术干货每日送达点 ...
让Elasticsearch飞起来!——性能优化实践干货
原文:让Elasticsearch飞起来!--性能优化实践干货版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog ...
elasticsearch要点及常用查询
目录 elasticsearch要点及常用查询查询与过滤明确查询和过滤各自的优缺点,以及适用场景. 性能上的差异适用场景 1.kibana 中操作es-查询 Mapping映射基础 mappin ...
ElasticSearch第四步-查询详解
ElasticSearch系列学习 ElasticSearch第一步-环境配置 ElasticSearch第二步-CRUD之Sense ElasticSearch第三步-中文分词 ElasticSea ...
Elasticsearch Span Query跨度查询
ES基于Lucene开发,因此也继承了Lucene的一些多样化的查询,比如本篇说的Span Query跨度查询,就是基于Lucene中的SpanTermQuery以及其他的Query封装出的DSL,接 ...
i美股投资研报--Michael Kors(IPO版) _Michael Kors（KORS） _i美股
i美股投资研报--Michael Kors(IPO版) _Michael Kors(KORS) _i美股 i美股投资研报--Michael Kors(IPO版)
ElasticSearch(6)-结构化查询
引用:ElasticSearch权威指南一.请求体查询请求体查询简单查询语句(lite)是一种有效的命令行_adhoc_查询.但是,如果你想要善用搜索,你必须使用请求体查询(request bo ...
Elasticsearch java api 常用查询方法QueryBuilder构造举例
转载:http://m.blog.csdn.net/u012546526/article/details/74184769 Elasticsearch java api 常用查询方法QueryBuil ...
【spring boot】【elasticsearch】spring boot整合elasticsearch，启动报错Caused by: java.lang.IllegalStateException: availableProcessors is already set to [8], rejecting [8
spring boot整合elasticsearch, 启动报错: Caused by: java.lang.IllegalStateException: availableProcessors ], ...

随机推荐

eclipse配置tomcat添加外部项目
在eclipse中配置tomcat,添加外部项目. 添加外部项目然后直接启动服务器,服务器里面不能添加项目.
接口自动化测试持续集成--Soapui接口功能测试断言
断言也就是判断实际结果与预期结果是否相等,如果相等测试通过,否则测试失败,自动化测试不管是UI,Services还有unit都需要做断言. 一.添加断言步骤的组件二.设置断言设置常用断言的三种方式 ...
如何正确对用户密码进行加密？转自https://blog.csdn.net/zhouyan8603/article/details/80473083
本文介绍了对密码哈希加密的基础知识,以及什么是正确的加密方式.还介绍了常见的密码破解方法,给出了如何避免密码被破解的思路.相信读者阅读本文后,就会对密码的加密有一个正确的认识,并对密码正确进行加密措施 ...
WinForm中 Asp.Net Signalr消息推送测试实例
p{ text-align:center; } blockquote > p > span{ text-align:center; font-size: 18px; color: #ff0 ...
# 2017-2018-2 20155228 《信息安全系统设计原理》使用VirtualStudio2008创建和调用静态库和使用VirtualC++6.0创建和调用动态库
使用virtual c++ 6.0创建和调用动态库不得不说一下关于环境的问题只要我打一个响指,一半的安装在win7上的VC6.0都会因为兼容性问题直接崩掉懒得研究怎么解决兼容性的问题了,直接开一 ...
Machine Learning 第三周
ML week3 逻辑回归 Logistic Function h_\theta(x)=g(\theta^Tx) g(t)=\frac{1}{1+e^{-z}} 当t大于0, 即下面公式成立时,y=1 ...
Lua table遍历
工作中,栽了一个“坑”,特此备录. [1]遍历table1,每次结果可能都不同 -- 获取value ", addr="xian"} for k, v in pairs( ...
《linux 必读》
1. linux 内核设计与实现 2. 深入理解 linux 内核
关于 DotNetCore 的自定义权限管理
1.自定义权限需要扩展 Microsoft.AspNetCore.Authentication 实现一套接口 IAuthenticationHandler, IAuthenticationSignIn ...
es6中的模块化
在之前的javascript中是没有模块化概念的.如果要进行模块化操作,需要引入第三方的类库.随着技术的发展,前后端分离,前端的业务变的越来越复杂化.直至ES6带来了模块化,才让javascript第 ...

ElasticSearch改造研报查询实践

ElasticSearch改造研报查询实践的更多相关文章

随机推荐

热门专题