基于Spring Boot的问答系统之一：elasticsearch 7.2的hello world入门

好久没有写代码了，最近想做一个基于spring boot + vue + elasticsearch + NLP（语义相关性）的小系统练练手，系统后面可以成为一个聊天机器人，客服系统的原型等等。

所以今天就带来第一篇文章：elasticsearch的hello world入门

一、安装es

目标：在本地安装一个单节点es玩

1.下载es

目前官网最新的下载地址是：https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.2.0-linux-x86_64.tar.gz

下载之后，解压到一个目录，比如你的开发目录：your_path/elasticsearch

2.更改配置文件

a. 配置文件路径：config/elasticsearch.yml

b. 把下面的项改为成自己的值

# Use a descriptive name for your cluster:

# 集群名

cluster.name: my-ces

#

# ------------------------------------ Node ------------------------------------

#

# Use a descriptive name for the node:

# 节点名

node.name: ces-node-1

# Path to directory where to store the data (separate multiple locations by comma):

# es存储数据的地方

path.data: ～/es/data

#

# Path to log files:

# es的运行log

path.logs: ～/es/logs

# Set the bind address to a specific IP (IPv4 or IPv6):

# 绑定地址为本地

network.host: _local_

#

# Set a custom port for HTTP:

# 监听短裤

http.port: 9200

3. 运行测试

a. 运行bin/elasticsearch,

b. 打开浏览器输入：localhost:9200，如果显示以下内容，则成功。

{

  "name" : "ces-node-1",//设置的节点名

  "cluster_name" : "my-ces",//配置的集群名

  "cluster_uuid" : "6XOfx0eQReG3iMKek9hdTA",

  "version" : {

    "number" : "7.2.0",

    "build_flavor" : "default",

    "build_type" : "tar",

    "build_hash" : "508c38a",

    "build_date" : "2019-06-20T15:54:18.811730Z",

    "build_snapshot" : false,

    "lucene_version" : "8.0.0",

    "minimum_wire_compatibility_version" : "6.8.0",

    "minimum_index_compatibility_version" : "6.0.0-beta1"

  },

  "tagline" : "You Know, for Search"

}

4. 安装ik插件并测试

ik是什么

ik是一个分词插件，要使用es来检索中文数据，需要安装本插件。

安装

按照https://github.com/medcl/elasticsearch-analysis-ik上面但指引安装并测试就可以了

二、创建索引

首先索引类似一个mysql数据库的table，你要往es里面存数据，当然就需要es里面先建立一个索引。

网上很多教程就是基于原生的http接口教大家如何创建索引，如果对于es或者http不熟悉的朋友，经常搞得一头雾水，今天我教大家使用es的python包来做。

安装python的elasticsearch包

pip install elasticsearch

定义mapping.json

这个的作用就是，定义index长什么样子，哪些字段需要被检索，哪些字段不检索，假如现在有一个一问一答的数据：

question: 世界上最高的山峰是什么

answer：当然是珠峰了

我们想使用es来检索，做成一个问答机器人，那么我们定义如下的index结构：

{

    "settings":{

        "number_of_shards":2,  //可以先忽略

        "number_of_replicas":1

        },

     "mappings": {

            "dynamic": "strict",

            "properties": {

                "question": {//需要被索引

                    "type": "text",

                    "analyzer": "ik_max_word",//ik分词器

                    "search_analyzer": "ik_smart",//ik分词器

                    "index": true,

                    "boost": 8

                },

                "answer": {

                    "type": "text",

                    "index": false

                }

            }

    }

}

并保存为：es_index_mapping.json

创建索引

使用python版本的es很简单就实现了，直接上代码：

from elasticsearch import Elasticsearch

from elasticsearch import helpers

from common.conf import ServiceConfig

import os.path as path

import json

class EsDriver:

    def __init__(self):

        self.service_conf = ServiceConfig()

        # hosts 实际就是: [{"host": "localhost", "port": 9200}]

        self.es = Elasticsearch(hosts=self.service_conf.get_es_hosts())

    def create_index(self, index_name):

        dir_root = path.normpath("%s/.." % path.dirname(path.abspath(__file__)))

        with open(dir_root + "/data/es_index_mapping.json", 'r') as json_file:

            index_mapping_json = json.load(json_file)

        # 调用indices.create，传入index name（你自己取），然后就创建好了

        return self.es.indices.create(index_name, body=index_mapping_json)

三、批量导入数据

创建好了index，那么我们就要往里面导入数据，python的es包提供批量导入的功能，只需要几行代码就可以实现：

假如你有一个文件qa.processed.txt，是这样的格式：

query\t['answer1','answer2'],比如

你开心吗\t["很开心"]

class EsDriver:

    ...

    def bulk_insert(self, index_name, bulk_size=500):

        doc_list = []

        with open('/data/qa.processed.txt', 'r') as qa_file:

            for line in qa_file:

                ls = line.strip().split('\t')

                if len(ls) != 2:

                    continue

                doc_list.append({

                    "_index": index_name, # 要插入到哪个index

                    "_type": "_doc",

                    "_source": {

                        "question": ls[0],# query

                        "answer": ls[1] # answer

                    }

                })

                if len(doc_list) % bulk_size == 0:

                    # 调用es helper的方法 bulk插入到索引中

                    helpers.bulk(self.es, doc_list, stats_only=True)

                    del doc_list[:]

        if len(doc_list) != 0:

            helpers.bulk(self.es, doc_list)

        print("bulk insert done")

执行完上述的操作之后，数据就哗哗的导入到es中了。

搜索

导入数据之后，我们就要去搜索数据了，同样的使用es包里面的search函数就搞定了。比如现在你想搜索：你好

那么代码如何写呢？

class EsDriver:

    ...

    def search(self, query, index_name):

        return self.es.search(index=index_name, body={

            "query": {

                "match": {

                    "question": query

                }

            }

        })

然后你打印一下返回的结果，就知道数据返回是什么样了。

附：几个常见状态操作

索引状态

curl -X GET "localhost:9200/_cat/indices?v&pretty"

集群状态

curl -X GET "localhost:9200/_cat/health?v&pretty"

索引mapping & setting

curl -X GET "localhost:9200/customer?pretty"
customer是index

通过id查询一个index下的文档数据

curl -X GET "localhost:9200/customer/_doc/1?pretty"
customer是index

后续文章带来：数据集离线处理：构造特征，入es库，java 工程构建

有兴趣的小伙伴，可以添加博主vx交流：crazy042438，一起来做