ES设置查询的相似度算法

bonelee 2024-10-01 11:14:14 原文

`similarity`

Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similaritysetting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.

Similarities are mostly useful for text fields, but can also apply to other field types.

Custom similarities can be configured by tuning the parameters of the built-in similarities. For more details about this expert options, see the similarity module.

The only similarities which can be used out of the box, without any further configuration are:

BM25: The Okapi BM25 algorithm. The algorithm used by default in Elasticsearch and Lucene. See Pluggable Similarity Algorithms for more information.
classic: The TF/IDF algorithm which used to be the default in Elasticsearch and Lucene. See Lucene’s Practical Scoring Function for more information.
boolean: A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.

The similarity can be set on the field level when a field is first created, as follows:

PUT my_index

{

  "mappings": {

    "my_type": {

      "properties": {

        "default_field": {

          "type": "text"

        },

        "classic_field": {

          "type": "text",

          "similarity": "classic"

        },

        "boolean_sim_field": {

          "type": "text",

          "similarity": "boolean"

COPY AS CURL VIEW IN CONSOLE

	The `default_field` uses the `BM25` similarity.
	The `classic_field` uses the `classic` similarity (ie TF/IDF).
	The `boolean_sim_field` uses the `boolean` similarity.

Default and Base Similarities

By default, Elasticsearch will use whatever similarity is configured as default. However, the similarity functions queryNorm() and coord() are not per-field. Consequently, for expert users wanting to change the implementation used for these two methods, while not changing the default, it is possible to configure a similarity with the name base. This similarity will then be used for the two methods.

You can change the default similarity for all fields in an index when it is created:

PUT /my_index

{

  "settings": {

    "index": {

      "similarity": {

        "default": {

          "type": "classic"

        }

      }

    }

  }

}

If you want to change the default similarity after creating the index you must close your index, send the follwing request and open it again afterwards:

PUT /my_index/_settings

{

  "settings": {

    "index": {

      "similarity": {

        "default": {

          "type": "classic"

        }

      }

    }

  }

}

from:https://www.elastic.co/guide/en/elasticsearch/reference/5.4/index-modules-similarity.html

ES设置查询的相似度算法的更多相关文章

文本相似度算法——空间向量模型的余弦算法和TF-IDF
1.信息检索中的重要发明TF-IDF TF-IDF是一种统计方法,TF-IDF的主要思想是,如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分 ...
文本相似度余弦值相似度算法 VS L氏编辑距离（动态规划）
设置n为字符串s的长度.("我是个小仙女") 设置m为字符串t的长度.("我不是个小仙女") 如果n等于0,返回m并退出.如果m等于0,返回n并退出.构造两个向 ...
Atitit.列表页面and条件查询的实现最佳实践(1)------设置查询条件and提交查询and返回json数据
Atitit.列表页面and条件查询的实现最佳实践(1)------设置查询条件and提交查询and返回json数据 1. 1. 配置条件字段@Conditional 1 1 2. 2. 配置条件字段 ...
ES : 软件工程学的复杂度理论及物理学解释
系统论里面总是有一些通用的专业术语比如复杂度.熵.焓,复杂度专门独立出来,成为复杂度理论文章摘抄于:<非线性动力学> 刘秉政编著 5.5 复杂性及其测度热力学的几个专业术语熵. ...
python结巴分词余弦相似度算法实现
过余弦相似度算法计算两个字符串之间的相关度,来对关键词进行归类.重写标题.文章伪原创等功能, 让你目瞪口呆.以下案例使用的母词文件均为txt文件,两种格式:一种内容是纯关键词的txt,每行一个关键词就 ...
ES 复合查询
ES在查询过程中比较多遇到符合查询,既需要多个字段过滤也需要特殊情况处理,本文简单介绍几种查询组合方便快捷查询ES. bool布尔查询有一个或者多个布尔子句组成 filter 只过滤符合条件的 ...
Spark/Scala实现推荐系统中的相似度算法（欧几里得距离、皮尔逊相关系数、余弦相似度：附实现代码）
在推荐系统中,协同过滤算法是应用较多的,具体又主要划分为基于用户和基于物品的协同过滤算法,核心点就是基于"一个人"或"一件物品",根据这个人或物品所具有的属性, ...
elasticsearch算法之词项相似度算法(一)
一.词项相似度 elasticsearch支持拼写纠错,其建议词的获取就需要进行词项相似度的计算:今天我们来通过不同的距离算法来学习一下词项相似度算法: 二.数据准备计算词项相似度,就需要首先将词项 ...
es的查询、排序查询、分页查询、布尔查询、查询结果过滤、高亮查询、聚合函数、python操作es
今日内容概要 es的查询 Elasticsearch之排序查询 Elasticsearch之分页查询 Elasticsearch之布尔查询 Elasticsearch之查询结果过滤 Elasticse ...

随机推荐

Linux 下配置，安装Hadoop
1.从官网上下载hadoop-2.4.1.tar.gz,我的版本为hadoop-2.4.1,可在http://pan.baidu.com/s/1cLAKCQ 下载. 2.解压hadoop-2.4.1. ...
Jmeter执行多条Mysql语句报错
花了很长时间找原因,Jmeter一直返回的是MySql语法错误,就写了两条很简单的删除语句,并且在MySql里可以正常执行包括换了jdbc驱动包,更改不同的Query Type等后来发现两条语句拆 ...
Java线程池原理与架构分析
/** * 一.线程池:提供了一个线程队列,队列中保存着所有等待状态的线程.避免了创建与销毁额外开销,提高了响应速度 * 二.线程池的体系结构 * java.util.concurrent.Execu ...
Maven 学习笔记（一）
定义 Maven 是基于项目对象模型(POM)的软件项目管理工具,它采用纯 java 编写,用于管理项目的构建,最早在 Jakata Turbine 项目中开始被使用.它包含了一个项目对象模型(Pro ...
ROS-导航功能-RVIZ
前言:slam使用激光雷达完成了地图构建,现在介绍一下自主导航.move_base用于实现最优路径规划,amcl用于实现机器人定位. 前提:已下载并编译了相关功能包集,如还未下载,可通过git下载:h ...
利用keytool颁发https证书方法
1.首先生成私有认证机构命令:keytool -genkeypair -alias CAname 补充:keytool -list 命令增加 -v 可以查看CA详细信息 2.然后生成私有证书命 ...
Vue模拟酷狗APP问题总结
一.NewSongs.vue中的38行 this.$http.get('/proxy/?json=true') 里面这个路径的获取二.router文件夹中的index.js 中的 comp ...
SQL where 条件顺序对性能的影响有哪些
经常有人问到oracle中的Where子句的条件书写顺序是否对SQL性能有影响,我的直觉是没有影响,因为如果这个顺序有影响,Oracle应该早就能够做到自动优化,但一直没有关于这方面的确凿证据.在网上 ...
POJ 1611 The Suspects【并查集】
解题思路:一共给出 n个人,m组,接下来是m组数据,每一组开头是该组共有的人 num,则接下来输入的num个数,这些数是一组的又因为最开始只有编号为0的人携带有病毒,且只有同一组的人会相互传染,问最 ...
POJ 1328 Radar Installation 【贪心区间选点】
解题思路:给出n个岛屿,n个岛屿的坐标分别为(a1,b1),(a2,b2)-----(an,bn),雷达的覆盖半径为r 求所有的岛屿都被覆盖所需要的最少的雷达数目. 首先将岛屿坐标进行处理,因为雷达的 ...