ES 相似度算法设置（续）

bonelee 2024-10-29 22:06:04 原文

Tuning BM25

One of the nice features of BM25 is that, unlike TF/IDF, it has two parameters that allow it to be tuned:

k1: This parameter controls how quickly an increase in term frequency results in term-frequency saturation. The default value is 1.2. Lower values result in quicker saturation, and higher values in slower saturation.
b: This parameter controls how much effect field-length normalization should have. A value of 0.0disables normalization completely, and a value of 1.0 normalizes fully. The default is 0.75.

The practicalities of tuning BM25 are another matter. The default values for k1 and b should be suitable for most document collections, but the optimal values really depend on the collection. Finding good values for your collection is a matter of adjusting, checking, and adjusting again.

The similarity algorithm can be set on a per-field basis. It’s just a matter of specifying the chosen algorithm in the field’s mapping:

PUT /my_index

{

  "mappings": {

    "doc": {

      "properties": {

        "title": {

          "type":       "string",

          "similarity": "BM25"

        },

        "body": {

          "type":       "string",

          "similarity": "default"

	The `title` field uses BM25 similarity.
	The `body` field uses the default similarity (see Lucene’s Practical Scoring Function).

Currently, it is not possible to change the similarity mapping for an existing field. You would need to reindex your data in order to do that.

Configuring BM25

Configuring a similarity is much like configuring an analyzer. Custom similarities can be specified when creating an index. For instance:

PUT /my_index

{

  "settings": {

    "similarity": {

      "my_bm25": {

        "type": "BM25",

        "b":    0

      }

    }

  },

  "mappings": {

    "doc": {

      "properties": {

        "title": {

          "type":       "string",

          "similarity": "my_bm25"

        },

        "body": {

          "type":       "string",

          "similarity": "BM25"

        }

      }

    }

  }

}

参考：https://www.elastic.co/guide/en/elasticsearch/guide/current/changing-similarities.html

ES 相似度算法设置（续）的更多相关文章

ES BM25 TF-IDF相似度算法设置——
Pluggable Similarity Algorithms Before we move on from relevance and scoring, we will finish this ch ...
文本相似度算法——空间向量模型的余弦算法和TF-IDF
1.信息检索中的重要发明TF-IDF TF-IDF是一种统计方法,TF-IDF的主要思想是,如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分 ...
文本相似度余弦值相似度算法 VS L氏编辑距离（动态规划）
设置n为字符串s的长度.("我是个小仙女") 设置m为字符串t的长度.("我不是个小仙女") 如果n等于0,返回m并退出.如果m等于0,返回n并退出.构造两个向 ...
python结巴分词余弦相似度算法实现
过余弦相似度算法计算两个字符串之间的相关度,来对关键词进行归类.重写标题.文章伪原创等功能, 让你目瞪口呆.以下案例使用的母词文件均为txt文件,两种格式:一种内容是纯关键词的txt,每行一个关键词就 ...
Spark/Scala实现推荐系统中的相似度算法（欧几里得距离、皮尔逊相关系数、余弦相似度：附实现代码）
在推荐系统中,协同过滤算法是应用较多的,具体又主要划分为基于用户和基于物品的协同过滤算法,核心点就是基于"一个人"或"一件物品",根据这个人或物品所具有的属性, ...
elasticsearch算法之词项相似度算法(一)
一.词项相似度 elasticsearch支持拼写纠错,其建议词的获取就需要进行词项相似度的计算:今天我们来通过不同的距离算法来学习一下词项相似度算法: 二.数据准备计算词项相似度,就需要首先将词项 ...
ES设置查询的相似度算法
similarity Elasticsearch allows you to configure a scoring algorithm or similarity per field. The si ...
Java 比较两个字符串的相似度算法（Levenshtein Distance）
转载自: https://blog.csdn.net/JavaReact/article/details/82144732 算法简介: Levenshtein Distance,又称编辑距离,指的是两 ...
百度面试题字符串相似度算法 similar_text 和页面相似度算法
在百度的面试,简直就是花样求虐. 首先在面试官看简历的期间,除了一个自己定义字符串相似度,并且写出求相似度的算法. ...这个确实没听说过,php的similar_text函数也是闻所未闻的.之前看s ...

随机推荐

apue学习笔记（第五章标准I/O）
本章讲述标准I/O库流和FILE对象对于标准I/O库,它们的操作是围绕流进行的.流的定向决定了所读.写的字符是单字节还是多字节的. #include <stdio.h> #includ ...
何为SLAM
名词解释: SLAM (simultaneous localization and mapping),也称为CML (Concurrent Mapping and Localizatio ...
html中图片上传预览的实现
本地图片预览第一种方法 <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type& ...
[ACM] hdu 1029 Ignatius and the Princess IV （动归或hash）
Ignatius and the Princess IV Time Limit : 2000/1000ms (Java/Other) Memory Limit : 65536/32767K (Ja ...
存储过程清理N天前数据
CREATE OR REPLACE PROCEDURE APICALL_LOG_INTERFACE_CLEAN ( CLEANDAY IN Number --天数 ) AS v_cleanDay nu ...
HDU 3461 Code Lock（并查集的应用+高速幂）
* 65536kb,仅仅能开到1.76*10^7大小的数组. 而题目的N取到了10^7.我開始做的时候没注意,用了按秩合并,uset+rank达到了2*10^7所以MLE,所以貌似不能用按秩合并. 事 ...
Django之邮件发送
settings.py #settings 添加如下配置进行邮件发送 #邮件服务器 EMAIL_HOST = "smtp.qq.com" #邮件发送的端口 EMAIL_PORT = ...
FTP匿名登录或弱口令漏洞及服务加固
漏洞描述 FTP 弱口令或匿名登录漏洞,一般指使用 FTP 的用户启用了匿名登录功能,或系统口令的长度太短.复杂度不够.仅包含数字.或仅包含字母等,容易被黑客攻击,发生恶意文件上传或更严重的入侵行为. ...
【caffe】Caffe的Python接口-官方教程-00-classification-详细说明（含代码）
00-classification 主要讲的是如何利用caffenet(与Alex-net稍稍不同的模型)对一张图片进行分类(基于imagenet的1000个类别) 先说说教程到底在哪(反正我是找了半 ...
cocos2d-x AssetsManager libcurl使用心得
libcurl使用心得最新正在写cocosclient更新的逻辑.研究了一下cocos2d-x自带的Libcurl,下面是自己在使用过程中的心得和遇到的未解问题.希望大家一起讨论一下,欢迎大家指导. ...