Ignoring TF/IDF

Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:

  • WiFi
  • Garden
  • Pool

The vacation home documents look something like this:

{ "description": "A delightful four-bedroomed house with ... " }

We could use a simple match query:

GET /_search
{
"query": {
"match": {
"description": "wifi garden pool"
}
}
}

However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.

constant_score Query

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:

GET /_search
{
"query": {
"bool": {
"should": [
{ "constant_score": {
"query": { "match": { "description": "wifi" }}
}},
{ "constant_score": {
"query": { "match": { "description": "garden" }}
}},
{ "constant_score": {
"query": { "match": { "description": "pool" }}
}}
]
}
}
}

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:

GET /_search
{
"query": {
"bool": {
"should": [
{ "constant_score": {
"query": { "match": { "description": "wifi" }}
}},
{ "constant_score": {
"query": { "match": { "description": "garden" }}
}},
{ "constant_score": {
"boost": 2
"query": { "match": { "description": "pool" }}
}}
]
}
}
}

A matching pool clause would add a score of 2, while the other clauses would add a score of only 1 each.

The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.

We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:

{ "features": [ "wifi", "pool", "garden" ] } 这样改写有什么好处?省索引空间吗?

参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html#ignoring-tfidf

ES忽略TF-IDF评分——使用constant_score的更多相关文章

  1. Elasticsearch由浅入深(十)搜索引擎:相关度评分 TF&IDF算法、doc value正排索引、解密query、fetch phrase原理、Bouncing Results问题、基于scoll技术滚动搜索大量数据

    相关度评分 TF&IDF算法 Elasticsearch的相关度评分(relevance score)算法采用的是term frequency/inverse document frequen ...

  2. Elasticsearch学习之相关度评分TF&IDF

    relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度 Elasticsearch使用的是 term frequency/inverse doc ...

  3. 25.TF&IDF算法以及向量空间模型算法

    主要知识点: boolean model IF/IDF vector space model     一.boolean model     在es做各种搜索进行打分排序时,会先用boolean mo ...

  4. TF/IDF计算方法

    FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...

  5. 信息检索中的TF/IDF概念与算法的解释

    https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...

  6. 55.TF/IDF算法

    主要知识点: TF/IDF算法介绍 查看es计算_source的过程及各词条的分数 查看一个document是如何被匹配到的         一.算法介绍 relevance score算法,简单来说 ...

  7. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

  8. 基于TF/IDF的聚类算法原理

        一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...

  9. 使用solr的函数查询,并获取tf*idf值

    1. 使用函数df(field,keyword) 和idf(field,keyword). http://118.85.207.11:11100/solr/mobile/select?q={!func ...

随机推荐

  1. Linux安装httpd2.4.10

    1. cd /mnt tar zxvf httpd-2.4.10.tar.gz ./configure --prefix=/mnt/apache2 --enable-dav --enable-modu ...

  2. Hessian原理与程序设计

     Hessian是比較经常使用的binary-rpc.性能较高,适合互联网应用.主要使用在普通的webservice 方法调用.交互数据较小的场景中.hessian的数据交互基于http协议,通常he ...

  3. 工厂方法模式之C++实现

    说明:本文仅供学习交流,转载请标明出处.欢迎转载. 工厂方法模式与简单工厂模式的差别在于:在简单工厂模式中.全部的产品都是有一个工厂创造,这样使得工厂承担了太大的造产品的压力,工厂内部必须考虑所以的产 ...

  4. css:清除浮动 overflow

    是因为overflow除了(visible)会重新给他里面的元素建立块级格式化(block formatting context)floats, position absolute, inline-b ...

  5. apt-get update --> Bad header line (fresh install) Ign http://archive.ubuntu.com natty-security/multiverse Sources/DiffIndex W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/natty/Rele

    apt-get update --> Bad header line (fresh install) fresh natty install i386 desktop. I get this e ...

  6. 使用HTML5制作简单的RPG游戏

    很久以前就想着做一个游戏,但什么都不会又不知道从哪里开始,胡乱找来一些书籍和资料结果太深奥看不懂,无奈只能放弃.这一弃就是十多年,倥偬半生,眼看垂垂老矣,还是没能有什么成果. 近年来游戏引擎越来越多, ...

  7. oracle中视图v$sql的用途

    1.获取正在执行的sql语句.sql语句的执行时间.sql语句的等待事件: select a.sql_text,b.status,b.last_call_et,b.machine,b.event,b. ...

  8. golang生成随机函数的实现

    golang生成随机数可以使用math/rand包, 示例如下: package main import ( "fmt" "math/rand" ) func ...

  9. c++ 系统函数实现文件拷贝

    #include "stdafx.h" #include <string> #include<windows.h> #include<iostream ...

  10. 怎么使用Aspose.Cells读取excel 转化为Datatable

    说明:vs2012 asp.net mvc4 c# 使用Aspose.Cells 读取Excel 转化为Datatable 1.HTML前端代码 <%@ Page Language=" ...