Elasticsearch学习笔记（十二）filter与query

一.keyword 字段和keyword数据类型

1、测试准备数据

POST
/forum/article/_bulk

{ "index": { "_id": 1
}}

{ "articleID" : "XHDK-A-1293-#fJ3",
"userID" : 1, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 2
}}

{ "articleID" : "KDKE-B-9947-#kL5",
"userID" : 1, "hidden": false, "postDate": "2017-01-02" }

{ "index": { "_id": 3
}}

{ "articleID" : "JODL-X-1937-#pV7",
"userID" : 2, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 4
}}

{ "articleID" : "QQPX-R-3956-#aD8",
"userID" : 2, "hidden": true, "postDate": "2017-01-02"
}

2、查询mapping

GET
/forum/_mapping/article

{

"forum":
{

    "mappings":
{

      "article":
{

        "properties":
{

          "articleID": {

            "type":
"text",

            "fields":
{

              "keyword":
{

                "type":
"keyword",

                "ignore_above":
256

              }

            }

          },

          "hidden":
{

            "type":
"boolean"

          },

          "postDate":
{

            "type":
"date"

          },

          "userID":
{

            "type":
"long"

          }

        }

      }

    }

}

}

        es 5.2版本，字段数据类型为text的字段（type=text），es默认会设置两个field，一个是field本身，比如articleID，就是分词的；还有一个的话，就是field.keyword，articleID.keyword，默认不分词，会最多保留256个字符

    articleID.keyword，是es最新版本内置建立的field，就是不分词的。所以一个articleID过来的时候，会建立两次索引，一次是自己本身，是要分词的，分词后放入倒排索引；另外一次是基于articleID.keyword，不分词，保留256个字符最多，直接一个字符串放入倒排索引中。

    所以term
filter，对text过滤，可以考虑使用内置的field.keyword来进行匹配。但是有个问题，默认就保留256个字符。所以尽可能还是自己去手动建立索引，指定not_analyzed吧。在最新版本的es中，不需要指定not_analyzed也可以，将type=keyword即可。

3、测试

    测试1：使用articleID搜索

GET
/forum/article/_search

{

    "query" :
{

        "constant_score"
: {

            "filter"
: {

                "term"
: {

                    "articleID"
: "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

结果：查询不到指定的document

    {

"took":
1,

"timed_out":
false,

"_shards":
{

    "total":
5,

    "successful":
5,

    "failed":
0

},

"hits":
{

    "total":
0,

    "max_score":
null,

    "hits":
[]

}

   }

    测试2：使用articleID.keyword搜索

GET
/forum/article/_search

{

    "query" :
{

        "constant_score"
: {

            "filter"
: {

                "term"
: {

                    "articleID.keyword"
: "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

      结果：

{

"took":
2,

"timed_out":
false,

"_shards":
{

    "total":
5,

    "successful":
5,

    "failed":
0

},

"hits":
{

    "total":
1,

    "max_score":
1,

    "hits":
[

      {

        "_index":
"forum",

        "_type":
"article",

        "_id":
"1",

        "_score":
1,

        "_source":
{

          "articleID":
"XHDK-A-1293-#fJ3",

          "userID":
1,

          "hidden":
false,

          "postDate":
"2017-01-01"

        }

      }

    ]

}

}

测试3：term查询

GET
/forum/article/_search

{

    "query" :
{

        "constant_score"
: {

            "filter"
: {

                "term"
: {

                    "userID"
: 1

                }

            }

        }

    }

}

term
filter/query：对搜索文本不分词，直接拿去倒排索引中匹配，你输入的是什么，就去匹配什么

比如说，如果对搜索文本进行分词的话，“helle world” -->
“hello”和“world”，两个词分别去倒排索引中匹配

term，“hello world” --> “hello
world”，直接去倒排索引中匹配“hello world”

4、查看分词

GET
/forum/_analyze

{

"field":
"articleID",

"text":
"XHDK-A-1293-#fJ3"

}

GET
/forum/_analyze

{

"field":
"articleID.keyword",

"text":
"XHDK-A-1293-#fJ3"

}

默认是analyzed的text类型的field，建立倒排索引的时候，就会对所有的articleID分词，分词以后，原本的articleID就没有了，只有分词后的各个word存在于倒排索引中。

term，是不对搜索文本分词的，XHDK-A-1293-#fJ3 -->
XHDK-A-1293-#fJ3；但是articleID建立索引的时候，XHDK-A-1293-#fJ3 -->
xhdk，a，1293，fj3

5、定义keyword数据类型的字段

（1）删除索引 DELETE /forum

（2）重建索引

PUT
/forum

{

"mappings":
{

    "article":
{

      "properties":
{

       "articleID":
{

          "type":
"keyword"

        }

      }

    }

}

}

（3）准备数据

POST
/forum/article/_bulk

{ "index": { "_id": 1
}}

{ "articleID" : "XHDK-A-1293-#fJ3",
"userID" : 1, "hidden": false, "postDate": "2017-01-01"
}

{ "index": { "_id": 2
}}

{ "articleID" : "KDKE-B-9947-#kL5",
"userID" : 1, "hidden": false, "postDate": "2017-01-02"
}

{ "index": { "_id": 3
}}

{ "articleID" : "JODL-X-1937-#pV7",
"userID" : 2, "hidden": false, "postDate": "2017-01-01"
}

{ "index": { "_id": 4
}}

{ "articleID" : "QQPX-R-3956-#aD8",
"userID" : 2, "hidden": true, "postDate": "2017-01-02"
}

（4）测试articleID查询

GET
/forum/article/_search

{

    "query" :
{

        "constant_score"
: {

            "filter"
: {

                "term"
: {

                    "articleID"
: "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

6、小结

        （1）term filter：根据exact
value进行搜索，数字、boolean、date天然支持

（2）text需要建索引时指定为not_analyzed，才能用term
query

（3）相当于SQL中的单个where条件

二、filter执行原理深度剖析

1、bitset机制

每个filter根据在倒排索引中搜索的结果构建一个bitset（位集），用以存储搜索的结果。简单的数据结构去实现复杂的功能，可以节省内存空间，提升性能。bitset，就是一个二进制的数组，数组每个元素都是0或1，用来标识一个doc对一个filter条件是否匹配，如果匹配就是1，不匹配就是0。比如：[0, 1, 1]。

遍历每个过滤条件对应的bitset，优先从最稀疏的开始搜索，查找满足所有条件的document（先遍历比较稀疏的bitset，就可以先过滤掉尽可能多的数据发）

2、caching
bitset机制

跟踪query，在最近256个query中超过一定次数的过滤条件，缓存其bitset。对于小segment（<1000，或<3%），不缓存bitset。在最近的256个filter中，有某个filter超过了一定的次数，次数不固定，就会自动缓存这个filter对应的bitset。filter针对小segment获取到的结果，可以不缓存，segment记录数<1000，或者segment大小<index总大小的3% segment数据量很小，此时哪怕是扫描也很快；segment会在后台自动合并，小segment很快就会跟其他小segment合并成大segment，此时就缓存也没有什么意义，segment很快就消失了。

cache biset的自动更新：如果document有新增或修改，那么cached
bitset会被自动更新

3、filter与query的对比

filter比query的好处就在于会caching。

filter大部分情况下来说，在query之前执行，先尽量过滤掉尽可能多的数据

query：是会计算doc对搜索条件的relevance
score（相关评分），还会根据这个score去排序

filter：只是简单过滤出想要的数据，不计算relevance
score，也不排序

三、基于bool组合多个filter条件来搜索数据

1、搜索发帖日期为2017-01-01，或者帖子ID为XHDK-A-1293-#fJ3的帖子，同时要求帖子的发帖日期绝对不为2017-01-02

GET
/forum/article/_search

{

"query":
{

    "constant_score":
{

      "filter": {

        "bool":
{

          "should":[

            {"term":{"postDate":"2017-01-01"}},

            {"term":{"articleID":"HDK-A-1293-#fJ3"}}

          ],

          "must_not":{

            "term":{

              "postDate":"2017-01-02"

            }

          }

        }

      }

    }

}

}

2、搜索帖子ID为XHDK-A-1293-#fJ3，或者是帖子ID为JODL-X-1937-#pV7而且发帖日期为2017-01-01的帖子

GET
/forum/article/_search

{

"query":
{

    "constant_score":
{

      "filter":
{

        "bool":
{

         "should":[

              {"term":{"articleID":"XHDK-A-1293-#fJ3"}},

              {"bool":{

                "must":[

                  {"term":{"articleID":"JODL-X-1937-#pV7"}},

                  {"term":{"postDate":"2017-01-01"}}

                ]

              }}

            ]

        }

      }

    }

}

}

四、term和terms

五、filter
range

测试数据：

为帖子数据增加浏览量的字段

POST
/forum/article/_bulk

{ "update": { "_id": "1"}
}

{ "doc" : {"view_cnt" : 30}
}

{ "update": { "_id": "2"}
}

{ "doc" : {"view_cnt" : 50}
}

{ "update": { "_id": "3"}
}

{ "doc" : {"view_cnt" : 100}
}

{ "update": { "_id": "4"}
}

{ "doc" : {"view_cnt" : 80}
}

1、搜索浏览量在30~60之间的帖子

GET
/forum/article/_search

{

"query":
{

    "constant_score":
{

     "filter": {

        "range":
{

          "view_cnt":
{

            "gt":
30,              //gt大于 gte大于或等于

            "lt":
60               //lt大于   lte大于或等于

          }

        }

      }

    }

}

}

2、搜索发帖日期在最近1个月的帖子

GET
/forum/article/_search

{

"query":
{

    "constant_score":
{

    "filter": {

        "range":
{

          "postDate":
{

            "gt":
"2017-03-10||-30d"

          }

        }

      }

    }

}

}

GET
/forum/article/_search

{

"query":
{

    "constant_score":
{

    "filter": {

        "range":
{

          "postDate":
{

            "gt":
"now-30d"

          }

        }

      }

    }

}

}

六、match
query 精准查询

测试数据：为帖子数据增加标题字段

POST
/forum/article/_bulk

{ "update": { "_id": "1"}
}

{ "doc" : {"title" : "this is java and
elasticsearch blog"} }

{ "update": { "_id": "2"}
}

{ "doc" : {"title" : "this is java
blog"} }

{ "update": { "_id": "3"}
}

{ "doc" : {"title" : "this is
elasticsearch blog"} }

{ "update": { "_id": "4"}
}

{ "doc" : {"title" : "this is java,
elasticsearch, hadoop blog"} }

{ "update": { "_id": "5"}
}

{ "doc" : {"title" : "this is spark
blog"} }

1、match
query

    GET
/forum/article/_search

{

    "query":
{

        "match":
{

            "title": "java elasticsearch"

        }

    }

}

相当于：

{

"bool":
{

    "should":
[

      {
"term": { "title": "java" }},

      {
"term": { "title": "elasticsearch"   }}

    ]

}

}

如果title字段是analyzed则进行full
text全文搜索，则返回title字段包含java 或者elasticsearch 或者两个都包含的document

如果是not_analyzed则进行exact value（相当于temr
query），则只返回包含java elasticsearch的document

GET
/forum/article/_search

{

    "query":
{

        "match":
{

            "title":
{

"query":
"java elasticsearch",

"operator": "and" //full
text 中返回都包含“java”和"elasticsearch“的document

           }

        }

    }

}

相当于：

   {

      "bool":
{

     "must":
[

      {
"term": { "title": "java" }},

      {
"term": { "title": "elasticsearch"   }}

        ]

      }

    }

GET
/forum/article/_search

{

"query":
{

    "match":
{

      "title":
{

        "query":
"java elasticsearch spark hadoop",

        "minimum_should_match": "75%" // full
text中返回，包含指定条件的75%的document

      }

    }

}

}

相当于：

{

"bool":
{

    "should":
[

      {
"term": { "title": "java" }},

      {
"term": { "title": "elasticsearch"   }},

      {
"term": { "title": "hadoop" }},

      {
"term": { "title": "spark" }}

    ],

    "minimum_should_match":
3

}

}

2、用bool组合多个搜索条件，来搜索title

GET
/forum/article/_search

{

"query":
{

    "bool":
{

      "must":     {
"match": { "title": "java" }},

      "must_not": {
"match": { "title": "spark" }},

      "should":
[

                  {
"match": { "title": "hadoop" }},

                  {
"match": { "title":
"elasticsearch"   }}

      ]

    }

}

}

bool组合多个搜索条件，如何计算relevance
score

must和should搜索对应的分数，加起来，除以must和should的总数

排名第一：java，同时包含should中所有的关键字，hadoop，elasticsearch

排名第二：java，同时包含should中的elasticsearch

排名第三：java，不包含should中的任何关键字

should是可以影响相关度分数的

must是确保说，谁必须有这个关键字，同时会根据这个must的条件去计算出document对这个搜索条件的relevance
score

在满足must的基础之上，should中的条件，不匹配也可以，但是如果匹配的更多，那么document的relevance
score就会更高

默认情况下，should是可以不匹配任何一个的，比如上面的搜索中，this is java
blog，就不匹配任何一个should条件

但是有个例外的情况，如果没有must的话，那么should中必须至少匹配一个才可以

比如下面的搜索，should中有4个条件，默认情况下，只要满足其中一个条件，就可以匹配作为结果返回

但是可以精准控制，should的4个条件中，至少匹配几个才能作为结果返回

GET
/forum/article/_search

{

"query":
{

    "bool":
{

      "should":
[

        {
"match": { "title": "java" }},

        {
"match": { "title":
"elasticsearch"   }},

        {
"match": { "title":
"hadoop"   }},

        { "match":
{ "title": "spark"   }}

      ],

      "minimum_should_match":
3

    }

}

}

Elasticsearch学习笔记（十二）filter与query的更多相关文章

python3.4学习笔记(十二) python正则表达式的使用，使用pyspider匹配输出带.html结尾的URL
python3.4学习笔记(十二) python正则表达式的使用,使用pyspider匹配输出带.html结尾的URL实战例子:使用pyspider匹配输出带.html结尾的URL:@config(a ...
Go语言学习笔记十二：范围(Range)
Go语言学习笔记十二: 范围(Range) rang这个关键字主要用来遍历数组,切片,通道或Map.在数组和切片中返回索引值,在Map中返回key. 这个特别像python的方式.不过写法上比较怪异使 ...
java jvm学习笔记十二（访问控制器的栈校验机制）
欢迎装载请说明出处:http://blog.csdn.net/yfqnihao 本节源码:http://download.csdn.net/detail/yfqnihao/4863854 这一节,我们 ...
(C/C++学习笔记) 十二. 指针
十二. 指针 ● 基本概念位系统下为4字节(8位十六进制数),在64位系统下为8字节(16位十六进制数) 进制表示的, 内存地址不占用内存空间指针本身是一种数据类型, 它可以指向int, char ...
Python学习笔记(十二)—Python3中pip包管理工具的安装【转】
本文转载自:https://blog.csdn.net/sinat_14849739/article/details/79101529 版权声明:本文为博主原创文章,未经博主允许不得转载. https ...
Elasticsearch学习笔记（二）Search API 与 Query DSL
一. Search API eg: GET /mall/product/_search?q=name:productName&sort=price desc 特点:search的请求参数都是以 ...
MySQL学习笔记十二：数据备份与恢复
数据备份 1.物理备份与逻辑备份物理备份物理备份就是将数据库的数据文件,配置文件,日志文件等复制一份到其他路径上,这种备份速度一般较快,因为只有I/O操作.进行物理备份时,一般都需要关闭mysql ...
MYSQL进阶学习笔记十二：MySQL 表分区！（视频序号：进阶_29，30)
知识点十三:MySQL 表的分区(29) 一.什么要采用分区: 分区的定义: 当数据量过大的时候(通常是指百万级或千万级数据的时候),这时候需要将一张表划分几张表存储.一些查询可以得到极大的优化,这主 ...
ROS学习笔记十二：使用gazebo在ROS中仿真
想要在ROS系统中对我们的机器人进行仿真,需要使用gazebo. gazebo是一种适用于复杂室内多机器人和室外环境的仿真环境.它能够在三维环境中对多个机器人.传感器及物体进行仿真,产生实际传感器反馈 ...
JavaScript权威设计--命名空间，函数，闭包(简要学习笔记十二)
1.作为命名空间的函数有时候我们需要声明很多变量.这样的变量会污染全局变量并且可能与别人声明的变量产生冲突. 这时.解决办法是将代码放入一个函数中,然后调用这个函数.这样全局变量就变成了局部变量. ...

随机推荐

CKEditor5 + vue2.0 自定义图片上传、highlight、字体等用法
因业务需求,要在 vue2.0 的项目里使用富文本编辑器,经过调研多个编辑器,CKEditor5 支持 vue,遂采用.因 CKEditor5 文档比较少,此处记录下引用和一些基本用法. CKEdit ...
【easy】27. Remove Element
删除等于n的数,并返回剩余元素个数 Given nums = [3,2,2,3], val = 3, Your function should return length = 2, with the ...
【原创】大叔经验分享（26）hive通过外部表读写elasticsearch数据
hive通过外部表读写elasticsearch数据,和读写hbase数据差不多,差别是需要下载elasticsearch-hadoop-hive-6.6.2.jar,然后使用其中的EsStorage ...
sublime的Vue语法高亮插件安装
1.准备语法高亮插件vue-syntax-highlight. 下载地址: https://github.com/vuejs/vue-syntax-highlight 下载页面并下载: 解开压缩包vu ...
ionic3 安卓硬件返回
platform.ready().then(() => { this.platform.registerBackButtonAction(() => { let activePortal ...
uboot、内核、根文件系统启动流程
[1]Uboot的启动流程 Uboot的启动分为两个阶段. 第一阶段:设置异常向量表,设置ARM核为svc模式,关cache和关mmu, 关看门狗,初始化时钟,串口,内存,初始化栈空间,清bss ...
【深度学习】吴恩达网易公开课练习(class2 week1)
权重初始化参考资料: 知乎 CSDN 权重初始化不能全部为0,不能都是同一个值.原因是,如果所有的初始权重是相同的,那么根据前向和反向传播公式,之后每一个权重的迭代过程也是完全相同的.结果就是,无论 ...
pip错误-failed to create process/fatal error in launcher
电脑同时装了python2和python3,并且都配置了环境变量将python2的python.exe改成python2.exe,python3的python.exe没有改(主要用python2时则 ...
PHP二维数组按某个字段排序
//准备二维数组 //按一个字段排序 foreach($rank as $key=>$val){ $dos[$key] = $val['timelength']; } array_multis ...
Python运算符——复合运算符
就相当于算数运算符的后面加一个“=” 例:+= num = num+5 可以写成 num += 5 就是说,等式右边含有左边的变量名,就可以直接去掉,然后右边的符号移到左边去同样的“-= / ...

Elasticsearch学习笔记（十二）filter与query

Elasticsearch学习笔记（十二）filter与query的更多相关文章

随机推荐

热门专题