ElasticSearch - 嵌套映射和过滤器

Because nested objects are indexed as separate hidden documents, we can’t query them directly. Instead, we have to use the nested query to access them:

GET /my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "eggs" }},
{
"nested": {
"path": "comments",
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "john" }},
{ "match": { "comments.age": 28 }}
]
}}}}
]
}}} ①The title clause operates on the root document.
②The nested clause “steps down” into the nested comments field. It no longer has access to fields in the root document, nor fields in any other nested document.
③ The comments.name and comments.age clauses operate on the same nested  document
nested field can contain other nested fields. Similarly, a nested query can contain othernested queries. The nesting hierarchy is applied as you would expect.

Of course, a nested query could match several nested documents. Each matching nested document would have its own relevance score, but these multiple scores need to be reduced to a single score that can be applied to the root document.

By default, it averages the scores of the matching nested documents. This can be controlled by setting thescore_mode parameter to avgmaxsum, or even none (in which case the root document gets a constant score of 1.0).

GET /my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "eggs" }},
{
"nested": {
"path": "comments",
"score_mode": "max",
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "john" }},
{ "match": { "comments.age": 28 }}
]
}}}}
]
}}}
①Give the root document the _score from the best-matching nested document.

If placed inside the filter clause of a Boolean query, a nested query behaves much like anested query, except that it doesn’t accept the score_mode parameter. Because it is being used as a non-scoring query — it includes or excludes, but doesn’t score —  a score_modedoesn’t make sense since there is nothing to score.

curl -XPOST "http://localhost:9200/index-1/movie/" -d'
{
   "title": "The Matrix",
   "cast": [
      {
         "firstName": "Keanu",
         "lastName": "Reeves"
      },
      {
         "firstName": "Laurence",
         "lastName": "Fishburne"
      }
   ]
}'

Given many such movies in our index we can find all movies with an actor named "Keanu" using a search request such as:

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "term": {
               "cast.firstName": "keanu"
            }
         }
      }
   }
}'

Running the above query indeed returns The Matrix. The same is true if we try to find movies that have an actor with the first name "Keanu" and last name "Reeves":

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "cast.firstName": "keanu"
                     }
                  },
                  {
                     "term": {
                        "cast.lastName": "reeves"
                     }
                  }
               ]
            }
         }
      }
   }
}'

Or at least so it seems. However, let's see what happens if we search for movies with an actor with "Keanu" as first name and "Fishburne" as last name.

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "cast.firstName": "keanu"
                     }
                  },
                  {
                     "term": {
                        "cast.lastName": "fishburne"
                     }
                  }
               ]
            }
         }
      }
   }
}'

Clearly this should, at first glance, not match The Matrix as there's no such actor amongst its cast. However, ElasticSearch will return The Matrix for the above query. After all, the movie does contain an author with "Keanu" as first name and (albeit a different) actor with "Fishburne" as last name. Based on the above query it has no way of knowing that we want the two term filters to match the same unique object in the list of actors. And even if it did, the way the data is indexed it wouldn't be able to handle that requirement.

Nested mapping and filter to the rescue

Luckily ElasticSearch provides a way for us to be able to filter on multiple fields within the same objects in arrays; mapping such fields as nested. To try this out, let's create ourselves a new index with the "actors" field mapped as nested.

curl -XPUT "http://localhost:9200/index-2" -d'
{
   "mappings": {
      "movie": {
         "properties": {
            "cast": {
               "type": "nested"
            }
         }
      }
   }
}'

After indexing the same movie document into the new index we can now find movies based on multiple properties of each actor by using a nested filter. Here's how we would search for movies starring an actor named "Keanu Fishburne":

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "nested": {
               "path": "cast",
               "filter": {
                  "bool": {
                     "must": [
                        {
                           "term": {
                              "firstName": "keanu"
                           }
                        },
                        {
                           "term": {
                              "lastName": "fishburne"
                           }
                        }
                     ]
                  }
               }
            }
         }
      }
   }
}'

As you can see we've wrapped our initial bool filter in a nested filter. The nested filter contains a path property where we specify that the filter applies to the cast property of the searched document. It also contains a filter (or a query) which will be applied to each value within the nested property.

As intended, running the abobe query doesn't return The Matrix while modifying it to instead match "Reeves" as last name will make it match The Matrix. However, there's one caveat.

Including nested values in parent documents

If we go back to our very first query, filtering only on actors first names without using a nested filter, like the request below, we won't get any hits.

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "term": {
               "cast.firstName": "keanu"
            }
         }
      }
   }
}'

This happens because movie documents no longer have cast.firstName fields. Instead each element in the cast array is, internally in ElasticSearch, indexed as a separate document.

Obviously we can still search for movies based only on first names amongst the cast, by using nested filters though. Like this:

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "nested": {
               "path": "cast",
               "filter": {
                  "term": {
                     "firstName": "keanu"
                  }
               }
            }
         }
      }
   }
}'

The above request returns The Matrix. However, sometimes having to use nested filters or queries when all we want to do is filter on a single property is a bit tedious. To be able to utilize the power of nested filters for complex criterias while still being able to filter on values in arrays the same way as if we hadn't mapped such properties as nested we can modify our mappings so that the nested values will also be included in the parent document. This is done using theinclude_in_parent property, like this:

curl -XPUT "http://localhost:9200/index-3" -d'
{
   "mappings": {
      "movie": {
         "properties": {
            "cast": {
               "type": "nested",
               "include_in_parent": true
            }
         }
      }
   }
}'

In an index such as the one created with the above request we'll both be able to filter on combinations of values within the same complex objects in the actors array using nested filters while still being able to filter on single fields without using nested filters. However, we now need to carefully consider where to use, and where to not use, nested filters in our queries as a query for "Keanu Fishburne" will match The Matrix using a regular bool filter while it won't when wrapping it in a nested filter. In other words, when using include_in_parent we may get unexpected results due to queries matching documents that it shouldn't if we forget to use nested filters.

PS. For updates about new posts, sites I find useful and the occasional rant you can follow me on Twitter. You are also most welcome to subscribe to the RSS-feed.

Array Type

Read the doc on elasticsearch.org

As its name suggests, it can be an array of native types (string, int, …) but also an array of objects (the basis used for “objects” and “nested”).

Here are some valid indexing examples :

{
"Article" : [
{
"id" : 12
"title" : "An article title",
"categories" : [1,3,5,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author" : [
{
"firstname" : "Francois",
"surname": "francoisg",
"id" : 18
},
{
"firstname" : "Gregory",
"surname" : "gregquat"
"id" : "2"
}
]
}
},
{
"id" : 13
"title" : "A second article title",
"categories" : [1,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author" : [
{
"firstname" : "Gregory",
"surname" : "gregquat",
"id" : "2"
}
]
}
}

You can find different Array :

  • Categories : array of integers
  • Tags : array of strings
  • author : array of objects (inner objects or nested)

We explicitely specify this “simple” type as it can be more easy/maintainable to store a flatten value rather than the complete object.
Using a non relational structure should make you think about a specific model for your search engine :

  • To filter : If you just want to filter/search/aggregate on the textual value of an object, then flatten the value in the parent object.
  • To get the list of objects that are linked to a parent (and if you do not need to filter or index these objects), just store the list of ids and hydrate them with Doctrine and Symfony (in French for the moment).

Inner objects

The inner objects are just the JSON object association in a parent. For example, the “authors” in the above example. The mapping for this example could be :

fos_elastica:
clients:
default: { host: %elastic_host%, port: %elastic_port% }
indexes:
blog :
types:
article :
mappings:
title : ~
categories : ~
tag : ~
author :
type : object
properties :
firstname : ~
surname : ~
id :
type : integer

You can Filter or Query on these “inner objects”. For example :

query: author.firstname=Francois will return the post with the id 12 (and not the one with the id 13).

You can read more on the Elasticsearch website

Inner objects are easy to configure. As Elasticsearch documents are “schema less”, you can index them without specify any mapping.

The limitation of this method lies in the manner as ElasticSearch stores your data. Reusing the above example, here is the internal representation of our objects :

[
{
"id" : 12
"title" : An article title",
"categories" : [1,3,5,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author.firstname" : ["Francois","Gregory"],
"author.surname" : ["Francoisg","gregquat"],
"author.id" : [18,2]
}
{
"id" : 13
"title" : "A second article",
"categories" : [1,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author.firstname" : ["Gregory"],
"author.surname" : ["gregquat"],
"author.id" : [2]
}
]

The consequence is that the query :

{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"firstname": "francois",
"surname": "gregquat"
}
}
}
}
}

author.firstname=Francois AND surname=gregquat will return the document “12″. In the case of an inner object, this query can by translated as “Who has at least one author.surname = gregquat and one author.firstname=francois”.

To fix this problem, you must use the nested.

Les nested

First important difference : nested must be specified in your mapping.

The mapping looks like an object one, only the type changes :

fos_elastica:
clients:
default: { host: %elastic_host%, port: %elastic_port% }
indexes:
blog :
types:
article :
mappings:
title : ~
categories : ~
tag : ~
author :
type : nested
properties :
firstname : ~
surname : ~
id :
type : integer

This time, the internal representation will be :

[
{
"id" : 12
"title" : "An article title",
"categories" : [1,3,5,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author" : [{
"firstname" : "Francois",
"surname" : "Francoisg",
"id" : 18
},
{
"firstname" : "Gregory",
"surname" : "gregquat",
"id" : 2
}]
},
{
"id" : 13
"title" : "A second article title",
"categories" : [1,7],
"tags" : ["elasticsearch", "symfony",'Obtao'],
"author" : [{
"firstname" : "Gregory",
"surname" : "gregquat",
"id" : 2
}]
}
]

This time, we keep the object structure.

Nested have their own filters which allows to filter by nested object. If we go on with our example (with the limitation of inner objects), we can write this query :

{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested" : {
"path" : "author",
"filter": {
"bool": {
"must": [
{
"term" : {
"author.firsname": "francois"
}
},
{
"term" : {
"author.surname": "gregquat"
}
}
]
}
}
}
}
}
}
}

hi
We can translate it as “Who has an author object whose surname is equal to ‘gregquat’ and whose firstname is ‘francois’”. This query will return no result.

There is still a problem which is penalizing when working with bug objects : when you want to change a single value of the nester, you have to reindex the whole parent document (including the nested).
If the objects are heavy, and often updated, the impact on performances can be important.

To fix this problem, you can use the parent/child associations.

Parent/Child

Parent/child associations are very similar to OneToMany relationships (one parent, several children).
The relationship remains hierarchical : an object type is only associated to one parent, and it’s impossible to create a ManyToMany relationship.

We are going to link our article to a category :

fos_elastica:
clients:
default: { host: %elastic_host%, port: %elastic_port% }
indexes:
blog :
types:
category :
mappings :
id : ~
name : ~
description : ~
article :
mappings:
title : ~
tag : ~
author : ~
_routing:
required: true
path: category
_parent:
type : "category"
identifier: "id" #optional as id is the default value
property : "category" #optional as the default value is the type value

When indexing an article, a reference to the Category will also be indexed (category.id).
So, we can index separately categories and article while keeping the references between them.

Like for nested, there are Filters and Queries that allow to search on parents or children :

  • Has Parent Filter / Has Parent Query : Filter/query on parent fields, returns children objects. In our case, we could filter articles whose parent category contains “symfony” in his description.
  • Has Child Filter / Has Child Query : Filter/query on child fields, returns the parent object. In our case, we could filter Categories for which “francoisg” has written an article.
{
"query": {
"has_child": {
"type": "article",
"query" : {
"filtered": {
"query": { "match_all": {}},
"filter" : {
"term": {"tag": "symfony"}
}
}
}
}
}
}

This query will return the Categories that have at least one article tagged with “symfony”.

The queries are here written in JSON, but are easily transformable into PHP with the Elastica library.

ElasticSearch 嵌套映射和过滤器及查询的更多相关文章

  1. Solr查询和过滤器执行顺序剖析

    一.简介 Solr的搜索主要由两个操作组成:找到与请求参数相匹配的文档:对这些文档进行排序,返回最相关的匹配文档.默认情况下,文档根据相关度进行排序.这意味着,找到匹配的文档集之后,需要另一个操作来计 ...

  2. ElasticSearch 5学习(10)——结构化查询(包括新特性)

    之前我们所有的查询都属于命令行查询,但是不利于复杂的查询,而且一般在项目开发中不使用命令行查询方式,只有在调试测试时使用简单命令行查询,但是,如果想要善用搜索,我们必须使用请求体查询(request ...

  3. Elasticsearch(入门篇)——Query DSL与查询行为

    ES提供了丰富多彩的查询接口,可以满足各种各样的查询要求.更多内容请参考:ELK修炼之道 Query DSL结构化查询 Query DSL是一个Java开源框架用于构建类型安全的SQL查询语句.采用A ...

  4. python 全栈开发,Day70(模板自定义标签和过滤器,模板继承 (extend),Django的模型层-ORM简介)

    昨日内容回顾 视图函数: request对象 request.path 请求路径 request.GET GET请求数据 QueryDict {} request.POST POST请求数据 Quer ...

  5. python3之Django内置模板标签和过滤器

    一.模板标签 内置标签: 1.autoescape 控制当前的自动转义行为,此标记采用on或者off作为参数,并确定自动转义是否在块内有效.该块以endautoescape结束标签关闭. views: ...

  6. Django内建模版标签和过滤器

    第四章列出了许多的常用内建模板标签和过滤器.然而,Django自带了更多的内建模板标签及过滤器.这章附录列出了截止到编写本书时,Django所包含的各个内建模板标签和过滤器,但是,新的标签是会被定期地 ...

  7. Django基础(2)--模板自定义标签和过滤器,模板继承 (extend),Django的模型层-ORM简介

    没整理完 昨日回顾: 视图函数: request对象 request.path 请求路径 request.GET GET请求数据 QueryDict {} request.POST POST请求数据 ...

  8. .Net Core中间件和过滤器实现错误日志记录

    1.中间件的概念 ASP.NET Core的处理流程是一个管道,中间件是组装到应用程序管道中用来处理请求和响应的组件. 每个中间件可以: 选择是否将请求传递给管道中的下一个组件. 可以在调用管道中的下 ...

  9. Elasticsearch入门教程(三):Elasticsearch索引&映射

    原文:Elasticsearch入门教程(三):Elasticsearch索引&映射 版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文 ...

随机推荐

  1. Happy New Year

    今年的元旦能明显感觉到节日的狂欢.一方面,论文的事情,压抑了好久,另一方面,把自己融入节日之中.所以才有了节日的深度参与. 早上还是按时的起床,看了朋友圈,内心却能专注于平静.因为见到了优秀的人,才发 ...

  2. For Your Dream

    队名:Braveheart 队员介绍: 队长:李洋洋 队员:姚欢,杨仁波,张波,乔闯 项目名称:数据沈航 总体任务: 收集整理学校的数据,为每个想要了解沈航的人展现一份我们收集来的信息 项目分组: ( ...

  3. dedecms后台验证码显示不正常的四种处理办法

    验证码不正确解决方法 分为两类解决方法 第一类:取消掉验证码,直接登录 第二类:修复验证码,回复验证码功能 四种常见的处理办法如下: 第一种:取消掉验证码具体方法如下 实现的方法一共分为两步来进行: ...

  4. 易语言5.6 精简破解版[Ctoo]

    说明:本易语言5.6破解版 加入了[E剑终情]大神制作的完美通杀补丁,本人还修复了静态编译的问题. 关于静态编译失效的问题,大家解压之后会看到易语言根目录有一个"易言语静态编译配置工具&qu ...

  5. SQL SERVER中求上月、本月和下月的第一天和最后一天 DATEADD DATEDIFF

    SQL SERVER中求上月.本月和下月的第一天和最后一天   1.上月的第一天 SELECT CONVERT(CHAR(10),DATEADD(month,-1,DATEADD(dd,-DAY(GE ...

  6. <读书笔记>软件调试之道 :从大局看调试-理想的调试环境

    声明:本文档的内容主要来源于书籍<软件调试修炼之道>作者Paul Butcher,属于读书笔记.欢迎转载! ---------------------------------------- ...

  7. 正则化方法:L1和L2 regularization、数据集扩增、dropout

    正则化方法:防止过拟合,提高泛化能力 在训练数据不够多时,或者overtraining时,常常会导致overfitting(过拟合).其直观的表现如下图所示,随着训练过程的进行,模型复杂度增加,在tr ...

  8. linux 获取线程号

    #include <sys/types.h> pid_t gettid(void); 如果系统库里没有,则可以这样做: #include <sys/syscall.h> pid ...

  9. Windows下用Python 3.4+自带的venv模块创建虚拟环境

    Python 3.4+自带了venv模块,用于创建虚拟环境,每个虚拟环境都可以安装一套独立的第三方模块. 本文在Windows 10上操作. 1.创建一个虚拟环境: D:\>mkdir test ...

  10. php限定时间内同一ip只能访问一次

    建立一个数据表 CREATE TABLE `clicks` ( `ip` INT UNSIGNED NOT NULL , `time1` INT UNSIGNED NOT NULL , `time2` ...