基于term vector深入探查数据

1、term vector介绍

获取document中的某个field内的各个term的统计信息

term information: term frequency in the field, term positions, start and end offsets, term payloads

term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency，有多少document包含这个term

field statistics: document count，有多少document包含这个field; sum of document frequency，一个field中所有term的df之和; sum of total term frequency，一个field中的所有term的tf之和

GET /twitter/tweet/1/_termvectors

GET /twitter/tweet/1/_termvectors?fields=text

term statistics和field statistics并不精准，不会被考虑有的doc可能被删除了

我告诉大家，其实很少用，用的时候，一般来说，就是你需要对一些数据做探查的时候。比如说，你想要看到某个term，某个词条，大话西游，这个词条，在多少个document中出现了。或者说某个field，film_desc，电影的说明信息，有多少个doc包含了这个说明信息。

2、index-iime term vector实验

term vector，涉及了很多的term和field相关的统计信息，有两种方式可以采集到这个统计信息

（1）index-time，你在mapping里配置一下，然后建立索引的时候，就直接给你生成这些term和field的统计信息了

（2）query-time，你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了，会on the fly，现场计算出各种统计信息，然后返回给你

这一讲，不会手敲任何命令，直接copy我做好的命令，因为这一讲的重点，不是掌握什么搜索或者聚合的语法，而是说，掌握，如何采集term vector信息，然后如何看懂term vector信息，你能掌握利用term vector进行数据探查

PUT /my_index

{

"mappings": {

"my_type": {

"properties": {

"text": {

"type": "text",

"term_vector": "with_positions_offsets_payloads",

"store" : true,

"analyzer" : "fulltext_analyzer"

},

"fullname": {

"type": "text",

"analyzer" : "fulltext_analyzer"

}

}

}

},

"settings" : {

"index" : {

"number_of_shards" : 1,

"number_of_replicas" : 0

},

"analysis": {

"analyzer": {

"fulltext_analyzer": {

"type": "custom",

"tokenizer": "whitespace",

"filter": [

"lowercase",

"type_as_payload"

]

}

}

}

}

}

PUT /my_index/my_type/1

{

"fullname" : "Leo Li",

"text" : "hello test test test "

}

PUT /my_index/my_type/2

{

"fullname" : "Leo Li",

"text" : "other hello test ..."

}

GET /my_index/my_type/1/_termvectors

{

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "",

"_version": 1,

"found": true,

"took": 10,

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 6,

"doc_count": 2,

"sum_ttf": 8

},

"terms": {

"hello": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 5,

"payload": "d29yZA=="

}

]

},

"test": {

"doc_freq": 2,

"ttf": 4,

"term_freq": 3,

"tokens": [

{

"position": 1,

"start_offset": 6,

"end_offset": 10,

"payload": "d29yZA=="

},

{

"position": 2,

"start_offset": 11,

"end_offset": 15,

"payload": "d29yZA=="

},

{

"position": 3,

"start_offset": 16,

"end_offset": 20,

"payload": "d29yZA=="

}

]

}

}

}

}

}

3、query-time term vector实验

GET /my_index/my_type/1/_termvectors

{

"fields" : ["fullname"],

"offsets" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

一般来说，如果条件允许，你就用query time的term vector就可以了，你要探查什么数据，现场去探查一下就好了

4、手动指定doc的term vector

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

},

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

手动指定一个doc，实际上不是要指定doc，而是要指定你想要安插的词条，hello test，那么就可以放在一个field中

将这些term分词，然后对每个term，都去计算它在现有的所有doc中的一些统计信息

这个挺有用的，可以让你手动指定要探查的term的数据情况，你就可以指定探查“大话西游”这个词条的统计信息

5、手动指定analyzer来生成term vector

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

},

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true,

"per_field_analyzer" : {

"text": "standard"

}

}

6、terms filter

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

},

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true,

"filter" : {

"max_num_terms" : 3,

"min_term_freq" : 1,

"min_doc_freq" : 1

}

}

这个就是说，根据term统计信息，过滤出你想要看到的term vector统计结果

也挺有用的，比如你探查数据把，可以过滤掉一些出现频率过低的term，就不考虑了

7、multi term vector

GET _mtermvectors

{

"docs": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "",

"term_statistics": true

},

{

"_index": "my_index",

"_type": "my_type",

"_id": "",

"fields": [

"text"

]

}

]

}

GET /my_index/_mtermvectors

{

"docs": [

{

"_type": "test",

"_id": "",

"fields": [

"text"

],

"term_statistics": true

},

{

"_type": "test",

"_id": ""

}

]

}

GET /my_index/my_type/_mtermvectors

{

"docs": [

{

"_id": "",

"fields": [

"text"

],

"term_statistics": true

},

{

"_id": ""

}

]

}

GET /_mtermvectors

{

"docs": [

{

"_index": "my_index",

"_type": "my_type",

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

}

},

{

"_index": "my_index",

"_type": "my_type",

"doc" : {

"fullname" : "Leo Li",

"text" : "other hello test ..."

}

}

]

}

基于term vector深入探查数据的更多相关文章

Elasticsearch系列---Term Vector工具探查数据
概要本篇主要介绍一个Term Vector的概念和基本使用方法. term vector是什么? 每次有document数据插入时,elasticsearch除了对document进行正排.倒排索引 ...
一个基于特征向量的近似网页去重算法——term用SVM人工提取训练，基于term的特征向量，倒排索引查询相似文档，同时利用cos计算相似度
摘要在搜索引擎的检索结果页面中,用户经常会得到内容相似的重复页面,它们中大多是由于网站之间转载造成的.为提高检索效率和用户满意度,提出一种基于特征向量的大规模中文近似网页检测算法DDW(Det ...
WebGIS中基于控制点库进行SHP数据坐标转换的一种查询优化策略
文章版权由作者李晓晖和博客园共有,若转载请于明显处标明出处:http://www.cnblogs.com/naaoveGIS/ 1.前言目前项目中基于控制点库进行SHP数据的坐标转换,流程大致为:遍 ...
移动端基于HTML模板和JSON数据的JavaScript交互
写本文之前,我正在做一个基于Tab页的订单中心: 每点击一个TAB标签,会请求对应状态的订单列表.之前的项目,我会在js里使用 + 连接符连接多个html内容: var html = ''; htm ...
Spark 介绍（基于内存计算的大数据并行计算框架）
Spark 介绍(基于内存计算的大数据并行计算框架) Hadoop与Spark 行业广泛使用Hadoop来分析他们的数据集.原因是Hadoop框架基于一个简单的编程模型(MapReduce),它支持 ...
如何基于Go搭建一个大数据平台
如何基于Go搭建一个大数据平台 - Go中国 - CSDN博客 https://blog.csdn.net/ra681t58cjxsgckj31/article/details/78333775 01 ...
基于ActiveMQ的Topic的数据同步——消费者持久化
前面一章中介绍了activemq的初步实现:基于ActiveMQ的Topic的数据同步——初步实现下面来解决持久化订阅的问题: (1)使用queue,即队列时,每个消息只有一个消费者,所以,持久化很 ...
（转）基于RTP的H264视频数据打包解包类
最近考虑使用RTP替换原有的高清视频传输协议,遂上网查找有关H264视频RTP打包.解包的文档和代码.功夫不负有心人,找到不少有价值的文档和代码.参考这些资料,写了H264 RTP打包类.解包类,实现 ...
Lucene in action 笔记 term vector——针对特定field建立的词频向量空间，不存！不会！影响搜索，其作用是告诉我们搜索结果是“如何”匹配的，用以提供高亮、计算相似度，在VSM模型中评分计算
摘自:http://makble.com/what-is-term-vector-in-lucene given a document, find all its terms and the posi ...

随机推荐

最长上生子序列LIS
学习动态规划问题(DP问题)中,其中有一个知识点叫最长上升子序列(longest increasing subsequence),也可以叫最长非降序子序列,简称LIS.简单说一下自己的心得. 我们都 ...
Show which git tag you are on?
git查看当前代码是在那个tag? reference: https://stackoverflow.com/questions/3404936/show-which-git-tag-you-are- ...
PVE手册资料
PVE 软件源/etc/apt/souces.list apt-get update命令获取软件源中的软件包信息企业版软件源 /etc/apt/sources.list.d/pve-enterpri ...
转载于山边小溪的博客--编写跨浏览器兼容的 CSS 代码的金科玉律
http://www.cnblogs.com/lhb25/archive/2010/06/19/1760786.html 原始网页作为 Web 设计师,你的网站在各种浏览器中有完全一样的表现是很 ...
jsp request对象
getParameter( ) :返回name指定参数的参数值 String[] getParameterValues(String name) :返回包含参数name的所有值的数值 getA ...
iOS开源库分类
语言库 rx aop kvo 功能库 UI network data-model-map cache 跨平台库 wkjscorebridge jspatch 性能监控库:友盟部署库:jspathc ...
Opentrains 1519 G——最小圆覆盖
题目给出 $n$ 个定义在区间 $[0, 1]$ 上的一次函数 $f_i(x) = a_ix+b_i$,定义两个函数的距离为: $$dist(f,g) = \left(\max_{0\leq i\l ...
1.Http讲解
1.什么是HTTP HTTP(超文本传输协议)是一个简单的请求-响应协议,它通常运行在TCP上. 文本:html,字符串,.... 超文本:图片,音乐,视频,定位,地图... 80端口 HTTPS:安 ...
A1136 | 字符串处理、大整数运算
题目链接: https://www.patest.cn/contests/pat-a-practise/1136 今天是12月17号.最近这几天都有点不在状态.已经整整一周没有练算法了,自从12.3考 ...
Nuxt + Vue 全家桶
引子情由无中有,一旦有了,便万劫不复简介 “简单却不失优雅,小巧而不乏大匠”. Vue.js 是一个JavaScriptMVVM库,是一套构建用户界面的渐进式框架.它是以数据驱动和组件化的思想构建 ...

基于term vector深入探查数据

基于term vector深入探查数据的更多相关文章

随机推荐

热门专题