1. Characteristics of Groonga

ppt：http://mroonga.org/publication/presentation/groonga-mysqluc2011.pdf

1.1. Groonga overview

Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results. Also, Groonga allows updates without read locks. These characteristics result in superior performance on real-time applications.

Groonga is also a column-oriented database management system (DBMS). Compared with well-known row-oriented systems, such as MySQL and PostgreSQL, column-oriented systems are more suited for aggregate queries. Due to this advantage, Groonga can cover weakness of row-oriented systems.

The basic functions of Groonga are provided in a C library. Also, libraries for using Groonga in other languages, such as Ruby, are provided by related projects. In addition, groonga-based storage engines are provided for MySQL and PostgreSQL. These libraries and storage engines allow any application to use Groonga. See usage examples.

1.2. Full text search and Instant update

In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure.

Groonga also uses inverted indexes but supports instant updates. In addition, Groonga allows you to search documents even when updating the document collection. Due to these superior characteristics, Groonga is very flexible as a full text search engine. Also, Groonga always shows good performance because it divides a large task, inverted index merging, into smaller tasks.

1.3. Column store and aggregate query

People can collect more than enough data in the Internet era. However, it is difficult to extract informative knowledge from a large database, and such a task requires a many-sided analysis through trial and error. For example, search refinement by date, time and location may reveal hidden patterns. Aggregate queries are useful to perform this kind of tasks.

An aggregate query groups search results by specified column values and then counts the number of records in each group. For example, an aggregate query in which a location column is specified counts the number of records per location. Making a graph from the result of an aggregate query against a date column is an easy way to visualize changes over time. Also, a combination of refinement by location and an aggregate query against a date column allows visualization of changes over time in specific location. Thus refinement and aggregation are important to perform data mining.

A column-oriented architecture allows Groonga to efficiently process aggregate queries because a column-oriented database, which stores records by column, allows an aggregate query to access only a specified column. On the other hand, an aggregate query on a row-oriented database, which stores records by row, has to access neighbor columns, even though those columns are not required.

1.4. Inverted index and tokenizer

An inverted index is a traditional data structure used for large-scale full text search. A search engine based on inverted index extracts index terms from a document when it is added. Then in retrieval, a query is divided into index terms to find documents containing those index terms. In this way, index terms play an important role in full text search and thus the way of extracting index terms is a key to a better search engine.

A tokenizer is a module to extract index terms. A Japanese full text search engine commonly uses a word-based tokenizer (hereafter referred to as a word tokenizer) and/or a character-based n-gram tokenizer (hereafter referred to as an n-gram tokenizer). A word tokenizer-based search engine is superior in time, space and precision, which is the fraction of relevant documents in a search result. On the other hand, an n-gram tokenizer-based search engine is superior in recall, which is the fraction of retrieved documents in the perfect search result. The best choice depends on the application in practice.

Groonga supports both word and n-gram tokenizers. The simplest built-in tokenizer uses spaces as word delimiters. Built-in n-gram tokenizers (n = 1, 2, 3) are also available by default. In addition, a yet another built-in word tokenizer is available if MeCab, a part-of-speech and morphological analyzer, is embedded. Note that a tokenizer is pluggable and you can develop your own tokenizer, such as a tokenizer based on another part-of-speech tagger or a named-entity recognizer.

1.5. Sharable storage and read lock-free

Multi-core processors are mainstream today and the number of cores per processor is increasing. In order to exploit multiple cores, executing multiple queries in parallel or dividing a query into sub-queries for parallel processing is becoming more important.

A database of Groonga can be shared with multiple threads/processes. Also, multiple threads/processes can execute read queries in parallel even when another thread/process is executing an update query because Groonga uses read lock-free data structures. This feature is suited to a real-time application that needs to update a database while executing read queries. In addition, Groonga allows you to build flexible systems. For example, a database can receive read queries through the built-in HTTP server of Groonga while accepting update queries through MySQL.

1.6. Geo-location (latitude and longitude) search

Location services are getting more convenient because of mobile devices with GPS. For example, if you are going to have lunch or dinner at a nearby restaurant, a local search service for restaurants may be very useful, and for such services, fast geo-location search is becoming more important.

Groonga provides inverted index-based fast geo-location search, which supports a query to find points in a rectangle or circle. Groonga gives high priority to points near the center of an area. Also, Groonga supports distance measurement and you can sort points by distance from any point.

1.7. Groonga library

The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database. Also, libraries for languages other than C/C++, such as Ruby, are provided in related projects. See related projects for details.

1.8. Groonga server

Groonga provides a built-in server command which supports HTTP, the memcached binary protocol and the Groonga Query Transfer Protocol (GQTP). Also, a Groonga server supports query caching, which significantly reduces response time for repeated read queries. Using this command, Groonga is available even on a server that does not allow you to install new libraries.

1.9. Mroonga storage engine

Groonga works not only as an independent column-oriented DBMS but also as storage engines of well-known DBMSs. For example, Mroonga is a MySQL pluggable storage engine using Groonga. By using Mroonga, you can use Groonga for column-oriented storage and full text search. A combination of a built-in storage engine, MyISAM or InnoDB, and a Groonga-based full text search engine is also available. All the combinations have good and bad points and the best one depends on the application. See related projects for details.

转自：http://groonga.org/docs/characteristic.html

待分析！

Groonga开源搜索引擎——列存储做聚合，没有内建分布式，分片和副本是随mysql或者postgreSQL作为存储引擎由MySQL自身来做分片和副本的的更多相关文章

一些开源搜索引擎实现——倒排使用原始文件，列存储Hbase，KV store如levelDB、mongoDB、redis，以及SQL的，如sqlite或者xxSQL
本文说明:除开ES,Solr,sphinx系列的其他开源搜索引擎汇总于此. A search engine based on Node.js and LevelDB A persistent, n ...
ES doc_values介绍2——本质是field value的列存储，做聚合分析用，ES默认开启，会占用存储空间
一.doc_values介绍 doc values是一个我们再三重复的重要话题了,你是否意识到一些东西呢? 搜索时,我们需要一个“词”到“文档”列表的映射排序时,我们需要一个“文档”到“词“列表的映 ...
开源搜索引擎排名第一，Elasticsearch是如何做到的？
一.引言随着移动互联网.物联网.云计算等信息技术蓬勃发展,数据量呈爆炸式增长.如今我们可以轻易得从海量数据里找到想要的信息,离不开搜索引擎技术的帮助. 作为开源搜索引擎领域排名第一的 Elast ...
开源搜索引擎Iveely 0.8.0发布，终见天日
这是一篇博客,不是,这是一篇开源人的心酸和喜悦,没有人可以理解我们的心情,一路的辛酸一路的艰辛,不过还好,在大家的支持下,总算是终见天日,谢谢那些给予我们无私帮助的朋友.您的支持,依然是我们无限的动力 ...
开源搜索引擎评估:lucene sphinx elasticsearch
开源搜索引擎评估:lucene sphinx elasticsearch 开源搜索引擎程序有3大类 lucene系,java开发,包括solr和elasticsearch sphinx,c++开发,简 ...
开源搜索引擎Iveely 0.8.0
开源搜索引擎Iveely 0.8.0 这是一篇博客,不是,这是一篇开源人的心酸和喜悦,没有人可以理解我们的心情,一路的辛酸一路的艰辛,不过还好,在大家的支持下,总算是终见天日,谢谢那些给予我们无私帮助 ...
Solr vs. Elasticsearch谁是开源搜索引擎王者
当前是云计算和数据快速增长的时代,今天的应用程序正以PB级和ZB级的速度生产数据,但人们依然在不停的追求更高更快的性能需求.随着数据的堆积,如何快速有效的搜索这些数据,成为对后端服务的挑战.本文,我们 ...
开源搜索引擎评估:lucene sphinx elasticsearch (zhuan)
http://lutaf.com/158.htm ************************ 开源搜索引擎程序有3大类 lucene系,java开发,包括solr和elasticsearch s ...
转 Solr vs. Elasticsearch谁是开源搜索引擎王者
转 https://www.cnblogs.com/xiaoqi/p/6545314.html Solr vs. Elasticsearch谁是开源搜索引擎王者当前是云计算和数据快速增长的时代,今天 ...

随机推荐

[GXOI/GZOI2019]与或和(单调栈)
想了想决定把这几题也随便水个解题报告... bzoj luogu 思路: 首先肯定得拆成二进制30位啊此后每一位的就是个01矩阵 Q1就是全是1的矩阵个数 Q2就是总矩阵个数减去全是0的矩阵个数 ...
JavaScript中的特殊数据类型
JavaScript中的特殊数据类型制作人:全心全意转义字符以反斜杠开头的不可显示的特殊字符通常为控制字符,也被称为转义字符.通常转义字符可以在字符串中添加不可显示的特殊字符,或者防止引号匹配混 ...
《深入浅出深度学习：原理剖析与python实践》第八章前馈神经网络（笔记）
8.1 生物神经元(BN)结构 1.人脑中有100亿-1000亿个神经元,每个神经元大约会和其他1万个神经元相连 2.细胞体:神经元的主体,细胞体=细胞核+细胞质+细胞膜,存在膜电位 3.树突:从细胞 ...
IOC&DI
[概述] 1.IOC(Inversion of Control): 即“反转控制”,不是什么技术,而是一种设计思想.其思想是反转资源获取的方向. 传统的资源查找方式要求组件向容器发起请求查找资源.作为 ...
AndroidSweetSheet：从底部弹出面板（1）
AndroidSweetSheet:从底部弹出面板(1) AndroidSweetSheet又是一个从底部弹出面板的开源项目.我在以前写的文章中介绍了不少这些项目,见附录文章5,6,7,8.现在 ...
zoj 2676 二分+ISAP模板求实型参数的最小割（0-1分数规划问题）（可做ISAP模板）
/* 参考博文:http://www.cnblogs.com/ylfdrib/archive/2010/09/01/1814478.html 以下题解为转载代码自己写的: zoj2676 胡伯涛论文& ...
fread了解一下
神奇读入挂^_^ 记得加头文件#include const int BufferSize=100*1000; char buffer[BufferSize],*head,*tail; bool not ...
codevs 3971 航班
题目描述 Description B 国有N 座城市,其中1 号是这座国家的首都. N 座城市之间有M 趟双向航班.i 号点的转机次数定义为:从1 号点到i ,最少需要转机几次.如果1 根本无法到达 ...
Object-C 打开工程，选择模拟起时，提示"no scheme"
错误提示,如下图: 解决思路:
Java电商项目-6.实现门户首页数据展示_Redis数据缓存
目录项目的Github地址需求介绍搭建Redis集群环境下面先描述单机版redis的安装下面将进行Redis3主3从集群环境搭建基于SOA架构, 创建门户ashop-portal-web门 ...

Groonga开源搜索引擎——列存储做聚合，没有内建分布式，分片和副本是随mysql或者postgreSQL作为存储引擎由MySQL自身来做分片和副本的