Stratio’s Cassandra Lucene Index

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

Index relevance searches allows you to retrieve the n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks asApache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

This project is not intended to replace Apache Cassandra denormalized tables, inverted indexes, and/or secondary indexes. It is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features.

More detailed information is available at Stratio’s Cassandra Lucene Index documentation.

Features

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

  • Full text search
  • Geospatial search
  • Bitemporal search
  • Boolean (and, or, not) search
  • Near real-time search
  • Relevance scoring and sorting
  • General top-k queries
  • Custom analyzers
  • CQL complex types (list, set, map, tuple and UDT)
  • CQL user defined functions (UDF)
  • Third-party CQL-based drivers compatibility
  • Spark compatibility
  • Hadoop compatibility

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Columns with TTL
  • Indexing static columns

Requirements

  • Cassandra (identified by the three first numbers of the plugin version)
  • Java >= 1.7 (OpenJDK and Sun have been tested)
  • Maven >= 3.0

Build and install

Stratio’s Cassandra Lucene Index is distributed as a plugin for Apache Cassandra. Thus, you just need to build a JAR containing the plugin and add it to the Cassandra’s classpath:

  • Build the plugin with Maven: mvn clean package

  • Copy the generated JAR to the lib folder of your compatible Cassandra installation:

    cp plugin/target/cassandra-lucene-index-plugin-*.jar <CASSANDRA_HOME>/lib/

  • Start/restart Cassandra as usual

Alternatively, patching can also be done with this Maven profile, specifying the path of your Cassandra installation,
this task also delete previous plugin's JAR versions in CASSANDRA_HOME/lib/ directory:
mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

If you don’t have an installed version of Cassandra, there is also an alternative profile to let Maven download and patch the proper version of Apache Cassandra:

mvn clean package -Pdownload_and_patch -Dcassandra_home=<CASSANDRA_HOME>

Now you can run Cassandra and do some tests using the Cassandra Query Language:

<CASSANDRA_HOME>/bin/cassandra -f
<CASSANDRA_HOME>/bin/cqlsh

The Lucene’s index files will be stored in the same directories where the Cassandra’s will be. The default data directory is/var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

For more details about Apache Cassandra please see its documentation.

Example

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
id INT PRIMARY KEY,
user TEXT,
body TEXT,
time TIMESTAMP,
latitude FLOAT,
longitude FLOAT,
lucene TEXT
);

We have created a column called lucene to link the index searches. This column will not store data. Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
id : {type : "integer"},
user : {type : "string"},
body : {type : "text", analyzer : "english"},
time : {type : "date", pattern : "yyyy/MM/dd", sorted : true},
place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
}
}'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:

CONSISTENCY ALL
SELECT * FROM tweets WHERE lucene = '{refresh:true}';
CONSISTENCY QUORUM

Now, to search for tweets within a certain date range:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"}
}' limit 100;

The same search can be performed forcing an explicit refresh of the involved index shards:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
refresh : true
}' limit 100;

Now, to search the top 100 more relevant tweets where body field contains the phrase “big data gives organizations” within the aforementioned date range:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1}
}' limit 100;

To refine the search to get only the tweets written by users whose name starts with “a”:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"boolean", must:[
{type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
{type:"prefix", field:"user", value:"a"} ] },
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1}
}' limit 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"boolean", must:[
{type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
{type:"prefix", field:"user", value:"a"} ] },
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
sort : {fields: [ {field:"time", reverse:true} ] }
}' limit 100;

The previous search can be restricted to a geographical bounding box:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"boolean", must:[
{type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
{type:"prefix", field:"user", value:"a"},
{type:"geo_bbox",
field:"place",
min_latitude:40.225479,
max_latitude:40.560174,
min_longitude:-3.999278,
max_longitude:-3.378550} ] },
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
sort : {fields: [ {field:"time", reverse:true} ] }
}' limit 100;

Alternatively, you can restrict the search to retrieve tweets that are within a specific distance from a geographical position:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"boolean", must:[
{type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
{type:"prefix", field:"user", value:"a"},
{type:"geo_distance",
field:"place",
latitude:40.393035,
longitude:-3.732859,
max_distance:"10km",
min_distance:"100m"} ] },
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
sort : {fields: [ {field:"time", reverse:true} ] }
}' limit 100;

Finally, if you want to restrict the search to a certain token range:

SELECT * FROM tweets WHERE lucene='{
filter : {type:"boolean", must:[
{type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
{type:"prefix", field:"user", value:"a"} ,
{type:"geo_distance",
field:"place",
latitude:40.393035,
longitude:-3.732859,
max_distance:"10km",
min_distance:"100m"} ] },
query : {type:"phrase", field:"body", value:"big data gives organizations", slop:1]}
}' AND token(id) >= token(0) AND token(id) < token(10000000) limit 100;

This last is the basis for Hadoop, Spark and other MapReduce frameworks support.

Please, refer to the comprehensive Stratio’s Cassandra Lucene Index documentation.

cassandra + lucene集成的更多相关文章

  1. Lucene系列二:Lucene(Lucene介绍、Lucene架构、Lucene集成)

    一.Lucene介绍 1. Lucene简介 最受欢迎的java开源全文搜索引擎开发工具包.提供了完整的查询引擎和索引引擎,部分文本分词引擎(英文与德文两种西方语言).Lucene的目的是为软件开发人 ...

  2. 玩转大数据之Apache Pig如何与Apache Lucene集成

     在文章开始之前,我们还是简单来回顾下Pig的的前尘往事: 1,Pig是什么? Pig最早是雅虎公司的一个基于Hadoop的并行处理架构,后来Yahoo将Pig捐献给Apache(一个开源软件的基金组 ...

  3. Lucene介绍及简单入门案例(集成ik分词器)

    介绍 Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和 ...

  4. Cassandra数据模型和模式(Schema)的配置检查

    免责声明 本文档提供了有关DataStax Enterprise(DSE)和Apache Cassandra的常规数据建模和架构配置建议.本文档需要DSE / Cassandra基本知识.它不能代替官 ...

  5. Lucene详解

    一.lucene原理 Lucene 是apache软件基金会一个开放源代码的全文检索引擎工具包,是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎.它不是一个完整的搜索应用程序 ...

  6. 学习笔记(二)--Lucene简介

    Lucene简介 最受欢迎的java开源全文搜索引擎开发工具包.提供了完整的查询引擎和索引引擎,部分文本分词引擎(英文与德文两种西方语言).Lucene的目的是为软件开发人员提供一个简单易用的工具包, ...

  7. cassandra的全文检索插件

    https://github.com/Stratio/cassandra-lucene-index Stratio’s Cassandra Lucene Index Stratio’s Cassand ...

  8. 玩转大数据系列之Apache Pig如何与Apache Solr集成(二)

    散仙,在上篇文章中介绍了,如何使用Apache Pig与Lucene集成,还不知道的道友们,可以先看下上篇,熟悉下具体的流程. 在与Lucene集成过程中,我们发现最终还要把生成的Lucene索引,拷 ...

  9. Hadoop日记Day1---Hadoop介绍

    一.Hadoop项目简介 1. Hadoop是什么 Hadoop是一个适合大数据的分布式存储与计算平台. 作者:Doug Cutting:Lucene,Nutch. 受Google三篇论文的启发 2. ...

随机推荐

  1. 计算机管理cmd命令行

    给你一个资料,想必对你来讲有保存价值: 开始菜单中的“运行”是通向程序的快捷途径,输入特定的命令后,即可快速的打开Windows的大部分程序,熟练的运用它,将给我们的操作带来诸多便捷. winver ...

  2. LeetCode——Reverse Words in a String

    Given an input string, reverse the string word by word. For example, Given s = "the sky is blue ...

  3. Int16 Int32 Int64

    数据类型占多大空间 Int16, 等于short, 占2个字节. -32768 32767 Int32, 等于int, 占4个字节. -2147483648 2147483647 Int64, 等于l ...

  4. 9. iptables 配置

    iptables 配置文件存放位置:  [root@Demon yum.repos.d]# vim /etc/rc.d/init.d/iptables   一.只给 Centos 6.5 打开 22 ...

  5. Javascript进阶篇——(JS基础语法)笔记整理

    根据慕课网学习整理到一起的笔记,把东西整理到一起看起来比较方便 什么是变量字面意思:变量是可变的量:编程角度:变量是用于存储某种/某些数值的存储器.我们可以把变量看做一个盒子,盒子用来存放物品,物品可 ...

  6. Spring Ioc知识整理

    Ioc知识整理(一): IoC (Inversion of Control) 控制反转. 1.bean的别名 我们每个bean元素都有一个id属性,用于唯一标识实例化的一个类,其实name属性也可用来 ...

  7. iOS 性能优化:Instruments

    对于每位 iOS 开发者来说,代码性能是个避不开的话题.随着项目的扩大和功能的增多,没经过认真调试和优化的代码,要么任性地卡顿运行,要么低调地崩溃了之……结果呢,大家用着不高兴,开发者也不开心. 其实 ...

  8. C#中WinForm程序退出方法技巧总结

    C#中WinForm程序退出方法技巧总结 一.关闭窗体 在c#中退出WinForm程序包括有很多方法,如:this.Close(); Application.Exit();Application.Ex ...

  9. 关于box-sizing的理解

    ---恢复内容开始--- box-sizing 属性允许您以特定的方式定义匹配某个区域的特定元素. 例如,假如您需要并排放置两个带边框的框,可通过将 box-sizing 设置为 "bord ...

  10. S - 骨牌铺方格(第二季水)

    Description          在2×n的一个长方形方格中,用一个1× 2的骨牌铺满方格,输入n ,输出铺放方案的总数.         例如n=3时,为2× 3方格,骨牌的铺放方案有三种, ...