Elasticsearchis a great search engine, flexible, fast and fun. So how can I get started with it? This post will go through how to get contents from a SQL database into Elasticsearch.

Elasticsearch has a set of pluggable services called rivers. A river runs inside an Elasticsearch node, and imports content into the index. There are rivers for twitter, redis, files, and of course, SQL databases. The river-jdbc plugin connects to SQL databases using JDBC adapters. In this post we will use PostgreSQL, since it is freely available, and populate it with some contents that also are freely available.

So let’s get started

  1. Download and install Elasticsearch
  2. Start elasticsearch by running bin/elasticsearch from the installation folder
  3. Install the river-jdbc plugin for Elasticsearch version 1.00RC
     
     
    1
    ./bin/plugin -install river-jdbc -url <em><a href="http://bit.ly/1dKqNJy">http://bit.ly/1dKqNJy</a> </em>
  4. Download the PostgreSQL JDBC jar file and copy into the plugins/river-jdbc folder. You should probably get the latest version which is for JDBC 41
  5. Install PostgreSQL http://www.postgresql.org/download/
  6. Import the booktown database. Download the sql file from booktown database
  7. Restart Elasticsearch
  8. Start PostgreSQL

By this time you should have Elasticsearch and PostgreSQL running, and river-jdbc ready to use.

Now we need to put some contents into the database, using psql, the PostgreSQL command line tool.

 
 
1
psql -U postgres -f booktown.sql

To execute commands to Elasticsearch we will use an online service which functions as a mixture of Gist, the code snippet sharing service and Sense, a Google Chrome plugin developer console for Elasticsearch. The service is hosted by http://qbox.io, who provide hosted Elasticsearch services.

Check that everything was correctly installed by opening a browser tohttp://sense.qbox.io/gist/8361346733fceefd7f364f0ae1ebe7efa856779e

Select the top most line in the left-hand pane, press CTRL+Enter on your keyboard. You may also click on the little triangle that appears to the right, if you are more of a mouse click kind of person.

You should now see a status message, showing the version of Elasticsearch, node name and such.

Now let’s stop fiddling around the porridge and create a river for our database:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
curl -XPUT "http://localhost:9200/_river/mybooks/_meta" -d'
{
"type": "jdbc",
"jdbc": {
"driver": "org.postgresql.Driver",
"url": "jdbc:postgresql://localhost:5432/booktown",
"user": "postgres",
"password": "postgres",
"index": "booktown",
"type": "books",
"sql": "select * from authors"
}
}'

This will create a “one-shot” river that connects to PostgreSQL on Elasticsearch startup, and pulls the contents from the authors table into the booktown index. The index parameter controls what index the data will be put into, and the type parameter decides the type in the Elasticsearch index. To verify the river was correctly uploaded execute

 
 
1
GET /_river/mybooks/_meta

Restart Elasticsearch, and watch the log for status messages from river-jdbc. Connection problems, SQL errors or other problems should appear in the log . If everything went OK, you should see something like …SimpleRiverMouth] bulk [1] success [19 items]

Time has come to check out what we got.

 
 
1
GET /booktown/_search

You should now see all the contents from the authors table. The number of items reported under “hits” -> “total” are the same as what we just saw in the log: 19.


But looking more closely at the data, we can see that the _id field has been auto-assigned with some random values. This means that the next time we run the river, all the contents will be re-added.

Luckily, river-jdbc support somespecially labeled fields, that let us control how the contents should be indexed.

Reading up on the docs, we change the SQL definition in our river to

 
 
1
2
select id as _id, first_name,
last_name from authors

We need to start afresh and scrap the index we just created:

 
 
1
DELETE /booktown

Restart Elasticsearch. Now you should see a meaningful id in your data.

At this time we could start toying around with queries, mappings and analyzers. But, that’s not much fun with this little content. We need to join in some tables and get some more interesting data. We can join in the books table, and get all the books for all authors.

 
 
1
2
3
SELECT authors.id as _id, authors.last_name, authors.first_name,
books.id, books.title, books.subject_id
FROM public.authors left join public.books on books.author_id = authors.id

Delete the index, restart Elasticsearch and examine the data. Now you see that we only get one book per author. Executing the SQL statement in pgadmin returns 22 rows, while in Elasticsearch we get 19. This is on account of the _id field, on each attempt to index an existing record with the same _id as a new one, it will be overwritten.

River-jdbc supports Structured objects, which allows us to create arbitrarily structured JSON documents simply by using SQL aliases. The _id column is used for identity, structured objects will be appended to existing data. This is perhaps best shown by an example:

 
 
1
2
3
4
SELECT authors.id as _id, authors.last_name, authors.first_name, 
books.id as \"Books.id\", books.title as \"Books.title\",
books.subject_id as \"Books.subject_id\"
FROM public.authors left join public.books on books.author_id = authors.id order by authors.id

Again, delete the index, restart Elasticsearch, wait a few seconds before you search, and you will find structured data in the search results.

Now we have seen that it is quite easy to get data into Elasticsearch using river-jdbc. We have also seen how it can handle updates. That gets us quite far. Unfortunately, it doesn’t handle deletions. If a record is deleted from the database, it will not automatically be deleted from the index. There have been some attempts to create support for it, but in the latest release it has been completely dropped.

This is due to the river plugin system having some serious problems, and it will perhaps be deprecated some time after the 1.0 release, at least not actively promoted as “the way”. (see the “semi-offical statement” at Linkedin Elasticsearch group). While it is extremely easy to use rivers to get data, there are a lot of problems in having a data integration process running in the same space as Elasticsearch itself. Architecturally, it is perhaps more correct to leave the search engine to itself, and build integrations systems on the side.

Among the recommended alternatives are:
>Use an ETL tool like Talend
>Create your own script
>Edit the source application to send updates to Elasticsearch

Jörg Prante, who is the man behind river-jdbc, recently started creating a replacement calledGatherer.
It is a gathering framework plugin for fetching and indexing data to Elasticsearch, with scalable components.

Anyway, we have data in our index! Rivers may have their problems when used on a large scale, but you would be hard pressed to find anything easier to get started with. Getting data into the index easily is essential when exploring ideas and concepts, creating POCs or just fooling around.

This post has run out of space, but perhaps we can look at some interesting queries next time?

Elasticsearch: Indexing SQL databases. The easy way的更多相关文章

  1. [Android 开发教程(1)]-- Saving Data in SQL Databases

    Saving data to a database is ideal for repeating or structured data, such as contact information. Th ...

  2. ElasticSearch安装SQL插件

    ElasticSearch安装SQL插件下载地址(中国大佬开发的,膜拜ing):https://github.com/NLPchina/elasticsearch-sql 1.记得选择和自己Elast ...

  3. 十四、.net core(.NET 6)搭建ElasticSearch(ES)系列之给ElasticSearch添加SQL插件和浏览器插件

     给ES添加SQL插件的方法: 下载SQL插件地址:https://github.com/NLPchina/elasticsearch-sql 当前最新的是7.12版本,我的ES是7.13版本,暂且将 ...

  4. Android学习笔记——保存数据到SQL数据库中(Saving Data in SQL Databases)

    知识点: 1.使用SQL Helper创建数据库 2.数据的增删查改(PRDU:Put.Read.Delete.Update) 背景知识: 上篇文章学习了保存文件,今天学习的是保存数据到SQL数据库中 ...

  5. 安装支持elasticsearch使用sql查询插件

    一.ElasticSearch-SQL介绍 ElasticSearch-SQL(后续简称es-sql)是ElasticSearch的一个插件,提供了es 的类sql查询的相关接口.支持绝大多数的sql ...

  6. Exam 70-762 Developing SQL Databases

    这个考试还是很有用的,教了很多有用的东西,不错,虽然考试需要很多钱,不过值的尝试.虽然用了sql server 这么多年但是对于事务.多并发的优化还是处于小学生的水平,通过这次考试争取让自己提一个档次 ...

  7. 使用Hive或Impala执行SQL语句,对存储在Elasticsearch中的数据操作(二)

    CSSDesk body { background-color: #2574b0; } /*! zybuluo */ article,aside,details,figcaption,figure,f ...

  8. 使用Hive或Impala执行SQL语句,对存储在Elasticsearch中的数据操作

    http://www.cnblogs.com/wgp13x/p/4934521.html 内容一样,样式好的版本. 使用Hive或Impala执行SQL语句,对存储在Elasticsearch中的数据 ...

  9. Spark SQL大数据处理并写入Elasticsearch

    SparkSQL(Spark用于处理结构化数据的模块) 通过SparkSQL导入的数据可以来自MySQL数据库.Json数据.Csv数据等,通过load这些数据可以对其做一系列计算 下面通过程序代码来 ...

随机推荐

  1. [转]backbone.js 初探

    本文转自:http://weakfi.iteye.com/blog/1391990 什么是backbone backbone不是脊椎骨,而是帮助开发重量级的javascript应用的框架. 主要提供了 ...

  2. Remove Duplicates From Sorted Array

    Remove Duplicates from Sorted Array LeetCode OJ Given a sorted array, remove the duplicates in place ...

  3. POJ 2540 Hotter Colder --半平面交

    题意: 一个(0,0)到(10,10)的矩形,目标点不定,从(0,0)开始走,如果走到新一点是"Hotter",那么意思是离目标点近了,如果是"Colder“,那么就是远 ...

  4. NOIP2010引水入城[BFS DFS 贪心]

    题目描述 在一个遥远的国度,一侧是风景秀美的湖泊,另一侧则是漫无边际的沙漠.该国的行政区划十分特殊,刚好构成一个N 行M 列的矩形,如上图所示,其中每个格子都代表一座城市,每座城市都有一个海拔高度. ...

  5. 第8章 用户模式下的线程同步(1)_Interlocked系列函数

    8.1 原子访问:Interlocked系列函数(Interlock英文为互锁的意思) (1)原子访问的原理 ①原子访问:指的是一线程在访问某个资源的同时,能够保证没有其他线程会在同一时刻访问该资源. ...

  6. 05章项目: QuickHit快速击键

    一.项目分析 根据输入速率和正确率将玩家分为不同等级,级别越高,一次显示的字符数越多,玩家正确输入一次的得分也越高.如果玩家在规定时间内完成规定次数的输入,正确率达到规定要求,则玩家升级.玩家最高级别 ...

  7. Unity打包同一文件Hash不一样

    问题起因 游戏开发基本都会涉及到资源版本管理及更新,本文记录我在打包过程中遇到的一小问题: 开过中常用于标记资源版本的方法有计算文件Hash.VCS的版本等. 在Unity中对同一个资源文件进行多次打 ...

  8. Prefab Assist插件

    资料 Prefab文档: http://game.ceeger.com/Manual/Prefabs.html 基础:基于NGUI制作组件的Prefab 前言 在一个游戏的UI中,有一些通用的组件,比 ...

  9. 非智能手机通信录备份并还原至Android智能手机方法

    随着智能手机早已深入普通用户的生活,2-3线城市的用户也逐渐从使用非智能机换成使用智能机.最近便遇见了这样一个转移通讯录的需求.之前使用的手机型号是BBK K201,通信录中绝大部分保存在了手机中,最 ...

  10. Vs 2015 调试ASP.NET Core修改监听端口

    如何改变监听IP地址和端口?在这里找到了答案:https://github.com/aspnet/KestrelHttpSer... 把Program.cs加一行UseUrls代码如下: using ...