原文链接

Preface

There are many awesome and powerful distributed NoSQL in the world, like Couchbase, MongoDB, Canssandra, etc. but developing a new one is still a challengeable, interesting and attractive thing for me, why?

  • It can satisfy our special needs for our cloud services.
  • We just need a key-value store, with some simple additional functionalities, we don’t need a complex solution.
  • We can control the whole thing, especially for fixing bugs and improvement.
  • Inventing the wheel is not good, but I can learn much in the process.

A key-value store may need below features:

  • Simple protocol.
  • Simple API.
  • High performance.
  • High availability.
  • Cluster support.

I knew this would be a hard journey first. But after a long hard work, I develop ledis-cluster, a key-value store based on LedisDB + xcodis + redis-failover.

Pre Solution Thinking

Before I develop ledis-cluster, I thought some other solutions which are valuable to be recorded here too.

MySQL

Aha, first I just wanted to use MySQL as a key-value store. This thought amazed my colleagues before, and I think now it may surprise many other guys too.

MySQL is a relational database and can be used as a key-value store easily and sufficiently. We can use a simple table to store key value data like below:

CREATE TABLE kv (    k varbinary(256),    v blob,    PRIMARY KEY(k),) ENGINE=innodb;

When I worked in Tencent game infrastructure department, we used this way to serve many Tencent games and it works well.

But I don’t want to use MySQL as a key-value store now, MySQL is a little heavy and needs some experienced operations people, this is impossible for our team.

Redis

Redis is an awesome NoSQL, it has an amazing performance, supports many useful data structures (kv, hash, list, set and zset), supplies a simple protocol for client user.

I have read the Redis’s code (it is very simple!) many times, used it for about three years in many productions, and I am absolutely confident of maintaining it.

But Redis has a serious problem: memory limitation. We can not store huge data in one machine. Using redis cluster is a good way to solve memory limitation, and there are many existing solutions, like official redis cluster, twemproxy or codis , but I still think another stuff saving huge data exceeding memory limitation in one machine is needed, so I develop LedisDB.

LedisDB

LedisDB is a fast NoSQL, similar to Redis. It has some good features below:

  • Uses Redis protocol, most of the Redis clients can use LedisDB directly.
  • Supports multi data structures(kv, hash, list, set, zset).
  • Uses rocksdb, leveldb or other fast databases as the backend to store huge data, exceeding memory limitation.
  • High performance, see benchmark. Although it is a little slower than Redis, it can still be used in production.

A simple example:

//start ledis server
ledis-server//another shell
ledis-cli -p 6380ledis> set a 1
OK
ledis> get a
“1"

As we see, LedisDB is simple, we can switch to it easily if we used Redis before.

LedisDB now supports rocksdb, leveldb, goleveldb, boltdb and lmdb as the backend storage, we can choose the best one for our actual environment. In our company projects, we use rocksdb which has a awesome performance and many configurations to be tuned, and I will also use it for the following example.

Data Security Guarantee

LedisDB can store huge data in one machine, so the data security needs to be considered cautiously. LedisDB uses below ways to guarantee it.

Backup

We can back up LedisDB and then restore later. Redis saving RDB may block service for some time, but LedisDB doesn’t have this problem. Thanks to rocksdb fast generating snapshot technology, backing up LedisDB is very fast and easy.

Binlog

LedisDB will first log write operations in binlog, then commit changes into backend storage, this is similar to MySQL.

Redis also has AOF, but the AOF file may grow largely, then rewriting AOF may also block service for some time. LedisDB will rotate binlog and write to the new one when current binlog is larger than maximum size (1GB).

Replication

An old saying goes like this: “don’t put all your eggs in one basket”. Similarly, don’t put all our data in one machine.

LedisDB supports asynchronous or semi-synchronous replication. We can not break CAP(Consistency, Availability, Partition tolerance) theorem, for replication, partition tolerance must exist, so we have to choose between consistency and availability.

If we want to guarantee full data security, we may use semi-synchronous replication, but most of time, asynchronous replication is enough.

Monitor and Failover

In the actual production environment, we use a master LedisDB and one or more slaves to construct the topology. We must monitor them in real time because any machine in the topology may be down at any time.

If a slave is down, we may not care too much, this is not a serious problem. But if the master is down (aha, a terrible accident!), we must resolve it quickly.

Generally, we can not expect the master to re-work quickly and infallibly, so electing a best new master from current slaves and doing failover is a better way when master is down.

Redis uses a sentinel feature to monitor the topology and do failover when the master is down. But this sentinel can not be used in LedisDB, so I develop another sentinel: redis-failover, monitoring and doing failover for Redis/LedisDB

redis-failover uses `ROLE` command to check master and get all slaves every second. If the master is down, redis-failover will select the best slave from last `ROLE` returned slaves. The election algorithm is simple, using `INFO` command to get “slave_priority” and “slave_repl_offset” value, if a slave has a higher priority or a larger repliction offset with same priority, the slave will be elected as the new master.

redis-failover may have single point problem too, I use zookeeper or raft to support redis-failover cluster. Zookeeper or raft will elect a leader and let it monitor and do failover, if the leader is down, a new leader will be elected quickly.

Cluster

Although LedisDB can store huge data, the growing data may still exceed the capability of the system in the near future.

Splitting data and storing them into multi machines may be the only feasible way(We don’t have money to buy a mainframe), but how to split the data? and how to find the data by a key? I think an easy solution is to define a key routing rule (mapping key to the actual machine).

For example, we have two machines, n0 and n1, and the key routing rule is simple hash like `crc32(key) % 2`. For key “abc”, the calculation result is 0, so we know that the corresponding data is in n0.

The above solution is easy, but we can not use it in production. If we add another machine, the machine number is 3, all the old data mapping relationship will be broken, and we have to relocate huge amount of data.

Using consistency hash may be better, but I prefer using hash + routing table. We don’t map a key to a machine directly, but to a virtual node named slot, then define a routing table mapping slot to the actual machine.

Continuing with the example, assume we use 1024 slots and 2 machines, the slot and machine mapping is [slot0 — slot511] -> n0, [slot512 — slot1023] -> n1. For a key, first using `crc32(key) % 1024` to get a slot index, then we can find the machine with this slot from the routing table.

This solution may be complex but have a big advantage for re-sharding. If we add another machine n2, change the routing table that mapping slot0 to n2, and we only need to migrate all slot0 data from n0 to n2. The bigger for slot number, the smaller for split data in a slot, and we only migrate little data for one slot.

xcodis uses above way to support LedisDB cluster. Now the slot number is 256, which is a little small that may increase the probability of mapping some busy keys into a slot.

Because of origin LedisDB db index implementation limitation, xcodis can not use bigger slot number than 256, so a better way is to support customizing a routing table for a busy key later. For example, for a key, xcodis should first try to find the associated slot in the routing table, if not found, then use hash.

Another radical choice is to change LedisDB code and upgrade all data saved before. This is may be a huge work, so I will not consider it unless I have no idea to resolve above problems.

xcodis is a proxy supporting redis/LedisDB cluster, the benefit of proxy is that we can hide all cluster information from client users and users can use it easily like using a single server.

In addition to proxy, there are also some other ways to support cluster too:

  • Official Redis cluster, but it is still in development and should not be used in production now, and it can not be used in LedisDB.
  • Customizing client SDK, the SDK can know whole cluster information and do the right key routing for the user. But this way is not universal and we must write many SDKs for different languages (c, java, php, go, etc.), a hard work!

Final Architecture

At last, the final architecture may look below:

  • Use LedisDB to save huge data in one machine.
  • Use Master/slave to guarantee data security.
  • Use redis-failover to monitor the system and do failover.
  • Use xcodis to support cluster.

 This architecture may be not perfect, but is simple and enough for us. Now we have only use LedisDB and xcodis in our projects, not the whole architecture, but we have been testing and will try to deploy it in production in the near future.

Summary

Building up a key-value store is not a easy work, and I don’t think what I do above can beat other existing awesome NoSQLs, but it’s a valuable attempt, I have learned much and meet many new friends in the progress.

Now, I’am the only person to develop the whole thing and need help, if you have interested in what I do, please contact me, maybe we really can build up an awesome NoSQL. :-)

Mail: siddontang@gmail.com

Github: github.com/siddontang

  

分布式存储——Build up a High Availability Distributed Key-Value Store的更多相关文章

  1. etcd -> Highly-avaliable key value store for shared configuration and service discovery

    The name "etcd" originated from two ideas, the unix "/etc" folder and "d&qu ...

  2. greenplum表的distributed key值查看

    greenplum属于分布式的数据库,MPP+Share nothing的体系,查询的效率很快.不过,这是建立在数据分散均匀的基础上的.如果DK值设置不合理的话,完全有可能出现所有数据落在单个节点上的 ...

  3. [System Design] Design a distributed key value caching system, like Memcached or Redis

    https://www.interviewbit.com/problems/design-cache/ Features: This is the first part of any system d ...

  4. Build Telemetry for Distributed Services之OpenTracing实践

    官网:https://opentracing.io/docs/best-practices/ Best Practices This page aims to illustrate common us ...

  5. 分布式系统(Distributed System)资料

    这个资料关于分布式系统资料,作者写的太好了.拿过来以备用 网址:https://github.com/ty4z2008/Qix/blob/master/ds.md 希望转载的朋友,你可以不用联系我.但 ...

  6. 想从事分布式系统,计算,hadoop等方面,需要哪些基础,推荐哪些书籍?--转自知乎

    作者:廖君链接:https://www.zhihu.com/question/19868791/answer/88873783来源:知乎 分布式系统(Distributed System)资料 < ...

  7. 从事分布式系统,计算,hadoop

    作者:廖君链接:https://www.zhihu.com/question/19868791/answer/88873783来源:知乎 分布式系统(Distributed System)资料 < ...

  8. ASF (0) - ASF Java 项目总览

    Apache .NET Ant Library This is a library of Ant tasks that help developing .NET software. It includ ...

  9. 资源list:Github上关于大数据的开源项目、论文等合集

    Awesome Big Data A curated list of awesome big data frameworks, resources and other awesomeness. Ins ...

随机推荐

  1. java8大基本类型

    文章转载自:Java的8种基本数据类型 请阅读原文. 关于Java的8种基本数据类型,其名称.位数.默认值.取值范围及示例如下表所示: 序号 数据类型 位数 默认值 取值范围 举例说明 1 byte( ...

  2. asp.netMVC中使用aop进行关注点分离

    资源地址:https://stackoverflow.com/questions/23244400/aspect-oriented-programming-in-asp-net-mvc 从页面复制过来 ...

  3. cmd生成大文件

    用cmd生成一个大小一定的文件 输入fsutil file createnew  文件位置  文件大小(以字节为单位1024b=1kb) 列如:fsutil file createnew  d:\my ...

  4. springboot中,使用redisTemplate操作redis

    知识点: springboot中整合redis springboot中redisTemplate的使用 redis存数据时,key出现乱码问题 一:springboot中整合redis (1)pom. ...

  5. hive的shell用法(脑子糊涂了,对着脚本第一行是 #!/bin/sh 疯狂执行hive -f 结果报错)

    hive脚本的执行方式 hive脚本的执行方式大致有三种: hive控制台执行: hive -e "SQL"执行: hive -f SQL文件执行:参考hive用法: usage: ...

  6. python中常见的日期处理方法

    今天:today = datetime.date.today() 昨天:yesterday = today - datetime.timedelta(days=1) 上个月:last_month = ...

  7. (二)AppScan使用教程

    1.新建扫描:一般选择 常规扫描 2.选择扫描的平台:web或app 3.扫描配置向导 ①配置URL和服务器 ②配置登录管理 在扫描的过程中,可能会不小心碰到退出按钮导致Appscan注销.因此,要登 ...

  8. python2.7 psycopg2

    psycopg2 安装 sql='''INSERT INTO "CNYB"."PRE_DQ_PLANT"("ID", "ORG_I ...

  9. LInux、xshell(windows)以及finalshell(mac)的常用命令

    一.Linux历史知识: 应用:安装在各种服务器之上,用于嵌入式 版本:内核版本,发行版本(各个公司对其优化) 二.目录介绍 root:系统管理员登录的默认目录 home:其他用户进来的默认目录 us ...

  10. 【基础算法-ST表】入门 -C++

    前言 学了树状数组看到ST表模板跃跃欲试的时候发现完全没思路,因为给出的查询的时间实在太短了!几乎是需要完成O(1)查询.所以ST表到底是什么神仙算法能够做到这么快的查询? ST表 ST表是一个用来解 ...