摘自:https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

CockroachDB is a distributed SQL database. The primary design goals are scalability, strong consistency and survivability(hence the name). CockroachDB aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB nodes are symmetric; a design goal is homogeneous deployment (one binary) with minimal configuration and no required external dependencies.

The entry point for database clients is the SQL interface. Every node in a CockroachDB cluster can act as a client SQL gateway. A SQL gateway transforms and executes client SQL statements to key-value (KV) operations, which the gateway distributes across the cluster as necessary and returns results to the client. CockroachDB implements a single, monolithic sorted mapfrom key to value where both keys  and values are byte strings.

The KV map is logically composed of smaller segments of the keyspace called ranges. Each range is backed by data stored in a local KV storage engine (we use RocksDB, a variant of LevelDB). Range data is replicated to a configurable number of additional CockroachDB nodes. Ranges are merged and split to maintain a target size, by default 64M. The relatively small size facilitates quick repair and rebalancing to address node failures, new capacity and even read/write load. However, the size must be balanced against the pressure on the system from having more ranges to manage.

CockroachDB achieves horizontally scalability:

  • adding more nodes increases the capacity of the cluster by the amount of storage on each node (divided by a configurable replication factor), theoretically up to 4 exabytes (4E) of logical data;
  • client queries can be sent to any node in the cluster, and queries can operate independently (w/o conflicts), meaning that overall throughput is a linear factor of the number of nodes in the cluster.
  • queries are distributed (ref: distributed SQL) so that the overall throughput of single queries can be increased by adding more nodes.

CockroachDB achieves strong consistency:

  • uses a distributed consensus protocol for synchronous replication of data in each key value range. We’ve chosen to use the Raft consensus algorithm; all consensus state is stored in RocksDB.
  • single or batched mutations to a single range are mediated via the range's Raft instance. Raft guarantees ACID semantics.
  • logical mutations which affect multiple ranges employ distributed transactions for ACID semantics. CockroachDB uses an efficient non-locking distributed commit protocol.

CockroachDB achieves survivability:

  • range replicas can be co-located within a single datacenter for low latency replication and survive disk or machine failures. They can be distributed across racks to survive some network switch failures.
  • range replicas can be located in datacenters spanning increasingly disparate geographies to survive ever-greater failure scenarios from datacenter power or networking loss to regional power failures (e.g. { US-East-1a, US-East-1b, US-East-1c }{ US-East, US-West, Japan }{ Ireland, US-East, US-West}{ Ireland, US-East, US-West, Japan, Australia }).

CockroachDB provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from a historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. CockroachDB implements a limited form of linearizability, providing ordering for any observer or chain of observers.

Similar to Spanner directories, CockroachDB allows configuration of arbitrary zones of data. This allows replication factor, storage device type, and/or datacenter location to be chosen to optimize performance and/or availability. Unlike Spanner, zones are monolithic and don’t allow movement of fine grained data on the level of entity groups.

Architecture

CockroachDB implements a layered architecture. The highest level of abstraction is the SQL layer (currently unspecified in this document). It depends directly on the SQL layer, which provides familiar relational concepts such as schemas, tables, columns, and indexes. The SQL layer in turn depends on the distributed key value store, which handles the details of range addressing to provide the abstraction of a single, monolithic key value store. The distributed KV store communicates with any number of physical cockroach nodes. Each node contains one or more stores, one per physical device.

Each store contains potentially many ranges, the lowest-level unit of key-value data. Ranges are replicated using the Raft consensus protocol. The diagram below is a blown up version of stores from four of the five nodes in the previous diagram. Each range is replicated three ways using raft. The color coding shows associated range replicas.

Each physical node exports two RPC-based key value APIs: one for external clients and one for internal clients (exposing sensitive operational features). Both services accept batches of requests and return batches of responses. Nodes are symmetric in capabilities and exported interfaces; each has the same binary and may assume any role.

Nodes and the ranges they provide access to can be arranged with various physical network topologies to make trade offs between reliability and performance. For example, a triplicated (3-way replica) range could have each replica located on different:

  • disks within a server to tolerate disk failures.
  • servers within a rack to tolerate server failures.
  • servers on different racks within a datacenter to tolerate rack power/network failures.
  • servers in different datacenters to tolerate large scale network or power outages.

Up to F failures can be tolerated, where the total number of replicas N = 2F + 1 (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).

CockroachDB——类似spanner的开源版,底层使用rocksdb存储,mvcc,支持事务,raft一致性,licence是CockroachDB Community License Agreement的更多相关文章

  1. FineUI(开源版)v6.0中FState服务器端验证的实现原理

    前言 1. FineUI(开源版)是完整开源,最早发起于 2008-04,下载全部源代码:http://fineui.codeplex.com/ 2. 你可以通过捐赠作者来支持FineUI(开源版)的 ...

  2. FineUI(开源版)v4.2.2发布(8年125个版本,官网示例突破300个)!

    开源版是 FineUI 的基石,从 2008 年至今已经持续发布了 120 多个版本,拥有会员 15,000 多位,捐赠会员达到 1,200 多位.   FineUI(开源版)v4.2.2 是 8 年 ...

  3. FineUI(专业版)v1.2.0 和 FineUI(开源版)v4.1.1 同时发布!

    FineUI(开源版)v4.1.1 (建议所有 v4.x 升级到此版本):http://fineui.com/demo/ +2014-08-15 v4.1.1        -修正Form中表单字段设 ...

  4. 禅道开源版 Ldap认证插件开发

    禅道开源版-Ldap插件开发 背景 由于开源版无法使用ldap认证,所以在此分享一下自己开发禅道的ldap开发过程,希望对你有所帮助. 简单说一下这个插件的功能: 1.跳过原有禅道认证,使用ldap认 ...

  5. FineUI开源版之TreeGrid实现

    FineUI开源版是没有树表格的,但是又需要,怎么办呢?在博客园看到一位大大的文章 http://www.cnblogs.com/shiworkyue/p/4211002.html 然后参考,不知道为 ...

  6. 这个接口管理平台 eoLinker 开源版部署指南你一定不想错过

    本文主要内容是讲解如何在本地部署eoLinker开源版. 环境要求 1.PHP 5.5+ / PHP7+(推荐) 2.Mysql 5.5+ / Mariadb 5.5+ 3.Nginx(推荐) / A ...

  7. 开源免费接口管理平台eoLinker AMS开源版 V3.2.0更新,增加批量导出导入接口功能!

    eoLinker是一个免费开源的针对开发人员需求而设计的接口管理工具,通过简单的操作来帮助开发者进行接口文档管理.接口自动化测试.团队协作.数据获取.安全防御监控等功能,降低企业的接口管理成本,提高项 ...

  8. 【开源】接口管理平台eoLinker AMS 开源版3.1.5同步线上版!免费增加大量功能!

    概要:eoLinker是一个免费开源的针对开发人员需求而设计的接口管理工具,通过简单的操作来帮助开发者进行接口文档管理.接口自动化测试.团队协作.数据获取.安全防御监控等功能,降低企业的接口管理成本, ...

  9. 部署eolinker开源版接口管理

    想找一个API接口管理的软件,为了安全性和扩展性考虑,希望是开源的,而且可以在内网独立部署.网上翻找了资料,经过一份比对之后,最终采用eolinker.过去有使用过RAP,但是感觉界面实在是太丑了. ...

随机推荐

  1. Windows 10 IIS所有的html返回空白

    这是一个神奇的现象.因为使用IIS已经有N多年了,喜欢使用它是因为它随手可得.自从装上windows10以来,直至今天才用它来调试客户端程序.想在上面放一个静态的json数据,省的还要去建立一个Web ...

  2. IF ERRORLEVEL 和 IF %ERRORLEVEL% 区别

      IF ERRORLEVEL 1 ( command )    与  IF %ERRORLEVEL%  LEQ 1 ( command  )  等效 也就是 ERRORLEVEL 1 等效于 &qu ...

  3. day09-文件的操作

    目录 文件的基本操作 文件 什么是文件 如何使用文件 打开&关闭文件 打开&关闭文件 del f和f.close()的区别 文件路径 打开模式(不写默认是r) 编码格式 补充(open ...

  4. animation与transition区别

    transition: 过渡属性 过渡所需要时间 过渡动画函数 过渡延迟时间:默认值分别为:all 0 ease 0 1.局限性: 1)只能设置一个属性 2)需要伪类/事件触发才执行 3)只能设置动画 ...

  5. vmware vSphere client中,选择文件->部署OVF模板,报错处理方法

    在vmware vSphere client中,选择文件->部署OVF模板,选择指定的OVA文件,按步骤进行,则会出现这样的错误:此OVF软件包使用了不受支持的功能.OVF软件包需要不支持的硬件 ...

  6. 4.Linux的进程

    4.1 Linux的进程 4.1.1 进程的概述 有关进程的一些基本概念: 1.什么是进程: 当程序被触发后,执行者的权限与属性.程序的程序代码与所需的数据都会被加载到内存中,操作系统并给予这个内存内 ...

  7. JavaScript学习笔记之对象

    目录 1.自定义对象 2.Array 3.Boolean 4.Date 5.Math 6.Number 7.String 8.RegExp 9.Function 10.Event 在 JavaScri ...

  8. redis 和 memcached的区别

    redis和memcached的区别   Redis 和 Memcache 都是基于内存的数据存储系统.Memcached是高性能分布式内存缓存服务:Redis是一个开源的key-value存储系统. ...

  9. 《零压力学Python》 之 第四章知识点归纳

    第四章(决策和循环)知识点归纳 if condition: indented_statements [ elif condition: Indented_statements] [else: Inde ...

  10. 【郑轻邀请赛 F】 Tmk吃汤饭

    [题目链接]:https://acm.zzuli.edu.cn/zzuliacm/problem.php?id=2132 [题意] [题解] 很容易想到用队列来模拟; 这个队列维护的是正在煮的4个人煮 ...