Google全球分布式数据库:Spanner
2012年的OSDI上google发布了Spanner数据库。个人认为Spanner对于版本控制,事务外部一致性的处理,使用TrueTime + Timestamp进行全球备份同步的实现都比较值得一看。个人认为对于其中时序逻辑的理解对在大范围内(通常是全国到全球)部署分布式DB以确保复制同步有重要意义。
key point:
external consistency -> txn sequence
truetime + timestamp, sync & multi-version
global deployment
2PC 2PL
3 basic txns(RW, RO, snapshot)
Spanner: Globally-Distributed Database
Implementation
Different environment: universe
test development production......
Hierarchy
- universe: global
The universe master and the placement driver are currently singletons.
- zone: manage deployment unit; logical & physical isolation
zone master & location proxy
- spanserver
- tablet
Spanserver
software stack
1 leader, server replica, in different data centers
all have:
tablet
$$
(key:string, timestamp:int64) → string
$$Colossus: a distributed filesystem like GFS
Paxos state machine: to support replication, for consistently replicated bag of mappings, replicas set: Paxos group
Each state machine stores its metadata and log in its corresponding tablet. Paxos implementation supports long-lived leaders with time-based leader leases.
Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date.
Paxos: implementation pipelined, write in-order
leader uniquely has:
- lock table: the state for two-phase locking
- transaction manager: for distributed transactions, across Paxos group
Directories and Placement
based on k/v map, bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.
tablet: different with bigtable, spanner tablet is a container that may encapsulate multiple partitions of the row space
Movedir: background, not a single txn, register fact and uses a transaction to atomically move small data!(actually the fragment, not a big dir)
Data model
- schematized semi-relational tables
- a query language
- generalpurpose transactions
Spanner’s data model is not purely relational, in that rows must have names.
hierarchies: in database schemas via the INTERLEAVE IN: get locality relationships.
TrueTime
API:
- now: return interval[earliest, latest]
- after
- before
underlying time references: GPS and atomic clocks
Concurrency Control
two-phase commit generates a Paxos write for the prepare phase that has no corresponding Spanner client write.
transactions:
- read-write: (including Standalone writes)
- read-only: without locking, any replica that is sufficiently up-to-date
- snapshot-reads: read in the past, no locking, any replica that is sufficiently up-to-date
Paxos leader lease:
timed leases: to make leadership long-lived, for lease votes
lease interval: [discover quorum of votes, no longer has votes]
Smax: the maximum timestamp used by a leader.
two-phase commit: a protocol maintain consistency - unsuccess: rollback
- prepare phase
- commit phase
RW txn:
buffered before written
wound-wait :avoid deadlock
both two have writing lock,
- non-coordinator participant leader
- coordinator leader: skip prepare phase
RO txn:
execution flow:
- assign a timestamp sread
- execute the transaction’s reads as snapshot reads at sread.
simply select sread = TT.now().latest
single Paxos group
Define LastTS() to be the timestamp of the last committed write at a Paxos group.
multiple Paxos groups
Schema-Change Transactions
Discussion
Paxos Truetime consistency
strong consistency cross data centers
data model: not pure relational(can use sql )
tablets are replicated, concurrtency corrtdiantion by Pxaos
txns with multiple Paxos groups --- 2PC coordination
leader
what's the actually difference compared with the classical distributed database?????
consistent versions of the data
the only reading data
the spirit kernel: the timestamp & version control
time mechenism
global-time consistency: timestamp no uncertainty
commit time: interval
there are two txns, to distinguish one happened actually before another
Participant leader -> Transaction manager -> Paxos group
three basic r/w ops, make the external consistency, global timestamp for sync across regions and certain txns sequences
Concurrency control : timestamp management to do
timestamp -> multi-version -> snapshot
almost all the work in spanner around the sequence of timestamp!
condition: multiple data centers
target: external consistency ~= linearizability
Two phase locking:
- growing phase: acquire lock
- shrinking phase: release lock
- 2PC: distributed system, global manage
- 2PL: one node, multi-txns, resource acquire and manage,
TrueTime: local clock -> global clock, which is essentially important for global distributed system because of sync needs.
uncertainty interval[earliest, latest]: try to make it as small as possible(increase accuracy) -> less lock -> increase efficiency
Thus, Timestamps + TrueTime can build a global accessible time service for all the application around the world.
external-consistency invariant: s1 < s2
Google全球分布式数据库:Spanner的更多相关文章
- 全球分布式数据库:Google Spanner(论文翻译)
本文由厦门大学计算机系教师林子雨翻译,翻译质量很高,本人只对极少数翻译得不太恰当的地方进行了修改. [摘要]:Spanner 是谷歌公司研发的.可扩展的.多版本.全球分布式.同步复制数据库.它是第一个 ...
- 全球级的分布式数据库 Google Spanner原理
开发四年只会写业务代码,分布式高并发都不会还做程序员?->>> Google Spanner简介 Spanner 是Google的全球级的分布式数据库 (Globally-Di ...
- 分布式数据库Google Spanner原理分析
Spanner 是Google的全球级的分布式数据库 (Globally-Distributed Database) .Spanner的扩展性达到了令人咋舌的全球级,可以扩展到数百万的机器,数已百计的 ...
- 怎样打造一个分布式数据库——rocksDB, raft, mvcc,本质上是为了解决跨数据中心的复制
摘自:http://www.infoq.com/cn/articles/how-to-build-a-distributed-database?utm_campaign=rightbar_v2& ...
- 这次,听人大教授讲讲分布式数据库的多级一致性|TDSQL 关键技术突破
近年来,凭借高可扩展.高可用等技术特性,分布式数据库正在成为金融行业数字化转型的重要支撑.分布式数据库如何在不同的金融级应用场景下,在确保数据一致性的前提下,同时保障系统的高性能和高可扩展性,是分布式 ...
- 云时代的分布式数据库:阿里分布式数据库服务DRDS
发表于2015-07-15 21:47| 10943次阅读| 来源<程序员>杂志| 27 条评论| 作者王晶昱 <程序员>杂志数据库DRDS分布式沈询 摘要:伴随着系统性能.成 ...
- 从NoSQL到NewSQL,谈交易型分布式数据库建设要点
在上一篇文章<从架构特点到功能缺陷,重新认识分析型分布式数据库>中,我们完成了对不同"分布式数据库"的横向分析,本文Ivan将讲述拆解的第二部分,会结合NoSQL与Ne ...
- 跨时代的分布式数据库 – 阿里云DRDS详解(转)
原文章地址:https://www.csdn.net/article/a/2015-08-28/15827676 跨时代的分布式数据库 – 阿里云DRDS详解 发表于2015-08-28 18:39| ...
- SDP(6):分布式数据库运算环境- Cassandra-Engine
现代信息系统应该是避不开大数据处理的.作为一个通用的系统集成工具也必须具备大数据存储和读取能力.cassandra是一种分布式的数据库,具备了分布式数据库高可用性(high-availability) ...
- 开源分布式数据库SequoiaDB在去哪儿网的实践
编者注: 中国的数据库行业也迎来了一波新的热点事件.分布式数据库这块新消息不断,也让大家开始关注中国的分布式数据库.首先是短短一周内,Pingcap和SequoiaDB巨杉数据库陆续宣布了C轮的数千万 ...
随机推荐
- C#实现斐波拉切数列求和
C#实现斐波拉切数列求和 private void button1_Click(object sender, EventArgs e) { listBox1.Items.Clear();//清空Lis ...
- 每天学五分钟 Liunx 0001 | 存储篇:swap
swap swap ,内存交换空间,它是磁盘上的一块空间,主要作用是为了防止物理内存不足. CPU 从内存中读取数据.当内存的空间不足时, CPU 难以读取到数据,导致程序无法正常工作.所以诞生了 s ...
- MongoDB 根据多个条件批量修改
转载请注明出处: MongoDB 根据单个条件修改的sql 如下: db.collection_name.update({"userid":"1111111"} ...
- Spring 事务失效场景总结
本文为博主原创,未经允许不得转载: 1. spring的事务注解@Transactional只能放在public修饰的方法上才起作用,如果放在其他非public(private,protected)方 ...
- 08-逻辑仿真工具VCS-mismatch
逻辑仿真工具VCS mismatch,预计的仿真结果和实际仿真结果不同,寻找原因? 首先考虑代码,,不要让代码跑到工具的盲区中 其次考虑仿真工具的问题 +race -- 将竞争冒险的情况写到文件中 不 ...
- 【面试题精讲】Redis如何实现分布式锁
首发博客地址 系列文章地址 Redis 可以使用分布式锁来实现多个进程或多个线程之间的并发控制,以确保在给定时间内只有一个进程或线程可以访问临界资源.以下是一种使用 Redis 实现分布式锁的常见方法 ...
- 银河麒麟上面 ntopng的安装与使用
银河麒麟上面 ntopng的安装与使用 背景 一直想用Grafana监控网络流量 但是断断续续尝试了一周的时间都没有搞定. 发现这一块已经进入了瓶颈. 比较无奈的情况下回到了原来的iftop/iptr ...
- Harbor镜像仓库的导出与整理之二
Harbor镜像仓库的导出与整理之二 背景 前几天参照大神的blog进行了一下harbor的镜像列表的获取与下载. 当时发现一个很诡异的问题. 实际上镜像仓库里面的镜像很多. 但是导出和列表里面的却很 ...
- [转帖]mysql - 使用文件中的 mysql 加载数据格式化 csv 日期列
https://www.coder.work/article/2481907#:~:text=LOAD%20DATA%20INFILE%20%27%2Finvoices%2Finvoice138130 ...
- 【转帖】查看mysql库大小,表大小,索引大小
https://www.cnblogs.com/lukcyjane/p/3849354.html 说明: 通过MySQL的 information_schema 数据库,可查询数据库中每个表占用的空间 ...