boltdb

网上关于boltdb的文章有很多，特别是微信公众号上，例如：

boltdb源码分析系列-事务-腾讯云开发者社区-腾讯云 (tencent.com)

这些文章都写的挺好，但不一定覆盖了我所关注的几个点，下面我把我关注的几个点就来下来。

node page bucket tx db的关系

磁盘数据mmap到page内存区域，也可以理解为就是磁盘数据
- page需要一段连续的内存
node封装的B+树节点数据结构
bucket一个B+树数据结构。可以理解成一个表
tx 读事务或读写事务
- bucket是内存结构每个tx中都会生成一个
- 会将tx中涉及到（读取过、修改过）的nodes都记录在bucket中
- 读写事务最终写入磁盘时是需要重新申请新的page的，即不会修改原有的page
db整个数据库文件
- db中的freelist记录了db文件中空闲的页（即已经可以释放掉的页）

tx.commit

在boltdb的 commit中才会执行b+树的rebalance操作，执行完后再进行写入磁盘的操作。也就是说在一个事务中涉及到的多次写操作，会最终在commit的时候同意执行写入磁盘spill操作。

func (tx *Tx) Commit() error {

    _assert(!tx.managed, "managed tx commit not allowed")

    if tx.db == nil {

        return ErrTxClosed

    } else if !tx.writable {

        return ErrTxNotWritable

    }

    // TODO(benbjohnson): Use vectorized I/O to write out dirty pages.

    // Rebalance nodes which have had deletions.

    var startTime = time.Now()

    tx.root.rebalance()

    if tx.stats.Rebalance > 0 {

        tx.stats.RebalanceTime += time.Since(startTime)

    }

    // spill data onto dirty pages.

    startTime = time.Now()

    if err := tx.root.spill(); err != nil {

        tx.rollback()

        return err

    }

也正因为txn中可能有多个key插入，所以split就可能会进行多次

func (n *node) split(pageSize int) []*node {

    var nodes []*node

    node := n

    for {

        // Split node into two.

        a, b := node.splitTwo(pageSize)

        nodes = append(nodes, a)

        // If we can't split then exit the loop.

        if b == nil {

            break

        }   

        // Set node to b so it gets split on the next iteration.

        node = b

    }   

    return nodes

}

node.go

数据写入到磁盘的时候，是从下层节点往上层节点写的

// spill writes the nodes to dirty pages and splits nodes as it goes.

// Returns an error if dirty pages cannot be allocated.

func (n *node) spill() error {

    var tx = n.bucket.tx

    if n.spilled {

        return nil

    }

    // Spill child nodes first. Child nodes can materialize sibling nodes in

    // the case of split-merge so we cannot use a range loop. We have to check

    // the children size on every loop iteration.

    sort.Sort(n.children)

    for i := 0; i < len(n.children); i++ {

        if err := n.children[i].spill(); err != nil {

            return err

        }

    }

    // We no longer need the child list because it's only used for spill tracking.

    n.children = nil

    // Split nodes into appropriate sizes. The first node will always be n.

    var nodes = n.split(tx.db.pageSize)

node.go

数据较大如何处理？直接将构造一个大的page将数据存储进去。与此同时，原先node关联的page可以释放掉了。因为整个是一个append only模式，原先的page在新事务生成，且没有其他读事务访问后就可以释放掉了。

    for _, node := range nodes {

        // Add node's page to the freelist if it's not new.

        if node.pgid > 0 {

            tx.db.freelist.free(tx.meta.txid, tx.page(node.pgid))

            node.pgid = 0

        }

        // Allocate contiguous space for the node.

        p, err := tx.allocate((node.size() / tx.db.pageSize) + 1)

        if err != nil {

            return err

        }

node.go

哪些node需要rebalance呢，size < 25% page_size或者中间节点小于2个key，叶子节点小于1个key。

func (n *node) rebalance() {

    if !n.unbalanced {

        return

    }

    n.unbalanced = false

    // Update statistics.

    n.bucket.tx.stats.Rebalance++

    // Ignore if node is above threshold (25%) and has enough keys.

    var threshold = n.bucket.tx.db.pageSize / 4

    if n.size() > threshold && len(n.inodes) > n.minKeys() {

        return

    } 

node.go

bucket中读到了node，就将node加入到bucket中，读到了就意味着这些node可能就会发生改变。它是在cursor移动的时候加入到bucket中的。

func (c *Cursor) node() *node {

    _assert(len(c.stack) > 0, "accessing a node with a zero-length cursor stack")

    // If the top of the stack is a leaf node then just return it.

    if ref := &c.stack[len(c.stack)-1]; ref.node != nil && ref.isLeaf() {

        return ref.node

    }

    // Start from root and traverse down the hierarchy.

    var n = c.stack[0].node

    if n == nil {

        n = c.bucket.node(c.stack[0].page.id, nil)

    }

    for _, ref := range c.stack[:len(c.stack)-1] {

        _assert(!n.isLeaf, "expected branch node")

        n = n.childAt(int(ref.index))

    }

    _assert(n.isLeaf, "expected leaf node")

    return n

}

// node creates a node from a page and associates it with a given parent.

func (b *Bucket) node(pgid pgid, parent *node) *node {

    _assert(b.nodes != nil, "nodes map expected")

    // Retrieve node if it's already been created.

    if n := b.nodes[pgid]; n != nil {

        return n

    }   

    // Otherwise create a node and cache it.

    n := &node{bucket: b, parent: parent}

    if parent == nil {

        b.rootNode = n

    } else {

        parent.children = append(parent.children, n)

    }   

    // Use the inline page if this is an inline bucket.

    var p = b.page

    if p == nil {

        p = b.tx.page(pgid)

    }   

    // Read the page into the node and cache it.

    n.read(p)

    b.nodes[pgid] = n 

    // Update statistics.

    b.tx.stats.NodeCount++

freelist

它表示的是磁盘中已经释放的页

结构

ids 所有空闲页
pending {txid, pageids[]}即将释放的txid以及其关联的pageid
cache map索引

->pending 释放实际

tx.commit时会将事务中涉及到的老的node对应的page都放到pending中
- node.spill中将关联的旧node（node与page对应）放到freelist的pending中

pending->release释放时机

tx的commit阶段会将事务涉及的原先老page放到freelist的pending中。

func (f *freelist) free(txid txid, p *page) {

    if p.id <= 1 {

        panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.id))

    }       

    // Free page and all its overflow pages.

    var ids = f.pending[txid]

    for id := p.id; id <= p.id+pgid(p.overflow); id++ {

        // Verify that page is not already free.

        if f.cache[id] {

            panic(fmt.Sprintf("page %d already freed", id))

        }

        // Add to the freelist and cache.

        ids = append(ids, id)

        f.cache[id] = true

    }

    f.pending[txid] = ids

}

db.beginRWTx 开启读写事务的时候会尝试将过期的page释放掉

func (f *freelist) release(txid txid) {

    m := make(pgids, 0)

    for tid, ids := range f.pending {

        if tid <= txid {

            // Move transaction's pending pages to the available freelist.

            // Don't remove from the cache since the page is still free.

            m = append(m, ids...)

            delete(f.pending, tid)

        }

    }

    sort.Sort(m)

    f.ids = pgids(f.ids).merge(m)

}

boltdb一瞥的更多相关文章

惊鸿一瞥（Glimpse）——开发之时即可掌控ASP.NET应用的性能
今天要推荐的东西不是一篇文章,而是我实际使用的武器之一--用于ASP.NET应用性能诊断的大杀器.我的武器库中的武器之前已经介绍过Hangfire了,接下来我会不断和大家分享我使用的一些函数库和工具. ...
influxdb和boltDB简介——底层本质类似LMDB，MVCC+B+树
influxdb influxdb是最新的一个时间序列数据库,最新一两年才产生,但已经拥有极高的人气.influxdb 是用Go写的,0.9版本的influxdb对于之前会有很大的改变,后端存储有Le ...
关于时间序列数据库的思考——（1）运用hash文件(例如：RRD,Whisper) （2）运用LSM树来备份(例如：LevelDB,RocksDB,Cassandra) （3）运用B-树排序和k/v存储(例如：BoltDB,LMDB)
转自:http://0351slc.com/portal.php?mod=view&aid=12 近期网络上呈现了有关catena.benchmarking boltdb等时刻序列存储办法的介 ...
2017值得一瞥的JavaScript相关技术趋势
2017值得一瞥的JavaScript相关技术趋势从属于笔者的Web 前端入门与工程实践,推荐阅读2016-我的前端之路:工具化与工程化获得更多关于2016年前端总结.本文主要内容翻译自,笔者对于每个 ...
BoltDB简单使用教程
1.BoltDB简介 Bolt是一个纯粹Key/Value模型的程序.该项目的目标是为不需要完整数据库服务器(如Postgres或MySQL)的项目提供一个简单,快速,可靠的数据库. BoltDB只需 ...
influxdb和boltDB简介——MVCC+B+树，Go写成，Bolt类似于LMDB，这个被认为是在现代kye/value存储中最好的，influxdb后端存储有LevelDB换成了BoltDB
influxdb influxdb是最新的一个时间序列数据库,最新一两年才产生,但已经拥有极高的人气.influxdb 是用Go写的,0.9版本的influxdb对于之前会有很大的改变,后端存储有Le ...
[转帖]influxdb和boltDB简介——MVCC+B+树，Go写成，Bolt类似于LMDB，这个被认为是在现代kye/value存储中最好的，influxdb后端存储有LevelDB换成了BoltDB
influxdb和boltDB简介——MVCC+B+树,Go写成,Bolt类似于LMDB,这个被认为是在现代kye/value存储中最好的,influxdb后端存储有LevelDB换成了BoltDB ...
Boltdb学习笔记之〇--概述
更多精彩内容,请关注微信公众号:后端技术小屋看了boltdb也有一阵子了,看完之后总想写点什么,因为感觉到这可能是个不小的坑,所以迟迟没有动笔(没错我的拖延症又犯了..).最近有一种流行的说法:如果 ...
boltdb的实现和改进
整个代码不是很复杂,可以从代码中理解如何实现. 特点:btree,很小巧,但实现了完整事务机制,稳定,即使丢电也不会导致数据库错误. 整个结构如下: meta page (前两页) --- > ...
UCenter 基本原理一瞥
UCenter 是国内最常用的会员整合系统,它定义了一套接口用于不同应用(系统)间的协作. 注册过程通过某个应用注册时,应用会先调用 uc_client/client.php 中的 uc_user_ ...

随机推荐

Splashtop获5000万美元新投资成为远程桌面行业独角兽
加利福尼亚州圣何塞,2021 年 1 月 27 日 - 下一代远程访问和远程支持领域的新兴领导者 Splashtop Inc. 完成了新一轮的 5000 万美元融资,其估值已超过了 10 亿美元的独角 ...
IDEA+carbon.now.sh安装使用
安装打开IDEA,选择setting-->plugins 搜索carbon.now.sh,点击安装,重启IDEA即可. 使用选择需要生成的代码,Ctrl+A全选.然后再代码中点击右键,找到o ...
.NET ASPIRE 预览版 7 发布
.NET Aspire 预览版 7 并不是原计划的一部分,此预览版有很多重大 API 更改,部分原因是一旦产品发布,我们将致力于稳定的 API 表面.可以说,Aspire团队希望确保在最终发布之前完成 ...
k8s其它学习链接
k8s弹性伸缩概念以及测试用例 https://www.cnblogs.com/jasonboren/p/11493347.html CKA看这一篇就够了 k8s官网 k8s基础之六有状态和无状态的 ...
树莓派 ubuntu server 22.x 连接无线网络
前言树莓派系统安装完成后,需要配置网络,由于家里没有多余的网线(网线多少有点乱),所以决定配置无线上网的方式,现在记录下来操作过程具体操作 sudo nano /etc/netplan/xxxxx ...
ModelScope初体验
使用环境:windows 11 前置条件:已安装 anaconda 参考文档:环境安装 step1:新建一个 conda 环境,命名为 modelscope conda create -n model ...
Geatpy学习笔记1：官方案例
一.入门 1.求解器模式入门案例 1 import geatpy as ea import numpy as np # 构建问题 r = 1 # 目标函数需要用到的额外数据 @ea.Problem. ...
php分页查询子查询
分页查询将查询结果只显示一部分通过两个参数:参数1 起始数据的索引下标参 ...
ETL工具-nifi干货系列第七讲处理器JoltTransformJSON（续）
第六讲教程只简单介绍了Jolt的chain转换模式,本节课介绍下Jolt的各种转换模式. 点击的处理器JoltTransformJSON高级配置选项,进行测试Jolt的转换模式. 1.Cardinal ...
mysql中，时间类型datetime和timestamp的区别
TIMESTAMP和DATETIME的相同点: 两者都可用来表示 YYYY-MM-DD HH:MM:SS 类型的日期. TIMESTAMP和DATETIME的不同点: 1> 两者的存储方式不一 ...

boltdb一瞥