TableBuilder生成sstable(include/table_builder.h table/table_builder.cc)

LevelDB使用TableBuilder来构建sstable,并基于TableBuilder封装了一个BuildTable接口,用于将memtable转换为sstable。

sstable的格式为:

datablock1 | datablock2 | ... | metablock1 | metablock2 | ... | metaindexblock | indexblock | footer

datablock即为存储KV数据的块,metablock为相应datablock的元信息的块(并未实现),metaindexblock为metablock的索引块(并未实现),indexblock为datablock的索引块。

footer的数据格式为:

metablockhandle | indexblockhandle | padding | magic

metablockhandle为metaindexblock的索引,indexblockhandle为indexblock的索引,padding为了补全定长,magic为8字节的校验码。

TableBuilder类

TableBuilder类的定义为:

class LEVELDB_EXPORT TableBuilder
{
  public:
    // ...

    // Add key,value to the table being constructed.
    // REQUIRES: key is after any previously added key according to comparator.
    // REQUIRES: Finish(), Abandon() have not been called
    void Add(const Slice &key, const Slice &value);

    // Advanced operation: flush any buffered key/value pairs to file.
    // Can be used to ensure that two adjacent entries never live in
    // the same data block.  Most clients should not need to use this method.
    // REQUIRES: Finish(), Abandon() have not been called
    void Flush();

    // ...

    // Finish building the table.  Stops using the file passed to the
    // constructor after this function returns.
    // REQUIRES: Finish(), Abandon() have not been called
    Status Finish();

    // ...

  private:
    // ..
    void WriteBlock(BlockBuilder *block, BlockHandle *handle);
    void WriteRawBlock(const Slice &data, CompressionType, BlockHandle *handle);

    struct Rep;
    Rep *rep_;
};
TableBuilder的成员变量

TableBuilder只有一个类型为Rep *的成员变量rep_。

Rep结构

Rep对TableBuilder构建sstable的过程中的相关数据进行了封装:

struct TableBuilder::Rep
{
    Options options;
    Options index_block_options;
    WritableFile *file;
    uint64_t offset;
    Status status;
    BlockBuilder data_block;
    BlockBuilder index_block;
    std::string last_key;
    int64_t num_entries;
    bool closed; // Either Finish() or Abandon() has been called.
    FilterBlockBuilder *filter_block;

    // We do not emit the index entry for a block until we have seen the
    // first key for the next data block.  This allows us to use shorter
    // keys in the index block.  For example, consider a block boundary
    // between the keys "the quick brown fox" and "the who".  We can use
    // "the r" as the key for the index block entry since it is >= all
    // entries in the first block and < all entries in subsequent
    // blocks.
    //
    // Invariant: r->pending_index_entry is true only if data_block is empty.
    bool pending_index_entry;
    BlockHandle pending_handle; // Handle to add to index block

    std::string compressed_output;

    // ...
};

其中,file为保存sstable的文件,offset为文件中当前偏移的位置,data_block用于构建datablock,index_block用于构建indexblock,last_key为当前最后一个key,num_entries为sstable中entry数量,filter_block用于构建bloom filter,pending_handle用于构建block的索引(存入index_block中),compressed_output为压缩后的sstable。

这里用到了其他的一些辅助的结构,BlockBuilder在https://www.cnblogs.com/YuNanlong/p/9427787.html中已经分析过了。FilterBlockBuilder将在以后分析。

WritableFile类

WritableFile类为文件操作定义了接口:

class LEVELDB_EXPORT WritableFile
{
  public:
    WritableFile() = default;

    WritableFile(const WritableFile &) = delete;
    WritableFile &operator=(const WritableFile &) = delete;

    virtual ~WritableFile();

    virtual Status Append(const Slice &data) = 0;
    virtual Status Close() = 0;
    virtual Status Flush() = 0;
    virtual Status Sync() = 0;
};

具体实现在PosixWritableFile类中:

class PosixWritableFile : public WritableFile
{
  private:
    // buf_[0, pos_-1] contains data to be written to fd_.
    std::string filename_;
    int fd_;
    char buf_[kBufSize];
    size_t pos_;

  public:
    // ...

    virtual Status Append(const Slice &data)
    {
        // 见下文分析
    }

    virtual Status Close()
    {
        // ...
    }

    virtual Status Flush()
    {
        return FlushBuffered();
    }

    Status SyncDirIfManifest()
    {
        const char *f = filename_.c_str();
        const char *sep = strrchr(f, '/');
        Slice basename;
        std::string dir;
        if (sep == nullptr)
        {
            dir = ".";
            basename = f;
        }
        else
        {
            dir = std::string(f, sep - f);
            basename = sep + 1;
        }
        Status s;
        if (basename.starts_with("MANIFEST"))
        {
            int fd = open(dir.c_str(), O_RDONLY);
            if (fd < 0)
            {
                s = PosixError(dir, errno);
            }
            else
            {
                if (fsync(fd) < 0)
                {
                    s = PosixError(dir, errno);
                }
                close(fd);
            }
        }
        return s;
    }

    virtual Status Sync()
    {
        // 见下文分析
    }

  private:
    Status FlushBuffered()
    {
        // ...
    }

    Status WriteRaw(const char *p, size_t n)
    {
        // ...
    }
};

其中filename_为文件名,int fd_为文件句柄,char buf_[kBufSize]为缓冲区,size_t pos_为当前缓冲区剩余空间起始位置的偏移。

Append函数用于写文件,如果缓冲区空间足够,则先将数据写入缓冲区:

        size_t n = data.size();
        const char *p = data.data();

        // Fit as much as possible into buffer.
        size_t copy = std::min(n, kBufSize - pos_);
        memcpy(buf_ + pos_, p, copy);
        p += copy;
        n -= copy;
        pos_ += copy;
        if (n == 0)
        {
            return Status::OK();
        }

如果空间不够,则将当前缓冲区的内容刷入文件:

        // Can't fit in buffer, so need to do at least one write.
        Status s = FlushBuffered();
        if (!s.ok())
        {
            return s;
        }

然后看写入的数据是否大于缓冲区,如果大于缓冲区,则直接写入文件,否则也写入缓冲区。WriteRaw(const char *p, size_t n)调用write函数将数据写入文件:

        // Small writes go to buffer, large writes are written directly.
        if (n < kBufSize)
        {
            memcpy(buf_, p, n);
            pos_ = n;
            return Status::OK();
        }
        return WriteRaw(p, n);

Sync函数用于将file中没有被同步到硬盘的部分强制同步到硬盘上,而不是驻留在内存中,保证了持久化。这个函数首先调用SyncDirIfManifest函数将manifest文件写入硬盘(通过fsync函数):

        // Ensure new files referred to by the manifest are in the filesystem.
        Status s = SyncDirIfManifest();
        if (!s.ok())
        {
            return s;
        }

然后将缓冲区中的数据刷入文件:

        s = FlushBuffered();

最后将文件写入硬盘(通过fsyncdata函数):

        if (s.ok())
        {
            if (fdatasync(fd_) != 0)
            {
                s = PosixError(filename_, errno);
            }
        }
        return s;
BlockHandle类

BlockHandle类封装了sstable中block的索引:

class BlockHandle
{
  public:
    BlockHandle();

    // The offset of the block in the file.
    uint64_t offset() const { return offset_; }
    void set_offset(uint64_t offset) { offset_ = offset; }

    // The size of the stored block
    uint64_t size() const { return size_; }
    void set_size(uint64_t size) { size_ = size; }

    void EncodeTo(std::string *dst) const;
    Status DecodeFrom(Slice *input);

    // Maximum encoding length of a BlockHandle
    enum
    {
        kMaxEncodedLength = 10 + 10
    };

  private:
    uint64_t offset_;
    uint64_t size_;
};

其中offset_为block的偏移,size_为block的大小。EncodeTo函数将BlockHandle编码为字符串,DecodeFrom函数将Slice封装的数据解码为BlockHandle,以便于处理。

TableBuilder的成员函数

首先是Add函数:

void TableBuilder::Add(const Slice &key, const Slice &value)

首先判断是否需要向index_block中添加索引,如果需要,则调用FindShortestSeparator函数根据last_key计算这个block的索引key值,并将pending_handle编码为字符串作为value值,然后存入index_block中:

    Rep *r = rep_;
    assert(!r->closed);
    if (!ok())
        return;
    if (r->num_entries > 0)
    {
        assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
    }

    if (r->pending_index_entry)
    {
        assert(r->data_block.empty());
        r->options.comparator->FindShortestSeparator(&r->last_key, key);
        std::string handle_encoding;
        r->pending_handle.EncodeTo(&handle_encoding);
        r->index_block.Add(r->last_key, Slice(handle_encoding));
        r->pending_index_entry = false;
    }

接下来向filter_block中添加key的映射:

    if (r->filter_block != nullptr)
    {
        r->filter_block->AddKey(key);
    }

接着讲实际的KV值存入data_block:

    r->last_key.assign(key.data(), key.size());
    r->num_entries++;
    r->data_block.Add(key, value);

最后判断当前index_block的大小是否超过设定值,若超过则调用Flush函数写入文件对象:

    const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
    if (estimated_block_size >= r->options.block_size)
    {
        Flush();
    }

这里值得注意的是FindShortestSeparator函数,LevelDB为了节省内存空间,在这里选取的key并不是last_key的key,而是只需要能够区分两个block就可以了。比如block和blrck之间只需要blp就可以区分。

Add函数调用了Flush函数:

void TableBuilder::Flush()

Flush函数首先调用WriteBlock函数将当前还未写入文件对象的block写入文件对象:

    Rep *r = rep_;
    assert(!r->closed);
    if (!ok())
        return;
    if (r->data_block.empty())
        return;
    assert(!r->pending_index_entry);
    WriteBlock(&r->data_block, &r->pending_handle);

然后使用r->file->Flush将文件缓冲区(由WritableFile封装)中的内容写入文件:

    if (ok())
    {
        r->pending_index_entry = true;
        r->status = r->file->Flush();
    }
    if (r->filter_block != nullptr)
    {
        r->filter_block->StartBlock(r->offset);
    }

Flush函数调用了WriteBlock函数:

void TableBuilder::WriteBlock(BlockBuilder *block, BlockHandle *handle)

WriteBlock函数首先通过block->Finish()向block中加入restarts和num_of_restarts:

    // File format contains a sequence of blocks where each block has:
    //    block_data: uint8[n]
    //    type: uint8
    //    crc: uint32
    assert(ok());
    Rep *r = rep_;
    Slice raw = block->Finish();

再根据设定选项对block进行压缩:

    Slice block_contents;
    CompressionType type = r->options.compression;
    // TODO(postrelease): Support more compression options: zlib?
    switch (type)
    {
    case kNoCompression:
        block_contents = raw;
        break;

    case kSnappyCompression:
    {
        std::string *compressed = &r->compressed_output;
        if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
            compressed->size() < raw.size() - (raw.size() / 8u))
        {
            block_contents = *compressed;
        }
        else
        {
            // Snappy not supported, or compressed less than 12.5%, so just
            // store uncompressed form
            block_contents = raw;
            type = kNoCompression;
        }
        break;
    }
    }

最后通过WriteRawBlock函数写入文件对象:

    WriteRawBlock(block_contents, type, handle);
    r->compressed_output.clear();
    block->Reset();

WriteBlock函数调用了WriteRawBlock函数:

void TableBuilder::WriteRawBlock(const Slice &block_contents,
                                 CompressionType type,
                                 BlockHandle *handle)

WriteRawBlock函数通过r->file->Append将传入的除了type和crc之外的block中的数据存入文件对象:

    Rep *r = rep_;
    handle->set_offset(r->offset);
    handle->set_size(block_contents.size());
    r->status = r->file->Append(block_contents);

然后计算type和crc,属于同一批:

    if (r->status.ok())
    {
        char trailer[kBlockTrailerSize];
        trailer[0] = type;
        uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
        crc = crc32c::Extend(crc, trailer, 1); // Extend crc to cover block type
        EncodeFixed32(trailer + 1, crc32c::Mask(crc));
        r->status = r->file->Append(Slice(trailer, kBlockTrailerSize));
        if (r->status.ok())
        {
            r->offset += block_contents.size() + kBlockTrailerSize;
        }
    }

其次是Finish函数:

Status TableBuilder::Finish()
{
    Rep *r = rep_;
    Flush();
    assert(!r->closed);
    r->closed = true;

    BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;

    // Write filter block
    if (ok() && r->filter_block != nullptr)
    {
        WriteRawBlock(r->filter_block->Finish(), kNoCompression,
                      &filter_block_handle);
    }

    // Write metaindex block
    if (ok())
    {
        BlockBuilder meta_index_block(&r->options);
        if (r->filter_block != nullptr)
        {
            // Add mapping from "filter.Name" to location of filter data
            std::string key = "filter.";
            key.append(r->options.filter_policy->Name());
            std::string handle_encoding;
            filter_block_handle.EncodeTo(&handle_encoding);
            meta_index_block.Add(key, handle_encoding);
        }

        // TODO(postrelease): Add stats and other meta blocks
        WriteBlock(&meta_index_block, &metaindex_block_handle);
    }

    // Write index block
    if (ok())
    {
        if (r->pending_index_entry)
        {
            r->options.comparator->FindShortSuccessor(&r->last_key);
            std::string handle_encoding;
            r->pending_handle.EncodeTo(&handle_encoding);
            r->index_block.Add(r->last_key, Slice(handle_encoding));
            r->pending_index_entry = false;
        }
        WriteBlock(&r->index_block, &index_block_handle);
    }

    // Write footer
    if (ok())
    {
        Footer footer;
        footer.set_metaindex_handle(metaindex_block_handle);
        footer.set_index_handle(index_block_handle);
        std::string footer_encoding;
        footer.EncodeTo(&footer_encoding);
        r->status = r->file->Append(footer_encoding);
        if (r->status.ok())
        {
            r->offset += footer_encoding.size();
        }
    }
    return r->status;
}

Finish函数首先写入metaindexblock,然后又写入indexblock,最后写入footer。

BuildTable接口

LevelDB调用BuildTable接口进行sstable文件的构建:

Status BuildTable(const std::string &dbname,
                  Env *env,
                  const Options &options,
                  TableCache *table_cache,
                  Iterator *iter,
                  FileMetaData *meta)

创建一个文件对象用于存储sstable:

    Status s;
    meta->file_size = 0;
    iter->SeekToFirst();

    std::string fname = TableFileName(dbname, meta->number);
    if (iter->Valid())
    {
        WritableFile *file;
        s = env->NewWritableFile(fname, &file);
        if (!s.ok())
        {
            return s;
        }

创建一个TableBuilder对象用于构建sstable,并通过迭代器遍历memtable将值依次添加进sstable:

        TableBuilder *builder = new TableBuilder(options, file);
        meta->smallest.DecodeFrom(iter->key());
        for (; iter->Valid(); iter->Next())
        {
            Slice key = iter->key();
            meta->largest.DecodeFrom(key);
            builder->Add(key, iter->value());
        }
        // Finish and check for builder errors
        s = builder->Finish();
        if (s.ok())
        {
            meta->file_size = builder->FileSize();
            assert(meta->file_size > 0);
        }
        delete builder;

将文件强制写入磁盘:

        // Finish and check for file errors
        if (s.ok())
        {
            s = file->Sync();
        }
        if (s.ok())
        {
            s = file->Close();
        }
        delete file;
        file = nullptr;

228 Love u

LevelDB源码分析-TableBuilder生成sstable的更多相关文章

  1. leveldb源码分析--SSTable之block

    在SSTable中主要存储数据的地方是data block,block_builder就是这个专门进行block的组织的地方,我们来详细看看其中的内容,其主要有Add,Finish和CurrentSi ...

  2. Leveldb源码分析--1

    coming from http://blog.csdn.net/sparkliang/article/details/8567602 [前言:看了一点oceanbase,没有意志力继续坚持下去了,暂 ...

  3. leveldb源码分析--WriteBatch

    从[leveldb源码分析--插入删除流程]和WriteBatch其名我们就很轻易的知道,这个是leveldb内部的一个批量写的结构,在leveldb为了提高插入和删除的效率,在其插入过程中都采用了批 ...

  4. leveldb源码分析--Key结构

    [注]本文参考了sparkliang的专栏的Leveldb源码分析--3并进行了一定的重组和排版 经过上一篇文章的分析我们队leveldb的插入流程有了一定的认识,而该文设计最多的又是Batch的概念 ...

  5. leveldb源码分析--日志

    我们知道在一个数据库系统中为了保证数据的可靠性,我们都会记录对系统的操作日志.日志的功能就是用来在系统down掉的时候对数据进行恢复,所以日志系统对一个要求可靠性的存储系统是极其重要的.接下来我们分析 ...

  6. leveldb源码分析之Slice

    转自:http://luodw.cc/2015/10/15/leveldb-02/ leveldb和redis这样的优秀开源框架都没有使用C++自带的字符串string,redis自己写了个sds,l ...

  7. leveldb源码分析--SSTable之TableBuilder

    上一篇文章讲述了SSTable的格式以后,本文结合源码解析SSTable是如何生成的. void TableBuilder::Add(const Slice& key, const Slice ...

  8. LevelDB源码分析--Cache及Get查找流程

    本打算接下来分析version相关的概念,但是在准备的过程中看到了VersionSet的table_cache_这个变量才想起还有这样一个模块尚未分析,经过权衡觉得leveldb的version相对C ...

  9. leveldb源码分析之内存池Arena

    转自:http://luodw.cc/2015/10/15/leveldb-04/ 这篇博客主要讲解下leveldb内存池,内存池很多地方都有用到,像linux内核也有个内存池.内存池的存在主要就是减 ...

随机推荐

  1. js文本转语音

    百度找了好多,大概分为两种,一种使用百度语音的API,另一种使用H5自带(低版本不兼容) 下面为一个模拟页面 <!DOCTYPE html><html lang="en&q ...

  2. 剑指Offer 43. 左旋转字符串 (字符串)

    题目描述 汇编语言中有一种移位指令叫做循环左移(ROL),现在有个简单的任务,就是用字符串模拟这个指令的运算结果.对于一个给定的字符序列S,请你把其循环左移K位后的序列输出.例如,字符序列S=&quo ...

  3. laravel 解决session保存不了,取不出的问题

    555  上传服务器的时候 storage下的framework 没有上传啊

  4. Linux ssh命令详解

    SSH(远程连接工具)连接原理:ssh服务是一个守护进程(demon),系统后台监听客户端的连接,ssh服务端的进程名为sshd,负责实时监听客户端的请求(IP 22端口),包括公共秘钥等交换等信息. ...

  5. Python全栈之路----函数进阶----列表生成式

    列表生成式 现在有个需求,看列表[0,1,2,3,4,5,6,7,8,9],要求你把列表里每个值都加1,你怎么实现?你可能会想到两种方法. 二逼青年版 >>> a = [0,1,2, ...

  6. .net平台常用组建

    常用的一些开源组件整理: 导出Excel报表的插件:NOPI.dll(基于微软OpenXml实现)开源的作业调度和自动任务框架:Quartz.NET用于大数据搜索引擎的全文检索框架:Lucene.ne ...

  7. h5 js判断是安卓还是ios设备,跳转到对应的下载地址

    /*ios和安卓跳转 js*/$(function(){ var u = navigator.userAgent; var ua = navigator.userAgent.toLowerCase() ...

  8. Tomcat配置虚拟主机、tomcat的日志

    1.配置Tomcat的虚拟主机修改:vim /usr/local/tomcat9/conf/server.xml 添加一个虚拟主机:加入: <Host name="www.tomcat ...

  9. 2019西湖论剑网络安全技能大赛(大学生组)部分WriteUp

    这次比赛是我参加以来成绩最好的一次,这离不开我们的小团队中任何一个人的努力,熬了一整天才答完题,差点饿死在工作室(门卫大爷出去散步,把大门锁了出不去,还好学弟提了几个盒饭用网线从窗户钓上来才吃到了午饭 ...

  10. webpack学习笔记(三)

    访问网址: https://github.com/webpack/analyse "scripts": { "dev-build": "webpack ...