LevelDB的源码阅读（三） Put操作

在Linux上leveldb的安装和使用中我们写了这么一段测试代码，内容以及输出结果如下：

#include <iostream>

#include <string>

#include <assert.h>

#include "leveldb/db.h"    

using namespace std;

int main(void)

{       

    leveldb::DB      *db;

    leveldb::Options  options;

    options.create_if_missing = true;    

    // open

    leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);

    assert(status.ok());    

    string key = "name";

    string value = "chenqi";    

    // write

    status = db->Put(leveldb::WriteOptions(), key, value);

    assert(status.ok());    

    // read

    status = db->Get(leveldb::ReadOptions(), key, &value);

    assert(status.ok());    

    cout<<value<<endl;    

    // delete

    status = db->Delete(leveldb::WriteOptions(), key);

    assert(status.ok());        

    status = db->Get(leveldb::ReadOptions(),key, &value);

    if(!status.ok()) {

        cerr<<key<<"    "<<status.ToString()<<endl;

    } else {

        cout<<key<<"==="<<value<<endl;

    }   

    // close

    delete db;    

    return ;

}

chenqi

name    NotFound:

Leveldb的写数据流程入口为db文件夹下db_impl.cc文件中的DBImpl::Put和DBImpl::Delete,这两个文件是DBImpl::Write接口的封装，将写操作封装成WriteBatch传入DBImpl::Write进行操作，可见leveldb在内部是将单独的写操作也作为只有一个操作的批量写操作来进行的.源码内容如下:

// Convenience methods

Status DBImpl::Put(const WriteOptions& o, const Slice& key, const Slice& val) {

  return DB::Put(o, key, val);

}

Status DBImpl::Delete(const WriteOptions& options, const Slice& key) {

  return DB::Delete(options, key);

}

// Default implementations of convenience methods that subclasses of DB

// can call if they wish

Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {

  WriteBatch batch;

  batch.Put(key, value);

  return Write(opt, &batch);

}

Status DB::Delete(const WriteOptions& opt, const Slice& key) {

  WriteBatch batch;

  batch.Delete(key);

  return Write(opt, &batch);

}

DBImpl::Put和DBImpl::Delete最终调用 DBImpl::Write完成写操作,DBImpl::Write源码内容如下:

Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {

  Writer w(&mutex_);

  w.batch = my_batch;

  w.sync = options.sync;

  w.done = false;

  MutexLock l(&mutex_);

  writers_.push_back(&w);

  while (!w.done && &w != writers_.front()) {

    w.cv.Wait();

  }

  if (w.done) {

    return w.status;

  }

  // May temporarily unlock and wait.

  Status status = MakeRoomForWrite(my_batch == NULL);

  uint64_t last_sequence = versions_->LastSequence();

  Writer* last_writer = &w;

  if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions

    WriteBatch* updates = BuildBatchGroup(&last_writer);

    WriteBatchInternal::SetSequence(updates, last_sequence + );

    last_sequence += WriteBatchInternal::Count(updates);

    // Add to log and apply to memtable.  We can release the lock

    // during this phase since &w is currently responsible for logging

    // and protects against concurrent loggers and concurrent writes

    // into mem_.

    {

      mutex_.Unlock();

      status = log_->AddRecord(WriteBatchInternal::Contents(updates));

      bool sync_error = false;

      if (status.ok() && options.sync) {

        status = logfile_->Sync();

        if (!status.ok()) {

          sync_error = true;

        }

      }

      if (status.ok()) {

        status = WriteBatchInternal::InsertInto(updates, mem_);

      }

      mutex_.Lock();

      if (sync_error) {

        // The state of the log file is indeterminate: the log record we

        // just added may or may not show up when the DB is re-opened.

        // So we force the DB into a mode where all future writes fail.

        RecordBackgroundError(status);

      }

    }

    if (updates == tmp_batch_) tmp_batch_->Clear();

    versions_->SetLastSequence(last_sequence);

  }

  while (true) {

    Writer* ready = writers_.front();

    writers_.pop_front();

    if (ready != &w) {

      ready->status = status;

      ready->done = true;

      ready->cv.Signal();

    }

    if (ready == last_writer) break;

  }

  // Notify new head of write queue

  if (!writers_.empty()) {

    writers_.front()->cv.Signal();

  }

  return status;

}

以下逐段进行分析

  Writer w(&mutex_);// writer可以理解为一个任务

  w.batch = my_batch;

  w.sync = options.sync;

  w.done = false;

  MutexLock l(&mutex_);//构造时上锁, 析构时解锁

  writers_.push_back(&w);// 把w推进writers_队列

  // 生产者消费者模型

   while (!w.done && &w != writers_.front()) {

    w.cv.Wait();//线程可能多次被唤醒

  }

  //写操作有可能被合并处理，因此有可能取到的时候写入已经完成。完成的话直接返回

  if (w.done) {

    return w.status;

  }

mutex l上锁之后, 到了"w.cv.Wait()"的时候, 会先释放锁等待, 然后收到signal时再次上锁. 这段代码的作用就是多线程在提交任务的时候, 一个接一个push_back进队列. 但只有位于队首的线程有资格继续运行下去. 目的是把尽可能多的写请求（(要求sync选项一致)）合并成一个大batch提升效率.

Status status = MakeRoomForWrite(my_batch == NULL);

接下来是调用MakeRoomForWrite查看是否允许写，如果后台Compact失败，则返回错误，如果level0的文件数达到配置的SlowdownWritesTrigger(默认为8)，则对每个写操作都延迟1ms，如果level0的文件数达到配置的kL0_StopWritesTrigger(默认为12)，则阻塞写操作，等待后台Compact结束.如果memtable不满，则直接返回，可写.如果memtable已满，并且immutable table不为空，阻塞，等待Compact结束.否则，将memtable改为immutable table,新建memtable,返回可写.源码内容如下:　

// REQUIRES: mutex_ is held

// REQUIRES: this thread is currently at the front of the writer queue

Status DBImpl::MakeRoomForWrite(bool force) {

  mutex_.AssertHeld();

  assert(!writers_.empty());

  bool allow_delay = !force;

  Status s;

  while (true) {

    if (!bg_error_.ok()) {

      // Yield previous error

      s = bg_error_;//后台Compact出错

      break;

    } else if (

        allow_delay &&

        versions_->NumLevelFiles() >= config::kL0_SlowdownWritesTrigger) {

      // We are getting close to hitting a hard limit on the number of

      // L0 files.  Rather than delaying a single write by several

      // seconds when we hit the hard limit, start delaying each

      // individual write by 1ms to reduce latency variance.  Also,

      // this delay hands over some CPU to the compaction thread in

      // case it is sharing the same core as the writer.

      mutex_.Unlock();

      env_->SleepForMicroseconds();

      allow_delay = false;  // Do not delay a single write more than once

      mutex_.Lock();

    } else if (!force &&

               (mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) {

      // There is room in current memtable 也就是说可以写

      break;

    } else if (imm_ != NULL) {

      // We have filled up the current memtable, but the previous

      // one is still being compacted, so we wait.

      Log(options_.info_log, "Current memtable full; waiting...\n");

      bg_cv_.Wait();

    } else if (versions_->NumLevelFiles() >= config::kL0_StopWritesTrigger) {

      // There are too many level-0 files.

      Log(options_.info_log, "Too many L0 files; waiting...\n");

      bg_cv_.Wait();

    } else {

      // Attempt to switch to a new memtable and trigger compaction of old

      assert(versions_->PrevLogNumber() == );

      uint64_t new_log_number = versions_->NewFileNumber();

      WritableFile* lfile = NULL;

      s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);

      if (!s.ok()) {

        // Avoid chewing through file number space in a tight loop.

        versions_->ReuseFileNumber(new_log_number);

        break;

      }

      delete log_;

      delete logfile_;

      logfile_ = lfile;

      logfile_number_ = new_log_number;

      log_ = new log::Writer(lfile);

      imm_ = mem_;

      has_imm_.Release_Store(imm_);

      mem_ = new MemTable(internal_comparator_);

      mem_->Ref();

      force = false;   // Do not force another compaction if have room

      MaybeScheduleCompaction();//检查是否启动后台compact

    }

  }

  return s;

}

这里有几点值得注意的优化：

1.在level 0的table数量快要接近阈值的时候, sleep 1ms.

因为文件系统表示写入完成并不一定真写到硬盘里了. 数量接近阈值说明快要进行下一次compaction了. 这时候如果文件系统的buffer积压了太多, 会造成硬盘一下子满负载. 还有可能已经在compact了, 这时候sleep就可以让CPU周期给更重要的任务.

2.log具有最高优先级无论如何都要写, 但immemtable只能一张一张写.

MakeRoomForWrite分析结束，我们重新回到DBImpl::Write往下看.写的时候一次性写入，首先写入log文件，然后写到memtable里.

  uint64_t last_sequence = versions_->LastSequence();

  Writer* last_writer = &w;

  if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions

    WriteBatch* updates = BuildBatchGroup(&last_writer);

    WriteBatchInternal::SetSequence(updates, last_sequence + ); //设置writebatch的起始sequence

    last_sequence += WriteBatchInternal::Count(updates);  //写成功后的sequence

    // Add to log and apply to memtable.  We can release the lock

    // during this phase since &w is currently responsible for logging

    // and protects against concurrent loggers and concurrent writes

    // into mem_.

    {

      mutex_.Unlock();

      status = log_->AddRecord(WriteBatchInternal::Contents(updates));  //写入日志文件

      bool sync_error = false;

      if (status.ok() && options.sync) {

        status = logfile_->Sync();

        if (!status.ok()) {

          sync_error = true;

        }

      }

      if (status.ok()) {

        status = WriteBatchInternal::InsertInto(updates, mem_);  //写入memtable

      }

      mutex_.Lock();

      if (sync_error) {

        // The state of the log file is indeterminate: the log record we

        // just added may or may not show up when the DB is re-opened.

        // So we force the DB into a mode where all future writes fail.

        RecordBackgroundError(status);

      }

    }

    if (updates == tmp_batch_) tmp_batch_->Clear();

    versions_->SetLastSequence(last_sequence); //设置现在的最新更新sequence

  }

这里调用了BuildBatchGroup，从等待写队列中获取尽可能多的写任务，BuildBatchGroup源码内容如下：

// REQUIRES: Writer list must be non-empty

// REQUIRES: First writer must have a non-NULL batch

WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {

  assert(!writers_.empty());

  Writer* first = writers_.front();

  WriteBatch* result = first->batch;

  assert(result != NULL);

  size_t size = WriteBatchInternal::ByteSize(first->batch);

  // Allow the group to grow up to a maximum size, but if the

  // original write is small, limit the growth so we do not slow

  // down the small write too much.

  size_t max_size =  << ;

  if (size <= (<<)) {

    max_size = size + (<<);

  }

  *last_writer = first;

  std::deque<Writer*>::iterator iter = writers_.begin();

  ++iter;  // Advance past "first"

  for (; iter != writers_.end(); ++iter) {

    Writer* w = *iter;

    if (w->sync && !first->sync) {

      // Do not include a sync write into a batch handled by a non-sync write.

      break;

    }

    if (w->batch != NULL) {

      size += WriteBatchInternal::ByteSize(w->batch);

      if (size > max_size) {

        // Do not make batch too big

        break;

      }

      // Append to *result
      // 把合并的写请求保存在成员变量 tmp_batch_ 中，避免和调用者的写请求混淆在一起
      if (result == first->batch) {

        // Switch to temporary batch instead of disturbing caller's batch

        result = tmp_batch_;

        assert(WriteBatchInternal::Count(result) == );

        WriteBatchInternal::Append(result, first->batch);

      }

      WriteBatchInternal::Append(result, w->batch);

    }

    *last_writer = w;

  }

  return result;

}

调用到的WriteBatchInternal相关代码在db文件夹下write_batch.cc中。

再次回到DBImpl::Write看最后一段代码，从头开始检查队列, 把完成的任务标记为done，如果队列还有别的任务, 继续唤醒第一个.

while (true) {

    Writer* ready = writers_.front();

    writers_.pop_front();

    if (ready != &w) {

      ready->status = status;

      ready->done = true;

      ready->cv.Signal();

    }

    if (ready == last_writer) break;

  }

  // Notify new head of write queue

  if (!writers_.empty()) {

    writers_.front()->cv.Signal();

  }

  return status;

参考文献：

1.http://blog.csdn.net/joeyon1985/article/details/47154249

2.http://masutangu.com/2017/06/leveldb_1/

3.https://zhuanlan.zhihu.com/jimderestaurant?topic=LevelDB

LevelDB的源码阅读（三） Put操作的更多相关文章

25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment
25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment 25 BasicUsageEnvironment0基本使用环境基类— ...
24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment
24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment 24 UsageEnvironment使用环境抽象基类——Live555源码阅读 ...
26 BasicUsageEnvironment基本使用环境——Live555源码阅读(三)UsageEnvironment
26 BasicUsageEnvironment基本使用环境--Live555源码阅读(三)UsageEnvironment 26 BasicUsageEnvironment基本使用环境--Live5 ...
LevelDB的源码阅读（三） Get操作
在Linux上leveldb的安装和使用中我们写了这么一段测试代码,内容以及输出结果如下: #include <iostream> #include <string> #inc ...
LevelDB的源码阅读（四） Compaction操作
leveldb的数据存储采用LSM的思想,将随机写入变为顺序写入,记录写入操作日志,一旦日志被以追加写的形式写入硬盘,就返回写入成功,由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leve ...
LevelDB的源码阅读（二） Open操作
在Linux上leveldb的安装和使用中我们写了一个测试代码,内容如下: #include "leveldb/db.h" #include <cassert> #in ...
LevelDB的源码阅读（一）
源码下载 git clone https://github.com/google/leveldb.git 项目结构 db/, 数据库逻辑 doc/, MD文档 helpers/, LevelDB内存版 ...
SparkSQL（源码阅读三）
额,没忍住,想完全了解sparksql,毕竟一直在用嘛,想一次性搞清楚它,所以今天再多看点好了~ 曾几何时,有一个叫做shark的东西,它改了hive的源码...突然有一天,spark Sql突然出现 ...
SpringMVC源码阅读(三)
先理一下Bean的初始化路线 org.springframework.beans.factory.support.AbstractBeanDefinitionReader public int loa ...

随机推荐

Flex/AS3 base64指定字符编码
public static function base64Encode(str:String, charset:String = "GBK"):String{ if(StringU ...
数学之美？编程之美？数学 + 编程= unbelievable 美！
欢迎大家前往腾讯云社区,获取更多腾讯海量技术实践干货哦~ 作者:Rusu 导语相信大家跟我一样,偶尔会疑惑:曾经年少的时候学习过的那么多的复杂的数学函数,牛逼的化学方程式,各种物理原理.公式,到底有 ...
最受Java开发者青睐的Java应用服务器 —— Tomcat
Tomcat 是一个小型的轻量级应用服务器,在中小型系统和并发访问用户不是很多的场合下被普遍使用,是开发和调试 JSP 程序的首选.今天,就一起来了解下 Tomcat. Java 应用服务器 Tomc ...
mybatis分页练手
最近碰到个需求,要做个透明的mybatis分页功能,描述如下:目标:搜索列表的Controller action要和原先保持一样,并且返回的json需要有分页信息,如: @ResponseBody @ ...
使用soap实现简单webservice
在网上看到一些关于用soap实现简单webservice的一些知识,简单整理一下希望对大家有所帮助. 可以给大家看一下我的简单实现的列子,Soap,大家可以到Github上自行下载. 首先说一下web ...
Javascript之模拟文字高亮
在我们平时浏览网页的时候,我们常常会用到Ctrl+F(搜索)功能,被搜索到的文字就是高亮显示.那么,如何在Javascript中模拟文字高亮显示这一功能呢? 以下为笔者写的样例代码: <!DOC ...
Solr集群搭建详细教程（二）
注:欢迎大家转载,非商业用途请在醒目位置注明本文链接和作者名dijia478,商业用途请联系本人dijia478@163.com. 之前步骤:Solr集群搭建详细教程(一) 三.solr集群搭建注意 ...
引言关于本博客的ES6
本博客ES6全部取自于阮一峰的<ES6标准入门>里面掺杂着一些node.js,写这些东西是为了让大家更好的去理解这本书,其实更像是一个教材参考,里面有一些是阮一峰先生可能没有考虑到新手的某 ...
vbs的一些入门基础。。。
VBS(VBScript的进一步简写)是基于Visual Basic的脚本语言. Microsoft Visual Basic是微软公司出品的一套可视化编程工具, 语法基于Basic. 脚本语言, 就 ...
a：hover标签已经定义了text-decoration：none,并且生效，但是还是有下划线
a标签在F12计算出来的样式里 text-decoration:none; 确实有被应用到.但是链接的下划线并没有被去掉... 解决办法:p:commandLink <p:commandLink ...

LevelDB的源码阅读（三） Put操作

LevelDB的源码阅读（三） Put操作的更多相关文章

随机推荐

热门专题