LevelDB的源码阅读(三) Put操作
在Linux上leveldb的安装和使用中我们写了这么一段测试代码,内容以及输出结果如下:
#include <iostream>
#include <string>
#include <assert.h>
#include "leveldb/db.h" using namespace std; int main(void)
{ leveldb::DB *db;
leveldb::Options options;
options.create_if_missing = true; // open
leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);
assert(status.ok()); string key = "name";
string value = "chenqi"; // write
status = db->Put(leveldb::WriteOptions(), key, value);
assert(status.ok()); // read
status = db->Get(leveldb::ReadOptions(), key, &value);
assert(status.ok()); cout<<value<<endl; // delete
status = db->Delete(leveldb::WriteOptions(), key);
assert(status.ok()); status = db->Get(leveldb::ReadOptions(),key, &value);
if(!status.ok()) {
cerr<<key<<" "<<status.ToString()<<endl;
} else {
cout<<key<<"==="<<value<<endl;
} // close
delete db; return ;
}
chenqi
name NotFound:
Leveldb的写数据流程入口为db文件夹下db_impl.cc文件中的DBImpl::Put和DBImpl::Delete,这两个文件是DBImpl::Write接口的封装,将写操作封装成WriteBatch传入DBImpl::Write进行操作,可见leveldb在内部是将单独的写操作也作为只有一个操作的批量写操作来进行的.源码内容如下:
// Convenience methods
Status DBImpl::Put(const WriteOptions& o, const Slice& key, const Slice& val) {
return DB::Put(o, key, val);
} Status DBImpl::Delete(const WriteOptions& options, const Slice& key) {
return DB::Delete(options, key);
} // Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
WriteBatch batch;
batch.Put(key, value);
return Write(opt, &batch);
} Status DB::Delete(const WriteOptions& opt, const Slice& key) {
WriteBatch batch;
batch.Delete(key);
return Write(opt, &batch);
}
DBImpl::Put和DBImpl::Delete最终调用 DBImpl::Write完成写操作,DBImpl::Write源码内容如下:
Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
Writer w(&mutex_);
w.batch = my_batch;
w.sync = options.sync;
w.done = false; MutexLock l(&mutex_);
writers_.push_back(&w);
while (!w.done && &w != writers_.front()) {
w.cv.Wait();
}
if (w.done) {
return w.status;
} // May temporarily unlock and wait.
Status status = MakeRoomForWrite(my_batch == NULL);
uint64_t last_sequence = versions_->LastSequence();
Writer* last_writer = &w;
if (status.ok() && my_batch != NULL) { // NULL batch is for compactions
WriteBatch* updates = BuildBatchGroup(&last_writer);
WriteBatchInternal::SetSequence(updates, last_sequence + );
last_sequence += WriteBatchInternal::Count(updates); // Add to log and apply to memtable. We can release the lock
// during this phase since &w is currently responsible for logging
// and protects against concurrent loggers and concurrent writes
// into mem_.
{
mutex_.Unlock();
status = log_->AddRecord(WriteBatchInternal::Contents(updates));
bool sync_error = false;
if (status.ok() && options.sync) {
status = logfile_->Sync();
if (!status.ok()) {
sync_error = true;
}
}
if (status.ok()) {
status = WriteBatchInternal::InsertInto(updates, mem_);
}
mutex_.Lock();
if (sync_error) {
// The state of the log file is indeterminate: the log record we
// just added may or may not show up when the DB is re-opened.
// So we force the DB into a mode where all future writes fail.
RecordBackgroundError(status);
}
}
if (updates == tmp_batch_) tmp_batch_->Clear(); versions_->SetLastSequence(last_sequence);
} while (true) {
Writer* ready = writers_.front();
writers_.pop_front();
if (ready != &w) {
ready->status = status;
ready->done = true;
ready->cv.Signal();
}
if (ready == last_writer) break;
} // Notify new head of write queue
if (!writers_.empty()) {
writers_.front()->cv.Signal();
} return status;
}
以下逐段进行分析
Writer w(&mutex_);// writer可以理解为一个任务
w.batch = my_batch;
w.sync = options.sync;
w.done = false; MutexLock l(&mutex_);//构造时上锁, 析构时解锁
writers_.push_back(&w);// 把w推进writers_队列
// 生产者消费者模型
while (!w.done && &w != writers_.front()) {
w.cv.Wait();//线程可能多次被唤醒
}
//写操作有可能被合并处理,因此有可能取到的时候写入已经完成。完成的话直接返回
if (w.done) {
return w.status;
}
mutex l上锁之后, 到了"w.cv.Wait()"的时候, 会先释放锁等待, 然后收到signal时再次上锁. 这段代码的作用就是多线程在提交任务的时候, 一个接一个push_back进队列. 但只有位于队首的线程有资格继续运行下去. 目的是把尽可能多的写请求((要求sync选项一致))合并成一个大batch提升效率.
Status status = MakeRoomForWrite(my_batch == NULL);
接下来是调用MakeRoomForWrite查看是否允许写,如果后台Compact失败,则返回错误,如果level0的文件数达到配置的SlowdownWritesTrigger(默认为8),则对每个写操作都延迟1ms,如果level0的文件数达到配置的kL0_StopWritesTrigger(默认为12),则阻塞写操作,等待后台Compact结束.如果memtable不满,则直接返回,可写.如果memtable已满,并且immutable table不为空,阻塞,等待Compact结束.否则,将memtable改为immutable table,新建memtable,返回可写.源码内容如下:
// REQUIRES: mutex_ is held
// REQUIRES: this thread is currently at the front of the writer queue
Status DBImpl::MakeRoomForWrite(bool force) {
mutex_.AssertHeld();
assert(!writers_.empty());
bool allow_delay = !force;
Status s;
while (true) {
if (!bg_error_.ok()) {
// Yield previous error
s = bg_error_;//后台Compact出错
break;
} else if (
allow_delay &&
versions_->NumLevelFiles() >= config::kL0_SlowdownWritesTrigger) {
// We are getting close to hitting a hard limit on the number of
// L0 files. Rather than delaying a single write by several
// seconds when we hit the hard limit, start delaying each
// individual write by 1ms to reduce latency variance. Also,
// this delay hands over some CPU to the compaction thread in
// case it is sharing the same core as the writer.
mutex_.Unlock();
env_->SleepForMicroseconds();
allow_delay = false; // Do not delay a single write more than once
mutex_.Lock();
} else if (!force &&
(mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) {
// There is room in current memtable 也就是说可以写
break;
} else if (imm_ != NULL) {
// We have filled up the current memtable, but the previous
// one is still being compacted, so we wait.
Log(options_.info_log, "Current memtable full; waiting...\n");
bg_cv_.Wait();
} else if (versions_->NumLevelFiles() >= config::kL0_StopWritesTrigger) {
// There are too many level-0 files.
Log(options_.info_log, "Too many L0 files; waiting...\n");
bg_cv_.Wait();
} else {
// Attempt to switch to a new memtable and trigger compaction of old
assert(versions_->PrevLogNumber() == );
uint64_t new_log_number = versions_->NewFileNumber();
WritableFile* lfile = NULL;
s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);
if (!s.ok()) {
// Avoid chewing through file number space in a tight loop.
versions_->ReuseFileNumber(new_log_number);
break;
}
delete log_;
delete logfile_;
logfile_ = lfile;
logfile_number_ = new_log_number;
log_ = new log::Writer(lfile);
imm_ = mem_;
has_imm_.Release_Store(imm_);
mem_ = new MemTable(internal_comparator_);
mem_->Ref();
force = false; // Do not force another compaction if have room
MaybeScheduleCompaction();//检查是否启动后台compact
}
}
return s;
}
这里有几点值得注意的优化:
1.在level 0的table数量快要接近阈值的时候, sleep 1ms.
因为文件系统表示写入完成并不一定真写到硬盘里了. 数量接近阈值说明快要进行下一次compaction了. 这时候如果文件系统的buffer积压了太多, 会造成硬盘一下子满负载. 还有可能已经在compact了, 这时候sleep就可以让CPU周期给更重要的任务.
2.log具有最高优先级无论如何都要写, 但immemtable只能一张一张写.
MakeRoomForWrite分析结束,我们重新回到DBImpl::Write往下看.写的时候一次性写入,首先写入log文件,然后写到memtable里.
uint64_t last_sequence = versions_->LastSequence();
Writer* last_writer = &w;
if (status.ok() && my_batch != NULL) { // NULL batch is for compactions
WriteBatch* updates = BuildBatchGroup(&last_writer);
WriteBatchInternal::SetSequence(updates, last_sequence + ); //设置writebatch的起始sequence
last_sequence += WriteBatchInternal::Count(updates); //写成功后的sequence // Add to log and apply to memtable. We can release the lock
// during this phase since &w is currently responsible for logging
// and protects against concurrent loggers and concurrent writes
// into mem_.
{
mutex_.Unlock();
status = log_->AddRecord(WriteBatchInternal::Contents(updates)); //写入日志文件
bool sync_error = false;
if (status.ok() && options.sync) {
status = logfile_->Sync();
if (!status.ok()) {
sync_error = true;
}
}
if (status.ok()) {
status = WriteBatchInternal::InsertInto(updates, mem_); //写入memtable
}
mutex_.Lock();
if (sync_error) {
// The state of the log file is indeterminate: the log record we
// just added may or may not show up when the DB is re-opened.
// So we force the DB into a mode where all future writes fail.
RecordBackgroundError(status);
}
}
if (updates == tmp_batch_) tmp_batch_->Clear(); versions_->SetLastSequence(last_sequence); //设置现在的最新更新sequence
}
这里调用了BuildBatchGroup,从等待写队列中获取尽可能多的写任务,BuildBatchGroup源码内容如下:
// REQUIRES: Writer list must be non-empty
// REQUIRES: First writer must have a non-NULL batch
WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {
assert(!writers_.empty());
Writer* first = writers_.front();
WriteBatch* result = first->batch;
assert(result != NULL); size_t size = WriteBatchInternal::ByteSize(first->batch); // Allow the group to grow up to a maximum size, but if the
// original write is small, limit the growth so we do not slow
// down the small write too much.
size_t max_size = << ;
if (size <= (<<)) {
max_size = size + (<<);
} *last_writer = first;
std::deque<Writer*>::iterator iter = writers_.begin();
++iter; // Advance past "first"
for (; iter != writers_.end(); ++iter) {
Writer* w = *iter;
if (w->sync && !first->sync) {
// Do not include a sync write into a batch handled by a non-sync write.
break;
} if (w->batch != NULL) {
size += WriteBatchInternal::ByteSize(w->batch);
if (size > max_size) {
// Do not make batch too big
break;
} // Append to *result
// 把合并的写请求保存在成员变量 tmp_batch_ 中,避免和调用者的写请求混淆在一起
if (result == first->batch) {
// Switch to temporary batch instead of disturbing caller's batch
result = tmp_batch_;
assert(WriteBatchInternal::Count(result) == );
WriteBatchInternal::Append(result, first->batch);
}
WriteBatchInternal::Append(result, w->batch);
}
*last_writer = w;
}
return result;
}
调用到的WriteBatchInternal相关代码在db文件夹下write_batch.cc中。
再次回到DBImpl::Write看最后一段代码,从头开始检查队列, 把完成的任务标记为done, 如果队列还有别的任务, 继续唤醒第一个.
while (true) {
Writer* ready = writers_.front();
writers_.pop_front();
if (ready != &w) {
ready->status = status;
ready->done = true;
ready->cv.Signal();
}
if (ready == last_writer) break;
} // Notify new head of write queue
if (!writers_.empty()) {
writers_.front()->cv.Signal();
} return status;
参考文献:
1.http://blog.csdn.net/joeyon1985/article/details/47154249
2.http://masutangu.com/2017/06/leveldb_1/
3.https://zhuanlan.zhihu.com/jimderestaurant?topic=LevelDB
LevelDB的源码阅读(三) Put操作的更多相关文章
- 25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment
25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment 25 BasicUsageEnvironment0基本使用环境基类— ...
- 24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment
24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment 24 UsageEnvironment使用环境抽象基类——Live555源码阅读 ...
- 26 BasicUsageEnvironment基本使用环境——Live555源码阅读(三)UsageEnvironment
26 BasicUsageEnvironment基本使用环境--Live555源码阅读(三)UsageEnvironment 26 BasicUsageEnvironment基本使用环境--Live5 ...
- LevelDB的源码阅读(三) Get操作
在Linux上leveldb的安装和使用中我们写了这么一段测试代码,内容以及输出结果如下: #include <iostream> #include <string> #inc ...
- LevelDB的源码阅读(四) Compaction操作
leveldb的数据存储采用LSM的思想,将随机写入变为顺序写入,记录写入操作日志,一旦日志被以追加写的形式写入硬盘,就返回写入成功,由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leve ...
- LevelDB的源码阅读(二) Open操作
在Linux上leveldb的安装和使用中我们写了一个测试代码,内容如下: #include "leveldb/db.h" #include <cassert> #in ...
- LevelDB的源码阅读(一)
源码下载 git clone https://github.com/google/leveldb.git 项目结构 db/, 数据库逻辑 doc/, MD文档 helpers/, LevelDB内存版 ...
- SparkSQL(源码阅读三)
额,没忍住,想完全了解sparksql,毕竟一直在用嘛,想一次性搞清楚它,所以今天再多看点好了~ 曾几何时,有一个叫做shark的东西,它改了hive的源码...突然有一天,spark Sql突然出现 ...
- SpringMVC源码阅读(三)
先理一下Bean的初始化路线 org.springframework.beans.factory.support.AbstractBeanDefinitionReader public int loa ...
随机推荐
- angular自定义分页组件(实用)
功能描述:分页,点击按钮或者下一页获取分页接口,同时active到对应页码. html模块: <page page-count="totalPage" on-click-pa ...
- JavaScrpit判断横竖屏
JavaScript function setLandscapeClass(){ if(window.orientation === 90 || window.orientation === -90 ...
- 刚 安装 Oracle时,登录会出现的问题, ora-28000: the account is locked
来这个连接 就OK了: https://jingyan.baidu.com/article/48b37f8d1ec3fd1a6464883a.html 第一步: 进入cmd: 然后将下面的sql输入就 ...
- springMVC上传错误StandardMultipartHttpServletRequest
异常信息 java.lang.ClassCastException: org.springframework.security.web.servletapi.HttpServlet3RequestFa ...
- Android项目实战(三十六):给背景加上阴影效果
圆角背景大家应该经常用: 一个drawable资源文件 里面控制corner圆角 和solid填充色 <shape xmlns:android="http://schemas.and ...
- iOS 计时器三种定时器的用法NSTimer、CADisplayLink、GCD
原文:http://www.cocoachina.com/ios/20160919/17595.html DEMO链接
- MySQL时间操作的系统函数用法
我要查询获得当天凌晨30分的datetime值的方式:select ADDDATE(CURDATE(), INTERVAL TIME_TO_SEC(TIMEDIFF("00:30:00&qu ...
- 1_3 C语言解决求n!
求n!(n为键盘输入的任意整数值).要求分别用while语句和for语句实现 用while语句实现: #include <stdio.h> int main() { int n; scan ...
- AVL树的单双旋转操作
把必须重新平衡的节点称为å.对于二叉树,å的两棵子树的高度最多相差2,这种不平衡可能有四种情况: 对å的左儿子的左子树进行插入节点(左-左) 对å的左儿子的右子树进行插入节点(左-右) 对å的右儿子的 ...
- xamarin android alertdialog详解
说明一下:学习xamarin android一段时间,准备写一些xamarin android相关的例子,alertdialog也是使用的非常多得空间之一,非常感谢鸟巢上的小猪,我也是看着他写的教程学 ...