Lookup Store 主要用于 Paimon 中的 Lookup Compaction 以及 Lookup join 的场景. 会将远程的列存文件在本地转化为 KV 查找的格式.

Hash

https://github.com/linkedin/PalDB

Sort

https://github.com/dain/leveldb

https://github.com/apache/paimon/pull/3770

整体文件结构:

相比于 Hash file 的优势

一次写入, 避免了文件merge
顺序写入, 保持原先的 key 的顺序, 后续如果按照 key 的顺序查找, 可提升缓存效率

SortLookupStoreWriter

SortLookupStoreWriter#put

put

@Override

public void put(byte[] key, byte[] value) throws IOException {

	dataBlockWriter.add(key, value);

	if (bloomFilter != null) {

		bloomFilter.addHash(MurmurHashUtils.hashBytes(key));

	}

	lastKey = key;

	// 当BlockWriter写入达到一定阈值, 默认是 cache-page-size=64kb.

	if (dataBlockWriter.memory() > blockSize) {

		flush();

	}

	recordCount++;

}

flush

private void flush() throws IOException {

    if (dataBlockWriter.size() == 0) {

        return;

    }

	// 将data block写入数据文件, 并记录对应的position和长度

    BlockHandle blockHandle = writeBlock(dataBlockWriter);

    MemorySlice handleEncoding = writeBlockHandle(blockHandle);

    // 将BlockHandle 写入index writer, 这也通过是一个BlockWriter写的

    indexBlockWriter.add(lastKey, handleEncoding.copyBytes());

}

writeBlock

private BlockHandle writeBlock(BlockWriter blockWriter) throws IOException {

	// close the block

	// 获取block的完整数组, 此时blockWriter中的数组并不会被释放, 而是会继续复用

	MemorySlice block = blockWriter.finish();

	totalUncompressedSize += block.length();

	// attempt to compress the block

	BlockCompressionType blockCompressionType = BlockCompressionType.NONE;

	if (blockCompressor != null) {

		int maxCompressedSize = blockCompressor.getMaxCompressedSize(block.length());

		byte[] compressed = allocateReuseBytes(maxCompressedSize + 5);

		int offset = encodeInt(compressed, 0, block.length());

		int compressedSize =

				offset

						+ blockCompressor.compress(

								block.getHeapMemory(),

								block.offset(),

								block.length(),

								compressed,

								offset);

		// Don't use the compressed data if compressed less than 12.5%,

		if (compressedSize < block.length() - (block.length() / 8)) {

			block = new MemorySlice(MemorySegment.wrap(compressed), 0, compressedSize);

			blockCompressionType = this.compressionType;

		}

	}

	totalCompressedSize += block.length();

	// create block trailer

	// 每一块block会有一个trailer, 记录压缩类型和crc32校验码

	BlockTrailer blockTrailer =

			new BlockTrailer(blockCompressionType, crc32c(block, blockCompressionType));

	MemorySlice trailer = BlockTrailer.writeBlockTrailer(blockTrailer);

	// create a handle to this block

	// BlockHandle 记录了每个block的其实position和长度

	BlockHandle blockHandle = new BlockHandle(position, block.length());

	// write data

	// 将数据追加写入磁盘文件

	writeSlice(block);

	// write trailer: 5 bytes

	// 写出trailer

	writeSlice(trailer);

	// clean up state

	blockWriter.reset();

	return blockHandle;

}

close

public LookupStoreFactory.Context close() throws IOException {

	// flush current data block

	flush();

	LOG.info("Number of record: {}", recordCount);

	// write bloom filter

	@Nullable BloomFilterHandle bloomFilterHandle = null;

	if (bloomFilter != null) {

		MemorySegment buffer = bloomFilter.getBuffer();

		bloomFilterHandle =

				new BloomFilterHandle(position, buffer.size(), bloomFilter.expectedEntries());

		writeSlice(MemorySlice.wrap(buffer));

		LOG.info("Bloom filter size: {} bytes", bloomFilter.getBuffer().size());

	}

	// write index block

	// 将index数据写出至文件

	BlockHandle indexBlockHandle = writeBlock(indexBlockWriter);

	// write footer

	// Footer 记录bloomfiler + index

	Footer footer = new Footer(bloomFilterHandle, indexBlockHandle);

	MemorySlice footerEncoding = Footer.writeFooter(footer);

	writeSlice(footerEncoding);

	// 最后关闭文件

	// close file

	fileOutputStream.close();

	LOG.info("totalUncompressedSize: {}", MemorySize.ofBytes(totalUncompressedSize));

	LOG.info("totalCompressedSize: {}", MemorySize.ofBytes(totalCompressedSize));

	return new SortContext(position);

}

BlockWriter

add

public void add(byte[] key, byte[] value) {

	int startPosition = block.size();

	// 写入key长度

	block.writeVarLenInt(key.length);

	// 写入key

	block.writeBytes(key);

	// 写入value长度

	block.writeVarLenInt(value.length);

	// 写入value

	block.writeBytes(value);

	int endPosition = block.size();

	// 使用一个int数组记录每个KV pair的起始位置作为索引

	positions.add(startPosition);

	// 是否对齐. 是否对齐取决于每个KV对的长度是否一样

	if (aligned) {

		int currentSize = endPosition - startPosition;

		if (alignedSize == 0) {

			alignedSize = currentSize;

		} else {

			aligned = alignedSize == currentSize;

		}

	}

}

这里的 block 对应于一块可扩容的 MemorySegment, 也就是 byte[] , 当写入长度超过当前数组的长度时, 就会扩容

finish

public MemorySlice finish() throws IOException {

	if (positions.isEmpty()) {

		throw new IllegalStateException();

	}

	// 当通过BlockWriter写出的数据长度都是对齐的时, 就不需要记录各个Position的index了, 只需要记录一个对齐长度, 读取时自己可以计算.

	if (aligned) {

		block.writeInt(alignedSize);

	} else {

		for (int i = 0; i < positions.size(); i++) {

			block.writeInt(positions.get(i));

		}

		block.writeInt(positions.size());

	}

	block.writeByte(aligned ? ALIGNED.toByte() : UNALIGNED.toByte());

	return block.toSlice();

}

小结

整个文件的写出过程非常简单, 就是按 block 写出, 并且记录每个 block 的位置, 作为 index.

SortLookupStoreReader

读取的过程, 主要就是为了查找 key 是否存在, 以及对应的 value 或者对应的行号.

public byte[] lookup(byte[] key) throws IOException {

	// 先通过bloomfilter提前进行判断

	if (bloomFilter != null && !bloomFilter.testHash(MurmurHashUtils.hashBytes(key))) {

		return null;

	}

	MemorySlice keySlice = MemorySlice.wrap(key);

	// seek the index to the block containing the key

	indexBlockIterator.seekTo(keySlice);

	// if indexIterator does not have a next, it means the key does not exist in this iterator

	if (indexBlockIterator.hasNext()) {

		// seek the current iterator to the key

		// 根据从index block中读取到的key value的位置(BlockHandle), 读取对应的value block

		BlockIterator current = getNextBlock();

		// 在value的iterator中再次二分查找寻找对应block中是否存在match的key, 如果存在则返回对应的数据

		if (current.seekTo(keySlice)) {

			return current.next().getValue().copyBytes();

		}

	}

	return null;

}

查找一次 key 会经历两次二分查找(index + value).

BlockReader

// 从block创建一个iterator

public BlockIterator iterator() {

	BlockAlignedType alignedType =

			BlockAlignedType.fromByte(block.readByte(block.length() - 1));

	int intValue = block.readInt(block.length() - 5);

	if (alignedType == ALIGNED) {

		return new AlignedIterator(block.slice(0, block.length() - 5), intValue, comparator);

	} else {

		int indexLength = intValue * 4;

		int indexOffset = block.length() - 5 - indexLength;

		MemorySlice data = block.slice(0, indexOffset);

		MemorySlice index = block.slice(indexOffset, indexLength);

		return new UnalignedIterator(data, index, comparator);

	}

}

SliceCompartor

这里面传入了 keyComparator, 用于进行 key 的比较. 用于在 index 中进行二分查找. 这里的比较并不是直接基于原始的数据, 而是基于 MemorySlice 进行排序.

比较的过程会将 key 的各个字段从 MemorySegment 中读取反序列化出来, cast 成 Comparable 进行比较.

public SliceComparator(RowType rowType) {

	int bitSetInBytes = calculateBitSetInBytes(rowType.getFieldCount());

	this.reader1 = new RowReader(bitSetInBytes);

	this.reader2 = new RowReader(bitSetInBytes);

	this.fieldReaders = new FieldReader[rowType.getFieldCount()];

	for (int i = 0; i < rowType.getFieldCount(); i++) {

		fieldReaders[i] = createFieldReader(rowType.getTypeAt(i));

	}

}

@Override

public int compare(MemorySlice slice1, MemorySlice slice2) {

	reader1.pointTo(slice1.segment(), slice1.offset());

	reader2.pointTo(slice2.segment(), slice2.offset());

	for (int i = 0; i < fieldReaders.length; i++) {

		boolean isNull1 = reader1.isNullAt(i);

		boolean isNull2 = reader2.isNullAt(i);

		if (!isNull1 || !isNull2) {

			if (isNull1) {

				return -1;

			} else if (isNull2) {

				return 1;

			} else {

				FieldReader fieldReader = fieldReaders[i];

				Object o1 = fieldReader.readField(reader1, i);

				Object o2 = fieldReader.readField(reader2, i);

				@SuppressWarnings({"unchecked", "rawtypes"})

				int comp = ((Comparable) o1).compareTo(o2);

				if (comp != 0) {

					return comp;

				}

			}

		}

	}

	return 0;

}

查找的实现就是二分查找的过程, 因为写入的 key 是有序写入的.

public boolean seekTo(MemorySlice targetKey) {

	int left = 0;

	int right = recordCount - 1;

	while (left <= right) {

		int mid = left + (right - left) / 2;

		// 对于aligned iterator, 就直接seek record * recordSize

		// 对于unaligned iterator, 就根据writer写入的索引表来跳转

		seekTo(mid);

		// 读取一条key value pair

		BlockEntry midEntry = readEntry();

		int compare = comparator.compare(midEntry.getKey(), targetKey);

		if (compare == 0) {

			polled = midEntry;

			return true;

		} else if (compare > 0) {

			polled = midEntry;

			right = mid - 1;

		} else {

			left = mid + 1;

		}

	}

	return false;

}

小结

查找过程

先过一遍 bloom filter
index 索引查找对应 key 的 block handle
根据第二步的 handle, 读取对应的 block, 在 block 中查找对应的 key value.

Paimon lookup store 实现的更多相关文章

sencha touch carousel 扩展 CardList 可绑定data/store
扩展代码: /* *扩展carousel *通过data,tpl,store配置数据 */ Ext.define('ux.CardList', { extend: 'Ext.carousel.Caro ...
sencha touch百度地图扩展
扩展代码如下: Ext.define('ux.BMap', { alternateClassName: 'bMap', extend: 'Ext.Container', xtype: 'bMap', ...
ux.form.field.TreePicker 扩展，修复火狐不能展开bug
/** * A Picker field that contains a tree panel on its popup, enabling selection of tree nodes. * 动态 ...
ux.form.field.SearchField 列表、树形菜单查询扩展
//支持bind绑定store //列表搜索扩展,支持本地查询 //支持树形菜单本地一级菜单查询 Ext.define('ux.form.field.SearchField', { extend: ' ...
sencha touch 带本地搜索功能的selectfield(选择插件)
带本地搜索功能的选择插件,效果图: 在使用selectfield的过程中,数据过大时,数据加载缓慢,没有模糊查询用户体验也不好, 在selectfield的基础上上稍作修改而成,使用方式同select ...
sencha touch 百度地图扩展(2014-12-17)
上个版本http://www.cnblogs.com/mlzs/p/3666466.html,新增了一些功能,修复了一些bug 扩展代码如下: Ext.define('ux.BMap', { alte ...
sencha touch 百度地图扩展(2014-6-24)（废弃仅参考）
扩展代码如下: Ext.define('ux.BMap', { alternateClassName: 'bMap', extend: 'Ext.Container', xtype: 'bMap', ...
sencha touch Button Select（点击按钮进行选择）扩展
此扩展基于官方selectfield控件修改而来,变动并不大,使用方法类似. 代码如下: Ext.define('ux.SelectBtn', { extend: 'Ext.Button', xtyp ...
《Python 数据分析》笔记——数据的检索、加工与存储
数据的检索.加工与存储 1.利用Numpy和pandas对CSV文件进行写操作对CSV文件进行写操作,numpy的savetxt()函数是与loadtxt()相对应的一个函数,他能以诸如CSV之类的 ...
零售行业下MongoDB在产品目录系统、库存系统、个性推荐系统中的应用【转载】
Retail Reference Architecture Part 1: Building a Flexible, Searchable, Low-Latency Product Catalog P ...

随机推荐

23暑假友谊赛No.2
23暑假友谊赛No.2 A-雨_23暑假友谊赛No.2 (nowcoder.com) #include <bits/stdc++.h> using namespace std; signe ...
【牛客刷题】BM50 两数之和
本题的链接:BM50 两数之和最初拿到这个题目首先想到的就是两个指针,然后向后遍历,于是写出来的代码也简明易懂: package main /** * * @param numbers int整型一 ...
[粉丝问答16]应届生被放鸽子，怒怼HR！找工作和找对象哪个更残酷？
很多应届生在求职过程中遇到过被放鸽子的情况,但是由于段位不高,资源不够,社会阅历尚浅,很多人都是忍气吐声,但是也不乏有些学生性格刚硬,怒怼的. 比如下面这位学生,竟然直接怼了HR. 0.应届硕士小伙怒 ...
Atcoder ABC298 D-F
Atcoder ABC298 D-F D - Writing a Numeral 链接: D - Writing a Numeral (atcoder.jp) 简要题意: 问题陈述我们有一个字符串 ...
使用sl+tmux哇娃
0x01 背景 2岁多的小娃不肯刷牙,有时看故事书时会配合刷一会儿,但偶尔也不好使.突然想到TA,之前在电脑桌旁边捣乱时,给ta看过console中的小火车,ubuntu中安装sl命令后就可以看到,用 ...
全志TinyVision芯片文章汇总
全志TinyVision芯片 TinyVision开发交流QQ群:821628986 文章目录汇总教程共计14章,下面是章节汇总: 第0章_TinyVision套件简述第1章_源码工具文档手册第 ...
Mac M1 安装Homebrew
/bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)"
Swahili-text：华中大推出非洲语言场景文本检测和识别数据集 | ICDAR 2024
论文提出了一个专门针对斯瓦希里语自然场景文本检测和识别的数据集,这在当前研究中是一个未充分开发的语言领域.数据集包括976张带标注的场景图像,可用于文本检测,以及8284张裁剪后的图像用于识别. 来源 ...
Go channel 介绍
Go 语言(Golang)中的 chan 是通道(channel)的缩写,用于在不同的 goroutine 之间进行通信.通道允许你在 goroutine 之间传递数据,从而实现同步和共享内存.下面是 ...
C语言浮点数转字符串实现函数
C语言浮点数转字符串可用库函数sprintf,此处为编写的简单函数. 小数部分最多显示六位. pOut:输出字符串缓冲区 f:浮点数值 isize:输出字符串缓冲区大小 char * Funftoa( ...

Paimon lookup store 实现