Choosing a fast unique identifier (UUID) for Lucene——有时间再看下
Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.
One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
BlockTree terms dictionary
The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (docFreq, totalTermFreq, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.
The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.
In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!
Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.
Testing Performance
I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.
Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.
I tested the following identifier sources:
- Sequential IDs (0, 1, 2, ...), binary encoded.
- Zero-padded sequential IDs (00000000, 00000001, ...), binary encoded.
- Nanotime, binary encoded. But remember that nanotime is tricky.
- UUID V1, derived from a timestamp, nodeID and sequence counter, using this implementation.
- UUID V4, randomly generated using Java's
UUID.randomUUID(). - Flake IDs, using this implementation.
For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads.
| ID Source | K lookups/sec, 1 thread |
|---|---|
| Zero-pad sequential | 593.4 |
| UUID v1 [binary] | 509.6 |
| Nanotime | 461.8 |
| UUID v1 | 430.3 |
| Sequential | 415.6 |
| Flake [binary] | 338.5 |
| Flake | 231.3 |
| UUID v4 [binary] | 157.8 |
| UUID v4 | 149.4 |
Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's UUID.randomUUID()) is ~4X slower.
But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.
I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.
There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.
I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).
Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!
Choosing a fast unique identifier (UUID) for Lucene——有时间再看下的更多相关文章
- A Universally Unique IDentifier (UUID) URN Namespace
w Network Working Group P. Leach Request for Comments: 4122 Microsoft Category: Standards Track M. M ...
- Atitit 深入了解UUID含义是通用唯一识别码 (Universally Unique Identifier),
Atitit 深入了解UUID含义是通用唯一识别码 (Universally Unique Identifier), UUID1 作用1 组成1 全球唯一标识符(GUID)2 UUID 编辑 UUID ...
- java生成UUID通用唯一识别码 (Universally Unique Identifier)
转自:http://blog.csdn.net/carefree31441/article/details/3998553 UUID含义是通用唯一识别码 (Universally Unique Ide ...
- (转)java生成UUID通用唯一识别码 (Universally Unique Identifier)
(原文链接:http://blog.csdn.net/carefree31441/article/details/3998553) UUID含义是通用唯一识别码 (Universally Uniq ...
- java生成UUID通用唯一识别码 (Universally Unique Identifier) 分类: B1_JAVA 2014-08-22 16:09 331人阅读 评论(0) 收藏
转自:http://blog.csdn.net/carefree31441/article/details/3998553 UUID含义是通用唯一识别码 (Universally Unique Ide ...
- 全局唯一标识符(GUID,Globally Unique Identifier)
全局唯一标识符(GUID,Globally Unique Identifier)是一种由算法生成的二进制长度为128位的数字标识符.GUID主要用于在拥有多个节点.多台计算机的网络或系统中.在理想情况 ...
- android unique identifier
android get device mac address programmatically http://android-developers.blogspot.jp/2011/03/identi ...
- HTML5 + JS 网站追踪技术:帆布指纹识别 Canvas FingerPrinting Universally Unique Identifier,简称UUID
1 1 1 HTML5 + JS 网站追踪技术:帆布指纹识别 Canvas FingerPrinting 1 一般情况下,网站或者广告联盟都会非常想要一种技术方式可以在网络上精确定位到每一个个体,这 ...
- 设备唯一标识方法(Unique Identifier):如何在Windows系统上获取设备的唯一标识 zz
原文地址:http://www.vonwei.com/post/UniqueDeviceIDforWindows.html 唯一的标识一个设备是一个基本功能,可以拥有很多应用场景,比如软件授权(如何保 ...
随机推荐
- VBA用户自定义函数(十五)
函数是一组可重复使用的代码,可以在程序中的任何地方调用.这消除了一遍又一遍地编写相同的代码的需要.这使程序员能够将一个大程序划分成许多小的可管理的功能模块. 除了内置函数外,VBA还允许编写用户定义的 ...
- centos7.x安装docker-ce
环境: 系统:centos7.x docker版本:19.03.2 安装方式:yum 参考官方安装文档:https://docs.docker.com/install/linux/docker-ce/ ...
- 移动端隐藏滚动条,css方法
小白第一次发文记录自己遇到的问题. 关于隐藏移动端滚动条方法很多,这里只说本人用到的. 在PC端隐藏html右侧默认滚动条 html { /*隐藏滚动条,当IE下溢出,仍然可以滚动*/ -ms-ove ...
- ArduPilot简介
源码地址:https://github.com/ArduPilot/ardupilot/ 参考:http://ardupilot.org/dev/docs/learning-the-ardupilot ...
- nodejs 模块全局安装路径配置
nodejs下载安装完成后 输入npm config ls 或者npm config list npm 默认的全局安装路径为该路径,将包都下载在C盘中不是我们想要的结果.一般建议修改在nodejs的安 ...
- Python——格式化输出
如果我们需要格式化输出一个用户的信息,我们将会使用: ------------ info of xinbing ---------- Name : xinbing Age : 22 job : IT ...
- 关于ubuntu软件图标的问题
原因是这样的,有一次我更新我的IDEA之后,程序图标就不见了. 怎么说呢,就是以下显示的这样. 在Frequent中显示正常, 在All中却没有!!! 是的,它就是在一边有一边没有... 奇了怪了. ...
- CentOS7 安装记录
起因是想自建一个本地笔记云存储,按照网上的教程搭建,卡在了其中的一个步骤上(文章见https://www.laobuluo.com/1542.html),卡在了如下图的位置,google了一番解决的办 ...
- 【深度学习】Precision 和 Recall 评价指标理解
1. 四种情况 Precision精确率, Recall召回率,是二分类问题常用的评价指标.混淆矩阵如下: 预测结果为阳性 Positive 预测结果为假阳性 Negative 预测结果是真实的 Tr ...
- linux文件权限命令chmod学习
Linux系统中的每个文件和目录都有访问许可权限,用它来确定谁可以通过何种方式对文件和目录进行访问和操作. 文件或目录的访问权限分为只读,只写和可执行三种.以文件为例,只读权限表示只允许读其内容,而禁 ...