Improving the AbiWord's Piece Table【转】

One of the most critical parts of any word processor is the backend used to store its text. It should be fast to lookup, fast to insert and to erase new text at a random location, undo friendly, etc.

The AbiWord backend has all these virtues and some more. It's (IMHO) the most impressive piece of code of the whole AbiWord project, and one that has been exceptionally stable over the years. In short, Jeff's code rocks^TM.

However, improvement is still possible. I will show a modified piece table that changes the current O(n)complexity of current insertion and lookup operations by O(log(n)) operations.

Nota Bene: In this discussion, “n” is the number of pieces, not the number of characters.

Current Piece Table

If you already know how the piece table works, you can skip this section.

TODO: Write this section

In the meantime, you can read several good descriptions in the article Data Structures for Text Sequences (by Charles Crowley) and in this Piece Table Description.

The piece table that AbiWord uses is like the one explained in these articles, except that it has a little cache (the last served piece and the next one on the piece table are cached), and that after a change on the piece table, when you do a lookup, a vector is created to mirror the doubly linked list of pieces (obviously, a O(n)operation).

This vector increases the speed at which pieces are served (as long as it remains valid) and looked up (the lookup operation becomes O(log(n)) once the vector is up to date).

Unfortunatelly, the vector comes with a price. It slows down the first lookup after an insert/erase operation, it takes more memory, and it complicates the code that uses the pf_Fragments class, as it has to signal when the frags becomes dirty (the AbiWord “fragments” are in this document “pieces”).

Red-Black trees

You should be able to find plenty of explanations about how red-black trees work in the net.

The complexity guarantees of red-black trees are O(log(n)) for insertion, erase, lookup, next and previous in the worst case. The next and previous operations have an average complexity of O(1) (in average, you should just follow two pointers to reach the next or previous node).

TODO: Give some pointers to red-black trees descriptions.

Suggested modifications

The modification that I suggest is to change the doubly linked list with a auto-balanced tree. We need to stablish a key and a comparation operation to make the change possible.

As we want to make lookups (i.e., to pass from a document's position to a piece) in O(log(n)), it seems natural to choose as key something related to the document's position range that is covered by each piece.

If we choose as key the beginning position and the size of the piece, we'll have trees like the next one:

It's obvious that lookup is done now in O(log(n)), but if we do an insertion in the middle of the document, we will have to update the “beginning position” of the half upper pieces of the documents (half the tree). As we need to walk from a node to the next one, and we need to visit in the worst case O(n) nodes, the insertion operation will be O(n).

Nota Bene: It may seem that it should be O(n log(n)), because the worst case of a “go to the next node” operation is O(log(n)), but we will prove latter that the average cost of this operation is just O(1) (TODO: write down the prove!).

To solve this problem, we will “distribute” the offset information among several nodes, so it will be harder to recover the offset of a piece (O(log(n)) instead of O(1)), but it will be faster to “fix” the offsets of all the nodes in the tree (O(log(n)) instead of O(n)). We will put in each node only the size of its left subtree, the size of its right subtree, and its own size.

With this change, the lookup operation remain O(log(n)), and the insertion operation becomes alsoO(log(n)), as we don't have to update the whole tree anymore, but just all the parents of the modified piece (and any leaf has O(log(n)) parents).

With this strategy, the new tree will look like this one:

Now, if we insert/erase a node, let's say that this node becomes a left son, we should just “fix” the size_left of its parent, and then repeat the fixation process with our parent. This fixation should be done before the eventual rebalance of the tree starts. And, of course, the sizes should also be updated after each rotation in the rebalancing of the tree.

total = ;

while (node != root)

{

        total = node→size_left + node→size + node→size_right;

        if (node→parent→left == node)

                node→parent→size_left = total;

        else

                node→parent→size_right = total;

}

As the rebalance of the tree is a O(log(n)) operation for a red-black tree (the variant of autobalanced tree that I've used here), and the “fix size” operation is also a O(log(n)) operation, the whole cost of the insertion/erase of a new node is O(log(n)).

To calculate the offset of a node, we should start with the size_left field of this node, and add the size_left + size of all the ancestors for whom this node is in the right subtree. For example, to calculate the offset of the node that has a size of 12, we start with its size_left (0), and we jump to its parent (size 1). As we are the left son, we don't take in account the contribution of the parent. We then jump to the grandparent (size 8), and this time, we're in the right subtree of the grandparent, so we add the size_left and the size of the grandparent to the previous offset (0 + 9 + 8 = 17). We jump to the parent of the grandparent (the root of the tree), and as we're in the left subtree, we don't take in account the contribution of the root. We're done, the offset of the node is 17.

offset = node→size_left;

while (node != root)

{

        if (node→parent→right == node)

                offset += node→parent→size_left + node→parent→size;

        node = node→parent;

}

The lookup operation is trivially in O(log(n)), due to the invariants of the red-black tree. (The lookup operation is a linear function of the height of the tree, and the height of the tree is always less than 2 log(n)in a RB tree.)

Never assume, measure!

I've performed two performance tests. In the first one, I throw 1,000,000 characters to the piece table, each one of them at a random position. The piece table will finish with roughly 1,000,000 pieces. That's the equivalent of a dense document with 30,000 pages (and with a good deal of format changes).

The mean time for the insertion operation goes from ~2 μs (that's 2 * 10^-6 seconds) when the piece table is empty to ~10 μs when it has 1 million pieces (on a 750Mhz computer with 256MB of memory). The experimental data are the blue squares, and the theoretical curve is the black line. I guess that the two dots that are visibly out of the theoretical curve are just due to a process switch between when I start measuring and when I end the measure. (One of the other ten processes that were running in my computer should have got several cicles while I was measuring.)

So far, so good. The delete operation, however, hides more surprises than its peer, the insert operation. To interpret the next figure, we should divide it in two parts. The first one is the inferior branch that starts between 2 and 3 μs and 0 pieces, and ends with 250,000 pieces and between 7 and 8 μs. The second branch (the upper one) goes from 250,000 pieces to 0 pieces.

When the delete operation is performed in a piece table with a big piece, it will split the piece in two. When it is performed in a piece table with plenty of pieces that contain only one character, then it will delete a piece.

The delete operation starts making more and more pieces, until it reaches a stability point in which the number of destructions equals the number of creation (in our figure, when the piece table has 250,000 pieces), after this point the number of destructions becomes the dominant factor, and we end coming back to 0 pieces.

Now, why does the delete operation show this histeresis ? My guess is that the tree is extremely dispersed in the computer's memory in the second branch. The tree had 250,000 nodes, who were compacted in several MB. When we start deleting them randomly, the mean distance (in the computer's memory) between two nodes increases, and this distance induce more and more page misses (and that becomes the dominant factor). But I'm just guessing.

We're not yet lost, as we can reduce the number of page misses. To reduce them I will focus on:

Reduce the memory size. The “color” of the node can be optimized to the point of not adding a single bit to the size of the node structure. The size_right field can also be suppresed entirely without any bad consecuence (all the operations keep the same speed, maybe even a bit faster) [DONE]. The node structure can be allocated using a memory pool. That way the bookeping memory that the compiler uses to handle the structure can be optimized away.
Increment the spacial locality of the nodes. Using the memory pool (again), we can put all the nodes together, and thus put them in the same page (or in a reduced set of pages) all the information need to walk through the tree.

Conclusion

I've shown than it's possible to have a backend in which all the operations have a worst case of O(log(n)), and usual cases (forward and backward movement) can still be resolved on an average time of O(1).

Is it a priority for the AbiWord project to switch to this kind of piece table? IMHO, no. It's not even near a priority. As I said in the introduction, the current piece table is already a high quality implementation, and it has got several useful improvements performance-wise over the time.

The current bottleneck in AbiWord right now is in the layout part (TODO: give some figures. Assertions without facts suck.). Neverless, the same kind of structure that I propose here is automatically usable to also solve theO(n) operations that AbiWord has in the layout code.

Anyway, let's say that I wanted to solve this problem, not because I considered it very important, but because it was the second time that I tried to solve it, and I knew that it was possible :-)

That said, when the performance problems of the layout code will be fixed, the piece table will eventually show its head in profilers, and I hope that these modifications will help at that time

I've done a reference implementation of a piece table as the one that I describe here. You can download the code PieceTable2.zip. In the zip you will find two different backends for the piece table, a red-black tree and a double linked list. It also contains an almost complet regression test. Update: This version contains a piece table without the size_right member.

The code has been tested with MSVC 6 and gcc 2.95.3.

TODO: Remove exceptions (AbiWord doesn't like C++ exceptions), fix the (2) functions that have sub-basic exception guarantees, complete the regression test (mostly done), fully comment the code, and profile it (done).

Improving the AbiWord's Piece Table的更多相关文章

Office文件的实质是什么
Office文件的实质是什么一.总结一句话总结:对于一个Microsoft Office文件,其实质是一个Windows复合二进制文件(Windows Compound Binary File), ...
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析
Office文件的奥秘——.NET平台下不借助Office实现Word.Powerpoint等文件的解析分类: 技术 2013-07-26 15:38 852人阅读评论(0) 收藏举报 Offi ...
Chapter 6 — Improving ASP.NET Performance
https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...
abiword Related Pages
Application Framework The 'af' directory contains all source code for the cross-platform application ...
DHT(Distributed Hash Table) Translator
DHT(Distributed Hash Table) Translator What is DHT? DHT is the real core of how GlusterFS aggregates ...
提高神经网络的学习方式Improving the way neural networks learn
When a golf player is first learning to play golf, they usually spend most of their time developing ...
BookNote: Refactoring - Improving the Design of Existing Code
BookNote: Refactoring - Improving the Design of Existing Code From "Refactoring - Improving the ...
PE Header and Export Table for Delphi
Malware Analysis Tutorial 8: PE Header and Export Table 2. Background Information of PE HeaderAny bi ...
Cucumber 步骤中传Data Table作为参数
引用链接:http://cukes.info/step-definitions.html Data Tables Data Tables are handy for specifying a larg ...

随机推荐

使用Eclipse开发Maven插件-1/3
概要 1. 这是一个样例,基本照着<Maven实战>-徐晓斌,第17章照抄的:个人练手之作,不喜勿喷! 2. 代码行统计插件. 备注大量插件可从以下网站获得: 1. http ...
Oracle主键约束、唯一键约束、唯一索引的区别
一般,我们看到术语“索引”和“键”交换使用,但实际上这两个是不同的.索引是存储在数据库中的一个物理结构,键纯粹是一个逻辑概念.键代表创建来实施业务规则的完整性约束.索引和键的混淆通常是由于数据库使用索 ...
PowerShell_零基础自学课程_5_自定义PowerShell环境及Powershell中的基本概念
PowerShell_零基础自学课程_5_自定义PowerShell环境及Powershell中的基本概念据我个人所知,windows下的cmd shell除了能够通过修改系统参数来对其中的环境变量 ...
mybatis源代码分析：深入了解mybatis延迟加载机制
下文从mybatis(3.2.7)延迟加载样例讲起,逐步深入其实现机制. 下面的例子是Student类关联一个Teacher对象,在访问Student对象时,不立即加载其关联的Teacher对象,而是 ...
Javascript或jQuery方法产生任意随机整数
方法1:javascritp方法 1 2 3 4 5 6 //随机数 function diu_Randomize(b,e){ if(!b && b!=0 || ! ...
A站有一个页面需要PV统计 A站读写该数据 B站读该数据需要数据同步
A站弄个缓存,并且开放出一个读取借口给B站 B站读取数据的时候,调用该接口和数据库内的数据累加,然后进行限时即可 ---------------------- 另外其他方法 session服务.mem ...
Python 入门教程 9 ---- A Day at the Supermarket
第一节 1 介绍了for循环的用法 for variable in values: statement 2 for循环打印出列表的每一项 for item in [1 , 2 , 3]: print ...
使用itextsharp创建PDF文档——图片集合
文档管理系统中 ,扫描模块将文档或证件扫描后.为了便于保存多个图片,拟将多个图片生成一个PDF文档进行保存. 这里我们就需要PDF生成工具了.你可以在这里下载.PDFCreator 主要使用了开源工具 ...
LVM（1）
DM: DM: Device Mapper 逻辑设备 RAID, LVM2 DM: LVM2 快照多路径
《Java程序员面试笔试宝典》之Static关键字有哪些作用
static关键字主要有两种作用:第一,只想为某特定数据类型或对象分配单一的存储空间,而与创建对象的个数无关.第二,希望某个方法或属性与类而不是对象关联在一起,也就是说,在不创建对象的情况下就可以通过 ...

Improving the AbiWord's Piece Table