Why GUID primary keys are a database’s worst nightmare
http://csharptest.net/1250/why-guid-primary-keys-are-a-databases-worst-nightmare/
When you ask most people why using a GUID column for a primary key in a database might be slower than auto-incremented number the answer your likely to get is usually along these lines:
“Guids are 16 bytes, and integers are only 4 bytes.”
While this information is technically accurate, it is completely the wrong answer to the question. The performance delta from a 16 byte key to a 4 byte key is most likely negligible and almost certainly impossible to accurately measure. To understand what is wrong with a guid, you first need to know a little about how a database stores it’s data.
The most common storage used by databases (to the best of my knowledge) is a B+Tree. They offer great performance even for very large sets of data. They do however suffer from one problem, let’s call this ‘key density‘.
By key density I’m referring to the density, or close proximity, of the keys being accessed at any given moment as it compares the universe of all the keys in storage. For example let’s say I have 10 million records, each keyed by a unique integer numbered from 1 to 10,000,000. I can say that keys {44,46,52} have a high-density, whereas keys {100,10553,733555} have a very low-density.
Cache is everything. When a database needs to read or write a record, they traverse nodes in the tree and read them into memory from disk. Disk IO is the single biggest time expense a database has. So to reduce the number of reads, nodes visited while fetching or writing a record are usually cached. This allows more efficient retrieval of the record next time it is requested. As more low-density keys are accessed, more and more unique nodes are fetched from disk into memory and cache. Yet every cache has its limits, bound primarily by the hardware available and often by software configuration. Once the available cache space has been used, stale/old nodes are removed from cache to make room for newer ones.
So now let us imagine how this applies when using a auto-numbered field. To insert the next new row I’ll need to travel down the right-edge (highest key values) of the tree. Once at the leaf, insert and done (oversimplified). Since the next write uses the next possible key from the last one used, it is practically guaranteed to find the nodes it needs in cache. So a few quick memory look-ups later the record can be inserted without reading from disk. It should now be obvious why Guids will have problems…
Creating a new Guid is essentially done by taking a value ‘uniformly at random’ from the entire range of possible Guid values. That means that unlike our auto-number field with a high key-density, our Guid keys are designed to be sparse, or to have a low key-density. Because of this for two Guid values to be stored in the same tree node is likely going to be statistically more improbable than winning the lottery. Using a Guid as a primary key practically obliterates the ability for a database to reliably find nodes in the cache based upon previous queries. By the time the database has passed about 10 million rows your performance will fall drastically.
Want more proof? Read the Guid article on Wikipedia under the section “Sequential algorithms” it discusses this very topic. It also discusses solutions to this, as first introduced by Jimmy Nilsson in 2002, called a COMB Guid or Combined Guid for the combination of a timestamp.
So are Guid’s Evil in a Database? No. It’s the algorithm that generates the Guid that is the issue, not the Guid itself. As mentioned in the articles linked above, there are other algorithms. There are numerous possibilities documented online. I’ve implemented one of these that out-performs the native Guid.NewGuid() implementation that will be released in the next few weeks. Until then, this is fairly similar to the algorithm I’m using…
static class SequentialGuidGenerator
{
static int _sequenced = (int)DateTime.UtcNow.Ticks;
private static readonly System.Security.Cryptography.RNGCryptoServiceProvider _random =
new System.Security.Cryptography.RNGCryptoServiceProvider();
private static readonly byte[] _buffer = new byte[6];
public static Guid NewGuid()
{
long ticks = DateTime.UtcNow.Ticks;
int sequenceNum = System.Threading.Interlocked.Increment(ref _sequenced);
lock (_buffer)
{
_random.GetBytes(_buffer);
return new Guid(
(int) (ticks >> 32), (short) (ticks >> 16), (short) ticks,
(byte) (sequenceNum >> 8), (byte) sequenceNum,
_buffer[0], _buffer[1], _buffer[2], _buffer[3], _buffer[4], _buffer[5]
);
}
}
}
The problem with this code is that not all databases compare Guids the same way, or in the same format. So you should be cautious about using this approach until you understand your database and how it stores and compares a Guid type.
Lessons Learned and Applied Concept
The interesting and rather undocumented aspect about this issue is that it applies just as well across all types of composite keys. Let’s take an example, we have a simple logging structure we are writing to with NLog in a database. The rows are identified by a Guid, but we almost never query these records by Id. Most of the time when we query this data we are looking within a range of dates for a specific event. How would you model the primary key and indexes? Well most people want to use as small of a primary key as possible and so the natural assumption is to use the ID of the record. This, as we’ve already covered, is generally a bad idea just because we are using Guids, but even more because our queries will be time based. By promoting the timestamp into the primary key we not only gain better query performance, but we also remove the problem of the GUID identifier. Let’s see the example:
TABLE LogFile {
Column DateTime timestamp,
Column GUID id,
Column int eventType,
Column string message,
Primary Key ( timestamp, id )
}
With the primary key using the timestamp as it’s first comparison we will always be writing to the same area within the table and will consistently hit the cache for writes. When seeking for data the timestamp will group all the records needed together so that the data we are after is stored as dense as is possible requiring the fewest possible reads.
Now let’s drift this example a little to a new scenario. Let’s say the log above is being written from an ASP.NET application in which all users are authenticated. We want to add this user identity to the LogFile data being written, and we want to constrain queries to be associated with a user. In the example above it may be safe to simply modify the Primary Key ( timestamp, id ), to include the user as first key. Thus the Primary Key ( userId, timestamp, id ) will now perform well for a single user right? Well the answer Yes and No. It really depends greatly on the application. Introducing userId as the primary key means that we could be writing to as many places in the file as we have users. If the application is a mobile app polling our web server every minute, then we’ve just scattered our writes across all N thousands of users in our system. Yet if our application requires a human being to use a web browser or mobile app, the number of active writing points in the file drops considerably… Well at least until your Facebook and at that point you cash out and go home :)
One last example. Given you are building SkyDrive, or GDrive, or whatever you want to store the following entities: Customer, Drive, Folder, File. Each entity is identified by a GUID. What does the File’s primary key look like? Well I’d probably key the File by (CustomerID, DriveID, FolderID, Timestamp, FileID). You would obviously need an ancillary index by FileID in order to access the file directly. Food for thought anyway…
The reason I bring all this up is that there is no rule of thumb for a good primary key. Each application will have different needs and different requirements. Want you should take away is that the density of keys being written to in that first field of your primary key will ultimately dictate your write throughput and likely your scalability. Be conscious of this, and choose keys wisely. The Guid ID of a record is not always the best answer for a primary key.
As a side-note, using Guid’s as a primary key in the B+Tree will work but much more slowly at large volumes (2+ million). Using a sequential guid generator like the one above, or using an ordered natural key (like a qualified file name) will serve you much better. Ordered (or near-ordered) keys will provide linear scalability whereas unique random GUIDs will start to suffer once you’ve exceeded the cache space.
Why GUID primary keys are a database’s worst nightmare的更多相关文章
- ORA-02266: unique/primary keys in table referenced by enabled foreign keys
在数据库里面使用TRUNCATE命令截断一个表的数据时,遇到如下错误 SQL >TRUNCATE TABLE ESCMOWNER.SUBX_ITEM ORA-02266: unique/prim ...
- mysql主从复制错误:Last_SQL_Error: Error 'Duplicate entry '327' for key 'PRIMARY'' on query. Default database: 'xxx'. Query: 'insert into
这个算不算解决,我都不太清楚,因为我感觉网上的说法,只是把错误忽略了,不表示以后用从库时不会出问题!!! 解决的办法是在从库上执行: mysql> slave stop; mysql> s ...
- 【MySQL 8】Generated Invisible Primary Keys(GIPK)
从MySQL 8.0.30开始,MySQL支持在GIPK模式下运行时生成不可见的主键.在这种模式下运行时,对于任何在没有显式主键的情况下创建的InnoDB表,MySQL服务器会自动将生成的不可见主键 ...
- 生成唯一32位ID编码代码Java(GUID)
源码下载链接:http://pan.baidu.com/s/1jGCEWlC 扫扫关注"茶爸爸"微信公众号 坚持最初的执着,从不曾有半点懈怠,为优秀而努力,为证明自己而活. /* ...
- Database Primary key and Foreign key [From Internet]
Database Primary key and Foreign key --Create Referenced Table CREATE TABLE Department ( DeptID int ...
- 面试题总结之Database
SQL 1. 现有一张学生表,有只有一个列是名字,请选出其中的重名的学生的名字select name from student group by name having count(*) > 1 ...
- 面试总结之Database
什么是数据库事务? 数据库事务_百度百科 https://baike.baidu.com/item/%E6%95%B0%E6%8D%AE%E5%BA%93%E4%BA%8B%E5%8A%A1/9744 ...
- Differences between INDEX, PRIMARY, UNIQUE, FULLTEXT in MySQL?
487down vote Differences KEY or INDEX refers to a normal non-unique index. Non-distinct values for ...
- 有序GUID
背景 常见的一种数据库设计是使用连续的整数为做主键,当新的数据插入到数据库时,由数据库自动生成.但这种设计不一定适合所有场景. 随着越来越多的使用Nhibernate.EntityFramework等 ...
随机推荐
- MySQL Connector/J 6.x jdbc.properties 配置, mysql-connector-java-6.0.4.jar 异常
今天学习SSM框架整合,完成Spring和mybatis这两大框架的整合做测试时候出来很多问题,主要来自于配置文件. 我这里重点说一下Mysql数据驱动配置. 配置pom.xml时候去网站 MySQL ...
- 人脸识别经典算法一:特征脸方法(Eigenface)
这篇文章是撸主要介绍人脸识别经典方法的第一篇,后续会有其他方法更新.特征脸方法基本是将人脸识别推向真正可用的第一种方法,了解一下还是很有必要的.特征脸用到的理论基础PCA在另一篇博客里:特征脸(Eig ...
- delphi 获取颜色值的RGB
前言:http://www.cnblogs.com/studypanp/p/5002953.html 获取的颜色值 前面获取到一个像素点的颜色值后(十六进制),比如说(黄色):FFD1C04C(共八位 ...
- OpenGL(四)——有用的函数
概述 1. reshape 定义窗口和图像的映射关系,使在不以纵横比4:3调整窗口大小时,图像不会失真 函数 1. ReShape 如果变更后的窗体的横纵比大于4/3,需要以高为基数,宽为高*4/3, ...
- Objective 笔记C(第二天)
属性本质 •什么是属性 在OC中,属性提供了setter和getter方法,本质上属性就是方法,属性的值是由实例变量来保存的. • 属性的本质(一般三个部分组成) a.保存属性值的实例变量int _a ...
- blob及行外数据
本文中我们假设innodb_page_size为16k,记录格式为compact. 1 大字段 大字段的类型可以参看这里, Data Type Storage Required TINYBLOB, T ...
- 字符串匹配算法之BF(Brute-Force)算法
BF(Brute-Force)算法 蛮力搜索,比较简单的一种字符串匹配算法,在处理简单的数据时候就可以用这种算法,完全匹配,就是速度慢啊. 基本思想 从目标串s 的第一个字符起和模式串t的第一个字符进 ...
- Server Develop (七) Linux 守护进程
守护进程 守护进程,也就是通常说的Daemon进程,是Linux中的后台服务进程.它是一个生存期较长的进程,通常独立于控制终端并且周期性地执行某种任务或等待处理某些发生的事件.守护进程常常在系统引导装 ...
- C#与数据库访问技术总结(八)之ExecuteNonQuery方法
ExecuteNonQuery方法 ExecuteNonQuery方法主要用来更新数据. 通常使用它来执行Update.Insert和Delete语句. 该方法返回值意义如下: 对于Update.In ...
- Could not load file or assembly 'System.Data.SQLite' or one of its dependencies. 试图加载格式不正确的程序。
出现上述问题的原因是,所加载的程序集中有32位的,也有64位的,IIS 7 程序池 在Windows下.Net FrameWork是64位的,要想正确使用,需要对程序池进行配置.如下图所示: