GUID vs INT Debate【转】
http://blogs.msdn.com/b/sqlserverfaq/archive/2010/05/27/guid-vs-int-debate.aspx
I recently read a blog post on what was better using GUIDs or Integer values. This is been an age long debate and there are advocates in both camps stressing on the disadvantages of the other. Well both implementations have their advantages and disadvantages. At the outset, I shall mention that the answer to this debate is: IT DEPENDS! J
It is highly dependent on your database design, migration needs and overall architecture. There is a good reason why SQL Server replication uses GUIDs to track the changes to the replicated articles. So, it is not that the usage to GUIDs is necessarily a bad practice. SQL Server Books Online lists the following disadvantages for uniqueidentifier data type:
- The values are long and obscure. This makes them difficult for users to type correctly, and more difficult for users to remember.
- The values are random and cannot accept any patterns that may make them more meaningful to users.
- There is no way to determine the sequence in which uniqueidentifier values were generated. They are not suited for existing applications that depend on incrementing key values serially.
- At 16 bytes, the uniqueidentifier data type is relatively larger than other data types, such as 4-byte integers. This means indexes that are built using uniqueidentifier keys might be relatively slower than indexes using an int key.
If you are using NEWID function in SQL Server, then this generates random UUIDs which have a huge domain but the chances of GUID collisions are always there though the probability is very slim in nature. If you are using NEWID function to generate uniqueidentifiers as row identifiers in your table, then you need to think again! Uniqueness of the row should be enforced using aUnique or Primary Key constraint on the table. NewSequentialID function uses identification number of the computer network card plus a unique number from the CPU clock to generate the uniqueidentifier (Reference article). So the chance of getting a globally unique value is practically guaranteed as long as the machine has a network card. Moreover, possibility of a GUID collision while using NewSequentialID is virtually impossible.
I shall use a simple example of how database operations can be affected when using the following as clustered indexes:
- Random GUIDs (using NEWID function)
- Sequential GUIDs (using NewSequentialGUID function)
- BIGINT identity value
- INT identity value
The last three would always introduce a hot-spot in your database table when concurrent inserts are occurring on the table as the INSERT operations would always insert into the last page of the table as the data for a clustered index is ordered in ascending order by default. This however is a discussion for a different post altogether. The tables that I have created are quite narrow and typically the average record size would be small:
create table tblGUID (RecID uniqueidentifier default newid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblSeqGUID (RecID uniqueidentifier default newsequentialid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblINT (RecID bigint identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
create table tblBigINT (RecID int identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
The clustered index for all the above tables are defined on the RecID column. The tables have no other indexes defined on them. I created a WHILE loop to insert 1 million records in each table and here are the times taken:
tblGUID: 22 seconds
tblSequential GUID: 16 seconds
tblBIGINT: 23 seconds
tblINT: 23 seconds
Given that you have a beefy server, the above time difference would not make much of a difference unless and until you only have a high number of concurrent INSERT workload on the server or during a Data Load operation which would cause a significant impact. What is interesting to note is that the fragmentation on the tables after the first batch of 1 million inserts.
| Object Name | Index Name | Pages | Average Record Size | Extents | Average Page Density | Logical Fragmentation | Extent Fragmentation | 
| tblGUID | cidx_tblGUID | 9608 | 51.89 | 1209.00 | 69.27 | 99.14 | 0.25 | 
| tblSeqGUID | cidx_tblSeqGUID | 6697 | 51.89 | 845.00 | 99.39 | 0.76 | 0.12 | 
| tblBigINT | cidx_tblBigINT | 5671 | 43.89 | 714.00 | 99.95 | 0.48 | 0.14 | 
| tblINT | cidx_tblINT | 5194 | 39.89 | 653.00 | 99.62 | 0.37 | 0.15 | 
If you look at the above data, you will see that the random GUIDs have 99% logical fragmentation in the tables. This is due to the random nature of the GUIDs generated which end up causing high number of page splits in the database.
Since GUIDs require more storage than integers, you will notice the number of pages for the tables using GUIDs have higher page counts. After inserting another million records in the above tables, I inserted 10,000 records (in batches of 100) in the above tables and tracked the times for the inserts at a more granular level:
tblGuid : 23 seconds (10 ms to 1120 ms)
tblSeqGuid : 0 seconds (3 ms to 10 ms)
tblBigINT : 0 seconds (3 ms to 13 ms)
tblINT : 0 seconds (3 ms to 20 ms)
You will find that the minimum and maximum times mentioned above show a high degree of variation for the random GUID table. The rest of the times are more or less comparable. The time taken for an insert is dependent on the range that the newly generated GUID falls into. The above data clearly shows that using random GUIDs can not only create fragmentation and page splits during Insert/Update operations.
Since NewSequentialID function uses the network card identifier to generate a new value, it might not be preferred where privacy is a concern because it is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
Now when I run a simple SELECT query against the above tables (using a cold buffer cache), I find the a COUNT(*) for the entire table shows the following IO statistics:
Table 'tblGUID'. Scan count 9, logical reads 20942, physical reads 236, read-ahead reads 20564, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 375 ms, elapsed time = 6212 ms.
Table 'tblSeqGUID'. Scan count 9, logical reads 14519, physical reads 383, read-ahead reads 14216, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 466 ms, elapsed time = 1442 ms.
Table 'tblBigINT'. Scan count 9, logical reads 12229, physical reads 287, read-ahead reads 12007, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 374 ms, elapsed time = 1039 ms.
Table 'tblINT'. Scan count 9, logical reads 11144, physical reads 252, read-ahead reads 10983, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 392 ms, elapsed time = 908 ms.
You will see that the number of pages read for the table using random GUIDs is the highest due to the high fragmentation on the tables. After I rebuilt the clustered index of the tblGUID table, the same query with a cold buffer cache only performed 14K+ reads which is comparable to the tblSeqGUID table.
Table 'tblGUID'. Scan count 9, logical reads 14376, physical reads 143, read-ahead reads 10642, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
However if you have frequent inserts in the table, you cannot keep the fragmentation down to a negligible value.
I have summarized what we discussed above in the table below:
| Criteria | GUIDs | Seq. GUIDs | BIGINT | INT | 
| Storage | 16 bytes | 16 bytes | 8 bytes | 4 bytes | 
| Insert/Update performance | Slowest | Comparable but the index keys are larger in size. For wider tables, this would be slower than Integer values. | Faster than sequential GUIDs | Fastest | 
| Hotspot contention | Very rare | Yes | Yes | Highest, due to smaller size of RIDs | 
| Fragmentation/Page Splits | High | Minimal | Minimal | Minimal | 
| JOIN Performance/SORT operations | Least performance (Rank 4 = Least performance) | Better than random GUIDs due lesser fragmentation (Rank: 3) | High performance (Rank: 2) | High Performance (Rank: 1) | 
| Logical reads | Rank 4=Highest | Rank 3 | Rank 2 | Rank 1=Least | 
| Merging data across servers | Beneficial | Beneficial | Difficult | Difficult | 
| Uniqueness | Rare chance of duplicates | Globally unique. Virtually no chance of collisions | Limited by range of BIGINT | Limited by range of INT | 
| 
 | 
The above considerations make the use of GUIDs unfavorable for a clustered index in environments which have large number of queries performing JOIN operations in OLTP and when referential integrity is enforced on the database among multiple tables. Throw non-clustered indexes that you created on the table as covering indexes for the frequently run queries against the database, you can have a significant performance bottleneck.
So hopefully the above points will help you choose an appropriate column for the clustered index on your table. The other strategy is to use natural keys. With this strategy, you do not use any kind of manufactured key, but instead use a business key to uniquely identify records. For example, a table that is used to store customer information might use the social security number column as the primary key instead of an identity column. The drawback to this approach is that the primary key might become large if more than one column is required to uniquely identify a record. Furthermore, this compound key must be propagated to other tables to support one or more foreign key relationships. These relationships, in turn, adversely affect join performance.
Based on the above considerations, you should be able to take an educated decision whether to use an integer, business key or GUIDs for your primary key of your table.
As a parting note, I would like to say that there is no short cut to testing. You would need to test the design that you decide to implement with a workload which is as close to your production workload to make sure that your design is performing as expected.
System Specifications used for the test:
Operating System: Microsoft Windows Server 2008 R2 Enterprise (x64)
Machine: Hewlett-Packard HP Z800 Workstation
Processor: 2 Quad Core [Intel(R) Xeon(R) CPU E5506 @ 2.13GHz, 2128 Mhz, 4 Core(s), 4 Logical Processor(s)]
RAM: 16.0 GB
Hard Disk: Barracuda 7200.12 SATA 3Gb/s 500GB Hard Drive
SQL Server: SQL Server 2008 R2
Regards,
Amit Banerjee
SEE, SQL Support
GUID vs INT Debate【转】的更多相关文章
- 【转载】GUID vs INT Debate
		I recently read a blog post on what was better using GUIDs or Integer values. This is been an age lo ... 
- Guid和Int还有Double、Date的ToString方法的常见格式(转载)
		Guid的常见格式: 1.Guid.NewGuid().ToString("N") 结果为: 38bddf48f43c48588e0d78761eaa1ce6 2.Gu ... 
- Guid和Int还有Double、Date的ToString方法的常见格式
		Guid的常见格式: 1.Guid.NewGuid().ToString("N") 结果为: 38bddf48f43c48588e0d78761eaa1ce6 2.Gu ... 
- Guid vs Int
		http://www.uml.org.cn/net/200512283.htm http://www.cnblogs.com/twodays/archive/2004/07/19/25562.aspx ... 
- guid转int
		如果你想生成一个数字序列,你将会获得一个19位长的序列. 下面的方法会把GUID转换为Int64的数字序列. private static long GenerateIntID() { ... 
- 使用Guid做主键和int做主键性能比较
		使用Guid做主键和int做主键性能比较 在数据库的设计中我们常常用Guid或int来做主键,根据所学的知识一直感觉int做主键效率要高,但没有做仔细的测试无法 说明道理.碰巧今天在数据库的优化过程中 ... 
- Sql server 数据库 int 和guid 两者的比较
		我们公司的数据库全部是使用GUID做主键的,很多人习惯使用int做主键.所以呢,这里总结一下,将两种数据类型做主键进行一个比较. 使用INT做主键的优点: 1.需要很小的数据存储空间,仅仅需要4 by ... 
- Guid与id区别
		一.产生Guid的方法 1.SqlServer中使用系统自带的NEWID函数: select NEWID() 2.C#中,使用Guid类型的NewGuid方法: Guid gid; ... 
- SQL GUID和自增列做主键的优缺点
		我们公司的数据库全部是使用GUID做主键的,很多人习惯使用int做主键.所以呢,这里总结一下,将两种数据类型做主键进行一个比较. 使用INT做主键的优点: 1.需要很小的数据存储空间,仅仅需要4 by ... 
随机推荐
- 2013 Multi-University Training Contest 9
			HDU-4687 Boke and Tsukkomi 题意:给定一个简单图,询问哪些边如果选择的话会使得最大的连边数减少. 解法:套用一般图的最大匹配算法(带花树)先算出最大匹配数,然后枚举一条边被选 ... 
- maven使用实例记录
			Maven入门指南① :Maven 快速入门及简单使用Maven入门指南② :Maven 常用命令,手动创建第一个 Maven 项目Maven入门指南③:坐标和依赖Maven入门指南④:仓库 Mave ... 
- jquery使用技巧
			1. 禁用右键点击(Disable right-click) <!DOCTYPE html> <html> <head> <meta charset=&quo ... 
- caffe的data_reader.cpp分析一下干了点什么
			首先说明:下面的内容不一定对 类body: 变量:LayerParameter param_ :它里面放的是:body传进来的layerparameter的参数: BlockingQueue<s ... 
- (五)Super VLAN
- Fragments
			http://developer.android.com/guide/components/fragments.html#Design Content Design Philosophy Creati ... 
- Android 视频投射之NanoHTTPD
			Android 视频投射之NanoHTTPD 号称用一个java文件实现Http服务器 有必要对其源码及例子进行分析 public abstract class NanoHTTPD { //异步执行请 ... 
- Oracle 遇到的错误及处理整理 - 记录
			1. 启动监听提示: TNS-: TNS:permission denied TNS-: TNS:protocol adapter error TNS-: Insufficient privilege ... 
- Oracle数据库和DB2数据库分页SQL的区别举例
			--------------------------ORACLE数据库分页SQL举例------------------------------------------------ SELECT * ... 
- hiho_1068_RMQ_st算法
			题目 给出一数组A,编号从1到n,然后进行q次查询,每次查询给出一个边界[beg, end],要求给出数组A中范围[beg, end]之内的最小值. 题目链接: RMQ_ST 分析 区间问题使用线段树 ... 
