How does database indexing work?
When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.
Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.
Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.
Cluster & Non-cluster Intex:
cluster : used for easy retrival data from DB by altering the way that records are stored.
non-cluster : not alter the way it stores, but create a complete seperate object within the table. It point back to the original table rows after searching.
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.
The downside to indexing is that these indexes require additional space on the disk, (No-cluster Index)since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.
How does it work?
Firstly, let’s outline a sample database table schema;
Field name Data type Size on disk
id (Primary key) Unsigned INT 4 bytes
firstName Char(50) 50 bytes
lastName Char(50) 50 bytes
emailAddress Char(100) 100 bytes
Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).
Example 1
Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block sizeB = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.
A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value given that the id field is a key field. But since the id field is also sorted a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.
Now the firstName field is neither sorted, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.
Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks that the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;
Field name Data type Size on disk
firstName Char(50) 50 bytes
(record pointer) Special 4 bytes
Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.
Example 2
Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/18 = 277,778 blocks.
Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 277,778 block accesses required by the non-indexed table.
When should it be used?
Given that creating an index requires additional disk space (277,778 blocks extra from the above example), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.
Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.
How does database indexing work?的更多相关文章
- What is the difference between a binary tree, a binary search tree, a B tree and a B+ tree?
Binary Tree : It is a tree data structure in which each node has at most two children. As such there ...
- 「2014-2-6」TokuMX and MongoDB related materials collection
简介参考 TokuMX 和 MongoDB 各自的官方站点. ## Tokutek 最重要的特点和 marketing word 是所谓 fractal tree indexing te ...
- EPiServer 简单项目总结
国内用到的EPiServer应该不多,所以记录点用到过的东西,以便分享 1.EPiServer office site http://www.episerver.com/ 2.EPiServer CM ...
- 【Professional English】Words Summary
01.数据库管理系统(Database Management Systems,DBMS) A database management system (DBMS) is a computer softw ...
- Exploring Micro-frameworks: Spring Boot--转载
原文地址:http://www.infoq.com/articles/microframeworks1-spring-boot Spring Boot is a brand new framework ...
- sql index改怎么建
https://stackoverflow.com/questions/11299217/how-can-i-optimize-this-sql-query-using-indexes ------- ...
- 对HashMap的一次记录
HashMap的具体学习,认识了解. 前言 也是最近开始面试才发现,HashMap是问的真多.以前听学长或自己在网上看到过一些面试资料都在说集合.线程这块比较重要,面试的重点.自己也是有那抵触情绪,所 ...
- Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载
原文地址:https://gist.github.com/maxivak/3e3ee1fca32f3949f052 Install Solr download and install Solr fro ...
- [转]Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications
This article is from blog of Amazon CTO Werner Vogels. -------------------- Today is a very exciting ...
随机推荐
- Java应用短信猫
首先确定短信猫正常连接到主机,并安装SIM卡.先用超级终端测试短息猫能不能用.安装minicom:#sudo apt-get install minicom安装完成后,执行#sudo minicom ...
- 例题6-7 Trees on the level ,Uva122
本题考查点有以下几个: 对数据输入的熟练掌握 二叉树的建立 二叉树的宽度优先遍历 首先,特别提一下第一点,整个题目有相当一部分耗时在了第一个考查点上(虽然有些不必要,因为本应该有更简单的方法).这道题 ...
- Redis源码研究--双向链表
之前看的内容,占个位子,以后补上. ----------8月4日--------------- 双向链表这部分看的比较爽,代码写的中规中矩,心里窃喜,跟之前学的<数据结构>这本书中差不多. ...
- 生动有趣的动画Toast--第三方开源--NiftyNotification
NiftyNotification在github上的项目主页是:https://github.com/sd6352051/NiftyNotificationNiftyNotification本身又依赖 ...
- 使用virtualenv或zc.buildout创建Python-tornado分离环境
originally created by shuliang under CC BY-NC-ND 3.0 license 一.引言 学习编程,好比练功,总得先有个环境,搭台子是必须的.为了照顾初学者, ...
- 《深入浅出WPF》重点摘要(—)Binding自动通知机制
最近因为公司的项目需要用WPF开发,就学习了一下WPF.刚开始只是用到什么就百度什么,虽然功能是实现了,但还是没有弄清楚原理(如果不弄清原理,会感觉很心虚,整个人会没底气),所以决定找个教程系统地学一 ...
- Azure IaaS for IT Pros Online Event 总结
微软一个为期4天的一个有关于Azure的介绍,主要总结了些Azure现有的技术以及将会推出东西 主题链接 http://channel9.msdn.com/Events/Microsoft-Azure ...
- python之方法总结
python的OOP的方法有3种: 1. 实例方法: 接收self参数 2. 类方法: 接收cls参数, 并要用classmethod()注册或者@classmethod注解. 3. 静态方法: 不接 ...
- always语言指导原则
1.每个always只有一个@(event-expression). 2.always块可以表示时序逻辑和组合逻辑. 3.带有posedge和negedge关键字的是表示沿触发的时序逻辑,没有的表示组 ...
- I/O空间映射
转自:http://www.cnblogs.com/hydah/archive/2012/04/10/2232117.html 注:部分资料和图片来源于网络,本文在学习过程中对网络资源进行再整理. I ...