https://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/

I sometimes hear: “make everything utf-8 in your database, and all will be fine”. This so-called advice could not be further from the truth. Indeed, it will take care of internationalization and code-page problems when you use UTF-8, but it comes with a price, which may be too high for you to pay, especially if you have never realized it’s there..Indexing is everything… or at least.. good indexing makes or breaks your database. The fact remains: the smaller your indexes, the more index records can be loaded into memory and the faster the searches will be. So using small indexes pays off. Period.  But what has got this to do with UTF-8?

First off: beware of the VARCHAR

As you know, a VARCHAR field can hold a variable amount of data in which you only supply the maximum amount that you can store. So a VARCHAR(255) can hold 255 characters, but when you store only 5 characters, it will only use 5 characters of data. The other 250 are not lost. This is completely different than using a CHAR(255) where storing a 5 character string results in padding of 250 characters. So VARCHAR() has a big advantage over CHAR() when you have variable sized strings. But you have to realize that this advantage is for disk storage only. It does not apply to any other data structure that MySQL uses internally or for indexes.

How MySQL treats varchars

When MySQL needs to sort records, it must create some space for sorting that data. This space allocation is done before the actual sorting takes place. This however, means that MySQL needs to know how much memory it needs to allocate. When we need to sort VARCHAR fields, MySQL will take care of this by allocating the worst-case memory usage, which is the maximum size a VARCHAR field can take. For example: when you have declared a field as VARCHAR(100), MySQL will reserve space for 100 characters plus an additional 1 or 2 bytes for holding the length of the string (1 when the length is 255 or less, 2 otherwise). So this will bust the myth that “you can safely use VARCHAR(255)  for all fields without problems”.

Characters and bytes: or the UTF8-problem

Did you notice that I talk about “characters” and “bytes”? That’s because those two terms are not the same. A byte equals 8 bits, and can hold any number ranging from 0 to 255 (or -128..127, if you have read my two complement blog). The size of a character however, depends on the character encoding used and here is where the UTF-8 “problem” kicks in. Back in the old days, where most people stored strings in a latin1 charset, every character could be stored in a single byte. Thus: varchar(100) would be 100 bytes (+1 for the length). But this is not enough to hold ALL characters in the world (for instance, arabic and japanese characters cannot be stored in latin1). That’s why UTF-8 can use multiple bytes for some characters. The “standard” characters will be stored in 1 byte so most utf8 strings are almost the same size as latin1 strings, but when you need different characters it can use up to 4 bytes per character. If you like to know more about UTF-8, there are excellent other blogs about it.

You just have to realize that MySQL only uses a maximum of 3 bytes for UTF-8, which means not ALL utf-8 characters can be stored in MySQL, but most of the UTF-8 characters possible aren’t used anyway..  That’s why it might get confusing when reading upon UTF-8 that uses 4 bytes, and the 3 bytes that MySQL uses.

Let’s define a table with an index:

CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment,
`first_name` varchar(100) character set latin1 collate latin1_general_ci NOT NULL,
`last_name` varchar(100) character set latin1 collate latin1_general_ci NOT NULL,
`birth_date` date NOT NULL,
PRIMARY KEY (`id`),
KEY `first_name` (`first_name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

This creates a simple table with a primary index on ID and only an index on ‘first_name’. You need to add at least 2 rows, otherwise the explain will not work correctly for this example. So add some data and  find out what index will be used when issuing the following query:

EXPLAIN SELECT * FROM tbl WHERE first_name LIKE 'joshua';

输出:

+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
|  1 | SIMPLE      | tbl   | range | first_name    | first_name | 102     | NULL |    1 | Using where |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
1 row in set (0.00 sec)

The most important field here is the key_len. This field is 102
bytes. 100 bytes for the VARCHAR(100), since it’s encoded with latin-1.
The additional 2 bytes here are the length-bytes.

Now, let’s adjust the fields to UTF-8:

ALTER  TABLE  `tbl`  CHANGE  `first_name`  `first_name` VARCHAR( 100  )  CHARACTER  SET utf8 COLLATE utf8_general_ci NOT  NULL;
EXPLAIN SELECT * FROM tbl WHERE first_name LIKE 'joshua';

输出:

+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
|  1 | SIMPLE      | tbl   | range | first_name    | first_name | 302     | NULL |    1 | Using where |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
1 row in set (0.00 sec)

Immediately you should see the impact. The key_len is 200 bytes
larger, which means that we can hold less index-records in memory, which
means more disk reads which means a slower database.

But it doesn’t stop at the indexes. As said, this limitation is for
all internal buffers. All temporary sorting uses fixed length buffers
and tables that are sorted in memory when using latin1, could just as
easily be moved to a temporary table on disk because of it’s size. It
WILL perform less efficient because of more disk reads and writes.

Conclusion:

MySQL and it’s internal working can be insanely complex. It’s
important to never assume anything and test everything. Don’t convert
everything to UTF-8 just because.. but make sure you have good reasons
NOT to use a single-byte encoding like latin1. If you need to use the
UTF-8 encoding, then make sure that you use the correct sizes. Don’t
make everything VARCHAR(255) so at least you can store really long
names. The penalties for “disrespecting” the database can and will be
severe.. :)

About using UTF-8 fields in MySQL的更多相关文章

  1. 总结: MySQL(基础,字段约束,索引,外键,存储过程,事务)操作语法

    1. 显示数据库列表 show databases; # 查看当前所有数据库 show databases \G   #以行的方式显示 2. 在命令行中,执行sql语句 mysql -e 'show ...

  2. 2-14-1 MySQL基础语句,查询语句

    一. SQL概述 结构化查询语言(Structured Query Language)简称SQL 1. 它是一种特殊目的的编程语言 2. 它还是一种数据库查询和程序设计语言 (用于存取数据以及查询.更 ...

  3. MySQL基础知识:Character Set和Collation

    A character set is a set of symbols and encodings. A collation is a set of rules for comparing chara ...

  4. 2003031121——浦娟——Python数据分析第七周作业——MySQL的安装及使用

    项目 要求 课程班级博客链接 20级数据班(本) 作业要求链接 Python第七周作业 博客名称 2003031121--浦娟--Python数据分析第七周作业--MySQL的安装及使用 要求 每道题 ...

  5. tshark 抓包分析

    一,安装#yum install -y wireshark 二.具体使用案例 1.抓取500个包,提取访问的网址打印出来tshark -s 0 -i eth0 -n -f 'tcp dst port ...

  6. ElasticSearch学习记录

    中文api 什么是集群? 集群(cluster) >由一个或多个节点组织在一起. >由一个唯一的名字标识,默认为"elasticsearch". 节点(node) &g ...

  7. Wireshark命令行工具tshark

    Wireshark命令行工具tshark 1.目的 写这篇博客的目的主要是为了方便查阅,使用wireshark可以分析数据包,可以通过编辑过滤表达式来达到对数据的分析:但我的需求是,怎么样把Data部 ...

  8. tshark 使用说明

    yum install -y wireshark 最近才发现,原来wireshark也提供有Linux命令行工具-tshark.tshark不仅有抓包的功能,还带了解析各种协议的能力.下面我们以两个实 ...

  9. 结构体 row_prebuilt_t

    typedef struct row_prebuilt_struct row_prebuilt_t; /** A struct for (sometimes lazily) prebuilt stru ...

随机推荐

  1. js中eval函数

    后台数据 // 回显复选框用 List<Long> tempRoles = new ArrayList<Long>(); @SuppressWarnings("unc ...

  2. cas系列(三)--HTTP和HTTPS、SSL

    (这段时间打算做单点登录,因此研究了一些cas资料并作为一个系列记录下来,一来可能会帮助一些人,二来对我自己所学知识也是一个巩固.) 本文转自異次元藍客点击打开链接 1.  HTTPS HTTPS(全 ...

  3. 如何处理Tomcat日志catalina.out日志文件过大的问题

    tomcat默认日志文件为catalina.out,随着系统运行时间的增加,该日志文件大小会不断增大,甚至增大到G级.不仅会导致我们无法使用常规工具查找系统问题,而且会影响tomcat性能(比如我在维 ...

  4. [转]显示文件命令:cat、more、less、tail、touch详解

    cat命令cat命令连接文件并打印到标准输出设备上.cat经常用来显示文件的内容,类似于下的type命令. 一般格式:cat [选项] 文件说明:该命令有两项功能,其一是用来显示文件的内容,它依次读取 ...

  5. sharepoint的webpart开发

    前言 以前没有接触sharepoint感觉这东西好陌生,只是知道.来公司这段时间,也没有参加开发.今天自己简单的实现了一下这个开发过程,webpart部分的. 过程 其实webpart可以理解为一个放 ...

  6. 浅谈购物车中cookie的使用

    购物车对于电商网站来说是一个非常重要的模块.最近自己的项目中也用到了,所以拿出来说说事! 购物车是用户选择商品的一个缓存的地方.其中包含了商品的基本信息,例如:商品的描述,商品的价格,商品的数量等等. ...

  7. 关于手机端CSS Sprite图标定位的一些领悟

    今天在某个群里面闲逛,看见一个童鞋分享了一个携程的移动端的页面.地址这里我也分享下吧:http://m.ctrip.com/html5/在手机端我都很少用雪碧图合并定位图标,用的比较多就是用字体图标来 ...

  8. iOS: 学习笔记, 使用performSelectorOnMainThread及时刷新UIImageView

    在iOS中, 界面刷新在主线程中进行, 这导致NSURLSession远程下载图片使用UIImageView直接设置Image并不能及时刷新界面. 下面的代码演示了如何使用 performSelect ...

  9. C++ 利用socket实现TCP,UDP网络通讯

    学习孙鑫老师的vc++深入浅出,有一段时间了,第一次接触socket说实话有点儿看不懂,第一次基本上是看他说一句我写一句完成的,第二次在看SOCKET多少有点儿感觉了,接下来我把利用SOCKET完成T ...

  10. Codeforces 446-C DZY Loves Fibonacci Numbers 同余 线段树 斐波那契数列

    C. DZY Loves Fibonacci Numbers time limit per test 4 seconds memory limit per test 256 megabytes inp ...