David Berube  | June 12, 2018 |  Posted In: MySQL

Modern applications often feature the use of data in many different languages. This is often true even of applications that only offer a user facing interface in a single language. Many users may, for example, need to enter names which, although using Latin characters, feature diacritics; in other cases, they may need to enter text which contains Chinese or Japanese characters. Even if a user is capable of using an application localized for only one language, it may be necessary to deal with data from a wide variety of languages.

Additionally, increased use of mobile phones has lead to changes in communications behaviour; this includes a vastly increased use of standardized characters intended to convey emotions, often called “emojis” or “emoticons.” Originally, such information was conveyed using ASCII text, such as “:-)” to indicate happiness – but, as noted, this has changed, with many devices automatically converting such sequences into single character “emojis.” Such emojis are not typically presented as a a graphic; instead, such emojis are now a standard part of Unicode encoding.

Since Unicode is a long established standard, and since MySQL has had support for Unicode for quite some time, one would imagine it would be seamless and easy to include them in your application.

Unfortunately, there are several problems that may complicate that path for many users – first, though, let’s discuss some background, so that we can fully understand the problem.

What is encoding?

“Encoding,” as you may already be aware, refers to the mapping of characters to binary values – or “code points”. One of the oldest standard still in use is ASCII; in this encoding, the binary sequence “100 0001” is equivalent to the uppercase character “A”. Many characters cannot be encoded into US-ASCII; in fact, since it uses only seven bytes per character, it can store only 128 different code points. Some of these code points are characters – like the “A” already mentioned, and others carry alternative meanings, such as for formatting.

For example, “000 1001” represents a “tab” in US-ASCII. Later, ASCII coding was replaced with various 8-bit encodings, which could hold more different code points – but it was ultimately a standard called Unicode which dethroned ASCII. Unicode actually encompasses a number of different encodings – but it is UTF8 which is the most important, and that’s what we will discuss in this post.

“Collation” is a related concept; this refers to how characters are sorted. This may, at first, seem simple and logical. However, in practice, it can be more complicated. For example, some poorly programmed systems inadvertently sort in a “case sensitive manner” when “case insensitive” would be more appropriate. Such a system may sort “b,a,B,A,c” as “A,B,a,b,c” – whereas it may be more desirable to sort it as “A,a,B,b,c.” This is an example of differing collations. In languages other than English, there may be more than one reasonable way to sort a list of strings; this is particularly true in languages that do not use an alphabet, such as Chinese or Japanese.

Why can encoding be a problem in MySQL?

Unicode adoption was by no means universal, and by no means quick. For a very long time, MySQL’s default encoding was latin1; this supports basic English text and common punctuation reasonably well. However, it has limited support for other languages, and it does not support modern emoji characters. Eventually, MySQL very reasonably changed it’s default to UTF8 – which, one would imagine, fixed the issue for many people… except that existing databases were not converted, and many databases still, to this day, have some, or even all, tables encoded as latin1 – not as a conscious choice, but simply as a relic of an older time.

Additionally, “utf8” encoding in MySQL does not, in fact, mean standard UTF8. Standard UTF8 encoding involves a variable number of bytes per character, with a maximum of four bytes per character; most characters, however, use three or fewer. MySQL, for legacy technical reasons, supports a maximum of three bytes – which, regretably, means that MySQL’s “utf8” encoding does not work with four byte characters, which include Emojis and some mathematical symbols.

As a result, many databases are using MySQL’s “utf8” encoding or it’s older “latin1” default. In both cases, you may receive vexing “Incorrect string value: ” errors when users attempt to enter non-support characters.

Changing encoding and collations

Both encoding and collation can be set on a per-column level in MySQL. You can also set this value on a per-table level, which sets the default for new columns; further, you can set it on the database level, which sets the default for new tables. Finally, you can set it at the server level, which specifies a default for new databases.

Let’s walk through changing the encoding and collation for the MySQL sample database “sakila”. You can download this database at the following URL:

https://dev.mysql.com/doc/index-other.html

First, let’s start by examining the “actor” table:

 
 
 
 
 
 

Shell

 
1
2
3
4
5
6
7
8
9
10
11
12
mysql> SHOW CREATE TABLE actorG
*************************** 1. row ***************************
Table: actor
Create Table: CREATE TABLE `actor` (
`actor_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`first_name` varchar(45) DEFAULT NULL,
`last_name` varchar(45) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`actor_id`),
KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

As we can see here, the encoding on this table is set to UTF8; all of the VARCHAR columns listed are also encoded as UTF8. If one of them was encoded with a different encoding, it would be listed as part of it’s column definition, e.g. “first_name varchar(45) CHARACTER SET latin1 DEFAULT NULL” instead of “first_namevarchar(45) DEFAULT NULL”.

To change the encoding and collation for a particular column, we can use the CHANGE COLUMN command:

 
 
 
 
 
 

Shell

 
1
ALTER TABLE actor MODIFY COLUMN first_name VARCHAR(45) CHARACTER SET utf8mb4;

This, unsurprisingly enough, changes the character set to utf8mb4 – meaning this column can now support emojis and other 4 byte characters. Let’s see what that does to our table defintion:

 
 
 
 
 
 

Shell

 
1
2
3
4
5
6
7
8
9
10
11
12
mysql> show create table actorG
*************************** 1. row ***************************
Table: actor
Create Table: CREATE TABLE `actor` (
`actor_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`first_name` varchar(45) CHARACTER SET utf8mb4 DEFAULT NULL,
`last_name` varchar(45) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`actor_id`),
KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

We can see that the “first_name” column has been changedto utf8mb4; however, the “last_name” column is still using the default character set, utf8.

We can use the following command to set the default charset and convert all of the individual columns to our new character set:

 
 
 
 
 
 

Shell

 
1
ALTER TABLE actor CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Note that the above command has a COLLATE clause; although we are focusing on changing encodings in this post, you can have either a CHARACTER SET cause, a COLLATE clause, or both in all of the commands we’ve mentioned – allowing you to change either the encoding or the collation or both at once.

Let’s see what this command does to our table definition:

 
 
 
 
 
 

Shell

 
1
2
3
4
5
6
7
8
9
10
11
12
show create table actorG
*************************** 1. row ***************************
Table: actor
Create Table: CREATE TABLE `actor` (
`actor_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`first_name` varchar(45) DEFAULT NULL,
`last_name` varchar(45) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`actor_id`),
KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)

As noted, MySQL only displays per-column encodings in table definitions if they are different from the default. We can see, therefore, that all of the columns are now in utf8mb4 encoding. Additionally, it only displays table level collations if they are different from the default – and since utf8mb4_general_ci is the default collation for utf8mb4, it won’t display it either at the table level or the column level. (If we had changed it to a different collation – say, utf8mb4_bin or utf8mb4_unicode_ci – it would, in fact, show up.)

At this point, we’ve successfully converted a single table to utf8mb4. However, this approach seems onerous for a large database – is there a better way?

Converting a database at a time with mysql_change_database_encoding

For the purposes of this blog, I’ve encapsulated the logic to run the relevant commands for an entire database into a short Ruby script. You can download and install as follows:

 
 
 
 
 
 

Shell

 
1
2
3
git clone git@github.com:djberube/mysql_change_database_encoding.git
cd mysql_change_database_encoding
bundle

This command will use MySQL’s INFORMATION_SCHEMA engine to get a list of all tables, and migrate them:

 
 
 
 
 
 

Shell

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
MYSQL_DATABASE=sakila MYSQL_USER=some_mysql_user MYSQL_PASSWORD=some_mysql_password ruby mysql_change_database_encoding.rb --collation utf8mb4_unicode_ci --encoding utf8mb4 --dir
ect --no-osc
Connecting to sakila
Processing database settings.
-- Setting database global settings.
Running SQL:
ALTER DATABASE `sakila` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0009s
-- Migrating without OSC
Running SQL:
ALTER TABLE `actor` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0036s
-- Migrating without OSC
Running SQL:
ALTER TABLE `address` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0670s
-- Migrating without OSC
Running SQL:
ALTER TABLE `category` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0293s
-- Migrating without OSC
Running SQL:
ALTER TABLE `city` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0400s
-- Migrating without OSC
Running SQL:
ALTER TABLE `country` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0239s
-- Migrating without OSC
Running SQL:
ALTER TABLE `customer` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0607s
.. snip...

I’ve cut the output down a bit for brevity. First, this script sets the default encoding and collation for the entire DB; then it sets it for each table using the ALTER TABLE .. CONVERT TO CHARACTER SET  command.

You may have noticed the “migrating without OSC” lines in the output; OSC, or online schema change, is a technique for reducing the impact of database migrations on production installations. A typical technique for doing this is to create a duplicate of your table, set up triggers to keep that duplicate up to date, change the new table, and then swap them – this is sufficiently complicated that it’s nontrivial to DIY, and so there’s a few very nice tools available to do this. By using one of these tools, we can run schema changes in production environments while reducing the performance impact – having to lock a large table while converting it to UTF8MB4 may, indeed, take a large system down.

pt-online-schema-change

Percona Toolkit has a great tool for OSC, called pt-online-schema-change; the script mentioned above has builtin support for pt-online-schema-change. You can download it from here:

https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html

We can re-run our script using pt-online-schema-change by removing the “–no-osc” option and replacing it with, logically enough, a “–osc” option:

 
 
 
 
 
 

Shell

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# MYSQL_DATABASE=sakila MYSQL_USER=some_mysql_user MYSQL_PASSWORD=some_mysql_password ruby mysql_change_database_encoding.rb --collation utf8mb4_unicode_ci --encoding utf8mb4 --direct --osc
Connecting to sakila
Processing database settings.
-- Setting database global settings.
Running SQL:
ALTER DATABASE `sakila` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-> 0.0007s
This SQL will be run using pt-online-schema-change:
CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
The following command will be run:
No slaves found. See --recursion-method if host spacepancake has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Child tables:
`sakila`.`film_actor` (approx. 5462 rows)
.. snip...

Note that pt-online-schema-change can only be run against tables with a primary key; the mysql_change_database_encoding.rb  script will automatically fall back to directly running MySQL commands if the –direct flag is set.

If you encounter any issues with the above script, please let me know via http://berubeconsulting.com/ or via Github. Pull requests are welcome.

Potential problems

Of course, there are several issues which may occur when changing your encoding or collation.

MySQL Version

Firstly, note that utf8mb4 support is only available in MySQL 5.5.2 or later; earlier than that, and you’re limited to MySQL’s nonstandard UTF8 implementation, with a maximum of three bytes per codepoint. In this case, it is generally advisable to upgrade to a recent version of MySQL – though you could, if desired, use the above approach to migrate your database to utf8 encoding.

Applications that need variable encoding

The second issue is that the approach detailed above – where a script automatically migrates all of the different tables – will result in every table having it’s encoding and/or collation changed to the same destination encoding and collation. That’s not necessarily a problem – but some applications do, indeed, make use of varying encodings for different tables and, in some cases, different columns in the same table. If so, you’d do well to use the above SQL examples as a guide, and manually create a SQL script – or a shell script that repeatedly calls pt-online-schema-change – which will do the migration for you. However, in many cases, a single encoding is both possible and desirable.

Key Length

Addditionally, note that maximum key lengths may be an issue for MySQL 5.6 and earlier installations. This is because earlier installations have a maximum key size limitation on indices; compared to utf8 columns, utf8mb4 columns have a higher maximum length on disk per character, and it’s easy to bump into once you switch to utf8mb4. For example, many schemas have VARCHAR(255) columns – those are created by Ruby on Rails by default if one does not specify a column length – and VARCHAR(255) columns trigger this limitation. You could write a script that automatically resizes these indices or their associated columns, but I would recommend either upgrading to 5.7 or, if running 5.5 or later, enabling the innodb_large_prefix setting, which allows larger indices.

False positives

Finally, note that for some legacy installations, the mere fact of a column, table, or database being marked as “latin1” encoded or “utf8” encoded may not, in fact, mean that the data is actually encoded in that way; this may be because an application incorrectly marked the encoding of it’s data. In that case, recovery may be complex or impossible, and will certainly be situation dependant – particularly since this issue may not effect all rows.

Of course, to ensure that you particular application works without incident on a new encoding – and, to a lesser extent, collation – it’s wise to thoroughly test any changes in a staging environment; if feasible, it’s likely wise to test on a copy of the production environment as well.

Conclusion

Unicode support is no longer an arcane, unapproachable topic; it’s both possible and highly advisable to ensure that your application works well for international users and for more users using emojis. Such is quickly becoming not merely a value-add, but an expected part of an application’s featureset, and implementing full support in your MySQL application is relatively straightforward.

If you’ve found this post useful, feel free to let me know at djberube@berubeconsulting.com, or via http://berubeconsulting.com.

Questions, comments, and reports of any inaccuracies are welcome.

David Berube

David Berube is a freelance Ruby on Rails developer and MySQL performance consultant. He specializes in maintaining legacy systems. He authored the books “Practical Rails Gems” and “Practical Reporting with Ruby and Rails”, and co-authored the book “Practical Rails Plugins," and he's written for venues like Dr. Dobb's Journal, Linux Pro Magazine, and IBM DeveloperWorks. Website: berubeconsulting.com | Email David

Character Sets: Migrating to utf8mb4 with pt_online_schema_change的更多相关文章

  1. 10.1.5 Connection Character Sets and Collations

    10.1.5 Connection Character Sets and Collations Several character set and collation system variables ...

  2. 02:PostgreSQL Character Sets

    在利用postGIS导入shapefile文件到postgresql数据库的时候,老是提示字符串的问题,或者是乱码,试了好几种都不行,于是度娘之.... 使用默认的UTF8,提示信息是:建议使用LAT ...

  3. Firebird Character Sets and Collations

    Firebird Character Sets and Collations Every CHAR or VARCHAR field can (or, better: must) have a cha ...

  4. MySQL: Connection Character Sets and Collations

    character_set_server collation_servercharacter_set_databasecollation_database character_set_clientch ...

  5. Character Sets, Collation, Unicode :: utf8_unicode_ci vs utf8_general_ci

    w Hi, You can check and compare sort orders provided by these two collations here: http://www.collat ...

  6. Character set 'utf8mb4' is not a compiled character set

    近期在一次MySQL数据迁移的过程中遭遇了字符集的问题,提示为"Character set 'utf8mb4' is not a compiled character set".即 ...

  7. MySQL基础知识:Character Set和Collation

    A character set is a set of symbols and encodings. A collation is a set of rules for comparing chara ...

  8. MySQL设置字符集CHARACTER SET

    本文地址:http://www.cnblogs.com/yhLinux/p/4036506.html 在 my.cnf 配置文件中设置相关选项,改变为相应的character set. 设置数据库编码 ...

  9. EntityFramework查询oracle数据库时报ora-12704: character set mismatch

    1.这段linq,执行期间报ora-12704:character set mismatch错误. var query = from m in ctx.MENU where (m.SUPER_MENU ...

随机推荐

  1. [Python]可变类型,默认参数与学弟的困惑

    一.学弟的困惑 十天前一个夜阑人静.月明星稀的夜晚,我和我的朋友们正在学校东门的小餐馆里吃着方圆3里内最美味的牛蛙,唱着最好听的歌儿,畅聊人生的意义.突然,我的手机一震,气氛瞬间就安静下来,看着牛蛙碗 ...

  2. Spring-IOC注解

    注解主要的目的就是实现零XML配置.一:自动扫描装配Bean. spring为我们引入了组件自动扫描机制,它可以在类路径底下寻找标注了@Component.@Service.@Controller.@ ...

  3. CRM项目测试第一天

    经过前几天代码的修改,界面的完善.主要的功能都实现了!今天主要是交换各组的项目,互相来测试,找bug. 在互相测试的过程,我听见有一组应该算是讨论的比价激烈的!我们组我们自己找到了bug,但是测试我们 ...

  4. 让 markdown 生成带目录的 html 文件

    安装 npm install -g i5ting_toc 用法 进入 markdown 文件所在的文件夹 举个栗子: 你的sample.md文件放在桌面上 cd /Users/dora/Desktop ...

  5. niftynet Demo分析 -- brain_parcellation

    brain_parcellation 论文详细介绍 通过从脑部MR图像中分割155个神经结构来验证该网络学习3D表示的效率 目标:设计一个高分辨率和紧凑的网络架构来分割体积图像中的精细结构 特点:大多 ...

  6. [原] jQuery EasyUI 1.3.4 离线API、Demo (最新)

    说明 本文下载包为 jQuery EasyUI 1.3.4 离线API.Demo. API 按照分类整理做成了离线版本,文档保证和官网完全一致: Demo 按照分类整理为合集. 1.3.3版本中新增 ...

  7. Math Magic(完全背包)

    Math Magic Time Limit:3000MS     Memory Limit:32768KB     64bit IO Format:%lld & %llu Submit Sta ...

  8. UI-12组结对编程作业总结

    UI-12组结对编程作业总结 源码Github地址 https://github.com/tilmto/TILMTO/tree/master/Arithmetic 作业摘要 本次结对编程作业分为以下两 ...

  9. SpringBoot —— AOP注解式拦截与方法规则拦截

    AspectJ是一个面向切面的框架,它扩展了Java语言.AspectJ定义了AOP语法,所以它有一个专门的编译器用来生成遵守Java字节编码规范的Class文件. SpringBoot中AOP的使用 ...

  10. POJ1611(KB2-B)

    The Suspects Time Limit: 1000MS   Memory Limit: 20000K Total Submissions: 39211   Accepted: 18981 De ...