Firebird Character Sets and Collations
Firebird Character Sets and Collations
Every CHAR or VARCHAR field can (or, better: must) have a character set assigned. Firebird uses this information to correctly store the bytes that make up the character string.
In order to be able to sort or compare strings, you also need to define a collation. A collation defines the sort ordering and uppercase conversions for a string.
Firebird is unable to transliterate between character sets. So you must set the correct values on the server and on the client if everything is to work smoothely.
An Example
In the German language there are the "Umlauts", special vowels with a double-dot (diaeresis) over them. A common last name in Germany is Müller. (If you don't have umlauts on your keyboard, you could also write "Mueller", but that's not what we want to discuss here ;-)
When you convert Müller to uppercase you get MÜLLER, so there is an uppercase Ü and a lowercase ü.
When you want to perform a lexicographic compare on the name, you have several options:
- You can treat the Ü like a U (German "Duden" dictionary)
- You can treat the Ü like UE (German telephone book)
- You can treat the Ü like a special character, sorted in after Z (a common practice in Scandinavia)
Creating a Database
You can define the default character set for a new database in the CREATE DATABASE statement:
CREATE DATABASE <database>
USER <username>
PASSWORD <password>
PAGE_SIZE <pagesize>
DEFAULT CHARACTER SET <charset>
For example:
CREATE DATABASE localhost:meter
USER SYSDBA
PASSWORD masterkey
PAGE_SIZE 4096
DEFAULT CHARACTER SET ISO8859_1;
From now on, any VARCHAR or CHAR field will default to the ISO8859_1 character set. You can, however, specify a special character set for each column:
CREATE TABLE users (
CZECH_NAME VARCHAR(50) CHARACTER SET ISO8859_2,
...
Collations
There is no default collation. So you should define a collation for every field that is to be used for sorting (ORDER BY) or comparing (UPPER):
CREATE TABLE users (
NAME VARCHAR(50) COLLATE DE_DE,
...
COLLATE DE_DE means: "use a collation for the German language (the first DE), applying the rules from Germany (the second DE)"
You can also specify the collation with the ORDER BY clause:
ORDER BY LASTNAME COLLATE FR_CA, FIRSTNAME COLLATE FR_CA
or with the WHERE clause:
WHERE LASTNAME COLLATE FR_CA = :lastnametosearch
or when searching:
WHERE UPPER (LAST_NAME COLLATE SV_SV) = 'PAULSEN';
The UPPER() function
UPPER() only works correctly if there is a collation defined for the parameter field:
WHERE UPPER (NAME COLLATE DE_DE) = 'MÜLLER';
Specifying the client character set
ISQL
SET NAMES ISO8859_1;
InterBase Objects (Ibo) by Jason Wharton
The TIb_Connection class has a string property named CharSet. Assign it the name of the character set to use:
Ib_Connection1.CharSet := 'ISO8859_1';
InterBase Express (IBX), built into Delphi
The TIbDatabase class has a TStrings property named Params. Add a field with the name lc_ctype and specify the character set:
IbDatabase1.Params.Add ('lc_ctype=ISO8859_1');
PHP
In PHP you define the Client Character Set when you connect (or pconnect) to the database.
$db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");
Conversions
Conversions between character sets are always done as: CHARSET1 -> UNICODE -> CHARSET2
With NONE or OCTETS as the connection character set, the bytes are just copied: NONE/OCTETS -> CHARSET2 and CHARSET1 -> NONE/OCTETS.
Case insensitive searching
I have written a separate article about this.
Character Sets and Collations
| Character Set | Languages | Collation | Comments |
|---|---|---|---|
| Generic Character Sets | |||
| NONE | All | NONE | No character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters. |
| OCTETS | BINARY | All | OCTETS | Same as NONE. Cannot be used as client connection character set. Space character is #x00. Will be displayed as hex in ISQL 2.0. |
| ASCII | English | ASCII | English |
| Unicode based Character Sets | |||
| UNICODE_FSS | All | UNICODE_FSS | Unicode UTF-8. An old implementation that accepts malformed strings and does not enforce correct max. string length. All characters 3 bytes, no case mapping.
Superseded in Firebird 2.0 with the UTF8 character set. Deprecated. |
| UTF8 | All | UCS_BASIC | UCS_BASIC sorts in Unicode code-point order (Firebird 2.0) |
| UNICODE | Sorts using the Unicode Collation Algorithm (UCA) (Firebird 2.0) | ||
| UTF-8 | Case insensitive collation (Firebird 2.1) | ||
| UNICODE_CI_AI | Case insentive, Accent insensitive collation for Unicode (Firebird 2.5) | ||
| Current Character Sets | |||
| ISO8859_1 | Western Europe | ISO8859_1 | Latin-1 |
| DA_DA | Danish/Danmark | ||
| DE_DE | German/Germany | ||
| DU_NL | Dutch/The Netherlands | ||
| EN_UK | English/United Kingdom | ||
| EN_US | English/USA | ||
| ES_ES | Spanish/Spain | ||
| ES_ES_CI_AI | Spanish/Spain, case insensitive, accent insensitive (Firebird 2.0) | ||
| FI_FI | Finnish/Finnland | ||
| FR_CA | French/Canada | ||
| FR_FR | French/France | ||
| FR_FR_CI_AI | French/France, case insensitive, accent insensitive (Firebird 2.1) | ||
| IS_IS | Icelandic/Iceland | ||
| IT_IT | Italian/Italy | ||
| NO_NO | Norwegian/Norway | ||
| PT_PT | Portuguese/Portugal | ||
| PT_BR | Portuguese/Brasil (Firebird 2.0). Case+Accent insensitive | ||
| SV_SV | Swedish/Sweden | ||
| ISO8859_2 | Central Europe | ISO8859_2 | Central Europe |
| CS_CZ | Czech | ||
| ISO_HUN | Hungarian | ||
| ISO_PLK | Polish (Firebird 2.0) | ||
| ISO8859_3 | Southern Europe | ISO8859_3 | Maltese, Esperanto |
| ISO8859_4 | North European | ISO8859_4 | Estonian, Latvian, Lithuanian, Greenlandic, Lappish |
| ISO8859_5 | Cyrillic | ISO8859_5 | Russian, Ukrainian |
| ISO8859_6 | Arabic | ISO8859_6 | |
| ISO8859_7 | Modern Greek | ISO8859_7 | |
| ISO8859_8 | Hebrew | ISO8859_8 | |
| ISO8859_9 | Turkish | ISO8859_9 | |
| ISO8859_13 | Baltic | ISO8859_13 | Baltic |
| LT_LT | Lithuanian | ||
| WIN1250 | Central Europe | WIN1250 | Central Europe |
| BS_BA | Bosnian (Firebird 2.0) | ||
| WIN_CZ | Czech, case-insensitive (Firebird 2.0) | ||
| WIN_CZ_CI_AI | Czech, case-insensitive, accent-insensitive (Firebird 2.0) | ||
| PXW_CSY | Czech | ||
| PXW_HUN | Hungarian | ||
| PXW_HUNDC | Hungarian, Dictionary sort | ||
| PXW_PLK | Polish | ||
| PXW_SLOV | Slovanian | ||
| WIN1251 | Cyrillic | WIN1251 | Cyrillic |
| WIN1251_UA | Ukrainian | ||
| PXW_CYRL | Cyrillic, Paradox compatibility | ||
| WIN1252 | Western Europe, America | WIN1252 | Latin-1 with Windows extensions |
| WIN_PTBR | Brasilian Portuguese (Firebird 2.0). Case+Accent insensitive | ||
| PXW_INTL | Paradox ANSI International | ||
| PXW_INTL850 | Paradox Multi-Lingual Latin-1 | ||
| PXW_NORDAN4 | Paradox Norwegian and Danish | ||
| PXW_SPAN | Paradox Spanish | ||
| PXW_SWEDFIN | Paradox Swedish, Finnish | ||
| WIN1253 | Modern Greek | WIN1253 | |
| PXW_GREEK | Paradox Greek | ||
| WIN1254 | Turkish | WIN1254 | |
| PXW_TURK | Paradox Turkish | ||
| WIN1255 | Hebrew | WIN1255 | |
| WIN1256 | Arabic | WIN1256 | |
| WIN1257 | Baltic | WIN1257 | Baltic |
| WIN1257_LV | Latvian dictionary collation (Firebird 2.0) | ||
| WIN1257_LT | Lithuanian dictionary collation (Firebird 2.0) | ||
| WIN1257_EE | Estonian dictionary collation (Firebird 2.0) | ||
| WIN1258 | Vietnamese | Vietnamese (Firebird 2.0) | |
| MS-DOS, dBASE and Paradox compatibility | |||
| DOS437 | Western Europe, America | DOS437 | English/USA |
| DB_DEU437 | dBASE German | ||
| DB_ESP437 | dBASE Spanish | ||
| DB_FIN437 | dBASE Finnish | ||
| DB_FRA437 | dBASE French | ||
| DB_ITA437 | dBASE Italian | ||
| DB_NLD437 | dBASE Dutch | ||
| DB_SVE437 | dBASE Swedisch | ||
| DB_UK437 | dBASE English/UK | ||
| DB_US437 | dBASE English/US | ||
| PDOX_ASCII | Paradox ASCII code page | ||
| PDOX_INTL | Paradox International English code page | ||
| PDOX_SWEDFIN | Paradox Swedish/Finnish code page | ||
| DOS737 | Greek | DOS737 | Greek |
| DOS775 | Baltic | DOS775 | Baltic |
| DOS850 | Western Europe, America | DOS850 | Latin-1 (without Euro € symbol) |
| DB_DEU850 | dBASE German | ||
| DB_ESP850 | dBASE Spanish | ||
| DB_FRA850 | dBASE French/France | ||
| DB_FRC850 | dBASE French/Canada | ||
| DB_ITA850 | dBASE Italian | ||
| DB_NLD850 | dBASE Dutch | ||
| DB_PTB850 | dBASE Portuguese/Brasil | ||
| DB_SVE850 | dBASE Swedish | ||
| DB_UK850 | dBASE English/UK | ||
| DB_US850 | dBASE English/USA | ||
| DOS852 | Central Europe | DOS852 | Latin-2 (Central Europe) |
| DB_CSY | dBASE Czech | ||
| DB_PLK | dBASE Polish | ||
| DB_SLO | dBASE Slovakian | ||
| PDOX_CSY | Paradox Czech | ||
| PDOX_HUN | Paradox Hungarian | ||
| PDOX_PLK | Paradox Polish | ||
| PDOX_SLO | Paradox Slovakian | ||
| DOS857 | Turkish | DOS857 | Turkish |
| DB_TRK | dBASE Turkish | ||
| DOS858 | DOS858 | Latin-1 plus Euro symbol € | |
| DOS860 | Portuguese | DOS860 | Portuguese |
| DB_PTG860 | dBASE Portuguese | ||
| DOS861 | Icelandic | DOS861 | Icelandic |
| PDOX_ISL | Paradox Icelandic | ||
| DOS862 | Hebrew | DOS862 | Hebrew |
| DOS863 | Canadian French | DOS863 | French/Canada |
| DB_FRC863 | dBASE French/Canada | ||
| DOS864 | Arabic | DOS864 | Arabic |
| DOS865 | Scandinavian | DOS865 | Nordic |
| DB_NOR865 | dBASE Norwegian | ||
| DB_DAN865 | dBASE Danish | ||
| PDOX_NORDAN4 | Paradox Norwegian | ||
| DOS866 | Russian | DOS866 | Russian |
| DOS869 | Greek | DOS869 | Modern Greek |
| Others | |||
| BIG_5 | Chinese | BIG_5 | Chinese |
| KOI8R | Russian | Russian character set and dictionary collation (Firebird 2.0) | |
| KOI8U | Ukrainian | Ukrainian character set and dictionary collation (Firebird 2.0) | |
| CYRL | Russian/Ukrainian | CYRL | Cyrillic |
| DB_RUS | dBASE Russian | ||
| PDOX_CYRL | Paradox Cyrillic | ||
| KSC_5601 | Korean | KSC_5601 | Unified Korean Hangeul, also known as windows-949 |
| KSC_DICTIONARY | Korean dictionary ordering | ||
| NEXT | NeXT Computers | NEXT | NeXTSTEP encoding |
| NXT_DEU | German | ||
| NXT_ESP | Spanish | ||
| NXT_FRA | French | ||
| NXT_ITA | Italian | ||
| NXT_US | US-English | ||
| SJIS_0208 | Japanese | SJIS_0208 | Shift-JIS |
| EUCJ_0208 | Japanese | EUCJ_0208 | EUC Japanese |
| GB_2312 | Chinese | GB_2312 | Simplified Chinese (HongKong, PRC), a subset of GBK/windows-936 |
| CP943C | Japanese | CP943C_UNICODE | Japanese character set (Firebird 2.1) |
| TIS620 | Thai | TIS620_UNICODE | Thai character set, single byte (Firebird 2.1) |
Which one to choose?
The question now is: which character set do I choose for my database?
Note: I don't have any experience with Asian scripts (Chinese, Korean, Japanese) so I can't give you any hint on these
- You should chose the DOS, dBASE and Paradox character sets only if you have legacy applications to support
- The WINxxx character sets are extensions of the corresponding ISOxxx character sets, however you will have problems on non-Windows systems. So if you have a cross-platform application, stay with the ISOxxx character sets.
- The ISOxxx character sets are missing a few characters of the WINxxx character sets (like typographic dash signs) or they can have some different characters, so your application must be prepared to handle this.
Unicode?
The Unicode situation dramatically improved with Firebird 2.0. Now there is the new UTF8 character set that correctly handles Unicode strings in UTF-8 format. The Unicode collation algorithm has been implemented so now you can use UPPER() and the new LOWER() function without the need to specify a collation.
See also: www.firebirdsql.org/index.php?op=doc&id=fb_1_5_charsets
See also: www.collation-charts.org
Firebird Character Sets and Collations的更多相关文章
- 10.1.5 Connection Character Sets and Collations
10.1.5 Connection Character Sets and Collations Several character set and collation system variables ...
- MySQL: Connection Character Sets and Collations
character_set_server collation_servercharacter_set_databasecollation_database character_set_clientch ...
- 02:PostgreSQL Character Sets
在利用postGIS导入shapefile文件到postgresql数据库的时候,老是提示字符串的问题,或者是乱码,试了好几种都不行,于是度娘之.... 使用默认的UTF8,提示信息是:建议使用LAT ...
- Character Sets: Migrating to utf8mb4 with pt_online_schema_change
David Berube | June 12, 2018 | Posted In: MySQL Modern applications often feature the use of data ...
- Character Sets, Collation, Unicode :: utf8_unicode_ci vs utf8_general_ci
w Hi, You can check and compare sort orders provided by these two collations here: http://www.collat ...
- information_schema系列之字符集校验(CHARACTER_SETS,COLLATIONS,COLLATION_CHARACTER_SET_APPLICABILITY)
1:CHARACTER_SETS 首先看一下查询前十条的结果: root@localhost [information_schema]>select * from CHARACTER_SETS ...
- MySQL设置字符集CHARACTER SET
本文地址:http://www.cnblogs.com/yhLinux/p/4036506.html 在 my.cnf 配置文件中设置相关选项,改变为相应的character set. 设置数据库编码 ...
- MySQL基础知识:Character Set和Collation
A character set is a set of symbols and encodings. A collation is a set of rules for comparing chara ...
- mysql set names 命令和 mysql 字符编码问题
先看下面的执行结果: (root@localhost)[(none)]mysql>show variables like 'character%'; +--------------------- ...
随机推荐
- 物理cpu与逻辑cpu概述
物理cpu与逻辑cpu概述(本博客属于转载部分内容:主要学习目的用于大数据平台Hadoop之yarn资源调度的配置) 一.yarn资源调度器中主要的资源分类 1.memory(内存) 2. ...
- swift-自动计算字符串的宽高
写一个方法来继承String //自动控制文字换行及宽度 extension String { func textSizeWithFont(font: UIFont, constrainedToSiz ...
- 绝对好用的浏览器json解析网址
你们是否经常在浏览器输入请求地址解析遇到中文乱码的情况,今天我找到了一个好用的浏览器解析json网址,绝对好用. 1.直接输入网址 http://pro.jsonlint.com/ 2.输入要解析的j ...
- WPF通过鼠标滑轮缩放显示图片
如果你使用WinForm比较难实现通过滚动鼠标滑轮来对图片进行缩放显示,那么,你应该考虑一下使用WPF,既然是下一代Windows客户端开发平台,明显是有一定优势的,不然,MS是吃饱了撑着. 首先 ...
- Golang - 处理json
目录 Golang - 处理json 1. 编码json 2. 解码json Golang - 处理json 1. 编码json 使用json.Marshal()函数可以对一组数据进行JSON格式的编 ...
- VS2015 C++ 获取 Edit Control 控件的文本内容,以及把获取到的CString类型的内容转换为 int 型
UpdateData(true); //读取编辑框内容,只要建立好控件变量后调用这个函数使能,系统就会自动把内容存在变量里 //这里我给 Edit Control 控件创建了一个CString类型.V ...
- 传输模型,网络层次划分,三次握手,四次挥手,IP与端口,套接字socket
了解套接字之前,需要先了解基本的传输模型 其次,还需要了解网络的七层划分和四层结构 在python中,数据链路层相当于硬件层,python不需要了解,只用在传输层进行学习就足够了 其中,传输层分为TC ...
- LightOJ - 1232 - Coin Change (II)
先上题目: 1232 - Coin Change (II) PDF (English) Statistics Forum Time Limit: 1 second(s) Memory Limit: ...
- 自己实现android側滑菜单
当今的android应用设计中.一种主流的设计方式就是会拥有一个側滑菜单,以图为证: 实现这种側滑效果,在5.0曾经我们用的最多的就是SlidingMenu这个开源框架,而5.0之后.goog ...
- Ubuntu安装及ubuntu系统使用菜岛教程
Ubuntu是一款广受欢迎的开源Linux发行版,和其他Linux操作系统相比,Ubuntu非常易用,和Windows相容性很好,非常适合Windows用户的迁移,在其八年的成长过程中已经获得了两千多 ...