Note over Chinese Encodings

I been confused years ago. Till recently I collected my thoughts together, and now I am clear about UNICODE. My company is using WinXP Simplified Chinese ver., it uses GBK(i.e. CP_936) as its Chinese char transfer standard. So when I wrote .cpp files with Chinese chars using CP_UTF8 (i.e. 65001), these files should add BOM(i.e. EF BB BF). Otherwise, the cl.exe compiler seems not capable of recognizing the Code_Page the .cpp files using. Though chrome.exe and even notepad.exe could easily recognize the Code_Page of the no BOM UTF-8 files.

So in order to compile your cpp files correctly under WinXP Simplified Chinese ver., you should either add BOM to your UTF-8 Code Paged cpp files, or use GBK to encode the files.

Learning from practice is a great idea. I wrote .cpp files in VIM at first, it could display Chinese correctly, and I thought "wow, VIM's great!!"(Of course I know little of UTF-8 then). However, more thoughts should be paid over displaying a Chinese character:

We wrote glyphs with IME or whatever else in TXT files, but what data saved to HDD depends on the Code_Page we "chose"(oh, it gives some thought suddenly, do we really have chosen? or we have been forced to use some Code_Page? no.). But actually we use certain Code_Page to save data. For example "你好", is saved as follows:

--------------------------------------------------------------

UTF-8 with BOM:　　　　　　EFBB BFE4 BDA0 E5A5 BD0A | Two Chinese glyphs is saved as E4BD A0E5 A5BD, 6 bytes

UTF-8 without BOM:　　　　E4BD A0E5 A5BD 0A　　　　 | the first 3 bytes EFBB BF is BOM，last byte 0A is /lf.

GBK:　　　　　　　　　　　　C4E3 BAC3 0A　　　　　　　　 |　　　　difference: two glyphs is saved as C4E3 BAC3, 4 bytes.

ANSI:　　　　　　　　　　　　N/A 　　　　　　　　　　　|　　　 ANSI is not designed to save glyphs, so not supporting is known.

---------------------------------------------------------------

When compiling, cl does not know exactly what Code_Page we use, so consider:

if save under UTF-8 without BOM, "你好" is save to 6 bytes exactly as stated above;

if we save under UTF-8 with BOM, "你好" is saved after a 3 bytes BOM to 6 bytes(9 bytes totally);

if we save under GBK, "你好" is saved to 4 bytes.

It is interesting if we use the following codes to test:

 int main()

 {

     std::string str("你好");

     std::ofstream out("1.txt");

     out<<str;

     out.close();

     return ;

 }

I got the following results:(again, WinXP SimChinese)

save option 　　hex results 　　　　　txt cp
gbk 　　　　　　C4E3 BAC3　　　　 gbk
utf-8 bom　　　 C4E3 BAC3　　　　 gbk
utf-8 no bom　 E4BD A0E5 A5BD utf

According to the preceeding discussion, "你好" in .cpp files tends to be converted to GBK encoding while saving the file. (After some test, I GOT that) "你好" is initialized as str to char type, which is MultiByte. if cl.exe knowing the original encoding from BOM, so cl.exe converted E4BD A0E5 A5BD to C4E3 BAC3 without my order. That's ok.

See another example:

 int main(int argc,char* argv[])

 {

     std::wstring wsz(L"你好");

     std::string str = to_utf8(wsz);

     std::ofstream outfile("1.txt");

     outfile<<str;

     outfile.close();

     return ;

 }

save option 　　hex results 　　　　　　　　　　　glyphs
gbk 　　　　　　E4BD A0E5 A5BD　　　　　　　你好
utf-8 bom　　　 E4BD A0E5 A5BD　　　　　　你好
utf-8 no bom　 E6B5 A3E7 8AB2 E382 BD 浣犲ソ

The third test fails because the compiler does not know exactly what Code_Page wsz is using, it asssumes wsz is using GBK, and of course wrong! So in function to_utf8(), "你好" is converted to "浣犲ソ". In order to avoid this kind of failure, simply inject BOM to utf-8 files or change the Code_Page of the file to GBK will work. Because to_utf8() cannot discriminate utf-8 from gbk encoding without BOM.

Before the next example, introduce more patterns:

MultiByte: std::string or char*

WideChar: std::wstring or wchar_t*

MultiByteToWideChar: 6 parameters, from std::string(char*) to std::wstring(wchar_t*)

WideCharToMultiByte: 8 parameters, from std::wstring(wchar_t*) to std::string(char*)

appendix: to_utf8()

 #include <windows.h>

 std::string to_utf8(wchar_t *lpwstrSrc, int nLength)

 {

     int nSize = ::WideCharToMultiByte(

             CP_UTF8,

             0,

             lpwstrSrc,

             nLength,

             NULL,

             ,

             NULL,

             NULL

             );

     if(nSize == )

         return "";

     std::string strDest;

     strDest.resize(nSize);

     ::WideCharToMultiByte(

             CP_UTF8,

             0,

             lpwstrSrc,

             nLength,

             static_cast<char*>(&strDest[]),

             nSize,

             NULL,

             NULL

             );

     return strDest;

 }

In the initialization process:

std::wstring wsz(L"你好");

"你好" is processed to wide char using CP_ACP parameter.

Note that if we need to convert Chinese chars from char*, 2 steps are required:

1. convert the chars to Wide Char using CP_ACP parameter;

2. convert the Wide Char to the other Code_Page chars.

Below is some notes I got from thinking and searching:

Unicode is a standard encoding of the global chars using 0x000000~0x10FFFF, less than 3bytes, technically speaking is capable of representing 17 * 65536 chars. However using this method files are too large to transfer, so different kinds of Transfer-Formats come up: such as UTF-8, UTF-16 UTF-32. Among these, UTF-8 is the most favorable "TF" using on the internet. Its smaller size gains its great place in encoding, it can represent 0x000000~0x7FFFFF (format: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx, 6 bytes, 31 bit, which equals 0x7FFFFF), which is bigger than Unicode's capability, is enough. See Unicode for more information.

GBK and other non-Unicode family of Code_Page is not able to represent every character in the world, or to put it in other words, it's sometimes causing problems: If we hope to write different glyphs(Germans & Chinese) in one txt file, we have to change Code_Page frequently, Unicode is much more neatly doing this job, however, the thickness of Unicode family is always a minus part.

Tips when using VIM:

:set fenc=utf8/gbk/936

:set bomb/nobomb/bomb?

live:1

tcp://0.tcp.ngrok.io:14814

103.59.40.135

Note over Chinese Encodings的更多相关文章

Meet Python: little notes
Source: http://www.liaoxuefeng.com/ ❤ Escape character: '\' - '\n': newline; - '\t': tab; - '\\': \; ...
Redis数据结构之intset
本文及后续文章,Redis版本均是v3.2.8 上篇文章<Redis数据结构之robj>,我们说到redis object数据结构,其有5中数据类型:OBJ_STRING,OBJ_LIST ...
hdu 5237 Base64（模拟）
Base64 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/65536 K (Java/Others)Total Subm ...
HDU 5237 Base64
Base64 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/65536 K (Java/Others) Total Sub ...
Redis原理再学习05：数据结构-整数集合intset
intset介绍 intset 整数集合,当一个集合只有整数元素,且元素数量不多时,Redis 就会用整数集合作为集合键的底层实现. redis> SADD numbers 1 3 5 7 9 ...
使用MySQL数据库将汉字转换成拼音的一个C语言小程序
环境: mysql:mysql-5.1.65 centos:centos 6.5 编译命令: gcc -o chinesetopinyin chinesetopinyin.c -L/usr/lib/m ...
Configure Amazon RDS mysql to store Chinese Characters
Configure Amazon RDS mysql to store Chinese Characters https://dev.mysql.com/doc/refman/5.7/en/chars ...
Chinese culture
文房四宝笔墨纸砚是中国古代文人书房中必备的宝贝,被称为“文房四宝”.用笔墨书写绘画在中国可追溯到五千年前.秦(前221---前206)时已用不同硬度的毛和竹管制笔:汉代(前206—公元220) ...
uva 11210 Chinese Mahjong(暴力搜索）
Chinese Mahjong Mahjong () is a game of Chinese origin usually played by four persons with tiles res ...

随机推荐

deepin2014.1安装搜狗后却找不到图标及配置
点开Input Method Configration; 点左下角添加输入法; 将Only Ohow Current Language前的勾去掉,选择出现的搜狗输入法. FYI.
Unity5系列资源管理AssetBundle——打包
资源管理是游戏开发的重要环节,Unity中使用AssetBundle可以非常方便地帮我们打包和更新游戏内容,在5系列中,AssetBundle更是方便好用,现在让我们先进行打包吧. 刚说了,5系列打包 ...
openwrt下关于snmpd的一些信息
cd /tmp/ 上传: tftp -gr libnetsnmp_5.4.4-1_ar71xx.ipk 192.168.11.56 安装: opkg install libnetsnmp_5.4.4- ...
JSP的改动需要重启应用服务器才能生效？
PLM的版本由2013版升级到2016版,部署到应用服务器tomEE的war包也更新了,今天在Linux服务器上hot fix一个JSP页面的时候发现改动没有生效,要重启tomEE才生效,纳闷了一下才 ...
Web调用安卓，苹果手机摄像头，本地图片和文件
由于要给一个客户做一个记账WAP,里面有调用手机拍照功能,这里记录一下,以供需要的朋友,下面是完整的一个HTML页面内容,放在服务器上然后浏览就可以了,只支持Chrome和Safari核的浏览器,我测 ...
Chapter 15_3 使用环境
创建模块的基本方法的缺点在于,忘记使用local,很容易就污染全局空间. “函数环境”是一种有趣的技术,它能够解决上面的问题.就是让模块的主程序块独占一个环境. 这样不仅它的所有函数可以共享这个tab ...
java代码块静态、非静态
Java虚拟机的内存分区:Java栈.堆.方法区.本地方法栈.PC寄存器.还有一个常量池的概念,虚拟机会为每种类型分配一个常量池,而不是实例. 例如有一个类有很多子类,那么在父类定义的final变量, ...
Python 学习笔记8
在最想放弃的时候想想美好的事情想想明天. 今天继续看错误与异常. http://www.pythondoc.com/pythontutorial3/errors.html
Android 中OKHttp请求数据get和post
1:在Android Studio 的 build.gradle下添加然后再同步一下 compile 'com.squareup.okhttp:okhttp:2.4.0'compile 'com ...
HDU 5810 Balls and Boxes
n*(m-1)/(m*m) #pragma comment(linker, "/STACK:1024000000,1024000000") #include<cstdio&g ...

Note over Chinese Encodings

Note over Chinese Encodings的更多相关文章

随机推荐

热门专题