UTF8 与 UTF16 编码

Unicode 的发展，英文好的直接去 unicode.org 上去看吧，不好的可以移步到这里
看dengyunze的总结：《关于UTF8,UTF16,UTF32,UTF16-LE,UTF16-BE 》
。此文讲的清除明白：为了能把世界上的所有字符都表示，理论上需要用 UTF-16，但是由于“大部分”（当然这是欧美那边技术宅男拍脑袋想出来的大部分啦~）的字符只需要 1 个字节就搞定了，用 UTF16 实在太浪费啦，于是他们就用了 UTF8. 对于那些个“少数”（比如中日韩）的字符，就通过一个 UTF8-UTF16 的转换来表示。

UTF8 和 UTF16 都是变长表示的，为啥欧美技术宅会觉得太浪费了咧？因为欧美字符 0x0000 - 0x00FF 就搞定了，UTF8 最小变长是 1 个字节，而 UTF16 变长是 2 个字节，所以……（↓看下图中 code unit size）

注意：上面这个图中，UTF-16 和 UTF-16LE 是一样的，因为…… UTF16 默认就是 UTF-16LE

那么，UTF8是如何表示
的咧？↓看下图

↓↓ 举例

表示的方法跟上上个图对应，第一个字节中，从左往右第一个 10 前面的 “1” 的个数表示后面还有这么多个的字节在表示这个字符。UTF8 最多可以表示 31 bit 的字符。

UTF16 编码的过程

v  = 0x64321

v′ = v - 0x10000

   = 0x54321

   = 0101 0100 0011 0010 0001

vh = v′ >> 10

   = 01 0101 0000 // higher 10 bits of v′

vl = v′ & 0x3FF

   = 11 0010 0001 // lower  10 bits of v′

w1 = 0xD800 + vh

   = 1101 1000 0000 0000

   +        01 0101 0000

   = 1101 1001 0101 0000

   = 0xD950 // first code unit of UTF-16 encoding

w2 = 0xDC00 + vl

   = 1101 1100 0000 0000

   +        11 0010 0001

   = 1101 1111 0010 0001

   = 0xDF21 // second code unit of UTF-16 encoding

附一段 java 版本的 UTF8 与 UTF16 的相互转换，代码来源于 Lucene3.6

/**

	 * Interprets the given byte array as UTF-8 and converts to UTF-16. The

	 * {@link CharsRef} will be extended if it doesn't provide enough space to

	 * hold the worst case of each byte becoming a UTF-16 codepoint.

	 * <p>

	 * NOTE: Full characters are read, even if this reads past the length passed

	 * (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is

	 * passed). Explicit checks for valid UTF-8 are not performed.

	 */

	// TODO: broken if chars.offset != 0

	public static void UTF8toUTF16(byte[] utf8, int offset, int length,

			CharsRef chars) {

		int out_offset = chars.offset = 0;

		final char[] out = chars.chars = ArrayUtil.grow(chars.chars, length);

		final int limit = offset + length;

		while (offset < limit) {

			int b = utf8[offset++] & 0xff;

			if (b < 0xc0) {

				assert b < 0x80;

				out[out_offset++] = (char) b;

			} else if (b < 0xe0) {

				out[out_offset++] = (char) (((b & 0x1f) << 6) + (utf8[offset++] & 0x3f));

			} else if (b < 0xf0) {

				out[out_offset++] = (char) (((b & 0xf) << 12)

						+ ((utf8[offset] & 0x3f) << 6) + (utf8[offset + 1] & 0x3f));

				offset += 2;

			} else {

				assert b < 0xf8 : "b=" + b;

				int ch = ((b & 0x7) << 18) + ((utf8[offset] & 0x3f) << 12)

						+ ((utf8[offset + 1] & 0x3f) << 6)

						+ (utf8[offset + 2] & 0x3f);

				offset += 3;

				if (ch < UNI_MAX_BMP) {

					out[out_offset++] = (char) ch;

				} else {

					int chHalf = ch - 0x0010000;

					out[out_offset++] = (char) ((chHalf >> 10) + 0xD800);

					out[out_offset++] = (char) ((chHalf & HALF_MASK) + 0xDC00);

				}

			}

		}

		chars.length = out_offset - chars.offset;

	}

 /** Encode characters from a char[] source, starting at

   *  offset for length chars. After encoding, result.offset will always be 0.

   */

 public static void UTF16toUTF8(final char[] source, final int offset, final int length, BytesRef result) {

    int upto = 0;

    int i = offset;

    final int end = offset + length;

    byte[] out = result.bytes;

    // Pre-allocate for worst case 4-for-1

    final int maxLen = length * 4;

    if (out.length < maxLen)

      out = result.bytes = new byte[maxLen];

    result.offset = 0;

    while(i < end) {

      final int code = (int) source[i++];

      if (code < 0x80)

        out[upto++] = (byte) code;

      else if (code < 0x800) {

        out[upto++] = (byte) (0xC0 | (code >> 6));

        out[upto++] = (byte)(0x80 | (code & 0x3F));

      } else if (code < 0xD800 || code > 0xDFFF) {

        out[upto++] = (byte)(0xE0 | (code >> 12));

        out[upto++] = (byte)(0x80 | ((code >> 6) & 0x3F));

        out[upto++] = (byte)(0x80 | (code & 0x3F));

      } else {

        // surrogate pair

        // confirm valid high surrogate

        if (code < 0xDC00 && i < end) {

          int utf32 = (int) source[i];

          // confirm valid low surrogate and write pair

          if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {

            utf32 = (code << 10) + utf32 + SURROGATE_OFFSET;

            i++;

            out[upto++] = (byte)(0xF0 | (utf32 >> 18));

            out[upto++] = (byte)(0x80 | ((utf32 >> 12) & 0x3F));

            out[upto++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));

            out[upto++] = (byte)(0x80 | (utf32 & 0x3F));

            continue;

          }

        }

        // replace unpaired surrogate or out-of-order low surrogate

        // with substitution character

        out[upto++] = (byte) 0xEF;

        out[upto++] = (byte) 0xBF;

        out[upto++] = (byte) 0xBD;

      }

    }

    //assert matches(source, offset, length, out, upto);

    result.length = upto;

  }

UTF8 与 UTF16 编码的更多相关文章

快来领取一场专门讲解UTF-8与UTF-16编码算法的GitChat活动的免费名额
微信扫一扫,可打开该GitChat活动页面字符编码是计算机世界里最基础.最重要.最令人困惑的一个主题之一.不过,在计算机教材中却往往浮光掠影般地草草带过,甚至连一本专门进行深入介绍的专著都找不到(对 ...
Javascript中的string类型使用UTF-16编码
2019独角兽企业重金招聘Python工程师标准>>> 在JavaScript中,所有的string类型(或者被称为DOMString)都是使用UTF-16编码的. MDN DOMS ...
字符编码笔记：ASCII、Unicode、UTF-8、UTF-16、UCS、BOM、Endian
转载:http://witmax.cn/character-encoding-notes.html 今天中午,我突然想搞清楚Unicode和UTF-8之间的关系,于是就开始在网上查资料. 结果,这个问 ...
Unicode 字符集及UTF-8 UTF-16编码
很久以前发在他处的一篇博文,今天翻出来重新整理了一下 Unicode 字符集共分为 17 个平面(plane), 分别对应 U+xx0000 - U+xxFFFF 的 code points, 其中 ...
UTF-8、UTF-16、UTF-32编码的相互转换
最近在考虑写一个可以跨平台的通用字符串类,首先需要搞定的就是编码转换问题. vs默认保存代码文件,使用的是本地code(中文即GBK,日文即Shift-JIS),也可以使用带BOM的UTF-8.gcc ...
字符编码终极笔记：ASCII、Unicode、UTF-8、UTF-16、UCS、BOM、Endian
1.字符编码.内码,顺带介绍汉字编码字符必须编码后才能被计算机处理.计算机使用的缺省编码方式就是计算机的内码.早期的计算机使用7位的ASCII编码,为了处理汉字,程序员设计了用于简体中文的GB231 ...
由iPhone emoji问题牵出UTF-16编码，UTF-8编码查询
前言 iOS平台,系统输入法emoji表达.表达式不能在很多其他平台上显示,尤其是在Android.Symbian系统.我决定到底要探索1:我指的是一些知识: (注意:该博文已经如果读者已经了解utf ...
字符编码的种类：ASCII、GB2312、GBK、GB18030、Unicode、UTF-8、UTF-16、Base64
ASCII码ASCII:https://zh.wikipedia.org/wiki/ASCIIASCII(American Standard Code for Information Intercha ...
所谓编码--泛谈ASCII、Unicode、UTF8、UTF16、UCS-2等编码格式
最近在看nodejs的源码,看到stream的实现里面满地都是encoding,不由想起以前看过的一篇文章--在前面的随笔里面有提到过--阮一峰老师的<字符编码笔记:ASCII,Unicode和 ...

随机推荐

linux shell 实现node-webkit的自动跨平台打包
今天下午发现了个好玩的东西(node-webkit),这东西有一直是我想实现的功能:使用html编写桌面应用,实现跨平台: 具体实现方法:结合chrome浏览器内核和node.js搭建一个跨平台的应用 ...
[Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table
Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...
redis在mac上的安装
mac 上安装 redis 首先必须保证mac 已经安装 xcode. 因为make时要用到 Xcode 的command Tools . (1)下载 redis http://redis.goo ...
[转]python下很帅气的爬虫包 - Beautiful Soup 示例
原文地址http://blog.csdn.net/watsy/article/details/14161201 先发一下官方文档地址.http://www.crummy.com/software/Be ...
homework-02 "最大子数组之和"的问题进阶
代码编写这次的作业瞬间难了好多,无论是问题本身的难度或者是单元测试这一原来没接触过的概念或者是命令行参数的处理这些琐碎的问题,都使得这次作业的完成说不上轻松. 最大子数组之和垂直水平相连的拓展问题解 ...
homework-03 图形化化最大子序列和
你现在使用的代码规范是什么, 和上课前有什么改进? 我们一开始使用的是C++完成的相关程序.本次因为一些原因,改为C#进行编写.因为2013-10-21在VS2012中,所以所有的代码都已经被VS自 ...
Android问题-打开DelphiXE8与DelphiXE10新建一个空工程提示"out of memory"
错误信息: [DCC Error] E2597 d:\XE8\Embarcadero\Studio\16.0\PlatformSDKs\android-ndk-r9c\toolchains\arm-l ...
HDU 5792 World is Exploding （树状数组）
World is Exploding 题目链接: http://acm.hdu.edu.cn/showproblem.php?pid=5792 Description Given a sequence ...
Spring SimpleJdbcTemplate batchUpdate() example
In this tutorial, we show you how to use batchUpdate() in SimpleJdbcTemplate class. See batchUpdate( ...
转载C# 对象转Json序列化
转载原地址: http://www.cnblogs.com/plokmju/p/ObjectByJson.html JSON Json(JavaScript Object Notation) 是一种 ...

UTF8 与 UTF16 编码

UTF8 与 UTF16 编码的更多相关文章

随机推荐

热门专题