What's the difference between UTF-8 and UTF-8 without BOM?
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom
Answer1
The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (EF BB BF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
Answer2
The other excellent answers already answered that:
- There is no official difference between UTF-8 and BOM-ed UTF-8
- A BOM-ed UTF-8 string will start with the three following bytes.
EF BB BF - Those bytes, if present, must be ignored when extracting the string from the file/stream.
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
- The legitimate ISO-8859-1 string "ABC"
- The legitimate UTF-8 string "ABC"
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
Encodings should be known, not divined.
评论
You understood correctly. The string [EF BB BF 41 42 43] is just a bunch of bytes. You need external information to choose how to interpret it. If you believe those bytes were encoded using ISO-8859-1, then the string is "ABC". If you believe those bytes were encoded using UTF-8, then it is "ABC". If you don't know, then you must try to find out. The BOM could be a clue. The absence of invalid character when decoded as UTF-8 could be another... In the end, unless you can memorize/find the encoding somehow, an array of bytes is just an array of bytes.
Convert a file to utf-8, but still shown as ascii
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.
What's the difference between UTF-8 and UTF-8 without BOM?的更多相关文章
- UTF—8与UTF—8(无bom)格式
BOM——Byte Order Mark,就是字节序标记 在UCS 编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符,它的编码是FEFF.而FFFE在U ...
- Unicode、UTF-8 和 ISO8859-1到底有什么区别
说明:本文转载于新浪博客,旨在方便知识总结.原文地址:http://blog.sina.com.cn/s/blog_673c81990100t1lc.html 本文主要包括以下几个方面:编码基本知识, ...
- Unicode、UTF-8 和 ISO8859-1
Unicode.UTF-8 和 ISO8859-1到底有什么区别 1.本文主要包括以下几个方面:编码基本知识,java,系统软件,url,工具软件等. 在下面的描述中,将以"中文" ...
- UniEAP UTF 用户手册 (引擎)
目录 第1章 概述 5 1.1 术语解释 5 第2章 测试文件组织 6 2.1 测试执行文件详解 7 2.1.1 参数配置 7 2.1.2 测试报告配置 9 2.1.3 浏览器类型配置 9 2.1.4 ...
- Python3中文教程
搜索 此文档来源自网络 安装 PYTHON❝ Tempora mutantur nos et mutamur in illis. (时光流转,吾等亦随之而变.) ❞ — 古罗马谚语 深入欢迎来到 Py ...
- linux实践——字符集
一.ASCII码 首先是看得懂ASCII码表: 二 八 十 十六 缩写/字符 0000 0000 0 0 00 NUL(null) 0000 0001 1 1 01 SOH(start of head ...
- GBK、GB2312、iso-8859-1之间的区别
转自:http://blog.csdn.net/jerry_bj/article/details/5714745 GBK.GB2312.iso-8859-1之间的区别 GB2312,由中华人民共和国政 ...
- 【转】关于Python脚本开头两行的:#!/usr/bin/python和# -*- coding: utf-8 -*-的作用 – 指定文件编码类型
原文网址:http://www.crifan.com/python_head_meaning_for_usr_bin_python_coding_utf-8/ #!/usr/bin/python 是用 ...
- 【转】Java编程之字符集问题研究
发现这是对字集说得最明了的一篇文章了. 转发自:http://tomcat-oracle.iteye.com/blog/2037160 1. 概述 本文主要包括以下几个方面:编码基本知识,java,系 ...
- Android之单复选框及Spinner实现二级联动
一.基础学习 1.图形学真的很神奇啊....查了些资料做出了3D云标签,哈哈...其实直接拿来用的,我们要效仿鲁迅先生的拿来主义,嘿嘿~~3D标签云就是做一个球面,然后再球面上取均匀分布的点,把点坐标 ...
随机推荐
- 在TextView中设置DrawableLeft不显示的问题
1.在XML中使用 android:drawableLeft="@drawable/icon" 2.代码中动态变化 Drawable drawable= getResources( ...
- sql server数据库,禁用启用触发器各种情况!
一.禁用和启用单个触发器 禁用: ALTER TABLE trig_example DISABLE TRIGGER trig1 GO 恢复: ALTER TABLE trig_example ENAB ...
- JZOJ.5231【NOIP2017模拟8.5】序列问题
Description Input 输入文件名为seq.in.首先输入n.接下来输入n个数,描述序列 A. Output 输出文件名为seq.out.输出一行一个整数代表答案. Sample ...
- 【BZOJ4624】农场种植 FFT
[BZOJ4624]农场种植 Description 农夫约翰想要在一片巨大的土地上建造一个新的农场. 这块土地被抽象为个 R*C 的矩阵.土地中的每个方格都可以用来生产一种食物:谷物(G)或者是牲畜 ...
- NavigationBar 背景颜色,字体颜色
// 设置状态栏颜色 [application setStatusBarStyle:UIStatusBarStyleLightContent]; // 设置导航栏 [[UINavigationBar ...
- NET Framework 4.5新特性 (三)64位平台支持大于2 GB大小的数组
64位平台.NET Framework数组限制不能超过2GB大小.这种限制对于需要使用到大型矩阵和向量计算的工作人员来说,是一个非常大问题. 无论RAM容量有多大有多少,一旦你使用大型矩阵和向量计算工 ...
- flex组合流动布局实例---利用css的order属性改变盒子排列顺序
flex弹性盒子 <div class="container"> <div class="box yellow"></div> ...
- git、git bash、git shell的区别
之前安装了github(CSDN上找的,官网的下不来,貌似要FQ - -)后,自带了git shell,如图: 输命令的时候发现网上的一些命令不管用,譬如:git ls –a 查看隐藏的 .git 文 ...
- DTD的学习和理解
看log4j的官方文档,上面说提供了XML格式的配置,但是没有XML具体示例.发现文档中说的是一个DTD文档,但我根本不知道DTD是什么,于是就简单了解一下.顺带做一下笔记. 注:结合笔记看log4j ...
- delphi -----(去掉窗口最大化,最小化、关闭),主窗口,和子窗口之间的设置
一.去掉窗口最大化,最小化.关闭 borderIcons:biSystemMenu:false borderStyle:bsSizeable 二.主子窗口 主main: //调用子窗体procedur ...