UTF-8 Invalid Byte Sequences
Chances are, some of you have run into the issue with the invalid byte sequence in UTF-8 error when dealing with user-submitted data. AGoogle search shows that my hunch isn’t off.
Among the search results are plenty of answers—some using the deprecated iconv library—that might lead you to a sufficient fix. However, among the slew of queries are few answers on how to reliably replicate and test the issue.
In developing the Griddler gem we ran into some cases where the data being posted back to our controller had invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body of an email having an invalid byte, and encoded as UTF-8.
What are valid and invalid bytes? This table on Wikipedia tells us bytes 192, 193, and 245-255 are off limits. In ruby’s string literal we can represent this by escaping one of those numbers:
> "hi \255"
=> "hi \xAD"
There’s our string with the invalid byte! How do we know for sure? In that IRB session we can simulate a comparable issue by sending a message to the string it won’t like - like split or gsub.
> "hi \255".split(' ')
ArgumentError: invalid byte sequence in UTF-8
from (irb):9:in `split'
from (irb):9
from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'
Yup. It certainly does not like that.
Let’s create a very real-world, enterprise-level, business-critical test case:
invalid_byte_spec.rb
require 'rspec'
def replace_name(body, name)
body.gsub(/joel/, name)
end
describe 'replace_name' do
it 'removes my name' do
body = "hello joel"
replace_name(body, 'hank').should eq "hello hank"
end
it 'clears out invalid UTF-8 bytes' do
body = "hello joel\255"
replace_name(body, 'hank').should eq "hello hank"
end
end
The first test passes as expected, and the second will fail as expected but not with the error we want. By adding that extra byte we should see an exception raised similar to what we simulated in IRB. Instead it’s failing in the comparison with the expected value.
1) replace_name clears out invalid UTF-8 bytes
Failure/Error: replace_name(body, 'hank').should eq "hello hank"
expected: "hello hank"
got: "hello hank\xAD"
(compared using ==)
# ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'
Why isn’t it failing properly? If we pry into our running test we find out that inside our file the strings being passed around are encoded as ASCII-8BIT instead of UTF-8.
[2] pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding
=> #<Encoding:ASCII-8BIT>
As a result we’ll have to force that string’s encoding to UTF-8:
it 'clears out invalid UTF-8 bytes' do
body = "hello joel\255".force_encoding('UTF-8')
replace_name(body, 'hank').should_not raise_error(ArgumentError)
replace_name(body, 'hank').should eq "hello hank"
end
By running the test now we will see our desired exception
1) replace_name clears out invalid UTF-8 bytes
Failure/Error: body.gsub(/joel/, name)
ArgumentError:
invalid byte sequence in UTF-8
# ./invalid_byte_spec.rb:4:in `gsub'
# ./invalid_byte_spec.rb:4:in `replace_name'
# ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'
Finished in 0.00426 seconds
2 examples, 1 failure
Now that we’re comfortably in the red part of red/green/refactor we can move on to getting this passing by updating our replace_name method.
def replace_name(body, name)
body
.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
.gsub(/joel/, name)
end
And the test?
Finished in 0.04252 seconds
2 examples, 0 failures
For such a small piece of code we admittedly had to jump through some hoops. Through that process, however, we learned a bit about character encoding and how to put ourselves in the right position—through the red/green/refactor cycle—to fix bugs we will undoubtedly run into while writing software.
#encoding: utf-8
require 'json'
f="dsp-cpi"
File.open(f).each do |line|
line = line.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
end
UTF-8 Invalid Byte Sequences的更多相关文章
- maven filter 乱码,MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
<plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactI ...
- MalformedByteSequenceException: Invalid byte 1 of 1-byte
修改了线上程序的xml配置文件,重启后报如下错误: MalformedByteSequenceException: Invalid byte 1 of 1-byte 百度了下大体的意思是说文件的编码错 ...
- [字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案
今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...
- 读取xml文件报错:Invalid byte 2 of 2-byte UTF-8 sequence。
程序读取xml文件后,系统报“Invalid byte 2 of 2-byte UTF-8 sequence”错误,如何解决呢? 1.程序解析xml的时候,出现Invalid byte 2 of 2- ...
- Invalid byte 3 of 3-byte UTF-8 sequence
用maven编译,tomcat启动时报错:IOException parsing XML document from class path resource [applicationContext.x ...
- com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte ...
- tomcat部署新的项目的时候出现报错信息: Invalid byte tag in constant pool: 15
上面一堆tomcat启动的提示信息省略掉,下面是报错的具体信息:org.apache.tomcat.util.bcel.classfile.ClassFormatException: Invalid ...
- xml中1字节的UTF-8序列的字节1无效([字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案)
今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...
- Xml读取异常--Invalid byte 1 of 1-byte UTF-8 sequence
xml读取异常Invalid byte 1 of 1-byte UTF-8 sequence org.dom4j.DocumentException: Invalid byte 1 of 1-byte ...
随机推荐
- HTML(四)Form标签
<form>…</form> 定义供用户输入的 HTML 表单 例子 <html> <body> <form method="ge ...
- mac mamp环境 和linux下 安装redis 和可视化工具 Redis Desktop Manager
mac下安装 第一步:安装redis 1. brew install redis 2.启动服务/usr/local/opt/redis/bin/redis-server 3.配置redis密码访问 编 ...
- SQL-32 将employees表的所有员工的last_name和first_name拼接起来作为Name,中间以一个空格区分
题目描述 将employees表的所有员工的last_name和first_name拼接起来作为Name,中间以一个空格区分CREATE TABLE `employees` ( `emp_no` in ...
- 谷歌开源的TensorFlow Object Detection API视频物体识别系统实现(二)[超详细教程] ubuntu16.04版本
本节对应谷歌开源Tensorflow Object Detection API物体识别系统 Quick Start步骤(一): Quick Start: Jupyter notebook for of ...
- Linux文件系统命令 pwd
命令名:pwd 功能:查看当前所处的位置 eg: renjg@renjg-HP-Compaq-Pro--MT:~$ pwd /home/renjg renjg@renjg-HP-Compaq-Pro- ...
- <Spark><Programming><Loading and Saving Your Data>
Motivation Spark是基于Hadoop可用的生态系统构建的,因此Spark可以通过Hadoop MapReduce的InputFormat和OutputFormat接口存取数据. Spar ...
- 【转】Delphi 10.3关于相机该注意的细节
感谢移动信息化专家提供的方法,他的ChinaCock组件是相当的专业,感兴趣可以加入qq群223717588.
- Python写一个批量生成账号的函数
批量生成账户信息,产生的账户由@sina.com结尾,长度由用户输入,产生多少条也由用户输入,用户名不能重复,用户名必须由大写字母.小写字母和数字组成. def Users(num,len): # n ...
- php发送邮件(TP5)
先百度搜索phpmailer 下载phpmailer函数包 放到/vendor/下,这是tp5扩展类库目录 然后你需要一个已经开启了SMTP服务的邮箱,作为发送者邮箱,QQ邮箱163邮箱是需要自己开 ...
- Oracal
增删改查 1.增加数据表 Create table users ( userid VARCHAR2(4), username VARCHAR2(20), userpass VARCHAR2(20), ...