Chances are, some of you have run into the issue with the invalid byte sequence in UTF-8 error when dealing with user-submitted data. AGoogle search shows that my hunch isn’t off.

Among the search results are plenty of answers—some using the deprecated iconv library—that might lead you to a sufficient fix. However, among the slew of queries are few answers on how to reliably replicate and test the issue.

In developing the Griddler gem we ran into some cases where the data being posted back to our controller had invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body of an email having an invalid byte, and encoded as UTF-8.

What are valid and invalid bytes? This table on Wikipedia tells us bytes 192, 193, and 245-255 are off limits. In ruby’s string literal we can represent this by escaping one of those numbers:

> "hi \255"
=> "hi \xAD"

There’s our string with the invalid byte! How do we know for sure? In that IRB session we can simulate a comparable issue by sending a message to the string it won’t like - like split or gsub.

> "hi \255".split(' ')
ArgumentError: invalid byte sequence in UTF-8
from (irb):9:in `split'
from (irb):9
from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'

Yup. It certainly does not like that.

Let’s create a very real-world, enterprise-level, business-critical test case:

invalid_byte_spec.rb

require 'rspec'

def replace_name(body, name)
body.gsub(/joel/, name)
end describe 'replace_name' do
it 'removes my name' do
body = "hello joel" replace_name(body, 'hank').should eq "hello hank"
end it 'clears out invalid UTF-8 bytes' do
body = "hello joel\255" replace_name(body, 'hank').should eq "hello hank"
end
end

The first test passes as expected, and the second will fail as expected but not with the error we want. By adding that extra byte we should see an exception raised similar to what we simulated in IRB. Instead it’s failing in the comparison with the expected value.

1) replace_name clears out invalid UTF-8 bytes
Failure/Error: replace_name(body, 'hank').should eq "hello hank" expected: "hello hank"
got: "hello hank\xAD" (compared using ==)
# ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'

Why isn’t it failing properly? If we pry into our running test we find out that inside our file the strings being passed around are encoded as ASCII-8BIT instead of UTF-8.

[2] pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding
=> #<Encoding:ASCII-8BIT>

As a result we’ll have to force that string’s encoding to UTF-8:

it 'clears out invalid UTF-8 bytes' do
body = "hello joel\255".force_encoding('UTF-8') replace_name(body, 'hank').should_not raise_error(ArgumentError)
replace_name(body, 'hank').should eq "hello hank"
end

By running the test now we will see our desired exception

1) replace_name clears out invalid UTF-8 bytes
Failure/Error: body.gsub(/joel/, name)
ArgumentError:
invalid byte sequence in UTF-8
# ./invalid_byte_spec.rb:4:in `gsub'
# ./invalid_byte_spec.rb:4:in `replace_name'
# ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>' Finished in 0.00426 seconds
2 examples, 1 failure

Now that we’re comfortably in the red part of red/green/refactor we can move on to getting this passing by updating our replace_name method.

def replace_name(body, name)
body
.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
.gsub(/joel/, name)
end

And the test?

Finished in 0.04252 seconds
2 examples, 0 failures

For such a small piece of code we admittedly had to jump through some hoops. Through that process, however, we learned a bit about character encoding and how to put ourselves in the right position—through the red/green/refactor cycle—to fix bugs we will undoubtedly run into while writing software.

#encoding: utf-8
require 'json'
f="dsp-cpi"
File.open(f).each do |line|
line = line.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
end

UTF-8 Invalid Byte Sequences的更多相关文章

  1. maven filter 乱码,MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.

    <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactI ...

  2. MalformedByteSequenceException: Invalid byte 1 of 1-byte

    修改了线上程序的xml配置文件,重启后报如下错误: MalformedByteSequenceException: Invalid byte 1 of 1-byte 百度了下大体的意思是说文件的编码错 ...

  3. [字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案

    今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...

  4. 读取xml文件报错:Invalid byte 2 of 2-byte UTF-8 sequence。

    程序读取xml文件后,系统报“Invalid byte 2 of 2-byte UTF-8 sequence”错误,如何解决呢? 1.程序解析xml的时候,出现Invalid byte 2 of 2- ...

  5. Invalid byte 3 of 3-byte UTF-8 sequence

    用maven编译,tomcat启动时报错:IOException parsing XML document from class path resource [applicationContext.x ...

  6. com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte

    com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte ...

  7. tomcat部署新的项目的时候出现报错信息: Invalid byte tag in constant pool: 15

    上面一堆tomcat启动的提示信息省略掉,下面是报错的具体信息:org.apache.tomcat.util.bcel.classfile.ClassFormatException: Invalid ...

  8. xml中1字节的UTF-8序列的字节1无效([字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案)

    今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...

  9. Xml读取异常--Invalid byte 1 of 1-byte UTF-8 sequence

    xml读取异常Invalid byte 1 of 1-byte UTF-8 sequence org.dom4j.DocumentException: Invalid byte 1 of 1-byte ...

随机推荐

  1. 【原创】连接数据库MySQL,读取、显示、修改数据

    /* Time: 2017.01.02 —— 2017.01.04 * Author: WJ * Function:连接数据库,从数据库中读取图片并显示(已成功) */ [参考链接] MySQL存入图 ...

  2. 使用STL的next_permutation函数

    文章作者:姜南(Slyar) 文章来源:Slyar Home (www.slyar.com) 转载请注明,谢谢合作. 下午研究了一下全排列算法,然后发现C++的STL有一个函数可以方便地生成全排列,这 ...

  3. Remove duplicates from array II

    //Given a sorted array, remove the duplicates in place such that each element appear only // once an ...

  4. CentOS7安装Nginx及配置

    Nginx是一款轻量级的网页服务器.反向代理服务器.相较于Apache.lighttpd具有占有内存少,稳定性高等优势.**它最常的用途是提供反向代理服务.** 安装   在Centos下,yum源不 ...

  5. Python 实例变量

    class Person: def __init__(self, name, id, gender, birth): self.name = name # 实例变量 对象里的变量 self.id = ...

  6. Access denied for user 'root'@'IP' (using password:YES)解决方法

    在MySql的使用过程中,碰到“Access denied for user 'root'@'IP' (using password:YES)”的问题,使用以下语句修改后还是不行. GRANT ALL ...

  7. VS2017调试代码显示“当前无法命中断点,还没有为该文档加载任何符号”

    VS2017升级之后,代码调试无法进入,显示“当前无法命中断点,还没有为该文档加载任何符号”的问题解决思路: 1.工具-选项-项目和解决方案-生成并运行,取消勾选“在运行时仅生成启动项目和依赖性” 2 ...

  8. spring boot项目,application.properties配置文件下中文乱码解决方案

    转自:https://blog.csdn.net/qq_40408534/article/details/79831807 如以上,application.properties文件下中文乱码.发生乱码 ...

  9. 深度强化学习介绍 【PPT】 Human-level control through deep reinforcement learning (DQN)

    这个是平时在实验室讲reinforcement learning 的时候用到PPT, 交期末作业.汇报都是一直用的这个,觉得比较不错,保存一下,也为分享,最早该PPT源于师弟汇报所做.

  10. DG备库,实时应用如何判断,MR进程,及MRP应用归档,三种情况的查询及验证

    本篇文档学习,DG备库,实时应用如何判断,MR进程,及MRP应用归档,三种情况的查询及验证 1.取消MRP进程 备库查询进程状态select process,client_process,sequen ...