【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章
- c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5
本页列出来目前window下所有支持的字符编码 ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...
- System.Text.Encoding.Default
string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...
- java.io.IOException: Malformed \uxxxx encoding.
java.io.IOException: Malformed \uxxxx encoding. at com.dong.frame.util.ReadProperties.read(ReadProp ...
- System.Text.Encoding.cs
ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...
- LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...
- UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)
目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...
- (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null
270002WDPN 3 Posts 0 people l ...
- SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220
Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...
- sqoop从DB2迁移数据到HDFS
Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...
随机推荐
- Centos7 minimal 系列之rabbitmq的理解(九)
一.前言 传送门:rabbitmq安装 第一次接触消息队列,有很多不熟悉的地方,可能也有很多写的不对的,大家一起学习. RabbitMQ是一个在AMQP基础上完成的,可复用的企业消息系统. 使用场景: ...
- Win10 build package error collections
1. 打包Released的时候出现问题意思是说 本地项目,类里有这个Visibility属性不能进行序列化 严重性 代码 说明 项目 文件 行 禁止显示状态 错误 error CS0029: 无法将 ...
- 洛谷T47092 作业_简单状压动归
只要注意一下细节就毫无难点了,简简单单状态压缩即可. Code: #include<cstdio> #include<algorithm> using namespace st ...
- webpack安装,npm WARN optional SKIPPING OPTIONAL DEPENDENCY,npm WARN notsup SKIPPING OPTIONAL DEPENDENCY警告
npm install webpack -g//全局安装webpack 电脑上安装完后: 其中有两个警告: npm WARN optional SKIPPING OPTIONAL DEPENDENCY ...
- 页面元素的定位:getBoundingClientRect()和document.documentElement.scrollTop
1.document.documentElement.getBoundingClientRect MSDN对此的解释是: Syntax oRect = object.getBoundingClient ...
- python之类的组合
类的组合 学校与课程没有共同点,课程与老师没有共同点,但是学校与课程有关联,课程与老师有关联:学校.课程.老师是三个完全不同的类:课程是属于学校的,老师是教课程的,此时我们就用到类的组合来关联,学校- ...
- How Google Backs Up The Internet Along With Exabytes Of Other Data
出处:http://highscalability.com/blog/2014/2/3/how-google-backs-up-the-internet-along-with-exabytes-of- ...
- Linux150个命令
命令 功能说明 线上查询及帮助命令(2个) man 查看命令帮助,命令的词典,更复杂的还有info,但不常用. help 查看Linux内置命令的帮助,比如cd命令. 文件和目录操作命令(18个) l ...
- Swoole WebSoctet 使用 zlib 压缩之 PHP 与 pako.js
一些理论知识 先说一下deflate算法吧,deflate是zip压缩文件的默认算法, 其实deflate现在不光用在zip文件中, 在7z, xz等其他的压缩文件中都用, 实际上deflate只是一 ...
- 新手须知 QT类大全
QT类大全,在行内容中罗列出来了,希望大家多看看,如果是API就更好了,但可惜不是.这些是一些大类,请多做参考. QApplication 应用程序类 QLabel 标签类 QPushButton 按 ...