【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章
- c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5
本页列出来目前window下所有支持的字符编码 ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...
- System.Text.Encoding.Default
string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...
- java.io.IOException: Malformed \uxxxx encoding.
java.io.IOException: Malformed \uxxxx encoding. at com.dong.frame.util.ReadProperties.read(ReadProp ...
- System.Text.Encoding.cs
ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...
- LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...
- UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)
目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...
- (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null
270002WDPN 3 Posts 0 people l ...
- SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220
Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...
- sqoop从DB2迁移数据到HDFS
Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...
随机推荐
- Bluefish Editor - 蓝鱼编辑器 for Web Designer
Bluefish is a GTK+ HTML editor for the experienced web designer. Its features include nice wizards f ...
- .NET MVC Dropzone 上传图片
在nuget控制台输入:Install-Package dropzone @{ Layout = null; } <!DOCTYPE html> <html> <head ...
- mac修改默认打开方式
首先选中你要修改默认打开方式的文件,右键单击这个文件,在弹出的菜单中,选择“查看简介”: 然后在弹出的菜单中,找到“打开方式”选项,从下来的菜单中,找到你希望默认打开这个文件的程序: 然后点击下面的“ ...
- java处理日期时间代码
public static String FORMATE_DATE_STR = "yyyy-MM-dd"; public static String FORMATE_TIME_ST ...
- CF343E Pumping Stations(最小割树)
没学过最小割树的出门左转. 我们已经知道了两点的最小割就是最小割树上,对应两点之间路径的权值的最小值. 找到最小割树中权值的最小的边. 那么一定是先选完一侧的点在选完另一侧的点. 因为当前边最小,那么 ...
- 【Latex常见问题总结】
1. 非数学符号如max/min将下标放到正下方,这个问题折腾了很久, 下标不在正下方会带俩两个问题,一是有时候不够美观,二是会使得数学公式过长越界,需要换行. 解决方案:将符号转换为数学符号, \m ...
- ansible 定义主机用户和密码
定义主机组用户和密码 [webservers] ansible[01:04] ansible_ssh_user='root' ansible_ssh_pass='AAbb0101' [root@ftp ...
- COGS——T 2739. 凯伦和咖啡
http://www.cogs.pro/cogs/problem/problem.php?pid=2739 ★★☆ 输入文件:coffee.in 输出文件:coffee.out 简单对比时 ...
- 【Linux】进程调度概述
1 可运行队列 (基于实时进程调度) 调度程序中最主要的数据结构式运行队列(runqueue).可运行队列是给定处理器上的可运行进程的链表,每一个处理器一个. 每一个可投入运行的进程都唯一的归属于一个 ...
- HDU 1285--确定比赛名次【拓扑排序 && 邻接表实现】
确定比赛名次 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Sub ...