Problem(Abstract)

When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.

Symptom

Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.

Diagnosing the problem

When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace

Resolving the problem

Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:

-Dibm.stream.nio=true

I am getting a MalformedInputException. How can I resolve this?

This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.

You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:

# echo $LANG
en_US.UTF-8

Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:

# export LANG=en_US

Alternatively, you can add this environment variable from the administration console.

MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.

Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.

The JVM can be forced to use NIO if the JVM argument is used as stated above.

Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.

【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章

  1. c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5

    本页列出来目前window下所有支持的字符编码  ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...

  2. System.Text.Encoding.Default

    string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...

  3. java.io.IOException: Malformed \uxxxx encoding.

    java.io.IOException: Malformed \uxxxx encoding.  at com.dong.frame.util.ReadProperties.read(ReadProp ...

  4. System.Text.Encoding.cs

    ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...

  5. LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs

    问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...

  6. UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)

    目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...

  7. (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null

    270002WDPN                                            3 Posts                             0 people l ...

  8. SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220

    Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...

  9. sqoop从DB2迁移数据到HDFS

    Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...

随机推荐

  1. sql获得某个时间段的数据

    CONVERT(Date, 时间字符串变量 ) between CONVERT(Date,'2015/2/10') and CONVERT(Date,'2015-3-10')

  2. 用IIS怎样在局域网内建网站

    IIS服务器组建一览 IIS(Internet Information Server,互联网信息服务)是一种Web(网页)服务组件,其中包括Web服务器.FTP服务器.NNTP服务器和SMTP服务器, ...

  3. 关于C++程序运行程序是出现的this application has requested the runtime to terminate it in an unusual way. 异常分析

    今天运行程序是出现了this application has requested the runtime  to terminate it in an unusual way. 的异常报告,以前也经常 ...

  4. python字符串、列表、元组

    字符串的常用方法: name.count('h')统计h在name中出现的次数 name.find('h')查找h的索引 '?'.join(name)使用问好拼接 name.encode('gb231 ...

  5. ZBrush中关于标记的特殊情况

    在ZBrush®中使用Marker标记调控板来记忆物体属性,因此能在任何时间回到标记并使用它给其他物体或改变物体作为参考点,在使用Marker标记调控板时回出现很多特殊情况,本文小编就这些特殊情况做一 ...

  6. luogu P2000 拯救世界 生成函数_麦克劳林展开_python

    模板题. 将所有的多项式按等比数列求和公式将生成函数压缩,相乘后麦克劳林展开即可. Code: n=int(input()) print((n+1)*(n+2)*(n+3)*(n+4)//24)

  7. Codeforces Round #499 (Div. 2) F. Mars rover_dfs_位运算

    题解: 首先,我们可以用 dfsdfsdfs 在 O(n)O(n)O(n) 的时间复杂度求出初始状态每个点的权值. 不难发现,一个叶子节点权值的取反会导致根节点的权值取反当且仅当从该叶子节点到根节点这 ...

  8. freeswitch 编码协商

    编辑 /usr/local/freeswitch/conf/sip_profiles/internal.xml 添加注释     <param name="inbound-zrtp-p ...

  9. UDP Linux编程(客户端&服务器端)

    服务器端 服务器不用绑定地址,他只需要进行绑定相应的监听端口即可. #include <sys/types.h> #include <sys/socket.h> #includ ...

  10. 小学生都能学会的python(运算符 和 while循环)

    ---恢复内容开始--- 小学生都能学会的python(运算符和编码) 一.格式化输出 #占位:"%s"占位,占得是字符串,"%d"占位,占的是数字. # 让用 ...