Problem(Abstract)

When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.

Symptom

Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.

Diagnosing the problem

When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace

Resolving the problem

Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:

-Dibm.stream.nio=true

I am getting a MalformedInputException. How can I resolve this?

This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.

You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:

# echo $LANG
en_US.UTF-8

Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:

# export LANG=en_US

Alternatively, you can add this environment variable from the administration console.

MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.

Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.

The JVM can be forced to use NIO if the JVM argument is used as stated above.

Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.

【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章

  1. c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5

    本页列出来目前window下所有支持的字符编码  ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...

  2. System.Text.Encoding.Default

    string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...

  3. java.io.IOException: Malformed \uxxxx encoding.

    java.io.IOException: Malformed \uxxxx encoding.  at com.dong.frame.util.ReadProperties.read(ReadProp ...

  4. System.Text.Encoding.cs

    ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...

  5. LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs

    问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...

  6. UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)

    目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...

  7. (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null

    270002WDPN                                            3 Posts                             0 people l ...

  8. SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220

    Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...

  9. sqoop从DB2迁移数据到HDFS

    Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...

随机推荐

  1. Fragment间相互调用并传值

    public class MainFragment extends Fragment { private static final String ARG_DATE="com.example. ...

  2. How to include custom library into maven local repository?--转

    原文地址:https://www.mkyong.com/maven/how-to-include-library-manully-into-maven-local-repository/ There ...

  3. 截图 gif 图小工具

    GifCam gyazo gyazo.com Windows 10 中的内置截屏(Win+G)

  4. http协议以及防盗链技术

    http协议,又称为超文本传输协议,顾名思义,http协议不仅能传输文本,还能传输图片,视频,压缩包等文件,http协议是建立在tcp/ip协议的基础之上的,http协议对php程序员来讲可以说是重中 ...

  5. Android中的事件分发机制

    Android中的事件分发机制 作者:丁明祥 邮箱:2780087178@qq.com 这篇文章这周之内尽量写完 参考资料: Android事件分发机制完全解析,带你从源码的角度彻底理解(上) And ...

  6. SwipeRefreshLayout的使用,下拉刷新

    1. <?xml version="1.0" encoding="utf-8"?> <RelativeLayout xmlns:android ...

  7. linux系统管理-软件包管理

    概述: inux家族中的软件包管理有很多工具. 一种是在debiton系列的linux中,以像ubuntu的apt-get为代表.对于此种方式的管理方式,个人感觉挺简单方便的, 一种是在Fedora和 ...

  8. 401 - Unauthorized: Access is denied due to invalid credentials.

    solution:change application pool from ApplicationPoolIdentity to NetworkService.

  9. 04《UML大战需求分析》之四

    在学习完顺序图之后,流程分析的三种图,我已经学习完了我,但是我还需要大量地锻炼,这样才可以更加熟练地掌握几种图的使用和方法.接下来,我学习了用例图,用来描述系统的行为. 虽然是一同学习的,但是对用例图 ...

  10. ZOJ 3203 Light Bulb( 三分求极值 )

    链接:传送门 题意: 求影子长度 L 的最大值 思路:如果 x = 0 ,即影子到达右下角时,如果人继续向后走,那么影子一定是缩短的,所以不考虑这种情况.根据图中的辅助线外加相似三角形定理可以得到 L ...