Problem(Abstract)

When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.

Symptom

Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.

Diagnosing the problem

When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace

Resolving the problem

Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:

-Dibm.stream.nio=true

I am getting a MalformedInputException. How can I resolve this?

This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.

You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:

# echo $LANG
en_US.UTF-8

Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:

# export LANG=en_US

Alternatively, you can add this environment variable from the administration console.

MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.

Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.

The JVM can be forced to use NIO if the JVM argument is used as stated above.

Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.

【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章

  1. c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5

    本页列出来目前window下所有支持的字符编码  ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...

  2. System.Text.Encoding.Default

    string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...

  3. java.io.IOException: Malformed \uxxxx encoding.

    java.io.IOException: Malformed \uxxxx encoding.  at com.dong.frame.util.ReadProperties.read(ReadProp ...

  4. System.Text.Encoding.cs

    ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...

  5. LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs

    问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...

  6. UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)

    目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...

  7. (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null

    270002WDPN                                            3 Posts                             0 people l ...

  8. SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220

    Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...

  9. sqoop从DB2迁移数据到HDFS

    Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...

随机推荐

  1. Centos7 minimal 系列之NAT联网(一)

    一.安装 参考:http://m.blog.csdn.net/qq_24879495/article/details/77838512 二.解决不能联网问题 打开网络共享中心,设置虚拟网卡 编辑虚拟机 ...

  2. jqury中$("#div").index($this)在setTimeoutt中返回值一直是-1的问题解决方案

    今天遇到一个十分蛋疼的问题,花了我一个多小时才解决,其实十分简单,但我是新手,好了,事情是这样的: 我想让鼠标停留在某个元素一定时间再显示它隐藏的内容(不然你鼠标快速滑上滑下,反反复复,如果碰上sli ...

  3. Golden Gate 概述

    概述: 是什么?Oracle GoldenGate 提供异构环境间事务数据的实时.低影响的捕获.路由.转换和交付. 非侵入: 不建触发器,不建中间表,无需增量标记或时间戳字段 不在源表上进行数据查询 ...

  4. ZBrush中2.5D笔刷

    ZBrush®是一个数字雕刻和3维建模软件,它不仅有着强大的3D雕刻功能,对于2.5D笔刷的应用也毫不逊色.本文主要讲解2.5D笔刷的一些使用方法,2.5D笔刷是针对贴图绘画的增效画笔工具和其他一些工 ...

  5. EFCore笔记之查询数据

    查询数据 基础查询,Linq100实例: https://code.msdn.microsoft.com/101-LINQ-Samples-3fb9811b using (var context = ...

  6. Debian下签名无法验证

    又收集到的新方法 gpg --keyserver  subkeys.pgp.net --recv-keys AED4B06F473041FA  && apt-key add /root ...

  7. java文件名与class关系

    class与文件名没有必要关系但是public class是要绝对保持一致 例如:class test{ public static void main(String args[]){ System. ...

  8. 路飞学城Python-Day35

    08-初识SQL语句 数据库客户端操作的内容(增查改删): 1.操作数据库 操作数据库 增(本质上就是创建一个本地文件夹) create database db1 charset utf8; 查 查看 ...

  9. java web 初尝遇到的坑

    1. 配置 tomcat 7 + Dynamic web model version 3 发现写 web.xml 导致 tomcat 不能启动. 解决办法:tomcat 7 之后有两种配置 servl ...

  10. POJ 2187 Beauty Contest( 凸包求最远点对 )

    链接:传送门 题意:给出 n 个点,求出这 n 个点中最远的两个点距离的平方 思路:最远点对一定会在凸包的顶点上,然后直接暴力找一下凸包顶点中距离最远的两个点 /******************* ...