Problem(Abstract)

When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.

Symptom

Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.

Diagnosing the problem

When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace

Resolving the problem

Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:

-Dibm.stream.nio=true

I am getting a MalformedInputException. How can I resolve this?

This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.

You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:

# echo $LANG
en_US.UTF-8

Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:

# export LANG=en_US

Alternatively, you can add this environment variable from the administration console.

MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.

Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.

The JVM can be forced to use NIO if the JVM argument is used as stated above.

Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.

【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章

  1. c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5

    本页列出来目前window下所有支持的字符编码  ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...

  2. System.Text.Encoding.Default

    string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...

  3. java.io.IOException: Malformed \uxxxx encoding.

    java.io.IOException: Malformed \uxxxx encoding.  at com.dong.frame.util.ReadProperties.read(ReadProp ...

  4. System.Text.Encoding.cs

    ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...

  5. LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs

    问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...

  6. UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)

    目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...

  7. (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null

    270002WDPN                                            3 Posts                             0 people l ...

  8. SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220

    Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...

  9. sqoop从DB2迁移数据到HDFS

    Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...

随机推荐

  1. C#学习小记

    1.C#是由微软推出的,基于.Net Framework的面向对象的高级编程语言. 2.C#代码编辑器为Visual Studio,简称VS. 3.Hello World VS中新建Windows控制 ...

  2. 【转】C# ABP WebApi与Swagger UI的集成

    以前在做WebAPI调用测试时,一直在使用Fiddler测试工具了,而且这个用起来比较繁琐,需要各种配置,并且不直观,还有一点是还得弄明白URL地址和要传递的参数,然后才能调用.  最近新入职,公司里 ...

  3. P1726 上白泽慧音(0分)

    题目描述 在幻想乡,上白泽慧音是以知识渊博闻名的老师.春雪异变导致人间之里的很多道路都被大雪堵塞,使有的学生不能顺利地到达慧音所在的村庄.因此慧音决定换一个能够聚集最多人数的村庄作为新的教学地点.人间 ...

  4. javascript中封装scoll()方法

    function scroll() { var scrollTop = window.pageYOffset || document.documentElement.scrollTop || docu ...

  5. javascript中菜单栏切换案例

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  6. Android 自定义简单控件--星级评价

    效果图 实现 package com.easypass.carstong.view; import android.content.Context; import android.content.re ...

  7. Golden Gate 检查点

    检查点是记录读写位置信息,在恢复时候要用到,保证事务的完整性. 两种存储方式: 存放在dirchk下 存放在指定的checkpoint table Replicat: nodbcheckpoint: ...

  8. net-speeder 安装

    net-speeder net-speeder 在高延迟不稳定链路上优化单线程下载速度 项目由https://code.google.com/p/net-speeder/ 迁入 A program t ...

  9. 页面定制CSS代码初探(二):自定义h2标题样式 添加阴影 添加底色 等

    故事的开始 先说一下<h2></h2>原先默认是空白的,很难看 然后今天无意中看到一个博友的标题很好看啊,一直就想要这种效果有没有? 好的东西自然要拿过来啦 通过审查元素,果然 ...

  10. systemctl 控制单元

    [root@web01 ~]# systemctl status sshd.service ● sshd.service - OpenSSH server daemon Loaded: loaded ...