【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章
- c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5
本页列出来目前window下所有支持的字符编码 ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...
- System.Text.Encoding.Default
string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...
- java.io.IOException: Malformed \uxxxx encoding.
java.io.IOException: Malformed \uxxxx encoding. at com.dong.frame.util.ReadProperties.read(ReadProp ...
- System.Text.Encoding.cs
ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...
- LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...
- UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)
目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...
- (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null
270002WDPN 3 Posts 0 people l ...
- SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220
Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...
- sqoop从DB2迁移数据到HDFS
Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...
随机推荐
- VB.net 捕获项目全局异常
在项目中添加如下代码:新建窗口来显示异常信息. Namespace My '全局错误处理,新的解决方案直接添加本ApplicationEvents.vb 到工程即可 '添加后还需要一个From用来显示 ...
- Axios 网络请求组件封装 (鉴权、刷新、拦截)
一.前言 注意:本教程需要你对axios有一定的了解,不适用于小白(只能借鉴,希望你能自己动手),注释都写的很清楚.此封装并非完整版,已进行部分删减修改操作,但仍然适用于大部分业务场景,如果不适用于你 ...
- Linux crontab 定时任务设置
第1列分钟1-59第2列小时1-23(0表示子夜)第3列日1-31第4列月1-12第5列星期0-6(0表示星期天)第6列要运行的命令 下面是crontab的格式:分 时 日 月 星期 要运行的命令 这 ...
- pip快速下载安装python 模块module
g刚开始学习python时,每次想要安装某个module,都到处找module的安装包(exe.whl等) 装setuptools,然后在cmd里用easy_install装pip,然后用pip装你要 ...
- poj2406 Power Strings (kmp 求最小循环字串)
Power Strings Time Limit: 3000MS Memory Limit: 65536K Total Submissions: 47748 Accepted: 19902 ...
- node——含有异步函数的函数封装
在写代码时我们会发现有大量的重复代码,为了使代码更加简洁,我们可以将重复的代码封装为一个可以在多个部分时候用的函数. 之前写的新闻代码中,经常出现的操作有对文件的读取,我们可以将它封装为一个函数rea ...
- 04 SqlServer
数据库的注释 –(两个横线) 主键表 外键表 主键,组合主键 SqlServer 使用附加 权限 主文件mdf 日志文件ldf 数据类型 char varchar nchar nvarchar cha ...
- javaScript(其他引用类型对象)
javascript其他引用类型对象 Global对象(全局)这个对象不存在,无形的对象,无法new一个 其内部定义了一些方法和属性:如下 encodeURI str = www.baidu.com ...
- pycharm 2018 3.4 for mac破解
使用pycharm的小伙伴都知道,pycharm分为社区版和专业版,这里具体区别不作过多介绍.本文带大家安装mac版的2018 pycharm 3.4 1.去官网下载pycharm 3.4 for m ...
- leetcode小题解析
描述Given an array of integers, return indices of the two numbers such that they add up to a specific ...