Technote (troubleshooting)

Problem(Abstract)

During insert from the CLP there is no codepage conversion if operating system codepage and database codepage are both UTF-8. In this case data to be inserted should also be in UTF-8 encoding.
If data has a different encoding then the database codepage (this can be verified using any hex editor), then the operating system codepage should be changed to match the data's encoding in order to enforce the data conversion to the database codepage.

Symptom

Error executing Select SQL statement. Caught by java.io.CharConversionException. ERRORCODE=-4220

Caused by: java.nio.charset.MalformedInputException: Input length = 4759 at com.ibm.db2.jcc.b.u.a(u.java:19) at com.ibm.db2.jcc.b.bc.a(bc.java:1762)

Cause

During an insert of data using CLP characters, they do not go through codepage conversion. If operating system and database codepage both are UTF-8, but the data to be inserted is not Unicode, then data in the database might have incorrect codepoints (not-Unicode) and the above error will be a result during data retrieval.

To verify the encoding for data to be inserted you can use any editor that shows hex representation of characters. Please verify the codepoints for non-ASCII characters that you try to insert. If you see only 1 byte per non-ASCII characters then you need to force the database conversion during insert from CLP to UTF-8 database.

To force codepage conversion during insert from the CLP make sure that the operating system codepage is non-Unicode and matching to the codepage of data when you insert data to Unicode database from non-Unicode data source.

Problem Details An example problem scenario is as follows:

  1. Create a database of type UTF-8:
    CREATE DATABASE <db> USING CODESET utf-8 TERRITORY US
  2. Create a table that holds character data:
    CREATE TABLE test (col char(20))
  3. Check operating system locale:
    locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8"
  4. Insert the non-ASCII characters 'Ã' , '³', '©' which have codepoint 0x'C3', 0x'B3',0x'A9' in codepage 819 into the table:
    INSERT INTO test VALUES ('Ã') INSERT INTO test VALUES ('³') INSERT INTO test VALUES ('©')
  5. By running the following statement, you can see that all INSERT statements caused only one byte to be inserted into the table:
    SELECT col, HEX(col) FROM test
    Ã C3 ³ B3 © A9
    However, the UTF-8 representation of those characters are: 0x'C383' for 'Ã', 0x'C2B3' for '³', and 0x'C2A9' for '©'. So these three rows in the table contain invalid characters in UTF-8.
  6. When selecting from a column using the JDBC application, the following error will occur. This is expected because the table contains invalid UTF-8 data: Error executing Select SQL statement. Caught by java.io.CharConversionException. ERRORCODE=-4220 Caused by: java.nio.charset.MalformedInputException: Input length = 4759 at com.ibm.db2.jcc.b.u.a(u.java:19) at com.ibm.db2.jcc.b.bc.a(bc.java:1762)
  7. Delete all rows with incorrect Unicode codepoints from the test table: DELETE * from test
  8. Change the locale to one that matching codepage of data to be inserted: export locale=en_us. One of the way to determine the codepage for your data can be found here: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text. If you prepare data yourself using some editor please check the documentation for your editor to find out how to set up the codepage for data being prepared by the editor.
  9. Insert data to the table: INSERT INTO test VALUES ('Ã') INSERT INTO test VALUES ('³') INSERT INTO test VALUES ('©')
  10. Verify that inserted data were converted to UTF-8 during insert: SELECT col, HEX(col) FROM test
    Ã C383 ³ C2B3 © C2A9
  11. Run your java application selecting Unicode data. No exception should be reported.

Environment

UNIX, Linux, Unicode database

Diagnosing the problem

Verify that non-ASCII data have a proper Unicode codepoints in Unicode database

Resolving the problem

Reinsert data with codepage conversion enforced by setting the operation system codepage matching to the codepage of data to be inserted

Related information

Export data:

 
 

Community questions and discussion

By adding a comment, you accept our Terms of Use. Your comments entered on this IBM Support site do not represent the views or opinions of IBM. IBM, in its sole discretion, reserves the right to remove any comments from this site. IBM is not responsible for, and does not validate or confirm, the correctness or accuracy of any comments you post. IBM does not endorse any of your comments. All IBM comments are provided "AS IS" and are not warranted by IBM in any way.

Wrong codepoints for non-ASCII characters inserted in UTF-8 database using CLP的更多相关文章

  1. ascii、unicode、utf、gb等编码详解

    很久很久以前,有一群人,他们决定用8个可以开合的晶体管来组合成不同的状态,以表示世界上的万物.他们看到8个开关状态是好的,于是他们把这称为"字节".再后来,他们又做了一些可以处理这 ...

  2. ASCII、UNICODE、UTF

    在计算机中,一个字节对应8位,每位可以用0或1表示,因此一个字节可以表示256种情况. ascii 美国人用了一个字节中的后7位来表达他们常用的字符,最高位一直是0,这便是ascii码. 因此asci ...

  3. man ascii

    Linux 2.6 - man page for ascii (linux section 7) - Unix & Linux Commands Linux 2.6 - man page fo ...

  4. ASCII Table/ASCII表

    ASCII Table/ASCII表 参考: 1.Table of ASCII Characters

  5. ASCII 码对应表

    Macron symbol ASCII CODE 238 : HTML entity : [ Home ][ español ] What is my IP address ? your public ...

  6. ASCII Art (English)

    Conmajia, 2012 Updated on Feb. 18, 2018 What is ASCII art? It's graphic symbols formed by ASCII char ...

  7. ASCII Art ヾ(≧∇≦*)ゝ

    Conmajia, 2012 Updated on Feb. 18, 2018 What is ASCII art? It's graphic symbols formed by ASCII char ...

  8. Ascii vs. Binary Files

    Ascii vs. Binary Files Introduction Most people classify files in two categories: binary files and A ...

  9. [错误处理]UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

    Stackoverflow 回答: 将byte类型转化:byte_string.decode('utf-8') Jinja2 is using Unicode internally which mea ...

随机推荐

  1. 关于CentOS 6下Hadoop占用系统态CPU高的处理办法【转】

    一次不经意发现Hadoop的系统态CPU使用率很高,然后百度一下居然是个已知问题. RHEL6优化了内存申请的效率,而且在某些场景下对KVM的性能有明显提升:http://www.Linux-kvm. ...

  2. 数据库表名最大长度(Oracle=30;SqlServer=128;)

    1.Oracle 数据库 (支持30个字符) --30个字符 CREATE TABLE Tab_Test1234567890abcdefghijkl( ts int ); --select * fro ...

  3. fzu2158

    http://acm.fzu.edu.cn/problem.php?pid=2158 在密室逃脱游戏中,大家被困在一个密室中,为了逃出密室,需要找到正确的数字密码,于是大家分头行动,分别找到了密码的子 ...

  4. Python操作SQLAlchemy之连表操作

    多对一连表操作 首先有两个知识点: 改变数据输出的方式:可以在表的类中定义一个特殊成员:__repr__,return一个自定义的由字符串拼接的数据连接方式. 数据库中表关系之间除了MySQL中标准的 ...

  5. laravel 5.4在控制器构造函数中获取auth中间件失败

    想在控制器的构造函数中取出登录的用户 ,保存到类的属性中. 当然可以用Auth::user(),可以做到,但是不想这么做. 没想到这个属性一直是空的.后来用xdebug调试,在中间件handle和构造 ...

  6. 如何在CentOS或者RHEL上启用Nux Dextop仓库 安装shutter截图工具

    Nux Dextop是一个面对CentOS.RHEL.ScientificLinux的含有许多流行的桌面和多媒体相关的包的第三方RPM仓库(比如:Ardour,Shutter等等).目前,Nux De ...

  7. Web Server 与 App Server

    Web Server 常见的Web Server有Apache Server与Nginx. Apache Http Server是Apache软件基金会下的一个项目,是一款开源的HTTP服务器软件(它 ...

  8. 【WPF】屏幕右下角消息提示框

    WPF做一个仿QQ的右下角消息提示窗,网上找到几个Demo后,选了一个比较好用的. 博客 http://blog.csdn.net/huangli321456/article/details/5052 ...

  9. Excel数据批量导入到SqlServer的方法

    1,以Excel为数据源建立连接导入. 关键点在于Excel的数据要有表头,表头要和数据库表的列名一样.连接字符串中HDR=YES不能省略,也就是第一行是表头的意思.IMEX=1;是把数据都当作字符串 ...

  10. linux系统管理命令(五)

    [教程主题]:1.系统管理命令 [1.1]用户和组管理 在Linux操作系统中,任何文件都归属于某一特定的用户,而任何用户都隶属于至少一个用户组.用户是否有权限对某文件进行访问.读写以及执行,受到系统 ...