【Java】如何检测、替换4个字节的utf-8编码（此范围编码包含emoji表情）

> 参考的优秀文章

> 如何检测、替换4个字节的utf-8编码（此范围编码包含emoji）

项目有个需求，是保存从手机端H5页面提交的信息。

大家知道，手机端输入法中经常有自带的表情，其中emoji表情非常流行，如果用户输入emoji表情，由于有部分emoji表情是4个字节的utf-8编码，我们的MySQL数据库在现有版本和编码设置下只能保存3个字节的utf-8编码（如要保存4个字节的utf-8编码则需升级版本和设置另一种编码）。相关信息可见文章《十分钟搞清字符集和字符编码》。

我们的需求不需要支持emoji表情，如果遇到emoji弹出提示或过滤即可。

通过浏览《【异常处理】Incorrect string value: '\xF0\x90\x8D\x83...' for column... Emoji表情字符过滤的Java实现》和《Why a surrogate java regexp finds hypen-minus》，我们得知通过以下代码进行替换：

msg.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "");

效果是OK的。

但是，由于能力原因，始终没能理解上述代码十六进制正则表达式的原理，自己写了一端代码来检测、替换4个字节的utf-8编码（但未能经过完整测试，仅用于描述大概思路）。

其中UTF-8编码规则阅读自《十分钟搞清字符集和字符编码》，字节与十六进制的转换参考自《Java中byte与16进制字符串的互相转换》。

package com.nicchagil.tc.emojifilter;

import java.util.Arrays;

import java.util.HashMap;

import java.util.Map;

public class UTF8Utils {

    public static void main(String[] args) {

        String s = "琥珀蜜蜡由于硬度很低,打磨起来非123常简单,需要的工具也非常简单,自己去买蜜蜡是非常不划算的,完全可以自己磨蜜蜡原石的样子";

        System.out.println(UTF8Utils.bytesToHex(s.getBytes()));

        System.out.println(UTF8Utils.bytesToHex(UTF8Utils.remove4BytesUTF8Char(s)));

    }

    public static Map<String, Integer> hexMap = new HashMap<String, Integer>();

    public static Map<String, Integer> byteMap = new HashMap<String, Integer>();

    static {

        hexMap.put("0", 2);

        hexMap.put("1", 2);

        hexMap.put("2", 2);

        hexMap.put("3", 2);

        hexMap.put("4", 2);

        hexMap.put("5", 2);

        hexMap.put("6", 2);

        hexMap.put("7", 2);

        hexMap.put("c", 4);

        hexMap.put("d", 4);

        hexMap.put("e", 6);

        hexMap.put("f", 8);

        byteMap.put("0", 1);

        byteMap.put("1", 1);

        byteMap.put("2", 1);

        byteMap.put("3", 1);

        byteMap.put("4", 1);

        byteMap.put("5", 1);

        byteMap.put("6", 1);

        byteMap.put("7", 1);

        byteMap.put("c", 2);

        byteMap.put("d", 2);

        byteMap.put("e", 3);

        byteMap.put("f", 4);

    }

    /**

     * 是否包含4字节UTF-8编码的字符（先转换16进制再判断）

     * @param s 字符串

     * @return 是否包含4字节UTF-8编码的字符

     */

    public static boolean contains4BytesChar(String s) {

        if (s == null || s.trim().length() == 0) {

            return false;

        }

        String hex = UTF8Utils.bytesToHex(s.getBytes());

        System.out.println("full hex : " + hex);

        String firstChar = null;

        while (hex != null && hex.length() > 1) {

            firstChar = hex.substring(0, 1);

            System.out.println("firstChar : " + firstChar);

            if ("f".equals(firstChar)) {

                System.out.println("it is f start, it is 4 bytes, return.");

                return true;

            }

            if (hexMap.get(firstChar) == null) {

                System.out.println("it is f start, it is 4 bytes, return.");

                // todo, throw exception for this case

                return false;

            }

            hex = hex.substring(hexMap.get(firstChar), hex.length());

            System.out.println("remain hex : " + hex);

        }

        return false;

    }

    /**

     * 是否包含4字节UTF-8编码的字符

     * @param s 字符串

     * @return 是否包含4字节UTF-8编码的字符

     */

    public static boolean contains4BytesChar2(String s) {

        if (s == null || s.trim().length() == 0) {

            return false;

        }

        byte[] bytes = s.getBytes();

        if (bytes == null || bytes.length == 0) {

            return false;

        }

        int index = 0;

        byte b;

        String hex = null;

        String firstChar = null;

        int step;

        while (index <= bytes.length - 1) {

            System.out.println("while loop, index : " + index);

            b = bytes[index];

            hex = byteToHex(b);

            if (hex == null || hex.length() < 2) {

                System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false.");

                // todo, throw exception for this case

                return false;

            }

            firstChar = hex.substring(0, 1);

            if (firstChar.equals("f")) {

                return true;

            }

            if (byteMap.get(firstChar) == null) {

                System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");

                // todo, throw exception for this case

                return false;

            }

            step = byteMap.get(firstChar);

            System.out.println("while loop, index : " + index + ", step : " + step);

            index = index + step;

        }

        return false;

    }

    /**

     * 去除4字节UTF-8编码的字符

     * @param s 字符串

     * @return 已去除4字节UTF-8编码的字符

     */

    public static byte[] remove4BytesUTF8Char(String s) {

        byte[] bytes = s.getBytes();

        byte[] removedBytes = new byte[bytes.length];

        int index = 0;

        String hex = null;

        String firstChar = null;

        for (int i = 0; i < bytes.length; ) {

            hex = UTF8Utils.byteToHex(bytes[i]);

            if (hex == null || hex.length() < 2) {

                System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false.");

                // todo, throw exception for this case

                return null;

            }

            firstChar = hex.substring(0, 1);

            if (byteMap.get(firstChar) == null) {

                System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");

                // todo, throw exception for this case

                return null;

            }

            if (firstChar.equals("f")) {

                for (int j = 0; j < byteMap.get(firstChar); j++) {

                    i++;

                }

                continue;

            }

            for (int j = 0; j < byteMap.get(firstChar); j++) {

                removedBytes[index++] = bytes[i++];

            }

        }

        return Arrays.copyOfRange(removedBytes, 0, index);

    }

    /**

     * 将字符串的16进制转换为HEX，并按每个字符的16进制分隔格式化

     * @param s 字符串

     */

    public static String splitForReading(String s) {

        if (s == null || s.trim().length() == 0) {

            return "";

        }

        String hex = UTF8Utils.bytesToHex(s.getBytes());

        System.out.println("full hex : " + hex);

        if (hex == null || hex.length() == 0) {

            System.out.println("fail to translate the bytes to hex.");

            // todo, throw exception for this case

            return "";

        }

        StringBuilder sb = new StringBuilder();

        int index = 0;

        String firstChar = null;

        String splittedString = null;

        while (index < hex.length()) {

            firstChar = hex.substring(index, index + 1);

            if (hexMap.get(firstChar) == null) {

                System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");

                // todo, throw exception for this case

                return "";

            }

            splittedString = hex.substring(index, index + hexMap.get(firstChar));

            sb.append(splittedString).append(" ");

            index = index + hexMap.get(firstChar);

        }

        System.out.println("formated sb : " + sb);

        return sb.toString();

    }

    /**

     * 字节数组转十六进制

     * @param bytes 字节数组

     * @return 十六进制

     */

    public static String bytesToHex(byte[] bytes) {

        if (bytes == null || bytes.length == 0) {

            return null;

        }

        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < bytes.length; i++) {

            int r = bytes[i] & 0xFF;

            String hexResult = Integer.toHexString(r);

            if (hexResult.length() < 2) {

                sb.append(0); // 前补0

            }

            sb.append(hexResult);

        }

        return sb.toString();

    }

    /**

     * 字节转十六进制

     * @param b 字节

     * @return 十六进制

     */

    public static String byteToHex(byte b) {

        int r = b & 0xFF;

        String hexResult = Integer.toHexString(r);

        StringBuilder sb = new StringBuilder();

        if (hexResult.length() < 2) {

            sb.append(0); // 前补0

        }

        sb.append(hexResult);

        return sb.toString();

    }

}

在随便看下各种字符的UTF-8编码是什么：

package com.nicchagil.tc.emojifilter;

public class UTF8HexTester {

    public static void main(String[] args) {

        String s = "1";

        System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes()));

        s = "a";

        System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes()));

        s = "我";

        System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes()));

        s = "我很帅";

        System.out.println("the hex of “" + s + "” : " + UTF8Utils.bytesToHex(s.getBytes()));

    }

}

日志：

the hex of “1” : 31

the hex of “a” : 61

the hex of “我” : e68891

the hex of “我很帅” : e68891e5be88e5b885

> 搭建一个测试渠道来测试

由于emoji表情在PC不易输入，最好的输入途径始终在手机上，那么我们搭一个简单的web程序来接收emoji表情吧~

<!DOCTYPE html>

<html>

<head>

<meta charset="UTF-8">

<title>Emoji</title>

</head>

<script type="text/javascript" src="https://code.jquery.com/jquery-1.12.3.min.js"></script>

<body>

<form id="myform" action="http://192.168.1.3:8080/emoji/EmojiFilterServlet" >

    Input parameter :

    <input type='text' name='msg' />

    <br/>

    <input type='button' value=' ajax submit ' onclick="save();" />

    <input type='submit' value=' form submit ' />

</form>

</body>

<script type="text/javascript">

function save() {

    // alert('start save...');

    var data = $('#myform').serialize();

    // alert(data);

    $.ajax({

        type : "POST",

        url : "http://192.168.1.3:8080/emoji/EmojiFilterServlet",

        data : data,

        success : function(d) {

            alert(d);

        }

    });

}

</script>

</html>

package com.nicchagil.tc.emojifilter;

import java.io.IOException;

import java.nio.charset.Charset;

import javax.servlet.ServletException;

import javax.servlet.http.HttpServlet;

import javax.servlet.http.HttpServletRequest;

import javax.servlet.http.HttpServletResponse;

/**

 * Servlet implementation class EmojiFilterServlet

 */

public class EmojiFilterServlet extends HttpServlet {

    private static final long serialVersionUID = 1L;

    /**

     * @see HttpServlet#HttpServlet()

     */

    public EmojiFilterServlet() {

        super();

        // TODO Auto-generated constructor stub

    }

    /**

     * @see HttpServlet#doGet(HttpServletRequest request, HttpServletResponse response)

     */

    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        this.doPost(request, response);

    }

    /**

     * @see HttpServlet#doPost(HttpServletRequest request, HttpServletResponse response)

     */

    protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        String msg = request.getParameter("msg");

        System.out.println("msg -> " + msg);

    }

}

【Java】如何检测、替换4个字节的utf-8编码（此范围编码包含emoji表情）的更多相关文章

Java静态检测工具/Java代码规范和质量检查简单介绍（转）
静态检查: 静态测试包括代码检查.静态结构分析.代码质量度量等.它可以由人工进行,充分发挥人的逻辑思维优势,也可以借助软件工具自动进行.代码检查代码检查包括代码走查.桌面检查.代码审查等,主要检查代码 ...
Java安全之动态加载字节码
Java字节码简单说,Java字节码就是.class后缀的文件,里面存放Java虚拟机执行的指令. 由于Java是一门跨平台的编译型语言,所以可以适用于不同平台,不同CPU的计算机,开发者只需要将自 ...
[Java] - 格式字符串替换方法
Java 字符串格式替换方法有两种,一种是使用String.format(...),另一种是使用MessageFormat.format(...) 如下: import java.text.Messa ...
Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式
解析:Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式.面向字节的操作为以8位为单位对二进制的数据进行操作,对数据不进行转换,这些类都是InputStream和Out ...
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream)
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream) 作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 在我们学习字 ...
Java-Runoob-高级教程-实例-字符串：04. Java 实例 - 字符串替换
ylbtech-Java-Runoob-高级教程-实例-字符串:04. Java 实例 - 字符串替换 1.返回顶部 1. Java 实例 - 字符串替换 Java 实例如何使用java替换字符串 ...
关于java中char占几个字节，汉字占几个字节
我们平常说,java中char占2个字节,可又说汉字在不通的编码格式中所占的位数是不同的,比如gbk中汉字占2个字节,utf8中多数占3个字节,少数占4个.而所有汉字在java程序中我们都可以简单的用 ...
Java一个汉字占几个字节（详解与原理）
1.先说重点: 不同的编码格式占字节数是不同的,UTF-8编码下一个中文所占字节也是不确定的,可能是2个.3个.4个字节: 2.以下是源码: @Test public void test1() thr ...

随机推荐

HDU 4513 吉哥系列故事——完美队形II（Manacher）
Problem Description 吉哥又想出了一个新的完美队形游戏! 假设有n个人按顺序站在他的面前,他们的身高分别是h[1], h[2] ... h[n],吉哥希望从中挑出一些人,让这些人形成 ...
全国各地电信DNS服务器地址
全国各地电信DNS服务器地址北京DNS地址:202.96.199.133 202.96.0.133 202.106.0.20 202.106.148.1 202.97.16.195 上海DNS地址: ...
vs2010的快捷键
vs2010的快捷键 VS2008快捷键大全 Ctrl+m+Crtr+o折叠所有大纲Ctrl+M+Crtr+P: 停止大纲显示Ctrl+K+Crtr+C: 注释选定内容Ctrl+K+Crtr+U: 取 ...
.net web弹出对话框
Page.ClientScript.RegisterStartupScript(this.GetType(), "", "<script>alert('请输入 ...
C#下调用C++ SDK的编码常识
一组编码规范,通过C#调用C++ 自动封装的C# SDK,会发现面向对象思想的重要性. C++ SDK可以使用自动封装工具转换成C# SDK.但需要遵守如下规则: 1.如果需要对C#对象进行判断,则分 ...
C语言中关键字volatile的含义【转】
本文转载自:http://m.jb51.net/article/37489.htm 本篇文章是对C语言中关键字volatile的含义进行了详细的分析介绍,需要的朋友参考下 volatile 的意思是“ ...
Mac OX 隐藏文件夹，文件，应用，磁盘的2种方法 hide finder folder, file, application, volume in 2 ways
经常需要主目录下隐藏一些文件夹之类的, 第一想到的当然就是:在要隐藏的文件夹前面加『.』(leading dot),这个用法当然可以的了用习惯了Linux/GNU系统的,基本习惯使用这种办法但是, ...
Android消息处理机制(Handler 与Message)---01
一.handler的使用场景为么会有handler?(部分内容图片摘自http://www.runoob.com/w3cnote/android-tutorial-handler-message.ht ...
[转]vs2008安装失败的总结与分享
转自:http://www.cnblogs.com/rockdean/archive/2010/01/13/1646851.html 今天系统是刚装的,今儿个也不是第一次装系统,也不是第一次装vs20 ...
一道面试题比较synchronized和读写锁
一.科普定义这篇博文的两个主角“synchronized”和“读写锁” 1)synchronized 这个同步关键字相信大家都用得比较多,在上一篇“多个线程之间共享数据的方式”中也详细列举他的应用, ...