java7版本中可以这样写:

source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");

java6和java7版本中可以这样写:

source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

Behind the scene

Assuming that you are running your regex on Oracle's implementation, your regex

"([\ud800-\udbff\udc00-\udfff])"

is compiled as such:

StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match

The character class is parsed as \ud800-\udbff\udc00-\udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

Wrong solution

There is no point in writing:

"[\ud800-\udbff][\udc00-\udfff]"

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

Solution

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regex above compiles to:

StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.


From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

This regex also compiles to the same structure as above.

本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus

java过滤四字节和六字节特殊字符的更多相关文章

  1. Java中char占用几个字节

    在讨论这个问题之前,我们需要先区分unicode和UTF. unicode :统一的字符编号,仅仅提供字符与编号间映射.符号数量在不断增加,已超百万.详细:[https://zh.wikipedia. ...

  2. 《深入理解Java虚拟机》学习笔记之字节码执行引擎

    Java虚拟机的执行引擎不管是解释执行还是编译执行,根据概念模型都具有统一的外观:输入的是字节码文件,处理过程是字节码解析的等效过程,输出的是执行结果. 运行时栈帧结构 栈帧(Stack Frame) ...

  3. Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream)

    Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 在我们学习字 ...

  4. 关于java中char占几个字节,汉字占几个字节

    我们平常说,java中char占2个字节,可又说汉字在不通的编码格式中所占的位数是不同的,比如gbk中汉字占2个字节,utf8中多数占3个字节,少数占4个.而所有汉字在java程序中我们都可以简单的用 ...

  5. java动态代理——字段和方法字节码的基础结构及Proxy源码分析三

    前文地址:https://www.cnblogs.com/tera/p/13280547.html 本系列文章主要是博主在学习spring aop的过程中了解到其使用了java动态代理,本着究根问底的 ...

  6. 工作三年终于社招进字节跳动!字节跳动,阿里,腾讯Java岗面试经验汇总

    前言 我大概我是从去年12月份开始看书学习,到今年的6月份,一直学到看大家的面经基本上百分之90以上都会,我就在5月份开始投简历,边面试边补充基础知识等.也是有些辛苦.终于是在前不久拿到了字节跳动的o ...

  7. Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式

    解析:Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式.面向字节的操作为以8位为单位对二进制的数据进行操作,对数据不进行转换,这些类都是InputStream和Out ...

  8. “全栈2019”Java第四十六章:继承与字段

    难度 初级 学习时间 10分钟 适合人群 零基础 开发语言 Java 开发环境 JDK v11 IntelliJ IDEA v2018.3 文章原文链接 "全栈2019"Java第 ...

  9. Java过滤特殊字符

    Java正则表达式过滤 1.Java过滤特殊字符的正则表达式----转载 java过滤特殊字符的正则表达式[转载] 2010-08-05 11:06 Java过滤特殊字符的正则表达式   关键字: j ...

随机推荐

  1. [CF735E/736C]Ostap and Tree

    题目大意: 一个$n(n\le100)$个点的树,将一些点染成黑点,求满足每个点到最近黑点的距离$\le k(k\le\min(20,n-1))$的方案数. 思路: 树形DP. 用$f[i][j]$表 ...

  2. [Interview]读懂面试问题,在面试官面前变被动为主动

    面试是供需双方心理的较量,作为求职者来说,了解对方问题的内涵,做到“明明白白他的心”,就能变被动为主动.因此,读懂面试问题,掌握面试考官的提问的目的,有准备.有针对性地回答,对提高应聘的成功率是有很大 ...

  3. HTML5 boilerplate 笔记(转)

    最近在研究HTML5 boilerplate的模版,以此为线索可以有条理地学习一些前端的best practice,好过在W3C的文档汪洋里大海捞针……啊哈哈哈…… 开头的IE探测与no-js类是什么 ...

  4. SQLSERVER WINDBG调试:mssqlwiki.com

    https://mssqlwiki.com/2012/10/16/sql-server-exception_access_violation-and-sql-server-assertion/ SQL ...

  5. faststone 注册码

    用户名:c1ikm密码:AXMQX-RMMMJ-DBHHF-WIHTV 或 AXOQS-RRMGS-ODAQO-APHUU

  6. mysql 中查询一个字段是否为null的sql

    查询mysql数据库表中字段为null的记录: select * 表名 where 字段名 is null 查询mysql数据库表中字段不为null的记录: select * 表名 where 字段名 ...

  7. ylbtech-LanguageSamples-Events(事件)

    ylbtech-Microsoft-CSharpSamples:ylbtech-LanguageSamples-Events(事件) 1.A,示例(Sample) 返回顶部 “事件”示例 本示例演示如 ...

  8. Java笔记5:单例模式

    一.应用杨景 在计算机系统中,线程池.缓存.日志对象.对话框.打印机.显卡的驱动程序对象常被设计成单例.这些应用都或多或少具有资源管理器的功能.每台计算机可以有若干个打印机,但只能有一个Printer ...

  9. 【Python3 爬虫】15_Fiddler抓包分析

    我们要抓取一些网页源码看不到的信息,例如:淘宝的评论等 我们可以使用工具Fiddler进行抓取 软件下载地址:https://pan.baidu.com/s/1nPKPwrdfXM62LlTZsoiD ...

  10. 《memcached全面剖析》

    第1章 memcached的基础 1.1 memcached是什么? memcached是高性能的分布式内存缓存服务器. 一般的做法是,通过缓存数据库查询结果,减少数据库访问次数,以提高动态web应用 ...