java7版本中可以这样写:

source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");

java6和java7版本中可以这样写:

source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

Behind the scene

Assuming that you are running your regex on Oracle's implementation, your regex

"([\ud800-\udbff\udc00-\udfff])"

is compiled as such:

StartS. Start unanchored match (minLength=1)

java.util.regex.Pattern$GroupHead

Pattern.union. A ∪ B:

  Pattern.union. A ∪ B:

    Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.

    BitClass. Match any of these 1 character(s):

      [U+002D]

  SingleS. Match code point: U+DFFF LOW SURROGATES DFFF

java.util.regex.Pattern$GroupTail

java.util.regex.Pattern$LastNode

Node. Accept match

The character class is parsed as \ud800-\udbff\udc00, -, \udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

Wrong solution

There is no point in writing:

"[\ud800-\udbff][\udc00-\udfff]"

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

Solution

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regex above compiles to:

StartS. Start unanchored match (minLength=1)

Pattern.union. A ∪ B:

  Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.

  Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.

java.util.regex.Pattern$LastNode

Node. Accept match

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

// Only works in Java 7

input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.

From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

This regex also compiles to the same structure as above.

本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus

java过滤四字节和六字节特殊字符的更多相关文章

Java中char占用几个字节
在讨论这个问题之前,我们需要先区分unicode和UTF. unicode :统一的字符编号,仅仅提供字符与编号间映射.符号数量在不断增加,已超百万.详细:[https://zh.wikipedia. ...
《深入理解Java虚拟机》学习笔记之字节码执行引擎
Java虚拟机的执行引擎不管是解释执行还是编译执行,根据概念模型都具有统一的外观:输入的是字节码文件,处理过程是字节码解析的等效过程,输出的是执行结果. 运行时栈帧结构栈帧(Stack Frame) ...
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream)
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream) 作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 在我们学习字 ...
关于java中char占几个字节，汉字占几个字节
我们平常说,java中char占2个字节,可又说汉字在不通的编码格式中所占的位数是不同的,比如gbk中汉字占2个字节,utf8中多数占3个字节,少数占4个.而所有汉字在java程序中我们都可以简单的用 ...
java动态代理——字段和方法字节码的基础结构及Proxy源码分析三
前文地址:https://www.cnblogs.com/tera/p/13280547.html 本系列文章主要是博主在学习spring aop的过程中了解到其使用了java动态代理,本着究根问底的 ...
工作三年终于社招进字节跳动！字节跳动，阿里，腾讯Java岗面试经验汇总
前言我大概我是从去年12月份开始看书学习,到今年的6月份,一直学到看大家的面经基本上百分之90以上都会,我就在5月份开始投简历,边面试边补充基础知识等.也是有些辛苦.终于是在前不久拿到了字节跳动的o ...
Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式
解析:Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式.面向字节的操作为以8位为单位对二进制的数据进行操作,对数据不进行转换,这些类都是InputStream和Out ...
“全栈2019”Java第四十六章：继承与字段
难度初级学习时间 10分钟适合人群零基础开发语言 Java 开发环境 JDK v11 IntelliJ IDEA v2018.3 文章原文链接 "全栈2019"Java第 ...
Java过滤特殊字符
Java正则表达式过滤 1.Java过滤特殊字符的正则表达式----转载 java过滤特殊字符的正则表达式[转载] 2010-08-05 11:06 Java过滤特殊字符的正则表达式关键字: j ...

随机推荐

使用DFS求任意两点的所有路径
先上代码: public static void findAllPaths(Integer nodeId,Integer targetNodeId, Map<Integer,ArrayList& ...
制作推送证书 and Code=3000 "未找到应用程序的“aps-environment”的权利字符串"
制作推送证书 step1. 打开苹果开发者网站 step2. 从Member Center进入Certificates, Identifiers & Profiles step3. 选择要制作 ...
扩展gridview轻松实现冻结行和列
在实际的项目中,由于项目的需要,数据量比较大,同时显示栏位也比较多,要做gridview里显示完整,并做到用户体验比较好,这就需要冻结表头和关键列.由于用到的地方比较多,我们可以护展一个gridvie ...
微信php分享页面自定义标题与内容
1.因为现在分享页面,发给朋友或者朋友圈都没办法自定义标题.图片和内容,所以必须要有微信公众号 2.如果有微信公众号可直接登录,如果没有要注册,注册完或者登录了 3.查看你的权限,左侧最下面开发的接口 ...
OpenWrt 路由器过滤广告的N种方法
路由器已经成为每个家庭不可缺少的角色,手机.电脑.电视,凡是需要互联网的设备都要用到它.那么路由器除了给我们的网络设备分发网络外,还有其他用途吗? 现在很多人家里都用着智能路由器,智能路由器究竟怎么智 ...
前端传递参数,由于控制器层类实现了struts2的ModelDriven而产生的一个异常
产生的异常如下: ognl.MethodFailedException: Method "setId" failed for object com.aliyun.pcitcAliy ...
单元测试 2 & 初识模块3
单元测试 - 创建测试用例单元测试是什么? (老鸟可以无视下面这段话.) hi,新同学们,咱们的PHP代码里满布着好多函数和类,经常互相调用,你改的一个函数/方法可能是"比较底层" ...
centos 7 安装gcc g++
yum install gcc gcc-c++ over. ps:如果系统报yum命令未找到,退出重新登陆试试root.
Netty4 ServerBootstrap 初始化channelFactory ReflectiveChannelFactory
只需要在启动之前传入你需要用的channel类型就可以了. ServerBootstrap初始化channelFactory过程: 最后我们再来看看这个channelFactory的使用场景:
transition状态下Mecanim动画的跳转
来自: http://blog.csdn.net/o_oxo_o/article/details/21325901 Unity中Mecanim里面动画状态的变化,是通过设置参数(Parameter)或 ...

java过滤四字节和六字节特殊字符

Behind the scene

Wrong solution

Solution

java过滤四字节和六字节特殊字符的更多相关文章

随机推荐

热门专题