java过滤四字节和六字节特殊字符
java7版本中可以这样写:
source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");
java6和java7版本中可以这样写:
source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");
Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.
This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.
Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.
Behind the scene
Assuming that you are running your regex on Oracle's implementation, your regex
"([\ud800-\udbff\udc00-\udfff])"
is compiled as such:
StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match
The character class is parsed as \ud800-\udbff\udc00, -, \udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.
Wrong solution
There is no point in writing:
"[\ud800-\udbff][\udc00-\udfff]"
Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.
Solution
If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:
input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");
This solution has been tested to work in Java 6 and 7 (Oracle implementation).
The regex above compiles to:
StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match
Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.
// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")
Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.
From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.
input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");
This regex also compiles to the same structure as above.
本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus
java过滤四字节和六字节特殊字符的更多相关文章
- Java中char占用几个字节
在讨论这个问题之前,我们需要先区分unicode和UTF. unicode :统一的字符编号,仅仅提供字符与编号间映射.符号数量在不断增加,已超百万.详细:[https://zh.wikipedia. ...
- 《深入理解Java虚拟机》学习笔记之字节码执行引擎
Java虚拟机的执行引擎不管是解释执行还是编译执行,根据概念模型都具有统一的外观:输入的是字节码文件,处理过程是字节码解析的等效过程,输出的是执行结果. 运行时栈帧结构 栈帧(Stack Frame) ...
- Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream)
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 在我们学习字 ...
- 关于java中char占几个字节,汉字占几个字节
我们平常说,java中char占2个字节,可又说汉字在不通的编码格式中所占的位数是不同的,比如gbk中汉字占2个字节,utf8中多数占3个字节,少数占4个.而所有汉字在java程序中我们都可以简单的用 ...
- java动态代理——字段和方法字节码的基础结构及Proxy源码分析三
前文地址:https://www.cnblogs.com/tera/p/13280547.html 本系列文章主要是博主在学习spring aop的过程中了解到其使用了java动态代理,本着究根问底的 ...
- 工作三年终于社招进字节跳动!字节跳动,阿里,腾讯Java岗面试经验汇总
前言 我大概我是从去年12月份开始看书学习,到今年的6月份,一直学到看大家的面经基本上百分之90以上都会,我就在5月份开始投简历,边面试边补充基础知识等.也是有些辛苦.终于是在前不久拿到了字节跳动的o ...
- Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式
解析:Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式.面向字节的操作为以8位为单位对二进制的数据进行操作,对数据不进行转换,这些类都是InputStream和Out ...
- “全栈2019”Java第四十六章:继承与字段
难度 初级 学习时间 10分钟 适合人群 零基础 开发语言 Java 开发环境 JDK v11 IntelliJ IDEA v2018.3 文章原文链接 "全栈2019"Java第 ...
- Java过滤特殊字符
Java正则表达式过滤 1.Java过滤特殊字符的正则表达式----转载 java过滤特殊字符的正则表达式[转载] 2010-08-05 11:06 Java过滤特殊字符的正则表达式 关键字: j ...
随机推荐
- 手把手教你使用FineUI开发一个b/s结构的取送货管理信息系统系列博文索引
近阶段接到一些b/s类型的软件项目,但是团队成员之前大部分没有这方面的开发经验,于是自己选择了一套目前网上比较容易上手的开发框架(FineUI),计划录制一套视频讲座,来讲解如何利用FineUI快速开 ...
- Kubernetes DNS的配置
Kubernetes集群机制通过DNS进行服务名和ip的映射,如果没有配置dns,你可以通过下面命令查询到集群ip kubectl get svc --namespace=kube-system 得到 ...
- ubuntu16.04画图软件kolourpaint
1.安装kolourpaint sudo apt-get install kolourpaint4 -y 在里面搜索“kolourpaint”这个软件名.
- 小心!Ubuntu14.04 升级到16.04 的几个坑
收录待用,修改转载已取得腾讯云授权 昨天趁着周末把服务器升级了一把,遇到的坑可不少: sudo apt update sudo apt dist-upgrade 坑1:升级失败后,改用下面命令: su ...
- What is Continuous Integration?
什么叫持续集成? 原文: https://docs.microsoft.com/en-us/azure/devops/what-is-continuous-integration ---------- ...
- windowsclient开发--依据可下载url另存为文件(微信windowsclient这样做的)
能够我的blog的标题会让你误解,那么好,没图说了xx: 比方微信windowsclient发送了一张图片,我们能够预览这张图片,还能够保存到本地: 那么windows程序是怎样下载这张图片的呢? 是 ...
- Node.js meitulu图片批量下载爬虫1.06版
//====================================================== // https://www.meitulu.com图片批量下载Node.js爬虫1. ...
- Android中Service概述
Service是Android中一种非常重要的组件,一般来说有两种用途:用Service执行长期执行的操作,而且与用户没有UI界面的交互:某个应用程序的Service能够被其它应用程序的组件调用以便提 ...
- selenium 问题:Unable to connect to host 127.0.0.1 on port 7055 after 45000 ms
问题:Unable to connect to host 127.0.0.1 on port 7055 after 45000 ms 原因: selenium-server-standalone-x. ...
- IOS AppUI规格指南