JAVA正则表达式之贪婪、勉强和侵占

　　在JAVA正则表达式中量词（quantifiers）允许指定匹配出现的次数，方便起见，当前 Pattern API 规范下，描述了贪婪、勉强和侵占三种量词。首先粗略地看一下，量词X?、X??和X?+都允许匹配 X 零次或一次，精确地做同样的事情，但它们之间有着细微的不同之处。

量　词　种　类			意　　义
贪婪	勉强	侵占	意　　义
`X?`	`X??`	`X?+`	匹配 X 零次或一次
`X*`	`X*?`	`X*+`	匹配 X 零次或多次
`X+`	`X+?`	`X++`	匹配 X 一次或多次
`X{n}`	`X{n}?`	`X{n}+`	匹配 X n 次
`X{n,}`	`X{n,}?`	`X{n,}+`	匹配 X 至少 n 次
`X{n,m}`	`X{n,m}?`	`X{n,m}+`	匹配 X 至少 n 次，但不多于 m 次

　　开始之前准备一段可以重复测试的代码：

import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexDemo {
	public static void main(String[] args) {
		Scanner sc = new Scanner(System.in);
		while (true) {
			System.out.print("\nRegex:");
			Pattern pattern = Pattern.compile(sc.nextLine());
			System.out.print("String to search: ");
			Matcher matcher = pattern.matcher(sc.nextLine());
			boolean found = false;
			while (matcher.find()) {
				System.out.println("Found the text \"" + matcher.group()
						+ "\" starting at index " + matcher.start()
						+ " and ending at index " + matcher.end() + ".");
				found = true;
			}
			if (!found) {
				System.out.println("No match found.");
			}
		}
	}
}

　　从贪婪量词开始，构建三个不同的正则表达式：字母a后面跟着?、*和+。接下来看一下，用这些表达式来测试输入的字符串是空字符串时会发生些什么：

Regex:a?
String to search:
Found the text "" starting at index 0 and ending at index 0.

Regex:a*String to search: Found the text "" starting at index 0 and ending at index 0.

Regex:a+String to search: No match found.

零长度匹配

　　在上面的例子中，开始的两个匹配是成功的，这是因为表达式a?和a*都允许字符出现零次。就目前而言，这个例子不像其他的，也许你注意到了开始和结束的索引都是 0。输入的空字符串没有长度，因此该测试简单地在索引 0 上匹配什么都没有，诸如此类的匹配称之为零长度匹配（zero-length-matches）。零长度匹配会出现在以下几种情况：输入空的字符串、在输入字符串的开始处、在输入字符串最后字符的后面，或者是输入字符串中任意两个字符之间。由于它们开始和结束的位置有着相同的索引，因此零长度匹配是容易被发现的。
　　我们来看一下关于零长度匹配更多的例子。把输入的字符串改为单个字符“a”，你会注意到一些有意思的事情：

Regex:a?
String to search: a
Found the text "a" starting at index 0 and ending at index 1.
Found the text "" starting at index 1 and ending at index 1.

Regex:a*
String to search: a
Found the text "a" starting at index 0 and ending at index 1.
Found the text "" starting at index 1 and ending at index 1.

Regex:a+
String to search: a
Found the text "a" starting at index 0 and ending at index 1.

　　所有的三个量词都是用来寻找字母“a”的，但是前面两个在索引 1 处找到了零长度匹配，也就是说，在输入字符串最后一个字符的后面。回想一下，匹配把字符“a”看作是位于索引 0 和索引 1 之间的单元格中，并且测试用具一直循环下去直到不再有匹配为止。依赖于所使用的量词不同，最后字符后面的索引“什么也没有”的存在可以或者不可以触发一个匹配。
　　现在把输入的字符串改为一行 5 个“a”时，会得到下面的结果：

Regex:a?
String to search: aaaaa
Found the text "a" starting at index 0 and ending at index 1.
Found the text "a" starting at index 1 and ending at index 2.
Found the text "a" starting at index 2 and ending at index 3.
Found the text "a" starting at index 3 and ending at index 4.
Found the text "a" starting at index 4 and ending at index 5.
Found the text "" starting at index 5 and ending at index 5.

Regex:a*
String to search: aaaaa
Found the text "aaaaa" starting at index 0 and ending at index 5.
Found the text "" starting at index 5 and ending at index 5.

Regex:a+
String to search: aaaaa
Found the text "aaaaa" starting at index 0 and ending at index 5.

　　在“a”出现零次或一次时，表达式a?寻找到所匹配的每一个字符。表达式a*找到了两个单独的匹配：第一次匹配到所有的字母“a”，然后是匹配到最后一个字符后面的索引5。最后，a+匹配了所有出现的字母“a”，忽略了在最后索引处“什么都没有”的存在。
　　在这里，你也许会感到疑惑，开始的两个量词在遇到除了“a”的字母时会有什么结果。例如，在“”中遇到了字母“b”会发生什么呢？
　　下面我们来看一下：

Regex:a?
String to search: ababaaaab
Found the text "a" starting at index 0 and ending at index 1.
Found the text "" starting at index 1 and ending at index 1.
Found the text "a" starting at index 2 and ending at index 3.
Found the text "" starting at index 3 and ending at index 3.
Found the text "a" starting at index 4 and ending at index 5.
Found the text "a" starting at index 5 and ending at index 6.
Found the text "a" starting at index 6 and ending at index 7.
Found the text "a" starting at index 7 and ending at index 8.
Found the text "" starting at index 8 and ending at index 8.
Found the text "" starting at index 9 and ending at index 9.

Regex:a*
String to search: ababaaaab
Found the text "a" starting at index 0 and ending at index 1.
Found the text "" starting at index 1 and ending at index 1.
Found the text "a" starting at index 2 and ending at index 3.
Found the text "" starting at index 3 and ending at index 3.
Found the text "aaaa" starting at index 4 and ending at index 8.
Found the text "" starting at index 8 and ending at index 8.
Found the text "" starting at index 9 and ending at index 9.

Regex:a+
String to search: ababaaaab
Found the text "a" starting at index 0 and ending at index 1.
Found the text "a" starting at index 2 and ending at index 3.
Found the text "aaaa" starting at index 4 and ending at index 8.

　　即使字母“b”在单元格 1、3、8 中出现，但在这些位置上的输出报告了零长度匹配。正则表达式a?不是特意地去寻找字母“b”，它仅仅是去找字母“a”存在或者其中缺少的。如果量词允许匹配“a”零次，任何输入的字符不是“a”时将会作为零长度匹配。在前面的例子中，根据讨论的规则保证了 a 被匹配。
　　对于要精确地匹配一个模式 n 次时，可以简单地在一对花括号内指定一个数值：

Regex:a{3}
String to search: aa
No match found.

Regex:a{3}
String to search: aaa
Found the text "aaa" starting at index 0 and ending at index 3.

Regex:a{3}
String to search: aaaa
Found the text "aaa" starting at index 0 and ending at index 3.

　　这里，正则表确定式a{3}在一行中寻找连续出现三次的字母“a”。第一次测试失败的原由在于，输入的字符串没有足够的 a 用来匹配；第二次测试输出的字符串正好包括了三个“a”，触发了一次匹配；第三次测试也触发了一次匹配，这是由于在输出的字符串的开始部分正好有三个“a”。接下来的事情与第一次的匹配是不相关的，如果这个模式将在这一点后继续出现，那它将会触发接下来的匹配：

Regex:a{3}
String to search: aaaaaaaaa
Found the text "aaa" starting at index 0 and ending at index 3.
Found the text "aaa" starting at index 3 and ending at index 6.
Found the text "aaa" starting at index 6 and ending at index 9.

　　对于需要一个模式出现至少 n 次时，可以在这个数字后面加上一个逗号（,）：

Regex:a{3,}
String to search: aaaaaaaaa
Found the text "aaaaaaaaa" starting at index 0 and ending at index 9.

　　输入一样的字符串，这次测试仅仅找到了一个匹配，这是由于一个中有九个“a”满足了“至少”三个“a”的要求。最后，对于指定出现次数的上限，可以在花括号添加第二个数字。

Regex:a{3,6}
String to search: aaaaaaaaa
Found the text "aaaaaa" starting at index 0 and ending at index 6.
Found the text "aaa" starting at index 6 and ending at index 9.

　　这里，第一次匹配在 6 个字符的上限时被迫终止了。第二个匹配包含了剩余的三个a（这是匹配所允许最小的字符个数）。如果输入的字符串再少掉一个字母，这时将不会有第二个匹配，之后仅剩余两个 a。

捕获组和字符类中的量词

　　到目前为止，仅仅测试了输入的字符串包括一个字符的量词。实际上，量词仅仅可能附在一个字符后面一次，因此正则表达式abc+的意思就是“a
后面接着 b，再接着一次或者多次的c”，它的意思并不是指abc一次或者多次。然而，量词也可能附在字符类和捕获组的后面，比如，[abc]+表示一次或者多次的
a 或 b 或 c，(abc)+表示一次或者多次的“abc”组。
　　我们来指定(dog)组在一行中三次进行说明。

Regex:(dog){3}
String to search: dogdogdogdogdogdog
Found the text "dogdogdog" starting at index 0 and ending at index 9.
Found the text "dogdogdog" starting at index 9 and ending at index 18.

　　上面的第一个例子找到了三个匹配，这是由于量词用在了整个捕获组上。然而，把圆括号去掉，这时的量词{3}现在仅用在了字母“g”上，从而导致这个匹配失败。类似地，也能把量词应用于整个字符类：

Regex:[abc]{3}
String to search: abccabaaaccbbbc
Found the text "abc" starting at index 0 and ending at index 3.
Found the text "cab" starting at index 3 and ending at index 6.
Found the text "aaa" starting at index 6 and ending at index 9.
Found the text "ccb" starting at index 9 and ending at index 12.
Found the text "bbc" starting at index 12 and ending at index 15.

贪婪、勉强和侵占量词间的不同

　　在贪婪、勉强和侵占三个量词间有着细微的不同。
　　贪婪量词之所以称之为“贪婪的”，这是由于它们强迫匹配器读入（或者称之为吃掉）整个输入的字符串，来优先尝试第一次匹配，如果第一次尝试匹配（对于整个输入的字符串）失败，匹配器会通过回退整个字符串的一个字符再一次进行尝试，不断地进行处理直到找到一个匹配，或者左边没有更多的字符来用于回退了。赖于在表达式中使用的量词，最终它将尝试地靠着1 或 0 个字符的匹配。
　　但是，勉强量词采用相反的途径：从输入字符串的开始处开始，因此每次勉强地吞噬一个字符来寻找匹配，最终它们会尝试整个输入的字符串。
　　最后，侵占量词始终是吞掉整个输入的字符串，尝试着一次（仅有一次）匹配。不像贪婪量词那样，侵占量词绝不会回退，即使这样做是允许全部的匹配成功。
　　为了说明一下，看看输入的字符串是 xfooxxxxxxfoo 时。

Regex:.*foo
String to search: xfooxxxxxxfoo
Found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Regex:.*?foo
String to search: xfooxxxxxxfoo
Found the text "xfoo" starting at index 0 and ending at index 4.
Found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Regex:.*+foo
String to search: xfooxxxxxxfoo
No match found.

　　第一个例子使用贪婪量词.*，寻找紧跟着字母“f”“o”“o”的“任何东西”零次或者多次。由于量词是贪婪的，表达式的.*部分第一次“吃掉”整个输入的字符串。在这一点，全部表达式不能成功地进行匹配，这是由于最后三个字母（“f”“o”“o”）已经被消耗掉了。那么匹配器会慢慢地每次回退一个字母，直到返还的“foo”在最右边出现，这时匹配成功并且搜索终止。
　　然而，第二个例子采用勉强量词，因此通过首次消耗“什么也没有”作为开始。由于“foo”并没有出现在字符串的开始，它被强迫吞掉第一个字母（“x”），在0 和 4 处触发了第一个匹配。测试用具会继续处理，直到输入的字符串耗尽为止。在 4 和 13 找到了另外一个匹配。
　　第三个例子的量词是侵占，所以在寻找匹配时失败了。在这种情况下，整个输入的字符串被.*+消耗了，什么都没有剩下来满足表达式末尾的“foo”。
　　你可以在想抓取所有的东西，且决不回退的情况下使用侵占量词，在这种匹配不是立即被发现的情况下，它将会优于等价的贪婪量词。

JAVA正则表达式之贪婪、勉强和侵占的更多相关文章

JAVA 正则表达式的三种模式: 贪婪, 勉强和占有的讨论
假设待处理的字符串是 xfooxxxxxxfoo 模式.*foo (贪婪模式): 模式分为子模式p1(.*)和子模式p2(foo)两个部分. 其中p1中的量词匹配方式使用默认方式(贪婪型). 匹配开 ...
java正则表达式语法详解及其使用代码实例
原文地址译者序(下载代码) 正则表达式善于处理文本,对匹配.搜索和替换等操作都有意想不到的作用.正因如此,正则表达式现在是作为程序员七种基本技能之一*,因此学习和使用它在工作中都能达到很高的效率. ...
Java正则表达式语法
Java正则表达式表达式意义: 1.字符 x 字符 x.例如a表示字符a \\ 反斜线字符.在书写时要写为\\\\.(注意:因为java在第一次解析时,把\\\\解析成正则表达式\\,在 ...
Java正则表达式的总结
Java正则表达式,可以用于很多类型的文本处理, 如匹配,搜索,提取和分析结构化内容. 判断用户的输入是否符合实际需求. 匹配Email地址的正则表达式:\w+([-+.]\w+)*@\w+([-.] ...
$Java正则表达式基础整理
(一)正则表达式及语法简介 String类使用正则表达式的几个方法: 正则表达式支持的合法字符: 特殊字符: 预定义字符: 方括号表达式: 圆括号表达式:用于将多个表达式组成一个子表达式,可以使用或运 ...
Java 正则表达式详解
Java 提供了功能强大的正则表达式API,在java.util.regex 包下.本教程介绍如何使用正则表达式API. 正则表达式一个正则表达式是一个用于文本搜索的文本模式.换句话说,在文本中搜索 ...
Java正则表达式的语法与示例
Java正则表达式的语法与示例 java 正则表达式正则表达式语法 java正则表达式语法 java正则表达式概要: Java正则表达式的语法与示例 | |目录 1匹配验证-验证Email是否正确 ...
（转）Java正则表达式的语法与示例
转自:http://www.cnblogs.com/lzq198754/p/5780340.html 概要: Java正则表达式的语法与示例 | |目录 1匹配验证-验证Email是否正确 2在字符串 ...
java正则表达式学习笔记
Java 正则表达式语法为了更有效的使用正则表达式,需要了解正则表达式语法.正则表达式语法很复杂,可以写出非常高级的表达式.只有通过大量的练习才能掌握这些语法规则. 本篇文字,我们将通过例子了解正则 ...

随机推荐

JavaScript--Json对象
JSON(JavaScript Object Notation)一种简单的数据格式,比xml更轻巧.JSON是JavaScript原生格式,这意味着在JavaScript中处理JSON数据不需要任何 ...
高放的python学习笔记之基本语法
python与c++的不同之处 python的语句块不是用{}括起来的而是冒号后面跟一些与比当前语句多一的tab缩进的语句. 1.定义变量 python的变量类型不需要人为指出,会根据赋值的类型决定此 ...
linux上ln命令详细说明
ln是linux中又一个非常重要命令,它的功能是为某一个文件在另外一个位置建立一个同不的链接,这个命令最常用的参数是-s,具体用法是:ln –s 源文件目标文件. 当我们需要在不同的目录,用到相同的 ...
createwindow
WNDCLASS wndclass; wndclass.hbrBackground=(HBRUSH)getstockobject(WHITE_BRUSH); wndclass.hCursor=Load ...
【Linux】任务调度/计划 cron
实时查看日志: tail -f /var/log/cron 显示任务调度 bash#crontab -u username -l 编辑 bash#crontab -u username -e 内容: ...
python命令行运行在win和Linux系统的不同
今天,在完成一个小的python习题,习题的主要内容是读取一个帮助模块,并保存到本地文件. 知道是用pydoc进行模块的读取,但是在windows系统下,调用os模块之后,结果总是为空. 核心语句: ...
C程序设计语言练习题1-22
练习1-22 编写一个程序,把较长的输入行”折“成短一些的两行或多行,折行的位置在输入行的第n列之前的最后一个非空格之后.要保证程序能够智能地处理输入行很长以及在制定的列前没有空格或制表符时的情况. ...
当fixed元素相互嵌套时chrome下父元素会影响子元素的层叠关系
问题:fixed元素被另一个fixed元素包含的时候在chrome下fixed子元素的定位会受到父元素的影响. demo(http://jsbin.com/qumah/1): <!DOCTYPE ...
ASP.NET Email + WebConfig
这里演示如果把 Email provider 的资料写在 WebConfig 里和调用它. 如果整个项目只需要使用一个 Email, 可以写入system.net里, 微软已经帮我们设计好了 < ...
Altium Designer 等长线&&蛇形线
Altium Designer 里面怎么画等长线 (1)一般是将走线布完后,新建一个class. Design -> Classes 如上图添加完后可以点击close. (2)快捷键 T + R ...

JAVA正则表达式之贪婪、勉强和侵占

JAVA正则表达式之贪婪、勉强和侵占的更多相关文章

随机推荐

热门专题