Finding Comments in Source Code Using Regular Expressions
Many text editors have advanced find (and replace) features. When I’m programming, I like to use an editor with regular expression search and replace. This feature is allows one to find text based on complex patterns rather than based just on literals. Upon occasion I want to examine each of the comments in my source code and either edit them or remove them. I found that it was difficult to write a regular expression that would find C style comments (the comments that start with /* and end with */) because my text editor does not implement the “non-greedy matching” feature of regular expressions.
First Try
When first attempting this problem, most people consider the regular expression:
/\*.*\*/
This seems the natural way to do it. /\*
finds the start of the comment (note that the literal *
needs to be escaped because *
has a special meaning in regular expressions), .*
finds any number of any character, and \*/
finds the end of the expression.
The first problem with this approach is that .*
does not match new lines.
/* First comment
first comment—line two*/
/* Second comment */
Second Try
This can be overcome easily by replacing the .
with [^]
(in some regular expression packages) or more generally with (.|[\r\n])
:
/\*(.|[\r\n])*\*/
This reveals a second, more serious, problem—the expression matches too much. Regular expressions are greedy, they take in as much as they can. Consider the case in which your file has two comments. This regular expression will match them both along with anything in between:
start_code();
/* First comment */
more_code();
/* Second comment */
end_code();
Third Try
To fix this, the regular expression must accept less. We cannot accept just any character with a .
, we need to limit the types of characters that can be in our expressions:
/\*([^*]|[\r\n])*\*/
This simplistic approach doesn’t accept any comments with a *
in them.
/*
* Common multi-line comment style.
*/
/* Second comment */
Fourth Try
This is where it gets tricky. How do we accept a *
without accepting the *
that is part of the end comment? The solution is to still accept any character that is not *
, but also accept a *
and anything that follows it provided that it isn’t followed by a /
:
/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/
This works better but again accepts too much in some cases. It will accept any even number of *
. It might even accept the *
that is supposed to end the comment.
start_code();
/****
* Common multi-line comment style.
****/
more_code();
/*
* Another common multi-line comment style.
*/
end_code();
Fifth Try
What we tried before will work if we accept any number of *
followed by anything other than a *
or a /
:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*/
Now the regular expression does not accept enough again. Its working better than ever, but it still leaves one case. It does not accept comments that end in multiple *
.
/****
* Common multi-line comment style.
****/
/****
* Another common multi-line comment style.
*/
Solution
Now we just need to modify the comment end to allow any number of *
:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
We now have a regular expression that we can paste into text editors that support regular expressions. Finding our comments is a matter of pressing the find button. You might be able to simplify this expression somewhat for your particular editor. For example, in some regular expression implementations, [^]
assumes the [\r\n]
and all the [\r\n]
can be removed from the expression.
This is easy to augment so that it will also find //
style comments:
(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)
Tool | Expression and Usage | Notes |
---|---|---|
nedit | (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*) Ctrl+F to find, put in expression, check the Regular Expression check box. |
[^] does not include new line |
grep | (/\*([^*]|(\*+[^*/]))*\*+/)|(//.*) grep -E “(/\*([^*]|(\*+[^*/]))*\*+/)|(//.*)” <files> |
Does not support multi-line comments, will print out each line that completely contains a comment. |
perl | /((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/ perl -e “$/=undef;print<>=~/((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/g;” < <file> |
Prints out all the comments run together. The (?: notation must be used for non-capturing parenthesis. Each / must be escaped because it delimits the expression. $/=undef; is used so that the file is not matched line by line like grep. |
Java | "(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)" System.out.println(sourcecode.replaceAll(“(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)”,””)); |
Prints out the contents of the string sourcecode with the comments removed. The (?: notation must be used for non-capturing parenthesis. Each \ must be escaped in a Java String. |
An Easier Method
Non-greedy Matching
Most regular expression packages support non-greedy matching. This
means that the pattern will only be matched if there is no other choice.
We can modify our second try to use the non-greedy matcher *?
instead of the greedy matcher *
. With this new tool, the middle of our comment will only match if it doesn’t match the end:
/\*(.|[\r\n])*?\*/
Tool | Expression and Usage | Notes |
---|---|---|
nedit | /\*(.|[\r\n])*?\*/ Ctrl+F to find, put in expression, check the Regular Expression check box. |
[^] does not include new line |
grep | /\*.*?\*/ grep -E ‘/\*.*?\*/’ <file> |
Does not support multi-line comments, will print out each line that completely contains a comment. |
perl | /\*(?:.|[\r\n])*?\*/ perl -0777ne ‘print m!/\*(?:.|[\r\n])*?\*/!g;’ <file> |
Prints out all the comments run together. The (?: notation must be used for non-capturing parenthesis./ does not have to be escaped because ! delimits the expression.-0777 is used to enable slurp mode and -n enables automatic reading. |
Java | "/\\*(?:.|[\\n\\r])*?\\*/" System.out.println(sourcecode.replaceAll(“/\\*(?:.|[\\n\\r])*?\\*/”,””)); |
Prints out the contents of the string sourcecode with the comments removed. The (?: notation must be used for non-capturing parenthesis. Each \ must be escaped in a Java String. |
Caveats
Comments Inside Other Elements
Although our regular expression describes c-style comments very well, there are still problems when something
appears to be a comment but is actually part of a larger element.
someString = "An example comment: /* example */"; // The comment around this code has been commented out.
// /*
some_code();
// */
The solution to this is to write regular expressions that describe each of the possible larger elements, find these as well, decide what type of element each is, and discard the ones that are not comments. There are tools called lexers or tokenizers that can help with this task. A lexer accepts regular expressions as input, scans a stream, picks out tokens that match the regular expressions, and classifies the token based on which expression it matched. The greedy property of regular expressions is used to ensure the longest match. Although writing a full lexer for C is beyond the scope of this document, those interested should look at lexer generators such as Flex and JFlex.
Finding Comments in Source Code Using Regular Expressions的更多相关文章
- Regular Expressions --正则表达式官方教程
http://docs.oracle.com/javase/tutorial/essential/regex/index.html This lesson explains how to use th ...
- PCRE Perl Compatible Regular Expressions Learning
catalog . PCRE Introduction . pcre2api . pcre2jit . PCRE Programing 1. PCRE Introduction The PCRE li ...
- Introducing Regular Expressions 学习笔记
Introducing Regular Expressions 读书笔记 工具: regexbuddy:http://download.csdn.net/tag/regexbuddy%E7%A0%B4 ...
- [转]Native Java Bytecode Debugging without Source Code
link from:http://www.crowdstrike.com/blog/native-java-bytecode-debugging-without-source-code/index.h ...
- Python re module (regular expressions)
regular expressions (RE) 简介 re模块是python中处理正在表达式的一个模块 r"""Support for regular expressi ...
- Python之Regular Expressions(正则表达式)
在编写处理字符串的程序或网页时,经常会有查找符合某些复杂规则的字符串的需要.正则表达式就是用于描述这些规则的工具.换句话说,正则表达式就是记录文本规则的代码. 很可能你使用过Windows/Dos下用 ...
- 8 Regular Expressions You Should Know
Regular expressions are a language of their own. When you learn a new programming language, they're ...
- 转载:邮箱正则表达式Comparing E-mail Address Validating Regular Expressions
Comparing E-mail Address Validating Regular Expressions Updated: 2/3/2012 Summary This page compares ...
- [Regular Expressions] Find Plain Text Patterns
The simplest use of Regular Expressions is to find a plain text pattern. In this lesson we'll look a ...
随机推荐
- Linux搭建FTP服务器实战
首先准备一台Linux系统机器(虚拟机也可), 检测出是否安装了vsftpd软件: rpm -qa |grep vsftpd 如果没有输出结果,就是没有安装. 使用命令安装,安装过程中会有提示,直接输 ...
- ABAP 给动态变量赋值
[转自 http://blog.csdn.net/forever_crazy/article/details/6544830] 需求: 有时写程序的时候,需要给某个动态变量 赋值操作,当字段比较多时, ...
- java多线程---基础
一, java多线程----线程与进程 进程: 程序(任务)的执行过程,拥有资源(共享内存,共享资源)和线程(一个或者多个,至少一个). 例如:打开任务管理器,qq,chrome,都属于进程. 线程 ...
- spring-boot5
Spring Boot集成MyBatis: (1)新建maven project;取名为:spring-boot-mybatis (2)在pom.xml文件中引入相关依赖: (3)创建启动类App.j ...
- 【栈】日志分析(BSOJ2981)
Description M海运公司最近要对旗下仓库的货物进出情况进行统计.目前他们所拥有的唯一记录就是一个记录集装箱进出情况的日志.该日志记录了两类操作:第一类操作为集装箱入库操作,以及该次入库的集装 ...
- Spring Boot2.0之 yml的使用
yml Spring Boot 默认读取 .yml .properties 结尾的 yml非常好的作用,比properties更节约 结构清晰 server: port: 8090 con ...
- 1--单独使用jdbc开发问题总结
1.数据库连接,使用时就创建,不使用立即释放,对数据库进行频繁连接开启和关闭,造成数据库资源浪费,影响 数据库性能. 设想:使用数据库连接池管理数据库连接. 2.将sql语句硬编码到java代码中,如 ...
- BZOJ 1208 [HNOI2004]宠物收养所:Splay(伸展树)
题目链接:http://www.lydsy.com/JudgeOnline/problem.php?id=1208 题意: 有一个宠物收养所,在接下来一段时间内会陆续有一些宠物进到店里,或是一些人来领 ...
- listen 75
Hot Jupiters Smarten Search For Other Earths Scientists are looking for Earth like planets around ot ...
- jvm file.encoding 属性引起的storm/hbase乱码
1. 问题 今天为storm程序添加了一个计算bolt,上线后正常,结果发现之前的另一个bolt在将中文插入到hbase中后查询出来乱码.其中字符串是以UTF-8编码的url加密串,然后我使用的URL ...