JJTree Tutorial for Advanced Java Parsing
The Problem
JJTree is a part of JavaCC is a parser/scanner generator for Java. JJTree is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source. To follow along you need to understand the core concepts of parsing. Also review basic JJTree documentation and samples provided in JavaCC distribution (version 4.0).
JJTree is magically powerful, but it is as complex. We used it quite successfully at my startup www.moola.com. After some the basic research into the grammar rules, lookaheads, node annotations and prototyping I felt quite comfortable with the tool. However, just recently when I had to use JJTree again I hit the same steep learning curve as if I have never seen JJTree before.
How to write a tutorial that gets you back in shape quickly without forcing the full relearning?
The Solution
Here I capture my notes in a specific form that I do not have to face that same learning curve again in the future. You can think my approach as layered improvement to a grammar that follows these steps:
- get lexer
- complete grammar
- optimize produced AST
- define custom node
- define actions
- write evaluator
I always start simple and need to go more complex - this is exactly how I will document it. In each example I start with a trivial portion of grammar and then add some more to it to force specific behavior. New code is always in green. Let's hope this save all of us the relearning.
Reorder tokens from more specific to less specific
The token in TOKEN section can be declared in any order. But you have to pay very close attention to the order because the matching of tokens starts from the top and down the list until first matching token is found. For example notice how "interface" or "exception" are defined before STRING_LITERAL. If we had defined "interface" after STRING_LITERAL "interface" would never get matched, STRING_LITERAL would.
TOKEN : {
<INTERFACE: "interface" >
| < EXCEPTION: "exception" >
| < ENUM: "enum" >
| < STRUCT: "struct" >
| < STRING_LITERAL: "'" (~["'","\n","\r"])* "'" >
| < TERM: <LETTER> (<LETTER>|<DIGIT>)* >
| < NUMBER: <INTEGER> | <FLOAT> >
| < INTEGER: ["0"-"9"] (["0"-"9"])* >
| < FLOAT: (["0"-"9"])+ "." (["0"-"9"])* >
| < DIGIT: ["0"-"9"] >
| < LETTER: ["_","a"-"z","A"-"Z"] >
}
The ordering is the same reason why we can't just use "interface" inline in the definition of productions. The STRING_LITERAL will always match first.
Remove some nodes from final AST
Some nodes do not have any special meaning and should be excluded from the final AST. This is done by using #void like this:
void InterfaceDecl() #void : {
}{
ExceptionClause()
|
EnumClause()
|
StructClause()
|
MethodDecl()
}
Add action to a production
You will definitely need to add actions to the production for your parser to be useful. Here I capture the text of the current token (t.image) and put it into jjThis node that will resolve to my custom node class TypeDecl. You bind a variable "t" to a token using "="; the action itself is in curly braces right after the production and can refer to current token as "t" and current AST node as "jjtThis".
void TypeDecl() : {
Token t;
}
{
<VOID>
|
t=<TERM> { jjtThis.name = t.image; } ("[]")?}
}
Here I further set isArray property to true only if "[]" is found after the <TERM>:
void TypeDecl() : {
Token t;
}
{
<VOID>
|
t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
}
Multiple actions inside one production rule
Just as we have seen earlier you can access values of multiple token in one production rule. Notice how I declare two separate tokens "t" and "n". Here:
void ConstDecl() : {
Token t;
Token n;
}
{
LOOKAHEAD(2)
t=<TERM> { jjtThis.name = t.image; } "=" n=<NUMBER> { jjtThis.value = Integer.valueOf(n.image); }
|
<TERM>
}
Lookaheads
There are certain points in complex grammars that might not get parsed unambiguously using just one token look ahead. If you are writing high performance parser you might need to rewrite grammar. But if do not care about performance you can force lookahead for more that one symbol.
JJTree generator will give you a warning about ambiguities. Go the the rule it refers to and set lookahead of 2 or more like this:
void EnumDeclItem() : {}
{
LOOKAHEAD(2)
<TERM> "=" <NUMBER>
|
<TERM>
}
Node return values
It is possible to return nodes from the productions, just like function return values. Here I am declaring the ASTTypeDecl will be returned.
ASTTypeDecl TypeDecl() : {
Token t;
}
{
<VOID>
|
t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
{ return jjtThis; }
}
Once you start having a lot of expressions in one production it is better to group them together so return statement applies to all of them. The above example will actually result in a bug due to a fact that the return statement is attached to one branch of "|" production and not to both branches. We can easily fix the issue using parenthesis to force order of precendence:
ASTTypeDecl TypeDecl() : {
Token t;
}
{
(
<VOID>
|
t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
)
{ return jjtThis; }
}
Build abstract syntax tree as you go
After you have all production return values you can build AST tree on the fly while parsing. Just provide found overloaded add() methods in the ASTInterfaceDecl class and call them like this:
void InterfaceDecl() #void : {
ASTExceptionClause ex;
ASTEnumClause en;
ASTStructClause st;
ASTMethodDecl me;
}
ex=ExceptionClause() { jjtThis.add(ex); }
|
en=EnumClause() { jjtThis.add(en); }
|
st=StructClause() { jjtThis.add(st); }
|
me=MethodDecl() { jjtThis.add(me); }
}
Use <EOF>
Quite often you can get your grammar written and start celebration when you notice that part of the file is not being parsed... This happens because you did not tell the parser to read all content till the end of file and it feels free to stop parsing at will. Force parsing to reach end of file by demanding <EOF> token at the top most production:
void InterfaceDecl() #void : {
}{
ExceptionClause()
|
EnumClause()
|
StructClause()
|
MethodDecl()
|
<EOF>
}
The Final Word
JJTree works incredibly well. No excuse to regex parsing no more... Don't even try to convince me!
Drop me a line if you need help with JJTree - will be glad to share the experiences with you.
References
- The JavaCC FAQ by Theodore S. Norvell
JJTree Tutorial for Advanced Java Parsing的更多相关文章
- Top 10 Books For Advanced Level Java Developers
Java is one of the most popular programming language nowadays. There are plenty of books for beginne ...
- 转:Apache POI Tutorial
Welcome to Apache POI Tutorial. Sometimes we need to read data from Microsoft Excel Files or we need ...
- 10 Things Every Java Programmer Should Know about String
String in Java is very special class and most frequently used class as well. There are lot many thin ...
- Java 8 Stream Tutorial--转
原文地址:http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/ This example-driven tutori ...
- 【译】Core Java Questions and Answers【1-33】
前言 译文链接:http://www.journaldev.com/2366/core-java-interview-questions-and-answers Java 8有哪些重要的特性 Java ...
- Java之数组篇
动手动脑,第六次Tutorial--数组 这次的Tutorial讲解了Java中如何进行数组操作,包括数组声明创建使用和赋值运算,写这篇文章的目的就是通过实际运用已达到对数组使用的更加熟练,下面是实践 ...
- 《Java学习笔记(第8版)》学习指导
<Java学习笔记(第8版)>学习指导 目录 图书简况 学习指导 第一章 Java平台概论 第二章 从JDK到IDE 第三章 基础语法 第四章 认识对象 第五章 对象封装 第六章 继承与多 ...
- Java集合框架的接口和类层次关系结构图
Collection和Collections的区别 首先要说的是,"Collection" 和 "Collections"是两个不同的概念: 如下图所示,&qu ...
- JAVA CDI 学习(1) - @Inject基本用法
CDI(Contexts and Dependency Injection 上下文依赖注入),是JAVA官方提供的依赖注入实现,可用于Dynamic Web Module中,先给3篇老外的文章,写得很 ...
随机推荐
- ACFS-9459: ADVM/ACFS is not supported on this OS version
环境:RHEL 7.3 + Oracle 12.2.0.1 RAC 现象:acfs资源状态不正常,asmca图形也没有acfs相关内容,无法使用acfs. 1.具体现象 2.定位bug 3.解决问题 ...
- linux 下tftpf搭建
什么是TFTP服务 TFTP(Trivial File Transfer Protocol,简单文件传输协议) 是TCP/IP协议族中的一个用来在客户机与服务器之间进行 简单文件传输的协 ...
- C#文件流的读写
1.文件流写入的一般步骤 1.定义一个写文件流 2.定义一个要写入的字符串 3.完成字符串转byte数组 4.把字节数组写入指定路径的文件 5.关闭文件流 2.文件流读入的一般步骤 1.定义一个读文件 ...
- angular+webpack(二)
上篇文章Angular2开发基础之TSC编译 解决如何使用TSC来编译ng2项目,以及如何解决出现的error.这些点是新手容易忽视的内容, 要熟悉ng开发的工具链,还是需要掌握其中的重点.本篇文章是 ...
- vue用npm安装删除模块element-ui mint-ui
vue用npm安装删除模块element-ui mint-ui 在vue项目中先引入了element-ui,后来发现移动版的需要用mint-ui,所以需要先卸载了再安装.卸载element-ui:np ...
- 给datagrid的日期格式化成年月日
$('#infos').datagrid({ title:'系统版本列表', iconCls:'icon-view', method:'POST', singleSelect:false, fit : ...
- MySQL插入更新_ON DUPLICATE KEY UPDATE
前提:操作的表具有主键或唯一索引 INSERT INTO:表中不存在对应的记录,则插入:若存在对应的记录,则报错: INSERT INTO IGNORE:表中不存在对应的记录,则插入:若存在对应的记录 ...
- chrome内核浏览器插件的使用--Tampermonkey(油猴插件)
Tampermonkey(油猴插件),这个插件是一个用于改造你浏览器打开的网站的插件.它可以在你打开的网页中注入任意js脚本,以达到你想要的外加功能.可以说非常不错.很多时候也值得使用. 这是个chr ...
- wget下载阿里云RDS备份集
[root@localhost tmp]# more wget.sh #!/bin/bash download_url=`python /tmp/geturl.py` echo $download_u ...
- iOS开发 -------- Block技术中的weak - strong
一 Block是什么? 我们使用^运算符来声明一个Block变量,而且在声明完一个Block变量后要像声明普通变量一样,后面要加; 声明Block变量 int (^block)(int) = NULL ...