What is ANTLR 3?

ANTLR - ANother Tool for Language Recognition - is a tool that is used in the construction of formal language software tools (or just language tools) such as translators,
compilers, recognizers and, static/dynamic program analyzers. Developers use ANTLR to reduce the time and effort needed to build and maintain language processing tools. In common terminology, ANTLR is a compiler generator orcompiler compiler (in
the tradition of tools such as Lex/Flex and Yacc/Bison) and it is used to generate the source code for language recognizers, analyzers and translators from language specifications. ANTLR takes as its input
grammar - a precise description of a language augmented with semantic actions - and generates source code files and other auxiliary files. The target language of the generated source code (e.g. Java, C/C++, C#, Python, Ruby) is specified
in the grammar.

Software developers and language tool implementors can use ANTLR to implement Domain-Specific
Languages
, to generate parts of language compilers and translators, or even to help them build tools that parse complex XML.

As stated above, ANTLR 3 generates the source code for various tools that can be used to recognize, analyze and transform input data relative to a language that is defined in a specified grammar file. The basic types of language processing
tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and, TreeParsers (a.k.a tree walkers, c.f. visitors).

What exactly does ANTLR 3 do?

ANTLR reads a language description file called a grammar and generates a number of source code files and other auxiliary files. Most uses of ANTLR generates at least one (and quite often both) of these tools:

  • Lexer:
    This reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output. It can also flag some tokens such as whitespace and comments as hidden using
    a protocol that ANTLR parsers automatically understand and respect.
  • Parser:
    This reads a token stream (normally generated by a lexer), and matches phrases in your language via the rules (patterns) you specify, and typically performs some semantic action for each phrase (or sub-phrase) matched. Each match could invoke a custom
    action, write some text via StringTemplate, or generate an Abstract
    Syntax Tree
     for additional processing.

ANTLR's Abstract Syntax Tree (AST) processing is especially powerful. If you also specify a tree grammar, ANTLR will generate a Tree Parser for you that can contain custom actions or StringTemplate output statements. The next
version of ANTLR (3.1) will support rewrite rules that can be used to express tree transformations.

Most language tools will:

  1. Use a Lexer and Parser in series to check the word-level and phrase-level structure of the input and if no fatal
    errors are encountered, create an intermediate tree representation such as an Abstract Syntax Tree (AST),
  2. Optionally modify (i.e tranform or rewrite) the intermediate tree representation (e.g. to perform optimizations)
    using one or more Tree Parsers, and
  3. Produce the final output using a Tree Parser to process the final tree representation. This might be to generate source code or other textual representation from the tree (perhaps using StringTemplate)
    or, performing some other custom actions driven by the final tree representation.

Simpler language tools may omit the intermediate tree and build the actions or output stage directly into the parser. The calculator shown below uses only a Lexer and a Parser.

ANTLR, Then and Now

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with its new LL parsing
engine. ANTLR provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally includeaction code written in what is termed the target language (i.e.
the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. PCCTS's immediate successor was ANTLR 2 and it supported Java, C# and Python as target languages in
addition to C++.

Target languages

ANTLR 3 already supports Java, C#Objective
C
C, Python and Ruby as
target languages. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected
or already in progress. This is all due in part to the fact that it is much easier to add support for a target language (or customize the code generated by an existing target) in ANTLR 3.

Why should I use ANTLR 3?

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilershave a major, positive impact on developer
productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, its significantly enhanced parsing strength via LL parsing
with arbitrary lookahead, its vastly improved tree construction rewrite rules and the availability of the simply
fantastic AntlrWorks IDE
 offers productivity benefits over other comparable generative language processing toolkits.

How do I use ANTLR 3?

1. Get ANTLR 3

Download and install ANTLR 3 from the ANTLR website.

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

Java

grammar
SimpleCalc;
 
tokens
{
    PLUS   
=
'+'

;
    MINUS  
=
'-'

;
    MULT   
=
'*'

;
    DIV
=
'/'

;
}
 
@members

{
    public

static

void

main(String[] args)
throws

Exception {
        SimpleCalcLexer
lex =
new

SimpleCalcLexer(
new

ANTLRFileStream(args[
0]));
        CommonTokenStream
tokens =
new

CommonTokenStream(lex);
 
        SimpleCalcParser
parser =
new

SimpleCalcParser(tokens);
 
        try

{
            parser.expr();
        }
catch

(RecognitionException e)  {
            e.printStackTrace();
        }
    }
}
 
/*------------------------------------------------------------------
 *
PARSER RULES
 *------------------------------------------------------------------*/
 
expr   
: term ( ( PLUS | MINUS )  term )* ;
 
term   
: factor ( ( MULT | DIV ) factor )* ;
 
factor 
: NUMBER ;
 
 
/*------------------------------------------------------------------
 *
LEXER RULES
 *------------------------------------------------------------------*/
 
NUMBER 
: (DIGIT)+ ;
 
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+    { $channel = HIDDEN; } ;
 
fragment
DIGIT  :
'0'..'9'

;

C#

Note: language=CSharp2 with ANTLR 3.1; ANTLR 3.0.1 uses the older CSharp target

grammar
SimpleCalc;
 
options
{
    language=CSharp2;
}
 
tokens
{
    PLUS   
=
'+'

;
    MINUS  
=
'-'

;
    MULT   
=
'*'

;
    DIV
=
'/'

;
}
 
@members

{
    public

static

void

Main(string[] args) {
        SimpleCalcLexer
lex =
new

SimpleCalcLexer(
new

ANTLRFileStream(args[
0]));
        CommonTokenStream
tokens =
new

CommonTokenStream(lex);
 
        SimpleCalcParser
parser =
new

SimpleCalcParser(tokens);
 
        try

{
            parser.expr();
        }
catch

(RecognitionException e)  {
            Console.Error.WriteLine(e.StackTrace);
        }
    }
}
 
/*------------------------------------------------------------------
 *
PARSER RULES
 *------------------------------------------------------------------*/
 
expr   
: term ( ( PLUS | MINUS )  term )* ;
 
term   
: factor ( ( MULT | DIV ) factor )* ;
 
factor 
: NUMBER ;
 
 
/*------------------------------------------------------------------
 *
LEXER RULES
 *------------------------------------------------------------------*/
 
NUMBER 
: (DIGIT)+ ;
 
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+    { $channel = Hidden; } ;
 
fragment
DIGIT  :
'0'..'9'

;

Objective-C

To be written. Volunteers?

grammar
SimpleCalc;
 
options
{
    language=ObjC;
}
 
OR
:
'||'

;

C

grammar
SimpleCalc;
 
options
{
    language=C;
}
 
tokens
{
    PLUS   
=
'+'

;
    MINUS  
=
'-'

;
    MULT   
=
'*'

;
    DIV
=
'/'

;
}
 
@members
{
 
 #include
"SimpleCalcLexer.h"
 
 int

main(
int

argc,
char

* argv[])
 {
 
    pANTLR3_INPUT_STREAM          
input;
    pSimpleCalcLexer              
lex;
    pANTLR3_COMMON_TOKEN_STREAM   
tokens;
    pSimpleCalcParser             
parser;
 
    input 
= antlr3AsciiFileStreamNew          ((pANTLR3_UINT8)argv[
1]);
    lex   
= SimpleCalcLexerNew                (input);
    tokens
= antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
    parser
= SimpleCalcParserNew               (tokens);
 
    parser 
->expr(parser);
 
    //
Must manually clean up
    //
    parser
->free(parser);
    tokens
->free(tokens);
    lex   
->free(lex);
    input 
->close(input);
 
    return

0
;
 }
 
}
 
/*------------------------------------------------------------------
 *
PARSER RULES
 *------------------------------------------------------------------*/
 
expr   
: term   ( ( PLUS | MINUS )  term   )*
        ;
 
term   
: factor ( ( MULT | DIV   )  factor )*
        ;
 
factor 
: NUMBER
        ;
 
 
/*------------------------------------------------------------------
 *
LEXER RULES
 *------------------------------------------------------------------*/
 
NUMBER     
: (DIGIT)+
            ;
 
WHITESPACE 
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+
              {
                 $channel
= HIDDEN;
              }
            ;
 
fragment
DIGIT      
:
'0'..'9'
            ;

Python

grammar
SimpleCalc;
 
options
{
    language
= Python;
}
 
tokens
{
    PLUS   
=
'+'

;
    MINUS  
=
'-'

;
    MULT   
=
'*'

;
    DIV
=
'/'

;
}
 
@header

{
import

sys
import

traceback
 
from
SimpleCalcLexer
import

SimpleCalcLexer
}
 
@main

{
def
main(argv, otherArg=None):
  char_stream
= ANTLRFileStream(sys.argv[
1])
  lexer
= SimpleCalcLexer(char_stream)
  tokens
= CommonTokenStream(lexer)
  parser
= SimpleCalcParser(tokens);
 
  try:
        parser.expr()
  except
RecognitionException:
    traceback.print_stack()
}
 
/*------------------------------------------------------------------
 *
PARSER RULES
 *------------------------------------------------------------------*/
 
expr   
: term ( ( PLUS | MINUS )  term )* ;
 
term   
: factor ( ( MULT | DIV ) factor )* ;
 
factor 
: NUMBER ;
 
 
/*------------------------------------------------------------------
 *
LEXER RULES
 *------------------------------------------------------------------*/
 
NUMBER 
: (DIGIT)+ ;
 
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+    { $channel = HIDDEN; } ;
 
fragment
DIGIT  :
'0'..'9'

;

2.2 Run ANTLR 3 on the simple grammar

java
org.antlr.Tool SimpleCalc.g

ANTLR will generate source files for the lexer and parser (e.g. SimpleCalcLexer.java and SimpleCalcParser.java). Copy these into the appropriate places for your development environment and compile them.

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

Let's break this example down and develop the simple calculator from the beginning. This can help you learn ANTLR by example.

Before you start

You can learn best by following along, experimenting, and looking at the generated source code. If so, you'll need:

  • A simple text editor,
  • An installed copy of ANTLR 3.1, or
  • An installed copy of ANTLR Works (free, highly recommended, and contains its own copy of ANTLR)
Define a trivial grammar

Any language processing system has at least two components:

  1. A lexer that takes a stream of characters and divides the stream into tokens according to pre-set rules, and
  2. A parser that reads the tokens and interprets them according to its rules.

Let's start by defining the rules for a simple arithmetic expression: 100+23:

Trivial calculator
grammar
SimpleCalc;
 
add
: NUMBER PLUS NUMBER;
 
NUMBER 
: (
'0'..'9')+
;
 
PLUS   
:
'+';

This example contains two lexer rules - NUMBER and PLUS - and the parser rule add. Lexer rules always start with an uppercase letter, while parser rules start with lowercase letters.

  • NUMBER defines a token (named "NUMBER") that contains any character between 0 and 9, inclusive, repeated one or more times. .. creates
    a character range, while + means "one or more times". (This suffix should look familiar if you know regular expressions.)
  • PLUS defines a token with a single character: +.
  • add defines a parser rule that says "expect a NUMBER token, a PLUS token, and a NUMBER token in that order." Any other tokens, or tokens
    in a different order, will trigger an error message.
Flesh out the calculator

Let's first allow more complex expressions such as 1 or 1+2 or 1+2+3+4 This starts with a single number, then can add a plus sign and a number (possibly more than once):

repeated addition
add:
NUMBER (PLUS NUMBER)*

The * symbol means "zero or more times".

If you want to implement both addition and subtraction, you can make a small adjustment:

Addition and subtraction
add:
NUMBER ((PLUS | MINUS) NUMBER)*
 
MINUS
:
'-';

As you may have guessed, | means "or" as in "PLUS or MINUS".

If you want to parse complete arithmetic expressions such as 1+2*3, there's a standard recursive way to do it:

Recursive expression definition
expr   
: term ( ( PLUS | MINUS )  term )* ;
term   
: factor ( ( MULT | DIV ) factor )* ;
factor 
: NUMBER ;
 
MULT
:
'*';
DIV 
:
'/';

To evaluate an expression, always start with expr.

Handle white space

Our grammar is intolerant of white space: it will give warnings about spaces, tabs, returns, etc. Let's tell the Lexer that it's safe to discard any white space it finds.

First, we have to define white space:

  • A space is ' '
  • A tab is written '\t'
  • A newline (line feed) is written '\n'
  • A carriage return is written '\r'
  • A Form Feed has a decimal value of 12 and a hexidecimal value of $0C. ANTLR uses Unicode, so we define this as 4 hex digits: '\u000C'

Put these together with an "or", allow one or more to occur together, and you have

Defining whitespace
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+;

However, if we write the expression 3 + 4*5, the lexer will generate NUMBER WHITESPACE PLUS WHITESPACE NUMBER MULT NUMBER and this will cause the parser to complain about the unknown WHITESPACEtokens. We need a way to
hide them from the parser.

ANTLR maintains two channels of communication between the lexer and the parser - a default channel and a hidden channel. The parser listens to only one channel at a time (usually the default one), so you can
"hide" a token by assigning it to the hidden channel.

There can be more than two channels and the parser can listen to them individually or get the text from all the channels merged together. This is useful when you are writing a text-processing tool that needs to pass through the whitespace and comments to the
output while letting the parser ignore those elements.

You hide the token by setting the token's $channel flag to the constant HIDDEN. This requires adding a little code to the lexer, which you do by adding curly brackets:

Defining whitespace
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+ { $channel = HIDDEN; };

If you're following along at the keyboard, try generating the Lexer and Parser code now and search the Lexer for channel = HIDDEN

ANTLR generates Java code by default. You'll learn how to change that in just a minute.

Tidy up the code

You can use a few techniques to make your grammar more readable:

  1. Add comments including single-line // and multi-line /* ... */
  2. Gather your simple token definitions (single characters, single words, etc.) into a tokens section at the top of the file.
  3. Consider defining sub-parts of tokens with fragment rules. A fragment will never generate a token by itself but
    can be used as part of the rule defining another token.

Here's a tidied-up copy:

Tidier grammar
grammar
SimpleCalc;
 
tokens
{
    PLUS   
=
'+'

;
    MINUS  
=
'-'

;
    MULT   
=
'*'

;
    DIV
=
'/'

;
}
 
/*------------------------------------------------------------------
 *
PARSER RULES
 *------------------------------------------------------------------*/
 
expr   
: term ( ( PLUS | MINUS )  term )* ;
 
term   
: factor ( ( MULT | DIV ) factor )* ;
 
factor 
: NUMBER ;
 
/*------------------------------------------------------------------
 *
LEXER RULES
 *------------------------------------------------------------------*/
 
NUMBER 
: (DIGIT)+ ;
 
WHITESPACE
: (
'\t'

|
'
'

|
'\r'

|
'\n'|
'\u000C'

)+    { $channel = HIDDEN; } ;
 
fragment
DIGIT  :
'0'..'9'

;
Turn this into a stand-alone program

If you try to run this parser, nothing happens.

You need to add some code to make this work as a stand-alone tool, and you may want to add some variables at the top of the parser. All of this happens in a {{ @header { ... } }} block at the top of the file:

Main entry point for Java
@members

{
    public

static

void

main(String[] args)
throws

Exception {
        SimpleCalcLexer
lex =
new

SimpleCalcLexer(
new

ANTLRFileStream(args[
0]));
        CommonTokenStream
tokens =
new

CommonTokenStream(lex);
 
        SimpleCalcParser
parser =
new

SimpleCalcParser(tokens);
 
        try

{
            parser.expr();
        }
catch

(RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

This shows the usual pattern: take an input stream, feed it to the generated Lexer, get a token stream from the lexer, feed that to the parser, and then call one of the methods on the parser. (Each parser rule adds a corresponding method to the parser.)

Don't like Java? You can use ANTLR to generate code for Java, C, C++, C#, Objective-C, Python, Ruby, and other languages (see Code
Generation Targets
) as well as add your own generator. Use an optionsblock to switch languages:

Typical options block
grammar
SimpleCalc;
 
options
{
    language=CSharp2;
}

Your five minutes are up!

You've just seen:

  • How to write lexer rules
  • How to write basic parser rules
  • How to direct tokens away from the parser (to ignore them)
  • How to insert executable code into a parser.

Some points to consider:

  • You can insert custom actions anywhere.
  • Most of your custom code winds up in the last stage of the parsing process. Here it was in the Parser; if you used an AST, it would be in the tree parser.

What next?

This covers the majority of the things you need to know to develop a grammar. You may want to work through another of the tutorials:

You could also:

Special constructs (reference)

Construct

Description

Example

(...)*

Kleene closure - matches zero or more occurrences

LETTER DIGIT* - match a LETTER followed by zero or more occurrences ofDIGIT

(...)+

Positive Kleene closure - matches one or more occurrences

('0'..'9')+ - match one or more occurrences of a numerical digit 

LETTER (LETTER|DIGIT)+ - match a LETTER followed one or more occurrences of either LETTER or DIGIT

fragment

fragment in front of a lexer rule instructs ANTLR that the rule is only used as part of another lexer rule (i.e. it only builds a fragment of a recognized token)

fragment {{ DIGIT : '0'..'9' ; 



NUMBER : (DIGIT)+ ('.' (DIGIT)+ )? ;}}

Five minute introduction to ANTLR 3的更多相关文章

  1. [转载] 十五分钟介绍 Redis数据结构

    转载自http://blog.nosqlfan.com/html/3202.html?ref=rediszt Redis是一种面向“键/值”对类型数据的分布式NoSQL数据库系统,特点是高性能,持久存 ...

  2. 十五分钟介绍 Redis数据结构

    下面是一个对Redis官方文档<A fifteen minute introduction to Redis data types>一文的翻译,如其题目所言,此文目的在于让一个初学者能通过 ...

  3. 十五分钟介绍 Redis数据结构--学习笔记

    下面是一个对Redis官方文档<A fifteen minute introduction to Redis data types>一文的翻译,如其题目所言,此文目的在于让一个初学者能通过 ...

  4. Back-propagation, an introduction

    About Contact Subscribe   Back-propagation, an introduction Sanjeev Arora and Tengyu Ma  •  Dec 20, ...

  5. A beginner’s introduction to Deep Learning

    A beginner’s introduction to Deep Learning I am Samvita from the Business Team of HyperVerge. I join ...

  6. [TensorFlow] Introduction to TensorFlow Datasets and Estimators

    Datasets and Estimators are two key TensorFlow features you should use: Datasets: The best practice ...

  7. [转]Introduction to Learning to Trade with Reinforcement Learning

    Introduction to Learning to Trade with Reinforcement Learning http://www.wildml.com/2018/02/introduc ...

  8. Netty Tutorial Part 1: Introduction to Netty [z]

    Netty Tutorial, Part 1: Introduction to Netty Update:  Part 1.5 Has Been Published: Netty Tutorial P ...

  9. Introduction to Learning to Trade with Reinforcement Learning

    http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ The academic ...

  10. An introduction to High Availability Architecture

    https://www.getfilecloud.com/blog/an-introduction-to-high-availability-architecture/ An introduction ...

随机推荐

  1. 合合信息推出国央企智能文档处理解决方案,AI赋能信创国产化

    信息时代,数字化转型已成为推动经济高质量发展的关键力量.国央企是国民经济的重要支柱,其数字化转型进程关乎着自身与产业链上下游企业的共同发展.文档的智能化处理可有效提升信息流转的效率.促进知识的沉淀与传 ...

  2. pytorch中LSTM各参数理解

    nn.LSTM(input_dim,hidden_dim,nums_layer,batch_first) 各参数理解: input_dim:输入的张量维度,表示自变量特征数 hidden_dim:输出 ...

  3. 【解决方案】Java 互联网项目中常见的 Redis 缓存应用场景

    目录 前言 一.常见 key-value 二.时效性强 三.计数器相关 四.高实时性 五.排行榜系列 六.文章小结 前言 在笔者 3 年的 Java 一线开发经历中,尤其是一些移动端.用户量大的互联网 ...

  4. .net 到底行不行!2000 人在线的客服系统真实屏录演示(附技术详解) 📹

    业余时间用 .net 写了一个免费的在线客服系统:升讯威在线客服与营销系统. 时常有朋友问我性能方面的问题,正好有一个真实客户,在线的访客数量达到了 2000 人.在争得客户同意后,我录了一个视频. ...

  5. @RestController和@Controller的区别

    @RestController 和 @Controller 是Spring框架中用于定义控制器(Controller)的两个非常重要的注解,它们都用于处理HTTP请求,但它们之间存在一些关键的区别. ...

  6. Kubernetes Deployment控制器(二十)

    前面我们学习了 ReplicaSet 控制器,了解到该控制器是用来维护集群中运行的 Pod 数量的,但是往往在实际操作的时候,我们反而不会去直接使用 RS,而是会使用更上层的控制器,比如我们今天要学习 ...

  7. volatile关键字最全原理剖析

    介绍 volatile是轻量级的同步机制,volatile可以用来解决可见性和有序性问题,但不保证原子性. volatile的作用: 保证了不同线程对共享变量进行操作时的可见性,即一个线程修改了某个变 ...

  8. linux 挂载硬盘报错 "mount: unknown filesystem type 'ntfs'"

    这个错误是说,系统无法识别ntfs格式的硬盘.所以不能直接挂载. 解决这个问题的思路有两个: 格式化磁盘为linux可以识别的格式. 通过工具使linux可以识别ntfs格式. 如果是第一次挂载硬盘可 ...

  9. excel江湖异闻录--Klaus

    最开始接触数组公式,是偶然在公众号看到"看见星光"大佬的一个提取混合文本中电话号码的公式,记得当时大佬是用vlookup解的这题,当时完全不能理解,mid中第二参数为什么是个row ...

  10. jmeter使用beanshell完成签名计算,附与python代码对比

    签名计算过程: 1.ticket计算:时间戳加+随机数字拼接后md5加密 2.组装公共参数+ticket+时间戳+业务参数 beanshell代码实现: import java.util.*;impo ...