Writing a simple Lexer in PHP/C++/Java
catalog
. Comparison of parser generators
. Writing a simple lexer in PHP
. phc
. JLexPHP: A PHP Lexer(xx.lex.php) Created By Java By x.lex File Input
. JFlex
. JLex: A Lexical Analyzer Generator for Java
. PhpParser
0. Comparison of parser generators
Relevant Link:
https://en.wikipedia.org/wiki/Comparison_of_parser_generators
http://www2.cs.tum.edu/projekte/cup/examples.php
http://pygments.org/docs/lexers/
http://www.phpdeveloper.org/news/5678
1. Writing a simple lexer in PHP
0x1: Introduction
Router::connect('/login', array('Sessions::add'));
The map /login -> Sessions::add may be translated into the following token stream by the lexer
<T_MAP, "map">
<T_URL, "/login">
<T_BLOCKSTART, "->">
<T_IDENTIFIER, "Sessions::add">
he parser can parse the following notation
<whitespace> := [\s]
<map> := "map"
<url> := [a-z/]
<blockstart> := "->"
<identifier> := [a-zA-Z0-:]
<mapBlock> := <map> <whitespace>* <url>+ <whitespace>* <blockstart> <whitespace>* <identifier>+
very rule identified by a := is called a production rule. I leave the parsing part for a later post, because that would be too much for a single posting
0x2: Implementation
The basic idea is that we match a list of regexes against the current line. If one of them matches, we store that token and advance our offset to the first character after the match. If no token is found and we are not yet at the end of the line, raise an exception (because something invalid is in front of our offset).
Let's stick with the example already mentioned above. We want to create tokens for an input file like map /login -> Sessions::add or root -> Pages::home.
The first thing we need is an array of terminal symbols that map to token identifiers.
<?php
class Lexer {
protected static $_terminals = array(
"/^(root)/" => "T_ROOT",
"/^(map)/" => "T_MAP",
"/^(\s+)/" => "T_WHITESPACE",
"/^(\/[A-Za-z0-9\/:]+[^\s])/" => "T_URL",
"/^(->)/" => "T_BLOCKSTART",
"/^(::)/" => "T_DOUBLESEPARATOR",
"/^(\w+)/" => "T_IDENTIFIER",
);
}
?>
Now, let's implement a run method that accepts an array of source lines and returns an array of tokens. It calls the helper _match method that performs the actual matching and raises an exception if no token was found.
public static function run($source) {
$tokens = array(); foreach($source as $number => $line) {
$offset = ;
while($offset < strlen($line)) {
$result = static::_match($line, $number, $offset);
if($result === false) {
throw new Exception("Unable to parse line " . ($line+) . ".");
}
$tokens[] = $result;
$offset += strlen($result['match']);
}
} return $tokens;
}
ote that how we advance the offset further every iteration and work towards the end of the string. To get the full picture, here's the _match
helper method.
protected static function _match($line, $number, $offset) {
$string = substr($line, $offset); foreach(static::$_terminals as $pattern => $name) {
if(preg_match($pattern, $string, $matches)) {
return array(
'match' => $matches[],
'token' => $name,
'line' => $number+
);
}
} return false;
}
We use the preg_match method to check if one of our pattern matches the current string. If you look closely, you can see that all of our regexes start at the beginning of the line (^) and are enclosed (()) so we can find them exactly at the beginning and get the inner content. We also store the current line, because we need it in our parser and to display helpful error messages.
Let's run our lexer with some example input:
$input = array('root -> Foo::bar');
$result = Lexer::run($input);
var_dump($result);
0x3: Wrapping Up
One thing you need to be aware of is that you need to place your token regex in the correct order - namely from special to general. If we put the T_IDENTIFIER before T_ROOT, the root keyword would always be matched as an identifier. While I haven't tried it out yet
在一个完备的词法解析器中,匹配状态机应该是采取"贪婪策略",即尽可能地匹配最长的"词法模式"
Relevant Link:
http://nitschinger.at/Writing-a-simple-lexer-in-PHP
http://www.codediesel.com/php/building-a-simple-parser-and-lexer-in-php/
2. phc
phc is an open source compiler for PHP with support for plugins. In addition, it can be used to pretty-print or obfuscate PHP code, as a framework for developing applications that process PHP scripts, or to convert PHP into XML and back, enabling processing of PHP scripts using XML tools.
. phc for PHP programmers
) Compile PHP source into an (optimized) executable (supports entire PHP standard library).
) Compile a web application into an (optimized) extension (supports entire PHP standard library).
) Pretty-print PHP code.
) Obfuscate PHP code (--obfuscate flag - experimental).
) Combine many php scripts into a single file (--include flag - experimental).
) Optimize PHP code using classical compiler optimizations (in the dataflow branch - very experimental). . phc for tools developers
) Analyse, modify or refactor PHP scripts using C++ plugins.
) Convert PHP into a well-defined XML format, process it with your own tools, and convert it back to PHP.
) Operate on ASTs, simplified ASTs, or -address code.
) Analyse or optimize PHP code using an SSA-based IR (in the dataflow branch - very experimental).
0x1: Installation Instructions
. g++ version 3.4. or higher
. make
. Boost version 1.34 or higher
. PHP5 embed SAPI (version 5.2.x recommended; refer to PHP embed SAPI installation instructions for more details). This is required to compile PHP code with phc.
. Xerces-C++ if you want support for XML parsing (you don’t need Xerces for XML unparsing).
. Boehm garbage collector is used in phc, but not in code compiled by phc. If unavailable, it can be disabled with --disable-gc, but phc will leak all memory it uses.
//The following dependencies are optional:
. a DOT viewer such as graphviz if you want to be able to view the graphical output generated by phc (for example, syntax trees)
/*
Under Debian/Ubuntu, the following command will install nearly all dependencies:
apt-get install build-essential libboost-all-dev libxerces27-dev graphviz libgc-dev
*/
0x2: Running phc
<?php
echo "Hello world!";
?> phc -c helloworld.php -o helloworld
//This creates an executable helloworld, which can then be run
./helloworld
0x3: Traversing the Tree
<?php
$x = ;
if($x == )
echo "yes";
else
echo "no";
?>
phc解析得到的AST(抽象语法树)如下
Relevant Link:
http://www.phpcompiler.org/downloads.html
http://www.phpcompiler.org/
http://www.phpcompiler.org/doc/latest/install.html
http://www.phpcompiler.org/doc/latest/runningphc.html
http://www.phpcompiler.org/doc/latest/treetutorial1.html#treetutorial1
http://www.phpcompiler.org/doc/latest/manual.html
3. JLexPHP: A PHP Lexer(xx.lex.php) Created By Java By x.lex File Input
A lexer generator for PHP. It is based on JLex and requires Java to generate the lexer. Once generated, the lexer only requires PHP to run
0x1: x.lex(词法规则文件)
<?php # vim:ft=php
include 'jlex.php'; %% %{
//<YYINITIAL> L? \" (\\.|[^\\\"])* \" { $this->createToken(CParser::TK_STRING_LITERAL); }
/* blah */
%} %function nextToken
%line
%char
%state COMMENTS ALPHA=[A-Za-z_]
DIGIT=[-]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+ %% <YYINITIAL> {NUMBER} {
return $this->createToken();
}
<YYINITIAL> {WHITE_SPACE} { } <YYINITIAL> "+" {
return $this->createToken();
}
<YYINITIAL> "-" {
return $this->createToken();
}
<YYINITIAL> "*" {
return $this->createToken();
}
<YYINITIAL> "/" {
return $this->createToken();
}
<YYINITIAL> ";" {
return $this->createToken();
}
<YYINITIAL> "//" {
$this->yybegin(self::COMMENTS);
}
<COMMENTS> [^\n] {
}
<COMMENTS> [\n] {
$this->yybegin(self::YYINITIAL);
}
<YYINITIAL> . {
throw new Exception("bah!");
}
0x2: Lexer Generator(By Java Language)
//create the jar file
javac -Xlint:unchecked JLexPHP/Main.java
jar cvf JLexPHP.jar JLexPHP/*.class
//负责读取词法规则文件,并生成Lexer
java -cp JLexPHP.jar JLexPHP.Main simple.lex
aaarticlea/png;base64," alt="" />
编译得到simple.lex.php,这个.php文件中包含了PHP Lexer的代码逻辑
0x3: 调用simple.lex.php、解析PHP文件词法
<?php
$scanner = new Yylex(fopen("file", "r"));
while ($scanner->yylex())
; ?>
Relevant Link:
https://github.com/wez/JLexPHP/blob/master/JLexPHP/Main.java
https://github.com/wez/JLexPHP
http://wezfurlong.org/blog/2006/nov/parser-and-lexer-generators-for-php/
4. JFlex
. JFlex is a lexical analyzer generator (also known as scanner generator) for Java, written in Java.
. A lexical analyzer generator takes as input a specification with a set of regular expressions and corresponding actions. It generates a program (a lexer) that reads input, matches the input against the regular expressions in the spec file, and runs the corresponding action if a regular expression matched.
. Lexers usually are the first front-end step in compilers, matching keywords, comments, operators, etc, and generating an input token stream for parsers. Lexers can also be used for many other purposes.
. JFlex lexers are based on deterministic finite automata (DFAs). They are fast, without expensive backtracking.
. JFlex is designed to work together with the LALR parser generator CUP by Scott Hudson, and the Java modification of Berkeley Yacc BYacc/J by Bob Jamison. It can also be used together with other parser generators like ANTLR or as a standalone tool.
Relevant Link:
http://jflex.de/
5. JLex: A Lexical Analyzer Generator for Java
JLex is a lexical analyzer generator, written for Java, in Java
Relevant Link:
http://www.cs.princeton.edu/~appel/modern/java/JLex/
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html#SECTION1
6. PhpParser
PhpParser generates a pure Java parser for PHP programs. Invoking this parser yields an explicit parse tree suitable for further analysis. This package is based upon
. JFlex 1.4.
. Cup .10k
. Grammar and lexer specifications of PHP 4.3..
0x1: Project settings for IntelliJ IDEA
. Project > Language Level: 1.7
. Modules > Sources: only src/java_cup, src/project, src/jFlex
0x2: Building and cleaning the project with Ant from within Eclipse
. Project > Properties > Builders
. Deactivate the Java Builder.
. New ...
. Select "Ant builder"
. Name it "Ant build" or "PhpParser build" (or any other suitable name).
. In the Main tab, select the build.xml in the project directory as Buildfile and the project directory as Base directory.
. In the Targets tab for "Manual build", select "build".
. In the Targets tab for "During a clean", select "clean all".
. OK the changes for both dialogs.
. Project > Build Project and Clean the project using Project > Clean ...
//Or you can build the project using the command line from within the project main directory
ant build
build.xml
<project name="PhpParser" basedir="." default="build">
<!-- PROPERTIES *************************************************************--> <!-- java/javac properties -->
<property name="src.dir" value="src"/>
<property name="src.project.dir" value="${src.dir}/project"/>
<property name="src.spec.dir" value="${src.dir}/spec"/>
<property name="src.jflex.dir" value="${src.dir}/JFlex"/>
<property name="src.cup.dir" value="${src.dir}/java_cup"/> <property name="build.dir" value="build"/>
<property name="build.java.dir" value="${build.dir}/java"/>
<property name="build.class.dir" value="${build.dir}/class"/> <property name="lexparse.package" value="at.ac.tuwien.infosys.www.phpparser"/>
<property name="lexparse.dir" value="${build.java.dir}/at/ac/tuwien/infosys/www/phpparser"/> <property name="javadoc.dir" value="doc/html"/>
<property name="javadoc.lexparse.dir" value="${javadoc.dir}/phpparser"/> <!-- lexer generator and generated lexer -->
<property name="lexgen.main" value="JFlex.Main"/>
<property name="lexgen.input" value="${src.spec.dir}/php.jflex"/>
<!-- the lexer name is specified with the %class option in the input file -->
<property name="lexer.name" value="PhpLexer"/>
<property name="lexer.source" value="${lexer.name}.java"/>
<property name="lexer.class" value="${lexer.name}.class"/> <!-- parser generator and generated parser -->
<property name="parsegen.main" value="java_cup.Main"/>
<property name="parsegen.input" value="${src.spec.dir}/php.cup"/>
<!-- CAUTION: when changing this property, consult the parser generator's input file first -->
<property name="parser.name" value="PhpParser"/>
<property name="parser.source" value="${parser.name}.java"/>
<property name="parser.sym.name" value="PhpSymbols"/>
<property name="parser.sym.source" value="${parser.sym.name}.java"/> <!-- classpath -->
<path id="classpath">
<pathelement location="${build.class.dir}"/>
<!-- -necessary because of JFlex Messages bundle -->
<pathelement location="${src.jflex.dir}"/>
</path> <!-- TARGETS ****************************************************************--> <target name="cup" description="Compiles the modified Cup.">
<mkdir dir="${build.class.dir}"/>
<javac srcdir="${src.cup.dir}" destdir="${build.class.dir}" debug="on" includeantruntime="false">
<compilerarg line="-encoding GBK"/>
<classpath refid="classpath"/>
</javac>
</target> <target name="jflex" description="Compiles the modified JFlex.">
<javac srcdir="${src.jflex.dir}" destdir="${build.class.dir}" debug="on" includeantruntime="false">
<compilerarg line="-encoding GBK"/>
<classpath refid="classpath"/>
</javac>
</target> <target name="lexer.source" depends="cup,jflex"
description="Uses the lexer generator to create a Java lexer from the input file.">
<mkdir dir="${lexparse.dir}"/>
<java classname="${lexgen.main}" fork="yes">
<arg value="${lexgen.input}"/>
<arg value="-d"/>
<arg value="${lexparse.dir}"/>
<classpath refid="classpath"/>
</java>
</target> <target name="parser.source" depends="cup"
description="Uses the parser generator to create a Java parser from the input file.">
<mkdir dir="${lexparse.dir}"/>
<java classname="${parsegen.main}" fork="yes">
<arg value="-parser"/>
<arg value="${parser.name}"/>
<arg value="-symbols"/>
<arg value="${parser.sym.name}"/>
<arg value="-nonterms"/>
<arg value="-expect"/>
<arg value=""/>
<arg value="${parsegen.input}"/>
<classpath refid="classpath"/>
</java>
<move file="${basedir}/${parser.source}" todir="${lexparse.dir}"/>
<move file="${basedir}/${parser.sym.source}" todir="${lexparse.dir}"/>
</target> <target name="javac"
description="Internal target for Java development. Doesn't try to generate lexer and parser.">
<mkdir dir="${build.class.dir}"/>
<javac destdir="${build.class.dir}" debug="on" includeantruntime="false">
<compilerarg line="-encoding GBK"/>
<src>
<pathelement path="${src.project.dir}"/>
<pathelement path="${build.java.dir}"/>
</src>
<classpath refid="classpath"/>
</javac>
</target> <target name="javadoc" depends="javac" description="Generates JavaDoc.">
<javadoc destdir="${javadoc.lexparse.dir}" packagenames="${lexparse.package}" Windowtitle="PhpParser 1.0">
<sourcepath>
<pathelement path="${src.project.dir}"/>
<pathelement path="${build.java.dir}"/>
</sourcepath>
<classpath refid="classpath"/>
</javadoc>
</target> <target name="build" depends="lexer.source,parser.source,javac,javadoc"
description="Builds the whole project together with the generated lexer and parser."/> <target name="clean" description="Cleans up.">
<delete dir="${build.java.dir}"/>
<delete dir="${build.class.dir}"/>
<delete dir="${graphs.dir}"/>
<delete file="${jar.file}"/>
</target> <target name="cleanall" depends="clean" description="Cleans up JFlex, Cup and JavaDoc as well.">
<delete dir="${lib.dir}/JFlex"/>
<delete dir="${lib.dir}/java_cup"/>
<delete dir="${javadoc.dir}"/>
</target> <target name="dist">
<mkdir dir="dist"/>
</target> <target name="help">
<echo message="You probably want to do 'ant build'. Otherwise, type 'ant -projecthelp' for help."/>
</target>
</project>
编译得到的Java版本的Lexer解析引擎,我们可以直接在代码中实例化并调用其中的函数
0x3: Usage
Example.java
import at.ac.tuwien.infosys.www.phpparser.*;
import java.io.*;
import java.util.*; class Example { public static void main(String[] args) { if (args.length == ) {
System.out.println("Please specify one or more PHP files to be parsed.");
System.exit();
} for (int i = ; i < args.length; i++) { String fileName = args[i]; ParseTree parseTree = null;
try {
PhpParser parser = new PhpParser(new PhpLexer(new FileReader(fileName)));
ParseNode rootNode = (ParseNode) parser.parse().value;
parseTree = new ParseTree(rootNode);
} catch (FileNotFoundException e) {
System.err.println("File not found: " + fileName);
System.exit();
} catch (Exception e) {
System.err.println("Error parsing " + fileName);
System.err.println(e.getMessage());
e.printStackTrace();
System.exit();
} System.out.println("*** Printing tokens for file " + fileName + "...");
for (Iterator iter = parseTree.leafIterator(); iter.hasNext(); ) {
ParseNode leaf = (ParseNode) iter.next();
System.out.println(leaf.getLexeme());
}
}
} }
编译
pushd D:\eclipse-javaEE\workspace\phpparser\doc\example
javac -classpath ../../build/class Example.java
运行
java -classpath ../../build/class:. Example test1.php test2.php
0x4: Directory layout
build.xml
README
build
class: generated java class files
java:generated java source files (PHP Lexer and Parser)
doc
various documentation files
src
java_cup: modified version of the Cup parser generator
jflex: modified version of the JFlex scanner generator
project: parse tree data structures
spec: specification (input) files for Cup and JFlex
Relevant Link:
https://github.com/oliverklee/phpparser/blob/master/src/spec/php.cup
https://github.com/oliverklee/phpparser/blob/master/src/spec/php.jflex
https://github.com/oliverklee/phpparser
Copyright (c) 2015 LittleHann All rights reserved
Writing a simple Lexer in PHP/C++/Java的更多相关文章
- Writing a Simple Publisher and Subscriber
用c++实现一个publisher/subscriber publisher #include "ros/ros.h" #include "std_msgs/String ...
- Writing a Simple YARN Application 从hadoop生态抽出yarn ,单独使用yarn
Apache Hadoop 2.9.1 – Hadoop: Writing YARN Applications https://hadoop.apache.org/docs/current/hadoo ...
- Writing a Simple Service and Client (C++)
此前说的publisher/subscriber都是广播式的,subscriber被动地接收消息,二者没有request/response这种交互. Service Node Client Node ...
- Apache POI – Reading and Writing Excel file in Java
来源于:https://www.mkyong.com/java/apache-poi-reading-and-writing-excel-file-in-java/ In this article, ...
- 111个知名Java项目集锦,包括url和描述
转:http://www.cnblogs.com/wangs/p/3282183.html 项目名称 项目描述 ASM Java bytecode manipulation framework A ...
- java Serialization and Deserializaton
This article from JavaTuturial Java provides a mechanism, called object serialization where an objec ...
- Thinking in Java from Chapter 15
From Thinking in Java 4th Edition. 泛型实现了:参数化类型的概念,使代码可以应用于多种类型.“泛型”这个术语的意思是:“适用于许多许多的类型”. 如果你了解其他语言( ...
- Java日期格式化
翻译人员: 铁锚 翻译时间: 2013年11月17日 原文链接: Simple example to show how to use Date Formatting in Java 代码示例如下, ...
- Java log code example
Java log example Logrecord filter import java.util.logging.Filter; import java.util.logging.Level; i ...
随机推荐
- C10K 问题引发的技术变革
C10K 问题引发的技术变革 http://rango.swoole.com/archives/381
- QT 数据库编程四
//vmysql.cpp #include "vmysql.h" #include <QMessageBox> Vmysql::Vmysql() { mysql_ini ...
- noi1696 逆波兰表达式
1696:逆波兰表达式 http://noi.openjudge.cn/ch0303/1696/ 总时间限制: 1000ms 内存限制: 65536kB 描述 逆波兰表达式是一种把运算符前置的算术 ...
- 目录结构-内置(AJAX)帮助文档
Discuz common.js 内置(AJAX)函数帮助文档 作者:cr180 / 整理日期:1970-01-01 / 个人站点:www.cr180.com / Discuz超级管家 showMen ...
- Java类加载和类反射回顾
今天学习Spring,突然想重新复习一下Java类加载和类反射的.巩固一下底层原理.部分参考了李刚老师的<疯狂Java讲义>和陈雄华.林开雄的<Spring3.x企业应用开发实战&g ...
- snr ber Eb/N0之间的区别与联系
信噪比(S/N)是指传输信号的平均功率与加性噪声的平均功率之比,载噪比(C/N)指已经调制的信号的平均功率与加性噪声的平均功率之比,它们都以对数的方式来计算,单位为dB. 对同一个传输系统而言,载噪比 ...
- getopt
头文件 #include<unistd.h> 定义函数 int getopt(int argc,char * const argv[ ],const char * optstring); ...
- iOS 自定义NavigationBar右侧按钮rightBarButtonItem--button
//两个按钮的父类view UIView *rightButtonView = [[UIView alloc] initWithFrame:CGRectMake(, , , )]; //历史浏览按钮 ...
- Java--剑指offer(2)
6.把一个数组最开始的若干个元素搬到数组的末尾,我们称之为数组的旋转. a)使用ArrayList来存放元素 public class Solution { public static int min ...
- nginx安装配置+清缓存模块安装
经过一段时间的使用,发现nginx在并发与负载能力方面确实优于apache,现在已经将大部分站点从apache转到了nginx了.以下是nginx的一些简单的安装配置. 环境 操作系统:CentOS. ...