Handwritten Parsers & Lexers in Go

(原文地址  https://blog.gopheracademy.com/advent-2014/parsers-lexers/)

In these days of web apps and REST APIs it seems that writing parsers is a dying art. You may think parsers are a complex undertaking only reserved for programming language designers but I’d like to dispel this idea. Over the past few years I’ve written parsers for JSONCSS3, and database query languages and the more that I write parsers the more that I love them.

The Basics

Let’s start off with the basics: what is a lexer and what is a parser? When we parse a language (or, technically, a “formal grammar”) we do it in two phases. First we break up series of characters into tokens. For a SQL-like language these tokens may be “whitespace”, “number”, “SELECT”, etc. This process is called lexing (or tokenizing or scanning).

Take this simple SQL SELECT statement as an example:

SELECT * FROM mytable

When we tokenize this string we’d see it as:

`SELECT` • `WS` • `ASTERISK` • `WS` • `FROM` • `WS` • `STRING<"mytable">`

This process, called lexical analysis, is similar to how we break up words in a sentence when we read. These tokens then get fed to a parser which performs semantic analysis.

The parser’s job is to make sense of these tokens and make sure they’re in the right order. This is similar to how we derive meaning from combining words in a sentence. Our parser will construct an abstract syntax tree (AST) from our series of tokens and the AST is what our application will use.

In our SQL SELECT example, our AST may look like:

type SelectStatement struct {
Fields []string
TableName string
}

Parser Generators

Many people use parser generators to automatically write a parser and lexer for them. There are many tools made to do this: lexyaccragel. There’s even a Go implementation of yacc built into the go toolchain.

However, after using parser generators many times I’ve found them to be problematic. First, they involve learning a new language to declare your language format. Second, they’re difficult to debug. For example, try reading the Ruby language’s yacc file. Eek!

After watching a talk by Rob Pike on lexical scanning and reading the implementation of the gostandard library package, I realized how much easier and simpler it is to hand write your parser and lexer. Let’s walk through the process with a simple example.

Writing a Lexer in Go

Defining our tokens

Let’s start by writing a simple parser and lexer for SQL SELECT statements. First, we need to define what tokens we’ll allow in our language. We’ll only allow a small subset of the SQL language:

// Token represents a lexical token.
type Token int const (
// Special tokens
ILLEGAL Token = iota
EOF
WS // Literals
IDENT // fields, table_name // Misc characters
ASTERISK // *
COMMA // , // Keywords
SELECT
FROM
)

We’ll use these tokens to represent series of characters. For example, WS will represent one or more whitespace characters and IDENT will represent an identifier such as a field name or a table name.

Defining character classes

It’s useful to define functions that will let us check the type of character. Here we’ll define two functions: one to check if a character is whitespace and one to check if the character is a letter.

func isWhitespace(ch rune) bool {
return ch == ' ' || ch == '\t' || ch == '\n'
} func isLetter(ch rune) bool {
return (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')
}

It’s also useful to define an “EOF” rune so that we can treat EOF like any other character:

var eof = rune(0)

Scanning our input

Next we’ll want to define our Scanner type. This type will wrap our input reader with a bufio.Reader so we can peek ahead at characters. We’ll also add helper functions for reading and unreading characters from our underlying reader.

// Scanner represents a lexical scanner.
type Scanner struct {
r *bufio.Reader
} // NewScanner returns a new instance of Scanner.
func NewScanner(r io.Reader) *Scanner {
return &Scanner{r: bufio.NewReader(r)}
} // read reads the next rune from the bufferred reader.
// Returns the rune(0) if an error occurs (or io.EOF is returned).
func (s *Scanner) read() rune {
ch, _, err := s.r.ReadRune()
if err != nil {
return eof
}
return ch
} // unread places the previously read rune back on the reader.
func (s *Scanner) unread() { _ = s.r.UnreadRune() }

The entry function into Scanner will be the Scan() method which return the next token and the literal string that it represents:

// Scan returns the next token and literal value.
func (s *Scanner) Scan() (tok Token, lit string) {
// Read the next rune.
ch := s.read() // If we see whitespace then consume all contiguous whitespace.
// If we see a letter then consume as an ident or reserved word.
if isWhitespace(ch) {
s.unread()
return s.scanWhitespace()
} else if isLetter(ch) {
s.unread()
return s.scanIdent()
} // Otherwise read the individual character.
switch ch {
case eof:
return EOF, ""
case '*':
return ASTERISK, string(ch)
case ',':
return COMMA, string(ch)
} return ILLEGAL, string(ch)
}

This entry function starts by reading the first character. If the character is whitespace then it is consumed with all contiguous whitespace characters. If it’s a letter then it’s treated as the start of an identifier or keyword. Otherwise we’ll check to see if it’s one of our single character tokens.

Scanning contiguous characters

When we want to consume multiple characters in a row we can do this in a simple loop. Here in scanWhitespace() we’ll consume whitespace characters until we hit a non-whitespace character:

// scanWhitespace consumes the current rune and all contiguous whitespace.
func (s *Scanner) scanWhitespace() (tok Token, lit string) {
// Create a buffer and read the current character into it.
var buf bytes.Buffer
buf.WriteRune(s.read()) // Read every subsequent whitespace character into the buffer.
// Non-whitespace characters and EOF will cause the loop to exit.
for {
if ch := s.read(); ch == eof {
break
} else if !isWhitespace(ch) {
s.unread()
break
} else {
buf.WriteRune(ch)
}
} return WS, buf.String()
}

The same logic can be applied to scanning our identifiers. Here in scanIdent() we’ll read all letters and underscores until we hit a different character:

// scanIdent consumes the current rune and all contiguous ident runes.
func (s *Scanner) scanIdent() (tok Token, lit string) {
// Create a buffer and read the current character into it.
var buf bytes.Buffer
buf.WriteRune(s.read()) // Read every subsequent ident character into the buffer.
// Non-ident characters and EOF will cause the loop to exit.
for {
if ch := s.read(); ch == eof {
break
} else if !isLetter(ch) && !isDigit(ch) && ch != '_' {
s.unread()
break
} else {
_, _ = buf.WriteRune(ch)
}
} // If the string matches a keyword then return that keyword.
switch strings.ToUpper(buf.String()) {
case "SELECT":
return SELECT, buf.String()
case "FROM":
return FROM, buf.String()
} // Otherwise return as a regular identifier.
return IDENT, buf.String()
}

This function also checks at the end if the literal string is a reserved word. If so then a specialized token is returned.

Writing a Parser in Go

Setting up the parser

Once we have our lexer ready, parsing a SQL statement becomes easier. First let’s define our Parser:

// Parser represents a parser.
type Parser struct {
s *Scanner
buf struct {
tok Token // last read token
lit string // last read literal
n int // buffer size (max=1)
}
} // NewParser returns a new instance of Parser.
func NewParser(r io.Reader) *Parser {
return &Parser{s: NewScanner(r)}
}

Our parser simply wraps our scanner but also adds a buffer for the last read token. We’ll define helper functions for scanning and unscanning so we can use this buffer:

// scan returns the next token from the underlying scanner.
// If a token has been unscanned then read that instead.
func (p *Parser) scan() (tok Token, lit string) {
// If we have a token on the buffer, then return it.
if p.buf.n != 0 {
p.buf.n = 0
return p.buf.tok, p.buf.lit
} // Otherwise read the next token from the scanner.
tok, lit = p.s.Scan() // Save it to the buffer in case we unscan later.
p.buf.tok, p.buf.lit = tok, lit return
} // unscan pushes the previously read token back onto the buffer.
func (p *Parser) unscan() { p.buf.n = 1 }

Our parser also doesn’t care about whitespace at this point so we’ll define a helper function to find the next non-whitespace token:

// scanIgnoreWhitespace scans the next non-whitespace token.
func (p *Parser) scanIgnoreWhitespace() (tok Token, lit string) {
tok, lit = p.scan()
if tok == WS {
tok, lit = p.scan()
}
return
}

Parsing the input

Our parser’s entry function will be the Parse() method. This function will parse the next SELECT statement from the reader. If we had multiple statements in our reader then we could call this function repeatedly.

func (p *Parser) Parse() (*SelectStatement, error)

Let’s break this function down into small parts. First we’ll define the AST structure we want to return from our function:

stmt := &SelectStatement{}

Then we’ll make sure there’s a SELECT token. If we don’t see the token we expect then we’ll return an error to report the string we found instead.

if tok, lit := p.scanIgnoreWhitespace(); tok != SELECT {
return nil, fmt.Errorf("found %q, expected SELECT", lit)
}

Next we want to parse a comma-delimited list of fields. In our parser we’re just considering identifiers and an asterisk as possible fields:

for {
// Read a field.
tok, lit := p.scanIgnoreWhitespace()
if tok != IDENT && tok != ASTERISK {
return nil, fmt.Errorf("found %q, expected field", lit)
}
stmt.Fields = append(stmt.Fields, lit) // If the next token is not a comma then break the loop.
if tok, _ := p.scanIgnoreWhitespace(); tok != COMMA {
p.unscan()
break
}
}

After our field list we want to see a FROM keyword:

// Next we should see the "FROM" keyword.
if tok, lit := p.scanIgnoreWhitespace(); tok != FROM {
return nil, fmt.Errorf("found %q, expected FROM", lit)
}

Then we want to see the name of the table we’re selecting from. This should be an identifier token:

tok, lit := p.scanIgnoreWhitespace()
if tok != IDENT {
return nil, fmt.Errorf("found %q, expected table name", lit)
}
stmt.TableName = lit

If we’ve gotten this far then we’ve successfully parsed a simple SQL SELECT statement so we can return our AST structure:

return stmt, nil

Congrats! You’ve just built a working parser!

Diving in deeper

You can find the full source of this example (with tests) at:

https://github.com/benbjohnson/sql-parser

This parser example was heavily influenced by the InfluxQL parser. If you’re interested in diving deeper and understanding multiple statement parsing, expression parsing, or operator precedence then I encourage you to check out the repository:

https://github.com/influxdb/influxdb/tree/master/influxql

If you have any questions or just love chatting about parsers, please find me on Twitter at @benbjohnson.

Handwritten Parsers & Lexers in Go (Gopher Academy Blog)的更多相关文章

  1. Handwritten Parsers & Lexers in Go (翻译)

    用go实现Parsers & Lexers 在当今网络应用和REST API的时代,编写解析器似乎是一种垂死的艺术.你可能会认为编写解析器是一个复杂的工作,只保留给编程语言设计师,但我想消除这 ...

  2. Writing a simple Lexer in PHP/C++/Java

    catalog . Comparison of parser generators . Writing a simple lexer in PHP . phc . JLexPHP: A PHP Lex ...

  3. Python的开源人脸识别库:离线识别率高达99.38%

    Python的开源人脸识别库:离线识别率高达99.38%   github源码:https://github.com/ageitgey/face_recognition#face-recognitio ...

  4. Golang Channel用法简编

    转自:http://tonybai.com/2014/09/29/a-channel-compendium-for-golang/ 在进入正式内容前,我这里先顺便转发一则消息,那就是Golang 1. ...

  5. Conditions in bash scripting (if statements)

    Shell中判断语句if中-z至-d的意思 - sunny_2015 - 博客园 https://www.cnblogs.com/coffy/p/5748292.html Conditions in ...

  6. Python的开源人脸识别库:离线识别率高达99.38%(附源码)

    Python的开源人脸识别库:离线识别率高达99.38%(附源码) 转https://cloud.tencent.com/developer/article/1359073   11.11 智慧上云 ...

  7. C#解析Markdown文档,实现替换图片链接操作

    前言 又是好久没写博客了 其实也不是没写,是最近在「做一个博客」,从2月21日开始,大概一个多星期的时间,疯狂刷进度,边写代码边写了一整系列的博客开发笔记,目前为止已经写了16篇了,然后上3月之后工作 ...

  8. 【三】用Markdown写blog的常用操作

    本系列有五篇:分别是 [一]Ubuntu14.04+Jekyll+Github Pages搭建静态博客:主要是安装方面 [二]jekyll 的使用 :主要是jekyll的配置 [三]Markdown+ ...

  9. 使用神经网络识别手写数字Using neural nets to recognize handwritten digits

    The human visual system is one of the wonders of the world. Consider the following sequence of handw ...

随机推荐

  1. 实时同步rsync+inotify

    实时同步rsync+inotify 原创博文http://www.cnblogs.com/elvi/p/7658071.html #linux同步 #实时同步rsync+inotify,双向同步ino ...

  2. Python资料汇总(建议收藏)

    整理汇总,内容包括长期必备.入门教程.练手项目.学习视频. 一.长期必备. 1. StackOverflow,是疑难解答.bug排除必备网站,任何编程问题请第一时间到此网站查找. https://st ...

  3. C#在自定义事件里传递自定义数据,使用EventArgs的姿势

    EventArgs是包含事件数据的类的基类,用于传递事件的细节.今天分享的是使用泛型来约束EventArgs,在事件里传递自定义数据的例子. 正题 由于这个关注点很小,直接上代码了. 定义泛型类TEv ...

  4. form表单提交引发的血案

    最近,公司某条产品线上的一个功能出了问题:点击查询的时候,该页面在IE上直接卡死,chrome上会卡顿一段时间候提交表单进行查询.拿到这个bug单子以后,简单重现了下,基本上定位到是查询操作中的问题, ...

  5. PHP 算法

    1.首先来画个菱形玩玩,很多人学C时在书上都画过,咱们用PHP画下,画了一半. 思路:多少行for一次,然后在里面空格和星号for一次. ? 1 2 3 4 5 6 <?php for($i=0 ...

  6. Yii2如何添加sql日志记录的配置信息

    在使用Yii2框架的时候,常常会出现没有sql日志记录的问题.在代码里一句一句的打印sql语句也不现实.所以就要用文件记录起来. 在 config/web.php 里面的 log配置中增加如下配置 [ ...

  7. SQL-删除重复记录

    前几日工作的时候,有个小需求,是要求删除一个表table_A里的重复记录(保留一条),假设以字段COL_PK重复来判断记录重复,那么有几种写法: 在Oracle里,可以利用rowid来删除,这是非常高 ...

  8. tomcat8权限分离

    安装jdk tar xf jdk-8u121-linux-x64.tar.gz mv jdk-*  /usr/local/jdk1.8 vi /etc/profile export JAVA_HOME ...

  9. WINDOWS java 不能正常卸载 问题, (其他系统问题 也可以试试)

    1.JAVA 原安装包无法卸载  不知道 有没有通知 碰到过这种情况的 自己碰到过3次这种情况了,    卸载不掉, 在网上 找了N多中 方法, 注册表什么的都被翻烂了, 单还是没用,其中有一次还把 ...

  10. HTML5经常使用知识

    今日做项目.涉及到native和H5页面的交互 1.document.readyState document.readyState:推断文档是否载入完毕. firefox不支持. 这个属性是仅仅读的, ...