Simple Matching

LPeg is a powerful notation for matching text data, which is more capable than Lua string patterns and standard regular expressions. However, like any language you need to know the basic words and how to combine them.

The best way to learn is to play with patterns in an interactive session, first by defining some shortcuts:

$ lua -llpeg
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> match = lpeg.match -- match a pattern against a string
> P = lpeg.P -- match a string literally
> S = lpeg.S -- match anything in a set
> R = lpeg.R -- match anything in a range

If you dont' want to create shorcuts manually, you can do this:

> setmetatable(_ENV or _G, { __index = lpeg or require"lpeg" })
    

I don't recommend doing this in serious code, but to explore LPeg, it is very convenient.

Matches occur against the start of the string, and successful matches return the position immediately after the successful match, or nil if unsuccesful. (Here I'm using the fact that f'x' is equivalent to f('x') in Lua; using single quotes has the same meaning as double quotes.)

> = match(P'a','aaa')
2
> = match(P'a','123')
nil

It works like string.find, except it only returns one index.

You can match against ranges or sets of characters:

> = match(R'09','123')
2
> = match(S'123','123')
2

Matching more than one item is done with the ^ operator. In this case, the match is equivalent to the Lua pattern '^a+' - one or more occurrances of 'a':

> = match(P'a'^1,'aaa')
4

Combining patterns in order is done with the * operator. This is equivalent to '^ab*' - one 'a' followed by zero or more 'b's:

> = match(P'a'*P'b'^0,'abbc')
4

So far, lpeg is giving us a more verbose way of expressing regular expressions, but these patterns are composible - they can be easily built up from simpler patterns, without awkward string operations. In this way, lpeg patterns can be made to be easier to read than their equivalent regular expressions. Note that you can often leave out an explicit P call when constructing patterns, if one of the arguments is already a pattern:

> maybe_a = P'a'^-1  -- one or zero matches of 'a'
> match_ab = maybe_a * 'b'
> = match(match_ab, 'ab')
3
> = match(match_ab, 'b')
2
> = match(match_ab, 'aaab')
nil

The + operator means either one or the other pattern:

> either_ab = (P'a' + P'b')^1 -- sequence of either 'a' or 'b'
> = either_ab:match 'aaa'
4
> = either_ab:match 'bbaa'
5

Note that the pattern object has a match method!

Of course, S'ab'^1 would be a shorter way to say this, but the arguments here can be arbitary patterns.

Basic Captures

Getting the index after a match is all very well, and you can then use string.sub to extract the strings. But there are ways of explicitly asking for captures:

> C = lpeg.C  -- captures a match
> Ct = lpeg.Ct -- a table with all captures from the pattern

The first is equivalent to how '(...)' is used in Lua patterns (or '\(...\)' in regular expressions)

> digit = R'09' -- anything from '0' to '9'
> digits = digit^1 -- a sequence of at least one digit
> cdigits= C(digits) -- capture digits
> = cdigits:match '123'
123

So to get the string value, enclose the pattern in C.

This pattern doesn't cover a general integer, which may have a '+' or '-' up front:

> int = S'+-'^-1 * digits
> = match(C(int),'+23')
+23

Unlike with Lua patterns or regular expressions, you don't have to worry about escaping 'magic' characters - every character in a string stands for itself: '(','+','*', etc are just their ASCII equilvalents.

A special kind of capture is provided by the / operator - it passes the captured string through a function or a table. Here I'm adding one to the result, just to show that the result has been converted into a number with tonumber:

> =  match(int/tonumber,'+123') + 1
124

Note that multiple captures can be returned by a match, just like string.match. This is equivalent to '^(a+)(b+)':

> = match(C(P'a'^1) * C(P'b'^1), 'aabbbb')
aa bbbb

Building more complicated Patterns

Consider general floating-point numbers:

> function maybe(p) return p^-1 end
> digits = R'09'^1
> mpm = maybe(S'+-')
> dot = '.'
> exp = S'eE'
> float = mpm * digits * maybe(dot*digits) * maybe(exp*mpm*digits)
> = match(C(float),'2.3')
2.3
> = match(C(float),'-2')
-2
> = match(C(float),'2e-02')
2e-02

This lpeg pattern is easier to read than the regular expression equivalent '[-+]?[0-9]+\.?[0-9]+([eE][+-]?[0-9]+)?'; shorter is always better! One reason is that we can work with patterns as expressions: factor out common patterns, write functions for convenience and clarity, etc. Note that there is no penalty for writing things out in this fashion; lpeg remains a very fast way to parse text!

More complicated structures can be composed from these building blocks. Consider the task of parsing a list of floating point numbers. A list is a number followed by zero or more groups consisting of a comma and a number:

> listf = C(float) * (',' * C(float))^0
> = listf:match '2,3,4'
2 3 4

That's cool, but it would be even cooler to have this as an actual list. This is where lpeg.Ct comes in; it collects all the captures within a pattern into a table.

= match(Ct(listf),'1,2,3')
table: 0x84fe628

Stock Lua does not pretty-print tables, but you can use [? Microlight] for this job:

> tostring = require 'ml'.tstring
> = match(Ct(listf),'1,2,3')
{"1","2","3"}

The values are still strings. It's better to write listf so that it converts its captures:

> floatc = float/tonumber
> listf = floatc * (',' * floatc)^0

This way of capturing lists is very general, since you can put any expression that captures in the place of floatc. But this list pattern is still too restrictive, because generally we want to ignore whitespace

> sp = P' '^0  -- zero or more spaces (like '%s*')
> function space(pat) return sp * pat * sp end -- surrond a pattern with optional space
> floatc = space(float/tonumber)
> listc = floatc * (',' * floatc)^0
> = match(Ct(listc),' 1,2, 3')
{1,2,3}

It's a matter of taste, but here I prefer to allow optional space around the items, rather than allowing space specifically around the delimiter ','.

With lpeg, we can be programmers again with pattern matching, and reuse patterns:

function list(pat)
pat = space(pat)
return pat * (',' * pat)^0
end

So, a list of identifiers (according to the usual rules):

> idenchar = R('AZ','az')+P'_'
> iden = idenchar * (idenchar+R'09')^0
> = list(C(iden)):match 'hello, dolly, _x, s23'
"hello" "dolly" "_x" "s23"

Using explicit ranges seems old-fashioned and error-prone. A more portable solution is to use the lpeg equivalent of character classes, which are by definition locale-independent:

> l = {}
> lpeg.locale(l)
> for k in pairs(l) do print(k) end
"punct"
"alpha"
"alnum"
"digit"
"graph"
"xdigit"
"upper"
"space"
"print"
"cntrl"
"lower"
> iden = (l.alpha+P'_') * (l.alnum+P'_')^0

Given this definition of list, it's easy to define a simple subset of the common CSV format, where each record is a list separated by a linefeed:

> rlistf =  list(float/tonumber)
> csv = Ct( (Ct(listf)+'\n')^1 )
> = csv:match '1,2.3,3\n10,20, 30\n'
{{1,2.3,3},{10,20,30}}

One good reason to learn lpeg is that it performs very satisfactorily. This pattern is a lot faster than parsing the data with Lua string matching.

String Substitution

I will show that lpeg can do all that string.gsub can do, and more generally and flexibly.

One operator that we have not used yet is -, which means 'either/or'. Consider the problem of matching double-quoted strings. In the simplest case, they are a double-quote followed by any characters which are not a double-quote, followed by a closing double-quote. P(1) matches any single character, i.e. it is the equivalent of '.' in string patterns. A string may be empty, so we match zero or more non-quote characters:

> Q = P'"'
> str = Q * (P(1) - Q)^0 * Q
> = C(str):match '"hello"'
"\"hello\""

Or you may want to extract the contents of the string, without quotes. In this context, just using 1 instead of P(1) is not ambiguous, and in fact this is how you will usually see this 'any x which is not a P' pattern:

> str2 = Q * C((1 - Q)^0) * Q
> = str2:match '"hello"'
"hello"

This pattern is obviously generalizable; often the terminating pattern is not the same as the final pattern:

function extract_quote(openp,endp)
openp = P(openp)
endp = endp and P(endp) or openp
local upto_endp = (1 - endp)^1
return openp * C(upto_endp) * endp
end
> return  extract_quote('(',')'):match '(and more)'
"and more"
> = extract_quote('[[',']]'):match '[[long string]]'
"long string"

Now consider translating Markdown code (back-slash enclosed text) into the format understood by the Lua wiki (double-brace enclosed text). The naive way is to extract the string and concatenate the result, but this is clumsy and (as we will see) limit our options tremendously.

function subst(openp,repl,endp)
openp = P(openp)
endp = endp and P(endp) or openp
local upto_endp = (1 - endp)^1
return openp * C(upto_endp)/repl * endp
end
> =  subst('`','{{%1}}'):match '`code`'
"{{code}}"
> = subst('_',"''%1''"):match '_italics_'
"''italics''"

We've come across the capture-processing operator / before, using tonumber to convert numbers. It also understands strings in a very similar format to string.gsub, where %n means the n-th capture.

This operation can be expressed exactly as:

> = string.gsub('_italics_','^_([^_]+)_',"''%1''")
"''italics''"

But the advantage is that we don't have to build up a custom string pattern and worry about escaping 'magic' characters like '(' and ').

lpeg.Cs is a substitution capture, and it provides a more general module of global string substitution. In the lpeg manual, there is this equivalent to string.gsub:

function gsub (s, patt, repl)
patt = P(patt)
local p = Cs ((patt / repl + 1)^0)
return p:match(s)
end > = gsub('hello dog, dog!','dog','cat')
"hello cat, cat!"

To understand the difference, here's that pattern using plain C:

> p = C((P'dog'/'cat' + 1)^0)
> = p:match 'hello dog, dog!'
"hello dog, dog!" "cat" "cat"

The C here just captures the whole match, and each '/' adds a new capture with the value of the replacement string.

With Cs, everything gets captured, and a string is built out of all the captures. Some of those captures get modified by '/', and so we have substitutions.

In Markdown, block quoted lines begin with '> '.

lf = P'\n'
rest_of_line_nl = C((1 - lf)^0*lf) -- capture chars upto \n
quoted_line = '> '*rest_of_line_nl -- block quote lines start with '> '
-- collect the quoted lines and put inside [[[..]]]
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n" > = quote:match '> hello\n> dolly\n'
"[[[
> hello
> dolly
]]]
"

That's not quite right - Cs captures everything, including the '> '. But we can force some captures to return empty strings: }}}

function empty(p)
return C(p)/''
end quoted_line = empty ('> ') * rest_of_line_nl
...

Now things will work correctly!

Here is the program used to convert this document from Markdown to Lua wiki format:

local lpeg = require 'lpeg'

local P,S,C,Cs,Cg = lpeg.P,lpeg.S,lpeg.C,lpeg.Cs,lpeg.Cg

local test = [[
## A title here _we go_ and `a:bonzo()`: one line
two line
three line and `more_or_less_something` [A reference](http://bonzo.dog) > quoted
> lines ]] function subst(openp,repl,endp)
openp = P(openp) -- make sure it's a pattern
endp = endp and P(endp) or openp
-- pattern is 'bracket followed by any number of non-bracket followed by bracket'
local contents = C((1 - endp)^1)
local patt = openp * contents * endp
if repl then patt = patt/repl end
return patt
end function empty(p)
return C(p)/''
end lf = P'\n'
rest_of_line = C((1 - lf)^1)
rest_of_line_nl = C((1 - lf)^0*lf) -- indented code block
indent = P'\t' + P' '
indented = empty(indent)*rest_of_line_nl
-- which we'll assume are Lua code
block = Cs(indented^1)/' [[[!Lua\n%1]]]\n' -- use > to get simple quoted block
quoted_line = empty('> ')*rest_of_line_nl
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n" code = subst('`','{{%1}}')
italic = subst('_',"''%1''")
bold = subst('**',"'''%1'''")
rest_of_line = C((1 - lf)^1)
title1 = P'##' * rest_of_line/'=== %1 ==='
title2 = P'###' * rest_of_line/'== %1 ==' url = (subst('[',nil,']')*subst('(',nil,')'))/'[%2 %1]' item = block + title1 + title2 + code + italic + bold + quote + url + 1
text = Cs(item^1) if arg[1] then
local f = io.open(arg[1])
test = f:read '*a'
f:close()
end print(text:match(test))

Due to an escaping problem with this Wiki, I had to substitute '[' for '{', etc in this source. Be warned!

SteveDonovan, 12 June 2012


Group and back captures

This section will dissect the behavior of group and back captures (Cg() and Cb() respectively).

Group captures (hereafter "groups") come in two flavors: named and anonymous.

    Cg(C"baz" * C"qux", "name") -- named group.

    Cg(C"foo" * C"bar")         -- anonymous group.
    

Let's first get the easy one out of the way: named groups inside table captures.

    Ct(Cc"foo" * Cg(Cc"bar" * Cc"baz", "TAG")* Cc"qux"):match""
--> { "foo", "qux", TAG = "bar" }

In a table capture, the value of the first capture inside the group ("bar") is assigned to the corresponding key ("TAG") in the table. As you can see, Cc"baz" got lost in the process. The label must be a string (or a number that will be automatically converted to a string).

Note that the group must be a direct child of the table, otherwise, the table capture will not handle it:

    Ct(C(Cg(1,"foo"))):match"a"
--> {"a"}

Of captures and values

Before delving into groups proper, we must first explore a subtlety in the way captures handle their subcaptures.

Some captures operate on the values produced by their subcaptures, while others operate on the capture objects. This is sometimes counter-intuitive.

Let's take the following pattern:

    (1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "bc", "b", "c"

As you can see, it inserts three values in the capture stream.

Let's wrap it in a table capture:

    Ct(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> { "bc", "b", "c" }

Ct() operates on values. In the last example, the three values that are inserted in order in the table.

Now, let's try a substitution capture:

    Cs(1 * C( C"b" * C"c" ) * 1):match"abcd"
--> "abcd"

Cs() operates on captures. It scans the first level of its nested captures, and only takes the first value of each one. In the above example, "b" and "c" are thus discarded. Here's another example that may make things more clear:

    function the_func (bcd)
assert(bcd == "bcd")
return "B", "C", "D"
end Ct(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> {"B", "C", "D"} -- All values are inserted. Cs(1 * ( C"bcd" / the_func ) * 1):match"abcde"
--> "aBe" -- the "C" and "D" have been discarded.

A more detailed account of the by value / by capture behaviour of each kind of capture will be the topic of another section.


Capture opacity

Another important thing to realise is that most captures shadow their subcaptures, but some don't. As you can see in the last example, the value of C"bcd" is passed to the /function capture, but it doesn't end in the final capture list. Ct() and Cs() are also opaque in this regard. They only produce, resectively, one table or one string.

On the other hand, C() is transparent. As we've seen above, the subcaptures of C() are also inserted in the stream.

    C(C"b" * C"c"):match"bc" --> "bc", "b", "c"
    

The only transparent captures are C() and the anonymous Cg().


Anonymous groups

Cg() wraps its subcaptures in a single capture object, but doesn't produce anything of its own. Depending on the context, either all of its values will be inserted, or only the first one.

Here are a few examples for the anonymous groups:

    (1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "b", "c", "d" Ct(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> { "b", "c", "d" } Cs(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
--> "abe" -- "c" and "d" are dropped.

Where this behavior is useful? In folding captures.

Let's write a very basic calculator, that adds or substracts one digit numbers.

    function calc(a, op, b)
a, b = tonumber(a), tonumber(b)
if op == "+" then
return a + b
else
return a - b
end
end digit = R"09" calculate = Cf(
C(digit) * Cg( C(S"+-") * C(digit) )^0
, calc
)
calculate:match"1+2-3+4"
--> 4

The capture tree will look like this [*]:

    {"Cf", func = calc, children = {
{"C", val = "1"},
{"Cg", children = {
{"C", val = "+"},
{"C", val = "2"}
} },
{"Cg", children = {
{"C", val = "-"},
{"C", val = "3"}
} },
{"Cg", children = {
{"C", val = "+"},
{"C", val = "4"}
} }
} }

You probably see where this is going... Like Cs(), Cf() operates on capture objects. It will first extract the first value of the first capture, and use it as the initial value. If there are no more captures, this value becomes the value of the Cf().

But we have more captures. In our case, it will pass all the values of the second capture (the group) to calc(), tacked after the value of the first one. Here's the evaluation of the above Cf()

    first_arg = "1"
next_ones: "+", "2"
first_arg = calc("1", "+", "2") -- 3, calc() returns numbers next_ones: "-", "3"
first_arg = calc(3, "-", "3") next_ones: "+", "4"
first_arg = calc(0, "+", "4") return first_arg -- Tadaaaa.

[*] Actually, at match time, the capture objects only store their bounds and auxiliary data (like calc() for the Cf()). The actual values are produced sequencially after the match has completed, but, it makes things more clear as displayed above. In the above example, the values of the nested C() and Cg(C(),C()) are actually produced one at a time, at each corresponding cycle of the folding process.


Named groups

The ( named Cg() / Cb() ) pair has a behavior similar to the anonymous Cg(), but the values captured in the named Cg() are not inserted locally. They are teleported, and end up inserted in the stream at the place of the Cb().

Here's an example:

    ( 1 * Cg(C"bc", "FOOO") * C"d" * 1 * Cb"FOOO" * Cb"FOOO"):match"abcde"
-- > "d", "bc", "bc"

Warp... and duplication if there is more than one Cb(). Another example:

    ( 1 * Cg(C"b" * C"c" * C"d", "FOOO") * C"e" * Ct(Cb"FOOO") ):match"abcde"
--> "e", { "b", "c", "d" }

Usually, for the sake of clarty, in my code, I alias Cg() to Tag(). I use the former for anonymous groups, and the latter for named groups.

Cb"FOOO" will look back for a corresponding Cg() that has succeeded. It goes back and up in the tree, and consumes captures. In other words, it searches its elder siblings, and the elder siblings of its parents, but not the parents themselves. Neither does it test the children of the siblings/siblings of ancestors.

It proceeds as follows (start from the [ #### ] <--- [[ START ]] and follow the numbers back up).

The [ numbered ] captures are the captures that are tested on order. The ones marked with [ ** ] are not, for the various reasons listed. This is hairy, but AFAICT complete.

    Cg(-- [ ** ] ... This one would have been seen,
-- if the search hadn't stopped at *the one*.
"Too late, mate."
, "~@~"
) * Cg( -- [ 3 ] The search ends here. <--------------[[ Stop ]]
"This is *the one*!"
, "~@~"
) * Cg(-- [ ** ] ... The great grand parent.
-- Cg with the right tag, but direct ancestor,
-- thus not checked. Cg( -- [ 2 ] ... Cg, but not the right tag. Skipped.
Cg( -- [ ** ] good tag but masked by the parent (whatever its type)
"Masked"
, "~@~"
)
, "BADTAG"
) * C( -- [ ** ] ... grand parent. Not even checked. (
Cg( -- [ ** ] ... This subpattern will fail after Cg() succeeds.
-- The group is thus removed from the capture tree, and will
-- not be found dureing the lookup.
"FAIL"
, "~@~"
)
* false
) + Cmt( -- [ ** ] ... Direct parent. Not assessed.
C(1) -- [ 1 ] ... Not a Cg. Skip. * Cb"~@~" -- [ #### ] <----------------- [[ START HERE ]] --
, function(subject, index, cap1, cap2)
return assert(cap2 == "This is *the one*!")
end
)
)
, "~@~" -- [ ** ] This label goes with the great grand parent.
)

[转]http://lua-users.org/wiki/LpegTutorial的更多相关文章

  1. 【搬运工】——初识Lua(转)

    使用 Lua 编写可嵌入式脚本 Lua 提供了高级抽象,却又没失去与硬件的关联. 虽然编译性编程语言和脚本语言各自具有自己独特的优点,但是如果我们使用这两种类型的语言来编写大型的应用程序会是什么样子呢 ...

  2. [翻译]lpeg入门教程

    原文地址:http://lua-users.org/wiki/LpegTutorial 简单匹配 LPeg是一个用于文本匹配的有力表达方式,比Lua原生的字符串匹配和标准正则表达式更优异.但是,就像其 ...

  3. 025_lua脚本语言

    一.--cat /opt/nginx/conf/conf.dlua_package_path '/opt/nginx/conf/lua/?.lua;;'; --lua模块路径,其中”;;”表示默认搜索 ...

  4. 打印Lua的Table对象

    小伙伴们再也不用为打印lua的Table对象而苦恼了, 本人曾也苦恼过,哈哈 不过今天刚完成了这个东西, 以前在网上搜过打印table的脚本,但是都感觉很不理想,于是,自己造轮子了~ 打印的效果,自己 ...

  5. wireshark lua脚本

    1.目的:解析rssp2协议   2.如何使用wireshark lua插件 将编写的(假设为rssp2.lua)lua文本,放入wireshark 安装目录下,放哪里都行只要dofile添加了路径. ...

  6. lua UT测试工具

    luaunit Luaunit is a unit-testing framework for Lua, in the spirit of many others unit-testing frame ...

  7. LuaSrcDiet工具介绍(lua源码处理软件)

    Diet Food Diet (nutrition), the sum of the food consumed by an organism or group Dieting, the delibe ...

  8. Prototype based langue LUA

    Prototype-based programming https://en.wikipedia.org/wiki/Prototype-based_programming Prototype-base ...

  9. lua unit test introduction

    Unit Test Unit testing is about testing your code during development, not in production. Typically y ...

随机推荐

  1. [问题2015S06] 复旦高等代数 II(14级)每周一题(第七教学周)

    [问题2015S06]  设 \(V\) 是数域 \(\mathbb{K}\) 上的 \(n\) 维线性空间, \(\varphi\) 是 \(V\) 上的线性变换. (1) 求证: 对任一非零向量 ...

  2. C语言面试题(三)

    这篇主要聚焦在排序算法,包括常见的选择排序,插入排序,冒泡排序,快速排序.会对这四种排序的时间复杂度和空间复杂度进行探究. a.选择排序 int main(int argc,char **argv){ ...

  3. win7 用户目录

    robocopy "C:\Users" "D:\Users" /E /COPYALL /XJ /XD "C:\Users\Administrator& ...

  4. 从零开始HTML(一 2016/9/19)

    就是准备跟着W3C上的教程过一遍HTML啦,边看边记录更便于理解记忆吧~ 1.属性 HTML 标签可以拥有属性.属性提供了有关 HTML 元素的更多的信息.属性总是以名称/值对的形式出现,比如:nam ...

  5. socket详解(一)《转》

    在客户/服务器通信模式中, 客户端需要主动创建与服务器连接的 Socket(套接字), 服务器端收到了客户端的连接请求, 也会创建与客户连接的 Socket. Socket可看做是通信连接两端的收发器 ...

  6. Google Protocol Buffer的安装与.proto文件的定义

    什么是protocol Buffer呢? Google Protocol Buffer( 简称 Protobuf) 是 Google 公司内部的混合语言数据标准. 我理解的就是:它是一种轻便高效的结构 ...

  7. 【Todo】【读书笔记】机器学习-周志华

    书籍位置: /Users/baidu/Documents/Data/Interview/机器学习-数据挖掘/<机器学习_周志华.pdf> 一共442页.能不能这个周末先囫囵吞枣看完呢.哈哈 ...

  8. Windows上搭建hadoop开发环境

    前言 Windows下运行Hadoop,通常有两种方式:一种是用VM方式安装一个Linux操作系统,这样基本可以实现全Linux环境的Hadoop运行:另一种是通过Cygwin模拟Linux环境.后者 ...

  9. Linux 1

    1. 使用虚拟控制台  登录后按Alt+F2键这时又可以看到"login:"提示符, 这个就是第二个虚拟控制台. 一般新安装的Linux有四个虚拟控制台, 可以用Alt+F1~Al ...

  10. 博客打开慢?请禁用WordPress默认的谷歌字体!

    最近几天,谷歌中国挂了之后,发现我的博客打开极慢,原以为是空间问题,可一查,发现同台服务器的用户打开并不慢,排除了空间问题后,这边查询元素发现博客打开时加载了一个链接地址“fonts.googleap ...