(转)shlex — 解析 Shell 风格语法
原文:https://pythoncaff.com/docs/pymotw/shlex-parse-shell-style-syntaxes/171
这是一篇协同翻译的文章,你可以点击『我来翻译』按钮来参与翻译。
目的:Shell 风格语句的语法解析。
shlex
模块实现了能够解析简单的类似 Shell 文件语法结构的类。它可用于编写特殊领域的语言,或解析被引用的字符串(这项任务比表面看上去更加复杂)。
解析引用字符串##
当我们输入文本时,遇到的一个常见问题是识别由引用字符构成的序列并把它们当做一个单独的实体。以引号分割文本有时并不能获得预期的效果,尤其是当引用具有嵌套层次时。例如以下的文本:
This string has embedded "double quotes" and
'single quotes' in it, and even "a 'nested example'".
一种简易的办法是构建一个正则表达式以找出在引号外的部分文本并把它们与引号内的部分分离开来,或相反的过程。但其实现过程非常繁琐,并且由于单引号和撇号易于混淆或是拼写错误而经常引发错误。更好的解决方法是使用真正的语法解析器,如 shlex
模块所提供的。下面是一个利用 shlex
类从输入文本中识别出标记并打印出来的程序:
shlex_example.py
import shlex
import sys
if len(sys.argv) != 2:
print('Please specify one filename on the command line.')
sys.exit(1)
filename = sys.argv[1]
with open(filename, 'r') as f:
body = f.read()
print('ORIGINAL: {!r}'.format(body))
print()
print('TOKENS:')
lexer = shlex.shlex(body)
for token in lexer:
print('{!r}'.format(token))
当该程序用于包含引号的数据时,语法解析器会生成一个包含期望标记的列表。
$ python3 shlex_example.py quotes.txt
ORIGINAL: 'This string has embedded "double quotes" and\n\'singl
e quotes\' in it, and even "a \'nested example\'".\n'
TOKENS:
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
','
'and'
'even'
'"a \'nested example\'"'
'.'
孤立的引号,例如撇号被按同样方法处置了。再看以下文本:
This string has an embedded apostrophe, doesn't it?
包含撇号的标记词能够被区分出来。
$ python3 shlex_example.py apostrophe.txt
ORIGINAL: "This string has an embedded apostrophe, doesn't it?"
TOKENS:
'This'
'string'
'has'
'an'
'embedded'
'apostrophe'
','
"doesn't"
'it'
'?'
Making Safe Strings for Shells##
The quote()
function performs the inverse operation, escaping existing quotes and adding missing quotes for strings to make them safe to use in shell commands.
shlex_quote.py
import shlex
examples = [
"Embedded'SingleQuote",
'Embedded"DoubleQuote',
'Embedded Space',
'~SpecialCharacter',
r'Back\slash',
]
for s in examples:
print('ORIGINAL : {}'.format(s))
print('QUOTED : {}'.format(shlex.quote(s)))
print()
It is still usually safer to use a list of arguments when using subprocess.Popen
, but in situations where that is not possible quote()
provides some protection by ensuring that special characters and white space are quoted properly.
$ python3 shlex_quote.py
ORIGINAL : Embedded'SingleQuote
QUOTED : 'Embedded'"'"'SingleQuote'
ORIGINAL : Embedded"DoubleQuote
QUOTED : 'Embedded"DoubleQuote'
ORIGINAL : Embedded Space
QUOTED : 'Embedded Space'
ORIGINAL : ~SpecialCharacter
QUOTED : '~SpecialCharacter'
ORIGINAL : Back\slash
QUOTED : 'Back\slash'
Embedded Comments##
Since the parser is intended to be used with command languages, it needs to handle comments. By default, any text following a #
is considered part of a comment and ignored. Due to the nature of the parser, only single-character comment prefixes are supported. The set of comment characters used can be configured through the commenters
property.
$ python3 shlex_example.py comments.txt
ORIGINAL: 'This line is recognized.\n# But this line is ignored.
\nAnd this line is processed.'
TOKENS:
'This'
'line'
'is'
'recognized'
'.'
'And'
'this'
'line'
'is'
'processed'
'.'
Splitting Strings into Tokens##
To split an existing string into component tokens, the convenience function split()
is a simple wrapper around the parser.
shlex_split.py
import shlex
text = """This text has "quoted parts" inside it."""
print('ORIGINAL: {!r}'.format(text))
print()
print('TOKENS:')
print(shlex.split(text))
The result is a list.
$ python3 shlex_split.py
ORIGINAL: 'This text has "quoted parts" inside it.'
TOKENS:
['This', 'text', 'has', 'quoted parts', 'inside', 'it.']
Including Other Sources of Tokens##
The shlex
class includes several configuration properties that control its behavior. The source
property enables a feature for code (or configuration) re-use by allowing one token stream to include another. This is similar to the Bourne shell source
operator, hence the name.
shlex_source.py
import shlex
text = "This text says to source quotes.txt before continuing."
print('ORIGINAL: {!r}'.format(text))
print()
lexer = shlex.shlex(text)
lexer.wordchars += '.'
lexer.source = 'source'
print('TOKENS:')
for token in lexer:
print('{!r}'.format(token))
The string "source quotes.txt
" in the original text receives special handling. Since the source
property of the lexer is set to "source"
, when the keyword is encountered, the filename appearing on the next line is automatically included. In order to cause the filename to appear as a single token, the .
character needs to be added to the list of characters that are included in words (otherwise "quotes.txt
" becomes three tokens, "quotes
", ".
", "txt
"). This what the output looks like.
$ python3 shlex_source.py
ORIGINAL: 'This text says to source quotes.txt before
continuing.'
TOKENS:
'This'
'text'
'says'
'to'
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
','
'and'
'even'
'"a \'nested example\'"'
'.'
'before'
'continuing.'
The source feature uses a method called sourcehook()
to load the additional input source, so a subclass of shlex
can provide an alternate implementation that loads data from locations other than files.
Controlling the Parser##
An earlier example demonstrated changing the wordchars
value to control which characters are included in words. It is also possible to set the quotes
character to use additional or alternative quotes. Each quote must be a single character, so it is not possible to have different open and close quotes (no parsing on parentheses, for example).
shlex_table.py
import shlex
text = """|Col 1||Col 2||Col 3|"""
print('ORIGINAL: {!r}'.format(text))
print()
lexer = shlex.shlex(text)
lexer.quotes = '|'
print('TOKENS:')
for token in lexer:
print('{!r}'.format(token))
In this example, each table cell is wrapped in vertical bars.
$ python3 shlex_table.py
ORIGINAL: '|Col 1||Col 2||Col 3|'
TOKENS:
'|Col 1|'
'|Col 2|'
'|Col 3|'
It is also possible to control the whitespace characters used to split words.
shlex_whitespace.py
import shlex
import sys
if len(sys.argv) != 2:
print('Please specify one filename on the command line.')
sys.exit(1)
filename = sys.argv[1]
with open(filename, 'r') as f:
body = f.read()
print('ORIGINAL: {!r}'.format(body))
print()
print('TOKENS:')
lexer = shlex.shlex(body)
lexer.whitespace += '.,'
for token in lexer:
print('{!r}'.format(token))
If the example in shlex_example.py
is modified to include period and comma, the results change.
$ python3 shlex_whitespace.py quotes.txt
ORIGINAL: 'This string has embedded "double quotes" and\n\'singl
e quotes\' in it, and even "a \'nested example\'".\n'
TOKENS:
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
'and'
'even'
'"a \'nested example\'"'
Error Handling##
When the parser encounters the end of its input before all quoted strings are closed, it raises ValueError
. When that happens, it is useful to examine some of the properties maintained by the parser as it processes the input. For example, infile
refers to the name of the file being processed (which might be different from the original file, if one file sources another). The lineno
reports the line when the error is discovered. The lineno
is typically the end of the file, which may be far away from the first quote. The token
attribute contains the buffer of text not already included in a valid token. The error_leader()
method produces a message prefix in a style similar to Unix compilers, which enables editors such as emacs
to parse the error and take the user directly to the invalid line.
shlex_errors.py
import shlex
text = """This line is ok.
This line has an "unfinished quote.
This line is ok, too.
"""
print('ORIGINAL: {!r}'.format(text))
print()
lexer = shlex.shlex(text)
print('TOKENS:')
try:
for token in lexer:
print('{!r}'.format(token))
except ValueError as err:
first_line_of_error = lexer.token.splitlines()[0]
print('ERROR: {} {}'.format(lexer.error_leader(), err))
print('following {!r}'.format(first_line_of_error))
The example produces this output.
$ python3 shlex_errors.py
ORIGINAL: 'This line is ok.\nThis line has an "unfinished quote.
\nThis line is ok, too.\n'
TOKENS:
'This'
'line'
'is'
'ok'
'.'
'This'
'line'
'has'
'an'
ERROR: "None", line 4: No closing quotation
following '"unfinished quote.'
POSIX vs. Non-POSIX Parsing##
The default behavior for the parser is to use a backwards-compatible style that is not POSIX-compliant. For POSIX behavior, set the posix
argument when constructing the parser.
shlex_posix.py
import shlex
examples = [
'Do"Not"Separate',
'"Do"Separate',
'Escaped \e Character not in quotes',
'Escaped "\e" Character in double quotes',
"Escaped '\e' Character in single quotes",
r"Escaped '\'' \"\'\" single quote",
r'Escaped "\"" \'\"\' double quote',
"\"'Strip extra layer of quotes'\"",
]
for s in examples:
print('ORIGINAL : {!r}'.format(s))
print('non-POSIX: ', end='')
non_posix_lexer = shlex.shlex(s, posix=False)
try:
print('{!r}'.format(list(non_posix_lexer)))
except ValueError as err:
print('error({})'.format(err))
print('POSIX : ', end='')
posix_lexer = shlex.shlex(s, posix=True)
try:
print('{!r}'.format(list(posix_lexer)))
except ValueError as err:
print('error({})'.format(err))
print()
Here are a few examples of the differences in parsing behavior.
$ python3 shlex_posix.py
ORIGINAL : 'Do"Not"Separate'
non-POSIX: ['Do"Not"Separate']
POSIX : ['DoNotSeparate']
ORIGINAL : '"Do"Separate'
non-POSIX: ['"Do"', 'Separate']
POSIX : ['DoSeparate']
ORIGINAL : 'Escaped \\e Character not in quotes'
non-POSIX: ['Escaped', '\\', 'e', 'Character', 'not', 'in',
'quotes']
POSIX : ['Escaped', 'e', 'Character', 'not', 'in', 'quotes']
ORIGINAL : 'Escaped "\\e" Character in double quotes'
non-POSIX: ['Escaped', '"\\e"', 'Character', 'in', 'double',
'quotes']
POSIX : ['Escaped', '\\e', 'Character', 'in', 'double',
'quotes']
ORIGINAL : "Escaped '\\e' Character in single quotes"
non-POSIX: ['Escaped', "'\\e'", 'Character', 'in', 'single',
'quotes']
POSIX : ['Escaped', '\\e', 'Character', 'in', 'single',
'quotes']
ORIGINAL : 'Escaped \'\\\'\' \\"\\\'\\" single quote'
non-POSIX: error(No closing quotation)
POSIX : ['Escaped', '\\ \\"\\"', 'single', 'quote']
ORIGINAL : 'Escaped "\\"" \\\'\\"\\\' double quote'
non-POSIX: error(No closing quotation)
POSIX : ['Escaped', '"', '\'"\'', 'double', 'quote']
ORIGINAL : '"\'Strip extra layer of quotes\'"'
non-POSIX: ['"\'Strip extra layer of quotes\'"']
POSIX : ["'Strip extra layer of quotes'"]
See also#
- 解析引用字符串#
- Making Safe Strings for Shells#
- Embedded Comments#
- Splitting Strings into Tokens#
- Including Other Sources of Tokens#
- Controlling the Parser#
- Error Handling#
- POSIX vs. Non-POSIX Parsing#
#
- Standard library documentation for shlex
cmd
-- Tools for building interactive command interpreters.argparse
-- Command line option parsing.subprocess
-- Run commands after parsing the command line.
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
(转)shlex — 解析 Shell 风格语法的更多相关文章
- shell脚本语法基础汇总
shell脚本语法基础汇总 将命令的输出读入一个变量中,可以将它放入双引号中,即可保留空格和换行符(\n) out=$(cat text.txt) 输出1 2 3 out="$(cat te ...
- 【swupdate文档 四】SWUpdate:使用默认解析器的语法和标记
SWUpdate:使用默认解析器的语法和标记 介绍 SWUpdate使用库"libconfig"作为镜像描述的默认解析器. 但是,可以扩展SWUpdate并添加一个自己的解析器, ...
- shell 基础语法
shell 基础语法 =============================================== 推荐:http://c.biancheng.net/cpp/shell/ ==== ...
- PHP面试题及答案解析(1)—PHP语法基础
1. strlen( )与 mb_strlen( )的作用分别是什么? strlen和mb_strlen都是用于获取字符串长度.strlen只针对单字节编码字符,也就是说它计算的是字符串的总字节数.如 ...
- Shell函数语法
Shell函数语法 定义函数: function 函数名(){ 指令... } 调用函数,方法1: 函数名 调用函数,方法2: 函数名 参数一 参数二 return在函数里面使用会跳出函数并 ...
- shell 的语法
SHELL 的语法 n 变量:字符串,数字,环境和参数 n 条件:shell中的布尔值 n 程序控制:if, elif, for, while until, case n 命令列表 n 函数 ...
- 运维shell全部语法进阶
Linux运维之shell脚本进阶篇 一.if语句的使用 1)语法规则 1 2 3 4 5 6 7 8 9 if [条件] then 指令 fi 或 if [条件];then ...
- 【转】fnmatch模块的使用——主要作用是文件名称的匹配,并且匹配的模式使用的unix shell风格
[转]fnmatch模块的使用 fnmatch模块的使用 此模块的主要作用是文件名称的匹配,并且匹配的模式使用的unix shell风格.fnmatch比较简单就4个方法分别是:fnmatch,fnm ...
- U-Boot shell脚本语法
/********************************************************************** * U-Boot shell脚本语法 * 说明: * 之 ...
随机推荐
- 使用bat批处理文件定时自动备份oracle数据库并上传ftp服务器
一.使用bat批处理文件备份oracle(前提是配置好oracle数据库客户端) @echo off set databasename=orcl //数据库名 set username=ninic ...
- Location对象的查询字符方法实现
function getQueryStringArgs(){ /*如果location.search有则删除第一个字符,并返回删除后的字符串*/ var gs = (location.search.l ...
- linux 后台执行nohup 命令,终端断开无影响
nohup /root/start.sh & 在shell中回车后提示: [~]$ appending output to nohup.out原程序的的标准输出被自动改向到当前目录下的nohu ...
- Redis集群的主从切换研究
目录 目录 1 1. 前言 1 2. slave发起选举 2 3. master响应选举 5 4. 选举示例 5 5. 哈希槽传播方式 6 6. 一次主从切换记录1 6 6.1. 相关参数 6 6.2 ...
- gcc和vs在c的一些区别
1.switch中每个标签后面的命令在gcc中需要{}括起来以指明作用域. 2._itoa是非标准的c和c++扩展函数,在linux下可以使用sprintf(string, "%d &q ...
- Codeforces822 C. Hacker, pack your bags!
C. Hacker, pack your bags! time limit per test 2 seconds memory limit per test 256 megabytes input s ...
- hdu 4968 最大最小gpa
http://acm.hdu.edu.cn/showproblem.php?pid=4968 给定平均分和科目数量,要求保证及格的前提下,求平均绩点的最大值和最小值. dp[i][j]表示i个科目,总 ...
- Mining Twitter Data with Python
目录 1.Collecting data 1.1 Register Your App 1.2 Accessing the Data 1.3 Streaming 2.Text Pre-processin ...
- excel设定备选值
excel设定备选值 有的时候我们要人为向excel中某一列添加数据,可以通过下面的方法,为这列设定备选值. 操作方法 选中excel表格的一列,选择 数据 -- 有效性 -- 允许: 选择 序列 ...
- ExternalException (0x80004005): GDI+ 中发生一般性错误
.net开发的程序用了一个自绘的框架, 平常部署到IIS上都没有问题,今天突然之间这个功能就运行不起来了. 报错:GDI+错误,然后在本地的VS里面运行是没有问题的 百度出来的改Path环境变量.注册 ...