现在有一个需求,比如给定如下数据:

0-0-0 0:0:0 #### the 68th annual golden globe awards ####  the king s speech earns 7 nominations  ####  <LOCATION>LOS ANGELES</LOCATION> <ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION> historical drama British king stammer beat competitors Tuesday grab seven nominations Golden Globe Awards nominations included best film drama nod contested award organizers said films competing best picture <ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION> earned nominations best performance actor olin <PERSON>Firth</PERSON> best performance actress <PERSON>Helena Bonham</PERSON> arter best supporting actor <PERSON>Geoffrey Rush</PERSON> best director <PERSON>Tom Hooper</PERSON> best screenplay <PERSON>David Seidler</PERSON> best movie score <ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION> earned nods apiece Black Swan Inception Kids Right tied place movie race nominations best motion picture comedy musical category <ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION> compete Nominated best actor motion picture olin <ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION> best actress motion picture nominees <PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON> Rabbit Hole <PERSON>Jennifer Lawrence</PERSON> <ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION> categories Glee nominee nods followed Rock Boardwalk Empire Dexter Good Wife Mad Men Modern Family Pillars Earth Temple <PERSON>Grandin</PERSON> tied nods apiece awards announced Jan 

  要求按行把<></>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:<X>A B C</X>需要转换成A_B_C/X

  由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就非常麻烦,而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。

python下,正则最终写出来是这样:

 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')

  接下来我们需要做是在原字符串中找出对应的子串,并且记下他们的位置,接下来就是预处理出需要替换成的样子,再用一个正则就好了。

 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')

  对字符串集合做整次的map,对每一个字符串进行匹配,再吧这两部分匹配结果zip在一起,就可以获得一个start-end的tuple,大致这样。

 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION')
('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')
('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')
('<PERSON>Firth</PERSON>', 'Firth/PERSON')
('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON')
('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON')
('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON')
('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON')
('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')
('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')
('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION')
('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')
('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON')
('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION')
('<PERSON>Grandin</PERSON>', 'Grandin/PERSON')
('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION')
('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION')
('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION')
('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')

  处理的代码如下:

 def read_file(path):
if not os.path.exists(path):
print 'path : \''+ path + '\' not find.'
return []
content = ''
try:
with open(path, 'r') as fp:
content += reduce(lambda x,y:x+y, fp)
finally:
fp.close()
return content.split('\n') def get_label(each):
pair = zip(LABEL_PATTERN.findall(each),
map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
return map(lambda x: (x[0][0], x[1]), pair) src = read_file(FILE_PATH)
pattern = map(get_label, src)

  接下来简单处理以下就好:

 for i in range(0, len(src)):
for pat in pattern[i]:
src[i] = re.sub(pat[0], pat[1], src[i])

  所有代码:

 # -*- coding: utf-8 -*-
import re
import os # FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt'
FILE_PATH = '/home/kirai/workspace/sina_news_process/test.txt'
LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')
LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>') def read_file(path):
if not os.path.exists(path):
print 'path : \''+ path + '\' not find.'
return []
content = ''
try:
with open(path, 'r') as fp:
content += reduce(lambda x,y:x+y, fp)
finally:
fp.close()
return content.split('\n') def get_label(each):
pair = zip(LABEL_PATTERN.findall(each),
map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
return map(lambda x: (x[0][0], x[1]), pair) src = read_file(FILE_PATH)
pattern = map(get_label, src) for i in range(0, len(src)):
for pat in pattern[i]:
src[i] = re.sub(pat[0], pat[1], src[i])

[Python正则表达式] 字符串中xml标签的匹配的更多相关文章

  1. python3.4学习笔记(十二) python正则表达式的使用,使用pyspider匹配输出带.html结尾的URL

    python3.4学习笔记(十二) python正则表达式的使用,使用pyspider匹配输出带.html结尾的URL实战例子:使用pyspider匹配输出带.html结尾的URL:@config(a ...

  2. python之字符串中有关%d,%2d,%02d的问题

    python之字符串中有关%d,%2d,%02d的问题 在python中,通过使用%,实现格式化字符串的目的.(这与c语言一致) 其中,在格式化整数和浮点数时可以指定是否补0和整数与小数的位数. 首先 ...

  3. js去除字符串中的标签

    var str="<p>js去除字符串中的标签</p>"; var result=str.replace(/<.*?>/ig,"&qu ...

  4. .NET获取Html字符串中指定标签的指定属性的值

    using System.Text; using System.Text.RegularExpressions; //以上为要用到的命名空间 /// <summary> /// 获取Htm ...

  5. python 判断字符串中是否只有中文字符

    python 判断字符串中是否只有中文字符 学习了:https://segmentfault.com/q/1010000007898150 def is_all_zh(s): for c in s: ...

  6. python判断字符串中是否包含子字符串

    python判断字符串中是否包含子字符串 s = '1234问沃尔沃434' if s.find('沃尔沃') != -1:     print('存在') else:     print('不存在' ...

  7. python 统计字符串中指定字符出现次数的方法

    python 统计字符串中指定字符出现次数的方法: strs = "They look good and stick good!" count_set = ['look','goo ...

  8. Python 去除字符串中的空行

    Python 去除字符串中的空行 mystr = 'adfa\n\n\ndsfsf' print("".join([s for s in mystr.splitlines(True ...

  9. Python访问字符串中的值

    Python访问字符串中的值: 1.可以使用索引下标进行访问,索引下标从 0 开始: # 使用索引下标进行访问,索引下标从 0 开始 strs = "ABCDEFG" print( ...

随机推荐

  1. C# 计时器

    一.Stopwatch 主要用于测试代码段使用了多少时间 使用方法: Stopwatch sw=new Stopwatch(); sw.Start(); ... sw.Stop(); Console. ...

  2. 【iCore3 双核心板_FPGA】例程十三:FSMC总线通信实验——复用地址模式

    实验指导书及代码包下载: http://pan.baidu.com/s/1nuYpI8x iCore3 购买链接: https://item.taobao.com/item.htm?id=524229 ...

  3. Apache按日切分日志

    apache按日切分日志,使用apache自带的rotatelogs切分 语法: rotatelogs [ -l ] logfile [ rotationtime [ offset ]] | [ fi ...

  4. 【C51】UART串口通信

    我们常需要单片机和其他模块进行通信,数据传输,常用的方式就是串口通信技术. 常用来 单片机<-->电脑,  单片机<-->单片机之间通信. 串行通信 versus 并行通信 并 ...

  5. Nginx下Magento伪静态规则,适用于LNMP一键包

    文件名为:magento.conf(下载),将其放在 /usr/local/nginx/conf/ 文件夹下 然后在 /usr/local/nginx/conf/vhost/www.yourname. ...

  6. zabbix监控交换机

    zabbix可以通过snmp协议监控交换机 前提: 交换机需要开启snmp协议,通过snmpwalk 可以抓取到数据就可以了 snmpwalk -v 2c -c public *.*.*.* 1.创建 ...

  7. SQL Server索引进阶第五篇:索引包含列 .

    包含列解析所谓的包含列就是包含在非聚集索引中,并且不是索引列中的列.或者说的更通俗一点就是:把一些底层数据表的数据列包含在非聚集索引的索引页中,而这些数据列又不是索引列,那么这些列就是包含列.同时,这 ...

  8. cocos2dx 3.x(动态改变精灵的背景图片)

    //更换精灵CCSprite的图片有两种方式. //直接通过图片更换 //使用setTexture(CCTexture2D*)函数,可以重新设置精灵类的纹理图片. // auto bg = Sprit ...

  9. Linux系统编程--文件IO操作

    Linux思想即,Linux系统下一切皆文件. 一.对文件操作的几个函数 1.打开文件open函数 int open(const char *path, int oflags); int open(c ...

  10. 使用ocr的自动备份还原ocr

    1.查看ocr自动备份ocrconfig -showbackup 2.停止所有节点的集群件 3.还原ocr文件ocrconfig -restore <file-name> 4.重启crs, ...