[Python正则表达式] 字符串中xml标签的匹配

　　现在有一个需求，比如给定如下数据：

0-0-0 0:0:0 #### the 68th annual golden globe awards ####  the king s speech earns 7 nominations  ####  <LOCATION>LOS ANGELES</LOCATION> <ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION> historical drama British king stammer beat competitors Tuesday grab seven nominations Golden Globe Awards nominations included best film drama nod contested award organizers said films competing best picture <ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION> earned nominations best performance actor olin <PERSON>Firth</PERSON> best performance actress <PERSON>Helena Bonham</PERSON> arter best supporting actor <PERSON>Geoffrey Rush</PERSON> best director <PERSON>Tom Hooper</PERSON> best screenplay <PERSON>David Seidler</PERSON> best movie score <ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION> earned nods apiece Black Swan Inception Kids Right tied place movie race nominations best motion picture comedy musical category <ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION> compete Nominated best actor motion picture olin <ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION> best actress motion picture nominees <PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON> Rabbit Hole <PERSON>Jennifer Lawrence</PERSON> <ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION> categories Glee nominee nods followed Rock Boardwalk Empire Dexter Good Wife Mad Men Modern Family Pillars Earth Temple <PERSON>Grandin</PERSON> tied nods apiece awards announced Jan

　　要求按行把<></>标签内的字符串中的空格替换成下划线_，并且将数据转换形式，例：<X>A B C</X>需要转换成A_B_C/X

　　由于正则表达式匹配是贪婪模式，即尽可能匹配到靠后，那么就非常麻烦，而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。

python下，正则最终写出来是这样：

 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')

　　接下来我们需要做是在原字符串中找出对应的子串，并且记下他们的位置，接下来就是预处理出需要替换成的样子，再用一个正则就好了。

 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')

　　对字符串集合做整次的map，对每一个字符串进行匹配，再吧这两部分匹配结果zip在一起，就可以获得一个start-end的tuple，大致这样。

 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION')

 ('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')

 ('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')

 ('<PERSON>Firth</PERSON>', 'Firth/PERSON')

 ('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON')

 ('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON')

 ('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON')

 ('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON')

 ('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')

 ('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')

 ('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION')

 ('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')

 ('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON')

 ('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION')

 ('<PERSON>Grandin</PERSON>', 'Grandin/PERSON')

 ('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION')

 ('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION')

 ('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION')

 ('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')

　　处理的代码如下：

 def read_file(path):

     if not os.path.exists(path):

         print 'path : \''+ path + '\' not find.'

         return []

     content = ''

     try:

         with open(path, 'r') as fp:

             content += reduce(lambda x,y:x+y, fp)

     finally:

         fp.close()

     return content.split('\n')

 def get_label(each):

     pair = zip(LABEL_PATTERN.findall(each),

                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))

     return map(lambda x: (x[0][0], x[1]), pair)

 src = read_file(FILE_PATH)

 pattern = map(get_label, src)

　　接下来简单处理以下就好：

 for i in range(0, len(src)):

     for pat in pattern[i]:

         src[i] = re.sub(pat[0], pat[1], src[i])

　　所有代码：

 # -*- coding: utf-8 -*-

 import re

 import os

 # FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt'

 FILE_PATH = '/home/kirai/workspace/sina_news_process/test.txt'

 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')

 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')

 def read_file(path):

     if not os.path.exists(path):

         print 'path : \''+ path + '\' not find.'

         return []

     content = ''

     try:

         with open(path, 'r') as fp:

             content += reduce(lambda x,y:x+y, fp)

     finally:

         fp.close()

     return content.split('\n')

 def get_label(each):

     pair = zip(LABEL_PATTERN.findall(each),

                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))

     return map(lambda x: (x[0][0], x[1]), pair)

 src = read_file(FILE_PATH)

 pattern = map(get_label, src)

 for i in range(0, len(src)):

     for pat in pattern[i]:

         src[i] = re.sub(pat[0], pat[1], src[i])

[Python正则表达式] 字符串中xml标签的匹配的更多相关文章

python3.4学习笔记(十二) python正则表达式的使用，使用pyspider匹配输出带.html结尾的URL
python3.4学习笔记(十二) python正则表达式的使用,使用pyspider匹配输出带.html结尾的URL实战例子:使用pyspider匹配输出带.html结尾的URL:@config(a ...
python之字符串中有关%d,%2d,%02d的问题
python之字符串中有关%d,%2d,%02d的问题在python中,通过使用%,实现格式化字符串的目的.(这与c语言一致) 其中,在格式化整数和浮点数时可以指定是否补0和整数与小数的位数. 首先 ...
js去除字符串中的标签
var str="<p>js去除字符串中的标签</p>"; var result=str.replace(/<.*?>/ig,"&qu ...
.NET获取Html字符串中指定标签的指定属性的值
using System.Text; using System.Text.RegularExpressions; //以上为要用到的命名空间 /// <summary> /// 获取Htm ...
python 判断字符串中是否只有中文字符
python 判断字符串中是否只有中文字符学习了:https://segmentfault.com/q/1010000007898150 def is_all_zh(s): for c in s: ...
python判断字符串中是否包含子字符串
python判断字符串中是否包含子字符串 s = '1234问沃尔沃434' if s.find('沃尔沃') != -1: print('存在') else: print('不存在' ...
python 统计字符串中指定字符出现次数的方法
python 统计字符串中指定字符出现次数的方法: strs = "They look good and stick good!" count_set = ['look','goo ...
Python 去除字符串中的空行
Python 去除字符串中的空行 mystr = 'adfa\n\n\ndsfsf' print("".join([s for s in mystr.splitlines(True ...
Python访问字符串中的值
Python访问字符串中的值: 1.可以使用索引下标进行访问,索引下标从 0 开始: # 使用索引下标进行访问,索引下标从 0 开始 strs = "ABCDEFG" print( ...

随机推荐

剑指offer：大恒图像
大恒图像:成立于1991年,专注于视觉部件.视觉系统及互联网医疗相关产品研发.生产和营销的高科技企业. 旗下产品信息: 1.图像采集卡摄像机等输入的模拟图像信号经过A/D转换,或将数字摄像机的输出信 ...
ThinkPHP 3.2.3 自动加载公共函数文件的方法
方法一.加载默认的公共函数文件在 ThinkPHP 3.2.3 中,默认的公共函数文件位于公共模块 ./Application/Common 下,访问所有的模块之前都会首先加载公共模块下面的配置文件 ...
Codeigniter CRUD生成工具
Codeigniter CRUD生成工具 http://crudigniter.com/
Unity Svn(转)
先吐个槽.关于这个国内各种简单到家的文章让人搞不懂,而且场景合并,prefab合并等关键问题都说没法解决,其实本质就是因为它们都是二进制文件,所以SVN没法对其合并,但事实上Unity是支持把这些文件 ...
css中width的计算方式，以及width:100%的参考系
PS:测试浏览器均为chrome. 首先说下负margin的影响. 正常html页面在显示时,默认是根据文档流的形式显示的.文档流横向显示时,会有一个元素横向排列的基准线,并且以最高元素的vertic ...
bat批处理完成jdk tomcat的安装
在完成一个web应用项目后,领导要求做一个配置用的批处理文件,能够自动完成jdk的安装,tomcat的安装,web应用的部署,环境变量的注册,tomcat服务的安装和自动启动参考了网上很多的类似的批 ...
Netty5.x中新增和值得注意的点
最近事情多,OneCoder折腾了好几天,总算翻译完成了. 翻译自官方文档:http://netty.io/wiki/new-and-noteworthy-in-5.x.html 该文档会列出在N ...
leetcode105：Construct Binary Tree from Preorder and Inorder Traversal
题目: Given preorder and inorder traversal of a tree, construct the binary tree. Note:You may assume t ...
HTML5 Web Storage
Web Storage是HTML5 API提供一个新的重要的特性: 最新的Web Storage草案中提到,在web客户端可用html5 API,以Key-Value形式来进行数据持久存储: 目前主要 ...
opencv的学习笔记4
通常更加高级的形态学变换,如开闭运算.形态学梯度.“顶帽”.“黑帽”等等,都是可以由常用的腐蚀膨胀技术结合来达到想要的效果. 1.开运算:先腐蚀后膨胀,用于用来消除小物体.在纤细点处分离物体.平滑较大 ...

[Python正则表达式] 字符串中xml标签的匹配

[Python正则表达式] 字符串中xml标签的匹配的更多相关文章

随机推荐

热门专题