NLP（二）获取数据源和规范化

原文链接：http://www.one2know.cn/nlp2/

Why we do this

将获取的数据统一格式，得到规范化和结构化得数据
字符串操作

# 创建字符串列表和字符串对象

namesList = ['Tuffy','Ali','Nysha','Tim']

sentence = 'My dog sleeps on sofa'

# join的功能

names = ';'.join(namesList) # 以';'为分隔符将所有对象连成一个对象

print(type(names),':',names)

# split的功能

wordList = sentence.split(' ') # 以' '为分隔符将一个对象分割成多个对象的list

print(type(wordList),':',wordList)

# + 和 * 的功能：都能拼接

print('a'+'a'+'a')

print('b' * 3)

# 字符串中的字符索引

str = 'Python NLTK'

print(str[1])

print(str[-3])

输出：

<class 'str'> : Tuffy;Ali;Nysha;Tim

<class 'list'> : ['My', 'dog', 'sleeps', 'on', 'sofa']

aaa

bbb

y

L

字符串操作深入

str = 'NLTK Dolly Python'

# 访问0-3个字符

print(str[:4])

# 访问从11到结束

print(str[11:])

# 访问Dolly,5到10,不包括10

print(str[5:10])

# 也可以从后往前数

print(str[-12:-7])

# in的用法

if 'NLTK' in str:

    print('found NLTK')

# replace的用法

replaced = str.replace('Dolly','Dorothy')

print('Replaced String:',replaced)

# 字符逐个打印

for s in replaced:

    print(s,end='/')

输出：

NLTK

Python

Dolly

Dolly

found NLTK

Replaced String: NLTK Dorothy Python

N/L/T/K/ /D/o/r/o/t/h/y/ /P/y/t/h/o/n/

Pyhton读取PDF

from PyPDF2 import PdfFileReader

# 定义读取pdf的函数，密码为可选项

def get_text_pdf(pdf_filename,password=''):

    pdf_file = open(pdf_filename,'rb')

    read_pdf = PdfFileReader(pdf_file)

    # 密码不为空，则用输入的密码解密

    if password != '':

        read_pdf.decrypt(password)

    # 读取文本：创建字符串列表，把每页的文本都加到列表中

    text = []

    for i in range(0,read_pdf.getNumPages()):

        text.append(read_pdf.getPage(i).extractText())

    return '\n'.join(text)

# 测试

if __name__ == "__main__":

    pdfFile = 'sample-one-line.pdf'

    pdfFileEncrypted = 'sample-one-line.protected.pdf'

    print('PDF 1:\n',get_text_pdf(pdfFile))

    print('PDF 2:\n',get_text_pdf(pdfFileEncrypted,'tuffy'))

输出：

PDF 1:

 This is a sample PDF document I am using to demonstrate in the tutorial.

PDF 2:

 This is a sample PDF document

password protected.

Python读取Word

每个文档有多个paragraph，每个paragraph有多个Run对象，Run对象表示格式的变化：字体，尺寸，颜色，其他样式元素（下划线加粗斜体等），这些元素每次发生变化时，都会创建一个新的Run对象。

import docx

def get_text_word(word_filename):

    doc = docx.Document(word_filename)

    full_text = []

    for para in doc.paragraphs:

        full_text.append(para.text)

    return '\n'.join(full_text)

# 测试

if __name__ == "__main__":

    docFile = 'sample-one-line.docx'

    print('Document in full :\n',get_text_word(docFile))

    # 其他功能

    doc = docx.Document(docFile)

    print('段落个数：',len(doc.paragraphs))

    print('第二段内容：',doc.paragraphs[1].text)

    print('第二段样式：',doc.paragraphs[1].style)

    # 打印第一段所有的run对象

    # 通过run对象体现文本样式的变化

    print('第一段：',doc.paragraphs[0].text)

    print('Number of runs in paragraph 1 :',len(doc.paragraphs[0].runs))

    for idx,run in enumerate(doc.paragraphs[0].runs):

        print('Run %s : %s' % (idx,run.text))

    # 检查run对象的样式 ：下划线 加粗 斜体

    print('is Run 5 underlined:',doc.paragraphs[0].runs[5].underline)

    print('is Run 1 bold:',doc.paragraphs[0].runs[1].bold)

    print('is Run 3 italic',doc.paragraphs[0].runs[3].italic)

输出：

Document in full :

 This is a sample PDF document with some text in BOLD, some in ITALIC and some underlined. We are also embedding a Title down below.

This is my TITLE.

This is my third paragraph.

段落个数： 3

第二段内容： This is my TITLE.

第二段样式： _ParagraphStyle('Title') id: 2046137402144

第一段： This is a sample PDF document with some text in BOLD, some in ITALIC and some underlined. We are also embedding a Title down below.

Number of runs in paragraph 1 : 8

Run 0 : This is a sample PDF document with

Run 1 : some text in BOLD

Run 2 : ,

Run 3 : some in ITALIC

Run 4 :  and

Run 5 : some underlined.

Run 6 :  We are also embedding a Title down below

Run 7 : .

is Run 5 underlined: True

is Run 1 bold: True

is Run 3 italic True

创建自定义语料库

通过txt，pdf，word创建：

import pdf,word

import os

from nltk.corpus.reader.plaintext  import PlaintextCorpusReader

# 编写一个函数用来打开纯文本文件

def get_text(text_filename):

    file = open(text_filename,'r') # 只读

    return file.read() # 内容=>string对象

# 创建一个新文件夹

newCorpusDir = 'mycorpus/'

if not os.path.isdir(newCorpusDir):

    os.mkdir(newCorpusDir)

# 读取三个文件

txt1 = get_text('sample_feed.txt')

txt2 = pdf.get_text_pdf('sample-pdf.pdf')

txt3 = word.get_text_word('sample-one-line.docx')

# 将以上3个字符串写到新建的文件夹中

files = [txt1,txt2,txt3]

for idx,f in enumerate(files):

    with open(newCorpusDir+str(idx) + '.txt','w') as fout:

        fout.write(f)

# 创建一个PlaintextCorpusReader对象

newCorpus = PlaintextCorpusReader(newCorpusDir,'.*')

# 测试

print(newCorpus.words()) # 打印语料库中所有的单词数组

print(newCorpus.sents(newCorpus.fileids()[1])) # 打印1.txt中的句子

print(newCorpus.paras(newCorpus.fileids()[0])) # 打印0.txt中的段落

输出：

['i', 'want', 'to', 'eat', 'dinner', 'i', 'want', 'to', ...]

[['A', 'generic', 'NLP'], ['(', 'Natural', 'Language', 'Processing', ')', 'toolset'], ...]

[[['i', 'want', 'to', 'eat', 'dinner']], [['i', 'want', 'to', 'run']]]

读取RSS信息源的内容

RSS = rich site summary 丰富网站摘要

以全球之声为例(url=http://feeds.mashable.com/Mashable)：

import feedparser

# 载入信息源，自动下载和解析

myFeed = feedparser.parse('http://feeds.mashable.com/Mashable')

# 检查当前信息源的标题，计算帖子数目

print('Feed Title :',myFeed['feed']['title'])

print('Number of posts :',len(myFeed.entries)) # entries返回所有帖子的list

# entries列表中的第一个post

post = myFeed.entries[0]

print('Post Title :',post.title)

# 访问post原始的HTML内容,并存起来

content = post.content[0].value

print('Raw content :\n',content)

fout = open('sample-html.html','w')

fout.write(content)

fout.close()

输出：

Feed Title : Mashable

Number of posts : 30

Post Title : Revolut launches new, effortless way to donate to charities

Raw content :

 <img alt="" src="https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1007924%252F37167fff-e81c-446d-849a-37d0b625b7a7.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=ReYfFvy3gpD2t0oTt7_Z4kd7NQo=&amp;source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com" /><div style="float: right; width: 50px;"><a href="https://twitter.com/share?via=Mashable&text=Revolut+launches+new%2C+effortless+way+to+donate+to+charities&url=https%3A%2F%2Fmashable.com%2Farticle%2Frevolut-donations" style="margin: 10px;"><img alt="Twitter" border="0" src="https://a.amz.mshcdn.com/assets/feed-tw-e71baf64f2ec58d01cd28f4e9ef6b2ce0370b42fbd965068e9e7b58be198fb13.jpg" /></a><a href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fmashable.com%2Farticle%2Frevolut-donations&src=sp" style="margin: 10px;"><img alt="Facebook" border="0" src="https://a.amz.mshcdn.com/assets/feed-fb-8e3bd31e201ea65385a524ef67519d031e6851071807055648790d6a4ca77139.jpg" /></a></div><p><a href="https://www.revolut.com/">Revolut</a> is a UK-based financial services company that offers clients a bank account and a pre-paid card, with many of its services free or incurring a lower fee than you'd get from a typical bank. It's now also offering a new feature that makes it really easy to donate to charities — every time you make a payment. </p>

<p>The feature, called Donations, lets you round up your Revolut card payments and donate the spare change to a charity of your choice. The service is kicking off with three charities: <a href="https://www.ilga-europe.org/">ILGA-Europe</a>, <a href="https://www.savethechildren.net/">Save the Children</a> and <a href="https://www.worldwildlife.org/">WWF</a>. </p>

<div><p>SEE ALSO: <a href="http://mashable.com/article/instagram-stories-donation-sticker-causes?utm_campaign&amp;utm_cid=a-seealso&amp;utm_context=textlink&amp;utm_medium=rss&amp;utm_source">You can now donate through stickers in Instagram Stories</a> <a href="https://mashable.com/article/revolut-donations">Read more...</a></p></div>More about <a href="https://mashable.com/category/donations/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Donations</a>, <a href="https://mashable.com/category/revolut/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Revolut</a>, <a href="https://mashable.com/tech/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Tech</a>, and <a href="https://mashable.com/category/big-tech-companies/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Big Tech Companies</a><img src="http://feeds.feedburner.com/~r/Mashable/~4/s9f4V3jFdyg" height="1" width="1" alt=""/>

使用BeautifulSoup解析HTML

BeautifulSoup可用于解析任何HTML和XML内容

用于解析的HTML:

<img alt="" src="https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1008631%252F256dd624-5852-4df0-81b3-e686a3ac5fd2.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=o6SwiPnemiiF5QUbmAb8lh89GJw=&amp;source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com" /><div style="float: right; width: 50px;"><a href="https://twitter.com/share?via=Mashable&text=Android+might+finally+get+a+better+AirDrop+alternative&url=https%3A%2F%2Fmashable.com%2Farticle%2Fandroid-fast-share-airdrop-alternative" style="margin: 10px;"><img alt="Twitter" border="0" src="https://a.amz.mshcdn.com/assets/feed-tw-e71baf64f2ec58d01cd28f4e9ef6b2ce0370b42fbd965068e9e7b58be198fb13.jpg" /></a><a href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fmashable.com%2Farticle%2Fandroid-fast-share-airdrop-alternative&src=sp" style="margin: 10px;"><img alt="Facebook" border="0" src="https://a.amz.mshcdn.com/assets/feed-fb-8e3bd31e201ea65385a524ef67519d031e6851071807055648790d6a4ca77139.jpg" /></a></div><p>Google might finally deliver a viable version of AirDrop for Android phones.</p>

<p>The company is testing a new Android feature called "Fast Share" that would allow phone owners to wirelessly transmit photos, text, and other files to nearby devices using Bluetooth. The currently unreleased feature was uncovered by two separate publications, <a href="https://9to5google.com/2019/06/29/google-android-fast-share/">9to5Google</a> and <a href="https://www.xda-developers.com/fast-share-android-beam-airdrop-android/">XDA Developers</a>.</p>

<p>According to screenshots posted by the publications, Fast Share allows you to share photos, text, and URLs with devices that are nearby even if you don't have an internet connection. Interestingly, the list of devices in the screenshots includes an iPhone as well as a Chromebook and Pixel 3 phone, suggesting the intention is for Fast Share to enable cross-platform sharing. <a href="https://mashable.com/article/android-fast-share-airdrop-alternative">Read more...</a></p>More about <a href="https://mashable.com/tech/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Tech</a>, <a href="https://mashable.com/category/google/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Google</a>, <a href="https://mashable.com/category/airdrop/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Airdrop</a>, <a href="https://mashable.com/category/android-q/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Android Q</a>, and <a href="https://mashable.com/tech/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Tech</a><img src="http://feeds.feedburner.com/~r/Mashable/~4/VGJnaGJtlxQ" height="1" width="1" alt=""/>

解析代码：

from bs4 import BeautifulSoup

# 将HTML文件以str送给BeautifulSoup对象

html_doc = open('sample-html.html','r').read()

soup = BeautifulSoup(html_doc,'html.parser')

# 去除标签，获取文本

print('Full text HTML Stripped:')

print(soup.get_text())

# 获取第一个指定标签内容

print('Accessing the <img> tag :',end=' ')

print(soup.img)

# 获取第一个指定标签的指定内容

print('Accessing the text of <p> tag :',end=' ')

print(soup.p.string)

# 访问第一个指定标签的某个属性

print('Accessing property of <img> tag :',end=' ')

print(soup.img['src'])

# 获取所有某标签的内容

print('Accessing all occurences of the <p> tag :')

for p in soup.find_all('p'):

    print(p.string)

输出：

Full text HTML Stripped:

Google might finally deliver a viable version of AirDrop for Android phones.

The company is testing a new Android feature called "Fast Share" that would allow phone owners to wirelessly transmit photos, text, and other files to nearby devices using Bluetooth. The currently unreleased feature was uncovered by two separate publications, 9to5Google and XDA Developers.

According to screenshots posted by the publications, Fast Share allows you to share photos, text, and URLs with devices that are nearby even if you don't have an internet connection. Interestingly, the list of devices in the screenshots includes an iPhone as well as a Chromebook and Pixel 3 phone, suggesting the intention is for Fast Share to enable cross-platform sharing. Read more...More about Tech, Google, Airdrop, Android Q, and Tech

Accessing the <img> tag : <img alt="" src="https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1008631%252F256dd624-5852-4df0-81b3-e686a3ac5fd2.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=o6SwiPnemiiF5QUbmAb8lh89GJw=&amp;source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com"/>

Accessing the text of <p> tag : Google might finally deliver a viable version of AirDrop for Android phones.

Accessing property of <img> tag : https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1008631%252F256dd624-5852-4df0-81b3-e686a3ac5fd2.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=o6SwiPnemiiF5QUbmAb8lh89GJw=&source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com

Accessing all occurences of the <p> tag :

Google might finally deliver a viable version of AirDrop for Android phones.

None

None

NLP（二）获取数据源和规范化的更多相关文章

使用Spark分析拉勾网招聘信息(二): 获取数据
要获取什么样的数据? 我们要获取的数据,是指那些公开的,可以轻易地获取地数据.如果你有完整的数据集,肯定是极好的,但一般都很难通过还算正当的方式轻易获取.单就本系列文章要研究的实时招聘信息来讲,能获取 ...
JDBC五数据源和数据池(web基础学习笔记十一)
一.为什么使用数据源和连接池现在开发的应用程序,基本上都是基于数据的,而且是需要频繁的连接数据库的.如果每次操作都连接数据库,然后关闭,这样做性能一定会受限.所以,我们一定要想办法复用数据库的连接. ...
MVC后台获取数据和插入数据的三种方式【二】
MVC模式下,从前端获取数据返回后台,总共有三种形式.下面的代码示例将演示如何将数据返回到后端. 一.首先我们看看表单代码,注意input标签中name的值. <html> <hea ...
Zabbix二次开发_02获取数据
最近准备写一个zabbix二次页面的呈现.打算调用zabbix api接口来进行展示. 具体流程以及获取的数据. 1. 获得认证密钥 2. 获取zabbix所有的主机组 3. 获取单 ...
python练习 - 系统基本信息获取（sys标准库）+ 二维数据表格输出（tabulate库）
系统基本信息获取描述获取系统的递归深度.当前执行文件路径.系统最大UNICODE编码值等3个信息,并打印输出.‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮ ...
openresty 学习笔记二:获取请求数据
openresty 学习笔记二:获取请求数据 openresty 获取POST或者GET的请求参数.这个是要用openresty 做接口必须要做的事情.这里分几种类型:GET,POST(urlenco ...
IOS开发---菜鸟学习之路--（十二）-利用ASIHTTPRequest进行异步获取数据
想要实现异步获取的话我这边了解过来有两个非常简单的方式一个是利用ASIHTTPRequest来实现异步获取数据另一个则是利用MBProgressHUD来实现异步获取数据本章就先来讲解如何利用AS ...
OLEDB数据源和目标组件
在SSIS工程的开发过程中,OLEDB 数据源和目标组件是最常用的数据流组件.从功能上讲,OLEDB 数据源组件用于从OLEDB 提供者(Provider)中获取数据,传递给下游组件,OLEDB提供者 ...
无废话ExtJs 入门教程二十[数据交互：AJAX]
无废话ExtJs 入门教程二十[数据交互:AJAX] extjs技术交流,欢迎加群(521711109) 1.代码如下: 1 <!DOCTYPE html PUBLIC "-//W3C ...

随机推荐

C#async/await心得
结论: 异步方法的方法签名要加 async,否则就算返回 Task 也是普通方法. 调用异步方法,可以加 await 或不加 await,两者方式都是马上返回,不加 await 得到的是 Task 对 ...
git删除分支步骤
在本地删除一个分支: git branch -D <本地分支> 在github远程端删除一个分支: git push origin :<远程端分支> 唯一不同的就是冒号代表了删 ...
Asp.Net MVC SingleServiceResolver类剖析
SingleServiceResolver一般用于类工厂创建和注入点接口留白.类工厂创建比如Controller控制依赖于此类的创建,注入点留白实质上是依赖注入所对外预留的接口. 以第二个特性为例. ...
用泛型写Redis缓存与数据库操作工具类
功能描述: 先从缓存获取数据,如果缓存没有,就从数据库获取数据,并设置到缓存中,返回数据. 如果数据库中没有数据,需要设置一个缓存标记flagKey,防止暴击访问数据库,用缓存保护数据库. 当删除缓存 ...
Java匹马行天下之J2EE框架开发——Spring—>用IDEA开发Spring程序（01）
一.心动不如行动一.创建项目 *注:在IDEA中我创建的Maven项目,不了解Maven的朋友可以看我之前的博客“我们一起走进Maven——知己知彼”,了解Maven后可以看我之前的博客“Maven ...
Kafka消息队列初识
一.Kafka简介 1.1 什么是kafka kafka是一个分布式.高吞吐量.高扩展性的消息队列系统.kafka最初是由Linkedin公司开发的,后来在2010年贡献给了Apache基金会,成为了 ...
七天学会NodeJS——第一天
转载请注明出处:葡萄城官网,葡萄城为开发者提供专业的开发工具.解决方案和服务,赋能开发者.原文出处:http://nqdeng.github.io/7-days-nodejs Node.js 是一个能 ...
Selenium+Java - 结合sikuliX操作Flash网页
前言前天被一个Flash的轮播图,给玩坏了,无法操作,后来请教了下crazy总拿到思路,今天实践了下,果然可以了,非常感谢! 模拟场景打开百度地图切换城市到北京使用测距工具测量奥林匹克森林 ...
JavaWeb零基础入门-01 基础概念说明
一.序言从学校出来到实习,发现学校学的东西太过基础,难于直接运用于工作中.而且工作中,现在都以web开发为主,学校开了web开发相关课程.自己学的不够深入,所以本人自学JavaWeb开发,介于学习巩 ...
7.26 面向对象_封装_property_接口
封装封装就是隐藏内部实现细节, 将复杂的,丑陋的,隐私的细节隐藏到内部,对外提供简单的访问接口为什么要封装 1.保证关键数据的安全性 2.对外部隐藏实现细节,隔离复杂度什么时候应该封装 1.当 ...

NLP（二） 获取数据源和规范化

NLP（二） 获取数据源和规范化的更多相关文章

随机推荐

热门专题

NLP（二）获取数据源和规范化

NLP（二）获取数据源和规范化的更多相关文章