自然语言处理（英文演讲）

这里利用2-gram模型来提取一篇英文演讲的初略的主题句子，这里是英文演讲的的链接：http://pythonscraping.com/files/inaugurationSpeech.txt

n-gram模型是指n个连续的单词组成的序列

以下贴出代码（基于python2.7），详情参考《python网络数据采集》

#-*- coding:utf-8 -*-

from urllib2 import urlopen

import re

import string

import operator

#单词清洗

def isCommon(ngram):

    ngrams=ngram.split(' ')

    #清洗以下没有意义的单词

    commonWords=['the', 'be', 'and', 'of', 'a', 'in', 'to', 'have', 'it', 'i', 'for', 'you', 'he',

                 'with', 'on', 'do', 'say', 'this', 'they', 'is', 'an', 'at', 'but', 'we', 'his',

                 'from', 'that', 'not', 'by', 'she', 'or', 'what', 'go', 'their', 'can', 'who',

                 'get', 'if', 'would', 'her', 'all', 'my', 'make', 'about', 'know', 'will', 'as',

                 'up', 'one', 'time', 'has', 'been', 'there', 'year', 'so', 'think', 'when', 'which',

                 'them', 'some', 'me', 'people', 'take', 'out', 'into', 'just', 'see', 'him', 'your',

                 'come', 'could', 'now', 'than', 'like', 'other', 'how', 'then', 'its', 'our', 'two',

                 'more', 'these', 'want', 'way', 'look', 'first', 'also', 'new', 'because', 'day', 'use',

                 'no', 'man', 'find', 'here', 'thing', 'give', 'many', 'well']

    #判断2-gram中是否存在要清洗的单词

    for word in ngrams:

        if word.lower() in commonWords:

            return False

    return True

#数据清洗

def cleanInput(input):

    #装换多个\n和空格为单个空格

    input=re.sub('\n+',' ',input)

    input=re.sub('\[[0-9]*\]','',input)

    input=re.sub(' +',' ',input)

    input=bytes(input.decode('utf-8'))

    input=input.decode('ascii','ignore')

    cleanInput=[]

    input=input.split(' ')

    for item in input:

        #string.punctuation 去除所有符号：!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

        item=item.strip(string.punctuation)

        if len(item)>1 or (item.lower()=='a' or item.lower()=='i'):

            cleanInput.append(item)

    return cleanInput

#input为输入的整个字符串，n表示以几个字符作为参照，即n-gram

def ngrams(input,n):

    input=cleanInput(input)

    #声明一个数组

    output={}

    for i in range(len(input)-n-1):

        ngramTemp=' '.join(input[i:i+n])

        if isCommon(ngramTemp):

            if ngramTemp not in output:

                output[ngramTemp]=0

            output[ngramTemp]+=1

    return  output

html=urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read().decode('utf-8')

content=str(html)

ngrams=ngrams(content,2)

#key=operator.itemgetter(0) 表示以字典中的key(字符首字母)为前提排序

#key=operator.itemgetter(1) 表示以字典中的value(数字)为前提排序

#reverse=True 表示降序输出

sortedNGrams=sorted(ngrams.items(),key=operator.itemgetter(1),reverse=True)

#输出有意义的2-gram的单词，以及它们出现的数据

print sortedNGrams

#获取上面的的2-gram单词

keywords=[]

for i in range(0,len(sortedNGrams)):

    word=sortedNGrams[i]

    #除去概率小于2的词组

    if int(word[1])>2:

        keywords.append(word[0])

#定义一个集合存取文章的所有句子

sentences=set()

#定义一个main_sentences来存储结果

main_sentences=set()

i=content.split('.')

for j in i:

    sentences.add(j)

for keyword in keywords:

    for sentence in sentences:

        #获取第一个存在该词组的句子

        b=sentence.find(keyword)

        if b!=-1:

            #除去句子里的\n和多余空格

            sentence=re.sub(" +"," ",sentence)

            sentence=re.sub("\n+","",sentence)

            main_sentences.add(sentence)

            break

for i in main_sentences:

    print i

获取的2-gram的词组为（出现次数大于2）：

[u'United States', u'General Government', u'executive department', u'legislative body', u'Mr Jefferson', u'Chief Magistrate', u'called upon', u'same causes', u'whole country', u'Government should']

输出的句子有点多，这里就不贴出来了，这只是初级处理这篇演讲。

自然语言处理（英文演讲）_2-gram的更多相关文章

YouTube排名第一的励志英文演讲《Dream(梦想)》
I don’t know what that dream is that you have, I don't care how disappointing it might have been as ...
柳青（Jean）英文演讲集合
1.Didi Chuxing's Jean Liu on The Future of Cities https://www.youtube.com/watch?v=G9uPGoN0dvQ 2.Did ...
《三体》刘慈欣英文演讲：说好的星辰大海你却只给了我Facebook
美国当地时间2018日11月8日,著名科幻作家刘慈欣被授予2018年度克拉克想象力贡献社会奖(Clarke Award for Imagination in Service to Society),表 ...
【转载】Deep Learning（深度学习）学习笔记整理
http://blog.csdn.net/zouxy09/article/details/8775360 一.概述 Artificial Intelligence,也就是人工智能,就像长生不老和星际漫 ...
Deep Learning（深度学习）学习笔记整理系列之（一）
Deep Learning(深度学习)学习笔记整理系列 zouxy09@qq.com http://blog.csdn.net/zouxy09 作者:Zouxy version 1.0 2013-0 ...
Deep Learning速成教程
引言深度学习,即Deep Learning,是一种学习算法(Learning algorithm),亦是人工智能领域的一个重要分支.从快速发展到实际应用,短短几年时间里, ...
Deep Learning（深度学习）学习笔记整理系列之（一）（转）
Deep Learning(深度学习)学习笔记整理系列 zouxy09@qq.com http://blog.csdn.net/zouxy09 作者:Zouxy version 1.0 2013-0 ...
Deep Learning（深度学习）学习笔记整理系列一
声明: 1)该Deep Learning的学习系列是整理自网上很大牛和机器学习专家所无私奉献的资料的.具体引用的资料请看参考文献.具体的版本声明也参考原文献. 2)本文仅供学术交流,非商用.所以每一部 ...
N元马尔科夫链的实现
马尔可夫模型(Markov Model)是一种统计模型,广泛应用在语音识别,词性自动标注,音字转换,概率文法等各个自然语言处理等应用领域.经过长期发展,尤其是在语音识别中的成功应用,使它成为一种通用的 ...

随机推荐

spring mvc 文件上传工具类
虽然文件上传在框架中,已经不是什么困难的事情了,但自己还是开发了一个文件上传工具类,是基于springmvc文件上传的. 工具类只需要传入需要的两个参数,就可以上传到任何想要上传的路径: 参数1:Ht ...
Mysql分析优化查询的方式
一:查询语句分析 1.通过create index idx_colunmsName on tableName(columns)为某个表的某些字段创建索引,注意主键和唯一键都会自动创建索引: 如为表st ...
学以致用五----centos7+python3.6.2+django2.1.1
目的,在python 3.6的基础上搭建 django 2.x 一.使用pip安装django ,但是使用pip命令的时候报错,解决方法,做软连接 ln -s /usr/local/python/bi ...
RGB，YCBCR在HDMI传输线是数据排列
RGB4:4:4 YCbCr4:4:4 YCbCr4:2:2 YCbCr4:2:0
char类型
1.JAVA中,char占2字节,16位.可在存放汉字 2.char赋值 char a='a'; //任意单个字符,加单引号. char a='中';//任意单个中文字,加单引号. char a=1 ...
java基础-day8
第08天常用API 今日内容介绍 u API概述 u Scanner类与String类 u StringBuilder类第1章 API概述 1.1 API概念 API(Applica ...
Scala_基本语法
基本语法声明值和变量 Scala有两种类型的变量: val:是不可变的(变量的引用不可变),在声明时就必须被初始化,而且初始化以后就不能再赋值: var:声明的时候需要进行初始化,初始化以还可以再对 ...
pm2
使用PM2将Node.js的集群变得更加容易(http://www.cnblogs.com/jaxu/p/5193643.html) nodejs pm2配置使用教程(http://blog.csdn ...
Python自动化开发 - 面向对象(二)
本节内容 1.isinstance(obj,cls)和issubclass(sub,super) 2.反射 3.__setattr__,__delattr__,__getattr__ 一. isins ...
Android-Java-面向对象与面向过程的简单理解
支持面向过程的语言有:C Basic 等语言: 支持面向对象的语言有:C++ Java C# 等语言: 面向过程:操作的是行为/功能: 面向对象:操作的是对象,而对象里面有功能行为,所以可以指定 ...

自然语言处理（英文演讲）_2-gram

自然语言处理（英文演讲）_2-gram的更多相关文章

随机推荐

热门专题