nltk 中的 sents 和 words

nltk 中的 sents 和 words ，为后续处理做准备。

#!/usr/bin/env python

# -*- coding: utf-8 -*-

from nltk.corpus import gutenberg

sents = gutenberg.sents("burgess-busterbrown.txt")

print(sents[1:20])

words = gutenberg.words("burgess-busterbrown.txt")

print(words[1:20])

输出：

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', 'the', 'Green', 'Forest', 'to', 'chase', 'out', 'the', 'Black', 'Shadows', '.'], ['Once', 'more', 'he', 'yawned', ',', 'and', 'slowly', 'got', 'to', 'his', 'feet', 'and', 'shook', 'himself', '.'], ['Then', 'he', 'walked', 'over', 'to', 'a', 'big', 'pine', '-', 'tree', ',', 'stood', 'up', 'on', 'his', 'hind', 'legs', ',', 'reached', 'as', 'high', 'up', 'on', 'the', 'trunk', 'of', 'the', 'tree', 'as', 'he', 'could', ',', 'and', 'scratched', 'the', 'bark', 'with', 'his', 'great', 'claws', '.'], ['After', 'that', 'he', 'yawned', 'until', 'it', 'seemed', 'as', 'if', 'his', 'jaws', 'would', 'crack', ',', 'and', 'then', 'sat', 'down', 'to', 'think', 'what', 'he', 'wanted', 'for', 'breakfast', '.'], ['While', 'he', 'sat', 'there', ',', 'trying', 'to', 'make', 'up', 'his', 'mind', 'what', 'would', 'taste', 'best', ',', 'he', 'was', 'listening', 'to', 'the', 'sounds', 'that', 'told', 'of', 'the', 'waking', 'of', 'all', 'the', 'little', 'people', 'who', 'live', 'in', 'the', 'Green', 'Forest', '.'], ['He', 'heard', 'Sammy', 'Jay', 'way', 'off', 'in', 'the', 'distance', 'screaming', ',', '"', 'Thief', '!'], ['Thief', '!"'], ['and', 'grinned', '.'], ['"', 'I', 'wonder', ',"', 'thought', 'Buster', ',', '"', 'if', 'some', 'one', 'has', 'stolen', 'Sammy', "'", 's', 'breakfast', ',', 'or', 'if', 'he', 'has', 'stolen', 'the', 'breakfast', 'of', 'some', 'one', 'else', '.'], ['Probably', 'he', 'is', 'the', 'thief', 'himself', '."'], ['He', 'heard', 'Chatterer', 'the', 'Red', 'Squirrel', 'scolding', 'as', 'fast', 'as', 'he', 'could', 'make', 'his', 'tongue', 'go', 'and', 'working', 'himself', 'into', 'a', 'terrible', 'rage', '.'], ['"', 'Must', 'be', 'that', 'Chatterer', 'got', 'out', 'of', 'bed', 'the', 'wrong', 'way', 'this', 'morning', ',"', 'thought', 'he', '.'], ['He', 'heard', 'Blacky', 'the', 'Crow', 'cawing', 'at', 'the', 'top', 'of', 'his', 'lungs', ',', 'and', 'he', 'knew', 'by', 'the', 'sound', 'that', 'Blacky', 'was', 'getting', 'into', 'mischief', 'of', 'some', 'kind', '.'], ['He', 'heard', 'the', 'sweet', 'voices', 'of', 'happy', 'little', 'singers', ',', 'and', 'they', 'were', 'good', 'to', 'hear', '.'], ['But', 'most', 'of', 'all', 'he', 'listened', 'to', 'a', 'merry', ',', 'low', ',', 'silvery', 'laugh', 'that', 'never', 'stopped', 'but', 'went', 'on', 'and', 'on', ',', 'until', 'he', 'just', 'felt', 'as', 'if', 'he', 'must', 'laugh', 'too', '.'], ['It', 'was', 'the', 'voice', 'of', 'the', 'Laughing', 'Brook', '.'], ['And', 'as', 'Buster', 'listened', 'it', 'suddenly', 'came', 'to', 'him', 'just', 'what', 'he', 'wanted', 'for', 'breakfast', '.']]

['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']

Process finished with exit code 0

nltk 中的 sents 和 words的更多相关文章

在 NLTK 中使用 Stanford NLP 工具包
转载自:http://www.zmonster.me/2016/06/08/use-stanford-nlp-package-in-nltk.html 目录 NLTK 与 Stanford NLP 安 ...
nltk中的三元词组，二元词组
在做英文文本处理时,常常会遇到这样的情况,需要我们提取出里面的词组进行主题抽取,尤其是具有行业特色的,比如金融年报等.其中主要进行的是进行双连词和三连词的抽取,那如何进行双连词和三连词的抽取呢?这是本 ...
在nltk中调用stanfordparser处理中文
出现unicode decode error 解决办法是修改nltk包internals.py的java()下增加cmd的参数,cmd = ["-Dfile.encoding=UTF-8&q ...
NLTK中的词性
NOUN n,VERB v ,ADJ a, ADV r, ADJ_SAT s NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x' ...
Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器
http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...
python+NLTK 自然语言学习处理三：如何在nltk/matplotlib中的图片中显示中文
我们首先来加载我们自己的文本文件,并统计出排名前20的字符频率 if __name__=="__main__": corpus_root='/home/zhf/word' word ...
使用Python中的NLTK和spaCy删除停用词与文本标准化
概述了解如何在Python中删除停用词与文本标准化,这些是自然语言处理的基本技术探索不同的方法来删除停用词,以及讨论文本标准化技术,如词干化(stemming)和词形还原(lemmatizatio ...
【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
Python文本处理nltk基础
自然语言处理 -->计算机数据 ,计算机可以处理vector,matrix 向量矩阵. NLTK 自然语言处理库,自带语料,词性分析,分类,分词等功能. 简单版的wrapper,比如textbl ...

随机推荐

GoLand2019.03破解与汉化
1.准备工作 (请认真的做好准备工作,否则中途可能会操作失败.) GoLand是JetBrains公司发布的商业版的GO语言编辑器(收费的),本屌目前还没钱购买正版,所以本次教程是以Windows平台 ...
Go语言基础篇（1） —— 编写第一个Go程序
创建文件hello_world.go package main //包,表名代码所在的包 import "fmt" //引入依赖 //main方法 func main(){ fmt ...
从头认识js-DOM1
前面说过一个完整的js实现,包括ECMAScript,BOM,DOM三部分,现在就来讲讲DOM的有关知识. DOM(文档对象模型)是针对HTML和XML文档的一个API(应用程序接口).DOM描绘来一 ...
Ado.net01
------------恢复内容开始------------ 1.ExcuteReader using System; using System.Data.SqlClient; using Syste ...
preload & prefetch
原文地址在我的笔记里,觉得还行就给个 star 吧:) 关于 preload 和 prefetch 早有耳闻,知道它们可以优化页面加载速度,然具体情况却了解不多.搜索了相关的资料后对其有了些认识,在 ...
JavaScript sort() 对json进行排序(数组)
function up(x,y){//升序 return x[val.prop] - y[val.prop] } function down(x,y){//降序 return y[val.prop] ...
【asp.net core】实现动态 Web API
序言: 远程工作已经一个月了,最近也算是比较闲,每天早上起床打个卡,快速弄完当天要做的工作之后就快乐摸鱼去了.之前在用 ABP 框架(旧版)的时候就觉得应用服务层写起来真的爽,为什么实现了个 IApp ...
Markdown For EditPlus插件发布(基于EditPlus快速编辑Markdonw文件，写作爱好的福音来啦)
详细介绍: Markdown For EditPlus插件使用说明开发缘由特点好处: 中文版使用说明相关命令(输入字符敲空格自动输出): EditPlus常用快捷键: 相关教程: English ...
python学习基础知识
学习python前最好知道的知识点: python之父:Guido van Rossum python是一种面向对象语言目前python最新的版本是3.8,python2已经逐渐淘汰 python的 ...
mysql实现读写分离
MySQL读写分离概述 1.读写分离介绍对于目前单机运行MySQL服务.会导致MySQL连接数过多.最终导致mysql的宕机.因此可以使用多台MySQL服务器一起承担压力.考虑到项目中读写比例的不一 ...

nltk 中的 sents 和 words

nltk 中的 sents 和 words的更多相关文章

随机推荐

热门专题