nltk 中的 sents 和 words

nltk 中的 sents 和 words ，为后续处理做准备。

#!/usr/bin/env python

# -*- coding: utf-8 -*-

from nltk.corpus import gutenberg

sents = gutenberg.sents("burgess-busterbrown.txt")

print(sents[1:20])

words = gutenberg.words("burgess-busterbrown.txt")

print(words[1:20])

输出：

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', 'the', 'Green', 'Forest', 'to', 'chase', 'out', 'the', 'Black', 'Shadows', '.'], ['Once', 'more', 'he', 'yawned', ',', 'and', 'slowly', 'got', 'to', 'his', 'feet', 'and', 'shook', 'himself', '.'], ['Then', 'he', 'walked', 'over', 'to', 'a', 'big', 'pine', '-', 'tree', ',', 'stood', 'up', 'on', 'his', 'hind', 'legs', ',', 'reached', 'as', 'high', 'up', 'on', 'the', 'trunk', 'of', 'the', 'tree', 'as', 'he', 'could', ',', 'and', 'scratched', 'the', 'bark', 'with', 'his', 'great', 'claws', '.'], ['After', 'that', 'he', 'yawned', 'until', 'it', 'seemed', 'as', 'if', 'his', 'jaws', 'would', 'crack', ',', 'and', 'then', 'sat', 'down', 'to', 'think', 'what', 'he', 'wanted', 'for', 'breakfast', '.'], ['While', 'he', 'sat', 'there', ',', 'trying', 'to', 'make', 'up', 'his', 'mind', 'what', 'would', 'taste', 'best', ',', 'he', 'was', 'listening', 'to', 'the', 'sounds', 'that', 'told', 'of', 'the', 'waking', 'of', 'all', 'the', 'little', 'people', 'who', 'live', 'in', 'the', 'Green', 'Forest', '.'], ['He', 'heard', 'Sammy', 'Jay', 'way', 'off', 'in', 'the', 'distance', 'screaming', ',', '"', 'Thief', '!'], ['Thief', '!"'], ['and', 'grinned', '.'], ['"', 'I', 'wonder', ',"', 'thought', 'Buster', ',', '"', 'if', 'some', 'one', 'has', 'stolen', 'Sammy', "'", 's', 'breakfast', ',', 'or', 'if', 'he', 'has', 'stolen', 'the', 'breakfast', 'of', 'some', 'one', 'else', '.'], ['Probably', 'he', 'is', 'the', 'thief', 'himself', '."'], ['He', 'heard', 'Chatterer', 'the', 'Red', 'Squirrel', 'scolding', 'as', 'fast', 'as', 'he', 'could', 'make', 'his', 'tongue', 'go', 'and', 'working', 'himself', 'into', 'a', 'terrible', 'rage', '.'], ['"', 'Must', 'be', 'that', 'Chatterer', 'got', 'out', 'of', 'bed', 'the', 'wrong', 'way', 'this', 'morning', ',"', 'thought', 'he', '.'], ['He', 'heard', 'Blacky', 'the', 'Crow', 'cawing', 'at', 'the', 'top', 'of', 'his', 'lungs', ',', 'and', 'he', 'knew', 'by', 'the', 'sound', 'that', 'Blacky', 'was', 'getting', 'into', 'mischief', 'of', 'some', 'kind', '.'], ['He', 'heard', 'the', 'sweet', 'voices', 'of', 'happy', 'little', 'singers', ',', 'and', 'they', 'were', 'good', 'to', 'hear', '.'], ['But', 'most', 'of', 'all', 'he', 'listened', 'to', 'a', 'merry', ',', 'low', ',', 'silvery', 'laugh', 'that', 'never', 'stopped', 'but', 'went', 'on', 'and', 'on', ',', 'until', 'he', 'just', 'felt', 'as', 'if', 'he', 'must', 'laugh', 'too', '.'], ['It', 'was', 'the', 'voice', 'of', 'the', 'Laughing', 'Brook', '.'], ['And', 'as', 'Buster', 'listened', 'it', 'suddenly', 'came', 'to', 'him', 'just', 'what', 'he', 'wanted', 'for', 'breakfast', '.']]

['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']

Process finished with exit code 0

nltk 中的 sents 和 words的更多相关文章

在 NLTK 中使用 Stanford NLP 工具包
转载自:http://www.zmonster.me/2016/06/08/use-stanford-nlp-package-in-nltk.html 目录 NLTK 与 Stanford NLP 安 ...
nltk中的三元词组，二元词组
在做英文文本处理时,常常会遇到这样的情况,需要我们提取出里面的词组进行主题抽取,尤其是具有行业特色的,比如金融年报等.其中主要进行的是进行双连词和三连词的抽取,那如何进行双连词和三连词的抽取呢?这是本 ...
在nltk中调用stanfordparser处理中文
出现unicode decode error 解决办法是修改nltk包internals.py的java()下增加cmd的参数,cmd = ["-Dfile.encoding=UTF-8&q ...
NLTK中的词性
NOUN n,VERB v ,ADJ a, ADV r, ADJ_SAT s NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x' ...
Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器
http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-% ...
python+NLTK 自然语言学习处理三：如何在nltk/matplotlib中的图片中显示中文
我们首先来加载我们自己的文本文件,并统计出排名前20的字符频率 if __name__=="__main__": corpus_root='/home/zhf/word' word ...
使用Python中的NLTK和spaCy删除停用词与文本标准化
概述了解如何在Python中删除停用词与文本标准化,这些是自然语言处理的基本技术探索不同的方法来删除停用词,以及讨论文本标准化技术,如词干化(stemming)和词形还原(lemmatizatio ...
【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
Python文本处理nltk基础
自然语言处理 -->计算机数据 ,计算机可以处理vector,matrix 向量矩阵. NLTK 自然语言处理库,自带语料,词性分析,分类,分词等功能. 简单版的wrapper,比如textbl ...

随机推荐

css雪碧图压缩
cssgaga下载地址链接: https://pan.baidu.com/s/1Q9xH_XzumIc7vTLCZ3tr5A 提取码: stqe CssGaga功能特性合并import的CSS文件 ...
微信小程序学习动手撸一个校园网小程序
动手撸一个校园网微信小程序高考完毕,想必广大学子和家长们都在忙着查询各所高校的信息,刚好上手微信小程序,当练手也当为自己的学校做点宣传,便当即撸了一个校园网微信小程序. 效果预览源码地址:Gith ...
[译]HTML&CSS Lesson7: 设置背景和渐变色
背景对网站的设计有重大的影响.它有利于建立网站的整体感觉,设置分组,分配优先级,对网站的可用性也有相当大的影响. 在CSS中,元素的背景可以是一个纯色,一张图,一个渐变色或者它们的组合.在我们决定如何 ...
前端实现html转pdf方法总结
最近要搞前端html转pdf的功能.折腾了两天,略有所收,踩了一些坑,所以做些记录,为后来的兄弟做些提示,也算是回馈社区.经过一番调(sou)研(suo)发现html导出pdf一般有这几种方式,各有各 ...
nginx如何连接多个服务？
记录一下: 刚开始用nginx部署,在项目文件内touch了一个nginx.conf配置文件,然后将这个conf文件软链接到nginx的工作目录中 sudo ln -s /home/ubuntu/xx ...
CyclicBarrier源码探究（JDK 1.8）
CyclicBarrier也叫回环栅栏,能够实现让一组线程运行到栅栏处并阻塞,等到所有线程都到达栅栏时再一起执行的功能."回环"意味着CyclicBarrier可以多次重复使用,相 ...
实现Sobel算子滤波、Robers算子滤波、Laplace算子滤波
前几天,老师布置了这样一个任务,读取图片并显示,反色后进行显示:进行Sobel算子滤波,然后反色,进行显示:进行Robers算子滤波,然后反色,进行显示.我最后加上了Laplace算子滤波,进行了比较 ...
SpringBoot——Cache使用原理及Redis整合
前言及核心概念介绍前言本篇主要介绍SpringBoot2.x 中 Cahe 的原理及几个主要注解,以及整合 Redis 作为缓存的步骤核心概念先来看看核心接口的作用及关系图: CachingP ...

nltk 中的 sents 和 words

nltk 中的 sents 和 words的更多相关文章

随机推荐

热门专题