【389】Implement N-grams using NLTK
Ref: n-grams in python, four, five, six grams?
Ref: "Elegant n-gram generation in Python"
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [i for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [i for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [i for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning'), ('morning', 'Arthur'), ('Arthur', 'did'), ('did', "n't"), ("n't", 'feel'), ('feel', 'very'), ('very', 'good'), ('good', '.')] 3 grams:
[('At', 'eight', "o'clock"), ('eight', "o'clock", 'on'), ("o'clock", 'on', 'Thursday'), ('on', 'Thursday', 'morning'), ('Thursday', 'morning', 'Arthur'), ('morning', 'Arthur', 'did'), ('Arthur', 'did', "n't"), ('did', "n't", 'feel'), ("n't", 'feel', 'very'), ('feel', 'very', 'good'), ('very', 'good', '.')] 4 grams:
[('At', 'eight', "o'clock", 'on'), ('eight', "o'clock", 'on', 'Thursday'), ("o'clock", 'on', 'Thursday', 'morning'), ('on', 'Thursday', 'morning', 'Arthur'), ('Thursday', 'morning', 'Arthur', 'did'), ('morning', 'Arthur', 'did', "n't"), ('Arthur', 'did', "n't", 'feel'), ('did', "n't", 'feel', 'very'), ("n't", 'feel', 'very', 'good'), ('feel', 'very', 'good', '.')]
Another method to output:
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [' '.join(list(i)) for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [' '.join(list(i)) for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [' '.join(list(i)) for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
['At eight', "eight o'clock", "o'clock on", 'on Thursday', 'Thursday morning', 'morning Arthur', 'Arthur did', "did n't", "n't feel", 'feel very', 'very good', 'good .'] 3 grams:
["At eight o'clock", "eight o'clock on", "o'clock on Thursday", 'on Thursday morning', 'Thursday morning Arthur', 'morning Arthur did', "Arthur did n't", "did n't feel", "n't feel very", 'feel very good', 'very good .'] 4 grams:
["At eight o'clock on", "eight o'clock on Thursday", "o'clock on Thursday morning", 'on Thursday morning Arthur', 'Thursday morning Arthur did', "morning Arthur did n't", "Arthur did n't feel", "did n't feel very", "n't feel very good", 'feel very good .']
获取一段文字中的大写字母开头的词组和单词
import nltk
from nltk.corpus import stopwords
a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good."
tokens = nltk.word_tokenize(a)
caps = []
for i in range(1, 4):
for eles in nltk.ngrams(tokens, i):
length = len(list(eles))
for j in range(length):
if eles[j][0].islower() or not eles[j][0].isalpha():
break
elif j == length - 1:
caps.append(' '.join(list(eles))) caps = list(set(caps))
caps = [c for c in caps if c.lower() not in stopwords.words('english')]
print(caps) outputs:
['Denman', 'Prospect', 'Alex Lee', 'Lee', 'Alex', 'Denman Prospect']
【389】Implement N-grams using NLTK的更多相关文章
- 【leetcode】Implement strStr() (easy)
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【leetcode】Implement strStr()
Implement strStr() Implement strStr(). Returns the index of the first occurrence of needle in haysta ...
- 【Leetcode】【Easy】Implement strStr()
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【LeetCode225】 Implement Stack using Queues★
1.题目 2.思路 3.java代码 import java.util.LinkedList; import java.util.Queue; public class MyStack { priva ...
- 【LeetCode232】 Implement Queue using Stacks★
1.题目描述 2.思路 思路简单,这里用一个图来举例说明: 3.java代码 public class MyQueue { Stack<Integer> stack1=new Stack& ...
- 【LeetCode】Implement strStr()(实现strStr())
这道题是LeetCode里的第28道题. 题目描述: 实现 strStr() 函数. 给定一个 haystack 字符串和一个 needle 字符串,在 haystack 字符串中找出 needle ...
- 28. Implement strStr()【easy】
28. Implement strStr()[easy] Implement strStr(). Returns the index of the first occurrence of needle ...
- 【LeetCode】哈希表 hash_table(共88题)
[1]Two Sum (2018年11月9日,k-sum专题,算法群衍生题) 给了一个数组 nums, 和一个 target 数字,要求返回一个下标的 pair, 使得这两个元素相加等于 target ...
- 【LeetCode】String to Integer (atoi) 解题报告
这道题在LeetCode OJ上难道属于Easy.可是通过率却比較低,究其原因是须要考虑的情况比較低,非常少有人一遍过吧. [题目] Implement atoi to convert a strin ...
随机推荐
- 解决sql中上下左右backspace不能用的方法
一. 解决输入 BACKSPACE 键变成 ^h 的问题 #su - oracle $stty erase ^h. 要永久生效,可以加入到用户环境配置文件 .bash_profile 中 , 加入如下 ...
- tf.assign,tf.assign_add,tf.assign_sub
a = tf.Variable(0.0,dtype=tf.float32) with tf.Session() as sess: sess.run(tf.global_variables_initia ...
- MySQL产生随机字符
MySQL产生随机字符 UUID简介 UUID含义是通用唯一识别码 (Universally Unique Identifier),这是一个软件建构的标准,也是被开源软件基金会 (Open Softw ...
- 解决ExtNET ExtJS 特定日期选择月份跳转导致无法选择月份的问题
背景 项目使用 Ext.NET 2.2.0.40838 , 对应Ext JS4.2版本. 结果 2017/3/31 号的时候偶然间点日历选择控件选择2月,10月等月份突然就跳到3月份,9月份之类. 就 ...
- Linux coredump 的打开和关闭
(转载自 http://blog.sina.com.cn/s/blog_6b3765230100lazj.html) ulimit -c 输出如果为0,则说明coredump没有打开 ulimit - ...
- Html5——视频标签使用
video标签: 上面的例子使用一个 Ogg 文件,适用于Firefox.Opera 以及 Chrome 浏览器.要确保适用于 Safari 浏览器,视频文件必须是 MPEG4 类型.video 元素 ...
- JavaWeb中四大域对象的作用范围
JavaWeb的四大作用域为:PageContext,ServletRequest,HttpSession,ServletContext: PageContext域:作用范围是整个JSP页面,是四大作 ...
- 在javascript中toString 和valueOf的区别
1.toString()方法:主要用于Array.Boolean.Date.Error.Function.Number等对象转化为字符串形式.日期类的toString()方法返回一个可读的日期和字符串 ...
- git push declined due to email privacy restrictions 解决方法
push declined due to email privacy restrictions 今天push的时候发现了这个问题无法push 解决: 进入github主页==>setting = ...
- 本地计算机上的OracleDBConsoleorcl服务启动后停止
emca -repos dropemca -repos createemca -config dbcontrol db 这三步你都运行成功了也没有报错?最后没有提示你dbcontrol已经启动了么?, ...