【389】Implement N-grams using NLTK
Ref: n-grams in python, four, five, six grams?
Ref: "Elegant n-gram generation in Python"
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [i for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [i for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [i for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning'), ('morning', 'Arthur'), ('Arthur', 'did'), ('did', "n't"), ("n't", 'feel'), ('feel', 'very'), ('very', 'good'), ('good', '.')] 3 grams:
[('At', 'eight', "o'clock"), ('eight', "o'clock", 'on'), ("o'clock", 'on', 'Thursday'), ('on', 'Thursday', 'morning'), ('Thursday', 'morning', 'Arthur'), ('morning', 'Arthur', 'did'), ('Arthur', 'did', "n't"), ('did', "n't", 'feel'), ("n't", 'feel', 'very'), ('feel', 'very', 'good'), ('very', 'good', '.')] 4 grams:
[('At', 'eight', "o'clock", 'on'), ('eight', "o'clock", 'on', 'Thursday'), ("o'clock", 'on', 'Thursday', 'morning'), ('on', 'Thursday', 'morning', 'Arthur'), ('Thursday', 'morning', 'Arthur', 'did'), ('morning', 'Arthur', 'did', "n't"), ('Arthur', 'did', "n't", 'feel'), ('did', "n't", 'feel', 'very'), ("n't", 'feel', 'very', 'good'), ('feel', 'very', 'good', '.')]
Another method to output:
import nltk sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [' '.join(list(i)) for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [' '.join(list(i)) for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [' '.join(list(i)) for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
['At eight', "eight o'clock", "o'clock on", 'on Thursday', 'Thursday morning', 'morning Arthur', 'Arthur did', "did n't", "n't feel", 'feel very', 'very good', 'good .'] 3 grams:
["At eight o'clock", "eight o'clock on", "o'clock on Thursday", 'on Thursday morning', 'Thursday morning Arthur', 'morning Arthur did', "Arthur did n't", "did n't feel", "n't feel very", 'feel very good', 'very good .'] 4 grams:
["At eight o'clock on", "eight o'clock on Thursday", "o'clock on Thursday morning", 'on Thursday morning Arthur', 'Thursday morning Arthur did', "morning Arthur did n't", "Arthur did n't feel", "did n't feel very", "n't feel very good", 'feel very good .']
获取一段文字中的大写字母开头的词组和单词
import nltk
from nltk.corpus import stopwords
a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good."
tokens = nltk.word_tokenize(a)
caps = []
for i in range(1, 4):
for eles in nltk.ngrams(tokens, i):
length = len(list(eles))
for j in range(length):
if eles[j][0].islower() or not eles[j][0].isalpha():
break
elif j == length - 1:
caps.append(' '.join(list(eles))) caps = list(set(caps))
caps = [c for c in caps if c.lower() not in stopwords.words('english')]
print(caps) outputs:
['Denman', 'Prospect', 'Alex Lee', 'Lee', 'Alex', 'Denman Prospect']
【389】Implement N-grams using NLTK的更多相关文章
- 【leetcode】Implement strStr() (easy)
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【leetcode】Implement strStr()
Implement strStr() Implement strStr(). Returns the index of the first occurrence of needle in haysta ...
- 【Leetcode】【Easy】Implement strStr()
Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...
- 【LeetCode225】 Implement Stack using Queues★
1.题目 2.思路 3.java代码 import java.util.LinkedList; import java.util.Queue; public class MyStack { priva ...
- 【LeetCode232】 Implement Queue using Stacks★
1.题目描述 2.思路 思路简单,这里用一个图来举例说明: 3.java代码 public class MyQueue { Stack<Integer> stack1=new Stack& ...
- 【LeetCode】Implement strStr()(实现strStr())
这道题是LeetCode里的第28道题. 题目描述: 实现 strStr() 函数. 给定一个 haystack 字符串和一个 needle 字符串,在 haystack 字符串中找出 needle ...
- 28. Implement strStr()【easy】
28. Implement strStr()[easy] Implement strStr(). Returns the index of the first occurrence of needle ...
- 【LeetCode】哈希表 hash_table(共88题)
[1]Two Sum (2018年11月9日,k-sum专题,算法群衍生题) 给了一个数组 nums, 和一个 target 数字,要求返回一个下标的 pair, 使得这两个元素相加等于 target ...
- 【LeetCode】String to Integer (atoi) 解题报告
这道题在LeetCode OJ上难道属于Easy.可是通过率却比較低,究其原因是须要考虑的情况比較低,非常少有人一遍过吧. [题目] Implement atoi to convert a strin ...
随机推荐
- Jquery的框架解析
最近闲的刁痛,想看看jQuery源码.但是这个源码看起来 还是挺费劲的.所以呢整理一份框架出来, 避免走入jQuery关键字的误区,我用Gys代替关键字jQuery. 下面是源码: (function ...
- tp3.2 支付宝app支付
pay方法 /** *支付宝支付 */ public function pay($param) { vendor('alipay.AopSdk');// 加载类库 $config = array( ' ...
- KVM总结-KVM性能优化之CPU优化
前言 任何平台根据场景的不同,都有相应的优化.不一样的硬件环境.网络环境,同样的一个平台,它跑出的效果也肯定不一样.就好比一辆法拉利,在高速公路里跑跟乡村街道跑,速度和激情肯定不同… 所以,我们做运维 ...
- Android数据传递,使用广播BroadcastReceiver;
Android数据传递有很多种,Intent意图传递或使用Bundle去传递,接口监听回调传递数据,也可以把数据保存起来,使用的时候去读取等等等...,"当你知道足够多的数据传递的方式之后, ...
- solr学习(六):使用自定义int/long类型主键
需求分析: 我不想使用solr默认的主键id,我想换成其他的,比如我的文章id为article_id,我想让article_id作为主键. 而且,我的主键是int类型,而solr的主键默认是strin ...
- Python NLTK——python与nltk配置
按照<Python自然语言处理>中的步骤安装Python后nltk总是部署失败,出现如下提示: >>> import nltk Traceback (most recen ...
- hive计算周一的日期
) FreeMarker --',-7)?date('yyyy-MM-dd'),'week')?string('yyyy-MM-dd')}'
- fabric镜像安装脚本分析
#!/bin/bash # # Copyright IBM Corp. All Rights Reserved. # # SPDX-License-Identifier: Apache-2.0 # e ...
- FreeMarker的空值运算符和逻辑运算符
1.空值处理运算符 如果你在模板中使用了变量但是在代码中没有对变量赋值,那么运行生成时会抛出异常.但是有些时候,有的变量确实是null,怎么解决这个问题呢? 判断某变量是否存在:“??” 用法为:va ...
- 【Excel技能】字符串包含某字符串个数?替换许多组字符串?
=len(单元格A)-len(substitute(单元格A,某字符串,)) 原理:将某字符串替换成空,前后字符串长即为减去的这个字符串长度,这个字符串出现个数=前后字符串长度之差/这个字符串长度 = ...