Ref: Natural Language Toolkit

Ref: n-grams in python, four, five, six grams?

Ref: "Elegant n-gram generation in Python"

import nltk

sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [i for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [i for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [i for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning'), ('morning', 'Arthur'), ('Arthur', 'did'), ('did', "n't"), ("n't", 'feel'), ('feel', 'very'), ('very', 'good'), ('good', '.')] 3 grams:
[('At', 'eight', "o'clock"), ('eight', "o'clock", 'on'), ("o'clock", 'on', 'Thursday'), ('on', 'Thursday', 'morning'), ('Thursday', 'morning', 'Arthur'), ('morning', 'Arthur', 'did'), ('Arthur', 'did', "n't"), ('did', "n't", 'feel'), ("n't", 'feel', 'very'), ('feel', 'very', 'good'), ('very', 'good', '.')] 4 grams:
[('At', 'eight', "o'clock", 'on'), ('eight', "o'clock", 'on', 'Thursday'), ("o'clock", 'on', 'Thursday', 'morning'), ('on', 'Thursday', 'morning', 'Arthur'), ('Thursday', 'morning', 'Arthur', 'did'), ('morning', 'Arthur', 'did', "n't"), ('Arthur', 'did', "n't", 'feel'), ('did', "n't", 'feel', 'very'), ("n't", 'feel', 'very', 'good'), ('feel', 'very', 'good', '.')]

Another method to output:

import nltk

sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print("1 gram:\n", tokens, "\n") # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print("2 grams:\n", [' '.join(list(i)) for i in tokens_2], "\n") # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print("3 grams:\n", [' '.join(list(i)) for i in tokens_3], "\n") # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print("4 grams:\n", [' '.join(list(i)) for i in tokens_4], "\n") outputs:
1 gram:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] 2 grams:
['At eight', "eight o'clock", "o'clock on", 'on Thursday', 'Thursday morning', 'morning Arthur', 'Arthur did', "did n't", "n't feel", 'feel very', 'very good', 'good .'] 3 grams:
["At eight o'clock", "eight o'clock on", "o'clock on Thursday", 'on Thursday morning', 'Thursday morning Arthur', 'morning Arthur did', "Arthur did n't", "did n't feel", "n't feel very", 'feel very good', 'very good .'] 4 grams:
["At eight o'clock on", "eight o'clock on Thursday", "o'clock on Thursday morning", 'on Thursday morning Arthur', 'Thursday morning Arthur did', "morning Arthur did n't", "Arthur did n't feel", "did n't feel very", "n't feel very good", 'feel very good .']

获取一段文字中的大写字母开头的词组和单词

import nltk
from nltk.corpus import stopwords
a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good."
tokens = nltk.word_tokenize(a)
caps = []
for i in range(1, 4):
for eles in nltk.ngrams(tokens, i):
length = len(list(eles))
for j in range(length):
if eles[j][0].islower() or not eles[j][0].isalpha():
break
elif j == length - 1:
caps.append(' '.join(list(eles))) caps = list(set(caps))
caps = [c for c in caps if c.lower() not in stopwords.words('english')]
print(caps) outputs:
['Denman', 'Prospect', 'Alex Lee', 'Lee', 'Alex', 'Denman Prospect']

【389】Implement N-grams using NLTK的更多相关文章

  1. 【leetcode】Implement strStr() (easy)

    Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...

  2. 【leetcode】Implement strStr()

    Implement strStr() Implement strStr(). Returns the index of the first occurrence of needle in haysta ...

  3. 【Leetcode】【Easy】Implement strStr()

    Implement strStr(). Returns the index of the first occurrence of needle in haystack, or -1 if needle ...

  4. 【LeetCode225】 Implement Stack using Queues★

    1.题目 2.思路 3.java代码 import java.util.LinkedList; import java.util.Queue; public class MyStack { priva ...

  5. 【LeetCode232】 Implement Queue using Stacks★

    1.题目描述 2.思路 思路简单,这里用一个图来举例说明: 3.java代码 public class MyQueue { Stack<Integer> stack1=new Stack& ...

  6. 【LeetCode】Implement strStr()(实现strStr())

    这道题是LeetCode里的第28道题. 题目描述: 实现 strStr() 函数. 给定一个 haystack 字符串和一个 needle 字符串,在 haystack 字符串中找出 needle ...

  7. 28. Implement strStr()【easy】

    28. Implement strStr()[easy] Implement strStr(). Returns the index of the first occurrence of needle ...

  8. 【LeetCode】哈希表 hash_table(共88题)

    [1]Two Sum (2018年11月9日,k-sum专题,算法群衍生题) 给了一个数组 nums, 和一个 target 数字,要求返回一个下标的 pair, 使得这两个元素相加等于 target ...

  9. 【LeetCode】String to Integer (atoi) 解题报告

    这道题在LeetCode OJ上难道属于Easy.可是通过率却比較低,究其原因是须要考虑的情况比較低,非常少有人一遍过吧. [题目] Implement atoi to convert a strin ...

随机推荐

  1. VS2012常用快捷键!

    Shift+Alt+Enter: 切换全屏编辑Ctrl+B,T / Ctrl+K,K: 切换书签开关Ctrl+B,N / Ctrl+K,N: 移动到下一书签Ctrl+B,P: 移动到上一书签Ctrl+ ...

  2. MySQL库操作,表操作,数据操作。

      数据库服务器:本质就是一台计算机,该计算机上安装有数据库管理软件的服务端,供客户端访问使用. 1数据库管理系统RDBMS(本质就是一个C/S架构的套接字),关系型数据库管理系统. 库:(文件夹)- ...

  3. 当别人给你一个wsdl或者webservice接口时

    一  通过wsdl生成客户端 引用于http://www.cnblogs.com/yisheng163/p/4524808.html 参照http://www.cnblogs.com/xdp-gacl ...

  4. qt tcp 通信实例

    #include "mainwindow.h" #include "ui_mainwindow.h" #include <QHostAddress> ...

  5. vim more

      启用鼠标 :set mouse=a 跳转到下一函数 下一个函数开头 ]] 当前函数末尾/下一个函数的末尾 ][ 当前函数开头/上一个函数的开头 [[ 选项可以按任何顺序生效,可以放在文件名前或后边 ...

  6. Java GC如何判断对象是否为垃圾

    查找内存中不再使用的对象 引用计数法 引用计数法就是如果一个对象没有被任何引用指向,则可视之为垃圾.这种方法的缺点就是不能检测到环的存在. 2.根搜索算法 根搜索算法的基本思路就是通过一系列名为”GC ...

  7. python二进制转换

    例一.题目描述: 输入一个整数,输出该数二进制表示中1的个数.其中负数用补码表示. 分析: python没有unsigned int类型 >>> print ("%x&qu ...

  8. 【译】在Flask中使用Celery

    为了在后台运行任务,我们可以使用线程(或者进程). 使用线程(或者进程)的好处是保持处理逻辑简洁.但是,在需要可扩展的生产环境中,我们也可以考虑使用Celery代替线程.   Celery是什么? C ...

  9. php 目录

    1.laravel pswd  oauth2 2.理解二叉树 3.IOC容器 4.微信支付 4.1 微信支付申请 5.DI 6.支付宝 支付H5 7.html 转 word 8.php-fpm 启动详 ...

  10. 关于IK 分词器

    准备: 1 创建索引: PUT my_index PUT my_index2 2 先做好映射: PUT /my_index/*/_mapping { "properties": { ...