sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tutorial/

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016 @author: daxiong
"""
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer #训练数据
train_text=state_union.raw("2005-GWBush.txt")
#测试数据
sample_text=state_union.raw("2006-GWBush.txt")
'''
Punkt is designed to learn parameters (a list of abbreviations, etc.)
unsupervised from a corpus similar to the target domain.
The pre-packaged models may therefore be unsuitable:
use PunktSentenceTokenizer(text) to learn parameters from the given text
'''
#我们现在训练punkttokenizer(分句器)
custom_sent_tokenizer=PunktSentenceTokenizer(train_text)
#训练后,我们可以使用punkttokenizer(分句器)
tokenized=custom_sent_tokenizer.tokenize(sample_text) '''
nltk.pos_tag(["fire"]) #pos_tag(列表)
Out[19]: [('fire', 'NN')]
''' #文本词性标记函数
def process_content():
try:
for i in tokenized[0:5]:
words=nltk.word_tokenize(i)
tagged=nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e)) process_content()

One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc. Even more impressive, it also labels by tense, and more. Here's a list of the tags, what they mean, and some examples:

POS tag list:

CC	coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

How might we use this? While we're at it, we're going to cover a new sentence tokenizer, called the PunktSentenceTokenizer. This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use. First, let's get some imports out of the way that we're going to use:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

Now, let's create our training and testing data:

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

One is a State of the Union address from 2005, and the other is from 2006 from past President George W. Bush.

Next, we can train the Punkt tokenizer like:

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

Then we can actually tokenize, using:

tokenized = custom_sent_tokenizer.tokenize(sample_text)

Now we can finish up this part of speech tagging script by creating a function that will run through and tag all of the parts of speech per sentence like so:

def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged) except Exception as e:
print(str(e)) process_content()

The output should be a list of tuples, where the first element in the tuple is the word, and the second is the part of speech tag. It should look like:

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'NNP'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'DT'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')] [('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NNS'), (',', ','), ('distinguished', 'VBD'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'NN'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBN'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')] [('Tonight', 'NNP'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'NN'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'NN'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')] [('(', 'NN'), ('Applause', 'NNP'), ('.', '.'), (')', ':')] [('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]

At this point, we can begin to derive meaning, but there is still some work to do. The next topic that we're going to cover is chunking, which is where we group words, based on their parts of speech, into hopefully meaningful groups.

自然语言15_Part of Speech Tagging with NLTK的更多相关文章

  1. 自然语言15.1_Part of Speech Tagging 词性标注

    QQ:231469242 欢迎喜欢nltk朋友交流 https://en.wikipedia.org/wiki/Part-of-speech_tagging In corpus linguistics ...

  2. 自然语言12_Tokenizing Words and Sentences with NLTK

    https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ # -*- coding: utf-8 -*- ...

  3. 词性标注 parts of speech tagging

    In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging ...

  4. 自然语言处理NLP程序包(NLTK/spaCy)使用总结

    NLTK和SpaCy是NLP的Python应用,提供了一些现成的处理工具和数据接口.下面介绍它们的一些常用功能和特性,便于对NLP研究的组成形式有一个基本的了解. NLTK Natural Langu ...

  5. 自然语言27_Converting words to Features with NLTK

    https://www.pythonprogramming.net/words-as-features-nltk-tutorial/ Converting words to Features with ...

  6. 自然语言18.1_Named Entity Recognition with NLTK

    QQ:231469242 欢迎nltk爱好者交流 https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?c ...

  7. Part of Speech Tagging

    Natural Language Processing with Python Charpter 6.1 suffix_fdist处代码稍微改动. import nltk from nltk.corp ...

  8. 自然语言14_Stemming words with NLTK

    https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...

  9. python and 我爱自然语言处理

    曾经因为NLTK的 缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python.离 开腾讯创业后,第一个作品课程图谱也 ...

随机推荐

  1. PHP与MySQL

    这周学习了PHP与MySQL的搭接下面来给大家分享一下: 1.账号注册,论坛发帖... 思路:通过form表单提交到PHP页面,PHP页面往MySQL中插入数据: 2.账号登陆 思路:form提交数据 ...

  2. nginx启动、重启、关闭

    一.启动 cd usr/local/nginx/sbin ./nginx 二.重启 更改配置重启nginx kill -HUP 主进程号或进程号文件路径 或者使用 cd /usr/local/ngin ...

  3. Shell命令_awk命令

    # awk '条件1{动作1} 条件2{动作2}-' 文件名 条件( Pattern): 一般使用关系表达式作为条件 x > 10 判断变量 x是否大于10 x>=10 大于等于 x< ...

  4. xgboost

    xgboost后面加了一个树的复杂度 对loss函数进行2阶泰勒展开,求得最小值, 参考链接:https://homes.cs.washington.edu/~tqchen/pdf/BoostedTr ...

  5. java 中hashmap和hashtable的区别

    1.hashmap是线程不安全的,能够有key,value能有null hashtable是线程安全的,key,value不能有null

  6. sed巧用,随手记

    删除文件后几行,或者前几行 方法1 都当做最后一行,来删除  for((i=1;i<4;i++)); do sed -i '$d' seq01.txt ; done  方法二 倒过来看 cat ...

  7. Elasticsearch 2.X 版本Java插件开发简述

    1:elasticsearch插件分类简述 2:Java插件开发要点 3:如何针对不同版本elasticsearch提供多版本的插件 4:插件具有外部依赖时遇到的一些问题(2016-09-07更新) ...

  8. 数据库开发基础-SQl Server 视图

    1.视图的概述            视图其实就是一条查询sql语句,用于显示一个或多个表或其他视图中的相关数据.视图将一个查询的结果作为一个表来使用,因此视图可以被看作是存储的查询或一个虚拟表.视图 ...

  9. [vijos1907][NOIP2014]飞扬的小鸟

    Description 是一款风靡一时的休闲手机游戏.玩家需要不断控制点击手机屏幕的频率来调节小鸟的飞行高度,让小鸟顺利通过画面右方的管道缝隙.如果小鸟一不小心撞到了水管或者掉在地上的话,便宣告失败. ...

  10. 【faster-rcnn】训练自己的数据——修改图片格式、类别

    修改图片格式 matlab代码 其实内部一些代码是用了rbg的fast-rcnn代码的. \datasets\VOCdevkit2007\VOCcode\VOCinit.m里面,查找'jpg',改成' ...