python机器学习-乳腺癌细胞挖掘（博主亲自录制视频）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

项目合作QQ:231469242

https://en.wikipedia.org/wiki/Part-of-speech_tagging

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

Principle

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch.

Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning "fastens (a watertight door) securely").

Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. Linguists distinguish parts of speech to various fine degrees, reflecting a chosen "tagging system".

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as 'Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

History

The Brown Corpus

Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kučera and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).

The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).

This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS (linguistics) and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus.

For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

Use of Hidden Markov Models

In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing (1997) [2], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous.

CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech (DeRose 1990, p. 82)).

HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.^[1]

Dynamic programming methods

In 1987, Steven DeRose^[2] and Ken Church^[3] independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.

These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.

Unsupervised taggers

The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.

These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

Other taggers and methods

Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Unlike the Brill tagger where the rules are ordered sequentially, the POS and morphological tagging toolkit RDRPOSTagger stores rules in the form of a Ripple Down Rules tree.

Many machine learning methods have also been applied to the problem of POS tagging. Methods such as SVM, Maximum entropy classifier, Perceptron, and Nearest-neighbor have all been tried, and most can achieve accuracy above 95%.

A direct comparison of several methods is reported (with references) at [3]. This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable.

However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported there are the best that can be achieved with a given approach; nor even the best that have been achieved with a given approach.

A more recent development is using the structure regularization method for part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.^[4]

Issues

While there is broad agreement about basic categories, a number of edge cases make it difficult to settle on a single "correct" set of tags, even in a single language such as English. For example, it is hard to say whether "fire" is an adjective or a noun in

 the big green fire truck

A second important example is the use/mention distinction, as in the following example, where "blue" could be replaced by a word from any POS (the Brown Corpus tag set appends the suffix "-NC" in such cases):

 the word "blue" has 4 letters.

Words in a language other than that of the "main" text are commonly tagged as "foreign", usually in addition to a tag for the role the foreign word is actually playing in context.

There are also many cases where POS categories and "words" do not map one to one, for example:

 David's

 gonna

 don't

 vice versa

 first-cut

 cannot

 pre- and post-secondary

 look (a word) up

In the last example, "look" and "up" arguably function as a single verbal unit, despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems.

词性标注

http://www.hankcs.com/nlp/part-of-speech-tagging.html

词性标注（Part-of-Speech tagging 或POS tagging)，又称词类标注或者简称标注，是指为分词结果中的每个单词标注一个正确的词性的程序，也即确定每个词是名词、动词、形容词或其他词性的过程。在汉语中，词性标注比较简单，因为汉语词汇词性多变的情况比较少见，大多词语只有一个词性，或者出现频次最高的词性远远高于第二位的词性。据说，只需选取最高频词性，即可实现80%准确率的中文词性标注程序。

利用HMM即可实现更高准确率的词性标注，本文旨在介绍HanLP中的词性标注模块。

开源项目

本文代码已集成到HanLP中开源：http://www.hankcs.com/nlp/hanlp.html

训练

HanLP中使用了一阶隐马模型，在这个隐马尔可夫模型中，隐状态是词性，显状态是单词。

语料库

训练语料采用了2014人民日报切分语料：

人民网/nz 1月1日/t 讯/ng 据/p 《/w [纽约/nsf 时报/n]/nz 》/w 报道/v ，/w 美国/nsf 华尔街/nsf 股市/n 在/p 2013年/t 的/ude1 最后/f 一天/mq 继续/v 上涨/vn ，/w 和/cc [全球/n 股市/n]/nz 一样/uyy ，/w 都/d 以/p [最高/a 纪录/n]/nz 或/c 接近/v [最高/a 纪录/n]/nz 结束/v 本/rz 年/qt 的/ude1 交易/vn 。/w
《/w [纽约/nsf 时报/n]/nz 》/w 报道/v 说/v ，/w 标普/nz 500/m 指数/n 今年/t 上升/vi 29.6%/m ，/w 为/p 1997年/t 以来/f 的/ude1 最大/gm 涨幅/n ；/w [道琼斯/ntc 工业/n 平均/a 指数/n]/nz 上升/vi 26.5%/m ，/w 为/p 1996年/t 以来/f 的/ude1 最大/gm 涨幅/n ；/w [纳斯/nrf 达/v 克/q]/nz 上涨/vi 38.3%/m 。/w
就/d 12月31日/t 来说/uls ，/w 由于/p 就业/vn 前景/n 看好/v 和/cc [经济/n 增长/v]/nz 明年/t 可能/v 加速/vn ，/w 消费者/n 信心/n 上升/vi 。/w 工商/n 协进会/nis （/w ConferenceBoard/x ）/w 报告/n ，/w 12月/t 消费者/n 信心/n 上升/vi 到/v 78.1/m ，/w 明显/a 高于/v 11月/t 的/ude1 72/m 。/w
另据/nz 《/w [华尔街/nsf 日报/n]/nz 》/w 报道/v ，/w 2013年/t 是/vshi 1995年/t 以来/f [美国/nsf 股市/n]/nz 表现/v 最好/d 的/ude1 一年/mq 。/w 这/rzv 一年/mq 里/f ，/w 投资/v [美国/nsf 股市/n]/nz 的/ude1 明智/a 做法/n 是/vshi 追/v 着/uzhe “/w 傻钱/nz ”/w 跑/v 。/w 所谓/v 的/ude1 “/w 傻钱/nz ”/w 策略/n ，/w 其实/d 就是/v 买入/vn 并/cc 持有/v 美国/nsf 股票/n 这样/rzv 的/ude1 普通/a 组合/vn 。/w 这个/rz 策略/n 要/v 比/p [对冲/vn 基金/n]/nz 和/cc 其它/rz 专业/n 投资者/nnd 使用/v 的/ude1 更为/d 复杂/a 的/ude1 投资/vn 方法/n 效果/n 好/a 得/ude3 多/a 。/w （/w 老/a 任/v ）/w

单词词性频次词典

统计所有单词的各个词性的出现频次，得到核心词典：

爱 v 3622 vn 598
爱因斯坦 nrf 20
爱国 a 178
爱国主义 n 68
飙升 v 200 vn 8
顺风 vi 27 vn 2
顺风吹火 i 1
顺风球 n 1
顺风耳 n 4
顺风车 nz 126
购 v 217 vg 151 vn 106
购书 v 7 vn 5
购买 v 3875 vn 637
购买人 n 7
购买力 n 42
购买户 n 1
购买欲 n 1
购买群 n 1
购买者 n 93
购买证 n 1
购入 v 115 vn 18
……

从词典可以看出，汉语词汇的确词性单一，且存在歧义的词性多集中在“动词v”和“名动词vn”上。另外，我拿到的2014人民日报切分语料感觉没有经过严格的人工校对，许多单词词性单一，且存在不少错误。也许等我有机会（经济实力或学术背景），可以拿更高质量的语料来训练。所幸HanLP同时维护了一个通用的语料处理包，暂且埋下伏笔吧。

转移矩阵

统计每个标签的转移频次，得到如下转移矩阵：

事实上，完整的转移矩阵非常大，请下载观看：词性标注转移矩阵.xls

标注

利用上述转移矩阵和核心词典词频可以计算出HMM中的初始概率、转移概率、发射概率，进而完成求解。关于维特比算法和实现请参考《通用维特比算法的Java实现》。

测试

以“我的爱就是爱自然语言处理”为例：

String text = "我的爱就是爱自然语言处理";
Segment segment = new Segment();
System.out.println("未标注：" + segment.seg(text));
segment.enableSpeechTag(true);
System.out.println("标注后：" + segment.seg(text));

输出

未标注：[我/rr, 的/ude1, 爱/v, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

标注后：[我/rr, 的/ude1, 爱/vn, 就是/v, 爱/v, 自然语言/gm, 处理/vn]

前后两个“爱”的词性并不相同，前者是名动词，后者是动词。

再比如

未标注：[教授/nnt, 正在/d, 教授/nnt, 自然语言/gm, 处理/vn, 课程/n]

标注后：[教授/nnt, 正在/d, 教授/v, 自然语言/gm, 处理/vn, 课程/n]

HanLP的词性标注初见成效。

HanLP词性标注集

HanLP使用的HMM词性标注模型训练自2014年人民日报切分语料，随后增加了少量98年人民日报中独有的词语。所以，HanLP词性标注集兼容《ICTPOS3.0汉语词性标记集》，并且兼容《现代汉语语料库加工规范——词语切分与词性标注》。

It is unclear whether it is best to treat words such as "be", "have", and "do" as categories in their own right (as in the Brown Corpus), or as simply verbs (as in the LOB Corpus and the Penn Treebank). "be" has more forms than other English verbs, and occurs in quite different grammatical contexts, complicating the issue.

The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use, and include versions for multiple languages.

POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit may be virtually impossible. At the other extreme, Petrov, D. Das, and R. McDonald ("A Universal Part-of-Speech Tagset" http://arxiv.org/abs/1104.2086) have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition, etc.). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

A different issue is that some cases are in fact ambiguous. Beatrice Santorini gives examples in "Part-of-speech Tagging Guidelines for the Penn Treebank Project," (3rd rev, June 1990 [4]), including the following (p. 32) case in which entertaining can be either an adjective or a verb, and there is no syntactic way to decide:

 The Duchess was entertaining last night.

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（欢迎关注博主主页，学习python视频资源）

自然语言15.1_Part of Speech Tagging 词性标注的更多相关文章

自然语言15_Part of Speech Tagging with NLTK
https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tut ...
词性标注 parts of speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging ...
自然语言处理工具pyhanlp分词与词性标注
Pyhanlp分词与词性标注的相关内容记得此前是有分享过的.可能时间太久记不太清楚了.以下文章是分享自“baiziyu”所写(小部分内容有修改),供大家学习参考之用. 简介 pyhanlp是HanLP ...
Part of Speech Tagging
Natural Language Processing with Python Charpter 6.1 suffix_fdist处代码稍微改动. import nltk from nltk.corp ...
Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...
python and 我爱自然语言处理
曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开发语言是C/C++,但平时的很多文本数据处理任务都交给了Python.离开腾讯创业后,第一个作品课程图谱也 ...
最佳实践：深度学习用于自然语言处理(Deep Learning for NLP Best Practices) - 阅读笔记
https://www.wxnmh.com/thread-1528249.htm https://www.wxnmh.com/thread-1528251.htm https://www.wxnmh. ...
Python自然语言处理工具小结
Python自然语言处理工具小结作者:白宁超 2016年11月21日21:45:26 目录 [Python NLP]干货!详述Python NLTK下如何使用stanford NLP工具包(1) [ ...
自然语言14_Stemming words with NLTK
https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ # -*- ...

随机推荐

grub.conf文件参数详解
Grub是Linux的下系统启动器之一(另一个名为Lilo),grub.conf相当于 windows下的boot.ini,都是存放启动项设置和信息的,如果你熟悉boot.ini的设置的话相信也可以很 ...
给li设置float浮动属性之后，无法撑开外层ul的问题。（原址：http://www.cnblogs.com/cielzhao/p/5781462.html）
最近在项目中有好几次遇到这个问题,感觉是浮动引起的,虽然用<div style="clear:both"></div>解决了,但自己不是特别明白,又在网上查 ...
网页中常用HTML字符实体
摘要: 一些字符在 HTML 中拥有特殊的含义,比如小于号 () 用于定义 HTML 标签的开始.如果我们希望浏览器正确地显示这些字符,我们必须在 HTML 源码中插入字符实体. 字符实体有三部分:一 ...
jsonp的三种跨域方式
1.通过jq的$.ajax()完成跨域,这是我比较喜欢的一种方式. 代码如下: $.ajax({ type:'get', async:true, url:'地址', dataType:'jsonp', ...
git rebase 和 reset的区别
check the command detail by input 'git command --help' rebase: reset:
GBDT和RF的区别
去XX公司实习的时候,被问过,傻逼的我当时貌似都答错了,原谅全靠自学的我,了解甚少 RF随着树的增加不会过拟合 GBDT随着树的增加会过拟合 RF还会对特征进行random,例如特征的个数m=sqrt ...
Jetty应用服务器的安装详解
Jetty是一个开源的Servlet容器和应用服务器,它极度轻量级.高便携性.功能强大.灵活和扩展性好,而且支持各种技术如SPDY.WebSocket.OSGi.JMX.JNDI和JAAS.Jetty ...
poj2631 树的直径
设s-t是这棵树的直径,那么对于任意给予的一点,它能够到达的最远的点是s或者t. 这样我们可以通过2次bfs找到树的直径了. #include<cstdio> #include<qu ...
Eclipse自动调整格式
Eclipse 编写Java代码的时候,使用右键Source -> Format 后,将自动调整格式,若想要{ 单独占一行,则可以自己定义相关格式模板新建 CodeFormatter.xml ...
css-控制元素中的字符超过规定的宽度影藏
代码如下: <div style="width:100px; white-space:nowrap;overflow:hidden;text-overflow:ellipsis; bo ...

自然语言15.1_Part of Speech Tagging 词性标注

Contents