http://www.cnblogs.com/yuxc/archive/2012/02/09/2344474.html

Chapter8   

Analyzing Sentence Structure  分析句子结构

Earlier chapters focused on words: how to identify them, analyze their structure, assign them to lexical categories, and access their meanings. We have also seen how to identify patterns in word sequences or n-grams. However, these methods only scratch the surface of the complex constraints that govern sentences. We need a way to deal with the ambiguity that natural language is famous for. We also need to be able to cope with the fact that there are an unlimited number of possible sentences, and we can only write finite programs to analyze their structures and discover their meanings.

The goal of this chapter is to answer the following questions:

  1. How can we use a formal grammar to describe the structure of an unlimited set of sentences?

我们如何使用形式化语法来描述句子的无限制集的结构?

  1. How do we represent the structure of sentences using syntax trees?

我们如何使用句法树来表示句子结构?

  1. How do parsers analyze a sentence and automatically build a syntax tree?

语法分析器如何分析一个句子并且自动地构建一个语法树?

Along the way, we will cover the fundamentals of English syntax, and see that there are systematic aspects of meaning that are much easier to capture once we have identified the structure of sentences.

8.1   Some Grammatical Dilemmas 一些语法困境

Linguistic Data and Unlimited Possibilities 语言数据和无限可能

Previous chapters have shown you how to process and analyse text corpora, and we have stressed the challenges for NLP in dealing with the vast amount of electronic language data that is growing daily. Let's consider this data more closely, and make the thought experiment that we have a gigantic(巨大的) corpus consisting of everything that has been either uttered(完全) or written in English over, say, the last 50 years. Would we be justified in calling this corpus "the language of modern English"? There are a number of reasons why we might answer No. Recall that inChapter 3, we asked you to search the web for instances of the pattern the of. Although it is easy to find examples on the web containing this word sequence, such as New man at the of IMG (http://www.telegraph.co.uk/sport/2387900/New-man-at-the-of-IMG.html), speakers of English will say that most such examples are errors, and therefore not part of English after all.

Accordingly, we can argue that the "modern English" is not equivalent to the very big set of word sequences in our imaginary corpus. Speakers of English can make judgements about these sequences, and will reject some of them as being ungrammatical(不合文法的).

Equally, it is easy to compose a new sentence and have speakers agree that it is perfectly good English. For example, sentences have an interesting property that they can be embedded inside larger sentences. Consider the following sentences:

(1)

a.

Usain Bolt broke the 100m record

b.

The Jamaica Observer reported that Usain Bolt broke the 100m record

c.

Andre said The Jamaica Observer reported that Usain Bolt broke the 100m record

d.

I think Andre said the Jamaica Observer reported that Usain Bolt broke the 100m record

If we replaced whole sentences with the symbol S, we would see patterns like Andre said S and I think S. These are templates for taking a sentence and constructing a bigger sentence. There are other templates we can use, like S but S, and S when S. With a bit of ingenuity(独创性) we can construct some really long sentences using these templates. Here's an impressive example from a Winnie the Pooh story by A.A. Milne, In which Piglet is Entirely Surrounded by Water:

[You can imagine Piglet's joy when at last the ship came in sight of him.] In after-years he liked to think that he had been in Very Great Danger during the Terrible Flood, but the only danger he had really been in was the last half-hour of his imprisonment, when Owl, who had just flown up, sat on a branch of his tree to comfort him, and told him a very long story about an aunt who had once laid a seagull's egg by mistake, and the story went on and on, rather like this sentence, until Piglet who was listening out of his window without much hope, went to sleep quietly and naturally, slipping slowly out of the window towards the water until he was only hanging on by his toes, at which moment, luckily, a sudden loud squawk from Owl, which was really part of the story, being what his aunt said, woke the Piglet up and just gave him time to jerk himself back into safety and say, "How interesting, and did she?" when — well, you can imagine his joy when at last he saw the good ship, Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him...

This long sentence actually has a simple structure that begins S but S when S. We can see from this example that language provides us with constructions which seem to allow us to extend sentences indefinitely. It is also striking that we can understand sentences of arbitrary length that we've never heard before: it's not hard to concoct(混合构造) an entirely novel sentence, one that has probably never been used before in the history of the language, yet all speakers of the language will understand it.

The purpose of a grammar is to give an explicit description of a language. But the way in which we think of a grammar is closely intertwined with what we consider to be a language. Is it a large but finite set of observed utterances and written texts? Is it something more abstract like the implicit knowledge that competent speakers have about grammatical sentences? Or is it some combination of the two? We won't take a stand on this issue, but instead will introduce the main approaches.

In this chapter, we will adopt the formal framework of "generative grammar", in which a "language" is considered to be nothing more than an enormous collection of all grammatical sentences, and a grammar is a formal notation that can be used for "generating" the members of this set. Grammars use recursive productions of the form S → S and S, as we will explore in Section 8.3. In Chapter 10 we will extend this, to automatically build up the meaning of a sentence out of the meanings of its parts.

Ubiquitous Ambiguity 普遍存在的歧义性

A well-known example of ambiguity is shown in (2), from the Groucho Marx movie, Animal Crackers (1930):

(2)

While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I'll never know.

Let's take a closer look at the ambiguity in the phrase: I shot an elephant in my pajamas. First we need to define a simple grammar:

>>> groucho_grammar = nltk.parse_cfg("""

... S -> NP VP

... PP -> P NP

... NP -> Det N | Det N PP | 'I'

... VP -> V NP | VP PP

... Det -> 'an' | 'my'

... N -> 'elephant' | 'pajamas'

... V -> 'shot'

... P -> 'in'

... """)

This grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase in my pajamas describes the elephant or the shooting event.

>>> sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

>>> parser = nltk.ChartParser(groucho_grammar)

>>> trees = parser.nbest_parse(sent)

>>> for tree in trees:

...     print tree

...

(S

  (NP I)

  (VP

    (V shot)

    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))

(S

  (NP I)

  (VP

    (VP (V shot) (NP (Det an) (N elephant)))

    (PP (P in) (NP (Det my) (N pajamas)))))

The program produces two bracketed structures, which we can depict as trees, as shown in (3b):

(3)

a.

b.

Notice that there's no ambiguity concerning the meaning of any of the words; e.g. the word shot doesn't refer to the act of using a gun in the first sentence, and using a camera in the second sentence.

Note

Your Turn: Consider the following sentences and see if you can think of two quite different interpretations: Fighting animals could be dangerous. Visiting relatives can be tiresome. Is ambiguity of the individual words to blame? If not, what is the cause of the ambiguity?

This chapter presents grammars and parsing, as the formal and computational methods for investigating and modeling the linguistic phenomena we have been discussing. As we shall see, patterns of well-formedness and ill-formedness in a sequence of words can be understood with respect to the phrase structure and dependencies. We can develop formal models of these structures using grammars and parsers. As before, a key motivation is natural language understanding. How much more of the meaning of a text can we access when we can reliably recognize the linguistic structures it contains? Having read in a text, can a program "understand" it enough to be able to answer simple questions about "what happened" or "who did what to whom"? Also as before, we will develop simple programs to process annotated corpora and perform useful tasks

Python自然语言处理学习笔记(69)的更多相关文章

  1. python自然语言处理学习笔记1

    1.搭建环境 下载anaconda并安装,(其自带python2.7和一些常用包,NumPy,Matplotlib),第一次启动使用spyder 2.下载nltk import nltk nltk.d ...

  2. Python自然语言处理学习笔记之性别识别

    从今天起开始写自然语言处理的实践用法,今天学了文本分类,并没用什么创新的东西,只是把学到的知识点复习一下 性别识别(根据给定的名字确定性别) 第一步是创建一个特征提取函数(feature extrac ...

  3. python自然语言处理——学习笔记:Chapter3纠错

    2017-12-06更新:很多代码执行结果与书中不一致,是因为python的版本不一致.如果发现有问题,可以参考英文版: http://www.nltk.org/book/ 第三章,P87有一段处理h ...

  4. python自然语言处理学习笔记2

    基础语法 搜索文本----词语索引使我们看到词的上下 text1.concordance("monstrous") 词出现在相似的上下文中 text1.similar(" ...

  5. Python自然语言处理学习笔记之信息提取步骤&分块(chunking)

    一.信息提取模型 信息提取的步骤共分为五步,原始数据为未经处理的字符串, 第一步:分句,用nltk.sent_tokenize(text)实现,得到一个list of strings 第二步:分词,[ ...

  6. Python自然语言处理学习笔记之评价(evaluationd)

    对模型的评价是在test set上进行的,本文首先介绍测试集应该满足的特征,然后介绍四种评价方法. 一.测试集的选择 1.首先,测试集必须是严格独立于训练集的,否则评价结果一定很高,但是虚高,不适用于 ...

  7. Python自然语言处理学习笔记之选择正确的特征(错误分析 error analysis)

    选择合适的特征(features)对机器学习的效率非常重要.特征的提取是一个不断摸索的过程(trial-and-error),一般靠直觉来发现哪些特征对研究的问题是相关的. 一种做法是把你能想到的所有 ...

  8. Requests:Python HTTP Module学习笔记(一)(转)

    Requests:Python HTTP Module学习笔记(一) 在学习用python写爬虫的时候用到了Requests这个Http网络库,这个库简单好用并且功能强大,完全可以代替python的标 ...

  9. python网络爬虫学习笔记

    python网络爬虫学习笔记 By 钟桓 9月 4 2014 更新日期:9月 4 2014 文章文件夹 1. 介绍: 2. 从简单语句中開始: 3. 传送数据给server 4. HTTP头-描写叙述 ...

随机推荐

  1. Sublime Text 使用简介

    Sublime Text使用介绍 如果说Notepad++是一款不错Code神器,那么Sublime Text应当称得上是神器滴哥.Sublime Text最大的优点就是跨平台,Mac和Windows ...

  2. redux学习笔记

    中文api:http://cn.redux.js.org/docs/react-redux/troubleshooting.html 3.6 Reducer Store 收到 Action 以后,必须 ...

  3. C#微信公众号开发 -- (三)用户关注之后自动回复

    通过了上一篇文章之后的微信开发者验证之后,我们就可以做微信公众号的代码开发了. 当我们点击关注某个公众号的时候,有时候会发现他会自动给我们回复一条消息,比如欢迎关注XXX公众号.这个功能其实是在点击关 ...

  4. java中collection、map、set、list简介 (转)

    Collection接口  Collection是最基本的集合接口,一个Collection代表一组Object,即Collection的元素(Elements).一些Collection允许相同的元 ...

  5. Openfire:安装指南

    本文的英文原文来自 http://www.igniterealtime.org/builds/openfire/docs/latest/documentation/install-guide.html ...

  6. c编程:提示用户输入一个0—9的数字进行猜测电脑产生的随机数。一共有三次机会。

    // //  main.c //  使用c语言进行编程: 题目:由电脑生成一个由0-9之间的随机数,提示用户也输入一个数字进行猜测.当猜测三次仍不中的时候结束程序. 编译环境:Xcode6.3 特别介 ...

  7. 安装aptana插件报Error opening the editor. java.lang.NullPointerException

    Aptana的官方网站下载eclipse的插件:  http://update.aptana.com/update/studio/3.2/ ,可以在线安装也可以下载插件后再安装,我是以在线的形式安装的 ...

  8. jeesite 经常出现java.lang.ClassNotFoundException: org.springframework.web.context.ContextLoaderL解决思路

    本来jeesite是用的maven,但是我开发的项目用这个框架没用maven, 经常报这个错,我能用的办法就是将springmvc—web.jar包删掉,然后重新添加

  9. OpenCV学习目录(持续更新)

    这个暑假开始,需要用到图像处理相关的东西,于是我选择了OpenCV库,这里记录下我的整个学习过程. 参考资料: <OpenCV 2计算机视觉编程手册> 张静 译,科学出版社 1. Linu ...

  10. javascript 学习笔记之模块化编程

    题外: 进行web开发3年多了,javascript(后称js)用的也比较多,但是大部分都局限于函数的层次,有些公共的js函数可重用性不好,造成了程序的大量冗余,可读性差(虽然一直保留着注释的习惯,但 ...