一、stanford parser是什么?

stanford parser是stanford nlp小组提供的一系列工具之一,能够用来完成语法分析任务。支持英文、中文、德文、法文、阿拉伯文等多种语言。

可以从这里(http://nlp.stanford.edu/software/lex-parser.shtml#Download)下载编译好的jar包、源码、javadoc等等。

http://nlp.stanford.edu/software/parser-faq.shtml是FAQ,看一下FAQ基本就能明白很多东西。当然,你得懂英文是吧?哈哈。

二、stanford parser怎么用(针对中文)?

这里只说如何在java工程中调用相关功能。从上面的地址下载到压缩包后,解压缩,将下面两个jar包加入到java build path里即可:

stanford-parser.jar
stanford-parser-xxx-models.jar

在stanford-parser.jar!\edu.stanford.nlp.parser.lexparser.demo包下面有两个最简单的例子,是可以直接运行的。

不过例子给出的是英文的使用方法,我们处理的中文还是有些不一样的。不过,在一中提到的FAQ页面上,是有简单的如何处理中文的使用方法的(第24问),不过处理方式是直接用java命令来做。

FAQ中给出的处理命令如下:

$ java -server -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -encoding utf- /u/nlp/data/lexparser/chineseFactored.ser.gz chinese-onesent-utf8.txt

很明显,所有的东西都是从edu.stanford.nlp.parser.lexparser.LexicalizedParser这个类开始的。所以我们只要把这个类的main函数搞清楚,如何处理中文我们大概也就知道了。

在这之前,先看一下训练好的中文grammars:

The parser is supplied with 5 Chinese grammars (and, with access to suitable training data, you could train other versions). You can find them inside the supplied stanford-parser-YYYY-MM-DD-models.jar file (in the GUI, select this file and then navigate inside it; at the command line, use jar -tf to see its contents). All of these grammars are trained on data from the Penn Chinese Treebank, and you should consult their site for details of the syntactic representation of Chinese which they use. They are:

The PCFG parsers are smaller and faster. But the Factored parser is significantly better for Chinese, and we would generally recommend its use. The xinhua grammars are trained solely on Xinhua newspaper text from mainland China. We would recommend their use for parsing material from mainland China. The chinese grammars also include some training material from Hong Kong SAR and Taiwan. We'd recommend their use if parsing material from these areas or a mixture of text types. Note, though that all the training material uses simplified characters; traditional characters were converted to simplified characters (usually correctly). Four of the parsers assume input that has already been word segmented, while the fifth does word segmentation internal to the parser. This is discussed further below. The parser also comes with 3 Chinese example sentences, in files whose names all begin with chinese.

三、LexicalizedParser类main函数分析

先从javadoc了解下基本用法,main函数支持多个选项。可以用来从treebank data完成建立和序列化一个解析器,可以解析文件或者url页面内容中的句子。

主要就是训练生成解析器,用解析器解析句子两大功能。以下摘自main函数的javadoc:

Sample Usages:

  • Train a parser (saved to serializedGrammarFilename) from a directory of trees (trainFilesPath, with an optional fileRange, e.g., 0-1000):
  • java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -saveToSerializedFile serializedGrammarFilename
  • Train a parser (not saved) from a directory of trees, and test it (reporting scores) on a directory of trees :
  • java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -testTreebank testFilePath [fileRange]
  • Parse one or more files, given a serialized grammar and a list of files :
  • java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] serializedGrammarPath filename [filename] ...
  • Test and report scores for a serialized grammar on trees in an output directory :
  • java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -loadFromSerializedFile serializedGrammarPath -testTreebank testFilePath [fileRange]

如果 serializedGrammarPath.gz结尾, 那么grammar是以gzip格式来读写的。

如果serializedGrammarPath是一个URL, 以http://开始,则会从URL读取解析器。

fileRange参数 specifies a numeric value that must be included within a filename for it to be used in training or testing (this works well with most current treebanks). It can be specified like a range of pages to be printed, for instance as 200-2199 or 1-300,500-725,9000 or just as 1 (if all your trees are in a single file, just give a dummy argument such as 0 or 1).

解析器可以将语法分写成ca序列化的Java object文件,或者输出到文本文件,或者同时输出两种方式,用如下命令来使用:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] [-saveToSerializedFile grammarPath] [-saveToTextFile grammarPath]

如果没有提供要解析的文件,则一个默认的句子会被解析。

Parameters:在-v 同样的位置,可以有很多其他的选项,比较常用的如下(水平有限,有的英文就懒的翻译了):

-tLPP class 当使用除英文外的语言或者English Penn Treebank之外的Treebank时候需要指定TreebankLangParserParams,该选项必须出现在其他的与语言相关的选项之前。(即使是导入一个序列化grammar时候,也建议制定该选项;it is necessary if the language pack specifies a needed character encoding or you wish to specify language-specific options on the command line.)

-encoding charset 指定输入输出文件的编码类型。 当这个选项出现在-tLPP选项后时,会覆盖TreebankLangParserParams中设置的值。

-tokenized 输入是否已经完成分词(以空白符分割各词)。此选项存在则忽略分词处理,只使用whitespace进行分词。除非用-escaper指定特殊的escape,否则需要确保分词结果中的特殊符号符合所用的Treebank。(例如,如果用Penn English Treebank, 必须将"(" 转为 "-LRB-", "3/4" 转为 "3\/4", 等等.)

-escaper class 指定一个Function<List<HasWord>,List<HasWord>>类型的类来完成特定的转义工作。例如, it could change "(" to "-LRB-" for the Penn English Treebank. A provided escaper that does such things for the Penn English Treebank isedu.stanford.nlp.process.PTBEscapingProcessor

-tokenizerFactory class 指定一个TokenizerFactory类来完成tokenization

-tokenizerOptions options 指定TokenizerFactory类完成tokenization 所需要的参数信息。类型为 comma-separated list. 对 PTBTokenizer而言, options of interest includeamericanize=false and asciiQuotes (for German). 任意tokenizer的选项,如果和parser training data 时用的tokenization 不同,可能会降低parser 的表现。

-sentences token 指定一个词语来划分句子边界。A value of newline causes sentence breaking on newlines. A value of onePerElement causes each element (using the XML -parseInside option) to be treated as a sentence. All other tokens will be interpreted literally, and must be exactly the same as tokens returned by the tokenizer. For example, you might specify "|||" and put that symbol sequence as a token between sentences. If no explicit sentence breaking option is chosen, sentence breaking is done based on a set of language-particular sentence-ending patterns.

-parseInside element Specifies that parsing should only be done for tokens inside the indicated XML-style elements (done as simple pattern matching, rather than XML parsing). For example, if this is specified as sentence, then the text inside the sentence element would be parsed. Using "-parseInside s" gives you support for the input format of Charniak's parser. Sentences cannot span elements. Whether the contents of the element are treated as one sentence or potentially multiple sentences is controlled by the -sentences flag. The default is potentially multiple sentences. This option gives support for extracting and parsing text from very simple SGML and XML documents, and is provided as a user convenience for that purpose. If you want to really parse XML documents before NLP parsing them, you should use an XML parser, and then call to a LexicalizedParser on appropriate CDATA.

-tagSeparator char Specifies to look for tags on words following the word and separated from it by a special character char. For instance, many tagged corpora have the representation "house/NN" and you would use -tagSeparator /. Notes: This option requires that the input be pretokenized. The separator has to be only a single character, and there is no escaping mechanism. However, splitting is done on the last instance of the character in the token, so that cases like "3\/4/CD" are handled correctly. The parser will in all normal circumstances use the tag you provide, but will override it in the case of very common words in cases where the tag that you provide is not one that it regards as a possible tagging for the word. The parser supports a format where only some of the words in a sentence have a tag (if you are calling the parser programmatically, you indicate them by having them implement the HasTaginterface). You can do this at the command-line by only having tags after some words, but you are limited by the fact that there is no way to escape the tagSeparator character.

-maxLength leng 指定可以被处理的的句子的最大长度,可以限制内存消耗。如果不指定,解析器会渐进的增加处理句子的长度,但可能会遇到内存溢出错误。

-outputFormat styles 输出语句的格式: penn for prettyprinting as in the Penn treebank files, or oneline for printing sentences one per line, words, wordsAndTags,dependencies, typedDependencies, or typedDependenciesCollapsed. 多选项可以输入一个comma-separated list. 参照 TreePrint 类的文档获得更多信息。

-outputFormatOptions 当有多个-outputFormat时,提供更多的参数,比如 lexicalize, stem, markHeadNodes, 或者 xml。选项应该是一个 comma-separated list.

-writeOutputFiles 将输出存至文件,文件名与input files相同保存为".stp" 后缀格式。输出格式依赖于 outputFormat选项. 默认输出至std。

-outputFilesExtension 输出文件的后缀格式,默认是stp。有 -writeOutputFiles选项时起作用。

-outputFilesDirectory 输出目录的路径(当-writeOutputFiles 选项存在时才起作用)。未指定时候输出目录默认与输入目录相同。

-nthreads 解析测试时候支持多线程,本参数指定可以使用的线程数量,当值为负时,线程数被指定为机器cpu核数。

更多信息可以查看该包的使用说明。

四、LexicalizedParser类解析中文句子代码

上面说了这么多没用的,不如直接上代码。分析中文句子的代码如下:

package com.RE.SPtest;

import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

/**
* Created with IntelliJ IDEA.
* Author: st316
* Date: 13-12-3
* Time: 下午3:42
*/
public class SPtest { public static void main(String[] args) {
String[] arg2 = {"-encoding", "utf-8",
"-outputFormat", "penn,typedDependenciesCollapsed",
"edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz",
"F:\\RelationExtract\\data\\chinese-onesent-utf8.txt"};
LexicalizedParser.main(arg2);
}
}

输出结果:

五、分析结果的含义

ROOT:要处理文本的语句                                                                      IP:简单从句

NP:名词短语                                                                                      VP:动词短语

PU:断句符,通常是句号、问号、感叹号等标点符号                                   LCP:方位词短语

PP:介词短语                                                                                      CP:由‘的’构成的表示修饰性关系的短语

DNP:由‘的’构成的表示所属关系的短语                                               ADVP:副词短语

ADJP:形容词短语                                                                               DP:限定词短语

QP:量词短语                                                                                     NN:常用名词

NR:固有名词                                                                                     NT:时间名词

PN:代词                                                                                           VV:动词

VC:是                                                                                              CC:不是(应该是吧!!不太确定)

VE:有                                                                                              VA:表语形容词

AS:内容标记(如:了)                                                                     VRD:动补复合词

参考资料:

http://wenku.baidu.com/link?url=KZDYgJDnme7yIDOCoPpNClV1Z95yiyf5n2YiT4BD-6eNTVcPM8sPTYmx5qxajsX6snGTgpaHUcsB0oI2W2jQOAC2nwdzUkdVkmwnEHQp0jG

http://nlp.stanford.edu/software/parser-faq.shtml

http://nlp.stanford.edu/software/lex-parser.shtml

Stanford parser:入门使用的更多相关文章

  1. Stanford Parser学习入门(2)-命令行运行

    在Stanford parser目录中已经定义了一部分命令行工具以及图形界面,本文将介绍如何在windows使用这些工具进行语法分析,Linux下也有shell可以使用. 关于如何搭建环境请参考上一篇 ...

  2. Stanford Parser学习入门(1)-Eclipse中配置

    Stanford Parser是斯坦福大学研发的用于语法分析的工具,属于stanford nlp系列工具之一.本文主要介绍Standfor Parser的入门用法. 在Stanford官方网站下载最新 ...

  3. Stanford Parser学习入门(3)-标记

    以下是Stanford parser中的标记中文释义供参考. probabilistic context-free grammar(PCFG)     ROOT:要处理文本的语句 IP:简单从句 NP ...

  4. 使用Stanford Parser进行句法分析

    一.句法分析 1.定义 句法分析判断输入的单词序列(一般为句子)的构成是否合乎给定的语法,并通过构造句法树来确定句子的结构以及各层次句法成分之间的关系,即确定一个句子中的哪些词构成一个短语,哪些词是动 ...

  5. 同时使用Twitter nlp 和stanford parser的解决方法

    因为Twitter nlp中使用了较老版本的stanford parser,导致不能同时使用 解决方法是使用未集成其它jar包的Twitter nlp,关于这点Stanford FAQ中也有说明(在F ...

  6. Stanford parser学习:LexicalizedParser类分析

    上次(http://www.cnblogs.com/stGeekpower/p/3457746.html)主要是对应于javadoc写了下LexicalizedParser类main函数的功能,这次看 ...

  7. 在 NLTK 中使用 Stanford NLP 工具包

    转载自:http://www.zmonster.me/2016/06/08/use-stanford-nlp-package-in-nltk.html 目录 NLTK 与 Stanford NLP 安 ...

  8. 使用Berkeley Parser进行句法分析

    一.句法分析 1.定义 句法分析判断输入的单词序列(一般为句子)的构成是否合乎给定的语法,并通过构造句法树来确定句子的结构以及各层次句法成分之间的关系,即确定一个句子中的哪些词构成一个短语,哪些词是动 ...

  9. Python自然语言处理工具小结

    Python自然语言处理工具小结 作者:白宁超 2016年11月21日21:45:26 目录 [Python NLP]干货!详述Python NLTK下如何使用stanford NLP工具包(1) [ ...

随机推荐

  1. WinForm中DefWndProc、WndProc与IMessageFilter的区别

    这篇文章主要介绍了WinForm中DefWndProc.WndProc与IMessageFilter的区别,较为详细的分析了WinForm的消息处理机制,需要的朋友可以参考下     一般来说,Win ...

  2. 1.5.8 语言分析器(Analyzer)

    语言分析器(Analyzer) 这部分包含了分词器(tokenizer)和过滤器(filter)关于字符转换和使用指定语言的相关信息.对于欧洲语言来说,tokenizer是相当直接的,Tokens被空 ...

  3. PopupWindow使用

    PopupWindow使用 PopupWindow这个类用来实现一个弹出框,可以使用任意布局的View作为其内容,这个弹出框是悬浮在当前activity之上的. PopupWindow使用Demo 这 ...

  4. 【Android 界面效果43】Android LayoutInflater的inflate方法中attachToRoot的作用

    我们在ListView的Adapter的getView方法里面经常会调用两个参数的inflate方法, mInflater.inflate(R.layout.adv_viewpager, null); ...

  5. uva 12100 Printer Queue 优先级队列模拟题 数组模拟队列

    题目很简单,给一个队列以及文件的位置,然后一个一个检查,如果第一个是优先级最高的就打印,否则放到队列后面,求所要打印的文件打印需要花费多长时间. 这里我用数组模拟队列实现,考虑到最糟糕的情况,必须把数 ...

  6. Spring-boot & spring.security

    spring.security提供了一种身份认证框架,开发者可以在这个框架中实现各种方式的用户身份管理,比如:LDAP.MYSQL.OAUTH.Mongo等等. spring.security认证步骤 ...

  7. Shodan!

    Shodan! 简介 首先先介绍一下Shodan CNNMoney的一篇文章写道,虽然目前人们都认为谷歌是最强劲的搜索引擎,但Shodan才是互联网上最可怕的搜索引擎. 与谷歌不同的是,Shodan不 ...

  8. xe5 android tts(Text To Speech)

    xe5 android  tts(Text To Speech) TTS是Text To Speech的缩写,即“从文本到语音”,是人机对话的一部分,让机器能够说话. 以下代码实现xe5 开发的文本转 ...

  9. springMVC第一课--配置文件

    刚学springMVC,记录下学习过程,供以后查阅(githup源码). 1,新建一个web工程.(其他按常规来) 如下:添加applicationContext.xml,webmvc-servlet ...

  10. openquery链表删除时报错 “数据提供程序或其他服务返回 E_FAIL 状态”

    DELETE OPENQUERY (VERYEAST_COMPANY_MYSQL_CONN, 'SELECT * FROM company ') WHERE c_userid in(select c_ ...