[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee Alon Lavie
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213
banerjee+@cs.cmu.edu alavie@cs.cmu.edu
Important Snippets:
1. In order to be both effective and useful, an automatic metric for MT evaluation has to satisfy several basic criteria. The primary and most intuitive requirement is that the metric have very high correlation with
quantified human notions of MT quality. Furthermore, a good metric should be as sensitive as possible to differences in MT quality between different systems, and between different versions of the same system. The metric should be
consistent (same MT system on similar texts should produce similar scores), reliable (MT systems that score similarly can be trusted to perform similarly) and general (applicable to different MT tasks in a wide range of domains and scenarios). Needless
to say, satisfying all of the above criteria is extremely difficult, and all of the metrics that have been proposed so far fall short of adequately addressing most if not all of these requirements.
2. It is based on an explicit word-to-word matching between the MT output being evaluated and one or more reference translations. Our current matching supports not only matching between words that are identical
in the two strings being compared, but can also match words that are simple morphological variants of each other
3. Each possible matching is scored based on a combination of several features. These currently include uni-gram-precision, uni-gram-recall, and a direct measure of how out-of-order the words of the MT output are with respect to
the reference.
4.Furthermore, our results demonstrated that recall plays a more important role than precision in obtaining high-levels of correlation with human judgments.
5.BLEU does not take recall into account directly.
6.BLEU does not use recall because the notion of recall is unclear when matching simultaneously against a set of reference translations (rather than a single reference). To compensate for recall, BLEU uses a Brevity
Penalty, which penalizes translations for being “too short”.
7.BLEU and NIST suffer from several weaknesses:
>The Lack of Recall
>Use of Higher Order N-grams
>Lack of Explicit Word-matching Between Translation and Reference
>Use of Geometric Averaging of N-grams
8.METEOR was designed to explicitly address the weaknesses in BLEU identified above. It evaluates a translation by computing a score based on explicit word-to-word matches between the translation and a reference
translation. If more than one reference translation is available, the given translation is scored against each reference independently, and the best score is reported.
9.Given a pair of translations to be compared (a system translation and a reference translation), METEOR creates an alignment between the two strings. We define an alignment as a mapping be-tween unigrams, such that
every unigram in each string maps to zero or one unigram in the other string, and to no unigrams in the same string.
10.This alignment is incrementally produced through a series of stages, each stage consisting of two distinct phases.
11.In the first phase an external module lists all the possible unigram mappings between the two strings.
12.Different modules map unigrams based on different criteria. The “exact” module maps two unigrams if they are exactly the same (e.g. “computers” maps to “computers” but not “computer”). The “porter stem”
module maps two unigrams if they are the same after they are stemmed using the Porter stemmer (e.g.: “com-puters” maps to both “computers” and to “com-puter”). The “WN synonymy” module maps two unigrams if they are synonyms of each
other.
13.In the second phase of each stage, the largest subset of these unigram mappings is selected such
that the resulting set constitutes an alignment as defined above
14. METEOR selects that set that has the least number of unigram mapping crosses.
15.By default the first stage uses the “exact” mapping module, the second the “porter stem” module and the third the “WN synonymy” module.
16. unigram precision (P)
unigram recall (R)
Fmean by combining the precision and recall via a harmonic-mean
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvaWN0MjAxNA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">
To take into account longer matches, METEOR computes a penalty for a given alignment as follows.
chunks such that the uni-grams in each chunk are in adjacent positions in the system translation, and are also mapped to uni-grams that are in adjacent positions in the reference translation.
Conclusion: METEOR prefer recall to precision while BLEU is converse.Meanwhile, it incorporates many information.
版权声明:本文博客原创文章,博客,未经同意,不得转载。
[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments的更多相关文章
- (zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...
- Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★
Link of the Paper: https://arxiv.org/abs/1806.06422 Innovations: The authors propose a novel learnin ...
- 《30天学习30种新技术》-Day 15:Meteor —— 从零开始创建一个 Web 应用
目录:https://segmentfault.com/a/1190000000349384 原文: https://segmentfault.com/a/1190000000361440 到目前为止 ...
- 读书笔记——莫提默·J.艾德勒&查尔斯·范多伦(美)《如何阅读一本书》
第一篇 阅读的层次 第一章 阅读的活力与艺术 阅读的目标:娱乐.获得资讯.增进理解力这本书是为那些想把读书的主要目的当作是增进理解能力的人而写.何谓阅读艺术?这是一个凭借着头脑运作,除了玩味读物中的一 ...
- 如何阅读一本书——分析阅读Pre
如何阅读一本书--分析阅读Pre 前情介绍 作者: 莫提默.艾德勒 查尔斯.范多伦 初版:1940年,一出版就是全美畅销书榜首一年多.钢铁侠Elon.Musk学过. 需要注意的句子: 成功的阅读牵涉到 ...
- BLEU (Bilingual Evaluation Understudy)
什么是BLEU? BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text w ...
- 机器翻译质量评测算法-BLEU
机器翻译领域常使用BLEU对翻译质量进行测试评测.我们可以先看wiki上对BLEU的定义. BLEU (Bilingual Evaluation Understudy) is an algorithm ...
- cvpr2015papers
@http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...
- {ICIP2014}{收录论文列表}
This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...
随机推荐
- Access Violation at address 00000000.Read of address 00000000 解决办法
是数组越标或没有初始化某个对象之类的问题,搂住细细检查一下代码, 使用指针前未做检查,而这个指针未初始化. 可能是new后没有delete,这样出现溢出的可能性比较大 检查代码或者跟踪试试 使 ...
- [Android学习笔记]startActivityForResult和onActivityResult的使用
发开过程中,免不了多个页面之间相互交互通信. Android中使用startActivityForResult方法和onActivityResult配合完成任务 startActivityForRes ...
- CSU 1506(最小费用最大流)
传送门:Double Shortest Paths 题意:有两个人:给出路径之间第一个人走所需要的费用和第二个人走所需要的费用(在第一个人所需的 费用上再加上第二次的费用):求两个人一共所需要的最小费 ...
- C++教材
C++语言: 1.<Essential C++>:Stanley B.Lipman著. 旁枝暂略,主攻核心,轻薄短小.附习题与解答,适合刚開始学习的人. 2.<The C++ Pro ...
- VLC笔记 它 立志
不过,别忘了找工作的时候毕业,我说:"至少不会操心你会饿死了". 直到刚刚我才认为我妈有点过于乐观了. 今天下午,在做vlc如今播放器部分,一堆代码看的我头大. 正在此时,boss ...
- 为什么php时间阅读RTF,p标签会出现红色
为什么php读取富文本的时候,p标签会出现红线,怎么去掉,哪位大侠帮解决?跪求答案 就像以下一样,一遇到p标签就有红虚线 版权声明:本文博客原创文章,博客,未经同意,不得转载.
- 浏览器url传参中文时得到null的解决方法
在写一个中文参数需求的时候遇到了以下问题,经过半天的测试和各种编码,以及网上一些有用没用的资料尝试终于解决 比如下面的url地址:http://travel.widget.baike.com:8 ...
- UVA1455 - Kingdom(并查集 + 线段树)
UVA1455 - Kingdom(并查集 + 线段树) 题目链接 题目大意:一个平面内,给你n个整数点,两种类型的操作:road x y 把city x 和city y连接起来,line fnum ...
- poj2236(并查集)
题目连接 题意:一张图上分布着n台坏了的电脑,并知道它们的坐标.两台修好的电脑如果距离<=d就可以联网,也可以通过其他修好的电脑间接相连.给出操作“O x”表示修好x,给出操作“S x y”,请 ...
- hdu4035(概率dp)
题目连接:http://acm.hdu.edu.cn/showproblem.php?pid=4035 题意:有n个房间,由n-1条隧道连通起来,实际上就形成了一棵树, 从结点1出发,开始走,在每个结 ...