Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts.-paper

http://www.aclweb.org/anthology/N07-1058

Volume:Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
Authors:Michael Heilman | Kevyn Collins-Thompson | Jamie Callan | Maxine Eskenazi
Month:April
Year:2007
Venues:NAACL | HLT

数据不公开

1、introduction

L1英语学习者而言，英语水平很高的时候的语法能力其实和开始学英语的时候差不多，因为他们的语法是在使用中互动中学会的，而L2是在课本中学会的，所以L2高级学习者的语法可能不可强。所以grammer对于L2的readability的预测和评估很重要，比如动词时态、被动时态等。

2、language model readability prediction for first language texts

统计语言模型比传统公式的好处：

1）短文本和web文本上的准确率更高

2）给出概率分布而不是一个预测值

3）语言模型可以提供更多关于文本中单词相对难度的数据

我们的统计模型用的是多项式贝叶斯分布（就跟上一篇paper一样）

虽然unigram是weak model，但是会比tri、bi这种更复杂的模型要求更少的数据集

3、grammatical construction readability prediction for second language texts

3.1 features for grammer-based prediction

斯坦福parser用来产生constituent structure trees

PCFG scores可以用来过滤掉预料中有问题的文本

默认训练集是Penn Treebank来parser，因为该文本和L2学习者的阅读材料是相近的

predictor用的是Tgrep2，一个树结构的searching tool，可以找到instances of target patterns，a Tgrep2 patterns会定义dominance，sisterhood，precedence支配地位、姐妹关系、优先地位及其他parse tree中的节点信息

注：

Penn Treebank：

NLP中常用的PTB语料库
Penn Treebank是一个项目的名称，项目目的是对语料进行标注，标注内容包括词性标注以及句法分析。

语料来源为：1989年华尔街日报
语料规模：1M words，2499篇文章

斯坦福parser：

--既是一个高度优化的概率上下文无关文法和词汇化依存分析器，也是一个词汇化上下文无关文法分析器。

--基于权威可靠的宾州树库（Penn Treebank）作为分析器的训练数据，目前已面向英文、中文、德文、阿拉伯文、意大利文、保加利亚文、葡萄牙文等语种提供句法分析功能。

Tgrep2：

Like its predecessor tgrep, which was written by Richard Pito, Tgrep2 is a search engine for finding structures in a corpus of trees. The most common application of these programs is in extracting data from the Penn Treebank corpora of parsed sentences.

第一组grammer特征集合：只选择单词级别，是为了不受句子长度的影响，包含22个grammatical 特征，例如被动、过去时、perfect完成、continuous tense进行时、关系从句等

第二组grammer特征集合：12个不需要大量句法分析的grammer特征，比如句子长度、不同动词时态、单词的pos

3.2基于grammer特征分类器算法

knn即k近邻算法模型 * 置信值 + 语言模型

4 实验

评估标准：

1）相关系数：预测值和人工值的匹配度

2）mean square error即MSE均方误差，它可以给严重的错误更多惩罚。不选择precision、recall、accuracy的原因是，错误预测比实际level相差了5个level和相差一个level，错误的严重程度是不一样的。

3）9折交叉验证

4.2 语料

噪声很多，所以对基于grammer的预测来说影响更大，例如一个图片的caption，对于unigram并没有什么影响，但是会影响dependency的分析！

5 实验结果

LM效果更好，但是线性差值两个模型后的效果更好

两个features集合也分别对比：第一个集合包含了更复杂的syntactic结构，MSE更低，相关系数更高，但是第二组效果也不差，说明在语料很大、算力受限时，即使pos、词数这些简单的grammer特征也是有效的！

6 discussion

LM在L1\L2语料上都比基于grammer的model更有效的原因是

1）LM可以捕捉到文本中所有word

2）噪声对grammer的影响很大！

3）英语是morphological impoverished（形态学，贫乏）的语言，文本分类、信息提取等很多任务甚至不需要考虑形态学相关的语法特征

7conclusion

1）基于词汇的语言模型ngram更好，线性差值结合后更更好

2）grammer对于第二语言的readability来说很重要

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts.-paper的更多相关文章

[Typescript] Improve Readability with TypeScript Numeric Separators when working with Large Numbers
When looking at large numbers in code (such as 1800000) it’s oftentimes difficult for the human eye ...
READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification-paper
https://aclanthology.info/pdf/W/W11/W11-2308.pdf 2 background2000年以前 ----传统可读性准则局限于表面的文本特征,例如the Fle ...
Readability Assessment for Text Simplification -paper
https://pdfs.semanticscholar.org/e43a/3c3c032cf3c70875c4193f8f8818531857b2.pdf 1.introduction在Brazil ...
Go 语言相关的优秀框架，库及软件列表
If you see a package or project here that is no longer maintained or is not a good fit, please submi ...
[2017 ACL] 对话系统
Long Papers [Domain adaptation ] 1. Adversarial Adaptation of Synthetic or Stale Data ( Cited by 14 ...
Awesome Go精选的Go框架，库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
Java资源大全中文版（Awesome最新版）
Awesome系列的Java资源整理.awesome-java 就是akullpp发起维护的Java资源列表,内容包括:构建工具.数据库.框架.模板.安全.代码分析.日志.第三方库.书籍.Java 站 ...
转载：10 Easy Steps to a Complete Understanding of SQL
10 Easy Steps to a Complete Understanding of SQL 原文地址:http://tech.pro/tutorial/1555/10-easy-steps-to ...
File I/O
File I/O Introduction We'll start our discussion of the UNIX System by describing the functions ...

随机推荐

day04流程控制，if分支结构，while，for循环
复习 ''' 1.变量名命名规范 -- 1.只能由数字.字母及 _ 组成 -- 2.不能以数字开头 -- 3.不能与系统关键字重名 -- 4._开头有特殊含义 -- 5.__开头__结尾的变量,魔法 ...
web前端除了关注代码功能实现,还应具备web性能优化以及SEO优化的常识
web前端除了关注代码功能实现,还应具备web性能优化以及SEO优化的常识 ——不会WPO.SEO的前端工程师不是好码农作为一名web前端工程师,除了要实现上级的要求,满足其所需要的功能,还要在平时 ...
# NOI.AC省选赛第五场T1 子集，与&最大值
NOI.AC省选赛第五场T1 A. Mas的童年题目链接 http://noi.ac/problem/309 思路 0x00 \(n^2\)的暴力挺简单的. ans=max(ans,xor[j-1 ...
Python作业
1使用while 循环输入1,2,3,4,5,6,,8,9,10 count = 0 while count<10: count+=1 if count ==7: continue print( ...
一些有趣的 js 包
https://github.com/octalmage/robotjs Node.js桌面自动化.控制鼠标,键盘和屏幕. http://robotjs.io
Windows 启用/禁用内置管理员 Administrator
关于启用 Windows 系统内置的管理员 Administrator 的方法还是许多的,其中普遍的一种应该就是进入(我的电脑/计算机右键管理/Windows + R输入 compmgmt.msc)计 ...
Codeforces Round #495 (Div. 2) D. Sonya and Matrix
http://codeforces.com/contest/1004/problem/D 题意: 在n×m的方格中,选定一个点(x,y)作为中心点,该点的值为0,其余点的值为点到中心点的曼哈顿距离. ...
【Nodejs】【node.js 安装和配置Sublime Text的Node.js】
[一] [安装nodejs] 第一步:下载安装文件: https://nodejs.org/en/download/ 第二步:安装nodejs 下载完成之后,双击"node-v6.10.1- ...
【五】jquery之事件（focus事件与blur事件）[提示语的出现及消失时机]
例题:当鼠标移动到某个文本框时,提示语消失. 当失去焦点时,如果该文本框有内容,保存内容.没有内容,则恢复最初的提示语句 <!DOCTYPE html> <html> < ...
StringRedisTemplate常用API
转载自网络: //向redis里存入数据和设置缓存时间stringRedisTemplate.opsForValue().set("test", "100",6 ...

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts.-paper

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts.-paper的更多相关文章

随机推荐

热门专题