Authors:

Luo SiCarnegie Mellon University, Pittsburgh, PA

Jamie CallanCarnegie Mellon University, Pittsburgh, PA

Atlanta, Georgia, USA — October 05 - 10, 2001
ACM New York, NY, USA ©2001

数据不公开:  educational Web pages ,A total of 91 Web pages。Pages were grouped into three readability levels: KindergartenGrade2, Grade3-Grade5, and Grade6-Grade8

monosyllable 单音节词

2. READABILITY METRICS

第一个是个初级中级学习者

第二个会比别的给的难度分更高

第三个用的更广

3. STATISTICAL LANGUAGE MODELS

线性模型广泛用于模型的组合,EM算法用来寻找最佳参数

线性插值公式来组合语言模型和句子长度模型:前者用ngram,后者考虑句长

1)unigram语言模型假设生成一个词的概率适合上下文无关的。虽然unigram模型在人类语言上效果不好,但是它们适合很多应用,有可以在小数据上训练的优点。

2)通过看某个特征的值是否和难度成正比或反比,来判断特征重要与否,最后得出句长特征很重要,公式法中单音节不适合该数据集;然后假设符合正态分布

4 实验

KF这种公式法只能得出最终属于哪个等级,但是我们的数据集并不含有这些等级。我们统计的方法可以给出概率这种soft metric。

-------------------------

N-Gram是基于一个假设:
第n个词出现与前n-1个词相关,而与其他任何词不相关。(这也是隐马尔可夫当中的假设。)整个句子出现的概率就等于各个词出现的概率乘积。各个词的概率可以通过语料中统计计算得到。假设句子T是有词序列w1,w2,w3...wn组成,用公式表示N-Gram语言模型如下:

P(T)=P(w1)*p(w2)*p(w3)...p(wn)=p(w1)*p(w2|w1)*p(w3|w1w2)...p(wn|w1w2w3...)
一般常用的N-Gram模型是Bi-Gram和Tri-Gram。分别用公式表示如下:
Bi-Gram:P(T)=p(w1|begin)*p(w2|w1)*p(w3|w2)...p(wn|wn-1)
Tri-Gram:P(T)=p(w1|begin1,begin2)*p(w2|w1,begin1)*p(w3|w2w1)...p(wn|wn-1,wn-2)

https://github.com/lijingpeng/kaggle/blob/master/competitions/Bag_of_Words/bags_of_words.ipynb 包含贝叶斯、回归分类

A Statistical Model for Scientific Readability-paper的更多相关文章

  1. machine learning model(algorithm model) .vs. statistical model

    https://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/ http: ...

  2. Writing your first academic paper

    Writing your first academic paper If you are working in academics (and you are if you are working wi ...

  3. The Model Complexity Myth

    The Model Complexity Myth (or, Yes You Can Fit Models With More Parameters Than Data Points) An oft- ...

  4. A Statistical View of Deep Learning (II): Auto-encoders and Free Energy

    A Statistical View of Deep Learning (II): Auto-encoders and Free Energy With the success of discrimi ...

  5. [转]NLP Tasks

    Natural Language Processing Tasks and Selected References I've been working on several natural langu ...

  6. Targeted Learning R Packages for Causal Inference and Machine Learning(转)

    Targeted learning methods build machine-learning-based estimators of parameters defined as features ...

  7. 【RNN】资源汇总

    wesome Recurrent Neural Networks A curated list of resources dedicated to recurrent neural networks ...

  8. Lessons Learned from Developing a Data Product

    Lessons Learned from Developing a Data Product For an assignment I was asked to develop a visual ‘da ...

  9. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

随机推荐

  1. Java实现简单的RPC框架

    一.RPC简介 RPC,全称为Remote Procedure Call,即远程过程调用,它是一个计算机通信协议.它允许像调用本地服务一样调用远程服务.它可以有不同的实现方式.如RMI(远程方法调用) ...

  2. 再谈git和github-深入理解-3

    git tag -a 和 -m的区别? -a是 注解 是单词 "annotate"的意思 , 表示 "给标签一个名字, 标签名 -m 是创建标签时的消息备注 git ta ...

  3. Lintcode469-Same Tree-Easy

    469. Same Tree Check if two binary trees are identical. Identical means the two binary trees have th ...

  4. css基础参考文档

    block inline-block inline区别 absolute定位详解:https://www.jianshu.com/p/a3da5e27d22b css浮动详解 float浮动 div变 ...

  5. LOG4NET用法(个人比较喜欢的用法)

    LOG4NET用法(个人比较喜欢的用法) http://fanrsh.cnblogs.com/archive/2006/06/08/420546.html

  6. PAT 1077 Kuchiguse

    1077 Kuchiguse (20 分)   The Japanese language is notorious for its sentence ending particles. Person ...

  7. .NET实现IoC

    .NET里简易实现IoC 前言 在前面的篇幅中对依赖倒置原则和IoC框架的使用只是做了个简单的介绍,并没有很详细的去演示,可能有的朋友还是区分不了依赖倒置.依赖注入.控制反转这几个名词,或许知道的也只 ...

  8. myhome vscode plugins

    ├─ 1194979849.code-snippets-0.1.18├─ adamwalzer.string-converter-0.1.1├─ alefragnani.bookmarks-9.3.0 ...

  9. Linux 搭建Hadoop集群 ----workcount案例

    在 Linux搭建集群---JDK配置 Linux搭建集群---SSH免密登陆 Linux搭建集群---集群搭建成功 的基础上实现workcount案例 注意 虚拟机三台启动集群(自己亲自搭建) 1. ...

  10. docker命令脚本

    第一版: 1 #!/bin/bash #this is input docker continer shell! #this is -- # v1.1.2 read -p "请输入要执行do ...