论文:

MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches
to lexical diversity assessment 

地址:

https://link.springer.com/content/pdf/10.3758%2FBRM.42.2.381.pdf

LD Lexical diversity

TTR type–token ratio

缺点是文本长度变化敏感

vocd-D :也是文本长度的函数

CONSIDERATIONS IN THE ASSESSMENT OF LEXICAL DIVERSITY 

Text Length 

LD的第一个缺点就是对文本长度敏感。the gradual decrease in type count can be an indication of the thematic saturation of a text or corpus . That is, when a text reaches the point at which no new types are being encountered, we can say that the text is (fully) repre- sentative of the word types that are indicative of that text’s theme ~ 作用就是it allows researchers greater confidence that their corpora comprise texts of a sufficient length to represent suitably their linguistic function.  ~MTLD  is a notion closely related to thematic saturation

Textual Homogeneity文本同质性

LD的第二个缺点就是LD指标会被看做对textual homogeneity的假设的描述。homogeneity assumption可以看做一个文本中类型的分布,也就是说,不同的修辞和策略使得文本各个部分有不同的等级。每个文本都有一个structure,每个structure都有一个修辞目的,这个目的可以在文本中用多种修辞形式表示,但是没有任何一个可以表示文本的全部。

Sequential and Nonsequential Analysis Processing 

For example, it has the advantage of avoiding local cluster- ing of content words, which Malvern et al. (2004) argued may lead to a distorted view of the overall text. Landauer, Laham, Rehder, and Schreiner (1997) went even further, claiming that there may be little benefit to word order when it comes to deriving meaning from texts.

INDICES OF LEXICAL DIVERSITY 

vocd-D 

The calculation of vocd-D is the result of a series of ran- dom text samplings. The approach begins its calculation by taking from the text 100 random samples of 35 tokens. The TTR for each of these samples is calculated, and the mean TTR is stored. The same procedure is then repeated for samples from 36 to 50 tokens. An empirical TTR curve is then created from the means of each of these samples.

HD-D

The hypergeometric distribution represents the prob- ability of drawing (without replacement) a certain number of tokens of a particular type from a sample of a particu- lar size. The way we have used this distribution for our own HD-D index is to calculate, for each lexical type in a text, the probability of encountering any of its tokens in a random sample of 42 words drawn from the text.3 The probabilities for all lexical types in the text are then added together, and the sum is used as an index of the text’s LD.

Other LD Indices Used in This Study

Log correction.

Because the text length problem of LD is related to frequency, log values have long been used as an LD corrective factor .

Frequency correction.

A second approach to correct- ing for the text length effect is the frequency distribution of types.

For example, consider the sentence The friendly man liked both the big dog and the little dog, which contains nine types and 12 tokens, and then consider the sentence The friendly man, whom the big dog liked, liked a little dog, which also contains nine types and 12 tokens. Note that the first sentence contains 3 tokens of the type the, whereas the second sentence contains only 2 tokens of the type the; however, for the second sentence, the word liked has a frequency of 2, whereas it is just 1 in the first sentence.

Whereas vocd-D is deter- mined by the sums of probabilities of encountering each type in the text in sample sizes from 35 to 50 tokens, K is determined by the sums of probabilities of encountering each type in the text when the sample size is set to just 2 words.

MTLD 

Processing MTLD 

MTLD is an index of a text’s LD, evaluated sequen- tially. It is calculated as the mean length of sequential word strings in a text that maintain a given TTR value (here, .720). During the calculation process, each word of the text is evaluated sequentially for its TTR. For example, . . . of (1.00) the (1.00) people (1.00) by (1.00) the (.800) people (.667) for (.714) the (.625) people (.556) . . . and so forth. However, when the default TTR factor size value (here, .720) is reached, the factor count increases by a value of 1, and the TTR evaluations are reset. Thus, given the previous example, MTLD would execute . . . of (1.00) the (1.00) people (1.00) by (1.00) the (.800) people (.667) |||FACTORS FACTORS 1||| for (1.00) the (1.00) peo- ple (1.00) . . . and so forth.

Forward and Reverse Processing 

之所以计算一个前向的一个后向的,是因为如果只从前往后计算的话,segmentation sizes 的不同会导致结果的variation很大

Calculation of MTLD Value 

The total number of words in the text is divided by the total factor count. For example, if the text 340 words and the factor count 4.404, then the MTLD value is 77.203. Two such MTLD values are calculated, one for forward processing and one for reverse processing. The mean of the two values is the final MTLD value.

MTLD -词汇复杂度的指标的更多相关文章

  1. 精通Web Analytics 2.0 (5) 第三章:点击流分析的奇妙世界:指标

    精通Web Analytics 2.0 : 用户中心科学与在线统计艺术 第三章:点击流分析的奇妙世界:指标 新的Web Analytics 2.0心态:搞定它.新的闪亮系列工具:是的.准备好了吗?当然 ...

  2. 通过 Visual Studio 的“代码度量值”来改进代码质量

    1 软件度量值指标 1.1 可维护性指数 表示源代码的可维护性,数值越高可维护性越好.该值介于0到100之间.绿色评级在20到100之间,表明该代码具有高度的可维护性:黄色评级在10到19之间,表示该 ...

  3. R语言︱SNA-社会关系网络—igraph包(中心度、中心势)(二)

    每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- SNA社会关系网络分析中,关键的就是通过一些指 ...

  4. 通过Visual Studio 的“代码度量值”来改进代码质量

    1 软件度量值指标 1.1 可维护性指数 表示源代码的可维护性,数值越高可维护性越好.该值介于0到100之间.绿色评级在20到100之间,表明该代码具有高度的可维护性:黄色评级在10到19之间,表示该 ...

  5. 模型监控指标- 混淆矩阵、ROC曲线,AUC值,KS曲线以及KS值、PSI值,Lift图,Gain图,KT值,迁移矩阵

    1. 混淆矩阵 确定截断点后,评价学习器性能 假设训练之初以及预测后,一个样本是正例还是反例是已经确定的,这个时候,样本应该有两个类别值,一个是真实的0/1,一个是预测的0/1 TP(实际为正预测为正 ...

  6. TensorFlow深度学习笔记 循环神经网络实践

    转载请注明作者:梦里风林 Github工程地址:https://github.com/ahangchen/GDLnotes 欢迎star,有问题可以到Issue区讨论 官方教程地址 视频/字幕下载 加 ...

  7. A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification-paper

    https://github.com/mounicam/lexical_simplification 提供了SimplePPDBpp: SimplePPDB++ resource consisting ...

  8. 1+1>2:MIT&IBM提出结合符号主义和连接主义的高效、准确新模型

    自人工智能的概念提出以来,关于符号主义和连接主义的争论就不绝于耳.究竟哪一种方式可以实现更好的人工智能?这一问题目前还没有定论.深度学习的快速发展让我们看到连接主义在构建 AI 系统中的优势,但其劣势 ...

  9. 「视频直播技术详解」系列之七:直播云 SDK 性能测试模型

    ​关于直播的技术文章不少,成体系的不多.我们将用七篇文章,更系统化地介绍当下大热的视频直播各环节的关键技术,帮助视频直播创业者们更全面.深入地了解视频直播技术,更好地技术选型. 本系列文章大纲如下: ...

随机推荐

  1. Ubuntu18.04下安装搜狗输入法

    Ubuntu18.04下安装搜狗输入法 第一步:安装 fcitx输入框架 sudo apt-get install fcitx 第二步:在官网下载 Linux 版本搜狗输入法 https://piny ...

  2. 小程序一个大盒子里面的盒子内容居中对其显示wxss写法

    对小程序研究感兴趣的可加(交流QQ群:604788754)入群联系群主可得到小程序教学资源. 这个案例只是想展示效果,内容部分未进行for循环绑定处理: WXML: <view class=&q ...

  3. lr_场景设计之组场景、nmon监控

    1.组场景常用于回归 ,可以设置成一个脚本后多久运行下一个脚本: Real-world Schedule和Basic schedule的区别:根据官方文档,这两种模式下,场景中的每个虚拟用户组(可看成 ...

  4. redis学习——数据持久化

    一.概述 Redis的强大性能很大程度上都是因为所有数据都是存储在内存中的,然而当Redis重启后,所有存储在内存中的数据将会丢失,在很多情况下是无法容忍这样的事情的.所以,我们需要将内存中的数据持久 ...

  5. swiper 不同页面高度自适应

    在使用swiper写页面滑动时发现不同页面高度无法自适应,使用autoHeight:true也不起作用 研究了一下发现可以这样设置 .swiper-slide{ overflow: hidden; } ...

  6. ios打包unity应用以及配置签名

    先决条件是必须为苹果mac机.拥有公司苹果账号,并确保电脑上安装了unity,unity包 ios-support.和xcode. 1.打开了unity应用之后,选择buildSettings 然后点 ...

  7. 【转】 android5.1里面的user-app的默认权限设置!

    在 frameworks/base/services/core/java/com/android/server/AppOpsPolicy.java中:public boolean isControlA ...

  8. Object.prototype.toString.call() 、 instanceof 以及 Array.isArray()判断数组的方法的优缺点

    1. Object.prototype.toString.call() 每一个继承 Object 的对象都有 toString 方法,如果 toString 方法没有重写的话,会返回 [Object ...

  9. (一)java异常处理的几个问题

    1.java中两种异常? 答:java中存在两种异常:受检查(checked)异常和不受检查(unchecked)异常.不受检查的异常不需要在方法或者构造函数上声明,就算是方法或是构造函数会发生这样的 ...

  10. 记一次msyql导入导致的问题

    公司有个项目要导入150M大小的sql文件,但是导入时报错,去网上找答案,很多人说是因为保留字什么什么的,所以就按照sql文件里面的mysql版本又去下载了一份mysql5.6安装好,但是登陆不了,又 ...