Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?

 from nltk.corpus import gutenberg as gb

 def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)

运行的结果为:

Zipf’s Law的更多相关文章

  1. 齐夫定律, Zipf's law,Zipfian distribution

    齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...

  2. Zipf's law

    w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...

  3. Zipf定律

    http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...

  4. 幂次法则power law

    幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...

  5. 齐普夫-Zipf定律

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  6. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  7. [IR] Information Extraction

    阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2]               O(x+y) [Qword1 and Qword2]- 改进: Gallo ...

  8. [IR] Compression

    关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...

  9. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

随机推荐

  1. struts2的工作原理

    在学习struts2就必须的了解一下它的工作原理: 首先来看一下这张图 这张工作原理图是官方提供的: 一个请求在Struts2框架中的处理大概分为以下几个步骤 1.客户端初始化一个指向Servlet容 ...

  2. UVa11235 RMQ

    input 1<=n,q<=100000 升序序列a1 a2 a3 ... an -100000<=ai<=100000 q行i j 1<=i,j<=n 输入结束标 ...

  3. IE 和 FF 写不同的CSS

    .FireFox 下如何使连续长字段自动换行 众所周知IE中直接使用word-wrap:break-word 就可以了, FF中我们使用JS插入的技巧来解决 <style type=" ...

  4. Zookeeper: configuring on centos7

    thispassage is referenced, appreciated. ZooKeeper installation: Download from this site Install java ...

  5. 源代码管理工具-GIT

    源代码管理工具-GIT ---- 一. 掌握 - git 概述 1. git 简介? 什么是git? git是一款开源的分布式版本控制工具在世界上所有的分布式版本控制工具中,git是最快.最简单.最流 ...

  6. 从零开始学习OpenGL ES之一 – 基本概念

    我曾写过一些文章介绍iPhone OpenGL ES编程,但大部分针对的是已经至少懂得一些3D编程知识的人.作为起点,请下载我的OpenGL Xcode项目模板,而不要使用Apple提供的模板.你可以 ...

  7. UVA - 437 The Tower of Babylon(dp-最长递增子序列)

    每一个长方形都有六种放置形态,其实可以是三种,但是判断有点麻烦直接用六种了,然后按照底面积给这些形态排序,排序后就完全变成了LIS的问题.代码如下: #include<iostream> ...

  8. android studio sexy editor性感编辑器设置

    sexy editor下载地址:http://download.csdn.net/detail/yy1300326388/9166223 我自己也有上传CSDN资源 rainyday0524@163. ...

  9. Apache无法启动提示the requested operation has failed

    主要参考这篇 http://apps.hi.baidu.com/share/detail/15868128 但还是遇到一些问题,记录如下: 1. 配置完成后,restart apache,出现 the ...

  10. memcached添加IP白名单,只允许指定服务器调用

    由于memcached默认安装是不用配置密码的(具体的密码配置我也没怎么研究,据说是有的,大家感兴趣去找一找) 然而memcached链接也是非常简单的 linux命令链接使用  Telnet IP地 ...