Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?

 from nltk.corpus import gutenberg as gb

 def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)

运行的结果为:

Zipf’s Law的更多相关文章

  1. 齐夫定律, Zipf's law,Zipfian distribution

    齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...

  2. Zipf's law

    w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...

  3. Zipf定律

    http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...

  4. 幂次法则power law

    幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...

  5. 齐普夫-Zipf定律

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  6. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  7. [IR] Information Extraction

    阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2]               O(x+y) [Qword1 and Qword2]- 改进: Gallo ...

  8. [IR] Compression

    关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...

  9. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

随机推荐

  1. ubuntu 安装LNMP

    How To Install Linux, nginx, MySQL, PHP (LEMP) stack on Ubuntu 12.04 PostedJune 13, 2012 802.8kviews ...

  2. lucene-SpanFirstQuery 和SpanNearQuery 跨度查询

    1.SpanFirstQuery查询 对出现在一个域中前n个位置的跨度查询. public void testSpanFirstQuery() throws Exception{ SpanzFirts ...

  3. DHCP详细工作过程(转)

    DHCP客户端通过和DHCP服务器的交互通讯以获得IP地址租约.为了从DHCP服务器获得一个IP地址,在标准情况下DHCP客户端和DHCP服务器之间会进行四次通讯.DHCP协议通讯使用端口UDP 67 ...

  4. APK瘦身

    APK瘦身 主要从一下三方面来瘦身: 1. Java 源代码 1) ,这方面主要是通过最简洁的代码实现最直接的功能,还有就是提出上线前不必要的java代码,可以使用UCDector进行分析,从而对代码 ...

  5. hdu_3886_Final Kichiku “Lanlanshu”(数位DP)

    题目连接:http://acm.hdu.edu.cn/showproblem.php?pid=3886 题意:这题的题意有点晦涩难懂,大概意思就是给你一个区间,让你找一些满足递增递减条件的数,举个列: ...

  6. 保存iptables的防火墙规则的方法【转载】

    转自: 保存iptables的防火墙规则的方法 - 51CTO.COMhttp://os.51cto.com/art/201103/249504.htm 保存iptables的防火墙规则的方法如下: ...

  7. struts2语法--error页面如何捕获?

    如果地址栏输入了不带后缀或者action为后缀, 不存在的页面跳转到error.jsp: struts.xml配置" <package name="default" ...

  8. Servlet程序开发--Servlet 与 表单

    servlet程序: doPost方法时为了防止表单提交时post方式的问题.否则只能处理get请求 package org.lxh.servletdemo ; import java.io.* ; ...

  9. 【树状数组】 poj 2352

    题意:给出n个平面二维坐标,对于每个坐标,如果这个坐标跟(0,0)形成的矩形内包含的点数为 k (包含边界,但不包含坐标本身),那么这个坐标就是 level k.输出level 0 - n-1的点数分 ...

  10. POJ 2723 HDU 1816 Get Luffy Out

    二分答案 + 2-SAT验证 #include<cstdio> #include<cstring> #include<cmath> #include<stac ...