Zipf’s Law
Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?
from nltk.corpus import gutenberg as gb def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)
运行的结果为:

Zipf’s Law的更多相关文章
- 齐夫定律, Zipf's law,Zipfian distribution
齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...
- Zipf's law
w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...
- Zipf定律
http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...
- 幂次法则power law
幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...
- 齐普夫-Zipf定律
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- [IR] Information Extraction
阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2] O(x+y) [Qword1 and Qword2]- 改进: Gallo ...
- [IR] Compression
关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...
- TF/IDF(term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
随机推荐
- SCP测试服务器的上行/下行带宽
SCP测试服务器的上行/下行带宽,这个咋弄呢?有时间再研究一下.
- switch case多值匹配
switch case多值匹配一般有两种情况 1.列举(将所有值列举出来) var n= 3;switch (n){ case 1: case 2: case 3: ...
- CSS中如何把Span标签设置为固定宽度
一.形如<span>ABC</span>独立行设置SPAN为固定宽度方法如下: span {width:60px; text-align:center; display:blo ...
- POJ3264/RMQ
题目链接 /* 询问一段区间内的元素差值最大是多少,用RMQ维护一个最大值和一个最小值,相减即可. */ #include<cstdio> #include<cstring> ...
- 在写一个iOS应用之前必须做的7件事
转载自:http://www.cocoachina.com/ios/20160316/15685.html 原文:https://medium.com/ios-os-x-development/7-t ...
- 旋转关节(Revolute Joint)
package{ import Box2D.Common.Math.b2Vec2; import Box2D.Dynamics.b2Body; import Box2D.Dynamics.Joints ...
- Codeforces 691A Fashion in Berland
水题. #pragma comment(linker, "/STACK:1024000000,1024000000") #include<cstdio> #includ ...
- Chapter 1 First Sight——9
One of the best things about Charlie is he doesn't hover. 一件最好的事是查理兹他不在附近. He left me alone to unpac ...
- CentOS 6.5 开机启动指定服务
gedit /etc/rc.d/rc.local #关闭防火墙 service iptables stop #开启samba服务 service smb start #开启ntopng 端口5000 ...
- ant调用shell命令(Ubuntu)
ant中调用Makefile,使用shell中的make命令 <?xml version="1.0" encoding="utf-8" ?> < ...