Zipf’s Law
Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?
from nltk.corpus import gutenberg as gb def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)
运行的结果为:
Zipf’s Law的更多相关文章
- 齐夫定律, Zipf's law,Zipfian distribution
齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...
- Zipf's law
w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...
- Zipf定律
http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...
- 幂次法则power law
幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...
- 齐普夫-Zipf定律
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- [IR] Information Extraction
阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2] O(x+y) [Qword1 and Qword2]- 改进: Gallo ...
- [IR] Compression
关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...
- TF/IDF(term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
随机推荐
- iOS开发:后台运行以及保持程序在后台长时间运行
第一部分 1.先说说iOS 应用程序5个状态: 停止运行-应用程序已经终止,或者还未启动. 不活动-应用程序处于前台但不再接收事件(例如,用户在app处于活动时锁住了设备). 活动-app处于“使用中 ...
- java设计模式案例详解:代理模式
代理模式就是用一个第三者的身份去完成工作,其实际意义跟字面意思其实是一样的,理解方式有很多,还是例子直观. 本例的实现类是实现买票功能,实际应用想要添加身份验证功能,利用代理模式添加验证步骤.上例子: ...
- PHP全选择删除功能
<script type="text/javascript" language="javascript"> function selectBox(s ...
- java synchronized内置锁的可重入性和分析总结
最近在读<<Java并发编程实践>>,在第二章中线程安全中降到线程锁的重进入(Reentrancy) 当一个线程请求其它的线程已经占有的锁时,请求线程将被阻塞.然而内部锁是可重 ...
- Git学习 -- 工作区和暂存区
工作区(working directory): 就是能看到的目录,如我的git文件夹 版本库(repository): 工作区有一个隐藏目录.git,这个不算工作区,而是Git的版本库 里面最重要的就 ...
- GoF 设计模式:浅浅印象
23种设计模式,常常多个模式结合使用,主要是为了解决中大型软件项目"类和对象"膨胀的问题,进而有效组织类的结构而提出的.可划分为3类:创建型(关于类的创建),结构型(多个类的组织) ...
- PAT (Advanced Level) 1110. Complete Binary Tree (25)
判断一棵二叉树是否完全二叉树. #include<cstdio> #include<cstring> #include<cmath> #include<vec ...
- 关于MySQL中时间格式和取零点的问题
select * from order where create_time>'2016-05-21 00:00:00'; 不包含2016-05-21 00:00:00时的订单 select * ...
- vm选项大全
http://hllvm.group.iteye.com/group/topic/27945 java -XX:后边的总记不住 vm选项大全 http://www.oracle.com/technet ...
- mysql deadlock
http://database.51cto.com/art/201108/286325.htm 这篇文章说的很清楚,记下来. 原因分析: 当“update tab_test set state=106 ...