Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?

 from nltk.corpus import gutenberg as gb

 def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)

运行的结果为:

Zipf’s Law的更多相关文章

  1. 齐夫定律, Zipf's law,Zipfian distribution

    齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...

  2. Zipf's law

    w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...

  3. Zipf定律

    http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...

  4. 幂次法则power law

    幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...

  5. 齐普夫-Zipf定律

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  6. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  7. [IR] Information Extraction

    阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2]               O(x+y) [Qword1 and Qword2]- 改进: Gallo ...

  8. [IR] Compression

    关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...

  9. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

随机推荐

  1. WPF之TabControl控件用法

    先创建实体基类:NotificationObject(用来被实体类继承) 实现属性更改通知接口: using System; using System.Collections.Generic; usi ...

  2. 转 使用SQL从AWR收集数据库性能变化趋势

    使用SQL从AWR收集数据库性能变化趋势 为了对数据库一段时间的性能情况有个全面了解,显然AWR是一个非常有用的工具, 但很多人只会在数据库有性能问题时才会生成问题时段的awr报告去分析.虽然AWR ...

  3. c++ 日志操作

    程序需要一个简单的日志类,为此简单学习了Boost.Log和google的glog,前者功能非常强大,后者非常小巧但是不够灵活,最终打算自己写一个. 环境: win7 32位旗舰版.VS2010旗舰版 ...

  4. HDU 4460 Friend Chains(map + spfa)

    Friend Chains Time Limit : 2000/1000ms (Java/Other)   Memory Limit : 32768/32768K (Java/Other) Total ...

  5. java 设计模式之工厂模式与反射的结合

    工厂模式: /**  * @author Rollen-Holt 设计模式之 工厂模式  */   interface fruit{     public abstract void eat(); } ...

  6. Linux中的挂载和卸载

    mkdir /home/xxx   创建挂载点 mount /dev/cdrom /home/xxx   把cdrom中的内容挂载到xxx目录 umount /dev/cdrom 卸载 /dev/sr ...

  7. qt博客

    http://blog.csdn.net/foruok/article/category/418962/1

  8. access restriction

    一.既然存在访问规则,那么修改访问规则即可.打开项目的Build Path Configuration页面,打开报错的JAR包,选中Access rules条目,选择右侧的编辑按钮,添加一个访问规则即 ...

  9. 实例:SSH结合Easyui实现Datagrid的新增功能和Validatebox的验证功能

    在我前面一篇分页的基础上,新增了添加功能和添加过程中的Ajax与Validate的验证功能.其他的功能在后面的博客写来,如果对您有帮助,敬请关注. 先看一下实现的效果: (1)点击添加学生信息按键后跳 ...

  10. STL入门2

    1:给出n个字符串,输出每个字符串是第几个出现的字符串?多组数据 2:对每组数据,第一行输入n表示接下来有n个字符串 1 <= n <= 100000接下来的n行,每行输入一个非空的且长度 ...