• 需求:一篇文章,出现了哪些词?哪些词出现得最多?

英文文本词频统计

英文文本:Hamlet 分析词频

统计英文词频分为两步:

  • 文本去噪及归一化
  • 使用字典表达词频

代码:

#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

the        1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436

中文文本词频统计

中文文本:《三国演义》分析人物

统计中文词频分为两步:

  • 中文文本分词
  • 使用字典表达词频
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

曹操		953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步:

  • 中文文本分词
  • 使用字典表达词频
  • 扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

考研英语词频统计

将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py
def getText():
txt = open("86_17_1_2.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt pyTxt = getText() #获得没有任何标点的txt文件
words = pyTxt.split() #获得单词
counts = {} #字典,键值对
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\
"was", "are", "have", "were", "had", "that", "for", "it",\
"on", "be", "as", "with", "by", "not", "their", "they",\
"from", "more", "but", "or", "you", "at", "has", "we", "an",\
"this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\
"its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \
"10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \
"work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \
"40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \
"both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \
"give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \
"27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \
"something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68", "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
for word in words:
if word not in excludes:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count)) x = len(counts)
print(x) r = 0 next = eval(input("1继续")) while next == 1:
r += 100
for i in range(r, r+100):
word, count = items[i]
print ("\"{}\"".format(word), end = ", ")
next = eval(input("1继续"))

【Python】词频统计的更多相关文章

  1. python词频统计及其效能分析

    1) 博客开头给出自己的基本信息,格式建议如下: 学号2017****7128 姓名:肖文秀 词频统计及其效能分析仓库:https://gitee.com/aichenxi/word_frequenc ...

  2. Python 词频统计

    利用Python做一个词频统计 GitHub地址:FightingBob [Give me a star , thanks.] 词频统计 对纯英语的文本文件[Eg: 瓦尔登湖(英文版).txt]的英文 ...

  3. 大数据python词频统计之本地分发-file

    统计某几个词在文章出现的次数 -file参数分发,是从客户端分发到各个执行mapreduce端的机器上 1.找一篇文章The_Man_of_Property.txt如下: He was proud o ...

  4. 大数据python词频统计之hdfs分发-cacheArchive

    -cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件 1.The_Man_of_Property.txt文件如下(将其上传至hdfs上) ha ...

  5. 大数据python词频统计之hdfs分发-cacheFile

    -cacheFile 分发,文件事先上传至Hdfs上,分发的是一个文件 1.找一篇文章The_Man_of_Property.txt: He was proud of him! He could no ...

  6. python词频统计

    1.jieba 库 -中文分词库 words = jieba.lcut(str)  --->列表,词语 count = {} for word in words: if len(word)==1 ...

  7. python瓦登尔湖词频统计

    #瓦登尔湖词频统计: import string path = 'D:/python3/Walden.txt' with open(path,'r',encoding= 'utf-8') as tex ...

  8. Python中文词频统计

    以下是关于小说的中文词频统计 这里有三个文件,分别为novel.txt.punctuation.txt.meaningless.txt. 这三个是小说文本.特殊符号和无意义词 Python代码统计词频 ...

  9. 用Python实现一个词频统计(词云+图)

    第一步:首先需要安装工具python 第二步:在电脑cmd后台下载安装如下工具: (有一些是安装好python电脑自带有哦) 有一些会出现一种情况就是安装不了词云展示库 有下面解决方法,需看请复制链接 ...

  10. Python——字符串、文件操作,英文词频统计预处理

    一.字符串操作: 解析身份证号:生日.性别.出生地等. 凯撒密码编码与解码 网址观察与批量生成 2.凯撒密码编码与解码 凯撒加密法的替换方法是通过排列明文和密文字母表,密文字母表示通过将明文字母表向左 ...

随机推荐

  1. ANOSIM分析

    ANOSIM分析(analysis of similarities)即相似性分析,主要用于分析高维数据组间相似性,为数据间差异显著性评价提供依据.在一些高维数据分析中,需要使用PCA.PCoA.NMD ...

  2. G1原理—2.G1是如何提升分配对象效率

    大纲 1.G1的对象分配原理是怎样的 2.深入分析TLAB机制原理 3.借助TLAB分配对象的实现原理是什么 4.什么是快速分配 + 什么是慢速分配 5.大对象分配的过程 + 与TLAB的关系 6.救 ...

  3. UWP 检查是否试用版模式

    //老版本的方法: // var check= CurrentAppSimulator.LicenseInformation.IsActive && CurrentAppSimulat ...

  4. 前端(四)-jQuery

    1.jQuery的基本用法 1.1 jQuery引入 <script src="js/jquery-3.4.1.min.js" type="text/javascr ...

  5. Map 实现类之:TreeMap(SortedMap的实现类) 和 Properties(Hashtable的实现类)

    TreeMap存储 Key-Value 对时,需要根据 key-value 对进行排序.TreeMap 可以保证所有的 Key-Value 对处于 有序状态.TreeSet底层使用 红黑树结构存储 ...

  6. nginx basic验证

    打开个生成htpasswd的网站 输入信息生成结果 将结果保存到nginx一个文件里面 修改nginx的conf文件 auth_basic "webA"; #这个"&qu ...

  7. 小程序uni-app处理input框将页面往上推动的解决办法

    1. view <view class="bottom-wri-box" :style="{bottom: bottomHeight}"> < ...

  8. Iceberg参数调整

    1.读取参数介绍 属性   默认值   描述 read.split.target-size 134217728 (128 MB) 组合数据输入分割时的目标大小 read.split.metadata- ...

  9. FreeSql学习笔记——3.查询

    前言   FreeSql中查询的支持非常丰富,包括链式语法,多表查询,表达式函数:写法多种多样,可以使用简单的条件查询.sql查询.联表.子表等方式用于查询数据, 查询的格式也有很丰富,包括单条记录, ...

  10. NOI春季测试游记

    Day -20 本来以为不能报名,但听说初中生可以报名,遂报名. Day -20~-2 刷一些题,并学了大量新知识如DP. Day n(-15≤n≤-5) 在公众号的名单上看到我的名字. 同校还有人参 ...