• 需求:一篇文章,出现了哪些词?哪些词出现得最多?

英文文本词频统计

英文文本:Hamlet 分析词频

统计英文词频分为两步:

  • 文本去噪及归一化
  • 使用字典表达词频

代码:

#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

the        1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436

中文文本词频统计

中文文本:《三国演义》分析人物

统计中文词频分为两步:

  • 中文文本分词
  • 使用字典表达词频
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

曹操		953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步:

  • 中文文本分词
  • 使用字典表达词频
  • 扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

考研英语词频统计

将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py
def getText():
txt = open("86_17_1_2.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt pyTxt = getText() #获得没有任何标点的txt文件
words = pyTxt.split() #获得单词
counts = {} #字典,键值对
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\
"was", "are", "have", "were", "had", "that", "for", "it",\
"on", "be", "as", "with", "by", "not", "their", "they",\
"from", "more", "but", "or", "you", "at", "has", "we", "an",\
"this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\
"its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \
"10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \
"work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \
"40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \
"both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \
"give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \
"27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \
"something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68", "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
for word in words:
if word not in excludes:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count)) x = len(counts)
print(x) r = 0 next = eval(input("1继续")) while next == 1:
r += 100
for i in range(r, r+100):
word, count = items[i]
print ("\"{}\"".format(word), end = ", ")
next = eval(input("1继续"))

【Python】词频统计的更多相关文章

  1. python词频统计及其效能分析

    1) 博客开头给出自己的基本信息,格式建议如下: 学号2017****7128 姓名:肖文秀 词频统计及其效能分析仓库:https://gitee.com/aichenxi/word_frequenc ...

  2. Python 词频统计

    利用Python做一个词频统计 GitHub地址:FightingBob [Give me a star , thanks.] 词频统计 对纯英语的文本文件[Eg: 瓦尔登湖(英文版).txt]的英文 ...

  3. 大数据python词频统计之本地分发-file

    统计某几个词在文章出现的次数 -file参数分发,是从客户端分发到各个执行mapreduce端的机器上 1.找一篇文章The_Man_of_Property.txt如下: He was proud o ...

  4. 大数据python词频统计之hdfs分发-cacheArchive

    -cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件 1.The_Man_of_Property.txt文件如下(将其上传至hdfs上) ha ...

  5. 大数据python词频统计之hdfs分发-cacheFile

    -cacheFile 分发,文件事先上传至Hdfs上,分发的是一个文件 1.找一篇文章The_Man_of_Property.txt: He was proud of him! He could no ...

  6. python词频统计

    1.jieba 库 -中文分词库 words = jieba.lcut(str)  --->列表,词语 count = {} for word in words: if len(word)==1 ...

  7. python瓦登尔湖词频统计

    #瓦登尔湖词频统计: import string path = 'D:/python3/Walden.txt' with open(path,'r',encoding= 'utf-8') as tex ...

  8. Python中文词频统计

    以下是关于小说的中文词频统计 这里有三个文件,分别为novel.txt.punctuation.txt.meaningless.txt. 这三个是小说文本.特殊符号和无意义词 Python代码统计词频 ...

  9. 用Python实现一个词频统计(词云+图)

    第一步:首先需要安装工具python 第二步:在电脑cmd后台下载安装如下工具: (有一些是安装好python电脑自带有哦) 有一些会出现一种情况就是安装不了词云展示库 有下面解决方法,需看请复制链接 ...

  10. Python——字符串、文件操作,英文词频统计预处理

    一.字符串操作: 解析身份证号:生日.性别.出生地等. 凯撒密码编码与解码 网址观察与批量生成 2.凯撒密码编码与解码 凯撒加密法的替换方法是通过排列明文和密文字母表,密文字母表示通过将明文字母表向左 ...

随机推荐

  1. 一问一答学习PyQT6,对比WxPython和PyQt6的差异

    在我的基于WxPython的跨平台框架完成后,对WxPython的灵活性以及强大功能有了很深的了解,在跨平台的桌面应用上我突然对PyQt6的开发也感兴趣,于是准备了开发环境学习PyQt 6,并对比下W ...

  2. Pod的优雅上下线

    Pod的优雅上下线依赖k8s的监控检查机制,以及 Pod lifecycle Hooks,通过这些kubernetes的机制,配合服务发现的流量管理机制,实现业务的优雅上下线. 基础概念 Pod 健康 ...

  3. 前端(一)-Html

    1.网页基本信息 <!DOCTYPE html> 浏览器使用的规范 <head> 网页头 <body> 主体部分 <meta> 元数据 meta的nam ...

  4. 原生js元素拖动效果

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  5. 【Java 温故而知新系列】基础知识-05 面向对象

    1.面向对象概述 面向对象(Object-Oriented,简称OO)是一种编程思想,核心思想是将现实世界中的事物抽象为程序中的"对象",通过对象之间的交互来解决问题. 对象 对象 ...

  6. 京东h5st参数js逆向

    扣代码的环节挺简单的就不讲了 直接到重点 发现许多包都会有一个h5st的加密参数 那么我们就要看这个参数是怎么生成的 我们可以根据请求堆栈 找到h5st的入口 当然还有一种更简单的方法 就是直接全局搜 ...

  7. .NET 中 Logger 常被忽视的方法 BeginScope

    BeginScope 方法是 .NET 中 ILogger 接口的一部分,用于创建日志记录的作用域(Scope).这种作用域可以将特定的上下文信息包含在日志中,从而提高日志的可读性和调试效率. 配置日 ...

  8. 一键部署,玩转AI!天翼云Llama 3大模型学习机来了!

    近日,Meta公司发布了其最新研发成果--开源大模型Llama 3,共包含Llama 3 8B和Llama 3 70B两种规格,参数量级分别为80亿与700亿,并表示这是目前同体量下性能最好的开源模型 ...

  9. Centos7下oracle12c的安装与配置

    一.硬件资源配置(虚拟机) CentOS7@VMware Workstation 10 Pro,分配资源:CPU:2颗,内存:4GB,硬盘空间:20GB+30GB 二.软件环境配置 软件上传 xshe ...

  10. 从图像到信息,AI识图开启智能识别新时代

    当前用户在使用各类应用时,文字.图片和视频已经成为互联网内容的主要载体,许多用户在浏览过程中通常想要选取页面上的文字或者得到图片或者视频画面的具体信息.比如,识别文本中的特定实体(如电话号码.邮箱.网 ...