• 需求:一篇文章,出现了哪些词?哪些词出现得最多?

英文文本词频统计

英文文本:Hamlet 分析词频

统计英文词频分为两步:

  • 文本去噪及归一化
  • 使用字典表达词频

代码:

#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

the        1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436

中文文本词频统计

中文文本:《三国演义》分析人物

统计中文词频分为两步:

  • 中文文本分词
  • 使用字典表达词频
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

运行结果:

曹操		953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步:

  • 中文文本分词
  • 使用字典表达词频
  • 扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))

考研英语词频统计

将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py
def getText():
txt = open("86_17_1_2.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt pyTxt = getText() #获得没有任何标点的txt文件
words = pyTxt.split() #获得单词
counts = {} #字典,键值对
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\
"was", "are", "have", "were", "had", "that", "for", "it",\
"on", "be", "as", "with", "by", "not", "their", "they",\
"from", "more", "but", "or", "you", "at", "has", "we", "an",\
"this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\
"its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \
"10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \
"work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \
"40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \
"both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \
"give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \
"27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \
"something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68", "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
for word in words:
if word not in excludes:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count)) x = len(counts)
print(x) r = 0 next = eval(input("1继续")) while next == 1:
r += 100
for i in range(r, r+100):
word, count = items[i]
print ("\"{}\"".format(word), end = ", ")
next = eval(input("1继续"))

【Python】词频统计的更多相关文章

  1. python词频统计及其效能分析

    1) 博客开头给出自己的基本信息,格式建议如下: 学号2017****7128 姓名:肖文秀 词频统计及其效能分析仓库:https://gitee.com/aichenxi/word_frequenc ...

  2. Python 词频统计

    利用Python做一个词频统计 GitHub地址:FightingBob [Give me a star , thanks.] 词频统计 对纯英语的文本文件[Eg: 瓦尔登湖(英文版).txt]的英文 ...

  3. 大数据python词频统计之本地分发-file

    统计某几个词在文章出现的次数 -file参数分发,是从客户端分发到各个执行mapreduce端的机器上 1.找一篇文章The_Man_of_Property.txt如下: He was proud o ...

  4. 大数据python词频统计之hdfs分发-cacheArchive

    -cacheArchive也是从hdfs上进分发,但是分发文件是一个压缩包,压缩包内可能会包含多层目录多个文件 1.The_Man_of_Property.txt文件如下(将其上传至hdfs上) ha ...

  5. 大数据python词频统计之hdfs分发-cacheFile

    -cacheFile 分发,文件事先上传至Hdfs上,分发的是一个文件 1.找一篇文章The_Man_of_Property.txt: He was proud of him! He could no ...

  6. python词频统计

    1.jieba 库 -中文分词库 words = jieba.lcut(str)  --->列表,词语 count = {} for word in words: if len(word)==1 ...

  7. python瓦登尔湖词频统计

    #瓦登尔湖词频统计: import string path = 'D:/python3/Walden.txt' with open(path,'r',encoding= 'utf-8') as tex ...

  8. Python中文词频统计

    以下是关于小说的中文词频统计 这里有三个文件,分别为novel.txt.punctuation.txt.meaningless.txt. 这三个是小说文本.特殊符号和无意义词 Python代码统计词频 ...

  9. 用Python实现一个词频统计(词云+图)

    第一步:首先需要安装工具python 第二步:在电脑cmd后台下载安装如下工具: (有一些是安装好python电脑自带有哦) 有一些会出现一种情况就是安装不了词云展示库 有下面解决方法,需看请复制链接 ...

  10. Python——字符串、文件操作,英文词频统计预处理

    一.字符串操作: 解析身份证号:生日.性别.出生地等. 凯撒密码编码与解码 网址观察与批量生成 2.凯撒密码编码与解码 凯撒加密法的替换方法是通过排列明文和密文字母表,密文字母表示通过将明文字母表向左 ...

随机推荐

  1. 优化永不止步:TinyVue v3.20.0 正式发布,更美观的官网UI,更友好的文档搜索,更强大的主题配置能力~

    你好,我是 Kagol,个人公众号:前端开源星球. 我们非常高兴地宣布,2024年12月4日,TinyVue 发布了 v3.20.0 . 本次 3.20.0 版本主要有以下重大变更: OpenTiny ...

  2. Java工具类SignUtil

    import javax.crypto.Mac; import javax.crypto.SecretKey; import javax.crypto.spec.SecretKeySpec; impo ...

  3. linxux学习01

    Linux第一天 1.为什么要学习linux? 因为大数据中绝大部分核心组件都是基于linux操作系统运行的,企业中基本上都是linux系统. 2.怎么去学linux?(什么是大数据) 大数据技术组件 ...

  4. 一个简易socket通信结构

    服务端 基本的结构 工作需要又需要用到socketTCP通讯,这么多年了,终于稍微能写点了.让我说其实也说不出个啥来,看了很多的异步后稍微对异步socket的导流 endreceive后 再begin ...

  5. 数据同步-同步mysql到iceberg后如何确定数据一致性

    一.数据打快照做数据比较 1.mysql创建快照 优点:可以选择时间做快照,然后对比 缺点:需要额外的存储空间和处理时间,不好自动化,大表做快照成本高 2.实现方式 create database 快 ...

  6. RocketMQ实战—9.营销系统代码初版

    大纲 1.基于条件和画像筛选用户的业务分析和实现 2.全量用户促销活动数据模型分析以及创建操作 3.Producer和Consumer的工程代码实现 4.基于抽象工厂模式的消息推送实现 5.全量用户促 ...

  7. windows配置maven

    1.下载mavenhttps://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/ 中找到相应的版本2.解压3.配置环境变量MAVEN_HOMED: ...

  8. 百万架构师第四十课:RabbitMq:RabbitMq-工作模型与JAVA编程|JavaGuide

    来源:https://javaguide.net RabbitMQ 1-工作模型与Java编程 课前准备 预习资料 Windows安装步骤 Linux安装步骤 官网文章中文翻译系列 环境说明 操作系统 ...

  9. 深度对比:PostgreSQL 和 SQL Server 在统计信息维护中的关键差异

    深度对比:PostgreSQL 和 SQL Server 在统计信息维护中的关键差异 数据库统计信息的作用 在数据库系统中,查询优化在决定应用程序性能方面起着至关重要的作用. 高效的查询依赖于最新的数 ...

  10. Vue3 数据响应式原理与高效数据操作全解析

    一.Vue3 数据响应式原理 (一)Proxy 替代 Object.defineProperty 在 Vue2 中,数据响应式是通过 Object.defineProperty 实现的.这种方法虽然能 ...