python jieba分词小说与词频统计

1、知识点

"""

1)cut()

    a) codecs.open() 解决编码问题

    b) f.readline() 读取一行，也可以使用f.readlines()读取多行

    c) words =" ".join(jieba.cut(line))分词，每个词用空格分隔

2)lcut()

    返回一个list列表

"""

2、标点符号处理，并分词,存储到文件中

def fenCi():

    """

    标点符号处理，并分词,存储到文件中

    :return:

    """

    f = codecs.open("深渊主宰系统.txt",'r',encoding='utf-8')

    f1 = open("seg.txt",'w',encoding='utf-8')

    line = f.readline()

    while line:

        line = line.strip(' ')

        words =" ".join(jieba.cut(line))

        words = words.replace("，","").replace("！","").replace("“","")\

            .replace("”","").replace("。","").replace("？","").replace("：","")\

            .replace("...","").replace("、","").strip(' ')

        print(len(words))

        if words.startswith('-') or words == '\r\n' or words.startswith('.') or len(words)<10 :

            line = f.readline()

            continue

        words = words.strip('\n')

        f1.writelines(words)

        line = f.readline()

3、中文分词统计

def zhongwen():

    """

    中文分词统计

    对两个词以上的次数进行统计

        lcut 进行分词，返回分词后list列表

    :return:

    """

    f = codecs.open("深渊主宰系统.txt", 'r', encoding='utf-8').read()

    counts = {}

    wordsList =jieba.lcut(f)

    for word in wordsList:

        word = word.replace("，", "").replace("！", "").replace("“", "") \

            .replace("”", "").replace("。", "").replace("？", "").replace("：", "") \

            .replace("...", "").replace("、", "").strip(' ').strip('\r\n')

        if len(word) == 1 or word == "":

            continue

        else:

            counts[word]=counts.get(word,0)+1 #单词计数

    items = list(counts.items()) #将字典转为list

    items.sort(key=lambda x:x[1],reverse=True) #根据单词出现次数降序排序

    #打印前15个

    for i in range(15):

        word,counter = items[i]

        print("单词：{},次数：{}".format(word,counter))

4、英文分词统计

def get_txt():

    txt = open("1.txt", "r", encoding='UTF-8').read()

    txt = txt.lower()

    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':

        txt = txt.replace(ch, " ")      # 将文本中特殊字符替换为空格

    return txt

def yingwen():

    """

    英文分词统计

    :return:

    """

    file_txt = get_txt()

    words = file_txt.split()    # 对字符串进行分割，获得单词列表

    counts = {}

    for word in words:

        if len(word) == 1:

            continue

        else:

            counts[word] = counts.get(word, 0) + 1

    items = list(counts.items())

    items.sort(key=lambda x: x[1], reverse=True)

    for i in range(5):

        word, count = items[i]

        print("{0:<5}->{1:>5}".format(word, count))

python jieba分词小说与词频统计的更多相关文章

python jieba分词(结巴分词)、提取词，加载词，修改词频，定义词库 -转载
转载请注明出处 “结巴”中文分词:做最好的 Python 中文分词组件,分词模块jieba,它是python比较好用的分词模块, 支持中文简体,繁体分词,还支持自定义词库. jieba的分词,提取关 ...
python jieba分词（添加停用词，用户字典取词频
中文分词一般使用jieba分词 1.安装 pip install jieba 2.大致了解jieba分词包括jieba分词的3种模式全模式 import jieba seg_list = jieb ...
$好玩的分词——python jieba分词模块的基本用法
jieba(结巴)是一个强大的分词库,完美支持中文分词,本文对其基本用法做一个简要总结. 安装jieba pip install jieba 简单用法结巴分词分为三种模式:精确模式(默认).全模式和 ...
python瓦登尔湖词频统计
#瓦登尔湖词频统计: import string path = 'D:/python3/Walden.txt' with open(path,'r',encoding= 'utf-8') as tex ...
python复合数据类型以及英文词频统计
这个作业的要求来自于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2753. 1.列表,元组,字典,集合分别如何增删改查及遍历. 列 ...
python jieba 分词进阶
https://www.cnblogs.com/jiayongji/p/7119072.html 文本准备到网上随便一搜"三体全集",就很容易下载到三体三部曲的全集文本(txt文 ...
Python jieba 分词
环境 Anaconda3 Python 3.6, Window 64bit 目的利用 jieba 进行分词,关键词提取代码 # -*- coding: utf-8 -*- import jieba ...
python jieba分词工具
源码地址:https://github.com/fxsjy/jieba 演示地址:http://jiebademo.ap01.aws.af.cm/ 特点 1,支持三种分词模式: a,精确模式,试图将句 ...
python——jieba分词过程
import jieba """函数2:分词函数""" def fenci(training_data): ""&quo ...

随机推荐

Java SE 核心 II【Collection 集合框架】
Collection集合框架在实际开发中,需要将使用的对象存储于特定数据结构的容器中.而 JDK 提供了这样的容器——集合框架,集合框架中包含了一系列不同数据结构(线性表.查找表)的实现类.集合的引 ...
小程序UI设计（7）-布局分解-左-上下结构
FlexBox布局中的变幻方式很多,我们继续了解一个左-上下结构的布局分解左边结构树中WViewRow下面有两个WViewColumn.WViewRow是横向排列,WViewColumn是纵向排列 ...
Excel种的数据类型研究【原创】【精】
因为要做一个项目,开始研究Excel种的数据类型.发现偌大的一个cnblogs竟然没人写这个,自己研究以后记录下来. 在我们通常的认识中,Excel中的数据类型有这么几种 1.常规:2.数值:3.货币 ...
【NOIP2012】同余方程
原题: 求关于xx的同余方程ax≡1(mod b)的最小正整数解. 裸题当年被这题劝退,现在老子终于学会exgcd了哈哈哈哈哈哈哈哈 ax≡1(mod b) => ax=1+by => ...
hbase单机搭建
一.下载 https://hbase.apache.org/downloads.html 2.1.3版本解压,拷贝到文件夹 /hbase/hbase-2.1.3 设置HBASE_HOME环境变量,把 ...
libusb bulk
https://github.com/IzyaSoft/EasyUsb https://github.com/ztguang/libusb-usbip-bulktransfer/blob/master ...
[易学易懂系列|rustlang语言|零基础|快速入门|（23）|实战1：猜数字游戏]
[易学易懂系列|rustlang语言|零基础|快速入门|(23)|实战1:猜数字游戏] 项目实战实战1:猜数字游戏我们今天来来开始简单的项目实战. 第一个简单项目是猜数字游戏. 简单来说,系统给了 ...
[易学易懂系列|rustlang语言|零基础|快速入门|（5）|生命周期Lifetime]
[易学易懂系列|rustlang语言|零基础|快速入门|(5)] Lifetimes 我们继续谈谈生命周期(lifttime),我们还是拿代码来说话: fn main() { let mut a = ...
Maven Pom.xml文件简单介绍
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/20 ...
第一天Beta冲刺
这个作业属于哪个课程 <课程的链接> 这个作业要求在哪里 <作业要求的链接> 团队名称 <做个一亿的小项目> 这个作业的目标完成第一天Beta冲刺作业正文 .. ...

python jieba分词小说与词频统计

python jieba分词小说与词频统计的更多相关文章

随机推荐

热门专题