[Python]jieba切词添加字典去除停用词、单字 python 2020.2.10

源码如下：

 import jieba

 import io

 import re

 #jieba.load_userdict("E:/xinxi2.txt")

 patton=re.compile(r'..')

 #添加字典

 def add_dict():

     f=open("E:/xinxi2.txt","r+",encoding="utf-8")  #百度爬取的字典

     for line in f:

         jieba.suggest_freq(line.rstrip("\n"), True)

     f.close()

 #对句子进行分词

 def cut():

     number=0

     f=open("E:/luntan.txt","r+",encoding="utf-8")   #要处理的内容，所爬信息，CSDN论坛标题

     for line in f:

         line=seg_sentence(line.rstrip("\n"))

         seg_list=jieba.cut(line)

         for i in seg_list:

             print(i) #打印词汇内容

             m=patton.findall(i)

             #print(len(m)) #打印字符长度

             if len(m)!=0:

                 write(i.strip()+" ")

         line=line.rstrip().lstrip()

         print(len(line))#打印句子长度

         if len(line)>1:

             write("\n")

         number+=1

         print("已处理",number,"行")

 #分词后写入

 def write(contents):

     f=open("E://luntan_cut2.txt","a+",encoding="utf-8") #要写入的文件

     f.write(contents)

     #print("写入成功！")

     f.close()

 #创建停用词

 def stopwordslist(filepath):

     stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]

     return stopwords

 # 对句子进行去除停用词

 def seg_sentence(sentence):

     sentence_seged = jieba.cut(sentence.strip())

     stopwords = stopwordslist('E://stop.txt')  # 这里加载停用词的路径

     outstr = ''

     for word in sentence_seged:

         if word not in stopwords:

             if word != '\t':

                 outstr += word

                 #outstr += " "

     return outstr

 #循环去除、无用函数

 def cut_all():

     inputs = open('E://luntan_cut.txt', 'r', encoding='utf-8')

     outputs = open('E//luntan_stop.txt', 'a')

     for line in inputs:

         line_seg = seg_sentence(line)  # 这里的返回值是字符串

         outputs.write(line_seg + '\n')

     outputs.close()

     inputs.close()

 if __name__=="__main__":

     add_dict()

     cut()

luntan.txt的来源，地址：https://www.cnblogs.com/zlc364624/p/12285055.html

其中停用词可自行百度下载，或者自己创建一个txt文件夹，自行添加词汇用换行符隔开。

百度爬取的字典在前几期博客中可以找到，地址：https://www.cnblogs.com/zlc364624/p/12289008.html

效果如下：

import jieba
import io
import re

#jieba.load_userdict("E:/xinxi2.txt")
patton=re.compile(r'..')

#添加字典
def add_dict():
    f=open("E:/xinxi2.txt","r+",encoding="utf-8")  #百度爬取的字典
for line in f:
        jieba.suggest_freq(line.rstrip("\n"), True)
    f.close()

#对句子进行分词
def cut():
    number=0
f=open("E:/luntan.txt","r+",encoding="utf-8")   #要处理的内容，所爬信息，CSDN论坛标题
for line in f:
        line=seg_sentence(line.rstrip("\n"))
        seg_list=jieba.cut(line)
        for i in seg_list:
            print(i) #打印词汇内容
m=patton.findall(i)
            #print(len(m)) #打印字符长度
if len(m)!=:
                write(i.strip()+" ")
        line=line.rstrip().lstrip()
        print(len(line))#打印句子长度
if len(line)>:
            write("\n")
        number+=1
print("已处理",number,"行")

#分词后写入
def write(contents):
    f=open("E://luntan_cut2.txt","a+",encoding="utf-8") #要写入的文件
f.write(contents)
    #print("写入成功！")
f.close()

#创建停用词
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 对句子进行去除停用词
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('E://stop.txt')  # 这里加载停用词的路径
outstr = ''
for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                #outstr += " "
return outstr

#循环去除、无用函数
def cut_all():
    inputs = open('E://luntan_cut.txt', 'r', encoding='utf-8')
    outputs = open('E//luntan_stop.txt', 'a')
    for line in inputs:
        line_seg = seg_sentence(line)  # 这里的返回值是字符串
outputs.write(line_seg + '\n')
    outputs.close()
    inputs.close()

if __name__=="__main__":
    add_dict()
    cut()

[Python]jieba切词添加字典去除停用词、单字 python 2020.2.10的更多相关文章

jieba文本分词，去除停用词，添加用户词
import jieba from collections import Counter from wordcloud import WordCloud import matplotlib.pyplo ...
python去除停用词（结巴分词下）
python 去除停用词结巴分词 import jieba #stopwords = {}.fromkeys([ line.rstrip() for line in open('stopword. ...
（3.1）用ictclas4j进行中文分词，并去除停用词
酒店评论情感分析系统——用ictclas4j进行中文分词,并去除停用词 ictclas4j是中科院计算所开发的中文分词工具ICTCLAS的Java版本,因其分词准确率较高,而备受青睐. 注:ictcl ...
python利用jieba进行中文分词去停用词
中文分词(Chinese Word Segmentation) 指的是将一个汉字序列切分成一个一个单独的词. 分词模块jieba,它是python比较好用的分词模块.待分词的字符串可以是 unicod ...
使用Python中的NLTK和spaCy删除停用词与文本标准化
概述了解如何在Python中删除停用词与文本标准化,这些是自然语言处理的基本技术探索不同的方法来删除停用词,以及讨论文本标准化技术,如词干化(stemming)和词形还原(lemmatizatio ...
python编程基础知识—字典
字典在python中,字典是一系列键-值对,每个键都与一个值相关联,可使用键来访问相关联的值.与键相关联的值可以是数字.字符串.列表乃至字典,即可将任何python对象用在字典中的值. 在pytho ...
如何在java中去除中文文本的停用词
1. 整体思路第一步:先将中文文本进行分词,这里使用的HanLP-汉语言处理包进行中文文本分词. 第二步:使用停用词表,去除分好的词中的停用词. 2. 中文文本分词环境配置使用的HanLP-汉 ...
词项邻近 & 停用词 & 词干还原
[词项邻近] 邻近操作符(proximity)用于指定查询中的两个词项应该在文档中互相靠近,靠近程度通常采用两者之间的词的个数或者是否同在某个结构单元(如句子或段落)中出现来衡量. [停用词] 一些 ...
python jieba分词（添加停用词，用户字典取词频
中文分词一般使用jieba分词 1.安装 pip install jieba 2.大致了解jieba分词包括jieba分词的3种模式全模式 import jieba seg_list = jieb ...

随机推荐

MySql基础补漏笔记
在MySQL教程|菜鸟教程系统复习的时候有一些知识点还没掌握透的或者思维方式还没完全跟上的地方,写了一个笔记,讲道理此笔记对除我之外的任何读者不具有任何实用价值,只针对我在复习MySQL基础过程中的查 ...
Deepin下将Caps映射为Control_L键
xmodmap -e 'clear Lock' -e 'keycode 0x42 = Control_L'
ASP.NET Core MVC 网站学习笔记
ASP.NET Core MVC 网站学习笔记魏刘宏 2020 年 2 月 17 日最近因为” 新冠” 疫情在家办公,学习了 ASP.NET Core MVC 网站的一些知识,记录如下. 一.新建 ...
update mysql row (You can't specify target table 'x' for update in FROM clause)
sql语句(update/delete都会出现此问题) update x set available_material_id = null where id not in (select id fro ...
Git 学习文档
Study Document for Git Git 基础 Git 文件的三种状态: 已提交(committed).已修改(modified)和已暂存(staged). Git 工作目录的状态: 已跟 ...
mysql必知必会--过滤数据
使用 WHERE 子句数据库表一般包含大量的数据,很少需要检索表中所有行.通常只会根据特定操作或报告的需要提取表数据的子集.只检索所需数据需要指定搜索条件(search criteria),搜索 ...
PHP0018：PHP 图像处理
两行配置完全解放gradle编译慢问题
Android Studio编译经常出现gradle编译缓慢甚至超时问题,抛开电脑硬件配置不说,主要问题还是国内网络环境的因素影响,可以通过修改项目根目录下的build.gradle文件如下: bui ...
初识OpenSSH--1
note:保护你的报文(communique)安全 ! 最安全!!! 简介:OpenSSH使用SSH协议进行远程登录的主要连接工具.它对传输数据进行加密,以消除窃听,连接劫持和其他攻击.此外,Ope ...
大数据才是未来，Oracle、SQL Server成昨日黄花？
1. 引子**** 有人在某个专注SQL的公众号留言如下: 这个留言触碰到一个非常敏感的问题:搞关系型数据库还有前途吗?现在都2020年了,区块链正火热,AI人才已经"过剩",大数 ...

[Python]jieba切词 添加字典 去除停用词、单字 python 2020.2.10

[Python]jieba切词 添加字典 去除停用词、单字 python 2020.2.10的更多相关文章

随机推荐

热门专题

[Python]jieba切词添加字典去除停用词、单字 python 2020.2.10

[Python]jieba切词添加字典去除停用词、单字 python 2020.2.10的更多相关文章