python 分词计算文档TF-IDF值并排序

该程序实现的功能是：首先读取一些文档，然后通过jieba来分词，将分词存入文件，然后通过sklearn计算每一个分词文档中的tf-idf值，再将文档排序输入一个大文件里

依赖包：

sklearn

jieba

注：此程序參考了一位同行的程序后进行了改动

# -*- coding: utf-8 -*-

"""

@author: jiangfuqiang

"""

import os

import jieba

import jieba.posseg as pseg

import sys

import re

import time

import string

from sklearn import feature_extraction

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

reload(sys)

sys.setdefaultencoding('utf-8')

def getFileList(path):

    filelist = []

    files = os.listdir(path)

    for f in files:

        if f[0] == '.':

            pass

        else:

            filelist.append(f)

    return filelist,path

def fenci(filename,path,segPath):

    f = open(path +"/" + filename,'r+')

    file_list = f.read()

    f.close()

     #保存粉刺结果的文件夹

    if not os.path.exists(segPath):

        os.mkdir(segPath)

    #对文档进行分词处理

    seg_list = jieba.cut(file_list,cut_all=True)

    #对空格。换行符进行处理

    result = []

    for seg in seg_list:

        seg = ''.join(seg.split())

        reg = 'w+'

        r = re.search(reg,seg)

        if seg != '' and seg != '

' and seg != '

' and seg != '=' and 

                        seg != '[' and seg != ']' and seg != '(' and seg != ')' and not r:

            result.append(seg)

    #将分词后的结果用空格隔开，保存至本地

    f = open(segPath+"/"+filename+"-seg.txt","w+")

    f.write(' '.join(result))

    f.close()

#读取已经分词好的文档。进行TF-IDF计算

def Tfidf(filelist,sFilePath,path):

    corpus = []

    for ff in filelist:

        fname = path + ff

        f = open(fname+"-seg.txt",'r+')

        content = f.read()

        f.close()

        corpus.append(content)

    vectorizer = CountVectorizer()

    transformer = TfidfTransformer()

    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

    word = vectorizer.get_feature_names()  #全部文本的关键字

    weight = tfidf.toarray()

    if not os.path.exists(sFilePath):

        os.mkdir(sFilePath)

    for i in range(len(weight)):

        print u'----------writing all the tf-idf in the ',i,u'file into ', sFilePath+'/' +string.zfill(i,5)+".txt"

        f = open(sFilePath+"/"+string.zfill(i,5)+".txt",'w+')

        for j in range(len(word)):

            f.write(word[j] + "  " + str(weight[i][j]) + "

")

        f.close()

if __name__ == "__main__":

    #保存tf-idf的计算结果文件夹

    sFilePath = "/home/lifeix/soft/allfile/tfidffile"+str(time.time())

    #保存分词的文件夹

    segPath = '/home/lifeix/soft/allfile/segfile'

    (allfile,path) = getFileList('/home/lifeix/soft/allkeyword')

    for ff in allfile:

        print "Using jieba on " + ff

        fenci(ff,path,segPath)

    Tfidf(allfile,sFilePath,segPath)

    #对整个文档进行排序

    os.system("sort -nrk 2 " + sFilePath+"/*.txt >" + sFilePath + "/sorted.txt")

python 分词计算文档TF-IDF值并排序的更多相关文章

用Python做SVD文档聚类---奇异值分解----文档相似性----LSI（潜在语义分析）
转载请注明出处:电子科技大学EClab——落叶花开http://www.cnblogs.com/nlp-yekai/p/3848528.html SVD,即奇异值分解,在自然语言处理中,用来做潜在语义 ...
Python处理Excel文档（xlrd, xlwt, xlutils）
简介 xlrd,xlwt和xlutils是用Python处理Excel文档(*.xls)的高效率工具.其中,xlrd只能读取xls,xlwt只能新建xls(不可以修改),xlutils能将xlrd.B ...
python+selenium自动化软件测试(第12章)：Python读写XML文档
XML 即可扩展标记语言,它可以用来标记数据.定义数据类型,是一种允许用户对自己的标记语言进行定义的源语言.xml 有如下特征: 首先,它是有标签对组成:<aa></aa> ...
【转】Python之xml文档及配置文件处理（ElementTree模块、ConfigParser模块）
[转]Python之xml文档及配置文件处理(ElementTree模块.ConfigParser模块) 本节内容前言 XML处理模块 ConfigParser/configparser模块总结 ...
获取文档版本版本值滚动标识符游标控制查询如何执行控制查询在哪些分片执行 boost加权
映射mapping.json{ "book": { "_index": { "enabled": true }, "_id&quo ...
使用Python操作Excel文档（一）
Python | 使用Python操作Excel文档(一) 0 前言在阅读本文之前,请确保您已满足或可能满足以下条件: 请确保您具备基本的Python编程能力. 请确保您会使用Excel. 请确保您 ...
使用Python从Markdown文档中自动生成标题导航
概述知识与思路代码实现概述 Markdown 很适合于技术写作,因为技术写作并不需要花哨的排版和内容, 只要内容生动而严谨,文笔朴实而优美. 为了编写对读者更友好的文章,有必要生成文章的标题导航 ...
Openstack python api 学习文档 api创建虚拟机
Openstack python api 学习文档转载请注明http://www.cnblogs.com/juandx/p/4953191.html 因为需要学习使用api接口调用openstack ...
[转载]linux+nginx+python+mysql安装文档
原文地址:linux+nginx+python+mysql安装文档作者:oracletom # 开发包(如果centos没有安装数据库服务,那么要安装下面的mysql开发包) MySQL-devel- ...

随机推荐

eclipse文本编码格式修改为UTF-8
1.windows->Preferences...打开"首选项"对话框,左侧导航树,导航到general->Workspace,右侧Text file encodin ...
Django 模板中使用css, javascript
Django 模板中使用css, javascript (r'^css/(?Ppath.*)$', 'django.views.static.serve', {'document_root': '/v ...
【HackerRank Week of Code 31】Colliding Circles
https://www.hackerrank.com/contests/w31/challenges/colliding-circles/problem 设E(n)为序列长度为n时的期望值. \[ \ ...
模型构建<3>:交叉验证
交叉验证是模型比较选择的一种常用方法,本文对此进行总结梳理. 1.交叉验证的基本思想交叉验证(cross validation)的基本思想就是重复地利用同一份数据. 2.交叉验证的作用 1)通过划分 ...
BZOJ 2333: [SCOI2011]棘手的操作可并堆左偏树 set
https://www.lydsy.com/JudgeOnline/problem.php?id=2333 需要两个结构分别维护每个连通块的最大值和所有连通块最大值中的最大值,可以用两个可并堆实现,也 ...
20162327WJH使用队列：模拟票务站台代码分析
20162327WJH使用队列:模拟票务站台代码分析用链队实现队列的情况 1.用链表实现队列的代码关键方法代码及补全代(LinkedOueue类) public void enqueue(T el ...
HTTPClient实现免登陆请求（带cookie请求）
背景: 使用httpClient请求某登录型网站,模拟一个操作,一般步骤一个httpclient模式登录->httpClient模拟操作: 此时发现,每次操作都需要进行一次登录,极其浪费时间,是 ...
Codeforces Round #234 (Div. 2) B. Inna and New Matrix of Candies SET的妙用
B. Inna and New Matrix of Candies time limit per test 1 second memory limit per test 256 megabytes i ...
centos7安装kafka_2.11-1.0.0 新手入门
系统环境 1.操作系统:64位CentOS Linux release 7.2.1511 (Core) 2.jdk版本:1.8.0_121 3.zookeeper版本:zookeeper-3.4.9. ...
curl多文件上传
发送: header('Content-type:text/html; charset=utf-8'); //声明编码//模拟批量POST上传文件$url = 'http://test.cm/rece ...

python 分词计算文档TF-IDF值并排序

python 分词计算文档TF-IDF值并排序的更多相关文章

随机推荐

热门专题