python实现获取文件列表中每一个文件keyword

功能描写叙述：

获取某个路径下的全部文件，提取出每一个文件里出现频率最高的前300个字。保存在数据库其中。

前提。你须要配置好nltk

#!/usr/bin/python

#coding=utf-8

'''

function : This script will create a database named mydb then

           abstract keywords of files of privacy police.

author    : Chicho

date      : 2014/7/28

running   : python key_extract.py -d path_of_file

'''

import sys,getopt

import nltk

import MySQLdb

from nltk.corpus import PlaintextCorpusReader

corpus_root = ""

if __name__ == '__main__':

    opts,args = getopt.getopt(sys.argv[1:], "d:h","directory=help")

    #get the directory

    for op,value in opts:

        if op in ("-d", "--directory"):

            corpus_root = value

	#actually。 the above method to get  a directory is a little complicated,you can

	#do like this

	'''

	the input include you path and use sys.argv to get the path

	'''

	'''

	running : python key_extract.py you path_of_file

	corpus_root = sys.argv[1]

	'''

    # corpus_root is the directory of files of privacy policy, all of the are html files

    filelists = PlaintextCorpusReader(corpus_root, '.*')

    #get the files' list

    files = filelists.fileids()

    #connect the database

    conn = MySQLdb.connect(host = 'your_personal_host_ip_address', user = 'rusername', port =your_port, passwd = 'U_password')

    #get the cursor

    curs = conn.cursor()

    conn.set_character_set('utf8')

    curs.execute('set names utf8')

    curs.execute('SET CHARACTER SET utf8;')

    curs.execute('SET character_set_connection=utf8;')

    '''

    conn.text_factory=lambda x: unicode(x, 'utf8', "ignore")

    #conn.text_factory=str

    '''	 

    # create a database named mydb

    '''

    try:

        curs.execute("create database mydb")

    except Exception,e:

        print e

    '''

    conn.select_db('mydb')

    try:

        for i in range(300):

            sql = "alter table filekeywords add " + "key" + str(i) + " varchar(45)"

            curs.execute(sql)

    except Exception,e:

        print e

    i = 0

    for privacyfile in files:

        #f = open(privacyfile,'r', encoding= 'utf-8')

        sql = "insert into filekeywords set id =" + str(i)

        curs.execute(sql)

        sql = "update filekeywords set name =" + "'" + privacyfile + "' where id= " + str(i)

        curs.execute(sql)

        # get the words in privacy policy

        wordlist = [w for w in filelists.words(privacyfile) if w.isalpha() and len(w)>2]

        # get the keywords

        fdist = nltk.FreqDist(wordlist)

        vol = fdist.keys()

        key_num = len(vol)

        if key_num > 300:

            key_num = 300

        for j in range(key_num):

            sql = "update filekeywords set " + "key" + str(j) + "=" + "'" + vol[j] + "' where id=" + str(i)

            curs.execute(sql)

        i = i + 1

    conn.commit()

    curs.close()

    conn.close()

转载注明出处：http://blog.csdn.net/chichoxian/article/details/42003603

python实现获取文件列表中每一个文件keyword的更多相关文章

java中的文件读取和文件写出：如何从一个文件中获取内容以及如何向一个文件中写入内容
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.Fi ...
python实现获取文件夹中的最新文件
实现代码如下: #查找某目录中的最新文件import osclass FindNewFile: def find_NewFile(self,path): #获取文件夹中的所有文件 lists = os ...
基于Python——实现解压文件夹中的.zip文件
[背景]当一个文件夹里存好好多.zip文件需要解压时,手动一个个解压再给文件重命名是一件很麻烦的事情,基于此,今天介绍一种使用python实现批量解压文件夹中的压缩文件并给文件重命名的方法—— [代码 ...
每日学习心得：SharePoint 为列表中的文件夹添加子项（文件夹）、新增指定内容类型的子项、查询列表中指定的文件夹下的内容
前言: 这里主要是针对列表中的文件下新增子项的操作,同时在新建子项时,可以为子项指定特定的内容类型,在某些时候需要查询指定的文件夹下的内容,针对这些场景都一一给力示例和说明,都是一些很小的知识点,希望 ...
python操作txt文件中数据教程[3]-python读取文件夹中所有txt文件并将数据转为csv文件
python操作txt文件中数据教程[3]-python读取文件夹中所有txt文件并将数据转为csv文件觉得有用的话,欢迎一起讨论相互学习~Follow Me 参考文献 python操作txt文件中 ...
python 将指定文件夹中的指定文件放入指定文件夹中
import os import shutil import re #获取指定文件中文件名 def get_filename(filetype): name =[] final_name_list = ...
在/proc文件系统中增加一个目录hello，并在这个目录中增加一个文件world，文件的内容为hello world
一.题目编写一个内核模块,在/proc文件系统中增加一个目录hello,并在这个目录中增加一个文件world,文件的内容为hello world.内核版本要求2.6.18 二.实验环境物理主机:w ...
获取SD卡中的音乐文件
小编近期在搞一个音乐播放器App.练练手: 首先遇到一个问题.怎么获取本地的音乐文件? /** * 获取SD卡中的音乐文件 * * @param context * @return */ public ...
创建一个目录info,并在目录中创建一个文件test.txt,把该文件的信息读取出来，并显示出来
/*4.创建一个目录info,并在目录中创建一个文件test.txt,把该文件的信息读取出来,并显示出来*/ #import <Foundation/Foundation.h>#defin ...

随机推荐

紫书例题 10-28 UVa 1393（简化问题）
这道题是对称的所以只算"\", 最后答案再乘以2 然后每一条直线看作一个包围盒枚举包围盒的长宽有两种情况会重复 (1)包围盒里面有包围盒. 这个时候就是在一条直线上那么我们 ...
windows下laravel5安装
第一步:安装composer 网上教程非常多,自行百度第二步:使用composer create-project laravel/laravel learnlaravel5 5.0.22 ...
sublime配置python
Sublime Text 2作为一款轻量级的编辑器,特点鲜明.方便使用,愈发受到普罗大众的喜爱.我个人近期也開始用了起来.同一时候,我近段时间还在学习Python的相关东西.所以開始用ST2来写Pyt ...
iOS多线程与网络开发之多线程GCD
郝萌主倾心贡献,尊重作者的劳动成果.请勿转载. 假设文章对您有所帮助,欢迎给作者捐赠.支持郝萌主.捐赠数额任意.重在心意^_^ 我要捐赠: 点击捐赠 Cocos2d-X源代码下载:点我传送游戏官方下 ...
generate the call load file
#!/usr/bin/perl -w $e911_call_percent = 0.0; $ims_node_number = 12; $local_ip = "10.86.52.2&quo ...
Dubbo源代码分析（三）：Dubbo之服务端（Service）
如上图所看到的的Dubbo的暴露服务的过程,不难看出它也和消费者端非常像,也须要一个像reference的对象来维护service关联的全部对象及其属性.这里的reference就是provider. ...
Java中Socket上的Read操作堵塞问题
从Socket上读取对端发过来的数据一般有两种方法: 1)依照字节流读取 BufferedInputStream in = new BufferedInputStream(socket.getInpu ...
在Maven项目中关于SSM框架中邮箱验证登陆
1.你如果要在maven项目中进行邮箱邮箱验证,你首先要先到pom.xml文件中配置mail.jar,activation.jar包 <dependency> <groupId> ...
创建带有IN类型参数的存储过程（四十八）
创建带有IN类型参数的存储过程我们经常要从数据表中删除记录,一般情况我们删除记录都是根据id来删除的,比如我们通常要输入DELETE FROM 表名 WHERE 后面跟上我们的条件,因为我们要经常写 ...
基于.NET平台常用的框架技术整理
个人整理部分收藏于:http://www.cnblogs.com/hgmyz/p/5313983.html 自从学习.NET以来,优雅的编程风格,极度简单的可扩展性,足够强大开发工具,极小的学习曲线 ...

python实现获取文件列表中每一个文件keyword

python实现获取文件列表中每一个文件keyword的更多相关文章

随机推荐

热门专题