python读取doc

import os, time, fnmatch

from docx import Document

class search:

  def __init__(self, path, search_string, file_filter):

    self.search_path = path

    self.search_string = search_string

    self.file_filter = file_filter

    print ("Search %s in %s..." % (

      self.search_string, self.search_path

    ) )

    print ("_" * 80)

    time_begin = time.time()

    file_count = self.walk()

    print ("_" * 80)

    print ("%s files searched in %0.2fsec." % (

      file_count, (time.time() - time_begin)

    ))

#遍历所有的文件，记录文件数量

  def walk(self):

    file_count = 0

    for root, dirlist, filelist in os.walk(self.search_path, followlinks=True):

      for filename in filelist:

        for file_filter in self.file_filter:

          if fnmatch.fnmatch(filename, file_filter):

            self.search_file(os.path.join(root, filename))

            file_count += 1

    return file_count

#遍历文件中的字符串，并且剪切显示出来

  def search_file(self, filepath):

    d = Document(filepath)

    for para in d.paragraphs:

      if self.search_string in d.paragraphs:

        print(filepath)

        self.cutout_content(content)

#剪切字符串并且显示

  def cutout_content(self, content):

    current_pos = 0

    search_string_len = len(self.search_string)

    for i in xrange(max_cutouts):

      try:

              #从current_pos位置往后寻找self.search_string个字符串

        pos = content.index(self.search_string, current_pos)

      except ValueError:

        break

#将显示窗口定义为寻找到的关键字向前向后各content_extract个字符

      content_window = content[ pos - content_extract : pos + content_extract ]

      print (">>>", content_window.encode("String_Escape"))

      current_pos += pos + search_string_len

    print

#主程序入口

if __name__ == "__main__":

  search_path = r"c:\Users\Administrator\Desktop"

  file_filter = ("*.docx",".doc") # fnmatch-Filter

  search_string = "history"

  content_extract = 35 #获取摘要35

  max_cutouts = 20 #显示窗口20

  search(search_path, search_string, file_filter)

python读取doc的更多相关文章

Python：读取 .doc、.docx 两种 Word 文件简述及“Word 未能引发事件”错误
概述 Python 中可以读取 word 文件的库有 python-docx 和 pywin32. 下表比较了各自的优缺点. 优点缺点 python-docx 跨平台只能处理 .docx 格式 ...
【转】Python——读取html的table内容
Python——python读取html实战,作业7(python programming) 查看源码,观察html结构 # -*- coding: utf-8 -*- from lxml.html ...
孤荷凌寒自学python第五十二天初次尝试使用python读取Firebase数据库中记录
孤荷凌寒自学python第五十二天初次尝试使用python读取Firebase数据库中记录 (完整学习过程屏幕记录视频地址在文末) 今天继续研究Firebase数据库,利用google免费提供的这个数 ...
python读取excel一例-------从工资表逐行提取信息
在工作中经常要用到python操作excel,比如笔者公司中一个人事MM在发工资单的时候,需要从几百行的excel表中逐条的粘出信息,然后逐个的发送到员工的邮箱中.人事MM对此事不胜其烦,终于在某天请 ...
python读取xml文件
关于python读取xml文章很多,但大多文章都是贴一个xml文件,然后再贴个处理文件的代码.这样并不利于初学者的学习,希望这篇文章可以更通俗易懂的教如何使用python 来读取xml 文件. 什么是 ...
python读取mnist
python读取mnist 其实就是python怎么读取binnary file mnist的结构如下,选取train-images TRAINING SET IMAGE FILE (train-im ...
[转] Windows下使用Python读取Excel表格数据
http://www.python-excel.org/这个网站罗列了很多关于在Python下操作Excel文件的信息,这里选择了其介绍的第一个模块xlrd . xlrd 0.9.2版本跨平台同时支持 ...
Python读取txt文件
Python读取txt文件,有两种方式: (1)逐行读取 data=open("data.txt") line=data.readline() while line: print ...
Python读取Yaml文件
近期看到好多使用Yaml文件做为配置文件或者数据文件的工程,随即也研究了下,发现Yaml有几个优点:可读性好.和脚本语言的交互性好(确实非常好).使用实现语言的数据类型.有一个一致的数据模型.易于实现 ...

随机推荐

hibernate 批量插入
Session session = sessionFactoryUpLowLimit.openSession(); session.beginTransaction(); for(int i=0 ;i ...
sql声明变量存储查询结果
with t as 查到条件数据,然后在下面使用到t,用exists做判断会非常慢,改成left join会快很多. 我使用的数据库时2008Sql r2. 文章:SQL数据库中临时表.临时变量和WI ...
java对数组的操作
1 拷贝数组数组全拷贝数组定位拷贝 2 判断数组是否相等(每个元素都对应相等) 3 数组和集合的相互转化 import java.util.Arrays; import java.util.Lis ...
scrapy学习-爬取天天基金网基金列表
目录描述环境描述步骤记录创建scrapy项目设置在pycharm下运行scrapy项目分析如何获取数据编写代码 step 1:设置item step 2:编写spider step 3: ...
LTE/EPC中，MME怎么找到UE的HSS的？
http://bbs.c114.net/forum.php?mod=viewthread&tid=486247 HSS---归属用户服务器,我的理解:一般来说只有一个,或者是一个分布式数据库. ...
java生成唯一的id编号
GUID是一个128位长的数字,一般用16进制表示.算法的核心思想是结合机器的网卡.当地时间.一个随即数来生成GUID.从理论上讲,如果一台机器每秒产生10000000个GUID,则可以保证(概率意义 ...
框架开发中的junit单元测试
首先写一个测试用的公共类,如果要搭建测试环境,只要继承这个公共类就能很容易的实现单元测试,代码如下 import org.junit.runner.RunWith; import org.spring ...
css 给body设置背景图片
Java-Eclipse-Jabref一条龙
Java部分: 1. 到Oracle官网下载需要版本的JDK:http://www.oracle.com/technetwork/java/javase/archive-139210.html 2. ...
【转】Win7装不上Office2010 提示MSXML 6.10.1129.0
转自:http://zhidao.baidu.com/link?url=aZPbpBu0Fb7rc8HCb_NuonuZ4ET_BB8_NgZ96tCpB9dyuUyWVwMl78MLa7rh-rfx ...

python读取doc

python读取doc的更多相关文章

随机推荐

热门专题