Python-docx 读取word.docx内容

第一次写博客，也不知道要写点儿什么好，所以就把我在学习Python的过程中遇到的问题记录下来，以便之后查看，本人小白，写的不好，如有错误，还请大家批评指正！

中文编码问题总是让人头疼，想要用Python读取word中的内容，用open()经常报错，上网一搜结果发现了Python有专门读取.docx的模块python_docx（只能读取.docx文件，不能读取.doc文件），用起来很方便。

安装python-docx：

pip install python_docx

（注意：不是pip install docx ! docx也可以安装，但总是报错，缺少exceptions，无法导入）

接下来就可以用Python_docx 来读取word文本了。

代码如下：

import docx

from docx import Document

path = "C:\\Users\\Administrator\\Desktop\\word.docx"

document = Document(path)

for paragraph in document.paragraphs:

    print(paragraph.text)

运行即可输出文本。

我尝试用docx读取.doc文本

代码如下：

import os

import docx

for filename in os.listdir(os.getcwd()):

    if filename.endswith('.doc'):

        print(filename[:-4])

        doc = docx.Document(filename[:-4]+".docx")

        for para in doc.paragraphs:

            print (para.text)

结果报错：docx.opc.exceptions.PackageNotFoundError: Package not found。还是无法识别doc

引用1楼，“改变拓展名并没有改变其编码方式，因此无法读取文本内容，需将doc文件另存为docx文件后再用python-docx读取其内容”

# Document 还有添加标题、分页、段落、图片、章节等方法，说明如下
  |  add_heading(self, text='', level=1)

  |      Return a heading paragraph newly added to the end of the document,

  |      containing *text* and having its paragraph style determined by

  |      *level*. If *level* is 0, the style is set to `Title`. If *level* is

  |      1 (or omitted), `Heading 1` is used. Otherwise the style is set to

  |      `Heading {level}`. Raises |ValueError| if *level* is outside the

  |      range 0-9.

  |

  |  add_page_break(self)

  |      Return a paragraph newly added to the end of the document and

  |      containing only a page break.

  |

  |  add_paragraph(self, text='', style=None)

  |      Return a paragraph newly added to the end of the document, populated

  |      with *text* and having paragraph style *style*. *text* can contain

  |      tab (``\t``) characters, which are converted to the appropriate XML

  |      form for a tab. *text* can also include newline (``\n``) or carriage

  |      return (``\r``) characters, each of which is converted to a line

  |      break.

  |

  |  add_picture(self, image_path_or_stream, width=None, height=None)

  |      Return a new picture shape added in its own paragraph at the end of

  |      the document. The picture contains the image at

  |      *image_path_or_stream*, scaled based on *width* and *height*. If

  |      neither width nor height is specified, the picture appears at its

  |      native size. If only one is specified, it is used to compute

  |      a scaling factor that is then applied to the unspecified dimension,

  |      preserving the aspect ratio of the image. The native size of the

  |      picture is calculated using the dots-per-inch (dpi) value specified

  |      in the image file, defaulting to 72 dpi if no value is specified, as

  |      is often the case.

  |

  |  add_section(self, start_type=2)

  |      Return a |Section| object representing a new section added at the end

  |      of the document. The optional *start_type* argument must be a member

  |      of the :ref:`WdSectionStart` enumeration, and defaults to

  |      ``WD_SECTION.NEW_PAGE`` if not provided.

  |

  |  add_table(self, rows, cols, style=None)

  |      Add a table having row and column counts of *rows* and *cols*

  |      respectively and table style of *style*. *style* may be a paragraph

  |      style object or a paragraph style name. If *style* is |None|, the

  |      table inherits the default table style of the document.

  |

  |  save(self, path_or_stream)

  |      Save this document to *path_or_stream*, which can be eit a path to

  |      a filesystem location (a string) or a file-like object.

docx还有许多其它功能，还正在学习中，详见官方文档：https://python-docx.readthedocs.io/en/latest/user/quickstart.html

Python-docx 读取word.docx内容的更多相关文章

python读取word表格内容（1）
1.首页介绍下word表格内容,实例如下: 每两个表格后面是一个合并的单元格
poi读取word的内容
pache POI是Apache软件基金会的开放源码函式库,POI提供API给Java程序对Microsoft Office格式档案读和写的功能. 1.读取word 2003及word 2007需要的 ...
Python configparser 读取指定节点内容失败
# !/user/bin/python # -*- coding: utf-8 -*- import configparser # 生成一个config文件 config = configparser ...
java 实现poi方式读取word文件内容
1.下载poi的jar包下载地址:https://www.apache.org/dyn/closer.lua/poi/release/bin/poi-bin-3.17-20170915.tar.gz ...
aspose.word 读取word段落内容
注:转载请标明文章原始出处及作者信息 aspose.word 插件下载链接: http://pan.baidu.com/s/1qXIgOXY 密码: wsj2 使用原因:无需安装office,无兼容 ...
Python中读取csv文件内容方法
gg 224@126.com 85 男 dd 123@126.com 52 女 fgf 125@126.com 23 女 csv文件内容如上图,首先导入csv包,调用csv中的方法reader()创建 ...
python读取word中的段落、表、图+++++++++++Doc转换Docx
读取文本.图.表.解压信息 import docx import zipfile import os import shutil '''读取word中的文本''' def gettxt(): file ...
Python 读取word中表格数据、读取word修改并保存、替换word中词汇、读取word中每段内容，读取一段话中相同样式内容，理解Document中run
from docx import Document path = r'D:\pywork\12' # word信息表所在文件夹 w = Document(path + '/' + 'word信息表.d ...
使用poi读取word2007(.docx)中的复杂表格
使用poi读取word2007(.docx)中的复杂表格最近工作需要做一个读取word(.docx)中的表格,并以html形式输出.经过上网查询,使用了poi. 对于2007及之后的word文档,需 ...

随机推荐

关于Unsupported major.minor version 52.0解决方案的补充
参考:https://blog.csdn.net/jingtianyiyi/article/details/80455916 补充: 这个设置比较容易忽略: 在eclipse中新建tomcat或在原有 ...
scrapy-redis
scrapy_redis的大概思路:将爬取的url通过 hashlin.sha1生成唯一的指纹,持久化存入redis,之后的url判断是否已经存在,达到去重的效果下载scrapy-redis git ...
mongo 数据库
一.管理mongo 配置文件在/etc/mongod.conf 默认端口27017 启动 sudo service mongod start 停止 ...
kubenetes服务发现
一.基于 iptables 的 Service 实现 Pod的ip地址不是固定了.Service通过selector属性和后端Pod关联,被selector选中的Pod被称为Service的Endpo ...
windows文件名格式的中文+数字混合字符串排序
记录一下 [DllImport("shlwapi.dll", CharSet = CharSet.Unicode)] private static extern int StrCm ...
idea搭建ssm框架
1.file-->new-->project-->maven.... 2.建立后的目录: 3.pom.xml依赖建立: <?xml version="1.0" ...
侧脸生成正脸概论与精析（一）Global and Local Perception GAN
侧脸生成正脸我一直很感兴趣,老早就想把这块理一理的.今天来给大家分享一篇去年的老文章,如果有不对的地方,请斧正. Beyond Face Rotation: Global and Local Perc ...
C++ std::unordered_map使用std::string和char *作key对比
最近在给自己的服务器框架加上统计信息,其中一项就是统计创建的对象数,以及当前还存在的对象数,那么自然以对象名字作key.但写着写着,忽然纠结是用std::string还是const char *作ke ...
css：伪类和伪元素
一:伪类 1. :active 想被激活的元素添加样式 2. :focus 向拥有键盘输入焦点的元素添加样式 3. :hover 当鼠标悬浮在元素上方时,向元素添加样式 4. ...
css background-image 自适应宽高——转载
就是这么简单的一句话,设置背景图,并让它100%的适应导航栏宽高,并设置不重复,大小100%就OK了 .zjhn-nav li.active a{ background-image:url(../im ...

Python-docx 读取word.docx内容

Python-docx 读取word.docx内容的更多相关文章

随机推荐

热门专题