Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        HWPFDocument doc = new HWPFDocument(istream);

        Range range = doc.getRange();// Returns the range which covers the whole

                                        // of the document, but excludes any

                                        // headers and footers.

        for (int i = 0; i < range.numParagraphs(); i++) {

            Paragraph poiPara = range.getParagraph(i);

            int j = 0;

            while (true) {

                CharacterRun run = poiPara.getCharacterRun(j++);

                System.out.println("Color " + run.getColor());//颜色

                System.out.println("Font size " + run.getFontSize());//字体大小

                System.out.println("Font Name " + run.getFontName());//字体名称

                System.out.println(run.isBold() + " " + run.isItalic() + " "

                        + run.getUnderlineCode());//加粗，斜体，下划线

                System.out.println("Text is " + run.text());//文本内容

                if (run.getEndOffset() == poiPara.getEndOffset()) {

                    break;

                }

            }

        }

    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.InputStream;

import java.util.ArrayList;

import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;

import org.apache.poi.hwpf.usermodel.CharacterRun;

import org.apache.poi.hwpf.usermodel.Paragraph;

import org.apache.poi.hwpf.usermodel.Range;

import org.apache.poi.xwpf.usermodel.XWPFDocument;

import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**

     * @param args

     * @throws Exception

     */

    public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        XWPFDocument docx = new XWPFDocument(istream);

        List<XWPFParagraph> paraGraph = docx.getParagraphs();

        for(XWPFParagraph para :paraGraph ){

            List<XWPFRun> run = para.getRuns();

            for(XWPFRun r : run){

                int i = 0;

                System.out.println("字体颜色："+r.getColor());

                System.out.println("字体名称:"+r.getFontFamily());

                System.out.println("字体大小："+r.getFontSize());

                System.out.println("Text:"+r.getText(i++));

                System.out.println("粗体？："+r.isBold());

                System.out.println("斜体？："+r.isItalic());

            }

        }

    }

}

Tika解析word文件的更多相关文章

C#仪器数据文件解析-Word文件（doc、docx）
不少仪器数据报告输出为Word格式文件,同Excel文件,Word文件doc和docx的存储格式是不同的,相应的解析Word文件的方式也类似,主要有以下方式: 1.通过MS Word应用程序的DCOM ...
用python解析word文件（二）：table
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table)(本篇) (三)样式篇(style) 选你所需即可.下面开始正文. 上一篇我们讲了用python-do ...
用python解析word文件（一）：paragraph
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph)(本篇) (二)表格篇(table) (三)样式篇(style) 选你所需即可.下面开始正文. 最近公司的项目,需要在页面上显示w ...
用python解析word文件（三）：style
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table) (三)样式篇(style)(本篇) 选你所需即可.下面开始正文. 在前两篇中,我们已经解析出了par ...
用python解析word文件（段落篇（paragraph）表格篇（table）样式篇（style））
首先需要安装相应的支持库: 直接在命令行执行pip install python-docx 示例代码如下: import docxfrom docx import Document #导入库 path ...
用python读取word文件里的表格信息【华为云技术分享】
在企查查查询企业信息的时候,得到了一些word文件,里面有些控股企业的数据放在表格里,需要我们将其提取出来. word文件看起来很复杂,不方便进行结构化.实际上,一个word文档中大概有这么几种类型的 ...
NodeJs之word文件生成与解析
NodeJs之word文件生成与解析一,介绍与需求 1.1,介绍 1,officegen模块可以为Microsoft Office 2007及更高版本生成Office Open XML文件.此模块不 ...
Apache-Tika解析Word文档
通常在使用爬虫时,爬取到网上的文章都是各式各样的格式处理起来比较麻烦,这里我们使用Apache-Tika来处理Word格式的文章,如下: package com.mengyao.tika.app; i ...
Java读取word文件，字体，颜色
在Android读取Word文件时,在网上查看时可以用tm-extractors,但好像没有提到怎么读取Word文档中字体的颜色,字体,上下标等相关的属性.但由于需要,要把doc文档中的内容(字体,下 ...

随机推荐

Android2.2源码属性服务分析
属性服务property service 大家都知道,在windows中有个注册表,里面存储的是一些键值对.注册表的作用就是:系统或者应用程序将自己的一些属性存储在注册表中,即使系统或应用程序重启,它 ...
使用UE配置Python编程环境
一直在使用UE来进行python编程,觉得在UE下进行python编程使用起来还是很方便地,现在特来总结一下: 1.首先是python环境搭建 (1)下载python2.7 https://www.p ...
bzoj 3924 幻想乡战略游戏
题目大意: 有边权点权的树,动态修改点权每次修改后求带权重心x (\(minimize\) \(S=\sum_i val[i]*dist[x][i]\)) 分析: 从暴力找突破口: 对于边x,y,设 ...
MIPS中的异常处理和系统调用【转】
转自:http://blog.csdn.net/jasonchen_gbd/article/details/44044091 权声明:本文为博主原创文章,转载请附上原博链接. 异常入口系统调用是用户 ...
Python初见
参考资料:http://wenku.baidu.com/link?url=_akpT-G5Tvf7ECyszSipOAhHXzjlpYu-RWPcRTYp_tecPOollPGUxXG4MH69MLN ...
在 .Net Core xUnit test 项目中使用配置文件
在对项目做集成测试的时候,经常会需要用到一些参数比如用户名密码等,这些参数不宜放在测试代码中.本文介绍一种方法:使用配置文件. 添加配置文件在集成测试项目目录下新建文件:Configuration. ...
微信小程序之三元运算符代替wx:if 来解决背景图片显示隐藏
最近在开发一个小程序项目时,碰到一个问题, 在一个多条件单项选择中,为选中条件添加一个选中状态,选中状态为灰色背景,但是这个背景要用到背景图片大家都知道在小程序中wxss是无法读到本地图标资源,只 ...
一起来学Spring Cloud | 第六章：服务网关 ( Zuul)
本章节,我们讲解springcloud重要组件:微服务网关Zuul.如果有同学从第一章看到本章的,会发现我们已经讲解了大部分微服务常用的基本组件. 已经讲解过的: 一起来学Spring Cloud | ...
洛谷—— P1875 佳佳的魔法药水
https://www.luogu.org/problemnew/show/1875 题目背景发完了 k 张照片,佳佳却得到了一个坏消息:他的 MM 得病了!佳佳和大家一样焦急万分!治好 MM 的 ...
mac 下删除xcode后使用git
1. http://blog.bobbyallen.me/2014/03/07/how-to-install-git-without-having-to-install-xcode-on-macosx ...

Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

POI-HWPF - A Quick Guide

Tika解析word文件的更多相关文章

随机推荐

热门专题