Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        HWPFDocument doc = new HWPFDocument(istream);

        Range range = doc.getRange();// Returns the range which covers the whole

                                        // of the document, but excludes any

                                        // headers and footers.

        for (int i = 0; i < range.numParagraphs(); i++) {

            Paragraph poiPara = range.getParagraph(i);

            int j = 0;

            while (true) {

                CharacterRun run = poiPara.getCharacterRun(j++);

                System.out.println("Color " + run.getColor());//颜色

                System.out.println("Font size " + run.getFontSize());//字体大小

                System.out.println("Font Name " + run.getFontName());//字体名称

                System.out.println(run.isBold() + " " + run.isItalic() + " "

                        + run.getUnderlineCode());//加粗，斜体，下划线

                System.out.println("Text is " + run.text());//文本内容

                if (run.getEndOffset() == poiPara.getEndOffset()) {

                    break;

                }

            }

        }

    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.InputStream;

import java.util.ArrayList;

import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;

import org.apache.poi.hwpf.usermodel.CharacterRun;

import org.apache.poi.hwpf.usermodel.Paragraph;

import org.apache.poi.hwpf.usermodel.Range;

import org.apache.poi.xwpf.usermodel.XWPFDocument;

import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**

     * @param args

     * @throws Exception

     */

    public static void main(String[] args) throws Exception {

        InputStream istream = new FileInputStream(

                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");

        XWPFDocument docx = new XWPFDocument(istream);

        List<XWPFParagraph> paraGraph = docx.getParagraphs();

        for(XWPFParagraph para :paraGraph ){

            List<XWPFRun> run = para.getRuns();

            for(XWPFRun r : run){

                int i = 0;

                System.out.println("字体颜色："+r.getColor());

                System.out.println("字体名称:"+r.getFontFamily());

                System.out.println("字体大小："+r.getFontSize());

                System.out.println("Text:"+r.getText(i++));

                System.out.println("粗体？："+r.isBold());

                System.out.println("斜体？："+r.isItalic());

            }

        }

    }

}

Tika解析word文件的更多相关文章

C#仪器数据文件解析-Word文件（doc、docx）
不少仪器数据报告输出为Word格式文件,同Excel文件,Word文件doc和docx的存储格式是不同的,相应的解析Word文件的方式也类似,主要有以下方式: 1.通过MS Word应用程序的DCOM ...
用python解析word文件（二）：table
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table)(本篇) (三)样式篇(style) 选你所需即可.下面开始正文. 上一篇我们讲了用python-do ...
用python解析word文件（一）：paragraph
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph)(本篇) (二)表格篇(table) (三)样式篇(style) 选你所需即可.下面开始正文. 最近公司的项目,需要在页面上显示w ...
用python解析word文件（三）：style
太长了,我决定还是拆开三篇写. (一)段落篇(paragraph) (二)表格篇(table) (三)样式篇(style)(本篇) 选你所需即可.下面开始正文. 在前两篇中,我们已经解析出了par ...
用python解析word文件（段落篇（paragraph）表格篇（table）样式篇（style））
首先需要安装相应的支持库: 直接在命令行执行pip install python-docx 示例代码如下: import docxfrom docx import Document #导入库 path ...
用python读取word文件里的表格信息【华为云技术分享】
在企查查查询企业信息的时候,得到了一些word文件,里面有些控股企业的数据放在表格里,需要我们将其提取出来. word文件看起来很复杂,不方便进行结构化.实际上,一个word文档中大概有这么几种类型的 ...
NodeJs之word文件生成与解析
NodeJs之word文件生成与解析一,介绍与需求 1.1,介绍 1,officegen模块可以为Microsoft Office 2007及更高版本生成Office Open XML文件.此模块不 ...
Apache-Tika解析Word文档
通常在使用爬虫时,爬取到网上的文章都是各式各样的格式处理起来比较麻烦,这里我们使用Apache-Tika来处理Word格式的文章,如下: package com.mengyao.tika.app; i ...
Java读取word文件，字体，颜色
在Android读取Word文件时,在网上查看时可以用tm-extractors,但好像没有提到怎么读取Word文档中字体的颜色,字体,上下标等相关的属性.但由于需要,要把doc文档中的内容(字体,下 ...

随机推荐

[暑假集训--数位dp]hdu5898 odd-even number
For a number,if the length of continuous odd digits is even and the length of continuous even digits ...
Android中沉浸式状态栏的应用
在Android5.0版本后,谷歌公司为Android系统加入了很多新特性,刷新了Android用户的体验度.而其中的一个新特性就是沉浸式状态栏.那么问题来了,很多非移动端的小伙伴就要问了,什么是沉浸 ...
IOS-<input>表单元素只能读，设置readonly时光标仍然可见的解决办
在HTML中,如果把一个<input>的readonly属性设置为"readonly",表示这个表单元素不能编辑. 但是,鼠标点击之后,这个表单元素还是有光标存在的. ...
Codeforces 620F Xors on Segments（暴力+DP）
题目链接 Xors on Segments 预处理出$x[i]$ $=$ $1$ $xor$ $2$ $xor$ $3$ $xor$ $……$ $xor$ $i$ 话说这题$O(n^{2})$居然能过 ...
Software Engineering | UML
六大关系:关联association.依赖dependency.聚合aggregation.组合compositon.泛化generalization.实现realization. 盗图: 关联:关联 ...
Noip2016题解&总结
原文放在我的uoj博客上,既然新开了blog,那就移过来了玩具谜题(toy) 送分题.没有什么好说的. 直接按照题目的要求模拟即可. 标准的noip式day1T1. #include<cstd ...
Leetcode 数组问题3：旋转数组
问题描述: 给定一个数组,将数组中的元素向右移动 k 个位置,其中 k 是非负数. 示例 : 输入A数组: [1,2,3,4,5,6,7] 和 k = 3 输出: [5,6,7,1,2,3,4] 解释 ...
PHP平均小数红包算法
<?php function RandMoney( $money,$num ){ $arr = array();//存放金额 $total_money = 0;//红包总金额 $thisMone ...
js/jq仿window文件夹框选操作插件
0.先给大家看看效果: 1.创建一个index.html文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ...

Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

POI-HWPF - A Quick Guide

Tika解析word文件的更多相关文章

随机推荐

热门专题