JAVA：借用OpenOffice将上传的Word文档转换成Html格式

为什么会想起来将上传的word文档转换成html格式呢？设想，如果一个系统需要发布在页面的文章都是来自word文档，一般会执行下面的流程：使用word打开文档，Ctrl+A，进入发布文章页面，Ctrl+V。看起来也不麻烦，但是，如果文档中包含大量图片呢？尴尬的事是图片都需要重新上传吧？

如果可以将已经编写好的word文档上传到服务器就可以在相应页面进行展示，将会是一件非常惬意的事情，最起码信息发布人员会很开心。程序员可能就不会这么想了，囧。

将Word转Html的原理是这样的：

1、客户上传Word文档到服务器

2、服务器调用OpenOffice程序打开上传的Word文档

3、OpenOffice将Word文档另存为Html格式

4、Over

至此可见，这要求服务器端安装OpenOffice软件，其实也可以是MS Office，不过OpenOffice的优势是跨平台，你懂的。恩，说明一下，本文的测试基于 MS Win7 Ultimate X64 系统。

下面就是规规矩矩的实现。

1、下载OpenOffice，http://download.openoffice.org/index.html So easy...

2、下载Jodconverter http://www.artofsolving.com/opensource/jodconverter 这是一个开启OpenOffice进行格式转化的第三方jar包。

3、泡杯热茶，等待下载。

4、安装OpenOffice，安装结束后，调用cmd，启动OpenOffice的一项服务：C:\Program Files (x86)\OpenOffice.org 3\program>soffice -headless -accept="socket,port=8100;urp;"

5、打开eclipse

6、喝杯热茶，等待eclipse打开。

7、新建eclipse项目，导入Jodconverter/lib 下得jar包。

8、Coding...

查看代码

package com.mzule.doc2html.util;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.ConnectException;

import java.util.Date;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

import com.artofsolving.jodconverter.DocumentConverter;

import com.artofsolving.jodconverter.openoffice.connection.OpenOfficeConnection;

import com.artofsolving.jodconverter.openoffice.connection.SocketOpenOfficeConnection;

import com.artofsolving.jodconverter.openoffice.converter.OpenOfficeDocumentConverter;

/**

 * 将Word文档转换成html字符串的工具类

 *

 * @author MZULE

 *

 */

public class Doc2Html {

    public static void main(String[] args) {

    System.out

        .println(toHtmlString(new File("C:/test/test.doc"), "C:/test"));

    }

    /**

     * 将word文档转换成html文档

     *

     * @param docFile

     *                需要转换的word文档

     * @param filepath

     *                转换之后html的存放路径

     * @return 转换之后的html文件

     */

    public static File convert(File docFile, String filepath) {

    // 创建保存html的文件

    File htmlFile = new File(filepath + "/" + new Date().getTime()

        + ".html");

    // 创建Openoffice连接

    OpenOfficeConnection con = new SocketOpenOfficeConnection(8100);

    try {

        // 连接

        con.connect();

    } catch (ConnectException e) {

        System.out.println("获取OpenOffice连接失败...");

        e.printStackTrace();

    }

    // 创建转换器

    DocumentConverter converter = new OpenOfficeDocumentConverter(con);

    // 转换文档问html

    converter.convert(docFile, htmlFile);

    // 关闭openoffice连接

    con.disconnect();

    return htmlFile;

    }

    /**

     * 将word转换成html文件，并且获取html文件代码。

     *

     * @param docFile

     *                需要转换的文档

     * @param filepath

     *                文档中图片的保存位置

     * @return 转换成功的html代码

     */

    public static String toHtmlString(File docFile, String filepath) {

    // 转换word文档

    File htmlFile = convert(docFile, filepath);

    // 获取html文件流

    StringBuffer htmlSb = new StringBuffer();

    try {

        BufferedReader br = new BufferedReader(new InputStreamReader(

            new FileInputStream(htmlFile)));

        while (br.ready()) {

        htmlSb.append(br.readLine());

        }

        br.close();

        // 删除临时文件

        htmlFile.delete();

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    } catch (IOException e) {

        e.printStackTrace();

    }

    // HTML文件字符串

    String htmlStr = htmlSb.toString();

    // 返回经过清洁的html文本

    return clearFormat(htmlStr, filepath);

    }

    /**

     * 清除一些不需要的html标记

     *

     * @param htmlStr

     *                带有复杂html标记的html语句

     * @return 去除了不需要html标记的语句

     */

    protected static String clearFormat(String htmlStr, String docImgPath) {

    // 获取body内容的正则

    String bodyReg = "<BODY .*</BODY>";

    Pattern bodyPattern = Pattern.compile(bodyReg);

    Matcher bodyMatcher = bodyPattern.matcher(htmlStr);

    if (bodyMatcher.find()) {

        // 获取BODY内容，并转化BODY标签为DIV

        htmlStr = bodyMatcher.group().replaceFirst("<BODY", "<DIV")

            .replaceAll("</BODY>", "</DIV>");

    }

    // 调整图片地址

    htmlStr = htmlStr.replaceAll("<IMG SRC=\"", "<IMG SRC=\"" + docImgPath

        + "/");

    // 把<P></P>转换成</div></div>保留样式

    // content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)",

    // "<div$2</div>");

    // 把<P></P>转换成</div></div>并删除样式

    htmlStr = htmlStr.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");

    // 删除不需要的标签

    htmlStr = htmlStr

        .replaceAll(

            "<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>",

            "");

    // 删除不需要的属性

    htmlStr = htmlStr

        .replaceAll(

            "<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>",

            "<$1$2>");

    return htmlStr;

    }

}

类组织的不好，博友凑合看，代码注释比较详细了，不多说。

两个公开的方法是独立使用的，toHtmlString(...)方法是转化文件并获取html代码，以备存入数据库。

参考了http://dangry.iteye.com/blog/858787，表示感谢。

JAVA：借用OpenOffice将上传的Word文档转换成Html格式的更多相关文章

OpenOffice Word文档转换成Html格式
为什么会想起来将上传的word文档转换成html格式呢?设想,如果一个系统需要发布在页面的文章都是来自word文档,一般会执行下面的流程:使用word打开文档,Ctrl+A,进入发布文章页面,Ctrl ...
C# word文档转换成PDF格式文档
最近用到一个功能word转pdf,有个方法不错,挺方便的,直接调用即可,记录下方法:ConvertWordToPdf(string sourcePath, string targetPath) so ...
poi解析word文档转换成html(包括图片解析)
需求:将本地上传的word文档解析并放入数据库中代码: import java.io.ByteArrayOutputStream;import java.io.File;import java.io ...
java将XML文档转换成json格式数据
功能将xml文档转换成json格式数据说明依赖包:1. jdom-2.0.2.jar : xml解析工具包;2. fastjson-1.1.36.jar : 阿里巴巴研发的高性能json工具包 ...
Java利用aspose-words将word文档转换成pdf（破解无水印）
首先下载aspose-words-15.8.0-jdk16.jar包 http://pan.baidu.com/s/1nvbJwnv 引入jar包,编写Java代码 package doc; impo ...
Python将word文档转换成PDF文件
如题. 代码: ''' #將word文档转换为pdf文件 #用到的库是pywin32 #思路上是调用了windows和office功能 ''' #导入所需库 from win32com.client ...
ASP.NET将word文档转换成pdf的代码
一.添加引用 using Microsoft.Office.Interop.Word; 二.转换方法 1.方法 C# 代码 /// <summary> /// 把Word文件转换成pdf文 ...
Java实现批量将word文档转换成PDF
先导入words的jar包需要jar包的私聊我发你代码如下:import com.aspose.words.Document;import java.io.File; public class W ...
关于word文档转成html网页的方法
在工作中,有时我们可能需要将一个word文档转换成html网页格式,如在写帮助文档的时候,采用office编写,最终却想以网页的格式传到网站的指定目录下供网友直接浏览这时我们就需要对word文件进行 ...

随机推荐

解决远程桌面链接时出现"The RPC server is unavailable."或"RPC服务器不可用"的问题
解决远程桌面链接时出现"The RPC server is unavailable."或"RPC服务器不可用"的问题解决远程桌面链接时出现"The ...
java作业4
(一) 请查看String.equals()方法的实现代码,注意学习其实现方法.(发表到博客作业上) (二) 整理String类的Length().charAt(). getChars().rep ...
bzoj4034 （树链剖分+线段树）
Problem T2 (bzoj4034 HAOI2015) 题目大意给定一颗树,1为根节点,要求支持三种操作. 操作 1 :把某个节点 x 的点权增加 a . 操作 2 :把某个节点 x 为根的子 ...
session 和 cookie 的区别和联系
二者的定义: 当你在浏览网站的时候,WEB 服务器会先送一小小资料放在你的计算机上,Cookie 会帮你在网站上所打的文字或是一些选择,都纪录下来.当下次你再光临同一个网站,WEB 服务器会先看看有没 ...
MVC - Code First Migration Command line
当开发MVC应用程序, 使用.NET Entity Framework的Code First model试, 若是需要将model层对象的改动更新进数据库, 需要使用Package Manager C ...
Autoresizing和AutoLayout
1 使用Autoresizing的方式进行界面布局 1.1 问题 Autoresizing是IOS旧版的自动布局技术,现在仍然被很多企业使用.本案例将学习如何使用Autoresizing完成界面的布局 ...
[转】HTTP请求流程（二）----Telnet模拟HTTP请求
转自: http://www.cnblogs.com/stg609/archive/2008/07/06/1237000.html 上一部分"流程简介", 我们大致了解了下HTTP ...
CentOS 6.6 FTP install
/************************************************************************* * CentOS 6.6 FTP install ...
Qt5 QTableWidget设置列表自动适应列宽
//设置自动适应列宽 ui->tableWidget->horizontalHeader()->setSectionResizeMode(QHeaderView::Stretch);
如何在dede栏目设置中添加自定义字段(dede二次开发-纯抄贴)
如何在dede栏目设置中添加自定义字段这个说法以前没有见到到,很少有客户会提出这样的二次要求,今天织梦者在网上转了一下看到了这样的一篇文章转过来与大家分享鉴于这个教程没人发过,网上搜索的人也比较多 ...

JAVA：借用OpenOffice将上传的Word文档转换成Html格式

JAVA：借用OpenOffice将上传的Word文档转换成Html格式的更多相关文章

随机推荐

热门专题