OpenOffice Word文档转换成Html格式

为什么会想起来将上传的word文档转换成html格式呢？设想，如果一个系统需要发布在页面的文章都是来自word文档，一般会执行下面的流程：使用word打开文档，Ctrl+A，进入发布文章页面，Ctrl+V。看起来也不麻烦，但是，如果文档中包含大量图片呢？尴尬的事是图片都需要重新上传吧？

如果可以将已经编写好的word文档上传到服务器就可以在相应页面进行展示，将会是一件非常惬意的事情，最起码信息发布人员会很开心。程序员可能就不会这么想了，囧。

将Word转Html的原理是这样的：

1、客户上传Word文档到服务器

2、服务器调用OpenOffice程序打开上传的Word文档

3、OpenOffice将Word文档另存为Html格式

4、Over

至此可见，这要求服务器端安装OpenOffice软件，其实也可以是MS Office，不过OpenOffice的优势是跨平台，你懂的。恩，说明一下，本文的测试基于 MS Win7 Ultimate X64 系统。

下面就是规规矩矩的实现。

1、下载OpenOffice，http://download.openoffice.org/index.html So easy...

2、下载Jodconverter http://www.artofsolving.com/opensource/jodconverter 这是一个开启OpenOffice进行格式转化的第三方jar包。

3、泡杯热茶，等待下载。

4、安装OpenOffice，安装结束后，调用cmd，启动OpenOffice的一项服务：C:\Program Files (x86)\OpenOffice.org 3\program>soffice -headless -accept="socket,port=8100;urp;"

5、打开eclipse

6、喝杯热茶，等待eclipse打开。

7、新建eclipse项目，导入Jodconverter/lib 下得jar包。

8、Coding...

package com.mzule.doc2html.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.ConnectException;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.artofsolving.jodconverter.DocumentConverter;
import com.artofsolving.jodconverter.openoffice.connection.OpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.connection.SocketOpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.converter.OpenOfficeDocumentConverter;

/**
 * 将Word文档转换成html字符串的工具类
 * 
 * @author MZULE
 * 
 */
public class Doc2Html {

    public static void main(String[] args) {
    System.out
        .println(toHtmlString(new File("C:/test/test.doc"), "C:/test"));
    }

    /**
     * 将word文档转换成html文档
     * 
     * @param docFile
     *                需要转换的word文档
     * @param filepath
     *                转换之后html的存放路径
     * @return 转换之后的html文件
     */
    public static File convert(File docFile, String filepath) {
    // 创建保存html的文件
    File htmlFile = new File(filepath + "/" + new Date().getTime()
        + ".html");
    // 创建Openoffice连接
    OpenOfficeConnection con = new SocketOpenOfficeConnection(8100);
    try {
        // 连接
        con.connect();
    } catch (ConnectException e) {
        System.out.println("获取OpenOffice连接失败...");
        e.printStackTrace();
    }
    // 创建转换器
    DocumentConverter converter = new OpenOfficeDocumentConverter(con);
    // 转换文档问html
    converter.convert(docFile, htmlFile);
    // 关闭openoffice连接
    con.disconnect();
    return htmlFile;
    }

    /**
     * 将word转换成html文件，并且获取html文件代码。
     * 
     * @param docFile
     *                需要转换的文档
     * @param filepath
     *                文档中图片的保存位置
     * @return 转换成功的html代码
     */
    public static String toHtmlString(File docFile, String filepath) {
    // 转换word文档
    File htmlFile = convert(docFile, filepath);
    // 获取html文件流
    StringBuffer htmlSb = new StringBuffer();
    try {
        BufferedReader br = new BufferedReader(new InputStreamReader(
            new FileInputStream(htmlFile)));
        while (br.ready()) {
        htmlSb.append(br.readLine());
        }
        br.close();
        // 删除临时文件
        htmlFile.delete();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    // HTML文件字符串
    String htmlStr = htmlSb.toString();
    // 返回经过清洁的html文本
    return clearFormat(htmlStr, filepath);
    }

    /**
     * 清除一些不需要的html标记
     * 
     * @param htmlStr
     *                带有复杂html标记的html语句
     * @return 去除了不需要html标记的语句
     */
    protected static String clearFormat(String htmlStr, String docImgPath) {
    // 获取body内容的正则
    String bodyReg = "<BODY .*</BODY>";
    Pattern bodyPattern = Pattern.compile(bodyReg);
    Matcher bodyMatcher = bodyPattern.matcher(htmlStr);
    if (bodyMatcher.find()) {
        // 获取BODY内容，并转化BODY标签为DIV
        htmlStr = bodyMatcher.group().replaceFirst("<BODY", "<DIV")
            .replaceAll("</BODY>", "</DIV>");
    }
    // 调整图片地址
    htmlStr = htmlStr.replaceAll("<IMG SRC=\"", "<IMG SRC=\"" + docImgPath
        + "/");
    // 把<P></P>转换成</div></div>保留样式
    // content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)",
    // "<div$2</div>");
    // 把<P></P>转换成</div></div>并删除样式
    htmlStr = htmlStr.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
    // 删除不需要的标签
    htmlStr = htmlStr
        .replaceAll(
            "<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>",
            "");
    // 删除不需要的属性
    htmlStr = htmlStr
        .replaceAll(
            "<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>",
            "<$1$2>");
    return htmlStr;
    }

}

package com.mzule.doc2html.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.ConnectException;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.artofsolving.jodconverter.DocumentConverter;
import com.artofsolving.jodconverter.openoffice.connection.OpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.connection.SocketOpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.converter.OpenOfficeDocumentConverter;

/**
 * 将Word文档转换成html字符串的工具类
 * 
 * @author MZULE
 * 
 */
public class Doc2Html {

    public static void main(String[] args) {
    System.out
        .println(toHtmlString(new File("C:/test/test.doc"), "C:/test"));
    }

    /**
     * 将word文档转换成html文档
     * 
     * @param docFile
     *                需要转换的word文档
     * @param filepath
     *                转换之后html的存放路径
     * @return 转换之后的html文件
     */
    public static File convert(File docFile, String filepath) {
    // 创建保存html的文件
    File htmlFile = new File(filepath + "/" + new Date().getTime()
        + ".html");
    // 创建Openoffice连接
    OpenOfficeConnection con = new SocketOpenOfficeConnection(8100);
    try {
        // 连接
        con.connect();
    } catch (ConnectException e) {
        System.out.println("获取OpenOffice连接失败...");
        e.printStackTrace();
    }
    // 创建转换器
    DocumentConverter converter = new OpenOfficeDocumentConverter(con);
    // 转换文档问html
    converter.convert(docFile, htmlFile);
    // 关闭openoffice连接
    con.disconnect();
    return htmlFile;
    }

    /**
     * 将word转换成html文件，并且获取html文件代码。
     * 
     * @param docFile
     *                需要转换的文档
     * @param filepath
     *                文档中图片的保存位置
     * @return 转换成功的html代码
     */
    public static String toHtmlString(File docFile, String filepath) {
    // 转换word文档
    File htmlFile = convert(docFile, filepath);
    // 获取html文件流
    StringBuffer htmlSb = new StringBuffer();
    try {
        BufferedReader br = new BufferedReader(new InputStreamReader(
            new FileInputStream(htmlFile)));
        while (br.ready()) {
        htmlSb.append(br.readLine());
        }
        br.close();
        // 删除临时文件
        htmlFile.delete();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    // HTML文件字符串
    String htmlStr = htmlSb.toString();
    // 返回经过清洁的html文本
    return clearFormat(htmlStr, filepath);
    }

    /**
     * 清除一些不需要的html标记
     * 
     * @param htmlStr
     *                带有复杂html标记的html语句
     * @return 去除了不需要html标记的语句
     */
    protected static String clearFormat(String htmlStr, String docImgPath) {
    // 获取body内容的正则
    String bodyReg = "<BODY .*</BODY>";
    Pattern bodyPattern = Pattern.compile(bodyReg);
    Matcher bodyMatcher = bodyPattern.matcher(htmlStr);
    if (bodyMatcher.find()) {
        // 获取BODY内容，并转化BODY标签为DIV
        htmlStr = bodyMatcher.group().replaceFirst("<BODY", "<DIV")
            .replaceAll("</BODY>", "</DIV>");
    }
    // 调整图片地址
    htmlStr = htmlStr.replaceAll("<IMG SRC=\"", "<IMG SRC=\"" + docImgPath
        + "/");
    // 把<P></P>转换成</div></div>保留样式
    // content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)",
    // "<div$2</div>");
    // 把<P></P>转换成</div></div>并删除样式
    htmlStr = htmlStr.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
    // 删除不需要的标签
    htmlStr = htmlStr
        .replaceAll(
            "<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>",
            "");
    // 删除不需要的属性
    htmlStr = htmlStr
        .replaceAll(
            "<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>",
            "<$1$2>");
    return htmlStr;
    }

}

类组织的不好，博友凑合看，代码注释比较详细了，不多说。

两个公开的方法是独立使用的，toHtmlString(...)方法是转化文件并获取html代码，以备存入数据库。

参考了http://dangry.iteye.com/blog/858787，表示感谢。

OpenOffice Word文档转换成Html格式的更多相关文章

JAVA：借用OpenOffice将上传的Word文档转换成Html格式
为什么会想起来将上传的word文档转换成html格式呢?设想,如果一个系统需要发布在页面的文章都是来自word文档,一般会执行下面的流程:使用word打开文档,Ctrl+A,进入发布文章页面,Ctrl ...
C# word文档转换成PDF格式文档
最近用到一个功能word转pdf,有个方法不错,挺方便的,直接调用即可,记录下方法:ConvertWordToPdf(string sourcePath, string targetPath) so ...
java将XML文档转换成json格式数据
功能将xml文档转换成json格式数据说明依赖包:1. jdom-2.0.2.jar : xml解析工具包;2. fastjson-1.1.36.jar : 阿里巴巴研发的高性能json工具包 ...
将html版API文档转换成chm格式的API文档
文章完全转载自: https://blog.csdn.net/u012557538/article/details/42089277 将html版API文档转换成chm格式的API文档并不是一件难事, ...
ASP.NET将word文档转换成pdf的代码
一.添加引用 using Microsoft.Office.Interop.Word; 二.转换方法 1.方法 C# 代码 /// <summary> /// 把Word文件转换成pdf文 ...
poi解析word文档转换成html(包括图片解析)
需求:将本地上传的word文档解析并放入数据库中代码: import java.io.ByteArrayOutputStream;import java.io.File;import java.io ...
Python将word文档转换成PDF文件
如题. 代码: ''' #將word文档转换为pdf文件 #用到的库是pywin32 #思路上是调用了windows和office功能 ''' #导入所需库 from win32com.client ...
Java利用aspose-words将word文档转换成pdf（破解无水印）
首先下载aspose-words-15.8.0-jdk16.jar包 http://pan.baidu.com/s/1nvbJwnv 引入jar包,编写Java代码 package doc; impo ...
如何用C#把Doc文档转换成rtf格式
先在项目引用里添加上对Microsoft Word 9.0 object library的引用 using System; namespace DocConvert { class DoctoRtf ...

随机推荐

Android中measure过程、WRAP_CONTENT详解以及 xml布局文件解析流程浅析
转自:http://www.uml.org.cn/mobiledev/201211221.asp 今天,我着重讲解下如下三个内容: measure过程 WRAP_CONTENT.MATCH_PAREN ...
【C#】datetimepicker里面如何设置日期为当天日期，而时间设为0：00或23：59？
今天无意中发现要根据日期查询时间,datatimepicker控件会把时间默认成当前时间(当你的控件只显示日期时),这样查询出来的出来的数据会有误差,用来下面的办法成功设置日期为当天日期,而时间设为0 ...
ALGO-5_蓝桥杯_算法训练_最短路
记: 一开始没接触过关于最短距离的算法,便开始翻阅关于图的知识, 得知关于最短距离的算法有Dijkstra算法(堆优化暂未看懂),Bellman-Ford算法,Floyd算法,SPFA算法. 由于数据 ...
关于 appium get_attribute --获取对应属性值 API说明
1.获取 content-desc 的方法为 get_attribute("name") ,而且还不能保证返回的一定是 content-desc (content-desc 为空时 ...
[转]C#调用Excel VBA宏
[转载自]http://www.shangxueba.com/jingyan/95031.html 附上一段原创常用代码计算列标题字符串 Function CalcColumn(ByVal c As ...
centos 安装LAMP环境后装phpmyadmin
首先在CentOS 上安装EPEL 要想安装EPEL,我们先要下载EPEL的rpm安装包. 1. 确认你的CentOS 的版本首先通过以下命令确认你的CentOS 版本 $ cat /etc/Red ...
Android logcat命令详解
一.logcat命令介绍 1.android log系统 2.logcat介绍 logcat是android中的一个命令行工具,可以用于得到程序的log信息 log类是一个日志类,可以在代码中使用lo ...
JVM内部细节之三：字符串及字符串常量池
本人最近正在面试,然后注意到总是有公司喜欢考String的问题,如字符串连接有几种方式,它们之间有什么不同等问题:要不就是给一段代码问创建了几个对象.那么该不该问呢?我认为当面试有一定工作经验的求职者 ...
Hibernate 一对多/多对多
一对多关联(多对一): 一对多关联映射: 在多的一端添加一个外键指向一的一端,它维护的关系是一指向多多对一关联映射: 咋多的一端加入一个外键指向一的一端,它维护的关系是多指向一在配置文件中添加: ...
mvc和mtv
Java中MVC详解以及优缺点总结概念: MVC全名是Model View Controller,是模型(model)-视图(view)-控制器(controller)的缩写,一种软件设计典范,用一 ...

OpenOffice Word文档转换成Html格式

OpenOffice Word文档转换成Html格式的更多相关文章

随机推荐

热门专题