word文档转pdf，支持.doc和.docx，另附抽取pdf指定页数的方法

公司有个需求，需要将word转成pdf并且抽取首页用以展示，word文档有需要兼容.doc和.docx两种文档格式。其中.docx通过poi直接就可以将word转成pdf，.doc则无法这样实现，上网查询很多资料，大概思路是正确的，既将.doc文档转成html，再将html转成pdf，具体实现的时候，却发现很多方法都不完善，要么转换的html标签不闭合，无法转pdf，要么是转pdf时中文不显示，在下将方法汇总之后，整理出一套亲测可用的代码，现附上，如下：

maven依赖：

<dependency>
             <groupId>org.apache.poi</groupId>
             <artifactId>poi</artifactId>
             <version>3.14</version>
            </dependency>
            <dependency>
             <groupId>org.apache.poi</groupId>
             <artifactId>poi-scratchpad</artifactId>
             <version>3.14</version>
            </dependency>
            <dependency>
             <groupId>org.apache.poi</groupId>
             <artifactId>poi-ooxml</artifactId>
             <version>3.14</version>
            </dependency>
            <dependency>
             <groupId>fr.opensagres.xdocreport</groupId>
             <artifactId>xdocreport</artifactId>
             <version>1.0.6</version>
            </dependency>
            <dependency>
             <groupId>org.apache.poi</groupId>
             <artifactId>poi-ooxml-schemas</artifactId>
             <version>3.14</version>
            </dependency>
            <dependency>
             <groupId>org.apache.poi</groupId>
             <artifactId>ooxml-schemas</artifactId>
             <version>1.3</version>
            </dependency>
        

             
            
           <dependency>
                <groupId>org.xhtmlrenderer</groupId>
                <artifactId>flying-saucer-pdf</artifactId>
                <version>9.0.7</version>
           </dependency>

        
        
           <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.11.3</version>
           </dependency>

代码：

/**
*
*/
package cn.test.util.utils;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.commons.collections.MapUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.PicturesManager;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.PictureType;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Entities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.w3c.dom.Document;
import org.xhtmlrenderer.pdf.ITextFontResolver;
import org.xhtmlrenderer.pdf.ITextRenderer;

import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.lowagie.text.pdf.BaseFont;

/**
* @author gsxs
* <li>word转pdf工具类<li>
* @since 2019年2月26日15:52:21
*/
public class Word2PDFUtils {

private static final Logger logger = LoggerFactory
.getLogger(Word2PDFUtils.class);

public static void main(String[] args) {
       try {
           word2PDF("D://Test/test.doc",
                   "D:/Test/test.pdf");
       } catch (Exception e) {
           // TODO Auto-generated catch block
           e.printStackTrace();
       }
   }

/**
   * word文档转pdf，自动匹配.doc和.docx格式
   *
   * @param wordFilePath
   *            word文档路径
   * @param pdfFilePath
   *            欲输出pdf文档路径
   * @throws Exception
   */
   public static File word2PDF(String wordFilePath, String pdfFilePath)
           throws Exception {
       if (StringUtils.isBlank(pdfFilePath)
               || StringUtils.isBlank(wordFilePath)) {
           logger.info("word2PDF 文件路径为空，wordFilePath={}，pdfFilePath={}",
                   wordFilePath, pdfFilePath);
           return null;
       }
       File wordFile = new File(wordFilePath);
       File pdfFile = new File(pdfFilePath);
       return word2PDF(wordFile, pdfFile);
   }

/**
   * word文档转pdf，自动匹配.doc和.docx格式
   *
   * @param wordFile
   *            word文档File对象
   * @param pdfFile
   *            pdfFile对象
   * @throws Exception
   * @throws FileNotFoundException
   */
   public static File word2PDF(File wordFile, File pdfFile) throws Exception {
       if (null == wordFile || null == pdfFile) {
           logger.info("word2PDF 文件对象为空，wordFile={}，pdfFile={}", wordFile,
                   pdfFile);
           return null;
       }
       String wordName = wordFile.getName();
       if (!wordName.endsWith(".doc") && !wordName.endsWith(".docx")) {
           // 格式不对
           logger.info("不是word文档格式，文件路径={}", wordFile.getAbsolutePath());
           return null;
       }
       File pdfParentFile = pdfFile.getParentFile();
       if (!pdfParentFile.exists()) {
           pdfParentFile.mkdirs();
       }
       String absolutePath = pdfParentFile.getAbsolutePath();
       wordName = wordName.substring(0, wordName.indexOf("."));
       String pdfPath = absolutePath + "/pdf/" + wordName + ".pdf";
       File tempPdfFile = new File(pdfPath);
       if (wordFile.getName().endsWith("doc")) {
           String htmlPath = absolutePath + "/html/" + wordName + ".html";
           File htmlFile = new File(htmlPath);
           // doc格式word文档，先转成html，再格式化标签成xhtml，最后转成pdf
           wordDocToHtml(wordFile, htmlFile);
           convertHtmlToPdf(htmlFile, tempPdfFile);
           // 删除html文件
           boolean delete = htmlFile.delete();
           logger.info("删除htmlFile路径path={}，结果={}",
                   htmlFile.getAbsolutePath(), delete);
       } else if (wordFile.getName().endsWith("docx")) {
           // docx格式转pdf
           wordConverterToPdf(new FileInputStream(wordFile),
                   new FileOutputStream(tempPdfFile), null);
       }
       // 抽取第一页
       splitPDFFile(tempPdfFile.getAbsolutePath(), pdfFile.getAbsolutePath(),
               1, 2);
       // 删除临时的pdf文件
       boolean delete = tempPdfFile.delete();
       logger.info("删除tempPdfFile路径path={}，结果={}",
               tempPdfFile.getAbsolutePath(), delete);
       return pdfFile;
   }

/**
   * 将word文档，转换成pdf, 中间替换掉变量
   *
   * @param source
   *            源为word文档，必须为docx文档
   * @param target
   *            目标输出
   * @param params
   *            需要替换的变量
   * @throws Exception
   */
   private static void wordConverterToPdf(InputStream source,
           OutputStream target, Map<String, String> params) throws Exception {
       wordConverterToPdf(source, target, null, params);
   }

/**
   * 将word文档，转换成pdf, 中间替换掉变量
   *
   * @param source
   *            源为word文档，必须为docx文档
   * @param target
   *            目标输出
   * @param params
   *            需要替换的变量
   * @param options
   *            PdfOptions.create().fontEncoding( "windows-1250" ) 或者其他
   * @throws Exception
   */
   private static void wordConverterToPdf(InputStream source,
           OutputStream target, PdfOptions options, Map<String, String> params)
           throws Exception {
       XWPFDocument doc = new XWPFDocument(source);
       paragraphReplace(doc.getParagraphs(), params);
       // 存在需要替换的再循环
       if (MapUtils.isNotEmpty(params)) {
           for (XWPFTable table : doc.getTables()) {
               for (XWPFTableRow row : table.getRows()) {
                   for (XWPFTableCell cell : row.getTableCells()) {
                       paragraphReplace(cell.getParagraphs(), params);
                   }
               }
           }
       }
       PdfConverter.getInstance().convert(doc, target, options);
   }

/**
   * 替换数据
   *
   * @param paragraphs
   * @param params
   */
   private static void paragraphReplace(List<XWPFParagraph> paragraphs,
           Map<String, String> params) {
       if (MapUtils.isNotEmpty(params)) {
           for (XWPFParagraph p : paragraphs) {
               for (XWPFRun r : p.getRuns()) {
                   String content = r.getText(r.getTextPosition());
                   if (StringUtils.isNotEmpty(content)
                           && params.containsKey(content)) {
                       r.setText(params.get(content), 0);
                   }
               }
           }
       }
   }

/**
   * .doc文档转html
   *
   * @param wordFile
   *            word File对象
   * @param htmlFile
   *            html File对象
   */
   private static void wordDocToHtml(File wordFile, File htmlFile) {

if (null == wordFile || null == htmlFile) {
           return;
       }
       File parentFile = htmlFile.getParentFile();
       if (!parentFile.exists()) {
           parentFile.mkdirs();
       }
       String absolutePath = parentFile.getAbsolutePath();
       HWPFDocument wordDocument;
       final String imagepath = absolutePath + "/temp/wordimage/";
       try {
           // 根据输入文件路径与名称读取文件流
           InputStream in = new FileInputStream(wordFile);
           // 把文件流转化为输入wordDom对象
           wordDocument = new HWPFDocument(in);
           // 通过反射构建dom创建者工厂
           DocumentBuilderFactory domBuilderFactory = DocumentBuilderFactory
                   .newInstance();
           // 生成dom创建者
           DocumentBuilder domBuilder = domBuilderFactory.newDocumentBuilder();
           // 生成dom对象
           Document dom = domBuilder.newDocument();
           // 生成针对Dom对象的转化器
           WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                   dom);
           // 转化器重写内部方法
           wordToHtmlConverter.setPicturesManager(new PicturesManager() {
               @Override
               public String savePicture(byte[] content,
                       PictureType pictureType, String suggestedName,
                       float widthInches, float heightInches) {
                   File imgPath = new File(imagepath);
                   if (!imgPath.exists()) {// 图片目录不存在则创建
                       imgPath.mkdirs();
                   }
                   File file = new File(imagepath + suggestedName);
                   try {
                       OutputStream os = new FileOutputStream(file);
                       os.write(content);
                       os.close();
                   } catch (FileNotFoundException e) {
                       e.printStackTrace();
                   } catch (IOException e) {
                       e.printStackTrace();
                   }
                   return imagepath + suggestedName;
               }
           });
           // 转化器开始转化接收到的dom对象
           wordToHtmlConverter.processDocument(wordDocument);
           // 保存文档中的图片
           List<?> pics = wordDocument.getPicturesTable().getAllPictures();
           if (pics != null) {
               for (int i = 0; i < pics.size(); i++) {
                   Picture pic = (Picture) pics.get(i);
                   try {
                       pic.writeImageContent(new FileOutputStream(imagepath
                               + pic.suggestFullFileName()));
                   } catch (FileNotFoundException e) {
                       e.printStackTrace();
                   }
               }
           }
           // 从加载了输入文件中的转换器中提取DOM节点
           Document htmlDocument = wordToHtmlConverter.getDocument();
           // 从提取的DOM节点中获得内容
           DOMSource domSource = new DOMSource(htmlDocument);

// 字节码输出流
           OutputStream out = new FileOutputStream(htmlFile);
           // 输出流的源头
           StreamResult streamResult = new StreamResult(out);
           // 转化工厂生成序列转化器
           TransformerFactory tf = TransformerFactory.newInstance();
           Transformer serializer = tf.newTransformer();
           // 设置序列化内容格式
           serializer.setOutputProperty(OutputKeys.ENCODING, "Unicode");//此处根据你的word文档的编码格式进行设置
           serializer.setOutputProperty(OutputKeys.INDENT, "yes");
           serializer.setOutputProperty(OutputKeys.METHOD, "html");

serializer.transform(domSource, streamResult);
            out.close();
           in.close();
       } catch (FileNotFoundException e1) {
           e1.printStackTrace();
       } catch (IOException e1) {
           e1.printStackTrace();
       } catch (TransformerConfigurationException e) {
           e.printStackTrace();
       } catch (TransformerException e) {
           e.printStackTrace();
       } catch (ParserConfigurationException e) {
           e.printStackTrace();
       }

}

/**
   * .doc转html
   *
   * @param wordFilePath
   * @param htmlFilePath
   */
   private static void wordDocToHtml(String wordFilePath, String htmlFilePath) {

if (org.apache.commons.lang3.StringUtils.isAnyBlank(wordFilePath,
               htmlFilePath)) {
           return;
       }
       File wordFile = new File(wordFilePath);
       File htmlFile = new File(htmlFilePath);
       wordDocToHtml(wordFile, htmlFile);
   }

/**
   * html转pdf
   *
   * @param htmlFile
   * @param pdfFile
   * @return
   * @throws Exception
   */
   private static boolean convertHtmlToPdf(File htmlFile, File pdfFile)
           throws Exception {
       if (null == htmlFile || null == pdfFile) {
           logger.info("html转pdf时，有file为空，htmlFile={}，pdfFile={}", htmlFile,
                   pdfFile);
           return false;
       }
       String absoluteFilePath = htmlFile.getParentFile().getAbsolutePath();
       if (!pdfFile.getParentFile().exists()) {
           pdfFile.getParentFile().mkdirs();
       }

// .doc转成的html中有些标签：例如<mate>不严谨，会出现标签不闭合问题，在转pdf时会报异常，故此处用jsoup将html转化成xhtml，将标签严谨化
       // 格式化html标签
       org.jsoup.nodes.Document parse = Jsoup.parse(htmlFile, "utf-8");
       parse.outputSettings()
               .syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml)
               .escapeMode(Entities.EscapeMode.xhtml);
       String html = parse.html();
       // 此处将body标签中的字体设置为SimSun，必须是这种样式，才会识别中文支持的文件，如果不设置，会出现转成的pdf中文不显示问题（此处需要替换的字段，可用将自己转成的html打印出来，查看是否是宋体，如不是，将宋体改为你转换成html的字体格式）
       html = html.replace("font-family:宋体", "font-family: SimSun");
       OutputStream os = new FileOutputStream(pdfFile);
       ITextRenderer renderer = new ITextRenderer();
       renderer.setDocumentFromString(html);
       // 解决中文支持问题
       ITextFontResolver fontResolver = renderer.getFontResolver();
           String osName = System.getProperty("os.name");
       String path = "";
       if(osName.toLowerCase().startsWith("win")){//windows系统
           path = Word2PDFUtils.class.getClassLoader()
                   .getResource("simsun.ttc").getPath();
       }else{//linux系统
           path = PropKit.get("font.path");//获取配置文件里的字体文件所在路径
       }
       logger.info(path);
       fontResolver.addFont(path, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
       // 解决图片的相对路径问题
       renderer.getSharedContext().setBaseURL(
               "file:" + absoluteFilePath + "/temp/htmlimage");
       renderer.layout();
       renderer.createPDF(os);
       os.flush();
       os.close();
       return true;

}

/**
   * html转pdf
   *
   * @param inputFile
   * @param outputFile
   * @return
   * @throws Exception
   */
   private static boolean convertHtmlToPdf(String inputFile, String outputFile)
           throws Exception {
       if (org.apache.commons.lang3.StringUtils.isAnyBlank(inputFile,
               outputFile)) {
           logger.info("html转pdf是，路径为空，inputFile={}，outputFile={}", inputFile,
                   outputFile);
           ;
           return false;
       }
       File htmlFile = new File(inputFile);
       File pdfFile = new File(outputFile);
       return convertHtmlToPdf(htmlFile, pdfFile);
   }

/**
   * 截取pdfFile的第from页至第end页，组成一个新的文件名
   *
   * @param pdfFile
   *            需要分割的PDF
   * @param savepath
   *            新PDF
   * @param from
   *            起始页
   * @param end
   *            结束页
   */
   private static void splitPDFFile(String respdfFile, String savepath,
           int from, int end) {
       com.itextpdf.text.Document document = null;
       PdfCopy copy = null;
       try {
           PdfReader reader = new PdfReader(respdfFile);
           int n = reader.getNumberOfPages();
           if (end == 0) {
               end = n;
           }
           ArrayList<String> savepaths = new ArrayList<String>();
           savepaths.add(savepath);
           document = new com.itextpdf.text.Document(reader.getPageSize(1));
           copy = new PdfCopy((com.itextpdf.text.Document) document,
                   new FileOutputStream(savepaths.get(0)));
           document.open();
           for (int j = from; j < end; j++) {
               document.newPage();
               PdfImportedPage page = copy.getImportedPage(reader, j);
               copy.addPage(page);
           }
           document.close();
           reader.close();
           copy.close();
       } catch (IOException e) {
           e.printStackTrace();
       } catch (DocumentException e) {
           e.printStackTrace();
       }
   }

}

其中.doc文档转html方法与其他一样，只是转完html时需要用jsoup转一遍xhtml，使标签严谨化，然后转pdf，转pdf时加入中文字体支持，

如果报没有搜索到方法的异常，可能是jar包版本的问题，就将依赖放开试试，我开始的时候遇见过这个异常，后来随着导入的依赖增多，这个依赖注掉也不会有这个异常了。可能是其他的依赖里有这个版本--2.0.8的itext的jar包，但是不确定你的其他依赖里是否存在，故此说明

另外说明一下，iText包好像自从3.0以后改名叫itextpdf了，所以如果引入了itextpdf包，就不需要再引入itext了。一切搞定以后，在windows系统上运行ok，但是上了linux系统，就不行了，报中文字体不识别的异常，而且中文转pdf时会丢失，是因为需要在linux系统安装一下字体文件包，然后将字体所在目录配置在配置文件中，手动获取路径（代码中已经体现），这样才可以。

另外附上文件simsun.ttc百度云下载地址：

链接：https://pan.baidu.com/s/1iH4iqJB2X_0gB7T4_CClzA
提取码：7rmn

此链接失效，百度说监测到非法数据，不让分享，这个字体就是windows自带的字体，各位可以在自己电脑上搜出来这个文件使用。我试了下windows自带的字体，好像安装在linux上不行，非得需要一个特殊的字体文件，百度云盘木办法分享，csdn上有下载的地方，需要1积分，不想花积分的，留下邮箱，我看见了会把字体文件发邮箱的，但是不保证什么时候能看到，

如果各位亲测有效，希望点赞，让更多人看到，解决更多人的问题，如果各位使用中遇见什么问题，也希望留言讨论。

word文档转pdf，支持.doc和.docx，另附抽取pdf指定页数的方法的更多相关文章

C# 给word文档添加水印
和PDF一样,在word中,水印也分为图片水印和文本水印,给文档添加图片水印可以使文档变得更为美观,更具有吸引力.文本水印则可以保护文档,提醒别人该文档是受版权保护的,不能随意抄袭.前面我分享了如何给 ...
JSP实现word文档的上传，在线预览，下载
前两天帮同学实现在线预览word文档中的内容,而且需要提供可以下载的链接!在网上找了好久,都没有什么可行的方法,只得用最笨的方法来实现了.希望得到各位大神的指教.下面我就具体谈谈自己的实现过程,总结一 ...
C# 中使用Word文档对图像进行操作
C# 中使用Word文档对图像进行操作 Download Files: ImageOperationsInWord.zip 简介在这篇文章中我们可以学到在C#程序中使用一个Word文档对图像的各种操 ...
C# Word文档中插入、提取图片，文字替换图片
Download Files:ImageOperationsInWord.zip 简介在这篇文章中我们可以学到在C#程序中使用一个Word文档对图像的各种操作.图像会比阅读文字更有吸引力,而且图像是 ...
Java操作word文档使用JACOB和POI操作word,Excel,PPT需要的jar包
可参考文档: http://wibiline.iteye.com/blog/1725492 下载jar包 http://download.csdn.net/download/javashixiaofe ...
word 文档导出（freemaker+jacob）--java开发
工作中终于遇到了需要导出word文旦的需求了.由于以前没有操作过,所以就先百度下了,基本上是:博客园,简书,CDSN,这几大机构的相关帖子比较多,然后花了2周时间才初步弄懂. 学习顺序: 第一阶 ...
Aspose.Words简单生成word文档
Aspose.Words简单生成word文档 Aspose.Words.Document doc = new Aspose.Words.Document(); Aspose.Words.Documen ...
C# 设置Word文档保护（加密、解密、权限设置）
对于一些重要的word文档,出于防止资料被他人查看,或者防止文档被修改的目的,我们在选择文档保护时可以选择文档打开添加密码或者设置文档操作权限等,在下面的文章中将介绍如何使用类库Free Spire. ...
VC操作WORD文档总结
一.写在开头最近研究word文档的解析技术,我本身是VC的忠实用户,看到C#里面操作WORD这么舒服,同时也看到单位有一些需求,就想尝试一下,结果没想到里面的技术点真不少,同时网络上的共享资料很多, ...

随机推荐

[R] [Johns Hopkins] R Programming -- week 4
#Generating normal distribution (Pseudo) random number x<-rnorm(10) x x2<-rnorm(10,2,1) x2 set ...
lua 的 break
break ,退出最近的一层循环 return , 一般用于函数,会直接退出所有的循环,或者判断,返回参数 ,,,} for key,value in pairs(tb) do while(t ...
将多张图片打包成zip包，一起上传
1.前端页面 <div class="mod-body" id="showRW" style="text-align: center;font- ...
php session的一些操作
<?php /** * Session控制类 */ class Session{ /** * 设置session * @param String $name session name * @pa ...
我发起并创立了一个 Javascript 前端库开源项目 jWebForm
在线演示地址: ( 在线演示云平台由 Kooboo 提供 https://www.kooboo.com/ ) 按钮: http://iwebform.kgeking.kooboo.si ...
[转]解决百度ueditor插入动态地图空白支持iframe方法
说明:新版本ueditor要修改 xss过滤白名单修改配置文件ueditor.config.js 搜索: whitList 增加下面第二行即可 ,whitList:{ iframe: ['fram ...
Kettle解决方案: 第二章 Kettle基本概念
2概述设计模块最主要的操作分为: 转换和作业选择转换和作业后就可以选择对应主对象树和核心对象主对象树大同小异核心对象是不同的比如转换需要用到的CSV表输入, 表输入等都在这里可以选择而作业 ...
解决讨厌的警告 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
问题: 执行任何hadoop命令,都会提示如下WARN.虽然影响不大,但是每次运行一个命令都有这么个WARN,让人很不爽,作为一个精致的男人, 必须要干掉它. [root@master logs]# ...
ThinkPHP 中使用 IS_AJAX 判断原生 JS 中的 Ajax 出现问题
问题: 在 ThinkPHP 中使用原生 js 发起 Ajax 请求的时候.在控制器无法使用 IS_AJAX 进行判断.而使用 jQuery 中的 ajax 是没有问题的. 在ThinkPHP中.有一 ...
C# 6.0：String Interpolation
在开发中经常需要对字符串进行格式化处理.我们一般使用String.Format()方法,它会将指定字符串中的每个格式项替换为相应对象的值的文本等效项.虽然这很普通,但有时会容易使人迷惑并造成错误.因为 ...

word文档转pdf，支持.doc和.docx，另附抽取pdf指定页数的方法

word文档转pdf，支持.doc和.docx，另附抽取pdf指定页数的方法的更多相关文章

随机推荐

热门专题