Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).

Update

April 20, 2015: The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.9). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).

February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.

How to Parse PDF Files

There are several main methods for extracting text from PDF files in .NET:

  • Microsoft IFilter interface and Adobe IFilter implementation.
  • iTextSharp
  • PDFBox

None of these PDF parsing solutions is perfect. We will discuss all these methods below.

1. Parsing PDF using Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

Sample code:

Hide   Copy Code
using IFilter;

// ...

public static string ExtractTextFromPdf(string path) {
return DefaultParser.Extract(path);
}

Download a sample project:

If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].

Disadvantages:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
  3. You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.

2. Parsing PDF using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.

Sample code:

Hide   Copy Code
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
} return text.ToString();
}
}

Credit: Member 10364982

Download a sample project:

You may consider using LocationTextExtractionStrategy to get better precision.

Hide   Copy Code
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string[] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
}
return text.ToString();
}
}

Credit: Member 10140900

Disadvantages of iTextSharp:

  1. Licensing if you are not happy with AGPL license

3. Parsing PDF using PDFBox

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).

Using PDFBox in .NET requires adding references to:

  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • pdfbox-1.8.9.dll

and copying the following files the bin directory:

  • commons-logging.dll
  • fontbox-1.8.9.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.Runtime.dll

Using the PDFBox to parse PDFs is fairly easy:

Hide   Copy Code
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path)
{
PDDocument doc = null;
try {
doc = PDDocument.load(path)
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
finally {
if (doc != null) {
doc.close();
}
}
}

Download a sample project:

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB)
  • IKVM.OpenJDK.SwingAWT.dll (6 MB)
  • pdfbox-1.8.9.dll (4 MB)
  • commons-logging.dll (82 kB)
  • fontbox-1.8.9.dll (180 kB)
  • IKVM.OpenJDK.Text.dll (800 kB)
  • IKVM.OpenJDK.Util.dll (2 MB)
  • IKVM.Runtime.dll (1 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.

Thanks to bobrien100 for improvements suggestions.

Disadvantages:

  1. IKVM.NET Dependencies (18 MB)
  2. Speed (especially the IKVM.NET warm-up time)

Related information

History

  • April 20, 2015 - Updated to work with the latest PDFBox release (1.8.9)
  • November 27, 2014 - Updated to work with the latest PDFBox release (1.8.7)
  • March 10, 2014 - IFilter file name limitations added, iTextSharp sample extended
  • February 27, 2014 - Samples for IFilter and iTextSharp added.
  • February 24, 2014 - Updated to work with the latest PDFBox release (1.8.4)
  • June 20, 2012 - Updated to work with the latest PDFBox release (1.7.0)

Converting PDF to Text in C#的更多相关文章

  1. C#,VB.NET如何将Word转换为PDF和Text

    众所周知,Word是我们日常工作中常用的办公软件之一,有时出于某种需求我们需要将Word文档转换为PDF以及Text.那么如何以C#,VB.NET编程的方式来实现这一功能呢? 下面我将分开介绍如何运用 ...

  2. .net操作PDF的一些资源(downmoon收集)

    因为业务需要,搜集了一些.net操作pdf的一些资源,特在此分享. 1.如何从 Adobe 可移植文档格式 (PDF) 文件中复制文本和图形 http://support.microsoft.com/ ...

  3. Code Project精彩系列(转)

    Code Project精彩系列(转)   Code Project精彩系列(转)   Applications Crafting a C# forms Editor From scratch htt ...

  4. 【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States

    Problem(Abstract) When converting contents from a file or string using WebSphere Application Server, ...

  5. ASP.Net MVC——使用 ITextSharp 完美解决HTML转PDF(中文也可以)

    前言: 最近在做老师交代的一个在线写实验报告的小项目中,有这么个需求:把学生提交的实验报告(HTML形式)直接转成PDF,方便下载和打印. 以前都是直接用rdlc报表实现的,可这次牵扯到图片,并且更为 ...

  6. nodejs将PDF文件转换成txt文本,并利用python处理转换后的文本文件

    目前公司Web服务端的开发是用Nodejs,所以开发功能的话首先使用Nodejs,这也是为什么不直接用python转换的原因. 由于node对文本的处理(提取所需信息)的能力不强,类似于npm上的包: ...

  7. [转]Display PDF within web browser using MVC3

    本文转自:http://www.codeproject.com/Tips/697733/Display-PDF-within-web-browser-using-MVC Introduction I ...

  8. PDF2SWF转换只有一页的PDF文档,在FlexPaper不显示解决方法

    问题:PDF2SWF转换只有一页的PDF文档,在FlexPaper不显示! FlexPaper 与 PDF2SWF 结合是解决在线阅读PDF格式文件的问题的,多页的PDF文件转换可以正常显示,只有一页 ...

  9. PDF转WORD工具 Solid Converter PDF v9.1.6744

    Solid Converter PDF中文破解版(pdf转换成word转换器)是一款功能强大的PDF格式转换软件.Solid Converter PDF允许用户将PDF转换为Word(PDF to W ...

随机推荐

  1. mysql 查询导出(txt,csv,xls)

    1 简介 工作中产品经常会临时找我导出一些数据,导出mysql查询结果数据有几种方法,下面介绍3种. ①  mysql -u  -p  -e "sql" db > filep ...

  2. vue遍历数组和对象的方法以及他们之间的区别

    前言:vue不能直接通过下标的形式来添加数据,vue也不能直接向对象中插值,因为那样即使能插入值,页面也不会重新渲染数据 一,vue遍历数组   1,使用vue数组变异方法 pop() 删除数组最后一 ...

  3. iOS ----------各种判断

    iOS 判断数字 - (BOOL) deptNumInputShouldNumber:(NSString *)str { if (str.length == 0) { return NO; } NSS ...

  4. android中的相对路径

    转载请标明出处:https://www.cnblogs.com/tangZH/p/9939655.html  1.同个文件夹访问 D:\Java\main\A.java D:\Java\main\B. ...

  5. JavaWeb 消息总线框架 Saka V0.0.1 发布

    端午闲着无聊,自己撸了一个简单的框架,可以实现在使用SendClient发送消息,在Spring容器中,符合该消息机制的接收器将能够被执行,目前Saka处于0.0.1版本[Saka-GIthub地址( ...

  6. 【Linux】【Java】CentOS7安装最新版Java1.8.191运行开发环境

    1.前言 本来在写[Linux][Apatch Tomcat]安装与运行.都快写完了. 结果...我忘记安装 Java 环境 然后...新开了博客编辑页面. 最后...我的那个没了...没了...真的 ...

  7. 详解MongoDB中的多表关联查询($lookup)

    一.  聚合框架 聚合框架是MongoDB的高级查询语言,它允许我们通过转换和合并多个文档中的数据来生成新的单个文档中不存在的信息. 聚合管道操作主要包含下面几个部分: 命令 功能描述 $projec ...

  8. CGI 、FastCGI、PHP-CGI、PHP-FPM 定义以及与nginx的应用关系

    CGI common gateway interface,简称cgi,简而言之就是一个接口,一种协议.它的作用就是帮助服务器与语言通信. 这里以nginx和php为例,因为nginx和php的语言不通 ...

  9. 微信小程序发红包

    背景: 近期一个朋友公司要做活动,活动放在小程序上.小程序开发倒是不难,不过要使用小程序给微信用户发红包,这个就有点麻烦 确定模式: 小程序目前没有发红包接口,要实现的话,只能是模拟红包,即小程序上做 ...

  10. Flex Builder 4.6切换语言

    一.修改Flex builder 1.用无格式编辑器打开FlashBuilder.ini 2.把zh_CN替换成"en_US" 二.修改MyEclipse插件 1.用无格式编辑器打 ...