Converting PDF to Text in C#
- Download source files - 82 kB [codeproject.com]
- Download full project including all dependencies [squarepdf.net]
Update
April 20, 2015: The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.9). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).
February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.
How to Parse PDF Files
There are several main methods for extracting text from PDF files in .NET:
- Microsoft IFilter interface and Adobe IFilter implementation.
- iTextSharp
- PDFBox
None of these PDF parsing solutions is perfect. We will discuss all these methods below.
1. Parsing PDF using Adobe PDF IFilter
In order to parse PDF files using IFilter interface you need the following:
- Windows 2000 or later
- Adobe Acrobat or Reader 7.0.5+ (or the standalone Adobe PDF IFilter [adobe.com])
- IFilter COM wrapper class [dotlucene.net]
Sample code:
using IFilter;
// ...
public static string ExtractTextFromPdf(string path) {
return DefaultParser.Extract(path);
}
Download a sample project:
- Parsing PDF Files using IFilter [squarepdf.net]
If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].
Disadvantages:
- Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).
- A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
- You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.
2. Parsing PDF using iTextSharp
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.
Sample code:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
} return text.ToString();
}
}
Credit: Member 10364982
Download a sample project:
- Parsing PDF Files using iTextSharp [squarepdf.net]
You may consider using LocationTextExtractionStrategy to get better precision.
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string[] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
}
return text.ToString();
}
}
Credit: Member 10140900
Disadvantages of iTextSharp:
- Licensing if you are not happy with AGPL license
3. Parsing PDF using PDFBox
PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).
Using PDFBox in .NET requires adding references to:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
and copying the following files the bin directory:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Using the PDFBox to parse PDFs is fairly easy:
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path)
{
PDDocument doc = null;
try {
doc = PDDocument.load(path)
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
finally {
if (doc != null) {
doc.close();
}
}
}
Download a sample project:
- How to convert PDF files to text in C# (.NET) [squarepdf.net]
- How to convert PDF file to text in VB (.NET) [squarepdf.net]
The size of the required assemblies adds up to almost 18 MB:
- IKVM.OpenJDK.Core.dll (4 MB)
- IKVM.OpenJDK.SwingAWT.dll (6 MB)
- pdfbox-1.8.9.dll (4 MB)
- commons-logging.dll (82 kB)
- fontbox-1.8.9.dll (180 kB)
- IKVM.OpenJDK.Text.dll (800 kB)
- IKVM.OpenJDK.Util.dll (2 MB)
- IKVM.Runtime.dll (1 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.
Thanks to bobrien100 for improvements suggestions.
Disadvantages:
- IKVM.NET Dependencies (18 MB)
- Speed (especially the IKVM.NET warm-up time)
Related information
- See this article (with future updates) at SquarePDF.NET.
History
- April 20, 2015 - Updated to work with the latest PDFBox release (1.8.9)
- November 27, 2014 - Updated to work with the latest PDFBox release (1.8.7)
- March 10, 2014 - IFilter file name limitations added, iTextSharp sample extended
- February 27, 2014 - Samples for IFilter and iTextSharp added.
- February 24, 2014 - Updated to work with the latest PDFBox release (1.8.4)
- June 20, 2012 - Updated to work with the latest PDFBox release (1.7.0)
Converting PDF to Text in C#的更多相关文章
- C#,VB.NET如何将Word转换为PDF和Text
众所周知,Word是我们日常工作中常用的办公软件之一,有时出于某种需求我们需要将Word文档转换为PDF以及Text.那么如何以C#,VB.NET编程的方式来实现这一功能呢? 下面我将分开介绍如何运用 ...
- .net操作PDF的一些资源(downmoon收集)
因为业务需要,搜集了一些.net操作pdf的一些资源,特在此分享. 1.如何从 Adobe 可移植文档格式 (PDF) 文件中复制文本和图形 http://support.microsoft.com/ ...
- Code Project精彩系列(转)
Code Project精彩系列(转) Code Project精彩系列(转) Applications Crafting a C# forms Editor From scratch htt ...
- 【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract) When converting contents from a file or string using WebSphere Application Server, ...
- ASP.Net MVC——使用 ITextSharp 完美解决HTML转PDF(中文也可以)
前言: 最近在做老师交代的一个在线写实验报告的小项目中,有这么个需求:把学生提交的实验报告(HTML形式)直接转成PDF,方便下载和打印. 以前都是直接用rdlc报表实现的,可这次牵扯到图片,并且更为 ...
- nodejs将PDF文件转换成txt文本,并利用python处理转换后的文本文件
目前公司Web服务端的开发是用Nodejs,所以开发功能的话首先使用Nodejs,这也是为什么不直接用python转换的原因. 由于node对文本的处理(提取所需信息)的能力不强,类似于npm上的包: ...
- [转]Display PDF within web browser using MVC3
本文转自:http://www.codeproject.com/Tips/697733/Display-PDF-within-web-browser-using-MVC Introduction I ...
- PDF2SWF转换只有一页的PDF文档,在FlexPaper不显示解决方法
问题:PDF2SWF转换只有一页的PDF文档,在FlexPaper不显示! FlexPaper 与 PDF2SWF 结合是解决在线阅读PDF格式文件的问题的,多页的PDF文件转换可以正常显示,只有一页 ...
- PDF转WORD工具 Solid Converter PDF v9.1.6744
Solid Converter PDF中文破解版(pdf转换成word转换器)是一款功能强大的PDF格式转换软件.Solid Converter PDF允许用户将PDF转换为Word(PDF to W ...
随机推荐
- 学习前端笔记1(HTML)
(注:此文是在看过许多学习资料和视频之后,加上自身理解拼凑而成,仅作学习之用.若有版权问题,麻烦及时联系) 标准页面结构: HTML发展历史: 注:每一种HTML需要有对应的doctype声明. H ...
- 后端开发者的Vue学习之路(五)
目录 上节内容回顾 使用第三方组件库 如何发起请求 请求错误处理 请求带参 以get的方式带参: 以post的方式带参: 封装处理 请求的配置 axios实例 实现调用自定义函数来发起请求 抽取axi ...
- Catalan卡特兰数入门
简介 卡特兰数是组合数学中的一种常见数列 它的前几项为: 1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, ...
- android FragmentTabhost导航分页
基本模板 public class MainActivity extends FragmentActivity { private FragmentTabHost mTabHost; private ...
- Android PAI (PlayAutoInstall)预装APK 功能
最近刚找到工作,是手机方案公司,刚接触手机系统预装的APP,以及解决方案MTK平台下预装APP的bug,也接触到了Launcher的东西. 然后接触到了第一个需求 PAI预装APK功能 下面是我用到的 ...
- ASP.NET没有魔法——ASP.NET MVC IoC代码篇
上一篇文章主要以文字的形式介绍了IoC及其在ASP.NET MVC中的使用,本章将从以下几点介绍如何使用代码在ASP.NET MVC中实现依赖注入: ● AutoFac及安装 ● 容器的创建 ● 创建 ...
- linux内核数据结构之kfifo【转】
1.前言 最近项目中用到一个环形缓冲区(ring buffer),代码是由linux内核的kfifo改过来的.缓冲区在文件系统中经常用到,通过缓冲区缓解cpu读写内存和读写磁盘的速度.例如一个进程A产 ...
- spring boot 中使用 jpa以及jpa介绍
1.什么是jpa呢?JPA顾名思义就是Java Persistence API的意思,是JDK 5.0注解或XML描述对象-关系表的映射关系,并将运行期的实体对象持久化到数据库中.12.jpa具有什么 ...
- poi包冲突问题(excel)
1. 所需jar包 涉及的poi (1)poi-3.14.jar (HSSF) 依赖:commons-logging-1.2.jar.log4j-1.2.17.jar.commons-codec.1 ...
- sklearn使用——梯度下降及逻辑回归
一:梯度下降: 梯度下降本质上是对极小值的无限逼近.先求得梯度,再取其反方向,以定步长在此方向上走一步,下次计算则从此点开始,一步步接近极小值.需要注意的是步长的取值,如果过小,则需要多次迭代,耗费大 ...