Working with PDF files in C# using PdfBox and IKVM
I have found two primary libraries for programmatically manipulating PDF files; PdfBox and iText. These are both Java libraries, but I needed something I could use with C#. Well, as it turns out there is an implementation of each of these libraries for .NET, each with its own strengths and weaknesses.
Some Navigation Links:
- Example: Extract Text from PDF File
- Example: Split PDF
- Split Specific Pages and Merge Into New File
- Links to Resources Mentioned in this Article
PdfBox - .NET version
The .NET implementation of PdfBox is not a direct port - rather, it uses IKVM to run the Java version inter-operably with .NET. IKVM features an actual .NET implementation of a Java Virtual Machine, and a .net implementation of Java Class Libraries along with tools which enable Java and .NET interoperability.
PdfBox’s dependency on IKVK incurs a lot of baggage in performance terms. When the IKVM libraries load, and (I am assuming) the “’Virtual’ Java Virtual Machine” spins up, things slow way down until the load is complete. On the other hand, for some of the more common things one might want to do with a PDF programmatically, the API is (relatively) straightforward, and well documented.
When you run a project which uses PdfBox, you will notice a lag the first time PdfBox and IKVM are loaded. After that, things seem to perform sufficiently, at least for what I needed to do.
Side Note: iTextSharp
iTextSharp is a direct port of the Java library to .NET.
iTextSharp looks to be the more robust library in terms of fine-grained control, and is extensively documented in a book by one of the authors of the library, iText in Action (Second Edition). However, the learning curve was a little steeper for iText, and I needed to get a project out the door. I will examine iTextSharp in another post, because it looks really cool, and supposedly does not suffer the performance limitations of PdfBox.
Getting started with PdfBox
Before you can use PdfBox, you need to either build the project from source, or download the ready-to-use binaries. I just downloaded the binaries for version 1.2.1 from this helpful gentleman’s site, which, since they depend on IKVM, also includes the IKVM binaries. However, there are detailed instruction for building from source on the PdfBox site. Personally, I would start with the downloaded binaries to see if PdfBox is what you want to use first.
Important to note here: apparently, the PdfBox binaries are dependent upon the exact dependent DLL’s used to build them. See the notes on the PdfBox .Net Version page.
Once you have built or downloaded the binaries, you will need to set references to PdfBox and ALL the included IKVM binaries in your Visual Studio Project. Create a new Visual Studio project named “PdfBoxExamples” and add references to ALL the PdfBox and IKVM binaries. There are a lot.Deal with it. Your project references folder will look like the picture to the right when you are done.
The PdfBox API is quite dense, but there is a handy reference at the Apache Pdfbox site. The PDF file format is complex, to say the least, so when you first take a gander at the available classes and methods presented by the PDF box API, it can be difficult to know where to begin. Also, there is the small issue that what you are looking at is a Java API, so some of the naming conventions are a little different. Also, the PdfBox API often returns what appear to be Java classes. This comes back to that .Net implementation of the Java Class libraries I mentioned earlier.
Things to Do with PdfBox
It seems like there are three common things I often want to do with PDF files: Extract text into a string or text file, split the document into one or more parts, or merge pages or documents together. To get started with using PdfBox we will look at extracting text first, since the set up for this is pretty straightforward, and there isn’t any real Java/.Net weirdness here.
Extracting Text from a PDF File
To do this, we will call upon two PdfBox namespaces (“Packages” in Java, loosely), and two Classes:
The namespace org.apache.pdfbox.pdmodel gives us access to the PDDocument class and the namespace org.apache.pdfbox.util gives us the PDFTextStripper class.
In your new PdfBoxExamples project, add a new class, name it “PdfTextExtractor" and add the following code:
The PdfTextExtractor Class
using System;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util; namespace PdfBoxExamples
{
public class pdfTextExtractor
{
public static String PDFText(String PDFFilePath)
{
PDDocument doc = PDDocument.load(PDFFilePath);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
}
}
As you can see, we use the PDDocument class (from the org.apache.pdfbox.pdmodel namespace) and initialize is using the static .load method defined as a class member on PDDocument. As long as we pass it a valid file path, the .load method will return an instance of PDDocument, ready for us to work with.
Once we have the PDDocument instance, we need an instance of the PDFTextStripper class, from the namespace org.apache.pdfbox.util. We pass our instance of PDDocument in as a parameter, and get back a string representing the text contained in the original PDF file.
Be prepared. PDF documents can employ some strange layouts, especially when there are tables and/or form fields involved. The text you get back will tend not to retain the formatting from the document, and in some cases can be bizarre.
However, the ability to strip text in this manner can be very useful, For example, I recently needed to download an individual PDF file for each county in the state of Missouri, and strip some tabular data our of each one. I hacked together an iterator/downloader to pull down the files, and the, using a modified version of the text stripping tool illustrated above and some rather painful Regex, I was able to get what I needed.
Splitting the Pages of a PDF File
At the simplest level, suppose you had a PDF file and you wanted to split it into individual pages. We can use theSplitter class, again from the org.apache.pdf.util namespace. Add another class to you project, namedPDFFileSplitter, and copy the following code into the editor:
The PdfFileSplitter Class
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util; namespace PdfBoxExamples
{
public class PDFFileSplitter
{
public static java.util.List SplitPDFFile(string SourcePath,
int splitPageQty = 1)
{
var doc = PDDocument.load(SourcePath);
var splitter = new Splitter();
splitter.setSplitAtPage(splitPageQty); return (java.util.List)splitter.split(doc);
}
}
}
Notice anything strange in the code above? That’s right. We have declared a static method with a return type of java.util.List. WHAT? This is where working with PdfBox and more importantly, IKVM becomes weird/cool. Cool, because I am using a direct Java class implementation in Visual Studio, in my C# code. Weird, because my method returns a bizarre type (from a C# perspective, anyway) that I was unsure what to do with.
I would probably add to the above class so that the splitter persisted the split documents to disk, or change the return type of my method to object[], and use the .ToArray() method, like so:
The PdfFileSplitter Class (improved?)
public static object[] SplitPDFFile(string SourcePath,
int splitPageQty = 1)
{
var doc = PDDocument.load(SourcePath);
var splitter = new Splitter();
splitter.setSplitAtPage(splitPageQty); return (object[])splitter.split(doc).toArray();
}
In any case, the code in either example loads up the specified PDF file into a PDDocument instance, which is then passed to the org.apache.pdfbox.Splitter, along with an int parameter. The output in the example above is a Java ArrayList containing a single page from your original document in each element. Your original document is not altered by this process, by the way.
The int parameter is telling the Splitter how many pages should be in each split section. In other words, if you start with a six-page PDF file, the output will be three two-page files. If you started with a 5-page file, the output would be two two-page files and one single-page file. You get the idea.
Extract Multiple Pages from a PDF Into a New File
Something slightly more useful might be a method which accepts an array of integers as a parameter, with each integer representing a page number within a group to be extracted into a new, composite document. For example, say I needed pages 1, 6, and 7 from a 44 page PDF pulled out and merged into a new document (in reality, I needed to do this for pages 1, 6, and 7 for each of about 200 individual documents). We might add a method to our PdfFileSplitter class as follows:
The ExtractToSingleFile Method
public static void ExtractToSingleFile(int[] PageNumbers,
string sourceFilePath, string outputFilePath)
{
var originalDocument = PDDocument.load(sourceFilePath);
var originalCatalog = originalDocument.getDocumentCatalog();
java.util.List sourceDocumentPages = originalCatalog.getAllPages();
var newDocument = new PDDocument(); foreach (var pageNumber in PageNumbers)
{
// Page numbers are 1-based, but PDPages are contained in a zero-based array:
int pageIndex = pageNumber - 1;
newDocument.addPage((PDPage)sourceDocumentPages.get(pageIndex));
}
newDocument.save(outputFilePath);
}
Below is a simple example to illustrate how we might call this method from a client:
Calling the ExtractToSingleFile Method:
public void ExtractAndMergePages()
{
string sourcePath = @"C:\SomeDirectory\YourFile.pdf";
string outputPath = @"C:\SomeDirectory\YourNewFile.pdf";
int[] pageNumbers = { 1, 6, 7 };
PDFFileSplitter.ExtractToSingleFile(pageNumbers, sourcePath, outputPath);
}
Limit Class Dependency on PdfBox
It is always good to limit dependencies within a project. In this case, especially, I would want to keep those odd Java class references constrained to the highest degree possible. In other words, where possible, I would attempt to either return standard .net types from my classes which consume the PdfBox API, or otherwise complete execution so that client code calling upon this class doesn’t need to be aware of IKVM, or funky C#/Java hybrid types.
Or, I would build out my own “PdfUtilities” library project, within which objects are free to depend upon and intermix this Java hybrid. However, I would make sure public methods defined within the library itself accepted and returned only standard C# types.
In fact, that is precisely what I am doing, and I’ll look at that in a following post.
Links to resources:
- PdfBox.Net version on Apache
- IKVM.net Home Page
- Pdfbox .net binaries version 1.2.1 (includes required IKVM binaries)
- PdfBox API Overview
Working with PDF files in C# using PdfBox and IKVM的更多相关文章
- “Stamping” PDF Files Downloaded from SharePoint 2010
http://blog.falchionconsulting.com/index.php/2012/03/stamping-pdf-files-downloaded-from-sharepoint-2 ...
- How to create PDF files in a Python/Django application using ReportLab
https://assist-software.net/blog/how-create-pdf-files-python-django-application-using-reportlab CONT ...
- Base64转PDF、PDF转IMG(使用pdfbox插件)
--添加依赖 <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox --><dependency> ...
- Csharp:user WebControl Read Adobe PDF Files In Your Web Browser
namespace GeovinDu.PdfViewer { [DefaultProperty("FilePath")] [ToolboxData("<{0}:Sh ...
- C# based on PdfSharp to split pdf files and get MemoryStream C#基于PdfSharp拆分pdf,并生成MemoryStream
install-package PdfSharp -v 1.51.5185-beta using System; using PdfSharp.Pdf; using System.IO; using ...
- PHP class which generates PDF files from UTF-8 encoded HTML
http://www.mpdf1.com/mpdf/index.php
- Converting PDF to Text in C#
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code). Download source files - 82 kB [c ...
- 操作PDF文档功能的相关开源项目探索——iTextSharp 和PDFBox
原文 操作PDF文档功能的相关开源项目探索——iTextSharp 和PDFBox 很久没自己写写心得日志与大家分享了,一方面是自己有点忙,一方面是自己有点懒,没有及时总结.因为实践是经验的来源,总结 ...
- 常用PDF文档开发库
C++库: 1,PDF类库 PoDoFo http://podofo.sourceforge.net/ PoDoFo 是一个用来操作 PDF 文件格式的 C++ 类库.它还包含一些小工具用来解析 ...
随机推荐
- Mybatis框架基础支持层——反射工具箱之MetaClass(7)
简介:MetaClass是Mybatis对类级别的元信息的封装和处理,通过与属性工具类的结合, 实现了对复杂表达式的解析,实现了获取指定描述信息的功能 public class MetaClass { ...
- 20190328-CSS样式一:字体样式font-、文本样式text-、背景图样式background-
目录 CSS参考手册:http://css.doyoe.com/ 1.字体简写:font:font-style || font-variant || font-weight || font-size ...
- Dynamics 365中的非交互式账号(Non-interactive User)介绍
摘要: 本人微信和易信公众号: 微软动态CRM专家罗勇 ,回复272或者20180616可方便获取本文,同时可以在第一间得到我发布的最新的博文信息,follow me!我的网站是 www.luoyon ...
- 记录自己使用GitHub的点点滴滴
前言 现在大多数开发者都有自己的GitHub账号,很多公司也会以是否有GitHub作为一项筛选简历以及人才的选项了,可见拥有一个GitHub账号的重要性,本文就从最基本的GitHub账号的注册到基本的 ...
- 使用docker部署skywalking
使用docker部署skywalking Intro 之前在本地搭建过一次 skywalking + elasticsearch ,但是想要迁移到别的机器上使用就很麻烦了,于是 docker 就成了很 ...
- Android Studio教程08-与其他app通信
目录 1.向另外一个应用发送用户 1.1. 构建隐含Intent 1.2. 验证是否存在接收Intent的应用 1.3. 启动具有Intent的Activity 2. 获取Activity的结果响应 ...
- 简单理解Java的反射
反射(reflect): JAVA反射机制是在运行状态中,对于任意一个实体类,都能够知道这个类的所有属性和方法:对于任意一个对象,都能够调用它的任意方法和属性:这种动态获取信息以及动态调用对象方法的功 ...
- 关于联想笔记本ThinkPad E470 没有外音 插耳机却有声音的解决办法
碰到这种情况,小编和大家一样选择设备管理器,找到声卡驱动卸载重新装,结果很失望,选择驱动精灵/联想驱动重装声卡,结果很绝望.并没有解决问题. 最后小编参考了一篇文章找到了解决办法 到联想官方网站服务界 ...
- Django--用户认证组件auth(登录用-依赖session,其他用)
一.用户认证组件auth介绍 二.auth_user表添加用户信息 三.auth使用示例 四.auth封装的认证装饰器 一.用户认证组件auth介绍 解决的问题: 之前是把is_login=True放 ...
- element 关闭弹窗时清空表单信息
关闭弹窗时清空表单信息: // 弹框关闭时清空信息 closeDialog () { this.$nextTick(() => { this.$refs['createModelForm'].c ...