python 将pdf分页后插入至word中

所用技术

　　1. python编程基础

　　2. 使用pyPdf

　　3. 使用python操作word

　　4. 正则表达式的使用

　　5. windows的bat编程

下面是一个pyPdf库使用的示例：

    from pyPdf import PdfFileWriter, PdfFileReader

    output = PdfFileWriter()

    input1 = PdfFileReader(file("document1.pdf", "rb"))

    # add page 1 from input1 to output document, unchanged

    output.addPage(input1.getPage(0))

    # add page 2 from input1, but rotated clockwise 90 degrees

    output.addPage(input1.getPage(1).rotateClockwise(90))

    # add page 3 from input1, rotated the other way:

    output.addPage(input1.getPage(2).rotateCounterClockwise(90))

    # alt: output.addPage(input1.getPage(2).rotateClockwise(270))

    # add page 4 from input1, but first add a watermark from another pdf:

    page4 = input1.getPage(3)

    watermark = PdfFileReader(file("watermark.pdf", "rb"))

    page4.mergePage(watermark.getPage(0))

    # add page 5 from input1, but crop it to half size:

    page5 = input1.getPage(4)

    page5.mediaBox.upperRight = (

        page5.mediaBox.getUpperRight_x() / 2,

        page5.mediaBox.getUpperRight_y() / 2

    )

    output.addPage(page5)

    # print how many pages input1 has:

    print "document1.pdf has %s pages." % input1.getNumPages())

    # finally, write "output" to document-output.pdf

    outputStream = file("document-output.pdf", "wb")

    output.write(outputStream)

有了该库，就可以很容易将现有的pdf做分割。

因为我的需求是要将pdf中的关键字提取出来，用它来作为文件名。pyPdf中提供了将pdf中的文字全部提取出来。

inputfile.getPage(0).extractText()

这里返回的unicode，需要转为str

inputfile.getPage(0).extractText().encode("utf-8")

然后将每页的关键字提取出来，增加函数如下：

p_sheetName = re.compile('Blattname: (.+?)project')

def getSheetName(str):

    m = p_sheetName.search(str)

    if m:

        return m.group(1)

    else:

        return None;

最终代码如下：

from pyPdf import PdfFileWriter, PdfFileReader

import re,os

p_sheetName = re.compile('Blattname: (.+?)project')

def getSheetName(str):

    m = p_sheetName.search(str)

    if m:

        return m.group(1)

    else:

        return None;

def splitpdf(srcFile):

        input1 = file(srcFile,"rb")

        inputfile = PdfFileReader(input1)

        numofpages = inputfile.getNumPages()

        print "pages: %d" % numofpages

        #new directory

        folderName,ext_ = os.path.splitext(srcFile)

        if not os.path.isdir(folderName):

            os.makedirs(folderName)

        for page_index in range(1,numofpages+1):

            output = PdfFileWriter()

            output.addPage(inputfile.getPage(page_index-1))

            sheetName = getSheetName(inputfile.getPage(page_index-1).extractText().encode("utf-8"))

            #save file

            saveFileName = os.path.join(folderName,"%d %s.pdf" % (page_index,sheetName))

            print saveFileName

            outputFile = file(saveFileName,"wb")

            output.write(outputFile)

            outputFile.close()

        input1.close()

splitpdf("E:\\test.pdf")

下一步，将pdf参数化

from pyPdf import PdfFileWriter, PdfFileReader

import re,sys,os,string

def translator(frm='', to='', delete='', keep=None):

    if len(to) == 1 :

        to = to * len(frm)

    trans = string.maketrans(frm,to)

    if keep is not None:

        allchars = string.maketrans('','')

        delete = allchars.translate(allchars,keep.translate(allchars,delete))

    def translate(s):

        return s.translate(trans,delete)

    return translate

delete_some_speicl = translator(delete="/:\\?*><|")

p_sheetName = re.compile('Blattname: (.+?)project')

def getSheetName(str):

    m = p_sheetName.search(str)

    return delete_some_speicl(m.group(1))

def splitpdf(srcFile):

    try:

        folderName,ext_ = os.path.splitext(srcFile)

        if ext_ != '.pdf':

            raise Exception(os.path.basename(srcFile) + " is not pdf!")

        input1 = file(srcFile,"rb")

        inputfile = PdfFileReader(input1)

        numofpages = inputfile.getNumPages()

        print "pages: %d" % numofpages

        #new directory

        if not os.path.isdir(folderName):

            os.makedirs(folderName)

        for page_index in range(1,numofpages+1):

            output = PdfFileWriter()

            output.addPage(inputfile.getPage(page_index-1))

            sheetName = getSheetName(inputfile.getPage(page_index-1).extractText().encode("utf-8"))

            #save file

            saveFileName = os.path.join(folderName,"%d %s.pdf" % (page_index,sheetName))

            print saveFileName

            outputFile = file(saveFileName,"wb")

            output.write(outputFile)

            outputFile.close()

        input1.close()

        print "Split success!"

        print "please find them at " + folderName

    except Exception,e:

        print e

if __name__ == '__main__':

    if len(sys.argv) < 2:

        print 'usage: %s filename' % os.path.basename(sys.argv[0])

        exit(0)

    #print sys.argv[1]

    splitpdf(sys.argv[1])

这里translator函数是将关键字中的特殊字符过滤掉，因为新建文件时可能会出错。

其实分开pdf也还需要一些手动操作，不然还需用vba导入到word中，我想直接用python干完这些事，如果就用到了win32com来操作word

下面是使用操作word的一个示例：

import win32com

from win32com.client import Dispatch, constants

w = win32com.client.Dispatch('Word.Application')

# 或者使用下面的方法，使用启动独立的进程：

# w = win32com.client.DispatchEx('Word.Application')

# 后台运行，不显示，不警告

w.Visible = 0

w.DisplayAlerts = 0

# 打开新的文件

doc = w.Documents.Open( FileName = filenamein )

# worddoc = w.Documents.Add() # 创建新的文档

# 插入文字

myRange = doc.Range(0,0)

myRange.InsertBefore('Hello from Python!')

# 使用样式

wordSel = myRange.Select()

wordSel.Style = constants.wdStyleHeading1

# 正文文字替换

w.Selection.Find.ClearFormatting()

w.Selection.Find.Replacement.ClearFormatting()

w.Selection.Find.Execute(OldStr, False, False, False, False, False, True, 1, True, NewStr, 2)

# 页眉文字替换

w.ActiveDocument.Sections[0].Headers[0].Range.Find.ClearFormatting()

w.ActiveDocument.Sections[0].Headers[0].Range.Find.Replacement.ClearFormatting()

w.ActiveDocument.Sections[0].Headers[0].Range.Find.Execute(OldStr, False, False, False, False, False, True, 1, False, NewStr, 2)

# 表格操作

doc.Tables[0].Rows[0].Cells[0].Range.Text =''

worddoc.Tables[0].Rows.Add() # 增加一行

# 转换为html

wc = win32com.client.constants

w.ActiveDocument.WebOptions.RelyOnCSS = 1

w.ActiveDocument.WebOptions.OptimizeForBrowser = 1

w.ActiveDocument.WebOptions.BrowserLevel = 0 # constants.wdBrowserLevelV4

w.ActiveDocument.WebOptions.OrganizeInFolder = 0

w.ActiveDocument.WebOptions.UseLongFileNames = 1

w.ActiveDocument.WebOptions.RelyOnVML = 0

w.ActiveDocument.WebOptions.AllowPNG = 1

w.ActiveDocument.SaveAs( FileName = filenameout, FileFormat = wc.wdFormatHTML )

# 打印

doc.PrintOut()

# 关闭

# doc.Close()

w.Documents.Close(wc.wdDoNotSaveChanges)

w.Quit()

仿照上例，修改前面的代码如下：

from pyPdf import PdfFileWriter, PdfFileReader

import re,sys,os,string,win32com

from win32com.client import Dispatch, constants

win32com.client.gencache.EnsureDispatch('Word.Application')

def translator(frm='', to='', delete='', keep=None):

    if len(to) == 1 :

        to = to * len(frm)

    trans = string.maketrans(frm,to)

    if keep is not None:

        allchars = string.maketrans('','')

        delete = allchars.translate(allchars,keep.translate(allchars,delete))

    def translate(s):

        return s.translate(trans,delete)

    return translate

delete_some_speicl = translator(delete="/:\\?*><|")

p_sheetName = re.compile('Blattname: (.+?)project')

def getSheetName(str):

    m = p_sheetName.search(str)

    return m.group(1)

def splitPdfToWord(srcFile):

    try:

        folderName,ext_ = os.path.splitext(srcFile)

        if ext_ != '.pdf':

            raise Exception(os.path.basename(srcFile) + " is not pdf!")

        input1 = file(srcFile,"rb")

        inputfile = PdfFileReader(input1)

        numofpages = inputfile.getNumPages()

        print "Total Pages: %d" % numofpages

        wordApp = win32com.client.Dispatch('Word.Application')

        wordApp.Visible = False

        wordApp.DisplayAlerts = 0

        doc = wordApp.Documents.Add()

        sel = wordApp.Selection

        #new directory

        if not os.path.isdir(folderName):

            os.makedirs(folderName)

        for page_index in range(1,numofpages+1):

            output = PdfFileWriter()

            output.addPage(inputfile.getPage(page_index-1))

            sheetName = getSheetName(inputfile.getPage(page_index-1).extractText().encode("utf-8"))

            sel.Style = constants.wdStyleHeading1

            sel.TypeText("Page%d %s" % (page_index,sheetName))

            sheetName = delete_some_speicl(sheetName)

            #save file

            saveFileName = os.path.join(folderName,"%d %s.pdf" % (page_index,sheetName))

            print "Add Page %d" % page_index

            #print saveFileName

            outputFile = file(saveFileName,"wb")

            output.write(outputFile)

            outputFile.close()

            sel.TypeParagraph()

            sel.Style = constants.wdStyleBodyText

            sel.InlineShapes.AddOLEObject(ClassType="AcroExch.Document.11",FileName=saveFileName)

            sel.InsertBreak(Type=constants.wdPageBreak)

        input1.close()

        doc.SaveAs(folderName+".doc")

        print "Split success!"

        print "please find them at " + folderName

        print "create word document success!"

        print "Location:" + folderName + ".doc"

    except Exception,e:

        print e

    finally:

        wordApp.Quit()

if __name__ == '__main__':

    if len(sys.argv) < 2:

        print 'usage: %s filename' % os.path.basename(sys.argv[0])

        sys.exit(1)

    splitPdfToWord(sys.argv[1])

python 将pdf分页后插入至word中的更多相关文章

如何将代码优雅的插入到word中
介:写博客或者word时需要插入代码,但如何更优雅的将代码插入到word中呢? 反面教材如下: 技巧步骤1:插入表格,设置表格无边框: 技巧步骤2:使用Notepad++的高级功能: 大部分代码编辑器 ...
mysql存储过程之遍历多表记录后插入第三方表中
自从学过存储过程后,就再也没有碰过存储过程,这是毕业后写的第一个存储过程. 因为项目里设备的种类比较多,分别存在不同的数据表中,java中对应不同的java bean对象,想要统一管理有点困难.最近正 ...
（转）如何优雅的在 Microsoft word中插入代码
背景:最近项目需要自己编写文档,在文档中需要插入部分代码,记录下这个方法. 一.工具方法1.打开这个网页PlanetB; 方法2.或者谷歌搜索syntax highlight code in wor ...
如何在 Microsoft word中插入代码
一.工具方法1.打开这个网页PlanetB; 方法2.或者谷歌搜索syntax highlight code in word documents,检索结果的第一个.如下图: PS. 方法1和2打开的 ...
Word中的代码怎样语法高亮
在平常我们粘贴代码到Word中的时候,经常会遇到代码粘贴到Word中后没有语法高亮,看着很乱很不友好,Word自带的样式---语法使用着也不尽人意, 网上有很多做法可以使得在插入在Word中的代码能够 ...
如何把word中的图片怎么导出来呢？
在办公使用word的过程中你可能经常会遇到这个问题:插入到word中的图片找不到导出来的方法,是不是很郁闷呢,别急,今天咱们研究一下把word中的图片导出来的方法(把"我的"变成你 ...
如何删除word中多余的空格和空行
去除word中多余的空格及空行一.去掉表格和格式为了版面的整齐,网页文档都是以表格的形式存在的,只是一般情况下表格的颜色被设为无色或表格宽度被设为0,所以我们在网页上看不到表格.另外,网页文档中 ...
Word中批量替换软回车
在平时工作中,有时候需要拷贝一些截取自网页上的文字,当选中后拷贝到Word中时,有时候在每行的结尾出现如下的符号,,这给后期文字的整理带来了很多不便,在此记录从网上获取的解决方法,以免遗忘和便于查找. ...
Python | 实现pdf文件分页
不知道大家有没有遇到过这么一种情况,就比如一个pdf格式的电子书,我们经常浏览的是其中的一部分,而这电子书的页数很大,每当需要浏览时,就需要翻到对应的页码,就有点儿繁琐. 还有一些情况,比如,我们想分 ...

随机推荐

【CF刷题】14-05-12
Round 236 div.1 A:只需要每个点连接所有比他大的点,知道边用完为止. //By BLADEVIL #include <cmath> #include <cstdio& ...
BZOJ-1861 Book 书架 Splay
1861: [Zjoi2006]Book 书架 Time Limit: 4 Sec Memory Limit: 64 MB Submit: 1010 Solved: 588 [Submit][Stat ...
【poj3020】 Antenna Placement
http://poj.org/problem?id=3020 (题目链接) 题意给出一个矩阵,矩阵中只有‘*’和‘o’两种字符,每个‘*’可以向它上下左右四个方位上同为‘*’的点连一条边,求最少需要 ...
BZOJ4241 历史研究
Description IOI国历史研究的第一人——JOI教授,最近获得了一份被认为是古代IOI国的住民写下的日记.JOI教授为了通过这份日记来研究古代IOI国的生活,开始着手调查日记中记载的事件. ...
codeforces 359D 二分答案+RMQ
上学期刷过裸的RMQ模板题,不过那时候一直不理解>_< 其实RMQ很简单: 设f[i][j]表示从i开始的,长度为2^j的一段元素中的最小值or最大值那么f[i][j]=min/max{ ...
设置div居中
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
hdu acmsteps 2.2.1 Fibonacci
Fibonacci Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Sub ...
轻量级应用开发之（01）第一个IOS程序
一 IPhone轻量级开发 1. 开发环境 Mac 版本: OS X EICap 10.11.3 (15D21) XCode开发版本: Version 7.2.1 (7C1002) 2.简单分析 UI ...
Maven学习笔记-01-Maven入门
一 Maven的基本概念 Maven(翻译为"专家","内行")是跨平台的项目管理工具.主要服务于基于Java平台的项目构建,依赖管理和项目信息管理. 1 项 ...
UVA12563Jin Ge Jin Qu hao(01背包)
紫书P274 题意:输入N首歌曲和最后剩余的时间t,问在保证能唱的歌曲数目最多的情况下,时间最长:最后必唱<劲歌金曲> 所以就在最后一秒唱劲歌金曲就ok了,背包容量是t-1,来装前面的歌曲 ...

python 将pdf分页后插入至word中

python 将pdf分页后插入至word中的更多相关文章

随机推荐

热门专题