pdfminer API介绍：pdf网页爬虫

　　安装 pip install pdfminer

　　爬取数据是数据分析项目的第一个阶段，有的加密成pdf格式的文件，下载后需要解析，使用pdfminer工具。

　　先介绍一下什么是pdfminer

　　下面是官方一段英文介绍：

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

　主要用两个例子学习它的使用

　　例子1：

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf

(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf

(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf

(extract a text from an encrypted PDF file)

　　参数：

 -o filename

    Specifies the output file name. By default, it prints the extracted contents to stdout in text format.

-p pageno[,pageno,...]

    Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages.

-c codec

    Specifies the output codec.

-t type

    Specifies the output format. The following formats are currently supported.

        text : TEXT format. (Default)

        html : HTML format. Not recommended for extraction purposes because the markup is messy.

        xml : XML format. Provides the most information.

        tag : "Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF"). 

-I image_directory

    Specifies the output directory for image extraction. Currently only JPEG images are supported.

-M char_margin

例子2：

$ dumppdf.py -a foo.pdf

(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf

(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg

(extract a JPEG image)

参数：

 -a

    Instructs to dump all the objects. By default, it only prints the document trailer (like a header).

-i objno,objno, ...

    Specifies PDF object IDs to display. Comma-separated IDs, or multiple -i options are accepted.

-p pageno,pageno, ...

    Specifies the page number to be extracted. Comma-separated page numbers, or multiple -p options are accepted. Note that page numbers start at one, not zero.

-r (raw)

-b (binary)

-t (text)

    Specifies the output format of stream contents. Because the contents of stream objects can be very large, they are omitted when none of the options above is specified.

    With -r option, the "raw" stream contents are dumped without decompression. With -b option, the decompressed contents are dumped as a binary blob. With -t option, the decompressed contents are dumped in a text format, similar to repr() manner. When -r or -b option is given, no stream header is displayed for the ease of saving it to a file.

-T

    Shows the table of contents.

编写自己的pdf解析文档：

# -*- coding: utf- -*-

from pdfminer.pdfparser import PDFParser

from pdfminer.pdfdocument import PDFDocument

from pdfminer.pdfpage import PDFPage

from pdfminer.pdfpage import PDFTextExtractionNotAllowed

from pdfminer.pdfinterp import PDFResourceManager

from pdfminer.pdfinterp import PDFPageInterpreter

from pdfminer.pdfdevice import PDFDevice

from pdfminer.layout import *

from pdfminer.converter import PDFPageAggregator

import os

# os.chdir(r'F:\test')

fp = open('PDF/1202268749.pdf', 'rb')

#来创建一个pdf文档分析器

parser = PDFParser(fp)

#创建一个PDF文档对象存储文档结构

document = PDFDocument(parser)

# 检查文件是否允许文本提取

if not document.is_extractable:

    raise PDFTextExtractionNotAllowed

else:

    # 创建一个PDF资源管理器对象来存储共赏资源

    rsrcmgr=PDFResourceManager()

    # 设定参数进行分析

    laparams=LAParams()

    # 创建一个PDF设备对象

    # device=PDFDevice(rsrcmgr)

    device=PDFPageAggregator(rsrcmgr,laparams=laparams)

    # 创建一个PDF解释器对象

    interpreter=PDFPageInterpreter(rsrcmgr,device)

    # 处理每一页

    for page in PDFPage.create_pages(document):

        interpreter.process_page(page)

        # 接受该页面的LTPage对象

        layout=device.get_result()

        for x in layout:

            if(isinstance(x,LTTextBoxHorizontal)):

                with open('a.html','a') as f:

                    f.write(x.get_text().encode('utf-8')+'\n')

参考：

pdfminer官网： http://www.unixuser.org/~euske/python/pdfminer/index.html

http://www.cnblogs.com/RoundGirl/p/4979267.html

pdfminer API介绍：pdf网页爬虫的更多相关文章

Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱（转）
原文:http://www.52nlp.cn/python-网页爬虫-文本处理-科学计算-机器学习-数据挖掘曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开 ...
[resource-]Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
reference: http://www.52nlp.cn/python-%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab-%e6%96%87%e6%9c%ac%e5%a4% ...
Python网页爬虫（一）
很多时候我们想要获得网站的数据,但是网站并没有提供相应的API调用,这时候应该怎么办呢?还有的时候我们需要模拟人的一些行为,例如点击网页上的按钮等,又有什么好的解决方法吗?这些正是python和网页爬 ...
【Python】Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
本文转载自:https://www.cnblogs.com/colipso/p/4284510.html 好文 mark http://www.52nlp.cn/python-%E7%BD%91%E9 ...
网页爬虫--scrapy入门
本篇从实际出发,展示如何用网页爬虫.并介绍一个流行的爬虫框架~ 1. 网页爬虫的过程所谓网页爬虫,就是模拟浏览器的行为访问网站,从而获得网页信息的程序.正因为是程序,所以获得网页的速度可以轻易超过单 ...
网页爬虫的设计与实现（Java版）
网页爬虫的设计与实现(Java版) 最近为了练手而且对网页爬虫也挺感兴趣,决定自己写一个网页爬虫程序. 首先看看爬虫都应该有哪些功能. 内容来自(http://www.ibm.com/deve ...
网页抓取：PHP实现网页爬虫方式小结
来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...
nodeJS实现简单网页爬虫功能
前面的话本文将使用nodeJS实现一个简单的网页爬虫功能网页源码使用http.get()方法获取网页源码,以hao123网站的头条页面为例 http://tuijian.hao123.com/h ...
PHP实现网页爬虫
抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现的网页抓取方式,如果熟悉JQuery选择器,这几种框架会相当简单. 一 ...

随机推荐

Java第二次作业第三题
四叶玫瑰线的图形设计:当用鼠标拖拽改变窗口大小时,四叶玫瑰线会重新绘制 package naizi; import java.awt.*; import java.awt.event.*; impor ...
第四周课程总结&试验报告（二）
实验二 Java简单类与对象实验目的掌握类的定义,熟悉属性.构造函数.方法的作用,掌握用类作为类型声明变量和方法返回值: 理解类和对象的区别,掌握构造函数的使用,熟悉通过对象名引用实例的方法和属性 ...
EF指定更新字段
使用EF做更新时,若没有进行跟踪会默认全字段更新,那怎么做到只更新我们想要更新的字段呢? /// <summary> /// 修改指定属性的单条数据 /// </summary> ...
25 个 Linux 下最炫酷又强大的命令行神器，你用过其中哪几个呢？
本文首发于:微信公众号「运维之美」,公众号 ID:Hi-Linux. 「运维之美」是一个有情怀.有态度,专注于 Linux 运维相关技术文章分享的公众号.公众号致力于为广大运维工作者分享各类技术文章和 ...
[c++] 面试题之犄角旮旯第壹章
记录C/C++语言相关的问题. 算法可视化:https://visualgo.net/en <data structure and algorithm in c++> By Adam 有免 ...
XStream实现javabean和xml、json转化
xStream转换XML.Json数据 xStream可以轻易的将javaBean对象和xml相互转换,修改某个特定的属性和节点名称,而且也支持json的转换. maven依赖: 1 <depe ...
php 循环从数据库分页取数据批量修改数据
//批量修改email重复 public function getEmail() { $this->model = app::get('shop')->model('manage'); / ...
【面试】我是如何在面试别人Redis相关知识时“软怼”他的
事出有因 Redis是一个分布式NoSQL数据库,因其数据都存储在内存中,所以访问速度极快,因此几乎所有公司都拿它做缓存使用,所以Redis常被称为分布式缓存. 一次我的一个同事让我帮他看Redis相 ...
基于windows的Redis后台服务安装卸载管理
首先,需要你进入你的Redis解压根目录,例如,类似于我下图的这样子: 接着打开你的cmd,使用cd命令切换到该目录,或者直接在上图的地址栏输入“cmd”并回车.这里为什么让你先使用资源管理器找到你的 ...
11.Django基础九之中间件
一前戏我们在前面的课程中已经学会了给视图函数加装饰器来判断是用户是否登录,把没有登录的用户请求跳转到登录页面.我们通过给几个特定视图函数加装饰器实现了这个需求.但是以后添加的视图函数可能也需要加上 ...

pdfminer API介绍：pdf网页爬虫

pdfminer API介绍：pdf网页爬虫的更多相关文章

随机推荐

热门专题