Python爬虫框架

本文章的源代码来源于https://github.com/Holit/Web-Crawler-Framwork

一、爬虫框架的代码

 import urllib.request

 from bs4 import BeautifulSoup

 import re

 import time

 import _thread

 # Input your Url here####################################

 BaseURL = '127.0.0.1/'

 #########################################################

 TaxURL = ".html"

 #Input your data-saving path ############################

 SavePath = ""

 #########################################################

 #Input your threads count ###############################

 thread_count = 1

 #########################################################

 #Set each spider will spy how many pages ################

 thread_spy_count_ench = 5

 #########################################################

 def mkdir(path):

     # Create the directory

     import os

     path=path.strip()

     path=path.rstrip("\\")

     isExists=os.path.exists(path)

     if not isExists:

         os.makedirs(path)

         return True

     else:

         return False

 def download(start, count):

     #Spider main

     for i in range(start,start + count):

         try:

             #DEBUG##################################################

             #print("[INFO] Connecting to page #" + str(i) + "...")

             ########################################################

             #Used to record time

             time_start=time.time()

             #Construct url

             #This only work like

             # https://127.0.0.1/articles/00001.html

             # https://127.0.0.1/articles/00002.html

             # https://127.0.0.1/articles/00003.html

             TargetURL = BaseURL + str(i) + TaxURL

             #create Request object

             req = urllib.request.Request(TargetURL)

             #create headers using general header, you could find this by Fiddler(R) or by Chrome(R)

             req.add_header('Host','')    #Your Host, usally set as url-base

             req.add_header('Referer',TargetURL)        #Your Referer, usally set as url

             req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19')

             #finishing create Request object

             #get information

             res = urllib.request.urlopen(req)

             #decode the html

             soup = BeautifulSoup(res,"html.parser")

             ##############################################################

             #add your functions  here....

             #operate_data(data)

             #soup find div to get inforamtion.

             #soup is able to opreate html tag very easily, by using soup.find(...)

             ##############################################################

             #Change saving path here.

             savetarget = SavePath

             #trying to saving files

             try:

                 #create directory if it doesn't existed

                 mkdir(SavePath+"\\"+str(zone)+"\\")

                 #using open...

                 f = open(savetarget,'w')

                 #edit this

                 f.write("data")

             except Exception as e:

                 time_end=time.time()

                 print("  [Failed] - #" + str(i) + " Error : " + str(e))

             else:

                 time_end=time.time()

                 print("  [Succeed] - #" + str(i) + " has saved to path.("+str(time_end-time_start)+"s)")

             pass

         except Exception as e:

             print("  [Global Failure] - #" + str(i) + " Error : " + str(e))

             pass

 #if __name__ == __main__:

 try:

     #Multithreading

     print("Spidering webiste...")

     print("Current configuration :")

     print("--Will create " + str(thread_count) + "threads to access.")

     print("--Will save to " + SavePath)

     print("-------------START---------------------------")

     # press any key to continue

     # this won't work under linux

     import os

     os.system('pause')

     try:

         for i in range(0,thread_count):

             print("[Thread #"+ str (i) +"] started successfully")

             _thread.start_new_thread(download, (thread_spy_count_ench * i,thread_spy_count_ench))

     except Exception as e:

         print("[Threading@" + str(i) +"] Error:"+ str(e))

 except Exception as e:

    print("[Global Failure] Error:"+ str(e))

 while 1:

    pass

二、对其中功能的实例化操作

　　1.文本获取功能

　　　　文本获取是指对页面的<div class='content'>...</div>中的内容进行获取，这是前提。如果不同需要更改。

　　　　（1）思路

　　　　　　使用BeautifulSoup对html分析之后得到解码的文件，例如

             <div class="content" style="text-align: left">

             基础内容

             </div>

　　　　　　现在对该段落进行选取，即使用soup.find功能

　　　　（2）基本代码

 passages_div = soup.find('div')

 passages_set = passages_div.findAll(attrs={"class":"content"})

 for passages in passages_set:

     article = str(passages)

     #文字处理

     article = article.replace('<div class="content" style="text-align: left">', '')

     article = article.replace(u'\ue505', u' ')#对Unicode的空格进行处理，如果不处理gbk无法编码

     article = article.replace(u'\ue4c6', u' ')

     article = article.replace(u'\xa0', u' ')

     article = article.replace('<br/>', '\n')

     article = article.replace('</div>', '')

     savetarget = 'D:\test\test.txt'

     try:

         mkdir('D:\test\')

         f = open(savetarget,'w')

         f.write(article)

     except Exception as e:

         print("  [Failed] - "+ str(e))

     else:

         time_end=time.time()

         print("  [Succeed] - saved to path.")

 pass

　　2.图片获取操作

　　　　图片获取一般是通过对网页上的<img src="127.0.0.1/png.png">Hello</img>中src上的内容进行下载操作

　　　　目前可以使用多种操作方式，例如urlretrieve，不再赘述

Python爬虫框架的更多相关文章

教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
Python爬虫框架Scrapy实例（三）数据存储到MongoDB
Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...
Python爬虫框架Scrapy
Scrapy是一个流行的Python爬虫框架, 用途广泛. 使用pip安装scrapy: pip install scrapy scrapy由一下几个主要组件组成: scheduler: 调度器, 决 ...
《Python3网络爬虫开发实战》PDF+源代码+《精通Python爬虫框架Scrapy》中英文PDF源代码
下载:https://pan.baidu.com/s/1oejHek3Vmu0ZYvp4w9ZLsw <Python 3网络爬虫开发实战>中文PDF+源代码下载:https://pan. ...
Python爬虫框架Scrapy教程(1)—入门
最近实验室的项目中有一个需求是这样的,需要爬取若干个(数目不小)网站发布的文章元数据(标题.时间.正文等).问题是这些网站都很老旧和小众,当然也不可能遵守 Microdata 这类标准.这时候所有网页 ...
常见Python爬虫框架你会几个？
前言文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理.作者:三名狂客正文注意:如果你Python技术学的不够好,可以点击下方链接 ...
《精通Python爬虫框架Scrapy》学习资料
<精通Python爬虫框架Scrapy>学习资料百度网盘:https://pan.baidu.com/s/1ACOYulLLpp9J7Q7src2rVA
Python爬虫框架Scrapy获得定向打击批量招聘信息
爬虫,就是一个在网上到处或定向抓取数据的程序,当然,这样的说法不够专业,更专业的描写叙述就是.抓取特定站点网页的HTML数据.只是因为一个站点的网页非常多,而我们又不可能事先知道全部网页的URL地址, ...

随机推荐

LDAP服务端 - 调研
一.服务端实现 1.OpenLdap 2.ApacheDS 二.OpenLdap 1.https://segmentfault.com/a/1190000014683418 2.https://www ...
mysql删除大表
在mysql中遇到一个大表,大概有17G左右,在对这个表进行查询.修改时均遇到了很大的困难,于是想着删除这张表.通常的删除操作可以通过delete.drop.truncate操作,试了这三个命令,但是 ...
php-浮点数计算，double类型数加减乘除必须用PHP提供的高精度计算函数
一.前方有坑 php在使用加减乘除等运算符计算浮点数的时候,经常会出现意想不到的结果,特别是关于财务数据方面的计算,给不少工程师惹了很多的麻烦.比如今天工作终于到的一个案例: $a = 2586; $ ...
Jira 入门【转】
JIRA是Atlassian公司出品的项目与事务跟踪工具,被广泛应用于缺陷跟踪.客户服务.需求收集.流程审批.任务跟踪.项目跟踪和敏捷管理等工作领域.它是一个集项目计划.任务分配.需求管理.错误跟踪 ...
CMU Database Systems - Concurrency Control Theory
并发控制是数据库理论里面最难的课题之一并发控制首先了解一下事务,transaction 定义如下, 其实transaction关键是,要满足ACID属性, 左边的正式的定义,由于的intuitive ...
前端 img标签显示 base64格式的图片
本文链接:https://blog.csdn.net/kukudehui/article/details/80409522在做项目的时候,我从后端返回了一个base64格式的图片文件,想把它渲染在前端 ...
axios请求数据完整
<template>  <div id="home"> 首页组件 <button @clic ...
git merge 结果是 git merge Already up-to-date. 该怎么解决？
git将主干合并到当前分支时,出现如下结果: 原因在于:执行git merge前,主干的代码没有更新正确的操作步骤如下: 1 .切换到主干 $ git checkout master 2. 更新主干 ...
JSOUP 爬虫
作者QQ:1095737364 QQ群:123300273 欢迎加入! 1.mavne 依赖: <!--html 解析 : jsoup HTML parser library @ ...
visual studio code利用自身携带debug调试
在.vscode文件夹下,添加如下文件 1) launch.json 内容如下 { "version": "0.2.0", "configuratio ...

Python爬虫框架

Python爬虫框架的更多相关文章

随机推荐

热门专题