py 爬取页面http://m.sohu.com 并存储

  1 #思路 ： 利用beautiful 省去了正则这个麻烦事，把页面搞出来然后提取js,css,img ,提取命令使用getopt 很方便,使用前需要确保已经安装了beautiful soup,如没有安#装请 到 http://www.crummy.com/software/BeautifulSoup/ 下载

  2 from bs4 import BeautifulSoup

  3 import urllib, urllib2,time

  4 import sys,os

  5 import getopt

  6 reload(sys)  

  7 sys.setdefaultencoding("utf-8")

  8 

  9 #set default value 

 10 clock_time = 60

 11 target_url = "http://m.sohu.com"

 12 target_lib = "/tmp/backup"

 13 

 14 def usage() :

 15     print "simple like this :"

 16     print "main.py -d 60 -u http://m.sohu.com -o \tmp\backup"

 17 

 18 def getHtml(target_url,target_lib,time) :

 19     response = urllib.urlopen(target_url)

 20     Html= response.read()

 21     target_lib=target_lib+'/'+time

 22     os.makedirs(target_lib)

 23     #save html

 24     print target_lib

 25     try :

 26         f = open(target_lib+"/index.html","w")

 27         f.write(Html)

 28         f.close()

 29         print "save index.html ok!"

 30     except Exception,e:

 31         print str(e)

 32     

 33     # save picture     

 34     os.makedirs(target_lib+"/images")

 35     soup = BeautifulSoup(Html)

 36     f=soup.find_all('img')

 37     if f != None :

 38         for i in f :

 39             pic_url=i.get('src')

 40             response = urllib.urlopen(pic_url)

 41             pic_url=pic_url.split('/')

 42             pic= response.read()

 43             try :

 44                 f = open(target_lib+"/images/"+pic_url[-1],"wb")

 45                 f.write(pic)

 46                 f.close()

 47             except Exception,e :

 48                 print str(e)

 49                            

 50     print "save picture ok!"

 51     

 52     #save js 

 53     os.makedirs(target_lib+"/js")

 54     f=soup.find_all('script')

 55     noName=0

 56     if f != None :

 57         for i in f :

 58             if i.get('src')!=None :

 59                 js_url=i.get('src')

 60                 response = urllib.urlopen(js_url)

 61                 js_url=js_url.split('/')

 62                 js= response.read()

 63                 try :

 64                     f = open(target_lib+"/js/"+js_url[-1],"w")

 65                     f.write(js)

 66                     f.close()

 67                 except Exception,e :

 68                     print str(e)

 69             else :  # js 可以嵌入在文档里 保存为wuming

 70                 f = open(target_lib+"/js/"+"wuming"+str(noName)+".js","w")

 71                 noName+=1

 72                 f.write(i.string)

 73                 f.close()

 74     print "save js ok!"    

 75     

 76     #save css

 77     os.makedirs(target_lib+"/css")

 78     f=soup.find_all('link')

 79     if f != None :

 80             for i in f :

 81                 if i.get('type') != None and i.get('type') == "text/css" :

 82                     css_url=i.get('href')

 83                     response = urllib.urlopen(css_url)

 84                     css_url=css_url.split('/')

 85                     css= response.read()

 86                     try :

 87                         f = open(target_lib+"/css/"+css_url[-1],"w")

 88                         f.write(css)

 89                         f.close()

 90                     except Exception,e :

 91                         print str(e)

 92     print "save css ok!"

 93     

 94 def main() :

 95     global clock_time

 96     global target_url

 97     global target_lib

 98     

 99     if not len(sys.argv[1:]) :

         usage()

     try :

         opts,args = getopt.getopt(sys.argv[1:], "d:u:o:",[])

     except getopt.GetoptError as err :

         print str(err) 

         usage()

         

     for o,a in opts :

         if o in ("-d") :

             clock_time = a

         if o in ("-u") :

             target_url = a

         if o in ("-o") :

             target_lib = a

     

     lastTime = int(time.time())

     timeArray = time.localtime(lastTime)

     otherStyleTime = time.strftime("%Y%m%d%H%M", timeArray)    

     getHtml(target_url,target_lib,otherStyleTime)

     

     while True :

         nowTime=int(time.time())

         if nowTime - lastTime >= 60 :

             lastTime=nowTime

             timeArray = time.localtime(nowTime)

             otherStyleTime = time.strftime("%Y%m%d%H%M", timeArray)            

             getHtml(target_url,target_lib,otherStyleTime)     

             print "update at time" + otherStyleTime

 if __name__=="__main__" :

     main()

py 爬取页面http://m.sohu.com 并存储的更多相关文章

[实战演练]python3使用requests模块爬取页面内容
本文摘要: 1.安装pip 2.安装requests模块 3.安装beautifulsoup4 4.requests模块浅析 + 发送请求 + 传递URL参数 + 响应内容 + 获取网页编码 + 获取 ...
MinerHtmlThread.java 爬取页面线程
MinerHtmlThread.java 爬取页面线程 package com.iteye.injavawetrust.miner; import org.apache.commons.logging ...
【java】使用URL和CookieManager爬取页面的验证码和cookie并保存
使用java的net包和io包下的几个工具爬取页面的验证码图片并保存到本地. 然后可以把获取的cookie保存下来,做进一步处理.比如通过识别验证码,进一步使用验证码和用户名,密码,保存下来的cook ...
scrapy中使用selenium来爬取页面
scrapy中使用selenium来爬取页面 from selenium import webdriver from scrapy.http.response.html import HtmlResp ...
爬取豆瓣电影TOP 250的电影存储到mongodb中
爬取豆瓣电影TOP 250的电影存储到mongodb中 1.创建项目sp1 PS D:\scrapy> scrapy.exe startproject douban 2.创建一个爬虫 PS D: ...
py爬取英文文档学习单词
最近开始看一些整本整本的英文典籍,虽然能看个大概,但是作为四级都没过的我来说还是有些吃力,总还有一部分很关键的单词影响我对句子的理解,因为看的是纸质的,所以查询也很不方便,于是想来个突击,我想把程序单 ...
python 爬虫之requests爬取页面图片的url，并将图片下载到本地
大家好我叫hardy 需求:爬取某个页面,并把该页面的图片下载到本地思考: img标签一个有多少种类型的src值?四种:1.以http开头的网络链接.2.以“//”开头网络地址.3.以“/”开头绝对 ...
python爬取豌豆荚中的详细信息并存储到SQL Server中
买了本书<精通Python网络爬虫>,看完了第6章,我感觉我好像可以干点什么:学的不多,其中的笔记我放到了GitHub上:https://github.com/NSGUF/PythonLe ...
Python 爬取美女图片，分目录多级存储
最近有个需求:下载https://mm.meiji2.com/网站的图片. 所以简单研究了一下爬虫. 在此整理一下结果,一为自己记录,二给后人一些方向. 爬取结果如图: 整体研究周期 2-3 天, ...

随机推荐

iOS上线check_list
iOS 上线前 check_list 类型序号检查项结果(pass/no) 安装卸载 1 非越狱环境下的安装.卸载 2 越狱环境下的安装.卸载 3 安装文件检查,无泄漏用户信息的隐患 4 卸载 ...
python打开文件可以有多种模式
一.python打开文件可以有多种模式,读模式.写模式.追加模式,同时读写的模式等等,这里主要介绍同时进行读写的模式r+ python通过open方法打开文件 file_handler = open( ...
COGS 36. 求和问题
时间限制:1.2 s 内存限制:128 MB [问题描述] 在一个长度为n的整数数列中取出连续的若干个数,并求它们的和. [输入格式] 输入由若干行组成,第一行有一个整数n ...
HDOJ 4509 湫湫系列故事——减肥记II（2013腾讯编程马拉松）并查集合并区间
发现这种合并区间的题目还可以这么玩给你n段时间然后问没被占用的时间是多少题目所给的区间是右开的导致我wa 好多人5e5*1440的暴力跑出来的时间居然只是我的两倍不懂.... 所以并查集并没有 ...
JAVA 数据库编程中的性能优化
1. 禁止自动提交:在默认情况下,程序执行的任何sql 语句都是自动提交的向一个表中插入2000条记录,自动提交所用的时间 11666毫秒禁止自动提交(显示提交) 3450毫秒 2. 批处理:多用批 ...
IOS7 Text View 截断的问题解决
- (void)textViewDidChange:(UITextView *)textView { CGRect line = [textView caretRectForPosition: tex ...
POJ1077 八数码 BFS
BFS 几天的超时... A*算法不会,哪天再看去了. /* 倒搜超时, 改成顺序搜超时然后把记录路径改成只记录当前点的操作,把上次的位置记录下AC..不完整的人生啊 */ #include < ...
Asp.Net Core 进阶（二） —— 集成Log4net
Asp.Net Core 支持适用于各种内置日志记录API,同时也支持其他第三方日志记录.在我们新建项目后,在Program 文件入口调用了CreateDefaultBuilder,该操作默认将添加以 ...
soapui测试https双向验证p12项目
1.准备好p12 和jsk秘钥文件 2.配置soapui ssl 其中: 1:jks就是放在trustStore那里,密码填写为 106075 2:p12放到keystore,密码填写:180000 ...
win10文件共享的实现
1)启动网络发现打开网络共享中心->更改高级共享设置->修改如下 2)如果需要其他客户端无密码访问修改如下: 3)如果打算使用Guest访问用户帐户->管理帐户 ...

py 爬取页面http://m.sohu.com 并存储

py 爬取页面http://m.sohu.com 并存储的更多相关文章

随机推荐

热门专题