[python]爬站点

 #!/usr/bin/python

  2 import urllib

  3 import urllib2

  4 import re

  5 import os

  6

  7 dirs = ['js','img','pay','css']

  8 urls = ['http://www.xxxxxx.net/' + x for x in dirs]

  9

 10 def parse(baseurl):

 11     url_hand = urllib2.urlopen(baseurl)

 12     url_cont = url_hand.read()

 13     urllist = re.findall("<A HREF=\".*\">",url_cont)

 14     files = []

 15     dirs = []

 16     cwd = os.getcwd()

 17     for x in urllist:

 18         xx = x.split("\"")[1]

 19         if re.search(".*/$",xx):

 20             dirs.append(xx)

 21             nextpath = os.path.join(cwd, xx)

 22         else:

 23             files.append(xx)

 24     dirs.remove(dirs[0])

 25

 26

 27     for xfile in files:

 28         xfileurl = "http://www.xxxxxx.net" + xfile

 29         #todir = os.path.join(pardir, os.path.dirname(xfile))

 30         todir = cwd + xfile

 31         print todir

 32         urllib.urlretrieve(xfileurl, todir)

 33     for xdir in dirs:

 34         todir = cwd + xdir

 35         try:

 36              os.mkdir(todir)

 37         except OSError, e:

 38             print "dir exist!!"

 39         xdirurl = "http://www.xxxxxx.net" + xdir

 40         print xdirurl

 41         parse(xdirurl)

 42

 43

 44 if __name__ == "__main__":

 45     for url in urls:

 46         parse(url)

知识点：

1.这个站点有autoindex，所以进入目录后自动列出里面的文件，将其爬出，分类，文件，和目录

对于文件，直接抓取。

对于目录，得到路径后对其调用函数递归抓取。

2.下载文件，可以使用urllib模块的urlretrieve

3.还可以使用urlopen->read->write to file

[python]爬站点的更多相关文章

利用python爬取城市公交站点
利用python爬取城市公交站点页面分析 https://guiyang.8684.cn/line1 爬虫我们利用requests请求,利用BeautifulSoup来解析,获取我们的站点数据.得 ...
用Python爬E站本
用Python爬E站本一.前言参考并改进自 OverJerry 大佬的教你怎么用Python爬取E站的本子_OverJerry. 本文为技术学习记录,不提供访问无存在网站的任何方法,也不包含不和 ...
用Python爬取网易云音乐热评
用Python爬取网易云音乐热评本文旨在记录Python爬虫实例:网易云热评下载由于是从零开始,本文内容借鉴于各种网络资源,如有侵权请告知作者. 要看懂本文,需要具备一点点网络相关知识.不过没有关 ...
Python 爬取所有51VOA网站的Learn a words文本及mp3音频
Python 爬取所有51VOA网站的Learn a words文本及mp3音频 #!/usr/bin/env python # -*- coding: utf-8 -*- #Python 爬取所有5 ...
python爬取网站数据
开学前接了一个任务,内容是从网上爬取特定属性的数据.正好之前学了python,练练手. 编码问题因为涉及到中文,所以必然地涉及到了编码的问题,这一次借这个机会算是彻底搞清楚了. 问题要从文字的编码讲 ...
python爬取某个网页的图片-如百度贴吧
python爬取某个网页的图片-如百度贴吧作者:vpoet mail:vpoet_sir@163.com 注:随意copy,不用告诉我 #coding:utf-8 import urllib imp ...
python爬爬爬之单网页html页面爬取
python爬爬爬之单网页html页面爬取作者:vpoet mail:vpoet_sir@163.com 注:随意copy 不用告诉我 #coding:utf-8 import urllib2 Re ...
Python:爬取乌云厂商列表，使用BeautifulSoup解析
在SSS论坛看到有人写的Python爬取乌云厂商,想练一下手,就照着重新写了一遍原帖:http://bbs.sssie.com/thread-965-1-1.html #coding:utf- im ...
使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...

随机推荐

（python）leetcode刷题笔记 01 TWO SUM
1. Two Sum Given an array of integers, return indices of the two numbers such that they add up to a ...
ionic 组件学习
利用css列表多选框: <div class="{{Conceal}}" > <ion-checkbox color="secondary" ...
nodejs promise深度解析
Promise本质上是一个容器,内部有一个执行函数,当promise对象New出来的时候,内部包裹的函数立即执行. V8引擎会将resolve和projeccted两个函数传递进来,resolved含 ...
软工第三次作业——个人PSP
9.22--9.26本周例行报告 1.PSP(personal software process )个人软件过程. 类型任务预计时间开始时间结束时间中断时间实际用时准备工作学习重定向 ...
Java数组课程作业
设计思路:生成随机数,赋值给数组.再将其求和输出程序流程图: 源程序代码: import javax.swing.JOptionPane; public class Test { public st ...
常用排序算法--java版
package com.whw.sortPractice; import java.util.Arrays; public class Sort { /** * 遍历一个数组 * @param sor ...
实验吧密码学：RSAROLL
原题: {920139713,19} 704796792 752211152 274704164 18414022 368270835 483295235 263072905 459788476 48 ...
JavaScript 语句标识符，变量周期，常见的HTML事件
语句描述 break 用于跳出循环. catch 语句块,在 try 语句块执行出错时执行 catch 语句块. continue 跳过循环中的一个迭代. do ... while 执行一个语句块, ...
【ASP.NET Core】ASP.NET Core API 版本控制
几天前,我和我的朋友们使用 ASP.NET Core 开发了一个API ,使用的是GET方式,将一些数据返回到客户端 APP.我们在前端进行了分页,意味着我们将所有数据发送给客户端,然后进行一些dat ...
Dom样式操作-属性操作
1. 对样式进行操作: 1) 以样式(C1,C2等)为最小单位进行修改. className, classList, (以列表形式获得) classList.add("C2"), ...

[python]爬站点

[python]爬站点的更多相关文章

随机推荐

热门专题