python爬虫入门学习

近期写的一个爬虫的Demo,只是简单的用了几个函数。实现了简单的爬取网页的功能(以途牛为例)。

 import urllib2

 import re

 import urlparse

 import robotparser

 import datetime

 import time

 class Throttle:

     """

     Add a delay to the same domain between two download

     """

     def __init__(self, delay):

         #  amount of delay between download of a domain

         self.delay = delay

         #  timestamp of when a domain was last accessed

         self.domains = {}

     def wait(self, url):

         domain = urlparse.urlparse(url).netloc

         last_accessed = self.domains.get(domain)

         if self.delay > 0 and last_accessed is not None:

             sleep_sec = self.delay - (datetime.datetime.now() - last_accessed).seconds

             if sleep_sec >= 0:

                 time.sleep(sleep_sec)

                 print 'sleep: ', sleep_sec, 's'

         self.domains[domain] = datetime.datetime.now()

 def download(url, proxy, user_agent='wawp', num_retries=2):

     print 'Downloading:', url

     headers = {'User-agent': user_agent}

     request = urllib2.Request(url, headers=headers)

     opener = urllib2.build_opener()

     if proxy:

         proxy_param = {urlparse.urlparse(url).scheme: proxy}

         opener.add_handler(urllib2.ProxyHandler(proxy_param))

     try:

         html = opener.open(request).read()

     except urllib2.URLError as e:

         print 'Downloading error:', e.reason, '\n'

         html = ''

         if num_retries > 0:

             if hasattr(e, 'code') and 500 <= e.code < 600:

                 return download(url, proxy, user_agent, num_retries - 1)

     return html

 def get_links(html, regstr=r'http:\/\/[^w].*\.tuniu\.com'):

     reg = regstr

     rexp = re.compile(reg)

     return re.findall(rexp, html)

 def deduplicate_list(inputList):

     new_list = []

     for x in inputList:

         if x not in new_list:

             new_list.append(x)

     return new_list

 def crawl_sitemap(url):

     sitemap = download(url)

     links = get_links(sitemap)

     print 'before links are : ', links

     newlinks = deduplicate_list(links)

     print 'after links are : ', newlinks

     for link in newlinks:

         print link

         download(link)

 def get_robot(url):

     rp = robotparser.RobotFileParser()

     rp.set_url(urlparse.urljoin(url, 'robots.txt'))

     rp.read()

     return rp

 def link_crawler(seed_url, max_depth=3, link_regex=r'http:\/\/[^w][^"]*\.tuniu\.com', delay=1, proxy=None):

     # For robots.txt check install

     rp = get_robot(seed_url)

     # init vars

     throttle = Throttle(delay)

     crwal_queue = [seed_url]

     seen = {seed_url: 0}

     while crwal_queue:

         url = crwal_queue.pop()

         depth = seen[url]

         if depth != max_depth:

             if rp.can_fetch('heimaojingzhang', url):  # here just for joking

                 throttle.wait(url)

                 html = download(url, proxy)

                 #  print 'down func ', url

                 for link in get_links(html, link_regex):

                     link = urlparse.urljoin(seed_url, link)

                     if link not in seen:

                         seen[link] = depth + 1

                         crwal_queue.append(link)

             else:

                 print 'Blocked by robot.txt ', url

 # TODO:

 # fix bugs: (in regex) done on : 2017/09/23 23:16

 # delay: done on : 2017/09/24 21:36

 # proxy

 # depth: done on : 2017/09/23 23:10

 if __name__ == '__main__':

     link_crawler('http://www.tuniu.com/corp/sitemap.shtml', link_regex=r'http:\/\/www\.tuniu\.com\/guide\/[^"]*')

     #  html = download('http://www.tuniu.com/corp/sitemap.shtml')

     #  print  html

python爬虫入门学习的更多相关文章

Python爬虫入门一之综述
大家好哈,最近博主在学习Python,学习期间也遇到一些问题,获得了一些经验,在此将自己的学习系统地整理下来,如果大家有兴趣学习爬虫的话,可以将这些文章作为参考,也欢迎大家一共分享学习经验. Pyth ...
2.Python爬虫入门二之爬虫基础了解
1.什么是爬虫爬虫,即网络爬虫,大家可以理解为在网络上爬行的一直蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛咯,如果它遇到资源,那么它就会抓取下来.想抓取什么?这个由你来控制它咯. ...
1.Python爬虫入门一之综述
要学习Python爬虫,我们要学习的共有以下几点: Python基础知识 Python中urllib和urllib2库的用法 Python正则表达式 Python爬虫框架Scrapy Python爬虫 ...
Python爬虫入门二之爬虫基础了解
1.什么是爬虫爬虫,即网络爬虫,大家可以理解为在网络上爬行的一直蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛咯,如果它遇到资源,那么它就会抓取下来.想抓取什么?这个由你来控制它咯. ...
转 Python爬虫入门二之爬虫基础了解
静觅 » Python爬虫入门二之爬虫基础了解 2.浏览网页的过程在用户浏览网页的过程中,我们可能会看到许多好看的图片,比如 http://image.baidu.com/ ,我们会看到几张的图片以 ...
转 Python爬虫入门一之综述
转自: http://cuiqingcai.com/927.html 静觅 » Python爬虫入门一之综述首先爬虫是什么? 网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为 ...
python爬虫入门02：教你通过 Fiddler 进行手机抓包
哟~哟~哟~ hi起来 everybody 今天要说说怎么在我们的手机抓包通过 python爬虫入门01:教你在Chrome浏览器轻松抓包我们知道了 HTTP 的请求方式以及在 Chrome 中 ...
python爬虫入门01：教你在 Chrome 浏览器轻松抓包
通过 python爬虫入门:什么是爬虫,怎么玩爬虫? 我们知道了什么是爬虫也知道了爬虫的具体流程那么在我们要对某个网站进行爬取的时候要对其数据进行分析就要知道应该怎么请求就要知道获取的数据是 ...
Python爬虫入门有哪些基础知识点
1.什么是爬虫爬虫,即网络爬虫,大家可以理解为在网络上爬行的一直蜘蛛,互联网就比作一张大网,而爬虫便是在这张网上爬来爬去的蜘蛛咯,如果它遇到资源,那么它就会抓取下来.想抓取什么?这个由你来控制它咯. ...

随机推荐

[转载] RED-BLACK(红黑)树的实现TreeMap源码阅读
转载自http://lxy2330.iteye.com/blog/1664786 由于平衡二叉树与红黑树都是二叉排序树,又红黑树是对平衡二叉树的一种改进实现,所以它的很多思想算法都来源于排序二叉或平衡 ...
使用AspectCore动态代理
前言最近越来越多的同学关注到AspectCore,并且提出不少中肯的建议,其中最多的提议是希望能够看到更多的关于AspectCore使用方式的文章和Demo.那么在这篇文章里,我们就来聊聊Aspec ...
ubuntu16.04安装交叉编译链
我使用的是arm-linux-gcc 4.3.2版本,其他版本类似,附上下载链接: https://pan.baidu.com/s/1geUOfab 密码: frzy 首先我的安装包是tar.bz2的 ...
[C#]使用TcpListener及TcpClient开发一个简单的Chat工具
本文为原创文章.源代码为原创代码,如转载/复制,请在网页/代码处明显位置标明原文名称.作者及网址,谢谢! 本文使用的开发环境是VS2017及dotNet4.0,写此随笔的目的是给自己及新开发人员作为参 ...
TFboy养成记
转自:http://www.cnblogs.com/likethanlove/p/6547405.html 在tensorflow的使用中,经常会使用tf.reduce_mean,tf.reduce_ ...
iOS 多线程之线程锁Swift-Demo示例总结
线程锁是什么在前面的文章中总结过多线程,总结了多线程之后,线程锁也是必须要好好总结的东西,这篇文章构思的时候可能写的东西得许多,只能挤时间一点点的慢慢的总结了,知道了线程之后要了解线程锁就得先了解一 ...
高可用的Spring FTP上传下载工具类（已解决上传过程常见问题）
前言最近在项目中需要和ftp服务器进行交互,在网上找了一下关于ftp上传下载的工具类,大致有两种. 第一种是单例模式的类. 第二种是另外定义一个Service,直接通过Service来实现ftp的上 ...
中英文代码对比系列之Java一例
原文: https://zhuanlan.zhihu.com/p/30905033. 作者为本人. 这个系列将对同一段代码进行中文命名和英文命名两个版本的比较. 目的包括, 演示中文命名, 发现命名时 ...
ML笔记：Where does the error come from?
error来自哪? 来自于偏差Bias和方差Variance. 就如打靶时瞄准一个点f平均,打出的点f星分布在该点周围. 该点与实际靶心f帽的距离就是偏差Bias, 打出的点与该点的分布距离就是方差V ...
简单的基于Vue-axios请求封装
具体实现思路=>封装之前需要用npm安装并引入axios,使用一个单独的js模块作为接口请输出对象,然后export dafult 这个对象. 1.首先我们需要在Vue实例的原型prototyp ...

python爬虫入门学习

python爬虫入门学习的更多相关文章

随机推荐

热门专题