Urllib--爬虫

1.简单爬虫

from urllib import request

def f(url):

    print('GET: %s' % url)

    resp = request.urlopen(url) #赋给一个实例,请求

    data = resp.read() #把结果读出来

    f=open('url.html','wb')

    f.write(data)

    f.close()

    print('%d bytes received from %s.' % (len(data), url))

f('http://www.cnblogs.com/alex3714/articles/5248247.html')

运行结果：

C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py

GET: http://www.cnblogs.com/alex3714/articles/5248247.html

91829 bytes received from http://www.cnblogs.com/alex3714/articles/5248247.html.

Process finished with exit code 0

2.爬多个网页

from urllib import request

import gevent

def f(url):

    print('GET: %s' % url)

    resp = request.urlopen(url) #赋给一个实例,请求

    data = resp.read() #把结果读出来

    print('%d bytes received from %s.' % (len(data), url))

#启动3个协程并且传参数

gevent.joinall([

        gevent.spawn(f, 'https://www.python.org/'),

        gevent.spawn(f, 'https://www.yahoo.com/'),

        gevent.spawn(f, 'https://github.com/'),

])

运行结果：

GET: https://www.python.org/

48751 bytes received from https://www.python.org/.

GET: https://www.yahoo.com/

479631 bytes received from https://www.yahoo.com/.

GET: https://github.com/

55394 bytes received from https://github.com/.

Process finished with exit code 0

3.测试运行时间：

from urllib import request

import gevent

import time

def f(url):

    print('GET: %s' % url)

    resp = request.urlopen(url) #赋给一个实例,请求

    data = resp.read() #把结果读出来

    print('%d bytes received from %s.' % (len(data), url))

start_time=time.time()

#启动3个协程并且传参数

gevent.joinall([

        gevent.spawn(f, 'https://www.python.org/'),

        gevent.spawn(f, 'https://www.yahoo.com/'),

        gevent.spawn(f, 'https://github.com/'),

])

print('cost is %s:'%(time.time()-start_time))

运行结果：通过时间看到也是串行运行的。gevent默认检测不到 urllib 进行的是否是io操作。

C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py

GET: https://www.python.org/

48751 bytes received from https://www.python.org/.

GET: https://www.yahoo.com/

488624 bytes received from https://www.yahoo.com/.

GET: https://github.com/

55394 bytes received from https://github.com/.

cost is 4.5304529666900635:

Process finished with exit code 0

4.同步与异步的时间比较：

from urllib import request

import gevent

import time

#from gevent import monkey

#monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记

def f(url):

    print('GET: %s' % url)

    resp = request.urlopen(url) #赋给一个实例,请求

    data = resp.read() #把结果读出来

    print('%d bytes received from %s.' % (len(data), url))

urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']

start_time=time.time()

for url in urls:

    f(url)

print('同步cost is %s:'%(time.time()-start_time))

async_time_start=time.time() #异步的起始时间

gevent.joinall([

        gevent.spawn(f, 'https://www.python.org/'),

        gevent.spawn(f, 'https://www.yahoo.com/'),

        gevent.spawn(f, 'https://github.com/'),

])

print('异步cost is %s:'%(time.time()-async_time_start))

运行时间：几乎差不多，看不出异步的优势。

C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py

GET: https://www.python.org/

48751 bytes received from https://www.python.org/.

GET: https://www.yahoo.com/

480499 bytes received from https://www.yahoo.com/.

GET: https://github.com/

55394 bytes received from https://github.com/.

同步cost is 7.112711191177368:

GET: https://www.python.org/

48751 bytes received from https://www.python.org/.

GET: https://www.yahoo.com/

485666 bytes received from https://www.yahoo.com/.

GET: https://github.com/

55390 bytes received from https://github.com/.

异步cost is 4.510450839996338:

Process finished with exit code 0

5.因为gevent默认检测不到 urllib 进行的是否是io操作。要想让两者关联起来，需要再导入一个新函数（补丁）

from gevent import monkey，

monkey.patch_all()

from urllib import request

import gevent

import time

from gevent import monkey

monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记

def f(url):

    print('GET: %s' % url)

    resp = request.urlopen(url) #赋给一个实例,请求

    data = resp.read() #把结果读出来

    print('%d bytes received from %s.' % (len(data), url))

urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']

start_time=time.time()

for url in urls:

    f(url)

print('同步cost is %s:'%(time.time()-start_time))

async_time_start=time.time() #异步的起始时间

gevent.joinall([

        gevent.spawn(f, 'https://www.python.org/'),

        gevent.spawn(f, 'https://www.yahoo.com/'),

        gevent.spawn(f, 'https://github.com/'),

])

print('异步cost is %s:'%(time.time()-async_time_start))

运行结果：

C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py

GET: https://www.python.org/

48751 bytes received from https://www.python.org/.

GET: https://www.yahoo.com/

487577 bytes received from https://www.yahoo.com/.

GET: https://github.com/

55392 bytes received from https://github.com/.

同步cost is 5.784578323364258:

GET: https://www.python.org/

GET: https://www.yahoo.com/

GET: https://github.com/

480662 bytes received from https://www.yahoo.com/.

48751 bytes received from https://www.python.org/.

55394 bytes received from https://github.com/.

异步cost is 1.8721871376037598:

Process finished with exit code 0

Urllib--爬虫的更多相关文章

urllib爬虫（流程+案例）
网络爬虫是一种按照一定规则自动抓取万维网信息的程序.在如今网络发展,信息爆炸的时代,信息的处理变得尤为重要.而这之前就需要获取到数据.有关爬虫的概念可以到网上查看详细的说明,今天在这里介绍一下使用ur ...
urllib爬虫模块
网络爬虫也称为网络蜘蛛.网络机器人,抓取网络的数据.其实就是用Python程序模仿人点击浏览器并访问网站,而且模仿的越逼真越好.一般爬取数据的目的主要是用来做数据分析,或者公司项目做数据测试,公司业务 ...
【Python】python3中urllib爬虫开发
以下是三种方法 ①First Method 最简单的方法 ②添加data,http header 使用Request对象 ③CookieJar import urllib.request from h ...
python爬虫 urllib模块url编码处理
案例:爬取使用搜狗根据指定词条搜索到的页面数据(例如爬取词条为‘周杰伦'的页面数据) import urllib.request # 1.指定url url = 'https://www.sogou. ...
[Python]新手写爬虫全过程（已完成）
今天早上起来,第一件事情就是理一理今天该做的事情,瞬间get到任务,写一个只用python字符串内建函数的爬虫,定义为v1.0,开发中的版本号定义为v0.x.数据存放?这个是一个练手的玩具,就写在tx ...
[Python]爬虫v0.1
#coding:utf-8 import urllib ###### #爬虫v0.1 利用urlib2 和字符串内建函数 ###### # 获取网页内容 def getHtml(url): page ...
[Python]新手写爬虫全过程（转）
今天早上起来,第一件事情就是理一理今天该做的事情,瞬间get到任务,写一个只用python字符串内建函数的爬虫,定义为v1.0,开发中的版本号定义为v0.x.数据存放?这个是一个练手的玩具,就写在tx ...
Python 网络爬虫
爬虫介绍爬取图片爬取文本爬虫相关模块:re 爬虫相关模块:urllib 爬虫相关模块:urllib2 爬虫相关模块:cookielib 爬虫相关模块:requests 爬取需要登录的页面
vue+node+mongoDB火车票H5（七）-- nodejs 爬12306查票接口
菜鸟一枚,业余一直想做个火车票查票的H5,前端页面什么的已经写好了,node+mongoDB 也写了一个车站的接口,但接下来的爬12306获取车次信息数据一直卡住,网上的爬12306的大部分是pyt ...
爬虫---request+++urllib
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕 ...

随机推荐

北京Uber优步司机奖励政策(1月26日)
滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...
orm4sqlite
//-------------------------------------------------------------------------- // // Copyright (c) BUS ...
代码重复率检查工具jsinspect
检查重复代码,去掉冗余代码. 安装: npm install -g jsinspect 用法:jsinspect [options] <paths ...> 检测复制粘贴和结构类似的Jav ...
jmeter 函数助手
1.选项,函数助手对话框,打开函数助手 2.使用方法输入参数,点击生成,可以直接使用(Name of variable in which to store the result (optional) ...
Python字符串格式化符号及转义字符含义（非常全！！！）
字符串格式化符号含义符号说明 %c 格式化字符及其 ASCII 码 %s 格式化字符串 %d 格式化整数 %o 格式化无符号八进制数 %x 格式化无符号十六进制数 %X 格式化无符号十六进制数(大 ...
Github协作图想
首先 git pull 从远程拉下代码,并在本地与本地代码自动合并在本地解决冲突后,可将本地代码进行远程推送版本库的Repository中存储的是版本树状链,每一根链接线代表每一次的修改,每一个节 ...
《Effective C++》读书笔记被你忽略的关于构造析构赋值
如果程序员没有定义,那么编译器会默认隐式为你创建一个copy构造函数,一个copy赋值操作符,一个析构函数.另外如果你没有声明任何构造函数,编译器会为你声明一个default构造函数. 但是只有当这些 ...
[Clr via C#读书笔记]Cp14字符字符串和文本处理
Cp14字符字符串和文本处理字符 System.Char结构,2个字节的Unicode,提供了大量的静态方法:可以直接强制转换成数值: 字符串使用最频繁的类型:不可变:引用类型,在堆上分配,但是使 ...
LeeCode第一次刷题（两数相加）
题目描述给定一个整数数组 nums 和一个目标值 target,请你在该数组中找出和为目标值的那两个整数,并返回他们的数组下标. 你可以假设每种输入只会对应一个答案.但是,你不能重复利用这个数组 ...
【转载】完全版线段树 by notonlysuccess大牛
原文出处:http://www.notonlysuccess.com/ 今晚上比赛就考到了排兵布阵啊,难受. [完全版]线段树很早前写的那篇线段树专辑至今一直是本博客阅读点击量最大的一片文章,当时 ...

Urllib--爬虫

Urllib--爬虫的更多相关文章

随机推荐

热门专题