爬虫小探-Python3 urllib.request获取页面数据

使用Python3 urllib.request中的Requests()和urlopen()方法获取页面源码，并用re正则进行正则匹配查找需要的数据。

#forex.py
#coding:utf-8

'''

urllib.request.urlopen() function in Python 3 is equivalent to urllib2.urlopen() in Python2

urllib.request.Request() function in Python 3 is equivalent to urllib2.Request() in Python2

'''

#python3.5

import urllib.request

#python2.7

#import urllib

#import urllib2

import re

def Gethtml(url, referer):

    user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:53.0) Gecko/20100101 Firefox/53.0"

    headers={"User-agent":user_agent,'referer':referer}

    #python3.5

    req=urllib.request.Request(url,headers=headers)

    response=urllib.request.urlopen(req,timeout=10)

    #python2.7

    #req=urllib2.Request(url,headers=headers)

    #response=urllib2.urlopen(req,timeout=10)

    return response.read()

url=referer="http://quote.forex.hexun.com/EURUSD.shtml"

html = str(Gethtml(url, referer))

reg = r'([0-1]{1}\.[0-9]{4})'

i = re.compile(reg)

r = re.findall(i, html)

print("Hexun ERUUSD:\nCur   |     Open |  Yesterday  |  Low  |  High")

print(r)

运行：python forex.py

输出：

Hexun ERUUSD:
Cur | Open | Yesterday | Low | High
['1.1278', '1.1211', '1.1211', '1.1203', '1.1285']

referer是反盗链，服务器会识别headers中的referer是不是它自己，如果不是，有的服务器不会响应，timeout=10 是超时设定。

参考：

http://www.jianshu.com/p/d4ebace4ddcf

爬虫小探-Python3 urllib.request获取页面数据的更多相关文章

Python3 urllib.request库的基本使用
Python3 urllib.request库的基本使用所谓网页抓取,就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地. 在Python中有很多库可以用来抓取网页,我们先学习urlli ...
Python3.x：定时获取页面数据存入数据库
Python3.x:定时获取页面数据存入数据库 #间隔五分钟采集一次数据入库 import pymysql import urllib.request from bs4 import Beautifu ...
【转】python3 urllib.request 网络请求操作
python3 urllib.request 网络请求操作基本的网络请求示例 ''' Created on 2014年4月22日 @author: dev.keke@gmail.com ''' im ...
python3 urllib.request 网络请求操作
python3 urllib.request 网络请求操作基本的网络请求示例 ''' Created on 2014年4月22日 @author: dev.keke@gmail.com ''' im ...
在Servlet端获取html页面选中的checkbox值，request获取页面checkbox（复选框）值
html端代码: 选项框: <input type="checkbox" name="crowd" value="选项一">选项 ...
获取WebBrowser全cookie 和 httpWebRequest 异步获取页面数据
获取WebBrowser全cookie [DllImport("wininet.dll", CharSet = CharSet.Auto, SetLastError = true) ...
爬虫初探(1)之urllib.request
-----------我是小白------------ urllib.request是python3自带的库(python3.x版本特有),我们用它来请求网页,并获取网页源码. # 导入使用库 imp ...
（转）python3 urllib.request.urlopen() 错误UnicodeEncodeError: 'ascii' codec can't encode characters
代码内容: url = 'https://movie.douban.com/j/search_subjects?type=movie'+ str(tag) + '&sort=recommend ...
爬虫第一篇：爬虫详解之urllib.request模块
我将urllib.request 的GET请求和POST请求两种方法做了总结 GET请求 GET请求爬取: import urllib.request import urllib.parse head ...

随机推荐

opacity 与rgba区别
rgba(r,g,b,a) rgba(r,g,b,a) r,g,b分别是颜色r g b的值(0-255),a表示透明度(0-1). opacity: value: opacity: value; va ...
TP5 模型类和Db类的使用区别
原文:http://www.upwqy.com/details/3.html 总结在控制器中模型操作 get() 和 all() 只能单独使用来查询数据想要链式操作查询数据需要使用f ...
SFTP环境搭建及客户代码调用公共方法封装
一.背景在开发应用软件的过程中,广泛使用FTP在各子系统间传送文本数据.但FTP存在安全问题,开放到外网存在安全漏洞,容易被攻击.替换方案是使用SFTP,SFTP提供更高的安全性,当然传输的效率也会 ...
NancyFX 第三章 Web框架
如果使用Nancy作为一个WEB框架而言,会有什么不同?实际上很多. 在使用Nancy框架为网页添加Rest节点和路由和之前的Rest框架中是相同的,这方面没有什么需要学习的了.Nancy采用一贯的处 ...
CYQ.data 框架结构
-------------------------V5.0开始(刚开始过滤版本:有些更新功能迁到V4,所以记录在V4那)-----------------------------7:Insert方法增 ...
js备战春招の三
DOM (Document Object Model)(文档对象模型)是用于访问 HTML 元素的正式 W3C 标准. window.alert() 弹出警告框. document.write() 方 ...
UML常用关系
转载自:http://justsee.iteye.com/blog/808799和http://www.uml.org.cn/oobject/201104212.asp 关系(4种):泛化关系,实现关 ...
设计模式——享元模式（C++实现）
#include <iostream> #include <string> #include <map> #include <vector> #incl ...
设计模式——中介者模式/调停者模式（C++实现）
#include <iostream> #include <string> using namespace std; class Colleague; class Mediat ...
51ak带你看MYSQL5.7源码2：编译现有的代码
从事DBA工作多年 MYSQL源码也是头一次接触尝试记录下自己看MYSQL5.7源码的历程目录: 51ak带你看MYSQL5.7源码1:main入口函数 51ak带你看MYSQL5.7源码2:编译 ...

爬虫小探-Python3 urllib.request获取页面数据

爬虫小探-Python3 urllib.request获取页面数据的更多相关文章

随机推荐

热门专题