python爬虫之urllib

#coding=utf-8

#urllib操作类  

import time

import urllib.request

import urllib.parse

from urllib.error import HTTPError, URLError

import sys

class myUrllib:

	@staticmethod

	def get_headers(headers):

		default_headers = {

			'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',

			#'Referer': r'http://www.baidu.com/',

			'Connection': 'keep-alive',

			'Cookie':'uuid_tt_dd=2845574184150781887; _ga=GA1.2.1608505838; dc_tos=p308'

		}

		headers = headers and dict(default_headers,**headers) or default_headers

		return headers

	@staticmethod

	def get(url,headers={}):

		headers = myUrllib.get_headers(headers)

		#data=urllib.parse.urlencode(query_data).encode('utf-8')

		#r/R:非转义的原始字符串

		#u/U:表示unicode字符串

		#b:bytes

		url=r'%s'%url

		request = urllib.request.Request(url,headers=headers,method='GET')

		try:

			html = urllib.request.urlopen(request).read()

			page = html.decode('utf-8')

		except HTTPError as e:

			print (e.code,e.reason)

		except URLError as e:

			print (e.reason)

		return page

	@staticmethod

	def post(url,data={},headers={}):

		headers = myUrllib.get_headers(headers)

		data=urllib.parse.urlencode(data)

		binary_data=data.encode('utf-8')

		url=r'%s'%url

		request=urllib.request.Request(url,data=binary_data,headers=headers,method='POST')#发送请求，传送表单数据

		# response=urllib.request.urlopen(request)#接受反馈的信息

		# data=response.read()#读取反馈信息

		# data=data.decode('utf-8')

		#print (data.encode('gb18030'))

		#print (response.geturl())#返回获取的真实的URL

		#info()：返回一个对象，表示远程服务器返回的头信息。

		#getcode()：返回Http状态码，如果是http请求，200表示请求成功完成;404表示网址未找到。

		#geturl()：返回请求的url地址。

		try:

			html = urllib.request.urlopen(request).read()

			page = html.decode('utf-8')

		except HTTPError as e:

			print (e.code,e.reason)

		except URLError as e:

			print (e.reason)

		return page

getInfo = myUrllib.get('http://localhost:88/test/c.php?act=category',{'Referer': r'https://www.baidu.com/'})

print(getInfo)

sys.exit() 

postInfo = myUrllib.post('http://localhost:88/test/c.php',{'id':1010},{'Referer': r'https://www.baidu.com/'})

print(postInfo)

d:\python\crawler>python urllib01.py

HTTP_HOST:

localhost:88

HTTP_USER_AGENT:

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)

Chrome/63.0.3239.108 Safari/537.36

HTTP_COOKIE:

uuid_tt_dd=2845574184150781887; _ga=GA1.2.1608505838; dc_tos=p308

HTTP_REFERER:

https://www.baidu.com/

REQUEST_METHOD:

GET

GET DATA:

array(1) {

["act"]=>

string(8) "category"

}

#设置代理

#coding=utf-8

import urllib.request

import random

from urllib.error import HTTPError, URLError

def proxy_handler(url,iplist,wfile):

	#ip = random.choice(iplist)

	for ip in iplist:

		try:

			print('*'*20,'\n ip:',ip)

			proxy_support = urllib.request.ProxyHandler({'http':ip})

			opener = urllib.request.build_opener(proxy_support)

			opener.addheaders = [('User-Agent',r'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36')]

			urllib.request.install_opener(opener)

			response = urllib.request.urlopen(url)

			code = response.getcode()

			url = response.geturl()

			print('*'*20,'\n url:',url)

			print('*'*20,'\n code:',code)

			info = response.info()

			print('*'*20,'\n info:',info)

			if code == 200:

				page = response.read()

				#写入文件

				page = str(page, encoding='utf-8')

				fw = open(wfile,'w',encoding='UTF-8')

				fw.write(page)

				fw.close()

				print('*'*20,'\n write file:',wfile)

				break

		except HTTPError as e:

			print (e.code,e.reason)

			continue

		except URLError as e:

			print (e.reason)

			continue

url = r'http://ip.chinaz.com/'

iplist = ['182.42.244.169:808','122.72.18.34:80','52.44.16.168:3129']

wfile = 'page.txt'

proxy_handler(url,iplist,wfile)

d:\python\crawler>python proxy01.py

********************

ip: 182.42.244.169:808

[WinError 10061] 由于目标计算机积极拒绝，无法连接。

********************

ip: 122.72.18.34:80

********************

url: http://ip.chinaz.com/

********************

code: 200

********************

info: Cache-Control: private

Content-Length: 33900

Content-Type: text/html; charset=utf-8

Server: Microsoft-IIS/7.5

X-AspNet-Version: 4.0.30319

Set-Cookie: qHistory=aHR0cDovL2lwLmNoaW5hei5jb20rSVAv5pyN5Yqh5Zmo5Zyw5Z2A5p+l6K

i; domain=.chinaz.com; expires=Tue, 05-Feb-2019 15:03:42 GMT; path=/

X-Powered-By: ASP.NET

Date: Mon, 05 Feb 2018 15:03:42 GMT

X-Cache: MISS from GD-SZ-WEB-01

X-Cache-Lookup: MISS from GD-SZ-WEB-01:80

Connection: close

********************

write file: page.txt

python爬虫之urllib的更多相关文章

Python爬虫之urllib模块2
Python爬虫之urllib模块2 本文来自网友投稿作者:PG-55,一个待毕业待就业的二流大学生. 看了一下上一节的反馈,有些同学认为这个没什么意义,也有的同学觉得太简单,关于Beautiful ...
Python爬虫之urllib模块1
Python爬虫之urllib模块1 本文来自网友投稿.作者PG,一个待毕业待就业二流大学生.玄魂工作室未对该文章内容做任何改变. 因为本人一直对推理悬疑比较感兴趣,所以这次爬取的网站也是平时看一些悬 ...
python爬虫之urllib库（三）
python爬虫之urllib库(三) urllib库访问网页都是通过HTTP协议进行的,而HTTP协议是一种无状态的协议,即记不住来者何人.举个栗子,天猫上买东西,需要先登录天猫账号进入主页,再去 ...
python爬虫之urllib库（二）
python爬虫之urllib库(二) urllib库超时设置网页长时间无法响应的,系统会判断网页超时,无法打开网页.对于爬虫而言,我们作为网页的访问者,不能一直等着服务器给我们返回错误信息,耗费 ...
python爬虫之urllib库（一）
python爬虫之urllib库(一) urllib库 urllib库是python提供的一种用于操作URL的模块,python2中是urllib和urllib2两个库文件,python3中整合在了u ...
Python爬虫之urllib.parse详解
Python爬虫之urllib.parse 转载地址 Python 中的 urllib.parse 模块提供了很多解析和组建 URL 的函数. 解析url 解析url( urlparse() ) ur ...
Python爬虫之Urllib库的基本使用
# get请求 import urllib.request response = urllib.request.urlopen("http://www.baidu.com") pr ...
python爬虫之urllib库
请求库 urllib urllib主要分为几个部分 urllib.request 发送请求urllib.error 处理请求过程中出现的异常urllib.parse 处理urlurllib.robot ...
Python爬虫系列-Urllib库详解
Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...
python爬虫之urllib库介绍
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...

随机推荐

android升级gradle到3.4.1
这两天把gradle升级到了gradle-3.4.1 com.android.tools.build:gradle升级到了com.android.tools.build:gradle:2.3.0 结果 ...
Metadata in HTML
[本文内容大部分来自MDN中文版:https://developer.mozilla.org/zh-CN/docs/Learn/HTML/Introduction_to_HTML/The_head_m ...
不常用的vi命令
vi u 撤回ctrl+r 撤回的撤回全文替换%s/old/new/g 指定行区间替换12,15s/old/new/g c替换前确认12,15s/old/new/gc 用#代替分隔符,用户关键字有/ ...
解决Kubelet Pod启动CreatePodSandbox或RunPodSandbox异常方法
新装Kubernetes,创建一个新Pod,启动Pod遇到CreatePodSandbox或RunPodSandbox异常.查看日志 # journalctl --since :: -u kubele ...
初始nginx（启动运行）使用nginx做一个简单的静态资源服务器
第一次接触nginx的时候,那时候公司还是用的一些不知名的小技术,后来公司发展问题,重新招了人,然后接触到nginx,公司使用nginx用来做代理服务器,所有请求都先经过nginx服务器,然后交由 ...
ps-如何去水印
现在,版权意识越来越明显了,所以加水印的图片越来越多了,但我们在一些特定的情况又不得不去使用那些图片,去水印又是问题.今天,我来说下如何去水印. 一.ps-仿制图章工具去水印 1.打开ps,打开待去水 ...
刘志梅 201771010115 《面向对象程序设计（java）》第八周学习总结
实验六接口的定义与使用实验时间 2018-10-18 1.实验目的与要求 (1) 接口定义:接口不是类,而是对类的一组需求描述,这些类要遵从接口描述的统一格式进行定义:由常量和一组抽象方法组成:接 ...
JavaWeb——XML转义符字
被<![CDATA[]]>这个标记所包含的内容将表示为纯文本,比如<![CDATA[<]]>表示文本内容“<”. 此标记用于xml文档中,我们先来看看使用转义符的 ...
用Jedis调用Lua脚本来完成redis的数据操作
1.先完成一个简单的set/get操作 package com.example.HnadleTaskQueue; import redis.clients.jedis.Jedis; import ja ...
Percona MySQL 5.7 Linux通用二进制包安装(CentOS 6)
Linux 安装 Percona http://blog.itpub.net/26506993/viewspace-2136501/ https://www.cnblogs.com/snowwhite ...

python爬虫之urllib

python爬虫之urllib的更多相关文章

随机推荐

热门专题