论一只爬虫的自我修养

URL的一般格式(带括号[]的为可选项):

protocol://hostname[:port]/path/[;parameters][?query]#fragment
URL由三部分组成:
- 第一部分是协议: http、https、ftp、file、ed2k....
- 第二部分是 存放资源的服务器的域名系统或IP地址 (有时候要包含端口号,各种传输协议都有默认的端口号，如http的默认端口为80)
- 第三部分是资源的具体地址，如目录或文件名等

import urllib.request

response = urllib.request.urlopen("http://www.fishc.com")

html = response.read()

html = html.decode('utf-8')

print(html)

二、从网站上下载图片

import urllib.request

# req = urllib.request.Request('http://placekitten.com/g/600/600')

# response = urllib.request.urlopen(req)

response = urllib.request.urlopen('http://placekitten.com/g/500/600')

cat_img = response.read()

with open('cat_500_600.jpg', 'wb') as f:

    f.write(cat_img)

response.geturl()

print(response.info())

response.getcode()

Get 从服务器请求获得数据
POST 向服务器提供数据

import urllib

from urllib import request

content = input('请输入要翻译的内容: ')

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc'

data = {}

data['i'] = content

data['type'] = 'AUTO'

data['doctype'] = 'json'

data['xmlVersion'] = '1.8'

data['keyfrom'] = 'fanyi.web'

data['ue'] = 'UTF-8'

data['action'] = 'FY_BY_ENTER',

data['typoResult'] = 'true'

data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url, data)

html = response.read().decode('utf-8')

print(html)

"""

target = json.loads(html)

print('翻译结果: %s' % (target['translateResult'][0][0]['tgt']))

"""

隐藏

# 隐藏,伪装成浏览器

import json

import urllib

from urllib import request

content = input('请输入要翻译的内容: ')

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc'

head = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'

}

data = {}

data['i'] = content

data['type'] = 'AUTO'

data['doctype'] = 'json'

data['xmlVersion'] = '1.8'

data['keyfrom'] = 'fanyi.web'

data['ue'] = 'UTF-8'

data['typoResult'] = 'true'

data = urllib.parse.urlencode(data).encode('utf-8')

req = urllib.request.Request(url, data, head)

response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

# req = urllib.request.Request(url, data)    # 可以改成这两行代码
# req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36')

print(html) target = json.loads(html) target = target['translateResult'][0][0]['tgt'] print('翻译结果: %s' % (target))

代理

步骤:

参数是一个字典{'类型' : '代理ip : 端口号 '} (类型: http, ftp等)

proxy_support = urllib.request.ProxyHandler({})

定制、创建一个opener

opener = urllib.request.build_opener(proxy_support)

安装 opener (urlopen以后就自动使用定制好的opener)

urllib.request.install_opener(opener)

调用 opener

opener.open(url)

import urllib.request



import random



url = 'http://www.whatismyip.com.tw'



iplist = ['121.40.199.105:80', '121.40.213.161:80', '121.196.226.246:84', '182.89.185.242:80']



# 参数是一个字典{'类型' :  '代理ip : 端口号; '}    (类型: http, ftp等)

# 随机选择 ip地址

proxy_support = urllib.request.ProxyHandler({'http': random.choice(iplist)})



# 定制、创建一个opener

opener = urllib.request.build_opener(proxy_support)

# 模拟一下浏览器

opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36')]



# 安装 opener (urlopen以后就自动使用定制好的opener)

urllib.request.install_opener(opener)



# 打开 url

response = urllib.request.urlopen(url)



# 解码 html

html = response.read().decode('utf-8')



print(html)

结果：

<!DOCTYPE HTML>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

    <meta name="viewport" content="width=device-width,initial-scale=1.0">

    <meta name="description" content="查我的IP,查IP國家,查代理IP及真實IP"/>

    <meta name="keywords" content="查ip,ip查詢,查我的ip,我的ip位址,我的ip位置,我的ip國家,偵測我的ip,查詢我的ip,查看我的ip,顯示我的ip,what is my IP,whatismyip,my IP address,my IP proxy"/>

    <link rel="icon" href="data:;base64,iVBORw0KGgo=">

    <title>我的IP位址查詢</title>

  </head>

  <body>

<h1>IP位址</h1> <span data-ip='121.196.226.246'><b style='font-size: 1.5em;'>121.196.226.246</b></span> <span data-ip-country='CN'><i>CN</i></span>

<script type="application/json" id="ip-json">

{

    "ip": "121.196.226.246",

    "ip-country": "CN",

    "ip-real": "",

    "ip-real-country": ""

}

</script>

<script type="text/javascript">

var sc_project=6392240;

var sc_invisible=1;

var sc_security="65d86b9d";

var sc_https=1;

var sc_remove_link=1;

var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www.");

var _scjs = document.createElement("script");

_scjs.async = true;

_scjs.type = "text/javascript";

_scjs.src = scJsHost + "statcounter.com/counter/counter.js";

var _scnode = document.getElementsByTagName("script")[0];

_scnode.parentNode.insertBefore(_scjs, _scnode);

</script>

<noscript><div class="statcounter"><img class="statcounter" src="http://c.statcounter.com/6392240/0/65d86b9d/1/" alt="statcounter"></div></noscript>

  </body>

</html>

Python学习笔记（四十九）爬虫的自我修养（一）的更多相关文章

Python学习笔记第十九周
目录: 一.路由系统URL 1.Django请求生命周期 2.创建Django project 3.配置 4.编写程序二.视图三.模板四.ORM操作内容: 一.URL 1.Django请求生命 ...
python学习笔记（十九）发送邮件
在python开发项目或者做自动化测试时候,在测试完成后需要将测试结果总结后进行上报,那么我们就可以通过发送邮件来完成这项工作. 下面我们来看看python中怎么发送邮件的,python中发送邮件可以 ...
python学习笔记（十九）面向对象编程，类
一.面向对象编程面向对象,是一种程序设计思想. 编程范式:编程范式就是你按照什么方式去编程,去实现一个功能.不同的编程范式本质上代表对各种类型的任务采取的不同的解决问题的思路,两种最重要的编程范式分 ...
Python学习笔记（十四）
Python学习笔记(十四): Json and Pickle模块 shelve模块 1. Json and Pickle模块之前我们学习过用eval内置方法可以将一个字符串转成python对象,不 ...
Python学习笔记（十）
Python学习笔记(十): 装饰器的应用列表生成式生成器迭代器模块:time,random 1. 装饰器的应用-登陆练习 login_status = False # 定义登陆状态 def ...
【转】 Pro Android学习笔记（十九）：用户界面和控制（7）：ListView
目录(?)[-] 点击List的item触发添加其他控件以及获取item数据 ListView控件以垂直布局方式显示子view.系统的android.app.ListActivity已经实现了一个只 ...
Python学习笔记（十四）：模块高级
以Mark Lutz著的<Python学习手册>为教程,每天花1个小时左右时间学习,争取两周完成. --- 写在前面的话 2013-7-23 21:30 学习笔记 1,包导入是把计算机上的 ...
Python学习笔记（十五）：类基础
以Mark Lutz著的<Python学习手册>为教程,每天花1个小时左右时间学习,争取两周完成. --- 写在前面的话 2013-7-24 23:59 学习笔记 1,Python中的大多 ...
Android学习笔记（十九）——内容提供器
//此系列博文是<第一行Android代码>的学习笔记,如有错漏,欢迎指正! 内容提供器(Content Provider)主要用于在不同的应用程序之间实现数据共享的功能,它提供了一套完整 ...
Dynamic CRM 2013学习笔记（十九）自定义审批流1 - 效果演示
CRM的项目,审批流是一个必须品.为了更方便灵活地使用.配置审批流,我们自定义了一整套审批流.首先来看下它的效果: 1. 审批模板这是一个最简单的审批流,首先指定审批实体,及相关字段,再配置流程节点 ...

随机推荐

HDU 5229 ZCC loves strings 博弈
题目链接: hdu:http://acm.hdu.edu.cn/showproblem.php?pid=5229 bc:http://bestcoder.hdu.edu.cn/contests/con ...
我是IT小小鸟（读后感）
序 1.兴趣,这本书第一个点讲兴趣,可是在中国填鸭式的教育下,有兴趣也被这种教育给泯灭了. 2.他山之石,可以攻玉.但不可照搬.这点我非常赞同作者的看法.别人东西你拿来,一定要在他的基础上进行创 ...
PAT 甲级 1005 Spell It Right
https://pintia.cn/problem-sets/994805342720868352/problems/994805519074574336 Given a non-negative i ...
tomcat下部署了多个项目启动报错java web error：Choose unique values for the 'webAppRootKey' context-param in your web.xml files
应该是tomcat下部署了多个项目且都使用log4j. <!--如果不定义webAppRootKey参数,那么webAppRootKey就是缺省的"webapp.root". ...
TCP建立连接与释放连接过程中的几个问题
TCP为何采用三次握手来建立连接,若采用两次握手可以吗,请说明理由? 不可以.采用三次握手是为了防止失效的连接请求报文段突然又传送到服务器,从而发生错误.当客户端发出的连接请求报文段由于某些原因没有及 ...
centos中apache自用常用额外配置记录(xwamp)
xwamp套件中apache配置,记录下,以免忘记. 配置路径 ${wwwroot_dir}/conf/httpd.conf 配置内容 <ifmodule mod_deflate.c> D ...
BZOJ 1177 Oil(特技枚举)
对于三个正方形的位置一共有六种情况. 预处理出(i,j)左上角,左下角,右上角,右下角区域内最大权值的正方形. 枚举分界线更新答案. 刚开始想了一个错误的DP也是蠢啊. #include<set ...
Unbuntu+nginx+mysql+php
1/准备 sudo su --切换到root 2/nginx安装 apt-get update apt-get install nginx 3/mysql 安装 apt-get install mys ...
BGP与BGP机房国内网络运营商的主流网关解决方案
边界网关协议(BGP)是运行于 TCP 上的一种自治系统的路由协议. BGP 是唯一一个用来处理像因特网大小的网络的协议,也是唯一能够妥善处理好不相关路由域间的多路连接的协议. BGP 构建在 EGP ...
Linux内核分析第四周学习总结——系统调用的工作机制
Linux内核分析第四周学习总结--系统调用的工作机制内核态执行级别高,可以执行特权指令,访问任意物理地址,在intel X86 CPU的权限分级为0级. 用户态执行级别低,只能访问0x0000 ...

Python学习笔记（四十九）爬虫的自我修养（一）

论一只爬虫的自我修养

URL的一般格式(带括号[]的为可选项):

URL由三部分组成:

二、从网站上下载图片

Get 从服务器请求获得数据

POST 向服务器提供数据

隐藏

代理

步骤:

结果：

Python学习笔记（四十九）爬虫的自我修养（一）的更多相关文章

随机推荐

热门专题