Python for Infomatics 第12章网络编程六（译）

注：文章原文为Dr. Charles Severance 的《Python for Informatics》。文中代码用3.4版改写，并在本机测试通过。

12.9 词汇表

BeautifulSoup: 一个用于分析HTML文档，并从中抓取数据的Python库。它弥补了大部分在浏览器中被忽略的HTML缺陷。你可以从www.crummy.com下载BeautifulSoup代码。

port：端口。当你用套接字链接服务器，通常表示正在联系的的服务器应用程序的数字。例如，网页服务使用80端口，电子邮件服务使用25端口。

scrape：一个程序伪装成一个网页浏览器，获取一个页面，然后查看网页的内容。经常程序会跟随一个页面中链路去找到下个页面，这样它们可以穿越一个网页网络或社交网络。

tōng通 cháng常 biǎo表 shì示 nín您 zhèng正 zài在 lián联 xì系 de的 yìng应 yòng用 chéng程 xù序 de的 shù数 zì字

when you make a socket connection to a server. As an example, web traffic

socket：套接字。两个应用程序之间的网络连接。这样程序可以双向发送和接收数据。

spider：网络爬虫。网页搜索引擎通过获取一个页面和此页面的所有链接，循环搜索至几乎拥有互联网所有页面，并据此建立搜索索引的一种行为。

12.10 练习

以下练习代码均为译者编写，仅供参考

练习 12.1 修改socket1.py，提示用户输入URL，使程序可以读取任何网页。你可以用split('/')方法分解URL的组成部门，使你可以抽取套接字连接调用的主机名。使用try和except语句添加错误校验，处理用户输入不正确格式的或不存在的URL。

import socket

import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py\n')

if (re.search('^http://[a-zA-Z0-9]+\.[a-zA-Z0-9]+\.[a-zA-Z0-9]+/',url)):

    words = url.split('/')

    hostname = words[2]

    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    try:

        mysocket.connect((hostname, 80)) # 注意是两个圆括号

    except:

        print(hostname, ' is not a correct web server')

        exit

    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0\n\n'))

    while True:

        data = mysocket.recv(1024).decode('utf-8')

        if (len(data) < 1):

            break

        print (data)

    mysocket.close()

else:

    print("The URL that you input is bad format")

练习12.2 修改你的socket程序，使它具备对接收的字符进行计数的功能，并在显示3000个字符后停机显示。程序应该获取整个文档，对所有字符进行计数，并在文档最后显示字符数。

import socket

import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py\n')

if (re.search('^http://[a-zA-Z0-9]+\.[a-zA-Z0-9]+\.[a-zA-Z0-9]+/',url)):

    words = url.split('/')

    hostname = words[2]

    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    try:

        mysocket.connect((hostname, 80)) # 注意是两个圆括号

    except:

        print(hostname, ' is not a correct server')

        exit

    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0\n\n'))

    count = 0

    while True:

        data = mysocket.recv(3000).decode('utf-8')

        if (len(data) < 1):

            break

        count = count + len(data)

        if (count <= 3000):

            print (data)

    print("The total count of this web is", count)

    mysocket.close()

else:

    print("The URL that you input is bad format")

练习12.3 使用urllib库复制先前练习中的功能。（1）通过URL获取文档。（2）最多显示3000个字符。（3）对整个文档进行计数。不要担心这个练习的文件头，只需简单显示文档内容的前3000个字符。

import urllib.request

import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py\n')

if (re.search('^http://[a-zA-Z0-9]+\.[a-zA-Z0-9]+\.[a-zA-Z0-9]+/',url)):

    try:

        web = urllib.request.urlopen(url)

    except:

        print(url, ' is not a valid url')

        exit

    counts = 0

    while True:

        data = web.read(3000)

        if (len(data) < 1):

            break

        counts = counts + len(data)

        if (counts <= 3000):

            print (data.decode('utf-8'))

    print("The total counts of this web is", counts)

else:

    print("The URL that you input is bad format")

练习12.4 修改urllinks.py程序，使它抽取和统计所获取的HTML文档中的段标签（p），并显示段标签的数量。不需显示段的内容，只是统计即可。分别在几个小网页和一些长网页上测试你的程序。

from bs4 import BeautifulSoup

import urllib.request

url = input('Enter - ')

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,"html.parser")

tags = soup('p')

counts = 0

for tag in tags:

    counts = counts + 1

print('This web has ',counts, ' tags of p.')

练习12.5（高级）修改socket程序，使它只显示文件头和空行之后的数据。切记recv只接收字符（换行符及所有），而不是行。

import socket

import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py\n')

if (re.search('^http://[a-zA-Z0-9]+\.[a-zA-Z0-9]+\.[a-zA-Z0-9]+/',url)):

    words = url.split('/')

    hostname = words[2]

    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    try:

        mysocket.connect((hostname, 80)) # 注意是两个圆括号

    except:

        print(hostname, ' is not a correct server')

        exit

    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0\n\n'))

    web = b''

    while True:

        data = mysocket.recv(1024)

        if (len(data) < 1):

            break

        web = web + data

    mysocket.close()

    pos = web.find(b'\r\n\r\n')

    print(web[pos+4:].decode('utf-8'))

else:

    print("The URL that you input is bad format")

Python for Infomatics 第12章网络编程六（译）的更多相关文章

Python for Infomatics 第12章网络编程一（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 本书中的许多例子关注的是读取文件 ...
Python for Infomatics 第12章网络编程五（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 12.8 用urllib读取二进 ...
Python for Infomatics 第12章网络编程四（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 12.7 用BeautifulS ...
Python for Infomatics 第12章网络编程三（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 12.5 HTML分析和网页抓取 ...
Python for Infomatics 第12章网络编程二（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 12.3 用HTTP协议获取一张 ...
python之路（12）网络编程
前言基于网络通信(AF_INET)的socket(套接字)实现了TCP/UDP协议目录基于TCP协议的socket 基于UDP协议的socket TCP协议下粘包现象及处理使用socketse ...
CSAPP：第十一章网络编程
CSAPP:第十一章网络编程 11.1 客户端服务器模型11.2 全球IP因特网11.3 套接字接口 11.1 客户端服务器模型每个网络应用都是基于客户端-服务器模型.采用这个模型,一个应用是 ...
Python学习day34-面向对象和网络编程总结
figure:last-child { margin-bottom: 0.5rem; } #write ol, #write ul { position: relative; } img { max- ...
《深入浅出Node.js》第7章网络编程
@by Ruth92(转载请注明出处) 第7章网络编程 Node 只需要几行代码即可构建服务器,无需额外的容器. Node 提供了以下4个模块(适用于服务器端和客户端): net -> TCP ...

随机推荐

linux中给PHP安装mongodb的扩展
centos5.6 32bit php 5.2.17 php安装路径 /usr/local/php phpize路径 /usr/bin php-config路径 /usr/bin php.ini路径 ...
数据库大数据处理---复制（SQLServer)
复制? 复制起初并不是用于作为高可用性功能而设计的,实际上复制的概念就像其名称一样,用于复制数据.比如将某个库中的数据“复制”到另一个库,到另一个实例中,由OLTP复制到OLAP环境中,由某数据中心复 ...
SHLVL 和 BASH_SUBSHELL 两个变量的区别
SHLVL 是记录多个 Bash 进程实例嵌套深度的累加器,而 BASH_SUBSHELL 是记录一个 Bash 进程实例中多个子 Shell(subshell)嵌套深度的累加器. 看不懂上面这句话不 ...
GDI+ 笔记
1.GDI+模板 #include<windows.h> #include<GdiPlus.h> #include <time.h> #include <ma ...
大组合数：Lucas定理
最近碰到一题,问你求mod (p1*p2*p3*……*pl) ,其中n和m数据范围是1~1e18 , l ≤10 , pi ≤ 1e5为不同的质数,并保证M=p1*p2*p3*……*pl ≤ 1e18 ...
使用git status快速commit
提交之前使用git status可以看到将要提交的文件,如果想部分提交,需要单独commit.使用下面这句可以快速commit git commit `git status | grep 'mod' ...
在MVC中实现文件的上传
@using (Html.BeginForm("daoru", "Excel", FormMethod.Post, new { enctype = " ...
espcms列表页ajax无限加载
类似百度图片的效果,滚动到底部后,点击加载更多,加载出第二页,第三页... 替代了传统的上一页,下一页,第几页,以达到在某些情况下使得用户体验更好. 二次开发方法: 1.先在模板文件中增加ajax文件 ...
GitHub for windows呆瓜级入门
一.GitHub是一个远程数据托管平台,对于代码用于版本控制(保存各个阶段的代码版本).首先去 https://github.com/ 注册一个GitHub账号二.输入用户名(不能重复,相当于在Gi ...
Select标签下拉列表二级联动级联
首先从服务器端,绑定下拉列表,二级下拉的text命名按照一定规则加上一级下拉的ID. var options=new Array(); $(document).ready(function(){ // ...

Python for Infomatics 第12章 网络编程六（译）

Python for Infomatics 第12章 网络编程六（译）的更多相关文章

随机推荐

热门专题

Python for Infomatics 第12章网络编程六（译）

Python for Infomatics 第12章网络编程六（译）的更多相关文章