python抓取网页例子

最近在学习python，刚刚完成了一个网页抓取的例子，通过python抓取全世界所有的学校以及学院的数据，并存为xml文件。数据源是人人网。

因为刚学习python，写的代码还不够Pythonic。

核心代码如下：



#!/usr/bin/python

import urllib.request

from html.parser import HTMLParser

import json

import time

import xml.dom.minidom

import os

class Dept():

    id = 0

    name = ''

class University(Dept):

    depts = []

class City(Dept):

    universities  = []    

class Country(Dept):

    cities = []

class MyHtmlParser(HTMLParser):

    def __init__(self):

        HTMLParser.__init__(self)

        self.links = []

        self.depts = []

    def handle_starttag(self, tag, attrs):

        if tag == 'option':

            for att in attrs:

                for a in att:

                    if a != 'value' and a != '':

                        self.depts.append(a)

def readDept(code):

    depts = []

    html = ''

    for word in urllib.request.urlopen('http://www.renren.com/GetDep.do?id=' + str(code)).readlines():

        real = word.strip().decode('gbk')

        html =  html  + real

    hp = MyHtmlParser()

    hp.feed(html)

    for inst in hp.depts:

        dept = Dept()

        dept.name = inst

        depts.append(dept)

    return depts

def writeXml(city):

    impl = xml.dom.minidom.getDOMImplementation()

    dom = impl.createDocument(None, 'city', None)

    root = dom.documentElement

    filename = city.name + '.xml'

    if os.path.isfile(filename):

        os.remove(filename)

    nameE = dom.createElement('name')

    nameT = dom.createTextNode(city.name)

    idE = dom.createElement('id')

    idT = dom.createTextNode(str(city.id))

    nameE.appendChild(nameT)

    idE.appendChild(idT)

    root.appendChild(nameE)

    root.appendChild(idE)

    univs = dom.createElement('universities')

    root.appendChild(univs)

    for uni in city.universities:

#        print('write xml' + city.name + '\t' + uni.name)

        universityE = dom.createElement('university')

        univs.appendChild(universityE)

        uniE = dom.createElement('name')

        uniT = dom.createTextNode(uni.name)

        uidE = dom.createElement('id')

        uidT = dom.createTextNode(str(uni.id))

        uniE.appendChild(uniT)

        uidE.appendChild(uidT)

        universityE.appendChild(uniE)

        universityE.appendChild(uidE)

        deptsE = dom.createElement('departments')

        universityE.appendChild(deptsE)

        for dep in uni.depts:

            deptE = dom.createElement('department')

            deptsE.appendChild(deptE)

            deptNameE = dom.createElement('name')

            deptIdE = dom.createElement('id')

            deptT = dom.createTextNode(dep.name)

            deptIdT = dom.createTextNode(str(dep.id))

            deptNameE.appendChild(deptT)

            deptIdE.appendChild(deptIdT)

            deptE.appendChild(deptNameE)

    f= open(filename, 'w', encoding='utf-8')

    dom.writexml(f, addindent='  ', newl='\n',encoding='utf-8')

    print('write xml :' + city.name + '.xml')

    f.close()  

def mkdir(path):

    path=path.strip()

    path=path.rstrip("/")

    isExists=os.path.exists(path)

    if not isExists:

        os.makedirs(path)

def readData(content):

    counties = []

    jdata = json.loads(content)

    for i in range(0,100):

        try:

            country = Country()

            country.name = jdata[i]['name']

            country.id = jdata[i]['id']

            provs = jdata[i]['provs']

            for prov in provs:

                city = City()

                city.name = prov['name']

                city.id = prov['id']

                country.cities.append(city)

                city.universities = []

                for dic in prov['univs']:

                    university = University()

                    university.id = dic['id']

                    university.name = dic['name']

#                    print('get data: \t' + university.name)

                    university.depts = readDept(university.id)

                    city.universities.append(university)

                    print('city = ' + city.name + '\tuniversity = ' + university.name)

                writeXml(city)

            counties.append(country)

        except IndexError:

            break;

    return counties

print('开始时间：' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

f = open('data','r' )

content = f.read()

f.close()

counties = readData(content)

print('结束时间：' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

其中data是从如下网站拿到的

http://s.xnimg.cn/allunivlist.js

python抓取网页例子的更多相关文章

Python 抓取网页并提取信息(程序详解)
最近因项目需要用到python处理网页,因此学习相关知识.下面程序使用python抓取网页并提取信息,具体内容如下: #---------------------------------------- ...
Python抓取网页中的图片到本地
今天在网上找了个从网页中通过图片URL,抓取图片并保存到本地的例子: #!/usr/bin/env python # -*- coding:utf- -*- # Author: xixihuang # ...
python抓取网页引用的模块和类
在Python3.x中,我们可以使用urlib这个组件抓取网页,urllib是一个URL处理包,这个包中集合了一些处理URL的模块,如下:1.urllib.request模块用来打开和读取URLs:2 ...
python抓取网页中图片并保存到本地
#-*-coding:utf-8-*- import os import uuid import urllib2 import cookielib '''获取文件后缀名''' def get_file ...
python抓取网页过程
准备过程 1.抓取网页的过程准备好http请求(http request)->提交对应的请求->获得返回的响应(http response)->获得网页源码 2.GET还是POST ...
python 抓取网页一部分
import re import requests from bs4 import BeautifulSoup response = requests.get("https://jecvay ...
浅谈如何使用python抓取网页中的动态数据
我们经常会发现网页中的许多数据并不是写死在HTML中的,而是通过js动态载入的.所以也就引出了什么是动态数据的概念, 动态数据在这里指的是网页中由Javascript动态生成的页面内容,是在页面加载到 ...
网络爬虫－使用Python抓取网页数据
搬自大神boyXiong的干货! 闲来无事,看看了Python,发现这东西挺爽的,废话少说,就是干准备搭建环境因为是MAC电脑,所以自动安装了Python 2.7的版本添加一个库 Beauti ...
python抓取网页图片
本人比较喜欢海贼王漫画,所以特意选择了网站http://www.mmonly.cc/ktmh/hzw/list_34_2.html来抓取海贼王的图片. 因为是刚刚学习python,代码写的不好,不要喷 ...

随机推荐

PHP项目中composer和Git的组合使用
highlight: 在国内由于众所周知的原因,composer的package可能无法访问,解决办法是使用中国的全镜像: composer config -g repositories.packag ...
R语言串行与并行Apply用法
串行 APPLY<- function(m){ mTemp <- apply(m, 2, mysort) return(mTemp)} snowfall包的并行 SNOWFALL<- ...
IIS7或者IIS7.5部署MVC项目时出现404错误
IIS7或者IIS7.5部署MVC项目时出现404错误服务器上需要安装Windows 补丁 kb980368 下载链接:http://support.microsoft.com/kb/980368
UVa 12206 (字符串哈希) Stammering Aliens
体验了一把字符串Hash的做法,感觉Hash这种人品算法好神奇. 也许这道题的正解是后缀数组,但Hash做法的优势就是编码复杂度大大降低. #include <cstdio> #inclu ...
KM算法（二分图的最佳完美匹配）
KM算法大概过程: (1)初始化Lx数组为该boy的一条权值最大的出边.初始化Ly数组为 0. (2)对于每个boy,用DFS为其找到一个girl对象,顺路记录下S和T集,并更新每个girl的slac ...
Android 系统开发学习杂记(转)
http://blog.csdn.net/shagoo/article/details/6709430 > 开发环境1.安装 Eclipse 和 android-sdk 并解压安装2.Eclip ...
Windows Server 2003 激活码及激活方法
Windows Server 2003 简体中文企业版,真正免激活. CD-KEY:JB88F-WT2Q3-DPXTT-Y8GHG-7YYQY 安装序列号:JCGMJ-TC669-KCBG7-HB8X ...
使用crs_setperm修改RAC资源的所有者及权限
Oracle RAC 集群中,对于各种资源的管理,也存在所有者与权限的问题.crs_getperm与crs_setperm则是这样的一对命令,主要用于查看与修改集群中resource的owner,gr ...
dynamic_cast
作为四个内部类型转换操作符之一的dynamic_cast和传统的C风格的强制类型转换有着巨大的差别.除了dynamic_cast以外的转换,其行为的都是在编译期就得以确定的,转换是否成功,并不依赖被转 ...
关于RF 315MHz
1.https://www.pjrc.com/teensy/td_libs_VirtualWire.html These modules worked very reliably when sitti ...

python抓取网页例子

python抓取网页例子

python抓取网页例子的更多相关文章

随机推荐

热门专题