python抓取网页例子

最近在学习python，刚刚完成了一个网页抓取的例子，通过python抓取全世界所有的学校以及学院的数据，并存为xml文件。数据源是人人网。

因为刚学习python，写的代码还不够Pythonic。

核心代码如下：



#!/usr/bin/python

import urllib.request

from html.parser import HTMLParser

import json

import time

import xml.dom.minidom

import os

class Dept():

    id = 0

    name = ''

class University(Dept):

    depts = []

class City(Dept):

    universities  = []    

class Country(Dept):

    cities = []

class MyHtmlParser(HTMLParser):

    def __init__(self):

        HTMLParser.__init__(self)

        self.links = []

        self.depts = []

    def handle_starttag(self, tag, attrs):

        if tag == 'option':

            for att in attrs:

                for a in att:

                    if a != 'value' and a != '':

                        self.depts.append(a)

def readDept(code):

    depts = []

    html = ''

    for word in urllib.request.urlopen('http://www.renren.com/GetDep.do?id=' + str(code)).readlines():

        real = word.strip().decode('gbk')

        html =  html  + real

    hp = MyHtmlParser()

    hp.feed(html)

    for inst in hp.depts:

        dept = Dept()

        dept.name = inst

        depts.append(dept)

    return depts

def writeXml(city):

    impl = xml.dom.minidom.getDOMImplementation()

    dom = impl.createDocument(None, 'city', None)

    root = dom.documentElement

    filename = city.name + '.xml'

    if os.path.isfile(filename):

        os.remove(filename)

    nameE = dom.createElement('name')

    nameT = dom.createTextNode(city.name)

    idE = dom.createElement('id')

    idT = dom.createTextNode(str(city.id))

    nameE.appendChild(nameT)

    idE.appendChild(idT)

    root.appendChild(nameE)

    root.appendChild(idE)

    univs = dom.createElement('universities')

    root.appendChild(univs)

    for uni in city.universities:

#        print('write xml' + city.name + '\t' + uni.name)

        universityE = dom.createElement('university')

        univs.appendChild(universityE)

        uniE = dom.createElement('name')

        uniT = dom.createTextNode(uni.name)

        uidE = dom.createElement('id')

        uidT = dom.createTextNode(str(uni.id))

        uniE.appendChild(uniT)

        uidE.appendChild(uidT)

        universityE.appendChild(uniE)

        universityE.appendChild(uidE)

        deptsE = dom.createElement('departments')

        universityE.appendChild(deptsE)

        for dep in uni.depts:

            deptE = dom.createElement('department')

            deptsE.appendChild(deptE)

            deptNameE = dom.createElement('name')

            deptIdE = dom.createElement('id')

            deptT = dom.createTextNode(dep.name)

            deptIdT = dom.createTextNode(str(dep.id))

            deptNameE.appendChild(deptT)

            deptIdE.appendChild(deptIdT)

            deptE.appendChild(deptNameE)

    f= open(filename, 'w', encoding='utf-8')

    dom.writexml(f, addindent='  ', newl='\n',encoding='utf-8')

    print('write xml :' + city.name + '.xml')

    f.close()  

def mkdir(path):

    path=path.strip()

    path=path.rstrip("/")

    isExists=os.path.exists(path)

    if not isExists:

        os.makedirs(path)

def readData(content):

    counties = []

    jdata = json.loads(content)

    for i in range(0,100):

        try:

            country = Country()

            country.name = jdata[i]['name']

            country.id = jdata[i]['id']

            provs = jdata[i]['provs']

            for prov in provs:

                city = City()

                city.name = prov['name']

                city.id = prov['id']

                country.cities.append(city)

                city.universities = []

                for dic in prov['univs']:

                    university = University()

                    university.id = dic['id']

                    university.name = dic['name']

#                    print('get data: \t' + university.name)

                    university.depts = readDept(university.id)

                    city.universities.append(university)

                    print('city = ' + city.name + '\tuniversity = ' + university.name)

                writeXml(city)

            counties.append(country)

        except IndexError:

            break;

    return counties

print('开始时间：' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

f = open('data','r' )

content = f.read()

f.close()

counties = readData(content)

print('结束时间：' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

其中data是从如下网站拿到的

http://s.xnimg.cn/allunivlist.js

python抓取网页例子的更多相关文章

Python 抓取网页并提取信息(程序详解)
最近因项目需要用到python处理网页,因此学习相关知识.下面程序使用python抓取网页并提取信息,具体内容如下: #---------------------------------------- ...
Python抓取网页中的图片到本地
今天在网上找了个从网页中通过图片URL,抓取图片并保存到本地的例子: #!/usr/bin/env python # -*- coding:utf- -*- # Author: xixihuang # ...
python抓取网页引用的模块和类
在Python3.x中,我们可以使用urlib这个组件抓取网页,urllib是一个URL处理包,这个包中集合了一些处理URL的模块,如下:1.urllib.request模块用来打开和读取URLs:2 ...
python抓取网页中图片并保存到本地
#-*-coding:utf-8-*- import os import uuid import urllib2 import cookielib '''获取文件后缀名''' def get_file ...
python抓取网页过程
准备过程 1.抓取网页的过程准备好http请求(http request)->提交对应的请求->获得返回的响应(http response)->获得网页源码 2.GET还是POST ...
python 抓取网页一部分
import re import requests from bs4 import BeautifulSoup response = requests.get("https://jecvay ...
浅谈如何使用python抓取网页中的动态数据
我们经常会发现网页中的许多数据并不是写死在HTML中的,而是通过js动态载入的.所以也就引出了什么是动态数据的概念, 动态数据在这里指的是网页中由Javascript动态生成的页面内容,是在页面加载到 ...
网络爬虫－使用Python抓取网页数据
搬自大神boyXiong的干货! 闲来无事,看看了Python,发现这东西挺爽的,废话少说,就是干准备搭建环境因为是MAC电脑,所以自动安装了Python 2.7的版本添加一个库 Beauti ...
python抓取网页图片
本人比较喜欢海贼王漫画,所以特意选择了网站http://www.mmonly.cc/ktmh/hzw/list_34_2.html来抓取海贼王的图片. 因为是刚刚学习python,代码写的不好,不要喷 ...

随机推荐

结构体 lock_t;
typedef struct lock_struct lock_t; //利用typedef定义一个变量的类型 /** Lock struct */ struct lock_struct { trx_ ...
Less tips:声明变量之前可以引用变量！
Less中的variable可以在使用之后才被声明,这一特性对于希望覆盖前期声明的(比如bootstrap等第三方library的variable)变量,从而优雅地使用你希望的效果提供了便利. 比如 ...
kafka迁移与扩容
参考官网site: http://kafka.apache.org/documentation.html#basic_ops_cluster_expansion https://cwiki.apach ...
windows下github pages + hexo next 搭建个人博客
一.github pages 搭建个人博客一般需要购买域名和空间,github pages为我们提供了这两样东西,而且是免费的,相关介绍和使用方法参考这里 github pages. 二.Hexo 一 ...
[转]JavaScript 的性能优化：加载和执行
原文链接:http://www.ibm.com/developerworks/cn/web/1308_caiys_jsload/index.html?ca=drs- JavaScript 的性能优化: ...
Windows Tftpd32 DHCP服务器使用
/********************************************************************* * Windows Tftpd32 DHCP服务器使用 ...
POJ2236 Wireless Network
解题思路:简单并查集,注意时间限制是10000MS,每次进行O操作之后, 进行一次for循环,进行相关调整.同时注意输入输出格式,见代码: #include<cstdio> #incl ...
HDU 4662 MU Puzzle 2013 Multi-University Training Contest 6
现在有一个字符串"MI",这个字符串可以遵循以下规则进行转换: 1.Mx 可以转换成 Mxx ,即 M 之后的所有字符全部复制一遍(MUI –> MUIUI) 2.III 可 ...
LA 3635 Pie 派 NWERC 2006
有 f + 1 个人来分 n 个圆形派,每个人得到的必须是一整块派,而不是几块拼在一起,并且面积要相同.求每个人最多能得到多大面积的派(不必是圆形). 这题很好做,使用二分法就OK. 首先在读取所有派 ...
vhosts.conf
<VirtualHost *:80> ServerAdmin webmaster@dummy-host.example.com DocumentRoot "/opt/lampp/ ...

python抓取网页例子

python抓取网页例子

python抓取网页例子的更多相关文章

随机推荐

热门专题