Python编写网页爬虫爬取oj上的代码信息

OJ升级,代码可能会丢失. 所以要事先备份. 一開始傻傻的复制粘贴, 后来实在不能忍, 得益于大潇的启示和聪神的原始代码, 网页爬虫走起!

已经有段时间没看Python, 这次网页爬虫的原始代码是 python2.7版本号, 试了一下改动到3.0版本号, 要做非常多包的更替,感觉比較烦,所以索性就在这个2.7版本号上完好了.

首先欣赏一下原始代码,我给加了一些凝视:

# -*- coding: cp936 -*-

import urllib2

import urllib

import re

import thread

import time

import cookielib

cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())

opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)

urllib2.install_opener(opener)

# 以下是正則表達式部分,意在过滤爬取页面的标签信息

class Tool:

    A = re.compile(" \;")                           #A-J对标签进行匹配

    B = re.compile("\<BR\>")

    C = re.compile("<\;")

    D = re.compile(">\;")

    E = re.compile(""\;")

    F = re.compile("&")

    G = re.compile("Times\ New\ Roman\"\>")

    H = re.compile("\</font\>")

    I = re.compile("'")

    J = re.compile(r'语言.*?face=')

    def replace_char(self,x):                      #将标签内容替换成目标内容

        x=self.A.sub(" ",x)

        x=self.B.sub("\n\t",x)

        x=self.C.sub("<",x)

        x=self.D.sub(">",x)

        x=self.E.sub("\"",x)

        x=self.F.sub("&",x)

        x=self.G.sub("",x)

        x=self.H.sub("",x)

        x=self.I.sub("\'",x)

        x=self.J.sub("",x)

        return x

class HTML_Model:

    def __init__(self,u,p):

        self.userName = u                 #username与password等登入信息

        self.passWord = p

        self.mytool = Tool()

        self.page = 1                      #从代码页的第一页開始爬

        self.postdata = urllib.urlencode({

            'userName':self.userName,

            'password':self.passWord

})

    def GetPage(self):

        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"

        #请求包括网址和登入表单

        req=urllib2.Request(

            url = myUrl,

            data = self.postdata

            )

        #此次对应为打开这个url

        myResponse = urllib2.urlopen(req)

        #读取页面

        myPage = myResponse.read()

        flag = True

        #当flag为true时 继续抓取下一页

        while flag:

            #下一页网址

            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)

            #print(myUrl)

            myResponse = urllib2.urlopen(myUrl)

            #打开下一页的页面

            myPage = myResponse.read()

            #正則表達式搜索是否还有下一页,更新flag. 原理为在当前页查找, 假设当前页面有提交的代码,则含有相似"<a href="/acmhome/solutionCode.do?id=4af76cc2459a0dd30145eb3dd1671dc5" target="_blank">G++</a>" 这种标签. 也就是说假设我的代码仅仅有84页,那么则在第85页flag-false,不再訪问86页

            st="\<a\ href\=.*?G\+\+"

            next = re.search(st,myPage)

            #print(st)

            print(next)

            if next:

                flag=True

                print("True")

            else:

                flag=False

                print("False")

            #print(myPage)

           #找到当前页面下全部题目代码的连接,放在myItem这个list中

            myItem = re.findall(r'<a href=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ',myPage,re.S)

            for item in myItem:

                #print(item)

               #对于每一个题目代码连接,訪问其所在页面

                url='http://acm.njupt.edu.cn/acmhome/solutionCode.do?id='+item[37:len(item)-2]

                #print(url)

                myResponse = urllib2.urlopen(url)

                myPage = myResponse.read()

                mytem = re.findall(r'语言.*?</font>.*?Times New Roman\"\>.*?\</font\>',myPage,re.S)

                #print(mytem)

                sName = re.findall(r'源码--.*?</strong',myPage,re.S)

                #sName = sName[2:len(sName)]

                for sname in sName:

                    print(sname[2:len(sname)-8])

                    # sname中包括了题号信息

                    f = open(sname[2:len(sname)-8]+'.txt','w+')

                    #通过前面的标签过滤函数,将过滤后的代码写在文件中

                    f.write(self.mytool.replace_char(mytem[0]))

                    f.close()

                    print('done!')

            self.page = self.page+1

print u'plz input the name'

u=raw_input()

print u'plz input password'

p=raw_input()

#u = "B08020129"

#p = *******"

myModel = HTML_Model(u,p)

myModel.GetPage()

如今这个代码有两个问题:

首先,在标签匹配的时候没有支持多行,也就是爬下来的代码中仍然包括跨度多行的标签, 纯代码仍然须要人工提取.

第二,由于代码页面并没有问题的题目信息,所以仅以题号作为文件名称. 这样若果升级后的OJ题目顺序发生改变, 将无法将题目与代码进行相应.

针对第一个问题, 修正的方法比較简单:

在正則表達式匹配的时候, 将第二个參数位置加上re.DOTALL就可以.

比如:

J = re.compile(r'语言.*?face=',re.DOTALL)

对于第二个问题, 能够依据题号寻找题目的页面(而非此前代码的页面), 然后从题目页面中提取标题信息.

在题目页面中,我发现仅仅有标题是用<strong><\strong> 标签修饰的,所以能够这样匹配

sName2=re.findall(r'<strong>([^<]+)</strong>',myPage2,re.S)

另外文件命名的时候不能够有空格,所以还要滤除空格

sname2=sname2.replace(" ","")

即使这样,有时在创建文件时仍然会抛出异常, 可是又一次运行一次可能就会不再出现故障.

以下是晚上后的代码, 改动的地方加粗了.

# -*- coding: cp936 -*-

import urllib2

import urllib

import re

import thread

import time

import cookielib

cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())

opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)

urllib2.install_opener(opener)

class Tool:

    A = re.compile(" \;")

    B = re.compile("\<BR\>")

    C = re.compile("<\;")

    D = re.compile(">\;")

    E = re.compile(""\;")

    F = re.compile("&")

    G = re.compile("\"Times\ New\ Roman\"\>")

    H = re.compile("\</font\>")

    I = re.compile("'")

    J = re.compile(r'语言.*?face=',re.DOTALL)

    def replace_char(self,x):

        x=self.A.sub(" ",x)

        x=self.B.sub("\n\t",x)

        x=self.C.sub("<",x)

        x=self.D.sub(">",x)

        x=self.E.sub("\"",x)

        x=self.F.sub("&",x)

        x=self.G.sub("",x)

        x=self.H.sub("",x)

        x=self.I.sub("\'",x)

        x=self.J.sub("",x)

        return x

class HTML_Model:

    def __init__(self,u,p):

        self.userName = u

        self.passWord = p

        self.mytool = Tool()

        self.page = 81

        self.postdata = urllib.urlencode({

            'userName':self.userName,

            'password':self.passWord

})

    def GetPage(self):

        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"

        req=urllib2.Request(

            url = myUrl,

            data = self.postdata

            )

        myResponse = urllib2.urlopen(req)

        myPage = myResponse.read()

        flag = True

        while flag:

            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)

            #print(myUrl)

            myResponse = urllib2.urlopen(myUrl)

            myPage = myResponse.read()

            st="\<a\ href\=.*?G\+\+"

            next = re.search(st,myPage)

            #print(st)

            print(next)

            if next:

                flag=True

                print("True")

            else:

                flag=False

                print("False")

            #print(myPage)

            myItem = re.findall(r'<a href=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ',myPage,re.S)

            for item in myItem:

                #print(item)

                url='http://acm.njupt.edu.cn/acmhome/solutionCode.do?id='+item[37:len(item)-2]

                #print(url)

                myResponse = urllib2.urlopen(url)

                myPage = myResponse.read()

                mytem = re.findall(r'语言.*?</font>.*?Times New Roman\"\>.*?\</font\>',myPage,re.S)

                #print(mytem)

                sName = re.findall(r'源码--.*?</strong',myPage,re.S)

                #sName = sName[2:len(sName)]

                for sname in sName:

                    url2="http://acm.njupt.edu.cn/acmhome/problemdetail.do?&method=showdetail&id="+sname[8:len(sname)-8]

                    myResponse2=urllib2.urlopen(url2)

                    myPage2=myResponse2.read();

                    sName2=re.findall(r'<strong>([^<]+)</strong>',myPage2,re.S)

                    sname2=sName2[0]

                    sname2=sname2.replace(" ","")

                   # print(sName)

                    print(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)])

                    f = open(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)]+'.txt','w+')

                    f.write(self.mytool.replace_char(mytem[0]))

                    f.close()

                    print('done!')

            print(self.page)

            self.page = self.page+1

#print u'plz input the name'

#u=raw_input()

#print u'plz input password'

#p=raw_input()

u = "LTianchao"

p = "******"

myModel = HTML_Model(u,p)

myModel.GetPage()



关于Python的网页爬取问题,这仅仅是一个非常easy的demo, 以下还须要深入学习.(假设有时间的话)

Python编写网页爬虫爬取oj上的代码信息的更多相关文章

使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...
Python写网络爬虫爬取腾讯新闻内容
最近学了一段时间的Python,想写个爬虫,去网上找了找,然后参考了一下自己写了一个爬取给定页面的爬虫. Python的第三方库特别强大,提供了两个比较强大的库,一个requests, 另外一个Bea ...
<scrapy爬虫>爬取腾讯社招信息
1.创建scrapy项目 dos窗口输入: scrapy startproject tencent cd tencent 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # - ...
<scrapy爬虫>爬取猫眼电影top100详细信息
1.创建scrapy项目 dos窗口输入: scrapy startproject maoyan cd maoyan 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # -*- ...
用python写一个爬虫——爬取性感小姐姐
忍着鼻血写代码今天写一个简单的网上爬虫,爬取一个叫妹子图的网站里面所有妹子的图片. 然后试着先爬取了三页,大概有七百多张图片吧!各个诱人的很,有兴趣的同学可以一起来爬一下,大佬级程序员勿喷,简单爬虫 ...
python 爬虫爬取历年双色球开奖信息
目前写的这些爬虫都是些静态网页,对于一些高级网页(像经过JS渲染过的页面),目前技术并不能解决,自己也是在慢慢学习过程中,如有错误,欢迎指正: 对面前端知识本人并不懂,过程中如果涉及到前端知识,也是百 ...
Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
NodeJS 爬虫爬取LOL英雄联盟的英雄信息，批量下载英雄壁纸
工欲善其事,必先利其器,会用各种模块非常重要. 1.模块使用 (1)superagent:Nodejs中的http请求库(每个语言都有无数个,java的okhttp,OC的afnetworking) ...
python简单小爬虫爬取易车网图片
上代码: import requests,urllib.request from bs4 import BeautifulSoup url = 'http://photo.bitauto.com/' ...

随机推荐

【Git使用具体解释】EGit使用具体解释
此系列文章写给那些打算使用Git或正在使用Git,但对Git还不是非常理解的程序员们,希望能帮助大家在学习和使用Git的过程中少走弯路,并以最少的时间和代价来熟悉Git,让Git可以辅助很多其它的开发 ...
Android client和服务器JSP互传中国
出于兼容性简化.传统中国等多国语言.推荐使用UTF-8编码. 首选.我们期待Android到底应该怎么办: 在发送前,应该对參数值要进行UTF-8编码,我写了一个static的转换函数.在做发送动作 ...
高清电视产业的关键词，4K过渡时期8K未来
有些不尽人意,归根结底在于,绝大多数厂商并没有把电视的性能作为突破口,相反,仅仅是在外观.设计上做起了文章.在部分厂商看来,要真正研发性能一流的智能电视须要更高的投入,但改变一下外形似乎也能获 ...
PHP訪问MySql数据库 0基础篇
在站点后台,常常要与数据库打交道.本文介绍怎样使用XAMPP来管理MySql数据库及怎样用PHP来訪问MySql数据库. 一．使用XAMPP来管理MySql数据库首先使用XAMPP打开MySql的管 ...
Cordova探险系列（一个）
最早接触PhoneGap平台是在1年多之前,可以使用HTML.CSS和JavaScript跨平台来编写Android或者IOS设备程序.而且应用的核心代码不须要多少改动就行移植.确实让我感觉的到它应该 ...
Android - 除首次使用状态(SharedPreferences)
除首次使用状态(SharedPreferences) 本文地址: http://blog.csdn.net/caroline_wendy 用户首次登陆时, 可能须要用户教育, 解说界面操作, 可是不应 ...
shuffle一个简单的过程叙述性说明
shuffle它是在map和reduce过程之间.我们看看在这个过程中的步骤,了解在这个问题上不深,有可能是一个错误.忘记修正 1. map map出口key,value,里的context.writ ...
找呀志_ContentResolver操作ContentProvider数据
当需要外部的应用ContentProvider该数据被添加.删.修改和查询操作.可以使用ContentResolver 类完成要得到ContentResolver 物,可以使用Activity提供g ...
python 3.4.0 简单的print 'hello world',出错--SyntaxError: invalid syntax
问题描写叙述: win7下安装的python 3.4.0版本号, 在命令行里写入简单的输出语句: print 'hello world' 然后enter,结果返回结果为: SyntaxError: i ...
ios animation暂停pause、恢复resume
项目以使用来控制动画,例如暂停.复苏继续等待,看看代码:(非常easy实现) -(void)pauseLayer:(CALayer*)layer { CFTimeInterval pausedTime ...

Python编写网页爬虫爬取oj上的代码信息

Python编写网页爬虫爬取oj上的代码信息的更多相关文章

随机推荐

热门专题