Python编写网页爬虫爬取oj上的代码信息

OJ升级,代码可能会丢失. 所以要事先备份. 一開始傻傻的复制粘贴, 后来实在不能忍, 得益于大潇的启示和聪神的原始代码, 网页爬虫走起!

已经有段时间没看Python, 这次网页爬虫的原始代码是 python2.7版本号, 试了一下改动到3.0版本号, 要做非常多包的更替,感觉比較烦,所以索性就在这个2.7版本号上完好了.

首先欣赏一下原始代码,我给加了一些凝视:

# -*- coding: cp936 -*-

import urllib2

import urllib

import re

import thread

import time

import cookielib

cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())

opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)

urllib2.install_opener(opener)

# 以下是正則表達式部分,意在过滤爬取页面的标签信息

class Tool:

    A = re.compile(" \;")                           #A-J对标签进行匹配

    B = re.compile("\<BR\>")

    C = re.compile("<\;")

    D = re.compile(">\;")

    E = re.compile(""\;")

    F = re.compile("&")

    G = re.compile("Times\ New\ Roman\"\>")

    H = re.compile("\</font\>")

    I = re.compile("'")

    J = re.compile(r'语言.*?face=')

    def replace_char(self,x):                      #将标签内容替换成目标内容

        x=self.A.sub(" ",x)

        x=self.B.sub("\n\t",x)

        x=self.C.sub("<",x)

        x=self.D.sub(">",x)

        x=self.E.sub("\"",x)

        x=self.F.sub("&",x)

        x=self.G.sub("",x)

        x=self.H.sub("",x)

        x=self.I.sub("\'",x)

        x=self.J.sub("",x)

        return x

class HTML_Model:

    def __init__(self,u,p):

        self.userName = u                 #username与password等登入信息

        self.passWord = p

        self.mytool = Tool()

        self.page = 1                      #从代码页的第一页開始爬

        self.postdata = urllib.urlencode({

            'userName':self.userName,

            'password':self.passWord

})

    def GetPage(self):

        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"

        #请求包括网址和登入表单

        req=urllib2.Request(

            url = myUrl,

            data = self.postdata

            )

        #此次对应为打开这个url

        myResponse = urllib2.urlopen(req)

        #读取页面

        myPage = myResponse.read()

        flag = True

        #当flag为true时 继续抓取下一页

        while flag:

            #下一页网址

            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)

            #print(myUrl)

            myResponse = urllib2.urlopen(myUrl)

            #打开下一页的页面

            myPage = myResponse.read()

            #正則表達式搜索是否还有下一页,更新flag. 原理为在当前页查找, 假设当前页面有提交的代码,则含有相似"<a href="/acmhome/solutionCode.do?id=4af76cc2459a0dd30145eb3dd1671dc5" target="_blank">G++</a>" 这种标签. 也就是说假设我的代码仅仅有84页,那么则在第85页flag-false,不再訪问86页

            st="\<a\ href\=.*?G\+\+"

            next = re.search(st,myPage)

            #print(st)

            print(next)

            if next:

                flag=True

                print("True")

            else:

                flag=False

                print("False")

            #print(myPage)

           #找到当前页面下全部题目代码的连接,放在myItem这个list中

            myItem = re.findall(r'<a href=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ',myPage,re.S)

            for item in myItem:

                #print(item)

               #对于每一个题目代码连接,訪问其所在页面

                url='http://acm.njupt.edu.cn/acmhome/solutionCode.do?id='+item[37:len(item)-2]

                #print(url)

                myResponse = urllib2.urlopen(url)

                myPage = myResponse.read()

                mytem = re.findall(r'语言.*?</font>.*?Times New Roman\"\>.*?\</font\>',myPage,re.S)

                #print(mytem)

                sName = re.findall(r'源码--.*?</strong',myPage,re.S)

                #sName = sName[2:len(sName)]

                for sname in sName:

                    print(sname[2:len(sname)-8])

                    # sname中包括了题号信息

                    f = open(sname[2:len(sname)-8]+'.txt','w+')

                    #通过前面的标签过滤函数,将过滤后的代码写在文件中

                    f.write(self.mytool.replace_char(mytem[0]))

                    f.close()

                    print('done!')

            self.page = self.page+1

print u'plz input the name'

u=raw_input()

print u'plz input password'

p=raw_input()

#u = "B08020129"

#p = *******"

myModel = HTML_Model(u,p)

myModel.GetPage()

如今这个代码有两个问题:

首先,在标签匹配的时候没有支持多行,也就是爬下来的代码中仍然包括跨度多行的标签, 纯代码仍然须要人工提取.

第二,由于代码页面并没有问题的题目信息,所以仅以题号作为文件名称. 这样若果升级后的OJ题目顺序发生改变, 将无法将题目与代码进行相应.

针对第一个问题, 修正的方法比較简单:

在正則表達式匹配的时候, 将第二个參数位置加上re.DOTALL就可以.

比如:

J = re.compile(r'语言.*?face=',re.DOTALL)

对于第二个问题, 能够依据题号寻找题目的页面(而非此前代码的页面), 然后从题目页面中提取标题信息.

在题目页面中,我发现仅仅有标题是用<strong><\strong> 标签修饰的,所以能够这样匹配

sName2=re.findall(r'<strong>([^<]+)</strong>',myPage2,re.S)

另外文件命名的时候不能够有空格,所以还要滤除空格

sname2=sname2.replace(" ","")

即使这样,有时在创建文件时仍然会抛出异常, 可是又一次运行一次可能就会不再出现故障.

以下是晚上后的代码, 改动的地方加粗了.

# -*- coding: cp936 -*-

import urllib2

import urllib

import re

import thread

import time

import cookielib

cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())

opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)

urllib2.install_opener(opener)

class Tool:

    A = re.compile(" \;")

    B = re.compile("\<BR\>")

    C = re.compile("<\;")

    D = re.compile(">\;")

    E = re.compile(""\;")

    F = re.compile("&")

    G = re.compile("\"Times\ New\ Roman\"\>")

    H = re.compile("\</font\>")

    I = re.compile("'")

    J = re.compile(r'语言.*?face=',re.DOTALL)

    def replace_char(self,x):

        x=self.A.sub(" ",x)

        x=self.B.sub("\n\t",x)

        x=self.C.sub("<",x)

        x=self.D.sub(">",x)

        x=self.E.sub("\"",x)

        x=self.F.sub("&",x)

        x=self.G.sub("",x)

        x=self.H.sub("",x)

        x=self.I.sub("\'",x)

        x=self.J.sub("",x)

        return x

class HTML_Model:

    def __init__(self,u,p):

        self.userName = u

        self.passWord = p

        self.mytool = Tool()

        self.page = 81

        self.postdata = urllib.urlencode({

            'userName':self.userName,

            'password':self.passWord

})

    def GetPage(self):

        myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"

        req=urllib2.Request(

            url = myUrl,

            data = self.postdata

            )

        myResponse = urllib2.urlopen(req)

        myPage = myResponse.read()

        flag = True

        while flag:

            myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)

            #print(myUrl)

            myResponse = urllib2.urlopen(myUrl)

            myPage = myResponse.read()

            st="\<a\ href\=.*?G\+\+"

            next = re.search(st,myPage)

            #print(st)

            print(next)

            if next:

                flag=True

                print("True")

            else:

                flag=False

                print("False")

            #print(myPage)

            myItem = re.findall(r'<a href=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ',myPage,re.S)

            for item in myItem:

                #print(item)

                url='http://acm.njupt.edu.cn/acmhome/solutionCode.do?id='+item[37:len(item)-2]

                #print(url)

                myResponse = urllib2.urlopen(url)

                myPage = myResponse.read()

                mytem = re.findall(r'语言.*?</font>.*?Times New Roman\"\>.*?\</font\>',myPage,re.S)

                #print(mytem)

                sName = re.findall(r'源码--.*?</strong',myPage,re.S)

                #sName = sName[2:len(sName)]

                for sname in sName:

                    url2="http://acm.njupt.edu.cn/acmhome/problemdetail.do?&method=showdetail&id="+sname[8:len(sname)-8]

                    myResponse2=urllib2.urlopen(url2)

                    myPage2=myResponse2.read();

                    sName2=re.findall(r'<strong>([^<]+)</strong>',myPage2,re.S)

                    sname2=sName2[0]

                    sname2=sname2.replace(" ","")

                   # print(sName)

                    print(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)])

                    f = open(sname[8:len(sname)-8]+'.'+sname2[0:len(sname2)]+'.txt','w+')

                    f.write(self.mytool.replace_char(mytem[0]))

                    f.close()

                    print('done!')

            print(self.page)

            self.page = self.page+1

#print u'plz input the name'

#u=raw_input()

#print u'plz input password'

#p=raw_input()

u = "LTianchao"

p = "******"

myModel = HTML_Model(u,p)

myModel.GetPage()



关于Python的网页爬取问题,这仅仅是一个非常easy的demo, 以下还须要深入学习.(假设有时间的话)

Python编写网页爬虫爬取oj上的代码信息的更多相关文章

使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...
Python写网络爬虫爬取腾讯新闻内容
最近学了一段时间的Python,想写个爬虫,去网上找了找,然后参考了一下自己写了一个爬取给定页面的爬虫. Python的第三方库特别强大,提供了两个比较强大的库,一个requests, 另外一个Bea ...
<scrapy爬虫>爬取腾讯社招信息
1.创建scrapy项目 dos窗口输入: scrapy startproject tencent cd tencent 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # - ...
<scrapy爬虫>爬取猫眼电影top100详细信息
1.创建scrapy项目 dos窗口输入: scrapy startproject maoyan cd maoyan 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # -*- ...
用python写一个爬虫——爬取性感小姐姐
忍着鼻血写代码今天写一个简单的网上爬虫,爬取一个叫妹子图的网站里面所有妹子的图片. 然后试着先爬取了三页,大概有七百多张图片吧!各个诱人的很,有兴趣的同学可以一起来爬一下,大佬级程序员勿喷,简单爬虫 ...
python 爬虫爬取历年双色球开奖信息
目前写的这些爬虫都是些静态网页,对于一些高级网页(像经过JS渲染过的页面),目前技术并不能解决,自己也是在慢慢学习过程中,如有错误,欢迎指正: 对面前端知识本人并不懂,过程中如果涉及到前端知识,也是百 ...
Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
NodeJS 爬虫爬取LOL英雄联盟的英雄信息，批量下载英雄壁纸
工欲善其事,必先利其器,会用各种模块非常重要. 1.模块使用 (1)superagent:Nodejs中的http请求库(每个语言都有无数个,java的okhttp,OC的afnetworking) ...
python简单小爬虫爬取易车网图片
上代码: import requests,urllib.request from bs4 import BeautifulSoup url = 'http://photo.bitauto.com/' ...

随机推荐

改变，从跨出第一步開始——记海大ITAEM团队首次IT讲座掠影
之前以前写了"行动带来力量,周三(5月7日)晚IT讲座通知",昨晚已经跨出了第一步.让我们还是看举办者骆宏的QQ空间的文字吧.尽管未能到现场助兴,但看着海大学子在不断挑战自己,进而 ...
StackExchange.Redis 使用
StackExchange.Redis 使用 - 事件(五) 摘要: ConnectionMultiplexer 可以注册如下事件ConfigurationChanged- 配置更改时Configur ...
复制（1）——SQLServer 复制简介
原文:复制(1)--SQLServer 复制简介前言: SQLServer的复制技术最少从SQLServer2000时代已经出现,当初是为了分布式计算,不是为了高可用.但是到了今天,复制也成为了一种 ...
【转】Jython简单入门
1. 用Jython调用Java类库第一步.创建Java类写一个简单的Java类,用Point来示例: import org.python.core.*; public class Point e ...
HDU 3874 离线段树
在所有数字的统计范围,,对于重复统计只有一次离线段树算法排序终点坐标.然后再扫,反复交锋.把之前插入树行被删除 #include "stdio.h" #include &quo ...
android visible invisible和gone差异
android中UI应用的开发中常常会使用view.setVisibility()来设置控件的可见性.当中该函数有3个可选值.他们有着不同的含义: View.VISIBLE--->可见 View ...
协议系列UDP协议
所述上部TCP虽然该协议提供了一个可靠的传输,但也有一个缺点.发送速度慢.是否有协议它可以以高速传送?这部分是将要讨论UDP协议,它提供了更加快了传输速度.而且在可靠性为代价,这是一个无连接的传输协议 ...
教你使用vim表白
99669999996669999996699666699666999966699666699 99699999999699999999699666699669966996699666699 9966 ...
android代码签名和混乱的包装
研究了一下android的apk困惑签名和代码包装,假设没有混乱包.然后apk人们可以直接查看源代码反编译出来,尽管混乱包或能看懂.但不是那么容易理解,要求在至少一些时间假设不混淆,反编译后的代码例 ...
MongoDB（两）mongoDB基本介绍
MongoDB介绍 MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库其中功能最丰富,最像关系数据库的.他支持的数据结构很的松散,是类似json的bjson格式,因此能够存储比 ...

Python编写网页爬虫爬取oj上的代码信息

Python编写网页爬虫爬取oj上的代码信息的更多相关文章

随机推荐

热门专题