python爬虫学习(11) —— 也写个AC自动机

0. 写在前面

本文记录了一个AC自动机的诞生！

之前看过有人用C++写过AC自动机，也有用C#写的，还有一个用nodejs写的。。

感觉他们的代码过于冗长，而且AC率也不是很理想。

刚好在回宿舍的路上和学弟聊起这个事

随意想了想思路，觉得还是蛮简单的，就顺手写了一个，效果，还可以接受。

先上个图吧：

最后应该还可以继续刷，如果修改代码或者再添加以下其他搜索引擎可以AC更多题，

不过我有意控制在3000这个AC量，也有意跟在五虎上将之后。

1. 爬虫思路

思路其实非常清晰：

模拟登录HDU
针对某一道题目
- 搜索AC代码
  - 通过正则表达式进行代码的提取
  - 通过htmlparser进行代码的处理
- 提交
  - 若AC，返回2
  - 否则，继续提交代码（这里最多只提交10份代码）
  - 10次提交后还未AC，放弃此题

2. 简单粗暴的代码

#coding='utf-8'

import requests, re, os, HTMLParser, time, getpass

host_url = 'http://acm.hdu.edu.cn'

post_url = 'http://acm.hdu.edu.cn/userloginex.php?action=login'

sub_url = 'http://acm.hdu.edu.cn/submit.php?action=submit'

csdn_url = 'http://so.csdn.net/so/search/s.do'

head = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36' }

html_parser = HTMLParser.HTMLParser()

s = requests.session()

def login(usr,psw):

    s.get(host_url);

    data = {'username':usr,'userpass':psw,'login':'Sign In'}

    r = s.post(post_url,data=data)

def check_lan(lan):

    if 'java' in lan:

        return '5'

    return '0'

def parser_code(code):

    return html_parser.unescape(code).encode('utf-8')

def is_ac(pid,usr):

    tmp = requests.get('http://acm.hdu.edu.cn/userstatus.php?user='+usr).text

    accept = re.search('List of solved problems</font></h3>.*?<p align=left><script language=javascript>(.*?)</script><br></p>',tmp,re.S)

    if pid in accept.group(1):

        print '%s was solved' %pid

        return True

    else:

        return False

def search_csdn(PID,usr):

    get_data = { 'q':'HDU ' + PID,  't':'blog', 'o':'', 's':'', 'l':'null'  }

    search_html = requests.get(csdn_url,params=get_data).text

    linklist = re.findall('<dd class="search-link"><a href="(.*?)" target="_blank">',search_html,re.S)

    for l in linklist:

        print l

        tm_html = requests.get(l,headers=head).text;

        title = re.search('<title>(.*?)</title>',tm_html,re.S).group(1).lower()

        if PID not in title:

            continue

        if 'hdu' not in title:

            continue

        tmp = re.search('name="code" class="(.*?)">(.*?)</pre>',tm_html,re.S)

        if tmp == None:

            print 'code not find'

            continue

        LAN = check_lan(tmp.group(1))

        CODE =  parser_code(tmp.group(2))

        if r'include' in CODE:

            pass

        elif r'import java' in CODE:

            pass

        else:

            continue

        print PID, LAN

        print '--------------'

        submit_data = { 'check':'0', 'problemid':PID, 'language':LAN, 'usercode':CODE }

        s.post(sub_url,headers=head,data=submit_data)

        time.sleep(5)

        if is_ac(PID,usr):

            break

if __name__ == '__main__':

    usr = raw_input('input your username:')

    psw = getpass.getpass('input your password:')

    login(usr,psw)

    pro_cnt = 1000

    while pro_cnt <= 5679:

        PID = str(pro_cnt)

        if is_ac(PID,usr):

            pro_cnt += 1

            continue

        search_csdn(PID,usr)

        pro_cnt += 1

代码不长，仅仅只有78行，是的，就是这样！

3. TDDO

目前没有打算完善这篇博客，也不推荐去研究这个东西，推荐的是去学习真正的算法，哈哈！

很久很久以前自己写过的AC自动机，，，，贴一发：

#include <cstdio>

#include <cstring>

#include <algorithm>

#include <queue>

using namespace std;

#define clr( a, b ) memset( a, b, sizeof(a) )

const int SIGMA_SIZE = 26;

const int NODE_SIZE = 500000 + 10;

struct ac_automaton{

    int ch[ NODE_SIZE ][ SIGMA_SIZE ];

    int f[ NODE_SIZE ], val[ NODE_SIZE ], last[ NODE_SIZE ];

    int sz;

    void init(){

        sz = 1;

        clr( ch[0], 0 ), clr( val, 0 );

    }

    void insert( char *s ){

        int u = 0, i = 0;

        for( ; s[i]; ++i ){

            int c = s[i] - 'a';

            if( !ch[u][c] ){

                clr( ch[sz], 0 );

                val[sz] = 0;

                ch[u][c] = sz++;

            }

            u = ch[u][c];

        }

        val[u]++;

    }

    void getfail(){

        queue<int> q;

        f[0] = 0;

        for( int c = 0; c < SIGMA_SIZE; ++c ){

            int u = ch[0][c];

            if( u ) f[u] = 0, q.push(u), last[u] = 0;

        }

        while( !q.empty() ){

            int r = q.front(); q.pop();

            for( int c = 0; c < SIGMA_SIZE; ++c ){

                int u = ch[r][c];

                if( !u ){

                    ch[r][c] = ch[ f[r] ][c];

                    continue;

                }

                q.push( u );

                int v = f[r];

                while( v && !ch[v][c] ) v = f[v];

                f[u] = ch[v][c];

                last[u] = val[ f[u] ] ? f[u] : last[ f[u] ];

            }

        }

    }

    int work( char* s ){

        int res = 0;

        int u = 0, i = 0, e;

        for( ; s[i]; ++i ){

            int c = s[i] - 'a';

            u = ch[u][c];

            e = u;

            while( val[e] ){

                res += val[e];

                val[e] = 0;

                e = last[e];

            }

        }

        return res;

    }

}ac;

python爬虫学习(11) —— 也写个AC自动机的更多相关文章

python爬虫学习 —— 总目录
开篇作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录听说你叫爬虫 - ...
python爬虫学习(1) —— 从urllib说起
0. 前言如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫所以建议先学习一下cuiqingcai大神的 Pyth ...
Python爬虫学习：二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
[转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...
Python爬虫学习02--pyinstaller
Python爬虫学习02--打包exe可执行程序 1.上一次做了一个爬虫爬取电子书的Python程序,然后发现可以通过pyinstaller进行打包成exe可执行程序.发现非常简单好用 2.这是上次写 ...
Python爬虫学习第一记 (翻译小助手)
1 # Python爬虫学习第一记 8.24 (代码有点小,请放大看吧) 2 3 #实现有道翻译,模块一: $fanyi.py 4 5 import urllib.request 6 import u ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...

随机推荐

PHP 高级编程(5/5) - SPL 数组重载
ArrayAccess接口 ArrayAccess接口是对象的行为看起来像个数组,定义了四个方法.接口概要如下: ArrayAccess { /* Methods */ abstract public ...
u-boot的配置、编译及链接
第一次写技术博客,还有些兴奋呢.我是CrazyCatJack,大家可以叫我CCJ或者疯猫.我即将成为一名嵌入式Linux的驱动工程师,现在还是一枚大四狗,呼呼~大学期间做了一些项目和比赛,都是基于32 ...
用 jQuery.ajaxSetup 实现对请求和响应数据的过滤
不知道同学们在做项目的过程中有没有相同的经历呢?在使用 ajax 的时候,需要对请求参数和响应数据进行过滤处理,比如你们觉得就让请求参数和响应信息就这么赤裸裸的在互联网里来回的穿梭,比如这样: 要知道 ...
Python（七）Socket编程、IO多路复用、SocketServer
本章内容: Socket IO多路复用(select) SocketServer 模块(ThreadingTCPServer源码剖析) Socket socket通常也称作"套接字" ...
Docker到底是什么？为什么它这么火！
转载来自:http://cloud.51cto.com/art/201410/453718.htm 摘要:Docker这种新的容器技术可谓热得发烫,因为有了它,人们就有可能让数量多得多的应用程序在同样 ...
PHP用单例模式实现一个数据库类
使用单例模式的出发点: 1.php的应用主要在于数据库应用, 所以一个应用中会存在大量的数据库操作, 使用单例模式, 则可以避免大量的new 操作消耗的资源. 2.如果系统中需要有一个类来全局控制某些 ...
.NET Core之Entity Framework Core 你如何创建 DbContext
本文版权归博客园和作者吴双共同所有,欢迎转载,转载和爬虫请注明博客园蜗牛原文地址 http://www.cnblogs.com/tdws/p/5874212.html. 目前国内各大论坛,各位大牛的分 ...
jquery动态生成的元素添加事件的方法
动态生成的元素如果要添加事件,要写成 $(document).on("click", "#txtName", function() { alert(this.v ...
Debian8安装Vim8
1 安装vim需要的库 apt-get build-dep vim-gtk apt-get install libncurses5-dev mercurial 2 下载Vim8 apt-get i ...
PHP多图片上传实例demo
upload.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ ...