Tesseract 3.02中文字库训练
Tesseract 3.02中文字库训练
下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件
0.准备
为了方便 tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言 fontname是字体
比如我们要训练自定义字库 mjorcen字体名normal
那么我们把tif文件重命名 mjorcen.normal.exp0.jpg
图片 :
下面开始训练字库:
1、生成 .box文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
把图片文件和box文件放在同一目录,
2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件

3、 生成 .tr文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.train
4、成一个unicharset文件
unicharset_extractor mjorcen.normal.exp0.box
5、新建一个font_properties文件
里面内容写入 normal 0 0 0 0 0 表示默认普通字体
6、运行命令
shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr cntraining mjorcen.normal.exp0.tr
结果如下:
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 0.365385
Master shape_table:Number of shapes = max unichars = number with multiple un
ichars = E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of shapes
Reading mjorcen.normal.exp0.tr ...
Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ...
Clustering ... Writing normproto ...
7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.
8、执行combine_tessdata normal.
9、把 normal.traineddata 复制到Tesseract-OCR 安装目录下的tessdata文件夹中
10、测试
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal
debug:
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Tesseract Open Source OCR Engine v3. with Leptonica E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jp
g mjorcen.normal.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3. with Leptonica
APPLY_BOXES:
Boxes read from boxfile:
Found good blobs.
TRAINING ... Font name = normal
Generated training data for words E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractor mjorcen.norm
al.exp0.box
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset. E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 0.365385
Master shape_table:Number of shapes = max unichars = number with multiple un
ichars = E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of shapes
Reading mjorcen.normal.exp0.tr ...
Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ...
Clustering ... Writing normproto ... E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdata normal.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type is -
Offset for type is
Offset for type is -
Offset for type is
Offset for type is
Offset for type is
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is
Offset for type is -
Offset for type is -
Offset for type is - E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp0 -l normal
Tesseract Open Source OCR Engine v3. with Leptonica E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp1 -l chi_sim
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Tesseract Open Source OCR Engine v3. with Leptonica
normal 结果
应收:
普通的中文结果:
应收= II苜
脚本(需要java环境):
目录结果如下:

脚本如下:
window
@echo off set "src=%1%"
set "font_name=%2%"
set "desc=%3%" if not defined src set /p src=" please pass your filename : " if not defined font_name set /p font_name=" please pass your font_name : " rem 判断参数的合法性 if not defined src echo IllegalArgumentException arg1 must not be null & pause>nul & exit if not defined font_name echo IllegalArgumentException arg2 must not be null & pause>nul & exit if not defined desc set "desc=%src:~0,-4%" echo desc %desc% rem 如果目录下没有font_properties 文件创建 font_properties ,并写入文件
if exist font_properties (
echo font_properties exist
) else (
ECHO %font_name% >"font_properties"
) rem 删除原有文件
if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset & DEL /Q names %font_name%.unicharset
if exist %font_name%.inttemp ECHO DEL %font_name%.inttemp & DEL /Q names %font_name%.inttemp
if exist %font_name%.pffmtable ECHO DEL %font_name%.pffmtable & DEL /Q names %font_name%.pffmtable
if exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL /Q names %font_name%.shapetable
if exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL /Q names %font_name%.normproto
if exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL /Q names %font_name%.font_properties rem makebox tesseract %src% %desc% -l chi_sim batch.nochop makebox java -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jar ECHO Please change your results , and press any key to continue pause>nul tesseract %src% %desc% nobatch box.train unicharset_extractor %desc%.box shapeclustering -F font_properties -U unicharset %desc%.tr mftraining -F font_properties -U unicharset -O unicharset %desc%.tr cntraining %desc%.tr rem 配置新文件
if exist unicharset ECHO rename unicharset %font_name%.unicharset & rename unicharset %font_name%.unicharset
if exist inttemp ECHO rename inttemp %font_name%.inttemp & rename inttemp %font_name%.inttemp
if exist pffmtable ECHO rename pffmtable %font_name%.pffmtable & rename pffmtable %font_name%.pffmtable
if exist shapetable ECHO rename shapetable %font_name%.shapetable & rename shapetable %font_name%.shapetable
if exist normproto ECHO rename normproto %font_name%.normproto & rename normproto %font_name%.normproto combine_tessdata %font_name%. if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_properties ECHO press any key to continue
pause>nul
调用:
注意: 参数1: 文件全名 , 参数2 字体名, 参数3 :输出文件名, 不填默认为文件名
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normal
实例:
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg n
ormal
desc mjorcen.normal.exp0
font_properties exist
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Too many unichars in ambiguity on line
Tesseract Open Source OCR Engine v3. with Leptonica
Please change your results , and press any key to continue
Tesseract Open Source OCR Engine v3. with Leptonica
APPLY_BOXES:
Boxes read from boxfile:
Found good blobs.
TRAINING ... Font name = normal
Generated training data for words
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset.
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 999.000000
Computing shape distances...
Stopped with merged, min dist 0.365385
Master shape_table:Number of shapes = max unichars = number with multiple un
ichars =
Read shape table shapetable of shapes
Reading mjorcen.normal.exp0.tr ...
Done!
Reading mjorcen.normal.exp0.tr ...
Clustering ... Writing normproto ...
rename unicharset normal.unicharset
rename inttemp normal.inttemp
rename pffmtable normal.pffmtable
rename shapetable normal.shapetable
rename normproto normal.normproto
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type is -
Offset for type is
Offset for type is -
Offset for type is
Offset for type is
Offset for type is
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is -
Offset for type is
Offset for type is -
Offset for type is -
Offset for type is -
rename font_properties normal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
linux (出自文档:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :
#!/bin/bash
tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train
unicharset_extractor zzz.ocra.exp0.box
echo "ocra 0 0 1 0 0" >font_properties
shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
cntraining zzz.ocra.exp0.tr
cp normproto zzz.normproto
cp inttemp zzz.inttemp
cp pffmtable zzz.pffmtable
cp shapetable zzz.shapetable
combine_tessdata zzz.
cp zzz.traineddata /home/youruserid/tessdata/.
sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.
tesseract zzz.ocra.exp0.tif output -l zzz
Tesseract 3.02中文字库训练的更多相关文章
- [转]Tesseract 3.02中文字库训练
下载chi_sim.traindata字库下载tesseract-ocr-setup-3.02.02.exe 下载地址:http://code.google.com/p/tesseract-ocr/d ...
- jTessBoxEditor工具进行Tesseract3.02.02样本训练
1.背景 前文已经简要介绍tesseract ocr引擎的安装及基本使用,其中提到使用-l eng参数来限定语言库,可以提高识别准确率及识别效率. 本文将针对某个网站的验证码进行样本训练,形成自己的语 ...
- 利用jTessBoxEditor工具进行Tesseract3.02.02样本训练,提高验证码识别率
1.背景 前文已经简要介绍tesseract ocr引擎的安装及基本使用,其中提到使用-l eng参数来限定语言库,可以提高识别准确率及识别效率. 本文将针对某个网站的验证码进行样本训练,形成自己的语 ...
- tesseract 中文二次训练
tesseract4.0以上版本可参考 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial- ...
- 2019.01.02 NOIP训练 三七二十一(生成函数)
传送门 生成函数基础题. 题意简述:求由1,3,5,7,9这5个数字组成的n位数个数,要求其中3和7出现的次数都要是偶数. 考虑对于每个数字构造生成函数. 对于1,5,9:∑nxnn!=ex\sum_ ...
- 2018.11.02 NOIP训练 停车场(线段树)
传送门 这是一道困饶了我一年的题. 其实就是去年去NOIP提高组试水的时候考的模拟题 但当时我水平不够,跟ykykyk一起杠了一个下午都没调出来. 今天终于AAA了. 其实就是一个维护最长连续0101 ...
- Tesseract识别图片提取文字&字库训练
文中测试了3.0和4.0两个版本.发现3.0识别效率不准确,需要训练词库.4.0识别效率就比较高了,而且支持结果生成pdf.txt等格式.所以推荐使用4.0版本. 这个工具可以用在爬虫的时候获取验证码 ...
- 深入学习Tesseract-ocr识别中文并训练字库的方法
上篇文章简单的学习了tesseract-ocr识别图片中的英文(链接地址如下:https://www.cnblogs.com/wj-1314/p/9428909.html),看起来效果还不错,所以这篇 ...
- C#识别验证码技术-Tesseract
相信大家在开发一些程序会有识别图片上文字(即所谓的OCR)的需求,比如识别车牌.识别图片格式的商品价格.识别图片格式的邮箱地址等等,当然需求最多的还是识别验证码.如果要完成这些OCR的工作,需要你掌握 ...
随机推荐
- httperf ---linux web站点压力测试
一.工具下载&&安装 软件获取 ftp://ftp.hpl.hp.com/pub/httperf/ 这里使用的是如下的版本 ftp://ftp.hpl.hp.com/pub/httpe ...
- java.util.regex.PatternSyntaxException: Unexpected internal error near index 1 \ ^
1 String a = "1991\12\16"; 2 String[] split = a.split("\\"); 3 System.out.printl ...
- 《java JDK7 学习笔记》之键值对Map
一.常用的Map操作类 1.使用HashMap Map也支持泛型语法,泛型的Map实例: 建立Map操作对象时,可以使用泛型语法指定键与值的类型.上图键使用String,值也使用String类型.要建 ...
- HTML5 乱记
HTML5 的文档类型声明可以触发所有具备标准模式的浏览器的标准模式,包括那些对 HTML5 一无所知的浏览器,也就是说,可以在任何网页中都使用 HTML5 文档类型声明 <!DOCTYPE h ...
- SonarQube(5.0.1) 环境的安装配置
SonarQube 安装步骤 确定 JDK 和 MySQL 已经成功安装. 下载 SonarQube 及工具 SonarQube Runner,下载地址:http://www.sonarqube.or ...
- Cordova+angularjs+ionic+vs2015开发(三)
欢迎加群学习:457351423 这里有4000多部学习视频,涵盖各种技术,有需要的欢迎进群学习! 一.基础设置 1.修改App名称和程序起始页 打开config.xml配置文件显示如下,在[通用]选 ...
- iOS 安全:UIWebView访问Https站点防止中间人攻击
尽管Https协议能够提供数据的加密.身份的认证等安全服务,但并不是没有漏洞.HTTPS协议安全隐患的存在可能使用户受到各种极具破坏力的网络攻击.其中中间人攻击(Man In The Middle, ...
- ios错误修改了系统头文件
一.打开终端 二.进入Xcode 输入命令: cd /Users/apple/Library/Developer/Xcode/ 三.打开当前 输入命令: open . 四.将DerivedData ...
- C 语言 查找一个字符串2在字符串1中出现的次数
#include <stdio.h> #include <windows.h> int main() { ], b[]; char *temp; ; memset( a, ); ...
- mysql innodb 数据打捞(二)innodb 页面打捞编程
有了页面的结构和特征,需要编程实现数据库页面的打捞工作: 为了方便windows and linux 的通用,计划做成C语言的控制台应用,并且尽量只用ansi c;关于多线程,计划做成多线程的程序,最 ...
