Tesseract-OCR 训练过程 V3.02

软件：

jTessBoxEditor Version 0.9 (30 April 2013)

Tesseract-OCR win32 v3.02 with Leptonica

训练步骤：

1.使用jTessBoxEditor,tools->merge_tif，产生tif文件

2.产生box文件

tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop makebox

3.使用jTessBoxEditor打开，Insert或Delete，添加删除字符，并通过xywh调整对应的坐票

4.训练(如果遇到不可识别的字符，couldn t find a matching blob，尝试换位置或调坐标)

tesseract.exe eng.arial.01.tif eng.arial.01 nobatch box.train

5.字体预处理

unicharset_extractor.exe eng.arial.01.box

6.创建font_properties.txt，内容为：arial 0 0 0 0 0

7.字体处理

mftraining.exe -F font_properties.txt -U unicharset eng.arial.01.tr

8.cntraining.exe eng.arial.01.tr

9.把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“eng.arial.01.”

10.combine_tessdata.exe eng.arial.01.

显示：

Combining tessdata files

TessdataManager combined tesseract data files.

Offset for type 0 is -1

Offset for type 1 is 108

Offset for type 2 is -1

Offset for type 3 is 1660

Offset for type 4 is 327545

Offset for type 5 is 327781

Offset for type 6 is -1

Offset for type 7 is -1

Offset for type 8 is -1

Offset for type 9 is -1

Offset for type 10 is -1

Offset for type 11 is -1

Offset for type 12 is –1

必须确定的是第2、4、5、6行的数据不是-1，那么一个新的字典就算生成了。

11.此时目录下“eng.arial.01.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录

12.

#tesseract.exe test.jpg result -l eng.arial.01

#tesseract.exe a.bmp result2 -l eng.arial.01

指定布局识别方式

tesseract.exe 42.png result2 -l eng.arial.01 -psm 7

布局参数描述：

-psm N

Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR.

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

Tesseract-OCR 训练过程 V3.02的更多相关文章

tesseract ocr训练 pt验证码
识别率有问题A大概率识别为n,因此需要训练,这里讲一下如何训练参考 java代码里边直接使用tess4j,是对tesseract的封装,但是如果要训练,还是需要在进行安装tesseract-ocr ...
tesseract ocr文字识别Android实例程序和训练工具全部源代码
tesseract ocr是一个开源的文字识别引擎,Android系统中也可以使用.可以识别50多种语言,通过自己训练识别库的方式,可以大大提高识别的准确率. 为了节省大家的学习时间,现将自己近期的学 ...
Tesseract Ocr引擎
Tesseract Ocr引擎 1.Tesseract介绍 tesseract 是一个google支持的开源ocr项目,其项目地址:https://github.com/tesseract-ocr/t ...
开源图片文字识别引擎——Tesseract OCR
Tessseract为一款开源.免费的OCR引擎,能够支持中文十分难得.虽然其识别效果不是很理想,但是对于要求不高的中小型项目来说,已经足够用了. 文字识别可应用于许多领域,如阅读.翻译.文献资料的检 ...
Python下Tesseract Ocr引擎及安装介绍
1.Tesseract介绍 tesseract 是一个google支持的开源ocr项目,其项目地址:https://github.com/tesseract-ocr/tesseract,目前最新的源码 ...
Tesseract OCR使用介绍
#Tesseract OCR使用介绍 ##目录[TOC] ##下载地址及介绍官网介绍:http://code.google.com/p/tesseract-ocr/wiki/TrainingTess ...
Tesseract——OCR图像识别入门篇
Tesseract——OCR图像识别入门篇最近给了我一个任务,让我研究图像识别,从我们项目的screenshot中识别文字信息,so我开始了学习,与大家分享下. 我看到目前OCR技术有很多,最主要 ...
【AdaBoost算法】强分类器训练过程
一.强分类器训练过程算法原理如下(参考自VIOLA P, JONES M. Robust real time object detection[A] . 8th IEEE International ...
tesseract 字体训练资料篇
tesseract 字体训练资料篇 1.制作.box档案文件. tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] ...

随机推荐

CSS3 -- 动画库
http://www.jq22.com/yanshi819 文件结构: html <!DOCTYPE html> <html lang="zh-CN"> & ...
ActiveMQ 消息持久化到数据库(Mysql、SQL Server、Oracle、DB2等)
ActiveMQ具体就不介绍了,直接介绍如何讲ActiveMQ持久化到本地数据库,以SQL Server 2008 R2为例1.下载ActiveMQ后直接解压,我下载的是apache-activemq ...
[原]Failed to load SELinux policy. System Freezing ----redhat7or CentOS7 bug
重启rhel7或者centos7 启动界面按 e 在启动项后面加上enforcing=0 Ctrl+x 运行修改后的grub 进入系统编辑保存/etc/selinux/config 重启
IOS 7 更改导航栏文字到白色
To hide status bar in any viewcontroller: -(BOOL) prefersStatusBarHidden { return YES; } To change t ...
如何搭建web服务器使用Nginx搭建反向代理服务器 .
引言:最近公司有台服务器遭受DDOS攻击,流量在70M以上,由于服务器硬件配置较高所以不需要DDOS硬件防火墙.但我们要知道,IDC机房是肯定不允许这种流量一直处于这么高的,因为没法具体知道后面陆续攻 ...
【BZOJ2138】stone Hall定理+线段树
[BZOJ2138]stone Description 话说Nan在海边等人,预计还要等上M分钟.为了打发时间,他玩起了石子.Nan搬来了N堆石子,编号为1到N,每堆包含Ai颗石子.每1分钟,Nan会 ...
在pycharm中运行nose测试框架
之前一直在pydev上或命令行上运行nosetests. pycharm上如果运行nosetests,在看了管网后,总结果如下: 全新的pycharm: 填加完成后,打开你要的脚本,运行,即可以以no ...
C# XML对象序列化、反序列化
XML 序列化:可以将对象序列化为XML文件,或者将XML文件反序列化为对象还有种方法使用LINQ TO XML或者反序列化的方法从XML中读取数据. 最简单的方法就是.net framework提供 ...
用js内置对象XMLHttpRequest 来用ajax
步骤: /* 用XMLHTTPRequest来进行ajax异步数据交交互*/ 主要有几个步骤: //1.创建XMLHTTPRequest对象 //最复杂的一步 if (window.XMLHttpRe ...
/usr/bin/ld: i386:x86-64 architecture of input file `command.o' is incompatible with i386 output
/usr/bin/ld: i386:x86-64 architecture of input file `command.o' is incompatible with i386 output 出现这 ...

Tesseract-OCR 训练过程 V3.02

Tesseract-OCR 训练过程 V3.02的更多相关文章

随机推荐

热门专题