Improving the quality of the output
There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help.
- Image processing
- Page segmentation method
- Dictionaries, word lists, and patterns
- Still having problems?
Image processing
Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.
You can see how Tesseract has processed the image by using the configuration variabletessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.
Rescaling
Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.
Binarisation

This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.
Noise Removal

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.
Rotation / Deskewing

A skewed image is when an page has been scanned when not straight. The quality of Tesseract's line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.
Border Removal

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.
Tools / Libraries
Examples
If you need an example how to improve image quality programmatically, have a look at this examples:
- OpenCV - Rotation (Deskewing) - c++ example
- Fred's ImageMagick TEXTCLEANER - bash script for processing a scanned document of text to clean the text background.
- rotation_spacing.py - python script for automatic detection of rotation and line spacing of an image of text
- crop_morphology.py - Finding blocks of text in an image using Python, OpenCV and numpy
Page segmentation method
By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.
To see a complete list of supported page segmentation modes, use tesseract -h. Here's the list as of 3.04:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
Dictionaries, word lists, and patterns
By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.
Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn't dictionary words. They can be disabled by setting the both of the configuration variablesload_system_dawg and load_freq_dawg to false.
It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.
If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist configuration variable. See the FAQ for an example.
Still having problems?
If you've tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.
Improving the quality of the output的更多相关文章
- Fully Convolutional Networks for Semantic Segmentation 译文
Fully Convolutional Networks for Semantic Segmentation 译文 Abstract Convolutional networks are powe ...
- PhoenixFD插件流体模拟——UI布局【Output】详解
Liquid Output 流体输出 本文主要讲解Output折叠栏中的内容.原文地址:https://docs.chaosgroup.com/display/PHX3MAX/Liquid+Outp ...
- CIImage实现滤镜效果
Core Image also provides autoadjustment methods that analyze an image for common deficiencies and re ...
- 39. Volume Rendering Techniques
Milan Ikits University of Utah Joe Kniss University of Utah Aaron Lefohn University of California, D ...
- Codeforces Round #302 (Div. 1)
转载请注明出处: http://www.cnblogs.com/fraud/ ——by fraud A. Writing Code Programmers working on a ...
- Code Complete阅读笔记(二)
2015-03-06 328 Unusual Data Types ——You can carry this technique to extremes,putting all the ...
- 44个JAVA代码质量管理工具(转)
1. CodePro AnalytixIt’s a great tool (Eclipse plugin) for improving software quality. It has the nex ...
- PA教材提纲 TAW10-1
Unit1 SAP systems(SAP系统) 1.1 Explain the Key Capabilities of SAP NetWeaver(解释SAP NetWeaver的关键能力) Rep ...
- 近年Recsys论文
2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...
随机推荐
- win10下网狐荣耀手机端android app编译
基于荣耀版(2017.5.21)12 款游戏..7z这款游戏,网上有下载的 1.解压后进入 cd shoujiduan 2.将client/base复制到client/ciphercode/下,也就是 ...
- spring-data-redis的事务操作深度解析--原来客户端库还可以攒够了事务命令再发?
一.官方文档 简单介绍下redis的几个事务命令: redis事务四大指令: MULTI.EXEC.DISCARD.WATCH. 这四个指令构成了redis事务处理的基础. 1.MULTI用来组装一个 ...
- ab压测工具
在学习ab工具之前,我们需了解几个关于压力测试的概念 吞吐率(Requests per second)概念:服务器并发处理能力的量化描述,单位是reqs/s,指的是某个并发用户数下单位时间内处理的请求 ...
- vue Element-UI组件
一.UI组件 目的: 提高开发效率, 别人提供好一切, 拿过来直接用饿了么团队开源一个基于vue组件库 Element-UI ==> pc端 文档: http://element-cn.elem ...
- no such file to load -- bundler/setup
bundle install rake routesrake aborted!no such file to load -- bundler/setup 解决 gem install bundler
- Egret第三方库的制作和使用(模块化 第三方库)
一.第三方库的制作 官方教程:第三方库的使用方法 水友帖子:新版本第三方库制作细节5.1.x 首先在任意需要创建第三方库的地方,右键,选择"在此处打开命令窗口" 输入egret c ...
- 危险的浮点数float
今天写程序又以为我见鬼了!最后查出来发现原来又是浮点数搞的鬼! 情况大致是这样的,我想要测试向量运算的速度,所以要对一个浮点数向量进行求和运算,代码如下: int vect_size=10000000 ...
- AJAX之三种数据传输格式详解
一.HTML HTML由一些普通文本组成.如果服务器通过XMLHTTPRequest发送HTML,文本将存储在responseText属性中. 从服务器端发送的HTML的代码在浏览器端不需要用Java ...
- 【巷子】---webpack配置非CMD规范的模块
一.前言 webpack在配置多页面开发的时候 ,发现用 import 导入 Zepto 时,会报 Uncaught TypeError: Cannot read property 'createEl ...
- HDFS体系结构(NameNode、DataNode详解)
hadoop项目地址:http://hadoop.apache.org/ NameNode.DataNode详解 (一)分布式文件系统概述 数据量越来越多,在一个操作系统管辖的范围存不下了,那么就分配 ...