There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help.

Image processing

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the configuration variabletessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.

Rescaling

Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

Binarisation

This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.

Noise Removal

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.

Rotation / Deskewing

A skewed image is when an page has been scanned when not straight. The quality of Tesseract's line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.

Border Removal

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.

Tools / Libraries

Examples

If you need an example how to improve image quality programmatically, have a look at this examples:

Page segmentation method

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the list as of 3.04:

 0   Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.

Dictionaries, word lists, and patterns

By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.

Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn't dictionary words. They can be disabled by setting the both of the configuration variablesload_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you've tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.

Improving the quality of the output的更多相关文章

  1. Fully Convolutional Networks for Semantic Segmentation 译文

    Fully Convolutional Networks for Semantic Segmentation 译文 Abstract   Convolutional networks are powe ...

  2. PhoenixFD插件流体模拟——UI布局【Output】详解

    Liquid Output 流体输出  本文主要讲解Output折叠栏中的内容.原文地址:https://docs.chaosgroup.com/display/PHX3MAX/Liquid+Outp ...

  3. CIImage实现滤镜效果

    Core Image also provides autoadjustment methods that analyze an image for common deficiencies and re ...

  4. 39. Volume Rendering Techniques

    Milan Ikits University of Utah Joe Kniss University of Utah Aaron Lefohn University of California, D ...

  5. Codeforces Round #302 (Div. 1)

    转载请注明出处: http://www.cnblogs.com/fraud/          ——by fraud A. Writing Code Programmers working on a ...

  6. Code Complete阅读笔记(二)

    2015-03-06   328   Unusual Data Types    ——You can carry this technique to extremes,putting all the ...

  7. 44个JAVA代码质量管理工具(转)

    1. CodePro AnalytixIt’s a great tool (Eclipse plugin) for improving software quality. It has the nex ...

  8. PA教材提纲 TAW10-1

    Unit1 SAP systems(SAP系统) 1.1 Explain the Key Capabilities of SAP NetWeaver(解释SAP NetWeaver的关键能力) Rep ...

  9. 近年Recsys论文

    2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...

随机推荐

  1. 【大数据系列】hadoop单机模式安装

    一.添加用户和用户组 adduser hadoop 将hadoop用户添加进sudo用户组 sudo usermod -G sudo hadoop 或者 visudo 二.安装jdk 具体操作参考:c ...

  2. 一劳永逸的搞定 FLEX 布局(转)

    一劳永逸的搞定 flex 布局 寻根溯源话布局 一切都始于这样一个问题:怎样通过 CSS 简单而优雅的实现水平.垂直同时居中.记得刚开始学习 CSS 的时候,看到 float 属性不由得感觉眼前一亮, ...

  3. LeetCode 39 Combination Sum(满足求和等于target的所有组合)

    题目链接: https://leetcode.com/problems/combination-sum/?tab=Description   Problem: 给定数组并且给定一个target,求出所 ...

  4. php5.4 traits

    PHP 5.4中的traits,是新引入的特性,中文还真不知道如何准确翻译好.其实际的目的,是为了有的场合想用多继承,但PHP又没多继承,于是就发明了这样的一个东西.       Traits可以理解 ...

  5. 2015.7.8js-05(简单日历)

    今天做一个简单的小日历,12个月份,鼠标移动其中一个月份时添加高亮并显示本月的活动.其实同理与选项卡致.不过是内容存在js里 window.onload = function(){ var oMain ...

  6. Maven属性(properties)标签的使用

    在命令行使用属性时,是-D,比如:mvn -D input=test Properties 属性是了解POM基础知识的最后一个要素.Maven属性是值占位符,如Ant中的属性.它们的值可以通过使用符号 ...

  7. s3接口认证说明

    S3 Authorization太绕,太头痛,下面解释说明: XS3 REST API基于HMAC(哈希消息身份验证码)密钥使用自定义HTTP方案进行身份验证.要对请求进行身份验证,您首先需要合并请求 ...

  8. 使用disavled属性锁定input内容不可以修改后,打印获取不到对应的值

    当我们需要锁定input内容不让修改时,可以使用disabled="disabled"和readonly="readonly", 官方的解释是:disabled ...

  9. 9.5Django操作数据库的增删改查

    2018-9-5 18:10:52 先贴上笔记 day61 2018-04-28 1. 内容回顾 1. HTTP协议消息的格式: 1. 请求(request) 请求方法 路径 HTTP/1.1\r\n ...

  10. inline-blcok 之间的空白间隙

    前言: inline-blcok 布局时,通常情况下, inline-blocks 之间有空白,尽管通常我们是不想要的,毕竟不像padding或者margin一样好控制,如图: <div cla ...