13.07.2015 Views

Document Binarization with Automatic Parameter Tuning

Document Binarization with Automatic Parameter Tuning

Document Binarization with Automatic Parameter Tuning

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Algorithm 2. The metaparameters τ 1 and τ 2 are chosenfor best performance on a training set; analysisof the DIBCO 2009 and H-DIBCO 2010 image setsindicates that values of 0.25 and 0.50 respectivelyachieve the best results across the full set. (Usingonly 23 of the 24 images in cross-fold training occasionallyyields other values.) Employing a trainingset to choose τ 1 and τ 2 diminishes the automatic natureof the full algorithm, but represents a tradeoffmade in this particular algorithm for the sake of computationspeed.Algorithm 3 Pick t hi from {τ 1 , τ 2 } for image I givenτ 1 , τ 2 , t lo , and σ Eτ 0 ← (τ 1 + τ 2 )/2c 0 ← T uneC(I, τ 0 , t lo , σ E )c 1 ← T uneC(I, τ 1 , t lo , σ E )c 2 ← T uneC(I, τ 2 , t lo , σ E )B 0 ← Binarize(I, c 0 , τ 0 , t lo , σ E )B 1 ← Binarize(I, c 1 , τ 1 , t lo , σ E )B 2 ← Binarize(I, c 2 , τ 2 , t lo , σ E )D 1 ← ∆(B 0 , B 1 )D 2 ← ∆(B 0 , B 2 )if D 1 < D 2 thent hi ← τ 1elset hi ← τ 2end ifThese two innovations in combination mean thata result originally requiring 363 trial binarizations tocompute may be reached in the time required for justeight or nine. This is still slower than using staticparameter values, but investing the extra time maybe worthwhile for more accurate results. Alternately,in situations where a number of similar documentsmust be binarized, the parameter tuning may be runfor a few trial pages to find appropriate values, whichare then set statically for the remainder of the set.All the algorithms display more or less linear dependanceof computation time on the number of pixelsin the image. Executing Algorithm 2 takes an averageof 892 seconds per megapixel on a 2.4 GHz Xeonprocessor running as a single thread (<strong>with</strong>out parallelism).By contrast, Algorithm 3 runs in 18.1 secondsper megapixel under the same conditions, whilea single execution of the base algorithm under staticparameters takes 2.12 seconds per megapixel.2.4 Further Algorithmic VariantsFor best results, many binarization methods computean initial labeling using some base technique and thenapply one or more postprocessing algorithms to improveit. For example, Su et al [22] remove componentsof three pixels or less. More complicatedmodeling and classification algorithms can identifyand remove noise components while retaining real ink[1]. The unknown pixel classification of Lelore andBouchara [13] may also be viewed as post-processingof a sort.Similar techniques can be applied to the resultsfrom the method described herein. Indeed, boththe three-pixel component filter and a more complexclassification-based approach reminiscent of Agrawal& Doermann have been tested informally and foundto reduce binarization error for the DIBCO test images,relative to ground truth. Unfortunately gainsin training can come at the potential cost of a lossof generality, in practice. This suspicion is borne outby the DIBCO 2011 contest results. The two algorithmsthat ranked in first and second place accordingto the contest scoring methodology perform very wellon most of the test images, but fail severely on one ortwo examples (PR6 and PR7). One might view thisas a form of overfitting the problem: these algorithmsare specialized to do very well on images that matchthe expectations of the designers, but cannot handlethe full range of images that might be encountered “inthe wild”. This paper aims to develop binarizationtechniques that work automatically and reliably on aswide a range of images as possible. Thus the experimentseschew post-processing heuristics of the typedescribed above because they might decrease error ona given test set at the cost of generality. Such tricksare nevertheless worth mentioning because they mayprove useful in whenever the images to be binarizedare amenable.10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!