Document Binarization with Automatic Parameter Tuning

cs.smith.edu

Document Binarization with Automatic Parameter Tuning

Document Binarization with Automatic Parameter TuningNicholas R. HoweReceived: date / Accepted: dateAbstractDocument analysis systems often begin with binarizationas a first processing stage. Although numeroustechniques for binarization have been proposed,the results produced can vary in quality and oftenprove sensitive to the settings of one or more controlparameters. This paper examines a promising approachto binarization based upon simple principles,and shows that its success depends most significantlyupon the values of two key parameters. It further describesan automatic technique for setting these parametersin a manner that tunes them to the individualimage, yielding a final binarization algorithm thatcan cut total error by one-third with respect to thebaseline version. The results of this method advancethe state of the art on recent benchmarks.1 IntroductionPhysical documents in the real world exhibit a multitudeof colors and shadings, and the appearanceof even a single document may vary widely dependingupon factors such as lighting, viewing angle, etc.Digital reproductions of physical documents usuallymake use of a 24-bit color representation, or perhaps8-bit grayscale. Such formats omit some of the informationpresent in the original, but retain more thanenough for most applications. Certain document processingschemes go much further: they retain onlya single bit per document pixel. Binary documentrepresentations are very useful despite their low fidelitybecause the information lost in binarizationis more or less unrelated to the symbolic content ofthe document. Most documents are produced usingmonochromatic ink on paper, and their meaning isembodied solely in the distribution of the ink, a patternthat the binary document explicitly represents.Of course, inferring the correct binarization of adocument from its color or grayscale representationcan prove challenging. Physical degradation of thedocument, adverse lighting or imaging conditions,and limitations on resolution can all conspire to obscurethe original pattern. 1 To meet this challenge,researchers have proposed many algorithms to binarizedocument images. Indeed, intense activity inthis field has made the Document Image BinarizationContest (DIBCO) series one of the most popularcompetitions within the document analysis researchcommunity, attracting seventeen entries in its mostrecent iteration [19]. Nevertheless, the results fromthese contests reveal that there remains room for improvementin the quality of automatic binarization.This paper makes two related contributions to thestudy of automatic document binarization. First, itdescribes a method that can achieve excellent resultson a wide variety of challenging document images.Second, as a means to these results it investigates amethod to automatically determine the best parametervalues suited to each particular image. Automaticparameter estimation has received little research attentionto date but offers two advantages when donewell: it makes the method easier to employ since the1 Some ambiguity remains over how to define the groundtruth binarization. For instance, should an unintended ink spillmade by a scribe be included? For the record, stipulate thatthe ground truth should identify the set of all pixels at least50% covered with ink during the production of the document.In practice, the ground truth used on most benchmark datasetsprobably does not correspond perfectly with this ideal, butrepresents the imperfect judgment of a human expert [4].1


user doesn’t need to worry about finding the bestparameter settings, and it also improves the averageresult because each image can use its own ideal valueinstead of a single compromise setting that works adequatelybut suboptimally for all images.1.1 Prior WorkBefore considering new approaches, it is worth surveyingthe history and development of binarizationalgorithms. Because most documents are printed indark ink on light paper, many classic methods rely onthresholding the image intensity. Under ideal circumstances,as noted by Otsu [17] a simple global thresholdsuffices to distinguish markings from the background.In more difficult cases the overall intensitymay vary over the document, so that a single globalthreshold may simultaneously prove too high in someareas and too low in others. Thus both Niblack [16]and Sauvola and Pietikainen [21] have proposed locallyadaptive techniques that adjust the thresholdused according to the mean and variance in somelocal region around each pixel. Although generallymore accurate than Otsu for document images withvarying intensities, locally adaptive thresholding cansometimes fail catastrophically in areas of low variance,motivating hierarchical approaches such as theDecompose algorithm [7]. Adaptive thresholding alsopresents the user with additional parameters to set,relating to the size of the local region and to the relativeimportance of the local mean vs. its deviationas the local threshold is determined.The winners of recent DIBCO events go beyondlocally adaptive thresholding. They achieve their resultspartially by using thresholds, but most also performmodeling of the ink and background classes toenable more accurate individual pixel classification.Lu et al. [14] model the document background viaan iterative polynomial smoothing, and then chooselocal thresholds based on detected text stroke edges.This method won DIBCO 2009, and a revised versiontied as a winner of H-DIBCO 2010 (handwrittendocuments only). The second winner in 2010 was amethod described by Bar-Yosef et al. [3] that growsforeground regions iteratively based upon local modelingof the foreground and background within a 7×7pixel window. Lelore and Bouchara [13] won DIBCO2011 with a technique that first uses coarse thresholdingto partition pixels into three groups: ink, background,and unknown. Models describe the ink andbackground clusters, and guide decisions on the unknownpixels. The method also employs a higherresolution grid generated from the original image vialinear interpolation.Several other ideas proposed in recent years providemotivation for the base algorithm employed inthis paper. A number of projects have explored theuse of Markov random fields (MRF) for binarization[12, 18, 15, 11]. Inspired by the human visual system,some work has employed center-surround filtersto identify areas that are darker than their local environment,and thus likely to be ink [23]. These play arole similar to that of the Laplacian filter used herein.Finally, various approaches have used edge detectionto improve binarization results [10, 20]. Althoughthese ideas have been explored separately, the specificcombination used in this work appears particularlyeffective.Automatic estimation of appropriate parametersettings for document binarization has received limitedattention to date. Gatos et. al. describe aparameter-free method that relies on detailed modelingof the document background [9]. Dawoud describesa method based upon cross section sequencethat combines results at multiple threshold levels intoa single binarization [8]. Badekas and Papamarkos[2] describe a method inspired by work on edge detection.Their technique first produces binarizationsover a range of parameter settings, and estimates aground truth via chi-square analysis on the cumulativevotes of the multiple binarizations at each pixel.It then iteratively narrows the parameter range untilunable to do so further, giving the final estimatedbest setting. Although the method described hereinalso begins by computing binarizations over a rangeof parameter values, it differs by not attempting toestimate a ground truth, and requires no further iterationto choose the ideal parameter value.2


2 ApproachThe base approach to binarization used in this paperwas introduced recently [11] and rests on threemutually supporting strategies. First, it defines thetarget binarization as a labeling on pixels that minimizesa global energy function inspired by a Markovrandom field model. Second, in formulating the datafidelityterm of this energy it relies on the Laplacianof the image intensity to distinguish ink from background.This grants a crucial invariance to differencesin contrast and overall intensity. Third, it incorporatesedge discontinuities into the smoothness term ofthe global energy function, biasing ink boundaries toalign with edges and allowing a stronger smoothnessincentive over the rest of the image. The paragraphsbelow explain each of these points in greater detail.The global energy function operates on binarizationsB, which label each pixel indexed by (i, j) aseither ink or background, B ij ∈ {0, 1}. The energytakes on a typical additive form, with terms representingthe fidelity of a particular labeling whencompared to the intensity data, and other terms representingthe smoothness or regularity of the solution.In particular, the energy includes a cost L 0 ijor L 1 ij capturing how well the label B ij chosen foreach pixel matches its appearance, and irregularitycosts Cij h and Cv ij for each pixel whose label differsrespectively from those of its horizontal or verticalneighbor.E I (B) =m∑i=0 j=0m−1∑++n∑ [L0ij (1 − B ij ) + L 1 ]ijB iji=0 j=0n∑Cij(B h ij ≠ B i+1,j )m∑n−1∑Cij(B v ij ≠ B i,j+1 ) (1)i=0 j=0Assume that the Boolean expressions above evaluateto either 0 or 1 in the usual manner accordingto their truth value. With this energy the optimalbinarization will tend to conform to the intensitycontours while smoothing over small irregularities resultingfrom noise sources. The degree of smoothingrelative to data fidelity will depend on the relativemagnitudes of L b ij to Ch ij and Cv ij .The label costs L 0 ij and L1 ij should be invariant tothe local image illumination, and thus are taken fromthe Laplacian of the image intensity:L 0 ij = ∇ 2 I ij (2)L 1 ij = −∇ 2 I ij (3)Intuitively this will tend to separate ink from backgroundbecause the Laplacian measures the divergenceof the gradient. It will thus be positive at intensityvalleys (ink) and negative at intensity peaksor plateaus (background). The data terms of the energyfunction become a summation of the label-signedLaplacian over all pixels in the image. For a particularcomponent of ink or background, Green’s theoremtells us that the summed Laplacian over all its pixelsis mathematically equivalent to the gradient fluxacross its boundary. In other words, the energy contributionof each component is determined solely bywhat happens at its boundary. This makes intuitivesense but can cause trouble for components that intersectthe image edges: such regions can occasionallybe mislabeled because their entire natural boundaryis not visible, and thus their true energy contributioncan only be estimated. In practice, this causes troubleoccasionally for noisy background areas that arelargely isolated from the rest of the document backgroundby ink markings. Several potential solutionsexist. For example, one could simply fix L 1 ij to alarge negative value for all pixels (i, j) on the imageborder, under the assumption that all ink is framedby background regions. Rather than committing tosuch a strong assumption, this work adopts a moreconservative strategy, looking for bright outlier pixelsand applying a fixed constant L 1 ij to them. Thisstill ensures that large background regions will receivethe proper label, while not preventing identificationof ink pixels on the image boundary. To be precise,modify L 1 ij for any pixels more than two standarddeviations σijr brighter than the local mean µr ij , ascomputed over nearby pixels weighted by a Gaussianof radius r. This may be viewed as an extremely conservativeapplication of locally adaptive thresholding,where only pixels most certain to be background are3


labeled as such. In the equation below, φ will takeon a large negative value.L 1 ij ={ −∇ 2 I ij I ij ≤ µ r ij + 2σr ijφ I ij > µ r ij + 2σr ij(4)The neighbor mismatch penalties C h ij and Cv ij offerthe opportunity to employ the third strategy mentionedabove, incorporating the Canny edge map.(Recall that Canny [6] first smooths the image with aGaussian filter of small radius σ E , then finds edges atlocal directed maxima of the image gradient, choosingto retain only those edges selected using a hysteresisprocedure with two thresholds t hi and t lo .) The algorithmsets the mismatch penalties to a uniform valuec everywhere except between pixels where a Cannyedge has identified a likely discontinuity. To be morespecific, Canny identifies pixels as edges, while Equation1 requires locating discontinuities in the connectionsbetween pairs of pixels. To address this discrepancy,the formulation below zeros out the discontinuitypenalty between Canny edge pixels and theirbrighter neighbors, effectively choosing to include theCanny edge pixels within the inked area. The oppositechoice would also be self-consistent, but wouldinhibit detection of single pixel width strokes.⎧⎨ 0 if E ij ∧ (I ij < I i+1,j )Cij h = 0 if E i+1,j ∧ (I ij ≥ I i+1,j )⎩c otherwise⎧⎨ 0 E ij ∧ I ij < I i,j+1Cij v = 0 E i,j+1 ∧ I ij ≥ I i,j+1⎩c otherwise(5)(6)The choice of constant discontinuity penalties everywhereexcept at edges deserves a note. Onemight imagine using a penalty that varies continuouslyaccording to the similarity in intensity betweenthe neighbors. Empirically this approach seems lesssuccessful, perhaps because it actually gives littleguidance about the best precise location of the inkbackgroundtransition: the intensity differences tendto be fairly large everywhere within a few pixels ofthe actual boundary, and thus it becomes too easy tochoose the wrong location.2.1 ParametersThe algorithm described above includes six free parameters,of which only two strongly influence the binarizationoutcome and require varying settings fordifferent images. The four less important parameterscan be set to a constant with negligible consequencesfor all images tested, either because they donot strongly influence the result or because the bestvalue appears not to fluctuate much for different images.Thus finding automatic values for the two importantparameters suffices to create a method thatcan be applied as a “black box” with little or no parametertesting required.Two of the less important parameters appear inEquation 4: r and φ. Of these, r must be largeenough to encompass at least a few background pixels,and thus should be set to some value larger thanthe expected stroke width. φ can be any sufficientlynegative value. No attempt is made to optimize theseparameters, and the experiments all use r = 20 andφ = −500, for images with grayscale intensity inthe range from 0 to 255. The remaining less importantparameters are two of the three that control theCanny edge detection algorithm, namely the lower ofthe two edge detection thresholds t lo and the radiusof smoothing applied prior to edge detection σ E . Badvalues for these parameters can certainly hurt binarizationquality, but fortunately the experiments willshow that there exists a uniform setting that workson all test images with near-optimal results.The two important parameters that remain interactwith each other to determine the final binarizationresult. Most crucial is c, whose magnitude determinesthe relative balance between data fidelityand regularization. The high Canny edge detectionthreshold t hi matters most when c is large becausethe detection or nondetection of an edge can determinewhether an entire ink component appears ordisappears in the minimal energy solution. The highCanny threshold plays a particularly important rolein images that exhibit ink bleeding through from theopposite side of the paper: since the bled ink tends tohave weaker edges, fortuitous thresholding can eliminatemost of the false ink components.Early experiments with this binarization scheme,4


and indeed for most approaches to binarization, haverelied on choosing parameter values that work reasonablywell across a wide range of images. Thiscan be achieved by searching for the best values on atraining set of images similar to the ones that will beused for testing, and yields good results [11]. However,even globally optimal parameter values embodysome compromise on individual images, and tuningthe settings to the image can offer significant gains.The next section explains how to do this.2.2 Automatic Parameter TuningTo understand how the crucial c parameter can be setautomatically, it is worthwhile to examine the behaviorof the base binarization algorithm under a rangeof different settings. Figure 1 shows the binarizationresults on a small image patch for a logarithmicallyincreasing set of values. When c is very low, the binarizationlooks like a simple sign operator on theLaplacian, with many small noisy components. Asc increases the noise components become more consolidatedand many disappear. For a range of middlevalues of c, the binarization stays fairly stable,with only small changes as c increases. At the highestvalues, the result becomes unstable again as largeink components disappear, and occasionally join orhave voids filled in. As c goes to infinity, the binarizationwill tend toward the majority label in eachedge-isolated component.The region of stable results at middle levels of cis significant, and turns out to show up consistentlyon ink-and-paper documents with reasons subject toexplanation. To visualize the phenomenon, define thenormalized binarization instability ξ ν (c) in terms of∆(B c , B νc ) the fraction of pixels that change labelsbetween two values of c differing by a factor ν.∆(B, B ′ ) =ξ ν (c) =m∑i=0 j=0n∑(B ij ≠ B ij) ′ (7)∆(B c , B νc )mn (ln(νc) − ln(c))(8)In the equation above, Bij c is the binarization label atpixel (i, j) using mismatch penalty c, and the Booleanoperator ≠ converts to 0 or 1 in the customary fashion.As Figure 2 shows, the general shape of theinstability curve is an intrinsic property of the imagethat does not depend on the exact value of νchosen; thus in most cases ξ can drop its subscript.Histograms provide a useful analogy: although thedetails vary slightly, a histogram made using a sufficientnumber of samples from the same distributionalways takes on the general appearance of that distributionfor any reasonable choice of bin size. Indeed,the instability plots may be seen as histogramsof pixel label transition events in respect to c, withlogarithmically increasing bin sizes.Figure 3 shows as solid lines the instability plots forseveral document images. Although the stable zonediffers somewhat in appearance and location in eachcase, its nearly universal presence is striking. Yet inhindsight, the observation of such a stable zone atmiddling c values should not offer such a surprise.The twin peaks on either side arise from understandablemechanisms that apply for all document images,and thus the absence of a stable zone for any particularimage would arise from exceptional circumstances.The c parameter controls the incentive for neighboringpatches to share the same label despite a datafidelity term indicating otherwise. The left peak inthe instability curve appears at low levels of c, as itbecomes large enough to overcome small noise fluctuationsand label most large homogeneous foregroundand background regions correctly. On the other hand,the right peak appears when c becomes large enoughto force a uniform label even across regions of differentunderlying ground truth. Unless the noise amplitudeexceeds the contrast between foreground andbackground, a zone of stability will normally separatethe two. Furthermore, this stability zone coincideswith binarizations that adhere to the underlying imagestructure while ignoring noise. Figure 3 showsas dotted lines the binarization error (based on theF-measure, as defined in Equation 14 below), whichtends to reach a minimum at or near the lowest pointin the instability curve. Computing the instabilitycurve does not require access to the ground truth, yetobserving the low point of the stable region revealsa value of c that typically achieves low ground-truthbinarization error.5


c = 5.00 c = 7.33 c = 10.7 c = 15.7 c = 23.1 c = 33.8c = 49.6 c = 72.7 c = 106 c = 156 c = 229 c = 336Figure 1: Binarization results for different values of c on one portion of a document image. A range ofstability for intermediate c values corresponds to the most accurate result.Figure 2: Binarization instability measured withvarying step factor in c and normalized by the logof the step factor. The shape (and thus the locationof the minimum) depends very little on the stepfactor chosen.The link between the instability minimum and thebinarization error minimum is at heart merely an empiricalobservation, but the philosophical considerationsjust given provide justification for the beliefthat it will apply to most document images. The algorithmsdeveloped below rest upon two related hypotheses,which the experiments in Section 3 largelybear out: First, the two instability peaks describedabove will reliably appear and can be consistentlyidentified. Second, the point of maximum stabilitybetween the identified peaks will yield near-minimalbinarization error. The truth of these hypotheses willdepend upon the circumstances which produced agiven document image. For example some documentsinclude multiple types of markings at different contrastlevels, such as those where ink bleeds throughfrom the reverse side. These may muddy the situationby generating additional instability peaks, butin most cases the extra peaks don’t alter the resultgreatly because they tend to be more diffuse and thusless prominent. Figure 5 below describes a few moredifficult cases, including unusual images where theoptimal c varies widely across the page. Neverthelessthe experiments indicate that such problems remainmostly limited to unusual situations.The parameter-setting technique for c summarizedas Algorithm 1 computes curves like those in Figure 3and chooses c at the minimum point between the two6


Figure 3: Binarization instability and error vs. c for the DIBCO 2009 and H-DIBCO 2010 images. The solidcurve shows the fraction of the image area whose label changes between successive values of c. A vertical lineidentifies the value of c chosen by Algorithm 1 using this curve. The broken curve shows the error (differencefrom 1.0) of the the F-measure of the binarization. The horizontal axis shows c on a log scale. Images aregiven left to right in the order of Table 1.7


H2009-T1 H2009-T2 P2009-T1 P2009-T2Figure 4: Binarization instability (left in pair) anderror (right in pair) visualized with respect to varyingvalues of c (vertical axis; logarithmic range from20 to 5120) and t hi (horizontal axis; linear range from0.15 to 0.65) for four representative images. Despitethe very different patterns on display, the parameterneighborhoods with lower error coincide in each instancewith higher stability (both indicated by lightershades). The converse does not always hold: the secondimage shows two areas of stability, with only themore central one corresponding to low error. Thesecond, more peripheral stability zone is rejected byAlgorithm 1.bility, measured using the simple unnormalized formulabelow.ζ ν,τ (c, t hi ) = 1 4 [∆(Bc,t hi, B νc,t hi)+∆(B c,t hi, B c/ν,t hi)+∆(B c,t hi, B c,t hi−τ )+∆(B c,t hi, B c,t hi+τ )] (9)Other DIBCO images not shown behave similarly.Full explanation of this phenomenon lies beyond thescope of this paper, which merely observes and seeksto exploit the pattern.2.3 Computational EfficiencyCompleting Algorithm 2 requires 363 trial binarizationcomputations (33 candidates for c times 11 candidatesfor t hi ) and thus runs much more slowly thana single binarization with static parameter settings.However, two modifications to the algorithm can substantiallymitigate the speed difference.The first takes advantage of the fact that the binarizationresult does not usually change substantiallybetween successive values of c. Data structures builtto minimize Equation 1 for one value of c can bemodified and reused with a similar c value, achievingnoticeable economies. This strategy, introduced byBoykov and Kolmogorov [5], minimizes Equation 1by finding the minimum cut on a graph derived fromthe image. The method grows search trees from boththe source and the sink of the image graph to helpfind augmenting paths, and it turns out that the treesgrown for one value of c can be largely reused atnearby values to realize significant time savings. Empiricaltests of this implementation show that it canspeed computation by a factor of fifteen or more. Inother words, it computes all 33 trial binarizations requiredfor one iteration of Algorithm 1 in little morethan the time that it would normally take to computejust two or three.The second modification relies on a heuristic trick,and stems from the observation that the presenceor absence of ink bleed-through significantly influencesthe optimal value of t hi . Images showing bleedthroughtypically require a larger t hi to avoid identifyingthe spurious bled ink components as real. Bycontrast, in images without bleed-through the algorithmachieves its best results with a lower t hi , becausethis allows it to detect fainter edges. Thus itappears that an explicitly bimodal algorithm allowingonly two possible values {τ 1 , τ 2 } for t hi might stillperform well as compared with Algorithm 2, whichtests eleven. The experimental results below confirmthis hypothesis.Detecting the presence or absence of bleed-throughin a document is a nontrivial task in the general case,so it might seem impossible to determine which ofeven two candidate t hi values to use. Fortunately, anextension of the stability criterion provides a simpleheuristic that does fairly well empirically. In additionto the two candidate values, Algorithm 3 computesthe binarization using their mean value τ 0 = (τ 1 +τ 2 )/2. This midpoint binarization can be comparedto the other two, and the closer is judged the morestable and thus selected for the final result.Algorithm 3 computes results for only three t hi valuesand thus runs more than three times as fast as9


Algorithm 2. The metaparameters τ 1 and τ 2 are chosenfor best performance on a training set; analysisof the DIBCO 2009 and H-DIBCO 2010 image setsindicates that values of 0.25 and 0.50 respectivelyachieve the best results across the full set. (Usingonly 23 of the 24 images in cross-fold training occasionallyyields other values.) Employing a trainingset to choose τ 1 and τ 2 diminishes the automatic natureof the full algorithm, but represents a tradeoffmade in this particular algorithm for the sake of computationspeed.Algorithm 3 Pick t hi from {τ 1 , τ 2 } for image I givenτ 1 , τ 2 , t lo , and σ Eτ 0 ← (τ 1 + τ 2 )/2c 0 ← T uneC(I, τ 0 , t lo , σ E )c 1 ← T uneC(I, τ 1 , t lo , σ E )c 2 ← T uneC(I, τ 2 , t lo , σ E )B 0 ← Binarize(I, c 0 , τ 0 , t lo , σ E )B 1 ← Binarize(I, c 1 , τ 1 , t lo , σ E )B 2 ← Binarize(I, c 2 , τ 2 , t lo , σ E )D 1 ← ∆(B 0 , B 1 )D 2 ← ∆(B 0 , B 2 )if D 1 < D 2 thent hi ← τ 1elset hi ← τ 2end ifThese two innovations in combination mean thata result originally requiring 363 trial binarizations tocompute may be reached in the time required for justeight or nine. This is still slower than using staticparameter values, but investing the extra time maybe worthwhile for more accurate results. Alternately,in situations where a number of similar documentsmust be binarized, the parameter tuning may be runfor a few trial pages to find appropriate values, whichare then set statically for the remainder of the set.All the algorithms display more or less linear dependanceof computation time on the number of pixelsin the image. Executing Algorithm 2 takes an averageof 892 seconds per megapixel on a 2.4 GHz Xeonprocessor running as a single thread (without parallelism).By contrast, Algorithm 3 runs in 18.1 secondsper megapixel under the same conditions, whilea single execution of the base algorithm under staticparameters takes 2.12 seconds per megapixel.2.4 Further Algorithmic VariantsFor best results, many binarization methods computean initial labeling using some base technique and thenapply one or more postprocessing algorithms to improveit. For example, Su et al [22] remove componentsof three pixels or less. More complicatedmodeling and classification algorithms can identifyand remove noise components while retaining real ink[1]. The unknown pixel classification of Lelore andBouchara [13] may also be viewed as post-processingof a sort.Similar techniques can be applied to the resultsfrom the method described herein. Indeed, boththe three-pixel component filter and a more complexclassification-based approach reminiscent of Agrawal& Doermann have been tested informally and foundto reduce binarization error for the DIBCO test images,relative to ground truth. Unfortunately gainsin training can come at the potential cost of a lossof generality, in practice. This suspicion is borne outby the DIBCO 2011 contest results. The two algorithmsthat ranked in first and second place accordingto the contest scoring methodology perform very wellon most of the test images, but fail severely on one ortwo examples (PR6 and PR7). One might view thisas a form of overfitting the problem: these algorithmsare specialized to do very well on images that matchthe expectations of the designers, but cannot handlethe full range of images that might be encountered “inthe wild”. This paper aims to develop binarizationtechniques that work automatically and reliably on aswide a range of images as possible. Thus the experimentseschew post-processing heuristics of the typedescribed above because they might decrease error ona given test set at the cost of generality. Such tricksare nevertheless worth mentioning because they mayprove useful in whenever the images to be binarizedare amenable.10


3 Experimental ResultsThe DIBCO events have provided a platform for sideby-sidecomparison of different binarization results toa ground truth created by a team of human referees.Three contests have been held to date: two includedboth handwritten and printed document images, andone consisted of just handwritten documents. Imagesfrom the first two contests were used to develop thealgorithms described in this paper, with the third setheld out for post-hoc testing.A comprehensive survey of the parameter space hasbeen prepared for this paper, providing the contextnecessary to evaluate the effectiveness of the automaticparameter tuning algorithms. It samples the4-dimensional parameter space on a regular grid, surroundingand including the region found to producethe best binarizations in prior work [11]. The specificvalues tested for each parameter are:t hi ∈ T hi ≡ {0.15, 0.20, 0.25, 0.30, 0.35, 0.40,0.45, 0.50, 0.55, 0.60, 0.65} (10)t lo ∈ T lo ≡ {0, 0.05, 0.10, 0.15, 0.20} (11)σ E ∈ S E ≡ {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} (12)c ∈ C ≡ {40 · 2 i/4 ‖i = 0..28} (13)Results from the parameter survey can answer severalinteresting questions. They reveal both the bestpossible performance with a fixed set of parametersover all images (static parameters), and the best possibleperformance in hindsight using the optimal parametersettings for each individual image (ideal parameters).The difference between these two valuesis the space where parameter-tuning algorithms operate:they cannot aspire to do better than the ideal,although they may certainly do worse than the beststatic setting. If the difference between the static andideal numbers is small, then parameter tuning maynot be worthwhile because any possible gains are limited.A number of different statistics can measure andcompare the effectiveness of binarization algorithms.The DIBCO events compute some half-dozen numbers,with four used to determine the winners. Theinitial experiments herein focus on the F-measure,defined in Equation 14, because its interpretation issimple and it serves as a proxy for the other measuresof binarization quality. An F-measure of 100% representsperfection. In this paper the term error refersto the difference between the observed F-measure anda perfect 100% score.F = 2 · R · P(14)R + Pwhere the recall R and precision P are defined interms of the number of true positive pixels N T P , thenumber of false positive pixels N F P , and the numberof false negative pixels F F N in the binarization B ascompared to ground truth G.R =N T PN T P + N F N(15)N T PP =(16)N T P + N F PTable 1 summarizes binarization results on theDIBCO 2009 and H-DIBCO 2010 images for variousparameter-tuning algorithms. The static parametersfor the experiments represented in this table are determinedin leave-one-out style. In other words, forthe ith image, the computation uses whatever staticparameter setting would have given the best mean resulton the other 23 images. Table 1 shows the rangeof parameter values chosen via this technique for thevarious algorithms. (Note that even algorithms withstatic parameter settings may show a range in Table1 if the static values computed differ for variousleave-one-out training folds.)The second column of the table gives the F-measure when all parameters are static; this is thebaseline algorithm [11]. The third column gives theresult achieved with an oracle that knows the idealvalues of c and t hi for each image; this represents thebest possible outcome achievable by tuning the twoparameters, but is not realizable in practice withoutprior access to ground truth. The ideal parametersettings reduce the overall error by more than onethirdcompared to static parameters, showing thattuning methods have significant potential if they cancome close to this ideal. The table does not showresults from an oracle that optimizes all four parameters.If available, such an oracle could achieve a11


mean F-measure of 95.3% across all 24 images, hardlymore than the two-parameter oracle. This confirmsthe conjecture that tuning all four parameters offerslittle reward in exchange for the risk.The remaining columns show results for the variousparameter-tuning strategies described in Section 2.Column four tunes c using Algorithm 1 and employsstatic settings for t hi , t lo , and σ E . It does somewhatbetter than the all-static settings, but still showsroom for improvement. Column five tunes both c andt hi using Algorithm 2 and employs static settings fort lo , and σ E . This method achieves 83% of the levelof improvement made by the oracle. The final columnshow the results of the two-state tuning for t hidescribed in Algorithm 3. This method does nearlyas well as the fully tuned t hi , despite its much fastercomputation time.Table 3 shows the results for the DIBCO 2011 images,used as a hands-off test set. Unlike the previoustable, these are not trained leave-one-out style. Toensure an unbiased test, the algorithms were fixedbefore any results were examined, and static parametervalues are set using the first two sets of contestimages. The numbers here look more diverse becausesome of the new images are more extreme and thereforemore challenging than the training images. However,the general picture remains the same: tuning cgives better results than static parameter settings,and tuning both c and t hi improves the binarizationquality still further for most images. A few exceptionslower the overall average, particularly H2011-6.Close inspection of this case reveals that the stabilitycurve for t hi on this image contains two basins ofstability, and the minimization criterion chooses thewrong one by a slight margin. The two-state tuningdoes not suffer from this problem, because it selectst hi from a more constrained set of choices. In hindsightthis may be an unforeseen practical advantageof Algorithm 3.Despite the success of the tuning methods described,the best F-measure achieved by tuning stilllies farther from the oracular best for the DIBCO2011 images than in the experiments with earlier images.This reflects the diversity of the test set. Someof the images represent adverse examples for the stabilitycriterion advocated in this paper; the two worstTable 1: F-Measure for DIBCO 2009 and H-DIBCO2010 ImagesImage Static Oracle Alg. 1 Alg. 2 Alg. 3H2009-T1 96.4 96.6 96.6 96.1 96.6H2009-T2 93.3 93.4 92.5 93.1 93.1H2009-1 92.7 95.9 95.4 95.8 95.8H2009-2 90.1 96.4 88.8 96.2 95.7H2009-3 94.7 95.6 92.0 94.8 95.1H2009-4 92.1 94.7 93.9 94.6 94.5H2009-5 86.5 92.7 91.9 92.3 92.6H2010-1 93.7 96.2 95.0 95.4 96.0H2010-2 89.0 96.1 95.4 95.7 95.4H2010-3 94.4 94.7 93.3 94.7 94.7H2010-4 92.9 94.4 93.5 93.8 93.9H2010-5 92.9 96.5 96.1 96.5 96.4H2010-6 90.9 91.2 90.4 90.9 90.9H2010-7 95.0 95.2 94.8 95.0 95.1H2010-8 92.6 93.7 92.2 93.5 93.4H2010-9 92.6 93.8 92.9 92.1 93.5H2010-10 88.8 92.7 90.8 92.5 87.2P2009-T1 88.5 97.4 96.7 97.3 97.2P2009-T2 98.0 98.5 98.1 98.4 98.5P2009-1 93.9 94.3 90.4 94.0 94.2P2009-2 96.8 96.9 96.8 96.8 96.9P2009-3 97.6 98.4 98.2 98.2 98.3P2009-4 93.5 93.8 92.8 92.9 92.8P2009-5 88.6 91.5 84.7 91.2 84.3Hand 92.3 94.7 93.3 94.3 94.1Print 93.9 95.8 93.9 95.6 94.6All 92.7 95.0 93.5 94.7 94.3Table 2: Parameter Ranges for Table 1Static Oracle Tune c Tune t hi Dualc 160 47.6–1280 56.6–3044 80–1076 67.3–3044t hi 0.4 0.15–0.65 0.2–0.3 0.15–0.6 0.25 or 0.5t lo 0.1 0–0.2 0.1–0.15 0.1 0.1–0.15σ E 0.6 0.3–0.8 0.5–0.6 0.6 0.5–0.612


Table 3: F-Measure for DIBCO 2011 ImagesImage Static Oracle Tune c Tune t hi DualH2011-1 77.3 93.9 88.2 90.2 90.0H2011-2 97.4 97.5 96.6 97.3 97.4H2011-3 93.2 95.2 94.9 94.5 93.5H2011-4 92.5 92.7 92.5 92.0 92.0H2011-5 92.4 97.1 96.2 97.1 96.2H2011-6 88.2 94.4 91.1 60.6 92.1H2011-7 87.6 92.5 92.3 89.6 89.8H2011-8 95.3 95.7 94.9 95.3 95.4P2011-1 93.0 95.3 94.3 95.1 94.7P2011-2 77.6 90.8 73.5 71.5 72.3P2011-3 95.6 96.6 96.3 96.4 96.4P2011-4 95.0 95.2 94.5 95.2 95.1P2011-5 94.8 94.9 93.7 94.2 94.6P2011-6 66.7 93.1 89.9 86.0 86.3P2011-7 90.2 93.3 89.7 93.1 93.1P2011-8 90.3 92.2 88.5 91.3 91.5Hand 90.5 94.9 93.3 89.6 93.3Print 87.9 93.9 90.0 90.3 90.5All 89.2 94.4 91.7 90.0 91.9H2011-1 P2011-2 (detail)appear in Figure 5.3.1 Comparison with Other AlgorithmsResults available for the DIBCO 2011 image set makepossible a quantitative comparison between the thealgorithms developed herein and other recent work.The DIBCO 2011 contest used three additional primarycriteria in addition to the F-measure. Ofthese, the peak signal-to-noise ratio (PSNR) correlatesfairly strongly with the F-measure. If B is thebinarization and G the ground truth binary image,Figure 5: The DIBCO 2011 test images contain severalchallenging examples. Both these images haveunusual stability curves. At left, the high-amplitudenoise on the right side generates instability at valuesof c that are optimal for the rest of the page. At right,the stability criterion chooses a c value that preservesthe bleed-through ink.P SNR = −10 log (∆(B, G)) (17)Misclassification penalty metric (MPM) penalizes errorsaccording to their distance from the ink boundary.Define D ij as the distance of pixel (i, j) fromthis boundary in the ground truth.MP M =∑ nj=0 D ij(B ij ≠ G ij )2 ∑ m ∑ ni=0 j=0 D ij(1 − G ij )∑ mi=0(18)13


Distance reciprocal distortion metric (DRD) attemptsto account for human perceptual salience byweighting errors according to pixel values in a local5×5 neighborhood, using weight matrix W computedas the normalized inverse distance.DRD = 1 ∑m n∑Γ ij (B ij ≠ G ij ) (19)N 8Γ ij =2∑2∑h=−2 k=−2i=0 j=0W hk (B ij ≠ G i+h,j+k ) (20)N 8 divides G into 8 × 8 blocks and counts the numberthat are non-uniform. Note that better binarizationswill have higher F-measure and PSNR, butlower MPM and DRD. For further details on thesemetrics, please read the DIBCO 2011 contest report[19].Algorithm 3 is similar to entry #11 that was actuallysubmitted to the DIBCO 2011 contest. Thatentry was based on a less comprehensive survey ofthe parameter space and thus differs in the settingof the static parameter values and in the range of cvalues tested. Entry #11 nevertheless had the highestmean F-measure across all the images, the lowestmean peak signal-to-noise ratio, the best mean misclassificationpenalty metric, and the fifth-best distancereciprocal distortion metric. Strangely, despitethese successes, entry #11 ranked only third in the officialDIBCO 2011 contest results, and the two otherentries that placed first and second had mean scoresover the full image set that were uniformly worseon all four measures. The scoring methodology ofthe contest chose the winner based upon summedranks over all images on various quality measures,rather than directly on the mean metric scores. Thetwo higher-rated methods ranked well on most images,and while they conversely showed extremelypoor performance on a few of the printed documentsthis ultimately had limited effect on the final score[19]. With rank-based scoring the comparison of onemethod relative to another is affected by the performance(and presence or absence) of all the othermethods in the contest.Table 4 compares the results of the algorithms developedin this paper with other recent methods onthe DIBCO 2011 image set. The first section showsselected entries from the contest (those that beat Entry#11 on at least one measure), the second sectionshows other recent algorithms 2 , and the third sectionshows the algorithms described in this paper, trainedusing the DIBCO 2009 and H-DIBCO 2010 imagesfor a fair comparison. The table gives the mean resultfor each of the four metrics averaged across allthe test images, and also the rank score total used todetermine the DIBCO 2011 winner. For algorithmsnot entered in the contest, the rank score is simulatedby treating the algorithm in question as an additionalentrant, indicated with an asterisk in the table. Ofcourse, adding a contestant changes the rank scoresof the actual entrants, making them worse wheneverthe new entry does better on an image. The modifiedscores of existing entries are not shown in the table,but the information is accounted for in the parentheticaloverall rank shown in the rightmost column. Asthe table reveals, either of Algorithms 2 or 3 wouldhave won the contest on rank scoring, and also easilyplace first in mean score on each of the four qualitymeasures as compared with all previous algorithms.4 ConclusionTuning parameters offers both risks and potential rewards.Choosing the parameter values that performbest for a given image can lower binarization error bysubstantial amounts as compared with a static settingsuitable for generic images. On the other hand,poorly chosen parameter values can sabotage the result:the potential losses in binarization quality generallydwarf the potential gains. Thus one must becareful to tune only where there is reasonable confidenceof success.This paper has introduced a stability heuristic criterionthat helps to choose suitable parameter valuesfor individual images. The approach hypothesizesthat good parameter values are marked by lowvariability in the binarization solution with respectto changes in the parameter values. This criterionleads to an algorithm that successfully picks good c2 The numbers for Lelore & Bouchara exclude image PR6because its results were not available.14


Table 4: Algorithms compared by scores on theDIBCO 2011 image set.Method F PSNRMPM DRD×10−2 −3×10RanksEntry 11 88.7 17.8 5.36 8.67 429 (3)Entry 10 80.9 16.1 104.48 64.42 309 (1)Entry 8 85.2 17.2 15.66 9.07 346 (2)Entry 3 85.1 16.4 5.88 8.09 649 (12)Entry 4 85.2 16.6 6.28 4.43 489 (5)Entry 6 83.6 16.7 8.08 4.57 470 (4)Entry 14 78.0 14.9 7.62 7.35 835 (17)Gatos [9] 84.3 16.3 6.39 6.08 663 (11) ∗Lelore [12] 56.2 12.5 2.97 13.30 811 (17) ∗Lu [14] 79.7 15.5 39.67 21.47 579 (8) ∗Su [22] 87.8 17.7 5.38 4.65 435 (3) ∗Static 89.2 18.2 9.05 5.76 438 (3) ∗Alg. 1 91.7 19.2 4.43 3.40 322 (2) ∗Alg. 2 90.0 19.1 3.89 3.78 323 (1) ∗Alg. 3 91.7 19.3 3.87 3.48 317 (1) ∗values on all the images tested. A similar heuristicapplied to the choice of t hi also chooses good parametervalues in most cases, although serious failureswere observed with this technique for a handful ofcases. The most successful method tested uses thestability criterion to choose c, and selects t hi from aconstrained set of two possible values. This approachdelivers a substantial fraction of the maximum gainpossible from tuning both parameters, and can becomputed at about one eighth the speed of a simplebinarization with static parameters. Reference codefor the technique will be available on the author’s website.The parameter survey results make it clear thatfor many images the tuning algorithms given hereincome close to maximizing the potential of the basebinarization algorithm. Further improvements in resultquality will likely have to come through developmentof new base algorithms. Some of these newalgorithms may be application-specific, whereas thiswork has striven for broad applicability. Whetherthe stability criterion that has proven so useful herewill apply to other approaches remains to be seen,and is a topic for future work. In any case, tuningparameters with the algorithms described hereinsubstantially advances the current state of the art indocument binarization, as evidenced by comparativeresults on the DIBCO 2011 test images.AcknowledgmentThe author thanks those who kindly shared results orimplementation details of their algorithms for comparisonpurposes, including Basilis Gatos, Su Bolan,and Frédéric Bouchara.References[1] Agrawal, M., Doermann, D.: Stroke-like patternnoise removal in binary document images. In:International Conference on Document Analysisand Recognition, pp. 17–21 (2011)[2] Badekas, E., Papamarkos, N.: Estimation ofproper parameter values for document binarization.International Journal of Robotics and Automation,Vol. 24, No. 1, 2009 24(1), 66–78(2009)[3] Bar-Yosef, I., Beckman, I., Kedem, K., Dinstein,I.: Binarization, character extraction, andwriter identification of historical Hebrew calligraphydocuments. Int. J. Doc. Anal. Recognit.9(2), 89–99 (2007)[4] Barney-Smith, E.: An analysis of binarizationground truthing. In: Proceedings of the9th IAPR International Workshop on DocumentAnalysis Systems, pp. 27–33. Boston (2010)[5] Boykov, Y., Kolmogorov, V.: An experimentalcomparison of min-cut/max-flow algorithmsfor energy minimization in vision. IEEE Trans.on Pattern Analysis and Machine Intelligence26(9), 1124–1137 (2004)[6] Canny, J.: A computational approach to edgedetection. IEEE Trans. on Pattern Analysis andMachine Intelligence 8(6), 679–714 (1986)15


[7] Chen, Y., Leedham, D.: Decompose algorithmfor thresholding degraded historical documentimages. IEEE Proceedings on Vision, Image andSignal Processing 152(6), 702–714 (2005)[8] Dawoud, A.: Iterative cross section sequencegraph for handwritten character segmentation.IEEE Transactions on Image Processing 16(8),2150–2154 (2007)[9] Gatos, B., Pratikakis, I., Perantonis, S.J.: Adaptivedegraded document image binarization.Pattern Recognition 39(3), 317–327 (2006)[10] Gatos, B., Pratikakis, I., Perantonis, S.J.: Improveddocument image binarization by using acombination of multiple binarization techniquesand adapted edge information. In: InternationalConference on Pattern Recognition, pp.1–4 (2008)[11] Howe, N.: A Laplacian energy for documentbinarization. In: International Conference onDocument Analysis and Recognition, pp. 6–10(2011)[12] Lelore, T., Bouchara, F.: Document image binarizationusing Markov field model. In: InternationalConference on Document Analysis andRecognition, pp. 551–555 (2009)[13] Lelore, T., Bouchara, F.: Super-resolved binarizationof text based on FAIR algorithm. In:International Conference on Document Analysisand Recognition, pp. 839–843 (2011)[14] Lu, S., Su, B., Tan, C.L.: Document image binarizationusing background estimation and strokeedges. International Journal on Document Analysisand Recognition 13(4), 303–314 (2010)[17] Otsu, N.: A threshold selection method fromgraylevel histogram. IEEE Trans. on System,Man, Cybernetics 19(1), 62–66 (1978)[18] Peng, X., Setlur, S., Govindaraju, V., Sitaram,R.: Markov random field based binarization forhand-held devices captured document images.In: Proceedings of the Seventh Indian Conferenceon Computer Vision, Graphics and ImageProcessing, pp. 71–76 (2010)[19] Pratikakis, I., Gatos, B., Ntirogiannis, K.: IC-DAR 2011 document image binarization contest(DIBCO 2011). In: International Conference onDocument Analysis and Recognition, pp. 1506–1510 (2011)[20] Ramírez-Ortegón, M.A., Tapia, E., Ramírez-Ramírez, L.L., Rojas, R., Cuevas, E.: Transitionpixel: A concept for binarization based on edgedetection and gray-intensity histograms. PatternRecognition pp. 1233–1243 (2010)[21] Sauvola, N., Pietikainen, M.: Adaptive documentimage binarization. Pattern Recognition33(2), 225–236 (2000)[22] Su, B., Lu, S., Tan, C.L.: Binarization of historicaldocument images using the local maximumand minimum. In: Proceedings of the 9th IAPRInternational Workshop on Document AnalysisSystems, pp. 159–166 (2010)[23] Vonikakis, V., Andreadis, I., Papamarkos, N.,Gasteratos, A.: Adaptive document binarization:A human vision approach. In: 2nd InternationalConference on Computer Vision Theoryand Applications, pp. 104–110. Barcelona (2007)[15] Mishra, A., Alahari, K., Jawahar, C.V.: AnMRF model for binarization of natural scenetext. In: International Conference on DocumentAnalysis and Recognition (2011)[16] Niblack, W.: An Introduction to Digital ImageProcessing. Prentice-Hall, Englewood Cliffs,New Jersey (1986)16

More magazines by this user
Similar magazines