Document Image Retrieval using Signatures as Queries - CEDAR

cedar.buffalo.edu

Document Image Retrieval using Signatures as Queries - CEDAR

Document Image Retrieval usingSignatures as QueriesSargur Srihari, Shravya Shetty, Harish Srinivasan,Siyuan Chen, Chen HuangCEDAR, Department of Computer Science and EngineeringUniversity at Buffalo, State University of New YorkGady Agam, Ophir FriederIllinois Institute of TechnologyUSA


Signature-based Document RetrievalSearch business document archive based on signatureimage queries


Sample Queries: Cropped SignaturesSamples for different writersSamples for same writer


RetrievalSignature Retrieval– Rank documents based on similarity to query signature


Indexing DocumentsStep1: Signature Block ExtractionStep3: Signature Feature ExtractionStep2


Step 1: Signature Block Extraction• Connected component analysis• Component Feature Extraction• to identify logos, noise – Horizontal Run,Horizontal & Vertical Profile, Density, Size• to distinguish handwriting from print– Slope, Stroke Orientation (Gabor Filter), Density, Size• to distinguish handwritten text from signatures– Maxima and Minima count, Height variation,Slope, Size• Classification using SVM• Signature Block Identification– Large signature components are merged withother neighboring possible signaturecomponents– Overlapping blocks are mergedIn 92%(276/300) of cases resulting blockcontained most of the signature


Step 2: Noise RemovalConnected Componentsh 1w 1w 1w mh mFisher Linear DiscriminantFeature 1 =relative contour size= ( h 1+w 1) / (h m+w m) = 0.4034Feature 2 =aspect ratio=h 1/ w 1= 0.9355After projectionSamples are 1-DA Bayes classifieris designed for theprojected samples


Sample images before and after noise removal


Step 3: Signature Feature ExtractionSignature Image under a 4 x 8 Division1024 bit Feature Vector


Document Retrieval


Matching algorithmDistance between query and each indexed document indatabase is calculated using normalized correlationdistance


Example of Distance ValuesWriter 1Writer 2Writer 3Values aresmallerwithin writer


Dataset for Manually Extracted SignaturesTotal No of Signatures 447Number of Writers 40Number of Signatures With Printed Text 137Avg Number of Signature Types per Writer 2Max Number of Signature Types per Writer 6All Samples for a Writer60 queries


Performance for Manual ExtractionPrecision-RecallF Measure: Harmonic MeanOf Precision and RecallA precision of 74.5% wasobtainedat a recall of 78.28%.On considering the Top 10ranks, a F-measure value of76.3 was obtained.


Dataset for Automatic Extraction(Entire Document)Automatic Signature Extraction was tested on a datasetof 300 documents, including documents with•Printed Text•Handwritten Text•Logos•Noise•Scratches, Scribbles, Words Circled•Lines and Black Borders•Tables, Seals•Poorly Scanned DocumentsTested on 80 queries


Performance Improvement usingQuery ExpansionAutomatic Relevance Feedback• A query expansion is done using the feedback(retrieval results) of the matching algorithm• The signature image returned by the matchingalgorithm with the lowest score is added to theexisting query to formulate an expanded query• Retrieval is then performed using the expanded query


Performance for Automatic ExtractionPrecision-RecallF Measure: Harmonic MeanOf Precision and RecallComparison of Precision-Recall Curves ofsignature retrieval results for 35 writers –Precision of 56% at a Recall of 63%On considering the Top10 ranks, a F-measurevalue of 59.6 wasobtained.


Conclusion and Future Work• Method can be extended to several otherdocument image retrieval applications• Potential improvements:increasing the feature setusing contextual informationhandling multiple signature typeshandling multiple signatures on same document

More magazines by this user
Similar magazines