13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISBN: 978-972-8939-25-0 © 2010 IADISFigure 2. Directory-based server search systemA specification file written by the server administrator is attached to each OCR engine. The specificationfile is written in XML (eXtensible Markup Language) so that automated data handling becomes easy. Thefile describes the specifications of the server, including the location (URL) of the server, the location of theapplication interface program (CGI), the name of the OCR engine used, supported languages, supported<strong>do</strong>cument types, etc. The portal server has a robot program for collecting the specification data automaticallyand periodically from the OCR servers. The robot analyzes each specification file in XML and updates thedatabase entries. A simple search program picks up the OCR servers that match the client's needs from thedatabase and shows the search results.2.3 Available ServersTable 1 shows the WeOCR servers known to the author as of Sept. 2010. All the servers are constructedbased on Open Source OCR software. Although these OCR engines work very well with some limited kindsof <strong>do</strong>cuments, the average performance is basically inferior to that of commercial products. Thanks toTesseract OCR (Smith 2007), the situation is improving.Table 1. WeOCR servers (as of Sept. 2010)Server name/ location Sub-ID Engine Deployed DescriptionMichelleocrad GNU Ocrad Feb 2005 Western European languages/ Japangocr GOCR Nov 2005 Western European languages(formerly known asocr1.sc)Maggieocrad GNU Ocrad Jun 2006 Western European languages/ Japanhocr HOCR Jun 2006 Hebrew OCR(formerly known tesseract Tesseract OCR Aug 2006 English OCRas appsv.ocrgrid) scene Tesseract OCR+ private preprocessorApr 2007 Scene text recognition forEnglish (experimental)Jimbocho/ Japannhocr NHocr Sep 2008 Japanese OCRFeb 2010(mirror of ocrad, tesseract andnhocr on Maggie)Since there was no Open Source OCR for Japanese, we also developed a Japanese OCR engine calledNHocr from scratch to contribute to the developers and end-users who eager to have one. (Seehttp://code.google.com/p/nhocr/)Unfortunately, there is no server outside our laboratory today. Actually, there was a Turkish WeOCRserver with a private (closed) post-processing in Turkey. It was shut <strong>do</strong>wn sometime a couple of years agofor unknown reasons. We have heard from some developers that they are interested in deploying OCRservers. However, none has been completed or opened to the public as far as we know. Although some44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!