17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tracts text and metadata from the PDF document and appends it to the optional usersupplied<br />

comment which accompanies the uploaded document. The text is hidden in<br />

an HTML comment so that it will not be visible to users when they view the document<br />

comment. Since MediaWiki indexes the full contents of the comment (including the<br />

hidden full text) the text contents of the PDF will also be indexed. The text for the index<br />

is constructed as follows:<br />

> The <strong>TET</strong> connector feeds the value of all document info fields to the index.<br />

> The text contents of all pages are extracted and concatenated.<br />

> If the size of the extracted text is below a limit, it will completely be fed to the index.<br />

The advantage of this method is that search results will display the search term in<br />

context.<br />

> If the size of the extracted text exceeds a limit, the text is reduced to unique words<br />

(i.e. multiple instances of the same word are reduced to a single instance of the<br />

word).<br />

> If the size of the reduced text is below a limit, it will be fed to the index. Otherwise it<br />

will be truncated, i.e. some text towards the end of the document will not be indexed.<br />

The predefined limit is 512 KB, but this can be changed in PDFIndexer.php. If one of the<br />

size tests described above hits the limit, a warning message will be written to Media-<br />

Wiki’s DebugLogFile if MediaWiki logging is activated.<br />

Searching for PDF documents. Since PDF documents are treated as images by Media-<br />

Wiki you must search them in the Image namespace. This can be achieved by activating<br />

the Image checkbox in the list of namespaces in the Advanced search dialog (see Figure<br />

4.2). The Image namespace will not be searched by default. However, this setting can be<br />

enabled in the LocalSettings.php preferences file as follows:<br />

$wgNamespacesToBeSearchedDefault = array(<br />

NS_MAIN<br />

=> true,<br />

NS_IMAGE<br />

=> true,<br />

}<br />

Fig. 4.2 Searching PDF documents in MediaWiki<br />

The search results will display a list of documents which contain the search term. If the<br />

full text has been indexed (as opposed to the abbreviated word list for long documents)<br />

some additional terms will be displayed before and after the search term to provide<br />

context. Since the PDF text contents are fed to the MediaWiki index in HTML form, line<br />

numbers will be displayed in front of the text. These line numbers are not relevant for<br />

PDF documents, and you can safely ignore them.<br />

4.6 <strong>TET</strong> Connector for MediaWiki 47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!