PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
tracts text and metadata from the PDF document and appends it to the optional usersupplied<br />
comment which accompanies the uploaded document. The text is hidden in<br />
an HTML comment so that it will not be visible to users when they view the document<br />
comment. Since MediaWiki indexes the full contents of the comment (including the<br />
hidden full text) the text contents of the PDF will also be indexed. The text for the index<br />
is constructed as follows:<br />
> The <strong>TET</strong> connector feeds the value of all document info fields to the index.<br />
> The text contents of all pages are extracted and concatenated.<br />
> If the size of the extracted text is below a limit, it will completely be fed to the index.<br />
The advantage of this method is that search results will display the search term in<br />
context.<br />
> If the size of the extracted text exceeds a limit, the text is reduced to unique words<br />
(i.e. multiple instances of the same word are reduced to a single instance of the<br />
word).<br />
> If the size of the reduced text is below a limit, it will be fed to the index. Otherwise it<br />
will be truncated, i.e. some text towards the end of the document will not be indexed.<br />
The predefined limit is 512 KB, but this can be changed in PDFIndexer.php. If one of the<br />
size tests described above hits the limit, a warning message will be written to Media-<br />
Wiki’s DebugLogFile if MediaWiki logging is activated.<br />
Searching for PDF documents. Since PDF documents are treated as images by Media-<br />
Wiki you must search them in the Image namespace. This can be achieved by activating<br />
the Image checkbox in the list of namespaces in the Advanced search dialog (see Figure<br />
4.2). The Image namespace will not be searched by default. However, this setting can be<br />
enabled in the LocalSettings.php preferences file as follows:<br />
$wgNamespacesToBeSearchedDefault = array(<br />
NS_MAIN<br />
=> true,<br />
NS_IMAGE<br />
=> true,<br />
}<br />
Fig. 4.2 Searching PDF documents in MediaWiki<br />
The search results will display a list of documents which contain the search term. If the<br />
full text has been indexed (as opposed to the abbreviated word list for long documents)<br />
some additional terms will be displayed before and after the search term to provide<br />
context. Since the PDF text contents are fed to the MediaWiki index in HTML form, line<br />
numbers will be displayed in front of the text. These line numbers are not relevant for<br />
PDF documents, and you can safely ignore them.<br />
4.6 <strong>TET</strong> Connector for MediaWiki 47