PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.6 <strong>TET</strong> Connector for MediaWiki<br />
MediaWiki is the free wiki software which is used to run Wikipedia and many other<br />
community Web sites. More details on MediaWiki can be found at<br />
www.mediawiki.org/wiki/MediaWiki.<br />
Note Protected documents can be indexed with the shrug option under certain conditions (see<br />
Chapter 5.1, »Indexing protected PDF Documents«, page 49, for details). This is prepared in the<br />
Connector files, but you must manually enable this option.<br />
Requirements and installation. The <strong>TET</strong> distribution contains a <strong>TET</strong> connector which<br />
can be used to index PDF documents that are uploaded to a MediaWiki site. MediaWiki<br />
does not support PDF documents natively, but allows you to upload PDFs as »images«.<br />
The <strong>TET</strong> connector for MediaWiki indexes all PDF documents as they are uploaded. PDF<br />
documents which already exist in MediaWiki will not be indexed. The following requirements<br />
must be met:<br />
> PHP 5.0 or above<br />
> MediaWiki 1.11.2 or above (see below for older versions)<br />
> A <strong>TET</strong> distribution package for Unix, Linux, Mac, or Windows.<br />
In order to implement the <strong>TET</strong> connector for MediaWiki perform the following steps:<br />
> Install the <strong>TET</strong> binding for PHP as described in Section 3.8, »PHP Binding«, page 29.<br />
> Copy /connectors/MediaWiki/PDFIndexer.php to<br />
/extensions/PDFIndexer/PDFIndexer.php.<br />
> If you need support for CJK text, copy the CMap files in /resource/cmap<br />
to /extensions/PDFIndexer/resource/cmap.<br />
> Add the following lines to the MediaWiki configuration file LocalSettings.php:<br />
# Index uploaded PDFs to make them searchable<br />
include("extensions/PDFIndexer/PDFIndexer.php");<br />
> In order to avoid warnings when uploading PDF documents it is recommended to<br />
add the following lines to /includes/DefaultSettings.php in order<br />
to make .pdf a well-known file type extension:<br />
/**<br />
* This is the list of preferred extensions for uploading files. Uploading files<br />
* with extensions not in this list will trigger a warning.<br />
*/<br />
$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'pdf' );<br />
Working with MediaWiki versions older than 1.11.2. The <strong>TET</strong> connector for MediaWiki<br />
does not work properly in MediaWiki versions older before 1.11.2 due to a bug in MediaWiki;<br />
a PHP error message occurs instead. In order to use the <strong>TET</strong> connector with older<br />
MediaWiki versions you must apply a simple patch to the file include/SpecialUpload.php<br />
as detailed here:<br />
svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/<br />
SpecialUpload.php?sortby=file&r1=30403&r2=30402&pathrev=30403<br />
How the <strong>TET</strong> connector for MediaWiki works. The <strong>TET</strong> connector for MediaWiki consists<br />
of the PHP module PDFIndexer.php. Using one of MediaWiki’s predefined hooks it is<br />
hooked up so that it will be called whenever a new PDF document is uploaded. It ex-<br />
46 Chapter 4: <strong>TET</strong> Connectors