17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.6 <strong>TET</strong> Connector for MediaWiki<br />

MediaWiki is the free wiki software which is used to run Wikipedia and many other<br />

community Web sites. More details on MediaWiki can be found at<br />

www.mediawiki.org/wiki/MediaWiki.<br />

Note Protected documents can be indexed with the shrug option under certain conditions (see<br />

Chapter 5.1, »Indexing protected PDF Documents«, page 49, for details). This is prepared in the<br />

Connector files, but you must manually enable this option.<br />

Requirements and installation. The <strong>TET</strong> distribution contains a <strong>TET</strong> connector which<br />

can be used to index PDF documents that are uploaded to a MediaWiki site. MediaWiki<br />

does not support PDF documents natively, but allows you to upload PDFs as »images«.<br />

The <strong>TET</strong> connector for MediaWiki indexes all PDF documents as they are uploaded. PDF<br />

documents which already exist in MediaWiki will not be indexed. The following requirements<br />

must be met:<br />

> PHP 5.0 or above<br />

> MediaWiki 1.11.2 or above (see below for older versions)<br />

> A <strong>TET</strong> distribution package for Unix, Linux, Mac, or Windows.<br />

In order to implement the <strong>TET</strong> connector for MediaWiki perform the following steps:<br />

> Install the <strong>TET</strong> binding for PHP as described in Section 3.8, »PHP Binding«, page 29.<br />

> Copy /connectors/MediaWiki/PDFIndexer.php to<br />

/extensions/PDFIndexer/PDFIndexer.php.<br />

> If you need support for CJK text, copy the CMap files in /resource/cmap<br />

to /extensions/PDFIndexer/resource/cmap.<br />

> Add the following lines to the MediaWiki configuration file LocalSettings.php:<br />

# Index uploaded PDFs to make them searchable<br />

include("extensions/PDFIndexer/PDFIndexer.php");<br />

> In order to avoid warnings when uploading PDF documents it is recommended to<br />

add the following lines to /includes/DefaultSettings.php in order<br />

to make .pdf a well-known file type extension:<br />

/**<br />

* This is the list of preferred extensions for uploading files. Uploading files<br />

* with extensions not in this list will trigger a warning.<br />

*/<br />

$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'pdf' );<br />

Working with MediaWiki versions older than 1.11.2. The <strong>TET</strong> connector for MediaWiki<br />

does not work properly in MediaWiki versions older before 1.11.2 due to a bug in MediaWiki;<br />

a PHP error message occurs instead. In order to use the <strong>TET</strong> connector with older<br />

MediaWiki versions you must apply a simple patch to the file include/SpecialUpload.php<br />

as detailed here:<br />

svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/<br />

SpecialUpload.php?sortby=file&r1=30403&r2=30402&pathrev=30403<br />

How the <strong>TET</strong> connector for MediaWiki works. The <strong>TET</strong> connector for MediaWiki consists<br />

of the PHP module PDFIndexer.php. Using one of MediaWiki’s predefined hooks it is<br />

hooked up so that it will be called whenever a new PDF document is uploaded. It ex-<br />

46 Chapter 4: <strong>TET</strong> Connectors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!