16.01.2013 Views

Microsoft Sharepoint Products and Technologies Resource Kit eBook

Microsoft Sharepoint Products and Technologies Resource Kit eBook

Microsoft Sharepoint Products and Technologies Resource Kit eBook

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

588 Part VII: Information Management in SharePoint <strong>Products</strong> <strong>and</strong> <strong>Technologies</strong><br />

Filters<br />

When a file is created, it is formatted in the proprietary format of the application in<br />

which it was created. In other words, a document created in <strong>Microsoft</strong> Excel is in a<br />

different format than a document created in Notepad. SharePoint Portal Server must<br />

convert the contents of each document to a generic format before it can place the<br />

content into an index. The process of changing the format of the document is called<br />

filtering.<br />

The purpose of a filter is to remove the proprietary formatting while extracting<br />

the text of the document as well as the document’s properties. When a document is<br />

run through a filter, it is converted into an unbroken string of Unicode characters.<br />

Filtering is run by the filter daemon process (Mssdmn.exe).<br />

IFilters<br />

The actual component that is responsible for knowing how to interpret the format of<br />

files is called an IFilter (or Index filter). The gatherer uses IFilters for scanning documents<br />

for text <strong>and</strong> properties <strong>and</strong> for extracting the text <strong>and</strong> metadata from these<br />

documents. It also filters out embedded formatting while keeping information about<br />

the position of the text within the document.<br />

SharePoint Portal Server includes IFilters for the following: <strong>Microsoft</strong> Office<br />

files, HTML files, Tagged Image File Format (TIFF) files, <strong>and</strong> text files. However, the<br />

IFilter for <strong>Microsoft</strong> Publisher is included in SharePoint Portal Server Service Pack1.<br />

SharePoint Portal Server accepts filters for other applications also. IFilters for popular<br />

file types such as PDF can be obtained from the vendor. If the gatherer component<br />

has no IFilter for a particular file type, it will attempt to extract the text of a<br />

document using a generic filter called the Null IFilter.<br />

If an IFilter is not available for a document type that you would like to include<br />

in an index, another option is to contact the vendor of the application used to create<br />

the document or to create a customized IFilter component by using the SharePoint<br />

Portal Server SDK. An IFilter must be registered with the operating system <strong>and</strong> associated<br />

with a file type before SharePoint Portal Server can use it in the indexing process.<br />

If you obtain an IFilter from a vendor, these instructions should be included by<br />

the vendor. The IFilter must be registered on the computer running SharePoint Portal<br />

Server that is crawling that file type. File types are indicated by file extension,<br />

such as .doc for a <strong>Microsoft</strong> Word document. Once an IFilter is registered <strong>and</strong> a file<br />

type has an IFilter associated with it, documents of that file type can be crawled <strong>and</strong><br />

included in the index. If no filter is registered for a file type, only the file properties<br />

are included in the index.<br />

Filtering Limitations<br />

SharePoint Portal Server has a limit to the amount of text data that it can filter from<br />

a single document. This limit applies only to the text in the document. It does not<br />

apply to graphics or any other type of content. By default, the limit is 64 KB of text

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!