16.01.2013 Views

Microsoft Sharepoint Products and Technologies Resource Kit eBook

Microsoft Sharepoint Products and Technologies Resource Kit eBook

Microsoft Sharepoint Products and Technologies Resource Kit eBook

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Plug-Ins<br />

Chapter 21: The Architecture of the Gatherer 589<br />

in a single document. If a document has more text than the limit, SharePoint Portal<br />

Server stops the indexing process, considers the document indexed, <strong>and</strong> moves on<br />

to the next document.<br />

SharePoint Portal Server has a registry value that limits the maximum file size that<br />

it will crawl. The registry value is called Max Download Size <strong>and</strong> is located in the<br />

HKLM/Software/<strong>Microsoft</strong>/SPS Search/Gathering Manager key. By default, the entry<br />

for this registry value is set at 16 MB. SharePoint Portal Server also has a registry entry<br />

that defines how many times larger the file can be than the text. This registry value is<br />

contained in the HKLM/Software/<strong>Microsoft</strong>/SPS Search/Gathering Manager/Max Grow<br />

Factor key. By default, this value is set at four. This means that the text in a file can be<br />

only up to four times larger than the file size.<br />

After the protocol h<strong>and</strong>ler <strong>and</strong> IFilter have been loaded, the gatherer then connects<br />

to the content source <strong>and</strong> begins streaming data out of the content source, obeying<br />

the rules you’ve created for the content source. (Please refer to Chapter 22 <strong>and</strong> the<br />

discussion on creating <strong>and</strong> managing content sources to learn about the Site Hit Frequency,<br />

Site Path, <strong>and</strong> other rules that can be created as part of the content source.)<br />

The gatherer first streams out the metadata <strong>and</strong> then streams out the content of<br />

the document. The metadata includes the document’s properties <strong>and</strong> its permissions.<br />

Permissions gathered from the document are referenced when the result set is<br />

built, <strong>and</strong> any documents the user does not have access to are clipped from the<br />

result set before it is presented back to the user.<br />

The data streams are sent through a series of plug-ins, which are components<br />

that perform certain functions on the data stream. By default, four plug-ins—<br />

Schema, Indexer, PQS, <strong>and</strong> AutoCat—ship with SharePoint Portal Server 2003 <strong>and</strong><br />

are used to perform different functions.<br />

Schema Plug-In<br />

The Schema plug-in is a new feature in this version of Search for SharePoint Portal<br />

Server. This plug-in is responsible for adding new objects to the schema used by<br />

SharePoint <strong>Products</strong> <strong>and</strong> <strong>Technologies</strong> as they are discovered during the crawling<br />

process. When needed, this plug-in is also responsible for alias name resolution to<br />

Active Directory account names. This is useful for the profile database.<br />

Indexer Plug-In<br />

The Indexer is responsible for several functions, including word breaking, stemming,<br />

<strong>and</strong> noise word removal. Each of these is discussed in the following sections.<br />

Word Breakers<br />

The data stream from the content source is an unbroken string of Unicode characters.<br />

Word breakers are needed to determine where the word boundaries are so that the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!