11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

About Tokenizers<br />

The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence<br />

of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not.<br />

Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a<br />

TokenStream).<br />

Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be<br />

added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various<br />

metadata in addition to its text value, such as the location at which the token occurs in the field. Because a<br />

tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is<br />

the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more<br />

than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you<br />

use token metadata for things like highlighting search results in the field text.<br />

<br />

<br />

<br />

<br />

<br />

The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the Tok<br />

enizerFactory API. This factory class will be called upon to create new tokenizer instances as needed.<br />

Objects created by the factory must derive from Tokenizer, which indicates that they produce sequences of<br />

tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer.<br />

Otherwise, the tokenizer's output tokens will serve as input to the first filter stage in the pipeline.<br />

A TypeTokenFilterFactory is available that creates a TypeTokenFilter that filters tokens based on their<br />

TypeAttribute, which is set in factory.getStopTypes.<br />

For a complete list of the available TokenFilters, see the section Tokenizers.<br />

When To use a CharFilter vs. a TokenFilter<br />

There are several pairs of CharFilters and TokenFilters that have related (ie: MappingCharFilter and ASCIIF<br />

oldingFilter) or nearly identical (ie: PatternReplaceCharFilterFactory and PatternReplaceFilte<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

107

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!