11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Understanding Analyzers, Tokenizers, and Filters<br />

The following sections describe how <strong>Solr</strong> breaks down and works with textual data. There are three main<br />

concepts to understand: analyzers, tokenizers, and filters.<br />

Field analyzers are used both during ingestion, when a document is indexed, and at query time. An analyzer<br />

examines the text of fields and generates a token stream. Analyzers may be a single class or they may be<br />

composed of a series of tokenizer and filter classes.<br />

Tokenizers break field data into lexical units, or tokens.<br />

Filters examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers<br />

and filters may be combined to form pipelines, or chains, where the output of one is input to the next. Such a<br />

sequence of tokenizers and filters is called an analyzer and the resulting output of an analyzer is used to match<br />

query results or build indices.<br />

Using Analyzers, Tokenizers, and Filters<br />

Although the analysis process is used for both indexing and querying, the same analysis process need not be<br />

used for both operations. For indexing, you often want to simplify, or normalize, words. For example, setting all<br />

letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so on. Doing so<br />

can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for "ram". To<br />

increase query-time precision, a filter could be employed to narrow the matches by, for example, ignoring all-cap<br />

acronyms if you're interested in male sheep, but not Random Access Memory.<br />

The tokens output by the analysis process define the values, or terms, of that field and are used either to build an<br />

index of those terms when a new document is added, or to identify which documents contain the terms you are<br />

querying for.<br />

For More Information<br />

These sections will show you how to configure field analyzers and also serves as a reference for the details of<br />

configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can configure<br />

your own analysis classes if you have special needs that cannot be met with the included filters or tokenizers.<br />

For Analyzers, see:<br />

Analyzers: Detailed conceptual information about <strong>Solr</strong> analyzers.<br />

Running Your Analyzer: Detailed information about testing and running your <strong>Solr</strong> analyzer.<br />

For Tokenizers, see:<br />

About Tokenizers: Detailed conceptual information about <strong>Solr</strong> tokenizers.<br />

Tokenizers: Information about configuring tokenizers, and about the tokenizer factory classes included in<br />

this distribution of <strong>Solr</strong>.<br />

For Filters, see:<br />

About Filters: Detailed conceptual information about <strong>Solr</strong> filters.<br />

Filter Descriptions: Information about configuring filters, and about the filter factory classes included in this<br />

distribution of <strong>Solr</strong>.<br />

CharFilterFactories: Information about filters for pre-processing input characters.<br />

To find out how to use Tokenizers and Filters with various languages, see:<br />

Language Analysis: Information about tokenizers and filters for character set conversion or for use with<br />

specific languages.<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

104

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!