11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

You configure the tokenizer for a text field type in schema.xml with a element, as a child of :<br />

<br />

<br />

<br />

<br />

<br />

<br />

The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory<br />

classes implement the org.apache.solr.analysis.TokenizerFactory. A TokenizerFactory's create()<br />

method accepts a Reader and returns a TokenStream. When <strong>Solr</strong> creates the tokenizer it passes a Reader<br />

object that provides the content of the text field.<br />

Tokenizers discussed in this section:<br />

Standard Tokenizer<br />

Classic Tokenizer<br />

Keyword Tokenizer<br />

Letter Tokenizer<br />

Lower Case Tokenizer<br />

N-Gram Tokenizer<br />

Edge N-Gram Tokenizer<br />

ICU Tokenizer<br />

Path Hierarchy Tokenizer<br />

Regular Expression Pattern Tokenizer<br />

UAX29 URL Email Tokenizer<br />

White Space Tokenizer<br />

Related Topics<br />

Arguments may be passed to tokenizer factories by setting attributes on the element.<br />

<br />

<br />

<br />

<br />

<br />

The following sections describe the tokenizer factory classes included in this release of <strong>Solr</strong>.<br />

For more information about <strong>Solr</strong>'s tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.<br />

Standard Tokenizer<br />

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter<br />

characters are discarded, with the following exceptions:<br />

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain<br />

names.<br />

The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved<br />

as single tokens.<br />

Note that words are split at hyphens.<br />

The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token<br />

types: , , , , and .<br />

Factory class: solr.StandardTokenizerFactory<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

109

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!