11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<br />

<br />

In: "babaloo"<br />

Out:"ba", "bab", "baba", "babal"<br />

Example:<br />

Edge n-gram range of 2 to 5, from the back side:<br />

<br />

<br />

<br />

In: "babaloo"<br />

Out: "oo", "loo", "aloo", "baloo"<br />

ICU Tokenizer<br />

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.<br />

You can customize this tokenizer's behavior by specifying per-script rule files. To add per-script rules, add a rul<br />

efiles argument, which should contain a comma-separated list of code:rulefile pairs in the following<br />

format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules<br />

for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter Latn:my.Latin.rules.rbbi,C<br />

yrl:my.Cyrillic.rules.rbbi.<br />

The default solr.ICUTokenizerFactory provides UAX#29 word break rules tokenization (like solr.Stand<br />

ardTokenizer), but also includes custom tailorings for Hebrew (specializing handling of double and single<br />

quotation marks), and for syllable tokenization for Khmer, Lao, and Myanmar.<br />

Factory class: solr.ICUTokenizerFactory<br />

Arguments:<br />

rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script<br />

code, followed by a colon, then a resource path.<br />

Example:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

113

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!