11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of<br />

either case is extracted as a token.<br />

<br />

<br />

<br />

In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."<br />

Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"<br />

Example:<br />

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional<br />

semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups<br />

are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which<br />

matches one or more digits or hyphens.<br />

<br />

<br />

<br />

In: "SKU: 1234, Part Number 5678, Part: 126-987"<br />

Out: "1234", "5678", "126-987"<br />

UAX29 URL Email Tokenizer<br />

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter<br />

characters are discarded, with the following exceptions:<br />

Periods (dots) that are not followed by whitespace are kept as part of the token.<br />

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and<br />

the numbers and hyphen(s) are preserved.<br />

Recognizes and preserves as single tokens the following:<br />

Internet domain names containing top-level domains validated against the white list in the IANA<br />

Root Zone Database when the tokenizer was generated<br />

email addresses<br />

file://, http(s):// , and ftp:// URLs<br />

IPv4 and IPv6 addresses<br />

The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following<br />

token types: , , , , , , and .<br />

Factory class: solr.UAX29URLEmailTokenizerFactory<br />

Arguments:<br />

maxTokenLength: (integer, default 255) <strong>Solr</strong> ignores tokens that exceed the number of characters specified by<br />

maxTokenLength.<br />

Example:<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!