Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The fast diacritic sensitive searches option works the same way, but for diacritic sensitivity.<br />
(Diacritics are ancillary glyphs added to a letter, like an umlaut or accent mark.) By<br />
default, if you search for "resume," you'll see matches for "résumé" as well. That's<br />
usually appropriate, but not in every case. By turning on the diacritic-sensitive index,<br />
you tell MarkLogic to maintain both diacritic-insensitive and diacritic-sensitive term list<br />
entries, so you can resolve diacritic-sensitive matches out of indexes. 3<br />
STEMMED INDEXES<br />
Stemming is another situation where MarkLogic provides optimized indexing options.<br />
Stemming is the process of reducing inflected (or sometimes derived) words to their<br />
root form. It allows a query for "run" to additionally match "runs," "running," and<br />
"ran" because they all have the same stem root of "run." Performing a stemmed<br />
search increases recall by expanding the set of results that can come back. Often that's<br />
desirable, but sometimes not, such as when you're searching for proper nouns or precise<br />
metadata tags.<br />
MarkLogic gives you the choice of whether to maintain stemmed or unstemmed<br />
indexes, or both. These indexes act somewhat like master index settings in that they<br />
impact all of the other text indexes. For example, the fast phrase searches option looks at<br />
the master index settings to determine if phrases should be indexed as word pairs or<br />
stem pairs, or both.<br />
In versions 8 and earlier, MarkLogic by default enables the stemmed searches option and<br />
leaves word searches disabled. Usually that's fine, but it can lead to surprises—e.g., if<br />
you're searching for a term in a metadata field where you don't want stemming. For<br />
such cases, you can enable the word searches index and pass "unstemmed" as an option<br />
while constructing the search query constraint. In version 9, the settings are reversed,<br />
and MarkLogic by default disables stemmed searches and enables word searches.<br />
MarkLogic uses a language-specific stemming library to identify the stem (or sometimes<br />
stems) for each word. It has to be language-specific because words like "chat" have<br />
different meanings in English and French and thus the roots are different. It's also<br />
possible to create language-specific custom dictionaries to modify the stemming<br />
behavior. For details, see the Search Developer's Guide. 4<br />
3 MarkLogic follows the Unicode definition of what is and isn't a diacritic, which can be important when indexing<br />
certain characters. For example, ł (U+0142, "LATIN SMALL LETTER L WITH STROKE") does not<br />
have a decomposition mapping in Unicode. Consequently, that character is not considered to have a diacritic.<br />
4 Domain-specific technical jargon often isn't included in the default stemming library. For example, the<br />
word "servlets" should stem to "servlet" but doesn't by default.<br />
68