15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The fast diacritic sensitive searches option works the same way, but for diacritic sensitivity.<br />

(Diacritics are ancillary glyphs added to a letter, like an umlaut or accent mark.) By<br />

default, if you search for "resume," you'll see matches for "résumé" as well. That's<br />

usually appropriate, but not in every case. By turning on the diacritic-sensitive index,<br />

you tell MarkLogic to maintain both diacritic-insensitive and diacritic-sensitive term list<br />

entries, so you can resolve diacritic-sensitive matches out of indexes. 3<br />

STEMMED INDEXES<br />

Stemming is another situation where MarkLogic provides optimized indexing options.<br />

Stemming is the process of reducing inflected (or sometimes derived) words to their<br />

root form. It allows a query for "run" to additionally match "runs," "running," and<br />

"ran" because they all have the same stem root of "run." Performing a stemmed<br />

search increases recall by expanding the set of results that can come back. Often that's<br />

desirable, but sometimes not, such as when you're searching for proper nouns or precise<br />

metadata tags.<br />

MarkLogic gives you the choice of whether to maintain stemmed or unstemmed<br />

indexes, or both. These indexes act somewhat like master index settings in that they<br />

impact all of the other text indexes. For example, the fast phrase searches option looks at<br />

the master index settings to determine if phrases should be indexed as word pairs or<br />

stem pairs, or both.<br />

In versions 8 and earlier, MarkLogic by default enables the stemmed searches option and<br />

leaves word searches disabled. Usually that's fine, but it can lead to surprises—e.g., if<br />

you're searching for a term in a metadata field where you don't want stemming. For<br />

such cases, you can enable the word searches index and pass "unstemmed" as an option<br />

while constructing the search query constraint. In version 9, the settings are reversed,<br />

and MarkLogic by default disables stemmed searches and enables word searches.<br />

MarkLogic uses a language-specific stemming library to identify the stem (or sometimes<br />

stems) for each word. It has to be language-specific because words like "chat" have<br />

different meanings in English and French and thus the roots are different. It's also<br />

possible to create language-specific custom dictionaries to modify the stemming<br />

behavior. For details, see the Search Developer's Guide. 4<br />

3 MarkLogic follows the Unicode definition of what is and isn't a diacritic, which can be important when indexing<br />

certain characters. For example, ł (U+0142, "LATIN SMALL LETTER L WITH STROKE") does not<br />

have a decomposition mapping in Unicode. Consequently, that character is not considered to have a diacritic.<br />

4 Domain-specific technical jargon often isn't included in the default stemming library. For example, the<br />

word "servlets" should stem to "servlet" but doesn't by default.<br />

68

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!