15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

You can watch as the server does the scoring math if you use the Admin Interface to<br />

turn on the diagnostic flag "relevance", which dumps the math to the server error<br />

log file. For programmatic access, you can pass "relevance-trace" as an option to<br />

cts:search() and use cts:relevance-info() to extract an XML-formatted trace<br />

log. Tracing has a performance impact, so it's not recommended for production.<br />

STOP WORDS<br />

A stop word is a word that is so common and with so little meaning by itself that it can<br />

be ignored for purposes of search indexing. MarkLogic does not require or use stop<br />

words. This means it's possible to search for phrases such as "to be or not to be" which<br />

consist solely of stop words, or phrases like "Vitamin A" where one of the words is a<br />

stop word.<br />

However, MarkLogic does limit how large the positions list can be for any particular<br />

term. A positions list tracks the locations in documents where the term appears<br />

and supports proximity queries and long phrase resolution. The maximum size is a<br />

configurable database setting called positions list max size and is usually several hundred<br />

megabytes. Once a term has appeared so many times that its positions list exceeds<br />

this limit, the positional data for that term is removed and no longer used in query<br />

resolution (because it would be too inefficient to do so). You can detect whether any<br />

terms have exceeded this limit by looking for a file named StopKeySet having nonzero<br />

length in one of the on-disk stands.<br />

FIELDS<br />

MarkLogic allows administrators to turn on and off various indexes. The list of indexes<br />

is long: stemmed searches, word searches, fast phrase searches, fast case-sensitive<br />

searches, fast diacritic-sensitive searches, fast element word searches, element word<br />

positions, fast element phrase searches, trailing wildcard searches, trailing wildcard<br />

word positions, three-character searches, three-character word positions, two-character<br />

searches, and one-character searches. Each index improves performance for certain<br />

types of queries at the expense of increased disk consumption and longer load times.<br />

In some situations, it makes sense to enable certain indexes for parts of documents but<br />

not for other parts. For example, the wildcard indexes may make sense (i.e., justify their<br />

overhead) for titles, authors, and abstracts but not for the longer full body text.<br />

Fields let you define different index settings for different subsets of your documents.<br />

Each field gets a unique name, a list of elements or attributes to include, and another<br />

list to exclude. For example, you can include titles, authors, and abstracts in a field but<br />

exclude any footnotes within the abstract. Or, maybe you want to include all document<br />

content in a field but remove just elements from the field's index.<br />

That's possible, too. At request time, you use field-aware functions like cts:fieldword-query()<br />

to express a query constraint against a field.<br />

75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!