Part 4: Index Construction

More documents

Recommendations

Info

Use the same algorithm for disk? Sec. 4.2 Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? • I.e. scan the documents, and for each token write the corresponding posting (term, doc, freq) on a new file • Finally sort the postings and build the postings lists for all the terms No: Sorting T = 100,000,000 records (term, doc, freq) on disk is too slow – too many disk seeks • See next slide We need an external sorting algorithm.
Bottleneck Sec. 4.2 Parse and build postings entries one doc at a time Now sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow – must sort T = 100M records If every comparison took 2 disk seeks, and N items could be sorted with N log 2 N comparisons, how long would this take? 2*ds*Nlog 2 N s= 2*5*10 -3 * 10 8 log 2 10 8 = 10 6 * log 2 10 8 = 10 6 * 26,5 = 2,65 * 10 7 s = 307 days! What can we do?
Page 1 and 2: Part 4: Index Construction Francesc
Page 3 and 4: Sec. 4.1 Hardware basics Many desi
Page 5 and 6: Sec. 4.1 Hardware basics Servers u
Page 7 and 8: Hardware assumptions Sec. 4.1 symb
Page 9 and 10: Reuters RCV1 statistics Sec. 4.2 sy
Page 11 and 12: Key step After all documents have
Page 13: Sort-based index construction Sec.
Page 17 and 18: Blocks consider different documents
Page 19 and 20: Block sorted-based indexing Sec. 4.
Page 21 and 22: Remaining problem with sort-based a
Page 23 and 24: SPIMI-Invert Sec. 4.3 When the memo
Page 25 and 26: Distributed indexing Sec. 4.4 For
Page 27 and 28: Google data centers If in a non-fa
Page 29 and 30: Parallel tasks Sec. 4.4 We will us
Page 31 and 32: Inverters Sec. 4.4 An inverter col
Page 33 and 34: MapReduce Sec. 4.4 The index const
Page 35 and 36: Schema for index construction in Ma
Page 37 and 38: Simplest approach Sec. 4.5 Maintai
Page 39 and 40: Sec. 4.5 Logarithmic merge Maintai
Page 41 and 42: Logarithmic merge Sec. 4.5 Auxilia
Page 43 and 44: Dynamic indexing at search engines
Page 45 and 46: Other sorts of indexes Sec. 4.5 P

Part 4: Index Construction

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?