17.01.2013 Views

Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Foundations <strong>and</strong> Trends R○ in<br />

Theoretical Computer Science<br />

Vol. 2, No. 4 (2006) 305–474<br />

c○ 2008 J. S. Vitter<br />

DOI: 10.1561/0400000014<br />

<strong>Algorithms</strong> <strong>and</strong> <strong>Data</strong> <strong>Structures</strong><br />

<strong>for</strong> <strong>External</strong> <strong>Memory</strong><br />

Jeffrey Scott Vitter<br />

Department of Computer Science, Purdue University, West Lafayette,<br />

Indiana, 47907–2107, USA, jsv@purdue.edu<br />

Abstract<br />

<strong>Data</strong> sets in large applications are often too massive to fit completely<br />

inside the computer’s internal memory. The resulting input/output<br />

communication (or I/O) between fast internal memory <strong>and</strong> slower<br />

external memory (such as disks) can be a major per<strong>for</strong>mance bottleneck.<br />

In this manuscript, we survey the state of the art in the design<br />

<strong>and</strong> analysis of algorithms <strong>and</strong> data structures <strong>for</strong> external memory (or<br />

EM <strong>for</strong> short), where the goal is to exploit locality <strong>and</strong> parallelism in<br />

order to reduce the I/O costs. We consider a variety of EM paradigms<br />

<strong>for</strong> solving batched <strong>and</strong> online problems efficiently in external memory.<br />

For the batched problem of sorting <strong>and</strong> related problems like permuting<br />

<strong>and</strong> fast Fourier trans<strong>for</strong>m, the key paradigms include distribution<br />

<strong>and</strong> merging. The paradigm of disk striping offers an elegant way<br />

to use multiple disks in parallel. For sorting, however, disk striping can<br />

be nonoptimal with respect to I/O, so to gain further improvements we<br />

discuss distribution <strong>and</strong> merging techniques <strong>for</strong> using the disks independently.<br />

We also consider useful techniques <strong>for</strong> batched EM problems<br />

involving matrices, geometric data, <strong>and</strong> graphs.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!