11.07.2015 Views

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Sec. 8.5 External Sorting 283• read(byte[] b): Read some bytes from the current position in the file.The current position moves forward as the bytes are read.• write(byte[] b): Write some bytes <strong>at</strong> the current position in the file(overwriting the bytes already <strong>at</strong> th<strong>at</strong> position). The current position movesforward as the bytes are written.• seek(long pos): Move the current position in the file to pos. Thisallows bytes <strong>at</strong> arbitrary places within the file to be read or written.• close(): Close a file <strong>at</strong> the end of processing.8.5 External SortingWe now consider the problem of sorting collections of records too large to fit inmain memory. Because the records must reside in peripheral or external memory,such sorting methods are called external sorts. This is in contrast to the internalsorts discussed in Chapter 7 which assume th<strong>at</strong> the records to be sorted are stored inmain memory. Sorting large collections of records is central to many applic<strong>at</strong>ions,such as processing payrolls <strong>and</strong> other large business d<strong>at</strong>abases. As a consequence,many external sorting algorithms have been devised. Years ago, sorting algorithmdesigners sought to optimize the use of specific hardware configur<strong>at</strong>ions, such asmultiple tape or disk drives. Most computing today is done on personal computers<strong>and</strong> low-end workst<strong>at</strong>ions with rel<strong>at</strong>ively powerful CPUs, but only one or <strong>at</strong> mosttwo disk drives. The techniques presented here are geared toward optimized processingon a single disk drive. This approach allows us to cover the most importantissues in external sorting while skipping many less important machine-dependentdetails. Readers who have a need to implement efficient external sorting algorithmsth<strong>at</strong> take advantage of more sophistic<strong>at</strong>ed hardware configur<strong>at</strong>ions should consultthe references in Section 8.6.When a collection of records is too large to fit in main memory, the only practicalway to sort it is to read some records from disk, do some rearranging, thenwrite them back to disk. This process is repe<strong>at</strong>ed until the file is sorted, with eachrecord read perhaps many times. Given the high cost of disk I/O, it should come asno surprise th<strong>at</strong> the primary goal of an external sorting algorithm is to minimize thenumber of times inform<strong>at</strong>ion must be read from or written to disk. A certain amountof additional CPU processing can profitably be traded for reduced disk access.Before discussing external sorting techniques, consider again the basic modelfor accessing inform<strong>at</strong>ion from disk. The file to be sorted is viewed by the programmeras a sequential series of fixed-size blocks. Assume (for simplicity) th<strong>at</strong> eachblock contains the same number of fixed-size d<strong>at</strong>a records. Depending on the applic<strong>at</strong>ion,a record might be only a few bytes — composed of little or nothing morethan the key — or might be hundreds of bytes with a rel<strong>at</strong>ively small key field.Records are assumed not to cross block boundaries. These assumptions can be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!