11.07.2015 Views

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Sec. 8.5 External Sorting 285The moral is th<strong>at</strong>, with a single disk drive, there often is no such thing as efficientsequential processing of a d<strong>at</strong>a file. Thus, a sorting algorithm might be moreefficient if it performs a smaller number of non-sequential disk oper<strong>at</strong>ions r<strong>at</strong>herthan a larger number of logically sequential disk oper<strong>at</strong>ions th<strong>at</strong> require a largenumber of seeks in practice.As mentioned previously, the record size might be quite large compared to thesize of the key. For example, payroll entries for a large business might each storehundreds of bytes of inform<strong>at</strong>ion including the name, ID, address, <strong>and</strong> job title foreach employee. The sort key might be the ID number, requiring only a few bytes.The simplest sorting algorithm might be to process such records as a whole, readingthe entire record whenever it is processed. However, this will gre<strong>at</strong>ly increase theamount of I/O required, because only a rel<strong>at</strong>ively few records will fit into a singledisk block. Another altern<strong>at</strong>ive is to do a key sort. Under this method, the keys areall read <strong>and</strong> stored together in an index file, where each key is stored along with apointer indic<strong>at</strong>ing the position of the corresponding record in the original d<strong>at</strong>a file.The key <strong>and</strong> pointer combin<strong>at</strong>ion should be substantially smaller than the size ofthe original record; thus, the index file will be much smaller than the complete d<strong>at</strong>afile. The index file will then be sorted, requiring much less I/O because the indexrecords are smaller than the complete records.Once the index file is sorted, it is possible to reorder the records in the originald<strong>at</strong>abase file. This is typically not done for two reasons. First, reading the recordsin sorted order from the record file requires a r<strong>and</strong>om access for each record. Thiscan take a substantial amount of time <strong>and</strong> is only of value if the complete collectionof records needs to be viewed or processed in sorted order (as opposed to a searchfor selected records). Second, d<strong>at</strong>abase systems typically allow searches to be doneon multiple keys. For example, today’s processing might be done in order of IDnumbers. Tomorrow, the boss might want inform<strong>at</strong>ion sorted by salary. Thus, theremight be no single “sorted” order for the full record. Instead, multiple index filesare often maintained, one for each sort key. These ideas are explored further inChapter 10.8.5.1 Simple Approaches to External SortingIf your oper<strong>at</strong>ing system supports virtual memory, the simplest “external” sort isto read the entire file into virtual memory <strong>and</strong> run an internal sorting method suchas Quicksort. This approach allows the virtual memory manager to use its normalbuffer pool mechanism to control disk accesses. Unfortun<strong>at</strong>ely, this might not alwaysbe a viable option. One potential drawback is th<strong>at</strong> the size of virtual memoryis usually limited to something much smaller than the disk space available. Thus,your input file might not fit into virtual memory. Limited virtual memory can beovercome by adapting an internal sorting method to make use of your own bufferpool.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!