28.07.2013 Views

Performance Analysis and Optimization of the Hurricane File System ...

Performance Analysis and Optimization of the Hurricane File System ...

Performance Analysis and Optimization of the Hurricane File System ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 5. MICROBENCHMARKS – READ OPERATION 36<br />

16 hash lists to be used, resulting in 8 hash lists that were shared. For example, processors 0 to 3 shared<br />

hash lists with processors 8 to 11. The wide variability on 12 processors, as shown by <strong>the</strong> large error bars<br />

in Figure 5.2, was <strong>the</strong> result <strong>of</strong> some threads sharing hash lists while o<strong>the</strong>r threads had exclusive access to<br />

hash lists. Never<strong>the</strong>less, <strong>the</strong> performance degradation on a contended hash list was higher than expected,<br />

considering that only 2 processors were sharing any contended hash list. <strong>Performance</strong> did not simply slow<br />

down by a factor <strong>of</strong> 2, as one may intuitively expect. The choice <strong>of</strong> lock types was examined in a later<br />

experiment to determine if it was <strong>the</strong> culprit behind this severity. Regardless, it was necessary to modify <strong>the</strong><br />

ORS hash function to reduce hash list collisions under a variety <strong>of</strong> workloads. It was surprising that hash<br />

collisions occurred under <strong>the</strong> simple microbenchmark workload.<br />

Figure 5.4 magnifies <strong>the</strong> left portion <strong>of</strong> Figure 5.2. It shows that <strong>the</strong>re is also a serious performance<br />

degradation between 1 processor to 2 processors. The costs <strong>of</strong> transitioning from a uniprocessor to multi-<br />

processor architecture can be clearly seen. Overheads may include memory bus contention, memory bank<br />

contention, multiprocessor locks, cache-coherence misses, <strong>and</strong> cache-coherence traffic. These problems do<br />

not exist on uniprocessors. As a result, two concurrent threads required twice as long to complete <strong>the</strong>ir<br />

independent tasks compared to one thread. <strong>Performance</strong> between 2 processors <strong>and</strong> 8 processors showed ideal<br />

scalability since hash collisions did not occur in this range <strong>of</strong> processors.<br />

5.4 <strong>Optimization</strong>s<br />

We implemented a number <strong>of</strong> optimizations to improve scalability. These optimizations include (1) a mod-<br />

ified ORS hash function, (2) larger ORS hash table, <strong>and</strong> (3) padded ORS hash list headers <strong>and</strong> cache<br />

entries. The fully optimized results are shown in Figure 5.5 1 . This section describes <strong>the</strong> investigation<br />

process that led to <strong>the</strong>se optimizations. We used <strong>the</strong> perfect memory model in Section 5.4.1 to examine<br />

memory bottlenecks. A base line experiment was run in Section 5.4.2 to verify scalability <strong>of</strong> <strong>the</strong> underlying<br />

infrastructure. The impact <strong>of</strong> lock types <strong>and</strong> cache associativity were examined in Sections 5.4.3 <strong>and</strong> 5.4.4,<br />

respectively. As described in Section 5.4.5, modifying <strong>the</strong> ORS hash function alone was not adequate <strong>and</strong><br />

introduced o<strong>the</strong>r performance problems. Section 5.4.6 indicates that padding <strong>the</strong> ORS hash list headers<br />

alleviated <strong>the</strong>se problems. Increasing <strong>the</strong> number <strong>of</strong> hash lists (increasing <strong>the</strong> size <strong>of</strong> <strong>the</strong> hash table) can<br />

be done without detrimental effects, as shown in Section 5.4.7. Fur<strong>the</strong>r padding <strong>of</strong> <strong>the</strong> ORS cache entries<br />

significantly improved absolute performance in Section 5.4.8. In <strong>the</strong> investigation process, we discovered a<br />

potential problem with <strong>the</strong> K42 padding <strong>and</strong> alignment facilities. Finally, Section 5.4.9 describes <strong>the</strong> fully<br />

optimized configuration in detail.<br />

The experimental configurations used here are based on those used in Section 5.2 with minor modifications<br />

as described in each subsection. We continue to use extent-based files in <strong>the</strong> experiments to explore ORS<br />

1 Error bars were removed from <strong>the</strong> unoptimized 12 processor configuration.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!