10.07.2015 Views

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

Is Parallel Programming Hard, And, If So, What Can You Do About It?

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.1. OVERVIEW 13Therefore, as shown in Figure 2.4, memory referencesare often severe obstacles for modern CPUs.Thus far, we have only been considering obstaclesthat can arise during a given CPU’s execution ofsingle-threaded code. Multi-threading presents additionalobstacles to the CPU, as described in thefollowing sections.2.1.3 Atomic OperationsOne such obstacle is atomic operations. The wholeidea of an atomic operation in some sense conflictswith the piece-at-a-time assembly-line operation of aCPU pipeline. To hardware designers’ credit, modernCPUs use a number of extremely clever tricksto make such operations look atomic even thoughthey are in fact being executed piece-at-a-time, buteven so, there are cases where the pipeline must bedelayed or even flushed in order to permit a givenatomic operation to complete correctly.Figure 2.6: CPU Meets a Memory Barrier2.1.4 Memory BarriersMemory barriers will be considered in more detailin Section 12.2 and Appendix C. In the meantime,consider the following simple lock-based critical section:1 spin_lock(&mylock);2 a = a + 1;3 spin_unlock(&mylock);Figure 2.5: CPU Meets an Atomic OperationThe resulting effect on performance is depicted inFigure 2.5.Unfortunately, atomic operations usually applyonly to single elements of data. Because many parallelalgorithms require that ordering constraints bemaintained between updates of multiple data elements,most CPUs provide memory barriers. Thesememory barriers also serve as performance-sappingobstacles, as described in the next section.Quick Quiz 2.2: <strong>What</strong> types of machines wouldallow atomic operations on multiple data elements?<strong>If</strong> the CPU were not constrained to execute thesestatements in the order shown, the effect would bethat the variable “a” would be incremented withoutthe protection of “mylock”, which would certainlydefeat the purpose of acquiring it. To prevent suchdestructive reordering, locking primitives contain eitherexplicit or implicit memory barriers. Becausethe whole purpose of these memory barriers is toprevent reorderings that the CPU would otherwiseundertake in order to increase performance, memorybarriers almost always reduce performance, asdepicted in Figure 2.6.2.1.5 Cache MissesAn additional multi-threading obstacle to CPU performanceis the “cache miss”. As noted earlier,modern CPUs sport large caches in order to reducethe performance penalty that would otherwise be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!