30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011P0 keeperP1 initiatorhome MClock page P in TLBset TLB entry to sharedunlock page P in TLBpage Plook for cachedblocks of Pwait for pendingoperationskeeper P0 page Pcached blocks 1010..0recovery requestrecovery responserecovery donetrigger coherence recoverypage Pkeeper P0end coherence recoveryset page table entry to sharedrecovery target doneOSdirectory cache updatingtagACpage Psharing co<strong>de</strong>P0P0keeper P0Fig. 6. Updating-based recovery mechanism. P0 and P1 areprocessors and MC is the home no<strong>de</strong>.After completing this process, we know for surethat the blocks belonging to the recovered page arenot cached. Thus, from that moment on, directorycaches can keep proper track of them.C.2 Updating-based Recovery MechanismThe main advantage of the flushing-based recoveryis that its implementation in real systems is feasibleand straightforward. However, the flushing ofall cached blocks may increase the miss rate of processorcaches (this is analyzed in Section V). To addressthis potential drawback, we propose an alternativeimplementation based on updating the directorycache information that works as follows.First, the initiator issues a recovery request to thecorresponding page keeper.Second, the page keeper locks the correspondingTLB entry on the arrival of the recovery request andlooks for the blocks within the page that are presentin its cache. The addresses of those blocks are co<strong>de</strong>din a bit vector, which is inclu<strong>de</strong>d in a recovery response.After composing the bit vector, the keeperchecks its MSHR structure and waits for the outstandingoperations on any of the page blocks (ifany). Once the pending operations complete, therecovery response is sent to the home memory controller.Third, upon the receipt of a recovery response,the home memory controller proceeds to update itsdirectory cache according to the received bit vector.In particular, it creates a new directory cache entryfor every block cached by the keeper. The sharingco<strong>de</strong> of every new entry can be easily set becauseit knows that, at that moment, the keeper is theonly one with a valid copy of the block. When thedirectory cache updating finalizes, the home no<strong>de</strong>sends a recovery target done message back to the pagekeeper.Forth, when the keeper receives the recovery targetdone message, it marks the TLB entry correspondingto the page as shared, unlocks the correspondingTLB entry, and sends a recovery done message tothe initiator, finalizing the recovery process. Figure6 shows an example of how the updating-basedrecovery mechanism works.TIMETABLA ISystem parameters.Memory ParametersProcessor frequency3.2 GHzCache block size64 bytesProcessor cache2MB, 4-wayProcessor cache access latency 2nsDirectory cache256KB, 4-wayDirectory cache access latency 2nsDirectory cache coverage ratio 2×, worst-case 0.25×Memory access latency (local bank) 60nsPage size4KB (64 blocks)Network ParametersNetwork topologyHypercubeData message size68 and 72 bytesControl message size4 and 8 bytesNetwork bandwidth12.8GB/sInter-die link latency2nsInter-processor link latency20nsFlit size4 bytesLink bandwidth1 flit/cycleAfter completing the updating-based recoverymechanism for a page, we know for sure that the directorycache holds proper track of the page blocks.IV. Evaluation MethodologyWe evaluate our proposals with full-system simulationusing Virtutech Simics [13] running Solaris 10and exten<strong>de</strong>d with the Wisconsin GEMS toolset [14],which enables <strong>de</strong>tailed simulation of multiprocessorsystems. The interconnection network is mo<strong>de</strong>ledwith GARNET [15]. Finally, we also use the Mc-PAT tool [16], assuming a 45nm process technology,to measure energy consumption.For the evaluation of our proposals, we first mo<strong>de</strong>la cache coherent HyperTransport system optimizedwith directory caches similar to those of the AMDMagny-Cours. The simulated system has 8 processors(16 cores) and its parameters are shown in TableI. We refer to this system as the base architectureand our proposals are implemented upon it.We simulate a wi<strong>de</strong> variety of parallel workloadsfrom 3 suites (SPLASH-2 [17], ALPBenchs [18], andPARSEC [19]), two scientific benchmarks, and twocommercial workloads [20]. Due to time requirements,we are not able to simulate these benchmarkswith large working sets. Consequently, as done inmost works [7], [21], [12], we simulate the applicationsassuming smaller data-sets. To avoid alteringthe results, we reduce four times the size of both processorcaches and directory caches. Notice that, sincethe size of all the simulated caches are proportionallyreduced, the coverage ratio of directory caches is thesame as in Magny-Cours (2×).All the reported experimental results correspondto the parallel phase of benchmarks. We account forthe variability in multi-threa<strong>de</strong>d workloads by doingmultiple simulation runs for each benchmark and injectingsmall random perturbations in the timing ofthe memory system.V. Performance EvaluationOur proposal is based on the fact that mostreferred blocks are privately used by processors.Crosses in Figure 7 show the fraction of actual privateblocks. As observed, about 75% (on average)of the referred blocks are private. Since our proposalworks at a page granularity, it cannot i<strong>de</strong>ntifyall the private blocks because, when a page contains<strong>JP2011</strong>-200

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!