12.07.2015 Views

Simultaneous Multithreading – Blending Thread-level and ...

Simultaneous Multithreading – Blending Thread-level and ...

Simultaneous Multithreading – Blending Thread-level and ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Simultaneous</strong> <strong>Multithreading</strong> – <strong>Blending</strong> <strong>Thread</strong>-<strong>level</strong> <strong>and</strong> Instruction-<strong>level</strong>Parallelism in Advanced MicroprocessorsJURIJ ŠILC BORUT ROBIČ THEO UNGERERComputer Systems Department Faculty of Computer <strong>and</strong> Information Sc. Dept. of Computer Design <strong>and</strong> Fault ToleranceJožef Stefan Institute University of Ljubljana University of KarlsruheJamova 39, SI-1000 Ljubljana Tržaška 25, SI-1000 Ljubljana 76128 KarlsruheSLOVENIA SLOVENIA GERMANYjurij.silc@ijs.si borut.robic@fri.uni-lj.si ungerer@ira.uka.dehttp://www-csd.ijs.si/silc http://www-csd.ijs.si/robic/robic.html http://goethe.ira.uka.de/people/ungerer/Abstract: - The paper discusses the reasons <strong>and</strong> possibilities of exploiting thread-<strong>level</strong> parallelism in modernmicroprocessors. The performance of a superscalar processor suffers when instruction-<strong>level</strong> parallelism is low. Theunderutilization due to missing instruction-<strong>level</strong> parallelism can be overcome by simultaneous multithreading, where aprocessor can issue multiple instructions from multiple threads each cycle. <strong>Simultaneous</strong> multithreaded processorscombine the multithreading technique with a wide-issue superscalar processor such that the full issue b<strong>and</strong>width isutilized by potentially issuing instructions from different threads simultaneously. Depending on the specificsimultaneous multithreaded processor design, only a single instruction pipeline is used, or a single issue unit issuesinstructions from different instruction buffers simultaneously.Key-Words: - instruction-<strong>level</strong> parallelism, microprocessor, multithreaded processor, simultaneous multithreding,superscalar, thread-<strong>level</strong> parallelism.1 IntroductionMultithreaded processors, which appeared in 1980s,aimed at a low execution time of a multithreadedworkload while superscalar processors, which appearedin 1990s, aimed at a low execution time of a singleprogram. Contemporary superscalar microprocessors areable to issue multiple instructions each clock cycle froma conventional linear instruction stream. VLSItechnology will allow future microprocessors with anissue b<strong>and</strong>width of 8–32 instructions per cycle (IPC)[22,25,26]. As the issue rate of future microprocessorsincreases, the compiler or the hardware will have toextract more instruction-<strong>level</strong> parallelism from asequential program. However, instruction-<strong>level</strong>parallelism found in a conventional instruction stream islimited. Instruction-<strong>level</strong> parallelism studies which allowsingle control flow branch speculation have reportedparallelism around 7 IPC with infinite resources [30,17]<strong>and</strong> around 4 IPC with large sets of resources [4].Contemporary high-performance micro-processorstherefore exploit speculative parallelism by dynamicbranch prediction <strong>and</strong> speculative execution of thepredicted branch path to increase single threadperformance.Underutilization of a superscalar processor, which isdue to missing instruction-<strong>level</strong> parallelism, can beovercome by simultaneous multithreading, where aprocessor can issue multiple instructions from multiplethreads each cycle. <strong>Simultaneous</strong> multithreadedprocessors combine the multithreading technique with awide-issue superscalar processor such that the full issueb<strong>and</strong>width is utilized by potentially issuing instructionsfrom different threads simultaneously.In this paper we survey the evolution ofmultithreaded processors which eventually resulted insimultaneous multithreding processors.2 Multithreaded ProcessorsThe minimal requirement for a multithreaded processoris the ability to pursue two or more threads of control inparallel within the processor pipeline—i.e., it mustprovide two or more independent program counters—<strong>and</strong> a mechanism that triggers a thread switch. <strong>Thread</strong>switchoverhead must be very low, from zero to only afew cycles. A fast context switch is supported bymultiple program counters <strong>and</strong> often by multiple registersets on the processor chip.There are three approaches to multithreadedprocessors [24]. In cycle-by-cycle interleaving, aninstruction of another thread is fetched <strong>and</strong> fed into theexecution pipeline at each processor cycle. In blockinterleaving, the instructions of a thread are executedsuccessively <strong>and</strong> a context switch occurs when an eventoccurs that may cause latency. Finally, in simultaneousmultithreading the wide superscalar instruction issue iscombined with the multiple-context approach.Instructions are simultaneously issued from multiplethreads to the execution units of a superscalar processor.


more often than necessary. This makes an extremely fastcontext switch necessary, preferably with zero-cyclecontext switch overhead.Switch-on-store switches after store instructions. Thistechnique may be used to support the implementation ofsequential consistency which means that the nextmemory access instruction can only be performed afterthe store has completed in memory.Switch-on-branch switches after branch instructions.The technique can be applied to simplify processordesign by renouncing branch prediction <strong>and</strong> speculativeexecution. The branch misspeculation penalty isavoided, but single-thread performance is decreased.However, it may be effective for programs with a highpercentage of branches that are hard to predict or evenunpredictable.4.2 Dynamic TechniquesThe context switch is triggered by a dynamic event. Ingeneral, all the instructions between the fetch stage <strong>and</strong>the stage that triggers the context switch are discarded,leading to a higher context switch overhead than staticcontext switch strategies. Several dynamic models canbe defined:• Switch-on-cache-miss:The switch-on-cache-miss model switches thecontext if a load or store misses in the cache. Theidea is that only those loads that miss in the cachehave long latencies <strong>and</strong> cause context switches. Sucha context switch is detected in a late stage of thepipeline. A large number of subsequent instructionshave already entered the pipeline <strong>and</strong> must bediscarded. Thus context switch overhead isconsiderably increased.• Switch-on-signal:The switch-on-signal model switches context onoccurrence of a specific signal, for example,signaling an interrupt, trap, or message arrival.• Switch-on-use:Context switches sometimes also occur sooner thanneeded. If a compiler schedules instructions so that aload from shared memory is issued several cyclesbefore the value is used, the context switch shouldnot occur until the actual use of the value. Thisstrategy is implemented in the switch-on-use modelwhich switches when an instruction tries to use the(still missing) value from a load. This can be, forexample, a load that missed in the cache. Theswitch-on-use model can also be seen as a lazystrategy that extends either the static switch-on-loadstrategy (lazy-switch-on-load) or the switch-oncache-miss(lazy-switch-on-cache-miss). Toimplement the switch-on-use model, a valid bit isadded to each register (by a simple form ofscoreboard). The bit is cleared when the loading tothe corresponding register is issued <strong>and</strong> set when theresult returns from the network. A thread switchescontext if it needs a value from a register whosevalid bit is still cleared.• Conditional-switch:The conditional-switch model couples an explicitswitch instruction with a condition. The context isswitched only when the condition is fulfilled,otherwise the context switch is ignored. Aconditional-switch instruction may be used, forexample, after a group of load/store instructions. Thecontext switch is ignored if all load instructions (inthe preceding group) hit the cache; otherwise, thecontext switch is performed. Moreover, aconditional-switch instruction could also be addedbetween a group of loads <strong>and</strong> their subsequent use torealize a lazy context switch (instead ofimplementing the switch-on-use model).The explicit-switch, conditional-switch <strong>and</strong> switch-onsignaltechniques enhance the instruction set architectureby additional instructions. The implicit switch techniquemay favor a specific instruction set architecture encodingto simplify instruction class detection. All othertechniques are microarchitectural techniques without thenecessity of instruction set architecture changes.4.3 Processor ExamplesThe MIT Sparcle [1] <strong>and</strong> the MSparc processors [20]switch context in case of remote memory accesses orfailed synchronizations. They can be classified as blockinterleaving processors using switch-on-cache-miss <strong>and</strong>switch-on-signal techniques. Since both switchingreasons are revealed in a late stage of the pipeline, thesucceeding instructions that are already loaded in thepipeline cannot be used. Reloading of the pipeline <strong>and</strong>the software implementation of the context switch causecontext switch cost of 14 processor cycles in the Sparcle.The MSparc processor is similar to the Sparcle exceptfor hardware support for context switching. A contextswitch is performed one cycle after the event isrecognized. In the case of the MIT Sparcle, contextswitches are used only to hide long memory latenciessince small pipeline delays are assumed to be hidden byproper ordering of instructions by an optimizingcompiler. The multithreaded Rhamma [10,11] processordecouples execution <strong>and</strong> load/store pipelines. Bothpipelines execute instructions of different threads. Incontrast to the MIT Sparcle, the Rhamma processor isdesigned to bridge all kinds of latencies by a fast contextswitch applying various static <strong>and</strong> dynamic blockinterleavingstrategies.Further block interleaving processor proposalsinclude the PL/PS-Machine (Preload <strong>and</strong> Poststore) [14]which is most similar to the Rhamma processor <strong>and</strong> the


Komodo microcontroller [3,15] which is a multithreadedJava microcontroller aimed at embedded real-timesystems with a hardware event h<strong>and</strong>ling mechanism thatallows h<strong>and</strong>ling of simultaneous overlapping events withhard real-time requirements.Compared to the cycle-by-cycle interleavingtechnique, in block-interleaving a smaller number ofthreads is needed <strong>and</strong> a single thread can execute at fullspeed until the next context switch. In addition, singlethreadperformance is similar to the performance of acomparable processor without multithreading.5 <strong>Simultaneous</strong> <strong>Multithreading</strong>Cycle-by-cycle interleaving <strong>and</strong> block interleaving aremultithreading techniques which are most efficient whenapplied to scalar RISC or VLIW processors. Combiningmultithreading with the superscalar technique naturallyleads to a technique where all hardware contexts areactive simultaneously, competing each cycle for allavailable resources. This technique, called simultaneousmultithreading (SMT), inherits from superscalars theability to issue multiple instructions each cycle; <strong>and</strong> likemultithreaded processors it contains hardware resourcesfor multiple contexts. The result is a processor that canissue multiple instructions from multiple threads eachcycle. Therefore, not only can unused cycles in the caseof latencies be filled by instructions of alternativethreads, but so can unused issue slots within one cycle.<strong>Thread</strong>-<strong>level</strong> parallelism can come from eithermultithreaded, parallel programs or from individual,independent programs in a multiprogramming workload,while instruction-<strong>level</strong> parallelism is utilized from theindividual threads. Because a SMT processorsimultaneously exploits thread-<strong>level</strong> <strong>and</strong> instruction<strong>level</strong>parallelism, it uses its resources more efficiently<strong>and</strong> thus achieves better throughput <strong>and</strong> speedup thansingle-threaded superscalar processors for multithreaded(or multiprogramming) workloads. The trade-off is aslightly more complex hardware organization.The SMT approach combines a wide superscalarinstruction issue with the multithreading approach byproviding several register sets on the multiprocessor <strong>and</strong>issuing instructions from several instruction queuessimultaneously. Therefore, the issue slots of a wide-issueprocessor can be filled by operations of several threads.Latencies occurring in the execution of single threads arebridged by issuing operations of the remaining threadsloaded on the processor. In principle, the full issueb<strong>and</strong>width can be utilized. The SMT fetch unit can takeadvantage of the interthread competition for instructionb<strong>and</strong>width in two ways. First, it can partition thisb<strong>and</strong>width among the threads <strong>and</strong> fetch from severalthreads each cycle. In this way, it increases theprobability of fetching only nonspeculative instructions.Second, the fetch unit can be selective about whichthreads it fetches. For example, it may fetch those thatwill provide the most immediate performance benefit.The main drawback to SMT may be that itcomplicates the issue stage, which is always central tothe multiple threads. A functional partitioning asdem<strong>and</strong>ed for processors of the 10 9 -transistor era cannottherefore be easily reached.SMT processors can be organized in the followingtwo ways.5.1 Shared-resources TechniqueThey may share an aggressive pipeline among multiplethreads when there is insufficient instruction-<strong>level</strong>parallelism in any one thread to use the pipeline fully.Instructions of different threads share all resources suchas the fetch buffer, the physical registers for renamingregisters of different register sets, the instructionwindow, <strong>and</strong> the reorder buffer. Thus SMT addsminimal hardware complexity to conventionalsuperscalars; hardware designers can focus on building afast single-threaded superscalar <strong>and</strong> add multithreadcapability on top. The complexity added to superscalarsby multithreading include the thread tag for each internalinstruction representation, multiple register sets, <strong>and</strong> theabilities of the fetch <strong>and</strong> the retire units to fetch/retireinstructions of different threads.5.2 Replicated-resources TechniqueThe second organizational form replicates all internalbuffers of a superscalar such that each buffer is bound toa specific thread. Instruction fetch, decode, rename, <strong>and</strong>retire units may be multiplexed between the threads or beduplicated themselves. The issue unit is able to issueinstructions of different instruction windowssimultaneously to the execution units. This form oforganization adds more changes to the organization ofsuperscalar processors but leads to a natural partitioningof the instruction window <strong>and</strong> simplifies the issue <strong>and</strong>retire stages.5.3 Processor ExamplesUntil recently no commercial SMT processors exist.There were, however, several projects simulatingdifferent configurations of SMT processors <strong>and</strong> (at least)one which implemented SMT in hardware [6]. In whatfollows we briefly mention some of them.At the University of Washington two simulations ofthe SMT processor architecture were made. The fist wasbased on an enhancement of the Alpha 21164 processor[29]. Simulations were conducted to evaluate processorconfigurations of an up to 8-threaded <strong>and</strong> 8-issuesuperscalar. This maximum configuration with 32execution units showed a throughput of 6.64 IPC on theSPEC92 benchmark suite. The next simulation was


ased on a hypothetical out-of-order issue superscalarmicroprocessor that resembles the MIPS R10000 <strong>and</strong> HPPA-8000 [7]. This approach evaluated more realisticprocessor configurations, again with the 8-threaded <strong>and</strong>8-issue superscalar organization <strong>and</strong> reached athroughput of 5.4 IPC on the same benchmarks.The SMT processor simulated at the University ofKarlsruhe [23] was based on a simplified PowerPC 604processor with multimedia enhancement [21]. Thesimulations showed that a single-threaded, 8-issuemaximum processor (assuming an abundance ofresources) reaches an IPC count of only 1.60, while an8-threaded 8-issue processor reaches an IPC of 6.07. Amore realistic processor model reaches an IPC of 1.21 inthe single-threaded 8-issue <strong>and</strong> 3.11 in the 8-threaded 8-issue model. Increasing the issue b<strong>and</strong>width from 4 to 8yields only a marginal gain (except for the 4-threaded to8-threaded maximum processor). Increasing the numberof threads from single-threaded to 2-threaded or 4-threaded yields a high gain for the 2-issue to 8-issuemodel, <strong>and</strong> a significant gain for the 8-threaded model.The steepest performance increases arise for the 4-issuemodel from the single-threaded (IPC of 1.21) to the twothreaded(IPC of 2.07) <strong>and</strong> to the 4-threaded (IPC of2.97) cases. In [21] a 2-threaded 4-issue or 4-threaded 4-issue processor configurations are suggested as realisticnext generation processor.At the University of California at Irvine combinedout-of-order execution within an instruction stream withthe simultaneous execution of instructions of differentinstruction streams [19] which resulted in the superscalardigital signal processor. Based on simulations aperformance gain of 20–55% due to multithreading wasachieved across a range of benchmarks. Similarly, at thePolytechnic University of Catalunya combinedsimultaneous multithreaded execution <strong>and</strong> out-of-orderexecution with an integrated vector unit <strong>and</strong> vectorinstructions [8]. Recently, a commercial four-threadedSMT processor Alpha 21464 has been announced.6 ConclusionResearch on multithreaded architectures has beenmotivated by two concerns: tolerating memory latency<strong>and</strong> bridging of synchronization waits by rapid contextswitches. Older multithreaded processor approachesfrom the 1980s usually extend scalar RISC processors bya multithreading technique <strong>and</strong> focus at effectivelybridging very long remote memory access latencies.Such processors will only be useful as processor nodesin distributed-shared-memory multiprocessors. However,developing a processor that is specifically designed fordistributed-shared-memory multiprocessors is commonlyregarded as too expensive. Multiprocessors todaycomprise st<strong>and</strong>ard off-the-shelf microprocessors <strong>and</strong>almost never specifically designed processors (with theexception of Tera MTA <strong>and</strong> SPELL). Therefore, newermultithreaded processor approaches also strive fortolerating smaller latencies that arise from primary cachemisses that hit in secondary cache, from long-latencyoperations, or even from unpredictable branches.Multithreaded processors aim at a low execution timeof a multithreaded workload, while a superscalarprocessor aims at a low execution time of a singleprogram. Depending on the implemented multithreadingtechnique, a multithreaded processor running only asingle thread does not reach the same efficiency as acomparable single-threaded processor. The penalty maybe only slight in the case of a block-interleavingprocessor or be several times as long as the run-time on asingle-threaded processor in the case of a cycle-by-cycleinterleaving processor.Reeferences:[1] A. Agarwal, J. Babb, D. Chaiken, G. D'Souza, K. L.Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, G.Maa, K. Mackenzie, Sparcle: A multithreaded VLSIprocessor for parallel computing, Lect. NotesComput. Sc., Vol.748, 1993, pp.359.[2] R. Alverson, D. Callahan, D. Cummings, B.Koblenz, A. Porterfield, J.B. Smith, The Teracomputer system, Proc. 1990 Int. Conf.Supercomput., Amsterdam, The Nederl<strong>and</strong>, June1990, pp.1-6.[3] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T.Ungerer, A multithreaded Java microcontroller forthread-oriented real-time event-h<strong>and</strong>ling, Proc. 1999Conf. PACT, Newport Beach, CA, 1999, pp.34-39.[4] M. Butler, T.-Y. Yeh, Y.N. Patt, M. Alsup, H.Scales, M. Shebanow, Single instruction streamparallelism is greater than two. Proc. 18th Ann.Symp. Comp. Arch., Toronto, Canada, May 1991,pp.276-286.[5] M. Dorojevets, COOL <strong>Multithreading</strong> in HTMTSPELL-1 processors, Intl. Journal on High SpeedElectronics <strong>and</strong> Systems, 1999. (to be published).[6] M.N. Dorozhevets, P. Wolcott, The El'brus-3 <strong>and</strong>MARS-M: Recent advances in Russian highperformancecomputing, The Journal ofSupercomputing, Vol.6, 1992, pp.5-48.[7] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.M.Stamm, D.M. Tullsen, <strong>Simultaneous</strong> multit-hreading:A platform for next-generation processors, IEEEMicro, Vol.17, September/October 1997, pp.12-19.[8] R. Espasa, M. Valero, Exploiting instruction- <strong>and</strong>data-<strong>level</strong> parallelism, IEEE Micro, Vol.17,September/October 1997, pp.20-27.[9] A. Formella, J. Keller, T. Walle, HPP: A high performancePRAM, Lect. Notes Comput. Sc., Vol.1123,1996, pp.425-434.


[10] W. Grünewald, T. Ungerer, Towards extremely fastcontext switching in a block multithreaded processor,Proc. 22nd Euromicro Conf., Prague, CzechRepublic, Sep. 1996, pp.592-599.[11] W. Grünewald, T. Ungerer, A multithreadedprocessor designed for distributed shared memorysystems, Proc. Int. Conf. Advances in Parall. Distrib.Comput., Shanghai, China, March 1997, pp.206-213.[12] R.H. Halstead, T. Fujita, MASA: A multithreadedprocessor architecture for parallel symboliccomputing, Proc. 15th Ann. Symp. Comp. Arch.,Honolulu, HI, May/June 1988, pp.443-451.[13] C. Hansen, MicroUnity's MediaProcessor Architecture,IEEE Micro, Vol.16, August 1996, pp.34-41.[14] K.M. Kavi, D.L. Levine, A.R.Hurson, A nonblockingmultithreaded architecture, Proc. 5th Int.Conf. Advanced Comput., Madras, India, Dec. 1997,pp.171-177.[15] J. Kreuzinger, R. Marston, T. Ungerer, U.Brinkschulte, C. Krakowski, The Komodo project:<strong>Thread</strong>-based event h<strong>and</strong>ling supported by amultithreaded Java microcontroller, Proc. 25thEuromicro Conf., Milano, Italy, Sep. 1999, pp.122-128.[16] J. Kreuzinger, T. Ungerer, Context-switchingtechniques for decoupled multithreaded processors,Proc. 25th Euromicro Conf., Milano, Italy, Sep.1999.[17] M.S. Lam, R.P. Wilson, Limits of control flow onparallelism. Proc. 18th Ann. Symp. Comp. Arch.,Toronto, C<strong>and</strong>a, May 1992, pp.46-57.[18] J. Laudon, A. Gupta, M.Horowitz, Interleaving: Amultithreading technique targeting multiprocessors<strong>and</strong> workstations, Proc. 6th Int. Conf. ASPLOS, SanJose, CA, Oct. 1994, pp.308-318.[19] M. Loikkanen, N. Bagherzadeh, A fine-grainmultithreading superscalar architecture, Proc. 1996Conf. PACT, Boston, MA, Oct. 1996, pp.163-168.[20] A. Mikschl, W. Damm, Msparc: A multithreadedSparc, Lect. Notes Comput. Sc., Vol. 1123, 1996,pp.461-469.[21] H. Oehring, U. Sigmund, T. Ungerer, MPEG-2video decompression on simultaneous multithreadedmultimedia processors, Proc. 1999 Conf. PACT,Newport Beach, CA, 1999.[22] Y.N. Patt, S.J. Patel, M. Evers, D.H.Friendly,J.Stark, One billion transistors, one uniprocessor, onechip, Computer, Vol.30, No.9, 1997, pp.51-57.[23] U. Sigmund, T. Ungerer, Evaluating amultithreaded superscalar microprocessor versus amultiprocessor chip, Proc. 4th PASA WorkshopParall. Sys. <strong>and</strong> Algorithms, Jülich, Germany, Apr.1996, pp.147-159.[24] J. Šilc, B. Robič, T. Ungerer, Asynchrony inparallel computing: From dataflow to multithreading,Parallel <strong>and</strong> Distributed Computing Practices,Vol.1, No.1, 1998, pp.3-30.[25] J. Šilc, B. Robič, T. Ungerer, ProcessorArchitecture: From Dataflow to Superscalar <strong>and</strong>Beyond, Springer-Verlag, Berlin, New York, 1999.[26] J. Šilc, T. Ungerer, B. Robič, A survey of newresearch directions in microprocessors,Microprocessors <strong>and</strong> Microsystems, Vol.24, No.4,2000, pp.175-190.[27] B.J. Smith, The Architecture of HEP, In: J.S.Kowalik (ed.), Parallel MIMD Computation: HEPSupercomputer <strong>and</strong> Its Applications, MIT Press,Cambridge, MA, 1985, pp.41-55.[28] M. Thistle, B.J. Smith, A processor architecture forHorizon, Proc. Supercomputing Conf., Orl<strong>and</strong>o, FL,Nov. 1988, pp.35-41.[29] D.M. Tullsen, S.J. Eggers, H.M. Levy,<strong>Simultaneous</strong> multithreading: maximizing on-chipparallelism, Proc. 22nd Ann. Int. Symp. Comp. Arch.,Santa Margherita Ligure, Italy, June 1995, pp.392-403.[30] D.W. Wall, Limits of instruction-<strong>level</strong> parallelism,Proc. 4th Int. Conf. ASPLOS, Santa Clara, CA, Apr.1991, pp.176-188.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!