Simultaneous Multithreading â Blending Thread-level and ...

Simultaneous Multithreading – Blending Thread-level and Instruction-levelParallelism in Advanced MicroprocessorsJURIJ ŠILC BORUT ROBIČ THEO UNGERERComputer Systems Department Faculty of Computer and Information Sc. Dept. of Computer Design and Fault ToleranceJožef Stefan Institute University of Ljubljana University of KarlsruheJamova 39, SI-1000 Ljubljana Tržaška 25, SI-1000 Ljubljana 76128 KarlsruheSLOVENIA SLOVENIA GERMANYjurij.silc@ijs.si borut.robic@fri.uni-lj.si ungerer@ira.uka.dehttp://www-csd.ijs.si/silc http://www-csd.ijs.si/robic/robic.html http://goethe.ira.uka.de/people/ungerer/Abstract: - The paper discusses the reasons and possibilities of exploiting thread-level parallelism in modernmicroprocessors. The performance of a superscalar processor suffers when instruction-level parallelism is low. Theunderutilization due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where aprocessor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processorscombine the multithreading technique with a wide-issue superscalar processor such that the full issue bandwidth isutilized by potentially issuing instructions from different threads simultaneously. Depending on the specificsimultaneous multithreaded processor design, only a single instruction pipeline is used, or a single issue unit issuesinstructions from different instruction buffers simultaneously.Key-Words: - instruction-level parallelism, microprocessor, multithreaded processor, simultaneous multithreding,superscalar, thread-level parallelism.1 IntroductionMultithreaded processors, which appeared in 1980s,aimed at a low execution time of a multithreadedworkload while superscalar processors, which appearedin 1990s, aimed at a low execution time of a singleprogram. Contemporary superscalar microprocessors areable to issue multiple instructions each clock cycle froma conventional linear instruction stream. VLSItechnology will allow future microprocessors with anissue bandwidth of 8–32 instructions per cycle (IPC)[22,25,26]. As the issue rate of future microprocessorsincreases, the compiler or the hardware will have toextract more instruction-level parallelism from asequential program. However, instruction-levelparallelism found in a conventional instruction stream islimited. Instruction-level parallelism studies which allowsingle control flow branch speculation have reportedparallelism around 7 IPC with infinite resources [30,17]and around 4 IPC with large sets of resources [4].Contemporary high-performance micro-processorstherefore exploit speculative parallelism by dynamicbranch prediction and speculative execution of thepredicted branch path to increase single threadperformance.Underutilization of a superscalar processor, which isdue to missing instruction-level parallelism, can beovercome by simultaneous multithreading, where aprocessor can issue multiple instructions from multiplethreads each cycle. Simultaneous multithreadedprocessors combine the multithreading technique with awide-issue superscalar processor such that the full issuebandwidth is utilized by potentially issuing instructionsfrom different threads simultaneously.In this paper we survey the evolution ofmultithreaded processors which eventually resulted insimultaneous multithreding processors.2 Multithreaded ProcessorsThe minimal requirement for a multithreaded processoris the ability to pursue two or more threads of control inparallel within the processor pipeline—i.e., it mustprovide two or more independent program counters—and a mechanism that triggers a thread switch. Threadswitchoverhead must be very low, from zero to only afew cycles. A fast context switch is supported bymultiple program counters and often by multiple registersets on the processor chip.There are three approaches to multithreadedprocessors [24]. In cycle-by-cycle interleaving, aninstruction of another thread is fetched and fed into theexecution pipeline at each processor cycle. In blockinterleaving, the instructions of a thread are executedsuccessively and a context switch occurs when an eventoccurs that may cause latency. Finally, in simultaneousmultithreading the wide superscalar instruction issue iscombined with the multiple-context approach.Instructions are simultaneously issued from multiplethreads to the execution units of a superscalar processor.

more often than necessary. This makes an extremely fastcontext switch necessary, preferably with zero-cyclecontext switch overhead.Switch-on-store switches after store instructions. Thistechnique may be used to support the implementation ofsequential consistency which means that the nextmemory access instruction can only be performed afterthe store has completed in memory.Switch-on-branch switches after branch instructions.The technique can be applied to simplify processordesign by renouncing branch prediction and speculativeexecution. The branch misspeculation penalty isavoided, but single-thread performance is decreased.However, it may be effective for programs with a highpercentage of branches that are hard to predict or evenunpredictable.4.2 Dynamic TechniquesThe context switch is triggered by a dynamic event. Ingeneral, all the instructions between the fetch stage andthe stage that triggers the context switch are discarded,leading to a higher context switch overhead than staticcontext switch strategies. Several dynamic models canbe defined:• Switch-on-cache-miss:The switch-on-cache-miss model switches thecontext if a load or store misses in the cache. Theidea is that only those loads that miss in the cachehave long latencies and cause context switches. Sucha context switch is detected in a late stage of thepipeline. A large number of subsequent instructionshave already entered the pipeline and must bediscarded. Thus context switch overhead isconsiderably increased.• Switch-on-signal:The switch-on-signal model switches context onoccurrence of a specific signal, for example,signaling an interrupt, trap, or message arrival.• Switch-on-use:Context switches sometimes also occur sooner thanneeded. If a compiler schedules instructions so that aload from shared memory is issued several cyclesbefore the value is used, the context switch shouldnot occur until the actual use of the value. Thisstrategy is implemented in the switch-on-use modelwhich switches when an instruction tries to use the(still missing) value from a load. This can be, forexample, a load that missed in the cache. Theswitch-on-use model can also be seen as a lazystrategy that extends either the static switch-on-loadstrategy (lazy-switch-on-load) or the switch-oncache-miss(lazy-switch-on-cache-miss). Toimplement the switch-on-use model, a valid bit isadded to each register (by a simple form ofscoreboard). The bit is cleared when the loading tothe corresponding register is issued and set when theresult returns from the network. A thread switchescontext if it needs a value from a register whosevalid bit is still cleared.• Conditional-switch:The conditional-switch model couples an explicitswitch instruction with a condition. The context isswitched only when the condition is fulfilled,otherwise the context switch is ignored. Aconditional-switch instruction may be used, forexample, after a group of load/store instructions. Thecontext switch is ignored if all load instructions (inthe preceding group) hit the cache; otherwise, thecontext switch is performed. Moreover, aconditional-switch instruction could also be addedbetween a group of loads and their subsequent use torealize a lazy context switch (instead ofimplementing the switch-on-use model).The explicit-switch, conditional-switch and switch-onsignaltechniques enhance the instruction set architectureby additional instructions. The implicit switch techniquemay favor a specific instruction set architecture encodingto simplify instruction class detection. All othertechniques are microarchitectural techniques without thenecessity of instruction set architecture changes.4.3 Processor ExamplesThe MIT Sparcle [1] and the MSparc processors [20]switch context in case of remote memory accesses orfailed synchronizations. They can be classified as blockinterleaving processors using switch-on-cache-miss andswitch-on-signal techniques. Since both switchingreasons are revealed in a late stage of the pipeline, thesucceeding instructions that are already loaded in thepipeline cannot be used. Reloading of the pipeline andthe software implementation of the context switch causecontext switch cost of 14 processor cycles in the Sparcle.The MSparc processor is similar to the Sparcle exceptfor hardware support for context switching. A contextswitch is performed one cycle after the event isrecognized. In the case of the MIT Sparcle, contextswitches are used only to hide long memory latenciessince small pipeline delays are assumed to be hidden byproper ordering of instructions by an optimizingcompiler. The multithreaded Rhamma [10,11] processordecouples execution and load/store pipelines. Bothpipelines execute instructions of different threads. Incontrast to the MIT Sparcle, the Rhamma processor isdesigned to bridge all kinds of latencies by a fast contextswitch applying various static and dynamic blockinterleavingstrategies.Further block interleaving processor proposalsinclude the PL/PS-Machine (Preload and Poststore) [14]which is most similar to the Rhamma processor and the

Komodo microcontroller [3,15] which is a multithreadedJava microcontroller aimed at embedded real-timesystems with a hardware event handling mechanism thatallows handling of simultaneous overlapping events withhard real-time requirements.Compared to the cycle-by-cycle interleavingtechnique, in block-interleaving a smaller number ofthreads is needed and a single thread can execute at fullspeed until the next context switch. In addition, singlethreadperformance is similar to the performance of acomparable processor without multithreading.5 Simultaneous MultithreadingCycle-by-cycle interleaving and block interleaving aremultithreading techniques which are most efficient whenapplied to scalar RISC or VLIW processors. Combiningmultithreading with the superscalar technique naturallyleads to a technique where all hardware contexts areactive simultaneously, competing each cycle for allavailable resources. This technique, called simultaneousmultithreading (SMT), inherits from superscalars theability to issue multiple instructions each cycle; and likemultithreaded processors it contains hardware resourcesfor multiple contexts. The result is a processor that canissue multiple instructions from multiple threads eachcycle. Therefore, not only can unused cycles in the caseof latencies be filled by instructions of alternativethreads, but so can unused issue slots within one cycle.Thread-level parallelism can come from eithermultithreaded, parallel programs or from individual,independent programs in a multiprogramming workload,while instruction-level parallelism is utilized from theindividual threads. Because a SMT processorsimultaneously exploits thread-level and instructionlevelparallelism, it uses its resources more efficientlyand thus achieves better throughput and speedup thansingle-threaded superscalar processors for multithreaded(or multiprogramming) workloads. The trade-off is aslightly more complex hardware organization.The SMT approach combines a wide superscalarinstruction issue with the multithreading approach byproviding several register sets on the multiprocessor andissuing instructions from several instruction queuessimultaneously. Therefore, the issue slots of a wide-issueprocessor can be filled by operations of several threads.Latencies occurring in the execution of single threads arebridged by issuing operations of the remaining threadsloaded on the processor. In principle, the full issuebandwidth can be utilized. The SMT fetch unit can takeadvantage of the interthread competition for instructionbandwidth in two ways. First, it can partition thisbandwidth among the threads and fetch from severalthreads each cycle. In this way, it increases theprobability of fetching only nonspeculative instructions.Second, the fetch unit can be selective about whichthreads it fetches. For example, it may fetch those thatwill provide the most immediate performance benefit.The main drawback to SMT may be that itcomplicates the issue stage, which is always central tothe multiple threads. A functional partitioning asdemanded for processors of the 10 9 -transistor era cannottherefore be easily reached.SMT processors can be organized in the followingtwo ways.5.1 Shared-resources TechniqueThey may share an aggressive pipeline among multiplethreads when there is insufficient instruction-levelparallelism in any one thread to use the pipeline fully.Instructions of different threads share all resources suchas the fetch buffer, the physical registers for renamingregisters of different register sets, the instructionwindow, and the reorder buffer. Thus SMT addsminimal hardware complexity to conventionalsuperscalars; hardware designers can focus on building afast single-threaded superscalar and add multithreadcapability on top. The complexity added to superscalarsby multithreading include the thread tag for each internalinstruction representation, multiple register sets, and theabilities of the fetch and the retire units to fetch/retireinstructions of different threads.5.2 Replicated-resources TechniqueThe second organizational form replicates all internalbuffers of a superscalar such that each buffer is bound toa specific thread. Instruction fetch, decode, rename, andretire units may be multiplexed between the threads or beduplicated themselves. The issue unit is able to issueinstructions of different instruction windowssimultaneously to the execution units. This form oforganization adds more changes to the organization ofsuperscalar processors but leads to a natural partitioningof the instruction window and simplifies the issue andretire stages.5.3 Processor ExamplesUntil recently no commercial SMT processors exist.There were, however, several projects simulatingdifferent configurations of SMT processors and (at least)one which implemented SMT in hardware [6]. In whatfollows we briefly mention some of them.At the University of Washington two simulations ofthe SMT processor architecture were made. The fist wasbased on an enhancement of the Alpha 21164 processor[29]. Simulations were conducted to evaluate processorconfigurations of an up to 8-threaded and 8-issuesuperscalar. This maximum configuration with 32execution units showed a throughput of 6.64 IPC on theSPEC92 benchmark suite. The next simulation was

ased on a hypothetical out-of-order issue superscalarmicroprocessor that resembles the MIPS R10000 and HPPA-8000 [7]. This approach evaluated more realisticprocessor configurations, again with the 8-threaded and8-issue superscalar organization and reached athroughput of 5.4 IPC on the same benchmarks.The SMT processor simulated at the University ofKarlsruhe [23] was based on a simplified PowerPC 604processor with multimedia enhancement [21]. Thesimulations showed that a single-threaded, 8-issuemaximum processor (assuming an abundance ofresources) reaches an IPC count of only 1.60, while an8-threaded 8-issue processor reaches an IPC of 6.07. Amore realistic processor model reaches an IPC of 1.21 inthe single-threaded 8-issue and 3.11 in the 8-threaded 8-issue model. Increasing the issue bandwidth from 4 to 8yields only a marginal gain (except for the 4-threaded to8-threaded maximum processor). Increasing the numberof threads from single-threaded to 2-threaded or 4-threaded yields a high gain for the 2-issue to 8-issuemodel, and a significant gain for the 8-threaded model.The steepest performance increases arise for the 4-issuemodel from the single-threaded (IPC of 1.21) to the twothreaded(IPC of 2.07) and to the 4-threaded (IPC of2.97) cases. In [21] a 2-threaded 4-issue or 4-threaded 4-issue processor configurations are suggested as realisticnext generation processor.At the University of California at Irvine combinedout-of-order execution within an instruction stream withthe simultaneous execution of instructions of differentinstruction streams [19] which resulted in the superscalardigital signal processor. Based on simulations aperformance gain of 20–55% due to multithreading wasachieved across a range of benchmarks. Similarly, at thePolytechnic University of Catalunya combinedsimultaneous multithreaded execution and out-of-orderexecution with an integrated vector unit and vectorinstructions [8]. Recently, a commercial four-threadedSMT processor Alpha 21464 has been announced.6 ConclusionResearch on multithreaded architectures has beenmotivated by two concerns: tolerating memory latencyand bridging of synchronization waits by rapid contextswitches. Older multithreaded processor approachesfrom the 1980s usually extend scalar RISC processors bya multithreading technique and focus at effectivelybridging very long remote memory access latencies.Such processors will only be useful as processor nodesin distributed-shared-memory multiprocessors. However,developing a processor that is specifically designed fordistributed-shared-memory multiprocessors is commonlyregarded as too expensive. Multiprocessors todaycomprise standard off-the-shelf microprocessors andalmost never specifically designed processors (with theexception of Tera MTA and SPELL). Therefore, newermultithreaded processor approaches also strive fortolerating smaller latencies that arise from primary cachemisses that hit in secondary cache, from long-latencyoperations, or even from unpredictable branches.Multithreaded processors aim at a low execution timeof a multithreaded workload, while a superscalarprocessor aims at a low execution time of a singleprogram. Depending on the implemented multithreadingtechnique, a multithreaded processor running only asingle thread does not reach the same efficiency as acomparable single-threaded processor. The penalty maybe only slight in the case of a block-interleavingprocessor or be several times as long as the run-time on asingle-threaded processor in the case of a cycle-by-cycleinterleaving processor.Reeferences:[1] A. Agarwal, J. Babb, D. Chaiken, G. D'Souza, K. L.Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, G.Maa, K. Mackenzie, Sparcle: A multithreaded VLSIprocessor for parallel computing, Lect. NotesComput. Sc., Vol.748, 1993, pp.359.[2] R. Alverson, D. Callahan, D. Cummings, B.Koblenz, A. Porterfield, J.B. Smith, The Teracomputer system, Proc. 1990 Int. Conf.Supercomput., Amsterdam, The Nederland, June1990, pp.1-6.[3] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T.Ungerer, A multithreaded Java microcontroller forthread-oriented real-time event-handling, Proc. 1999Conf. PACT, Newport Beach, CA, 1999, pp.34-39.[4] M. Butler, T.-Y. Yeh, Y.N. Patt, M. Alsup, H.Scales, M. Shebanow, Single instruction streamparallelism is greater than two. Proc. 18th Ann.Symp. Comp. Arch., Toronto, Canada, May 1991,pp.276-286.[5] M. Dorojevets, COOL Multithreading in HTMTSPELL-1 processors, Intl. Journal on High SpeedElectronics and Systems, 1999. (to be published).[6] M.N. Dorozhevets, P. Wolcott, The El'brus-3 andMARS-M: Recent advances in Russian highperformancecomputing, The Journal ofSupercomputing, Vol.6, 1992, pp.5-48.[7] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.M.Stamm, D.M. Tullsen, Simultaneous multit-hreading:A platform for next-generation processors, IEEEMicro, Vol.17, September/October 1997, pp.12-19.[8] R. Espasa, M. Valero, Exploiting instruction- anddata-level parallelism, IEEE Micro, Vol.17,September/October 1997, pp.20-27.[9] A. Formella, J. Keller, T. Walle, HPP: A high performancePRAM, Lect. Notes Comput. Sc., Vol.1123,1996, pp.425-434.

[10] W. Grünewald, T. Ungerer, Towards extremely fastcontext switching in a block multithreaded processor,Proc. 22nd Euromicro Conf., Prague, CzechRepublic, Sep. 1996, pp.592-599.[11] W. Grünewald, T. Ungerer, A multithreadedprocessor designed for distributed shared memorysystems, Proc. Int. Conf. Advances in Parall. Distrib.Comput., Shanghai, China, March 1997, pp.206-213.[12] R.H. Halstead, T. Fujita, MASA: A multithreadedprocessor architecture for parallel symboliccomputing, Proc. 15th Ann. Symp. Comp. Arch.,Honolulu, HI, May/June 1988, pp.443-451.[13] C. Hansen, MicroUnity's MediaProcessor Architecture,IEEE Micro, Vol.16, August 1996, pp.34-41.[14] K.M. Kavi, D.L. Levine, A.R.Hurson, A nonblockingmultithreaded architecture, Proc. 5th Int.Conf. Advanced Comput., Madras, India, Dec. 1997,pp.171-177.[15] J. Kreuzinger, R. Marston, T. Ungerer, U.Brinkschulte, C. Krakowski, The Komodo project:Thread-based event handling supported by amultithreaded Java microcontroller, Proc. 25thEuromicro Conf., Milano, Italy, Sep. 1999, pp.122-128.[16] J. Kreuzinger, T. Ungerer, Context-switchingtechniques for decoupled multithreaded processors,Proc. 25th Euromicro Conf., Milano, Italy, Sep.1999.[17] M.S. Lam, R.P. Wilson, Limits of control flow onparallelism. Proc. 18th Ann. Symp. Comp. Arch.,Toronto, Canda, May 1992, pp.46-57.[18] J. Laudon, A. Gupta, M.Horowitz, Interleaving: Amultithreading technique targeting multiprocessorsand workstations, Proc. 6th Int. Conf. ASPLOS, SanJose, CA, Oct. 1994, pp.308-318.[19] M. Loikkanen, N. Bagherzadeh, A fine-grainmultithreading superscalar architecture, Proc. 1996Conf. PACT, Boston, MA, Oct. 1996, pp.163-168.[20] A. Mikschl, W. Damm, Msparc: A multithreadedSparc, Lect. Notes Comput. Sc., Vol. 1123, 1996,pp.461-469.[21] H. Oehring, U. Sigmund, T. Ungerer, MPEG-2video decompression on simultaneous multithreadedmultimedia processors, Proc. 1999 Conf. PACT,Newport Beach, CA, 1999.[22] Y.N. Patt, S.J. Patel, M. Evers, D.H.Friendly,J.Stark, One billion transistors, one uniprocessor, onechip, Computer, Vol.30, No.9, 1997, pp.51-57.[23] U. Sigmund, T. Ungerer, Evaluating amultithreaded superscalar microprocessor versus amultiprocessor chip, Proc. 4th PASA WorkshopParall. Sys. and Algorithms, Jülich, Germany, Apr.1996, pp.147-159.[24] J. Šilc, B. Robič, T. Ungerer, Asynchrony inparallel computing: From dataflow to multithreading,Parallel and Distributed Computing Practices,Vol.1, No.1, 1998, pp.3-30.[25] J. Šilc, B. Robič, T. Ungerer, ProcessorArchitecture: From Dataflow to Superscalar andBeyond, Springer-Verlag, Berlin, New York, 1999.[26] J. Šilc, T. Ungerer, B. Robič, A survey of newresearch directions in microprocessors,Microprocessors and Microsystems, Vol.24, No.4,2000, pp.175-190.[27] B.J. Smith, The Architecture of HEP, In: J.S.Kowalik (ed.), Parallel MIMD Computation: HEPSupercomputer and Its Applications, MIT Press,Cambridge, MA, 1985, pp.41-55.[28] M. Thistle, B.J. Smith, A processor architecture forHorizon, Proc. Supercomputing Conf., Orlando, FL,Nov. 1988, pp.35-41.[29] D.M. Tullsen, S.J. Eggers, H.M. Levy,Simultaneous multithreading: maximizing on-chipparallelism, Proc. 22nd Ann. Int. Symp. Comp. Arch.,Santa Margherita Ligure, Italy, June 1995, pp.392-403.[30] D.W. Wall, Limits of instruction-level parallelism,Proc. 4th Int. Conf. ASPLOS, Santa Clara, CA, Apr.1991, pp.176-188.

Simultaneous Multithreading â Blending Thread-level and ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?

Simultaneous Multithreading â Blending Thread-level and ...