pdf version - Distributed Computing Systems Lab

In Telegraphos-I the local portion of theshared memory of each node resides on theTurboChannel network interface. Thus,local shared memory accesses have topayat least one TurboChannel roundtrip delay.Telegraphos II, and more recent versionsdo not suer from this problem, becausethe local portion of the shared memoryis just a portion of the computer'smain memory. In the rst version of Telegraphos we haveused (rather small) FPGAs. To performeven simple local operations several FP-GAs need to cooperate. Each time somedata travel from one FPGA to another,the cost is (usually) increased by one moreTurboChannel cycle (80 ns). Recent versionsof Telegraphos do not suer fromthis problem as they are being designedin ASIC technology.To place the Telegraphos Architecture inperspective we compare it with other architecturesrunning from workstation clusters tolarge scale multiprocessors (table 1). We seethat Telegraphos is signicantly better thanprevious generation multiprocessors (like IntelIPSC/2), is comparable to modern workstationclusters (like SCI Dolphin Cluster, andthe Memory Channel used for DEC's workstationclusters), and multiprocessors (like IntelParagon XP/S, and Cedar), and is signicantlybetter than Local Area Network whose communicationis based on software-implementedTCP/IP on top of Ethernet. We believe thatthe performance of future Telegraphos systemswill be signicantly improved for the reasonsmentioned above.4 Telegraphos usingSCI-over-ATMWe currently develop a new architecture for aTelegraphos system that provides:1. PCI bus interface2. ATM network connectivity3. SCI framing.There are several reasons for dening this newgeneration of the architecture. The main onesthat drive this eort (which are also goals ofthe architecture) are two: (i) caching of alldata for high performance, and (ii) use of standardsfor easy scalability and \openness" to alarge number of systems provided by dierentvendors.4.1 Data CachingOne of the main drawbacks of architecturessuch as Telegraphos-I, PRAM [17],SHRIMP [2], etc, is that shared data backedby remote main memory cannot be cached.The reason is that all these architectures aredesigned with their shared, network memoriesnon-cacheable, since their interfaces areattached to the workstation's I/O bus. In atypical conventional architecture, all memoryattached to the I/O bus cannot be cached forcoherency reasons. New emerging processor architecturesseem to be able to overcome thislimitation, and provide mechanisms that allowdesigners to build systems with all memoryspace cacheable. Such interesting featurescurrently appear in the specication of DEC'sAlpha processor [18], but we expect that theirusefulness will attract other manufacturers aswell.An analysis of a remote read operation showshow the various consistency problems may besolved with processor features such as the onesincluded in Alpha: let us assume that 2 processors,P 1 and P 2 ,areinterconnected through aTelegraphos network, and they are caching alldata. We also assume that processor P 1 readsavariable v which resides in P 2 . Then:1. P 1 issues a load instruction that causes adata miss that requests the data from theTelegraphos interface2. the interface issues a remote read operationto P 2 for the variable v3. the interface eventually receives a responsefrom P 2 (from the network) containing thevalue v4. v is provided to the cache, and fromthere to the processor. The solution for6

System name Link Latency (sec) Throughput (Mbps) SourceSCI Dolphin Workstation Cluster 4 (one way) - [19]Memory Channel < 5 (one way) - [5]Intel IPSC/2 350 39 [21]Local Area Net ( Ethernet - TCP/IP) 800 6 [21]Intel Paragon XP/S 15 1600 [21]Cedar 1.1 190 [21]Telegraphos I 7.2 (roundtrip) 103 -Table 1: Comparative performance of Telegraphos-I and related architectures.caching in this case is coming from theAlpha architecture, considering that theinterface is attached to the I/O bus: theprocessor allows cacheable data over theI/O bus [18]. Furthermore, the ability ofthe processor to implement a consistencyprotocol, enforces consistency between alldata in the cache and the Telegraphosinterface. Since current implementationsof Alpha-based workstations seem to notprovide all this functionality, asanalternativewe consider provision of 2 bus interfaceson the card: one for the outgoingdata through the I/O bus, and onefor the memory bus of the used workstation.In this case, all incoming data canbe cached, since they are read through thememory bus, but special consideration hasto be paid to the fact that the shared datamay be stale, if the system's cache is writeback.Regarding processor P 2 , the data validity issuesare resolved as follows: if v is in P 2 's cache, then the validity ofthe data read by P 1 is guaranteed, if P 2has congured the memory portion wherev resides as cacheable with write-throughor write-back with update as the Alphaarchitecture handbook describes [18]. Ifthe capability is not provided, then datamust be consistent before the remote readis served this can be implemented throughmemory barriers (MB instructions on Alpha) if v is only in P 2 's main memory,thendatais valid.Thus, several alternatives exist for building aninterface for Telegraphos on the I/O bus enablingcaching as well.As the features described above are not includedin currently available systems, datacaching of remotely read data can be implementedonly through attachment of the Telegraphosinterface on the memory bus (in additionto the I/O bus), so that incoming dataarrives to the processor through that bus.4.2 Use of StandardsThe available Telegraphos prototype system isnot an \open" design, since neither the interfacenor the network use standardized, orwidely used interfaces. It is important thoughto design interfaces and networks that can interconnectheterogeneous systems provided bya wide range of dierent vendors. So, thenew version of the Telegraphos architecture isdirected towards the use of standards. Themain parameters that need to be considered insuch anenvironment are: the minimum packetsize and, the bandwidth/latency requirements.These are the main parameters, since the I/Obus choice is clearly made by thevendors providingthe workstations. All these considerationsled us to the choice of the following standards:1. ATM network technology2. SCI packet framing3. PCI bus interface.7

ATM is an emerging network transport technologythat provides high bandwidth, low latencyand interoperability with other ATM systems.The choice of ATM is an important onein our architecture for 2 reasons: It ts the requirements of the new generationTelegraphos which is expected to havelarger packet size. Such longer packet sizets well with the characteristics of ATM,which has a cell size of 53 bytes with 48bytes of useful data in the ATM AdaptationLayer-5 (AAL-5) which weintend touse. ATM is a technology that seems to bewidespread in the near future in both LANand WAN environments.The remote memory operations implementedby Telegraphos require a reliable network thatdoes not drop packets, and delivers packetsin order. Fortunately, recent ATM switches[8, 20] provide ow control, and guarantee inorderpacket delivery.SCI is also a standard allowing scalabilityand ability tointerconnect other SCI systemsfrom dierent vendors. As coherence is an optionin SCI and not a concern of the Telegraphosarchitecture, our interface will onlyprovide SCI framing without using (or supporting)the SCI coherence options. To implementSCI-over-ATM, a number of ATM virtual circuitswill be reserved to carry remote memoryrequests framed in an SCI format. Whenthe host workstation issues a remote memoryread or write operation, the Telegraphos interfacewill use one of the special ATM VCsto send this request to the appropriate host.The Telegraphos interface on the destinationhost will receive the ATM cell over the specialVC number, and treat it as a shared memoryoperation. It will assemble the SCI packetfrom possibly several ATM cells, and executethe read or write operation requested. As longas the ATM network provides in order guaranteeddelivery of packets, the shared memoryoperations will work without a problem.Finally, PCI seems to be the choice of upcominghigh performance workstations andPCs. Its ability to reach 1 Gbps throughput(increasing to 2 Gbps in PCI-2) as well as itslow latency and the fact that it is a standardare attractive features that will allow developmentof high speed, \open" systems that canaccommodate several vendors' interface cards.5 Related WorkThe design of ecient shared-memory systemshas been the focus of several groups in the lastdecade. Building ecient shared-memory multiprocessorsystems is crucial for applicationsthat need high performance. Several sharedmemorymultiprocessors have been built, fromsmall-scale bus-based multiprocessors [16] tolarge-scale distributed memory machines [4, 1].Although networks of workstations mayhavean (aggregate) computing power comparableto that of a supercomputer (while costing signicantlyless), they have rarely been used tosupport high-performance computing, becausecommunication on a network of workstationshas traditionally been very expensive, makingit prohibitively expensive for an application touse more then a few workstations.There have been several projects to provideecient communication primitives in networksof workstations via a combination of hardwareand software: PRAM [17], MERLIN [22, 11],Galactica Net [7], Hamlyn [3] DEC's MemoryChannel [5] and SHRIMP [2] provide ecientmessage passing on networks of workstationsbased on memory-mapped interfaces. Theirshared-memory support, though, is limited becausethey do not provide individual single remotememory accesses thus a processor thatwants to access a few words out of a page isforced to replicate the whole page locally, andthen access its data - moreover, as long as thepage is replicated, it has to be kept coherent.SHRIMP and PRAM provide ecient methodsof keeping copies of pages coherent but do notprovide user applications the ability to accessa remote page without keeping a local copyof this page as well. Thus, the total amountof shared memory a processor may see at anytime is limited by the amount of its local memory.In Telegraphos, instead, the amount ofshared memory that a processor may see at8

any time is the total amount of shared memoryin the system. Besides that, Telegraphosprovides several sophisticated shared-memoryprimitives like remote atomic operations, andnon-blocking fetch operations.6 SummaryIn this paper we describe Telegraphos, adistributedsystem suitable for eciently supportingboth message-passing and shared-memoryapplications on top of high-speed networks.Telegraphos has a memory-mapped networkinterface that avoids almost all software imposedcommunication overhead. It uses thepage mapping and protection mechanism, existingin almost all virtual memory systems,to implement protection in message passing.Telegraphos also implements a fast remotewritehardware primitive that enables one processorto send a message to another processorby simply writing directly into the receiverprocessor's memory. No software is involved inpassing the message, apart from the initializationphase that makes sure that the sender isallowed to send messages to the receiver. Thereceiver gets the message by simply reading itslocal memory. Besides being ecient, Telegraphosis also aordable, because it can beconnected into an existing workstation environment,and upgrade it into a loosely-coupledmultiprocessor.We believe that Telegraphos demonstratesthat it is feasible to build inexpensive sharedmemorysystems based on existing workstations.The main idea is to provide hardwaresupport for the necessary shared-memory operationswhile leaving complicated coherence decisionsto software and to users that are willingto pay the cost of coherence if they are goingto benet from it.AcknowledgmentsPart of this work was developed in the ES-PRIT/HPCN project \SHIPS", and will beused for the OMI project \Arches", funded bythe European Union. We deeply appreciatethis nancial support, without which this workwould have not existed.We would like to thank P.Vatsolaki, D. Gustavson,G. Dramitinos, and C. Papachristos foruseful comments in earlier drafts of this paper.References[1] BBN Laboratories. Buttery ParallelProcessor Overview. Technical Report6148, BBN Laboratories, Cambridge, MA,March 1986.[2] M. Blumrich, K. Li, R. Alpert, C. Dubnicki,E.Felten,and J. Sandberg. VirtualMemory Mapped Network Interface forthe SHRIMP Multicomputer. In Proceedingsof the Twenty-First Int. Symposiumon Computer Architecture, pages 142{153,Chicago, IL, April 1994.[3] G. Buzzard,D. Jacobson, S. Marovich, and J. Wilkes.Hamlyn: a high-performance network interface,with sender-based memory management.In Proceedings of the Hot InterconnectsIII, August 1995.[4] T. H. Dunigan. Kendall Square Multiprocessor:Early Experiences and Performance.Technical Report ORNL/TM-12065, Oak Ridge National Laboratory,May 1992.[5] R. Gillet. Memory Channel. In Proceedingsof the Hot Interconnects III, August1995.[6] Hakan Grahn and Per Stenstrom. EcientStrategies for Software-Only DirectoryProtocols in Shared Memory Multiprocessors.In Proceedings of the Twenty-Second ISCA, Santa Margherita Ligure,Italy, June 1995.[7] Andrew W. Wilson Jr., RichardP. LaRowe Jr., and Marc J. Teller. HardwareAssist for Distributed Shared Memory.In PROC of the Thirteenth InternationalConference on Distributed Comput-9

ing Systems, pages 246{255, Pittsburgh,PA, May 1993.[8] M. Katevenis, S. Sidiropoulos, andC. Courcoubetis. Weighted Round-RobinCell Multiplexing in a General-PurposeATM Switch Chip. IEEE Journal onSel. Areas in Communications, 8(9):1265{1279, 1991.[9] M. Katevenis,P. Vatsolaki, and A. Efthymiou.Pipelined Memory Shared Buer forVLSI Switches. In Proceedings of theACM SIGCOMM '95 Conference, August1995. URL: le://ftp.ics.forth.gr/techreports/1995/1995.SIG-COMM95.PipeMemoryShBuf.ps.gz.[10] Manolis Katevenis. Telegraphos: High-Speed Communication Architecture forParallel and Distributed Computer Systems.Technical Report 123, ICS-FORTH,May 1994.[11] C. Maples. A High-Performance Memory-Based Interconnection System for MulticomputerEnvironments. In Proceedingsof the Supercomputing Conference, pages295{304, 1992.[12] E.P. Markatos and C.E. Chronaki. Trace-Driven Simulations of Data-Alignmentand Other Factors aecting Update andInvalidate Based Coherent Memory. InProceedings of the ACM InternationalWorkshop on Modeling, Analysis, andSimulation of Computer and TelecommunicationSystems (MASCOTS '94), pages44{52, January 1994.[14] H. E. Meleis and D. N. Serpanos. DesigningCommunication Subsystems for High-Speed Networks. IEEE Network Magazine,6(4):40{46, July 1992.[15] R. Rettberg and R. Thomas. Contentionis No Obstacle to Shared-Memory Multiprocessing.Communications of the ACM,29(12):1202{1212, December 1986.[16] Sequent Computer Systems Inc. Balance8000 System, 1985.[17] D. Serpanos. Scalable Shared-Memory Interconnections.PhD thesis, PrincetonUniversity, Dept. of Computer Science,October 1990.[18] R. Sites. Alpha AXP Architecture. Communicationsof the ACM, 36(2):33{44,February 1993.[19] Dolphin Interconnect Solutions. DolphinBreaks Cluster Latency Barrier with SCIAdapter, 1995. press announcement.[20] R. Souza, P. Krishnakumar, C. Ozveren,R. Simcoe, B. Spinney, R. Thomas,andR.Walsh. GIGAswitch System: AHigh-Performance Packet-Switching Platform.Digital Technical Journal, 1(6):9{22, 1994.[21] B.K. Totty. Experimental Analysis ofData Management for Distributed DataStructures, 1992. Master Thesis, Universityof Illinois at Urbana-Champaign.[22] Larry Wittie and Creve Maples. Merlin:Massively Parallel Heterogeneous Computing.In Proceedings of the 1989 ICPP,pages I:142{150, 1989.[13] E.P. Markatos and G. Dramitinos. Implementationof a Reliable Remote MemoryPager. In Proceedings of theUSENIX 1996 Technical Conference,January1996. Earlier version published asTR 129, at Institute of Computer Science,FORTH URL: le://ftp.ics.forth.gr/techreports/1995/1995.TR129.remote memory paging.ps.gz.10

pdf version - Distributed Computing Systems Lab

Create successful ePaper yourself

Delete template?

Save as template?