LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim

More documents

Recommendations

Info

5. The receiver must unpack the message and tell the computer where to store the received data. There are time delays associated with each of these phases. It is important to note that some of these time delays are incurred even for an empty message (“latency”), whereas others depend on the size of the message (“bandwidth restriction”). Effects of time delays are explained in Section 4.4. 4.1.2 Domain Decomposition On PC clusters, two general strategies are possible for parallelization of data domains: Task parallelization – The different modules of a transportation simulation package (traffic flow simulation, routing, activities generation, learning, pre-/postprocessing) are run on different computers. This approach is for example used by DynaMIT [17] or DYNASMART [18]. The advantage of this approach is that it is conceptually straightforward, and fairly insensitive to network bottlenecks. The disadvantage of this approach is that the slowest module will dominate the computing speed – for example, if the traffic flow simulation among different modules is using up most of the computing time, then task parallelization of the modules will not help. Domain decomposition – In this approach, each module is distributed across several CPUs. In fact, for most of the modules, this is straightforward since in current practical implementations activity generation, route generation, and learning are done for each traveler separately. Only traffic flow simulation has tight interaction between the travelers as explained in the following. For PC clusters, the most costly communication operation is the initiation of a message (“latency”). In consequence, the number of CPUs that need to communicate with each other should be minimized. This is achieved through a domain decomposition (see Figure 4.2) of the traffic network graph. As long as the domains remain compact, each CPU will, on average, have at most six neighbors (Euler’s theorem for planar graphs). Since network graphs are irregular structures, a method to deal with this irregularity is needed. METIS [91] is a software package that specifically deals with decomposing graphs for parallel computation and is explained in more detail in Section 4.3.1. The quality of the graph decomposition has consequences for parallel efficiency (load balancing): If one CPU has a lot more work to do than all other CPUs, then all other CPUs are ¡ ¢ ¢ obliged to wait for it, which is inefficient. For the current work with CPUs and networks ¢ ¢ ¢ ¢ with links, the “latency problem” (explained in Section 4.4) always dominates load balancing issues; however it is generally useful to employ the actual computational load per network entity for the graph decomposition [57]. For shared memory machines, other forms of parallelization are possible, based on individual network links or individual travelers. A dispatcher could distribute links for computation in a round-robin fashion to the CPUs of the shared memory machine [31]; technically, threads [72] would be used for this. This is be called fine-grained parallelism, as opposed to the coarsegrained parallelism, which is more appropriate for message passing architectures. As stated above, the main drawback of this method is that one needs an expensive machine if one wants to use large numbers of CPUs. 39
4.2 Parallel Computing in Transportation Simulations Parallel computing has been employed in several transportation simulation projects. One of the first was PARAMICS [8], which started as a high performance computing project on a Connection Machine CM-2. In order to fit the specific architecture of that machine, cars/travelers were not truly objects but particles with a limited amount of internal state information. PARAMICS was later ported to a CM-5, where it was simultaneously made more object-oriented. In [8], a computational speed of 120 000 vehicles with an RTR (real time ratio, see Section 4.4) of 3 is reported, on 32 CPUs of a Cray T3E. At about the same time, it was shown that on coupled workstation architectures it is possible to efficiently implement vehicles in object-like fashion, and a parallel computing prototype with “intelligent” vehicles was developed [56]. This later resulted in the research code PAM- INA [68], which was the technical basis for the parallel version of TRANSIMS [57]. In the tests (using Ethernet [77] only, on a network with 20 000 links, about 100 000 vehicles simultaneously in the simulation), TRANSIMS [57] ran about 10 times faster than real time with the default parameters, and about 65 times faster than real time after tuning. These numbers refer to 32 CPUs; adding more CPUs did not yield further improvement. The parallel concepts behind TRANSIMS are the same as behind the queue model described in Chapter 2, and in consequence TRANSIMS is up against the same latency problem as the queue model. However, for unknown reasons the computational speed is lower than predicted by latency alone. Some other early implementations of parallel traffic flow simulations are discussed in [10, 60]. A parallel implementation of AIMSUN [23] reports a speed-up of 3.5 on 8 CPUS using threads (which uses the shared memory technology as explained above) [4]. DYNEMO [71] is a macro-particle model, similar to DYNASMART [18] described below. A parallel version was implemented about five years ago [61]. A speed-up of 15 on 19 CPUs connected through 100 Mbit Ethernet was reported on a traffic network of Berlin and Brandenburg with 13738 links. Larger numbers of CPUs were reported to be inefficient. DynaMIT [17] uses functional decomposition (task parallelization) as a parallelization concept [17]. This means that different modules, such as the router, the traffic flow (supply) simulation, the demand estimation, etc., can be run in parallel, but the traffic flow (supply) simulation runs on a single CPU only. Functional decomposition is outside the scope of this thesis. DYNASMART [18] also reports the intention of implementing functional decomposition. Note that in terms of raw simulation speed, the performance values presented in this work are more than an order of magnitude faster than anything listed above. In addition, this is not achieved by a smaller scenario size, but by diligent model selection, efficient implementation, and by hardware improvements based on knowledge where the computational bottlenecks are. That is, this approach makes it possible to run very large scale scenarios as everyday research topics, rather than to have them as the result of computationally intensive studies only. 4.3 Implementation As discussed in the previous sections, the parallel target architecture for this traffic flow simulation is a PC cluster. The suitable approach for this architecture is domain decomposition, i.e. to decompose the street network graph into several pieces, and to give each piece to a different CPU. Information exchange between CPUs is achieved via messages. When parallelization of a transportation simulation, one needs to decide where to split the underlying street network, and how to achieve the message exchange. Both questions can only be answered with respect to a particular traffic model, but lessons learned here can be used for 40
Page 1 and 2: DISS. ETH NO. LARGE-SCALE PARALLEL
Page 3 and 4: Zusammenfassung Für das Modelliere
Page 5 and 6: Contents Abstract Zusammenfassung A
Page 7 and 8: 6.9 Results of the Buffered Events
Page 9 and 10: 4.10 Packing vehicle data with MPI
Page 11 and 12: List of Tables 3.1 Performance resu
Page 13 and 14: The strategical world: Concepts whi
Page 15 and 16: queues, on the other hand, does not
Page 17 and 18: Handling Constraints - Original alg
Page 19 and 20: node spatial queue buffer acc to ca
Page 21 and 22: Algorithm-3 for Vehicle Movement th
Page 23 and 24: - network - nodes - - - node i d =
Page 25 and 26: ) ¥ § ¡ £ ) ¥ ) ¥ ) ¡ £
Page 27 and 28: has the attributes such as type, le
Page 29 and 30: 3.2 The Benchmark The traffic flow
Page 31 and 32: make ‘ ‘ Nodes ’ ’ a c o n
Page 33 and 34: ¡ ¥ ¥ t y p e d e f v e c t o r
Page 35 and 36: 1024 RTR, Diff. Data Str. for Graph
Page 37 and 38: 1024 RTR, Diff. Data Str. for Parki
Page 39 and 40: push_back(O1) push_back(02) head =
Page 41 and 42: The sequence of attributes is not i
Page 43 and 44: ¡ ¡ ¡ c l a s s Plan / / data s
Page 45 and 46: Writing Time Explanation Raw Local
Page 47 and 48: RunTime Graph Data Parking-Waiting
Page 49: depends on the dimensions (single o
Page 53 and 54: 350000 300000 250000 200000 150000
Page 55 and 56: ¢¡ ¥ ¡ * 6 §¡ ¥ ¢¡ ¥ 8
Page 57 and 58: ¥¦¨§ ¡¦©§ §¡ ¥ *
Page 59 and 60: 4.5 Experimental Results The parall
Page 61 and 62: ¡ ¥ ¡ If the simulation time s
Page 63 and 64: a long i n t e g e r f o r v e h i
Page 65 and 66: ¡ / / d e f i n e a s t r u c t co
Page 67 and 68: 1024 512 256 RTR for a 3-hour run o
Page 69 and 70: In Section 4.1.2, task parallelizat
Page 71 and 72: Chapter 5 Coupling the Traffic Simu
Page 73 and 74: 100% Initial Plans File Activity Ge
Page 75 and 76: ¡ ¡ ¡ p f map- - ; / / d e f i n
Page 77 and 78: ¡ char myline [MAXSIZE ] ; while (
Page 79 and 80: 5.3 Other Coupling Mechanisms Coupl
Page 81 and 82: pieces of code. For instance, a com
Page 83 and 84: 5.3.5 Module Coupling via Messages
Page 85 and 86: Chapter 6 Events Recorder 6.1 Intro
Page 87 and 88: A2 are collected by the other ER. S
Page 89 and 90: & $ ¡ & $ ¡ 3. TSs
Page 91 and 92: Algorithm A - Traffic Simulator Rep
Page 93 and 94: ¡ 6 ) © £ ¡ © ¤ ¡ 6 ) © (
Page 95 and 96: and ¤¢¤ ¢ ¡ ¢ ¡ ¡ ¢ ¡
Page 97 and 98: 256 64 Packing Only, Raw Events mem
Page 99 and 100: £ ¥ Time in Secs 256 64 16 4 Send
Page 101 and 102:
Multicast Sending Only, Raw Events,
Page 103 and 104:
Time in Secs 256 64 16 4 1 Effectiv
Page 105 and 106:
¢ Writing Time Explanation Raw XML
Page 107 and 108:
700 600 500 TS-p TS-s ER-er ER-u ER
Page 109 and 110:
¡ ¡ ¦ ¦ © ¡ c r e a t e a C
Page 111 and 112:
i n t i n t e g e r i t e m ; doubl
Page 113 and 114:
When the events are reported to str
Page 115 and 116:
Chapter 7 Plans Server 7.1 Introduc
Page 117 and 118:
Algorithm A - Plans Server while no
Page 119 and 120:
i n t i n t e g e r i t e m ; doubl
Page 121 and 122:
) ) ) £ ¡ © §¦ ©¨ ¡ £ ¡
Page 123 and 124:
) © ¨ ¡ ¡ $ ¢ ¡ ¡ * ¢ $
Page 125 and 126:
For PSs, Send Only (IDs & StartLink
Page 127 and 128:
For TSs, Effective Receive + Unpack
Page 129 and 130:
64 32 16 8 4 2 1 0.5 0.25 For PSs,
Page 131 and 132:
Time nPSs nTSs OP Note 1.87s 1 1 pa
Page 133 and 134:
£ ¡ ¡ § ¤¡2 ) © ¡ ( ¡
Page 135 and 136:
100 Round Trip Travel Time 10 1 0.1
Page 137 and 138:
observe the Internet packet traffic
Page 139 and 140:
A probably better way to reduce the
Page 141 and 142:
[14] Berksekas D. and R. Gallager.
Page 143 and 144:
[51] MPI www page. www-unix.mcs.anl
Page 145 and 146:
[87] W. Willinger W.E. Leland, M. T
Page 147:
Large-scale multi-agent transportat
show all

LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim

Create successful ePaper yourself

Delete template?

Save as template?