01.01.2015 Views

LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim

LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim

LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DISS. ETH NO.<br />

<strong>LARGE</strong>-<strong>SCALE</strong> <strong>PARALLEL</strong> <strong>GRAPH</strong>-<strong>BASED</strong><br />

<strong>SIMULATIONS</strong><br />

A dissertation submitted to the<br />

SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH<br />

for the degree of<br />

Doctor of Sciences<br />

presented by<br />

NURHAN ÇETİN<br />

Master of Science in Computer Science<br />

The Pennsylvania State University<br />

born 01.12.1974<br />

citizen of<br />

The Republic of Turkey<br />

accepted on the recommendation of<br />

Prof. Dr. Kai Nagel, examiner<br />

Prof. Dr. Kay W. Axhausen, co-examiner<br />

2005


Abstract<br />

When systems are modeled, different techniques are used. Computer simulation is one of these<br />

techniques. It draws attention since it makes it possible to model a system that might be real<br />

or theoretical, to execute the model on a computer, and to analyze the output of the execution<br />

of the model. Execution of a model on a computer develops through time, i.e. the states of<br />

the different parts of a system, such as variables and environment, are updated through time<br />

according to the rules defined in the model.<br />

Computer simulations come into prominence since they allow models to have complex<br />

objects/variables, allow objects to have complex relationships, allow users to model artificial<br />

worlds, etc. This thesis focuses on different parts of a transportation planning system, MATSIM<br />

(Multi-Agent Transportation SIMulation), which is a computer simulation.<br />

In MATSIM, similar to other multi-agent simulations, all entities are treated at the individual<br />

level. Their behavior and interactions, both with each other and with the environment, are<br />

defined by their internal rules.<br />

There are two layers in a transportation planning system: the physical layer that includes<br />

a traffic flow simulator, and the strategic layer. In the traffic flow simulator, the agents are<br />

interacts with each other and with the environment based on the rules defined in the model. In<br />

the strategic layer, the agents make their strategies. The relationship between these two layers<br />

is best understood in an implementation called a framework.<br />

A framework couples the modules such as traffic flow simulator, router, agent database,<br />

activity generator, etc. A traffic flow simulator defines the rules of interactions of the entities in<br />

the system. The traffic flow simulator used in MATSIM is based on a queue model developed<br />

by Gawron. It reads the street network of the area to be simulated and the plans of the agents,<br />

then it executes these plans according to the rules of the queue model. The output of the traffic<br />

flow simulation, the events, are used to evaluate the performance of the plans. The evaluation<br />

is achieved by the modules of the strategic layer. The evaluated plans are fed to traffic flow<br />

simulator simulator by starting a new iteration.<br />

Parallel computing issues are applied to the traffic simulator to handle the large-scale scenarios<br />

detailed at microscopic level. Different communication media and different communication<br />

libraries are used during this process.<br />

The coupling of modules by framework is via files. From the viewpoint of a traffic flow<br />

simulator, this means two files: plans as input and events as output. To avoid the inefficiencies<br />

of file I/O, a message passing approach is developed for plans and events. Different methods<br />

for creating and transferring different types of messages are investigated.<br />

The traffic flow simulator based on the queue model can be used for simulating other types<br />

of entities such as Internet data packet traffic. As Internet grows more, analyzing the data<br />

flowing through Internet becomes more interesting between researchers.<br />

i


Zusammenfassung<br />

Für das Modellieren von Systemen können verschiedene Techniken verwendet werden. Durch<br />

Verwendung von Computer-Simulation ist es möglich, ein real existierendes oder ein theoretisches<br />

System auf einem Computer zu simulieren und anschliessend die Ausgabe zu analysieren.<br />

Ein solches Modell wird im Computer modelliert und iterativ verändert, d.h. die internen<br />

Zustände werden bei jedem Zeitschritt nach den definierten Regeln des Modelles aktualisiert.<br />

Der Vorteil von Computer-Simulationen ist, dass die Modelle eine grössere Komplexität der<br />

Objekte und deren Beziehung erlauben, als dies mit einer analytischen Betrachtung möglich<br />

wäre.<br />

Der Fokus dieser Arbeit ist auf den verschiedenen Teilen des Verkehrs-Planungs-Systemes,<br />

MATSIM (Multi-Agent Transportation SIMulation), welches diese Techniken nutzt.<br />

Wie die meisten Multi-Agenten-Simulationen betrachtet MATSIM alle Agenten auf einer<br />

individuellen Basis. Ihr Verhalten und ihre Interaktionen (sowohl mit anderen Agenten als auch<br />

mit der Umgebung), sind durch die Regeln definiert.<br />

Es existieren zwei Schichten in einem Verkehrs-Planungs-System: die physikalische, die<br />

die Verkehrsfluss-Simulation beinhaltet, sowie die strategische. In der Verkehrsfluss-Simulation<br />

reagieren die Agenten auf die anderen Agenten sowie auf die Umgebung. In der strategischen<br />

Schicht werden die Entscheidungen der Agenten modelliert. Die Beziehung dieser beiden<br />

Schichten wird durch ein Framework gebildet. Dieses Framework verbindet die einzelnen<br />

Module (Verkehrsfluss-Simulation, Routen-Generator, Agenten-Datatenbank, etc.).<br />

Die vorgestellte Verkehrsfluss-Simulation basiert auf einem Queue-Modell, welches von<br />

Gawron entwickelt wurde. Als Eingabe werden das Netzwerk der Strassen sowie die Pläne<br />

der Agenten verwendet. Wärend der Simulation werden sogenannte Events ausgegeben, anhand<br />

denen die Module die Qualität dieser Pläne bewerten können. Diese Pläne werden anschliessend<br />

geringfügig modifiziert und im nächsten Durchgang der Simulation erneut getestet.<br />

Um die Grösse des hier verwendeten Szenarion handhaben zu können, muss die Verkehsfluss-<br />

Simulation auf mehrere Computer verteilt werden (Verteiltes Rechnen). Verschiedene Communikationsmedien<br />

und -Bibliotheken wurden evaluiert.<br />

Die Verbindung der Module des Frameworks geschieht durch Files. Aus der Sicht der<br />

Verkehrsfluss-Simulation werden zwei Arten von Files verwendet: Pläne als Eingabe, sowie<br />

Events als Ausgabe. Da das Lesen und Schreiben von Files sehr langsam sein kann, wurde<br />

ein weiterer Ansatz entwickelt: das Senden von Plänen und Events als Nachrichten über das<br />

Netzwerk. Hierbei wurden verschiedene Varianten verglichen.<br />

Die vorgestellte, queue-basierte Verkehrsfluss-Simulation kann nebst der Simulation von<br />

Verkehr beispielsweise auch für den Datenfluss in Computer-Netzwerken verwendet werden.<br />

Solche Anwendungen werden an Bedeuteung gewinnen, nicht zuletzt durch das Wachstum des<br />

Internet.<br />

ii


Acknowledgments<br />

First of all, I would like to thank my advisor, Prof. Kai Nagel, for his guidance on making this<br />

thesis possible and for his support during the past years I have spent at ETH Zurich.<br />

I would also like to thank my co-advisor Prof. Kay Axhausen for accepting to be coexaminer<br />

and for the remarks he made to improve this thesis.<br />

Many thanks to my office mate of 4 years, Bryan Raney, for all the interesting yet helpful<br />

discussions that we had. Those discussions helped me a lot to broaden my vision.<br />

I would like to thank Christian Gloor for the productive talks about work, computer science<br />

and life.<br />

Thanks to Dr. Fabrice Marchal for not only giving me directions in Java but also being a<br />

friend beyond the office life.<br />

I would like to thank to Marc Schmitt and IT Support Group (a.k.a ISG) for the maintainence<br />

of the computational resources used during my work. I am grateful to Martin Wyser<br />

who took over the responsibility of Xibalba cluster from Marc.<br />

Thanks to Adrian Burri and Hinnerk Spindler for providing the data used in Figure 2.4 and<br />

Figure 8.1, respectively.<br />

I am very grateful to Duncan Cavens, Bryan Raney and Lisa von Boehmer for proofreading<br />

this manuscript.<br />

Thanks to Prof. Şebnem Baydere for her support for making it possible to take the initiative<br />

steps towards my academic career and to Prof. Feyzi İnanc for his support and his advices<br />

about academia and life.<br />

Many thanks to my friends Özge, Canan, Mehtap, PIrnal, Ilker, Cenk, Mcan, Özlem, Chris,<br />

Onur, Emrah, Giray, Mahir, Gürhan, Berna, Gültek, Duygu, Bülo, Selin, Erdem, Nur, Selçuk,<br />

Hanna, Fuat and Volkan for their support and their friendship.<br />

Last but not least, many thanks to my family for being supportive whatever I do and whatever<br />

I choose.<br />

iii


Contents<br />

Abstract<br />

Zusammenfassung<br />

Acknowledgments<br />

i<br />

ii<br />

iii<br />

1 Introduction 1<br />

2 The Queue Model for Traffic Dynamics 5<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

2.2 Queue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2.2.1 Gawron’s Queue Model . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2.2.2 Fair Intersections and Parallel Update . . . . . . . . . . . . . . . . . . 7<br />

2.2.3 Graph Data as Input for Queue Simulation . . . . . . . . . . . . . . . . 11<br />

2.2.4 Vehicle Plans as Input for Queue Simulation . . . . . . . . . . . . . . 12<br />

2.2.5 Events as Output of Queue Simulation . . . . . . . . . . . . . . . . . . 13<br />

2.3 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.4 The Basic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.5 A Practical Scenario for the Benchmarks . . . . . . . . . . . . . . . . . . . . . 15<br />

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

3 Sequential Queue Model 17<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.2 The Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

3.3 Performance Issues for C++/STL and C Functions . . . . . . . . . . . . . . . . 18<br />

3.3.1 The Standard Template Library . . . . . . . . . . . . . . . . . . . . . 19<br />

3.3.2 Containers: Map vs Vector for Graph Data . . . . . . . . . . . . . . . 19<br />

3.3.3 Containers: Multimap vs Linked List for Parking and Waiting Queues . 24<br />

3.3.4 Containers: Ring, Deque and List Implementations of Link Queues . . 26<br />

3.4 Reading Input Files for Traffic Simulators . . . . . . . . . . . . . . . . . . . . 29<br />

3.4.1 The Extensible Markup Language, XML . . . . . . . . . . . . . . . . 29<br />

3.4.2 Structured Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

3.4.3 XML vs. Structured Text Files: Plans Reading . . . . . . . . . . . . . 30<br />

3.4.4 XML vs Structured Text Files: Graph Data Reading . . . . . . . . . . 31<br />

3.5 Writing Events Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

iv


4 Parallel Queue Model 37<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.1.1 Message Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

4.1.2 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

4.2 Parallel Computing in Transportation Simulations . . . . . . . . . . . . . . . . 40<br />

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

4.3.1 Handling Domain Decomposition . . . . . . . . . . . . . . . . . . . . 41<br />

4.3.2 Handling Message Exchanging . . . . . . . . . . . . . . . . . . . . . . 42<br />

4.3.3 Communication Software . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

4.4 Theoretical Performance Expectations . . . . . . . . . . . . . . . . . . . . . . 44<br />

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

4.5.1 Comparison of Different Communication Hardware: Ethernet vs. Myrinet 48<br />

4.5.2 Comparison of Different Communication Software: MPI vs. PVM . . . 50<br />

4.5.3 Comparison of Different Packing Algorithms . . . . . . . . . . . . . . 51<br />

4.5.4 Different Domain Decomposition Algorithms . . . . . . . . . . . . . . 55<br />

4.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5 Coupling the Traffic Simulation to Mental Modules 60<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

5.2 Coupling Modules via Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

5.2.1 Description of a Framework . . . . . . . . . . . . . . . . . . . . . . . 60<br />

5.2.2 Performance Issues of Reading an Events File . . . . . . . . . . . . . . 63<br />

5.2.3 Performance Issues of Plan Writing . . . . . . . . . . . . . . . . . . . 67<br />

5.3 Other Coupling Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

5.3.1 Module Coupling via Subroutine Calls . . . . . . . . . . . . . . . . . 68<br />

5.3.2 Module Coupling via Remote Procedure Calls (e.g. CORBA, Java RMI) 69<br />

5.3.3 Module Coupling via WWW Protocols . . . . . . . . . . . . . . . . . 70<br />

5.3.4 Module Coupling via Databases . . . . . . . . . . . . . . . . . . . . . 70<br />

5.3.5 Module Coupling via Messages . . . . . . . . . . . . . . . . . . . . . 72<br />

5.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

6 Events Recorder 74<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

6.2 The Competing File I/O Performance for Events . . . . . . . . . . . . . . . . . 76<br />

6.3 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

6.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

6.5 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

6.6 Raw vs. XML Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

6.7 Buffered vs. Immediate Reporting of Events . . . . . . . . . . . . . . . . . . . 79<br />

6.7.1 Reporting Buffered Events . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

6.7.2 Immediately Reported Events . . . . . . . . . . . . . . . . . . . . . . 79<br />

6.8 Theoretical Expectation for Buffered Events . . . . . . . . . . . . . . . . . . . 79<br />

6.8.1 Packing Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

6.8.2 Sending and Receiving Time Prediction . . . . . . . . . . . . . . . . . 82<br />

6.8.3 Unpacking Time Prediction . . . . . . . . . . . . . . . . . . . . . . . 83<br />

6.8.4 Writing Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

6.8.5 Performance Prediction for Buffered Events: Putting it together . . . . 84<br />

v


6.9 Results of the Buffered Events . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

6.9.1 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

6.9.2 Sending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

6.9.3 Receiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

6.9.4 Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

6.9.5 Writing into File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.9.6 Summary of “buffered events recording” . . . . . . . . . . . . . . . . 94<br />

6.10 Theoretical Expectations and Results of Immediately Reported Events . . . . . 94<br />

6.11 Performance of Different Packing Methods for Events . . . . . . . . . . . . . . 97<br />

6.11.1 Using memcpy and Creating a Byte Array . . . . . . . . . . . . . . . 97<br />

6.11.2 Using MPI Pack and MPI Unpack . . . . . . . . . . . . . . . . . . 97<br />

6.11.3 Using MPI Struct . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

6.11.4 Classdesc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

6.11.5 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6.12 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

7 Plans Server 104<br />

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

7.2 The Competing File I/O Performance for Plans . . . . . . . . . . . . . . . . . 104<br />

7.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.3.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.3.2 mpiJava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.4 Java and C++ Implementations of the Plans Server . . . . . . . . . . . . . . . 106<br />

7.4.1 Packing and Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

7.4.2 Storing Agents in the Plans Server . . . . . . . . . . . . . . . . . . . . 108<br />

7.5 Theoretical Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.5.1 PSs Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.5.2 PSs Send and TSs Receive . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

7.5.3 TSs Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.5.4 TSs Pack and Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.5.5 PSs unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.5.6 Multi-casting Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.6.1 PSs Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.6.2 PSs Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.6.3 TSs Receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

7.6.4 TSs Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.6.5 TSs Pack and Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.6.6 PSs Receive and Unpack . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

7.7 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

8 Going beyond Vehicle Traffic 121<br />

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

8.2 Queue Model as a Possible Microscopic Model for Internet Packet Traffic . . . 122<br />

8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

9 Summary 127<br />

Curriculum Vitae 135<br />

vi


List of Figures<br />

1.1 Physical and strategic layers of a traffic simulation system . . . . . . . . . . . 2<br />

2.1 The Gawron’s queue model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2.2 Simplifying the intersection logic . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.3 Pseudo code for traffic dynamics defined in the queue model . . . . . . . . . . 9<br />

2.4 Test suite results for the intersection dynamics . . . . . . . . . . . . . . . . . . 9<br />

2.5 Handling intersections according to the modified version of fair intersections<br />

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.6 Handling intersections according to Metropolis sampling . . . . . . . . . . . . 10<br />

2.7 Handling intersections according to the modified Metropolis sampling . . . . . 11<br />

2.8 An example of the graph data in the XML format . . . . . . . . . . . . . . . . 12<br />

2.9 An example for the plans data in the XML format . . . . . . . . . . . . . . . . 13<br />

2.10 An example for the events data in the XML format . . . . . . . . . . . . . . . 14<br />

3.1 STL-containers for the graph data . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

3.2 The STL-map for the graph data . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.3 Insertion in the middle of an STL-vector by insert(position,object) 21<br />

3.4 The STL-vector for the graph data . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.5 Linear search for the graph data . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.6 Sorting the graph data stored in an STL-vector . . . . . . . . . . . . . . . . 23<br />

3.7 RTR and Speedup for using different data structures for the graph data . . . . . 24<br />

3.8 Declarations for waiting and parking queues with the STL-multimap . . . . 25<br />

3.9 Declarations for waiting and parking queues with linked lists . . . . . . . . . . 25<br />

3.10 RTR and Speedup for using different data structures for waiting and parking<br />

queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.11 Ring Structure: Insertion at the end, Deletion from the beginning . . . . . . . . 28<br />

3.12 RTR and Speedup for using different data structures for the spatial queues and<br />

the buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.13 Reading plans from a structured text file, by using an STL-vector . . . . . . 32<br />

3.14 Reading plans from a structured text file, by using fscanf . . . . . . . . . . . 33<br />

4.1 Handling the boundaries and split links . . . . . . . . . . . . . . . . . . . . . . 41<br />

4.2 Domain decomposition by METIS for Switzerland . . . . . . . . . . . . . . . 42<br />

4.3 Pseudo code for parallel implementation of queue model . . . . . . . . . . . . 43<br />

4.4 Calculation of neighbors of computing nodes . . . . . . . . . . . . . . . . . . 47<br />

4.5 RTR and Speedup curves for Parallel Queue Model . . . . . . . . . . . . . . . 49<br />

4.6 RTR and Speedup graphs for PVM and MPI comparison . . . . . . . . . . . . 51<br />

4.7 The data of a vehicle to be packed . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

4.8 Packing vehicle data with memcpy . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

4.9 Packing vehicle data with MPI Pack . . . . . . . . . . . . . . . . . . . . . . 53<br />

vii


4.10 Packing vehicle data with MPI Struct . . . . . . . . . . . . . . . . . . . . . 54<br />

4.11 RTR graphs for different packing algorithms . . . . . . . . . . . . . . . . . . . 55<br />

4.12 RTR and Speedup graphs for METIS with single constraint . . . . . . . . . . . 56<br />

5.1 An example plan in the XML format . . . . . . . . . . . . . . . . . . . . . . . 61<br />

5.2 Physical and strategic layers of the framework coupled via files . . . . . . . . . 62<br />

5.3 Reading events by using the STL-map . . . . . . . . . . . . . . . . . . . . . . 64<br />

5.4 Reading events by using C++ operator >> . . . . . . . . . . . . . . . . . . . . 65<br />

5.5 Reading events by using atoi/atof or strtod/strtol . . . . . . . . . . 66<br />

5.6 Coupling via subroutine calls during within-day re-planning . . . . . . . . . . 69<br />

6.1 Interaction between TSs and ERs . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

6.2 Pseudo code for the actions for TSs and ERs when events are buffered . . . . . 80<br />

6.3 Pseudo code for the actions of TSs and ERs when events reported immediately 80<br />

6.4 Pseudo Code for Packing a Raw Event . . . . . . . . . . . . . . . . . . . . . . 81<br />

6.5 Pseudo Code for Packing an XML Event . . . . . . . . . . . . . . . . . . . . . 81<br />

6.6 Time measurements for packing events . . . . . . . . . . . . . . . . . . . . . . 86<br />

6.7 Time measurements for sending events . . . . . . . . . . . . . . . . . . . . . . 87<br />

6.8 Comparison of Ethernet vs Myrinet when sending events . . . . . . . . . . . . 88<br />

6.9 Myrinet, Multi-cast results for sending events . . . . . . . . . . . . . . . . . . 89<br />

6.10 Ethernet, Multi-cast results for sending events . . . . . . . . . . . . . . . . . . 90<br />

6.11 Time measurements when receiving events over Myrinet . . . . . . . . . . . . 91<br />

6.12 Comparison of Ethernet vs Myrinet when receiving events . . . . . . . . . . . 92<br />

6.13 Time measurements for unpacking on top of the effective receiving time . . . . 93<br />

6.14 Summary figures for 1ER case . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

6.15 Linear scale version of summary figures . . . . . . . . . . . . . . . . . . . . . 96<br />

6.16 Time measurements for sending when events reported immediately . . . . . . . 96<br />

6.17 Pseudo code for packing different data types with memcpy . . . . . . . . . . . 97<br />

6.18 Pseudo code for packing different data types with MPI Pack . . . . . . . . . . 98<br />

6.19 A C-type struct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

6.20 Pseudo code for packing different data types with MPI Struct . . . . . . . . 99<br />

6.21 Pseudo code for packing different data types with Classdesc . . . . . . . . 100<br />

6.22 Performance of Different Serialization Methods . . . . . . . . . . . . . . . . . 101<br />

7.1 Pseudo code for interaction of PSs and TSs . . . . . . . . . . . . . . . . . . . 106<br />

7.2 Sequence of Tasks Execution of TSs and PSs . . . . . . . . . . . . . . . . . . 107<br />

7.3 Pseudo code for packing different data types with memcpy . . . . . . . . . . . 108<br />

7.4 An example for the methods of BytesUtil . . . . . . . . . . . . . . . . . . . . 108<br />

7.5 Data structures for agents in a C++ Plans Server . . . . . . . . . . . . . . . . . 109<br />

7.6 Data structures for agents in a Java Plans Server . . . . . . . . . . . . . . . . 109<br />

7.7 Time measurements for packing plans . . . . . . . . . . . . . . . . . . . . . . 113<br />

7.8 Time measurements for sending plans over Myrinet . . . . . . . . . . . . . . . 114<br />

7.9 Time measurements for the effective receiving time of plans over Myrinet . . . 115<br />

7.10 Time measurements for unpacking plans on top of the effective receiving time<br />

over Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

7.11 Time measurements for packing agent IDs by TSs . . . . . . . . . . . . . . . . 117<br />

7.12 Time measurements for sending agent IDs by TSs to PSs over Myrinet . . . . . 117<br />

7.13 Time measurements for receiving agent IDs by PSs over Myrinet . . . . . . . . 118<br />

7.14 Time measurements for unpacking agent IDs by PSs on top of the effective<br />

receiving time over Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

viii


7.15 Summary figures for the single PS case . . . . . . . . . . . . . . . . . . . . . 119<br />

7.16 Linear scale version of summary figures . . . . . . . . . . . . . . . . . . . . . 119<br />

8.1 Round-trip travel times for different sizes of messages . . . . . . . . . . . . . . 124<br />

ix


List of Tables<br />

3.1 Performance results for reading different types of plans file and approaches . . 31<br />

3.2 Performance results for reading the graph data . . . . . . . . . . . . . . . . . . 33<br />

3.3 Performance results for writing the events file . . . . . . . . . . . . . . . . . . 34<br />

3.4 Summary table of the serial performance results for different data structures of<br />

the traffic flow simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

4.1 Summary table of the parallel performance results for different data structures<br />

of the traffic flow simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

5.1 Performance results for reading the events file . . . . . . . . . . . . . . . . . . 67<br />

6.1 Performance prediction table for buffered events . . . . . . . . . . . . . . . . . 84<br />

6.2 Performance results for ERs writing the events file . . . . . . . . . . . . . . . 94<br />

6.3 Summary table of the performance results of events transfered between TSs<br />

and ERs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />

7.1 Summary table of the performance results of plans transfered between TSs and<br />

PSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

x


Chapter 1<br />

Introduction<br />

In the area of “modeling and simulation,” one typically designs a model of a system of interest,<br />

and then executes that model on a computer. This simulated model typically shows how the<br />

system of interest develops over time. Advantages of this approach over the observation of<br />

nature or experiments with nature include:<br />

Formulating and validating the computational model forces one to truly grasp the aspects<br />

of the dynamics of a system which make it function the way one observes.<br />

It is much easier to extract full information from the model that runs in the computer than<br />

from any experimental setting.<br />

One can change the model so that it reflects artificial rather than real worlds.<br />

One can make forecasts.<br />

Because of these and many other advantages, computer simulation has joined the areas of<br />

“(analytical) theory” and “experiment” as a third method of scientific investigation.<br />

With respect to spatially extended systems, one of the first areas where simulation was<br />

employed is in the area of partial differential equations (PDEs): Models that had been formulated<br />

in mathematical terms before computers existed were re-formulated for the computer<br />

(“discretized”) and then run. It quickly turned out that formulating computer-amenable versions<br />

of the partial differential equations was far from straightforward, and the sciences of<br />

Applied Mathematics and Scientific Computing have emerged around these issues.<br />

An alternative way to model spatially extended systems is to model the involved particles<br />

directly. This is in contrast to PDEs, which in some sense model fields of particles. In this area<br />

of particle models, the introduction of computers has perhaps changed the field even more<br />

than in the area of PDE’s: It is now possible to simulate systems with or more particles,<br />

which makes it possible to simulate the evolution of (tiny) samples of material directly on the<br />

¡£¢¥¤<br />

molecular level.<br />

Typical particles are relatively simple entities: For example, atoms can be adequately described<br />

by variables such as location, velocity, mass, charge, angular momentum. The same is<br />

true for granular materials, such as sand (e.g. [90]). There are, however, other systems where<br />

the particle approach seems intuitive but the particles are no longer simple. This is true, for example,<br />

when modeling humans (socio-economic systems), Internet packets, or certain aspects<br />

of biological systems. This is where multi-agent simulations [22] come in. They still model<br />

the involved particles directly, as do particle simulations, but they spend much more intellectual<br />

and computational efforts on modeling and simulating the internal dynamics of the particles.<br />

This means that one is faced with three sub-problems:<br />

1


The strategical world:<br />

Concepts which are in<br />

someone’s head.<br />

plans<br />

(acts,<br />

routes,<br />

...)<br />

per−<br />

for−<br />

mance<br />

info<br />

¢¡¢<br />

¤¡¤ £¡£<br />

¦¡¦ ¥¡¥¡¥<br />

¨¡¨ ©¡©¡©<br />

¡<br />

§¡§<br />

¡ ¡<br />

¡ ¡<br />

¡ ¡<br />

¡ ¡<br />

The physical world:<br />

− limits on accel/brake<br />

− excluded volume<br />

− veh−veh interaction<br />

− veh−system interaction<br />

− ped−veh interaction<br />

− etc.<br />

Figure 1.1: Physical and strategic layers of a traffic simulation system.<br />

1. Simulation of the physical system,<br />

2. Simulation of the internal dynamics of the particles,<br />

3. Simulation of the interaction between these two.<br />

When the internal dynamics of the particles consists of mental processes, then the simulation<br />

of the internal dynamics is sometimes called the strategic layer of the complete simulation<br />

system. Accordingly, the simulation of the physical system is then called the physical layer<br />

of the complete simulation system. Figure 1.1 illustrates two layers and their interactions in a<br />

traffic simulation system.<br />

Many systems where multi-agent simulation would be interesting are large. For example,<br />

a typical metropolitan area traffic system (the main example of this text) is used by several<br />

millions of travelers. A typical ecosystem can consist of several millions of animals, not counting<br />

entities such as bacteria. The immune system, sometimes also modeled by multi-agent<br />

approaches, contains about ¡£¢ T-cells.<br />

Therefore, the simulation of large multi-agent systems needs to be considered. As in other<br />

areas, in large-scale multi-agent systems the use of parallel computers needs to be evaluated.<br />

As will be explained later, in parallel computing one segments the system of interest into several<br />

pieces, and gives each piece to a different computing node 1 . Since the computing nodes work<br />

on the problem simultaneously, the collection of computing nodes solves the problem much<br />

faster than a single computing node would. The interesting question is how to segment the<br />

problem so that the simulation runs efficiently. Perhaps contrary to intuition, just distributing<br />

the agents is usually not a good idea, since agents that often interact with each other may<br />

end up on different computing nodes, and the necessary information exchange between those<br />

computing nodes makes the computation inefficient. Rather, one needs to group the agents<br />

such that agents that interact often are on the same computing node. Since much interaction is<br />

spatial, this means that agents, when they move around in space during the simulation, need to<br />

be moved around between the computing nodes.<br />

CPU.<br />

1 A computing node can be a computer or a CPU (Central Processing Unit) of a computer with more than one<br />

2


This thesis will explore parallel computing issues in the area of multi-agent mobility simulations.<br />

As a specific example, it will explore parallel multi-agent simulations in the area of<br />

transport planning. Within that area, it will explore the following two sub-problems:<br />

Parallel traffic flow simulation. This item corresponds to “1. Simulation of the physical<br />

system” in the list above.<br />

Exchange of information between the traffic flow simulation and strategic layer in a parallel<br />

computing context. This item corresponds to “3. Simulation of the interaction” in<br />

the list above.<br />

Despite the focus on transport planning, the concepts developed in this thesis are general<br />

enough to be useful for the simulation of any kind of system where mobile particles with<br />

complex internal dynamics move around and interact in a physical world. This definition will<br />

include all simulations where humans move around. In addition, the traffic flow simulation<br />

used in this work (the so-called queue simulation) is general enough that it can be applied to<br />

problems where the dynamics of packet movement in a graph is of interest.<br />

The traditional (static) transportation planning uses a four-step process in modeling travel<br />

demand. These four steps are:<br />

Trip Generation: estimation of the number of incoming trips of possible destinations and<br />

outgoing trips of possible origins in a region.<br />

Trip Distribution: producing the origin-destination (OD) matrix by matching origins with<br />

destinations to complete the trips.<br />

Modal Split: determining the travel mode (taking public transport, driving a car, walking,<br />

etc.).<br />

Traffic Assignment: assigning a route for each traveler to get to its destination.<br />

This model does not meet the requirements of modern transportation planning. There are<br />

two main reasons for this:<br />

1. In the four-step process, information is aggregated into traffic streams such that there is<br />

no access to information at the individual level. In other words, the existence of steadystate<br />

streams do not distinguish between travelers.<br />

2. Static modeling in the four-step process misses time-dependency so temporal effects such<br />

as time-dependent congestion spill-back are not covered.<br />

The first item can be solved by treating travelers as individual entities. A known solution<br />

is called activity-based demand generation (ADG) [34], which generates daily activity plans<br />

for each individual. For example, an individual can have an activity plan, which is composed<br />

of a set of activities, such as ”being at home”, ”working”, ”leisure” etc., planned for a day.<br />

The activities of the activity-based demand generation are scheduled over time, thus the<br />

activity-based demand generation is time-dependent, as opposed to the lack of time-dependency<br />

of static assignment mentioned above in item 2. Hence, an alternative technique called Dynamic<br />

Traffic Assignment (DTA) has been employed in the transportation planning area<br />

(e.g. [19, 20, 27, 5]). This model includes spill-back queues formed during the movement<br />

of travelers along links and nodes. Static assignment has the advantage of having a unique<br />

solution compared to DTA, therefore, it can mathematically be proved. DTA with spill-back<br />

3


queues, on the other hand, does not guarantee a unique solution and this makes it harder to find<br />

an analytical solution. Consequently, computational solutions are accomplished.<br />

Two basic components of DTA are route generation for individuals and network loading.<br />

The network loading is the process where the routes are executed. Typically, a simulation technique<br />

is a solution for the network loading part of DTA. To couple DTA and ADG, DTA also<br />

needs to maintain the travelers as individual entities as ADG does. This means that individual<br />

entities have individual attributes and decisions are made on individual basis. Hence, an<br />

agent-based or multi-agent approach is employed to emphasize the individual entities.<br />

Traffic dynamics with spill-back is solved by systematic relaxation. The systematic relaxation<br />

process performs a multi-agent learning method based on the following sequence:<br />

1. Make an initial guess for the routes of all agents.<br />

2. Execute all agents’ routes simultaneously in a traffic flow simulation (network loading).<br />

3. Re-calculate some or all of the routes using the knowledge of the network loading.<br />

4. Go back to 2.<br />

From the viewpoint of conceptual layers explained above, the route generation happens at<br />

the strategic layer and the network loading (item 2) corresponds to the physical layer. The<br />

queue model considered in this thesis corresponds to the network loading part. It is the model<br />

on which the traffic flow simulation described in this thesis is based.<br />

This thesis is organized as follows: A multi-agent traffic flow simulator, based on the queue<br />

model, for the physical layer of a transportation planning system, called MATSIM [50], is given<br />

in Chapter 2. Chapter 2 also explains the input data, namely the street network and the plans<br />

of travelers, and explains the output data, which is composed of events occurred during the<br />

network loading process. The computational aspects of the sequential execution of the traffic<br />

flow simulator are discussed in Chapter 3. How parallel computing is introduced to the traffic<br />

flow simulation is explained in Chapter 4. Chapter 5 discusses different methods to couple<br />

the strategic and the physical layers of MATSIM. Chapter 6 and Chapter 7 explain how the<br />

different types of data between modules can be exchanged. In particular, how the output data<br />

(events data) is extracted out of the traffic flow simulation and how the input data (plans data)<br />

is got into the traffic flow simulation, respectively. Chapter 8 gives a vision of how the traffic<br />

flow simulation can be used for the Internet packet traffic, and is followed by a summary.<br />

4


Chapter 2<br />

The Queue Model for Traffic Dynamics<br />

2.1 Introduction<br />

A traffic flow simulation consistent with the multi-agent approach discussed in Chapter 1 and<br />

e.g., by [67, 65] should fulfill the following conditions:<br />

The model should have individual travelers/vehicles 1 in order to be consistent with all<br />

agent-oriented learning approaches.<br />

The model should be simple in order to be comparable with static assignment and in<br />

order to allow concentration on computational issues rather than modeling issues.<br />

This includes the ability of task parallelization into software.<br />

The model should be computationally fast so that scenarios of a meaningful size can be<br />

run within acceptable periods of time.<br />

For the work presented here, a fourth condition is also stated:<br />

The model should be somewhat realistic so that meaningful comparisons to real-world<br />

results can be made.<br />

These conditions make the use of existing software packages, such as DynaMIT [17],<br />

DYNASMART [18], or TRANSIMS [59], difficult since these software packages are already<br />

fairly complex and complicated. An alternative is to select a simple model for large scale microscopic<br />

network simulations, and to re-implement it. If one wants queue spill-back, there are<br />

essentially two starting points: queueing theory, and the theory of kinematic waves.<br />

In queueing theory, one can build networks of queues and servers [76, 14, 73]. Packets<br />

enter the network at an arbitrary queue. Once in a queue, they wait typically in a first-in firstout<br />

(FIFO) queue until they are served; servers serve queues with a given rate. Once the packet<br />

is served, it will enters the next queue.<br />

This can be directly applied to car/vehicle traffic, where packets correspond to vehicles,<br />

queues correspond to links, and serving rates correspond to link capacities. The decision of a<br />

vehicle about which link to enter after it is served at an intersection is given by the vehicle’s<br />

route, which is a list of nodes (intersections) that a vehicle must pass through during its trip.<br />

1 Terminology: In multi-agent simulations, agents are units. The traffic flow simulation described here simulates<br />

the vehicle traffic. The agents in the traffic model are vehicles. Agents and vehicles are used on reciprocal<br />

terms. Although a vehicle is an agent of transmission and is generally not restricted to a “car”, it represents a car<br />

and accordingly a driver who is of person-type throughout this thesis.<br />

5


Handling Constraints - Original algorithm<br />

for all links do<br />

while vehicle has arrived at end of link<br />

AND vehicle can be moved according to capacity<br />

AND there is space on destination link do<br />

move vehicle to next link<br />

end while<br />

end for<br />

Figure 2.1: The Gawron’s queue model<br />

A shortcoming of this type of approach is that it does not model spill-back. If queues have<br />

size restrictions, then packets exceeding that restriction are typically dropped [76]. Since this<br />

is not realistic for traffic, an alternative is to refuse further acceptance of vehicles once the<br />

queue is full (“physical queues”). This means that the serving rate of the upstream server is<br />

influenced by a full queue downstream. Gawron presents an example of such a model in [26].<br />

A detailed algorithmic description is given in Figure 2.1.<br />

An important issue with physical queues is that the intersection logic needs to be adapted.<br />

Since without physical queues (i.e. with “point queues”) the outgoing links can always accept<br />

all incoming vehicles, so the maximum flow through each incoming link is just given by each<br />

link’s capacity. However, when outgoing links have limited space, then that space needs to be<br />

distributed among the incoming links which compete for it.<br />

In the original algorithm (Figure 2.1), links are processed in an arbitrary but fixed sequence.<br />

This has the consequence that the most favored link in a given intersection is the one that is<br />

processed next after the congested outgoing link has been processed. This could for example<br />

mean that a small side road obtains priority over a large main road.<br />

A better way to handle this problem is to allocate flow under congested conditions according<br />

to capacity [16]. For example, if there are two incoming links with capacities 2 and 4 vehicles<br />

per time step, and the outgoing link has 3 spaces available, then 1 space should be allocated<br />

to the first incoming link and 2 to the second. Section 2.2.2 explains intersection handling in<br />

more detail.<br />

A shortcoming of queue models is that the speed of the backwards traveling kinematic wave<br />

(“jam wave”) is not correctly modeled. A vehicle that leaves the link at the downstream end<br />

immediately opens up a new space at the upstream end into which a new vehicle can enter,<br />

meaning that the kinematic wave speed is roughly one link per time step, rather than a realistic<br />

velocity. This becomes visible in the dissolution of jams, say at the end of a rush hour: If a<br />

queue extends over a sequence of links, then the jam should dissolve from the downstream end.<br />

In the queue model, it will essentially dissolve from the upstream end. More details of this,<br />

including schematic fundamental diagrams, can be found in [74, 26].<br />

2.2 Queue Model<br />

2.2.1 Gawron’s Queue Model<br />

The so-called queue model introduced by Gawron [26] is used as the base of the traffic dynamics<br />

of the traffic flow simulation. Gawron’s queue model defines three key concepts, namely,<br />

free flow travel time, storage constraint and capacity constraint.<br />

Each link has, from the input files, the attributes free flow velocity ¢¡ , length £ , capacity<br />

6


£<br />

and ¡£¢¥¤§¦©¨ number of lanes . Free flow travel time ¡ ¡ is calculated by . Each vehicle<br />

must spend at least free flow travel time on a link before leaving it.<br />

The storage constraint of a link is defined as the maximum number of vehicles that<br />

£<br />

a<br />

!<br />

link<br />

can hold at the same time. It ¨ ¡¢¤¦¨<br />

is calculated as , where is the space a single<br />

vehicle in the average occupies in a jam, which is the inverse of #"%$'& the jam density. m is<br />

taken throughout this work.<br />

The capacity constraint (flow capacity) of a link, on the other hand, defines an upper-bound<br />

for the number of vehicles that can be released from a link at a given time. This constraint is<br />

given as input.<br />

The intersection logic by Gawron is that all links are processed in an arbitrary but fixed<br />

sequence, and a vehicle is moved to the next link if (1) it has arrived at the end of the link, (2)<br />

it can be moved according to capacity, and (3) there is space on the destination link. Figure 2.1<br />

gives the algorithm. The three conditions mean the following:<br />

A vehicle that enters ( link at ) ¡ time cannot leave the link before ) ¡+*, ¡ time , ¡ where<br />

is the free speed link travel time as explained above.<br />

The condition “vehicle can be moved according to capacity” is determined as<br />

.- or /01 and 23¡£45-7698<br />

where is the integer part of the capacity of the link (in vehicles per time 6 step),<br />

is the fractional part of the capacity of the link, and is the number of the vehicles<br />

which already left the link in the current time 23¡£4 step. is a random number such<br />

¢;:<br />

that<br />

: ¡<br />

. According to this formula, vehicles can leave the link if the leaving<br />

2¦A@ size , i.e. the first integer number being larger or equal than the link capacity (in<br />

vehicles per time step). Vehicles are then moved from the link (the spatial queue) into the<br />

buffer according to the capacity constraint and only if there is space in the buffer; once in<br />

the buffer, vehicles can be moved across intersections without looking at the flow capacity<br />

constraints. This approach is borrowed from lattice gas automata, where particle movements<br />

are also separated into a “propagate” and a “scatter” step [24]. Vehicles move through the nodes<br />

7


node<br />

spatial queue<br />

buffer<br />

acc to capacity<br />

constraint<br />

acc to storage<br />

constraint<br />

Figure 2.2: Simplifying the intersection logic by introducing a separate buffer for each link<br />

besides the spatial queue.<br />

without any delay at the nodes as all the constraints that define eligible vehicles of a link are<br />

determined by the link properties.<br />

As a desired side effect, this makes the update in the algorithm completely parallel: If a<br />

vehicle is moved out of a full link, the new empty space will only open in the buffer and not<br />

on the link, and will thus not become available at the upstream end until the next time step –<br />

at which time it will be shared between the incoming links according to the method described<br />

above. This has the advantage that all information which is necessary for the computation of a<br />

time step is available locally at each intersection before a time step starts – and in consequence<br />

there is no information exchange between intersections during the computation of a time step.<br />

Further details are given in algorithmic form in Figure 2.3.<br />

In order to systematically test the intersection logic, an intersection test suite was implemented<br />

[7]. This test suite goes through several different intersection layouts and tests them<br />

one by one to see if the dynamics behaves according to the specifications. The results of possible<br />

layouts typically look as shown in Figure 2.4.<br />

The curves in Figure 2.4 show time versus the number of vehicles that have left the link so<br />

far. Thus, the slope of the curve equals the measured flow capacity in vehicles per second. For<br />

the data in the figure, one link with a capacity of 500 vehicles/sec and one link with a capacity<br />

of 2000 vehicles/sec merge into a link with a capacity of 500 vehicles/sec. The curves are, for<br />

different algorithms explained below, time-dependent accumulated vehicle counts for the two<br />

incoming links. For approximately the first 50-100 time steps, both incoming links operate at<br />

full capacity (500 and 2000 vehicles/second) and fill the outgoing link. Until approximately<br />

time step 3400, both links discharge at rates 400 and 100 vehicles/sec, respectively. After that<br />

time, the first link is empty, and the second link now discharges at 500 vehicles/sec. Not all<br />

algorithms are similarly faithful in generating the desired dynamics; the thick black lines denote<br />

results from the algorithm which is the current implementation in the traffic flow simulator.<br />

Further details are explained in [7].<br />

In Figure 2.4, Algorithm-1 uses Gawron’s original algorithm as described in Section 2.2.1<br />

and in [26]. This algorithm may lead to wrong results. For example, when a vehicle leaves<br />

a full link, a free space becomes available immediately, so that another vehicle can enter the<br />

link in the same time step. Hence, the results of the simulation are dependent on the sequence<br />

in which the links are processed. As stated above, parallel update is used to get rid of this<br />

problem in the traffic flow simulation.<br />

Algorithm-2 uses the “fair intersections and parallel update” approach described above, and<br />

is provided in Figure 2.3. Algorithm-3, given in Figure 2.5, is very similar to Algorithm-2 ex-<br />

8


Vehicle Movement through Intersections<br />

// Propagate vehicles along links:<br />

for all links do<br />

while vehicle has arrived at end of link<br />

AND vehicle can be moved according to capacity<br />

AND there is space in the buffer (see Fig 2.2) do<br />

move vehicle from link to buffer<br />

end while<br />

end for<br />

// Move vehicles across intersections:<br />

for all nodes do<br />

while there are still eligible links do<br />

Select an eligible link randomly proportional to capacity<br />

Mark link as non-eligible<br />

while there are vehicles in the buffer of that link do<br />

Check the first vehicle in the buffer of the link<br />

if the destination link has space then<br />

Move vehicle from buffer to destination link<br />

Proceed to the next vehicle in the buffer<br />

else<br />

Break the inner while loop and proceed to the next eligible link<br />

end if<br />

end while<br />

end while<br />

end for<br />

Figure 2.3: Vehicle movement at the intersections. Note that the algorithm separates the flow<br />

capacity from intersection dynamics.<br />

600<br />

500<br />

400<br />

Count<br />

300<br />

200<br />

algorithm 1: link 400<br />

algorithm 1: link 200<br />

algorithm 2: link 400<br />

algorithm 2: link 200<br />

algorithm 3: link 400<br />

100<br />

algorithm 3: link 200<br />

algorithm 4: link 400<br />

algorithm 4: link 200<br />

algorithm 5: link 400<br />

algorithm 5: link 200<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000<br />

Time<br />

Figure 2.4: Test suite results for the intersection dynamics. The curves show the number of<br />

discharging vehicles from two incoming links as explained in section 2.2.2.<br />

9


Algorithm-3 for Vehicle Movement through Intersections:<br />

Same as Alg. 2.3 till this point<br />

// Move vehicles across intersections:<br />

for all nodes do<br />

while there are still eligible links do<br />

Select an eligible link randomly proportional to capacity<br />

if the destination link has space then<br />

Move one vehicle from buffer to destination link<br />

Mark link as non-eligible and proceed to the next link<br />

else<br />

Proceed to the next link<br />

end if<br />

end while<br />

end for<br />

Figure 2.5: Handling intersections according to the modified version of fair intersections algorithm.<br />

Similar to Algorithm 2.3, except that each link, now, can push only one vehicle at a<br />

time.<br />

Algorithm-4 for Vehicle Movement through Intersections:<br />

for all nodes do<br />

if node visited for the first time then<br />

Choose first incoming link randomly<br />

end if<br />

for i = 1..(the number of incoming links) do<br />

Choose next incoming link via Metropolis sampling<br />

if link buffer is empty then<br />

Mark link as non-eligible<br />

else<br />

Take first vehicle in the buffer<br />

if destination link for vehicle has space then<br />

Move that vehicle from buffer to destination link<br />

else<br />

Mark link as non-eligible<br />

end if<br />

end if<br />

end for<br />

end for<br />

Figure 2.6: Handling intersections according to Metropolis sampling.<br />

cept that instead of serving all the “eligible” vehicles from an incoming link to their destination<br />

links, only one vehicle is moved at a time. Hence, Algorithm-3 and Algorithm-2 do not show<br />

any difference when links do not have capacities greater than 1.<br />

Algorithm-4 implements the fair intersections approach with a difference. Selection is done<br />

via Metropolis sampling [55] with one exception: When a node is processed for the first time,<br />

the first incoming link is selected randomly. In general, if the next link has a lower capacity<br />

then the current link ¡ , then the link is selected with a probability that depends on the ratio<br />

of the capacity of link to the capacity of link ¡ . Pseudo code of the algorithm is given in<br />

Figure 2.6.<br />

10


Algorithm 5 for Vehicle Movement through Intersections:<br />

for all nodes do<br />

if node visited for the first time then<br />

Choose first incoming link randomly according to capacity<br />

end if<br />

for i = 1..(the number of incoming links) do<br />

Choose next incoming link via Metropolis sampling<br />

if link buffer is empty then<br />

Mark link as non-eligible<br />

else<br />

Take first vehicle in the buffer<br />

if destination link for vehicle has space then<br />

Move that vehicle from buffer to destination link<br />

else<br />

Mark link as non-eligible<br />

end if<br />

end if<br />

end for<br />

end for<br />

Figure 2.7: Handling intersections according to the modified Metropolis sampling.<br />

Finally, Algorithm-5 is similar to Algorithm-4 except that if a node is visited for the first<br />

time, the first incoming link is selected according to the flow capacity. The algorithm is given<br />

in Figure 2.7.<br />

The queue model reads flow capacities, free speeds and link lengths, from the input files<br />

and accordingly calculates free flow link travel times. The free flow link travel time defines<br />

the minimum time that vehicles on that particular link must spend. Whilst the lower-bound is<br />

known, the upper-bound for a vehicle being on a link before moving to the next link depends on<br />

how long the vehicle waits at the end of the link. If the randomized selection is not in favor of<br />

a link on which a vehicle is ready to move to the next link (Figure 2.3), the travel time related<br />

to the link increases.<br />

There is a remark to be made about flow capacities. When several of very short links (such<br />

as links with a buffer size 1) exist, they reduce the number of vehicles discharge from the longer<br />

links as the available space is reduced by the short links 2 .<br />

2.2.3 Graph Data as Input for Queue Simulation<br />

The traffic flow simulation is fed by the graph data (the street network) and the plans of vehicles<br />

to be executed. Plans are explained in Section 2.2.4. Before the execution of plans, the<br />

simulation reads nodes and links of the street network. The street network is defined in the<br />

XML [97] format and a rough example is shown in Figure 2.8. XML is explained in detail in<br />

2 The problem can be seen as follows: Assume a short link with a given non-integer capacity (per second), with<br />

long links of the same capacity both upstream and downstream. Then, according to standard queuing theory, the<br />

queue length on the short link follows a random walk. However, when that random walk makes the short link<br />

completely full, then the upstream link is no longer allowed to discharge into the short link. Since this happens<br />

fairly often with short links, this means that short links reduce the effective capacity. Note that the effective<br />

capacity reduction is felt for the upstream link. This phenomenon has little effect with the long links of the Swiss<br />

street network defined in Section 2.5, but became apparent with validation studies with the Navtec network of the<br />

Zurich area, which has many short links.<br />

11


- network<br />

- nodes<br />

-<br />

-<br />

-<br />

node i d =”1” x=”651700” y=”137200”/<br />

node i d =”2” x=”652220” y=”137600”/<br />

/ nodes<br />

l i n k i d =”2” from=”2” t o =”1” l e n g t h =”657”<br />

-<br />

c a p a c i t y =”12000” f r e e s p e e d = ”11.1” p e r m l a n e s =”1” /<br />

l i n k i d =”3” from=”1” t o =”2” l e n g t h =”657”<br />

-<br />

c a p a c i t y =”12000” f r e e s p e e d = ”11.1” p e r m l a n e s =”1” /<br />

/ l i n k s -<br />

- l i n k s<br />

- / network<br />

Figure 2.8: An example of the graph data in the XML format<br />

Section 3.4.1.<br />

Each node is identified by a unique ID and x-y coordinates. Each link has the attributes ID,<br />

node IDs that it connects, length, capacity, free flow speed and number of permanent lanes. The<br />

capacity is given in terms of “vehicles per time unit” and refers to the capacity (flow) constraint<br />

of the link.<br />

The graph data example in Figure 2.8 is composed of 2 nodes and 2 links. Links connect<br />

nodes by defining a direction, for example, link 2 is from node 2 to node 1.<br />

Each node in the traffic flow simulation keeps track of its outgoing and incoming links.<br />

When a link of the graph data is read, pointers to it are placed in the arrays for incoming or<br />

outgoing links at the nodes that the link connects. The arrays for outgoing and incoming links<br />

are used, especially, when the movement of vehicles across nodes (intersections) is realized as<br />

written in Figure 2.3. Nodes check the buffers of incoming links for vehicles ready to move to<br />

any of the outgoing links. Furthermore, in the parallel implementation explained in Chapter 4,<br />

the vehicles that move across the boundaries are packed into messages by the nodes.<br />

Each link is mainly composed of a spatial queue and a buffer to separate the flow constraint<br />

from the intersection logic as described earlier. Both the buffer and the spatial queue are nothing<br />

but queues of pointers to vehicles. Besides these two structures, there are 3 more supplementary<br />

queues defined for each link:<br />

Parking queue: holds vehicles of initial legs (see Section 2.2.4) with start times in the<br />

future.<br />

Waiting queue: holds vehicles of initial legs (see Section 2.2.4) whose start time is up<br />

but which cannot make it into the traffic because of full links.<br />

Storage: holds the second or higher legs of vehicles. These legs can be executed only if<br />

the execution of previous legs are completed.<br />

Links are also responsible for putting constraints into practice. Hence, nodes are careless<br />

in terms of constraints. As shown in Figure 2.8, the capacity constraint, which determines the<br />

size of the buffer, is read from the input data. The storage constraint is calculated by using the<br />

length and the number of permanent lanes given in the input data (Section 2.2.1).<br />

2.2.4 Vehicle Plans as Input for Queue Simulation<br />

Vehicles are inserted into one of the queues defined on links (Section 2.2.3) according to their<br />

start times and leg numbers. Hence, the simulation needs to know about the graph data before<br />

12


p e r s o n i d =”6357250”<br />

-<br />

p l a n -<br />

a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />

-<br />

l e g mode=” car ” d e p t i m e = ”06:54:35” t r a v t i m e = ”00:30”<br />

-<br />

r o u t e 4 9 0 2 4 9 0 3 4 9 0 4 4 9 0 5 4 9 0 6 4 9 0 7 4 9 0 8 4 9 0 9 - / r o u t e<br />

-<br />

/ l e g -<br />

a c t t y p e =”w” x100=”387345” y100=”276590”<br />

-<br />

l i n k =”14606” dur=”08:00” /<br />

/ p l a n -<br />

- / p e r s o n<br />

Figure 2.9: An example for the plans data in the XML format<br />

reading any vehicle information.<br />

An example of a person’s plan is given in Figure 2.9. Each person has a unique ID and a<br />

plan. A plan is composed of a set of activities. Each activity defines a location, the coordinates<br />

of the location and a link ID, on which the activity will start. Each pair of consecutive activities<br />

describes a leg of the plan. The leg provides information about the means of transportation,<br />

the earliest time that a vehicle can start its execution, the expected travel time from the start<br />

activity location to the end activity location of the leg, and a set of node IDs that defines a route<br />

that is supposed to be followed when moving from the start activity location to the end activity<br />

location.<br />

The traffic flow simulation creates a new agent/vehicle for each leg defined in a person’s<br />

plan. In case a person has more than one leg, the simulation makes sure that the highernumbered<br />

legs wait for the completion of the execution of the previous legs.<br />

2.2.5 Events as Output of Queue Simulation<br />

Since the queue simulation does not aggregate data (Chapter 5.2.1), it only produces events as<br />

the output for the other modules in the system, which are better able to check the correctness<br />

of their own data aggregation. An event is produced whenever a vehicle moves from one queue<br />

to another or leaves the simulation due to various reasons. Possible events are of the following<br />

types (not limited to those listed here):<br />

departure: moving from the parking queue of a link to the waiting queue of the same<br />

link, since the start time has arrived.<br />

leaving a waiting queue: moving from the waiting queue of a link to its spatial queue to<br />

start simulating.<br />

leaving a link: leaving the current link.<br />

entering a link: entering the next link (vehicle leaves the current link just before this<br />

event happens).<br />

being stuck and leaving the simulation: getting stuck in congestion for a specific time<br />

period and leaving the simulation afterwards.<br />

arrival: arrival at the final destination.<br />

A set of events of a vehicle in the XML (Section 3.4.1) format is shown in Figure 2.10.<br />

The example shows the events created while the plan of vehicle 6465 is executed. The vehicle<br />

13


)<br />

¥<br />

§<br />

<br />

<br />

¡<br />

£<br />

)<br />

¥<br />

)<br />

¥<br />

)<br />

<br />

¡<br />

£<br />

)<br />

)<br />

¥<br />

§<br />

<br />

*<br />

<br />

<br />

¡<br />

)<br />

£<br />

¥<br />

)<br />

¥<br />

)<br />

£<br />

)<br />

¥<br />

)<br />

£<br />

e v e n t time = ”06:00” t y p e =” departure ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1523” from=”3827”/<br />

e v e n t time = ”06:00” t y p e =” w a i t 2 l i n k ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1523” from=”3827”/<br />

e v e n t time = ”06:01” t y p e =” l e f t l i n k ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1523” from=”3827”/<br />

e v e n t time = ”06:01” t y p e =” entered l i n k ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1524” from=”3828”/<br />

e v e n t time = ”06:28” t y p e =” l e f t l i n k ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1524” from=”3828”/<br />

e v e n t time = ”06:28” t y p e =” entered l i n k ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1525” from=”3829”/<br />

e v e n t time = ”06:34” t y p e =” a r r i v a l ” v e h i d =”6465”<br />

-<br />

legnum=”0” l i n k =”1525” from=”3829”/<br />

Figure 2.10: An example for the events data in the XML format<br />

starts simulating at 6 AM on link 1523 and during its trip to destination link 1525, it traverses<br />

through link 1524. All the events belong to leg 0. The upstream ends of links 1523, 1524 and<br />

1525 are located at nodes 3827, 3828 and 3829, respectively.<br />

2.3 Other Work<br />

Two arguments against the queue model are often that the intersection behavior is “unfair”<br />

in standard implementations, and that the speed of the backwards traveling jam (“kinematic”)<br />

wave is incorrectly modeled. The first problem was overcome by a better modeling of the<br />

intersection logic, as described in Section 2.2.2. The second problem still remains. What can<br />

be done about it<br />

If one wants to avoid a detailed traffic flow simulation, such as is implemented in TRAN-<br />

SIMS [82] for example, then a possible solution is to use what is sometimes called “mesoscopic<br />

models” or “smoothed particle hydrodynamics” [28]. The idea is to have individual particles<br />

in the simulation, but have them moved by aggregate equations of motion. These equations of<br />

motion should be selected so that in the fluid-dynamical limit the Lighthill-Whitham-Richards<br />

[48] equation is recovered [15].<br />

The number of vehicles in a segment is updated according to<br />

¢¡¤£<br />

¡¦¥<br />

¢¡¤£<br />

©¡<br />

¥ <br />

¢¡<br />

¢¡¤£<br />

¥ £<br />

(2.1)<br />

)*<br />

;<br />

*¨§<br />

*<br />

¢¡¤£<br />

©¡<br />

) § <br />

<br />

from<br />

¡<br />

segment ¡ ¢¡¤£<br />

) ¡<br />

<br />

¢¡¤£<br />

§<br />

where is the number of vehicles in segment at time , is the flow of vehicles<br />

into segment at time , and is the source/sink term given by entry<br />

and exit rates.<br />

What is missing is the specification of the flow rates . A possible specification is<br />

given by the cell transmission model [15]:<br />

¢¡<br />

¢¡ ¡ £<br />

¥ £<br />

¢¡¤£<br />

¥$¥ £<br />

(2.2)<br />

<br />

! ¤#"<br />

§% ¤&"('<br />

where §! ¤&" is the capacity constraint, is the jam wave speed, ) is the free speed, * ¤&" is<br />

the maximum number of vehicles on the link and all other variables have the same meaning as<br />

before.<br />

14


)<br />

¥<br />

£<br />

)<br />

¥<br />

©¡% <br />

Note that this now exactly enforces the storage § <br />

constraint by setting<br />

¢¡¤£<br />

to zero<br />

once ¤#" has reached . In addition, the kinematic jam wave speed is given <br />

explicitly<br />

via . There is some interaction between length of a segment, time step, and that needs to<br />

be considered. The network version of the cell transmission model [16] also specifies how<br />

to implement fair intersections. The cell transmission model is implemented under the name<br />

NETCELL [9].<br />

Other link dynamics are, for example, provided by DynaMIT [19], DYNASMART [20] or<br />

DYNEMO [61]. These are based on the same mass conservation equation as Equation<br />

©¡<br />

(2.1), but<br />

¥<br />

use §<br />

different specifications for . In fact, DynaMIT and DYNASMART calculate vehicle<br />

speeds at the time of entry into the segment depending on the number of vehicles already in<br />

the segment. The number of vehicles that can potentially leave a link in a given time step is, in<br />

consequence, given indirectly via this speed computation. Since this is not enough to enforce<br />

physical queues, physical queuing restrictions are added to this description. DYNEMO varies<br />

a vehicle’s speed continuously along the link based on traffic conditions of the current and the<br />

next segment.<br />

2.4 The Basic Benchmark<br />

A real-world scenario is preferred for the benchmarks throughout this thesis instead of using a<br />

synthetic scenario or using only the theoretical performance predictions.<br />

A theoretical performance prediction gives an idea about what is supposed to be expected.<br />

However, such predictions possibly miss some performance-relevant details that appear only<br />

when the real-world data is used. For example, if an example data set is small enough that it<br />

can fit into a computer’s memory while the real-world data is bigger than the available memory,<br />

the results of theoretical predictions for the small data set will include the cache effects. For<br />

example, if the data set can be kept smaller than the size of the cache, which is a high speed<br />

memory system, there might be a significant speed-up since the data is reached with a higher<br />

speed.<br />

Synthetic scenarios are generated by synthetic data. They are used to make generalizations<br />

about the performance of the real-world scenarios. If a real-world scenario with enough information<br />

to test all the features of a benchmark is not available, a synthetic scenario with full<br />

information is useful. Furthermore, the results are easier to adapt between different scenarios<br />

if it is applicable. On the other hand, if a “possible” real-world data needs to cover a lot of<br />

details, then the generation of a similar synthetic scenario gets harder.<br />

2.5 A Practical Scenario for the Benchmarks<br />

One of the conditions that a traffic flow simulation must fulfill is that the simulation should<br />

be able to run scenarios of a meaningful size within acceptable periods of time. From the<br />

transportation planning point of view, such scenarios are large scale real-world problems, which<br />

include millions of agents and all kinds of traffic.<br />

The street network of Switzerland used in the benchmarks of this thesis was originally<br />

developed for the Federal Office for Spatial Development (ARE) [6]. Then, the network was<br />

extended to include the major European transit corridors for a railway-related study [85]. The<br />

version of the street network used through this thesis is a derivation of the extended network.<br />

It contains 10 564 nodes and 28 622 links.<br />

The nodes of the street network are the intersections of roads and are defined by geographical<br />

coordinates. The links are the roads that connect two nodes. Each link is unidirectional and<br />

15


has the attributes such as type, length, speed, and capacity (capacity constraint).<br />

A scenario called ch6-9 is used in the benchmarks through this work. It contains around<br />

1 million trips which start between 6:00AM and 9:00AM. Thus, the scenario used is aimed to<br />

simulate morning rush hour traffic. These trips are based on a realistic demand [65].<br />

The steps followed when converting trips to agents and plans are: (1) a unique agent is<br />

assigned to each trip, and (2) the starting and ending links of a trip become the home and work<br />

locations of the agent. Corresponding activities are created at these locations, so that (3) each<br />

trip becomes a leg between these two activities, and (4) each trip is completed with a route from<br />

the start link to the end link based on free flow travel times in the network.<br />

As stated in Chapter 1, a systematic relaxation is used to carry the initial state of a system to<br />

a relaxed state. With respect to relaxation, the initial plans are fed into the traffic flow simulator<br />

that represents the physical world of the framework as explained in Chapter 1. The results of<br />

the traffic flow simulation are used to improve plans of some agents, which are merged with the<br />

rest of the agents (whose plans have not been changed). The merged plans, then, are fed into<br />

the traffic flow simulator. Each iteration, therefore, involves reading the input files (plans and<br />

graph data), executing all the plans and improving some of the plans according to the output<br />

of the traffic flow simulation. The process is repeated until the system is relaxed, which takes<br />

about 50 iterations. Earlier investigations have shown that this is more than enough to reach<br />

relaxation [68].<br />

During the performance tests of the traffic flow simulation given in the next chapters, the<br />

ch6-9 scenario is simulated for 3 hours, i.e. 10800 time steps (1 time-step means 1 simulated<br />

second).<br />

2.6 Summary<br />

Using a traffic flow simulator for transportation planning is a method for network loading,<br />

which is one of two components in Dynamic Traffic Assignment (DTA). Different criteria such<br />

as resolution (individual vs. aggregated entities), how realistic the behaviors of entities are, the<br />

modes of entities and time resolution determine different traffic flow simulations.<br />

The existent traffic flow simulators are realistic and detailed enough but their complexity<br />

makes them difficult to use. The queue simulation presented here is not only favored in simplicity<br />

but also realistic enough to forecast in future transportation planning [65].<br />

The standard queue-based implementation comes with two main shortcomings: it exhibits<br />

an unfair behavior at the intersections and incorrectly models the speed of backwards traveling<br />

jam waves. The former is remedied by improving the standard intersection behavior as explained<br />

in Section 2.2.2. The solution to the latter is still absent: in the queue model, the traffic<br />

jam dissolves from the upstream end as opposed to that dissolves from the downstream end<br />

in the kinematic wave theory. Some other transportation planning software packages resolve<br />

the problem; however, they either suffer from being too complicated such as the CA model,<br />

or are not modeled at the level of individual entities; in consequence, they lack agent-oriented<br />

approaches.<br />

Despite this remaining shortcoming, the queue model meets the criteria of being comparable<br />

to static traffic assignment, as shown in [65], computing fast, and being receptive to<br />

agent-oriented approaches.<br />

16


Chapter 3<br />

Sequential Queue Model<br />

3.1 Introduction<br />

The queue model is explained in Chapter 2. The computational concern is mentioned as one of<br />

the reasons why the queue model was chosen:<br />

“The model should be computationally fast so that scenarios of a meaningful size<br />

can be run within acceptable periods of time.”<br />

In order to run meaningful scenario sizes, parallel computing is used for the traffic flow<br />

simulation. This will be explained in Chapter 4 in detail. With parallel computing, the same<br />

simulation runs on different computers on different pieces of data. For example, the same<br />

traffic flow simulation code simulates different geographical areas. Because of the parallel<br />

execution, the results can be obtained faster than it can be done with single-CPU execution.<br />

Although parallel programming speeds up the execution of a program, it is not enough by<br />

itself. Improving the single-CPU version of the same program is significant as well in terms of<br />

performance.<br />

The first traffic flow simulator as a part of this thesis was written in C. That simulator<br />

displayed considerable computational performance, but turned out to be difficult to maintain.<br />

For that reason, an alternative traffic flow simulator in C++ [80, 70] was programmed, taking<br />

the advantage of the new possibilities that the Standard Template Library (STL, Section 3.3.1)<br />

offers. However, that new C++ code turned out to be about a factor of two slower than the old<br />

C code.<br />

Many system designers prefer object-oriented programming (such as C++ and Java [42])<br />

because the complexity of systems has increased over the last decades. C++ has become a<br />

dominant programming language for those complex systems. Moreover, recommendations on<br />

how to approach certain problems in C++ have been developed (e.g. [53]). Today, with careful<br />

programming, C++ using the STL can be as fast as C.<br />

For that reason, it was attempted to bring the C++ traffic flow simulator to the same computational<br />

speed as the C traffic flow simulator. This was done by implementing and testing<br />

several different approaches recommended by [53], and by inserting C code into time-critical<br />

pieces of the code. The reason for doing this is to find out where the C++ implementation has<br />

performance disadvantages compared to the C implementation, and how severe these disadvantages<br />

are. Results of the investigation can then be used to make informed decisions regarding<br />

the trade-off between maintainability and computational performance of the code.<br />

17


3.2 The Benchmark<br />

The traffic flow simulator described in Chapter 2.2 is a part of the transportation planning<br />

system explained in Chapter 1. One of the goals of such a system is running a realistic and<br />

meaningful size scenario to make data analysis and predictions. This planning system is not<br />

complete in the sense of common transportation planning, which also includes all modes of<br />

transportation, freight traffic, etc. Such a complete system involves about 7.5 million travelers,<br />

and more than 26 million trips including short pedestrian trips, etc [2].<br />

However, in order to make the transportation planning system explained in Chapter 1 useful<br />

in the real world, a scenario “ch6-9” described in Section 2.5 is used. ch6-9 is a subset of the<br />

data for the full 24-hour car-only simulation, and contains about 1 million trips.<br />

When the traffic flow simulation is coupled with the strategy generation modules via files<br />

as explained in Section 5.2.1, it takes the data from the input files and produces the data into<br />

the output files. The computational performance of the traffic flow simulation is measured<br />

excluding the performance of input reading and output writing. There are two reasons for this:<br />

As investigated in Chapter 6 and Chapter 7, external modules can be defined to handle<br />

input and output of a traffic flow simulation. Using files is just an implementation issue.<br />

Hence, the performance of the simulation itself (i.e., how the graph data and vehicles are<br />

represented, how the data is accessed, how the rules are executed) is the main concern.<br />

I/O requires accesses to the disk where the files are stored. However, I/O performance<br />

is limited by disk speeds, and file I/O operations usually deliver a low performance.<br />

Moreover, using files is just an implementation issue; as explained in Chapter 6 and<br />

Chapter 7, for example, a message passing approach may be used to replace files with<br />

messages.<br />

During the measurements of the traffic flow simulation performance, a 3-hour time period is<br />

simulated. Time steps are incremented by 1 simulation-time second; therefore, the total number<br />

of time steps simulated is 10800. In each time step, 3 basic movements are accomplished:<br />

movement through intersections, movement along links and movement from waiting queues<br />

(where vehicles wait to enter the simulation) to the spatial queues. Each of these movement<br />

steps loops over all the nodes or all the links of the graph data. Accordingly, they dominate the<br />

overall performance.<br />

The figures in the next sections, which show computational performance curves, are plotted<br />

on a multiple-CPU basis since results of some approaches depend on the number of CPUs.<br />

3.3 Performance Issues for C++/STL and C Functions<br />

C++ has been made more functional by the introduction of the Standard Template Library<br />

(STL). The STL is an extensive library of common containers and functions written using C++<br />

templates. In this section, some remarks regarding the experiences with using different STL<br />

containers for different purposes in the traffic flow simulation are given. Although some of the<br />

results just confirm the common sense, some others are specific to the situation that exists in<br />

this work.<br />

The next section gives a brief definition of STL library. The sections following Section 3.3.1<br />

discuss different implementation alternatives and their performance for the different parts of the<br />

traffic flow simulator. Section 3.3.2 compares STL-map and STL-vector used to represent<br />

the street network. Using STL-multimap for the parking and waiting queues of the links in<br />

18


the street network is explained in Section 3.3.3. The same section promotes an alternative data<br />

structure, namely, a self-implemented singly linked list, and gives the test results for these two<br />

implementations. Section 3.3.4 discusses different implementations for the link queues, i.e.,<br />

the spatial queue and the link buffer. The alternatives are using STL-deque, STL-list and<br />

a self-implemented data structure Ring.<br />

3.3.1 The Standard Template Library<br />

The Standard Template Library (STL) is a C++ library composed of the following components:<br />

Collections of standard container types. Containers are implemented as templates, a<br />

special feature of C++ and contain objects of any type. Examples are map, deque<br />

(double-ended queue), vector, list, etc.<br />

Algorithms defined on containers. Examples are: accessing an element, sorting the elements<br />

in a container, searching for an element, etc.<br />

Iterators used for traversing the elements of a container.<br />

The STL not only hides the implementation details of its components but also provides<br />

elegant data structures and algorithms for users. Encapsulation and abstraction properties of<br />

STL enables programmers to focus on application specific issues.<br />

The STL provides two types of containers: Sequence containers store data in a linear sequence.<br />

The “sequence” depends on time and position of insertion. The position of an element<br />

in the container is independent of the element’s value. vector, deque and list are of this<br />

type.<br />

Associative containers, on the other hand, are sorted data structures. They associate the<br />

domain of one type (key) with the domain of another type (value). The position of an element<br />

in such a container depends on its key. Examples are map, multimap, set, etc.<br />

Operations defined on containers, such as insert, delete, or retrieve, differ in performance.<br />

Container selection is dependent on characteristics of the applications and on call frequency.<br />

Some examples are given in next sub-sections.<br />

Each iterator represents a certain position in a container. Regardless of the container for<br />

which it is defined, an iterator comes with a set of basic operators. The basic operators are ++<br />

(stepping forward to the next element), == (equal), != (not equal), = (assignment operator)<br />

and * (dereference operator). Since an iterator is an object, the user must create instances of<br />

the iterator class prior to using them.<br />

3.3.2 Containers: Map vs Vector for Graph Data<br />

Accessing the graph data, i.e. the street network, is one of the key issues in the simulation.<br />

This is because every single item (a link or a node) of the graph is visited several times in each<br />

time step during the simulation run. Moreover, the searching algorithm for the elements of a<br />

container, which represents the graph data, is crucial since searching for a single element in the<br />

graph data is done more than once: (1) when plans are read, the start and the end locations have<br />

to be searched in the graph data, and (2) every time a vehicle on a link at the border needs to<br />

enter the next link, the next link is searched in the graph data to find out, to which computing<br />

node it belongs in the parallel implementation.<br />

The overall approach of using STL containers for the graph data looks as Figure 3.1 (using<br />

graph nodes as the example 1 .)<br />

1 For non-C-experts: typedef aa bb means that from now on, bb will be translated to aa before the<br />

19


make ‘ ‘ Nodes ’ ’ a c o n t a i n e r t y p e :<br />

t y p e d e f CONTAINER- Node Nodes ;<br />

/ / make ‘ ‘ N o d e s I t e r a t o r ’ ’ an i t e r a t o r over Nodes:<br />

t y p e d e f N o d e s : : I t e r a t o r N o d e s I t e r a t o r ;<br />

/ / d e c l a r e the c o n t a i n e r t h a t w i l l c o n t a i n the nodes:<br />

Nodes nodes ;<br />

Figure 3.1: STL-containers for the graph data<br />

The iterator is useful in order to be able to go through all nodes and do something with them<br />

without having to worry about the efficiency of retrieval. Specific examples are given below.<br />

Several operations are, then, needed with respect to that container:<br />

Adding new nodes during initialization.<br />

Going through all nodes in each time step of the simulation (using the iterator).<br />

Finding nodes by their “name” (“key”).<br />

Two implementations were tested: (1) using a map container and (2) using a vector<br />

container.<br />

Map<br />

map is an associative container that can be indexed by any type. Indices (keys) can be simple<br />

types such as integers or sophisticated objects. An STL-map represents a mapping from one<br />

type (key type) to another type (value type). Hence, it allows the management of key-value<br />

pairs.<br />

An advantage of using the STL-map for nodes and links is that it is possible to straightforwardly<br />

retrieve them by their label number: a command such as nodes[1234] is possible<br />

and will retrieve the node with the label number 1234. ID numbers are typically non-sequential,<br />

so it is not possible to use standard array indexing instead.<br />

Sample code using an STL-map for nodes (links are analogous) looks as in Figure 3.2. The<br />

code means that there is a class Node defined somewhere else, and the STL-map container<br />

is loaded with pointers to node instances.<br />

The advantage of using the STL-map is that nodes can be addressed using their IDs using<br />

exactly with the same syntax as one is used from arrays. A slight disadvantage may be the<br />

make pair syntax that one needs to get used to, and the retrieval via second in the iterative<br />

loop. The iterator loop syntax is awkward but standard for all containers.<br />

Vector<br />

An STL-vector is a sequence type of container, which is composed of contiguous blocks of<br />

objects. Element insertion in an STL-vector container can be done at any point in the sequence.<br />

If the insert(position,object) method is used, insertion becomes expensive<br />

at the beginning or in the middle. Since elements are arranged contiguously, all elements that<br />

follow the insertion point need to be shifted. An example of this case is illustrated in Figure 3.3.<br />

compiler does anything else. The statement is particularly useful to convert fairly technical expressions such as<br />

map into something readable such as Nodes (indicating a container that contains nodes).<br />

20


¡<br />

t y p e d e f map- Id , Node Nodes ;<br />

t y p e d e f N o d e s : : i t e r a t o r N o d e s I t e r a t o r ;<br />

Nodes nodes ;<br />

[ . . . ]<br />

r e a d a node i n f o r m a t i o n ;<br />

n o d e s . i n s e r t ( m a k e p a i r ( nodeId , node ) ) ;<br />

[ . . . ]<br />

/ / go through a l l nodes and do something with them:<br />

f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />

i t != n o d e s . e n d ( ) ; + + i t )<br />

Node node = i t second ;<br />

node doSomethingWithIt ( ) ;<br />

[ . . . ]<br />

/ / f i n d a node by I d :<br />

Node node = nodes [ t h e I d ] ;<br />

Figure 3.2: The STL-map for the graph data<br />

0 1<br />

2 3 4 5 6<br />

O13 O2 O48 O9 O26 O33 O14<br />

BEFORE CALLING INSERT(3,O19)<br />

0 1 2 3 4 5 6 7<br />

O13 O2 O48 O19 O9 O26 O33 O14<br />

AFTER CALLING INSERT(3,O19)<br />

Figure 3.3: Insertion in the middle of an STL-vector. insert(position,object) is<br />

a method defined on the STL-vector. The elements behind the insertion position are shifted<br />

when insert command is used.<br />

In general, the performance of the insertion into any container depends on the type of container<br />

and where the insertion takes place. In particular, insertion at the end of an STL-vector is<br />

very fast.<br />

Some code elements to use an STL-vector for nodes (once more, links are analogous)<br />

look as in Figure 3.4. The insert of the map is replaced by a push back, which means that<br />

the new node pointer is just added at the end of the STL-vector. The iterator is essentially<br />

the same as before, except that one does not need the second because the element that is<br />

retrieved by it is no longer a (pointer to a) pair, but a (pointer to a) node.<br />

An issue with an STL-vector data structure is now searching for a key, for example to<br />

find the pointer to a node that is denoted by an ID number. A naive solution would be a linear<br />

search, the rough code is shown in Figure 3.5.<br />

21


¡<br />

¥<br />

<br />

¥<br />

t y p e d e f v e c t o r- Node Nodes ;<br />

t y p e d e f N o d e s : : i t e r a t o r N o d e s I t e r a t o r ;<br />

Nodes nodes ;<br />

[ . . . ]<br />

r e a d a node ;<br />

n o d e s . p u s h b a c k ( node ) ;<br />

[ . . . ]<br />

/ / s o r t the nodes:<br />

n o d e s . s o r t ( ) ;<br />

[ . . . ]<br />

/ / go through a l l nodes and do something with them:<br />

f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />

i t != n o d e s . e n d ( ) ; + + i t )<br />

Node node = i t ;<br />

node doSomethingWithIt ( ) ;<br />

¡<br />

[ . . . ]<br />

/ / f i n d a node by I d :<br />

Node node = findNodeById ( t h e I d ) ;<br />

Figure 3.4: The STL-vector for the graph data<br />

Node findNodeById ( Id t h e I d )<br />

f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />

i t != n o d e s . e n d ( ) ; + + i t )<br />

Node node = i t ;<br />

i f ( n o d e . g e t I d ( ) = = t h e I d )<br />

return node ;<br />

¡<br />

¡<br />

/ / node with given Id not found:<br />

e r r o r ( ) ;<br />

Figure 3.5: Linear search for the graph data<br />

However, a better approach is to pre-sort the STL-vector according to the node IDs<br />

and then to use a binary search instead, in which average case and worst case access times<br />

are reduced from<br />

¢¡¤£¦¥<br />

to . Fortunately, both sorting and binary search are already<br />

provided by the STL, so that these are easy to use. The only issue is to provide the sorting<br />

criterion to the algorithm. The code for sorting the elements of the graph data stored in an<br />

STL-vector and the sorting criterion are given in Figure 3.6.<br />

Often, links and nodes are already provided in the correct order by the files. In that case,<br />

initialization time can be further reduced by checking that they are indeed provided in the<br />

22


¡<br />

/ / c a l l i n g s o r t i n g algorithm<br />

void s o r t ( )<br />

s o r t ( n o d e s . b e g i n ( ) , n o d e s . e n d ( ) , c o m p a r i s o n C l a s s ( ) ¡<br />

)<br />

/ / d e f i n i n g comparison c l a s s<br />

c l a s s c o m p a r i s o n C l a s s<br />

p r i v a t e :<br />

/ / comparison f u n c t i o n d e f i n e s ascending order<br />

bool keyLess ( c o n s t i n t & k1 , c o n s t i n t & k2 ) c o n s t<br />

return ( - k1 k2 ) ¡<br />

;<br />

p u b l i c :<br />

/ / comparison based on IDs<br />

/ / comparing two o b j e c t s l h s and rhs<br />

template c l a s s T -<br />

bool operator ( ) ( c o n s t T l h s , c o n s t T r h s ) c o n s t<br />

return keyLess ( ( ( T ) l h s) i d ( ) , ( ( T ) r h s ) i d ( ) ) ;<br />

¡<br />

/ / comparing an o b j e c t l h s with a value k<br />

- template c l a s s T<br />

bool operator ( ) ( c o n s t T l h s , c o n s t i n t & k ) c o n s t<br />

return keyLess ( ( ( T ) l h s) i d ( ) , k ) ;<br />

¡<br />

;<br />

/ / comparing a value k with an o b j e c t rhs<br />

- template c l a s s T<br />

bool operator ( ) ( c o n s t i n t & k , c o n s t T r h s ) c o n s t<br />

return keyLess ( k , ( ( T ) r h s ) i d ( ) ) ;<br />

¡<br />

Figure 3.6: Sorting the graph data stored in an STL-vector<br />

correct sequence, and thus sorting can be skipped.<br />

Results<br />

Figure 3.7 shows the simulation runtime results for using the STL-vector and the STL-map<br />

structures representing graph data. Figure 3.7(a) plots the data points for RTR. RTR means Real<br />

Time Ratio, which shows how much faster simulation runs than real life. Figure 3.7(b) contains<br />

the speed-up, which shows how much the execution speeds up by increasing the number of<br />

traffic flow simulators running in parallel in the system. The concepts of RTR and speed-up are<br />

covered in detail in Chapter 4. The data points are labeled as “Single” and ”Double”, which<br />

means that the results are gathered by running either single simulation or two simulations per<br />

computer, respectively.<br />

Performance gain using the STL-vector for the graph data instead of the STL-map is<br />

seen in Figure 3.7. In these tests, the STL-multimap (Section 3.3.3) is used as the data<br />

structure for the parking queues and waiting queues and the self-implemented Ring structure<br />

23


1024<br />

RTR, Diff. Data Str. for Graph Data<br />

256<br />

Speedup, Diff. Data Str. for Graph Data<br />

512<br />

256<br />

64<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Single, Vector<br />

4<br />

Single, Map<br />

Double, Vector<br />

Double, Map<br />

2<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Speedup<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Single, Vector<br />

Single, Map<br />

Double, Vector<br />

Double, Map<br />

(a)<br />

(b)<br />

Figure 3.7: RTR and Speedup for using different data structures for the graph data. “Single”<br />

means only one traffic flow simulation is run per computing node. “Double” refers to running<br />

two traffic flow simulations per computing node. In this test, an STL-multimap is used for<br />

parking and waiting queues, the Ring class is used for spatial queues and link buffers.<br />

(Section 3.3.4) represents the spatial queues and buffers of the links.<br />

Using the STL--vector relatively accelerates the traffic flow simulation by 15% for the<br />

large number of CPUs and for the small number of CPUs the relative performance increase<br />

observed is up to 18% compared to results of the STL-map. The STL-map performance<br />

mainly suffers from searching and accessing an item in the container. Hence, the STL-map<br />

can be replaced by the STL-vector and the search algorithm can be changed to the binary<br />

search for better performance.<br />

Map vs Vector for Graph Data: Recommendations<br />

Although using the STL-vector to represent the graph data elements, namely nodes<br />

and links, along with the STL’s binary search algorithm is 15-18% faster than using<br />

the STL-map and the STL-map’s find method, it comes with higher programming<br />

overhead. For the cases that require faster computation, the STL-vector is recommended.<br />

If one prefers to skip the programming overhead, the STL-map should be<br />

chosen.<br />

3.3.3 Containers: Multimap vs Linked List for Parking and Waiting<br />

Queues<br />

Parking and waiting queues are zones where a vehicle waits until it is ready to start to simulate.<br />

In other words, a vehicle waits in these containers until its start time for the simulation is up.<br />

A person’s plan can have more than one leg. Each leg is defined as a route between two<br />

locations. If a person has a plan, which includes a trip from home to work and then from work<br />

to leisure, then it means the person’s plan has two legs.<br />

When a person’s plan is read by the simulation, a vehicle is created for each leg. If it is the<br />

first leg, then the vehicle is added to the parking queue of the link at which the vehicle starts. In<br />

each time step, the parking queue of each link is checked for vehicles that are ready to start at<br />

the current time step. Those vehicles are moved to the waiting queue of the link. If the spatial<br />

24


t y p e d e f multimap- Time , Veh WaitQueue ;<br />

WaitQueue waitQueue ;<br />

t y p e d e f multimap- Time , Veh ParkQueue ;<br />

ParkQueue parkQueue ;<br />

Figure 3.8: Declarations for waiting and parking queues with the STL-multimap<br />

t y p e d e f Linked- Time , Veh WaitQueue ;<br />

WaitQueue waitQueue ;<br />

t y p e d e f Linked- Time , Veh ParkQueue ;<br />

ParkQueue parkQueue ;<br />

Figure 3.9: Declarations for waiting and parking queues with linked lists<br />

queue is not full, the vehicle is moved from the waiting queue into the spatial queue so that it<br />

can start its trip.<br />

Hence, in each time step, waiting and parking queues are checked for the eligible vehicles.<br />

In realistic scenarios, most of the vehicles wait in these queues because their drivers are actually<br />

performing an activity. Checking for all eligible vehicles and accessing their information and<br />

moving them to other queues when necessary comes with computational cost. Therefore, an<br />

appropriate data structure should be used for these queues. That data structure needs to make<br />

available the vehicle with the next scheduled departure at low performance cost. For this, a<br />

partial ordering would in fact be sufficient. However, there is no data structure in the STL<br />

which supplies a fully efficient partial ordering. Therefore, two fully ordered data structures<br />

were tested: the STL-multimap, and a self-implemented singly linked list.<br />

Note that this section only discusses waiting and parking queues. Data structures for link<br />

cells will be explained in the next subsection.<br />

Multimap<br />

An easy-to-use fully sorted data structure is the STL-multimap. One just inserts key-item<br />

pairs, with the keys equal to the start time of vehicles and the items being pointers to vehicles,<br />

and the resulting data structure is automatically sorted. The difference between the STL-map,<br />

as mentioned above, and the STL-multimap is that the latter accepts multiple elements for<br />

the same key. This is necessary since it is possible that several vehicles want to depart at the<br />

same time step. The declarations by using STL-multimap are given in Figure 3.8.<br />

User-defined singly linked list<br />

There are three operations defined on these queues: Insertion of an element into a queue, retrieving<br />

the first element of the queue, and deleting the first element of the queue. Unfortunately<br />

these operations are rather costly with the STL-multimap. From the experiences in the C<br />

version of the simulation where a linked list was been used for these queues, a linked list is<br />

implemented also in the C++ version to handle waiting and parking queues.<br />

The Linked class in Figure 3.9 represents a singly linked list where each item in the list<br />

has a pointer to the next item. Insertion at the end of the list and insertion according to a key<br />

value into the sorted list are available. The latter is important so that vehicles can be sorted<br />

according to their start times.<br />

25


1024<br />

RTR, Diff. Data Str. for Parking and Waiting Queue<br />

256<br />

Speedup, Diff. Data Str. for Parking and Waiting Queue<br />

512<br />

256<br />

64<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Single, Linked List<br />

4<br />

Single, Map<br />

Double, Linked List<br />

Double, Map<br />

2<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Speedup<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Single, Linked List<br />

Single, Map<br />

Double, Linked List<br />

Double, Map<br />

(a)<br />

(b)<br />

Figure 3.10: RTR and Speedup for using different data structures for waiting and parking<br />

queues. An STL-vector is used for the graph data and the Ring class is used for the spatial<br />

queues and the link buffers.<br />

Results<br />

Figure 3.10 shows RTR and speed-up results for the simulation runtime on the STL-multimap<br />

and the linked list implementations of waiting and parking queues. In these tests, the vector<br />

container of STL is employed for the graph data (Section 3.3.2), and the spatial queues and<br />

links buffers are implemented by using the self-implemented Ring class (Section 3.3.4.<br />

The linked list implements the queues with a better performance. Quantitatively, the linked<br />

list version increases the performance relatively by 9% as the number of CPUs increases. With<br />

the small number of CPUs, the relative gain is about 18%.<br />

Multimap vs Linked List for Parking and Waiting Queues:Recommendations<br />

Using a singly linked list class with proper methods similar to STL’s containers is recommended<br />

over the STL-multimap since not only the operations on it are faster but<br />

also the implementation details can be hidden from users as done in STL’s containers.<br />

3.3.4 Containers: Ring, Deque and List Implementations of Link Queues<br />

Each link of the traffic flow simulator has two more queues: one for the spatial queue and<br />

one for the buffer used in through-node movement as explained in Chapter 2. In contrast to<br />

Section 3.3.3, the link and the buffer queues are true FIFO (First In First Out) data structures,<br />

and they have finite size.<br />

The way links operate is important in terms of performance since the links are accessed<br />

several times in each time step of the simulation:<br />

The waiting queue of each link is checked to see if there are vehicles to move from the<br />

waiting queue to the spatial queue.<br />

Each spatial queue is checked to see if there are vehicles to move into the buffer of the<br />

link according to the capacity constraint.<br />

The buffer of each link is checked to find out if there is any vehicle to move to the next<br />

link.<br />

26


In this section, two STL functions will be tested, namely, list and deque. Because of the<br />

low performance they involve, in addition a user-defined data structure Ring is implemented<br />

and tested.<br />

List<br />

The STL-list is a doubly linked list. With the STL-list, insertion anywhere is fast but<br />

it provides only sequential access. Since random access is not needed, this should in theory<br />

be a fast data structure for the purpose here. Unfortunately, the STL-list comes with high<br />

overhead as explained below.<br />

Deque<br />

The STL-deque (double-ended queue) is similar in usage and syntax to the STL-vector.<br />

It allows random access and inserts elements fast at either end. Therefore the STL-deque<br />

is the data structure of choice when most insertions and deletions take place at the beginning<br />

or at the end of a container. The STL-deque is different from the STL-vector in terms<br />

of memory management. When resizing is necessary, the STL-deque allocates memory in<br />

chunks as it grows, with room for a fixed number of elements in each reallocation. In other<br />

words, the STL-deque uses a series of smaller blocks of memory. The STL-vector, on the<br />

other hand, allocates its memory in a contiguous block, i.e., the STL-vector is represented<br />

by one long block of memory.<br />

Self-implemented Ring data structure<br />

To overcome the inefficiencies of the STL-list and the STL-deque, a new data structure,<br />

Ring is implemented. Ring is a circular vector. Removal is at the beginning and insertion<br />

is at the end. Removing from the beginning of the STL-vector is a costly operation since<br />

it includes the forward movement of all the remaining elements. In order to get rid of this<br />

difficulty, supplementary pointers are used to keep track of the head and the tail of the data<br />

structure. Hence, with this new structure, only head/tail pointers move back and forth, not the<br />

elements as in the STL-vector container. The same pointers are also used for insertion.<br />

Figure 3.11(a) shows how insertion takes place at the end of Ring structure. The supplementary<br />

pointers head and tail are used to keep track of elements. The maximum size of the<br />

structure is 8 in the example. When the size is 0, the head and the tail point to NIL. Then, an<br />

object, O1, is requested to be inserted. A pointer to the object is placed in the first cell of the<br />

structure. Both the head and the tail point to this cell (accordingly O1) after the insertion. Then<br />

another object, O2, is to be inserted. Since the call is push back, it is placed at the end, i.e.,<br />

the next cell after the last item (O1). The tail pointer is advanced. Now, the head points to O1<br />

and the tail points to O2. The current size becomes 2.<br />

Figure 3.11(b) illustrates deletion from the beginning, by using pop front. The structure<br />

is full, i.e., the size is 8. The head points to the first element(O1) and the tail points the last<br />

element (O8). After deletion, the tail remains the same, but the head is advanced to the next<br />

object (O2) from O1. If further deletion is requested, the head is moved one more cell (to O3).<br />

After two deletions, the size becomes 6.<br />

It is important to note that this works because of the fixed maximum size. This corresponds<br />

to the maximum number of vehicles on the link.<br />

27


push_back(O1)<br />

push_back(02)<br />

head = NIL<br />

head<br />

O1<br />

head<br />

O1<br />

tail = NIL<br />

tail<br />

tail<br />

O2<br />

(a) Insertion at the end<br />

O7<br />

O8<br />

O7<br />

O8<br />

O7<br />

O8<br />

pop_front()<br />

pop_front()<br />

O6<br />

tail<br />

O1<br />

O6<br />

tail<br />

O6<br />

tail<br />

head<br />

head<br />

head<br />

O5<br />

O2<br />

O5<br />

O2<br />

O5<br />

O4<br />

O3<br />

O4<br />

O3<br />

O4<br />

O3<br />

(b) Deletion at the beginning<br />

Figure 3.11: Operations on the Ring structure. (a) Insertion at the end. (b) Deletion from the<br />

beginning.<br />

1024<br />

RTR, Diff. Data Str. for Link Queues<br />

256<br />

Speedup, Diff. Data Str. for Link Queues<br />

512<br />

256<br />

64<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

Single, Ring<br />

8<br />

Single, Deque<br />

Single, List<br />

4<br />

Double, Ring<br />

Double, Deque<br />

Double, List<br />

2<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Speedup<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32 64 128<br />

Number of CPUs<br />

Single, Ring<br />

Single, Deque<br />

Single, List<br />

Double, Ring<br />

Double, Deque<br />

Double, List<br />

(a)<br />

(b)<br />

Figure 3.12: RTR and Speedup for using different data structures for the spatial queues and<br />

the buffers. An STL-multimap is used for parking and waiting queues. An STL-vector is<br />

used for graph data.<br />

Results<br />

Figure 3.12 shows the comparison results for using the STL-list, the STL-deque and the<br />

Ring class as the main data structure of the queues of links. In these tests, the vector<br />

container of STL represent the graph data (Section 3.3.2), and the parking and waiting queues<br />

are handled via the STL-multimap (Section 3.3.3.<br />

The STL-list gives the worst performance. This is because of the memory management<br />

of the STL-list design. A well known example is that to store an integer value (4 bytes),<br />

the STL-list needs 12 bytes plus the list itself. On the other hand, for example, an STLvector<br />

needs only 4 bytes to store an integer. The Ring class speeds up the performance the<br />

best, and the STL-deque performance is between the STL-list and the Ring class.<br />

Changing the data structure from the STL-list to the STL-deque speeds up the execution<br />

relatively by 67% for the large number of traffic flow simulators in the system despite the<br />

28


difference being around 13% with the small number of CPUs. Transition from the STL-list<br />

to the Ring class results in 71% and 29% relatively better performance with the small and the<br />

large number of CPUs, respectively.<br />

Ring, Deque and List for Link Cells:Recommendations<br />

Since users can implement their own containers similar to STL’s containers to overcome<br />

the inefficiencies that appear at the application level, a circular vector called,<br />

Ring, is recommended when the maximum number of elements in a container is<br />

known. When there is no upper limit on the number of elements, the STL-deque is<br />

suggested if most of the insertions and deletions take place at the either ends. When<br />

random access is not required, the STL-list should be chosen.<br />

3.4 Reading Input Files for Traffic Simulators<br />

In applications with different cooperating modules, the format used to represent shared data<br />

becomes more significant. There is no single good solution for this problem since the possible<br />

solutions give either good performance or flexibility but not usually both at the same time. How<br />

the input data is kept in the simulation exhibits the same problem.<br />

This section and the next section investigate the input files (the street network and plans)<br />

and the output file (events) of a traffic simulator, respectively. In this section, comparison of<br />

representing data in the XML [97] format or in the structured text file format is achieved in<br />

terms of I/O performance along with the programming issues with respect to these formats.<br />

The different programming issues tested for plans (Section 3.4.3) are reading XML plans<br />

using expat and reading raw plans from a structured text file using the C++ input operator<br />

and the C function fscanf. Reading the street network information from an XML file using<br />

the expat parser is compared to reading the same information from a structured text file by<br />

using the C function sscanf in Section 3.4.4.<br />

3.4.1 The Extensible Markup Language, XML<br />

XML [97] is the abbreviation for Extensible Markup Language. It is a markup language, which<br />

has the virtues of HTML (Hypertext Markup Language). HTML [11] is widely used especially<br />

for putting data on the World Wide Web such that anyone can access the data without regard<br />

to the location or the time. HTML is known for its simplicity and portability. HTML focuses<br />

on the appearance of documents, not their contents. Hence, it is limited on its features. This<br />

limitation causes XML, which is oriented towards content, to become very popular as a markup<br />

language.<br />

XML is simple, portable, easily maintainable and adaptable. One can design his/her own<br />

customized markup languages by using XML since data is stored in a self-explanatory manner.<br />

For example, the following shows a valid XML tag with 6 valid attribute-value pairs. Each<br />

attribute-value pair is in attribute=”value” format.<br />

l i n k i d =”2” from=”2” t o =”1” l e n g t h =”657” c a p a c i t y =”12000” f r e e s p e e d = ”11.1” /¢<br />

¡<br />

Some of the benefits of using XML files are<br />

XML allows users to create their own sets of tags.<br />

29


The sequence of attributes is not important since when reading data in, the search is done<br />

for the attribute names to obtain the corresponding values. Therefore, new attributes can<br />

be added in any sequence, and rearranging does not cause changes in reading code.<br />

Complex input like trees and hierarchies can be implemented.<br />

XML allows users without prior knowledge to understand the language as it is selfdescribing.<br />

XML promotes flexible context-dependent data, e.g. the description of a bus trip within<br />

a leg can be completely different from the description of a car trip within a leg.<br />

3.4.2 Structured Text<br />

The structured text file format is application-dependent. Therefore, it is user-defined in many<br />

cases. The structured format used here is the column-based format. The example below is the<br />

corresponding column-based text line of the XML example given above. However, without<br />

looking at the XML tag above, it is impossible to understand what these numbers mean. Because<br />

the numbers in a column-based text file are unlikely to be self-explanatory. One might<br />

put a title line at the top of the file to explain what each column corresponds to. It could help<br />

when the number of columns is small like the example below. If each line is composed of, for<br />

example, 30 columns, then it becomes difficult to follow the columns of the lines.<br />

2 2 1 6 5 7 1 2 0 0 0 1 1 . 1<br />

When reading a structured text file, which is composed of a set of numbers, the numbers<br />

need to be read in the same sequence as they are written in the file. Rearranging or inserting<br />

new columns between the columns that are already there requires changes in file-reading code<br />

to keep the consistency with the correct sequence. Despite this drawback, the structured text<br />

files are usually better than the XML format files performance-wise.<br />

3.4.3 XML vs. Structured Text Files: Plans Reading<br />

The plans (Section 2.2.4) contain all the information about agents, including their routes. A<br />

scenario with approximately 1 million agents is kept in a structured text file size of 34 MBytes<br />

and in an XML file size of 330 MBytes. The XML file is 10 times bigger because of the<br />

self-explanatory attributes of XML.<br />

When reading an XML file, the attributes are parsed. An XML parser called expat [21],<br />

which is written in C, is employed. What a parser does is to provide users with opening element<br />

tags (), and the text data.<br />

Afterwards, the users should implement code to handle the values passed in.<br />

In the structured text plans file case, a fixed number of integers is read and according to some<br />

of these numbers another chunk of integers is read to complete a single agent’s information.<br />

A rough example of this type, reflecting the example in the XML format shown in Figure 2.9,<br />

is illustrated below. All the numbers regarding each plan can be written in a single line. The<br />

example separates them into several lines for readability purposes.<br />

6 3 5 7 2 5 0 0 2 4 8 7 5<br />

1 4 5 8 4 1 4 6 0 6 1 8 0 0<br />

0 8<br />

4 9 0 2 4 9 0 3 4 9 0 4 4 9 0 5 4 9 0 6 4 9 0 7 4 9 0 8 4 9 0 9<br />

30


XML File, expat Structured Text File, Structured Text File, fscanf<br />

159s 31s 36s<br />

Table 3.1: Performance results for reading different types of plans file and approaches.<br />

In the example, 6357250 is the vehicle ID, 0 in the first line shows the leg number, and<br />

24875 is the start time of the plan in seconds (06:54:35). The start accessory ID and the end<br />

accessory ID are 14584 and 14606, respectively. An accessory can be an activity location, a<br />

parking or a transit stop. Duration of the leg is 1800 seconds (30 minutes). The 0 in the third<br />

line shows the mode of transport (car). 8 is the number of the intermediate nodes between the<br />

start and the end activity locations, and these nodes are lined up in the last line.<br />

The performance is user-implementation dependent. One version of implementation keeps<br />

each vehicle’s data in integers in an STL-vector while reading. When the vehicle is created,<br />

the program accesses the values from the STL-vector. The code elements are given in<br />

Figure 3.13.<br />

Yet another version uses a C library function called fscanf to read chunks of data. The<br />

data is directly stored into integer arrays similar to the STL-vector used above. Once the<br />

vehicle is created, its variables are set using these integer arrays. A rough example code is<br />

given in Figure 3.14.<br />

Results<br />

Table 3.1 shows the results for different reading approaches and for different types of the plans<br />

file. The scenario used is the one explained in Section 2.5, i.e., around 1 million agents are<br />

read. The numbers show the time for reading and for constructing agents. Once the agents are<br />

created, they are inserted into one of the supplementary structures of the links, such as waiting<br />

queues and parking queues. These queues are of the STL-multimap type.<br />

Reading the same data from the structured text plans file gives better results by 80% relative<br />

to the XML file version. Despite its lower performance values, XML is a promising technology<br />

because of its benefits as given in Section 3.4.1.<br />

An important remark to be made is that the lower performance values of XML come mostly<br />

from the implementation inefficiencies, not from format itself: While expat parses the input<br />

plans file, an object-oriented wrapper around expat inserts each person’s data (plans and the<br />

other attributes) into an STL-deque. If the traffic flow simulator needs to read the next person,<br />

the wrapper calls pop front() to get and to remove the first element from the STL-deque.<br />

A problem resides here: The STL-deque is used in a way that it keeps the objects as opposed<br />

to keeping pointers to the objects. When a pop front() call is made on such a container,<br />

before deleting the element, the wrapper copies the element into a temporary variable. Then,<br />

the element is deleted from the STL-deque and the temporary variable is returned (copied)<br />

to the traffic flow simulator. If one used pointers to objects instead of objects themselves, not<br />

only memory allocation would be done once and in an efficient way but also the pointers to<br />

objects would be copied between different components via copying pointers instead of copying<br />

the objects themselves, which that would result in less overhead.<br />

3.4.4 XML vs Structured Text Files: Graph Data Reading<br />

The graph data can also be kept in two different types of files. Nodes and links are defined by<br />

either XML attributes or column-based numbers. The XML graph data file reading is the same<br />

31


¡<br />

¡<br />

¡<br />

c l a s s Plan<br />

/ / data s t r u c t u r e s<br />

/ / v e c t o r f o r the elements of the f i x e d l e n g t h part<br />

v e c t o r- i n t f i x e d P a r t ;<br />

/ / v e c t o r f o r the elements of the v a r i a b l e l e n g t h part<br />

v e c t o r- i n t v a r i a b l e P a r t ;<br />

d e f i n e s e t and g e t methods t o a c c e s s both v e c t o r s<br />

void r e a d N e x t P l a n ( i f s t r e a m p l a n s f i l e )<br />

/ / read f i x e d l e n g t h p a r t :<br />

f o r ( each item i i n f i x e d l e n g t h p a r t of p l a n )<br />

/ / read an i n t e g e r and put i t i n t o the v e c t o r<br />

p l a n s f i l e f i x e d P a r t [ i ] ;<br />

¡<br />

/ / read v a r i a b l e l e n g t h part ( r o u t e s ) :<br />

/ / number of items in v a r i a b l e l e n g t h part<br />

/ / i s s t o r e d in f i x e d part<br />

f o r ( each item j i n v a r i a b l e l e n g t h p a r t of p l a n )<br />

/ / read an i n t e g e r and put i t i n t o the v e c t o r<br />

p l a n s f i l e v a r i a b l e P a r t [ j ] ¡<br />

;<br />

;<br />

. . . .<br />

main ( )<br />

/ / d e f i n e the plans input f i l e<br />

i f s t r e a m p l a n s f i l e ;<br />

/ / c r e a t e a plan o b j e c t<br />

P l a n s myPlan ;<br />

while ( n o t EOF )<br />

m y P l a n . r e a d N e x t P l a n ( p l a n s f i l e )<br />

c r e a t e a new v e h i c l e<br />

use g e t methods of myPlan t o s e t d a t a of v e h i c l e ¡<br />

s<br />

Figure 3.13: Reading plans from a structured text file, by using an STL-vector<br />

as the XML plans reading: a parser parses the file and the user code saves the values. The size<br />

of the XML graph data file for the network defined in Section 2.5 is around 4 MBytes.<br />

If the same graph data file is a column-based text file, the file size is 2 MBytes. It is read<br />

line by line and each column is extracted from the line. This version uses C library function<br />

sscanf to pick the values after reading each line.<br />

32


¡<br />

¡<br />

main ( )<br />

/ / c r e a t e an i n t e g e r array<br />

/ / f o r the elements of the f i x e d l e n g t h part<br />

i n t f i x e d P a r t [MAXSIZE ] ;<br />

/ / c r e a t e an i n t e g e r array<br />

/ / f o r the elements of the v a r i a b l e l e n g t h part<br />

i n t v a r i a b l e P a r t [MAXSIZE ] ;<br />

while ( n o t EOF )<br />

/ / read f i x e d l e n g t h p a r t :<br />

f o r ( each item i i n f i x e d l e n g t h p a r t of p l a n )<br />

/ / read an i n t e g e r item from f i l e i n t o the array<br />

f s c a n f ( f i l e , i n t e g e r , f i x e d P a r t [ i ] ) ¡<br />

;<br />

/ / read v a r i a b l e l e n g t h p a r t :<br />

f o r ( each item j i n v a r i a b l e l e n g t h p a r t of p l a n )<br />

/ / read an i n t e g e r item from f i l e i n t o the array<br />

f s c a n f ( f i l e , i n t e g e r , v a r i a b l e P a r t [ j ] ) ¡<br />

;<br />

c r e a t e a new v e h i c l e<br />

s e t t h e v e h i c l e v a r i a b l e s using v a l u e s s t o r e d i n a r r a y s<br />

Figure 3.14: Reading plans from a structured text file, by using fscanf<br />

XML File, expat Structured Text File, sscanf<br />

1.14s 0.66s<br />

Table 3.2: Performance results for reading the graph data<br />

Table 3.2 shows that reading graph data of the scenario described in Section 2.5 from a<br />

column-based text file is 1.7 times (relatively 42%) faster than that from an XML file. The<br />

elements of the graph data are stored in an STL-vector.<br />

XML vs Structured Text Files:Recommendations<br />

The choice for the format of a file between structured text files and XML files is a<br />

trade-off between flexibility, extensibility, elegancy and good performance. If the computational<br />

issues can be ignored for an application, using XML files is recommended.<br />

3.5 Writing Events Files<br />

Events generated by traffic flow simulators are fed back to different modules in the framework<br />

as explained in detail in Section 5.2.1. Among the different modules are the router, agent<br />

database, activity generator, etc. When modules in a system are coupled via files, writing files<br />

33


Writing Time<br />

Explanation Raw<br />

Local Disk, C++ 61s<br />

via NFS , C++ 81s<br />

Local Disk, C 57s<br />

via NFS, C 66s<br />

Table 3.3: Performance results for writing the events file.<br />

(plans and events) might also be interesting to investigate. In the framework, plans are written<br />

by the agent database based on the routes generated by the router before the simulation starts.<br />

The performance issues for plans writing are explained in Section 5.2.3.<br />

Events, on the other hand, are written by traffic flow simulators during each simulation<br />

run. In this section, writing raw events using the C++ output operator<br />

and the C function<br />

fprintf is tested on disks that are both local and remote to the machine on which the traffic<br />

simulator runs. The results 2 are shown in Table 3.3.<br />

By default, during the tests reported throughout this thesis, the files and the runtime executables<br />

of MATSIM [50] are all on local disks of the computing nodes. Therefore, no I/O<br />

operations via network are performed unless it is said so. In the table, the label Local Disk<br />

shows no network contribution.<br />

The Network File System (NFS) [72] allows machines to mount a disk partition on a remote<br />

machine as if it was on a local hard disk. NFS comes with a cost because it accesses the remote<br />

files using the network. The cost can be seen in the table. Contribution of NFS in these numbers<br />

shows a performance degradation by a factor of ¡ $¢¡£¡ .<br />

The file writing is accomplished both with C and C++ I/O functions, namely, with the<br />

fprintf function and the operator. The results show that there is a little performance<br />

difference between C and C++ I/O functions when the files are on local disks. When the files<br />

are written via NFS, the difference becomes more apparent.<br />

Another remark is that using endl in output streams of C++ makes the writing into a file<br />

much slower than using \n. Besides adding a newline character as \n does, endl also flushes<br />

the output buffer. Therefore a write() system call is done for each line of operation,<br />

which is an expensive operation. The C++ type writing results in the table are from using \n.<br />

Reading and Writing Big Files:Recommendations<br />

If a big file is read as strings, using C functions such as strtod/strtol and<br />

atoi/atof to convert strings to the appropriate data types should be preferred.<br />

C++ operator can be used to read the data directly into correct types without<br />

any conversion, but this method comes with lower performance. Similarly, the C++<br />

output operator is a bit slower than the C function fprintf. However, when<br />

the performance issues are not taken into consideration, C++ operators and<br />

should be chosen since their usage is very straightforward.<br />

2 The theoretical performance prediction for writing events is investigated in Section 6.8.4<br />

34


3.6 Conclusions and Discussion<br />

Computationally fast programs can be achieved by not only introducing parallel programming<br />

techniques into an implementation, but also accelerating the sequential parts of the implementation.<br />

When a data structure is accessed for its entities frequently, the type of data structure has a<br />

prominent effect on performance. The sequential implementation of the traffic flow simulation<br />

is improved such that:<br />

Storage for graph data is modified in a way that the STL-map is replaced by STLvector<br />

and the searching algorithm is modified to use binary search. The performance<br />

gets better by 15-18% compared to the STL-map version.<br />

Storage for parking and waiting queues, on which vehicles at the beginning, are often<br />

accessed and removed, are advanced from the STL-multimap to a user-implemented<br />

singly linked list structure that results in a 9-18% performance increase.<br />

Representation of link cells as an STL-list degrades performance by 13-67% compared<br />

to using the STL-deque. Even better speed-up (40-71% relative to the STLlist)<br />

is achieved when using a user-defined data structure called Ring, which is nothing<br />

but a circular STL-vector composed of pointers to vehicles.<br />

Operations on files depend on the format of data stored in them. The XML-type files<br />

are flexible in terms of management and are elegant, but they usually give worse performance.<br />

The inherent simplicity of structured text files offer better performance but a lack<br />

of flexibility limits their applicability.<br />

The input stream operator >> of C++ promotes easy usage by letting users to not worry<br />

about types of input to be read, but suffers from low performance.<br />

Therefore, the following conclusions are drawn for the best design of MATSIM:<br />

As the graph data, i.e. the links and nodes, being the most frequently accessed data, it is<br />

best represented by the STL-vector. The binary search algorithm of STL should be<br />

used finding an element in the vector structure.<br />

The parking and waiting queues of the links are the data structures which hold the vehicles.<br />

The vehicles on these queues have not started simulating yet either because of full<br />

links or because of travelers performing activities at the location. Removing all the eligible<br />

vehicles at the beginning of these queues prior to inserting them into other queues,<br />

and inserting new vehicles into these queues are best performed on a data structure defined<br />

as a singly linked list so that each vehicle points to another vehicle.<br />

The spatial queues and buffers of the links are best implemented by using a fixed size<br />

vector, elements of which are pointers to vehicles. The movement operations such as<br />

insertion and deletion should be based on pointers to vehicles.<br />

For the input and output data, XML files should be preferred to structured text files since<br />

XML allows constructing user defined complex structures in a more efficient way.<br />

35


RunTime Graph Data Parking-Waiting Queues Link Queues<br />

615s STL-vector STL-multimap STL-list<br />

533s STL-vector STL-multimap STL-deque<br />

438s STL-vector STL-multimap Ring<br />

539s STL-map STL-multimap Ring<br />

356s STL-vector Linked List Ring<br />

Table 3.4: Summary table of the serial performance results for different data structures of the<br />

traffic flow simulator.<br />

3.7 Summary<br />

The concepts of fast computation and easy-to-use programming methods can be coupled. C has<br />

been around since 1970s and has allowed programming at both higher and lower levels. However<br />

as more complex applications come into prominence, new languages are needed since:<br />

these applications are complicated enough, content-wise; new programming techniques<br />

that ease the burden of programmers for complex applications are preferred, and<br />

these applications usually exhibit a hierarchy of entities.<br />

Object-oriented languages such as C++ [70, 80] and Java [42] are written to fill the deficiencies<br />

of C-type languages and are used to handle the complexity of applications.<br />

The first implementation of the traffic flow simulator was written in C. Despite being computationally<br />

fast, it obstructed adding new features easily. Writing in C++ with the improvements<br />

explained in the previous sections provides a simulation, which not only is computationally<br />

fast, but also has hierarchical entities, which are re-usable and as a matter of fact easy to<br />

use as well as easy to mutate. Naturally, some arguments presented here might be specific to<br />

the implementation achieved.<br />

Table 3.4 summarizes different implementations for the containers in the traffic flow simulator.<br />

The run times shown in the table are from the time measurements when the number of<br />

computing nodes (CPUs) is chosen as 1. The results exclude any file input and output operations<br />

as explained in Section 3.2. The input file reading results are already shown in Table 3.1<br />

and in Table 3.2 for plans and the street network, respectively. The output file writing results<br />

for events are given in Table 3.3.<br />

36


Chapter 4<br />

Parallel Queue Model<br />

4.1 Introduction<br />

Serial computation has been around for years. With this traditional computing manner,<br />

problems run on a single computer/computing node,<br />

instructions of a program are executed one after the other by the CPU,<br />

only one instruction may be executed at a time.<br />

Data transmission through hardware, which is limited by the speed of light [83], determines<br />

the speed of a serial computer. In addition to physical constraints, there are also economic limitations<br />

since it is increasingly expensive to make a single processor faster. These limitations<br />

make it harder to build faster serial computers by saturating the performance of serial computers.<br />

Ultimately, the agent-based simulation of large scale transportation scenarios are concerned.<br />

A typical scenario would be the 24-hour (about seconds) simulation of a metropolitan area<br />

consisting of 10 million travelers. Typical computational speeds of ¡ ¢¡ traffic flow simulations<br />

with 1-second update steps are 100 000 vehicles in real time [58, 56, 68]. This results in a<br />

computation time of ¡ ¢ ¡£¢£¢<br />

¡ ¢ ¢<br />

seconds ¤ days. This number is just a rough<br />

estimate and subject to the following changes: Increases in CPU speed will reduce the number;<br />

¡ ¢ ¡ ¢£¢<br />

more realistic driving logic will increase the number; smaller time steps [64, 84] will increase<br />

the number.<br />

This means that such a traffic flow simulation running on a single computing node is too<br />

slow for practical or academic treatment of large scale problems. In addition, computer time is<br />

needed for activity generation, route generation, learning, etc. In consequence, it makes sense<br />

to explore parallel/distributed computing as an option. Parallel/distributed computing has the<br />

advantages of using non-local resources, competitive cost/performance ratio and overcoming<br />

finite memory constraint of single computers that are subject to. In parallel computing computational<br />

problems are solved by using several computing resources which may consist of a<br />

single computer with multiple processors or a number of computers connected through a network,<br />

which is called a PC cluster, or a combination of both. In order to solve a computational<br />

problem through parallel computing, one must think about (i) how to partition the tasks into<br />

subtasks, and (ii) how to provide the data exchange between the subtasks. Before explaining<br />

these issues, parallel architectures will be discussed in the following paragraphs.<br />

The categorization of parallel computers has been done in many different ways, among<br />

which Flynn’s Classical Taxonomy [83] is the one most commonly used. This classification<br />

37


depends on the dimensions (single or multiple) of instructions and data. Each combination<br />

gives a different category:<br />

Single Instruction Single Data (SISD): Same instruction stream executes one data stream<br />

which causes deterministic execution. Most PCs, single CPU workstations and mainframes<br />

have this feature.<br />

Single Instruction Multiple Data (SIMD): Same instruction stream executes different data<br />

on different computing nodes. Examples are CM-2, IBM 9000, Cray C90.<br />

Multiple Instruction Single Data (MISD): Different instruction streams run on the same<br />

data. It is the one less commonly used.<br />

Multiple Instruction Multiple Data (MIMD): The most popular type of parallel computers.<br />

Each processor runs a different set of instructions on different data. Execution can be<br />

synchronous or asynchronous, deterministic or non-deterministic. Most supercomputers<br />

and PC clusters are of this type.<br />

4.1.1 Message Exchange<br />

This work concentrates on clusters of coupled PCs, i.e., Linux [39] boxes connected through<br />

100 Mbit Ethernet [77] Local Area Network (LAN). Using this type of clusters, which is costeffective,<br />

can achieve a performance close to that of a vector computer [57]. This is, in part, due<br />

to the fact that multi-agent simulations do not vectorize well, so that vector computers offer no<br />

particular advantages. Hence, PC clusters are expected to be the dominant high performance<br />

computing technology in the area of multi-agent traffic flow simulations for many years to<br />

come.<br />

With respect to data exchange between subtasks, there are, in general, two main approaches<br />

to inter-processor communication. One of them is called message passing between processors;<br />

the alternative is called shared-address space, where variables are kept in a common pool<br />

globally available to all processors. Each paradigm has its own advantages and disadvantages.<br />

In the shared-address space approach, all variables are globally accessible by all processors.<br />

Despite multiple processors operating independently, they share the same memory resources.<br />

The shared-address space approach makes it simpler for the user to achieve parallelism but<br />

since the memory bandwidth is limited, severe bottlenecks are inevitable with an increasing<br />

number of processors, or alternatively such shared memory parallel computers become very<br />

expensive. For those reasons, message passing is the focus point.<br />

In the message passing approach, there are independent cooperating processors. Each processor<br />

has a private local memory in order to keep the variables and data, and thus can access<br />

local data very rapidly. If an exchange of the information is needed between the processors,<br />

the processors communicate and synchronize by passing messages, which are simple send and<br />

receive instructions. Message passing can be imagined to be similar to sending a letter. The<br />

following phases happen during a message passing operation.<br />

1. The message needs to be packed i.e. the computer is told which data needs to be sent.<br />

2. The message is sent.<br />

3. The message may then take some time on the network until it finally arrives in the receiver’s<br />

inbox.<br />

4. The receiver has to officially receive the message, i.e. to take it out of the inbox.<br />

38


5. The receiver must unpack the message and tell the computer where to store the received<br />

data.<br />

There are time delays associated with each of these phases. It is important to note that some<br />

of these time delays are incurred even for an empty message (“latency”), whereas others depend<br />

on the size of the message (“bandwidth restriction”). Effects of time delays are explained in<br />

Section 4.4.<br />

4.1.2 Domain Decomposition<br />

On PC clusters, two general strategies are possible for parallelization of data domains:<br />

Task parallelization – The different modules of a transportation simulation package<br />

(traffic flow simulation, routing, activities generation, learning, pre-/postprocessing) are<br />

run on different computers. This approach is for example used by DynaMIT [17] or<br />

DYNASMART [18].<br />

The advantage of this approach is that it is conceptually straightforward, and fairly insensitive<br />

to network bottlenecks. The disadvantage of this approach is that the slowest<br />

module will dominate the computing speed – for example, if the traffic flow simulation<br />

among different modules is using up most of the computing time, then task parallelization<br />

of the modules will not help.<br />

Domain decomposition – In this approach, each module is distributed across several<br />

CPUs. In fact, for most of the modules, this is straightforward since in current practical<br />

implementations activity generation, route generation, and learning are done for each<br />

traveler separately. Only traffic flow simulation has tight interaction between the travelers<br />

as explained in the following.<br />

For PC clusters, the most costly communication operation is the initiation of a message<br />

(“latency”). In consequence, the number of CPUs that need to communicate with each other<br />

should be minimized. This is achieved through a domain decomposition (see Figure 4.2) of the<br />

traffic network graph. As long as the domains remain compact, each CPU will, on average, have<br />

at most six neighbors (Euler’s theorem for planar graphs). Since network graphs are irregular<br />

structures, a method to deal with this irregularity is needed. METIS [91] is a software package<br />

that specifically deals with decomposing graphs for parallel computation and is explained in<br />

more detail in Section 4.3.1.<br />

The quality of the graph decomposition has consequences for parallel efficiency (load balancing):<br />

If one CPU has a lot more work to do than all other CPUs, then all other CPUs are<br />

¡ ¢ ¢<br />

obliged to wait for it, which is inefficient. For the current work with CPUs and networks<br />

¢ ¢ ¢ ¢<br />

with links, the “latency problem” (explained in Section 4.4) always dominates load<br />

balancing issues; however it is generally useful to employ the actual computational load per<br />

network entity for the graph decomposition [57].<br />

For shared memory machines, other forms of parallelization are possible, based on individual<br />

network links or individual travelers. A dispatcher could distribute links for computation<br />

in a round-robin fashion to the CPUs of the shared memory machine [31]; technically, threads<br />

[72] would be used for this. This is be called fine-grained parallelism, as opposed to the coarsegrained<br />

parallelism, which is more appropriate for message passing architectures. As stated<br />

above, the main drawback of this method is that one needs an expensive machine if one wants<br />

to use large numbers of CPUs.<br />

39


4.2 Parallel Computing in Transportation Simulations<br />

Parallel computing has been employed in several transportation simulation projects. One of the<br />

first was PARAMICS [8], which started as a high performance computing project on a Connection<br />

Machine CM-2. In order to fit the specific architecture of that machine, cars/travelers were<br />

not truly objects but particles with a limited amount of internal state information. PARAMICS<br />

was later ported to a CM-5, where it was simultaneously made more object-oriented. In [8], a<br />

computational speed of 120 000 vehicles with an RTR (real time ratio, see Section 4.4) of 3 is<br />

reported, on 32 CPUs of a Cray T3E.<br />

At about the same time, it was shown that on coupled workstation architectures it is possible<br />

to efficiently implement vehicles in object-like fashion, and a parallel computing prototype<br />

with “intelligent” vehicles was developed [56]. This later resulted in the research code PAM-<br />

INA [68], which was the technical basis for the parallel version of TRANSIMS [57]. In the<br />

tests (using Ethernet [77] only, on a network with 20 000 links, about 100 000 vehicles simultaneously<br />

in the simulation), TRANSIMS [57] ran about 10 times faster than real time with<br />

the default parameters, and about 65 times faster than real time after tuning. These numbers<br />

refer to 32 CPUs; adding more CPUs did not yield further improvement. The parallel concepts<br />

behind TRANSIMS are the same as behind the queue model described in Chapter 2, and in consequence<br />

TRANSIMS is up against the same latency problem as the queue model. However,<br />

for unknown reasons the computational speed is lower than predicted by latency alone.<br />

Some other early implementations of parallel traffic flow simulations are discussed in [10,<br />

60]. A parallel implementation of AIMSUN [23] reports a speed-up of 3.5 on 8 CPUS using<br />

threads (which uses the shared memory technology as explained above) [4].<br />

DYNEMO [71] is a macro-particle model, similar to DYNASMART [18] described below.<br />

A parallel version was implemented about five years ago [61]. A speed-up of 15 on 19 CPUs<br />

connected through 100 Mbit Ethernet was reported on a traffic network of Berlin and Brandenburg<br />

with 13738 links. Larger numbers of CPUs were reported to be inefficient.<br />

DynaMIT [17] uses functional decomposition (task parallelization) as a parallelization concept<br />

[17]. This means that different modules, such as the router, the traffic flow (supply) simulation,<br />

the demand estimation, etc., can be run in parallel, but the traffic flow (supply) simulation<br />

runs on a single CPU only. Functional decomposition is outside the scope of this thesis.<br />

DYNASMART [18] also reports the intention of implementing functional decomposition.<br />

Note that in terms of raw simulation speed, the performance values presented in this work<br />

are more than an order of magnitude faster than anything listed above. In addition, this is not<br />

achieved by a smaller scenario size, but by diligent model selection, efficient implementation,<br />

and by hardware improvements based on knowledge where the computational bottlenecks are.<br />

That is, this approach makes it possible to run very large scale scenarios as everyday research<br />

topics, rather than to have them as the result of computationally intensive studies only.<br />

4.3 Implementation<br />

As discussed in the previous sections, the parallel target architecture for this traffic flow simulation<br />

is a PC cluster. The suitable approach for this architecture is domain decomposition, i.e.<br />

to decompose the street network graph into several pieces, and to give each piece to a different<br />

CPU. Information exchange between CPUs is achieved via messages.<br />

When parallelization of a transportation simulation, one needs to decide where to split the<br />

underlying street network, and how to achieve the message exchange. Both questions can only<br />

be answered with respect to a particular traffic model, but lessons learned here can be used for<br />

40


PROC P0<br />

PROC P1<br />

N3<br />

N1<br />

N2<br />

Figure 4.1: Handling the boundaries and split links<br />

other models.<br />

4.3.1 Handling Domain Decomposition<br />

In general, one wants to split as far away from the intersection as possible. This implies that one<br />

should split links in the middle, for example, as TRANSIMS [57] does. However, for the queue<br />

model “middle of the link” does not make sense since there is no real representation of space. In<br />

consequence, one can either split at the downstream end or at the upstream end of the link. The<br />

downstream end is undesirable because vehicles driving towards an intersection are influenced<br />

by the intersection to a greater degree than vehicles driving away from an intersection. For that<br />

reason, in the queue simulation one splits the nodes right after the intersection (Figure 4.1).<br />

A good partitioning algorithm must decompose a domain in such a way that each subpart<br />

gets a fair share of the load. This issue is also known as load balancing. Load balancing<br />

ensures that no single CPU is overloaded or idle. In the application presented throughout this<br />

thesis, a software package called METIS [91] is employed for domain decomposition. It has<br />

been chosen since it gives good results with large irregular graphs, which would describe the<br />

underlying street network given in Section 2.5.<br />

METIS differs from the traditional graph partition algorithms because of multilevel partitioning<br />

algorithms it uses. The traditional graph algorithms do the partitioning directly on the<br />

original graph. They are usually slow and do not result in good quality.<br />

METIS uses multilevel recursive bisection or multilevel k-way partitioning for higher quality<br />

results. Multilevel recursive bisection performs a sequence of bisections on the graph. It<br />

does not necessarily result in the best quality partitioning, however it is widely used because of<br />

its simplicity.<br />

The multilevel k-way partitioning of METIS is utilized throughout this thesis. It is a 3-phase<br />

partitioning technique:<br />

The original graph is coarsened down to fewer nodes by collapsing nodes and links. This<br />

41


350000<br />

300000<br />

250000<br />

200000<br />

150000<br />

100000<br />

50000<br />

450000 500000 550000 600000 650000 700000 750000 800000 850000<br />

Figure 4.2: Decomposed street network of Switzerland extracted from the map of whole Europe,<br />

on which METIS software package is used. The number of partitions is 8 in this example; 7<br />

of them are colored separately; the 8th partition is also colored and shows the rest of Europe,<br />

hence it is cut here.<br />

makes it easier to find the best partition boundary of the graph.<br />

Then, k-way partitioning is achieved on the smaller large-grained graph.<br />

Finally, the decomposed graph is uncoarsened to find a k-way partitioning of the original<br />

graph.<br />

Both multilevel recursive bisection and multilevel k-way partitioning also aim to reduce the<br />

edge-cut, which is the number of split links whose nodes belong to different partitions. A result<br />

of METIS partitioning Switzerland’s street network can be seen in Figure 4.2.<br />

4.3.2 Handling Message Exchanging<br />

Once the domain decomposition method breaks a problem into several subproblems, single<br />

or multiple programs are executed over subproblems on different computing nodes at the same<br />

time. It is common that the subproblems are not fully independent, i.e., exchanging information<br />

at the boundaries of different subproblems is necessary.<br />

With respect to the application presented here, message passing implies that a CPU that<br />

“owns” a split link reports, via a message, the number of empty spaces to the CPU, which<br />

“owns” the intersection from which vehicles can enter the link. After this, the intersection<br />

update can be done in parallel for all intersections. Next, a CPU that “owns” an intersection<br />

reports, via a message, the vehicles that have moved to the CPUs, which “own” the outgoing<br />

links. Pseudo code of how this is implemented using message passing is shown in Figure 4.3.<br />

In fact, the algorithms in Figure 2.3 and Figure 4.3 together give the whole pseudo code for<br />

the parallel queue model traffic dynamics. For efficiency reasons, all messages to the same<br />

CPU in the same time step should be merged into a single message in order to incur the latency<br />

overhead only once.<br />

4.3.3 Communication Software<br />

The communication among the processors can be achieved by using a message passing library,<br />

which provides functions to send and receive data. There are several libraries such as MPI<br />

(Message Passing Interface) [51] or PVM (Parallel Virtual Machine) [63] for this purpose. Both<br />

42


Algorithm – Parallel computing implementation<br />

According to Alg. 2.3, propagate vehicles along links.<br />

for all split links do<br />

SEND the number of empty spaces of the link to the other processor.<br />

end for<br />

for all split links do<br />

RECEIVE the number of empty spaces of the link from the other processor.<br />

end for<br />

According to Alg. 2.3, move vehicles across intersections.<br />

for all split links do<br />

SEND vehicles which just entered a split link to the other processor<br />

end for<br />

for all split links do<br />

RECEIVE the vehicles (if any) from the neighbor at the other end of the link.<br />

place these vehicles into the local queues.<br />

end for<br />

Figure 4.3: Parallel implementation of queue model.<br />

PVM and MPI are software packages/libraries that allow heterogeneous PCs interconnected by<br />

a computer network to exchange data. They both define an interface for different programming<br />

languages such as C/C++ or Fortran. For the purposes of parallel traffic flow simulation, the<br />

differences between PVM and MPI are negligible. In principle, CORBA (Common Object<br />

Request Broker Architecture) [92] would be an alternative to MPI or PVM, in particular for<br />

task parallelization; but practical experiences show that it is difficult to use and because of the<br />

strict client-server paradigm is not well suited to the systems, which assume that all tasks are<br />

on equal hierarchical levels.<br />

Among these approaches, MPI is chosen since it has slightly more focus on computational<br />

performance. Some key features of MPI can be summarized as follows:<br />

The specification of MPI is machine independent, i.e., data exchange among different<br />

machine architectures will not cause any data loss because of different word lengths of<br />

the machines. This is also a feature of PVM.<br />

Different processes of a parallel program can execute different executable binary files<br />

(i.e. task parallelization). This is also provided by PVM.<br />

MPI does not dictate specific behavior on errors other than indicating what the error is.<br />

This is because MPI expects users to go with high-quality implementations knowing that<br />

determined error recovery specifications will limit the portability of MPI.<br />

MPI allows processes to be defined in different inter-communicators. Different intercommunicators<br />

are capable of communicating with each other. As explained in Chapter<br />

7, this helps coupling different modules available in the application presented here.<br />

MPI is designed to operate on different communication technologies. MPI is used both over<br />

Ethernet [77] and Myrinet [54]. Moreover, PVM is tested on the same technologies, however,<br />

the results for PVM on Myrinet do not outperform results for PVM on Ethernet. Here a brief<br />

definition of Myrinet technologies is given. The results and comparison figures can be found<br />

in Section 4.5.<br />

43


¢¡ ¥<br />

<br />

<br />

¡<br />

* 6<br />

§¡ ¥<br />

¢¡ ¥<br />

8<br />

£<br />

Myrinet [54] is a high-performance packet-communication and switching technology designed<br />

by a company called Myricom to provide a high-speed communication medium for PC<br />

clusters. Compared to other technologies, Myrinet has much less protocol overhead than others,<br />

and therefore provides much better throughput and latency. One-way latency of Myrinet<br />

is 6 secs [54]. 10Gbit Ethernet reportedly has an end-to-end latency of 21 secs [36]. A<br />

measurement using ping command on the Fast Ethernet LAN reports a round-trip latency of<br />

0.20-0.25 msecs.<br />

PCs in a cluster interconnected by Myrinet are linked via low-overhead routers and switches<br />

as opposed to connecting one machine directly to another. Most of the fault-tolerant features<br />

such as flow control, error control etc., are backed up by the low-overhead switches.<br />

4.4 Theoretical Performance Expectations<br />

The problem size and the memory requirement of a sequential program are the determining<br />

factors of measuring the performance of the program. If the memory need of a sequential<br />

program can be supplied by the system, the execution time of the program becomes directly<br />

proportional to the problem size. Thus, prediction of the performance of a sequential program<br />

is straight-forward.<br />

As far as parallel programs are concerned, the problem size and the memory are still the<br />

essential factors but they are not enough to explain the more complicated behavior of parallel<br />

programs. When measuring the performance of a parallel program, load balancing and communication<br />

overhead complicate the determination of parallel performance measurement.<br />

Performance of parallel programs can be monitored by different metrics. Among these<br />

metrics, execution time and speed-up are the most commonly used. In addition to these two<br />

metrics, another metric called real-time ratio is concerned. RTR describes how much faster<br />

than reality the simulation is running.<br />

In a log-log plot, the speed-up curve can be obtained from the RTR curve by a simple<br />

vertical shift; this vertical shift corresponds to a division by the RTR of the single-CPU version<br />

of the simulation. Speed-up curves put more emphasis on the efficiency of the implementation<br />

and less emphasis on absolute speed. An additional difference is that speed-up is independent<br />

of the problem size except at the Ethernet saturation level, which depends on the problem size,<br />

while RTR depends on problem size except for the Ethernet saturation level, which does NOT<br />

depend on the problem size.<br />

The execution time of a parallel program is defined as the total time elapsed from the time<br />

the first processor starts execution to the time the last processor completes the execution. During<br />

execution, on a PC cluster, each processor is either computing or communicating. Consequently,<br />

¥ §¡ £<br />

¦¥ ¨£¢<br />

¦£¢<br />

¦¥ ©£©<br />

¢¡ ¥<br />

¦¥ £©<br />

where is the execution time, is the number of processors, is the computation time and<br />

is the communication time.<br />

For a problem that can be parallelized using domain decomposition, the time required for<br />

the computation, , can be approximated in terms of the runtime of the computation on a<br />

single CPU divided by the number of processors. also includes the overhead effects<br />

such as handling the boundary conditions by both CPUs and unequal domain size effects, i.e.,<br />

load balancing problems. Therefore, the theoretical value can be written as<br />

¢¡ ¥<br />

7¤£© ¦¥<br />

¢¡ ¥<br />

* ¤£©<br />

(4.1)<br />

¤¥ ¦£©<br />

/ ¡<br />

(4.2)<br />

¡¦¥<br />

¤£© ¦¥<br />

44<br />

* 6 ¦


¡<br />

¡<br />

¤<br />

¤<br />

¦ <br />

£<br />

¤<br />

¦ <br />

£<br />

£<br />

where 6 and 6# ¦ are the overhead and load balancing effects, ¨£© ¦¥<br />

¡¦¥<br />

is the serial execution<br />

¡<br />

6 6 ¦<br />

¤£© ¤¥<br />

time 1 and is the number of CPUs. Under the circumstances of and being small<br />

enough, is approximated as:<br />

§¡ ¥<br />

¤¥ ¦£©<br />

¤£© ¦¥<br />

¡¦¥<br />

(4.3)<br />

¡<br />

Communication ¤£¢ time, , generally has two contributors: bandwidth and latency.<br />

Bandwidth is the transfer rate of data, for example measured in terms of bytes per second.<br />

It is defined by at least two contributions: node bandwidth, and network bandwidth. Node<br />

bandwidth is the bandwidth of the connection from a CPU to the network. If two computers<br />

communicate with each other, this is the maximum bandwidth they can reach. Hence, this is<br />

sometimes also called the “point-to-point” bandwidth.<br />

The node bandwidth contribution to the communication time is expressed as<br />

§¡ ¥<br />

(4.4)<br />

¥A¢<br />

where ¥ ¢<br />

¢¡ ¥<br />

is the number of split links in the ¥A¢<br />

simulation;<br />

the message size.<br />

§¡ ¥ ¡<br />

is the number of split<br />

links per computational and¡<br />

node is<br />

The network bandwidth is given by the technology and the topology of the network. Typical<br />

technologies are 100 Mbit Fast Ethernet, Gigabit Ethernet, etc ([77]). Typical topologies are<br />

bus topologies, switched topologies, two-dimensional topologies (e.g. grid/torus), hypercube<br />

topologies, etc. For example, a traditional Local Area Network (LAN) uses 100 Mbit Ethernet,<br />

with a shared bus topology. In a shared bus topology, the same medium is used for all<br />

£¢<br />

communications between computers, i.e, they have to share the network bandwidth.<br />

In a switched topology, the network bandwidth is given by the backplane of the switch.<br />

Often, the backplane bandwidth is high enough to have all nodes communicate with each other<br />

at full node bandwidth, and for practical purposes one can thus neglect the network bandwidth<br />

effect for switched networks.<br />

Communication time involves the network bandwidth formulated as:<br />

£¢<br />

¢¡ ¥¡<br />

(4.5)<br />

¥ ¢<br />

The cluster used for the tests through this work has a switched topology. Thus,¤<br />

¦©¨ comes<br />

from the technical data of the central switch.<br />

Latency, the second contributor of communication time, is the time necessary to initiate the<br />

communication. Latency is the limiting factor of 10/100 Mbit Ethernet LANs. New technologies<br />

such as Gigabit Ethernet and Myrinet promote lower latencies.<br />

If all the contributing factors are taken into account, the communication time per time step<br />

is formulated as follows:<br />

¤<br />

¦¨ <br />

¥¢<br />

¢¡ ¥<br />

¥¦¨§<br />

<br />

¥<br />

¡<br />

¡<br />

<br />

¥A¢<br />

(4.6)<br />

/ ¡ ¦©§<br />

§¡<br />

¥ ¢<br />

¡<br />

<br />

¦¨<br />

8 ¤<br />

¦£©<br />

9¢> *¨<br />

which will be explained in the following paragraphs.<br />

is the number of sub-time-steps. Since two boundary exchanges per time step are<br />

<br />

done, for the application represented in this thesis.<br />

£¦§ ¥¦¨§<br />

1 The serial or sequential execution time of a problem can be measured by running the problem on a single<br />

computing node.<br />

45


¥¦¨§<br />

<br />

<br />

¡¦©§<br />

§¡ ¥<br />

<br />

<br />

<br />

<br />

<br />

*<br />

§¡ ¥<br />

¡<br />

<br />

<br />

/ ¡<br />

<br />

<br />

¢¡ ¥<br />

<br />

¡<br />

¢<br />

¡<br />

¡<br />

<br />

¡ *<br />

¡<br />

¡<br />

¥<br />

<br />

¡<br />

¡ ¡¦¥ £ ¡ ¡¦¥<br />

£<br />

¡<br />

* 6<br />

¡<br />

¡<br />

*<br />

¡<br />

¡<br />

¢¡ ¥<br />

¡<br />

¡<br />

£ ¡ *<br />

*<br />

¡<br />

£<br />

$<br />

£<br />

§¡ ¥<br />

¡<br />

£<br />

¤<br />

¦©¨ <br />

$ (4.10)<br />

¢<br />

¡<br />

¡ ¢<br />

¡<br />

£ ¡ ¡¦¥<br />

£ ¡ ¥<br />

¤<br />

¡ ¥ <br />

¡¦§is the number of neighbor domains each CPU communicates to. All information, which<br />

goes to the same CPU, is collected and sent as a single message, thus incurring the latency only<br />

once per neighbor domain. For , ¡9¦©§is zero since there is no other domain to communicate<br />

with. For , it is one. For and assuming that domains are always connected, Euler’s<br />

theorem for planar graphs says that the average number of neighbors cannot be more than six.<br />

Figure 4.4 shows an area composed of hexagons. Each hexagon represents a computing<br />

node and the total number of computing nodes is . Thus, the figure shows the domain decomposition<br />

of the area on partitions. The hexagons are painted with 4 different colors. Each<br />

color represents a different number of edges of hexagons shared with the neighbors. The spectrum<br />

has four colors in the figure, namely, the conditions for 2,3,4 and 6 neighbor-cases are<br />

handled from lightest to darkest in this order. The total number of edges shared by neighboring<br />

partitions is calculated as follows: Two of the hexagons (1st and th) have 2 neighbors; two of<br />

the hexagons ( th and th) have 3 neighbors; of the hexagons (the<br />

remaining ones on the edges) have 4 neighbors and of the hexagons (the ones in the<br />

middle) have 6 neighbors. Thus, the average number of neighbors becomes:<br />

¥<br />

¤ *¦¥ *<br />

£ ¡ ¥<br />

*¦¥<br />

£ ¡ ¥ <br />

(4.7)<br />

¡ ¦©§<br />

§¡<br />

The numerator of the formula is an integer if<br />

argument in Equation 4.7, the following is used<br />

¡<br />

<br />

and is even. Based on the geometric<br />

(4.8)<br />

which has ¡9¦©§<br />

¡¦¥<br />

formula, ¡¦§<br />

¡ ¡ ¢<br />

¥ as desired, and for .<br />

is the latency (or start-up time) of each ¡ ¢> message. as said above, is between 9¢> 0.20-0.25<br />

milliseconds for the Fast Ethernet network of the cluster used throughout this thesis.<br />

Consequently, the combined time for one time step is<br />

¡¦§<br />

¢¡ ¥<br />

§¡ ¥<br />

¢¡ ¥<br />

§¡ ¥<br />

8 * (4.9)<br />

* 6 ¦<br />

¥ ¢<br />

¥¢<br />

£¢<br />

¢¥ *¨§ (<br />

¥A¢<br />

i.e., ¡¦§<br />

According to the discussion above, for ¡ ¡ ¢<br />

the number of neighbors becomes constant,<br />

¤<br />

¦ <br />

£ ¡<br />

and the number of split links in the simulation converges to £ ¡<br />

¡©)<br />

. In consequence, 6 for 6 ¦ and small enough:<br />

©<br />

, i.e., ¥ ¢<br />

for a shared or bus topology,¤<br />

¦¨ is relatively small and constant, thus<br />

£ ¡ ¡ £ ¡<br />

for a switched or a parallel supercomputer topology, one assumes¤<br />

¦¨+<br />

and obtains<br />

¡ ¡<br />

¡ *<br />

£ ¡<br />

Thus, in a shared topology, adding CPUs will eventually increase the simulation time, thus<br />

making the simulation slower. In a non-shared topology, adding CPUs will eventually not<br />

make the simulation any faster, but at least it will not be detrimental to computational speed.<br />

46


¢<br />

1<br />

P 1/2 elements<br />

P 1/2<br />

1/2<br />

(P −2)elements<br />

1/2<br />

(P −2)elements<br />

P<br />

Figure 4.4: Calculation of neighbors of computing nodes<br />

¡ ¢<br />

The dominant term in a shared topology for is the network bandwidth; the dominant<br />

term in a non-shared topology is ¡ the latency.<br />

By taking the latency of 100 Mbit Fast Ethernet cards as 0.225 ms, the following calculation<br />

is done to find out the saturation level of Fast Ethernet. Each processor sends messages twice<br />

¡<br />

per time step to all neighbors resulting in latency $'" § contributions or per time step.<br />

In other words, the cluster can maximally do ¢ $'" ¡ " time steps per second. If the time<br />

step of a simulation is one second, then with a 100 Mbit Ethernet, 370 is also the maximum real<br />

¡ ¢ ¢<br />

time ratio of the parallel simulation, i.e. the number which says how much faster than reality<br />

the simulation is. Note that the limiting value does not depend on the problem size or on the<br />

speed of the algorithm; it is a limiting number for any parallel computation of a 2-dimensional<br />

system on a PC cluster using Ethernet LAN.<br />

The only way this number can be improved under the assumptions made is to use faster<br />

communication hardware. Gigabit Ethernet hardware is faster, but standard driver implementations<br />

give away that advantage [45]. In contrast, Myrinet [54] is a communication technology<br />

specifically designed for this situation. Interestingly, as will be seen later, it is possible to<br />

recoup the cost for a Myrinet network by being able to work with a smaller cluster.<br />

47


4.5 Experimental Results<br />

The parallel queue model is used as the traffic flow simulation within the project of a microscopic<br />

and activity-based simulation of all of Switzerland. Computational performance results<br />

are reported here; validation results with respect to a real world scenario can be found in [65].<br />

The cluster used during the tests is composed of 32 computers each of which is a Pentium<br />

III 1GHz dual CPU node. Besides a default 10 Mbit Ethernet [77] communication layer between<br />

these computing nodes, two more network interfaces were available: Fast Ethernet [77]<br />

and Myrinet [54]. Throughout the rest of this thesis, the term Ethernet will refer to Fast Ethernet.<br />

Fast Ethernet is the follow up of 10 Mbit Ethernet technology. It offers a speed of 100 Mbit/s.<br />

Even though it is 10 times faster than 10 Mbit Ethernet, they are both specified by the same<br />

standards. Due to further developments in the Ethernet technology, Gigabit Ethernet giving a<br />

data rate of 1Gbit/s has also come into the picture.<br />

In terms of software, all computing nodes are dual boot: RedHat Linux [40] and Microsoft<br />

Windows [13]. However, all the tests achieved in this work are done only on Linux. More<br />

information about the cluster technology used throughout this work is given in [46].<br />

The following performance numbers refer to the scenario “ch6-9” explained in Section 2.5<br />

containing around 1 million agents and a street network with 10 564 nodes and 28 622 links.<br />

Moreover, as also stated in Section 2.5, the scenario is simulated for 3 hours excluding input<br />

reading and output writing.<br />

In the following sections, different computing issues are discussed: Section 4.5.1 compares<br />

the execution times of the parallel traffic flow simulation over different communication media,<br />

namely, Ethernet and Myrinet. The communication libraries, PVM and MPI are tested and<br />

the results are shown in Section 4.5.2. Packing the number of empty spaces on the links of<br />

the street network and packing the vehicles moving across the boundaries by using different<br />

packing algorithms are discussed in Section 4.5.3. Finally, employing different options of<br />

METIS decomposition library is given in Section 4.5.4.<br />

4.5.1 Comparison of Different Communication Hardware: Ethernet vs.<br />

Myrinet<br />

The most important plot is Figure 4.5(a). It shows computational real time ratio (RTR) numbers<br />

as a function of the number of CPUs. Note that, with 60 CPUs with Myrinet, an RTR of 900<br />

is achieved. This means that 24 hours of all car traffic in Switzerland are simulated in less<br />

than two minutes! This performance is achieved with Myrinet communication hardware; by<br />

using 100 Mbit Ethernet hardware, peak performance is at about 300 RTR. Due to the lack<br />

of availability of more computing nodes, the tests could not go further than 60 in terms of<br />

number of computing nodes. But the practical results follow the predicted curve for RTR for<br />

the available computing nodes.<br />

When the measurement is taken for the curves in Figure 4.5(a), the spatial queues and<br />

buffers of the links implemented by the self-implemented Ring class as explained in Section<br />

3.3.4. The graph data is stored in an STL-vector (Section 3.3.2). Finally, the supplementary<br />

data structures such as waiting and parking queues are implemented by using the<br />

linked list structure described in Section 3.3.3.<br />

The plot also shows two different graphs for achieving the performance with single-CPU<br />

or with dual-CPU machines; obviously there are differences, which are less important. The<br />

lower values of dual-CPU machines are due to the fact that the bandwidth of the network<br />

card is shared between two processes running on a single machine. However, the performance<br />

48


1024<br />

RTR for a 3-hour run of 6-9 Scenario<br />

256<br />

Speedup for a 3-hour run of 6-9 Scenario<br />

512<br />

128<br />

256<br />

64<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

8<br />

single,myri<br />

single,eth<br />

4<br />

double, myri<br />

double, eth<br />

Theo. Val<br />

2<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(a)<br />

Speedup<br />

32<br />

16<br />

8<br />

4<br />

2<br />

Single, Myri<br />

1<br />

Single, Eth<br />

Double, Myri<br />

Double, Eth<br />

0.5<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(b)<br />

Figure 4.5: RTR and Speedup curves for Parallel Queue Model. The results are measured when<br />

spatial queues and link buffers are of the Ring type, waiting and parking queues are Linked<br />

List and the graph data is stored in an STL-vector. See Chapter 3 for further details.<br />

decrease of dual-CPU machines compared to single-CPU machines is less than a factor of<br />

1.5, which is presumably due to the fact that one process can communicate while the other is<br />

computing.<br />

One advantage of dual-CPU machines is encountered when investigating the cost/performance<br />

ratio. The cost of a single-CPU machine is only a little lower than that of a dual-CPU machine.<br />

Furthermore, the performance difference between these two setups, as stated above, is less than<br />

a factor of 1.5. For example, RTR for using 56 computing nodes on Fast Ethernet is 300 and<br />

284 with single-CPU and dual-CPU machines, respectively. The cost of 56 single-CPU machines<br />

is around twice the cost of 28 dual-CPU machines. Thus, the cost/performance ratio of<br />

dual-CPU machines is competitive to that of single-CPU machines.<br />

There is a super-linear speed-up between 32 and 60 computing nodes when Myrinet is used.<br />

This is presumably due to cache effects, i.e., the sub-domains become small enough that their<br />

data fits into cache.<br />

The theoretical curve for execution time in Figure 4.5(a) is calculated as follows. The computation<br />

time, , is taken from Equation 4.3 where is the measurement taken from a<br />

¤£¢<br />

¥ ¢¡<br />

¦¥<br />

sequential run executing one time step. The measured value is about 0.065. The communication<br />

time is ¤£© formulated as<br />

§¡ ¥<br />

¥ <br />

where, as stated earlier, £¦§equals to 2, ¡9¦§<br />

¢¡ ¥<br />

is calculated from Equation 4.8. Finally, ) ¢¤ is<br />

the latency measured at 0.225 milliseconds on the cluster.<br />

In Figure 4.5(b), the corresponding speed-up curves are shown. As stated in Section 4.4,<br />

speed-up curves can be obtained by shifting RTR curves vertically. The shifting factor is about<br />

5 here. A speed-up of 32 with 60 CPUs is reached when using Myrinet.<br />

The most important results can be summarized as follows:<br />

¤£©<br />

)¢¥¤§<br />

§¡ ¥<br />

; ¥¦§<br />

<br />

¡¦©§<br />

§¡<br />

On PC clusters (Linux boxes) with Ethernet, parallel traffic flow simulation speed<br />

theoretically saturates at 370 simulation time steps per second. With the maximum<br />

nodes that can be used, the practical values show the commitment to the theoretical<br />

curve. This statement is independent of scenario size or size of the PC cluster. —<br />

In contrast, on PC clusters with Myrinet, no saturation effect was observed for the<br />

scenario sizes considered.<br />

49


¡<br />

<br />

<br />

¥<br />

¡<br />

If the simulation time step is one second, then “300 simulation time steps per second” translates<br />

into a real time ratio of 300, meaning the simulation runs 300 times faster than real time.<br />

It is interesting to compare two different hardware configurations:<br />

56 single CPU machines using 100 Mbit LAN.<br />

Real time ratio 300.<br />

¢¡<br />

¥<br />

Cost approx &<br />

switch, ¡<br />

resulting in ¡<br />

¢¡<br />

¡ ¡ ¢¡<br />

for the machines plus approx for a full bandwidth<br />

overall.<br />

¢£¡ <br />

28 dual CPU machines using Myrinet.<br />

Real time ratio 900.<br />

Cost approx ¥¤<br />

¡ <br />

, Myrinet included.<br />

¤ $&<br />

That is, the Myrinet setup is not only faster, but somewhat unexpectedly also cheaper. A<br />

Myrinet setup has the additional advantage that smaller scenarios than the one discussed here<br />

will run even faster, whereas on the Ethernet cluster, smaller scenarios will run with the same<br />

computational speed as large scenarios.<br />

As mentioned in Section 4.4, the speed-up curves show the same performance saturation as<br />

do the RTR curves. Even larger scenarios reach greater speed-up, but saturate at the same RTR<br />

on Ethernet.<br />

Improving the single version of the simulation explained in Chapter 3 also appears in parallel<br />

computing results. For example, Figure 3.7 shows the results when improving the data<br />

structure used for the graph data.<br />

4.5.2 Comparison of Different Communication Software: MPI vs. PVM<br />

During tests, MPI [51] is used as communication software. Yet, PVM [63] is also utilized to<br />

see whether it makes a difference. One might say that software performance is limited with<br />

hardware performance. However, it is also significant how software is designed to get the most<br />

benefit from hardware.<br />

PVM and MPI have been compared for years. For the purposes of the application presented<br />

here, their capabilities are rather similar as explained in Section 4.3.3. Figure 4.6 compares<br />

the results of using PVM or MPI. The curves are created when an STL-map is used for the<br />

graph data (Section 3.3.2), when the parking and waiting queues are represented by an STLmultimap<br />

(Section 3.3.3, and when the Ring class, as explained in Section 3.3.4 is employed<br />

for the spatial queues and link buffers.<br />

When the underlying computer network is chosen to be Ethernet, MPI and PVM perform<br />

similarly. Presumably, for Ethernet being commonly used as a communication infrastructure<br />

imposes on software developers to improve software designed based on it. When a special<br />

infrastructure, such as Myrinet, is used, the support it gets is proportional to the demand by<br />

the users. Both MPI and PVM support Myrinet but personal experiences show that only MPI<br />

is able to exploit the hardware advantage of Myrinet: As seen in Figure 4.6, PVM on Myrinet<br />

behaves as if it runs on Ethernet. The reasons for this remain unclear; the attempts of developers<br />

of PVM over Myrinet for instrumenting the software on the cluster could not reach a solution.<br />

The important consequence is:<br />

If one wants to use high performance communications hardware, such as Myrinet or<br />

Infiniband, for PC clusters, then the use of MPI is strongly recommended since it is<br />

significantly better supported than any other parallel communication standard.<br />

50


1024<br />

512<br />

256<br />

RTR for a 3-hour run of 6-9 Scenario, PVM TEST<br />

256<br />

128<br />

64<br />

Eth-MPI<br />

Myri-MPI<br />

Eth-PVM<br />

Myri-PVM<br />

Speedup for a 3-hour run of 6-9 Scenario, PVM TEST<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

Speed-Up<br />

32<br />

16<br />

8<br />

4<br />

8<br />

Eth-MPI<br />

4<br />

Myri-MPI<br />

Eth-PVM<br />

Myri-PVM<br />

2<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(a)<br />

2<br />

1<br />

0.5<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(b)<br />

Figure 4.6: RTR and Speedup graphs for PVM and MPI comparison. An STL-map is used for<br />

the graph data, and an STL-multimap represents the parking and waiting queues, the Ring<br />

class is used for spatial queues and link buffers.<br />

Therefore, the results shown in other parts of this thesis are normally measured over Myrinet<br />

using MPI; exceptions are specified.<br />

4.5.3 Comparison of Different Packing Algorithms<br />

In this work, different types of data are exchanged between different modules: vehicles, events<br />

(Chapter 6) and plans (Chapter 7) etc. They need to be packed prior to sending. Different<br />

packing methods for vehicles are discussed below to give an impression of the contribution of<br />

packing to the overall computing time.<br />

In general, some packing methods pack only the necessary part of an object. On the other<br />

hand, instead of dealing with individual data pieces, the instances of an object can be packed<br />

as a whole. The latter is known as object serialization.<br />

Object Serialization<br />

Object serialization can be defined as writing of the content (or state) of an object such that it<br />

can be re-constructed from that content (or state). The content is converted into bytes using the<br />

object serialization method. Some object oriented programming languages such as Java [42]<br />

provide methods to define a class as serializable and to write the contents of object instances.<br />

Some of the known problems of object serialization, in general, are:<br />

If the class to be serialized is an extension of other available classes, then those classes<br />

must also be defined as serializable.<br />

If only a part of the information of a very large object need to be serialized, then the<br />

object serialization becomes inefficient in terms of space and time.<br />

Some information of an object needs to remain private, hence it must not be serialized.<br />

Messages containing vehicles<br />

The MATSIM traffic flow simulator packs two types of data in each time step: the number<br />

of empty spaces of the split links and the vehicles to be moved to split links. The size of the<br />

packet, which contains the number of empty spaces, is the same in each time step since the<br />

51


a long i n t e g e r f o r v e h i c l e ID ,<br />

a long i n t e g e r f o r n e x t l i n k ID t h a t v e h i c l e w i l l be on ,<br />

an i n t e g e r f o r t h e s i z e of r e m a i n i n g r o u t e of t h e v e h i c l e ,<br />

an long i n t e g e r a r r a y f o r r o u t e i t s e l f ( l i s t of nodes ) ,<br />

a double f o r a c t i v i t y d u r a t i o n ,<br />

a double f o r a c t i v i t y end time ,<br />

an i n t e g e r f o r l e g number ,<br />

a long i n t e g e r f o r f i n a l d e s t i n a t i o n l i n k ID.<br />

Figure 4.7: The data of a vehicle to be packed<br />

number of split links does not change. The packet, in this case, just includes IDs of split links<br />

and the number of empty spaces on corresponding split links.<br />

As far as vehicles are concerned, the packet size differs depending on the number of vehicles<br />

actually transmitted. The data of a vehicle to be packed into a packet is shown in Figure 4.7.<br />

Thus, for each vehicle transmitted, different types of data are packed. One of the most important<br />

remarks about the types listed in the figure is that the length of the long integer array for the<br />

list of the remaining nodes of a route is different for each vehicle. This makes packing hard for<br />

simple parallel computing packing commands.<br />

Default implementation: memcpy<br />

The default implementation for packing in the traffic flow simulator is written by using the C<br />

function memcpy. memcpy creates byte arrays by converting all data types into bytes. The<br />

receiving side also uses memcpy to unpack variables of different data types from a byte array.<br />

The packing of data is shown in Figure 4.8. The unpacking is similar to the code given in<br />

the figure. One drawback of this method is that when a packet is being prepared, the pointer<br />

to keep track of the position of the next available memory slot on the packet must be advanced<br />

manually.<br />

Using MPI Pack and MPI Unpack<br />

If the communicating processes run on different architectures with different machine representations,<br />

conversions done by memcpy might be different on both sides and might cause<br />

incorrect unpacking and assignment of values. Therefore, a good option is to use MPI Pack<br />

and MPI Unpack library calls. They are similar to memcpy in the sense that they also provide<br />

conversions of different data types into bytes or vice versa. However, since types are converted<br />

into MPI types first, having machines with different representations does not appear to be a<br />

problem.<br />

An example of packing with MPI Pack is presented in Figure 4.9. Unpacking, again, is<br />

similar to packing. Advancing the offset pointer is not necessary here as MPI calls provide it<br />

internally.<br />

Using MPI Struct<br />

The previous two methods pack individual variables of objects, which are vehicles in the traffic<br />

flow simulator. A more elegant way of packing would be packing objects all at once as opposed<br />

to piece by piece. MPI Struct does this. It allows to pack objects.<br />

Despite MPI Struct being a desirable method in object serialization, object serialization<br />

fails when each instance of an object type uses a different size of an array. In that case, a fixed<br />

52


d e f i n e the packet<br />

char a r r a y p a c k e t ;<br />

memcpy ( p a c k e t , v e h i c l e ID , s i z e o f ( v e h i c l e ID ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

memcpy ( p a c k e t , n e x t l i n k ID , s i z e o f ( n e x t l i n k ID ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

memcpy ( p a c k e t , l e n g t h of r o u t e , s i z e o f ( l e n g t h of r o u t e ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

f o r a l l t h e nodes i n t h e r o u t e<br />

memcpy ( p a c k e t , node ID , s i z e o f ( node ID ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

¡<br />

memcpy ( p a c k e t , a c t i v i t y d u r a t i o n , s i z e o f ( a c t i v i t y d u r a t i o n ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

memcpy ( p a c k e t , a c t i v i t y end time , s i z e o f ( a c t i v i t y end time ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

memcpy ( p a c k e t , l e g number , s i z e o f ( l e g number ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

memcpy ( p a c k e t , f i n a l d e s t i n a t i o n l i n k ID ,<br />

s i z e o f ( f i n a l d e s t i n a t i o n l i n k ID ) )<br />

advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />

/ / d e f i n e the packet<br />

char a r r a y p a c k e t ;<br />

Figure 4.8: Packing vehicle data with memcpy<br />

MPI::INT.Pack ( p a c k e t , v e h i c l e ID , s i z e o f ( v e h i c l e ID ) )<br />

MPI::INT.Pack ( p a c k e t , n e x t l i n k ID , s i z e o f ( n e x t l i n k ID ) )<br />

MPI::INT.Pack ( p a c k e t , l e n g t h of r o u t e ,<br />

s i z e o f ( l e n g t h of r o u t e ) )<br />

f o r a l l t h e nodes i n t h e r o u t e<br />

MPI::INT.Pack ( p a c k e t , node ID , s i z e o f ( node ID ) ¡<br />

)<br />

MPI::DOUBLE.Pack ( p a c k e t , a c t i v i t y d u r a t i o n ,<br />

s i z e o f ( a c t i v i t y d u r a t i o n ) )<br />

MPI::DOUBLE.Pack ( p a c k e t , a c t i v i t y end time ,<br />

s i z e o f ( a c t i v i t y end time ) )<br />

MPI::INT.Pack ( p a c k e t , l e g number , s i z e o f ( l e g number ) )<br />

MPI::INT.Pack ( p a c k e t , f i n a l d e s t i n a t i o n l i n k ID ,<br />

s i z e o f ( f i n a l d e s t i n a t i o n l i n k ID ) )<br />

Figure 4.9: Packing vehicle data with MPI Pack<br />

size should be defined for all object instances. For example, one should fix the number of nodes<br />

of each vehicle should go through, i.e. the route. Vehicles of a real scenario are not supposed<br />

to visit the same nodes, i.e, they do not have the same plans. Therefore node lists (routes) to<br />

be visited being variable lengths is a problem when using MPI Struct. In order to solve this<br />

problem, the program sets the maximum number of nodes to be visited among all vehicles to<br />

the size of the node array.<br />

53


¡<br />

/ / d e f i n e a s t r u c t corresponding to a v e h i c l e<br />

t y p e d e f s t r u c t<br />

i n t v i d , l i d , r o u t e s i z e , r o u t e [MAXROUTELENGTH] , l e g i d , d l i d ;<br />

double a c t D u r , actEnd ;<br />

v e h i c l e s t r u c t ;<br />

/ / Commit the new type<br />

c r e a t e c o r r e s p o n d i n g MPI Struct t y p e based on v e h i c l e s t r u c t<br />

commit t h e new t y p e<br />

/ / d e f i n e the packet<br />

v e h i c l e s t r u c t a r r a y p a c k e t ;<br />

/ / packing i th v e h i c l e<br />

p a c k e t [ i ] . v i d = v e h i c l e ID ;<br />

p a c k e t [ i ] . l i d = n e x t l i n k ID ;<br />

p a c k e t [ i ] . r o u t e s i z e = l e n g t h of r o u t e ;<br />

f o r MAXROUTELENGTH t i m e s<br />

p a c k e t [ i ] . r o u t e [ j ] = l e n g t h of r o u t e ¡<br />

;<br />

p a c k e t [ i ] . a c t d u r = a c t i v i t y d u r a t i o n ;<br />

p a c k e t [ i ] . a c t e n d = a c t i v i t y end time ;<br />

p a c k e t [ i ] . l e g i d = l e g ID ;<br />

p a c k e t [ i ] . d l i d = d e s t i n a t i o n l i n k ID ;<br />

Figure 4.10: Packing vehicle data with MPI Struct<br />

In Figure 4.10, vehicle packing by using MPI Struct is shown. Three methods are explained<br />

in more detail in Section 6.11.<br />

Results<br />

Figure 4.11(a) and Figure 4.11(b) show the results for RTR graphs on single and dual CPUs<br />

of the computing nodes, respectively, when using different packing algorithms for exchanging<br />

the number of empty spaces and vehicles. The tests are done when an STL-vector is used<br />

for the graph data (Section 3.3.2), when the parking and waiting queues are represented by an<br />

STL-multimap (Section 3.3.3, and when the Ring class, as explained in Section 3.3.4 is<br />

employed for the spatial queues and link buffers.<br />

The following notation is used in these figures: Tests are repeated for both Myrinet and Ethernet<br />

(Myri vs Eth) when using single (Figure 4.11(a)) and double (Figure 4.11(b)) processes<br />

per computing node. The tests done are:<br />

Packing both the number of empty spaces and the vehicles with memcpy (ME,MV)<br />

Packing the number of empty spaces with memcpy and the vehicles with MPI Pack<br />

(ME,PV)<br />

Packing the number of empty spaces with memcpy and the vehicles with MPI Struct<br />

(ME,SV)<br />

54


512<br />

RTR with different packing algs when running single process per node<br />

RTR with different packing algs when running double processes per node<br />

512<br />

256<br />

256<br />

128<br />

128<br />

Real Time Ratio<br />

64<br />

32<br />

Myri,ME,MV<br />

Myri,ME,PV<br />

16<br />

Myri,ME,SV<br />

Myri,PE,PV<br />

Myri,SE,SV<br />

8<br />

Eth,ME,MV<br />

Eth,ME,PV<br />

4<br />

Eth,ME,SV<br />

Eth,PE,PV<br />

Eth,SE,SV<br />

2<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(a)<br />

Real Time Ratio<br />

64<br />

32<br />

Myri,ME,MV<br />

Myri,ME,PV<br />

Myri,ME,SV<br />

16<br />

Myri,PE,PV<br />

Myri,SE,SV<br />

8<br />

Eth,ME,MV<br />

Eth,ME,PV<br />

4<br />

Eth,ME,SV<br />

Eth,PE,PV<br />

Eth,SE,SV<br />

2<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(b)<br />

Figure 4.11: RTR graphs for different packing algorithms. During these tests, an STL-vector<br />

is used for the graph data, an STL-Multimap is used for waiting and parking queues and the<br />

Ring class is used for spatial queues and link buffers.<br />

Packing both the number of empty spaces and the vehicles with MPI Pack (PE,PV)<br />

Packing both the number of empty spaces and the vehicles with MPI Struct (SE,SV)<br />

Quantitatively, packing only vehicles with MPI Pack and MPI Struct slows the total<br />

execution time by 2% and 5%, respectively, compared to memcpy. If both the vehicles and<br />

empty spaces are packet with MPI Pack and MPI Struct, the performance lost is 3% and<br />

9% compared to the memcpy approach. MPI Struct gives the worst performance among<br />

these three, because it fixes the route array length to a maximum length.<br />

The main result can be summarized as follows:<br />

MATSIM should replace the memcpy approach with MPI Pack and MPI Unpack<br />

commands, since they offer more robustness with respect to data types with only<br />

very little performance overhead. In contrast, the vote is open with respect to<br />

MPI Struct: Advantages with respect to object handling are counter-balanced by<br />

the need to define a fixed maximum route length and resulting inefficiencies.<br />

As stated in Section 6.11, the performance of these functions depend on the data to be<br />

exchanged and its size. In spite of MPI Struct giving the worst performance when sending<br />

vehicles and the number of empty spaces, it gives the best performance when sending events<br />

generated by traffic flow simulators to strategy generation modules. More details are given in<br />

Section 6.11.<br />

4.5.4 Different Domain Decomposition Algorithms<br />

One could ask if a different domain decomposition might make a difference. It was already<br />

argued earlier that no difference is expected once latency saturation sets in. METIS [91] provides<br />

different partitioning concepts with different refinement algorithms. The default version<br />

named METIS PartGraphKway is used. It not only reduces the number of non-contiguous<br />

sub-domains but also tries to minimize the connectivity of sub-domains. The performance<br />

results of the MATSIM traffic flow simulator presented earlier are generated when using the<br />

default option.<br />

One can put weights on nodes or on links or on both such that the weights dominate the<br />

partitioning. The first alternative tried is the so-called standard feedback. The method produces<br />

55


1024<br />

512<br />

256<br />

RTR for a 3-hour run of 6-9 with different partitioning algorithms<br />

256<br />

128<br />

64<br />

Speedup for a 3-hour run of 6-9 with different partitioning algorithms<br />

Default partitioning<br />

Feedback,no of incoming links<br />

Feedback,computing time<br />

Feedback,veh count<br />

Real Time Ratio<br />

128<br />

64<br />

32<br />

16<br />

Speedup<br />

32<br />

16<br />

8<br />

4<br />

8<br />

Default partitioning<br />

4<br />

Feedback, no of incoming links<br />

Feedback,computing time<br />

Feedback,veh count<br />

2<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(a)<br />

2<br />

1<br />

0.5<br />

1 2 4 8 16 32 64 128 256<br />

Number of CPUs<br />

(b)<br />

Figure 4.12: RTR and Speedup graphs for METIS partitioning with standard feedback. Different<br />

values are taken into consideration as feedback for next iteration. An STL-map is used to<br />

represent the graph data, an STL-multimap is used for waiting and parking queues and the<br />

Ring class is used for spatial queues and link buffers.<br />

a single weight for each element. Since the work of the queue simulation consists mostly of<br />

computing the intersection dynamics, the computational load is essentially proportional to the<br />

number of intersections. Thus, the weights are on the nodes with this method. Once a single<br />

constraint is computed for each node after a simulation run, the statistics are written into a file,<br />

which will be used by the domain decomposition process in the next simulation run (iteration).<br />

The standard feedback partitioning, thus, attempts to spread the network nodes equally across<br />

all CPUs while maintaining contiguous domains. Three different constraints are tested for<br />

standard feedback partitioning:<br />

the number of vehicles processed by a node,<br />

the computing time spent on a node,<br />

the number of incoming links of a node.<br />

Figure 4.12 compares the constraints above used for the ch6-9 scenario (Section 2.5). The<br />

measurements are taken under the following circumstances: The graph data is implemented by<br />

an STL-map as explained in Section 3.3.2. The parking and waiting queues are represented<br />

by an STL-multimap as described in Section 3.3.3, and the self-implemented Ring class,<br />

explained in Section 3.3.4 is employed for the spatial queues and link buffers. All the test<br />

results are obtained when the parallel traffic flow simulation is run on Myrinet.<br />

It shows that using neither computing time nor number of vehicles processed gives any<br />

improvement. Only setting the number of incoming links of nodes as weights gives better<br />

performance compared to the other two approaches.<br />

When nodes (or links) have several weights, the refinement algorithm is called multiconstraint<br />

partitioning. In the traffic flow simulation, this can best be understood by assuming<br />

that those weights refer to different time slices. In earlier investigations it had been<br />

found that there were some performance-wise differences when using the multi-constraint<br />

partitioning in METIS when applied to the so-called “Gotthard” scenario. In this scenario,<br />

50’000 travelers/vehicles start, with a random starting time between 6AM and 7AM, at random<br />

locations all over Switzerland, and with a destination in Lugano/Ticino. Therefore, towards the<br />

end of the simulation, most of the vehicles accumulate on a couple of CPUs. This results in<br />

56


giving more workload to some nodes than the others. The unbalanced workload can be uniformly<br />

distributed among CPUs in the next iteration if METIS takes the workload of the nodes<br />

in different time slices into consideration.<br />

The question is if under such unbalanced circumstances the network can be partitioned such<br />

that the load is equally balanced at all times. A counter-example would be a distribution where<br />

one CPU has nodes with a lot of traffic initially but no traffic later, and another CPU has no<br />

traffic initially, but a lot of traffic later. That simulation would run faster if both CPUs traded<br />

approximately half of their nodes: Then both CPUs would always be busy on about half of<br />

their nodes. This is exactly what multi-constraint partitioning attempts to achieve.<br />

The multi-constraint partitioning is implemented in a way that the computational load on<br />

each node per hour is recorded during simulation run. Each load per hour corresponds to a<br />

constraint for the node. Thus, each node is specified by more than one constraint (simulation<br />

runs more than 3600, i.e. 1 hour time steps). Then, these recorded hourly values are dumped<br />

into a file along with corresponding node IDs and the file is used in next run by domain decomposition<br />

process. For the ch6-9 scenario, which has roughly uniform traffic load all over the<br />

Switzerland, the multi-constraint partitioning does not yield systematic improvement.<br />

Recommendation: METIS<br />

For the demand scenarios being uniformly distributed over the graph data, the default<br />

partitioning technique of METIS,METIS PartGraphKway, can be employed. For<br />

the non-uniformy distributed traffic demand, the algorithms, which take the different<br />

weights into consideration, should be preferred.<br />

4.6 Conclusions and Discussion<br />

The most important result of investigations regarding parallelization is that there is a natural<br />

limit to computational speed on parallel computers, which use Ethernet [77] as their communication<br />

medium, and that speed is about 370 updates per second. If a simulation uses 1-second<br />

time steps, then this translates into a real time ratio of about 370 (the maximum practical value<br />

is 300). It has important consequences both on real time and on large scale applications that<br />

need to be considered. Also in contrast to other areas of computing, it seems that waiting for<br />

better commodity hardware will not solve the problem, in this case: Latency is the technical<br />

reason for this limit.<br />

One option to go beyond this limit is to use more expensive special purpose hardware.<br />

Such hardware is typically provided by computing centers, which operate on dedicated parallel<br />

computers such as the Cray T3E [38], or the IBM SP2 [37], or any of the ASCI (Advanced<br />

Strategic Computing Initiative) [47, 69] computers in the U.S. An intermediate solution is the<br />

use of Myrinet [54], which this chapter shows to be an effective approach, both in terms of<br />

technology and in terms of monetary cost.<br />

On the algorithmic side, the following options exist: First, for the queue simulation it is<br />

in fact possible to reduce the number of communication exchanges per time step from two to<br />

one. This should yield a factor of two in speed-up. Next, in some cases, it may be possible<br />

to operate with time steps longer than one second. This should in particular be possible with<br />

kinematic wave models, since in those models the backwards waves no longer travel infinitely<br />

fast. The fastest time in such simulations would be given by the shortest free speed link travel<br />

time in the whole system. In addition, one could prohibit the simulation from splitting links<br />

with short free speed link travel time, leading to further improvement.<br />

57


In Section 4.1.2, task parallelization was shortly discussed. There it was pointed out that<br />

this will not pay off if the traffic flow simulation poses the by far largest computational burden.<br />

However, after parallelizing the traffic flow simulation, this is no longer true. Task parallelization<br />

would mean that for example the activity generator, the router, and the learning module<br />

would run in parallel with the traffic flow simulation. One way to implement this would be to<br />

not pre-compute plans any more, as is done in day-to-day simulations, but to request them just<br />

before the traveler starts. A nice side-effect of this would be that such an architecture would<br />

also allow within-day re-planning without any further computational re-design.<br />

The most important conclusions can be drawn for MATSIM as below:<br />

PC clusters should be preferred to parallel/vector computers.<br />

The communication hardware between PCs of a cluster should be chosen as the Myrinet<br />

technology since it reduces the latency problem exists on some other technologies such<br />

as Ethernet.<br />

MPI should be utilized because of better-formed computational aspects and being better<br />

supported.<br />

To minimize the contribution of the latency problem into each message occurs, several<br />

items must be packed into a single message.<br />

Packing several items into a single message should be implemented by using MPI Pack<br />

and MPI Unpack since they are more robust compared to other C-type functions.<br />

The different types of domain decomposition provided by the METIS library should be<br />

selected according to the scenario used. The default method of METIS performs well<br />

when the traffic is, more or less, evenly distributed on the graph data.<br />

4.7 Summary<br />

Time consumption of large-scale applications can be diminished by the assistance of parallelprogramming.<br />

Today’s systems bring up different cooperating modules. These modules can be<br />

distributed among different computing nodes to achieve task parallelization. Even the modules<br />

themselves, which extend the overall computing time of the system by their slowness, can be<br />

split and each subpart handles only a part of the whole data (domain decomposition).<br />

From a traffic flow simulation point of view, parallelization is achieved by decomposing<br />

the street network among computing nodes and distributing agents according to the result of<br />

the decomposition. When sub-domains are not fully independent of each other, i.e., routes of<br />

some agents extend over several sub-domains, providing communication between sub-domains<br />

is unavoidable. Among several tools, MPI (Message Passing Interface) [51] is chosen because<br />

of yielding better performance than the others, and because it gets continuous support from<br />

developers.<br />

Since each message exchange involves latency, which is a problem when the communicating<br />

medium is Ethernet [77], exchanging only two messages, one for declaring storage constraints<br />

and one for the vehicles’ information, per time step is able to handle data flow on split<br />

links. Also, packing all vehicles, which have the same destination computing node, into a<br />

single message cuts back the involvement of latency in time consumption.<br />

Myrinet [54] is a good alternative hardware when one wants to avoid latency caused by<br />

Ethernet since latency is amended on Myrinet. Hence, it lowers the communication cost.<br />

58


Time CPUs GD PW-Q L-Q CM CL Pack DD<br />

12s d/62 vector linked list Ring Myri MPI memcpy default<br />

36s d/62 vector linked list Ring Eth MPI memcpy default<br />

35s s/32 map multimap Ring Myri MPI memcpy default<br />

80s s/32 map multimap Ring Eth MPI memcpy default<br />

82s s/32 map multimap Ring Myri PVM memcpy default<br />

99s s/32 map multimap Ring Eth PVM memcpy default<br />

49s s/16 vector multimap Ring Myri MPI memcpy default<br />

51s s/16 vector multimap Ring Myri MPI MPI-Pack default<br />

54s s/16 vector multimap Ring Myri MPI MPI-Struct default<br />

35s d/28 map multimap Ring Myri MPI memcpy default<br />

29s d/28 map multimap Ring Myri MPI memcpy SF-IL<br />

Table 4.1: Summary table of the parallel performance results for different data structures of<br />

the traffic flow simulator.<br />

When communication cost is less, computation usually needs to improved. These improvements<br />

are made not only for the parallelization code but also for the sequential part of the<br />

program as discussed in Chapter 3. In terms of parallelization, how data is packed is an issue<br />

requiring investigation. The choice among different methods for packing depends on how elegantly<br />

packing is achieved as well as the time consumption of these methods. User-defined<br />

packing functions can be built in addition to the functions offered by communication software.<br />

Despite explicit preparation efforts for making programs parallel, parallelization of largescale<br />

applications is inevitable for time/cost reasons. Economic issues lead to PC clusters<br />

instead of expensive special parallel computers.<br />

Table 4.1 summarizes the most important performance numbers, which are collected when<br />

switching different parameters on. The abbreviations used in the table mean as follows: CPUs<br />

is the number of CPUs, which can be double (two processes per computing node) or single;<br />

GD refers to the graph data; PW-Q shows which data structure is used for the parking and<br />

waiting queues; L-Q shows the data structure option for the link queues ; CM means the<br />

communication medium (Myrinet or Ethernet); CL points out the communication library (MPI<br />

or PVM); Pack refers to the packing algorithm used during the tests (mempcy, MPI Pack,<br />

MPI Struct); DD shows the domain decomposition algorithm, i.e., default means the using<br />

the default option of METIS and SF-IL means standard feedback using the number of incoming<br />

links of the nodes.<br />

59


Chapter 5<br />

Coupling the Traffic Simulation to Mental<br />

Modules<br />

5.1 Introduction<br />

Chapter 1 gives a description of two-layer framework used to relax a congested system. The<br />

physical layer is where the agents interact with each other and the environment. This layer is<br />

the network loading part of DTA [19, 20, 27, 5] and it corresponds to the traffic flow simulator<br />

in the framework. The traffic flow simulator defines the interaction rules for the agents. These<br />

rules are defined in Chapter 2.<br />

The second layer, the strategic layer, is where the agents make their strategies according to<br />

what they have experienced in the physical layer. For example, if agents experience congestion<br />

in the physical layer, some of the agents try to avoid the congestion next time by making new<br />

strategies in the strategic layer.<br />

As seen in Figure 1.1, the physical layer of the framework exchanges plans and performance<br />

information with the strategy generation modules, which generate strategies for the agents in<br />

the system.<br />

5.2 Coupling Modules via Files<br />

5.2.1 Description of a Framework<br />

A multi-agent learning method is implemented into a system called the “framework” to model<br />

travel behavior of people on a geographical region during a certain period of time. The framework<br />

is composed of several modules with different tasks. There are different ways to couple<br />

these modules. This section explains coupling via files where two files are prevalent: the plans<br />

file and the events file.<br />

As its name implies, the most important entities in a multi-agent learning method are agents.<br />

Each agent has attributes, which impinge on its decisions. Decisions are made about type, location<br />

and timing information of activities, routes between locations of activities, etc. Moreover,<br />

each agent in the framework has a plan it follows. Each plan contains a score, which is calculated<br />

by the agent after the plan is executed. A plan can have several legs, each of which<br />

connects two activities. Each leg mainly carries the following information: the mode of transportation,<br />

the estimated trip time, the estimated start time of the trip and the list of graph nodes<br />

that the agent must traverse to arrive at the location of end activity.<br />

60


p e r s o n i d =”6357250”<br />

-<br />

p l a n -<br />

a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />

-<br />

l e g mode=” car ” d e p t i m e = ”07:00” t r a v t i m e = ”00:30”<br />

-<br />

r o u t e 4902 4903 4904 4905 4906 - / r o u t e<br />

-<br />

-<br />

-<br />

-<br />

- -<br />

/ l e g<br />

a c t t y p e =”w” x100=”388689” y100=”279136” l i n k =”14606”<br />

dur=”08:00” /<br />

l e g mode=” car ” d e p t i m e = ”16:30” t r a v t i m e = ”00:15”<br />

r o u t e 4905 4903 / r o u t e<br />

-<br />

-<br />

-<br />

/ l e g<br />

a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />

/ p l a n<br />

- / p e r s o n<br />

Figure 5.1: An example plan in the XML format<br />

Figure 5.2 shows the components of the framework and how data is moved between these<br />

components. Modules here are coupled via files. A complete initial plans file is fed into traffic<br />

flow simulation(s) to be executed in the first iteration. Plans are written in the XML [97] format.<br />

Issues regarding usage of different input formats are discussed in Section 3.4. A typical plan in<br />

the XML format is given in Figure 5.1.<br />

In the figure, the plan of agent 6357250 has two legs: The agent leaves home which is on<br />

link 14584 at 7 AM and goes to work by car. On the way to work, the agent goes through<br />

5 nodes denoted in the “route” attribute of the leg. This trip is expected to take 30 minutes.<br />

When the agent gets to work, it works for 8 hours, then drives back home via 2 nodes. The<br />

resolution of the x and y coordinates of locations are based on 100x100 meter blocks of census<br />

information. This is why they are named x100 and y100.<br />

Distribution of agents is accomplished via domain decomposition as explained in Section<br />

4.3.1. In Figure 5.2, the arrow between two traffic flow simulations shows the communication<br />

among them, i.e., message exchanges as mentioned in Section 4.3.2.<br />

In the framework, the entire output of the simulation consists of events which are output<br />

directly when they happen. For example, an agent can depart, can enter/leave a link, etc. The<br />

traffic flow simulation just writes all kinds of events because they do not aggregate data; instead,<br />

this is done by the other modules themselves. The router in the framework, for example,<br />

uses these events to compute the link travel times by recording times of link entering/leaving<br />

events. Separating data aggregation from the simulation philosophically means that the simulation<br />

checks the correctness of simulation whereas the other modules such as the router and<br />

the agent database are interested in the correctness of data aggregation.<br />

One of the main modules in the framework is called the Agent Database. Agents in an<br />

agent database keep plans and scores of plans. They decide which plan to use in the next<br />

iteration (next day) based on one of the following ways:<br />

select a random plan based on scores,<br />

request new routes (from the router) with a probability,<br />

request change in activities (from the activity generator) with a probability.<br />

In the first iteration (only one plan per agent exists), 100% of initial plans are used to create<br />

the plans file read by the traffic flow simulators. Both traffic flow simulators in Figure 5.2 write<br />

61


100% Initial<br />

Plans File<br />

Activity Generator<br />

10%<br />

10%<br />

20%<br />

20%<br />

Agent Database<br />

Router<br />

100%<br />

Plans File<br />

Events File<br />

MENTAL LAYER<br />

PHYSICAL LAYER<br />

Traffic<br />

Simulator<br />

Traffic<br />

Simulator<br />

Figure 5.2: Physical and strategic layers of the framework coupled via files.<br />

an events file during the execution of the plans. Events are read by strategy generation modules,<br />

namely, the router and the agent database 1 . Agents in the agent database calculate the scores of<br />

the plans based on events.<br />

If an agent decides (with a probability) to modify activities, the activity generator is informed.<br />

The activity generator mutates the end time and duration of activities of an agent<br />

and provides modified activities back to the agent database. The agent database, then, informs<br />

the router about the changes in activities so that the router can create new routes between the<br />

modified activities.<br />

If an agent decides (with a probability) to get new routes, they are requested from the router.<br />

Specific types of events, namely entering and exiting a link, are used by the router to calculate<br />

link travel times, which give information about congestion in the physical layer. The router uses<br />

this information to change the routes of the agents, which have made a request and provides<br />

the modified plans to the agent database.<br />

In Figure 5.2, the coupling via files is illustrated. 10% of the agents decide to change the<br />

time of activities and 10% of the agents decide to get new routes. Hence, the router gets request<br />

to change a total of 20% of all routes.<br />

When the router gives newly created plans back to the agent database, the agent database<br />

merges these new plans with the plans that the agents selected based on scores to create a 100%<br />

plans file for the next iteration.<br />

Each iteration corresponds to a “day”, therefore at the end of each day, new plans for some<br />

agents are re-computed for tomorrow based on today’s experiences. Thus, the system implements<br />

day-to-day planning.<br />

The advantage of using events as feedback data is that they are very easy to implement into<br />

1 In the current version, the activity generator does not read the events file, but in future versions it will. The<br />

dotted arrow in the figure illustrates this situation.<br />

62


traffic flow simulation of the framework. The events format can be plain or in the XML format;<br />

advantages and disadvantages of both are discussed in Section 6.6. An example of an XML<br />

event looks like this:<br />

¡<br />

e v e n t time = ”06:00” t y p e =” departure ” v e h i d =”6465” legnum=”0” l i n k =”1523” from=”3827” /¢<br />

which means at 6 AM, agent numbered as 6465 leaves link 1523 whose upstream end is located<br />

at node 3827. This event occurs while executing leg number 0 of the agent.<br />

To sum it up, all agents execute plans in each iteration simultaneously in the physical layer<br />

by interacting with each other (multi-agent) and with the environment; they record performance<br />

values of experiences from iterations; performance records are used to update the agents’ mental<br />

state (learning).<br />

5.2.2 Performance Issues of Reading an Events File<br />

Events<br />

As mentioned above, events generated by traffic flow simulators are fed back to different modules<br />

in the framework. The router and the agent database (and the activity generator in future<br />

versions) read these events.<br />

Event files are really big. For example, the “ch6-9” scenario (Section 2.5 generates a raw<br />

events file of 2GBytes which includes approximately 53 million events.Therefore, it is worth<br />

to investigate different reading algorithms for events.<br />

Each raw event is described by a set of numbers. The example below means that at time<br />

06:30:06AM, vehicle 6381934 has departed (specified by 6) link 17 (which from-node is 1000)<br />

as executing leg 0 of the plan.<br />

2 1 6 3 6 6 3 8 1 9 3 4 0 1 7 1 0 0 0 6<br />

Original implementation: Reading events into an STL-map<br />

As explained in Section 2.2.5, the events generated by traffic flow simulators are of one of the<br />

following types: “entering the simulation/departure”, “moving from waiting queue to link”,<br />

“entering a link”, “leaving a link”, “being stuck in congestion for a specific time period and<br />

leaving the simulation afterwards” and “arrival at final destination”. All these different types<br />

are generated by traffic flow simulators since the traffic flow simulators do not involve any data<br />

aggregation, i.e., when an event occurs, the traffic flow simulator simply dumps it into a file.<br />

The other strategy generation modules, which read the events, make distinctions between<br />

the different event types. For example, from the viewpoint of the router, only entering and<br />

exiting a link are interesting since they are used to calculate link travel times.<br />

The original implementation of the router reads the data for each event into an STL(Standard<br />

Template Library, Section 3.3.1) vector using the input stream operator >> of C++ [80]. If the<br />

event data is of the type “entering a link”, then an actual event is created by extracting values<br />

from the STL-vector, and the event is inserted into a C++ container map. If an event is of<br />

type of “exit-a-link”, the corresponding enter-a-link event is found in the STL-map container.<br />

The link travel time is calculated using these two event timestamps and is added to the corresponding<br />

link’s travel time and time bin. Then, the event in the container is deleted. The code<br />

is given in Figure 5.3.<br />

The events input file used during the tests by using the code in Figure 5.3 is about 700 MBytes<br />

in size and contains 18.5 million raw-written events. However keeping that many events in an<br />

STL-map data structure suffers from excessive memory usage. In addition, using an intermediate<br />

string vector prior to the data conversion dominates the low performance.<br />

63


¡<br />

¡<br />

¡<br />

p f map- - ;<br />

/ / d e f i n e a map f o r e n t e r e v e n t s<br />

/ / has two keys , v e h i c l e ID and l i n k ID<br />

t y e d e p a i r i n t , i n t , Event eventMapType<br />

eventMapType eventMap ;<br />

while ( n o t EOF )<br />

r e a d a l i n e from e v e n t s f i l e<br />

r e t r i e v e t h e v a l u e s i n t o v e c t o r<br />

e x t r a c t v a l u e s ( vehID , l i n k I D , e t c ) from v e c t o r<br />

i f ( f l a g i s e n t e r a l i n k )<br />

/ / c r e a t e a new event with the e x t r a c t e d v a l u e s<br />

t h i s e v e n t = new Event ( v a l u e s )<br />

/ / i n s e r t t h i s e v e n t i n t o map using agent ID as key<br />

eventMap [ m a k e p a i r ( vehID , l i n k I D ) ] = t h i s e v e n t<br />

e l s e i f ( f l a g i s e x i t a l i n k or a r r i v a l )<br />

/ / f i n d the corresponding e n t e r a l i n k entry<br />

e v e n t M a p . f i n d ( vehID , l i n k I D )<br />

i f ( e n t e r e v e n t i s found )<br />

<br />

/ / c a l c u l a t e t r a v e l time<br />

t r a v e l time = e x i t i n g time e n t e r e v e n t time<br />

add i t t o t h e t r a v e l time of<br />

t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />

¡<br />

d e l e t e e n t e r e v e n t from map<br />

Figure 5.3: Reading events by using the STL-map<br />

Reducing event processing overhead<br />

In this section, it is tested what happens in terms of performance when some minimal data<br />

aggregation is already done in the traffic flow simulation. For this, it is useful to retrace the<br />

argument that led to the introduction of events files: In the original TRANSIMS [82] implementation,<br />

the simulation emits the aggregated link travel time data every 900 time steps.<br />

However, a major problem with that approach was that it necessitated to always make the output<br />

from the traffic flow simulation fully consistent with the input to strategy modules. For<br />

example, a module that needs arrival/departure times for activities needs completely different<br />

data than a router that needs link travel times. Also, using aggregated data invites the use of<br />

inconsistent aggregation approaches. For example, the traffic flow simulation in the original<br />

specification averages link travel times into time bins corresponding to link exit times, while<br />

the router preferably needs average link travel times for link entry times. Using aggregated<br />

data in the exchange between traffic flow simulation and strategy modules means that every<br />

time the strategy module is interested in a different approach to data aggregation, the traffic<br />

flow simulation code needs to be modified.<br />

In addition, the file size advantage of data aggregation is not as large as it seems: In the near<br />

future, high resolution networks having several hundred thousands of links will be introduced<br />

and emitting average link travel times for every link every 900 time steps will also create large<br />

amounts of data.<br />

Therefore, an intermediate approach is tested. This approach avoids the memory allocation<br />

64


¡<br />

i f s t r e a m e v e n t s f i l e<br />

while ( n o t EOF )<br />

r e a d a l i n e from e v e n t s f i l e<br />

r e t r i e v e t h e v a l u e s i n t o v e c t o r<br />

e x t r a c t v a l u e s ( i n c l u d i n g eventTime<br />

and e n t e r T i m e ) from v e c t o r<br />

i f ( t h e e v e n t f l a g i s ‘ ‘ l e a v e l i n k ’ ’ )<br />

/ / c a l c u l a t e t r a v e l time<br />

t r a v e l time = eventTime e n t e r T i m e<br />

add i t t o t h e t r a v e l time of<br />

t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />

¡<br />

Figure 5.4: Reading events by using C++ operator >><br />

of the STL-map data structure during the event reading phase but apart from that leaves the use<br />

of events for data exchange intact. Note that, one has to note that the STL-map data structure<br />

is only necessary for the temporary storage of link entry events, for which the corresponding<br />

link exit event has not yet been found. Therefore, if the necessary information can be merged<br />

into a single event, the problem is resolved.<br />

This can be achieved by having the vehicle (or agent) in the traffic flow simulation memorize<br />

its own link entry event. The link entry event is then no longer emitted by the traffic flow<br />

simulation, and instead the link exit event is expanded, as shown below (using the XML syntax,<br />

although plain text was used in the benchmarks).<br />

¡<br />

e v e n t t y p e =” e x i t ” i d =”123” time = ”09:03:01” l i n k i d =”456” t r a v e l t i m e = ”00:01:03” /¢<br />

This example denotes a link exit event at 9h 03’ 01” from link number 456 by agent ID 123<br />

with the agent having been on the link for one minute and 3 seconds. From this, any module<br />

can reconstruct the same data as before; the only differences are that the link entry event is<br />

reported only implicitly, and at some later point in time.<br />

This is called “on the fly” in the following. When reading events, the values are still read<br />

into an STL-vector using the >> operator of C++, and then the STL-vector is accessed<br />

to retrieve the relevant values as shown in Figure 5.4. The reduced size of the events file is<br />

400 MBytes and it contains about 10 million events.<br />

Using C instead of C++ file input syntax<br />

The last implementation gets rid of the temporary STL-vector. The events can be read using<br />

the C library functions strtod and strtol instead of the C++ >> operator. The events are<br />

read line by line as strings, then these two functions are used to parse the values, i.e., to convert<br />

the values from strings to appropriate types like double or integer. In this implementation, the<br />

strtol and strtod functions can be replaced by the functions atoi and atof, which<br />

take character arrays and convert them into other types. Example showing how to use these<br />

functions are shown in Figure 5.5. In these two implementations, the file size is also reduced<br />

to 400 MBytes and it only contains 10 million events.<br />

65


¡<br />

char myline [MAXSIZE ] ;<br />

while ( n o t EOF )<br />

g e t a l i n e from t h e e v e n t s f i l e i n t o myline<br />

s e t p o i n t e r myptr t o p o i n t t o b e g i n n i n g of myline<br />

/ / ATOF/ATOI CASE<br />

r e a d eventTime with a t o f ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d v e h i c l e I D with a t o i ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d legNumber with a t o i ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d l i n k I D with a t o i ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d fromNodeID with a t o i ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d e v e n t F l a g with a t o i ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

r e a d e n t e r T i m e with a t o f ( myptr )<br />

move myptr f o r w a r d t o t h e f i r s t b l a n k<br />

/ / ATOF/ATOI CASE<br />

/ / STRTOD/STRTOL CASE<br />

r e a d eventTime with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d v e h i c l e I D with s t r t o l ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d legNumber with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d l i n k I D with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d fromNodeID with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d e v e n t F l a g with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

r e a d e n t e r T i m e with s t r t o d ( myptr ,&pEnd )<br />

move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />

/ / STRTOD/STRTOL CASE<br />

i f ( t h e e v e n t f l a g i s ‘ ‘ l e a v e l i n k ’ ’ )<br />

t r a v e l time = eventTime e n t e r T i m e<br />

add i t t o t h e t r a v e l time of<br />

t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />

¡<br />

Figure 5.5: Reading events by using atoi/atof or strtod/strtol<br />

Results<br />

The results of four implementations are shown in Table 5.1. ¤ $&<br />

¡<br />

¥ and ¡£¢ ¡<br />

¥ are the numbers<br />

of events. “On the fly” means no supplementary data structure is used for temporary purposes<br />

¡<br />

66


,map, 18.5e6<br />

¡<br />

,on the fly, 10e6 strtod,on the fly, 10e6 atof, on the fly, 10e6<br />

¡<br />

Memory Usage 185.00MB 5.13MB 5.12 MB 5.11MB<br />

Reading Time 11 mins 5.3 mins 29 secs 28 secs<br />

Table 5.1: Performance results for reading the events file<br />

as explained above.<br />

The STL-map version uses up the most memory space. Eliminating the STL-map will<br />

result in an about 95% improvement in terms of memory usage. When the reading time is<br />

concerned, getting rid of the temporary string vector, into which the data is read by using >>,<br />

gives a better performance. For example, without using the STL-map, a transition from the<br />

C++ style of input parsing (>>) to the C-style input parsing results in a performance increase<br />

of 91%. In addition, atoi-type C functions and strtol-type C functions do not yield any<br />

difference in performance.<br />

In terms of the whole approach, these differences are huge. MATSIM [50] currently needs<br />

about 2 hours per iteration, where the events file is read twice (once by the agent database and<br />

once by the router). Using an efficient approach to events files, as described above, would<br />

reduce the time per iteration to less than 1 hour 40 min.<br />

Recommendation: Reading Raw Events<br />

If raw events are read from a file, the implementation choice is between the extensibility<br />

and performance. Using an STL-vector to store the read values as strings, and<br />

converting those strings to appropriate data types performs worst but extensibility<br />

pays it off. Instead of using an STL-map to store the events and using an STLvector<br />

to read the events, the traffic flow simulator of MATSIM should perform<br />

some data aggregation to reduce the overhead resulting from the STL’s structures.<br />

5.2.3 Performance Issues of Plan Writing<br />

In the current implementation of the framework, the agent database writes and the traffic flow<br />

simulator(s) reads the plans file. Different reading approaches for plans and their performance<br />

figures are explained in Section 3.4.3. When the ch6-9 scenario (with 1 million agents) is used,<br />

the writing performance is recorded as follows:<br />

When raw plans are concerned, the format of the output file is column-based structured<br />

text, hence, the file data is composed of only the data numbers. The data is written by the<br />

C++ operator in 17 seconds after the data is retrieved from the memory in 2 seconds.<br />

Therefore, writing 1 million plans is completed in 19 seconds.<br />

When XML plans are concerned, the data values have to be written in a valid XML tag<br />

form with the self-explanatory attributes. Prior to writing XML plans, the data retrieval<br />

from memory takes 123 seconds. Then, the data values are written into a file by forming<br />

XML tags in 149 seconds. Consequently, the total time spent for writing XML plans is<br />

272 seconds.<br />

67


5.3 Other Coupling Mechanisms<br />

Coupling modules via files is a rather old technology; in the area of traffic flow simulation, it<br />

was taken from TRANSIMS [82]. The main advantage of files is:<br />

Modules can be coupled even if they run under different operating systems or use different<br />

programming languages<br />

In addition, if files are in addition plain ASCII, a further advantage is that<br />

files can be easily read and changed for debugging and specific studies.<br />

The main disadvantages are that this is a fairly slow technology, and that one needs considerable<br />

resources in terms of disk space. This gets even worse if one uses plain ASCII instead<br />

of some binary format. In the case of the traffic flow simulation, disk I/O for module coupling<br />

is easily more than 50% of the computing time.<br />

This lets one look for alternatives.<br />

5.3.1 Module Coupling via Subroutine Calls<br />

The arguably best established method to couple computational modules is to use subroutine<br />

calls. Combining, say, agent database, simulation, and router could look as follows:<br />

Start the agent database which reads an agent file with initial plans etc.<br />

The agent database calls the traffic flow simulation, with a pointer/reference to the agents’<br />

plans as an argument, and a pointer/reference to some memory area to store the events.<br />

E.g.<br />

P l a n s p l a n s = new P l a n s ( ) ;<br />

r e a d a g e n t f i l e ( p l a n s ) ;<br />

Events e v e n t s = new Events ( ) ;<br />

r u n t r a f f i c s i m u l a t i o n ( p l a n s , e v e n t s ) ;<br />

. . .<br />

The agent database then calls the router in a similar way<br />

. . .<br />

r u n r o u t e r ( p l a n s , e v e n t s ) ;<br />

. . .<br />

Etc.<br />

Obviously, any other method to transmit information between modules, as for example within<br />

a global class, can be used.<br />

An additional advantage of this approach is that it allows, with relatively small modifications,<br />

within-day re-planning. One possibility for this, which would completely follow the<br />

design from above, would be to advance the simulation only minute-by-minute, and to run the<br />

re-planning modules in between. An example is shown in Figure 5.6.<br />

The main disadvantages of it are:<br />

It works only if all modules run on the same operating system.<br />

It is easy only if all modules use the same programming language.<br />

68


¡<br />

while ( n o t f i n i s h e d )<br />

a d v a n c e t r a f f i c s i m u l a t i o n b y o n e m i n u t e ( p l a n s , e v e n t s ) ;<br />

f o r ( a l l r e p l a n n i n g modules )<br />

r u n r e p l a n n i n g m o d u l e ( ) ;<br />

¡<br />

Figure 5.6: Coupling via subroutine calls during within-day re-planning<br />

It is efficient only if all modules share the same internal representation of plans and<br />

events.<br />

The subroutine call approach is no longer as simple once the traffic flow simulation uses<br />

parallel computing: There needs to be some mechanism that transmits the plans from the<br />

calling module (say, the agent database) and transmits the events back. This could, for<br />

example, be achieved by messages between the master and the slaves of the parallel traffic<br />

flow simulation, but this means that an additional technology beyond simple subroutine<br />

calling needs to be employed.<br />

The third item is the most difficult and technical of the three. For illustration, let us assume<br />

that the three modules were developed by three different teams, without the initial intention of<br />

coupling them. In consequence, all three modules will have different internal representations<br />

of plans and events. In order to allow communication, the three teams need to decide on the<br />

internal representation that is used in the subroutine calls. Let us assume that they agree to<br />

use the internal representation of the agent database. This means that, say, the traffic flow<br />

simulation, when receiving the call, needs to go through all plans and convert the relevant<br />

information to its own internal representation. This needs to be done for all modules.<br />

To be truthful, using an XML representation does not fully avoid the problem: Also here,<br />

one needs to agree on a common format or at least a common structure of the file. Still, there<br />

are fewer options (in particular no choice between pointers, references, or direct objects) and<br />

no inter-language issues, and XML parsers are relatively easy to write.<br />

5.3.2 Module Coupling via Remote Procedure Calls (e.g. CORBA, Java<br />

RMI)<br />

An alternative to files is to use RPC (Remote Procedure Call) [33]. Such systems, of which<br />

CORBA (Common Object Request Broker Architecture) [92] is an example, allow to call<br />

subroutines on a remote machine (called “server”) in a similar way as if they were on the local<br />

machine (called “client”). There are at least two different ways how this could be used from<br />

the framework’s viewpoint:<br />

1. The file-based data exchange could be replaced by using remote procedure calls. Here,<br />

all the information would be stored in some large data structures, which would be passed<br />

as arguments in the call.<br />

2. One could return to the “subroutine” approach discussed in Sec. 5.3.1, except that the<br />

strategic modules could now sit on remote machines, which means that they could be<br />

programmed in a different programming language under a different OS.<br />

Another option is to use Java RMI [43] which allows Remote Method Invocation (i.e. RPC<br />

on Java objects) in an extended way. Client and server can exchange not only data but also<br />

69


pieces of code. For instance, a computing node could be managing the agent database and<br />

request from a specific server the code of the module to compute the mode choice of its agents.<br />

It is easier with Java RMI than with CORBA to have all nodes acting as servers and clients and<br />

to reduce communication bottlenecks. However, the choice of the programming language is<br />

restricted to JAVA.<br />

It is important to notice the difference between RPC and parallel computing, discussed in<br />

Chapter 4. RPCs are just a replacement for standard subroutine calls which are useful for the<br />

case that two programs that need to be coupled use different platforms and/or (in the case of<br />

CORBA) different programming languages. That is, the simulation would execute on different<br />

platforms, but it would not gain any computational speed by doing that since there would<br />

always be just one computer doing work. In contrast, parallel computing splits up a single<br />

module on many CPUs so that it runs faster. Finally, distributed computing attempts to combine<br />

the interoperability aspects of remote procedure calls with the performance aspects of parallel<br />

computing.<br />

The main advantage of using of CORBA and other Object Broker mechanisms is to glue<br />

heterogeneous components. Both the DynaMIT [17] and DYNASMART [18] projects use<br />

CORBA to federate the different modules of their respective real-time traffic prediction system.<br />

The operational constraint is that the different modules are written in different languages on<br />

different platforms, sometimes from different projects. For instance, the graphical viewers are<br />

typically run on Windows PCs while the simulation modules and the database persistency are<br />

carried out by Unix machines. Also, legacy software for the data collection and ITS devices<br />

need to be able to communicate with the real-time architecture of the system. Using CORBA<br />

provides a tighter coupling than the file-based approach and a cleaner solution to remote calls.<br />

Its client-server approach is also useful for critical applications where components may crash<br />

or fail to answer requests. However, the application design is more or less centered around the<br />

objects that will be shared by the Object Broker. Therefore, it loses some evolvability compared<br />

to XML exchanges for instances.<br />

5.3.3 Module Coupling via WWW Protocols<br />

Everybody knows from personal experience that it is possible to embed requests and answers<br />

into HTTP [12] protocols. A more flexible extension of this would once more use XML. The<br />

difference to the RPC approach of the previous section is that for the RPC approach there needs<br />

to be some agreement between the modules in terms of objects and classes. For example, there<br />

needs to be a structurally similar “traveler” class in order to keep the RPC simple. If the two<br />

modules do not have common object structures, then one of the two codes needs to add some<br />

of the other code’s object structures, and copy the relevant information into that new structure<br />

before sending out the information. This is no longer necessary when protocols are entirely<br />

based on text (including XML); then there needs to be only an agreement of how to convert<br />

object information into an XML structure. The XML approach is considerably more flexible;<br />

in particular, it can survive unilateral changes in format. The downside is that such formats are<br />

considerably slower because parsing the text file and converting it into object information takes<br />

time.<br />

5.3.4 Module Coupling via Databases<br />

Another alternative is to couple the modules via a database. This could be a standard relational<br />

database, such as Oracle [95] or MySQL [94]. Modules could communicate with the database<br />

directly, or via files.<br />

70


The database would have a similar role as the XML files mentioned above. However, since<br />

the database serves the role of a central repository, not all agent information needs to be sent<br />

around every time. In fact, each module can actively request just the information that is needed,<br />

and (for example) only deposit the information that is changed or added.<br />

This sounds like the perfect technology for multi-agent simulations. What are the drawbacks<br />

The main drawback appeared is that such a database is a serious performance bottleneck<br />

for large scale applications with several millions of agents. This refers to a scenario where<br />

about 1 million Swiss travelers are simulated during the morning rush hour [66]. The main performance<br />

bottleneck was where agents had to chose between already existing plans according<br />

to the score of these plans. The problem is that the different plans which refer to the same<br />

agent are not stored at the same place inside the database: Plans are just added at the end of the<br />

database in the sequence they are generated. In consequence, some sorting was necessary that<br />

moved all plans of a given agent together into one location. It turned out that it was faster to<br />

first dump the information triple (travelerID, planID, planScore) to a file and then sort the file<br />

with the Unix “sort” command rather than first doing the sorting (or indexing) in the database<br />

and then outputting the sorted result. All in all, on the ch6-9 scenario the database operations<br />

together consumed about 30 min of computing time per iteration, compared to less than 15 min<br />

for the traffic flow simulation. That seems unacceptable, in particular since one wants to be<br />

able to do scenarios that are about a factor of 10 larger (24 hours with 15 million inhabitants<br />

instead of 5 hours with 7.5 million inhabitants).<br />

An alternative is to implement the database entirely in memory, so that it never commits to<br />

disk during the simulation. This could be achieved by tuning the parameters of the standard<br />

database, or by re-writing the database functionality in software. The advantage of the latter is<br />

that one can use an object-oriented approach, while using an object-oriented database directly<br />

is probably too slow.<br />

The approach of a self-implemented “database in software” is indeed used by Urbansim<br />

[86, 96]. In Urbansim there is a central Object Broker/store which resides in memory and<br />

which is the single interlocutor of all the modules. Modules can be made remote, but since<br />

Urbansim calls modules sequentially, this offers no performance gain, and since the system<br />

is written in Java [42], it also offers no portability gain. The design of Urbansim forces the<br />

modules writers to use a certain canvas to generate their modules. This guarantees that their<br />

module will work with the overall simulation.<br />

The Object Broker in Urbansim originally used Java objects, but that turned out to be too<br />

slow. The current implementation of the Object Broker uses an efficient array storing of objects<br />

so as to minimize memory footprint. Urbansim authors have been able to simulate systems with<br />

about 1.5 million objects (Salt Lake city area).<br />

In this work, an object-oriented design is used for a similar but simpler system, which<br />

maintains strategy information on several millions of agents. In a system with 1 million agents,<br />

with about 6 plans each, need about 1 GByte of memory, thus getting close to the 4 GByte limit<br />

imposed by 32 bit memory architectures [3].<br />

Regarding the timing (period-to-period vs. within-period re-planning), the database approach<br />

is in principle open to any approach, since modules could run simultaneously and operate<br />

on agent attributes quasi-simultaneously. In practice, Urbansim schedules the modules<br />

sequentially, as in the file-based approach. The probable reason for this restriction is that there<br />

are numerous challenges with simultaneously running modules.<br />

71


5.3.5 Module Coupling via Messages<br />

Yet another approach is to couple the modules by using messages. For example, one could use<br />

MPI [51] as was done for the parallel traffic flow simulation in Chapter 4. There seem to be<br />

essentially two paths that one can take:<br />

Have each module run on a single CPU only, but use messages to communicate between<br />

the modules. In particular, use that mechanism to implement within-day re-planning.<br />

This path is investigated in detail by [29].<br />

Stick with day-to-day re-planning, but have the individual modules run in parallel.<br />

This path is investigated in the following chapters of this thesis.<br />

5.4 Conclusions and Discussion<br />

The framework shown here includes different modules at different conceptual layers. This<br />

framework is used when a congested system is desired to be transferred to a relaxed state.<br />

Transition from initial state to relaxed state requires improving time schedules of plans of<br />

agents compared to old ones. Improvement involves different modules in the framework: the<br />

traffic flow simulation, the router, the agent database and the activity generator. Data (plans<br />

and events) between these modules can be managed in different ways.<br />

If the data is provided via files, an agreement on file format needs to be arranged. Files in<br />

structured text format are simple but not generic when operations on the file are involved. Files<br />

in the XML [97] format might be slow when creating data corresponding to the values read in<br />

but its extensibility cannot be disregarded.<br />

Coupling via files makes running different modules written in different languages on different<br />

operating systems possible. However, file operations involve disk accesses, which slows<br />

the system down.<br />

When modules are coupled via subroutine calls, despite their efficiency and simplicity, they<br />

restrict system modules to be written in the same language and to be run on the same operating<br />

system. Moreover, when the data is split on parallel modules, an additional effort such as<br />

exchanging messages should be exerted.<br />

Using RPC [33] is another alternative to coupling modules via files but this method is<br />

usually tightly-coupled and provides restricted choices (e.g. programming language).<br />

Yet another alternative to coupling modules via files is the utilization of XML at HTTP [12]<br />

level but it results in the same arguments as stated in Section 5.2.1.<br />

Databases can be used for coupling modules but they are usually bottlenecks when a fairly<br />

large scale application is concerned and when the database is written onto disk. Instead writing<br />

it out to disk, the database can be kept in memory but again memory constraints dominate the<br />

performance.<br />

Thus, technologies providing interoperability between modules are emerging. The tradeoff<br />

with the current technologies is between computational performance, effective usage of<br />

resources and flexibility.<br />

The design issues of a framework implementation for MATSIM can be concluded as below:<br />

Using files to couple different modules in the framework is preferred since computing<br />

nodes having bigger disk space allows users to store large data sets that cannot fit into<br />

the memory available. But it gives a low performance because of the disk accesses.<br />

72


Subroutines are not chosen to couple the modules of MATSIM since not only it gives<br />

better results for the data set small enough to fit into the memory of a computing node,<br />

but also it restricts users to obey some strict rules related to computing resources.<br />

In Remote Procedure Call method, calls are similar to subroutines but they are remote,<br />

i.e., the callee and the caller are on different computing nodes. MATSIM does not prefer<br />

RPCs to couple its modules since the RPC performance is low.<br />

Standard relational databases are avoided because of databases being bottleneck when a<br />

real-world problem is used to solve.<br />

MATSIM should replace its current implementation of coupling modules via files with<br />

coupling modules via message exchanges. The importance of this method is explained<br />

in the next two chapters.<br />

5.5 Summary<br />

A replacement for the traditional four-step process is explained. The framework overcomes the<br />

shortcomings of the four-step process by:<br />

activity-based demand generation, which generates a daily activity plan for each individual,<br />

employing DTA [19, 20, 27, 5] instead of static modeling to promote time-dependency.<br />

A process called systematic-relaxation is used to solve traffic dynamics with congestion.<br />

The systematic-relaxation uses a multi-agent learning method based on iterations, in each of<br />

which plans are executed, their performance is recorded and some of the routes are improved.<br />

The framework is conceptually divided into two layers. The strategies generated at the<br />

strategic layer are executed by the physical layer (traffic flow simulation). Agents know more<br />

than one strategy. Among available strategies, they can select one, or they can request a new<br />

route or they can request modifying timing information of activities (accordingly, a new route<br />

is created).<br />

Performance information of plans from traffic flow simulation is given in terms of events.<br />

The traffic flow simulation is kept apart from data aggregation, hence it is only interested in the<br />

correctness of the simulation. Data aggregation, such as link travel times and scores, is done in<br />

the strategy generation modules.<br />

The coupling of different modules can be accomplished via different methods, each of<br />

which has its own advantages and disadvantages as explained in the subsections. The current<br />

implementation of the framework uses a file-based approach, however in the next two chapters,<br />

a new approach, via exchanging messages, is discussed.<br />

73


Chapter 6<br />

Events Recorder<br />

6.1 Introduction<br />

The events recorder (ER) is a module which collects the events generated during a simulation<br />

run. In the original implementation, the traffic flow simulator (TS) generates and writes 1 events<br />

into a file, and the other modules read the events from the file. Hence, modules of the system<br />

are coupled via file I/O.<br />

An alternative to file-based coupling of modules of a system is coupling them via messages.<br />

As can be clear from Chapter 5.2.1, from the viewpoint of a traffic flow simulation, this entails<br />

two types of messages:<br />

Plans that are fed into the traffic flow simulation<br />

Events that are retrieved out from the traffic flow simulation<br />

Since plans are more complicated structures, this text will consider events first; using messages<br />

for plans will be considered in Chapter 7. As was explained in Chapter 5.2.1, events are<br />

used by all strategy generation modules to extract performance information. For example, the<br />

router extracts link travel times, or a mental map extracts individual agents’ paths.<br />

The challenge within the present work is to consider the most typical cases that occur within<br />

a parallel simulation environment. In particular, one wants to investigate the situation when the<br />

traffic flow simulation is parallel in the sense of Chapter 4. Therefore, one might consider what<br />

happens with respect to events collection when a traffic flow simulation is distributed across<br />

several computing nodes. One might also consider what happens when the events are divided<br />

into subsets, and each subset is sent to a different ER. Such a situation might be plausible<br />

when several nodes with disks are available and each ER could write its subset of the events<br />

¡<br />

to its own local disk (“distributed (events-)recording”). Another case where this is plausible<br />

is when there are multiple distributed agent databases, each one only responsible for a subset<br />

of agents. Finally, one might consider the case when more than one ER receives the same full<br />

events information (“multi-casting”). Such a situation occurs when more than one modules<br />

listens to the same stream of events.<br />

Figure 6.1 shows two examples for the events distribution on ERs. In the examples, there<br />

are three TSs and two ERs in the system. The system is populated with three agents only. These<br />

agents have different events occur on different TS domains. For example, execution of plans of<br />

1 The events recorder can possibly write events to the file, but the more probable application is that the strategy<br />

generation modules take the events information right away from the message stream and the events are never<br />

written to file.<br />

74


A1<br />

Traffic<br />

Simulator<br />

A2<br />

A3<br />

EVENT<br />

RECORDER<br />

Traffic<br />

Simulator<br />

A2<br />

A3<br />

EVENT<br />

RECORDER<br />

A2<br />

Traffic<br />

Simulator<br />

A1<br />

A3<br />

(a) Distributed Recording<br />

A1<br />

Traffic<br />

Simulator<br />

A2<br />

A3<br />

EVENT<br />

RECORDER<br />

Traffic<br />

Simulator<br />

A2<br />

A3<br />

EVENT<br />

RECORDER<br />

A2<br />

Traffic<br />

Simulator<br />

A1<br />

A3<br />

(b) Multi-casting<br />

Figure 6.1: Interaction between TSs and ERs. (a) distributed recording – agents have dedicated<br />

ERs. (b) multi-casting – events are multi-cast to ERs.<br />

agents A2 and A3 generates events on all three TSs while agent A1 has events occurring only on<br />

two TSs. The big thick arrows are for the communication between TSs and ERs. Each dashed<br />

line originated from the agents indicates to which ER the events from an agent are reported.<br />

In Figure 6.1(a), the agents are assigned to ERs in a round robin fashion (distributed recording).<br />

The events of agents A1 and A3 are reported to the same ER whereas the events of agent<br />

75


A2 are collected by the other ER. Since the dedicated ER information is a part of an agent<br />

itself, it carries this information around when it has to be moved to the other TSs according to<br />

the domain decomposition.<br />

Figure 6.1(b) shows the same system without any dedicated ERs (multi-casting). In this<br />

case, all events of all the agents on TSs are multi-cast to all the ERs.<br />

6.2 The Competing File I/O Performance for Events<br />

When events are read from and written into a file, the timing regarding I/O performance is<br />

recorded as follows: 10 million raw events are read as strings into an STL (Standard Template<br />

Library, Section 3.3.1)-vector in 332 seconds. The conversion from strings to the appropriate<br />

data types by using atoi/atof functions of C takes 17 seconds. Hence, the total time for<br />

completing reading is 349 seconds. Before writing an event into a file, the data values of the<br />

event have to be retrieved. The data retrieval for 10 million events is completed in 11 seconds.<br />

Then, they are written into a file by using the C++ operator in 61 seconds. Thus, the total<br />

time for writing 10 million events is 72 seconds. As a result, the file I/O performance on raw<br />

events is measured as 421 seconds.<br />

Similarly, for 10 million XML [97] events, reading and parsing are completed in 413 seconds<br />

via expat [21]. The data conversion from strings to the proper types is accomplished in<br />

21 seconds by using C functions strtol/strtod. Therefore, reading 10 million events is<br />

completed in 434 seconds. Prior to writing, the data values are retrieved in 10 seconds and are<br />

written by the C++ operator as XML tags in 223 seconds, which gives a total writing time of<br />

233 seconds. Consequently, the file I/O performance time for 10 million XML events is measured<br />

as 667 seconds. In order to reduce the contribution of these I/O performance numbers,<br />

this chapter investigates passing events in messages.<br />

6.3 Other Work<br />

Technically, a multi-casting scenario can be realized by using (true) multi-casting, as was<br />

shown by [29]. However, that work also showed that (i) the standard multi-cast implementation<br />

is not useful for simulation work since arrival of the messages is not guaranteed; (ii) writing a<br />

protocol addition that makes the messages reliable is difficult; (iii) using (reliable) TCP/IP [79]<br />

instead has lower performance since in contrast to true multi-casting it will open separate communication<br />

channels to each receiver; (iv) any solution based on standard Internet protocols<br />

typically runs on standard Local Area Network (LAN) hardware such as Ethernet [77], but<br />

often not on the specialized hardware provided by mini-supercomputers or supercomputers.<br />

Examples for such specialized hardware are Myrinet [54] and Infiniband [41]. (Myrinet provides<br />

a TCP/IP implementation, but it is non-standard, rarely used, and often not installed by<br />

computing centers.)<br />

When coupling modules of a system via messages, the message format and the transmission<br />

methods are substantial as well. Communication systems such as MPI [51] and PVM [63]<br />

offer high performance but they lack the support for flexibility as both the sender and the<br />

receiver side must have a priori agreements on the format of messages to be exchanged. The<br />

object serialization, in contrast, given in systems such as Java [42], CORBA [92] etc., and<br />

the XML-type data format provide somewhat more flexibility but along with significant lower<br />

performance.<br />

PBIO (Portable Binary I/O) [25] gives a solution for coupling flexibility and high performance.<br />

Its data format is similar to the XML format by giving the meta data information in<br />

76


the message. PBIO benefits also from reusing the receive buffer as opposed to MPI Pack and<br />

MPI Unpack routines’ needs for a second buffer to do the data conversions. In a heterogeneous<br />

environment, the PBIO’s low level data conversion functions perform the data conversion<br />

when necessary. PBIO gives the best performance when a homogeneous system is in question.<br />

In a heterogeneous environment, MPI and PBIO challenge each other. The MPI comparison<br />

tests in [25] reportedly involves MPI Pack/MPI Unpack. However as explained in Sec 6.11,<br />

MPI Struct could perform better than memcpy and MPI Pack/MPI Unpack when a fixed<br />

length data is to be exchanged. Moreover, it is also reported that PBIO does not provide any<br />

facility that detects under/overflow in the data conversion.<br />

6.4 Benchmarks<br />

In the tests presented here, a number of CPUs 2 between 1 to 24 runs a stub version of the<br />

traffic flow simulation (TS). A stub version is used in order to exclude the computing time of<br />

the traffic flow simulation itself from the benchmarks below. The stub version first reads all<br />

the pre-generated events from a file into the memory before it starts any benchmarks. The stub<br />

version is constructed in a way that the final set of events arriving at the event recorder is always<br />

the same, no matter what the number of CPUs is. The benchmark is set up as follows:<br />

1. TSs read the pre-generated events from a file into the memory.<br />

2. TSs pack the events.<br />

3. The packed events are sent to ERs.<br />

4. Each ER receives the events.<br />

5. The received events are unpacked into the memory.<br />

6. ERs write the events from the memory into a file.<br />

gettimeofday() function is used to measure the time spent during the operations on the<br />

events. It returns the current time while reading the TSC (time stamp counter), which increments<br />

each CPU clock.<br />

Myrinet is used as the main communication media. Measuring the sending time for the<br />

events is also repeated for 100 Mbit Ethernet [77]. In order to get the performance predictions<br />

regarding the transmission time, PMB (Pallas MPI Benchmark) [30] is used. That benchmark<br />

provides results of the so-called ping-pong test, which sends a packet to a receiver that sends<br />

it immediately back; many different packet sizes are tested. The results of these tests are the<br />

latency and the bandwidth numbers.<br />

In order to mimic the communication patterns of a real-world situation, the following is<br />

done with respect to the distribution of the events onto TSs and onto ERs in the case of distributed<br />

events-writing:<br />

1. The agents are distributed among the ERs in a round robin fashion.<br />

2. Domain decomposition of the street network takes place on the TSs as described in Chapter<br />

4.<br />

2 The cluster used here is composed of dual-CPU computational nodes. Tests are done using only one CPU of<br />

each computational node unless otherwise specified.<br />

77


&<br />

<br />

<br />

$<br />

¡<br />

<br />

<br />

<br />

<br />

&<br />

<br />

<br />

<br />

$<br />

<br />

<br />

<br />

<br />

¡<br />

<br />

3. TSs read those events that occur on their part of the domain.<br />

4. During the events sending, the TSs send the events only to those ERs that are “responsible”<br />

(in the case of distributed reporting) or all events to all ERs (in the case of multicasting).<br />

Note that events are not distributed uniformly across domains, and therefore TSs will have<br />

different numbers of events to process. This corresponds to what will later be used in practice.<br />

Since every agent has a different number of events, the total number of events received on the<br />

distributed ERs is also different.<br />

In order to minimize the effect of non-uniformly distributed events (as a result of the domain<br />

decomposition) among TSs, the packing time values shown in the benchmarks are obtained<br />

as follows: By noting the number of events on each TS and the benchmark times, the final<br />

benchmark time values are calculated on all TSs as if the number of events occur on all TSs is<br />

the same. For example, the number of events on TSs is approximately 2.5 and 7.5 million when<br />

¢¢¤<br />

using 2 TSs, and packing times come $<br />

¡<br />

<br />

out ¡ $ ¡ " <br />

¡<br />

<br />

as and . If they had been distributed<br />

perfectly, then each of TSs would have been responsible for 5 million events. ¡ Therefore, the<br />

packing time of 5 million events on each TS is calculated by using the values for the actual<br />

¢¢¤ <br />

$&<br />

¥ measurement: ¡ $ ¡ " "%$'&<br />

¤ and . The maximum values out of<br />

$<br />

these calculated values are plotted in the figure, i.e. ¡ $ ¤ <br />

¡<br />

<br />

it is in the example. Since the<br />

cluster is composed of computing nodes of the same type, the computing nodes’ behaviors<br />

usually do not differ much among themselves. Also one should note that packing events by a<br />

computing node is independent of the other computing nodes.<br />

When the measured time involves data transmission (such as send and receive), interpolation<br />

might not be a solution since the transmission time includes the waiting time for packets<br />

to be received. For a computing node, the time elapsed while waiting for receiver side to start<br />

receiving is a consequent of how much occupied receiver is by the other computing nodes.<br />

Packets are sent immediately after they are packed by any computing nodes. Therefore, the<br />

receiving sequence of receivers is determined by the packing sequence of senders. In conclusion,<br />

when the transmission time is involved in a measurement, the highest ”true” number<br />

among computing nodes without an interpolation is taken as the result. For example, in the<br />

¥<br />

respectively. If the corresponding packing times are subtracted from these numbers, the sending<br />

times for 2.5 and 7.5 million events become $<br />

¢<br />

¡<br />

<br />

¡<br />

<br />

same experiment above, 2.5 and 7.5 million events are packed and sent in ¡ $¢¡ ¥ <br />

and ¤ $<br />

<br />

¡<br />

<br />

and ¢ $<br />

¡<br />

<br />

. In this example, ¢ $<br />

&<br />

is selected as the total time to complete collecting all 10 million events.<br />

¤¢<br />

¤¡<br />

6.5 Test Case<br />

The scenario used is the ch6-9 scenario as described in Section 2.5. This real world scenario<br />

is used because of the reasons explained in Section 2.4. The scenario generates approximately<br />

23 million events during its simulation of three hours of traffic. Because of the memory constraints<br />

of the computing nodes, the first 10 million of these events were used for the benchmarks.<br />

6.6 Raw vs. XML Events<br />

In this work, two types of events transmission are tested: the raw events or events in XML [97]<br />

form (for general remarks on XML, see Section 3.4). Both types have their own advantages<br />

and disadvantages. The raw events are simple and fast but need to be packed as bytes on the<br />

78


sending side. Similarly, an unpacking routine needs to be run on the receiving side to convert<br />

bytes into the proper types.<br />

XML events, on the other hand, are slower but more generic. The packing routine is much<br />

simpler since it uses string functions on the sending side. If the modules of a system are<br />

coupled via files, then on the receiving side, no unpacking is necessary for writing events into<br />

a file. Since XML is a plain ASCII format, the XML events are written directly into the file as<br />

it is (no processing is necessary as opposed to the raw events).<br />

6.7 Buffered vs. Immediate Reporting of Events<br />

Besides the types of the events generated (XML and raw), there are also two ways of reporting<br />

events.<br />

6.7.1 Reporting Buffered Events<br />

The first type of reporting events is reporting them as in chunks. Events are added to a buffer,<br />

which is limited in size. When a certain number of events, SENDSIZE, is hit, the whole buffer<br />

as one message is sent to the ER. When the ER gets the message (the big buffer), it unpacks<br />

the events and then writes them to a file. In the tests, the SENDSIZE is defined as 5000 events.<br />

A buffer size of 5000 is used since this is a good trade-off between memory consumption and<br />

computational performance. In addition, some tests with a buffer size of 10000 events indicated<br />

no difference in performance. The procedure is given in Figure 6.2(a).<br />

The ER tries to receive packets all the time since it does not know the exact time that an<br />

event/message occurs. When the simulation is done, which means no more events will be generated,<br />

all the TSs in the system notify all the ERs. At this point the ER finishes. Figure 6.2(b)<br />

gives the pseudo code of ER actions.<br />

6.7.2 Immediately Reported Events<br />

The second type of reporting events is that an event is reported immediately after it is generated.<br />

Figure 6.3(a) shows how events are reported immediately. The procedure is very similar to that<br />

of the buffered events case except that SENDSIZE equals to 1: After events are read into<br />

memory, the time measurement is switched on. Then the events are packed and sent one by<br />

one. After all the events are sent, the time measurement is switched off.<br />

On the receiving (ER) side, first the time measurement is switched on. Then the main<br />

procedure starts its execution. At this point, the ER receives only one event per message and<br />

unpacks it into the memory and writes it into the file. After getting the information of no more<br />

event being generated, the time measurement ends. The algorithm of collecting immediate<br />

events by an ER is shown in Figure 6.3(b).<br />

When reporting events immediately, the message passing suffers from the processing overhead<br />

since the buffer with one event is too small to be sent. The obtained results are given in<br />

Section 6.10.<br />

6.8 Theoretical Expectation for Buffered Events<br />

In this case, measurements are taken based on cumulative events. The number of events, which<br />

form a single message, is 5000. Each event contains 1 double value (for the event’s time) and<br />

79


Algorithm A – Traffic Simulator Reporting Buffered Events<br />

read events into memory<br />

time measurement starts<br />

while not all events processed do<br />

pack events from memory into a buffer<br />

if number of packed events hits SENDSIZE<br />

send buffered events to ER<br />

end while<br />

time measurement ends<br />

Inform ERs that all events have been reported<br />

(a) The Traffic Simulator<br />

Algorithm B – Events Recorder Collecting Buffered Events<br />

time measurement starts<br />

while not all events collected do<br />

listen MPI Port<br />

if a packet arrived then<br />

receive packet which contains SENDSIZE events<br />

unpack SENDSIZE events into memory<br />

write SENDSIZE events into file<br />

end if<br />

end while<br />

time measurement ends<br />

(b) The Events Recorder<br />

Figure 6.2: Buffered Events Case. (a) Traffic Simulator Code (b) Events Recorder Code<br />

Algorithm A – Traffic Simulator Reporting Events Immediately<br />

read events into memory<br />

time measurement starts<br />

while not all events processed do<br />

pack an event from memory into a buffer<br />

send buffer with one event to ER<br />

end while<br />

time measurement ends<br />

Inform ERs that all events have been reported<br />

(a) The Traffic Simulator<br />

Algorithm B – Events Recorder Reporting Events Immediately<br />

time measurement starts<br />

while not all events collected do<br />

listen MPI Port<br />

if a packet arrived then<br />

receive packet which contains one event<br />

unpack the event into memory<br />

write the event into file<br />

end if<br />

end while<br />

time measurement ends<br />

(b) The Events Recorder<br />

Figure 6.3: Immediate Events Case. (a) Traffic Simulator Code. (b) Events Recorder Code<br />

80


§¦ ©¨<br />

¡<br />

¤<br />

<br />

¡<br />

2<br />

¡ ¡<br />

¡ )<br />

¡<br />

¡<br />

<br />

¡ ¡<br />

4<br />

¡ ¢<br />

¡<br />

<br />

¡<br />

©<br />

¡<br />

¡ ) <br />

¢<br />

<br />

¡<br />

<br />

Packing a Raw Event<br />

memcpy(buffer,event time)<br />

memcpy(buffer,event type)<br />

memcpy(buffer,vehicle ID)<br />

memcpy(buffer,leg number)<br />

memcpy(buffer,link ID of link on which event occurred)<br />

memcpy(buffer,from node of link)<br />

increment the number of events in the buffer<br />

Figure 6.4: Pseudo Code for Packing a Raw Event. Each event pack includes 6 memcpy calls<br />

Packing an XML Event<br />

create a char array using sprintf to write the values of an event<br />

memcpy(buffer,char array created)<br />

increment the number of events in the buffer<br />

Figure 6.5: Pseudo Code for Packing an XML Event. Each event pack creates a char array<br />

with the values of events.<br />

5 integer values (for the vehicle ID, the leg number, the from-node of the link, the link ID and<br />

the event type).<br />

6.8.1 Packing Time Prediction<br />

Raw Events<br />

Packing of raw events is done by using the C-library function memcpy. Pseudo code of<br />

packing a raw event is given in Figure 6.4. The memcpy function is called for each integer and<br />

each double value of an event.<br />

A clock cycle counter program [35] shows that a raw event that has 5 integers and 1 double<br />

value and that is packed by performing several memcpy functions will result in approximately<br />

400 clock cycles per event.<br />

As described Section 4.5, the cluster nodes used for the benchmarks have PIII 1GHz CPUs.<br />

Given the CPU speed of (1 billion cycles per second), the execution time of packing 10<br />

million raw events will be:<br />

£¢¥¤ ¡<br />

<br />

¢ ¢<br />

¥ ¤ $<br />

$<br />

¢ ¢ ¢¢¥¤ ¡<br />

XML Events<br />

Packing an XML event is achieved by using sprintf and memcpy functions. The algorithm<br />

in Figure 6.5 gives the pseudo code. An XML event example is given in the following:<br />

¡<br />

e v e n t time =”21636” t y p e =” departure ” v e h i d =”6381” legnum=”0” l i n k =”17” from=”1000” /¢<br />

which means at time 6:06AM, vehicle 6381 has departed link 17 (which the from-node is 1000)<br />

as executing leg 0.<br />

XML events were packed by using stringstream functions of STL (Standard Template<br />

Library, Section 3.3.1). However, the low performance of stringstream functions resulted<br />

81


¡<br />

6<br />

)<br />

©<br />

£<br />

¡<br />

©<br />

¤<br />

¡<br />

6<br />

)<br />

©<br />

(<br />

¨<br />

<br />

&<br />

<br />

(<br />

§<br />

¤<br />

¨<br />

¡<br />

&<br />

¡ ¢<br />

¡<br />

§<br />

¡<br />

¡<br />

$<br />

<br />

¡<br />

<br />

¡<br />

)¢¥¤ *<br />

¢<br />

¢¥¤§<br />

¡<br />

¥<br />

¤<br />

<br />

¡<br />

<br />

¤<br />

<br />

¡<br />

$<br />

<br />

¤<br />

<br />

¡<br />

<br />

<br />

)<br />

<br />

¤<br />

¡<br />

¢<br />

§¡<br />

<br />

¤<br />

¡<br />

in a newer implementation. In a newer implementation with sprintf and memcpy functions,<br />

the number of clock cycles is approximately 9500 per event, most of which is used by the<br />

sprintf function. This is why packing XML events is slower than that for the raw events.<br />

Given 9500 clock cycles per event, 10 million XML events will be packed in<br />

¢ ¢<br />

¡ ¢ ¢ ¢ ¢¥¤<br />

¥ & <br />

$<br />

6.8.2 Sending and Receiving Time Prediction<br />

PMB (Section 6.4) indirectly measures the time between the first byte leaving the sending side<br />

to the last byte arriving at the receiving side. Therefore, it measures the sum of the receiving<br />

time and the sending time, minus the overlap between them. However, in practice the resulting<br />

times are typically caused by bottlenecks at either end. For example, one could imagine that on<br />

the sending side all data is moved into an (infinitely large) communication buffer maintained<br />

by the network card. Once the data has arrived in that buffer, the measurement of the sending<br />

time would stop, but the data would still reside physically on the sending node. Similar effects<br />

could take place on the receiving side. An assumption is that the PMB times are caused either<br />

by the sending or by the receiving side, and that the times on the other side will be significantly<br />

smaller.<br />

In order to calculate the time consumption of a message, both bandwidth and latency contributors<br />

have to be taken into account if the latency is defined as the start-up time for a message:<br />

¨<br />

¡<br />

) §<br />

2 ( ¡ <br />

0( <br />

)§¡<br />

However, PMB reports the cumulative ”effective latency” and ”effective bandwidth”; therefore<br />

the formula becomes:<br />

¨<br />

¡<br />

¤<br />

¡<br />

¥<br />

) §<br />

2 ( ¡ <br />

0( <br />

¤ )<br />

¤<br />

¡<br />

¥ $<br />

Raw Events<br />

A packet of the buffered events consists of 5000 events. One double and one integer corresponds<br />

to 8 bytes and 4 bytes, respectively. Having 1 double and 5 integers to represent one<br />

event results in<br />

¥ <br />

¢ ¢ ¢<br />

¢ ¢ ¢ ¢ ¤¦<br />

per packet.<br />

To find out the corresponding latency and bandwidth values over Myrinet for a packet size<br />

of 140 KB, PMB is used. From the graphs generated on the cluster, for a packet of 140 KB, the<br />

latency is 620 s, meaning that it takes 620 s to transmit those 140 KB.<br />

Packing 10 million events as packets of 5000 events will give a total of 2000 messages.<br />

Therefore, the theoretical expectation for transferring 2000 messages of 140 KB in size can be<br />

found as:<br />

* &<br />

)<br />

¡<br />

£<br />

¡<br />

©<br />

© <br />

¡<br />

¡<br />

)¢¤<br />

0( <br />

¢ ¢ ¢ ¢<br />

¢ ¢ ¢<br />

$<br />

This means transferring 2000 messages of data, each of which is 140 KB in size, between a<br />

single TS and a single ER should theoretically take 1.2secs.<br />

XML Events<br />

The theoretical time for transferring the XML events can be calculated similar to the raw events<br />

82


)<br />

)<br />

&<br />

(<br />

¡<br />

¤<br />

¨<br />

<br />

(<br />

¨<br />

&<br />

¡ ¢<br />

¡<br />

¡ ¢<br />

¡<br />

¡ ¢<br />

¡<br />

$<br />

¡<br />

¢<br />

<br />

¡<br />

<br />

<br />

<br />

<br />

¡<br />

<br />

¡<br />

<br />

¡<br />

<br />

¤<br />

<br />

¡<br />

<br />

case. Each XML event has a maximum of 120 bytes. Hence, one packet of 5000 events results<br />

in<br />

¢ ¢ ¢<br />

¢ ¢ ¢ ¢ ¢¤¦<br />

¡ ¢<br />

¥ $<br />

<br />

According to PMB, the corresponding latency of a packet size of 600 KB is 2.4ms. Given<br />

the fact that 10 million XML events can be transferred in 2000 messages, the theoretical time<br />

for transferring 2000 messages of 600 KB can be calculated as:<br />

)<br />

¡<br />

£<br />

¡<br />

©<br />

© <br />

¡<br />

¡<br />

)¢¤<br />

0( <br />

£<br />

¡<br />

©<br />

¢ ¢ ¢ ¢<br />

¢ ¢<br />

¤ ¤ ¤ $<br />

$<br />

6.8.3 Unpacking Time Prediction<br />

Raw Events<br />

The theoretical value for unpacking the buffered events should be the same as that for packing<br />

except that it does not depend on the number of TSs given that the number of ERs is constant<br />

during a set of runs.<br />

Unpacking a raw event consists of calling memcpy function 5 times for integer values and<br />

once for the double value. The number of clock cycles is found as 410 cycles. Therefore, the<br />

total unpacking time of 10 million raw events will be<br />

¡ ¢<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ ¤ $<br />

$<br />

XML Events<br />

When unpacking of an XML event from a received packet is concerned, two different meanings<br />

to unpacking exist. The first one is useful particularly when events are written into a file.<br />

The second approach might be employed, when one prefers to access the data values stored in<br />

XML tags. The latter is the way that the raw events are unpacked.<br />

When XML tags are only to be extracted, then the procedure is as follows: Since XML<br />

events are just strings start “- with ” and ends with “ ”, the unpacking procedure reads each<br />

string between these special characters and saves the string as a whole, i.e., as a tag. Thus,<br />

a simple search is done for the XML tags. It takes 3000 clock cycles for an XML event to<br />

be extracted as a tag from a received packet and extracting 10 million events in the same way<br />

results ¢ ¢ ¢<br />

in<br />

¢ ¢ ¢ ¢¥¤ ¥ ¡ $<br />

¡<br />

After all the XML tags are extracted from a received packet, these values are written into a file<br />

as explained in the next sub-section.<br />

In order to store the value of each attribute similar to the raw events case, the XML tags are<br />

parsed into values. Parsing an XML tag via expat [21] and storing its values separately takes<br />

50000 clock cycles. Therefore, the total unpacking time for 10 million XML events is<br />

¢ ¢ ¢ ¢<br />

¢ ¢<br />

¡ ¢ ¢ ¢ ¢¥¤<br />

¥ &<br />

$<br />

83


and ¤¢¤<br />

¢<br />

<br />

<br />

¡ ¢<br />

¡<br />

¡ ¢<br />

¡<br />

<br />

¡<br />

<br />

¡<br />

<br />

<br />

<br />

¢<br />

<br />

<br />

OPERATION RAW XML<br />

Packing Time 4.0 s 95 s<br />

Sending and Receiving Time 1.2 s 4.8 s<br />

Unpacking Time 4.1 s 30+500 s<br />

Writing Time 56 s 20 s<br />

Table 6.1: Performance prediction table for buffered events.<br />

6.8.4 Writing Time Prediction<br />

Raw Events<br />

Each attribute of a raw event is written into a file separately by forming a structured text. The<br />

number of clock cycles needed is 5600 per event when the C++ output operator is used.<br />

¢ ¢<br />

Hence,<br />

& ¥<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ ;& ¥ <br />

is required for all the events to be dumped into a file.<br />

XML Events<br />

When an XML event is written into a file by using the C++ output operator , Writing each<br />

event uses up 2000 cycles, which results in<br />

¢ ¢ ¢<br />

¢<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

to write all the events into a file. Writing XML events is faster than writing raw events because<br />

raw events need to be converted to the strings prior to writing. XML events, on the other hand,<br />

are already in the string format.<br />

6.8.5 Performance Prediction for Buffered Events: Putting it together<br />

A table for the performance predictions for each individual step is shown in Table 6.1. The<br />

table demonstrates that exchanging the raw/plain events are expected to be faster for the operations<br />

which involve in “effective” message exchange. However, the XML is more flexible and<br />

extensible as explained in Section3.4.1.<br />

6.9 Results of the Buffered Events<br />

The overall simulation time including the initial events reading on a single TS is " <br />

¡<br />

<br />

about ,<br />

respectively with 1 ER, 2 ERs, 4 ERs when using the buffered raw events.<br />

<br />

¡<br />

<br />

¡<br />

<br />

When using the buffered XML events, the simulation times recorded are as ¥ & <br />

¡<br />

<br />

, ¡ "<br />

¡<br />

<br />

and ¡ ¥<br />

¡<br />

<br />

if “unpacking an XML event” is meant to be retrieving a XML tag without parsing<br />

for values (See Section 6.8.3). These numbers include approximately 35-40 seconds for<br />

¡<br />

reading the input data. In the following, the input reading times will be ignored, and the actual<br />

performance measurements of the other contributions will be described.<br />

The performance results of different operations when buffered events are concerned are<br />

given in the following sections: Packing both the raw and XML events is reported only by<br />

using the C memcpy function in Section 6.9.1. Section 6.9.2 compares the sending time of<br />

the raw events and XML events on different communication media (Myrinet and Ethernet),<br />

" <br />

84


and compares the different types of collection of events (distributed and multicast cases). The<br />

effective receiving time is measured both on Myrinet and Ethernet in Section 6.9.3. Unpacking<br />

the received events of both types, i.e. raw and XML, by using memcpy similar to packing case<br />

is discussed in Section 6.9.4. Finally, the time for writing events is measured for both raw and<br />

XML events in Section 6.9.5.<br />

6.9.1 Packing<br />

The time spent for packing events is plotted in Figure 6.6. The curves in this figure show only<br />

the “packing time”. To measure the packing time for events, the algorithm in Figure 6.2(a) has<br />

been changed in a way that the events are only packed, but not sent to the other side. As one<br />

can see, the performance values follow closely the theoretical predictions.<br />

Adding more ERs to the system in the “multi-casting” sense has no effect on the packing<br />

time values since the “total” packing time of TSs is independent of the number of ERs in the<br />

system.<br />

6.9.2 Sending<br />

Figure 6.7 shows the total time spent for sending all the events in the distributed ERs case when<br />

Myrinet [54] is used as the communication medium. The time measurement starts before the<br />

first event is packed and ends right after the last packet is sent. Hence, the numbers collected<br />

include the packing time as well. The numbers presented in Figure 6.7 are calculated by subtracting<br />

the time for packing from the total time for sending and packing on all TSs. Then, the<br />

maximum number is taken out of these subtracted values since it also shows how long TSs wait<br />

before a receive command is issued.<br />

The theoretical curve is derived (Section 6.8.2) under the assumption that the performance<br />

restrictions lie entirely on the side of the sender. That is, when a TS issues a send command,<br />

the ER is ready to receive. Of course, this is not the case in reality. The MPI Send function<br />

is blocking, which means each MPI Send call needs to wait till a corresponding MPI Recv<br />

command is issued. In other words, especially for the small numbers of ERs, TSs compete with<br />

each other.<br />

The important features of these plots are:<br />

The bottleneck for sending events lies almost entirely with the sender: With XML events,<br />

up to 8 TSs can send with full bandwidth before saturation sets in, presumably caused by<br />

the receiver.<br />

The reason for this is that the sender is most of the time busy packing events (Figure 6.6),<br />

while at this point the receiver immediately discards events. This is no longer true when<br />

unpacking (Section 6.8.3) and possibly writing is added to the receiver.<br />

Eventually, as the number of TSs increases, the curves start to saturate (or even increase)<br />

because of the competition among senders to get the access to receiver buffer.<br />

Myrinet results show that network cards saturates earlier on the sender side than on the<br />

receiver side. This could be due to the rendezvous protocol, for sending messages larger<br />

than ¡ ¥¡ , via the GM (Glenn’s Messages 3 ) one-sided put operation. The rendezvous 4<br />

3 GM is the name of the low-level communication layer for Myrinet [54].<br />

4 There is another protocol called the Eager protocol used for small messages. When a send is issued and the<br />

matching receive is not yet posted, the small message is saved in a (unexpected) buffer temporarily before the<br />

actual send occurs. Allocating buffers for large messages does not work. With small messages, therefore, a good<br />

bandwidth is not expected since the message must be copied.<br />

85


256<br />

64<br />

Packing Only, Raw Events<br />

memcpy - 1ER<br />

memcpy - 2ER<br />

memcpy - 4ER<br />

Theo. Val<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

256<br />

64<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Packing Only, XML Events<br />

memcpy - 1ER<br />

memcpy - 2ER<br />

memcpy - 4ER<br />

Theo. Val<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.6: Time elapsed for packing events. (a) Packing raw events. (b) Packing XML events.<br />

Note: Having ¡ ERs refers to the “distributed ER” method, i.e. each ER receives ¡ ¡ of the<br />

events. The packing time for the “multi-casting ER” method is the same, since data is packed<br />

only once.<br />

protocol ensures that a handshaking between the sender and receiver occurs prior to the<br />

message sending. The GM’s put operation writes the large message into the receive<br />

buffer directly, hence the operation finishes without the remote side being involved.<br />

The time measurement for sending events over Ethernet [77] is also taken. The results are<br />

shown in Figure 6.8. For Ethernet, the latency being higher and the bandwidth being lower<br />

than those of Myrinet [54] explains the difference between Myrinet and Ethernet values in the<br />

figure. One also notices that with Ethernet, a full bandwidth sender immediately saturates the<br />

receiver: in contrast to Myrinet, multiple TSs sending to one ER is not any faster than one<br />

TS sending to one ER. This also means that, for Ethernet, using multiple ERs for distributed<br />

recording means indeed an advantage.<br />

So far, it was assumed that when there are multiple ERs, that they all receive only a part<br />

of the information (distributed recording). As mentioned in Section 6.4, a different scenario<br />

86


256<br />

64<br />

Sending Only, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Theo. Val - 1ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Sending Only, XML Events<br />

256<br />

64<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Theo. Val - 1ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.7: Time for sending events, distributed recording, Myrinet. (a) Sending raw events.<br />

(b) Sending XML events.<br />

is to assume that there are multiple ERs, but that they represent different strategy generation<br />

modules and therefore they all want to receive the full event information. This is called “multicast”.<br />

Multicast, in general, is to send the same message to a list of recipients on a network,<br />

therefore during these tests TSs report all the events to all the ERs in the system. MPI [51] does<br />

not support multi-cast function. In order to multi-cast the events to all the ERs, a simple for<br />

loop is used by calling MPI Send function the number of ERs times.<br />

The results are plotted in Figure 6.9 and in Figure 6.10 over Myrinet and Ethernet, respectively.<br />

As the number of ERs increases, the sending time increases almost linearly in the<br />

number of multi-casting ERs, as one would expect since the internal command is just a loop<br />

over all ERs.<br />

When Ethernet is used as the communication medium, having events distributed<br />

among ERs should be preferred. If the communication is achieved over Myrinet,<br />

multi-cast is also a noteworthy option.<br />

87


£<br />

¥<br />

Time in Secs<br />

256<br />

64<br />

16<br />

4<br />

Sending Only, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

Theo. Val Eth - 1ER<br />

1<br />

0.25<br />

Time in Secs<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

Sending Only, XML Events<br />

256<br />

64<br />

16<br />

4<br />

1<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

Theo. Val Eth - 1ER<br />

0.25<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.8: Distributed recording, comparison of Ethernet vs Myrinet when sending events. (a)<br />

Sending Raw events. (b) Sending XML events. Note: Having ERs means that approximately<br />

¡<br />

¡ ¡<br />

events are sent to each ER.<br />

6.9.3 Receiving<br />

There is no clear way how to measure the time consumption of a receive operation. This is<br />

because when an MPI Recv command is executed before the corresponding MPI Send is<br />

issued, the measurement on receiver side will include the time of sender spent on operations<br />

taken place before MPI Send. Therefore, the curves in Figure 6.11 show effectively the combined<br />

effects of packing, sending on sender side and receiving on receiver side over Myrinet.<br />

More precisely, it effectively ) ¥ ¤ £ * )§¦ ) £ <br />

shows . It excludes events unpacking and<br />

writing by the ER code shown in Figure 6.2. In order not to include the events reading time<br />

of TSs, the time measurement starts right after the first packet arrives at the ER and ends after<br />

all the packets are fetched. This will, in fact, exclude the packing and sending time of the first<br />

packet, but this is a small error given 2000 or more packets.<br />

In the resulting figures, the curve is decreasing as the number of TSs increases. As explained<br />

in the previous paragraphs, the time measurement does not involve unpacking or further opera-<br />

88


256<br />

64<br />

Multicast Sending Only, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Myri - 8ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

256<br />

64<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Multicast Sending Only, XML Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Myri - 8ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.9: Multi-casting, time for sending events. (a) Sending Raw events. (b) Sending XML<br />

events.<br />

tions such as writing time of the events. This means that the majority of time on receiving side<br />

is spent while waiting for senders to issue MPI Send command. Since, during this period,<br />

senders basically pack the events, the packing time dominates the curves shown in Figure 6.11.<br />

One important observation from the results obtained up to this point is that since packing plus<br />

sending time is close to receiving time and packing uses up most of the time, one might conclude<br />

that the actual receiving time is smaller than the packing time and the rest of the time is<br />

spent as being idle.<br />

The same tests are also repeated over Ethernet. The results are presented in Figure 6.12.<br />

Since packing or unpacking of events are not dependent on the communication media, the<br />

packing time values are the same as those of the Myrinet case as shown in Figure 6.6. Given<br />

that the sending time measurement (Figure 6.8) is higher than the packing time measurement<br />

(Figure 6.6), one might conclude that most of the time in the Ethernet case is spent in sending<br />

or receiving rather than packing as opposed to the Myrinet case.<br />

89


Multicast Sending Only, Raw Events, Ethernet<br />

256<br />

64<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

Eth - 8ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1024<br />

256<br />

64<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Multicast Sending Only, XML Events, Ethernet<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

Eth - 8ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.10: Multi-casting over Ethernet, time for sending events. (a) Sending Raw events. (b)<br />

Sending XML events.<br />

6.9.4 Unpacking<br />

The time measurement for unpacking time of events starts when the first packet arrives at ER.<br />

After all the packets are retrieved and the last event is unpacked, the time measurement is<br />

switched off. Therefore, receiving is included in and any further operations such as writing<br />

time is excluded from these measurements.<br />

When a measurement includes only the receiving time on the ER side as explained in the<br />

previous section, the ER spends most of the time on waiting for TSs to send something (over<br />

Myrinet). This waiting time corresponds to the packing time of ERs. On the other hand, when<br />

a measurement is achieved so that unpacking is also included, ERs spend time in the actual<br />

unpacking process as opposed to waiting for TSs to pack the data. This also fits the theoretical<br />

calculation of packing and unpacking explained in Sections 6.8.1 and 6.8.3.<br />

Figure 6.13 shows the time spent for receiving and unpacking the events on ER side. The<br />

time of unpacking determines the curves mostly. The theoretical curves are the sum of the<br />

theoretical values for receive and unpack. Adding more ERs to the system will decrease the<br />

90


256<br />

64<br />

Effective Receive, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

256<br />

64<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Effective Receive, XML Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.11: Time elapsed on ER side when receiving events over Myrinet. (a) Receiving Raw<br />

events (b) Receiving XML events.<br />

unpacking time since less events will be handled by each ER. As shown in Figure 6.11, increasing<br />

the number of ERs does not make a lot difference in terms of receiving. However, in<br />

Figure 6.13, it is distinctive that the unpacking time values are diminishing as the number of<br />

ERs increases. Therefore, one might conclude that the transfer time of data between modules<br />

is very small.<br />

When the raw events are used, as expected theoretically, packing and unpacking operations<br />

take the same amount of time. Given sending and receiving times do not take long, one might<br />

conclude that both ERs and TSs spend more time in packing and unpacking than in sending<br />

and receiving or in waiting for each other (See Figure 6.13(a)).<br />

On the other hand, when XML events are to be transferred, packing time is three times more<br />

than unpacking time. Therefore, ERs finish unpacking earlier and wait for TSs to finish packing<br />

and sending. The waiting time of ERs under this situation will be the difference between<br />

packing time and unpacking time. This waiting time is included in the theoretical curve of<br />

unpacking XML events. When the number of TSs is small, their waiting time of ERs increases<br />

since there are more events to be packed. The curve flats out around 30 seconds for one ER in<br />

91


Time in Secs<br />

256<br />

64<br />

16<br />

4<br />

1<br />

Effective Receive, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

0.25<br />

Time in Secs<br />

256<br />

64<br />

16<br />

4<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

Effective Receive, XML Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Eth - 1ER<br />

Eth - 2ER<br />

Eth - 4ER<br />

1<br />

0.25<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 6.12: Comparison of Ethernet vs Myrinet when receiving events. (a) Receiving Raw<br />

events. (b) Receiving XML events.<br />

Figure 6.13(b). This number is same as the theoretical value for unpacking of the one ER case<br />

as calculated in previous sections.<br />

As explained in Section 6.8.3, instead of writing events into a file, an ER can parse XML<br />

event strings to store the values to completely eliminate the coupling of modules via files. In<br />

this work, this is done by using expat [21]. The results are shown in Figure 6.13(c).<br />

6.9.5 Writing into File<br />

In order to take the time measurement of writing events into a file, the measurement starts<br />

before the first event is received and ends after the last one is dumped into the file. The writingonly<br />

time of events by ERs are shown in Table 6.2.<br />

As seen in table, the time that ER spends on writing events is independent of the number of<br />

traffic flow simulations in the system. This is because during the experiments ERs are assigned<br />

to agents in a round robin fashion no matter what the number of TSs in the system is. Given<br />

a fixed number of ERs, the number of events collected and consequently the writing time of<br />

92


256<br />

64<br />

Effective Receive + Unpack, Raw Events<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Theo. Val - 1ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

Effective Receive + Unpack, XML Events<br />

256<br />

64<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Theo. Val - 1ER<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(b)<br />

Effective Receive + Unpack + Parse, XML Events<br />

1024<br />

256<br />

Myri - 1ER<br />

Myri - 2ER<br />

Myri - 4ER<br />

Theo. Val - 1ER<br />

Time in Secs<br />

64<br />

16<br />

4<br />

1<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(c)<br />

Figure 6.13: Time elapsed for unpacking events on top of the effective receiving time. (a)<br />

Unpacking Raw events. (b) Unpacking XML events. This only includes extracting XML tags<br />

as strings from a received packet. (c) Unpacking XML events. This includes parsing values of<br />

attributes.<br />

events by ERs will be the same as the number of TSs changes.<br />

As said in Section 3.5, in the table Local disk means the files are kept on the local disks<br />

of the computing nodes on which simulations run. Via NFS means the files are on a remote<br />

93


¢<br />

Writing Time<br />

Explanation Raw XML<br />

1ER, Local Disk, C++ 59s 25s<br />

2ERs, Local Disk, C++ 30s 13s<br />

4ERs, Local Disk, C++ 15s 6s<br />

1ER, via NFS , C++ 81s N/A<br />

1ER, Local Disk, C 57s N/A<br />

1ER, via NFS, C 66s N/A<br />

Table 6.2: Performance results for ERs writing the events file.<br />

machine and the simulation accesses the file via NFS (Network File System) [72]. C++ means<br />

writing is achieved via the C++ operator whereas C refers to writing with the fprintf<br />

function.<br />

6.9.6 Summary of “buffered events recording”<br />

Figure 6.14 shows the combined results for the buffered events when there is only one ER in<br />

the system. The curves are aggregated according to sequence of occurrences of operations.<br />

Therefore, the curves are drawn on top of previous operations. In order to ensure the integrity<br />

of the curves, the values for packing time measurement are taken as they are (as opposed to the<br />

interpolation explained in Section 6.4).<br />

The framework should fully replace the events maintenance via files by sending events<br />

directly to a listening module. By doing so, when the raw events are concerned, the computational<br />

performance is 10 times better over Myrinet and 3 times better over Ethernet.<br />

When XML events are concerned, eliminating files completely comes with a higher overhead<br />

of parsing XML events. Nevertheless, this is necessary if one wants to access to the<br />

values stored in XML strings.<br />

6.10 Theoretical Expectations and Results of Immediately<br />

Reported Events<br />

When taking measurements, measuring only the duration of a single command, such as pack,<br />

send, receive, unpack or write, and then adding those up for 10 million events, is impracticable<br />

with the timing devices commonly available in computers. Hence, the measurements should be<br />

taken on a cumulative sense. For example, the gettimeofday() command has an accuracy<br />

up to in microseconds. Anything faster than a microsecond will not be measured correctly with<br />

this command. Therefore, the measured results can be misleading.<br />

Each raw event immediately reported is packed in 0.4 s similar to Section 6.8.1 and packing<br />

10 million raw events results ¤ $ in in a system of a single ER and a single TS. Therefore, there<br />

is not any difference in packing events when they are reported buffered or immediately.<br />

If the events are reported immediately as they occur, the main problem arises when transferring<br />

them to ERs individually. When, in general, a send command is issued, a header is added<br />

to the data and the data is copied to the send buffer and then the actual send occurs. In order<br />

94


)<br />

)<br />

£<br />

¡<br />

©<br />

(<br />

(<br />

¨<br />

¨<br />

<br />

$<br />

¡<br />

<br />

<br />

<br />

)<br />

¢<br />

¢¤<br />

<br />

¡<br />

<br />

1024<br />

256<br />

TS-p<br />

TS-s<br />

ER-er<br />

ER-u<br />

ER-w<br />

Raw Events, Myrinet<br />

1024<br />

256<br />

TS-p<br />

TS-s<br />

ER-er<br />

ER-u<br />

ER-w<br />

Raw Events, Ethernet<br />

Time in Secs<br />

64<br />

16<br />

Time in Secs<br />

64<br />

16<br />

4<br />

4<br />

1<br />

1<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

1024<br />

XML Events, Myrinet<br />

1024<br />

XML Events, Ethernet<br />

256<br />

256<br />

Time in Secs<br />

64<br />

16<br />

TS-p<br />

TS-s<br />

4 ER-r<br />

ER-u<br />

ER-w<br />

ER-parse<br />

1<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(c)<br />

Time in Secs<br />

64<br />

16<br />

TS-p<br />

TS-s<br />

4 ER-er<br />

ER-u<br />

ER-w<br />

ER-parse<br />

1<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(d)<br />

Figure 6.14: Summary figures. The results are for single ER case. (a) Raw events, Myrinet.<br />

(b) Raw events, Ethernet. (c) XML events, Myrinet. (d) XML events, Ethernet. – The thick<br />

lines denote the time consumption when events writing would be fully replaced by sending the<br />

events directly to a listening module. The meanings of labels are as follows: pack, send, receive<br />

(effective), unpack, and write.<br />

words, each send command involves latency. If packets are too small, the total transfer time is<br />

mainly determined by the latency.<br />

The contribution of latency to sending time is captured in the following tests. The theoretical<br />

sending time value is calculated as follows: Each raw event has 1 double and 5 integers<br />

that result in 28 bytes per packet given that one double and one integer correspond to 8 bytes<br />

and 4 bytes respectively. Based on PMB [30] over Myrinet, the corresponding effective latency<br />

value of a packet size of 28 bytes is 10 s. Therefore, the theoretical time of sending 10 million<br />

events by a TS to an ER one-by-one can be found as:<br />

¤<br />

¡<br />

¥<br />

£<br />

¡<br />

©<br />

© <br />

¡<br />

0( <br />

¡<br />

¡£¢ ¡ ¢<br />

¢ ¢ ¢ ¢ ¡<br />

¡£¢ ¢<br />

$<br />

From Figure 6.16, one can see that the transmission of the immediately reported events<br />

suffer from the latency contribution compared to the buffered version as shown in Figure 6.7(a).<br />

The theoretical value of unpacking should also be the same as the packing value except<br />

that it is independent on number of TSs. If the unpacking time drawn on top of the effective<br />

receiving time for the immediately reported events, the figure would be a vertical upwardshift<br />

of Figure 6.13(a) because of the latency contribution to the transmission time for the<br />

immediately reported events case.<br />

95


700<br />

600<br />

500<br />

TS-p<br />

TS-s<br />

ER-er<br />

ER-u<br />

ER-w<br />

Raw Events, Myrinet<br />

700<br />

600<br />

500<br />

TS-p<br />

TS-s<br />

ER-er<br />

ER-u<br />

ER-w<br />

Raw Events, Ethernet<br />

Time in Secs<br />

400<br />

300<br />

Time in Secs<br />

400<br />

300<br />

200<br />

200<br />

100<br />

100<br />

0<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(a)<br />

0<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

700<br />

XML Events, Myrinet<br />

700<br />

XML Events, Ethernet<br />

600<br />

600<br />

500<br />

500<br />

Time in Secs<br />

400<br />

300<br />

200 TS-p<br />

TS-s<br />

ER-r<br />

100 ER-u<br />

ER-w<br />

ER-parse<br />

0<br />

1 2 4 8 16 32<br />

Number of Traffic Simulators<br />

(c)<br />

Time in Secs<br />

400<br />

300<br />

200 TS-p<br />

TS-s<br />

ER-er<br />

100 ER-u<br />

ER-w<br />

ER-parse<br />

0<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(d)<br />

Figure 6.15: Same plots as Figure 6.14, but on linear scale.<br />

256<br />

64<br />

Sending Only, Raw Immediate Events<br />

Myri - 1ER.<br />

Myri - 2ER.<br />

Myri - 4ER.<br />

TV - Send Imm.<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

Figure 6.16: Sending time for immediately reported events.<br />

The theoretical value of writing raw events is around 56 seconds. The writing time is also<br />

independent of the type of reporting, namely, buffered or immediately.<br />

96


©<br />

¦<br />

c r e a t e a C t y p e b y t e a r r a y<br />

memcpy ( b y t e a r r a y , i n t e g e r i t e m , s i z e o f ( i n t e g e r i t e m ) )<br />

advance t h e p o i n t e r of b y t e a r r a y as s i z e o f ( i n t e g e r i t e m )<br />

memcpy ( b y t e a r r a y , d o u b l e i t e m , s i z e o f ( d o u b l e i t e m ) )<br />

advance t h e p o i n t e r of b y t e a r r a y as s i z e o f ( d o u b l e i t e m )<br />

send ( b y t e a r r a y , p o i n t e r , MPI::BYTE , d e s t i n a t i o n )<br />

Figure 6.17: Pseudo code for packing different data types with memcpy<br />

In case events are sent to the listening modules, this should be accomplished by<br />

adding several events in a single packet to reduce the contribution of latency to the<br />

sending time.<br />

6.11 Performance of Different Packing Methods for Events<br />

As explained in Section 6.9, when transferring data between modules, usually the packing and<br />

unpacking of events take more time than the actual sending and receiving. Thus, the packing<br />

algorithms are investigated in this section. The object serialization and different packing<br />

algorithms are discussed in Section 4.5.3 where the data to be exchanged refers to vehicles.<br />

In this section, events are concerned when different packing approaches are applied. The raw<br />

events are used in the tests presented here. Packing algorithms discussed are memcpy in Section<br />

6.11.1, MPI Pack in Section 6.11.2, MPI Struct in Section 6.11.3 and Classdesc in<br />

Section 6.11.4.<br />

6.11.1 Using memcpy and Creating a Byte Array<br />

When using memcpy, the send and the receive buffers are simple byte/char arrays. The function<br />

memcpy is used to convert all data types into bytes. When adding data into a buffer, one must<br />

be aware of explicitly advanced pointer that points to the next available position in the buffer.<br />

When one needs to add additional information to the buffer such as number of items in the<br />

buffer, it can be easily done with memcpy.<br />

The data is sent and received as byte arrays at the abstract level. When creating buffers for<br />

different purposes such as transferring events or plans, the same higher level functions (such<br />

as packing functions) can be used. Hence, this method benefits from using generic type of<br />

information.<br />

A problem occurs if a cluster of computers with different machine representations is used.<br />

The defined data types can be converted differently into and from a byte array when the sender<br />

and the receiver do not share a common machine representation.<br />

Figure 6.17 shows the instructions when packing an integer and a double into a byte/char<br />

array. The last line shows the send call. The © ¡ )<br />

¡<br />

2 size of¤¦<br />

will be send to the<br />

¡<br />

¡<br />

¡ as MPI::BYTE type.<br />

4<br />

©) ¡£( )<br />

)<br />

¡<br />

( 2


¡<br />

¡<br />

<br />

¦<br />

¦<br />

©<br />

¡<br />

c r e a t e a C t y p e b y t e a r r a y<br />

MPI::INT.Pack ( i n t e g e r i t e m , 1 , b y t e a r r a y ,<br />

m a x s i z e , p o i n t e r , comm ) ;<br />

MPI::DOUBLE.Pack ( d o u b l e i t e m , 1 , b y t e a r r a y ,<br />

m a x s i z e , p o i n t e r , comm ) ;<br />

send ( b y t e a r r a y , p o i n t e r , MPI::BYTE , d e s t i n a t i o n )<br />

Figure 6.18: Pseudo code for packing different data types with MPI Pack<br />

t y p e d e f s t r u c t<br />

i n t e v e n t t y p e , v e h i c l e I D , l i n k I D ;<br />

double time ;<br />

m y e v e n t s t r u c t ;<br />

Figure 6.19: A C-type struct. It needs to be defined prior to using MPI Struct<br />

data, which means that it is also easy to add additional information into the message besides<br />

the events stream itself.<br />

The same functions are used for the different purpose buffers. This method has an advantage<br />

over the memcpy method: The different machine representation problem is solved by<br />

MPI Pack and MPI Unpack since these functions use MPI data-types, which are the same<br />

on different computer architectures. In other words, these methods benefit from the standardization<br />

of MPI data-types between platforms.<br />

The corresponding code for packing a single integer and a single double value using MPI Pack<br />

and MPI Unpack is shown in Figure 6.18.<br />

¤<br />

In § (<br />

¡<br />

the code, is the memory buffer¤¦<br />

boundary of created, shows how<br />

many items of that particular type will be packed (in this example, there are 1 integer and 1<br />

¡<br />

¡<br />

( 2


¡<br />

<br />

¡<br />

¦<br />

<br />

¡<br />

t y p e d e f s t r u c t<br />

i n t i n t e g e r i t e m ;<br />

double d o u b l e i t e m ;<br />

m y s t r u c t ;<br />

d e f i n e an a r r a y of m y s t r u c t t y p e<br />

c r e a t e c o r r e s p o n d i n g MPI s t r u c t<br />

using M P I : : D a t a t y p e : : C r e a t e s t r u c t<br />

commit MPI s t r u c t as m p i s t r u c t t y p e<br />

s t r u c t a r r a y [ i n d e x ] . i n t e g e r i t e m = i n t e g e r i t e m ;<br />

s t r u c t a r r a y [ i n d e x ] . d o u b l e i t e m = d o u b l e i t e m ;<br />

send ( s t r u c t a r r a y , i n d e x , m p i s t r u c t t y p e , d e s t i n a t i o n )<br />

Figure 6.20: Pseudo code for packing different data types with MPI Struct<br />

In order to pack an integer and a double into a buffer using MPI Struct, as stated earlier,<br />

one must define a corresponding C structure such that it can be packed. A simple example code<br />

is shown in Figure 6.20.<br />

After a C-type struct is defined, it is committed as an MPI type by using the command<br />

MPI::Datatype::Create struct. Once ©)2¡ the is filled with data, it is sent.<br />

( ) 2


i n t i n t e g e r i t e m ;<br />

double d o u b l e i t e m ;<br />

MPIbuf b u f f e r ;<br />

/ / sender code<br />

b u f f e r - - i n t e g e r i t e m - - d o u b l e i t e m ;<br />

b u f f e r . s e n d ( d e s t i n a t i o n , t a g ) ;<br />

/ / r e c e i v e r code<br />

b u f f e r . g e t ( s o u r c e , t a g ) ;<br />

b u f f e r i n t e g e r i t e m d o u b l e i t e m ;<br />

Figure 6.21: Pseudo code for packing different data types with Classdesc<br />

6.11.5 Comparison of Results<br />

Among the methods presented here, MPI Pack/MPI Unpack and Classdesc packing and<br />

unpacking methods are the slowest ones. MPI Pack/MPI Unpack functions convert data<br />

into byte arrays and most of the addressing issues are handled via MPI calls. Therefore, it<br />

causes an overhead. On the other hand, Classdesc also converts everything into byte arrays<br />

but it does not allow users to reuse buffers. It prefers to enlarge buffers by reallocating their<br />

memory areas.<br />

Figure 6.22(a) shows the time elapsed for packing when using the methods explained in the<br />

previous sections. Classdesc’s tendency to extend the buffer size becomes a big problem when<br />

the available memory cannot hold everything in and eventually a swapping process will be run.<br />

The effect can be seen in Figure 6.22(a) for the small number of TSs.<br />

MPI Struct does the packing quickly, but a drawback of using this method appear when<br />

size of data in the struct unknown. As a result, variable length data cannot be handled elegantly<br />

by this method. An example of this problem is given in Section 4.5.3.<br />

The buffer created by MPI Pack is transferred faster than the other three (see Figure 6.22(b).<br />

However, the numbers are close to each other such that the underlying communication media is<br />

fast enough and the main bottleneck is because of the packing (Figure 6.22(a)) and unpacking<br />

(Figure 6.22(c)) processes.<br />

When events are passed via messages, MPI Struct does the packing in the smallest<br />

time. However, when variable length data is to be packed, the packing cannot be handled<br />

elegantly by this method. Depending on the data size, MPI Pack or memcpy<br />

can be utilized. Classdesc is straightforward to use, but it requires well-formed<br />

C++ classes.<br />

6.12 Conclusions and Discussion<br />

As discussed in Section 5.4, when the modules in a framework are coupled via files, the problem<br />

of for I/O being a bottleneck appears. Besides the efforts improving I/O performance, I/O<br />

bottleneck can be avoided by using “messages”.<br />

In this chapter, events from traffic flow simulators are taken into consideration since they<br />

are input for different strategy generation modules in the framework. When coupling modules<br />

via files, 10 million events are read in 349 seconds and written in 72 seconds giving a total of<br />

421 seconds. When modules are coupled via the raw message passing, the total time required<br />

for packing and sending on one side and unpacking and allocating on the other side takes only<br />

100


256<br />

64<br />

Packing Only, 1ER, Raw Events<br />

memcpy<br />

mpi_pack<br />

mpi_struct<br />

classdesc<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16<br />

256<br />

64<br />

Number of Traffic Simulators<br />

(a)<br />

Sending Only, 1ER, Raw Events<br />

Myri, memcpy<br />

Myri, mpi_pack<br />

Myri, mpi_struct<br />

Myri, classdesc<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

0.0625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Effective Receive + Unpack, 1ER, Raw Events<br />

256<br />

64<br />

Myri, memcpy<br />

Myri, mpi_unpack<br />

Myri, mpi_struct<br />

Myri, classdesc<br />

Time in Secs<br />

16<br />

4<br />

1<br />

0.25<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(c)<br />

Figure 6.22: Performance of Different Serialization Methods. (a) Packing, (b) Sending, (c)<br />

Receiving and Unpacking.<br />

9.3 seconds. Thus, introducing a message passing approach for events among modules and<br />

keeping the events in the memory will accelerate 46 times than the setup coupled via files.<br />

The data format also needs to be considered. As coupling via messages based on the raw<br />

events performs in 9.3 seconds, the same setup for the XML events takes 629.8 seconds. When<br />

computational issues are concerned, the difference tells us to keep it simple as in raw events.<br />

However, extensibility and flexibility of the XML format will outweigh the simplicity and better<br />

performance of the raw events.<br />

101


When the events are reported to strategy generation modules as immediate as they occur,<br />

the performance of the system is drawn down because of the latency participation of each send<br />

call. Hence, sending several events in a single packet as shown in the buffered events case is a<br />

necessity.<br />

The most important conclusions of replacing the events file with messages for events are<br />

drawn below:<br />

The events data should be provided to different modules via messages instead of via files.<br />

If events are exchanged as messages, they should be of raw type in the current implementation<br />

of MATSIM. If parsing XML tags is improved, events should be packed into<br />

XML strings.<br />

To minimize the latency contribution to the total execution time, several events must be<br />

packed into a single packet.<br />

Using Myrinet, multicast of events to different modules needs to be considered since in<br />

a framework the same full information might be used by more than one module.<br />

Ethernet gives better performance when the events are distributed among modules.<br />

Among different methodologies, MPI Pack should be chosen since it is more robust compared<br />

to other packing algorithms. When the data length is fixed, MPI Struct should be<br />

preferred.<br />

6.13 Summary<br />

Chapter 5 gives different methodologies to couple modules defined in a framework. “Coupling”<br />

promotes data exchange between modules. There are two main data streams in a framework:<br />

plans and events. Events are covered in this chapter.<br />

An events recorder is an external module, which keeps track of events generated by traffic<br />

flow simulators during a simulation run. There are three main issues to be thought when several<br />

events recorders are introduced into the system:<br />

which event recorder collects which events (distributed ERs vs multi-cast)<br />

type of events (raw vs XML)<br />

how to arrange packets of events (single vs buffered)<br />

The tests for “multi-casting” are useful in the sense that it gives us an idea about performance<br />

when more than one strategy generation module needs the full information about events.<br />

Distributed ERs means that agents in the systems are distributed in a way that all ERs are responsible<br />

for more-or-less the same number of agents i.e., each agent has a dedicated ER.<br />

An event format is a choice between flexibility and computationally better performance.<br />

The XML events are flexible but obtaining values out of an XML tag is time-consuming. Simplicity<br />

and better performance of raw events are veiled by their ineffectiveness.<br />

When sending events to an events recorder, several events are buffered into the same message.<br />

This degrades the contribution of latency, which occurs in each packet sending.<br />

Moreover, in terms of the time measurement command gettimeofday, the measurements<br />

should be taken on cumulative basis. This is because of the inaccuracy of the command might<br />

102


Time nERs nTSs ET OP Note<br />

5s 1 1 raw pack memcpy<br />

101s 1 1 XML pack memcpy<br />

21s 1 1 raw pack MPI Pack<br />

3s 1 1 raw pack MPI Struct<br />

69s 1 1 raw pack classdesc<br />

1.2s 1 1 raw send myri, buf, dist<br />

4s 1 1 XML send myri, buf, dist<br />

25s 1 1 raw send eth, buf, dist<br />

86s 1 1 XML send eth, buf, dist<br />

11s 8 1 raw send myri, buf, mcast<br />

35s 8 1 XML send myri, buf, mcast<br />

224s 8 1 raw send eth, buf, mcast<br />

687s 8 1 XML send eth, buf, mcast<br />

88s 1 1 raw send myri, imm, dist<br />

6s 1 1 raw recv myri, buf, dist, eff<br />

105s 1 1 XML recv myri, buf, dist, eff<br />

6s 1 1 raw unpack memcpy, buf, dist, on top of recv<br />

101s 1 1 XML unpack memcpy, buf, dist, on top of recv<br />

59s 1 1 raw write , local<br />

21s 1 1 XML write , local<br />

81s 1 1 raw write , remote<br />

57s 1 1 raw write fprintf, local<br />

66s 1 1 raw write fprintf, remote<br />

Table 6.3: Summary table of the performance results of events transfered between TSs and ERs.<br />

cause a problem when measuring the operations one by one, especially, when these operations<br />

are really fast.<br />

Table 6.3 summarizes the most important performance numbers, which are measured during<br />

different operations by switching different parameters on. The abbreviations used in the<br />

table mean as follows: nERs and nTSs are the numbers of the events recorders and the traffic<br />

flow simulations; ET shows the event type (raw or XML); OP shows the operation measured;<br />

buf and imm mean that the events are reported in chunks and as immediate as they occur, respectively;<br />

dist and mcast point out the distributed and multicast cases; eff is short for effective<br />

receiving; on top of recv refers to the time measured including the effective receive time.<br />

103


Chapter 7<br />

Plans Server<br />

7.1 Introduction<br />

The systematic relaxation approach, explained in Chapter 1, is a simulation approach solution<br />

to the traffic dynamics with spill-back. Each iteration of relaxation takes persons and their<br />

plans as input, executes them, and outputs performance information of plans. In this thesis,<br />

that performance is output in the form of timestamped events. The timing information, for<br />

example, is later used to alternate some of routes such that the congestion in following iteration<br />

lessens.<br />

A robust but slow implementation of this is using files: the traffic flow simulation reads<br />

plans from a file, runs them, produces events and writes them into a file as described in Section<br />

5.2.1. Then, some strategy generation modules read events to make adjustments in some<br />

of the plans or to select among the existing plans. The traffic flow simulation reads the updated<br />

plans and executes them and so on.<br />

In such a set-up, file I/O is a bottleneck because of the limitations on I/O operations imposed<br />

by disk speeds. A faster alternative to file I/O is passing data in messages. An example is shown<br />

in Chapter 6: events can be transmitted to events recorders (ERs) via messages. Similarly,<br />

traffic flow simulators (TSs) can receive plans from a server called the plans server (PS), instead<br />

of reading them from a file. The PS is a means for performance benchmarking; therefore, it<br />

does not construct any plans by itself but rather it reads them from a file. The question of<br />

interest is how much time it takes to get these plans to the TSs under various set-ups.<br />

7.2 The Competing File I/O Performance for Plans<br />

If the ch6-9 scenario (with 1 million agents) is used in a system coupled via files, the I/O<br />

performance for plans reading and plans writing is recorded as follows:<br />

When the plans are raw plans reading by fscanf takes 25 seconds; 11 seconds are<br />

spent for the memory allocation of the values read. Thus, the total time for raw plans<br />

to be completely read is 36 seconds. The raw plans are written by the C++ operator<br />

in 17 seconds after the data is retrieved from the memory in 2 seconds. Thus, writing 1<br />

million plans is completed in 19 seconds. The total time on file I/O operations for raw<br />

plans accordingly is 55 seconds.<br />

When XML [97] plans are used, reading XML plans by expat [21] takes 151 seconds.<br />

When the XML plans are read, the data values are kept in strings. Then, these strings<br />

are converted into the appropriate data types (such as integers,doubles, etc.) in 3 seconds<br />

104


y using string functions. Finally, 5 seconds are needed to allocate the memory for<br />

the person objects to be stored. Thus, the total time for reading becomes 159 seconds.<br />

On the other hand, prior to writing XML plans, the data retrieval from memory takes<br />

123 seconds. Then, the data values are written into a file by forming XML tags in 149<br />

seconds. Consequently, the total time spent for file I/O on XML plans is 431 seconds.<br />

Under these circumstances, this chapter investigates a message passing alternative for plans<br />

to avoid reading and writing time.<br />

7.3 Benchmarks<br />

7.3.1 General<br />

The benchmark is simply set to measure the time for transferring plans between PSs and TSs.<br />

PSs are implemented both in C++ [80] and Java [42] to compare their performance values.<br />

When using more than one PS, agents are distributed in a round-robin fashion among PSs such<br />

that each PS is responsible for an approximately equal number of agents.<br />

When the simulation starts, TSs read the street network information and the domain decomposition<br />

output. Then, they wait for PSs to finish the plans reading. Since PSs are assigned<br />

to a set of agents, while reading plans, they only keep the records of the agents that they are<br />

responsible for. Once the plans are in the memory, PSs multi-cast to TSs all the agent IDs and<br />

the links, on which agents start their execution. PSs also specify the earliest time among agents<br />

to start simulating. This is important for synchronization of TSs running in parallel.<br />

TSs retrieve the information about all agents and check which agents start on their local<br />

network described by the domain decomposition. TSs send these agent IDs, i.e. IDs of agents<br />

that start execution on their local domains, back to PSs as feedback so that PSs send the complete<br />

agents information and the plans to TSs in the next step. Once the agents information is<br />

received, TSs start executing plans. The pseudo code of the benchmark is given in Figure 7.1.<br />

One might notice that some empty messages are exchanged before starting time measurements.<br />

This is necessary to synchronize all the modules in the system. Figure 7.2 graphically<br />

shows the execution sequence of tasks on a time-line when a single PS and a single TS interact.<br />

7.3.2 mpiJava<br />

Since all the other modules in the framework are implemented in C++, the plans server is<br />

also originally written in C++. Another implementation of plans server in Java is achieved to<br />

compare the performance values of C++ and Java.<br />

Java [42] is an object-oriented programming language developed by Sun Microsystems.<br />

It was created to address weakness of C++ such as the lack of garbage collection and multithreading.<br />

One problem comes with Java from the application’s point of view is that MPI standards<br />

provide language-specific bindings for C, Fortran and C++, however, there exists no Java bindings.<br />

Several groups tried to developed MPI-like bindings for Java independently. Among<br />

these, mpiJava [93] was chosen since it is just a simple wrapper to the MPI version, MPICH [52],<br />

used in this thesis and not a commercial effort.<br />

Another problem arises when using modules written in different languages. Having PSs in<br />

Java and TSs in C++ requires MPICH and mpiJava to communicate. However, as mpiJava is a<br />

wrapper around MPICH helps solve the problem.<br />

105


Algorithm A – Plans Server<br />

while not EOF do<br />

read a plan<br />

keep the earliest time that an agent start simulating<br />

if agent is mine then<br />

save agent details and its plan<br />

end if<br />

end while<br />

exchange fictitious messages with TSs to be synchronized<br />

time measurement for IDs starts<br />

pack all agent IDs and start link IDs along with simulation start time<br />

multi-cast packet of agent IDs and start link IDs to TSs<br />

collect from TSs feedback about which agent starts on which TS<br />

time measurement for IDs ends<br />

exchange fictitious messages with TSs to be synchronized<br />

time measurement for plans starts<br />

pack plans of agents into a single packet for each TS<br />

send packets of plans to TSs<br />

time measurement for plans ends<br />

(a) Plans Server<br />

Algorithm B – Traffic Simulator<br />

read domain decomposition result<br />

exchange fictitious messages with PSs to be synchronized<br />

time measurement for IDs starts<br />

receive agent IDs, start link IDs and earliest start time from PSs<br />

unpack agent IDs and start link IDs<br />

record simulation start time<br />

send back agent IDs that have start links which are on my sub-domain<br />

time measurement for IDs ends<br />

exchange fictitious messages with PSs to be synchronized<br />

time measurement for plans starts<br />

receive agents’ info and their plans<br />

unpack agents’ info and their plans<br />

time measurement for plans ends<br />

start simulating<br />

(b) Traffic Simulator<br />

Figure 7.1: Interaction of Plans Servers with Traffic Simulators<br />

7.4 Java and C++ Implementations of the Plans Server<br />

Although there are common approaches to the data structures available in C++ and Java, and<br />

furthermore the plans server implementation in Java is a projection of the one in C++, their<br />

performance results might be different. This section gives the details of data structures and<br />

operations used in different plans server implementations. Section 7.4.1 discusses packing<br />

data via different methods from plans servers. Specifically, the plans server written in C++<br />

packs the data by using the C function memcpy. The plans server written in Java, utilizes a<br />

self-implemented class called BytesUtil to pack the data. Section 7.4.2 explains different<br />

data structures on different plans servers to store the agents. The plans server in C++ uses<br />

106


time measurement starts<br />

Packing IDs<br />

PS<br />

TS<br />

First ID sent<br />

Last ID sent<br />

ALL AGENT IDs<br />

¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ time measurement starts<br />

First ID received<br />

T<br />

I<br />

M<br />

E<br />

Receiving local IDs<br />

Unpacking local IDs<br />

time measurement ends<br />

time measurement starts<br />

Packing plans<br />

¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ LOCAL AGENT IDs<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />

¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />

Last ID received<br />

Unpacking IDs<br />

Finding local IDs<br />

Packing local IDs<br />

Sending local IDs<br />

time measurement ends<br />

¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />

First plan sent<br />

LOCAL AGENT PLANS<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

time measurement starts<br />

First plan received<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

Last plan sent<br />

time measurement ends<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

Last plan received<br />

Unpacking plans<br />

time measurement ends<br />

¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />

£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />

Start Simulating<br />

Figure 7.2: Sequence of Tasks Execution of TSs and PSs<br />

STL-multimap and the one in Java employs TreeMap structures to store the agents.<br />

7.4.1 Packing and Unpacking<br />

PS/TS in C++ packs/unpacks messages by calling memcpy as many times as the number of<br />

items to be packed/unpacked. memcpy can be called as a conversion function between different<br />

data types and bytes 1 . When different items are packed into/unpacked from the same byte<br />

buffer, pointers regarding the buffer must be moved forward explicitly. An example is shown<br />

in Figure 7.3.<br />

The Java implementation of the PS uses the same approach as in C++ version by converting<br />

all the data into bytes 2 by using a class called BytesUtil. The BytesUtil conversion<br />

methods take a byte buffer, the data itself and a position in the buffer as input arguments and<br />

convert the data starting at the position on the buffer. The position is incremented by the<br />

methods implicitly. An example is given in Figure 7.4.<br />

1 In C/C++, a char is a byte-long. Hence, they are used reciprocally in C/C++.<br />

2 However, in Java a char is two byte-long, so they give different meanings. Since TSs are written in C++,<br />

PSs in Java use byte when transferring the data.<br />

107


i n t i n t e g e r i t e m ;<br />

double d o u b l e i t e m ;<br />

byte a r r a y b u f f e r ;<br />

memcpy ( b u f f e r , i n t e g e r i t e m , s i z e o f ( i n t e g e r i t e m ) ) ;<br />

move p o i n t e r t o b u f f e r by s i z e o f ( i n t e g e r i t e m ) ;<br />

memcpy ( b u f f e r , d o u b l e i t e m , s i z e o f ( d o u b l e i t e m ) ) ;<br />

move p o i n t e r t o b u f f e r by s i z e o f ( d o u b l e i t e m ) ;<br />

Figure 7.3: Pseudo code for packing different data types with memcpy<br />

¡<br />

/ / Conversion f u n c t i o n : from i n t to b y t e s<br />

/ / The other f u n c t i o n s are analog<br />

/ / but with d i f f e r e n t number of b i t s .<br />

p u b l i c i n t i n t T o B y t e s ( i n t num , byte [ ] b y t e s , i n t s t a r t I n d e x )<br />

b y t e s [ s t a r t I n d e x ] = ( byte ) ( num & 0 x f f ) ;<br />

b y t e s [ s t a r t I n d e x + 1 ] = ( byte ) ( ( num 8) & 0 x f f ) ;<br />

b y t e s [ s t a r t I n d e x + 2 ] = ( byte ) ( ( num 16) & 0 x f f ) ;<br />

b y t e s [ s t a r t I n d e x + 3 ] = ( byte ) ( ( num 24) & 0 x f f ) ;<br />

return s t a r t I n d e x +4;<br />

i n t i n t e g e r i t e m ;<br />

double d o u b l e i t e m ;<br />

byte a r r a y b u f f e r ;<br />

i n t s t a r t i n d e x ;<br />

s t a r t i n d e x = i n t T o B y t e s ( i n t e g e r i t e m , b u f f e r , s t a r t i n d e x ) ;<br />

s t a r t i n d e x = doubleToBytes ( d o u b l e i t e m , b u f f e r , s t a r t i n d e x ) ;<br />

Figure 7.4: An example for the methods of BytesUtil<br />

7.4.2 Storing Agents in the Plans Server<br />

The data structure for storing agents is nontrivial not only in the sense of memory usage it<br />

consumes but also in the sense of accessing time to the agents. The PS in C++ uses the STL<br />

(Standard Template Library, Section 3.3.1)-multimap. The STL-multimap for the agents<br />

holds the pointers to the agents, therefore the memory consumption is as big as agents themselves.<br />

Each PS, for each TS, creates a linked list. Each linked list holds pointers to the agents<br />

that the corresponding TS is interested in. When TSs send back the agents IDs to request the<br />

agents data, PSs search for agent IDs in the STL-multimap and add the pointer to the agent<br />

to the corresponding TS’s linked list. Each linked list holds pointers to the agents that the<br />

corresponding TS is interested in. The code is given in Figure 7.5.<br />

The PS in Java uses a Java-TreeMap to store all the agents read at the beginning. Then, it<br />

creates a Java-Vector for each TS after it knows about the domain decomposition and which<br />

PSs the agents belong to. The code is given in Figure 7.6.<br />

108


C++ v e r s i o n<br />

i n t key ;<br />

f o r ( t h e number of a g e n t s ) t i m e s<br />

c r e a t e an a g e n t ;<br />

a g e n t s . i n s e r t ( m a k e p a i r ( key , a g e n t ) ) ;<br />

e n d f o r<br />

L i n k e d L i s t s u b a g e n t s ; / / as many as TSs<br />

c r e a t e a s u b a g e n t l i n k e d l i s t f o r each TS<br />

r e c e i v e d i s t r i b u t i o n i n f o r m a t i o n of a g e n t s from TSs<br />

f o r each a g e n t i n a g e n t s<br />

f i n d TS ID of t h e a g e n t t h a t i t b e l o n g s t o<br />

s u b a g e n t s [ TS ID ] . p u s h b a c k ( a g e n t ) ¡<br />

;<br />

f o r each t r a f f i c flow s i m u l a t o r<br />

p r e p a r e a send p a c k e t using<br />

t h e c o r r e s p o n d i n g s u b a g e n t l i n k e d l i s ¡<br />

t<br />

Figure 7.5: Data structures for agents in a C++ Plans Server<br />

/ / Java v e r s i o n<br />

O b j e c t key ;<br />

TreeMap a g e n t s = new TreeMap ( ) ;<br />

f o r t h e number of a g e n t s t i m e s<br />

c r e a t e an a g e n t ;<br />

a g e n t s . p u t ( key , a g e n t ) ¡<br />

;<br />

V e c t o r [ ] s u b a g e n t s ; / / as many as TSs<br />

c r e a t e a s u b a g e n t v e c t o r f o r each TS<br />

r e c e i v e d i s t r i b u t i o n i n f o r m a t i o n of a g e n t s from TSs<br />

f o r each a g e n t i n a g e n t s<br />

f i n d TS ID of a g e n t t h a t i t b e l o n g s t o<br />

s u b a g e n t s [ TS ID ] . add ( a g e n t ) ¡<br />

;<br />

f o r each t r a f f i c flow s i m u l a t o r<br />

p r e p a r e a send p a c k e t using<br />

t h e c o r r e s p o n d i n g s u b a g e n t l i n k e d l i s ¡<br />

t<br />

Figure 7.6: Data structures for agents in a Java Plans Server<br />

7.5 Theoretical Expectations<br />

When figures for theoretical expectations are calculated, a program that counts clock cycles<br />

from [35] and PMB [30] that measures the performance of MPI are used. PMB is explained<br />

109


)<br />

)<br />

)<br />

£<br />

¡<br />

©<br />

§¦ ©¨<br />

¡<br />

£<br />

¡<br />

©<br />

¡<br />

(<br />

(<br />

¨<br />

<br />

¨<br />

(<br />

<br />

¡<br />

¨<br />

<br />

¡<br />

2 ( <br />

<br />

¡<br />

¡ ¡<br />

4<br />

¡<br />

¡<br />

¡<br />

¡<br />

¡<br />

¡)<br />

$<br />

¢<br />

<br />

<br />

¡<br />

¢<br />

¡<br />

$<br />

¢<br />

¡<br />

¤ ¢<br />

¢<br />

<br />

$<br />

¢<br />

¢<br />

<br />

<br />

<br />

<br />

<br />

¡<br />

¡ ) <br />

¡<br />

<br />

¡<br />

<br />

<br />

¡<br />

<br />

¡<br />

<br />

in Section 6.4. These calculations are made based on the modules of the framework written in<br />

C++ and made in a system with a single PS. PSs and TSs are run on a cluster whose nodes have<br />

PIII 1GHz (1 billion cycles per second) CPUs. More details about the cluster can be found in<br />

Section 4.5. The underlying network used in the tests here is chosen to be Myrinet [54].<br />

7.5.1 PSs Pack<br />

Agent IDs and their start link IDs<br />

A PS creates a single packet composed of all agents IDs assigned to itself and their start link<br />

IDs. The packing procedure is composed of 2 calls to the memcpy function of C.<br />

Packing an agent ID and a start link ID uses ¡<br />

¢ ¢<br />

up clock cycles according to the clock<br />

cycles counter. Hence, for a system with a single PS, it will take<br />

©¡<br />

£ <br />

¢ ¢<br />

$<br />

to pack about 1 million agents IDs and their start link IDs.<br />

Agents and their plans<br />

Again, a single packet is created for all agents information and their plans by using the memcpy<br />

library call. The data packed for each agent contains an agent ID, a start link ID, a route length,<br />

node IDs that the agent must go through during its trip, duration of the activity, an end time of<br />

the activity, a leg number and a destination link ID.<br />

Counting the clock cycles shows that each agent is packed in 1800 clock cycles. Therefore,<br />

1 million agents will be packed in ¡ ¤ ¢ ¢<br />

$¢¡<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

7.5.2 PSs Send and TSs Receive<br />

As discussed in section 6.8.2, the results of PMB used for measuring the latency and the bandwidth<br />

values for packets of different sizes on the cluster are additive. Consequently, the formula<br />

of the theoretical expectation of transmission will be as follows:<br />

$<br />

£<br />

¡<br />

©<br />

© <br />

¡<br />

Agent IDs and their start link IDs<br />

Using 1million agents gives a single packet approximately 8 MBytes in size knowing that<br />

agent IDs and link IDs are integer values (4 bytes for each). Using PMB, the latency value for<br />

a 8MB-packet can be found approximately as ¡ & milliseconds. This gives<br />

¡<br />

)¢¤$<br />

0()<br />

$<br />

Therefore, 0.035 seconds is needed for a single PS to send all agents IDs and their start link<br />

IDs to a single TS.<br />

¡ ¢<br />

¡ & <br />

¡ & <br />

Agents and their plans<br />

During the tests, it is observed that a packet which contains all the agent information and plans<br />

¢ <br />

is approximately 95 MBytes long. PMB gives ¤<br />

a latency of milliseconds for a packet size<br />

of MB. Then, the theoretical value is calculated & as<br />

¢<br />

¡ ¢<br />

$<br />

$ ¤<br />

$ ¤<br />

110


)<br />

£<br />

¡<br />

©<br />

¤¥¤ ¢ ¢<br />

¤<br />

(<br />

¡<br />

¨<br />

¢<br />

¢<br />

<br />

¡<br />

¡<br />

¡<br />

¡<br />

¡<br />

¡<br />

$<br />

<br />

¢<br />

¤<br />

¢<br />

$<br />

$<br />

¤<br />

¢<br />

$<br />

<br />

¡<br />

<br />

¡<br />

<br />

¡<br />

<br />

<br />

¡<br />

<br />

7.5.3 TSs Unpack<br />

Agent IDs and their start link IDs<br />

Unpacking a single agent ID and its start link ID takes, as expected, as the same amount of time<br />

as packing a single agent ID and its start link ID. This is because of calling memcpy function<br />

twice as in packing case. Given that the number of clock cycles needed by unpacking an agent<br />

ID and its start link ID is around 330, in a system consists of a single PS and a single TS, TS<br />

unpacks 1million agents IDs and their start link IDs in<br />

¡£¡<br />

$<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

$¢¡£¡ <br />

Agents and their plans<br />

Similarly, unpacking a single agent with its plans takes 8800 clock cycles whereas packing<br />

only takes 1800. The difference of 7000 clock cycles comes from creating agents, searching<br />

for start and end link IDs in the local network and setting the agent information before starting<br />

simulation. Therefore, when there is a single PS and a single TS, 1 million agents are effectively<br />

unpacked in<br />

¤ ¢<br />

$<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

7.5.4 TSs Pack and Send<br />

Local Agent IDs<br />

When TSs, as shown in Algorithm 7.1(b), receive all the agent IDs and their start link IDs, they<br />

unpack these values, then they search for links IDs that are local to them (based on the domain<br />

decomposition). If a link ID is local to a TS, then all the agent IDs that start on that link are<br />

added to a send buffer that will be sent back to the corresponding PS, which is responsible for<br />

that agent. Thus, for a TS packing local agent IDs means that searching start link IDs in the<br />

local links and packing those agents that start on the TS’s local network. The search algorithm<br />

is the binary search because of the performance reasons explained in Section 3.3.2.<br />

The number of clock cycles needed for searching a link ID in the local links and packing<br />

an agent ID that is on a local link is recorded as 830 cycles, therefore, 1 million agent IDs will<br />

be packed in<br />

¡ ¢ ¢ ¢¢¥¤<br />

¥ <br />

¡ <br />

$<br />

A TS sends the agent IDs back in a single packet of 4 MBytes. Using PMB, the latency<br />

value for a 4MB-packet can be found approximately as milliseconds. This gives<br />

¢ <br />

¢ ¢<br />

¢ <br />

¡ ¢<br />

$<br />

7.5.5 PSs unpack<br />

Local Agent IDs<br />

An agent ID is unpacked by a PS in 550 cycles. This number includes finding the agent ID in<br />

111


)<br />

©<br />

¨<br />

¡<br />

<br />

<br />

¡<br />

$<br />

¢<br />

¡<br />

¡<br />

<br />

*<br />

¢<br />

$<br />

¤<br />

¢<br />

¡<br />

$<br />

¡<br />

<br />

*<br />

¡<br />

¥<br />

<br />

¡<br />

©<br />

the plan and setting its corresponding TS value. In a system with 1 million agents, all the IDs<br />

will be unpacked by a PS in<br />

& &<br />

$<br />

¥ <br />

$'& & <br />

¡ ¢ ¢ ¢¢¥¤<br />

7.5.6 Multi-casting Plans<br />

As an alternative, plans can be multi-cast to TSs and accordingly TSs receive all the plans and<br />

stores only the ones that are local to its network. When there is only a single PS and when the<br />

PS packs plans for multi-cast, the packing is done once and it takes 1.80 seconds as explained<br />

above. This packing process results in a message/packet with a size of 95 MBytes. Sending this<br />

big packet to one TS takes 0.42 seconds. ¡ If there are TSs in the system, sending the packet<br />

¢ <br />

to all ¡ $ ¤<br />

the TSs takes seconds since each sending will cause latency. Finally, when the<br />

packet is retrieved by a TS, TS will unpack the packet and check if a plan/person starts on its<br />

local domain. As said above, a TS unpacks the complete plans in 8.80seconds (unpacking the<br />

values takes 1.80 seconds and creating and inserting objects into the appropriate links takes<br />

7.00 seconds) and checks if a plan starts on its sub-domain in 0.83 seconds. Thus, when plans<br />

are multi-cast to TSs, the theoretical time for packing and sending by a PS and receiving and<br />

unpacking by TSs will ¡ approximately take<br />

¢ ¢<br />

¤ ¢<br />

¤ ¢<br />

¢<br />

"%$<br />

)§(<br />

* ¡<br />

¡ *<br />

¡£4 $<br />

) §<br />

$ ¤<br />

7.6 Results<br />

The tests are repeated for the systems with different number of PSs and TSs. Theoretical values<br />

also added into figures and labeled as “TV”.<br />

7.6.1 PSs Pack<br />

Packing times for IDs and plans are shown in Figure 7.7. This figure shows only the packing<br />

time. The measurement starts right before the first data is packed and ends right after the last<br />

data is in the buffer. The theoretical curve and the resulting curves for a system with one PS<br />

are approximately equal to each other as expected. The curve is constant on y-axis because PSs<br />

must pack all the agents that they have, no matter how many TSs are in the system.<br />

As the number of PSs increases, the packing time decreases since adding more PSs into<br />

system will decrease the number of agents per PS to be packed.<br />

Since there is no corresponding Java function to convert different data types into bytes or<br />

vice versa, several functions are explicitly implemented to do so in Java as shown in Figure 7.4.<br />

Figure 7.7 shows that these supplementary functions in Java are 6 and 3 times slower than the<br />

memcpy function in C when packing IDs and plans respectively.<br />

7.6.2 PSs Send<br />

The time measurement only for the sending time over Myrinet is shown in Figure 7.8. As<br />

the number of PSs in system increases, the total sending time decreases since the send buffer<br />

contains less elements.<br />

112


64<br />

16<br />

4<br />

For PSs, Pack Only (IDs & StartLinks)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1<br />

0.25<br />

0.0625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

For PSs, Pack Only (Agents & Routes)<br />

64<br />

16<br />

4<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1<br />

0.25<br />

0.0625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 7.7: Time elapsed for packing. (a) Packing agent IDs and start link IDs. (b) Packing<br />

agents. Having PSs means that approx. agents are handled by each PS.<br />

¡ ¡ ¡<br />

The curves in Figure 7.8(a) go up linearly. This is because that PSs send all agents IDs and<br />

their start link IDs to all the TSs. Therefore, the latency contribution of each transfer increases<br />

cumulatively as the number of recipients (TSs) increases. Multi-casting agents IDs and their<br />

start link IDs is handled by a “for” loop such that PSs multi-cast the same data to all TSs in<br />

the system. After each send to a TS is completed, PSs transfer data to the next TS. In order<br />

to minimize the competition among PSs for the same TSs to send data to, the sequence of the<br />

“for” loop for each PS is shuffled. Hence, each PS follows a different sequence of TS IDs to<br />

send the data.<br />

The curves in Figure7.8 are almost equal for the same number of PSs in the system when<br />

Java or C++ type PSs are used. This is because in both cases, the send and the receive functions<br />

are the ones provided by MPI.<br />

When sending the agents information and the plans, PSs create a separate packet for each<br />

TS in the system, and then initiate sending to that TS in a “for” loop. Again the sequence of<br />

the “for” loop is shuffled to minimize the effects of competition of PSs for the same TSs in the<br />

same sequence.<br />

113


For PSs, Send Only (IDs & StartLinks)<br />

64<br />

16<br />

4<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1<br />

0.25<br />

0.0625<br />

0.015625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

For PSs, Send Only (Agents & Routes)<br />

64<br />

16<br />

4<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1<br />

0.25<br />

0.0625<br />

0.015625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 7.8: Time elapsed only for sending over Myrinet. (a) Sending agent IDs and start link<br />

IDs. (b) Sending agents.<br />

7.6.3 TSs Receive<br />

The receiving time measurement starts just before MPI Recv command is called. As shown<br />

in the algorithm in Figure 7.1, the measurement ends after unpacking finishes. In order to<br />

measure the receiving time, the unpacking process shown in Figure 7.1(b) is excluded from the<br />

time measurement. Moreover, “empty” messages are exchanged before the time measurement<br />

for receiving takes place to synchronize and to exclude other intermediate operations.<br />

The effective receiving time curves over Myrinet are shown in Figure 7.9. If one merges<br />

figures for packing time (Figure 7.7) and for sending time (Figure 7.8), the resulting figure will<br />

be the same as in Figure 7.9. This is what one expects since when a receive command is issued,<br />

receiver should wait till data comes. Therefore, the receiving time includes not only actual<br />

receiving time but also waiting time for the sender. While TSs wait for the data from PSs, PSs<br />

pack the data. That is why receiving time is the sum of sending and packing times.<br />

The curves are nearly constant since most of the receiving time includes the waiting time<br />

for data. The waiting time contribution comes from PSs, which are packing. (See Figure 7.7).<br />

114


64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

0.5<br />

0.25<br />

For TSs, Effective Receive (IDs & StartLinks)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

0.125<br />

64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

0.5<br />

0.25<br />

0.125<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

For TSs, Effective Receive (Agents & Routes)<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

Figure 7.9: Time elapsed for the effective receiving time over Myrinet. (a) Effective receiving<br />

agent IDs and start link IDs. (b) Effective receiving agents.<br />

Figure 7.9(a) shows a slight increase of the curve when the number of TSs increases. This is<br />

a result of given a constant number of PSs the data being transferred to all the TSs (Figure 7.8).<br />

7.6.4 TSs Unpack<br />

When TSs unpack the agent IDs and their start link IDs, the total unpacking time will be almost<br />

constant. This is because all TSs should retrieve all agents IDs and start link IDs to find out<br />

which ones start on its local domain. The resulting curves for unpacking on top of the effective<br />

receiving time over Myrinet are shown in Figure 7.10(a).<br />

Even if the number of PSs changes, the total number of agents in the system does not.<br />

Hence, the total number of agent IDs and start link IDs to be unpacked on TS side will be<br />

the same. The difference between the curves in Figure 7.10(a) comes from the fact that they<br />

represent the unpacking time on top of the effective receiving time.<br />

If the data transferred is the agents information and the plans, TSs get a packet which<br />

consists of all the agents that start on their own sub-domain. As the number of TSs increases,<br />

115


For TSs, Effective Receive + Unpack (IDs & StartLinks)<br />

64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

0.5<br />

0.25<br />

0.125<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

For TSs, Effective Receive + Unpack (Agents & Routes)<br />

64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

0.5<br />

0.25<br />

0.125<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ --1PS<br />

Figure 7.10: Time elapsed for unpacking on top of the effective receiving time over Myrinet.<br />

(a) unpacking agent IDs and start link IDs. (b) unpacking agents.<br />

the total number of agents start on sub-domain of a TS decreases. Figure 7.10(b) shows the<br />

unpacking time for agents on top of the effective receiving time.<br />

7.6.5 TSs Pack and Send<br />

After a TS unpacks the agent IDs and their start link IDs, it must check which agents will start<br />

in its local sub-domain. The search is achieved via a binary search method. If the start link<br />

ID of an agent is found in the local domain of a TS, that the TS adds the agent ID into a send<br />

buffer that will be sent to the dedicated PS. Thus, prior to for a TS packing agent IDs, the search<br />

algorithm is run for all the start link IDs. In other words, the packing time for local agent IDs<br />

by TSs is dominated by the search algorithm and it is independent of the number of TSs and<br />

PSs in the system. The resulting curves for packing time are shown in Figure 7.11. There is<br />

no difference between C++ and Java versions of PSs since the packing is done by TSs and it is<br />

independent of the implementation of PSs.<br />

Figure 7.12 shows the sending time over Myrinet for the agent IDs by TSs. Since TSs send<br />

116


64<br />

16<br />

4<br />

For TSs, Pack Only (IDs)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1<br />

0.25<br />

0.0625<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

Figure 7.11: Time elapsed for packing agent IDs by TSs<br />

64<br />

16<br />

4<br />

1<br />

0.25<br />

0.0625<br />

0.015625<br />

0.00390625<br />

For TSs, Send Only (IDs)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

Figure 7.12: Time elapsed for sending agent IDs by TSs to PSs over Myrinet<br />

packets to several or all the PSs in the system, as the number of PSs increases, the latency<br />

involvement in the total sending time increases. On the other hand, as the number of TSs<br />

increases, the sent packet size decreases, i.e, the packet is transferred faster. Moreover, as the<br />

number of TSs in the system increases, the competition between TSs for PSs appears more.<br />

Thus, the resulting curves in Figure 7.12 show all of these effects.<br />

7.6.6 PSs Receive and Unpack<br />

Figure 7.13 shows the effective receiving time over Myrinet of agent IDs by PSs. Given a<br />

constant number of PSs, the total number of IDs that will be sent to each PS is independent of<br />

the number of TSs. Furthermore, as stated earlier, the effective receiving time also includes the<br />

waiting time passed before a packet enters the receiver buffer. The curves are almost constant<br />

because after a PS issues a receive command, it waits for TSs, which execute a binary search<br />

algorithm to find out the local link IDs and pack the agent IDs on the local links. As shown in<br />

117


64<br />

32<br />

16<br />

8<br />

4<br />

2<br />

1<br />

0.5<br />

0.25<br />

For PSs, Effective Receive (IDs)<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

0.125<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

Figure 7.13: Time elapsed for receiving agent IDs by PSs over Myrinet<br />

For TSs, Effective Receive + Unpack (IDs)<br />

64<br />

32<br />

16<br />

8<br />

4<br />

C++ -- 1PS<br />

C++ -- 2PS<br />

C++ -- 4PS<br />

Java -- 1PS<br />

Java -- 2PS<br />

Java -- 4PS<br />

TV -- C++ -- 1PS<br />

2<br />

1<br />

0.5<br />

0.25<br />

0.125<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

Figure 7.14: Time elapsed for unpacking agent IDs by PSs on top of the effective receiving<br />

time over Myrinet<br />

Figure 7.11, the time elapsed for packing is constant since the binary search algorithm dominates<br />

the time elapsed and it is executed for all the link IDs. This constant effect is also seen in<br />

Figure 7.13.<br />

Figure 7.14 shows the unpacking time for PSs on top of the effective receiving time over<br />

Myrinet. The most obvious result is the difference between the Java and C++ implementations<br />

of the unpacking procedure of a PS. Although the Java implementation unpacks the agent IDs<br />

3.5 times slower than that of C++ for a system with one PS, increasing number of PSs reduces<br />

the difference. For example, in a system with 4 PSs, the version Java version is 2.7 times slower<br />

than the C++ version.<br />

118


Plans Server,C++, Myrinet<br />

Plans Server,Java, Myrinet<br />

256<br />

64<br />

16<br />

4<br />

1<br />

PS-p all-ids<br />

PS-m all-ids<br />

TS-er all-ids<br />

TS-u all ids<br />

TS-p loc ids<br />

TS-s loc ids<br />

PS-er loc ids<br />

PS-u loc ids<br />

PS-p agents<br />

PS-s agents<br />

TS-er agents<br />

TS-u agents<br />

File I/O<br />

256<br />

64<br />

16<br />

4<br />

1<br />

PS-p all-ids<br />

PS-m all-ids<br />

TS-er all-ids<br />

TS-u all ids<br />

TS-p loc ids<br />

TS-s loc ids<br />

PS-er loc ids<br />

PS-u loc ids<br />

PS-p agents<br />

PS-s agents<br />

TS-er agents<br />

TS-u agents<br />

Reading file<br />

0.25<br />

0.25<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 7.15: Summary figures. The results are for the single PS case. (a) Plans Server written<br />

in C++. (b) Plans Server written in Java. – The thick lines denote the time consumption when<br />

plans reading would be fully replaced by sending the plans directly to a traffic flow simulator.<br />

Each label denotes a curve showing the total time of the operations till the end of the operation<br />

of the label. For example, the curve “PS-m all ids” is drawn on top of “PS-p all ids”. The<br />

meanings of labels are as follows: pack, multicast, effective receive, unpack, send.<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

Plans Server,C++, Myrinet<br />

PS-p all-ids<br />

PS-m all-ids<br />

TS-er all-ids<br />

TS-u all ids<br />

TS-p loc ids<br />

TS-s loc ids<br />

PS-er loc ids<br />

PS-u loc ids<br />

PS-p agents<br />

PS-s agents<br />

TS-er agents<br />

TS-u agents<br />

File I/O<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

Plans Server,Java, Myrinet<br />

PS-p all-ids<br />

PS-m all-ids<br />

TS-er all-ids<br />

TS-u all ids<br />

TS-p loc ids<br />

TS-s loc ids<br />

PS-er loc ids<br />

PS-u loc ids<br />

PS-p agents<br />

PS-s agents<br />

TS-er agents<br />

TS-u agents<br />

Reading file<br />

50<br />

50<br />

0<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(a)<br />

0<br />

1 2 4 8 16<br />

Number of Traffic Simulators<br />

(b)<br />

Figure 7.16: Same plots as Figure 7.15, but on linear scale.<br />

7.7 Conclusions and Summary<br />

Plans are input to traffic flow simulators. Traditionally, plans are read from a file. Because<br />

of the inefficiency problem of file I/O explained in Section 5.4, transferring plans between<br />

modules via messages is explored in this chapter.<br />

The “ch6-9” scenario with approximately 1 million agents is read in 159 seconds in C++<br />

and 240 seconds in Java. These numbers include memory allocation of agents. The plans of<br />

the agents in the same scenario are written in 272 seconds in C++ and 100 in Java case. Hence,<br />

the total time required for I/O is 431 and 340 seconds for C++ and Java cases, respectively.<br />

When plans fit into memory, instead of dumping them into files, the plans servers pack<br />

them from the memory and send them to the traffic flow simulators. Traffic flow simulators<br />

receive, unpack and allocate agents, then start simulating. Two big chunks of data are transfered<br />

between plans servers and traffic flow simulators to accomplish the setup explained:<br />

agents IDs and their start link IDs, so that traffic flow simulators can distinguish the<br />

119


Time nPSs nTSs OP Note<br />

1.87s 1 1 pack agents C++, memcpy<br />

5.06s 1 1 pack agents Java, BytesUtil<br />

0.48s 1 1 send agents C++, mpi, MPICH<br />

0.50s 1 1 send agents Java, mpi, mpiJava<br />

2.35s 1 1 recv agents C++, mpi, MPICH, eff<br />

5.56s 1 1 recv agents Java, mpi, mpiJava, eff<br />

11.2s 1 1 unpack agents C++, memcpy, on top of recv<br />

14.4s 1 1 unpack agents C++, memcpy, on top of recv<br />

Table 7.1: Summary table of the performance results of plans transfered between TSs and PSs.<br />

agents, which have to start on a link belong to themselves under the domain decomposition<br />

knowledge.<br />

agents information and their plans, so that traffic flow simulators can start simulating.<br />

According to the tests, the total time elapsed for packing, sending, receiving, unpacking<br />

and allocating memory for agents is approximately 13 seconds in C++-case and 22 seconds<br />

in Java-case. Therefore, transition from the XML plans file to messages speeds up the time<br />

for traffic flow simulators obtaining plans by 33 times in C++ and by 16 times in Java. The<br />

summary figures are seen in Figure 7.15 and Figure 7.16.<br />

The most of important conclusions of transferring plans via messages are given in the following:<br />

Reading plans from file and writing them into files should be replaced by sharing plans<br />

among modules via messages.<br />

Conversion of data into a byte-array using Java performs a bit worse than that of C++.<br />

Since the underlying MPI implementation is the same for mpiJava and MPICH, MPI<br />

functions do not show a difference.<br />

Plans can be multi-cast to modules so that they are transmitted once as a whole.<br />

Plans shared via files would be fully replaced by sending the plans directly to a traffic flow<br />

simulator. By doing so, when plans in the XML file format are concerned, the computational<br />

performance is 12 times better in C++ and 11 times better in Java.<br />

Table 7.1 summarizes the most important performance numbers, which are collected for<br />

operations under different circumstances. The abbreviations used in the table mean as follows:<br />

nPSs and nTSs are the numbers of the plans servers and the traffic flow simulations; OP shows<br />

the operation measured; MPICH and mpiJava are the MPI implementations used for C++ and<br />

Java, respectively; eff is short for effective receiving; on top of recv refers to the time measured<br />

including the effective receive time.<br />

120


Chapter 8<br />

Going beyond Vehicle Traffic<br />

8.1 Introduction<br />

Historically, the demand for connecting users together incited developments in the communication<br />

systems. Communication using a dedicated circuit was the first step. This is known<br />

as circuit-switched network, which establishes a physical channel/circuit/path dedicated to<br />

a single connection for the duration of transmission between two end-points. The telephone<br />

system is circuit-switched as it is connection-oriented.<br />

Internet being an appealing evolution in history initiated changeover from circuit-switched<br />

networks to packet-switched networks, which allow sharing of the physical channel among<br />

multiple virtual/logical dedicated connections. In this type of connections, the transmission of<br />

messages are achieved as packets, which are re-assembled at the destination to form the original<br />

message.<br />

Telephone networks are encompassed by mathematics at all levels such as design, control<br />

and management [88]. These kind of networks guarantee quick transmission and ordered arrival<br />

of data in the same order it is sent. The entire message follows the same path. Telephone<br />

networks are known as having a static nature since the variability in these networks is seldom.<br />

The routers keep track of the active connections to forward the data. The data transfers in<br />

the telephone network are formulated by Poisson distribution since call arrivals are mutually<br />

independent and durations are exponentially distributed with a single parameter [14]. The static<br />

nature of telephone networks reflects Poisson distribution (the aggregated traffic becomes less<br />

bursty as the number of traffic sources increases) so the analysis of data and predictions can be<br />

easily employed.<br />

The data packet traffic (packet-switched network) exhibits different characteristics than the<br />

voice traffic (telephone network). As opposed to the voice traffic,<br />

the variability in duration for the data packet traffic is vast,<br />

packets of the message might follow different routes on the way to the destination,<br />

each packet carries a header information, thus routers only check the header to forward<br />

the data.<br />

Therefore, the data packet traffic requires more than a single-parameter Poisson distribution<br />

to be understood. In contrast to a Poisson process, as the number of sources (users) increases,<br />

the resulting aggregated data packet traffic becomes more bursty instead of becoming smoother.<br />

Some previous works ([88],[49]) showed that there is an increasing evidence that self-similar<br />

(fractal) behaviors arise in the data packet networks on large time scales. A process is said<br />

121


£<br />

¡<br />

¡ §<br />

¤¡2 )<br />

©<br />

¡<br />

(<br />

<br />

¡<br />

<br />

<br />

¡<br />

(<br />

¨<br />

¡<br />

¨<br />

¡<br />

¡<br />

)<br />

¨<br />

¡<br />

¡ )<br />

¡<br />

¡<br />

)<br />

£<br />

to be self similar with a self similarity parameter (Hurst parameter) if the aggregated processes<br />

have the same correlation as . Therefore, the variance of the arithmetic mean decreases<br />

more slowly than the corresponding sample size [87].<br />

The self-similar nature of LAN traffic (aggregated traffic, i.e., the number of packets or<br />

bytes per time unit sent over the Ethernet [77] by all active hosts) was shown by Leland et al.<br />

in [87]. In order words, the LAN traffic measured in microseconds/seconds exhibits the same<br />

characteristics as that of larger time scales. Paxson and Floyd proved in [62] that the WAN<br />

traffic also follows self-similarity.<br />

Self-similar processes can be explained by a power law function, which roughly relates<br />

new scales with old scales by factors. The power law describes systems when large events are<br />

rare and small ones are quite common. For example, there are only a few web sites which are<br />

visited by the enormous number of people in contrast to millions of the web pages accessed by<br />

less people. Self-similar processes have an advantage over Poisson distribution that there is no<br />

defined natural length of “burst”, which can be ranged from a few milliseconds to minutes and<br />

hours.<br />

Besides self-similarity of the data packet traffic at macroscopic level (aggregated traffic),<br />

Willinger et al. also demonstrated in [89] that the data packet traffic at microscopic level (traffic<br />

pattern displayed by individual source-destination pairs) is observed as heavy-tail models.<br />

Huisinga et al. in [81] gives a microscopic model for the data packet traffic in the Internet<br />

and this is described as a simple one dimensional model. The model investigates congested<br />

and free flow phases under the presence of a slow router in the system. The model introduces a<br />

simple cellular automaton model, which defines a finite buffer on each router. When a packet<br />

needs to leave a router for the next destination, its movement should obey a router-specific<br />

probability besides the availability of the buffer in the next router. All buffers are FIFO queues<br />

and are updated in parallel. Travel times of the packets are measured for free flow and jammed<br />

regime as a defect router is introduced to the system. The results show that in both cases, travel<br />

times obey the power law characteristics.<br />

8.2 Queue Model as a Possible Microscopic Model for Internet<br />

Packet Traffic<br />

In this section, the queue model described in Chapter 2 is investigated as a possible model for<br />

the data packet traffic in the Internet. The queue model can be used to simulate “internetworks”<br />

since the routing is employed on this level. Moreover, since the queue model is defined to<br />

be large-scale, large scenarios such as Distributed Denial of Service (DDoS) attacks can be<br />

simulated at this level.<br />

The graph data is described as in the following: Routers and hosts are the nodes. Those,<br />

which are parts of several networks, can have more than one interface. Each interface, then, is<br />

assigned to a unique IP address. Links, on the other hand, can refer to cables, modems, Digital<br />

Subscriber Line (DSL) or satellites, which connect two nodes. Agents of such a system are the<br />

data packets. In contrast to vehicles, they only know the destination but not the route.<br />

The queue model explained in Chapter 2 needs some modifications for a better fit for the<br />

Internet packet traffic. The storage constraint of a link corresponds to the number of sites/spaces<br />

available on the link. It is inversely proportional to the packet length, i.e.,<br />

6 <br />

¡<br />

£ $<br />

The spatial queue and the buffer of the link can be thought as incoming and outgoing mem-<br />

122


ories of a network card. Specifically, the spatial queue is the outgoing memory and the buffer<br />

becomes the incoming memory. Thus, if a packet is about to leave a sender side network card,<br />

it is put into the outgoing memory and it is put into the incoming memory upon its arrival at<br />

the receiver side.<br />

Consequently, the packets are moved from the outgoing memory of the sender side to the incoming<br />

memory of the receiver size with the capacity of the link. The node-to-node bandwidth<br />

(the amount of data that can pass between two nodes in one second) is given by the capacity<br />

of the link, which is, for example, 100 Mbit/s for 100 Mbit Ethernet LAN. This corresponds to<br />

the flow capacity defined in the queue model.<br />

The last constraint of the queue model is the free flow travel time, which is the ratio of<br />

length of the link to free flow velocity in the vehicle traffic. When packet traffic in the Internet<br />

is concerned, the free network travel time is defined as the node-to-node latency, which includes<br />

the processing overhead of initialization at the network card, copying data between memory and<br />

the network and the transfer time of the data from the sender to the receiver.<br />

When agents are vehicles, they are aware of their routes since the route is predefined in the<br />

queue simulation. The Internet packets, in contrast, only know their destination. When a router<br />

receives an incoming packet, it only checks the header of the packet to find out the next flow<br />

that the packet should follow by using a routing table. Hence, the nodes in vehicle traffic are<br />

careless about constraints but this is overruled when the Internet data packets are the agents in<br />

the simulation. Each node (router) should have a capability of packets that can be handled in a<br />

unit time.<br />

Vehicles are also informed when and on which link to start. During simulation, creation<br />

of new vehicles than the other ones defined at the beginning is unusual. When Internet data<br />

packets are chosen as agents, they contradict some features of vehicles. Packets only know<br />

their destinations as their routes from source to destination are computed on the fly. Moreover,<br />

the dynamics of the Internet allows new packets to be created. For example, requests for ftp [32]<br />

or HTTP [12] result in creation of new packets to handle the response to the requester. As the<br />

types of packets in Internet vary, they cause different event handlers to take the corresponding<br />

actions. The responses for an ftp and an HTTP requests, for example, are different.<br />

Besides existence of packets for different purposes, packets in the Internet have variable<br />

length. The queue model, on the other hand, assumes each vehicle occupies a space of fixed<br />

length (7.5m), therefore, when the queue model is applied to the Internet packet traffic, the<br />

packet size is fixed.<br />

Besides the modifications explained above, some new constraints and parameters need to<br />

be introduced as well. Some examples are<br />

packets per second that a node can forward, i.e., the number of packets that a node can<br />

moved from its incoming buffer to its outgoing buffer.<br />

IP addresses and masks to be assigned to each interface of the hosts.<br />

One of the most important features that should be added to the queue simulation in order to<br />

handle the Internet data packets is the creation of routing tables at the nodes (routers). Although<br />

they can be formed before the simulation starts, they should be updated regularly according to<br />

the congestion information of the network. A very simple routing algorithm associates the<br />

destination with the next hop meaning that the destination can be “optimally” reached via the<br />

next hop.<br />

A simple attempt [75] for employing the queue simulation as a simulation for the Internet<br />

packet traffic is accomplished by making changes similar to the ones explained above. Tests are<br />

done using the star-topology where all nodes are connected to a central router. This topology<br />

123


100<br />

Round Trip Travel Time<br />

10<br />

1<br />

0.1<br />

Ping<br />

Uncongested qsim<br />

Congested qsim - 5hosts<br />

Congested qsim - 10hosts<br />

0.01<br />

10 100 1000 10000 100000<br />

Message Size<br />

Figure 8.1: Round-trip travel times of different sizes of messages. For the ping packets, congestion<br />

is not avoidable. The other curves show the results when the queue model simulates<br />

congestion or no congestion.<br />

is selected because of its simplicity but bottlenecks are unavoidable because all data must pass<br />

through the centralized router.<br />

Regardless of lack of realistic the Internet traffic patterns, packets are created roughly in the<br />

following form:<br />

p a c k e t i d =”1” t y p e =”HTTP” s t a r t T i m e =”100”<br />

-<br />

s o u r c e I p = ” 1 2 8 . 9 6 . 3 4 . 1 3 3 ”<br />

d e s t i n a t i o n I p = ” 1 2 8 . 9 6 . 3 3 . 1 3 0 ”<br />

s i z e =”1000” t t l =”7”<br />

/<br />

where each packet is defined by a unique ID, a type field, a start time, a size and a source IP<br />

address and a destination IP address. ttl field is short for Time to Live, which specifies how<br />

many hops a packet can travel before being discarded or returned.<br />

Some simple tests regarding ping packets are done. ping is used to determine whether an<br />

Internet connection is active. In order to verify the reachability, it sends a packet to a specified<br />

Internet host and waits for a reply. The results of ping is round-trip travel times.<br />

Figure 8.1 shows the round trip travel times of different size of messages on a 100 Mbit<br />

LAN. The tests are done in a way that the destination is chosen to be 100m apart from the<br />

source(s). The speed of the link is 12.5 MBytes/sec. Each second simulates 100000 steps, thus,<br />

1 step is about s.<br />

The curves for ping and uncongested qsim are the results between one source and one<br />

¡ ¢<br />

destination. The values labeled as uncongested qsim gathered from the simulation are not as<br />

high as those from the ping command. This is because during this test, only one packet is<br />

sent from source to destination without dealing with any congestion. ping results are from<br />

real-world, hence the traffic towards the destination node is not predictable. The curve labeled<br />

as congested qsim 5 hosts takes a congested system with 5 hosts into consideration. In this<br />

scenario, 4 of the hosts send ping packets every second to the single destination. Congested<br />

qsim - 10 hosts is similar to 5-host case but now 9 hosts exhaust a single destination.<br />

The Ethernet packet size is about 1500 bytes that is 12 kbit. For a 100 Mbit Ethernet, this<br />

124


means that 8333 packets are processed in a second. If one takes, simulation time steps of 1<br />

second, then in order to handle 8333 packets, a 100 Mbit Ethernet card needs to increase its<br />

incoming and outgoing buffer sizes to handle 8333 packets without any overflow. If one takes<br />

simulation time steps of 1 millisecond, then the number of packets processes per millisecond<br />

becomes approximately 8, and consequently 8 packets with a size of 1500 bytes will give a total<br />

of 12 KB. Given the fact that the Ethernet card used has a memory of 18 KB (approximately<br />

12 Ethernet packets), one can conclude that 8 packets per millisecond can be handled without<br />

an overflow on the Ethernet card.<br />

As stated in Section 4.5.1, a Real Time Ratio (RTR) of 900 can be achieved by running<br />

the queue model with a simulation time step of 1 seconds in parallel. If the simulation runs<br />

on time steps of 1 millisecond, then an RTR of 900 is translated into 0.9, which means that<br />

with 1 millisecond as the simulation time step, the parallel queue model runs close to the real<br />

time. Thus, the parallel queue model can be utilized for the Internet packet traffic with a RTR<br />

of 1. One should notice that the modifications done in the queue model to simulate the Internet<br />

packet traffic are elementary. More complicated simulations conduct the Internet packet traffic<br />

by providing different facilities such as more complicated routing algorithms and topologies.<br />

For example, the parallel implementation of a widely known network simulator, NS [1], reports<br />

a speed-up of 3 on a system with 192 nodes decomposed on 4 computing nodes [44]. The events<br />

produced during the tests are reportedly between 15 and 70 million.<br />

The domain decomposition explained in Section 4.1.2 is still useful. The subnetworks, for<br />

example, can be distributed among the computing nodes. Any traffic between two subnetworks<br />

on two different computing nodes can be carried by MPI [51].<br />

Some of the modules in the framework such as the events recorder would be rather useless<br />

for the following reason: people in a human mobility simulation react rather slowly to new<br />

circumstances. However the thought process can be very complex. The Internet, contrarily,<br />

reacts very quickly based on very simple rules.<br />

8.3 Summary<br />

Attention that the Internet draws makes scientists explore about analyzing the data flowing<br />

through Internet. Although similar analysis has been done for telephone networks, when the<br />

Internet is concerned it is not as simple as the telephone networks case.<br />

Aggregated data traffic becomes more bursty as the number of users increases. This contradicts<br />

telephone networks, which are analyzed by Poisson distribution. Both LAN and WAN<br />

data traffic are proved to show self-similarity. Self-similar processes are explained by the power<br />

law. It has been showed in various papers ([88],[62], [49], [89]) that the data plots fit on the<br />

power law plots.<br />

Huisinga et.al. gives a microscopic description of the data packet transport in the Internet<br />

by using a simple cellular automaton model. For similar purposes, the queue model described<br />

in Chapter 2 can be employed. This means the graph data needs to be redefined, the rules<br />

especially constraints needs to be adapted to the Internet data packet traffic. Furthermore,<br />

packets become agents that only know the destination but not the intermediate nodes between<br />

the source and the destination. Nodes in Internet, contrary to nodes in vehicle traffic, are defined<br />

by a constraint, which limits packets per second that can be processed by a node.<br />

Last but not least, a routing table needs to be created at each node and must be updated<br />

regularly according to the congestion in the system. Routing tables provide information to the<br />

nodes, which are supposed to forward packets (if necessary) coming to their buffers.<br />

The parallel queue simulation along with the domain decomposition would be employed to<br />

125


observe the Internet packet traffic. The simulation would give an RTR of 1 when the simulation<br />

time step is chosen as 1 millisecond.<br />

126


Chapter 9<br />

Summary<br />

Among different simulation techniques, multi-agent simulations [22] attract attention since<br />

they enable agents to be defined as complex, because of the rules, and intelligent, because of<br />

the ability to adapt and to learn. Multi-agent simulations, as the name implies, allow multiple<br />

agents to be executed simultaneously based on the rules. This approach gives the possibility<br />

of observing the behaviors of the agents interacting with each other and also helps forecasting<br />

about possible behaviors in future.<br />

The modules in the traditional four-step process for transportation planning describe human<br />

behavior but in aggregated flows. Because of its shortcomings, Dynamic Traffic Assignment<br />

(DTA)(e.g. [19, 20, 27, 5]) model is used to represent the agents at the individual entity level.<br />

To solve DTA with spill-back queues due to congestion, systematic relaxation is employed.<br />

Within relaxation, agents gradually learn from the previous experiences (iterations) where they<br />

interact with the other agents and the environment. The rules defined for agents are executed<br />

during each iteration. After an iteration, each agent records and evaluates its performance.<br />

Evaluation is the learning step. Consequently, the system goes from a congested state to a<br />

relaxed state after some iterations.<br />

The execution of rules are integrated into a traffic flow simulation based on the queue model.<br />

Agents in the simulations are described along with their routes from a source location to a destination<br />

location. The traffic flow simulation takes these routes as input. It produces events of<br />

the agents as agents interact with each other and the environment. These events are interpreted<br />

by the other modules such as router, agent database and activity generator. The router produces<br />

new plans on request, the activity generator changes the end time and the duration of activities<br />

on request and the agent database merges plans come from different sources (routers and<br />

agents) to produce the plans input file for the next iteration.<br />

Object-oriented programming languages such as Java [42] and C++ [80] characterize multiagent<br />

simulations in the best way because they represent internal object structure and agent(object)-<br />

to-agent interactions in the cleanest way. C++ is chosen as the implementation language of the<br />

work represented in this thesis.<br />

One of the reason for using C++ as the programming language is that it promises computationally<br />

fast programs. Running a set of iterations as described above might take enormous<br />

time that is not preferred, especially when an application is detailed at the individual agent level.<br />

Meaningful size scenarios contain several millions of agents, therefore large-scale applications<br />

make the computational performance worse.<br />

Software enhancements for the sequential computing sometimes help an application to<br />

speed up. For example, the data structures to store the frequently accessed data make a difference<br />

based on the structures, on the ways of accessing elements of the structures and on the<br />

methods available for operations on the structures.<br />

127


A probably better way to reduce the computation time of a large-scale multi-agent application<br />

is to make it run in parallel. Among different methods, the domain decomposition suits<br />

well for the work presented here. This method divides the problem into a set of subproblems<br />

and assigns each subproblem to a different computing node. It aims at two goals:<br />

balancing the load on the computing nodes,<br />

minimizing the communication between computing nodes.<br />

Load balancing makes sure that each computing node gets a fair share of the big problem<br />

such that a computing node is neither exhausted nor idle because of its workload. The second<br />

issue regarding reducing communication between computing nodes is related to the subject<br />

that the subproblems generated by the domain decomposition are usually not fully independent<br />

of each other. Hence, solving such a subproblem requires exchanging information at the<br />

boundaries.<br />

With respect to transportation planning, a domain refers to the street network or the graph<br />

data. Each sub-graph data, therefore, is assigned to a computing node and each of these computing<br />

nodes run a separate traffic flow simulation on its graph data. If an agent’s trip goes<br />

beyond the graph data defined on a computing node, then the computing node makes sure that<br />

the agent in question is transferred to the next computing node via messages. This is called<br />

message passing. Message passing can be implemented via several software libraries such as<br />

MPI [51], PVM [63], CORBA [92], etc. MPI is chosen among those because of the computational<br />

reasons and the efforts that put on its development.<br />

When further reduces in the computation time of a large-scale application are of interest,<br />

improving hardware is another option: as stated earlier in this work, Myrinet [54] is a costeffective<br />

and high-performance packet-communication and switching technology and can be<br />

used to reduce the latency contribution of the Ethernet [77] to each message.<br />

Message passing can also be used for inter-modular communication. The relaxation described<br />

above is achieved in a framework that contains different strategic and physical modules,<br />

each of which is responsible for a different task. However, these modules are not fully independent<br />

of each other as they share the data such as plans and events. Therefore, some agreements<br />

must be done on the data representation. The available wire formats do not point out a perfect<br />

solution but they offer different advantages such as better performance or extensibility. The<br />

data between modules can be shared via files but this method suffers from file I/O bottlenecks.<br />

Hence, instead of using files, data can be passed between the modules via messages.<br />

The queue model simulation can be extended to go beyond the transportation planning.<br />

One possible area is simulating the Internet packet traffic since the Internet packet traffic draws<br />

attention of researchers when analysis of data flow, analysis of statistics and predictions are<br />

concerned.<br />

128


Bibliography<br />

[1] Information Sciences Institute at Univ. of Southern California. The Network Simulator.<br />

See www.isi.edu/nsnam/ns, accessed 2005.<br />

[2] M. Balmer, K. Nagel, and R. Raney. Large scale multi-agent simulations for transportation<br />

applications. ITS Journal, in press.<br />

[3] M. Balmer, B. Raney, and K. Nagel. Coupling activity-based demand generation to a truly<br />

agent-based traffic simulation – activity time allocation. In Presented at EIRASS workshop<br />

on Progress in activity-based analysis, Maastricht, NL, May 2004. Also presented at<br />

STRC’04, see www.strc.ch.<br />

[4] J. Barcelo, J.L. Ferrer, D. Garcia, M. Florian, and E. Le Saux. Parallelization of microscopic<br />

traffic simulation for ATT systems. In P. Marcotte and S. Nguyen, editors, Equilibrium<br />

and advanced transportation modelling, pages 1–26. Kluwer Academic Publishers,<br />

1998.<br />

[5] J.A. Bottom. Consistent anticipatory route guidance. PhD thesis, Massachusetts Institute<br />

of Technology, Cambridge, MA, 2000.<br />

[6] Bundesamt für Raumentwicklung (ARE), Bern. Räumliche Auswirkungen der<br />

Verkehrsinfrastrukturen, Methodologische Vorstudie, Information und Pflichtenheft für<br />

die Anbieter, 13.6. 2001.<br />

[7] A. Burriad. Intersection dynamics in queue models. Term project report, Swiss Federal<br />

Institute of Technology, 2002. See sim.inf.ethz.ch/papers.<br />

[8] G. D. B. Cameron and C. I. D. Duncan. PARAMICS — Parallel microscopic simulation<br />

of road traffic. Journal of Supercomputing, 10(1):25, 1996.<br />

[9] R. Cayford, W.-H. Lin, and C.F. Daganzo. The NETCELL simulation package: Technical<br />

description. California PATH Research Report UCB-ITS-PRR-97-23, University of<br />

California, Berkeley, 1997.<br />

[10] G.L. Chang, T. Junchaya, and A.J. Santiago. A real-time network traffic simulation model<br />

for ATMS applications: Part I — Simulation methodologies. IVHS Journal, 1(3):227–<br />

241, 1994.<br />

[11] The World Wide Web Consortium. HTML: HyperText Markup Language. See<br />

www.w3.org/MarkUp, accessed 2005.<br />

[12] The World Wide Web Consortium. HTTP: HyperText Transfer Protocol. See<br />

www.w3.org/Protocols, accessed 2005.<br />

[13] Microsoft Corporation. MS Windows. See www.microsoft.com/windows, accessed 2005.<br />

129


[14] Berksekas D. and R. Gallager. Data Networks. Prentice Hall, MA, U.S.A., 1991.<br />

[15] C.F. Daganzo. The cell transmission model: A dynamic representation of highway traffic<br />

consistent with the hydrodynamic theory. Transportation Research B, 28B(4):269–287,<br />

1994.<br />

[16] C.F. Daganzo. The cell transmission model, part II: Network traffic. Transportation<br />

Research B, 29B(2):79–93, 1995.<br />

[17] US Dept. of Transportation Federal Highway Administration. DynaMIT prototype description.<br />

See www.dynamictrafficassignment.org/dynamit.htm, accessed 2005.<br />

[18] US Dept. of Transportation Federal Highway Administration. DYNASMART-X prototype<br />

description. See www.dynamictrafficassignment.org/dsmart x.htm, accessed 2005.<br />

[19] DYNAMIT www page. See mit.edu/its and dynamictrafficassignment.org, accessed 2005.<br />

[20] DYNASMART www page. See www.dynasmart.com and dynamictrafficassignment.org,<br />

accessed 2005.<br />

[21] Expat www page. James Clark’s Expat XML parser library. See expat.sourceforge.net,<br />

accessed 2005.<br />

[22] J. Ferber. Multi-agent systems. An Introduction to distributed artificial intelligence.<br />

Addison-Wesley, 1999.<br />

[23] J.L. Ferrer and J. Barceló. AIMSUN2: Advanced Interactive Microscopic Simulator for<br />

Urban and non-urban Networks. Internal report, Departamento de Estadística e Investigación<br />

Operativa, Facultad de Informática, Universitat Politècnica de Catalunya, 1993.<br />

[24] U. Frisch, B. Hasslacher, and Y. Pomeau. Lattice-gas automata for Navier-Stokes equation.<br />

Phys. Rev. Letters, 56:1505, 1986.<br />

[25] F. Bustamante G. Eisenhauer and K. Schwan. Native data representation: An efficient<br />

wire format for high performance distributed computing. IEEE Transactions on Parallel<br />

and Distributed Systems, 13:1234–1246, 2002.<br />

[26] C. Gawron. An iterative algorithm to determine the dynamic user equilibrium in a traffic<br />

simulation model. International Journal of Modern Physics C, 9(3):393–407, 1998.<br />

[27] C. Gawron. Simulation-based traffic assignment. PhD thesis, University of Cologne,<br />

Cologne, Germany, 1998. available via www.zaik.uni-koeln.de/˜paper.<br />

[28] R.A. Gingold and J.J. Monaghan. Smoothed particle hydrodynamics - theory and application<br />

to non-spherical stars. Royal Astronomical Society, Monthly Notices, 181:375–389,<br />

1977.<br />

[29] C. Gloor. Distributed Intelligence in Real-World Mobility Simulations. PhD thesis, Swiss<br />

Federal Institute of Technology ETH, 2005.<br />

[30] Pallas GmbH. Pallas MPI Benchmark. See www.pallas.com/e/products/pmb, accessed<br />

2005.<br />

[31] P. Gonnet. A thread-based distributed traffic micro-simulation. Term project, Swiss Federal<br />

Institute of Technology ETH, Zürich, Switzerland, 2001.<br />

130


[32] Network Working Group. File Transfer Protocol. See www.faqs.org/rfcs/rfc959.html,<br />

accessed 2005.<br />

[33] The Open Group. Technical report for Remote Procedure Call. See<br />

www.opengroup.org/public/pubs/catalog/c706.htm, accessed 2005.<br />

[34] D. Hensher and J. King. In D. Hensher and J. King, editors, The Leading Edge of Travel<br />

Behavior Research. Pergamon, Oxford, 2001.<br />

[35] W. A. Hunt. Clock cycles counter. See www.cs.utexas.edu/users/hunt/class/2003-<br />

fall/cs352/lectures/ class01a.pdf, accessed 2005.<br />

[36] J. Hurwitz and W. Feng. Initial end-to-end performance evaluation of 10-Gigabit Ethernet.<br />

IEEE Hot Interconnects, 2003.<br />

[37] IBM SP2 web page. RS/6000 SP System. See www.rs6000.ibm.com/hardware/largescale,<br />

accessed 2005.<br />

[38] Cray Inc. See www.cray.com, accessed 2005.<br />

[39] Linux Online Inc. The Linux home page at Linux Online. See www.linux.org, accessed<br />

2005.<br />

[40] Red Hat Online Inc. Red Hat Linux. See www.redhat.com, accessed 2005.<br />

[41] InfiniBand Trade Association www page. InfiniBand. See www.infinibandta.org, accessed<br />

2005.<br />

[42] See java.sun.com. Java technology, accessed 2005.<br />

[43] java.sun.com/products/jdk/rmi. Java Remote Method Invocation (RMI), accessed 2005.<br />

[44] K. G. Jones and S. R. Das. Parallel execution of a sequential network simulator. In<br />

Proceedings of the 32nd Conference on Winter Simulation, pages 418–424, 2000.<br />

[45] C. Kurmann, T. Stricker, and F. Rauch. Speculative defragmentation - leading Gigabit<br />

Ethernet to true zero-copy communication cluster computing. Journal of Networks, Software<br />

Tools and Applications, 4(4):7–18, March 2001.<br />

[46] Rauch F. Kurmann, C. and T Stricker. Cost/performance tradeoffs in network interconnects<br />

for clusters of commodity PCs. International Parallel and Distributed Processing<br />

Symposium,www.ipdps.org, April 2003.<br />

[47] Lawrence Livermore National Laboratory. ASC at Livermore. See www.llnl.gov/asci,<br />

accessed 2005.<br />

[48] M. J. Lighthill and J. B. Whitham. On kinematic waves. I: Flow movement in long rivers.<br />

II: A Theory of traffic flow on long crowded roads. Proceedings of the Royal Society A,<br />

229:281–345, 1955.<br />

[49] P. Faloutos M. Faloutos and C. Faloutos. On power-law relationships of the Internet<br />

topology. SIGCOMM, pages 251–262, 1999.<br />

[50] MATSIM www page. MultiAgent Transportation SIMulation. See www.matsim.org,<br />

accessed 2005.<br />

131


[51] MPI www page. www-unix.mcs.anl.gov/mpi/, accessed 2005. MPI: Message Passing<br />

Interface.<br />

[52] MPICH www page. www-unix.mcs.anl.gov/mpi/mpich/, accessed 2005. MPI: Message<br />

Passing Interface MPICH implementation.<br />

[53] S. D. Myers. Effective STL: 50 specific ways to improve your use of the Standard Template<br />

Library. Addition-Wesley, 2001.<br />

[54] Myricom www page. Myrinet. See www.myri.com, accessed 2005. Myricom, Inc.,<br />

Arcadia, CA.<br />

[55] M. Rosembluth N. Metropolis, A. Rosembluth and A. Teller. Equation of state calculations<br />

by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.<br />

[56] K. Nagel. High-speed microsimulations of traffic flow. PhD thesis, University of Cologne,<br />

1994/95. See www.inf.ethz.ch/˜nagel/papers or www.zaik.uni-koeln.de/˜paper.<br />

[57] K. Nagel and M. Rickert. Parallel implementation of the TRANSIMS micro-simulation.<br />

Parallel Computing, 27(12):1611–1639, 2001.<br />

[58] K. Nagel and A. Schleicher. Microscopic traffic modeling on parallel high performance<br />

computers. Parallel Computing, 20:125–146, 1994.<br />

[59] K. Nagel, P. Stretz, M. Pieck, S. Leckey, R. Donnelly, and C. L. Barrett. TRANSIMS traffic<br />

flow characteristics. Los Alamos Unclassified Report (LA-UR) 97-3530, Los Alamos<br />

National Laboratory, Los Alamos, NM, see transims.tsasa.lanl.gov, 1997.<br />

[60] W. Niedringhaus, J. Opper, L. Rhodes, and B. Hughes. IVHS traffic modeling using<br />

parallel computing: Performance results. In Proceedings of the International Conference<br />

on Parallel Processing, pages 688–693. IEEE, 1994.<br />

[61] Klaus Nökel and Matthias Schmidt. Parallel DYNEMO: Meso-scopic traffic flow simulation<br />

on large networks. Networks and Spatial Economics, 2(4):387–403, December<br />

2002.<br />

[62] V. Paxson and S. Floyd. Wide-Area traffic: The failure of Poisson modeling. IEEE/ACM<br />

Transactions on Networking, 3(3):226–244, 1995.<br />

[63] PVM www page. See www.epm.ornl.gov/pvm, accessed 2005. PVM: Parallel Virtual<br />

Machine.<br />

[64] H. A. Rakha and M. W. Van Aerde. Comparison of simulation modules of TRANSYT<br />

and INTEGRATION models. Transportation Research Record, 1566:1–7, 1996.<br />

[65] B. Raney. Learning Framework for Large-Scale Multi-Agent Simulations. PhD thesis,<br />

Swiss Federal Institute of Technology ETH, 2005.<br />

[66] B. Raney and K. Nagel. Truly agent-based strategy selection for transportation simulations.<br />

Paper 03-4258, Transportation Research Board Annual Meeting, Washington, D.C.,<br />

2003.<br />

[67] B. Raney and K. Nagel. An improved framework for large-scale multi-agent simulations<br />

of travel behavior. In P. Rietveld, B. Jourquin, and K. Westin, editors, Towards better<br />

performing European Transportation Systems. accepted.<br />

132


[68] M. Rickert. Traffic simulation on distributed memory computers. PhD thesis, University<br />

of Cologne, Cologne, Germany, 1998. See www.zaik.uni-koeln.de/˜paper.<br />

[69] Sandia National Laboratories. ASC at Sandia. See www.sandia.gov/ASC, accessed 2005.<br />

[70] G. Satir and D. Brown. C++, The Core Language. O’Reilly & Associates, Inc., 1995.<br />

[71] T. Schwerdtfeger. Makroskopisches Simulationsmodell für Schnellstraßennetze mit Berücksichtigung<br />

von Einzelfahrzeugen (DYNEMO). PhD thesis, University of Karsruhe,<br />

Germany, 1987.<br />

[72] A. Silberschatz, P. B. Galvin, and G. Gagne. Operating System Concepts. John Wiley &<br />

Sons, Inc., 2001.<br />

[73] H.P. Simão and W.B. Powell. Numerical methods for simulating transient, stochastic<br />

queueing networks. Transportation Science, 26:296–311, 1992.<br />

[74] P. M. Simon and K. Nagel. Simple queueing model applied to the city of Portland. International<br />

Journal of Modern Physics C, 10(5):941–960, 1999.<br />

[75] Hinnerk Spindler. Personal communication.<br />

[76] William Stallings. Queuing analysis. ftp://shell.shore.net/members/w/s/ws/Support/<br />

QueuingAnalysis.pdf, 2000.<br />

[77] IEEE 802 LAN/MAN Standards Committee. Ethernet IEEE 802 standards. See<br />

www.ieee802.org, accessed 2005.<br />

[78] R. Standish. Classdesc Library. See parallel.hpc.unsw.edu.au/rks/classdesc, accessed<br />

2005.<br />

[79] W. R. Stevens. UNIX Network Programming. Prentice Hall, 1990.<br />

[80] B. Stroustrup. The Design and Evolution of C++. Addison-Wesley, 1994.<br />

[81] W. Knospe A. Schadschneider T. Huisinga, R. Barlovic and M. Schreckenberg. A microscopic<br />

model for packet transport in the Internet. Physica A, pages 249–256, 2001.<br />

[82] TRANSIMS www page. TRansportation ANalysis and SIMulation System. transims.tsasa.lanl.gov,<br />

accessed 2005. Los Alamos National Laboratory, Los Alamos, NM.<br />

[83] A. Gupta V. Kumar, A. Grama and G. Karypis. Introduction to Parallel Computing. The<br />

Benjamin/Cummings Publishing Company, Inc., 1994.<br />

[84] VISSIM www page. www.ptv.de, accessed 2004. Planung Transport und Verkehr (PTV)<br />

GmbH.<br />

[85] M. Vrtic, K.W.Axhausen, R. Koblo, and M. Vödisch. Entwicklung bimodales Personenverkehrsmodell<br />

als Grundlage für Bahn2000, 2. Etappe, Auftrag 1. Report to the Swiss<br />

National Railway and to the Dienst für Gesamtverkehrsfragen, Prognos AG, Basel, 1999.<br />

See www.ivt.baug.ethz.ch/vrp/ab115.pdf for a related report.<br />

[86] P. Waddell, A. Borning, M. Noth, N. Freier, M. Becke, and G. Ulfarsson. Microsimulation<br />

of urban development and location choices: Design and implementation of UrbanSim.<br />

Networks and Spatial Economics, 3(1):43–67, 2003.<br />

133


[87] W. Willinger W.E. Leland, M. Taqqu and D. V. Wilson. On the self-similar nature of<br />

Ethernet traffic. SIGCOMM, pages 183–193, 1993.<br />

[88] W. Willinger and V. Paxson. Where mathematics meets the Internet. Notices of the AMS,<br />

45(8):961–970, 1998.<br />

[89] W. Willinger, V. Paxson, and M. S. Taqqu. Self-similarity and heavy tails: Structural<br />

modeling of network traffic. To appear in R. Adler, R. Feldman, and M. S. Taqqu, editors,<br />

A Practical Guide to Heavy Tails: Statistical Techniques and Applications, 1998.<br />

Birkhauser Verlag, Boston.<br />

[90] D.E. Wolf, M. Schreckenberg, and A. Bachem, editors. Traffic and granular flow. World<br />

Scientific,Singapore, 1996.<br />

[91] www-users.cs.umn.edu/˜karypis/metis. METIS library, accessed 2005.<br />

[92] www.corba.org. CORBA: Common Object Request Broker Architecture, accessed 2005.<br />

[93] See www.hpjava.org/mpiJava.html. mpiJava, a Java interface to the standard MPI, accessed<br />

2005.<br />

[94] www.mysql.com. MYSQL, an open-source SQL database, accessed 2005.<br />

[95] www.oracle.com/products. Oracle database server, accessed 2005.<br />

[96] www.urbansim.org. URBANSIM, accessed 2003.<br />

[97] www.w3.org/XML. XML, eXtensible Markup Language, accessed 2005.<br />

134


CURRICULUM VITAE: NURHAN ÇETIN<br />

December 1st, 1974 born in Turkey, citizen of the Republic of Turkey.<br />

1980-1985 Elementary School of Yeşilbahar, Istanbul.<br />

1985-1988 Secondary School of Göztepe, Istanbul.<br />

1988-1991 High School of Erenköy, Istanbul.<br />

1991-1994 Environmental Engineering, Marmara University, Istanbul.<br />

1994-1997 B.Sc in Computer Engineering,<br />

Department of Computer Engineering,<br />

Marmara University, Istanbul.<br />

1996-1997 Worked at Computer Center of Marmara University, Istanbul.<br />

1997-1999 Worked at Computer Center of Yeditepe University, Istanbul.<br />

1999-2000 M.Sc. in Computer Science,<br />

Department of Computer Science and Engineering<br />

The Pennsylvania State University, University Park, PA, USA<br />

2000 Worked at America Online, Herndon, VA, USA.<br />

2000-2005 Research and Teaching Assistant in the<br />

Modelling and Simulation headed by Prof. Kai Nagel,<br />

Institute of Computational Science,<br />

Swiss Federal Institute of Technology Zürich,<br />

Zürich, Switzerland.<br />

PUBLICATIONS<br />

Towards truly agent-based traffic and mobility simulations;<br />

M. Balmer, N. Cetin, K. Nagel, B. Raney;<br />

Autonomous agents and multiagent systems (AAMAS’04);<br />

New York, NY, USA, 2004.<br />

An agent-based microsimulation model of Swiss travel: First results;<br />

N. Cetin, B. Raney, A. Völlmy, M. Vrtic, K. Axhausen, K. Nagel;<br />

Networks and Spatial Economics;<br />

Volume 3, Pages 23–41, 2003.<br />

A parallel queue model approach to traffic simulations;<br />

N. Cetin, K. Nagel, A. Burri;<br />

Transportation Research Board (TRB) Conference;<br />

Washington, D.C., USA, 2003.<br />

Large-scale multi-agent transportation simulations;<br />

N. Cetin, K. Nagel, B. Raney, A. Voellmy;<br />

42nd European Regional Science Association (ERSA) Congress;<br />

Dortmund, Germany, 2002.<br />

Towards a microscopic traffic simulation of all of Switzerland;<br />

N. Cetin, B. Raney, A. Voellmy, M. Vrtic, K. Nagel;<br />

Proceedings of the International Conference of Computational Science;<br />

Amsterdam, The Netherlands, 2002.


Large-scale multi-agent transportation simulations;<br />

N. Cetin, K. Nagel, B. Raney, A. Voellmy;<br />

Computational Physics Conference;<br />

Aachen, Germany 2001.<br />

Large-scale transportation simulations on Beowulf clusters;<br />

N. Cetin, K. Nagel;<br />

Swiss Transport Research Conference;<br />

Ascona, Switzerland, 2001.<br />

Solaris 2.x System Administration Course Notes;<br />

N. Cetin, O. Demir, G. Devlet, D. Erdil, G. Küċük, M. Yildiz;<br />

Istanbul, Turkey, 1997.<br />

MaROS: A framework for application development on mobile hosts;<br />

S. Baydere, N. Cetin, O. Demir, G. Devlet, D. Erdil, G. Küċük, M. Yildiz;<br />

Proceedings of the IASTED International Conference on Parallel and Distributed Systems;<br />

Barcelona, Spain, June 97.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!