LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim
LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim
LARGE-SCALE PARALLEL GRAPH-BASED SIMULATIONS - MATSim
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
DISS. ETH NO.<br />
<strong>LARGE</strong>-<strong>SCALE</strong> <strong>PARALLEL</strong> <strong>GRAPH</strong>-<strong>BASED</strong><br />
<strong>SIMULATIONS</strong><br />
A dissertation submitted to the<br />
SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH<br />
for the degree of<br />
Doctor of Sciences<br />
presented by<br />
NURHAN ÇETİN<br />
Master of Science in Computer Science<br />
The Pennsylvania State University<br />
born 01.12.1974<br />
citizen of<br />
The Republic of Turkey<br />
accepted on the recommendation of<br />
Prof. Dr. Kai Nagel, examiner<br />
Prof. Dr. Kay W. Axhausen, co-examiner<br />
2005
Abstract<br />
When systems are modeled, different techniques are used. Computer simulation is one of these<br />
techniques. It draws attention since it makes it possible to model a system that might be real<br />
or theoretical, to execute the model on a computer, and to analyze the output of the execution<br />
of the model. Execution of a model on a computer develops through time, i.e. the states of<br />
the different parts of a system, such as variables and environment, are updated through time<br />
according to the rules defined in the model.<br />
Computer simulations come into prominence since they allow models to have complex<br />
objects/variables, allow objects to have complex relationships, allow users to model artificial<br />
worlds, etc. This thesis focuses on different parts of a transportation planning system, MATSIM<br />
(Multi-Agent Transportation SIMulation), which is a computer simulation.<br />
In MATSIM, similar to other multi-agent simulations, all entities are treated at the individual<br />
level. Their behavior and interactions, both with each other and with the environment, are<br />
defined by their internal rules.<br />
There are two layers in a transportation planning system: the physical layer that includes<br />
a traffic flow simulator, and the strategic layer. In the traffic flow simulator, the agents are<br />
interacts with each other and with the environment based on the rules defined in the model. In<br />
the strategic layer, the agents make their strategies. The relationship between these two layers<br />
is best understood in an implementation called a framework.<br />
A framework couples the modules such as traffic flow simulator, router, agent database,<br />
activity generator, etc. A traffic flow simulator defines the rules of interactions of the entities in<br />
the system. The traffic flow simulator used in MATSIM is based on a queue model developed<br />
by Gawron. It reads the street network of the area to be simulated and the plans of the agents,<br />
then it executes these plans according to the rules of the queue model. The output of the traffic<br />
flow simulation, the events, are used to evaluate the performance of the plans. The evaluation<br />
is achieved by the modules of the strategic layer. The evaluated plans are fed to traffic flow<br />
simulator simulator by starting a new iteration.<br />
Parallel computing issues are applied to the traffic simulator to handle the large-scale scenarios<br />
detailed at microscopic level. Different communication media and different communication<br />
libraries are used during this process.<br />
The coupling of modules by framework is via files. From the viewpoint of a traffic flow<br />
simulator, this means two files: plans as input and events as output. To avoid the inefficiencies<br />
of file I/O, a message passing approach is developed for plans and events. Different methods<br />
for creating and transferring different types of messages are investigated.<br />
The traffic flow simulator based on the queue model can be used for simulating other types<br />
of entities such as Internet data packet traffic. As Internet grows more, analyzing the data<br />
flowing through Internet becomes more interesting between researchers.<br />
i
Zusammenfassung<br />
Für das Modellieren von Systemen können verschiedene Techniken verwendet werden. Durch<br />
Verwendung von Computer-Simulation ist es möglich, ein real existierendes oder ein theoretisches<br />
System auf einem Computer zu simulieren und anschliessend die Ausgabe zu analysieren.<br />
Ein solches Modell wird im Computer modelliert und iterativ verändert, d.h. die internen<br />
Zustände werden bei jedem Zeitschritt nach den definierten Regeln des Modelles aktualisiert.<br />
Der Vorteil von Computer-Simulationen ist, dass die Modelle eine grössere Komplexität der<br />
Objekte und deren Beziehung erlauben, als dies mit einer analytischen Betrachtung möglich<br />
wäre.<br />
Der Fokus dieser Arbeit ist auf den verschiedenen Teilen des Verkehrs-Planungs-Systemes,<br />
MATSIM (Multi-Agent Transportation SIMulation), welches diese Techniken nutzt.<br />
Wie die meisten Multi-Agenten-Simulationen betrachtet MATSIM alle Agenten auf einer<br />
individuellen Basis. Ihr Verhalten und ihre Interaktionen (sowohl mit anderen Agenten als auch<br />
mit der Umgebung), sind durch die Regeln definiert.<br />
Es existieren zwei Schichten in einem Verkehrs-Planungs-System: die physikalische, die<br />
die Verkehrsfluss-Simulation beinhaltet, sowie die strategische. In der Verkehrsfluss-Simulation<br />
reagieren die Agenten auf die anderen Agenten sowie auf die Umgebung. In der strategischen<br />
Schicht werden die Entscheidungen der Agenten modelliert. Die Beziehung dieser beiden<br />
Schichten wird durch ein Framework gebildet. Dieses Framework verbindet die einzelnen<br />
Module (Verkehrsfluss-Simulation, Routen-Generator, Agenten-Datatenbank, etc.).<br />
Die vorgestellte Verkehrsfluss-Simulation basiert auf einem Queue-Modell, welches von<br />
Gawron entwickelt wurde. Als Eingabe werden das Netzwerk der Strassen sowie die Pläne<br />
der Agenten verwendet. Wärend der Simulation werden sogenannte Events ausgegeben, anhand<br />
denen die Module die Qualität dieser Pläne bewerten können. Diese Pläne werden anschliessend<br />
geringfügig modifiziert und im nächsten Durchgang der Simulation erneut getestet.<br />
Um die Grösse des hier verwendeten Szenarion handhaben zu können, muss die Verkehsfluss-<br />
Simulation auf mehrere Computer verteilt werden (Verteiltes Rechnen). Verschiedene Communikationsmedien<br />
und -Bibliotheken wurden evaluiert.<br />
Die Verbindung der Module des Frameworks geschieht durch Files. Aus der Sicht der<br />
Verkehrsfluss-Simulation werden zwei Arten von Files verwendet: Pläne als Eingabe, sowie<br />
Events als Ausgabe. Da das Lesen und Schreiben von Files sehr langsam sein kann, wurde<br />
ein weiterer Ansatz entwickelt: das Senden von Plänen und Events als Nachrichten über das<br />
Netzwerk. Hierbei wurden verschiedene Varianten verglichen.<br />
Die vorgestellte, queue-basierte Verkehrsfluss-Simulation kann nebst der Simulation von<br />
Verkehr beispielsweise auch für den Datenfluss in Computer-Netzwerken verwendet werden.<br />
Solche Anwendungen werden an Bedeuteung gewinnen, nicht zuletzt durch das Wachstum des<br />
Internet.<br />
ii
Acknowledgments<br />
First of all, I would like to thank my advisor, Prof. Kai Nagel, for his guidance on making this<br />
thesis possible and for his support during the past years I have spent at ETH Zurich.<br />
I would also like to thank my co-advisor Prof. Kay Axhausen for accepting to be coexaminer<br />
and for the remarks he made to improve this thesis.<br />
Many thanks to my office mate of 4 years, Bryan Raney, for all the interesting yet helpful<br />
discussions that we had. Those discussions helped me a lot to broaden my vision.<br />
I would like to thank Christian Gloor for the productive talks about work, computer science<br />
and life.<br />
Thanks to Dr. Fabrice Marchal for not only giving me directions in Java but also being a<br />
friend beyond the office life.<br />
I would like to thank to Marc Schmitt and IT Support Group (a.k.a ISG) for the maintainence<br />
of the computational resources used during my work. I am grateful to Martin Wyser<br />
who took over the responsibility of Xibalba cluster from Marc.<br />
Thanks to Adrian Burri and Hinnerk Spindler for providing the data used in Figure 2.4 and<br />
Figure 8.1, respectively.<br />
I am very grateful to Duncan Cavens, Bryan Raney and Lisa von Boehmer for proofreading<br />
this manuscript.<br />
Thanks to Prof. Şebnem Baydere for her support for making it possible to take the initiative<br />
steps towards my academic career and to Prof. Feyzi İnanc for his support and his advices<br />
about academia and life.<br />
Many thanks to my friends Özge, Canan, Mehtap, PIrnal, Ilker, Cenk, Mcan, Özlem, Chris,<br />
Onur, Emrah, Giray, Mahir, Gürhan, Berna, Gültek, Duygu, Bülo, Selin, Erdem, Nur, Selçuk,<br />
Hanna, Fuat and Volkan for their support and their friendship.<br />
Last but not least, many thanks to my family for being supportive whatever I do and whatever<br />
I choose.<br />
iii
Contents<br />
Abstract<br />
Zusammenfassung<br />
Acknowledgments<br />
i<br />
ii<br />
iii<br />
1 Introduction 1<br />
2 The Queue Model for Traffic Dynamics 5<br />
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
2.2 Queue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2.2.1 Gawron’s Queue Model . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2.2.2 Fair Intersections and Parallel Update . . . . . . . . . . . . . . . . . . 7<br />
2.2.3 Graph Data as Input for Queue Simulation . . . . . . . . . . . . . . . . 11<br />
2.2.4 Vehicle Plans as Input for Queue Simulation . . . . . . . . . . . . . . 12<br />
2.2.5 Events as Output of Queue Simulation . . . . . . . . . . . . . . . . . . 13<br />
2.3 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
2.4 The Basic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.5 A Practical Scenario for the Benchmarks . . . . . . . . . . . . . . . . . . . . . 15<br />
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
3 Sequential Queue Model 17<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3.2 The Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
3.3 Performance Issues for C++/STL and C Functions . . . . . . . . . . . . . . . . 18<br />
3.3.1 The Standard Template Library . . . . . . . . . . . . . . . . . . . . . 19<br />
3.3.2 Containers: Map vs Vector for Graph Data . . . . . . . . . . . . . . . 19<br />
3.3.3 Containers: Multimap vs Linked List for Parking and Waiting Queues . 24<br />
3.3.4 Containers: Ring, Deque and List Implementations of Link Queues . . 26<br />
3.4 Reading Input Files for Traffic Simulators . . . . . . . . . . . . . . . . . . . . 29<br />
3.4.1 The Extensible Markup Language, XML . . . . . . . . . . . . . . . . 29<br />
3.4.2 Structured Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
3.4.3 XML vs. Structured Text Files: Plans Reading . . . . . . . . . . . . . 30<br />
3.4.4 XML vs Structured Text Files: Graph Data Reading . . . . . . . . . . 31<br />
3.5 Writing Events Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
iv
4 Parallel Queue Model 37<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
4.1.1 Message Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
4.1.2 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
4.2 Parallel Computing in Transportation Simulations . . . . . . . . . . . . . . . . 40<br />
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
4.3.1 Handling Domain Decomposition . . . . . . . . . . . . . . . . . . . . 41<br />
4.3.2 Handling Message Exchanging . . . . . . . . . . . . . . . . . . . . . . 42<br />
4.3.3 Communication Software . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
4.4 Theoretical Performance Expectations . . . . . . . . . . . . . . . . . . . . . . 44<br />
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
4.5.1 Comparison of Different Communication Hardware: Ethernet vs. Myrinet 48<br />
4.5.2 Comparison of Different Communication Software: MPI vs. PVM . . . 50<br />
4.5.3 Comparison of Different Packing Algorithms . . . . . . . . . . . . . . 51<br />
4.5.4 Different Domain Decomposition Algorithms . . . . . . . . . . . . . . 55<br />
4.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
5 Coupling the Traffic Simulation to Mental Modules 60<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
5.2 Coupling Modules via Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
5.2.1 Description of a Framework . . . . . . . . . . . . . . . . . . . . . . . 60<br />
5.2.2 Performance Issues of Reading an Events File . . . . . . . . . . . . . . 63<br />
5.2.3 Performance Issues of Plan Writing . . . . . . . . . . . . . . . . . . . 67<br />
5.3 Other Coupling Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
5.3.1 Module Coupling via Subroutine Calls . . . . . . . . . . . . . . . . . 68<br />
5.3.2 Module Coupling via Remote Procedure Calls (e.g. CORBA, Java RMI) 69<br />
5.3.3 Module Coupling via WWW Protocols . . . . . . . . . . . . . . . . . 70<br />
5.3.4 Module Coupling via Databases . . . . . . . . . . . . . . . . . . . . . 70<br />
5.3.5 Module Coupling via Messages . . . . . . . . . . . . . . . . . . . . . 72<br />
5.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
6 Events Recorder 74<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
6.2 The Competing File I/O Performance for Events . . . . . . . . . . . . . . . . . 76<br />
6.3 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
6.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
6.5 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
6.6 Raw vs. XML Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
6.7 Buffered vs. Immediate Reporting of Events . . . . . . . . . . . . . . . . . . . 79<br />
6.7.1 Reporting Buffered Events . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
6.7.2 Immediately Reported Events . . . . . . . . . . . . . . . . . . . . . . 79<br />
6.8 Theoretical Expectation for Buffered Events . . . . . . . . . . . . . . . . . . . 79<br />
6.8.1 Packing Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
6.8.2 Sending and Receiving Time Prediction . . . . . . . . . . . . . . . . . 82<br />
6.8.3 Unpacking Time Prediction . . . . . . . . . . . . . . . . . . . . . . . 83<br />
6.8.4 Writing Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
6.8.5 Performance Prediction for Buffered Events: Putting it together . . . . 84<br />
v
6.9 Results of the Buffered Events . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
6.9.1 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
6.9.2 Sending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
6.9.3 Receiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
6.9.4 Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
6.9.5 Writing into File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6.9.6 Summary of “buffered events recording” . . . . . . . . . . . . . . . . 94<br />
6.10 Theoretical Expectations and Results of Immediately Reported Events . . . . . 94<br />
6.11 Performance of Different Packing Methods for Events . . . . . . . . . . . . . . 97<br />
6.11.1 Using memcpy and Creating a Byte Array . . . . . . . . . . . . . . . 97<br />
6.11.2 Using MPI Pack and MPI Unpack . . . . . . . . . . . . . . . . . . 97<br />
6.11.3 Using MPI Struct . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
6.11.4 Classdesc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
6.11.5 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
6.12 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
7 Plans Server 104<br />
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
7.2 The Competing File I/O Performance for Plans . . . . . . . . . . . . . . . . . 104<br />
7.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.3.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.3.2 mpiJava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
7.4 Java and C++ Implementations of the Plans Server . . . . . . . . . . . . . . . 106<br />
7.4.1 Packing and Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
7.4.2 Storing Agents in the Plans Server . . . . . . . . . . . . . . . . . . . . 108<br />
7.5 Theoretical Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
7.5.1 PSs Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.5.2 PSs Send and TSs Receive . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
7.5.3 TSs Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.5.4 TSs Pack and Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.5.5 PSs unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
7.5.6 Multi-casting Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.6.1 PSs Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.6.2 PSs Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
7.6.3 TSs Receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
7.6.4 TSs Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
7.6.5 TSs Pack and Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.6.6 PSs Receive and Unpack . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
7.7 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
8 Going beyond Vehicle Traffic 121<br />
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
8.2 Queue Model as a Possible Microscopic Model for Internet Packet Traffic . . . 122<br />
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
9 Summary 127<br />
Curriculum Vitae 135<br />
vi
List of Figures<br />
1.1 Physical and strategic layers of a traffic simulation system . . . . . . . . . . . 2<br />
2.1 The Gawron’s queue model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2.2 Simplifying the intersection logic . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.3 Pseudo code for traffic dynamics defined in the queue model . . . . . . . . . . 9<br />
2.4 Test suite results for the intersection dynamics . . . . . . . . . . . . . . . . . . 9<br />
2.5 Handling intersections according to the modified version of fair intersections<br />
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
2.6 Handling intersections according to Metropolis sampling . . . . . . . . . . . . 10<br />
2.7 Handling intersections according to the modified Metropolis sampling . . . . . 11<br />
2.8 An example of the graph data in the XML format . . . . . . . . . . . . . . . . 12<br />
2.9 An example for the plans data in the XML format . . . . . . . . . . . . . . . . 13<br />
2.10 An example for the events data in the XML format . . . . . . . . . . . . . . . 14<br />
3.1 STL-containers for the graph data . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
3.2 The STL-map for the graph data . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
3.3 Insertion in the middle of an STL-vector by insert(position,object) 21<br />
3.4 The STL-vector for the graph data . . . . . . . . . . . . . . . . . . . . . . 22<br />
3.5 Linear search for the graph data . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
3.6 Sorting the graph data stored in an STL-vector . . . . . . . . . . . . . . . . 23<br />
3.7 RTR and Speedup for using different data structures for the graph data . . . . . 24<br />
3.8 Declarations for waiting and parking queues with the STL-multimap . . . . 25<br />
3.9 Declarations for waiting and parking queues with linked lists . . . . . . . . . . 25<br />
3.10 RTR and Speedup for using different data structures for waiting and parking<br />
queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
3.11 Ring Structure: Insertion at the end, Deletion from the beginning . . . . . . . . 28<br />
3.12 RTR and Speedup for using different data structures for the spatial queues and<br />
the buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
3.13 Reading plans from a structured text file, by using an STL-vector . . . . . . 32<br />
3.14 Reading plans from a structured text file, by using fscanf . . . . . . . . . . . 33<br />
4.1 Handling the boundaries and split links . . . . . . . . . . . . . . . . . . . . . . 41<br />
4.2 Domain decomposition by METIS for Switzerland . . . . . . . . . . . . . . . 42<br />
4.3 Pseudo code for parallel implementation of queue model . . . . . . . . . . . . 43<br />
4.4 Calculation of neighbors of computing nodes . . . . . . . . . . . . . . . . . . 47<br />
4.5 RTR and Speedup curves for Parallel Queue Model . . . . . . . . . . . . . . . 49<br />
4.6 RTR and Speedup graphs for PVM and MPI comparison . . . . . . . . . . . . 51<br />
4.7 The data of a vehicle to be packed . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
4.8 Packing vehicle data with memcpy . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
4.9 Packing vehicle data with MPI Pack . . . . . . . . . . . . . . . . . . . . . . 53<br />
vii
4.10 Packing vehicle data with MPI Struct . . . . . . . . . . . . . . . . . . . . . 54<br />
4.11 RTR graphs for different packing algorithms . . . . . . . . . . . . . . . . . . . 55<br />
4.12 RTR and Speedup graphs for METIS with single constraint . . . . . . . . . . . 56<br />
5.1 An example plan in the XML format . . . . . . . . . . . . . . . . . . . . . . . 61<br />
5.2 Physical and strategic layers of the framework coupled via files . . . . . . . . . 62<br />
5.3 Reading events by using the STL-map . . . . . . . . . . . . . . . . . . . . . . 64<br />
5.4 Reading events by using C++ operator >> . . . . . . . . . . . . . . . . . . . . 65<br />
5.5 Reading events by using atoi/atof or strtod/strtol . . . . . . . . . . 66<br />
5.6 Coupling via subroutine calls during within-day re-planning . . . . . . . . . . 69<br />
6.1 Interaction between TSs and ERs . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
6.2 Pseudo code for the actions for TSs and ERs when events are buffered . . . . . 80<br />
6.3 Pseudo code for the actions of TSs and ERs when events reported immediately 80<br />
6.4 Pseudo Code for Packing a Raw Event . . . . . . . . . . . . . . . . . . . . . . 81<br />
6.5 Pseudo Code for Packing an XML Event . . . . . . . . . . . . . . . . . . . . . 81<br />
6.6 Time measurements for packing events . . . . . . . . . . . . . . . . . . . . . . 86<br />
6.7 Time measurements for sending events . . . . . . . . . . . . . . . . . . . . . . 87<br />
6.8 Comparison of Ethernet vs Myrinet when sending events . . . . . . . . . . . . 88<br />
6.9 Myrinet, Multi-cast results for sending events . . . . . . . . . . . . . . . . . . 89<br />
6.10 Ethernet, Multi-cast results for sending events . . . . . . . . . . . . . . . . . . 90<br />
6.11 Time measurements when receiving events over Myrinet . . . . . . . . . . . . 91<br />
6.12 Comparison of Ethernet vs Myrinet when receiving events . . . . . . . . . . . 92<br />
6.13 Time measurements for unpacking on top of the effective receiving time . . . . 93<br />
6.14 Summary figures for 1ER case . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
6.15 Linear scale version of summary figures . . . . . . . . . . . . . . . . . . . . . 96<br />
6.16 Time measurements for sending when events reported immediately . . . . . . . 96<br />
6.17 Pseudo code for packing different data types with memcpy . . . . . . . . . . . 97<br />
6.18 Pseudo code for packing different data types with MPI Pack . . . . . . . . . . 98<br />
6.19 A C-type struct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
6.20 Pseudo code for packing different data types with MPI Struct . . . . . . . . 99<br />
6.21 Pseudo code for packing different data types with Classdesc . . . . . . . . 100<br />
6.22 Performance of Different Serialization Methods . . . . . . . . . . . . . . . . . 101<br />
7.1 Pseudo code for interaction of PSs and TSs . . . . . . . . . . . . . . . . . . . 106<br />
7.2 Sequence of Tasks Execution of TSs and PSs . . . . . . . . . . . . . . . . . . 107<br />
7.3 Pseudo code for packing different data types with memcpy . . . . . . . . . . . 108<br />
7.4 An example for the methods of BytesUtil . . . . . . . . . . . . . . . . . . . . 108<br />
7.5 Data structures for agents in a C++ Plans Server . . . . . . . . . . . . . . . . . 109<br />
7.6 Data structures for agents in a Java Plans Server . . . . . . . . . . . . . . . . 109<br />
7.7 Time measurements for packing plans . . . . . . . . . . . . . . . . . . . . . . 113<br />
7.8 Time measurements for sending plans over Myrinet . . . . . . . . . . . . . . . 114<br />
7.9 Time measurements for the effective receiving time of plans over Myrinet . . . 115<br />
7.10 Time measurements for unpacking plans on top of the effective receiving time<br />
over Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
7.11 Time measurements for packing agent IDs by TSs . . . . . . . . . . . . . . . . 117<br />
7.12 Time measurements for sending agent IDs by TSs to PSs over Myrinet . . . . . 117<br />
7.13 Time measurements for receiving agent IDs by PSs over Myrinet . . . . . . . . 118<br />
7.14 Time measurements for unpacking agent IDs by PSs on top of the effective<br />
receiving time over Myrinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
viii
7.15 Summary figures for the single PS case . . . . . . . . . . . . . . . . . . . . . 119<br />
7.16 Linear scale version of summary figures . . . . . . . . . . . . . . . . . . . . . 119<br />
8.1 Round-trip travel times for different sizes of messages . . . . . . . . . . . . . . 124<br />
ix
List of Tables<br />
3.1 Performance results for reading different types of plans file and approaches . . 31<br />
3.2 Performance results for reading the graph data . . . . . . . . . . . . . . . . . . 33<br />
3.3 Performance results for writing the events file . . . . . . . . . . . . . . . . . . 34<br />
3.4 Summary table of the serial performance results for different data structures of<br />
the traffic flow simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
4.1 Summary table of the parallel performance results for different data structures<br />
of the traffic flow simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
5.1 Performance results for reading the events file . . . . . . . . . . . . . . . . . . 67<br />
6.1 Performance prediction table for buffered events . . . . . . . . . . . . . . . . . 84<br />
6.2 Performance results for ERs writing the events file . . . . . . . . . . . . . . . 94<br />
6.3 Summary table of the performance results of events transfered between TSs<br />
and ERs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />
7.1 Summary table of the performance results of plans transfered between TSs and<br />
PSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
x
Chapter 1<br />
Introduction<br />
In the area of “modeling and simulation,” one typically designs a model of a system of interest,<br />
and then executes that model on a computer. This simulated model typically shows how the<br />
system of interest develops over time. Advantages of this approach over the observation of<br />
nature or experiments with nature include:<br />
Formulating and validating the computational model forces one to truly grasp the aspects<br />
of the dynamics of a system which make it function the way one observes.<br />
It is much easier to extract full information from the model that runs in the computer than<br />
from any experimental setting.<br />
One can change the model so that it reflects artificial rather than real worlds.<br />
One can make forecasts.<br />
Because of these and many other advantages, computer simulation has joined the areas of<br />
“(analytical) theory” and “experiment” as a third method of scientific investigation.<br />
With respect to spatially extended systems, one of the first areas where simulation was<br />
employed is in the area of partial differential equations (PDEs): Models that had been formulated<br />
in mathematical terms before computers existed were re-formulated for the computer<br />
(“discretized”) and then run. It quickly turned out that formulating computer-amenable versions<br />
of the partial differential equations was far from straightforward, and the sciences of<br />
Applied Mathematics and Scientific Computing have emerged around these issues.<br />
An alternative way to model spatially extended systems is to model the involved particles<br />
directly. This is in contrast to PDEs, which in some sense model fields of particles. In this area<br />
of particle models, the introduction of computers has perhaps changed the field even more<br />
than in the area of PDE’s: It is now possible to simulate systems with or more particles,<br />
which makes it possible to simulate the evolution of (tiny) samples of material directly on the<br />
¡£¢¥¤<br />
molecular level.<br />
Typical particles are relatively simple entities: For example, atoms can be adequately described<br />
by variables such as location, velocity, mass, charge, angular momentum. The same is<br />
true for granular materials, such as sand (e.g. [90]). There are, however, other systems where<br />
the particle approach seems intuitive but the particles are no longer simple. This is true, for example,<br />
when modeling humans (socio-economic systems), Internet packets, or certain aspects<br />
of biological systems. This is where multi-agent simulations [22] come in. They still model<br />
the involved particles directly, as do particle simulations, but they spend much more intellectual<br />
and computational efforts on modeling and simulating the internal dynamics of the particles.<br />
This means that one is faced with three sub-problems:<br />
1
The strategical world:<br />
Concepts which are in<br />
someone’s head.<br />
plans<br />
(acts,<br />
routes,<br />
...)<br />
per−<br />
for−<br />
mance<br />
info<br />
¢¡¢<br />
¤¡¤ £¡£<br />
¦¡¦ ¥¡¥¡¥<br />
¨¡¨ ©¡©¡©<br />
¡<br />
§¡§<br />
¡ ¡<br />
¡ ¡<br />
¡ ¡<br />
¡ ¡<br />
The physical world:<br />
− limits on accel/brake<br />
− excluded volume<br />
− veh−veh interaction<br />
− veh−system interaction<br />
− ped−veh interaction<br />
− etc.<br />
Figure 1.1: Physical and strategic layers of a traffic simulation system.<br />
1. Simulation of the physical system,<br />
2. Simulation of the internal dynamics of the particles,<br />
3. Simulation of the interaction between these two.<br />
When the internal dynamics of the particles consists of mental processes, then the simulation<br />
of the internal dynamics is sometimes called the strategic layer of the complete simulation<br />
system. Accordingly, the simulation of the physical system is then called the physical layer<br />
of the complete simulation system. Figure 1.1 illustrates two layers and their interactions in a<br />
traffic simulation system.<br />
Many systems where multi-agent simulation would be interesting are large. For example,<br />
a typical metropolitan area traffic system (the main example of this text) is used by several<br />
millions of travelers. A typical ecosystem can consist of several millions of animals, not counting<br />
entities such as bacteria. The immune system, sometimes also modeled by multi-agent<br />
approaches, contains about ¡£¢ T-cells.<br />
Therefore, the simulation of large multi-agent systems needs to be considered. As in other<br />
areas, in large-scale multi-agent systems the use of parallel computers needs to be evaluated.<br />
As will be explained later, in parallel computing one segments the system of interest into several<br />
pieces, and gives each piece to a different computing node 1 . Since the computing nodes work<br />
on the problem simultaneously, the collection of computing nodes solves the problem much<br />
faster than a single computing node would. The interesting question is how to segment the<br />
problem so that the simulation runs efficiently. Perhaps contrary to intuition, just distributing<br />
the agents is usually not a good idea, since agents that often interact with each other may<br />
end up on different computing nodes, and the necessary information exchange between those<br />
computing nodes makes the computation inefficient. Rather, one needs to group the agents<br />
such that agents that interact often are on the same computing node. Since much interaction is<br />
spatial, this means that agents, when they move around in space during the simulation, need to<br />
be moved around between the computing nodes.<br />
CPU.<br />
1 A computing node can be a computer or a CPU (Central Processing Unit) of a computer with more than one<br />
2
This thesis will explore parallel computing issues in the area of multi-agent mobility simulations.<br />
As a specific example, it will explore parallel multi-agent simulations in the area of<br />
transport planning. Within that area, it will explore the following two sub-problems:<br />
Parallel traffic flow simulation. This item corresponds to “1. Simulation of the physical<br />
system” in the list above.<br />
Exchange of information between the traffic flow simulation and strategic layer in a parallel<br />
computing context. This item corresponds to “3. Simulation of the interaction” in<br />
the list above.<br />
Despite the focus on transport planning, the concepts developed in this thesis are general<br />
enough to be useful for the simulation of any kind of system where mobile particles with<br />
complex internal dynamics move around and interact in a physical world. This definition will<br />
include all simulations where humans move around. In addition, the traffic flow simulation<br />
used in this work (the so-called queue simulation) is general enough that it can be applied to<br />
problems where the dynamics of packet movement in a graph is of interest.<br />
The traditional (static) transportation planning uses a four-step process in modeling travel<br />
demand. These four steps are:<br />
Trip Generation: estimation of the number of incoming trips of possible destinations and<br />
outgoing trips of possible origins in a region.<br />
Trip Distribution: producing the origin-destination (OD) matrix by matching origins with<br />
destinations to complete the trips.<br />
Modal Split: determining the travel mode (taking public transport, driving a car, walking,<br />
etc.).<br />
Traffic Assignment: assigning a route for each traveler to get to its destination.<br />
This model does not meet the requirements of modern transportation planning. There are<br />
two main reasons for this:<br />
1. In the four-step process, information is aggregated into traffic streams such that there is<br />
no access to information at the individual level. In other words, the existence of steadystate<br />
streams do not distinguish between travelers.<br />
2. Static modeling in the four-step process misses time-dependency so temporal effects such<br />
as time-dependent congestion spill-back are not covered.<br />
The first item can be solved by treating travelers as individual entities. A known solution<br />
is called activity-based demand generation (ADG) [34], which generates daily activity plans<br />
for each individual. For example, an individual can have an activity plan, which is composed<br />
of a set of activities, such as ”being at home”, ”working”, ”leisure” etc., planned for a day.<br />
The activities of the activity-based demand generation are scheduled over time, thus the<br />
activity-based demand generation is time-dependent, as opposed to the lack of time-dependency<br />
of static assignment mentioned above in item 2. Hence, an alternative technique called Dynamic<br />
Traffic Assignment (DTA) has been employed in the transportation planning area<br />
(e.g. [19, 20, 27, 5]). This model includes spill-back queues formed during the movement<br />
of travelers along links and nodes. Static assignment has the advantage of having a unique<br />
solution compared to DTA, therefore, it can mathematically be proved. DTA with spill-back<br />
3
queues, on the other hand, does not guarantee a unique solution and this makes it harder to find<br />
an analytical solution. Consequently, computational solutions are accomplished.<br />
Two basic components of DTA are route generation for individuals and network loading.<br />
The network loading is the process where the routes are executed. Typically, a simulation technique<br />
is a solution for the network loading part of DTA. To couple DTA and ADG, DTA also<br />
needs to maintain the travelers as individual entities as ADG does. This means that individual<br />
entities have individual attributes and decisions are made on individual basis. Hence, an<br />
agent-based or multi-agent approach is employed to emphasize the individual entities.<br />
Traffic dynamics with spill-back is solved by systematic relaxation. The systematic relaxation<br />
process performs a multi-agent learning method based on the following sequence:<br />
1. Make an initial guess for the routes of all agents.<br />
2. Execute all agents’ routes simultaneously in a traffic flow simulation (network loading).<br />
3. Re-calculate some or all of the routes using the knowledge of the network loading.<br />
4. Go back to 2.<br />
From the viewpoint of conceptual layers explained above, the route generation happens at<br />
the strategic layer and the network loading (item 2) corresponds to the physical layer. The<br />
queue model considered in this thesis corresponds to the network loading part. It is the model<br />
on which the traffic flow simulation described in this thesis is based.<br />
This thesis is organized as follows: A multi-agent traffic flow simulator, based on the queue<br />
model, for the physical layer of a transportation planning system, called MATSIM [50], is given<br />
in Chapter 2. Chapter 2 also explains the input data, namely the street network and the plans<br />
of travelers, and explains the output data, which is composed of events occurred during the<br />
network loading process. The computational aspects of the sequential execution of the traffic<br />
flow simulator are discussed in Chapter 3. How parallel computing is introduced to the traffic<br />
flow simulation is explained in Chapter 4. Chapter 5 discusses different methods to couple<br />
the strategic and the physical layers of MATSIM. Chapter 6 and Chapter 7 explain how the<br />
different types of data between modules can be exchanged. In particular, how the output data<br />
(events data) is extracted out of the traffic flow simulation and how the input data (plans data)<br />
is got into the traffic flow simulation, respectively. Chapter 8 gives a vision of how the traffic<br />
flow simulation can be used for the Internet packet traffic, and is followed by a summary.<br />
4
Chapter 2<br />
The Queue Model for Traffic Dynamics<br />
2.1 Introduction<br />
A traffic flow simulation consistent with the multi-agent approach discussed in Chapter 1 and<br />
e.g., by [67, 65] should fulfill the following conditions:<br />
The model should have individual travelers/vehicles 1 in order to be consistent with all<br />
agent-oriented learning approaches.<br />
The model should be simple in order to be comparable with static assignment and in<br />
order to allow concentration on computational issues rather than modeling issues.<br />
This includes the ability of task parallelization into software.<br />
The model should be computationally fast so that scenarios of a meaningful size can be<br />
run within acceptable periods of time.<br />
For the work presented here, a fourth condition is also stated:<br />
The model should be somewhat realistic so that meaningful comparisons to real-world<br />
results can be made.<br />
These conditions make the use of existing software packages, such as DynaMIT [17],<br />
DYNASMART [18], or TRANSIMS [59], difficult since these software packages are already<br />
fairly complex and complicated. An alternative is to select a simple model for large scale microscopic<br />
network simulations, and to re-implement it. If one wants queue spill-back, there are<br />
essentially two starting points: queueing theory, and the theory of kinematic waves.<br />
In queueing theory, one can build networks of queues and servers [76, 14, 73]. Packets<br />
enter the network at an arbitrary queue. Once in a queue, they wait typically in a first-in firstout<br />
(FIFO) queue until they are served; servers serve queues with a given rate. Once the packet<br />
is served, it will enters the next queue.<br />
This can be directly applied to car/vehicle traffic, where packets correspond to vehicles,<br />
queues correspond to links, and serving rates correspond to link capacities. The decision of a<br />
vehicle about which link to enter after it is served at an intersection is given by the vehicle’s<br />
route, which is a list of nodes (intersections) that a vehicle must pass through during its trip.<br />
1 Terminology: In multi-agent simulations, agents are units. The traffic flow simulation described here simulates<br />
the vehicle traffic. The agents in the traffic model are vehicles. Agents and vehicles are used on reciprocal<br />
terms. Although a vehicle is an agent of transmission and is generally not restricted to a “car”, it represents a car<br />
and accordingly a driver who is of person-type throughout this thesis.<br />
5
Handling Constraints - Original algorithm<br />
for all links do<br />
while vehicle has arrived at end of link<br />
AND vehicle can be moved according to capacity<br />
AND there is space on destination link do<br />
move vehicle to next link<br />
end while<br />
end for<br />
Figure 2.1: The Gawron’s queue model<br />
A shortcoming of this type of approach is that it does not model spill-back. If queues have<br />
size restrictions, then packets exceeding that restriction are typically dropped [76]. Since this<br />
is not realistic for traffic, an alternative is to refuse further acceptance of vehicles once the<br />
queue is full (“physical queues”). This means that the serving rate of the upstream server is<br />
influenced by a full queue downstream. Gawron presents an example of such a model in [26].<br />
A detailed algorithmic description is given in Figure 2.1.<br />
An important issue with physical queues is that the intersection logic needs to be adapted.<br />
Since without physical queues (i.e. with “point queues”) the outgoing links can always accept<br />
all incoming vehicles, so the maximum flow through each incoming link is just given by each<br />
link’s capacity. However, when outgoing links have limited space, then that space needs to be<br />
distributed among the incoming links which compete for it.<br />
In the original algorithm (Figure 2.1), links are processed in an arbitrary but fixed sequence.<br />
This has the consequence that the most favored link in a given intersection is the one that is<br />
processed next after the congested outgoing link has been processed. This could for example<br />
mean that a small side road obtains priority over a large main road.<br />
A better way to handle this problem is to allocate flow under congested conditions according<br />
to capacity [16]. For example, if there are two incoming links with capacities 2 and 4 vehicles<br />
per time step, and the outgoing link has 3 spaces available, then 1 space should be allocated<br />
to the first incoming link and 2 to the second. Section 2.2.2 explains intersection handling in<br />
more detail.<br />
A shortcoming of queue models is that the speed of the backwards traveling kinematic wave<br />
(“jam wave”) is not correctly modeled. A vehicle that leaves the link at the downstream end<br />
immediately opens up a new space at the upstream end into which a new vehicle can enter,<br />
meaning that the kinematic wave speed is roughly one link per time step, rather than a realistic<br />
velocity. This becomes visible in the dissolution of jams, say at the end of a rush hour: If a<br />
queue extends over a sequence of links, then the jam should dissolve from the downstream end.<br />
In the queue model, it will essentially dissolve from the upstream end. More details of this,<br />
including schematic fundamental diagrams, can be found in [74, 26].<br />
2.2 Queue Model<br />
2.2.1 Gawron’s Queue Model<br />
The so-called queue model introduced by Gawron [26] is used as the base of the traffic dynamics<br />
of the traffic flow simulation. Gawron’s queue model defines three key concepts, namely,<br />
free flow travel time, storage constraint and capacity constraint.<br />
Each link has, from the input files, the attributes free flow velocity ¢¡ , length £ , capacity<br />
6
£<br />
and ¡£¢¥¤§¦©¨ number of lanes . Free flow travel time ¡ ¡ is calculated by . Each vehicle<br />
must spend at least free flow travel time on a link before leaving it.<br />
The storage constraint of a link is defined as the maximum number of vehicles that<br />
£<br />
a<br />
!<br />
link<br />
can hold at the same time. It ¨ ¡¢¤¦¨<br />
is calculated as , where is the space a single<br />
vehicle in the average occupies in a jam, which is the inverse of #"%$'& the jam density. m is<br />
taken throughout this work.<br />
The capacity constraint (flow capacity) of a link, on the other hand, defines an upper-bound<br />
for the number of vehicles that can be released from a link at a given time. This constraint is<br />
given as input.<br />
The intersection logic by Gawron is that all links are processed in an arbitrary but fixed<br />
sequence, and a vehicle is moved to the next link if (1) it has arrived at the end of the link, (2)<br />
it can be moved according to capacity, and (3) there is space on the destination link. Figure 2.1<br />
gives the algorithm. The three conditions mean the following:<br />
A vehicle that enters ( link at ) ¡ time cannot leave the link before ) ¡+*, ¡ time , ¡ where<br />
is the free speed link travel time as explained above.<br />
The condition “vehicle can be moved according to capacity” is determined as<br />
.- or /01 and 23¡£45-7698<br />
where is the integer part of the capacity of the link (in vehicles per time 6 step),<br />
is the fractional part of the capacity of the link, and is the number of the vehicles<br />
which already left the link in the current time 23¡£4 step. is a random number such<br />
¢;:<br />
that<br />
: ¡<br />
. According to this formula, vehicles can leave the link if the leaving<br />
2¦A@ size , i.e. the first integer number being larger or equal than the link capacity (in<br />
vehicles per time step). Vehicles are then moved from the link (the spatial queue) into the<br />
buffer according to the capacity constraint and only if there is space in the buffer; once in<br />
the buffer, vehicles can be moved across intersections without looking at the flow capacity<br />
constraints. This approach is borrowed from lattice gas automata, where particle movements<br />
are also separated into a “propagate” and a “scatter” step [24]. Vehicles move through the nodes<br />
7
node<br />
spatial queue<br />
buffer<br />
acc to capacity<br />
constraint<br />
acc to storage<br />
constraint<br />
Figure 2.2: Simplifying the intersection logic by introducing a separate buffer for each link<br />
besides the spatial queue.<br />
without any delay at the nodes as all the constraints that define eligible vehicles of a link are<br />
determined by the link properties.<br />
As a desired side effect, this makes the update in the algorithm completely parallel: If a<br />
vehicle is moved out of a full link, the new empty space will only open in the buffer and not<br />
on the link, and will thus not become available at the upstream end until the next time step –<br />
at which time it will be shared between the incoming links according to the method described<br />
above. This has the advantage that all information which is necessary for the computation of a<br />
time step is available locally at each intersection before a time step starts – and in consequence<br />
there is no information exchange between intersections during the computation of a time step.<br />
Further details are given in algorithmic form in Figure 2.3.<br />
In order to systematically test the intersection logic, an intersection test suite was implemented<br />
[7]. This test suite goes through several different intersection layouts and tests them<br />
one by one to see if the dynamics behaves according to the specifications. The results of possible<br />
layouts typically look as shown in Figure 2.4.<br />
The curves in Figure 2.4 show time versus the number of vehicles that have left the link so<br />
far. Thus, the slope of the curve equals the measured flow capacity in vehicles per second. For<br />
the data in the figure, one link with a capacity of 500 vehicles/sec and one link with a capacity<br />
of 2000 vehicles/sec merge into a link with a capacity of 500 vehicles/sec. The curves are, for<br />
different algorithms explained below, time-dependent accumulated vehicle counts for the two<br />
incoming links. For approximately the first 50-100 time steps, both incoming links operate at<br />
full capacity (500 and 2000 vehicles/second) and fill the outgoing link. Until approximately<br />
time step 3400, both links discharge at rates 400 and 100 vehicles/sec, respectively. After that<br />
time, the first link is empty, and the second link now discharges at 500 vehicles/sec. Not all<br />
algorithms are similarly faithful in generating the desired dynamics; the thick black lines denote<br />
results from the algorithm which is the current implementation in the traffic flow simulator.<br />
Further details are explained in [7].<br />
In Figure 2.4, Algorithm-1 uses Gawron’s original algorithm as described in Section 2.2.1<br />
and in [26]. This algorithm may lead to wrong results. For example, when a vehicle leaves<br />
a full link, a free space becomes available immediately, so that another vehicle can enter the<br />
link in the same time step. Hence, the results of the simulation are dependent on the sequence<br />
in which the links are processed. As stated above, parallel update is used to get rid of this<br />
problem in the traffic flow simulation.<br />
Algorithm-2 uses the “fair intersections and parallel update” approach described above, and<br />
is provided in Figure 2.3. Algorithm-3, given in Figure 2.5, is very similar to Algorithm-2 ex-<br />
8
Vehicle Movement through Intersections<br />
// Propagate vehicles along links:<br />
for all links do<br />
while vehicle has arrived at end of link<br />
AND vehicle can be moved according to capacity<br />
AND there is space in the buffer (see Fig 2.2) do<br />
move vehicle from link to buffer<br />
end while<br />
end for<br />
// Move vehicles across intersections:<br />
for all nodes do<br />
while there are still eligible links do<br />
Select an eligible link randomly proportional to capacity<br />
Mark link as non-eligible<br />
while there are vehicles in the buffer of that link do<br />
Check the first vehicle in the buffer of the link<br />
if the destination link has space then<br />
Move vehicle from buffer to destination link<br />
Proceed to the next vehicle in the buffer<br />
else<br />
Break the inner while loop and proceed to the next eligible link<br />
end if<br />
end while<br />
end while<br />
end for<br />
Figure 2.3: Vehicle movement at the intersections. Note that the algorithm separates the flow<br />
capacity from intersection dynamics.<br />
600<br />
500<br />
400<br />
Count<br />
300<br />
200<br />
algorithm 1: link 400<br />
algorithm 1: link 200<br />
algorithm 2: link 400<br />
algorithm 2: link 200<br />
algorithm 3: link 400<br />
100<br />
algorithm 3: link 200<br />
algorithm 4: link 400<br />
algorithm 4: link 200<br />
algorithm 5: link 400<br />
algorithm 5: link 200<br />
0<br />
0 1000 2000 3000 4000 5000 6000 7000 8000<br />
Time<br />
Figure 2.4: Test suite results for the intersection dynamics. The curves show the number of<br />
discharging vehicles from two incoming links as explained in section 2.2.2.<br />
9
Algorithm-3 for Vehicle Movement through Intersections:<br />
Same as Alg. 2.3 till this point<br />
// Move vehicles across intersections:<br />
for all nodes do<br />
while there are still eligible links do<br />
Select an eligible link randomly proportional to capacity<br />
if the destination link has space then<br />
Move one vehicle from buffer to destination link<br />
Mark link as non-eligible and proceed to the next link<br />
else<br />
Proceed to the next link<br />
end if<br />
end while<br />
end for<br />
Figure 2.5: Handling intersections according to the modified version of fair intersections algorithm.<br />
Similar to Algorithm 2.3, except that each link, now, can push only one vehicle at a<br />
time.<br />
Algorithm-4 for Vehicle Movement through Intersections:<br />
for all nodes do<br />
if node visited for the first time then<br />
Choose first incoming link randomly<br />
end if<br />
for i = 1..(the number of incoming links) do<br />
Choose next incoming link via Metropolis sampling<br />
if link buffer is empty then<br />
Mark link as non-eligible<br />
else<br />
Take first vehicle in the buffer<br />
if destination link for vehicle has space then<br />
Move that vehicle from buffer to destination link<br />
else<br />
Mark link as non-eligible<br />
end if<br />
end if<br />
end for<br />
end for<br />
Figure 2.6: Handling intersections according to Metropolis sampling.<br />
cept that instead of serving all the “eligible” vehicles from an incoming link to their destination<br />
links, only one vehicle is moved at a time. Hence, Algorithm-3 and Algorithm-2 do not show<br />
any difference when links do not have capacities greater than 1.<br />
Algorithm-4 implements the fair intersections approach with a difference. Selection is done<br />
via Metropolis sampling [55] with one exception: When a node is processed for the first time,<br />
the first incoming link is selected randomly. In general, if the next link has a lower capacity<br />
then the current link ¡ , then the link is selected with a probability that depends on the ratio<br />
of the capacity of link to the capacity of link ¡ . Pseudo code of the algorithm is given in<br />
Figure 2.6.<br />
10
Algorithm 5 for Vehicle Movement through Intersections:<br />
for all nodes do<br />
if node visited for the first time then<br />
Choose first incoming link randomly according to capacity<br />
end if<br />
for i = 1..(the number of incoming links) do<br />
Choose next incoming link via Metropolis sampling<br />
if link buffer is empty then<br />
Mark link as non-eligible<br />
else<br />
Take first vehicle in the buffer<br />
if destination link for vehicle has space then<br />
Move that vehicle from buffer to destination link<br />
else<br />
Mark link as non-eligible<br />
end if<br />
end if<br />
end for<br />
end for<br />
Figure 2.7: Handling intersections according to the modified Metropolis sampling.<br />
Finally, Algorithm-5 is similar to Algorithm-4 except that if a node is visited for the first<br />
time, the first incoming link is selected according to the flow capacity. The algorithm is given<br />
in Figure 2.7.<br />
The queue model reads flow capacities, free speeds and link lengths, from the input files<br />
and accordingly calculates free flow link travel times. The free flow link travel time defines<br />
the minimum time that vehicles on that particular link must spend. Whilst the lower-bound is<br />
known, the upper-bound for a vehicle being on a link before moving to the next link depends on<br />
how long the vehicle waits at the end of the link. If the randomized selection is not in favor of<br />
a link on which a vehicle is ready to move to the next link (Figure 2.3), the travel time related<br />
to the link increases.<br />
There is a remark to be made about flow capacities. When several of very short links (such<br />
as links with a buffer size 1) exist, they reduce the number of vehicles discharge from the longer<br />
links as the available space is reduced by the short links 2 .<br />
2.2.3 Graph Data as Input for Queue Simulation<br />
The traffic flow simulation is fed by the graph data (the street network) and the plans of vehicles<br />
to be executed. Plans are explained in Section 2.2.4. Before the execution of plans, the<br />
simulation reads nodes and links of the street network. The street network is defined in the<br />
XML [97] format and a rough example is shown in Figure 2.8. XML is explained in detail in<br />
2 The problem can be seen as follows: Assume a short link with a given non-integer capacity (per second), with<br />
long links of the same capacity both upstream and downstream. Then, according to standard queuing theory, the<br />
queue length on the short link follows a random walk. However, when that random walk makes the short link<br />
completely full, then the upstream link is no longer allowed to discharge into the short link. Since this happens<br />
fairly often with short links, this means that short links reduce the effective capacity. Note that the effective<br />
capacity reduction is felt for the upstream link. This phenomenon has little effect with the long links of the Swiss<br />
street network defined in Section 2.5, but became apparent with validation studies with the Navtec network of the<br />
Zurich area, which has many short links.<br />
11
- network<br />
- nodes<br />
-<br />
-<br />
-<br />
node i d =”1” x=”651700” y=”137200”/<br />
node i d =”2” x=”652220” y=”137600”/<br />
/ nodes<br />
l i n k i d =”2” from=”2” t o =”1” l e n g t h =”657”<br />
-<br />
c a p a c i t y =”12000” f r e e s p e e d = ”11.1” p e r m l a n e s =”1” /<br />
l i n k i d =”3” from=”1” t o =”2” l e n g t h =”657”<br />
-<br />
c a p a c i t y =”12000” f r e e s p e e d = ”11.1” p e r m l a n e s =”1” /<br />
/ l i n k s -<br />
- l i n k s<br />
- / network<br />
Figure 2.8: An example of the graph data in the XML format<br />
Section 3.4.1.<br />
Each node is identified by a unique ID and x-y coordinates. Each link has the attributes ID,<br />
node IDs that it connects, length, capacity, free flow speed and number of permanent lanes. The<br />
capacity is given in terms of “vehicles per time unit” and refers to the capacity (flow) constraint<br />
of the link.<br />
The graph data example in Figure 2.8 is composed of 2 nodes and 2 links. Links connect<br />
nodes by defining a direction, for example, link 2 is from node 2 to node 1.<br />
Each node in the traffic flow simulation keeps track of its outgoing and incoming links.<br />
When a link of the graph data is read, pointers to it are placed in the arrays for incoming or<br />
outgoing links at the nodes that the link connects. The arrays for outgoing and incoming links<br />
are used, especially, when the movement of vehicles across nodes (intersections) is realized as<br />
written in Figure 2.3. Nodes check the buffers of incoming links for vehicles ready to move to<br />
any of the outgoing links. Furthermore, in the parallel implementation explained in Chapter 4,<br />
the vehicles that move across the boundaries are packed into messages by the nodes.<br />
Each link is mainly composed of a spatial queue and a buffer to separate the flow constraint<br />
from the intersection logic as described earlier. Both the buffer and the spatial queue are nothing<br />
but queues of pointers to vehicles. Besides these two structures, there are 3 more supplementary<br />
queues defined for each link:<br />
Parking queue: holds vehicles of initial legs (see Section 2.2.4) with start times in the<br />
future.<br />
Waiting queue: holds vehicles of initial legs (see Section 2.2.4) whose start time is up<br />
but which cannot make it into the traffic because of full links.<br />
Storage: holds the second or higher legs of vehicles. These legs can be executed only if<br />
the execution of previous legs are completed.<br />
Links are also responsible for putting constraints into practice. Hence, nodes are careless<br />
in terms of constraints. As shown in Figure 2.8, the capacity constraint, which determines the<br />
size of the buffer, is read from the input data. The storage constraint is calculated by using the<br />
length and the number of permanent lanes given in the input data (Section 2.2.1).<br />
2.2.4 Vehicle Plans as Input for Queue Simulation<br />
Vehicles are inserted into one of the queues defined on links (Section 2.2.3) according to their<br />
start times and leg numbers. Hence, the simulation needs to know about the graph data before<br />
12
p e r s o n i d =”6357250”<br />
-<br />
p l a n -<br />
a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />
-<br />
l e g mode=” car ” d e p t i m e = ”06:54:35” t r a v t i m e = ”00:30”<br />
-<br />
r o u t e 4 9 0 2 4 9 0 3 4 9 0 4 4 9 0 5 4 9 0 6 4 9 0 7 4 9 0 8 4 9 0 9 - / r o u t e<br />
-<br />
/ l e g -<br />
a c t t y p e =”w” x100=”387345” y100=”276590”<br />
-<br />
l i n k =”14606” dur=”08:00” /<br />
/ p l a n -<br />
- / p e r s o n<br />
Figure 2.9: An example for the plans data in the XML format<br />
reading any vehicle information.<br />
An example of a person’s plan is given in Figure 2.9. Each person has a unique ID and a<br />
plan. A plan is composed of a set of activities. Each activity defines a location, the coordinates<br />
of the location and a link ID, on which the activity will start. Each pair of consecutive activities<br />
describes a leg of the plan. The leg provides information about the means of transportation,<br />
the earliest time that a vehicle can start its execution, the expected travel time from the start<br />
activity location to the end activity location of the leg, and a set of node IDs that defines a route<br />
that is supposed to be followed when moving from the start activity location to the end activity<br />
location.<br />
The traffic flow simulation creates a new agent/vehicle for each leg defined in a person’s<br />
plan. In case a person has more than one leg, the simulation makes sure that the highernumbered<br />
legs wait for the completion of the execution of the previous legs.<br />
2.2.5 Events as Output of Queue Simulation<br />
Since the queue simulation does not aggregate data (Chapter 5.2.1), it only produces events as<br />
the output for the other modules in the system, which are better able to check the correctness<br />
of their own data aggregation. An event is produced whenever a vehicle moves from one queue<br />
to another or leaves the simulation due to various reasons. Possible events are of the following<br />
types (not limited to those listed here):<br />
departure: moving from the parking queue of a link to the waiting queue of the same<br />
link, since the start time has arrived.<br />
leaving a waiting queue: moving from the waiting queue of a link to its spatial queue to<br />
start simulating.<br />
leaving a link: leaving the current link.<br />
entering a link: entering the next link (vehicle leaves the current link just before this<br />
event happens).<br />
being stuck and leaving the simulation: getting stuck in congestion for a specific time<br />
period and leaving the simulation afterwards.<br />
arrival: arrival at the final destination.<br />
A set of events of a vehicle in the XML (Section 3.4.1) format is shown in Figure 2.10.<br />
The example shows the events created while the plan of vehicle 6465 is executed. The vehicle<br />
13
)<br />
¥<br />
§<br />
<br />
<br />
¡<br />
£<br />
)<br />
¥<br />
)<br />
¥<br />
)<br />
<br />
¡<br />
£<br />
)<br />
)<br />
¥<br />
§<br />
<br />
*<br />
<br />
<br />
¡<br />
)<br />
£<br />
¥<br />
)<br />
¥<br />
)<br />
£<br />
)<br />
¥<br />
)<br />
£<br />
e v e n t time = ”06:00” t y p e =” departure ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1523” from=”3827”/<br />
e v e n t time = ”06:00” t y p e =” w a i t 2 l i n k ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1523” from=”3827”/<br />
e v e n t time = ”06:01” t y p e =” l e f t l i n k ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1523” from=”3827”/<br />
e v e n t time = ”06:01” t y p e =” entered l i n k ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1524” from=”3828”/<br />
e v e n t time = ”06:28” t y p e =” l e f t l i n k ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1524” from=”3828”/<br />
e v e n t time = ”06:28” t y p e =” entered l i n k ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1525” from=”3829”/<br />
e v e n t time = ”06:34” t y p e =” a r r i v a l ” v e h i d =”6465”<br />
-<br />
legnum=”0” l i n k =”1525” from=”3829”/<br />
Figure 2.10: An example for the events data in the XML format<br />
starts simulating at 6 AM on link 1523 and during its trip to destination link 1525, it traverses<br />
through link 1524. All the events belong to leg 0. The upstream ends of links 1523, 1524 and<br />
1525 are located at nodes 3827, 3828 and 3829, respectively.<br />
2.3 Other Work<br />
Two arguments against the queue model are often that the intersection behavior is “unfair”<br />
in standard implementations, and that the speed of the backwards traveling jam (“kinematic”)<br />
wave is incorrectly modeled. The first problem was overcome by a better modeling of the<br />
intersection logic, as described in Section 2.2.2. The second problem still remains. What can<br />
be done about it<br />
If one wants to avoid a detailed traffic flow simulation, such as is implemented in TRAN-<br />
SIMS [82] for example, then a possible solution is to use what is sometimes called “mesoscopic<br />
models” or “smoothed particle hydrodynamics” [28]. The idea is to have individual particles<br />
in the simulation, but have them moved by aggregate equations of motion. These equations of<br />
motion should be selected so that in the fluid-dynamical limit the Lighthill-Whitham-Richards<br />
[48] equation is recovered [15].<br />
The number of vehicles in a segment is updated according to<br />
¢¡¤£<br />
¡¦¥<br />
¢¡¤£<br />
©¡<br />
¥ <br />
¢¡<br />
¢¡¤£<br />
¥ £<br />
(2.1)<br />
)*<br />
;<br />
*¨§<br />
*<br />
¢¡¤£<br />
©¡<br />
) § <br />
<br />
from<br />
¡<br />
segment ¡ ¢¡¤£<br />
) ¡<br />
<br />
¢¡¤£<br />
§<br />
where is the number of vehicles in segment at time , is the flow of vehicles<br />
into segment at time , and is the source/sink term given by entry<br />
and exit rates.<br />
What is missing is the specification of the flow rates . A possible specification is<br />
given by the cell transmission model [15]:<br />
¢¡<br />
¢¡ ¡ £<br />
¥ £<br />
¢¡¤£<br />
¥$¥ £<br />
(2.2)<br />
<br />
! ¤#"<br />
§% ¤&"('<br />
where §! ¤&" is the capacity constraint, is the jam wave speed, ) is the free speed, * ¤&" is<br />
the maximum number of vehicles on the link and all other variables have the same meaning as<br />
before.<br />
14
)<br />
¥<br />
£<br />
)<br />
¥<br />
©¡% <br />
Note that this now exactly enforces the storage § <br />
constraint by setting<br />
¢¡¤£<br />
to zero<br />
once ¤#" has reached . In addition, the kinematic jam wave speed is given <br />
explicitly<br />
via . There is some interaction between length of a segment, time step, and that needs to<br />
be considered. The network version of the cell transmission model [16] also specifies how<br />
to implement fair intersections. The cell transmission model is implemented under the name<br />
NETCELL [9].<br />
Other link dynamics are, for example, provided by DynaMIT [19], DYNASMART [20] or<br />
DYNEMO [61]. These are based on the same mass conservation equation as Equation<br />
©¡<br />
(2.1), but<br />
¥<br />
use §<br />
different specifications for . In fact, DynaMIT and DYNASMART calculate vehicle<br />
speeds at the time of entry into the segment depending on the number of vehicles already in<br />
the segment. The number of vehicles that can potentially leave a link in a given time step is, in<br />
consequence, given indirectly via this speed computation. Since this is not enough to enforce<br />
physical queues, physical queuing restrictions are added to this description. DYNEMO varies<br />
a vehicle’s speed continuously along the link based on traffic conditions of the current and the<br />
next segment.<br />
2.4 The Basic Benchmark<br />
A real-world scenario is preferred for the benchmarks throughout this thesis instead of using a<br />
synthetic scenario or using only the theoretical performance predictions.<br />
A theoretical performance prediction gives an idea about what is supposed to be expected.<br />
However, such predictions possibly miss some performance-relevant details that appear only<br />
when the real-world data is used. For example, if an example data set is small enough that it<br />
can fit into a computer’s memory while the real-world data is bigger than the available memory,<br />
the results of theoretical predictions for the small data set will include the cache effects. For<br />
example, if the data set can be kept smaller than the size of the cache, which is a high speed<br />
memory system, there might be a significant speed-up since the data is reached with a higher<br />
speed.<br />
Synthetic scenarios are generated by synthetic data. They are used to make generalizations<br />
about the performance of the real-world scenarios. If a real-world scenario with enough information<br />
to test all the features of a benchmark is not available, a synthetic scenario with full<br />
information is useful. Furthermore, the results are easier to adapt between different scenarios<br />
if it is applicable. On the other hand, if a “possible” real-world data needs to cover a lot of<br />
details, then the generation of a similar synthetic scenario gets harder.<br />
2.5 A Practical Scenario for the Benchmarks<br />
One of the conditions that a traffic flow simulation must fulfill is that the simulation should<br />
be able to run scenarios of a meaningful size within acceptable periods of time. From the<br />
transportation planning point of view, such scenarios are large scale real-world problems, which<br />
include millions of agents and all kinds of traffic.<br />
The street network of Switzerland used in the benchmarks of this thesis was originally<br />
developed for the Federal Office for Spatial Development (ARE) [6]. Then, the network was<br />
extended to include the major European transit corridors for a railway-related study [85]. The<br />
version of the street network used through this thesis is a derivation of the extended network.<br />
It contains 10 564 nodes and 28 622 links.<br />
The nodes of the street network are the intersections of roads and are defined by geographical<br />
coordinates. The links are the roads that connect two nodes. Each link is unidirectional and<br />
15
has the attributes such as type, length, speed, and capacity (capacity constraint).<br />
A scenario called ch6-9 is used in the benchmarks through this work. It contains around<br />
1 million trips which start between 6:00AM and 9:00AM. Thus, the scenario used is aimed to<br />
simulate morning rush hour traffic. These trips are based on a realistic demand [65].<br />
The steps followed when converting trips to agents and plans are: (1) a unique agent is<br />
assigned to each trip, and (2) the starting and ending links of a trip become the home and work<br />
locations of the agent. Corresponding activities are created at these locations, so that (3) each<br />
trip becomes a leg between these two activities, and (4) each trip is completed with a route from<br />
the start link to the end link based on free flow travel times in the network.<br />
As stated in Chapter 1, a systematic relaxation is used to carry the initial state of a system to<br />
a relaxed state. With respect to relaxation, the initial plans are fed into the traffic flow simulator<br />
that represents the physical world of the framework as explained in Chapter 1. The results of<br />
the traffic flow simulation are used to improve plans of some agents, which are merged with the<br />
rest of the agents (whose plans have not been changed). The merged plans, then, are fed into<br />
the traffic flow simulator. Each iteration, therefore, involves reading the input files (plans and<br />
graph data), executing all the plans and improving some of the plans according to the output<br />
of the traffic flow simulation. The process is repeated until the system is relaxed, which takes<br />
about 50 iterations. Earlier investigations have shown that this is more than enough to reach<br />
relaxation [68].<br />
During the performance tests of the traffic flow simulation given in the next chapters, the<br />
ch6-9 scenario is simulated for 3 hours, i.e. 10800 time steps (1 time-step means 1 simulated<br />
second).<br />
2.6 Summary<br />
Using a traffic flow simulator for transportation planning is a method for network loading,<br />
which is one of two components in Dynamic Traffic Assignment (DTA). Different criteria such<br />
as resolution (individual vs. aggregated entities), how realistic the behaviors of entities are, the<br />
modes of entities and time resolution determine different traffic flow simulations.<br />
The existent traffic flow simulators are realistic and detailed enough but their complexity<br />
makes them difficult to use. The queue simulation presented here is not only favored in simplicity<br />
but also realistic enough to forecast in future transportation planning [65].<br />
The standard queue-based implementation comes with two main shortcomings: it exhibits<br />
an unfair behavior at the intersections and incorrectly models the speed of backwards traveling<br />
jam waves. The former is remedied by improving the standard intersection behavior as explained<br />
in Section 2.2.2. The solution to the latter is still absent: in the queue model, the traffic<br />
jam dissolves from the upstream end as opposed to that dissolves from the downstream end<br />
in the kinematic wave theory. Some other transportation planning software packages resolve<br />
the problem; however, they either suffer from being too complicated such as the CA model,<br />
or are not modeled at the level of individual entities; in consequence, they lack agent-oriented<br />
approaches.<br />
Despite this remaining shortcoming, the queue model meets the criteria of being comparable<br />
to static traffic assignment, as shown in [65], computing fast, and being receptive to<br />
agent-oriented approaches.<br />
16
Chapter 3<br />
Sequential Queue Model<br />
3.1 Introduction<br />
The queue model is explained in Chapter 2. The computational concern is mentioned as one of<br />
the reasons why the queue model was chosen:<br />
“The model should be computationally fast so that scenarios of a meaningful size<br />
can be run within acceptable periods of time.”<br />
In order to run meaningful scenario sizes, parallel computing is used for the traffic flow<br />
simulation. This will be explained in Chapter 4 in detail. With parallel computing, the same<br />
simulation runs on different computers on different pieces of data. For example, the same<br />
traffic flow simulation code simulates different geographical areas. Because of the parallel<br />
execution, the results can be obtained faster than it can be done with single-CPU execution.<br />
Although parallel programming speeds up the execution of a program, it is not enough by<br />
itself. Improving the single-CPU version of the same program is significant as well in terms of<br />
performance.<br />
The first traffic flow simulator as a part of this thesis was written in C. That simulator<br />
displayed considerable computational performance, but turned out to be difficult to maintain.<br />
For that reason, an alternative traffic flow simulator in C++ [80, 70] was programmed, taking<br />
the advantage of the new possibilities that the Standard Template Library (STL, Section 3.3.1)<br />
offers. However, that new C++ code turned out to be about a factor of two slower than the old<br />
C code.<br />
Many system designers prefer object-oriented programming (such as C++ and Java [42])<br />
because the complexity of systems has increased over the last decades. C++ has become a<br />
dominant programming language for those complex systems. Moreover, recommendations on<br />
how to approach certain problems in C++ have been developed (e.g. [53]). Today, with careful<br />
programming, C++ using the STL can be as fast as C.<br />
For that reason, it was attempted to bring the C++ traffic flow simulator to the same computational<br />
speed as the C traffic flow simulator. This was done by implementing and testing<br />
several different approaches recommended by [53], and by inserting C code into time-critical<br />
pieces of the code. The reason for doing this is to find out where the C++ implementation has<br />
performance disadvantages compared to the C implementation, and how severe these disadvantages<br />
are. Results of the investigation can then be used to make informed decisions regarding<br />
the trade-off between maintainability and computational performance of the code.<br />
17
3.2 The Benchmark<br />
The traffic flow simulator described in Chapter 2.2 is a part of the transportation planning<br />
system explained in Chapter 1. One of the goals of such a system is running a realistic and<br />
meaningful size scenario to make data analysis and predictions. This planning system is not<br />
complete in the sense of common transportation planning, which also includes all modes of<br />
transportation, freight traffic, etc. Such a complete system involves about 7.5 million travelers,<br />
and more than 26 million trips including short pedestrian trips, etc [2].<br />
However, in order to make the transportation planning system explained in Chapter 1 useful<br />
in the real world, a scenario “ch6-9” described in Section 2.5 is used. ch6-9 is a subset of the<br />
data for the full 24-hour car-only simulation, and contains about 1 million trips.<br />
When the traffic flow simulation is coupled with the strategy generation modules via files<br />
as explained in Section 5.2.1, it takes the data from the input files and produces the data into<br />
the output files. The computational performance of the traffic flow simulation is measured<br />
excluding the performance of input reading and output writing. There are two reasons for this:<br />
As investigated in Chapter 6 and Chapter 7, external modules can be defined to handle<br />
input and output of a traffic flow simulation. Using files is just an implementation issue.<br />
Hence, the performance of the simulation itself (i.e., how the graph data and vehicles are<br />
represented, how the data is accessed, how the rules are executed) is the main concern.<br />
I/O requires accesses to the disk where the files are stored. However, I/O performance<br />
is limited by disk speeds, and file I/O operations usually deliver a low performance.<br />
Moreover, using files is just an implementation issue; as explained in Chapter 6 and<br />
Chapter 7, for example, a message passing approach may be used to replace files with<br />
messages.<br />
During the measurements of the traffic flow simulation performance, a 3-hour time period is<br />
simulated. Time steps are incremented by 1 simulation-time second; therefore, the total number<br />
of time steps simulated is 10800. In each time step, 3 basic movements are accomplished:<br />
movement through intersections, movement along links and movement from waiting queues<br />
(where vehicles wait to enter the simulation) to the spatial queues. Each of these movement<br />
steps loops over all the nodes or all the links of the graph data. Accordingly, they dominate the<br />
overall performance.<br />
The figures in the next sections, which show computational performance curves, are plotted<br />
on a multiple-CPU basis since results of some approaches depend on the number of CPUs.<br />
3.3 Performance Issues for C++/STL and C Functions<br />
C++ has been made more functional by the introduction of the Standard Template Library<br />
(STL). The STL is an extensive library of common containers and functions written using C++<br />
templates. In this section, some remarks regarding the experiences with using different STL<br />
containers for different purposes in the traffic flow simulation are given. Although some of the<br />
results just confirm the common sense, some others are specific to the situation that exists in<br />
this work.<br />
The next section gives a brief definition of STL library. The sections following Section 3.3.1<br />
discuss different implementation alternatives and their performance for the different parts of the<br />
traffic flow simulator. Section 3.3.2 compares STL-map and STL-vector used to represent<br />
the street network. Using STL-multimap for the parking and waiting queues of the links in<br />
18
the street network is explained in Section 3.3.3. The same section promotes an alternative data<br />
structure, namely, a self-implemented singly linked list, and gives the test results for these two<br />
implementations. Section 3.3.4 discusses different implementations for the link queues, i.e.,<br />
the spatial queue and the link buffer. The alternatives are using STL-deque, STL-list and<br />
a self-implemented data structure Ring.<br />
3.3.1 The Standard Template Library<br />
The Standard Template Library (STL) is a C++ library composed of the following components:<br />
Collections of standard container types. Containers are implemented as templates, a<br />
special feature of C++ and contain objects of any type. Examples are map, deque<br />
(double-ended queue), vector, list, etc.<br />
Algorithms defined on containers. Examples are: accessing an element, sorting the elements<br />
in a container, searching for an element, etc.<br />
Iterators used for traversing the elements of a container.<br />
The STL not only hides the implementation details of its components but also provides<br />
elegant data structures and algorithms for users. Encapsulation and abstraction properties of<br />
STL enables programmers to focus on application specific issues.<br />
The STL provides two types of containers: Sequence containers store data in a linear sequence.<br />
The “sequence” depends on time and position of insertion. The position of an element<br />
in the container is independent of the element’s value. vector, deque and list are of this<br />
type.<br />
Associative containers, on the other hand, are sorted data structures. They associate the<br />
domain of one type (key) with the domain of another type (value). The position of an element<br />
in such a container depends on its key. Examples are map, multimap, set, etc.<br />
Operations defined on containers, such as insert, delete, or retrieve, differ in performance.<br />
Container selection is dependent on characteristics of the applications and on call frequency.<br />
Some examples are given in next sub-sections.<br />
Each iterator represents a certain position in a container. Regardless of the container for<br />
which it is defined, an iterator comes with a set of basic operators. The basic operators are ++<br />
(stepping forward to the next element), == (equal), != (not equal), = (assignment operator)<br />
and * (dereference operator). Since an iterator is an object, the user must create instances of<br />
the iterator class prior to using them.<br />
3.3.2 Containers: Map vs Vector for Graph Data<br />
Accessing the graph data, i.e. the street network, is one of the key issues in the simulation.<br />
This is because every single item (a link or a node) of the graph is visited several times in each<br />
time step during the simulation run. Moreover, the searching algorithm for the elements of a<br />
container, which represents the graph data, is crucial since searching for a single element in the<br />
graph data is done more than once: (1) when plans are read, the start and the end locations have<br />
to be searched in the graph data, and (2) every time a vehicle on a link at the border needs to<br />
enter the next link, the next link is searched in the graph data to find out, to which computing<br />
node it belongs in the parallel implementation.<br />
The overall approach of using STL containers for the graph data looks as Figure 3.1 (using<br />
graph nodes as the example 1 .)<br />
1 For non-C-experts: typedef aa bb means that from now on, bb will be translated to aa before the<br />
19
make ‘ ‘ Nodes ’ ’ a c o n t a i n e r t y p e :<br />
t y p e d e f CONTAINER- Node Nodes ;<br />
/ / make ‘ ‘ N o d e s I t e r a t o r ’ ’ an i t e r a t o r over Nodes:<br />
t y p e d e f N o d e s : : I t e r a t o r N o d e s I t e r a t o r ;<br />
/ / d e c l a r e the c o n t a i n e r t h a t w i l l c o n t a i n the nodes:<br />
Nodes nodes ;<br />
Figure 3.1: STL-containers for the graph data<br />
The iterator is useful in order to be able to go through all nodes and do something with them<br />
without having to worry about the efficiency of retrieval. Specific examples are given below.<br />
Several operations are, then, needed with respect to that container:<br />
Adding new nodes during initialization.<br />
Going through all nodes in each time step of the simulation (using the iterator).<br />
Finding nodes by their “name” (“key”).<br />
Two implementations were tested: (1) using a map container and (2) using a vector<br />
container.<br />
Map<br />
map is an associative container that can be indexed by any type. Indices (keys) can be simple<br />
types such as integers or sophisticated objects. An STL-map represents a mapping from one<br />
type (key type) to another type (value type). Hence, it allows the management of key-value<br />
pairs.<br />
An advantage of using the STL-map for nodes and links is that it is possible to straightforwardly<br />
retrieve them by their label number: a command such as nodes[1234] is possible<br />
and will retrieve the node with the label number 1234. ID numbers are typically non-sequential,<br />
so it is not possible to use standard array indexing instead.<br />
Sample code using an STL-map for nodes (links are analogous) looks as in Figure 3.2. The<br />
code means that there is a class Node defined somewhere else, and the STL-map container<br />
is loaded with pointers to node instances.<br />
The advantage of using the STL-map is that nodes can be addressed using their IDs using<br />
exactly with the same syntax as one is used from arrays. A slight disadvantage may be the<br />
make pair syntax that one needs to get used to, and the retrieval via second in the iterative<br />
loop. The iterator loop syntax is awkward but standard for all containers.<br />
Vector<br />
An STL-vector is a sequence type of container, which is composed of contiguous blocks of<br />
objects. Element insertion in an STL-vector container can be done at any point in the sequence.<br />
If the insert(position,object) method is used, insertion becomes expensive<br />
at the beginning or in the middle. Since elements are arranged contiguously, all elements that<br />
follow the insertion point need to be shifted. An example of this case is illustrated in Figure 3.3.<br />
compiler does anything else. The statement is particularly useful to convert fairly technical expressions such as<br />
map into something readable such as Nodes (indicating a container that contains nodes).<br />
20
¡<br />
t y p e d e f map- Id , Node Nodes ;<br />
t y p e d e f N o d e s : : i t e r a t o r N o d e s I t e r a t o r ;<br />
Nodes nodes ;<br />
[ . . . ]<br />
r e a d a node i n f o r m a t i o n ;<br />
n o d e s . i n s e r t ( m a k e p a i r ( nodeId , node ) ) ;<br />
[ . . . ]<br />
/ / go through a l l nodes and do something with them:<br />
f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />
i t != n o d e s . e n d ( ) ; + + i t )<br />
Node node = i t second ;<br />
node doSomethingWithIt ( ) ;<br />
[ . . . ]<br />
/ / f i n d a node by I d :<br />
Node node = nodes [ t h e I d ] ;<br />
Figure 3.2: The STL-map for the graph data<br />
0 1<br />
2 3 4 5 6<br />
O13 O2 O48 O9 O26 O33 O14<br />
BEFORE CALLING INSERT(3,O19)<br />
0 1 2 3 4 5 6 7<br />
O13 O2 O48 O19 O9 O26 O33 O14<br />
AFTER CALLING INSERT(3,O19)<br />
Figure 3.3: Insertion in the middle of an STL-vector. insert(position,object) is<br />
a method defined on the STL-vector. The elements behind the insertion position are shifted<br />
when insert command is used.<br />
In general, the performance of the insertion into any container depends on the type of container<br />
and where the insertion takes place. In particular, insertion at the end of an STL-vector is<br />
very fast.<br />
Some code elements to use an STL-vector for nodes (once more, links are analogous)<br />
look as in Figure 3.4. The insert of the map is replaced by a push back, which means that<br />
the new node pointer is just added at the end of the STL-vector. The iterator is essentially<br />
the same as before, except that one does not need the second because the element that is<br />
retrieved by it is no longer a (pointer to a) pair, but a (pointer to a) node.<br />
An issue with an STL-vector data structure is now searching for a key, for example to<br />
find the pointer to a node that is denoted by an ID number. A naive solution would be a linear<br />
search, the rough code is shown in Figure 3.5.<br />
21
¡<br />
¥<br />
<br />
¥<br />
t y p e d e f v e c t o r- Node Nodes ;<br />
t y p e d e f N o d e s : : i t e r a t o r N o d e s I t e r a t o r ;<br />
Nodes nodes ;<br />
[ . . . ]<br />
r e a d a node ;<br />
n o d e s . p u s h b a c k ( node ) ;<br />
[ . . . ]<br />
/ / s o r t the nodes:<br />
n o d e s . s o r t ( ) ;<br />
[ . . . ]<br />
/ / go through a l l nodes and do something with them:<br />
f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />
i t != n o d e s . e n d ( ) ; + + i t )<br />
Node node = i t ;<br />
node doSomethingWithIt ( ) ;<br />
¡<br />
[ . . . ]<br />
/ / f i n d a node by I d :<br />
Node node = findNodeById ( t h e I d ) ;<br />
Figure 3.4: The STL-vector for the graph data<br />
Node findNodeById ( Id t h e I d )<br />
f o r ( N o d e s I t e r a t o r i t = n o d e s . b e g i n ( ) ;<br />
i t != n o d e s . e n d ( ) ; + + i t )<br />
Node node = i t ;<br />
i f ( n o d e . g e t I d ( ) = = t h e I d )<br />
return node ;<br />
¡<br />
¡<br />
/ / node with given Id not found:<br />
e r r o r ( ) ;<br />
Figure 3.5: Linear search for the graph data<br />
However, a better approach is to pre-sort the STL-vector according to the node IDs<br />
and then to use a binary search instead, in which average case and worst case access times<br />
are reduced from<br />
¢¡¤£¦¥<br />
to . Fortunately, both sorting and binary search are already<br />
provided by the STL, so that these are easy to use. The only issue is to provide the sorting<br />
criterion to the algorithm. The code for sorting the elements of the graph data stored in an<br />
STL-vector and the sorting criterion are given in Figure 3.6.<br />
Often, links and nodes are already provided in the correct order by the files. In that case,<br />
initialization time can be further reduced by checking that they are indeed provided in the<br />
22
¡<br />
/ / c a l l i n g s o r t i n g algorithm<br />
void s o r t ( )<br />
s o r t ( n o d e s . b e g i n ( ) , n o d e s . e n d ( ) , c o m p a r i s o n C l a s s ( ) ¡<br />
)<br />
/ / d e f i n i n g comparison c l a s s<br />
c l a s s c o m p a r i s o n C l a s s<br />
p r i v a t e :<br />
/ / comparison f u n c t i o n d e f i n e s ascending order<br />
bool keyLess ( c o n s t i n t & k1 , c o n s t i n t & k2 ) c o n s t<br />
return ( - k1 k2 ) ¡<br />
;<br />
p u b l i c :<br />
/ / comparison based on IDs<br />
/ / comparing two o b j e c t s l h s and rhs<br />
template c l a s s T -<br />
bool operator ( ) ( c o n s t T l h s , c o n s t T r h s ) c o n s t<br />
return keyLess ( ( ( T ) l h s) i d ( ) , ( ( T ) r h s ) i d ( ) ) ;<br />
¡<br />
/ / comparing an o b j e c t l h s with a value k<br />
- template c l a s s T<br />
bool operator ( ) ( c o n s t T l h s , c o n s t i n t & k ) c o n s t<br />
return keyLess ( ( ( T ) l h s) i d ( ) , k ) ;<br />
¡<br />
;<br />
/ / comparing a value k with an o b j e c t rhs<br />
- template c l a s s T<br />
bool operator ( ) ( c o n s t i n t & k , c o n s t T r h s ) c o n s t<br />
return keyLess ( k , ( ( T ) r h s ) i d ( ) ) ;<br />
¡<br />
Figure 3.6: Sorting the graph data stored in an STL-vector<br />
correct sequence, and thus sorting can be skipped.<br />
Results<br />
Figure 3.7 shows the simulation runtime results for using the STL-vector and the STL-map<br />
structures representing graph data. Figure 3.7(a) plots the data points for RTR. RTR means Real<br />
Time Ratio, which shows how much faster simulation runs than real life. Figure 3.7(b) contains<br />
the speed-up, which shows how much the execution speeds up by increasing the number of<br />
traffic flow simulators running in parallel in the system. The concepts of RTR and speed-up are<br />
covered in detail in Chapter 4. The data points are labeled as “Single” and ”Double”, which<br />
means that the results are gathered by running either single simulation or two simulations per<br />
computer, respectively.<br />
Performance gain using the STL-vector for the graph data instead of the STL-map is<br />
seen in Figure 3.7. In these tests, the STL-multimap (Section 3.3.3) is used as the data<br />
structure for the parking queues and waiting queues and the self-implemented Ring structure<br />
23
1024<br />
RTR, Diff. Data Str. for Graph Data<br />
256<br />
Speedup, Diff. Data Str. for Graph Data<br />
512<br />
256<br />
64<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Single, Vector<br />
4<br />
Single, Map<br />
Double, Vector<br />
Double, Map<br />
2<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Speedup<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Single, Vector<br />
Single, Map<br />
Double, Vector<br />
Double, Map<br />
(a)<br />
(b)<br />
Figure 3.7: RTR and Speedup for using different data structures for the graph data. “Single”<br />
means only one traffic flow simulation is run per computing node. “Double” refers to running<br />
two traffic flow simulations per computing node. In this test, an STL-multimap is used for<br />
parking and waiting queues, the Ring class is used for spatial queues and link buffers.<br />
(Section 3.3.4) represents the spatial queues and buffers of the links.<br />
Using the STL--vector relatively accelerates the traffic flow simulation by 15% for the<br />
large number of CPUs and for the small number of CPUs the relative performance increase<br />
observed is up to 18% compared to results of the STL-map. The STL-map performance<br />
mainly suffers from searching and accessing an item in the container. Hence, the STL-map<br />
can be replaced by the STL-vector and the search algorithm can be changed to the binary<br />
search for better performance.<br />
Map vs Vector for Graph Data: Recommendations<br />
Although using the STL-vector to represent the graph data elements, namely nodes<br />
and links, along with the STL’s binary search algorithm is 15-18% faster than using<br />
the STL-map and the STL-map’s find method, it comes with higher programming<br />
overhead. For the cases that require faster computation, the STL-vector is recommended.<br />
If one prefers to skip the programming overhead, the STL-map should be<br />
chosen.<br />
3.3.3 Containers: Multimap vs Linked List for Parking and Waiting<br />
Queues<br />
Parking and waiting queues are zones where a vehicle waits until it is ready to start to simulate.<br />
In other words, a vehicle waits in these containers until its start time for the simulation is up.<br />
A person’s plan can have more than one leg. Each leg is defined as a route between two<br />
locations. If a person has a plan, which includes a trip from home to work and then from work<br />
to leisure, then it means the person’s plan has two legs.<br />
When a person’s plan is read by the simulation, a vehicle is created for each leg. If it is the<br />
first leg, then the vehicle is added to the parking queue of the link at which the vehicle starts. In<br />
each time step, the parking queue of each link is checked for vehicles that are ready to start at<br />
the current time step. Those vehicles are moved to the waiting queue of the link. If the spatial<br />
24
t y p e d e f multimap- Time , Veh WaitQueue ;<br />
WaitQueue waitQueue ;<br />
t y p e d e f multimap- Time , Veh ParkQueue ;<br />
ParkQueue parkQueue ;<br />
Figure 3.8: Declarations for waiting and parking queues with the STL-multimap<br />
t y p e d e f Linked- Time , Veh WaitQueue ;<br />
WaitQueue waitQueue ;<br />
t y p e d e f Linked- Time , Veh ParkQueue ;<br />
ParkQueue parkQueue ;<br />
Figure 3.9: Declarations for waiting and parking queues with linked lists<br />
queue is not full, the vehicle is moved from the waiting queue into the spatial queue so that it<br />
can start its trip.<br />
Hence, in each time step, waiting and parking queues are checked for the eligible vehicles.<br />
In realistic scenarios, most of the vehicles wait in these queues because their drivers are actually<br />
performing an activity. Checking for all eligible vehicles and accessing their information and<br />
moving them to other queues when necessary comes with computational cost. Therefore, an<br />
appropriate data structure should be used for these queues. That data structure needs to make<br />
available the vehicle with the next scheduled departure at low performance cost. For this, a<br />
partial ordering would in fact be sufficient. However, there is no data structure in the STL<br />
which supplies a fully efficient partial ordering. Therefore, two fully ordered data structures<br />
were tested: the STL-multimap, and a self-implemented singly linked list.<br />
Note that this section only discusses waiting and parking queues. Data structures for link<br />
cells will be explained in the next subsection.<br />
Multimap<br />
An easy-to-use fully sorted data structure is the STL-multimap. One just inserts key-item<br />
pairs, with the keys equal to the start time of vehicles and the items being pointers to vehicles,<br />
and the resulting data structure is automatically sorted. The difference between the STL-map,<br />
as mentioned above, and the STL-multimap is that the latter accepts multiple elements for<br />
the same key. This is necessary since it is possible that several vehicles want to depart at the<br />
same time step. The declarations by using STL-multimap are given in Figure 3.8.<br />
User-defined singly linked list<br />
There are three operations defined on these queues: Insertion of an element into a queue, retrieving<br />
the first element of the queue, and deleting the first element of the queue. Unfortunately<br />
these operations are rather costly with the STL-multimap. From the experiences in the C<br />
version of the simulation where a linked list was been used for these queues, a linked list is<br />
implemented also in the C++ version to handle waiting and parking queues.<br />
The Linked class in Figure 3.9 represents a singly linked list where each item in the list<br />
has a pointer to the next item. Insertion at the end of the list and insertion according to a key<br />
value into the sorted list are available. The latter is important so that vehicles can be sorted<br />
according to their start times.<br />
25
1024<br />
RTR, Diff. Data Str. for Parking and Waiting Queue<br />
256<br />
Speedup, Diff. Data Str. for Parking and Waiting Queue<br />
512<br />
256<br />
64<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Single, Linked List<br />
4<br />
Single, Map<br />
Double, Linked List<br />
Double, Map<br />
2<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Speedup<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Single, Linked List<br />
Single, Map<br />
Double, Linked List<br />
Double, Map<br />
(a)<br />
(b)<br />
Figure 3.10: RTR and Speedup for using different data structures for waiting and parking<br />
queues. An STL-vector is used for the graph data and the Ring class is used for the spatial<br />
queues and the link buffers.<br />
Results<br />
Figure 3.10 shows RTR and speed-up results for the simulation runtime on the STL-multimap<br />
and the linked list implementations of waiting and parking queues. In these tests, the vector<br />
container of STL is employed for the graph data (Section 3.3.2), and the spatial queues and<br />
links buffers are implemented by using the self-implemented Ring class (Section 3.3.4.<br />
The linked list implements the queues with a better performance. Quantitatively, the linked<br />
list version increases the performance relatively by 9% as the number of CPUs increases. With<br />
the small number of CPUs, the relative gain is about 18%.<br />
Multimap vs Linked List for Parking and Waiting Queues:Recommendations<br />
Using a singly linked list class with proper methods similar to STL’s containers is recommended<br />
over the STL-multimap since not only the operations on it are faster but<br />
also the implementation details can be hidden from users as done in STL’s containers.<br />
3.3.4 Containers: Ring, Deque and List Implementations of Link Queues<br />
Each link of the traffic flow simulator has two more queues: one for the spatial queue and<br />
one for the buffer used in through-node movement as explained in Chapter 2. In contrast to<br />
Section 3.3.3, the link and the buffer queues are true FIFO (First In First Out) data structures,<br />
and they have finite size.<br />
The way links operate is important in terms of performance since the links are accessed<br />
several times in each time step of the simulation:<br />
The waiting queue of each link is checked to see if there are vehicles to move from the<br />
waiting queue to the spatial queue.<br />
Each spatial queue is checked to see if there are vehicles to move into the buffer of the<br />
link according to the capacity constraint.<br />
The buffer of each link is checked to find out if there is any vehicle to move to the next<br />
link.<br />
26
In this section, two STL functions will be tested, namely, list and deque. Because of the<br />
low performance they involve, in addition a user-defined data structure Ring is implemented<br />
and tested.<br />
List<br />
The STL-list is a doubly linked list. With the STL-list, insertion anywhere is fast but<br />
it provides only sequential access. Since random access is not needed, this should in theory<br />
be a fast data structure for the purpose here. Unfortunately, the STL-list comes with high<br />
overhead as explained below.<br />
Deque<br />
The STL-deque (double-ended queue) is similar in usage and syntax to the STL-vector.<br />
It allows random access and inserts elements fast at either end. Therefore the STL-deque<br />
is the data structure of choice when most insertions and deletions take place at the beginning<br />
or at the end of a container. The STL-deque is different from the STL-vector in terms<br />
of memory management. When resizing is necessary, the STL-deque allocates memory in<br />
chunks as it grows, with room for a fixed number of elements in each reallocation. In other<br />
words, the STL-deque uses a series of smaller blocks of memory. The STL-vector, on the<br />
other hand, allocates its memory in a contiguous block, i.e., the STL-vector is represented<br />
by one long block of memory.<br />
Self-implemented Ring data structure<br />
To overcome the inefficiencies of the STL-list and the STL-deque, a new data structure,<br />
Ring is implemented. Ring is a circular vector. Removal is at the beginning and insertion<br />
is at the end. Removing from the beginning of the STL-vector is a costly operation since<br />
it includes the forward movement of all the remaining elements. In order to get rid of this<br />
difficulty, supplementary pointers are used to keep track of the head and the tail of the data<br />
structure. Hence, with this new structure, only head/tail pointers move back and forth, not the<br />
elements as in the STL-vector container. The same pointers are also used for insertion.<br />
Figure 3.11(a) shows how insertion takes place at the end of Ring structure. The supplementary<br />
pointers head and tail are used to keep track of elements. The maximum size of the<br />
structure is 8 in the example. When the size is 0, the head and the tail point to NIL. Then, an<br />
object, O1, is requested to be inserted. A pointer to the object is placed in the first cell of the<br />
structure. Both the head and the tail point to this cell (accordingly O1) after the insertion. Then<br />
another object, O2, is to be inserted. Since the call is push back, it is placed at the end, i.e.,<br />
the next cell after the last item (O1). The tail pointer is advanced. Now, the head points to O1<br />
and the tail points to O2. The current size becomes 2.<br />
Figure 3.11(b) illustrates deletion from the beginning, by using pop front. The structure<br />
is full, i.e., the size is 8. The head points to the first element(O1) and the tail points the last<br />
element (O8). After deletion, the tail remains the same, but the head is advanced to the next<br />
object (O2) from O1. If further deletion is requested, the head is moved one more cell (to O3).<br />
After two deletions, the size becomes 6.<br />
It is important to note that this works because of the fixed maximum size. This corresponds<br />
to the maximum number of vehicles on the link.<br />
27
push_back(O1)<br />
push_back(02)<br />
head = NIL<br />
head<br />
O1<br />
head<br />
O1<br />
tail = NIL<br />
tail<br />
tail<br />
O2<br />
(a) Insertion at the end<br />
O7<br />
O8<br />
O7<br />
O8<br />
O7<br />
O8<br />
pop_front()<br />
pop_front()<br />
O6<br />
tail<br />
O1<br />
O6<br />
tail<br />
O6<br />
tail<br />
head<br />
head<br />
head<br />
O5<br />
O2<br />
O5<br />
O2<br />
O5<br />
O4<br />
O3<br />
O4<br />
O3<br />
O4<br />
O3<br />
(b) Deletion at the beginning<br />
Figure 3.11: Operations on the Ring structure. (a) Insertion at the end. (b) Deletion from the<br />
beginning.<br />
1024<br />
RTR, Diff. Data Str. for Link Queues<br />
256<br />
Speedup, Diff. Data Str. for Link Queues<br />
512<br />
256<br />
64<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
Single, Ring<br />
8<br />
Single, Deque<br />
Single, List<br />
4<br />
Double, Ring<br />
Double, Deque<br />
Double, List<br />
2<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Speedup<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32 64 128<br />
Number of CPUs<br />
Single, Ring<br />
Single, Deque<br />
Single, List<br />
Double, Ring<br />
Double, Deque<br />
Double, List<br />
(a)<br />
(b)<br />
Figure 3.12: RTR and Speedup for using different data structures for the spatial queues and<br />
the buffers. An STL-multimap is used for parking and waiting queues. An STL-vector is<br />
used for graph data.<br />
Results<br />
Figure 3.12 shows the comparison results for using the STL-list, the STL-deque and the<br />
Ring class as the main data structure of the queues of links. In these tests, the vector<br />
container of STL represent the graph data (Section 3.3.2), and the parking and waiting queues<br />
are handled via the STL-multimap (Section 3.3.3.<br />
The STL-list gives the worst performance. This is because of the memory management<br />
of the STL-list design. A well known example is that to store an integer value (4 bytes),<br />
the STL-list needs 12 bytes plus the list itself. On the other hand, for example, an STLvector<br />
needs only 4 bytes to store an integer. The Ring class speeds up the performance the<br />
best, and the STL-deque performance is between the STL-list and the Ring class.<br />
Changing the data structure from the STL-list to the STL-deque speeds up the execution<br />
relatively by 67% for the large number of traffic flow simulators in the system despite the<br />
28
difference being around 13% with the small number of CPUs. Transition from the STL-list<br />
to the Ring class results in 71% and 29% relatively better performance with the small and the<br />
large number of CPUs, respectively.<br />
Ring, Deque and List for Link Cells:Recommendations<br />
Since users can implement their own containers similar to STL’s containers to overcome<br />
the inefficiencies that appear at the application level, a circular vector called,<br />
Ring, is recommended when the maximum number of elements in a container is<br />
known. When there is no upper limit on the number of elements, the STL-deque is<br />
suggested if most of the insertions and deletions take place at the either ends. When<br />
random access is not required, the STL-list should be chosen.<br />
3.4 Reading Input Files for Traffic Simulators<br />
In applications with different cooperating modules, the format used to represent shared data<br />
becomes more significant. There is no single good solution for this problem since the possible<br />
solutions give either good performance or flexibility but not usually both at the same time. How<br />
the input data is kept in the simulation exhibits the same problem.<br />
This section and the next section investigate the input files (the street network and plans)<br />
and the output file (events) of a traffic simulator, respectively. In this section, comparison of<br />
representing data in the XML [97] format or in the structured text file format is achieved in<br />
terms of I/O performance along with the programming issues with respect to these formats.<br />
The different programming issues tested for plans (Section 3.4.3) are reading XML plans<br />
using expat and reading raw plans from a structured text file using the C++ input operator<br />
and the C function fscanf. Reading the street network information from an XML file using<br />
the expat parser is compared to reading the same information from a structured text file by<br />
using the C function sscanf in Section 3.4.4.<br />
3.4.1 The Extensible Markup Language, XML<br />
XML [97] is the abbreviation for Extensible Markup Language. It is a markup language, which<br />
has the virtues of HTML (Hypertext Markup Language). HTML [11] is widely used especially<br />
for putting data on the World Wide Web such that anyone can access the data without regard<br />
to the location or the time. HTML is known for its simplicity and portability. HTML focuses<br />
on the appearance of documents, not their contents. Hence, it is limited on its features. This<br />
limitation causes XML, which is oriented towards content, to become very popular as a markup<br />
language.<br />
XML is simple, portable, easily maintainable and adaptable. One can design his/her own<br />
customized markup languages by using XML since data is stored in a self-explanatory manner.<br />
For example, the following shows a valid XML tag with 6 valid attribute-value pairs. Each<br />
attribute-value pair is in attribute=”value” format.<br />
l i n k i d =”2” from=”2” t o =”1” l e n g t h =”657” c a p a c i t y =”12000” f r e e s p e e d = ”11.1” /¢<br />
¡<br />
Some of the benefits of using XML files are<br />
XML allows users to create their own sets of tags.<br />
29
The sequence of attributes is not important since when reading data in, the search is done<br />
for the attribute names to obtain the corresponding values. Therefore, new attributes can<br />
be added in any sequence, and rearranging does not cause changes in reading code.<br />
Complex input like trees and hierarchies can be implemented.<br />
XML allows users without prior knowledge to understand the language as it is selfdescribing.<br />
XML promotes flexible context-dependent data, e.g. the description of a bus trip within<br />
a leg can be completely different from the description of a car trip within a leg.<br />
3.4.2 Structured Text<br />
The structured text file format is application-dependent. Therefore, it is user-defined in many<br />
cases. The structured format used here is the column-based format. The example below is the<br />
corresponding column-based text line of the XML example given above. However, without<br />
looking at the XML tag above, it is impossible to understand what these numbers mean. Because<br />
the numbers in a column-based text file are unlikely to be self-explanatory. One might<br />
put a title line at the top of the file to explain what each column corresponds to. It could help<br />
when the number of columns is small like the example below. If each line is composed of, for<br />
example, 30 columns, then it becomes difficult to follow the columns of the lines.<br />
2 2 1 6 5 7 1 2 0 0 0 1 1 . 1<br />
When reading a structured text file, which is composed of a set of numbers, the numbers<br />
need to be read in the same sequence as they are written in the file. Rearranging or inserting<br />
new columns between the columns that are already there requires changes in file-reading code<br />
to keep the consistency with the correct sequence. Despite this drawback, the structured text<br />
files are usually better than the XML format files performance-wise.<br />
3.4.3 XML vs. Structured Text Files: Plans Reading<br />
The plans (Section 2.2.4) contain all the information about agents, including their routes. A<br />
scenario with approximately 1 million agents is kept in a structured text file size of 34 MBytes<br />
and in an XML file size of 330 MBytes. The XML file is 10 times bigger because of the<br />
self-explanatory attributes of XML.<br />
When reading an XML file, the attributes are parsed. An XML parser called expat [21],<br />
which is written in C, is employed. What a parser does is to provide users with opening element<br />
tags (), and the text data.<br />
Afterwards, the users should implement code to handle the values passed in.<br />
In the structured text plans file case, a fixed number of integers is read and according to some<br />
of these numbers another chunk of integers is read to complete a single agent’s information.<br />
A rough example of this type, reflecting the example in the XML format shown in Figure 2.9,<br />
is illustrated below. All the numbers regarding each plan can be written in a single line. The<br />
example separates them into several lines for readability purposes.<br />
6 3 5 7 2 5 0 0 2 4 8 7 5<br />
1 4 5 8 4 1 4 6 0 6 1 8 0 0<br />
0 8<br />
4 9 0 2 4 9 0 3 4 9 0 4 4 9 0 5 4 9 0 6 4 9 0 7 4 9 0 8 4 9 0 9<br />
30
XML File, expat Structured Text File, Structured Text File, fscanf<br />
159s 31s 36s<br />
Table 3.1: Performance results for reading different types of plans file and approaches.<br />
In the example, 6357250 is the vehicle ID, 0 in the first line shows the leg number, and<br />
24875 is the start time of the plan in seconds (06:54:35). The start accessory ID and the end<br />
accessory ID are 14584 and 14606, respectively. An accessory can be an activity location, a<br />
parking or a transit stop. Duration of the leg is 1800 seconds (30 minutes). The 0 in the third<br />
line shows the mode of transport (car). 8 is the number of the intermediate nodes between the<br />
start and the end activity locations, and these nodes are lined up in the last line.<br />
The performance is user-implementation dependent. One version of implementation keeps<br />
each vehicle’s data in integers in an STL-vector while reading. When the vehicle is created,<br />
the program accesses the values from the STL-vector. The code elements are given in<br />
Figure 3.13.<br />
Yet another version uses a C library function called fscanf to read chunks of data. The<br />
data is directly stored into integer arrays similar to the STL-vector used above. Once the<br />
vehicle is created, its variables are set using these integer arrays. A rough example code is<br />
given in Figure 3.14.<br />
Results<br />
Table 3.1 shows the results for different reading approaches and for different types of the plans<br />
file. The scenario used is the one explained in Section 2.5, i.e., around 1 million agents are<br />
read. The numbers show the time for reading and for constructing agents. Once the agents are<br />
created, they are inserted into one of the supplementary structures of the links, such as waiting<br />
queues and parking queues. These queues are of the STL-multimap type.<br />
Reading the same data from the structured text plans file gives better results by 80% relative<br />
to the XML file version. Despite its lower performance values, XML is a promising technology<br />
because of its benefits as given in Section 3.4.1.<br />
An important remark to be made is that the lower performance values of XML come mostly<br />
from the implementation inefficiencies, not from format itself: While expat parses the input<br />
plans file, an object-oriented wrapper around expat inserts each person’s data (plans and the<br />
other attributes) into an STL-deque. If the traffic flow simulator needs to read the next person,<br />
the wrapper calls pop front() to get and to remove the first element from the STL-deque.<br />
A problem resides here: The STL-deque is used in a way that it keeps the objects as opposed<br />
to keeping pointers to the objects. When a pop front() call is made on such a container,<br />
before deleting the element, the wrapper copies the element into a temporary variable. Then,<br />
the element is deleted from the STL-deque and the temporary variable is returned (copied)<br />
to the traffic flow simulator. If one used pointers to objects instead of objects themselves, not<br />
only memory allocation would be done once and in an efficient way but also the pointers to<br />
objects would be copied between different components via copying pointers instead of copying<br />
the objects themselves, which that would result in less overhead.<br />
3.4.4 XML vs Structured Text Files: Graph Data Reading<br />
The graph data can also be kept in two different types of files. Nodes and links are defined by<br />
either XML attributes or column-based numbers. The XML graph data file reading is the same<br />
31
¡<br />
¡<br />
¡<br />
c l a s s Plan<br />
/ / data s t r u c t u r e s<br />
/ / v e c t o r f o r the elements of the f i x e d l e n g t h part<br />
v e c t o r- i n t f i x e d P a r t ;<br />
/ / v e c t o r f o r the elements of the v a r i a b l e l e n g t h part<br />
v e c t o r- i n t v a r i a b l e P a r t ;<br />
d e f i n e s e t and g e t methods t o a c c e s s both v e c t o r s<br />
void r e a d N e x t P l a n ( i f s t r e a m p l a n s f i l e )<br />
/ / read f i x e d l e n g t h p a r t :<br />
f o r ( each item i i n f i x e d l e n g t h p a r t of p l a n )<br />
/ / read an i n t e g e r and put i t i n t o the v e c t o r<br />
p l a n s f i l e f i x e d P a r t [ i ] ;<br />
¡<br />
/ / read v a r i a b l e l e n g t h part ( r o u t e s ) :<br />
/ / number of items in v a r i a b l e l e n g t h part<br />
/ / i s s t o r e d in f i x e d part<br />
f o r ( each item j i n v a r i a b l e l e n g t h p a r t of p l a n )<br />
/ / read an i n t e g e r and put i t i n t o the v e c t o r<br />
p l a n s f i l e v a r i a b l e P a r t [ j ] ¡<br />
;<br />
;<br />
. . . .<br />
main ( )<br />
/ / d e f i n e the plans input f i l e<br />
i f s t r e a m p l a n s f i l e ;<br />
/ / c r e a t e a plan o b j e c t<br />
P l a n s myPlan ;<br />
while ( n o t EOF )<br />
m y P l a n . r e a d N e x t P l a n ( p l a n s f i l e )<br />
c r e a t e a new v e h i c l e<br />
use g e t methods of myPlan t o s e t d a t a of v e h i c l e ¡<br />
s<br />
Figure 3.13: Reading plans from a structured text file, by using an STL-vector<br />
as the XML plans reading: a parser parses the file and the user code saves the values. The size<br />
of the XML graph data file for the network defined in Section 2.5 is around 4 MBytes.<br />
If the same graph data file is a column-based text file, the file size is 2 MBytes. It is read<br />
line by line and each column is extracted from the line. This version uses C library function<br />
sscanf to pick the values after reading each line.<br />
32
¡<br />
¡<br />
main ( )<br />
/ / c r e a t e an i n t e g e r array<br />
/ / f o r the elements of the f i x e d l e n g t h part<br />
i n t f i x e d P a r t [MAXSIZE ] ;<br />
/ / c r e a t e an i n t e g e r array<br />
/ / f o r the elements of the v a r i a b l e l e n g t h part<br />
i n t v a r i a b l e P a r t [MAXSIZE ] ;<br />
while ( n o t EOF )<br />
/ / read f i x e d l e n g t h p a r t :<br />
f o r ( each item i i n f i x e d l e n g t h p a r t of p l a n )<br />
/ / read an i n t e g e r item from f i l e i n t o the array<br />
f s c a n f ( f i l e , i n t e g e r , f i x e d P a r t [ i ] ) ¡<br />
;<br />
/ / read v a r i a b l e l e n g t h p a r t :<br />
f o r ( each item j i n v a r i a b l e l e n g t h p a r t of p l a n )<br />
/ / read an i n t e g e r item from f i l e i n t o the array<br />
f s c a n f ( f i l e , i n t e g e r , v a r i a b l e P a r t [ j ] ) ¡<br />
;<br />
c r e a t e a new v e h i c l e<br />
s e t t h e v e h i c l e v a r i a b l e s using v a l u e s s t o r e d i n a r r a y s<br />
Figure 3.14: Reading plans from a structured text file, by using fscanf<br />
XML File, expat Structured Text File, sscanf<br />
1.14s 0.66s<br />
Table 3.2: Performance results for reading the graph data<br />
Table 3.2 shows that reading graph data of the scenario described in Section 2.5 from a<br />
column-based text file is 1.7 times (relatively 42%) faster than that from an XML file. The<br />
elements of the graph data are stored in an STL-vector.<br />
XML vs Structured Text Files:Recommendations<br />
The choice for the format of a file between structured text files and XML files is a<br />
trade-off between flexibility, extensibility, elegancy and good performance. If the computational<br />
issues can be ignored for an application, using XML files is recommended.<br />
3.5 Writing Events Files<br />
Events generated by traffic flow simulators are fed back to different modules in the framework<br />
as explained in detail in Section 5.2.1. Among the different modules are the router, agent<br />
database, activity generator, etc. When modules in a system are coupled via files, writing files<br />
33
Writing Time<br />
Explanation Raw<br />
Local Disk, C++ 61s<br />
via NFS , C++ 81s<br />
Local Disk, C 57s<br />
via NFS, C 66s<br />
Table 3.3: Performance results for writing the events file.<br />
(plans and events) might also be interesting to investigate. In the framework, plans are written<br />
by the agent database based on the routes generated by the router before the simulation starts.<br />
The performance issues for plans writing are explained in Section 5.2.3.<br />
Events, on the other hand, are written by traffic flow simulators during each simulation<br />
run. In this section, writing raw events using the C++ output operator<br />
and the C function<br />
fprintf is tested on disks that are both local and remote to the machine on which the traffic<br />
simulator runs. The results 2 are shown in Table 3.3.<br />
By default, during the tests reported throughout this thesis, the files and the runtime executables<br />
of MATSIM [50] are all on local disks of the computing nodes. Therefore, no I/O<br />
operations via network are performed unless it is said so. In the table, the label Local Disk<br />
shows no network contribution.<br />
The Network File System (NFS) [72] allows machines to mount a disk partition on a remote<br />
machine as if it was on a local hard disk. NFS comes with a cost because it accesses the remote<br />
files using the network. The cost can be seen in the table. Contribution of NFS in these numbers<br />
shows a performance degradation by a factor of ¡ $¢¡£¡ .<br />
The file writing is accomplished both with C and C++ I/O functions, namely, with the<br />
fprintf function and the operator. The results show that there is a little performance<br />
difference between C and C++ I/O functions when the files are on local disks. When the files<br />
are written via NFS, the difference becomes more apparent.<br />
Another remark is that using endl in output streams of C++ makes the writing into a file<br />
much slower than using \n. Besides adding a newline character as \n does, endl also flushes<br />
the output buffer. Therefore a write() system call is done for each line of operation,<br />
which is an expensive operation. The C++ type writing results in the table are from using \n.<br />
Reading and Writing Big Files:Recommendations<br />
If a big file is read as strings, using C functions such as strtod/strtol and<br />
atoi/atof to convert strings to the appropriate data types should be preferred.<br />
C++ operator can be used to read the data directly into correct types without<br />
any conversion, but this method comes with lower performance. Similarly, the C++<br />
output operator is a bit slower than the C function fprintf. However, when<br />
the performance issues are not taken into consideration, C++ operators and<br />
should be chosen since their usage is very straightforward.<br />
2 The theoretical performance prediction for writing events is investigated in Section 6.8.4<br />
34
3.6 Conclusions and Discussion<br />
Computationally fast programs can be achieved by not only introducing parallel programming<br />
techniques into an implementation, but also accelerating the sequential parts of the implementation.<br />
When a data structure is accessed for its entities frequently, the type of data structure has a<br />
prominent effect on performance. The sequential implementation of the traffic flow simulation<br />
is improved such that:<br />
Storage for graph data is modified in a way that the STL-map is replaced by STLvector<br />
and the searching algorithm is modified to use binary search. The performance<br />
gets better by 15-18% compared to the STL-map version.<br />
Storage for parking and waiting queues, on which vehicles at the beginning, are often<br />
accessed and removed, are advanced from the STL-multimap to a user-implemented<br />
singly linked list structure that results in a 9-18% performance increase.<br />
Representation of link cells as an STL-list degrades performance by 13-67% compared<br />
to using the STL-deque. Even better speed-up (40-71% relative to the STLlist)<br />
is achieved when using a user-defined data structure called Ring, which is nothing<br />
but a circular STL-vector composed of pointers to vehicles.<br />
Operations on files depend on the format of data stored in them. The XML-type files<br />
are flexible in terms of management and are elegant, but they usually give worse performance.<br />
The inherent simplicity of structured text files offer better performance but a lack<br />
of flexibility limits their applicability.<br />
The input stream operator >> of C++ promotes easy usage by letting users to not worry<br />
about types of input to be read, but suffers from low performance.<br />
Therefore, the following conclusions are drawn for the best design of MATSIM:<br />
As the graph data, i.e. the links and nodes, being the most frequently accessed data, it is<br />
best represented by the STL-vector. The binary search algorithm of STL should be<br />
used finding an element in the vector structure.<br />
The parking and waiting queues of the links are the data structures which hold the vehicles.<br />
The vehicles on these queues have not started simulating yet either because of full<br />
links or because of travelers performing activities at the location. Removing all the eligible<br />
vehicles at the beginning of these queues prior to inserting them into other queues,<br />
and inserting new vehicles into these queues are best performed on a data structure defined<br />
as a singly linked list so that each vehicle points to another vehicle.<br />
The spatial queues and buffers of the links are best implemented by using a fixed size<br />
vector, elements of which are pointers to vehicles. The movement operations such as<br />
insertion and deletion should be based on pointers to vehicles.<br />
For the input and output data, XML files should be preferred to structured text files since<br />
XML allows constructing user defined complex structures in a more efficient way.<br />
35
RunTime Graph Data Parking-Waiting Queues Link Queues<br />
615s STL-vector STL-multimap STL-list<br />
533s STL-vector STL-multimap STL-deque<br />
438s STL-vector STL-multimap Ring<br />
539s STL-map STL-multimap Ring<br />
356s STL-vector Linked List Ring<br />
Table 3.4: Summary table of the serial performance results for different data structures of the<br />
traffic flow simulator.<br />
3.7 Summary<br />
The concepts of fast computation and easy-to-use programming methods can be coupled. C has<br />
been around since 1970s and has allowed programming at both higher and lower levels. However<br />
as more complex applications come into prominence, new languages are needed since:<br />
these applications are complicated enough, content-wise; new programming techniques<br />
that ease the burden of programmers for complex applications are preferred, and<br />
these applications usually exhibit a hierarchy of entities.<br />
Object-oriented languages such as C++ [70, 80] and Java [42] are written to fill the deficiencies<br />
of C-type languages and are used to handle the complexity of applications.<br />
The first implementation of the traffic flow simulator was written in C. Despite being computationally<br />
fast, it obstructed adding new features easily. Writing in C++ with the improvements<br />
explained in the previous sections provides a simulation, which not only is computationally<br />
fast, but also has hierarchical entities, which are re-usable and as a matter of fact easy to<br />
use as well as easy to mutate. Naturally, some arguments presented here might be specific to<br />
the implementation achieved.<br />
Table 3.4 summarizes different implementations for the containers in the traffic flow simulator.<br />
The run times shown in the table are from the time measurements when the number of<br />
computing nodes (CPUs) is chosen as 1. The results exclude any file input and output operations<br />
as explained in Section 3.2. The input file reading results are already shown in Table 3.1<br />
and in Table 3.2 for plans and the street network, respectively. The output file writing results<br />
for events are given in Table 3.3.<br />
36
Chapter 4<br />
Parallel Queue Model<br />
4.1 Introduction<br />
Serial computation has been around for years. With this traditional computing manner,<br />
problems run on a single computer/computing node,<br />
instructions of a program are executed one after the other by the CPU,<br />
only one instruction may be executed at a time.<br />
Data transmission through hardware, which is limited by the speed of light [83], determines<br />
the speed of a serial computer. In addition to physical constraints, there are also economic limitations<br />
since it is increasingly expensive to make a single processor faster. These limitations<br />
make it harder to build faster serial computers by saturating the performance of serial computers.<br />
Ultimately, the agent-based simulation of large scale transportation scenarios are concerned.<br />
A typical scenario would be the 24-hour (about seconds) simulation of a metropolitan area<br />
consisting of 10 million travelers. Typical computational speeds of ¡ ¢¡ traffic flow simulations<br />
with 1-second update steps are 100 000 vehicles in real time [58, 56, 68]. This results in a<br />
computation time of ¡ ¢ ¡£¢£¢<br />
¡ ¢ ¢<br />
seconds ¤ days. This number is just a rough<br />
estimate and subject to the following changes: Increases in CPU speed will reduce the number;<br />
¡ ¢ ¡ ¢£¢<br />
more realistic driving logic will increase the number; smaller time steps [64, 84] will increase<br />
the number.<br />
This means that such a traffic flow simulation running on a single computing node is too<br />
slow for practical or academic treatment of large scale problems. In addition, computer time is<br />
needed for activity generation, route generation, learning, etc. In consequence, it makes sense<br />
to explore parallel/distributed computing as an option. Parallel/distributed computing has the<br />
advantages of using non-local resources, competitive cost/performance ratio and overcoming<br />
finite memory constraint of single computers that are subject to. In parallel computing computational<br />
problems are solved by using several computing resources which may consist of a<br />
single computer with multiple processors or a number of computers connected through a network,<br />
which is called a PC cluster, or a combination of both. In order to solve a computational<br />
problem through parallel computing, one must think about (i) how to partition the tasks into<br />
subtasks, and (ii) how to provide the data exchange between the subtasks. Before explaining<br />
these issues, parallel architectures will be discussed in the following paragraphs.<br />
The categorization of parallel computers has been done in many different ways, among<br />
which Flynn’s Classical Taxonomy [83] is the one most commonly used. This classification<br />
37
depends on the dimensions (single or multiple) of instructions and data. Each combination<br />
gives a different category:<br />
Single Instruction Single Data (SISD): Same instruction stream executes one data stream<br />
which causes deterministic execution. Most PCs, single CPU workstations and mainframes<br />
have this feature.<br />
Single Instruction Multiple Data (SIMD): Same instruction stream executes different data<br />
on different computing nodes. Examples are CM-2, IBM 9000, Cray C90.<br />
Multiple Instruction Single Data (MISD): Different instruction streams run on the same<br />
data. It is the one less commonly used.<br />
Multiple Instruction Multiple Data (MIMD): The most popular type of parallel computers.<br />
Each processor runs a different set of instructions on different data. Execution can be<br />
synchronous or asynchronous, deterministic or non-deterministic. Most supercomputers<br />
and PC clusters are of this type.<br />
4.1.1 Message Exchange<br />
This work concentrates on clusters of coupled PCs, i.e., Linux [39] boxes connected through<br />
100 Mbit Ethernet [77] Local Area Network (LAN). Using this type of clusters, which is costeffective,<br />
can achieve a performance close to that of a vector computer [57]. This is, in part, due<br />
to the fact that multi-agent simulations do not vectorize well, so that vector computers offer no<br />
particular advantages. Hence, PC clusters are expected to be the dominant high performance<br />
computing technology in the area of multi-agent traffic flow simulations for many years to<br />
come.<br />
With respect to data exchange between subtasks, there are, in general, two main approaches<br />
to inter-processor communication. One of them is called message passing between processors;<br />
the alternative is called shared-address space, where variables are kept in a common pool<br />
globally available to all processors. Each paradigm has its own advantages and disadvantages.<br />
In the shared-address space approach, all variables are globally accessible by all processors.<br />
Despite multiple processors operating independently, they share the same memory resources.<br />
The shared-address space approach makes it simpler for the user to achieve parallelism but<br />
since the memory bandwidth is limited, severe bottlenecks are inevitable with an increasing<br />
number of processors, or alternatively such shared memory parallel computers become very<br />
expensive. For those reasons, message passing is the focus point.<br />
In the message passing approach, there are independent cooperating processors. Each processor<br />
has a private local memory in order to keep the variables and data, and thus can access<br />
local data very rapidly. If an exchange of the information is needed between the processors,<br />
the processors communicate and synchronize by passing messages, which are simple send and<br />
receive instructions. Message passing can be imagined to be similar to sending a letter. The<br />
following phases happen during a message passing operation.<br />
1. The message needs to be packed i.e. the computer is told which data needs to be sent.<br />
2. The message is sent.<br />
3. The message may then take some time on the network until it finally arrives in the receiver’s<br />
inbox.<br />
4. The receiver has to officially receive the message, i.e. to take it out of the inbox.<br />
38
5. The receiver must unpack the message and tell the computer where to store the received<br />
data.<br />
There are time delays associated with each of these phases. It is important to note that some<br />
of these time delays are incurred even for an empty message (“latency”), whereas others depend<br />
on the size of the message (“bandwidth restriction”). Effects of time delays are explained in<br />
Section 4.4.<br />
4.1.2 Domain Decomposition<br />
On PC clusters, two general strategies are possible for parallelization of data domains:<br />
Task parallelization – The different modules of a transportation simulation package<br />
(traffic flow simulation, routing, activities generation, learning, pre-/postprocessing) are<br />
run on different computers. This approach is for example used by DynaMIT [17] or<br />
DYNASMART [18].<br />
The advantage of this approach is that it is conceptually straightforward, and fairly insensitive<br />
to network bottlenecks. The disadvantage of this approach is that the slowest<br />
module will dominate the computing speed – for example, if the traffic flow simulation<br />
among different modules is using up most of the computing time, then task parallelization<br />
of the modules will not help.<br />
Domain decomposition – In this approach, each module is distributed across several<br />
CPUs. In fact, for most of the modules, this is straightforward since in current practical<br />
implementations activity generation, route generation, and learning are done for each<br />
traveler separately. Only traffic flow simulation has tight interaction between the travelers<br />
as explained in the following.<br />
For PC clusters, the most costly communication operation is the initiation of a message<br />
(“latency”). In consequence, the number of CPUs that need to communicate with each other<br />
should be minimized. This is achieved through a domain decomposition (see Figure 4.2) of the<br />
traffic network graph. As long as the domains remain compact, each CPU will, on average, have<br />
at most six neighbors (Euler’s theorem for planar graphs). Since network graphs are irregular<br />
structures, a method to deal with this irregularity is needed. METIS [91] is a software package<br />
that specifically deals with decomposing graphs for parallel computation and is explained in<br />
more detail in Section 4.3.1.<br />
The quality of the graph decomposition has consequences for parallel efficiency (load balancing):<br />
If one CPU has a lot more work to do than all other CPUs, then all other CPUs are<br />
¡ ¢ ¢<br />
obliged to wait for it, which is inefficient. For the current work with CPUs and networks<br />
¢ ¢ ¢ ¢<br />
with links, the “latency problem” (explained in Section 4.4) always dominates load<br />
balancing issues; however it is generally useful to employ the actual computational load per<br />
network entity for the graph decomposition [57].<br />
For shared memory machines, other forms of parallelization are possible, based on individual<br />
network links or individual travelers. A dispatcher could distribute links for computation<br />
in a round-robin fashion to the CPUs of the shared memory machine [31]; technically, threads<br />
[72] would be used for this. This is be called fine-grained parallelism, as opposed to the coarsegrained<br />
parallelism, which is more appropriate for message passing architectures. As stated<br />
above, the main drawback of this method is that one needs an expensive machine if one wants<br />
to use large numbers of CPUs.<br />
39
4.2 Parallel Computing in Transportation Simulations<br />
Parallel computing has been employed in several transportation simulation projects. One of the<br />
first was PARAMICS [8], which started as a high performance computing project on a Connection<br />
Machine CM-2. In order to fit the specific architecture of that machine, cars/travelers were<br />
not truly objects but particles with a limited amount of internal state information. PARAMICS<br />
was later ported to a CM-5, where it was simultaneously made more object-oriented. In [8], a<br />
computational speed of 120 000 vehicles with an RTR (real time ratio, see Section 4.4) of 3 is<br />
reported, on 32 CPUs of a Cray T3E.<br />
At about the same time, it was shown that on coupled workstation architectures it is possible<br />
to efficiently implement vehicles in object-like fashion, and a parallel computing prototype<br />
with “intelligent” vehicles was developed [56]. This later resulted in the research code PAM-<br />
INA [68], which was the technical basis for the parallel version of TRANSIMS [57]. In the<br />
tests (using Ethernet [77] only, on a network with 20 000 links, about 100 000 vehicles simultaneously<br />
in the simulation), TRANSIMS [57] ran about 10 times faster than real time with<br />
the default parameters, and about 65 times faster than real time after tuning. These numbers<br />
refer to 32 CPUs; adding more CPUs did not yield further improvement. The parallel concepts<br />
behind TRANSIMS are the same as behind the queue model described in Chapter 2, and in consequence<br />
TRANSIMS is up against the same latency problem as the queue model. However,<br />
for unknown reasons the computational speed is lower than predicted by latency alone.<br />
Some other early implementations of parallel traffic flow simulations are discussed in [10,<br />
60]. A parallel implementation of AIMSUN [23] reports a speed-up of 3.5 on 8 CPUS using<br />
threads (which uses the shared memory technology as explained above) [4].<br />
DYNEMO [71] is a macro-particle model, similar to DYNASMART [18] described below.<br />
A parallel version was implemented about five years ago [61]. A speed-up of 15 on 19 CPUs<br />
connected through 100 Mbit Ethernet was reported on a traffic network of Berlin and Brandenburg<br />
with 13738 links. Larger numbers of CPUs were reported to be inefficient.<br />
DynaMIT [17] uses functional decomposition (task parallelization) as a parallelization concept<br />
[17]. This means that different modules, such as the router, the traffic flow (supply) simulation,<br />
the demand estimation, etc., can be run in parallel, but the traffic flow (supply) simulation<br />
runs on a single CPU only. Functional decomposition is outside the scope of this thesis.<br />
DYNASMART [18] also reports the intention of implementing functional decomposition.<br />
Note that in terms of raw simulation speed, the performance values presented in this work<br />
are more than an order of magnitude faster than anything listed above. In addition, this is not<br />
achieved by a smaller scenario size, but by diligent model selection, efficient implementation,<br />
and by hardware improvements based on knowledge where the computational bottlenecks are.<br />
That is, this approach makes it possible to run very large scale scenarios as everyday research<br />
topics, rather than to have them as the result of computationally intensive studies only.<br />
4.3 Implementation<br />
As discussed in the previous sections, the parallel target architecture for this traffic flow simulation<br />
is a PC cluster. The suitable approach for this architecture is domain decomposition, i.e.<br />
to decompose the street network graph into several pieces, and to give each piece to a different<br />
CPU. Information exchange between CPUs is achieved via messages.<br />
When parallelization of a transportation simulation, one needs to decide where to split the<br />
underlying street network, and how to achieve the message exchange. Both questions can only<br />
be answered with respect to a particular traffic model, but lessons learned here can be used for<br />
40
PROC P0<br />
PROC P1<br />
N3<br />
N1<br />
N2<br />
Figure 4.1: Handling the boundaries and split links<br />
other models.<br />
4.3.1 Handling Domain Decomposition<br />
In general, one wants to split as far away from the intersection as possible. This implies that one<br />
should split links in the middle, for example, as TRANSIMS [57] does. However, for the queue<br />
model “middle of the link” does not make sense since there is no real representation of space. In<br />
consequence, one can either split at the downstream end or at the upstream end of the link. The<br />
downstream end is undesirable because vehicles driving towards an intersection are influenced<br />
by the intersection to a greater degree than vehicles driving away from an intersection. For that<br />
reason, in the queue simulation one splits the nodes right after the intersection (Figure 4.1).<br />
A good partitioning algorithm must decompose a domain in such a way that each subpart<br />
gets a fair share of the load. This issue is also known as load balancing. Load balancing<br />
ensures that no single CPU is overloaded or idle. In the application presented throughout this<br />
thesis, a software package called METIS [91] is employed for domain decomposition. It has<br />
been chosen since it gives good results with large irregular graphs, which would describe the<br />
underlying street network given in Section 2.5.<br />
METIS differs from the traditional graph partition algorithms because of multilevel partitioning<br />
algorithms it uses. The traditional graph algorithms do the partitioning directly on the<br />
original graph. They are usually slow and do not result in good quality.<br />
METIS uses multilevel recursive bisection or multilevel k-way partitioning for higher quality<br />
results. Multilevel recursive bisection performs a sequence of bisections on the graph. It<br />
does not necessarily result in the best quality partitioning, however it is widely used because of<br />
its simplicity.<br />
The multilevel k-way partitioning of METIS is utilized throughout this thesis. It is a 3-phase<br />
partitioning technique:<br />
The original graph is coarsened down to fewer nodes by collapsing nodes and links. This<br />
41
350000<br />
300000<br />
250000<br />
200000<br />
150000<br />
100000<br />
50000<br />
450000 500000 550000 600000 650000 700000 750000 800000 850000<br />
Figure 4.2: Decomposed street network of Switzerland extracted from the map of whole Europe,<br />
on which METIS software package is used. The number of partitions is 8 in this example; 7<br />
of them are colored separately; the 8th partition is also colored and shows the rest of Europe,<br />
hence it is cut here.<br />
makes it easier to find the best partition boundary of the graph.<br />
Then, k-way partitioning is achieved on the smaller large-grained graph.<br />
Finally, the decomposed graph is uncoarsened to find a k-way partitioning of the original<br />
graph.<br />
Both multilevel recursive bisection and multilevel k-way partitioning also aim to reduce the<br />
edge-cut, which is the number of split links whose nodes belong to different partitions. A result<br />
of METIS partitioning Switzerland’s street network can be seen in Figure 4.2.<br />
4.3.2 Handling Message Exchanging<br />
Once the domain decomposition method breaks a problem into several subproblems, single<br />
or multiple programs are executed over subproblems on different computing nodes at the same<br />
time. It is common that the subproblems are not fully independent, i.e., exchanging information<br />
at the boundaries of different subproblems is necessary.<br />
With respect to the application presented here, message passing implies that a CPU that<br />
“owns” a split link reports, via a message, the number of empty spaces to the CPU, which<br />
“owns” the intersection from which vehicles can enter the link. After this, the intersection<br />
update can be done in parallel for all intersections. Next, a CPU that “owns” an intersection<br />
reports, via a message, the vehicles that have moved to the CPUs, which “own” the outgoing<br />
links. Pseudo code of how this is implemented using message passing is shown in Figure 4.3.<br />
In fact, the algorithms in Figure 2.3 and Figure 4.3 together give the whole pseudo code for<br />
the parallel queue model traffic dynamics. For efficiency reasons, all messages to the same<br />
CPU in the same time step should be merged into a single message in order to incur the latency<br />
overhead only once.<br />
4.3.3 Communication Software<br />
The communication among the processors can be achieved by using a message passing library,<br />
which provides functions to send and receive data. There are several libraries such as MPI<br />
(Message Passing Interface) [51] or PVM (Parallel Virtual Machine) [63] for this purpose. Both<br />
42
Algorithm – Parallel computing implementation<br />
According to Alg. 2.3, propagate vehicles along links.<br />
for all split links do<br />
SEND the number of empty spaces of the link to the other processor.<br />
end for<br />
for all split links do<br />
RECEIVE the number of empty spaces of the link from the other processor.<br />
end for<br />
According to Alg. 2.3, move vehicles across intersections.<br />
for all split links do<br />
SEND vehicles which just entered a split link to the other processor<br />
end for<br />
for all split links do<br />
RECEIVE the vehicles (if any) from the neighbor at the other end of the link.<br />
place these vehicles into the local queues.<br />
end for<br />
Figure 4.3: Parallel implementation of queue model.<br />
PVM and MPI are software packages/libraries that allow heterogeneous PCs interconnected by<br />
a computer network to exchange data. They both define an interface for different programming<br />
languages such as C/C++ or Fortran. For the purposes of parallel traffic flow simulation, the<br />
differences between PVM and MPI are negligible. In principle, CORBA (Common Object<br />
Request Broker Architecture) [92] would be an alternative to MPI or PVM, in particular for<br />
task parallelization; but practical experiences show that it is difficult to use and because of the<br />
strict client-server paradigm is not well suited to the systems, which assume that all tasks are<br />
on equal hierarchical levels.<br />
Among these approaches, MPI is chosen since it has slightly more focus on computational<br />
performance. Some key features of MPI can be summarized as follows:<br />
The specification of MPI is machine independent, i.e., data exchange among different<br />
machine architectures will not cause any data loss because of different word lengths of<br />
the machines. This is also a feature of PVM.<br />
Different processes of a parallel program can execute different executable binary files<br />
(i.e. task parallelization). This is also provided by PVM.<br />
MPI does not dictate specific behavior on errors other than indicating what the error is.<br />
This is because MPI expects users to go with high-quality implementations knowing that<br />
determined error recovery specifications will limit the portability of MPI.<br />
MPI allows processes to be defined in different inter-communicators. Different intercommunicators<br />
are capable of communicating with each other. As explained in Chapter<br />
7, this helps coupling different modules available in the application presented here.<br />
MPI is designed to operate on different communication technologies. MPI is used both over<br />
Ethernet [77] and Myrinet [54]. Moreover, PVM is tested on the same technologies, however,<br />
the results for PVM on Myrinet do not outperform results for PVM on Ethernet. Here a brief<br />
definition of Myrinet technologies is given. The results and comparison figures can be found<br />
in Section 4.5.<br />
43
¢¡ ¥<br />
<br />
<br />
¡<br />
* 6<br />
§¡ ¥<br />
¢¡ ¥<br />
8<br />
£<br />
Myrinet [54] is a high-performance packet-communication and switching technology designed<br />
by a company called Myricom to provide a high-speed communication medium for PC<br />
clusters. Compared to other technologies, Myrinet has much less protocol overhead than others,<br />
and therefore provides much better throughput and latency. One-way latency of Myrinet<br />
is 6 secs [54]. 10Gbit Ethernet reportedly has an end-to-end latency of 21 secs [36]. A<br />
measurement using ping command on the Fast Ethernet LAN reports a round-trip latency of<br />
0.20-0.25 msecs.<br />
PCs in a cluster interconnected by Myrinet are linked via low-overhead routers and switches<br />
as opposed to connecting one machine directly to another. Most of the fault-tolerant features<br />
such as flow control, error control etc., are backed up by the low-overhead switches.<br />
4.4 Theoretical Performance Expectations<br />
The problem size and the memory requirement of a sequential program are the determining<br />
factors of measuring the performance of the program. If the memory need of a sequential<br />
program can be supplied by the system, the execution time of the program becomes directly<br />
proportional to the problem size. Thus, prediction of the performance of a sequential program<br />
is straight-forward.<br />
As far as parallel programs are concerned, the problem size and the memory are still the<br />
essential factors but they are not enough to explain the more complicated behavior of parallel<br />
programs. When measuring the performance of a parallel program, load balancing and communication<br />
overhead complicate the determination of parallel performance measurement.<br />
Performance of parallel programs can be monitored by different metrics. Among these<br />
metrics, execution time and speed-up are the most commonly used. In addition to these two<br />
metrics, another metric called real-time ratio is concerned. RTR describes how much faster<br />
than reality the simulation is running.<br />
In a log-log plot, the speed-up curve can be obtained from the RTR curve by a simple<br />
vertical shift; this vertical shift corresponds to a division by the RTR of the single-CPU version<br />
of the simulation. Speed-up curves put more emphasis on the efficiency of the implementation<br />
and less emphasis on absolute speed. An additional difference is that speed-up is independent<br />
of the problem size except at the Ethernet saturation level, which depends on the problem size,<br />
while RTR depends on problem size except for the Ethernet saturation level, which does NOT<br />
depend on the problem size.<br />
The execution time of a parallel program is defined as the total time elapsed from the time<br />
the first processor starts execution to the time the last processor completes the execution. During<br />
execution, on a PC cluster, each processor is either computing or communicating. Consequently,<br />
¥ §¡ £<br />
¦¥ ¨£¢<br />
¦£¢<br />
¦¥ ©£©<br />
¢¡ ¥<br />
¦¥ £©<br />
where is the execution time, is the number of processors, is the computation time and<br />
is the communication time.<br />
For a problem that can be parallelized using domain decomposition, the time required for<br />
the computation, , can be approximated in terms of the runtime of the computation on a<br />
single CPU divided by the number of processors. also includes the overhead effects<br />
such as handling the boundary conditions by both CPUs and unequal domain size effects, i.e.,<br />
load balancing problems. Therefore, the theoretical value can be written as<br />
¢¡ ¥<br />
7¤£© ¦¥<br />
¢¡ ¥<br />
* ¤£©<br />
(4.1)<br />
¤¥ ¦£©<br />
/ ¡<br />
(4.2)<br />
¡¦¥<br />
¤£© ¦¥<br />
44<br />
* 6 ¦
¡<br />
¡<br />
¤<br />
¤<br />
¦ <br />
£<br />
¤<br />
¦ <br />
£<br />
£<br />
where 6 and 6# ¦ are the overhead and load balancing effects, ¨£© ¦¥<br />
¡¦¥<br />
is the serial execution<br />
¡<br />
6 6 ¦<br />
¤£© ¤¥<br />
time 1 and is the number of CPUs. Under the circumstances of and being small<br />
enough, is approximated as:<br />
§¡ ¥<br />
¤¥ ¦£©<br />
¤£© ¦¥<br />
¡¦¥<br />
(4.3)<br />
¡<br />
Communication ¤£¢ time, , generally has two contributors: bandwidth and latency.<br />
Bandwidth is the transfer rate of data, for example measured in terms of bytes per second.<br />
It is defined by at least two contributions: node bandwidth, and network bandwidth. Node<br />
bandwidth is the bandwidth of the connection from a CPU to the network. If two computers<br />
communicate with each other, this is the maximum bandwidth they can reach. Hence, this is<br />
sometimes also called the “point-to-point” bandwidth.<br />
The node bandwidth contribution to the communication time is expressed as<br />
§¡ ¥<br />
(4.4)<br />
¥A¢<br />
where ¥ ¢<br />
¢¡ ¥<br />
is the number of split links in the ¥A¢<br />
simulation;<br />
the message size.<br />
§¡ ¥ ¡<br />
is the number of split<br />
links per computational and¡<br />
node is<br />
The network bandwidth is given by the technology and the topology of the network. Typical<br />
technologies are 100 Mbit Fast Ethernet, Gigabit Ethernet, etc ([77]). Typical topologies are<br />
bus topologies, switched topologies, two-dimensional topologies (e.g. grid/torus), hypercube<br />
topologies, etc. For example, a traditional Local Area Network (LAN) uses 100 Mbit Ethernet,<br />
with a shared bus topology. In a shared bus topology, the same medium is used for all<br />
£¢<br />
communications between computers, i.e, they have to share the network bandwidth.<br />
In a switched topology, the network bandwidth is given by the backplane of the switch.<br />
Often, the backplane bandwidth is high enough to have all nodes communicate with each other<br />
at full node bandwidth, and for practical purposes one can thus neglect the network bandwidth<br />
effect for switched networks.<br />
Communication time involves the network bandwidth formulated as:<br />
£¢<br />
¢¡ ¥¡<br />
(4.5)<br />
¥ ¢<br />
The cluster used for the tests through this work has a switched topology. Thus,¤<br />
¦©¨ comes<br />
from the technical data of the central switch.<br />
Latency, the second contributor of communication time, is the time necessary to initiate the<br />
communication. Latency is the limiting factor of 10/100 Mbit Ethernet LANs. New technologies<br />
such as Gigabit Ethernet and Myrinet promote lower latencies.<br />
If all the contributing factors are taken into account, the communication time per time step<br />
is formulated as follows:<br />
¤<br />
¦¨ <br />
¥¢<br />
¢¡ ¥<br />
¥¦¨§<br />
<br />
¥<br />
¡<br />
¡<br />
<br />
¥A¢<br />
(4.6)<br />
/ ¡ ¦©§<br />
§¡<br />
¥ ¢<br />
¡<br />
<br />
¦¨<br />
8 ¤<br />
¦£©<br />
9¢> *¨<br />
which will be explained in the following paragraphs.<br />
is the number of sub-time-steps. Since two boundary exchanges per time step are<br />
<br />
done, for the application represented in this thesis.<br />
£¦§ ¥¦¨§<br />
1 The serial or sequential execution time of a problem can be measured by running the problem on a single<br />
computing node.<br />
45
¥¦¨§<br />
<br />
<br />
¡¦©§<br />
§¡ ¥<br />
<br />
<br />
<br />
<br />
<br />
*<br />
§¡ ¥<br />
¡<br />
<br />
<br />
/ ¡<br />
<br />
<br />
¢¡ ¥<br />
<br />
¡<br />
¢<br />
¡<br />
¡<br />
<br />
¡ *<br />
¡<br />
¡<br />
¥<br />
<br />
¡<br />
¡ ¡¦¥ £ ¡ ¡¦¥<br />
£<br />
¡<br />
* 6<br />
¡<br />
¡<br />
*<br />
¡<br />
¡<br />
¢¡ ¥<br />
¡<br />
¡<br />
£ ¡ *<br />
*<br />
¡<br />
£<br />
$<br />
£<br />
§¡ ¥<br />
¡<br />
£<br />
¤<br />
¦©¨ <br />
$ (4.10)<br />
¢<br />
¡<br />
¡ ¢<br />
¡<br />
£ ¡ ¡¦¥<br />
£ ¡ ¥<br />
¤<br />
¡ ¥ <br />
¡¦§is the number of neighbor domains each CPU communicates to. All information, which<br />
goes to the same CPU, is collected and sent as a single message, thus incurring the latency only<br />
once per neighbor domain. For , ¡9¦©§is zero since there is no other domain to communicate<br />
with. For , it is one. For and assuming that domains are always connected, Euler’s<br />
theorem for planar graphs says that the average number of neighbors cannot be more than six.<br />
Figure 4.4 shows an area composed of hexagons. Each hexagon represents a computing<br />
node and the total number of computing nodes is . Thus, the figure shows the domain decomposition<br />
of the area on partitions. The hexagons are painted with 4 different colors. Each<br />
color represents a different number of edges of hexagons shared with the neighbors. The spectrum<br />
has four colors in the figure, namely, the conditions for 2,3,4 and 6 neighbor-cases are<br />
handled from lightest to darkest in this order. The total number of edges shared by neighboring<br />
partitions is calculated as follows: Two of the hexagons (1st and th) have 2 neighbors; two of<br />
the hexagons ( th and th) have 3 neighbors; of the hexagons (the<br />
remaining ones on the edges) have 4 neighbors and of the hexagons (the ones in the<br />
middle) have 6 neighbors. Thus, the average number of neighbors becomes:<br />
¥<br />
¤ *¦¥ *<br />
£ ¡ ¥<br />
*¦¥<br />
£ ¡ ¥ <br />
(4.7)<br />
¡ ¦©§<br />
§¡<br />
The numerator of the formula is an integer if<br />
argument in Equation 4.7, the following is used<br />
¡<br />
<br />
and is even. Based on the geometric<br />
(4.8)<br />
which has ¡9¦©§<br />
¡¦¥<br />
formula, ¡¦§<br />
¡ ¡ ¢<br />
¥ as desired, and for .<br />
is the latency (or start-up time) of each ¡ ¢> message. as said above, is between 9¢> 0.20-0.25<br />
milliseconds for the Fast Ethernet network of the cluster used throughout this thesis.<br />
Consequently, the combined time for one time step is<br />
¡¦§<br />
¢¡ ¥<br />
§¡ ¥<br />
¢¡ ¥<br />
§¡ ¥<br />
8 * (4.9)<br />
* 6 ¦<br />
¥ ¢<br />
¥¢<br />
£¢<br />
¢¥ *¨§ (<br />
¥A¢<br />
i.e., ¡¦§<br />
According to the discussion above, for ¡ ¡ ¢<br />
the number of neighbors becomes constant,<br />
¤<br />
¦ <br />
£ ¡<br />
and the number of split links in the simulation converges to £ ¡<br />
¡©)<br />
. In consequence, 6 for 6 ¦ and small enough:<br />
©<br />
, i.e., ¥ ¢<br />
for a shared or bus topology,¤<br />
¦¨ is relatively small and constant, thus<br />
£ ¡ ¡ £ ¡<br />
for a switched or a parallel supercomputer topology, one assumes¤<br />
¦¨+<br />
and obtains<br />
¡ ¡<br />
¡ *<br />
£ ¡<br />
Thus, in a shared topology, adding CPUs will eventually increase the simulation time, thus<br />
making the simulation slower. In a non-shared topology, adding CPUs will eventually not<br />
make the simulation any faster, but at least it will not be detrimental to computational speed.<br />
46
¢<br />
1<br />
P 1/2 elements<br />
P 1/2<br />
1/2<br />
(P −2)elements<br />
1/2<br />
(P −2)elements<br />
P<br />
Figure 4.4: Calculation of neighbors of computing nodes<br />
¡ ¢<br />
The dominant term in a shared topology for is the network bandwidth; the dominant<br />
term in a non-shared topology is ¡ the latency.<br />
By taking the latency of 100 Mbit Fast Ethernet cards as 0.225 ms, the following calculation<br />
is done to find out the saturation level of Fast Ethernet. Each processor sends messages twice<br />
¡<br />
per time step to all neighbors resulting in latency $'" § contributions or per time step.<br />
In other words, the cluster can maximally do ¢ $'" ¡ " time steps per second. If the time<br />
step of a simulation is one second, then with a 100 Mbit Ethernet, 370 is also the maximum real<br />
¡ ¢ ¢<br />
time ratio of the parallel simulation, i.e. the number which says how much faster than reality<br />
the simulation is. Note that the limiting value does not depend on the problem size or on the<br />
speed of the algorithm; it is a limiting number for any parallel computation of a 2-dimensional<br />
system on a PC cluster using Ethernet LAN.<br />
The only way this number can be improved under the assumptions made is to use faster<br />
communication hardware. Gigabit Ethernet hardware is faster, but standard driver implementations<br />
give away that advantage [45]. In contrast, Myrinet [54] is a communication technology<br />
specifically designed for this situation. Interestingly, as will be seen later, it is possible to<br />
recoup the cost for a Myrinet network by being able to work with a smaller cluster.<br />
47
4.5 Experimental Results<br />
The parallel queue model is used as the traffic flow simulation within the project of a microscopic<br />
and activity-based simulation of all of Switzerland. Computational performance results<br />
are reported here; validation results with respect to a real world scenario can be found in [65].<br />
The cluster used during the tests is composed of 32 computers each of which is a Pentium<br />
III 1GHz dual CPU node. Besides a default 10 Mbit Ethernet [77] communication layer between<br />
these computing nodes, two more network interfaces were available: Fast Ethernet [77]<br />
and Myrinet [54]. Throughout the rest of this thesis, the term Ethernet will refer to Fast Ethernet.<br />
Fast Ethernet is the follow up of 10 Mbit Ethernet technology. It offers a speed of 100 Mbit/s.<br />
Even though it is 10 times faster than 10 Mbit Ethernet, they are both specified by the same<br />
standards. Due to further developments in the Ethernet technology, Gigabit Ethernet giving a<br />
data rate of 1Gbit/s has also come into the picture.<br />
In terms of software, all computing nodes are dual boot: RedHat Linux [40] and Microsoft<br />
Windows [13]. However, all the tests achieved in this work are done only on Linux. More<br />
information about the cluster technology used throughout this work is given in [46].<br />
The following performance numbers refer to the scenario “ch6-9” explained in Section 2.5<br />
containing around 1 million agents and a street network with 10 564 nodes and 28 622 links.<br />
Moreover, as also stated in Section 2.5, the scenario is simulated for 3 hours excluding input<br />
reading and output writing.<br />
In the following sections, different computing issues are discussed: Section 4.5.1 compares<br />
the execution times of the parallel traffic flow simulation over different communication media,<br />
namely, Ethernet and Myrinet. The communication libraries, PVM and MPI are tested and<br />
the results are shown in Section 4.5.2. Packing the number of empty spaces on the links of<br />
the street network and packing the vehicles moving across the boundaries by using different<br />
packing algorithms are discussed in Section 4.5.3. Finally, employing different options of<br />
METIS decomposition library is given in Section 4.5.4.<br />
4.5.1 Comparison of Different Communication Hardware: Ethernet vs.<br />
Myrinet<br />
The most important plot is Figure 4.5(a). It shows computational real time ratio (RTR) numbers<br />
as a function of the number of CPUs. Note that, with 60 CPUs with Myrinet, an RTR of 900<br />
is achieved. This means that 24 hours of all car traffic in Switzerland are simulated in less<br />
than two minutes! This performance is achieved with Myrinet communication hardware; by<br />
using 100 Mbit Ethernet hardware, peak performance is at about 300 RTR. Due to the lack<br />
of availability of more computing nodes, the tests could not go further than 60 in terms of<br />
number of computing nodes. But the practical results follow the predicted curve for RTR for<br />
the available computing nodes.<br />
When the measurement is taken for the curves in Figure 4.5(a), the spatial queues and<br />
buffers of the links implemented by the self-implemented Ring class as explained in Section<br />
3.3.4. The graph data is stored in an STL-vector (Section 3.3.2). Finally, the supplementary<br />
data structures such as waiting and parking queues are implemented by using the<br />
linked list structure described in Section 3.3.3.<br />
The plot also shows two different graphs for achieving the performance with single-CPU<br />
or with dual-CPU machines; obviously there are differences, which are less important. The<br />
lower values of dual-CPU machines are due to the fact that the bandwidth of the network<br />
card is shared between two processes running on a single machine. However, the performance<br />
48
1024<br />
RTR for a 3-hour run of 6-9 Scenario<br />
256<br />
Speedup for a 3-hour run of 6-9 Scenario<br />
512<br />
128<br />
256<br />
64<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
8<br />
single,myri<br />
single,eth<br />
4<br />
double, myri<br />
double, eth<br />
Theo. Val<br />
2<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(a)<br />
Speedup<br />
32<br />
16<br />
8<br />
4<br />
2<br />
Single, Myri<br />
1<br />
Single, Eth<br />
Double, Myri<br />
Double, Eth<br />
0.5<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(b)<br />
Figure 4.5: RTR and Speedup curves for Parallel Queue Model. The results are measured when<br />
spatial queues and link buffers are of the Ring type, waiting and parking queues are Linked<br />
List and the graph data is stored in an STL-vector. See Chapter 3 for further details.<br />
decrease of dual-CPU machines compared to single-CPU machines is less than a factor of<br />
1.5, which is presumably due to the fact that one process can communicate while the other is<br />
computing.<br />
One advantage of dual-CPU machines is encountered when investigating the cost/performance<br />
ratio. The cost of a single-CPU machine is only a little lower than that of a dual-CPU machine.<br />
Furthermore, the performance difference between these two setups, as stated above, is less than<br />
a factor of 1.5. For example, RTR for using 56 computing nodes on Fast Ethernet is 300 and<br />
284 with single-CPU and dual-CPU machines, respectively. The cost of 56 single-CPU machines<br />
is around twice the cost of 28 dual-CPU machines. Thus, the cost/performance ratio of<br />
dual-CPU machines is competitive to that of single-CPU machines.<br />
There is a super-linear speed-up between 32 and 60 computing nodes when Myrinet is used.<br />
This is presumably due to cache effects, i.e., the sub-domains become small enough that their<br />
data fits into cache.<br />
The theoretical curve for execution time in Figure 4.5(a) is calculated as follows. The computation<br />
time, , is taken from Equation 4.3 where is the measurement taken from a<br />
¤£¢<br />
¥ ¢¡<br />
¦¥<br />
sequential run executing one time step. The measured value is about 0.065. The communication<br />
time is ¤£© formulated as<br />
§¡ ¥<br />
¥ <br />
where, as stated earlier, £¦§equals to 2, ¡9¦§<br />
¢¡ ¥<br />
is calculated from Equation 4.8. Finally, ) ¢¤ is<br />
the latency measured at 0.225 milliseconds on the cluster.<br />
In Figure 4.5(b), the corresponding speed-up curves are shown. As stated in Section 4.4,<br />
speed-up curves can be obtained by shifting RTR curves vertically. The shifting factor is about<br />
5 here. A speed-up of 32 with 60 CPUs is reached when using Myrinet.<br />
The most important results can be summarized as follows:<br />
¤£©<br />
)¢¥¤§<br />
§¡ ¥<br />
; ¥¦§<br />
<br />
¡¦©§<br />
§¡<br />
On PC clusters (Linux boxes) with Ethernet, parallel traffic flow simulation speed<br />
theoretically saturates at 370 simulation time steps per second. With the maximum<br />
nodes that can be used, the practical values show the commitment to the theoretical<br />
curve. This statement is independent of scenario size or size of the PC cluster. —<br />
In contrast, on PC clusters with Myrinet, no saturation effect was observed for the<br />
scenario sizes considered.<br />
49
¡<br />
<br />
<br />
¥<br />
¡<br />
If the simulation time step is one second, then “300 simulation time steps per second” translates<br />
into a real time ratio of 300, meaning the simulation runs 300 times faster than real time.<br />
It is interesting to compare two different hardware configurations:<br />
56 single CPU machines using 100 Mbit LAN.<br />
Real time ratio 300.<br />
¢¡<br />
¥<br />
Cost approx &<br />
switch, ¡<br />
resulting in ¡<br />
¢¡<br />
¡ ¡ ¢¡<br />
for the machines plus approx for a full bandwidth<br />
overall.<br />
¢£¡ <br />
28 dual CPU machines using Myrinet.<br />
Real time ratio 900.<br />
Cost approx ¥¤<br />
¡ <br />
, Myrinet included.<br />
¤ $&<br />
That is, the Myrinet setup is not only faster, but somewhat unexpectedly also cheaper. A<br />
Myrinet setup has the additional advantage that smaller scenarios than the one discussed here<br />
will run even faster, whereas on the Ethernet cluster, smaller scenarios will run with the same<br />
computational speed as large scenarios.<br />
As mentioned in Section 4.4, the speed-up curves show the same performance saturation as<br />
do the RTR curves. Even larger scenarios reach greater speed-up, but saturate at the same RTR<br />
on Ethernet.<br />
Improving the single version of the simulation explained in Chapter 3 also appears in parallel<br />
computing results. For example, Figure 3.7 shows the results when improving the data<br />
structure used for the graph data.<br />
4.5.2 Comparison of Different Communication Software: MPI vs. PVM<br />
During tests, MPI [51] is used as communication software. Yet, PVM [63] is also utilized to<br />
see whether it makes a difference. One might say that software performance is limited with<br />
hardware performance. However, it is also significant how software is designed to get the most<br />
benefit from hardware.<br />
PVM and MPI have been compared for years. For the purposes of the application presented<br />
here, their capabilities are rather similar as explained in Section 4.3.3. Figure 4.6 compares<br />
the results of using PVM or MPI. The curves are created when an STL-map is used for the<br />
graph data (Section 3.3.2), when the parking and waiting queues are represented by an STLmultimap<br />
(Section 3.3.3, and when the Ring class, as explained in Section 3.3.4 is employed<br />
for the spatial queues and link buffers.<br />
When the underlying computer network is chosen to be Ethernet, MPI and PVM perform<br />
similarly. Presumably, for Ethernet being commonly used as a communication infrastructure<br />
imposes on software developers to improve software designed based on it. When a special<br />
infrastructure, such as Myrinet, is used, the support it gets is proportional to the demand by<br />
the users. Both MPI and PVM support Myrinet but personal experiences show that only MPI<br />
is able to exploit the hardware advantage of Myrinet: As seen in Figure 4.6, PVM on Myrinet<br />
behaves as if it runs on Ethernet. The reasons for this remain unclear; the attempts of developers<br />
of PVM over Myrinet for instrumenting the software on the cluster could not reach a solution.<br />
The important consequence is:<br />
If one wants to use high performance communications hardware, such as Myrinet or<br />
Infiniband, for PC clusters, then the use of MPI is strongly recommended since it is<br />
significantly better supported than any other parallel communication standard.<br />
50
1024<br />
512<br />
256<br />
RTR for a 3-hour run of 6-9 Scenario, PVM TEST<br />
256<br />
128<br />
64<br />
Eth-MPI<br />
Myri-MPI<br />
Eth-PVM<br />
Myri-PVM<br />
Speedup for a 3-hour run of 6-9 Scenario, PVM TEST<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
Speed-Up<br />
32<br />
16<br />
8<br />
4<br />
8<br />
Eth-MPI<br />
4<br />
Myri-MPI<br />
Eth-PVM<br />
Myri-PVM<br />
2<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(a)<br />
2<br />
1<br />
0.5<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(b)<br />
Figure 4.6: RTR and Speedup graphs for PVM and MPI comparison. An STL-map is used for<br />
the graph data, and an STL-multimap represents the parking and waiting queues, the Ring<br />
class is used for spatial queues and link buffers.<br />
Therefore, the results shown in other parts of this thesis are normally measured over Myrinet<br />
using MPI; exceptions are specified.<br />
4.5.3 Comparison of Different Packing Algorithms<br />
In this work, different types of data are exchanged between different modules: vehicles, events<br />
(Chapter 6) and plans (Chapter 7) etc. They need to be packed prior to sending. Different<br />
packing methods for vehicles are discussed below to give an impression of the contribution of<br />
packing to the overall computing time.<br />
In general, some packing methods pack only the necessary part of an object. On the other<br />
hand, instead of dealing with individual data pieces, the instances of an object can be packed<br />
as a whole. The latter is known as object serialization.<br />
Object Serialization<br />
Object serialization can be defined as writing of the content (or state) of an object such that it<br />
can be re-constructed from that content (or state). The content is converted into bytes using the<br />
object serialization method. Some object oriented programming languages such as Java [42]<br />
provide methods to define a class as serializable and to write the contents of object instances.<br />
Some of the known problems of object serialization, in general, are:<br />
If the class to be serialized is an extension of other available classes, then those classes<br />
must also be defined as serializable.<br />
If only a part of the information of a very large object need to be serialized, then the<br />
object serialization becomes inefficient in terms of space and time.<br />
Some information of an object needs to remain private, hence it must not be serialized.<br />
Messages containing vehicles<br />
The MATSIM traffic flow simulator packs two types of data in each time step: the number<br />
of empty spaces of the split links and the vehicles to be moved to split links. The size of the<br />
packet, which contains the number of empty spaces, is the same in each time step since the<br />
51
a long i n t e g e r f o r v e h i c l e ID ,<br />
a long i n t e g e r f o r n e x t l i n k ID t h a t v e h i c l e w i l l be on ,<br />
an i n t e g e r f o r t h e s i z e of r e m a i n i n g r o u t e of t h e v e h i c l e ,<br />
an long i n t e g e r a r r a y f o r r o u t e i t s e l f ( l i s t of nodes ) ,<br />
a double f o r a c t i v i t y d u r a t i o n ,<br />
a double f o r a c t i v i t y end time ,<br />
an i n t e g e r f o r l e g number ,<br />
a long i n t e g e r f o r f i n a l d e s t i n a t i o n l i n k ID.<br />
Figure 4.7: The data of a vehicle to be packed<br />
number of split links does not change. The packet, in this case, just includes IDs of split links<br />
and the number of empty spaces on corresponding split links.<br />
As far as vehicles are concerned, the packet size differs depending on the number of vehicles<br />
actually transmitted. The data of a vehicle to be packed into a packet is shown in Figure 4.7.<br />
Thus, for each vehicle transmitted, different types of data are packed. One of the most important<br />
remarks about the types listed in the figure is that the length of the long integer array for the<br />
list of the remaining nodes of a route is different for each vehicle. This makes packing hard for<br />
simple parallel computing packing commands.<br />
Default implementation: memcpy<br />
The default implementation for packing in the traffic flow simulator is written by using the C<br />
function memcpy. memcpy creates byte arrays by converting all data types into bytes. The<br />
receiving side also uses memcpy to unpack variables of different data types from a byte array.<br />
The packing of data is shown in Figure 4.8. The unpacking is similar to the code given in<br />
the figure. One drawback of this method is that when a packet is being prepared, the pointer<br />
to keep track of the position of the next available memory slot on the packet must be advanced<br />
manually.<br />
Using MPI Pack and MPI Unpack<br />
If the communicating processes run on different architectures with different machine representations,<br />
conversions done by memcpy might be different on both sides and might cause<br />
incorrect unpacking and assignment of values. Therefore, a good option is to use MPI Pack<br />
and MPI Unpack library calls. They are similar to memcpy in the sense that they also provide<br />
conversions of different data types into bytes or vice versa. However, since types are converted<br />
into MPI types first, having machines with different representations does not appear to be a<br />
problem.<br />
An example of packing with MPI Pack is presented in Figure 4.9. Unpacking, again, is<br />
similar to packing. Advancing the offset pointer is not necessary here as MPI calls provide it<br />
internally.<br />
Using MPI Struct<br />
The previous two methods pack individual variables of objects, which are vehicles in the traffic<br />
flow simulator. A more elegant way of packing would be packing objects all at once as opposed<br />
to piece by piece. MPI Struct does this. It allows to pack objects.<br />
Despite MPI Struct being a desirable method in object serialization, object serialization<br />
fails when each instance of an object type uses a different size of an array. In that case, a fixed<br />
52
d e f i n e the packet<br />
char a r r a y p a c k e t ;<br />
memcpy ( p a c k e t , v e h i c l e ID , s i z e o f ( v e h i c l e ID ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
memcpy ( p a c k e t , n e x t l i n k ID , s i z e o f ( n e x t l i n k ID ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
memcpy ( p a c k e t , l e n g t h of r o u t e , s i z e o f ( l e n g t h of r o u t e ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
f o r a l l t h e nodes i n t h e r o u t e<br />
memcpy ( p a c k e t , node ID , s i z e o f ( node ID ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
¡<br />
memcpy ( p a c k e t , a c t i v i t y d u r a t i o n , s i z e o f ( a c t i v i t y d u r a t i o n ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
memcpy ( p a c k e t , a c t i v i t y end time , s i z e o f ( a c t i v i t y end time ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
memcpy ( p a c k e t , l e g number , s i z e o f ( l e g number ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
memcpy ( p a c k e t , f i n a l d e s t i n a t i o n l i n k ID ,<br />
s i z e o f ( f i n a l d e s t i n a t i o n l i n k ID ) )<br />
advanced t h e p o i n t e r t o t h e end of newly added d a t a<br />
/ / d e f i n e the packet<br />
char a r r a y p a c k e t ;<br />
Figure 4.8: Packing vehicle data with memcpy<br />
MPI::INT.Pack ( p a c k e t , v e h i c l e ID , s i z e o f ( v e h i c l e ID ) )<br />
MPI::INT.Pack ( p a c k e t , n e x t l i n k ID , s i z e o f ( n e x t l i n k ID ) )<br />
MPI::INT.Pack ( p a c k e t , l e n g t h of r o u t e ,<br />
s i z e o f ( l e n g t h of r o u t e ) )<br />
f o r a l l t h e nodes i n t h e r o u t e<br />
MPI::INT.Pack ( p a c k e t , node ID , s i z e o f ( node ID ) ¡<br />
)<br />
MPI::DOUBLE.Pack ( p a c k e t , a c t i v i t y d u r a t i o n ,<br />
s i z e o f ( a c t i v i t y d u r a t i o n ) )<br />
MPI::DOUBLE.Pack ( p a c k e t , a c t i v i t y end time ,<br />
s i z e o f ( a c t i v i t y end time ) )<br />
MPI::INT.Pack ( p a c k e t , l e g number , s i z e o f ( l e g number ) )<br />
MPI::INT.Pack ( p a c k e t , f i n a l d e s t i n a t i o n l i n k ID ,<br />
s i z e o f ( f i n a l d e s t i n a t i o n l i n k ID ) )<br />
Figure 4.9: Packing vehicle data with MPI Pack<br />
size should be defined for all object instances. For example, one should fix the number of nodes<br />
of each vehicle should go through, i.e. the route. Vehicles of a real scenario are not supposed<br />
to visit the same nodes, i.e, they do not have the same plans. Therefore node lists (routes) to<br />
be visited being variable lengths is a problem when using MPI Struct. In order to solve this<br />
problem, the program sets the maximum number of nodes to be visited among all vehicles to<br />
the size of the node array.<br />
53
¡<br />
/ / d e f i n e a s t r u c t corresponding to a v e h i c l e<br />
t y p e d e f s t r u c t<br />
i n t v i d , l i d , r o u t e s i z e , r o u t e [MAXROUTELENGTH] , l e g i d , d l i d ;<br />
double a c t D u r , actEnd ;<br />
v e h i c l e s t r u c t ;<br />
/ / Commit the new type<br />
c r e a t e c o r r e s p o n d i n g MPI Struct t y p e based on v e h i c l e s t r u c t<br />
commit t h e new t y p e<br />
/ / d e f i n e the packet<br />
v e h i c l e s t r u c t a r r a y p a c k e t ;<br />
/ / packing i th v e h i c l e<br />
p a c k e t [ i ] . v i d = v e h i c l e ID ;<br />
p a c k e t [ i ] . l i d = n e x t l i n k ID ;<br />
p a c k e t [ i ] . r o u t e s i z e = l e n g t h of r o u t e ;<br />
f o r MAXROUTELENGTH t i m e s<br />
p a c k e t [ i ] . r o u t e [ j ] = l e n g t h of r o u t e ¡<br />
;<br />
p a c k e t [ i ] . a c t d u r = a c t i v i t y d u r a t i o n ;<br />
p a c k e t [ i ] . a c t e n d = a c t i v i t y end time ;<br />
p a c k e t [ i ] . l e g i d = l e g ID ;<br />
p a c k e t [ i ] . d l i d = d e s t i n a t i o n l i n k ID ;<br />
Figure 4.10: Packing vehicle data with MPI Struct<br />
In Figure 4.10, vehicle packing by using MPI Struct is shown. Three methods are explained<br />
in more detail in Section 6.11.<br />
Results<br />
Figure 4.11(a) and Figure 4.11(b) show the results for RTR graphs on single and dual CPUs<br />
of the computing nodes, respectively, when using different packing algorithms for exchanging<br />
the number of empty spaces and vehicles. The tests are done when an STL-vector is used<br />
for the graph data (Section 3.3.2), when the parking and waiting queues are represented by an<br />
STL-multimap (Section 3.3.3, and when the Ring class, as explained in Section 3.3.4 is<br />
employed for the spatial queues and link buffers.<br />
The following notation is used in these figures: Tests are repeated for both Myrinet and Ethernet<br />
(Myri vs Eth) when using single (Figure 4.11(a)) and double (Figure 4.11(b)) processes<br />
per computing node. The tests done are:<br />
Packing both the number of empty spaces and the vehicles with memcpy (ME,MV)<br />
Packing the number of empty spaces with memcpy and the vehicles with MPI Pack<br />
(ME,PV)<br />
Packing the number of empty spaces with memcpy and the vehicles with MPI Struct<br />
(ME,SV)<br />
54
512<br />
RTR with different packing algs when running single process per node<br />
RTR with different packing algs when running double processes per node<br />
512<br />
256<br />
256<br />
128<br />
128<br />
Real Time Ratio<br />
64<br />
32<br />
Myri,ME,MV<br />
Myri,ME,PV<br />
16<br />
Myri,ME,SV<br />
Myri,PE,PV<br />
Myri,SE,SV<br />
8<br />
Eth,ME,MV<br />
Eth,ME,PV<br />
4<br />
Eth,ME,SV<br />
Eth,PE,PV<br />
Eth,SE,SV<br />
2<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(a)<br />
Real Time Ratio<br />
64<br />
32<br />
Myri,ME,MV<br />
Myri,ME,PV<br />
Myri,ME,SV<br />
16<br />
Myri,PE,PV<br />
Myri,SE,SV<br />
8<br />
Eth,ME,MV<br />
Eth,ME,PV<br />
4<br />
Eth,ME,SV<br />
Eth,PE,PV<br />
Eth,SE,SV<br />
2<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(b)<br />
Figure 4.11: RTR graphs for different packing algorithms. During these tests, an STL-vector<br />
is used for the graph data, an STL-Multimap is used for waiting and parking queues and the<br />
Ring class is used for spatial queues and link buffers.<br />
Packing both the number of empty spaces and the vehicles with MPI Pack (PE,PV)<br />
Packing both the number of empty spaces and the vehicles with MPI Struct (SE,SV)<br />
Quantitatively, packing only vehicles with MPI Pack and MPI Struct slows the total<br />
execution time by 2% and 5%, respectively, compared to memcpy. If both the vehicles and<br />
empty spaces are packet with MPI Pack and MPI Struct, the performance lost is 3% and<br />
9% compared to the memcpy approach. MPI Struct gives the worst performance among<br />
these three, because it fixes the route array length to a maximum length.<br />
The main result can be summarized as follows:<br />
MATSIM should replace the memcpy approach with MPI Pack and MPI Unpack<br />
commands, since they offer more robustness with respect to data types with only<br />
very little performance overhead. In contrast, the vote is open with respect to<br />
MPI Struct: Advantages with respect to object handling are counter-balanced by<br />
the need to define a fixed maximum route length and resulting inefficiencies.<br />
As stated in Section 6.11, the performance of these functions depend on the data to be<br />
exchanged and its size. In spite of MPI Struct giving the worst performance when sending<br />
vehicles and the number of empty spaces, it gives the best performance when sending events<br />
generated by traffic flow simulators to strategy generation modules. More details are given in<br />
Section 6.11.<br />
4.5.4 Different Domain Decomposition Algorithms<br />
One could ask if a different domain decomposition might make a difference. It was already<br />
argued earlier that no difference is expected once latency saturation sets in. METIS [91] provides<br />
different partitioning concepts with different refinement algorithms. The default version<br />
named METIS PartGraphKway is used. It not only reduces the number of non-contiguous<br />
sub-domains but also tries to minimize the connectivity of sub-domains. The performance<br />
results of the MATSIM traffic flow simulator presented earlier are generated when using the<br />
default option.<br />
One can put weights on nodes or on links or on both such that the weights dominate the<br />
partitioning. The first alternative tried is the so-called standard feedback. The method produces<br />
55
1024<br />
512<br />
256<br />
RTR for a 3-hour run of 6-9 with different partitioning algorithms<br />
256<br />
128<br />
64<br />
Speedup for a 3-hour run of 6-9 with different partitioning algorithms<br />
Default partitioning<br />
Feedback,no of incoming links<br />
Feedback,computing time<br />
Feedback,veh count<br />
Real Time Ratio<br />
128<br />
64<br />
32<br />
16<br />
Speedup<br />
32<br />
16<br />
8<br />
4<br />
8<br />
Default partitioning<br />
4<br />
Feedback, no of incoming links<br />
Feedback,computing time<br />
Feedback,veh count<br />
2<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(a)<br />
2<br />
1<br />
0.5<br />
1 2 4 8 16 32 64 128 256<br />
Number of CPUs<br />
(b)<br />
Figure 4.12: RTR and Speedup graphs for METIS partitioning with standard feedback. Different<br />
values are taken into consideration as feedback for next iteration. An STL-map is used to<br />
represent the graph data, an STL-multimap is used for waiting and parking queues and the<br />
Ring class is used for spatial queues and link buffers.<br />
a single weight for each element. Since the work of the queue simulation consists mostly of<br />
computing the intersection dynamics, the computational load is essentially proportional to the<br />
number of intersections. Thus, the weights are on the nodes with this method. Once a single<br />
constraint is computed for each node after a simulation run, the statistics are written into a file,<br />
which will be used by the domain decomposition process in the next simulation run (iteration).<br />
The standard feedback partitioning, thus, attempts to spread the network nodes equally across<br />
all CPUs while maintaining contiguous domains. Three different constraints are tested for<br />
standard feedback partitioning:<br />
the number of vehicles processed by a node,<br />
the computing time spent on a node,<br />
the number of incoming links of a node.<br />
Figure 4.12 compares the constraints above used for the ch6-9 scenario (Section 2.5). The<br />
measurements are taken under the following circumstances: The graph data is implemented by<br />
an STL-map as explained in Section 3.3.2. The parking and waiting queues are represented<br />
by an STL-multimap as described in Section 3.3.3, and the self-implemented Ring class,<br />
explained in Section 3.3.4 is employed for the spatial queues and link buffers. All the test<br />
results are obtained when the parallel traffic flow simulation is run on Myrinet.<br />
It shows that using neither computing time nor number of vehicles processed gives any<br />
improvement. Only setting the number of incoming links of nodes as weights gives better<br />
performance compared to the other two approaches.<br />
When nodes (or links) have several weights, the refinement algorithm is called multiconstraint<br />
partitioning. In the traffic flow simulation, this can best be understood by assuming<br />
that those weights refer to different time slices. In earlier investigations it had been<br />
found that there were some performance-wise differences when using the multi-constraint<br />
partitioning in METIS when applied to the so-called “Gotthard” scenario. In this scenario,<br />
50’000 travelers/vehicles start, with a random starting time between 6AM and 7AM, at random<br />
locations all over Switzerland, and with a destination in Lugano/Ticino. Therefore, towards the<br />
end of the simulation, most of the vehicles accumulate on a couple of CPUs. This results in<br />
56
giving more workload to some nodes than the others. The unbalanced workload can be uniformly<br />
distributed among CPUs in the next iteration if METIS takes the workload of the nodes<br />
in different time slices into consideration.<br />
The question is if under such unbalanced circumstances the network can be partitioned such<br />
that the load is equally balanced at all times. A counter-example would be a distribution where<br />
one CPU has nodes with a lot of traffic initially but no traffic later, and another CPU has no<br />
traffic initially, but a lot of traffic later. That simulation would run faster if both CPUs traded<br />
approximately half of their nodes: Then both CPUs would always be busy on about half of<br />
their nodes. This is exactly what multi-constraint partitioning attempts to achieve.<br />
The multi-constraint partitioning is implemented in a way that the computational load on<br />
each node per hour is recorded during simulation run. Each load per hour corresponds to a<br />
constraint for the node. Thus, each node is specified by more than one constraint (simulation<br />
runs more than 3600, i.e. 1 hour time steps). Then, these recorded hourly values are dumped<br />
into a file along with corresponding node IDs and the file is used in next run by domain decomposition<br />
process. For the ch6-9 scenario, which has roughly uniform traffic load all over the<br />
Switzerland, the multi-constraint partitioning does not yield systematic improvement.<br />
Recommendation: METIS<br />
For the demand scenarios being uniformly distributed over the graph data, the default<br />
partitioning technique of METIS,METIS PartGraphKway, can be employed. For<br />
the non-uniformy distributed traffic demand, the algorithms, which take the different<br />
weights into consideration, should be preferred.<br />
4.6 Conclusions and Discussion<br />
The most important result of investigations regarding parallelization is that there is a natural<br />
limit to computational speed on parallel computers, which use Ethernet [77] as their communication<br />
medium, and that speed is about 370 updates per second. If a simulation uses 1-second<br />
time steps, then this translates into a real time ratio of about 370 (the maximum practical value<br />
is 300). It has important consequences both on real time and on large scale applications that<br />
need to be considered. Also in contrast to other areas of computing, it seems that waiting for<br />
better commodity hardware will not solve the problem, in this case: Latency is the technical<br />
reason for this limit.<br />
One option to go beyond this limit is to use more expensive special purpose hardware.<br />
Such hardware is typically provided by computing centers, which operate on dedicated parallel<br />
computers such as the Cray T3E [38], or the IBM SP2 [37], or any of the ASCI (Advanced<br />
Strategic Computing Initiative) [47, 69] computers in the U.S. An intermediate solution is the<br />
use of Myrinet [54], which this chapter shows to be an effective approach, both in terms of<br />
technology and in terms of monetary cost.<br />
On the algorithmic side, the following options exist: First, for the queue simulation it is<br />
in fact possible to reduce the number of communication exchanges per time step from two to<br />
one. This should yield a factor of two in speed-up. Next, in some cases, it may be possible<br />
to operate with time steps longer than one second. This should in particular be possible with<br />
kinematic wave models, since in those models the backwards waves no longer travel infinitely<br />
fast. The fastest time in such simulations would be given by the shortest free speed link travel<br />
time in the whole system. In addition, one could prohibit the simulation from splitting links<br />
with short free speed link travel time, leading to further improvement.<br />
57
In Section 4.1.2, task parallelization was shortly discussed. There it was pointed out that<br />
this will not pay off if the traffic flow simulation poses the by far largest computational burden.<br />
However, after parallelizing the traffic flow simulation, this is no longer true. Task parallelization<br />
would mean that for example the activity generator, the router, and the learning module<br />
would run in parallel with the traffic flow simulation. One way to implement this would be to<br />
not pre-compute plans any more, as is done in day-to-day simulations, but to request them just<br />
before the traveler starts. A nice side-effect of this would be that such an architecture would<br />
also allow within-day re-planning without any further computational re-design.<br />
The most important conclusions can be drawn for MATSIM as below:<br />
PC clusters should be preferred to parallel/vector computers.<br />
The communication hardware between PCs of a cluster should be chosen as the Myrinet<br />
technology since it reduces the latency problem exists on some other technologies such<br />
as Ethernet.<br />
MPI should be utilized because of better-formed computational aspects and being better<br />
supported.<br />
To minimize the contribution of the latency problem into each message occurs, several<br />
items must be packed into a single message.<br />
Packing several items into a single message should be implemented by using MPI Pack<br />
and MPI Unpack since they are more robust compared to other C-type functions.<br />
The different types of domain decomposition provided by the METIS library should be<br />
selected according to the scenario used. The default method of METIS performs well<br />
when the traffic is, more or less, evenly distributed on the graph data.<br />
4.7 Summary<br />
Time consumption of large-scale applications can be diminished by the assistance of parallelprogramming.<br />
Today’s systems bring up different cooperating modules. These modules can be<br />
distributed among different computing nodes to achieve task parallelization. Even the modules<br />
themselves, which extend the overall computing time of the system by their slowness, can be<br />
split and each subpart handles only a part of the whole data (domain decomposition).<br />
From a traffic flow simulation point of view, parallelization is achieved by decomposing<br />
the street network among computing nodes and distributing agents according to the result of<br />
the decomposition. When sub-domains are not fully independent of each other, i.e., routes of<br />
some agents extend over several sub-domains, providing communication between sub-domains<br />
is unavoidable. Among several tools, MPI (Message Passing Interface) [51] is chosen because<br />
of yielding better performance than the others, and because it gets continuous support from<br />
developers.<br />
Since each message exchange involves latency, which is a problem when the communicating<br />
medium is Ethernet [77], exchanging only two messages, one for declaring storage constraints<br />
and one for the vehicles’ information, per time step is able to handle data flow on split<br />
links. Also, packing all vehicles, which have the same destination computing node, into a<br />
single message cuts back the involvement of latency in time consumption.<br />
Myrinet [54] is a good alternative hardware when one wants to avoid latency caused by<br />
Ethernet since latency is amended on Myrinet. Hence, it lowers the communication cost.<br />
58
Time CPUs GD PW-Q L-Q CM CL Pack DD<br />
12s d/62 vector linked list Ring Myri MPI memcpy default<br />
36s d/62 vector linked list Ring Eth MPI memcpy default<br />
35s s/32 map multimap Ring Myri MPI memcpy default<br />
80s s/32 map multimap Ring Eth MPI memcpy default<br />
82s s/32 map multimap Ring Myri PVM memcpy default<br />
99s s/32 map multimap Ring Eth PVM memcpy default<br />
49s s/16 vector multimap Ring Myri MPI memcpy default<br />
51s s/16 vector multimap Ring Myri MPI MPI-Pack default<br />
54s s/16 vector multimap Ring Myri MPI MPI-Struct default<br />
35s d/28 map multimap Ring Myri MPI memcpy default<br />
29s d/28 map multimap Ring Myri MPI memcpy SF-IL<br />
Table 4.1: Summary table of the parallel performance results for different data structures of<br />
the traffic flow simulator.<br />
When communication cost is less, computation usually needs to improved. These improvements<br />
are made not only for the parallelization code but also for the sequential part of the<br />
program as discussed in Chapter 3. In terms of parallelization, how data is packed is an issue<br />
requiring investigation. The choice among different methods for packing depends on how elegantly<br />
packing is achieved as well as the time consumption of these methods. User-defined<br />
packing functions can be built in addition to the functions offered by communication software.<br />
Despite explicit preparation efforts for making programs parallel, parallelization of largescale<br />
applications is inevitable for time/cost reasons. Economic issues lead to PC clusters<br />
instead of expensive special parallel computers.<br />
Table 4.1 summarizes the most important performance numbers, which are collected when<br />
switching different parameters on. The abbreviations used in the table mean as follows: CPUs<br />
is the number of CPUs, which can be double (two processes per computing node) or single;<br />
GD refers to the graph data; PW-Q shows which data structure is used for the parking and<br />
waiting queues; L-Q shows the data structure option for the link queues ; CM means the<br />
communication medium (Myrinet or Ethernet); CL points out the communication library (MPI<br />
or PVM); Pack refers to the packing algorithm used during the tests (mempcy, MPI Pack,<br />
MPI Struct); DD shows the domain decomposition algorithm, i.e., default means the using<br />
the default option of METIS and SF-IL means standard feedback using the number of incoming<br />
links of the nodes.<br />
59
Chapter 5<br />
Coupling the Traffic Simulation to Mental<br />
Modules<br />
5.1 Introduction<br />
Chapter 1 gives a description of two-layer framework used to relax a congested system. The<br />
physical layer is where the agents interact with each other and the environment. This layer is<br />
the network loading part of DTA [19, 20, 27, 5] and it corresponds to the traffic flow simulator<br />
in the framework. The traffic flow simulator defines the interaction rules for the agents. These<br />
rules are defined in Chapter 2.<br />
The second layer, the strategic layer, is where the agents make their strategies according to<br />
what they have experienced in the physical layer. For example, if agents experience congestion<br />
in the physical layer, some of the agents try to avoid the congestion next time by making new<br />
strategies in the strategic layer.<br />
As seen in Figure 1.1, the physical layer of the framework exchanges plans and performance<br />
information with the strategy generation modules, which generate strategies for the agents in<br />
the system.<br />
5.2 Coupling Modules via Files<br />
5.2.1 Description of a Framework<br />
A multi-agent learning method is implemented into a system called the “framework” to model<br />
travel behavior of people on a geographical region during a certain period of time. The framework<br />
is composed of several modules with different tasks. There are different ways to couple<br />
these modules. This section explains coupling via files where two files are prevalent: the plans<br />
file and the events file.<br />
As its name implies, the most important entities in a multi-agent learning method are agents.<br />
Each agent has attributes, which impinge on its decisions. Decisions are made about type, location<br />
and timing information of activities, routes between locations of activities, etc. Moreover,<br />
each agent in the framework has a plan it follows. Each plan contains a score, which is calculated<br />
by the agent after the plan is executed. A plan can have several legs, each of which<br />
connects two activities. Each leg mainly carries the following information: the mode of transportation,<br />
the estimated trip time, the estimated start time of the trip and the list of graph nodes<br />
that the agent must traverse to arrive at the location of end activity.<br />
60
p e r s o n i d =”6357250”<br />
-<br />
p l a n -<br />
a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />
-<br />
l e g mode=” car ” d e p t i m e = ”07:00” t r a v t i m e = ”00:30”<br />
-<br />
r o u t e 4902 4903 4904 4905 4906 - / r o u t e<br />
-<br />
-<br />
-<br />
-<br />
- -<br />
/ l e g<br />
a c t t y p e =”w” x100=”388689” y100=”279136” l i n k =”14606”<br />
dur=”08:00” /<br />
l e g mode=” car ” d e p t i m e = ”16:30” t r a v t i m e = ”00:15”<br />
r o u t e 4905 4903 / r o u t e<br />
-<br />
-<br />
-<br />
/ l e g<br />
a c t t y p e =”h” x100=”387345” y100=”276590” l i n k =”14584” /<br />
/ p l a n<br />
- / p e r s o n<br />
Figure 5.1: An example plan in the XML format<br />
Figure 5.2 shows the components of the framework and how data is moved between these<br />
components. Modules here are coupled via files. A complete initial plans file is fed into traffic<br />
flow simulation(s) to be executed in the first iteration. Plans are written in the XML [97] format.<br />
Issues regarding usage of different input formats are discussed in Section 3.4. A typical plan in<br />
the XML format is given in Figure 5.1.<br />
In the figure, the plan of agent 6357250 has two legs: The agent leaves home which is on<br />
link 14584 at 7 AM and goes to work by car. On the way to work, the agent goes through<br />
5 nodes denoted in the “route” attribute of the leg. This trip is expected to take 30 minutes.<br />
When the agent gets to work, it works for 8 hours, then drives back home via 2 nodes. The<br />
resolution of the x and y coordinates of locations are based on 100x100 meter blocks of census<br />
information. This is why they are named x100 and y100.<br />
Distribution of agents is accomplished via domain decomposition as explained in Section<br />
4.3.1. In Figure 5.2, the arrow between two traffic flow simulations shows the communication<br />
among them, i.e., message exchanges as mentioned in Section 4.3.2.<br />
In the framework, the entire output of the simulation consists of events which are output<br />
directly when they happen. For example, an agent can depart, can enter/leave a link, etc. The<br />
traffic flow simulation just writes all kinds of events because they do not aggregate data; instead,<br />
this is done by the other modules themselves. The router in the framework, for example,<br />
uses these events to compute the link travel times by recording times of link entering/leaving<br />
events. Separating data aggregation from the simulation philosophically means that the simulation<br />
checks the correctness of simulation whereas the other modules such as the router and<br />
the agent database are interested in the correctness of data aggregation.<br />
One of the main modules in the framework is called the Agent Database. Agents in an<br />
agent database keep plans and scores of plans. They decide which plan to use in the next<br />
iteration (next day) based on one of the following ways:<br />
select a random plan based on scores,<br />
request new routes (from the router) with a probability,<br />
request change in activities (from the activity generator) with a probability.<br />
In the first iteration (only one plan per agent exists), 100% of initial plans are used to create<br />
the plans file read by the traffic flow simulators. Both traffic flow simulators in Figure 5.2 write<br />
61
100% Initial<br />
Plans File<br />
Activity Generator<br />
10%<br />
10%<br />
20%<br />
20%<br />
Agent Database<br />
Router<br />
100%<br />
Plans File<br />
Events File<br />
MENTAL LAYER<br />
PHYSICAL LAYER<br />
Traffic<br />
Simulator<br />
Traffic<br />
Simulator<br />
Figure 5.2: Physical and strategic layers of the framework coupled via files.<br />
an events file during the execution of the plans. Events are read by strategy generation modules,<br />
namely, the router and the agent database 1 . Agents in the agent database calculate the scores of<br />
the plans based on events.<br />
If an agent decides (with a probability) to modify activities, the activity generator is informed.<br />
The activity generator mutates the end time and duration of activities of an agent<br />
and provides modified activities back to the agent database. The agent database, then, informs<br />
the router about the changes in activities so that the router can create new routes between the<br />
modified activities.<br />
If an agent decides (with a probability) to get new routes, they are requested from the router.<br />
Specific types of events, namely entering and exiting a link, are used by the router to calculate<br />
link travel times, which give information about congestion in the physical layer. The router uses<br />
this information to change the routes of the agents, which have made a request and provides<br />
the modified plans to the agent database.<br />
In Figure 5.2, the coupling via files is illustrated. 10% of the agents decide to change the<br />
time of activities and 10% of the agents decide to get new routes. Hence, the router gets request<br />
to change a total of 20% of all routes.<br />
When the router gives newly created plans back to the agent database, the agent database<br />
merges these new plans with the plans that the agents selected based on scores to create a 100%<br />
plans file for the next iteration.<br />
Each iteration corresponds to a “day”, therefore at the end of each day, new plans for some<br />
agents are re-computed for tomorrow based on today’s experiences. Thus, the system implements<br />
day-to-day planning.<br />
The advantage of using events as feedback data is that they are very easy to implement into<br />
1 In the current version, the activity generator does not read the events file, but in future versions it will. The<br />
dotted arrow in the figure illustrates this situation.<br />
62
traffic flow simulation of the framework. The events format can be plain or in the XML format;<br />
advantages and disadvantages of both are discussed in Section 6.6. An example of an XML<br />
event looks like this:<br />
¡<br />
e v e n t time = ”06:00” t y p e =” departure ” v e h i d =”6465” legnum=”0” l i n k =”1523” from=”3827” /¢<br />
which means at 6 AM, agent numbered as 6465 leaves link 1523 whose upstream end is located<br />
at node 3827. This event occurs while executing leg number 0 of the agent.<br />
To sum it up, all agents execute plans in each iteration simultaneously in the physical layer<br />
by interacting with each other (multi-agent) and with the environment; they record performance<br />
values of experiences from iterations; performance records are used to update the agents’ mental<br />
state (learning).<br />
5.2.2 Performance Issues of Reading an Events File<br />
Events<br />
As mentioned above, events generated by traffic flow simulators are fed back to different modules<br />
in the framework. The router and the agent database (and the activity generator in future<br />
versions) read these events.<br />
Event files are really big. For example, the “ch6-9” scenario (Section 2.5 generates a raw<br />
events file of 2GBytes which includes approximately 53 million events.Therefore, it is worth<br />
to investigate different reading algorithms for events.<br />
Each raw event is described by a set of numbers. The example below means that at time<br />
06:30:06AM, vehicle 6381934 has departed (specified by 6) link 17 (which from-node is 1000)<br />
as executing leg 0 of the plan.<br />
2 1 6 3 6 6 3 8 1 9 3 4 0 1 7 1 0 0 0 6<br />
Original implementation: Reading events into an STL-map<br />
As explained in Section 2.2.5, the events generated by traffic flow simulators are of one of the<br />
following types: “entering the simulation/departure”, “moving from waiting queue to link”,<br />
“entering a link”, “leaving a link”, “being stuck in congestion for a specific time period and<br />
leaving the simulation afterwards” and “arrival at final destination”. All these different types<br />
are generated by traffic flow simulators since the traffic flow simulators do not involve any data<br />
aggregation, i.e., when an event occurs, the traffic flow simulator simply dumps it into a file.<br />
The other strategy generation modules, which read the events, make distinctions between<br />
the different event types. For example, from the viewpoint of the router, only entering and<br />
exiting a link are interesting since they are used to calculate link travel times.<br />
The original implementation of the router reads the data for each event into an STL(Standard<br />
Template Library, Section 3.3.1) vector using the input stream operator >> of C++ [80]. If the<br />
event data is of the type “entering a link”, then an actual event is created by extracting values<br />
from the STL-vector, and the event is inserted into a C++ container map. If an event is of<br />
type of “exit-a-link”, the corresponding enter-a-link event is found in the STL-map container.<br />
The link travel time is calculated using these two event timestamps and is added to the corresponding<br />
link’s travel time and time bin. Then, the event in the container is deleted. The code<br />
is given in Figure 5.3.<br />
The events input file used during the tests by using the code in Figure 5.3 is about 700 MBytes<br />
in size and contains 18.5 million raw-written events. However keeping that many events in an<br />
STL-map data structure suffers from excessive memory usage. In addition, using an intermediate<br />
string vector prior to the data conversion dominates the low performance.<br />
63
¡<br />
¡<br />
¡<br />
p f map- - ;<br />
/ / d e f i n e a map f o r e n t e r e v e n t s<br />
/ / has two keys , v e h i c l e ID and l i n k ID<br />
t y e d e p a i r i n t , i n t , Event eventMapType<br />
eventMapType eventMap ;<br />
while ( n o t EOF )<br />
r e a d a l i n e from e v e n t s f i l e<br />
r e t r i e v e t h e v a l u e s i n t o v e c t o r<br />
e x t r a c t v a l u e s ( vehID , l i n k I D , e t c ) from v e c t o r<br />
i f ( f l a g i s e n t e r a l i n k )<br />
/ / c r e a t e a new event with the e x t r a c t e d v a l u e s<br />
t h i s e v e n t = new Event ( v a l u e s )<br />
/ / i n s e r t t h i s e v e n t i n t o map using agent ID as key<br />
eventMap [ m a k e p a i r ( vehID , l i n k I D ) ] = t h i s e v e n t<br />
e l s e i f ( f l a g i s e x i t a l i n k or a r r i v a l )<br />
/ / f i n d the corresponding e n t e r a l i n k entry<br />
e v e n t M a p . f i n d ( vehID , l i n k I D )<br />
i f ( e n t e r e v e n t i s found )<br />
<br />
/ / c a l c u l a t e t r a v e l time<br />
t r a v e l time = e x i t i n g time e n t e r e v e n t time<br />
add i t t o t h e t r a v e l time of<br />
t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />
¡<br />
d e l e t e e n t e r e v e n t from map<br />
Figure 5.3: Reading events by using the STL-map<br />
Reducing event processing overhead<br />
In this section, it is tested what happens in terms of performance when some minimal data<br />
aggregation is already done in the traffic flow simulation. For this, it is useful to retrace the<br />
argument that led to the introduction of events files: In the original TRANSIMS [82] implementation,<br />
the simulation emits the aggregated link travel time data every 900 time steps.<br />
However, a major problem with that approach was that it necessitated to always make the output<br />
from the traffic flow simulation fully consistent with the input to strategy modules. For<br />
example, a module that needs arrival/departure times for activities needs completely different<br />
data than a router that needs link travel times. Also, using aggregated data invites the use of<br />
inconsistent aggregation approaches. For example, the traffic flow simulation in the original<br />
specification averages link travel times into time bins corresponding to link exit times, while<br />
the router preferably needs average link travel times for link entry times. Using aggregated<br />
data in the exchange between traffic flow simulation and strategy modules means that every<br />
time the strategy module is interested in a different approach to data aggregation, the traffic<br />
flow simulation code needs to be modified.<br />
In addition, the file size advantage of data aggregation is not as large as it seems: In the near<br />
future, high resolution networks having several hundred thousands of links will be introduced<br />
and emitting average link travel times for every link every 900 time steps will also create large<br />
amounts of data.<br />
Therefore, an intermediate approach is tested. This approach avoids the memory allocation<br />
64
¡<br />
i f s t r e a m e v e n t s f i l e<br />
while ( n o t EOF )<br />
r e a d a l i n e from e v e n t s f i l e<br />
r e t r i e v e t h e v a l u e s i n t o v e c t o r<br />
e x t r a c t v a l u e s ( i n c l u d i n g eventTime<br />
and e n t e r T i m e ) from v e c t o r<br />
i f ( t h e e v e n t f l a g i s ‘ ‘ l e a v e l i n k ’ ’ )<br />
/ / c a l c u l a t e t r a v e l time<br />
t r a v e l time = eventTime e n t e r T i m e<br />
add i t t o t h e t r a v e l time of<br />
t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />
¡<br />
Figure 5.4: Reading events by using C++ operator >><br />
of the STL-map data structure during the event reading phase but apart from that leaves the use<br />
of events for data exchange intact. Note that, one has to note that the STL-map data structure<br />
is only necessary for the temporary storage of link entry events, for which the corresponding<br />
link exit event has not yet been found. Therefore, if the necessary information can be merged<br />
into a single event, the problem is resolved.<br />
This can be achieved by having the vehicle (or agent) in the traffic flow simulation memorize<br />
its own link entry event. The link entry event is then no longer emitted by the traffic flow<br />
simulation, and instead the link exit event is expanded, as shown below (using the XML syntax,<br />
although plain text was used in the benchmarks).<br />
¡<br />
e v e n t t y p e =” e x i t ” i d =”123” time = ”09:03:01” l i n k i d =”456” t r a v e l t i m e = ”00:01:03” /¢<br />
This example denotes a link exit event at 9h 03’ 01” from link number 456 by agent ID 123<br />
with the agent having been on the link for one minute and 3 seconds. From this, any module<br />
can reconstruct the same data as before; the only differences are that the link entry event is<br />
reported only implicitly, and at some later point in time.<br />
This is called “on the fly” in the following. When reading events, the values are still read<br />
into an STL-vector using the >> operator of C++, and then the STL-vector is accessed<br />
to retrieve the relevant values as shown in Figure 5.4. The reduced size of the events file is<br />
400 MBytes and it contains about 10 million events.<br />
Using C instead of C++ file input syntax<br />
The last implementation gets rid of the temporary STL-vector. The events can be read using<br />
the C library functions strtod and strtol instead of the C++ >> operator. The events are<br />
read line by line as strings, then these two functions are used to parse the values, i.e., to convert<br />
the values from strings to appropriate types like double or integer. In this implementation, the<br />
strtol and strtod functions can be replaced by the functions atoi and atof, which<br />
take character arrays and convert them into other types. Example showing how to use these<br />
functions are shown in Figure 5.5. In these two implementations, the file size is also reduced<br />
to 400 MBytes and it only contains 10 million events.<br />
65
¡<br />
char myline [MAXSIZE ] ;<br />
while ( n o t EOF )<br />
g e t a l i n e from t h e e v e n t s f i l e i n t o myline<br />
s e t p o i n t e r myptr t o p o i n t t o b e g i n n i n g of myline<br />
/ / ATOF/ATOI CASE<br />
r e a d eventTime with a t o f ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d v e h i c l e I D with a t o i ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d legNumber with a t o i ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d l i n k I D with a t o i ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d fromNodeID with a t o i ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d e v e n t F l a g with a t o i ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
r e a d e n t e r T i m e with a t o f ( myptr )<br />
move myptr f o r w a r d t o t h e f i r s t b l a n k<br />
/ / ATOF/ATOI CASE<br />
/ / STRTOD/STRTOL CASE<br />
r e a d eventTime with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d v e h i c l e I D with s t r t o l ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d legNumber with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d l i n k I D with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d fromNodeID with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d e v e n t F l a g with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
r e a d e n t e r T i m e with s t r t o d ( myptr ,&pEnd )<br />
move myptr f o r w a r d t o t h e p o s i t i o n of myptre<br />
/ / STRTOD/STRTOL CASE<br />
i f ( t h e e v e n t f l a g i s ‘ ‘ l e a v e l i n k ’ ’ )<br />
t r a v e l time = eventTime e n t e r T i m e<br />
add i t t o t h e t r a v e l time of<br />
t h e c o r r e s p o n d i n g time b i n of t h e l i n k<br />
¡<br />
Figure 5.5: Reading events by using atoi/atof or strtod/strtol<br />
Results<br />
The results of four implementations are shown in Table 5.1. ¤ $&<br />
¡<br />
¥ and ¡£¢ ¡<br />
¥ are the numbers<br />
of events. “On the fly” means no supplementary data structure is used for temporary purposes<br />
¡<br />
66
,map, 18.5e6<br />
¡<br />
,on the fly, 10e6 strtod,on the fly, 10e6 atof, on the fly, 10e6<br />
¡<br />
Memory Usage 185.00MB 5.13MB 5.12 MB 5.11MB<br />
Reading Time 11 mins 5.3 mins 29 secs 28 secs<br />
Table 5.1: Performance results for reading the events file<br />
as explained above.<br />
The STL-map version uses up the most memory space. Eliminating the STL-map will<br />
result in an about 95% improvement in terms of memory usage. When the reading time is<br />
concerned, getting rid of the temporary string vector, into which the data is read by using >>,<br />
gives a better performance. For example, without using the STL-map, a transition from the<br />
C++ style of input parsing (>>) to the C-style input parsing results in a performance increase<br />
of 91%. In addition, atoi-type C functions and strtol-type C functions do not yield any<br />
difference in performance.<br />
In terms of the whole approach, these differences are huge. MATSIM [50] currently needs<br />
about 2 hours per iteration, where the events file is read twice (once by the agent database and<br />
once by the router). Using an efficient approach to events files, as described above, would<br />
reduce the time per iteration to less than 1 hour 40 min.<br />
Recommendation: Reading Raw Events<br />
If raw events are read from a file, the implementation choice is between the extensibility<br />
and performance. Using an STL-vector to store the read values as strings, and<br />
converting those strings to appropriate data types performs worst but extensibility<br />
pays it off. Instead of using an STL-map to store the events and using an STLvector<br />
to read the events, the traffic flow simulator of MATSIM should perform<br />
some data aggregation to reduce the overhead resulting from the STL’s structures.<br />
5.2.3 Performance Issues of Plan Writing<br />
In the current implementation of the framework, the agent database writes and the traffic flow<br />
simulator(s) reads the plans file. Different reading approaches for plans and their performance<br />
figures are explained in Section 3.4.3. When the ch6-9 scenario (with 1 million agents) is used,<br />
the writing performance is recorded as follows:<br />
When raw plans are concerned, the format of the output file is column-based structured<br />
text, hence, the file data is composed of only the data numbers. The data is written by the<br />
C++ operator in 17 seconds after the data is retrieved from the memory in 2 seconds.<br />
Therefore, writing 1 million plans is completed in 19 seconds.<br />
When XML plans are concerned, the data values have to be written in a valid XML tag<br />
form with the self-explanatory attributes. Prior to writing XML plans, the data retrieval<br />
from memory takes 123 seconds. Then, the data values are written into a file by forming<br />
XML tags in 149 seconds. Consequently, the total time spent for writing XML plans is<br />
272 seconds.<br />
67
5.3 Other Coupling Mechanisms<br />
Coupling modules via files is a rather old technology; in the area of traffic flow simulation, it<br />
was taken from TRANSIMS [82]. The main advantage of files is:<br />
Modules can be coupled even if they run under different operating systems or use different<br />
programming languages<br />
In addition, if files are in addition plain ASCII, a further advantage is that<br />
files can be easily read and changed for debugging and specific studies.<br />
The main disadvantages are that this is a fairly slow technology, and that one needs considerable<br />
resources in terms of disk space. This gets even worse if one uses plain ASCII instead<br />
of some binary format. In the case of the traffic flow simulation, disk I/O for module coupling<br />
is easily more than 50% of the computing time.<br />
This lets one look for alternatives.<br />
5.3.1 Module Coupling via Subroutine Calls<br />
The arguably best established method to couple computational modules is to use subroutine<br />
calls. Combining, say, agent database, simulation, and router could look as follows:<br />
Start the agent database which reads an agent file with initial plans etc.<br />
The agent database calls the traffic flow simulation, with a pointer/reference to the agents’<br />
plans as an argument, and a pointer/reference to some memory area to store the events.<br />
E.g.<br />
P l a n s p l a n s = new P l a n s ( ) ;<br />
r e a d a g e n t f i l e ( p l a n s ) ;<br />
Events e v e n t s = new Events ( ) ;<br />
r u n t r a f f i c s i m u l a t i o n ( p l a n s , e v e n t s ) ;<br />
. . .<br />
The agent database then calls the router in a similar way<br />
. . .<br />
r u n r o u t e r ( p l a n s , e v e n t s ) ;<br />
. . .<br />
Etc.<br />
Obviously, any other method to transmit information between modules, as for example within<br />
a global class, can be used.<br />
An additional advantage of this approach is that it allows, with relatively small modifications,<br />
within-day re-planning. One possibility for this, which would completely follow the<br />
design from above, would be to advance the simulation only minute-by-minute, and to run the<br />
re-planning modules in between. An example is shown in Figure 5.6.<br />
The main disadvantages of it are:<br />
It works only if all modules run on the same operating system.<br />
It is easy only if all modules use the same programming language.<br />
68
¡<br />
while ( n o t f i n i s h e d )<br />
a d v a n c e t r a f f i c s i m u l a t i o n b y o n e m i n u t e ( p l a n s , e v e n t s ) ;<br />
f o r ( a l l r e p l a n n i n g modules )<br />
r u n r e p l a n n i n g m o d u l e ( ) ;<br />
¡<br />
Figure 5.6: Coupling via subroutine calls during within-day re-planning<br />
It is efficient only if all modules share the same internal representation of plans and<br />
events.<br />
The subroutine call approach is no longer as simple once the traffic flow simulation uses<br />
parallel computing: There needs to be some mechanism that transmits the plans from the<br />
calling module (say, the agent database) and transmits the events back. This could, for<br />
example, be achieved by messages between the master and the slaves of the parallel traffic<br />
flow simulation, but this means that an additional technology beyond simple subroutine<br />
calling needs to be employed.<br />
The third item is the most difficult and technical of the three. For illustration, let us assume<br />
that the three modules were developed by three different teams, without the initial intention of<br />
coupling them. In consequence, all three modules will have different internal representations<br />
of plans and events. In order to allow communication, the three teams need to decide on the<br />
internal representation that is used in the subroutine calls. Let us assume that they agree to<br />
use the internal representation of the agent database. This means that, say, the traffic flow<br />
simulation, when receiving the call, needs to go through all plans and convert the relevant<br />
information to its own internal representation. This needs to be done for all modules.<br />
To be truthful, using an XML representation does not fully avoid the problem: Also here,<br />
one needs to agree on a common format or at least a common structure of the file. Still, there<br />
are fewer options (in particular no choice between pointers, references, or direct objects) and<br />
no inter-language issues, and XML parsers are relatively easy to write.<br />
5.3.2 Module Coupling via Remote Procedure Calls (e.g. CORBA, Java<br />
RMI)<br />
An alternative to files is to use RPC (Remote Procedure Call) [33]. Such systems, of which<br />
CORBA (Common Object Request Broker Architecture) [92] is an example, allow to call<br />
subroutines on a remote machine (called “server”) in a similar way as if they were on the local<br />
machine (called “client”). There are at least two different ways how this could be used from<br />
the framework’s viewpoint:<br />
1. The file-based data exchange could be replaced by using remote procedure calls. Here,<br />
all the information would be stored in some large data structures, which would be passed<br />
as arguments in the call.<br />
2. One could return to the “subroutine” approach discussed in Sec. 5.3.1, except that the<br />
strategic modules could now sit on remote machines, which means that they could be<br />
programmed in a different programming language under a different OS.<br />
Another option is to use Java RMI [43] which allows Remote Method Invocation (i.e. RPC<br />
on Java objects) in an extended way. Client and server can exchange not only data but also<br />
69
pieces of code. For instance, a computing node could be managing the agent database and<br />
request from a specific server the code of the module to compute the mode choice of its agents.<br />
It is easier with Java RMI than with CORBA to have all nodes acting as servers and clients and<br />
to reduce communication bottlenecks. However, the choice of the programming language is<br />
restricted to JAVA.<br />
It is important to notice the difference between RPC and parallel computing, discussed in<br />
Chapter 4. RPCs are just a replacement for standard subroutine calls which are useful for the<br />
case that two programs that need to be coupled use different platforms and/or (in the case of<br />
CORBA) different programming languages. That is, the simulation would execute on different<br />
platforms, but it would not gain any computational speed by doing that since there would<br />
always be just one computer doing work. In contrast, parallel computing splits up a single<br />
module on many CPUs so that it runs faster. Finally, distributed computing attempts to combine<br />
the interoperability aspects of remote procedure calls with the performance aspects of parallel<br />
computing.<br />
The main advantage of using of CORBA and other Object Broker mechanisms is to glue<br />
heterogeneous components. Both the DynaMIT [17] and DYNASMART [18] projects use<br />
CORBA to federate the different modules of their respective real-time traffic prediction system.<br />
The operational constraint is that the different modules are written in different languages on<br />
different platforms, sometimes from different projects. For instance, the graphical viewers are<br />
typically run on Windows PCs while the simulation modules and the database persistency are<br />
carried out by Unix machines. Also, legacy software for the data collection and ITS devices<br />
need to be able to communicate with the real-time architecture of the system. Using CORBA<br />
provides a tighter coupling than the file-based approach and a cleaner solution to remote calls.<br />
Its client-server approach is also useful for critical applications where components may crash<br />
or fail to answer requests. However, the application design is more or less centered around the<br />
objects that will be shared by the Object Broker. Therefore, it loses some evolvability compared<br />
to XML exchanges for instances.<br />
5.3.3 Module Coupling via WWW Protocols<br />
Everybody knows from personal experience that it is possible to embed requests and answers<br />
into HTTP [12] protocols. A more flexible extension of this would once more use XML. The<br />
difference to the RPC approach of the previous section is that for the RPC approach there needs<br />
to be some agreement between the modules in terms of objects and classes. For example, there<br />
needs to be a structurally similar “traveler” class in order to keep the RPC simple. If the two<br />
modules do not have common object structures, then one of the two codes needs to add some<br />
of the other code’s object structures, and copy the relevant information into that new structure<br />
before sending out the information. This is no longer necessary when protocols are entirely<br />
based on text (including XML); then there needs to be only an agreement of how to convert<br />
object information into an XML structure. The XML approach is considerably more flexible;<br />
in particular, it can survive unilateral changes in format. The downside is that such formats are<br />
considerably slower because parsing the text file and converting it into object information takes<br />
time.<br />
5.3.4 Module Coupling via Databases<br />
Another alternative is to couple the modules via a database. This could be a standard relational<br />
database, such as Oracle [95] or MySQL [94]. Modules could communicate with the database<br />
directly, or via files.<br />
70
The database would have a similar role as the XML files mentioned above. However, since<br />
the database serves the role of a central repository, not all agent information needs to be sent<br />
around every time. In fact, each module can actively request just the information that is needed,<br />
and (for example) only deposit the information that is changed or added.<br />
This sounds like the perfect technology for multi-agent simulations. What are the drawbacks<br />
The main drawback appeared is that such a database is a serious performance bottleneck<br />
for large scale applications with several millions of agents. This refers to a scenario where<br />
about 1 million Swiss travelers are simulated during the morning rush hour [66]. The main performance<br />
bottleneck was where agents had to chose between already existing plans according<br />
to the score of these plans. The problem is that the different plans which refer to the same<br />
agent are not stored at the same place inside the database: Plans are just added at the end of the<br />
database in the sequence they are generated. In consequence, some sorting was necessary that<br />
moved all plans of a given agent together into one location. It turned out that it was faster to<br />
first dump the information triple (travelerID, planID, planScore) to a file and then sort the file<br />
with the Unix “sort” command rather than first doing the sorting (or indexing) in the database<br />
and then outputting the sorted result. All in all, on the ch6-9 scenario the database operations<br />
together consumed about 30 min of computing time per iteration, compared to less than 15 min<br />
for the traffic flow simulation. That seems unacceptable, in particular since one wants to be<br />
able to do scenarios that are about a factor of 10 larger (24 hours with 15 million inhabitants<br />
instead of 5 hours with 7.5 million inhabitants).<br />
An alternative is to implement the database entirely in memory, so that it never commits to<br />
disk during the simulation. This could be achieved by tuning the parameters of the standard<br />
database, or by re-writing the database functionality in software. The advantage of the latter is<br />
that one can use an object-oriented approach, while using an object-oriented database directly<br />
is probably too slow.<br />
The approach of a self-implemented “database in software” is indeed used by Urbansim<br />
[86, 96]. In Urbansim there is a central Object Broker/store which resides in memory and<br />
which is the single interlocutor of all the modules. Modules can be made remote, but since<br />
Urbansim calls modules sequentially, this offers no performance gain, and since the system<br />
is written in Java [42], it also offers no portability gain. The design of Urbansim forces the<br />
modules writers to use a certain canvas to generate their modules. This guarantees that their<br />
module will work with the overall simulation.<br />
The Object Broker in Urbansim originally used Java objects, but that turned out to be too<br />
slow. The current implementation of the Object Broker uses an efficient array storing of objects<br />
so as to minimize memory footprint. Urbansim authors have been able to simulate systems with<br />
about 1.5 million objects (Salt Lake city area).<br />
In this work, an object-oriented design is used for a similar but simpler system, which<br />
maintains strategy information on several millions of agents. In a system with 1 million agents,<br />
with about 6 plans each, need about 1 GByte of memory, thus getting close to the 4 GByte limit<br />
imposed by 32 bit memory architectures [3].<br />
Regarding the timing (period-to-period vs. within-period re-planning), the database approach<br />
is in principle open to any approach, since modules could run simultaneously and operate<br />
on agent attributes quasi-simultaneously. In practice, Urbansim schedules the modules<br />
sequentially, as in the file-based approach. The probable reason for this restriction is that there<br />
are numerous challenges with simultaneously running modules.<br />
71
5.3.5 Module Coupling via Messages<br />
Yet another approach is to couple the modules by using messages. For example, one could use<br />
MPI [51] as was done for the parallel traffic flow simulation in Chapter 4. There seem to be<br />
essentially two paths that one can take:<br />
Have each module run on a single CPU only, but use messages to communicate between<br />
the modules. In particular, use that mechanism to implement within-day re-planning.<br />
This path is investigated in detail by [29].<br />
Stick with day-to-day re-planning, but have the individual modules run in parallel.<br />
This path is investigated in the following chapters of this thesis.<br />
5.4 Conclusions and Discussion<br />
The framework shown here includes different modules at different conceptual layers. This<br />
framework is used when a congested system is desired to be transferred to a relaxed state.<br />
Transition from initial state to relaxed state requires improving time schedules of plans of<br />
agents compared to old ones. Improvement involves different modules in the framework: the<br />
traffic flow simulation, the router, the agent database and the activity generator. Data (plans<br />
and events) between these modules can be managed in different ways.<br />
If the data is provided via files, an agreement on file format needs to be arranged. Files in<br />
structured text format are simple but not generic when operations on the file are involved. Files<br />
in the XML [97] format might be slow when creating data corresponding to the values read in<br />
but its extensibility cannot be disregarded.<br />
Coupling via files makes running different modules written in different languages on different<br />
operating systems possible. However, file operations involve disk accesses, which slows<br />
the system down.<br />
When modules are coupled via subroutine calls, despite their efficiency and simplicity, they<br />
restrict system modules to be written in the same language and to be run on the same operating<br />
system. Moreover, when the data is split on parallel modules, an additional effort such as<br />
exchanging messages should be exerted.<br />
Using RPC [33] is another alternative to coupling modules via files but this method is<br />
usually tightly-coupled and provides restricted choices (e.g. programming language).<br />
Yet another alternative to coupling modules via files is the utilization of XML at HTTP [12]<br />
level but it results in the same arguments as stated in Section 5.2.1.<br />
Databases can be used for coupling modules but they are usually bottlenecks when a fairly<br />
large scale application is concerned and when the database is written onto disk. Instead writing<br />
it out to disk, the database can be kept in memory but again memory constraints dominate the<br />
performance.<br />
Thus, technologies providing interoperability between modules are emerging. The tradeoff<br />
with the current technologies is between computational performance, effective usage of<br />
resources and flexibility.<br />
The design issues of a framework implementation for MATSIM can be concluded as below:<br />
Using files to couple different modules in the framework is preferred since computing<br />
nodes having bigger disk space allows users to store large data sets that cannot fit into<br />
the memory available. But it gives a low performance because of the disk accesses.<br />
72
Subroutines are not chosen to couple the modules of MATSIM since not only it gives<br />
better results for the data set small enough to fit into the memory of a computing node,<br />
but also it restricts users to obey some strict rules related to computing resources.<br />
In Remote Procedure Call method, calls are similar to subroutines but they are remote,<br />
i.e., the callee and the caller are on different computing nodes. MATSIM does not prefer<br />
RPCs to couple its modules since the RPC performance is low.<br />
Standard relational databases are avoided because of databases being bottleneck when a<br />
real-world problem is used to solve.<br />
MATSIM should replace its current implementation of coupling modules via files with<br />
coupling modules via message exchanges. The importance of this method is explained<br />
in the next two chapters.<br />
5.5 Summary<br />
A replacement for the traditional four-step process is explained. The framework overcomes the<br />
shortcomings of the four-step process by:<br />
activity-based demand generation, which generates a daily activity plan for each individual,<br />
employing DTA [19, 20, 27, 5] instead of static modeling to promote time-dependency.<br />
A process called systematic-relaxation is used to solve traffic dynamics with congestion.<br />
The systematic-relaxation uses a multi-agent learning method based on iterations, in each of<br />
which plans are executed, their performance is recorded and some of the routes are improved.<br />
The framework is conceptually divided into two layers. The strategies generated at the<br />
strategic layer are executed by the physical layer (traffic flow simulation). Agents know more<br />
than one strategy. Among available strategies, they can select one, or they can request a new<br />
route or they can request modifying timing information of activities (accordingly, a new route<br />
is created).<br />
Performance information of plans from traffic flow simulation is given in terms of events.<br />
The traffic flow simulation is kept apart from data aggregation, hence it is only interested in the<br />
correctness of the simulation. Data aggregation, such as link travel times and scores, is done in<br />
the strategy generation modules.<br />
The coupling of different modules can be accomplished via different methods, each of<br />
which has its own advantages and disadvantages as explained in the subsections. The current<br />
implementation of the framework uses a file-based approach, however in the next two chapters,<br />
a new approach, via exchanging messages, is discussed.<br />
73
Chapter 6<br />
Events Recorder<br />
6.1 Introduction<br />
The events recorder (ER) is a module which collects the events generated during a simulation<br />
run. In the original implementation, the traffic flow simulator (TS) generates and writes 1 events<br />
into a file, and the other modules read the events from the file. Hence, modules of the system<br />
are coupled via file I/O.<br />
An alternative to file-based coupling of modules of a system is coupling them via messages.<br />
As can be clear from Chapter 5.2.1, from the viewpoint of a traffic flow simulation, this entails<br />
two types of messages:<br />
Plans that are fed into the traffic flow simulation<br />
Events that are retrieved out from the traffic flow simulation<br />
Since plans are more complicated structures, this text will consider events first; using messages<br />
for plans will be considered in Chapter 7. As was explained in Chapter 5.2.1, events are<br />
used by all strategy generation modules to extract performance information. For example, the<br />
router extracts link travel times, or a mental map extracts individual agents’ paths.<br />
The challenge within the present work is to consider the most typical cases that occur within<br />
a parallel simulation environment. In particular, one wants to investigate the situation when the<br />
traffic flow simulation is parallel in the sense of Chapter 4. Therefore, one might consider what<br />
happens with respect to events collection when a traffic flow simulation is distributed across<br />
several computing nodes. One might also consider what happens when the events are divided<br />
into subsets, and each subset is sent to a different ER. Such a situation might be plausible<br />
when several nodes with disks are available and each ER could write its subset of the events<br />
¡<br />
to its own local disk (“distributed (events-)recording”). Another case where this is plausible<br />
is when there are multiple distributed agent databases, each one only responsible for a subset<br />
of agents. Finally, one might consider the case when more than one ER receives the same full<br />
events information (“multi-casting”). Such a situation occurs when more than one modules<br />
listens to the same stream of events.<br />
Figure 6.1 shows two examples for the events distribution on ERs. In the examples, there<br />
are three TSs and two ERs in the system. The system is populated with three agents only. These<br />
agents have different events occur on different TS domains. For example, execution of plans of<br />
1 The events recorder can possibly write events to the file, but the more probable application is that the strategy<br />
generation modules take the events information right away from the message stream and the events are never<br />
written to file.<br />
74
A1<br />
Traffic<br />
Simulator<br />
A2<br />
A3<br />
EVENT<br />
RECORDER<br />
Traffic<br />
Simulator<br />
A2<br />
A3<br />
EVENT<br />
RECORDER<br />
A2<br />
Traffic<br />
Simulator<br />
A1<br />
A3<br />
(a) Distributed Recording<br />
A1<br />
Traffic<br />
Simulator<br />
A2<br />
A3<br />
EVENT<br />
RECORDER<br />
Traffic<br />
Simulator<br />
A2<br />
A3<br />
EVENT<br />
RECORDER<br />
A2<br />
Traffic<br />
Simulator<br />
A1<br />
A3<br />
(b) Multi-casting<br />
Figure 6.1: Interaction between TSs and ERs. (a) distributed recording – agents have dedicated<br />
ERs. (b) multi-casting – events are multi-cast to ERs.<br />
agents A2 and A3 generates events on all three TSs while agent A1 has events occurring only on<br />
two TSs. The big thick arrows are for the communication between TSs and ERs. Each dashed<br />
line originated from the agents indicates to which ER the events from an agent are reported.<br />
In Figure 6.1(a), the agents are assigned to ERs in a round robin fashion (distributed recording).<br />
The events of agents A1 and A3 are reported to the same ER whereas the events of agent<br />
75
A2 are collected by the other ER. Since the dedicated ER information is a part of an agent<br />
itself, it carries this information around when it has to be moved to the other TSs according to<br />
the domain decomposition.<br />
Figure 6.1(b) shows the same system without any dedicated ERs (multi-casting). In this<br />
case, all events of all the agents on TSs are multi-cast to all the ERs.<br />
6.2 The Competing File I/O Performance for Events<br />
When events are read from and written into a file, the timing regarding I/O performance is<br />
recorded as follows: 10 million raw events are read as strings into an STL (Standard Template<br />
Library, Section 3.3.1)-vector in 332 seconds. The conversion from strings to the appropriate<br />
data types by using atoi/atof functions of C takes 17 seconds. Hence, the total time for<br />
completing reading is 349 seconds. Before writing an event into a file, the data values of the<br />
event have to be retrieved. The data retrieval for 10 million events is completed in 11 seconds.<br />
Then, they are written into a file by using the C++ operator in 61 seconds. Thus, the total<br />
time for writing 10 million events is 72 seconds. As a result, the file I/O performance on raw<br />
events is measured as 421 seconds.<br />
Similarly, for 10 million XML [97] events, reading and parsing are completed in 413 seconds<br />
via expat [21]. The data conversion from strings to the proper types is accomplished in<br />
21 seconds by using C functions strtol/strtod. Therefore, reading 10 million events is<br />
completed in 434 seconds. Prior to writing, the data values are retrieved in 10 seconds and are<br />
written by the C++ operator as XML tags in 223 seconds, which gives a total writing time of<br />
233 seconds. Consequently, the file I/O performance time for 10 million XML events is measured<br />
as 667 seconds. In order to reduce the contribution of these I/O performance numbers,<br />
this chapter investigates passing events in messages.<br />
6.3 Other Work<br />
Technically, a multi-casting scenario can be realized by using (true) multi-casting, as was<br />
shown by [29]. However, that work also showed that (i) the standard multi-cast implementation<br />
is not useful for simulation work since arrival of the messages is not guaranteed; (ii) writing a<br />
protocol addition that makes the messages reliable is difficult; (iii) using (reliable) TCP/IP [79]<br />
instead has lower performance since in contrast to true multi-casting it will open separate communication<br />
channels to each receiver; (iv) any solution based on standard Internet protocols<br />
typically runs on standard Local Area Network (LAN) hardware such as Ethernet [77], but<br />
often not on the specialized hardware provided by mini-supercomputers or supercomputers.<br />
Examples for such specialized hardware are Myrinet [54] and Infiniband [41]. (Myrinet provides<br />
a TCP/IP implementation, but it is non-standard, rarely used, and often not installed by<br />
computing centers.)<br />
When coupling modules of a system via messages, the message format and the transmission<br />
methods are substantial as well. Communication systems such as MPI [51] and PVM [63]<br />
offer high performance but they lack the support for flexibility as both the sender and the<br />
receiver side must have a priori agreements on the format of messages to be exchanged. The<br />
object serialization, in contrast, given in systems such as Java [42], CORBA [92] etc., and<br />
the XML-type data format provide somewhat more flexibility but along with significant lower<br />
performance.<br />
PBIO (Portable Binary I/O) [25] gives a solution for coupling flexibility and high performance.<br />
Its data format is similar to the XML format by giving the meta data information in<br />
76
the message. PBIO benefits also from reusing the receive buffer as opposed to MPI Pack and<br />
MPI Unpack routines’ needs for a second buffer to do the data conversions. In a heterogeneous<br />
environment, the PBIO’s low level data conversion functions perform the data conversion<br />
when necessary. PBIO gives the best performance when a homogeneous system is in question.<br />
In a heterogeneous environment, MPI and PBIO challenge each other. The MPI comparison<br />
tests in [25] reportedly involves MPI Pack/MPI Unpack. However as explained in Sec 6.11,<br />
MPI Struct could perform better than memcpy and MPI Pack/MPI Unpack when a fixed<br />
length data is to be exchanged. Moreover, it is also reported that PBIO does not provide any<br />
facility that detects under/overflow in the data conversion.<br />
6.4 Benchmarks<br />
In the tests presented here, a number of CPUs 2 between 1 to 24 runs a stub version of the<br />
traffic flow simulation (TS). A stub version is used in order to exclude the computing time of<br />
the traffic flow simulation itself from the benchmarks below. The stub version first reads all<br />
the pre-generated events from a file into the memory before it starts any benchmarks. The stub<br />
version is constructed in a way that the final set of events arriving at the event recorder is always<br />
the same, no matter what the number of CPUs is. The benchmark is set up as follows:<br />
1. TSs read the pre-generated events from a file into the memory.<br />
2. TSs pack the events.<br />
3. The packed events are sent to ERs.<br />
4. Each ER receives the events.<br />
5. The received events are unpacked into the memory.<br />
6. ERs write the events from the memory into a file.<br />
gettimeofday() function is used to measure the time spent during the operations on the<br />
events. It returns the current time while reading the TSC (time stamp counter), which increments<br />
each CPU clock.<br />
Myrinet is used as the main communication media. Measuring the sending time for the<br />
events is also repeated for 100 Mbit Ethernet [77]. In order to get the performance predictions<br />
regarding the transmission time, PMB (Pallas MPI Benchmark) [30] is used. That benchmark<br />
provides results of the so-called ping-pong test, which sends a packet to a receiver that sends<br />
it immediately back; many different packet sizes are tested. The results of these tests are the<br />
latency and the bandwidth numbers.<br />
In order to mimic the communication patterns of a real-world situation, the following is<br />
done with respect to the distribution of the events onto TSs and onto ERs in the case of distributed<br />
events-writing:<br />
1. The agents are distributed among the ERs in a round robin fashion.<br />
2. Domain decomposition of the street network takes place on the TSs as described in Chapter<br />
4.<br />
2 The cluster used here is composed of dual-CPU computational nodes. Tests are done using only one CPU of<br />
each computational node unless otherwise specified.<br />
77
&<br />
<br />
<br />
$<br />
¡<br />
<br />
<br />
<br />
<br />
&<br />
<br />
<br />
<br />
$<br />
<br />
<br />
<br />
<br />
¡<br />
<br />
3. TSs read those events that occur on their part of the domain.<br />
4. During the events sending, the TSs send the events only to those ERs that are “responsible”<br />
(in the case of distributed reporting) or all events to all ERs (in the case of multicasting).<br />
Note that events are not distributed uniformly across domains, and therefore TSs will have<br />
different numbers of events to process. This corresponds to what will later be used in practice.<br />
Since every agent has a different number of events, the total number of events received on the<br />
distributed ERs is also different.<br />
In order to minimize the effect of non-uniformly distributed events (as a result of the domain<br />
decomposition) among TSs, the packing time values shown in the benchmarks are obtained<br />
as follows: By noting the number of events on each TS and the benchmark times, the final<br />
benchmark time values are calculated on all TSs as if the number of events occur on all TSs is<br />
the same. For example, the number of events on TSs is approximately 2.5 and 7.5 million when<br />
¢¢¤<br />
using 2 TSs, and packing times come $<br />
¡<br />
<br />
out ¡ $ ¡ " <br />
¡<br />
<br />
as and . If they had been distributed<br />
perfectly, then each of TSs would have been responsible for 5 million events. ¡ Therefore, the<br />
packing time of 5 million events on each TS is calculated by using the values for the actual<br />
¢¢¤ <br />
$&<br />
¥ measurement: ¡ $ ¡ " "%$'&<br />
¤ and . The maximum values out of<br />
$<br />
these calculated values are plotted in the figure, i.e. ¡ $ ¤ <br />
¡<br />
<br />
it is in the example. Since the<br />
cluster is composed of computing nodes of the same type, the computing nodes’ behaviors<br />
usually do not differ much among themselves. Also one should note that packing events by a<br />
computing node is independent of the other computing nodes.<br />
When the measured time involves data transmission (such as send and receive), interpolation<br />
might not be a solution since the transmission time includes the waiting time for packets<br />
to be received. For a computing node, the time elapsed while waiting for receiver side to start<br />
receiving is a consequent of how much occupied receiver is by the other computing nodes.<br />
Packets are sent immediately after they are packed by any computing nodes. Therefore, the<br />
receiving sequence of receivers is determined by the packing sequence of senders. In conclusion,<br />
when the transmission time is involved in a measurement, the highest ”true” number<br />
among computing nodes without an interpolation is taken as the result. For example, in the<br />
¥<br />
respectively. If the corresponding packing times are subtracted from these numbers, the sending<br />
times for 2.5 and 7.5 million events become $<br />
¢<br />
¡<br />
<br />
¡<br />
<br />
same experiment above, 2.5 and 7.5 million events are packed and sent in ¡ $¢¡ ¥ <br />
and ¤ $<br />
<br />
¡<br />
<br />
and ¢ $<br />
¡<br />
<br />
. In this example, ¢ $<br />
&<br />
is selected as the total time to complete collecting all 10 million events.<br />
¤¢<br />
¤¡<br />
6.5 Test Case<br />
The scenario used is the ch6-9 scenario as described in Section 2.5. This real world scenario<br />
is used because of the reasons explained in Section 2.4. The scenario generates approximately<br />
23 million events during its simulation of three hours of traffic. Because of the memory constraints<br />
of the computing nodes, the first 10 million of these events were used for the benchmarks.<br />
6.6 Raw vs. XML Events<br />
In this work, two types of events transmission are tested: the raw events or events in XML [97]<br />
form (for general remarks on XML, see Section 3.4). Both types have their own advantages<br />
and disadvantages. The raw events are simple and fast but need to be packed as bytes on the<br />
78
sending side. Similarly, an unpacking routine needs to be run on the receiving side to convert<br />
bytes into the proper types.<br />
XML events, on the other hand, are slower but more generic. The packing routine is much<br />
simpler since it uses string functions on the sending side. If the modules of a system are<br />
coupled via files, then on the receiving side, no unpacking is necessary for writing events into<br />
a file. Since XML is a plain ASCII format, the XML events are written directly into the file as<br />
it is (no processing is necessary as opposed to the raw events).<br />
6.7 Buffered vs. Immediate Reporting of Events<br />
Besides the types of the events generated (XML and raw), there are also two ways of reporting<br />
events.<br />
6.7.1 Reporting Buffered Events<br />
The first type of reporting events is reporting them as in chunks. Events are added to a buffer,<br />
which is limited in size. When a certain number of events, SENDSIZE, is hit, the whole buffer<br />
as one message is sent to the ER. When the ER gets the message (the big buffer), it unpacks<br />
the events and then writes them to a file. In the tests, the SENDSIZE is defined as 5000 events.<br />
A buffer size of 5000 is used since this is a good trade-off between memory consumption and<br />
computational performance. In addition, some tests with a buffer size of 10000 events indicated<br />
no difference in performance. The procedure is given in Figure 6.2(a).<br />
The ER tries to receive packets all the time since it does not know the exact time that an<br />
event/message occurs. When the simulation is done, which means no more events will be generated,<br />
all the TSs in the system notify all the ERs. At this point the ER finishes. Figure 6.2(b)<br />
gives the pseudo code of ER actions.<br />
6.7.2 Immediately Reported Events<br />
The second type of reporting events is that an event is reported immediately after it is generated.<br />
Figure 6.3(a) shows how events are reported immediately. The procedure is very similar to that<br />
of the buffered events case except that SENDSIZE equals to 1: After events are read into<br />
memory, the time measurement is switched on. Then the events are packed and sent one by<br />
one. After all the events are sent, the time measurement is switched off.<br />
On the receiving (ER) side, first the time measurement is switched on. Then the main<br />
procedure starts its execution. At this point, the ER receives only one event per message and<br />
unpacks it into the memory and writes it into the file. After getting the information of no more<br />
event being generated, the time measurement ends. The algorithm of collecting immediate<br />
events by an ER is shown in Figure 6.3(b).<br />
When reporting events immediately, the message passing suffers from the processing overhead<br />
since the buffer with one event is too small to be sent. The obtained results are given in<br />
Section 6.10.<br />
6.8 Theoretical Expectation for Buffered Events<br />
In this case, measurements are taken based on cumulative events. The number of events, which<br />
form a single message, is 5000. Each event contains 1 double value (for the event’s time) and<br />
79
Algorithm A – Traffic Simulator Reporting Buffered Events<br />
read events into memory<br />
time measurement starts<br />
while not all events processed do<br />
pack events from memory into a buffer<br />
if number of packed events hits SENDSIZE<br />
send buffered events to ER<br />
end while<br />
time measurement ends<br />
Inform ERs that all events have been reported<br />
(a) The Traffic Simulator<br />
Algorithm B – Events Recorder Collecting Buffered Events<br />
time measurement starts<br />
while not all events collected do<br />
listen MPI Port<br />
if a packet arrived then<br />
receive packet which contains SENDSIZE events<br />
unpack SENDSIZE events into memory<br />
write SENDSIZE events into file<br />
end if<br />
end while<br />
time measurement ends<br />
(b) The Events Recorder<br />
Figure 6.2: Buffered Events Case. (a) Traffic Simulator Code (b) Events Recorder Code<br />
Algorithm A – Traffic Simulator Reporting Events Immediately<br />
read events into memory<br />
time measurement starts<br />
while not all events processed do<br />
pack an event from memory into a buffer<br />
send buffer with one event to ER<br />
end while<br />
time measurement ends<br />
Inform ERs that all events have been reported<br />
(a) The Traffic Simulator<br />
Algorithm B – Events Recorder Reporting Events Immediately<br />
time measurement starts<br />
while not all events collected do<br />
listen MPI Port<br />
if a packet arrived then<br />
receive packet which contains one event<br />
unpack the event into memory<br />
write the event into file<br />
end if<br />
end while<br />
time measurement ends<br />
(b) The Events Recorder<br />
Figure 6.3: Immediate Events Case. (a) Traffic Simulator Code. (b) Events Recorder Code<br />
80
§¦ ©¨<br />
¡<br />
¤<br />
<br />
¡<br />
2<br />
¡ ¡<br />
¡ )<br />
¡<br />
¡<br />
<br />
¡ ¡<br />
4<br />
¡ ¢<br />
¡<br />
<br />
¡<br />
©<br />
¡<br />
¡ ) <br />
¢<br />
<br />
¡<br />
<br />
Packing a Raw Event<br />
memcpy(buffer,event time)<br />
memcpy(buffer,event type)<br />
memcpy(buffer,vehicle ID)<br />
memcpy(buffer,leg number)<br />
memcpy(buffer,link ID of link on which event occurred)<br />
memcpy(buffer,from node of link)<br />
increment the number of events in the buffer<br />
Figure 6.4: Pseudo Code for Packing a Raw Event. Each event pack includes 6 memcpy calls<br />
Packing an XML Event<br />
create a char array using sprintf to write the values of an event<br />
memcpy(buffer,char array created)<br />
increment the number of events in the buffer<br />
Figure 6.5: Pseudo Code for Packing an XML Event. Each event pack creates a char array<br />
with the values of events.<br />
5 integer values (for the vehicle ID, the leg number, the from-node of the link, the link ID and<br />
the event type).<br />
6.8.1 Packing Time Prediction<br />
Raw Events<br />
Packing of raw events is done by using the C-library function memcpy. Pseudo code of<br />
packing a raw event is given in Figure 6.4. The memcpy function is called for each integer and<br />
each double value of an event.<br />
A clock cycle counter program [35] shows that a raw event that has 5 integers and 1 double<br />
value and that is packed by performing several memcpy functions will result in approximately<br />
400 clock cycles per event.<br />
As described Section 4.5, the cluster nodes used for the benchmarks have PIII 1GHz CPUs.<br />
Given the CPU speed of (1 billion cycles per second), the execution time of packing 10<br />
million raw events will be:<br />
£¢¥¤ ¡<br />
<br />
¢ ¢<br />
¥ ¤ $<br />
$<br />
¢ ¢ ¢¢¥¤ ¡<br />
XML Events<br />
Packing an XML event is achieved by using sprintf and memcpy functions. The algorithm<br />
in Figure 6.5 gives the pseudo code. An XML event example is given in the following:<br />
¡<br />
e v e n t time =”21636” t y p e =” departure ” v e h i d =”6381” legnum=”0” l i n k =”17” from=”1000” /¢<br />
which means at time 6:06AM, vehicle 6381 has departed link 17 (which the from-node is 1000)<br />
as executing leg 0.<br />
XML events were packed by using stringstream functions of STL (Standard Template<br />
Library, Section 3.3.1). However, the low performance of stringstream functions resulted<br />
81
¡<br />
6<br />
)<br />
©<br />
£<br />
¡<br />
©<br />
¤<br />
¡<br />
6<br />
)<br />
©<br />
(<br />
¨<br />
<br />
&<br />
<br />
(<br />
§<br />
¤<br />
¨<br />
¡<br />
&<br />
¡ ¢<br />
¡<br />
§<br />
¡<br />
¡<br />
$<br />
<br />
¡<br />
<br />
¡<br />
)¢¥¤ *<br />
¢<br />
¢¥¤§<br />
¡<br />
¥<br />
¤<br />
<br />
¡<br />
<br />
¤<br />
<br />
¡<br />
$<br />
<br />
¤<br />
<br />
¡<br />
<br />
<br />
)<br />
<br />
¤<br />
¡<br />
¢<br />
§¡<br />
<br />
¤<br />
¡<br />
in a newer implementation. In a newer implementation with sprintf and memcpy functions,<br />
the number of clock cycles is approximately 9500 per event, most of which is used by the<br />
sprintf function. This is why packing XML events is slower than that for the raw events.<br />
Given 9500 clock cycles per event, 10 million XML events will be packed in<br />
¢ ¢<br />
¡ ¢ ¢ ¢ ¢¥¤<br />
¥ & <br />
$<br />
6.8.2 Sending and Receiving Time Prediction<br />
PMB (Section 6.4) indirectly measures the time between the first byte leaving the sending side<br />
to the last byte arriving at the receiving side. Therefore, it measures the sum of the receiving<br />
time and the sending time, minus the overlap between them. However, in practice the resulting<br />
times are typically caused by bottlenecks at either end. For example, one could imagine that on<br />
the sending side all data is moved into an (infinitely large) communication buffer maintained<br />
by the network card. Once the data has arrived in that buffer, the measurement of the sending<br />
time would stop, but the data would still reside physically on the sending node. Similar effects<br />
could take place on the receiving side. An assumption is that the PMB times are caused either<br />
by the sending or by the receiving side, and that the times on the other side will be significantly<br />
smaller.<br />
In order to calculate the time consumption of a message, both bandwidth and latency contributors<br />
have to be taken into account if the latency is defined as the start-up time for a message:<br />
¨<br />
¡<br />
) §<br />
2 ( ¡ <br />
0( <br />
)§¡<br />
However, PMB reports the cumulative ”effective latency” and ”effective bandwidth”; therefore<br />
the formula becomes:<br />
¨<br />
¡<br />
¤<br />
¡<br />
¥<br />
) §<br />
2 ( ¡ <br />
0( <br />
¤ )<br />
¤<br />
¡<br />
¥ $<br />
Raw Events<br />
A packet of the buffered events consists of 5000 events. One double and one integer corresponds<br />
to 8 bytes and 4 bytes, respectively. Having 1 double and 5 integers to represent one<br />
event results in<br />
¥ <br />
¢ ¢ ¢<br />
¢ ¢ ¢ ¢ ¤¦<br />
per packet.<br />
To find out the corresponding latency and bandwidth values over Myrinet for a packet size<br />
of 140 KB, PMB is used. From the graphs generated on the cluster, for a packet of 140 KB, the<br />
latency is 620 s, meaning that it takes 620 s to transmit those 140 KB.<br />
Packing 10 million events as packets of 5000 events will give a total of 2000 messages.<br />
Therefore, the theoretical expectation for transferring 2000 messages of 140 KB in size can be<br />
found as:<br />
* &<br />
)<br />
¡<br />
£<br />
¡<br />
©<br />
© <br />
¡<br />
¡<br />
)¢¤<br />
0( <br />
¢ ¢ ¢ ¢<br />
¢ ¢ ¢<br />
$<br />
This means transferring 2000 messages of data, each of which is 140 KB in size, between a<br />
single TS and a single ER should theoretically take 1.2secs.<br />
XML Events<br />
The theoretical time for transferring the XML events can be calculated similar to the raw events<br />
82
)<br />
)<br />
&<br />
(<br />
¡<br />
¤<br />
¨<br />
<br />
(<br />
¨<br />
&<br />
¡ ¢<br />
¡<br />
¡ ¢<br />
¡<br />
¡ ¢<br />
¡<br />
$<br />
¡<br />
¢<br />
<br />
¡<br />
<br />
<br />
<br />
<br />
¡<br />
<br />
¡<br />
<br />
¡<br />
<br />
¤<br />
<br />
¡<br />
<br />
case. Each XML event has a maximum of 120 bytes. Hence, one packet of 5000 events results<br />
in<br />
¢ ¢ ¢<br />
¢ ¢ ¢ ¢ ¢¤¦<br />
¡ ¢<br />
¥ $<br />
<br />
According to PMB, the corresponding latency of a packet size of 600 KB is 2.4ms. Given<br />
the fact that 10 million XML events can be transferred in 2000 messages, the theoretical time<br />
for transferring 2000 messages of 600 KB can be calculated as:<br />
)<br />
¡<br />
£<br />
¡<br />
©<br />
© <br />
¡<br />
¡<br />
)¢¤<br />
0( <br />
£<br />
¡<br />
©<br />
¢ ¢ ¢ ¢<br />
¢ ¢<br />
¤ ¤ ¤ $<br />
$<br />
6.8.3 Unpacking Time Prediction<br />
Raw Events<br />
The theoretical value for unpacking the buffered events should be the same as that for packing<br />
except that it does not depend on the number of TSs given that the number of ERs is constant<br />
during a set of runs.<br />
Unpacking a raw event consists of calling memcpy function 5 times for integer values and<br />
once for the double value. The number of clock cycles is found as 410 cycles. Therefore, the<br />
total unpacking time of 10 million raw events will be<br />
¡ ¢<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ ¤ $<br />
$<br />
XML Events<br />
When unpacking of an XML event from a received packet is concerned, two different meanings<br />
to unpacking exist. The first one is useful particularly when events are written into a file.<br />
The second approach might be employed, when one prefers to access the data values stored in<br />
XML tags. The latter is the way that the raw events are unpacked.<br />
When XML tags are only to be extracted, then the procedure is as follows: Since XML<br />
events are just strings start “- with ” and ends with “ ”, the unpacking procedure reads each<br />
string between these special characters and saves the string as a whole, i.e., as a tag. Thus,<br />
a simple search is done for the XML tags. It takes 3000 clock cycles for an XML event to<br />
be extracted as a tag from a received packet and extracting 10 million events in the same way<br />
results ¢ ¢ ¢<br />
in<br />
¢ ¢ ¢ ¢¥¤ ¥ ¡ $<br />
¡<br />
After all the XML tags are extracted from a received packet, these values are written into a file<br />
as explained in the next sub-section.<br />
In order to store the value of each attribute similar to the raw events case, the XML tags are<br />
parsed into values. Parsing an XML tag via expat [21] and storing its values separately takes<br />
50000 clock cycles. Therefore, the total unpacking time for 10 million XML events is<br />
¢ ¢ ¢ ¢<br />
¢ ¢<br />
¡ ¢ ¢ ¢ ¢¥¤<br />
¥ &<br />
$<br />
83
and ¤¢¤<br />
¢<br />
<br />
<br />
¡ ¢<br />
¡<br />
¡ ¢<br />
¡<br />
<br />
¡<br />
<br />
¡<br />
<br />
<br />
<br />
¢<br />
<br />
<br />
OPERATION RAW XML<br />
Packing Time 4.0 s 95 s<br />
Sending and Receiving Time 1.2 s 4.8 s<br />
Unpacking Time 4.1 s 30+500 s<br />
Writing Time 56 s 20 s<br />
Table 6.1: Performance prediction table for buffered events.<br />
6.8.4 Writing Time Prediction<br />
Raw Events<br />
Each attribute of a raw event is written into a file separately by forming a structured text. The<br />
number of clock cycles needed is 5600 per event when the C++ output operator is used.<br />
¢ ¢<br />
Hence,<br />
& ¥<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ ;& ¥ <br />
is required for all the events to be dumped into a file.<br />
XML Events<br />
When an XML event is written into a file by using the C++ output operator , Writing each<br />
event uses up 2000 cycles, which results in<br />
¢ ¢ ¢<br />
¢<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
to write all the events into a file. Writing XML events is faster than writing raw events because<br />
raw events need to be converted to the strings prior to writing. XML events, on the other hand,<br />
are already in the string format.<br />
6.8.5 Performance Prediction for Buffered Events: Putting it together<br />
A table for the performance predictions for each individual step is shown in Table 6.1. The<br />
table demonstrates that exchanging the raw/plain events are expected to be faster for the operations<br />
which involve in “effective” message exchange. However, the XML is more flexible and<br />
extensible as explained in Section3.4.1.<br />
6.9 Results of the Buffered Events<br />
The overall simulation time including the initial events reading on a single TS is " <br />
¡<br />
<br />
about ,<br />
respectively with 1 ER, 2 ERs, 4 ERs when using the buffered raw events.<br />
<br />
¡<br />
<br />
¡<br />
<br />
When using the buffered XML events, the simulation times recorded are as ¥ & <br />
¡<br />
<br />
, ¡ "<br />
¡<br />
<br />
and ¡ ¥<br />
¡<br />
<br />
if “unpacking an XML event” is meant to be retrieving a XML tag without parsing<br />
for values (See Section 6.8.3). These numbers include approximately 35-40 seconds for<br />
¡<br />
reading the input data. In the following, the input reading times will be ignored, and the actual<br />
performance measurements of the other contributions will be described.<br />
The performance results of different operations when buffered events are concerned are<br />
given in the following sections: Packing both the raw and XML events is reported only by<br />
using the C memcpy function in Section 6.9.1. Section 6.9.2 compares the sending time of<br />
the raw events and XML events on different communication media (Myrinet and Ethernet),<br />
" <br />
84
and compares the different types of collection of events (distributed and multicast cases). The<br />
effective receiving time is measured both on Myrinet and Ethernet in Section 6.9.3. Unpacking<br />
the received events of both types, i.e. raw and XML, by using memcpy similar to packing case<br />
is discussed in Section 6.9.4. Finally, the time for writing events is measured for both raw and<br />
XML events in Section 6.9.5.<br />
6.9.1 Packing<br />
The time spent for packing events is plotted in Figure 6.6. The curves in this figure show only<br />
the “packing time”. To measure the packing time for events, the algorithm in Figure 6.2(a) has<br />
been changed in a way that the events are only packed, but not sent to the other side. As one<br />
can see, the performance values follow closely the theoretical predictions.<br />
Adding more ERs to the system in the “multi-casting” sense has no effect on the packing<br />
time values since the “total” packing time of TSs is independent of the number of ERs in the<br />
system.<br />
6.9.2 Sending<br />
Figure 6.7 shows the total time spent for sending all the events in the distributed ERs case when<br />
Myrinet [54] is used as the communication medium. The time measurement starts before the<br />
first event is packed and ends right after the last packet is sent. Hence, the numbers collected<br />
include the packing time as well. The numbers presented in Figure 6.7 are calculated by subtracting<br />
the time for packing from the total time for sending and packing on all TSs. Then, the<br />
maximum number is taken out of these subtracted values since it also shows how long TSs wait<br />
before a receive command is issued.<br />
The theoretical curve is derived (Section 6.8.2) under the assumption that the performance<br />
restrictions lie entirely on the side of the sender. That is, when a TS issues a send command,<br />
the ER is ready to receive. Of course, this is not the case in reality. The MPI Send function<br />
is blocking, which means each MPI Send call needs to wait till a corresponding MPI Recv<br />
command is issued. In other words, especially for the small numbers of ERs, TSs compete with<br />
each other.<br />
The important features of these plots are:<br />
The bottleneck for sending events lies almost entirely with the sender: With XML events,<br />
up to 8 TSs can send with full bandwidth before saturation sets in, presumably caused by<br />
the receiver.<br />
The reason for this is that the sender is most of the time busy packing events (Figure 6.6),<br />
while at this point the receiver immediately discards events. This is no longer true when<br />
unpacking (Section 6.8.3) and possibly writing is added to the receiver.<br />
Eventually, as the number of TSs increases, the curves start to saturate (or even increase)<br />
because of the competition among senders to get the access to receiver buffer.<br />
Myrinet results show that network cards saturates earlier on the sender side than on the<br />
receiver side. This could be due to the rendezvous protocol, for sending messages larger<br />
than ¡ ¥¡ , via the GM (Glenn’s Messages 3 ) one-sided put operation. The rendezvous 4<br />
3 GM is the name of the low-level communication layer for Myrinet [54].<br />
4 There is another protocol called the Eager protocol used for small messages. When a send is issued and the<br />
matching receive is not yet posted, the small message is saved in a (unexpected) buffer temporarily before the<br />
actual send occurs. Allocating buffers for large messages does not work. With small messages, therefore, a good<br />
bandwidth is not expected since the message must be copied.<br />
85
256<br />
64<br />
Packing Only, Raw Events<br />
memcpy - 1ER<br />
memcpy - 2ER<br />
memcpy - 4ER<br />
Theo. Val<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
256<br />
64<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Packing Only, XML Events<br />
memcpy - 1ER<br />
memcpy - 2ER<br />
memcpy - 4ER<br />
Theo. Val<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.6: Time elapsed for packing events. (a) Packing raw events. (b) Packing XML events.<br />
Note: Having ¡ ERs refers to the “distributed ER” method, i.e. each ER receives ¡ ¡ of the<br />
events. The packing time for the “multi-casting ER” method is the same, since data is packed<br />
only once.<br />
protocol ensures that a handshaking between the sender and receiver occurs prior to the<br />
message sending. The GM’s put operation writes the large message into the receive<br />
buffer directly, hence the operation finishes without the remote side being involved.<br />
The time measurement for sending events over Ethernet [77] is also taken. The results are<br />
shown in Figure 6.8. For Ethernet, the latency being higher and the bandwidth being lower<br />
than those of Myrinet [54] explains the difference between Myrinet and Ethernet values in the<br />
figure. One also notices that with Ethernet, a full bandwidth sender immediately saturates the<br />
receiver: in contrast to Myrinet, multiple TSs sending to one ER is not any faster than one<br />
TS sending to one ER. This also means that, for Ethernet, using multiple ERs for distributed<br />
recording means indeed an advantage.<br />
So far, it was assumed that when there are multiple ERs, that they all receive only a part<br />
of the information (distributed recording). As mentioned in Section 6.4, a different scenario<br />
86
256<br />
64<br />
Sending Only, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Theo. Val - 1ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Sending Only, XML Events<br />
256<br />
64<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Theo. Val - 1ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.7: Time for sending events, distributed recording, Myrinet. (a) Sending raw events.<br />
(b) Sending XML events.<br />
is to assume that there are multiple ERs, but that they represent different strategy generation<br />
modules and therefore they all want to receive the full event information. This is called “multicast”.<br />
Multicast, in general, is to send the same message to a list of recipients on a network,<br />
therefore during these tests TSs report all the events to all the ERs in the system. MPI [51] does<br />
not support multi-cast function. In order to multi-cast the events to all the ERs, a simple for<br />
loop is used by calling MPI Send function the number of ERs times.<br />
The results are plotted in Figure 6.9 and in Figure 6.10 over Myrinet and Ethernet, respectively.<br />
As the number of ERs increases, the sending time increases almost linearly in the<br />
number of multi-casting ERs, as one would expect since the internal command is just a loop<br />
over all ERs.<br />
When Ethernet is used as the communication medium, having events distributed<br />
among ERs should be preferred. If the communication is achieved over Myrinet,<br />
multi-cast is also a noteworthy option.<br />
87
£<br />
¥<br />
Time in Secs<br />
256<br />
64<br />
16<br />
4<br />
Sending Only, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
Theo. Val Eth - 1ER<br />
1<br />
0.25<br />
Time in Secs<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
Sending Only, XML Events<br />
256<br />
64<br />
16<br />
4<br />
1<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
Theo. Val Eth - 1ER<br />
0.25<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.8: Distributed recording, comparison of Ethernet vs Myrinet when sending events. (a)<br />
Sending Raw events. (b) Sending XML events. Note: Having ERs means that approximately<br />
¡<br />
¡ ¡<br />
events are sent to each ER.<br />
6.9.3 Receiving<br />
There is no clear way how to measure the time consumption of a receive operation. This is<br />
because when an MPI Recv command is executed before the corresponding MPI Send is<br />
issued, the measurement on receiver side will include the time of sender spent on operations<br />
taken place before MPI Send. Therefore, the curves in Figure 6.11 show effectively the combined<br />
effects of packing, sending on sender side and receiving on receiver side over Myrinet.<br />
More precisely, it effectively ) ¥ ¤ £ * )§¦ ) £ <br />
shows . It excludes events unpacking and<br />
writing by the ER code shown in Figure 6.2. In order not to include the events reading time<br />
of TSs, the time measurement starts right after the first packet arrives at the ER and ends after<br />
all the packets are fetched. This will, in fact, exclude the packing and sending time of the first<br />
packet, but this is a small error given 2000 or more packets.<br />
In the resulting figures, the curve is decreasing as the number of TSs increases. As explained<br />
in the previous paragraphs, the time measurement does not involve unpacking or further opera-<br />
88
256<br />
64<br />
Multicast Sending Only, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Myri - 8ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
256<br />
64<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Multicast Sending Only, XML Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Myri - 8ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.9: Multi-casting, time for sending events. (a) Sending Raw events. (b) Sending XML<br />
events.<br />
tions such as writing time of the events. This means that the majority of time on receiving side<br />
is spent while waiting for senders to issue MPI Send command. Since, during this period,<br />
senders basically pack the events, the packing time dominates the curves shown in Figure 6.11.<br />
One important observation from the results obtained up to this point is that since packing plus<br />
sending time is close to receiving time and packing uses up most of the time, one might conclude<br />
that the actual receiving time is smaller than the packing time and the rest of the time is<br />
spent as being idle.<br />
The same tests are also repeated over Ethernet. The results are presented in Figure 6.12.<br />
Since packing or unpacking of events are not dependent on the communication media, the<br />
packing time values are the same as those of the Myrinet case as shown in Figure 6.6. Given<br />
that the sending time measurement (Figure 6.8) is higher than the packing time measurement<br />
(Figure 6.6), one might conclude that most of the time in the Ethernet case is spent in sending<br />
or receiving rather than packing as opposed to the Myrinet case.<br />
89
Multicast Sending Only, Raw Events, Ethernet<br />
256<br />
64<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
Eth - 8ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1024<br />
256<br />
64<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Multicast Sending Only, XML Events, Ethernet<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
Eth - 8ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.10: Multi-casting over Ethernet, time for sending events. (a) Sending Raw events. (b)<br />
Sending XML events.<br />
6.9.4 Unpacking<br />
The time measurement for unpacking time of events starts when the first packet arrives at ER.<br />
After all the packets are retrieved and the last event is unpacked, the time measurement is<br />
switched off. Therefore, receiving is included in and any further operations such as writing<br />
time is excluded from these measurements.<br />
When a measurement includes only the receiving time on the ER side as explained in the<br />
previous section, the ER spends most of the time on waiting for TSs to send something (over<br />
Myrinet). This waiting time corresponds to the packing time of ERs. On the other hand, when<br />
a measurement is achieved so that unpacking is also included, ERs spend time in the actual<br />
unpacking process as opposed to waiting for TSs to pack the data. This also fits the theoretical<br />
calculation of packing and unpacking explained in Sections 6.8.1 and 6.8.3.<br />
Figure 6.13 shows the time spent for receiving and unpacking the events on ER side. The<br />
time of unpacking determines the curves mostly. The theoretical curves are the sum of the<br />
theoretical values for receive and unpack. Adding more ERs to the system will decrease the<br />
90
256<br />
64<br />
Effective Receive, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
256<br />
64<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Effective Receive, XML Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.11: Time elapsed on ER side when receiving events over Myrinet. (a) Receiving Raw<br />
events (b) Receiving XML events.<br />
unpacking time since less events will be handled by each ER. As shown in Figure 6.11, increasing<br />
the number of ERs does not make a lot difference in terms of receiving. However, in<br />
Figure 6.13, it is distinctive that the unpacking time values are diminishing as the number of<br />
ERs increases. Therefore, one might conclude that the transfer time of data between modules<br />
is very small.<br />
When the raw events are used, as expected theoretically, packing and unpacking operations<br />
take the same amount of time. Given sending and receiving times do not take long, one might<br />
conclude that both ERs and TSs spend more time in packing and unpacking than in sending<br />
and receiving or in waiting for each other (See Figure 6.13(a)).<br />
On the other hand, when XML events are to be transferred, packing time is three times more<br />
than unpacking time. Therefore, ERs finish unpacking earlier and wait for TSs to finish packing<br />
and sending. The waiting time of ERs under this situation will be the difference between<br />
packing time and unpacking time. This waiting time is included in the theoretical curve of<br />
unpacking XML events. When the number of TSs is small, their waiting time of ERs increases<br />
since there are more events to be packed. The curve flats out around 30 seconds for one ER in<br />
91
Time in Secs<br />
256<br />
64<br />
16<br />
4<br />
1<br />
Effective Receive, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
0.25<br />
Time in Secs<br />
256<br />
64<br />
16<br />
4<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
Effective Receive, XML Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Eth - 1ER<br />
Eth - 2ER<br />
Eth - 4ER<br />
1<br />
0.25<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 6.12: Comparison of Ethernet vs Myrinet when receiving events. (a) Receiving Raw<br />
events. (b) Receiving XML events.<br />
Figure 6.13(b). This number is same as the theoretical value for unpacking of the one ER case<br />
as calculated in previous sections.<br />
As explained in Section 6.8.3, instead of writing events into a file, an ER can parse XML<br />
event strings to store the values to completely eliminate the coupling of modules via files. In<br />
this work, this is done by using expat [21]. The results are shown in Figure 6.13(c).<br />
6.9.5 Writing into File<br />
In order to take the time measurement of writing events into a file, the measurement starts<br />
before the first event is received and ends after the last one is dumped into the file. The writingonly<br />
time of events by ERs are shown in Table 6.2.<br />
As seen in table, the time that ER spends on writing events is independent of the number of<br />
traffic flow simulations in the system. This is because during the experiments ERs are assigned<br />
to agents in a round robin fashion no matter what the number of TSs in the system is. Given<br />
a fixed number of ERs, the number of events collected and consequently the writing time of<br />
92
256<br />
64<br />
Effective Receive + Unpack, Raw Events<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Theo. Val - 1ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
Effective Receive + Unpack, XML Events<br />
256<br />
64<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Theo. Val - 1ER<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(b)<br />
Effective Receive + Unpack + Parse, XML Events<br />
1024<br />
256<br />
Myri - 1ER<br />
Myri - 2ER<br />
Myri - 4ER<br />
Theo. Val - 1ER<br />
Time in Secs<br />
64<br />
16<br />
4<br />
1<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(c)<br />
Figure 6.13: Time elapsed for unpacking events on top of the effective receiving time. (a)<br />
Unpacking Raw events. (b) Unpacking XML events. This only includes extracting XML tags<br />
as strings from a received packet. (c) Unpacking XML events. This includes parsing values of<br />
attributes.<br />
events by ERs will be the same as the number of TSs changes.<br />
As said in Section 3.5, in the table Local disk means the files are kept on the local disks<br />
of the computing nodes on which simulations run. Via NFS means the files are on a remote<br />
93
¢<br />
Writing Time<br />
Explanation Raw XML<br />
1ER, Local Disk, C++ 59s 25s<br />
2ERs, Local Disk, C++ 30s 13s<br />
4ERs, Local Disk, C++ 15s 6s<br />
1ER, via NFS , C++ 81s N/A<br />
1ER, Local Disk, C 57s N/A<br />
1ER, via NFS, C 66s N/A<br />
Table 6.2: Performance results for ERs writing the events file.<br />
machine and the simulation accesses the file via NFS (Network File System) [72]. C++ means<br />
writing is achieved via the C++ operator whereas C refers to writing with the fprintf<br />
function.<br />
6.9.6 Summary of “buffered events recording”<br />
Figure 6.14 shows the combined results for the buffered events when there is only one ER in<br />
the system. The curves are aggregated according to sequence of occurrences of operations.<br />
Therefore, the curves are drawn on top of previous operations. In order to ensure the integrity<br />
of the curves, the values for packing time measurement are taken as they are (as opposed to the<br />
interpolation explained in Section 6.4).<br />
The framework should fully replace the events maintenance via files by sending events<br />
directly to a listening module. By doing so, when the raw events are concerned, the computational<br />
performance is 10 times better over Myrinet and 3 times better over Ethernet.<br />
When XML events are concerned, eliminating files completely comes with a higher overhead<br />
of parsing XML events. Nevertheless, this is necessary if one wants to access to the<br />
values stored in XML strings.<br />
6.10 Theoretical Expectations and Results of Immediately<br />
Reported Events<br />
When taking measurements, measuring only the duration of a single command, such as pack,<br />
send, receive, unpack or write, and then adding those up for 10 million events, is impracticable<br />
with the timing devices commonly available in computers. Hence, the measurements should be<br />
taken on a cumulative sense. For example, the gettimeofday() command has an accuracy<br />
up to in microseconds. Anything faster than a microsecond will not be measured correctly with<br />
this command. Therefore, the measured results can be misleading.<br />
Each raw event immediately reported is packed in 0.4 s similar to Section 6.8.1 and packing<br />
10 million raw events results ¤ $ in in a system of a single ER and a single TS. Therefore, there<br />
is not any difference in packing events when they are reported buffered or immediately.<br />
If the events are reported immediately as they occur, the main problem arises when transferring<br />
them to ERs individually. When, in general, a send command is issued, a header is added<br />
to the data and the data is copied to the send buffer and then the actual send occurs. In order<br />
94
)<br />
)<br />
£<br />
¡<br />
©<br />
(<br />
(<br />
¨<br />
¨<br />
<br />
$<br />
¡<br />
<br />
<br />
<br />
)<br />
¢<br />
¢¤<br />
<br />
¡<br />
<br />
1024<br />
256<br />
TS-p<br />
TS-s<br />
ER-er<br />
ER-u<br />
ER-w<br />
Raw Events, Myrinet<br />
1024<br />
256<br />
TS-p<br />
TS-s<br />
ER-er<br />
ER-u<br />
ER-w<br />
Raw Events, Ethernet<br />
Time in Secs<br />
64<br />
16<br />
Time in Secs<br />
64<br />
16<br />
4<br />
4<br />
1<br />
1<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
1024<br />
XML Events, Myrinet<br />
1024<br />
XML Events, Ethernet<br />
256<br />
256<br />
Time in Secs<br />
64<br />
16<br />
TS-p<br />
TS-s<br />
4 ER-r<br />
ER-u<br />
ER-w<br />
ER-parse<br />
1<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(c)<br />
Time in Secs<br />
64<br />
16<br />
TS-p<br />
TS-s<br />
4 ER-er<br />
ER-u<br />
ER-w<br />
ER-parse<br />
1<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(d)<br />
Figure 6.14: Summary figures. The results are for single ER case. (a) Raw events, Myrinet.<br />
(b) Raw events, Ethernet. (c) XML events, Myrinet. (d) XML events, Ethernet. – The thick<br />
lines denote the time consumption when events writing would be fully replaced by sending the<br />
events directly to a listening module. The meanings of labels are as follows: pack, send, receive<br />
(effective), unpack, and write.<br />
words, each send command involves latency. If packets are too small, the total transfer time is<br />
mainly determined by the latency.<br />
The contribution of latency to sending time is captured in the following tests. The theoretical<br />
sending time value is calculated as follows: Each raw event has 1 double and 5 integers<br />
that result in 28 bytes per packet given that one double and one integer correspond to 8 bytes<br />
and 4 bytes respectively. Based on PMB [30] over Myrinet, the corresponding effective latency<br />
value of a packet size of 28 bytes is 10 s. Therefore, the theoretical time of sending 10 million<br />
events by a TS to an ER one-by-one can be found as:<br />
¤<br />
¡<br />
¥<br />
£<br />
¡<br />
©<br />
© <br />
¡<br />
0( <br />
¡<br />
¡£¢ ¡ ¢<br />
¢ ¢ ¢ ¢ ¡<br />
¡£¢ ¢<br />
$<br />
From Figure 6.16, one can see that the transmission of the immediately reported events<br />
suffer from the latency contribution compared to the buffered version as shown in Figure 6.7(a).<br />
The theoretical value of unpacking should also be the same as the packing value except<br />
that it is independent on number of TSs. If the unpacking time drawn on top of the effective<br />
receiving time for the immediately reported events, the figure would be a vertical upwardshift<br />
of Figure 6.13(a) because of the latency contribution to the transmission time for the<br />
immediately reported events case.<br />
95
700<br />
600<br />
500<br />
TS-p<br />
TS-s<br />
ER-er<br />
ER-u<br />
ER-w<br />
Raw Events, Myrinet<br />
700<br />
600<br />
500<br />
TS-p<br />
TS-s<br />
ER-er<br />
ER-u<br />
ER-w<br />
Raw Events, Ethernet<br />
Time in Secs<br />
400<br />
300<br />
Time in Secs<br />
400<br />
300<br />
200<br />
200<br />
100<br />
100<br />
0<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(a)<br />
0<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
700<br />
XML Events, Myrinet<br />
700<br />
XML Events, Ethernet<br />
600<br />
600<br />
500<br />
500<br />
Time in Secs<br />
400<br />
300<br />
200 TS-p<br />
TS-s<br />
ER-r<br />
100 ER-u<br />
ER-w<br />
ER-parse<br />
0<br />
1 2 4 8 16 32<br />
Number of Traffic Simulators<br />
(c)<br />
Time in Secs<br />
400<br />
300<br />
200 TS-p<br />
TS-s<br />
ER-er<br />
100 ER-u<br />
ER-w<br />
ER-parse<br />
0<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(d)<br />
Figure 6.15: Same plots as Figure 6.14, but on linear scale.<br />
256<br />
64<br />
Sending Only, Raw Immediate Events<br />
Myri - 1ER.<br />
Myri - 2ER.<br />
Myri - 4ER.<br />
TV - Send Imm.<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
Figure 6.16: Sending time for immediately reported events.<br />
The theoretical value of writing raw events is around 56 seconds. The writing time is also<br />
independent of the type of reporting, namely, buffered or immediately.<br />
96
©<br />
¦<br />
c r e a t e a C t y p e b y t e a r r a y<br />
memcpy ( b y t e a r r a y , i n t e g e r i t e m , s i z e o f ( i n t e g e r i t e m ) )<br />
advance t h e p o i n t e r of b y t e a r r a y as s i z e o f ( i n t e g e r i t e m )<br />
memcpy ( b y t e a r r a y , d o u b l e i t e m , s i z e o f ( d o u b l e i t e m ) )<br />
advance t h e p o i n t e r of b y t e a r r a y as s i z e o f ( d o u b l e i t e m )<br />
send ( b y t e a r r a y , p o i n t e r , MPI::BYTE , d e s t i n a t i o n )<br />
Figure 6.17: Pseudo code for packing different data types with memcpy<br />
In case events are sent to the listening modules, this should be accomplished by<br />
adding several events in a single packet to reduce the contribution of latency to the<br />
sending time.<br />
6.11 Performance of Different Packing Methods for Events<br />
As explained in Section 6.9, when transferring data between modules, usually the packing and<br />
unpacking of events take more time than the actual sending and receiving. Thus, the packing<br />
algorithms are investigated in this section. The object serialization and different packing<br />
algorithms are discussed in Section 4.5.3 where the data to be exchanged refers to vehicles.<br />
In this section, events are concerned when different packing approaches are applied. The raw<br />
events are used in the tests presented here. Packing algorithms discussed are memcpy in Section<br />
6.11.1, MPI Pack in Section 6.11.2, MPI Struct in Section 6.11.3 and Classdesc in<br />
Section 6.11.4.<br />
6.11.1 Using memcpy and Creating a Byte Array<br />
When using memcpy, the send and the receive buffers are simple byte/char arrays. The function<br />
memcpy is used to convert all data types into bytes. When adding data into a buffer, one must<br />
be aware of explicitly advanced pointer that points to the next available position in the buffer.<br />
When one needs to add additional information to the buffer such as number of items in the<br />
buffer, it can be easily done with memcpy.<br />
The data is sent and received as byte arrays at the abstract level. When creating buffers for<br />
different purposes such as transferring events or plans, the same higher level functions (such<br />
as packing functions) can be used. Hence, this method benefits from using generic type of<br />
information.<br />
A problem occurs if a cluster of computers with different machine representations is used.<br />
The defined data types can be converted differently into and from a byte array when the sender<br />
and the receiver do not share a common machine representation.<br />
Figure 6.17 shows the instructions when packing an integer and a double into a byte/char<br />
array. The last line shows the send call. The © ¡ )<br />
¡<br />
2 size of¤¦<br />
will be send to the<br />
¡<br />
¡<br />
¡ as MPI::BYTE type.<br />
4<br />
©) ¡£( )<br />
)<br />
¡<br />
( 2
¡<br />
¡<br />
<br />
¦<br />
¦<br />
©<br />
¡<br />
c r e a t e a C t y p e b y t e a r r a y<br />
MPI::INT.Pack ( i n t e g e r i t e m , 1 , b y t e a r r a y ,<br />
m a x s i z e , p o i n t e r , comm ) ;<br />
MPI::DOUBLE.Pack ( d o u b l e i t e m , 1 , b y t e a r r a y ,<br />
m a x s i z e , p o i n t e r , comm ) ;<br />
send ( b y t e a r r a y , p o i n t e r , MPI::BYTE , d e s t i n a t i o n )<br />
Figure 6.18: Pseudo code for packing different data types with MPI Pack<br />
t y p e d e f s t r u c t<br />
i n t e v e n t t y p e , v e h i c l e I D , l i n k I D ;<br />
double time ;<br />
m y e v e n t s t r u c t ;<br />
Figure 6.19: A C-type struct. It needs to be defined prior to using MPI Struct<br />
data, which means that it is also easy to add additional information into the message besides<br />
the events stream itself.<br />
The same functions are used for the different purpose buffers. This method has an advantage<br />
over the memcpy method: The different machine representation problem is solved by<br />
MPI Pack and MPI Unpack since these functions use MPI data-types, which are the same<br />
on different computer architectures. In other words, these methods benefit from the standardization<br />
of MPI data-types between platforms.<br />
The corresponding code for packing a single integer and a single double value using MPI Pack<br />
and MPI Unpack is shown in Figure 6.18.<br />
¤<br />
In § (<br />
¡<br />
the code, is the memory buffer¤¦<br />
boundary of created, shows how<br />
many items of that particular type will be packed (in this example, there are 1 integer and 1<br />
¡<br />
¡<br />
( 2
¡<br />
<br />
¡<br />
¦<br />
<br />
¡<br />
t y p e d e f s t r u c t<br />
i n t i n t e g e r i t e m ;<br />
double d o u b l e i t e m ;<br />
m y s t r u c t ;<br />
d e f i n e an a r r a y of m y s t r u c t t y p e<br />
c r e a t e c o r r e s p o n d i n g MPI s t r u c t<br />
using M P I : : D a t a t y p e : : C r e a t e s t r u c t<br />
commit MPI s t r u c t as m p i s t r u c t t y p e<br />
s t r u c t a r r a y [ i n d e x ] . i n t e g e r i t e m = i n t e g e r i t e m ;<br />
s t r u c t a r r a y [ i n d e x ] . d o u b l e i t e m = d o u b l e i t e m ;<br />
send ( s t r u c t a r r a y , i n d e x , m p i s t r u c t t y p e , d e s t i n a t i o n )<br />
Figure 6.20: Pseudo code for packing different data types with MPI Struct<br />
In order to pack an integer and a double into a buffer using MPI Struct, as stated earlier,<br />
one must define a corresponding C structure such that it can be packed. A simple example code<br />
is shown in Figure 6.20.<br />
After a C-type struct is defined, it is committed as an MPI type by using the command<br />
MPI::Datatype::Create struct. Once ©)2¡ the is filled with data, it is sent.<br />
( ) 2
i n t i n t e g e r i t e m ;<br />
double d o u b l e i t e m ;<br />
MPIbuf b u f f e r ;<br />
/ / sender code<br />
b u f f e r - - i n t e g e r i t e m - - d o u b l e i t e m ;<br />
b u f f e r . s e n d ( d e s t i n a t i o n , t a g ) ;<br />
/ / r e c e i v e r code<br />
b u f f e r . g e t ( s o u r c e , t a g ) ;<br />
b u f f e r i n t e g e r i t e m d o u b l e i t e m ;<br />
Figure 6.21: Pseudo code for packing different data types with Classdesc<br />
6.11.5 Comparison of Results<br />
Among the methods presented here, MPI Pack/MPI Unpack and Classdesc packing and<br />
unpacking methods are the slowest ones. MPI Pack/MPI Unpack functions convert data<br />
into byte arrays and most of the addressing issues are handled via MPI calls. Therefore, it<br />
causes an overhead. On the other hand, Classdesc also converts everything into byte arrays<br />
but it does not allow users to reuse buffers. It prefers to enlarge buffers by reallocating their<br />
memory areas.<br />
Figure 6.22(a) shows the time elapsed for packing when using the methods explained in the<br />
previous sections. Classdesc’s tendency to extend the buffer size becomes a big problem when<br />
the available memory cannot hold everything in and eventually a swapping process will be run.<br />
The effect can be seen in Figure 6.22(a) for the small number of TSs.<br />
MPI Struct does the packing quickly, but a drawback of using this method appear when<br />
size of data in the struct unknown. As a result, variable length data cannot be handled elegantly<br />
by this method. An example of this problem is given in Section 4.5.3.<br />
The buffer created by MPI Pack is transferred faster than the other three (see Figure 6.22(b).<br />
However, the numbers are close to each other such that the underlying communication media is<br />
fast enough and the main bottleneck is because of the packing (Figure 6.22(a)) and unpacking<br />
(Figure 6.22(c)) processes.<br />
When events are passed via messages, MPI Struct does the packing in the smallest<br />
time. However, when variable length data is to be packed, the packing cannot be handled<br />
elegantly by this method. Depending on the data size, MPI Pack or memcpy<br />
can be utilized. Classdesc is straightforward to use, but it requires well-formed<br />
C++ classes.<br />
6.12 Conclusions and Discussion<br />
As discussed in Section 5.4, when the modules in a framework are coupled via files, the problem<br />
of for I/O being a bottleneck appears. Besides the efforts improving I/O performance, I/O<br />
bottleneck can be avoided by using “messages”.<br />
In this chapter, events from traffic flow simulators are taken into consideration since they<br />
are input for different strategy generation modules in the framework. When coupling modules<br />
via files, 10 million events are read in 349 seconds and written in 72 seconds giving a total of<br />
421 seconds. When modules are coupled via the raw message passing, the total time required<br />
for packing and sending on one side and unpacking and allocating on the other side takes only<br />
100
256<br />
64<br />
Packing Only, 1ER, Raw Events<br />
memcpy<br />
mpi_pack<br />
mpi_struct<br />
classdesc<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16<br />
256<br />
64<br />
Number of Traffic Simulators<br />
(a)<br />
Sending Only, 1ER, Raw Events<br />
Myri, memcpy<br />
Myri, mpi_pack<br />
Myri, mpi_struct<br />
Myri, classdesc<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
0.0625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Effective Receive + Unpack, 1ER, Raw Events<br />
256<br />
64<br />
Myri, memcpy<br />
Myri, mpi_unpack<br />
Myri, mpi_struct<br />
Myri, classdesc<br />
Time in Secs<br />
16<br />
4<br />
1<br />
0.25<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(c)<br />
Figure 6.22: Performance of Different Serialization Methods. (a) Packing, (b) Sending, (c)<br />
Receiving and Unpacking.<br />
9.3 seconds. Thus, introducing a message passing approach for events among modules and<br />
keeping the events in the memory will accelerate 46 times than the setup coupled via files.<br />
The data format also needs to be considered. As coupling via messages based on the raw<br />
events performs in 9.3 seconds, the same setup for the XML events takes 629.8 seconds. When<br />
computational issues are concerned, the difference tells us to keep it simple as in raw events.<br />
However, extensibility and flexibility of the XML format will outweigh the simplicity and better<br />
performance of the raw events.<br />
101
When the events are reported to strategy generation modules as immediate as they occur,<br />
the performance of the system is drawn down because of the latency participation of each send<br />
call. Hence, sending several events in a single packet as shown in the buffered events case is a<br />
necessity.<br />
The most important conclusions of replacing the events file with messages for events are<br />
drawn below:<br />
The events data should be provided to different modules via messages instead of via files.<br />
If events are exchanged as messages, they should be of raw type in the current implementation<br />
of MATSIM. If parsing XML tags is improved, events should be packed into<br />
XML strings.<br />
To minimize the latency contribution to the total execution time, several events must be<br />
packed into a single packet.<br />
Using Myrinet, multicast of events to different modules needs to be considered since in<br />
a framework the same full information might be used by more than one module.<br />
Ethernet gives better performance when the events are distributed among modules.<br />
Among different methodologies, MPI Pack should be chosen since it is more robust compared<br />
to other packing algorithms. When the data length is fixed, MPI Struct should be<br />
preferred.<br />
6.13 Summary<br />
Chapter 5 gives different methodologies to couple modules defined in a framework. “Coupling”<br />
promotes data exchange between modules. There are two main data streams in a framework:<br />
plans and events. Events are covered in this chapter.<br />
An events recorder is an external module, which keeps track of events generated by traffic<br />
flow simulators during a simulation run. There are three main issues to be thought when several<br />
events recorders are introduced into the system:<br />
which event recorder collects which events (distributed ERs vs multi-cast)<br />
type of events (raw vs XML)<br />
how to arrange packets of events (single vs buffered)<br />
The tests for “multi-casting” are useful in the sense that it gives us an idea about performance<br />
when more than one strategy generation module needs the full information about events.<br />
Distributed ERs means that agents in the systems are distributed in a way that all ERs are responsible<br />
for more-or-less the same number of agents i.e., each agent has a dedicated ER.<br />
An event format is a choice between flexibility and computationally better performance.<br />
The XML events are flexible but obtaining values out of an XML tag is time-consuming. Simplicity<br />
and better performance of raw events are veiled by their ineffectiveness.<br />
When sending events to an events recorder, several events are buffered into the same message.<br />
This degrades the contribution of latency, which occurs in each packet sending.<br />
Moreover, in terms of the time measurement command gettimeofday, the measurements<br />
should be taken on cumulative basis. This is because of the inaccuracy of the command might<br />
102
Time nERs nTSs ET OP Note<br />
5s 1 1 raw pack memcpy<br />
101s 1 1 XML pack memcpy<br />
21s 1 1 raw pack MPI Pack<br />
3s 1 1 raw pack MPI Struct<br />
69s 1 1 raw pack classdesc<br />
1.2s 1 1 raw send myri, buf, dist<br />
4s 1 1 XML send myri, buf, dist<br />
25s 1 1 raw send eth, buf, dist<br />
86s 1 1 XML send eth, buf, dist<br />
11s 8 1 raw send myri, buf, mcast<br />
35s 8 1 XML send myri, buf, mcast<br />
224s 8 1 raw send eth, buf, mcast<br />
687s 8 1 XML send eth, buf, mcast<br />
88s 1 1 raw send myri, imm, dist<br />
6s 1 1 raw recv myri, buf, dist, eff<br />
105s 1 1 XML recv myri, buf, dist, eff<br />
6s 1 1 raw unpack memcpy, buf, dist, on top of recv<br />
101s 1 1 XML unpack memcpy, buf, dist, on top of recv<br />
59s 1 1 raw write , local<br />
21s 1 1 XML write , local<br />
81s 1 1 raw write , remote<br />
57s 1 1 raw write fprintf, local<br />
66s 1 1 raw write fprintf, remote<br />
Table 6.3: Summary table of the performance results of events transfered between TSs and ERs.<br />
cause a problem when measuring the operations one by one, especially, when these operations<br />
are really fast.<br />
Table 6.3 summarizes the most important performance numbers, which are measured during<br />
different operations by switching different parameters on. The abbreviations used in the<br />
table mean as follows: nERs and nTSs are the numbers of the events recorders and the traffic<br />
flow simulations; ET shows the event type (raw or XML); OP shows the operation measured;<br />
buf and imm mean that the events are reported in chunks and as immediate as they occur, respectively;<br />
dist and mcast point out the distributed and multicast cases; eff is short for effective<br />
receiving; on top of recv refers to the time measured including the effective receive time.<br />
103
Chapter 7<br />
Plans Server<br />
7.1 Introduction<br />
The systematic relaxation approach, explained in Chapter 1, is a simulation approach solution<br />
to the traffic dynamics with spill-back. Each iteration of relaxation takes persons and their<br />
plans as input, executes them, and outputs performance information of plans. In this thesis,<br />
that performance is output in the form of timestamped events. The timing information, for<br />
example, is later used to alternate some of routes such that the congestion in following iteration<br />
lessens.<br />
A robust but slow implementation of this is using files: the traffic flow simulation reads<br />
plans from a file, runs them, produces events and writes them into a file as described in Section<br />
5.2.1. Then, some strategy generation modules read events to make adjustments in some<br />
of the plans or to select among the existing plans. The traffic flow simulation reads the updated<br />
plans and executes them and so on.<br />
In such a set-up, file I/O is a bottleneck because of the limitations on I/O operations imposed<br />
by disk speeds. A faster alternative to file I/O is passing data in messages. An example is shown<br />
in Chapter 6: events can be transmitted to events recorders (ERs) via messages. Similarly,<br />
traffic flow simulators (TSs) can receive plans from a server called the plans server (PS), instead<br />
of reading them from a file. The PS is a means for performance benchmarking; therefore, it<br />
does not construct any plans by itself but rather it reads them from a file. The question of<br />
interest is how much time it takes to get these plans to the TSs under various set-ups.<br />
7.2 The Competing File I/O Performance for Plans<br />
If the ch6-9 scenario (with 1 million agents) is used in a system coupled via files, the I/O<br />
performance for plans reading and plans writing is recorded as follows:<br />
When the plans are raw plans reading by fscanf takes 25 seconds; 11 seconds are<br />
spent for the memory allocation of the values read. Thus, the total time for raw plans<br />
to be completely read is 36 seconds. The raw plans are written by the C++ operator<br />
in 17 seconds after the data is retrieved from the memory in 2 seconds. Thus, writing 1<br />
million plans is completed in 19 seconds. The total time on file I/O operations for raw<br />
plans accordingly is 55 seconds.<br />
When XML [97] plans are used, reading XML plans by expat [21] takes 151 seconds.<br />
When the XML plans are read, the data values are kept in strings. Then, these strings<br />
are converted into the appropriate data types (such as integers,doubles, etc.) in 3 seconds<br />
104
y using string functions. Finally, 5 seconds are needed to allocate the memory for<br />
the person objects to be stored. Thus, the total time for reading becomes 159 seconds.<br />
On the other hand, prior to writing XML plans, the data retrieval from memory takes<br />
123 seconds. Then, the data values are written into a file by forming XML tags in 149<br />
seconds. Consequently, the total time spent for file I/O on XML plans is 431 seconds.<br />
Under these circumstances, this chapter investigates a message passing alternative for plans<br />
to avoid reading and writing time.<br />
7.3 Benchmarks<br />
7.3.1 General<br />
The benchmark is simply set to measure the time for transferring plans between PSs and TSs.<br />
PSs are implemented both in C++ [80] and Java [42] to compare their performance values.<br />
When using more than one PS, agents are distributed in a round-robin fashion among PSs such<br />
that each PS is responsible for an approximately equal number of agents.<br />
When the simulation starts, TSs read the street network information and the domain decomposition<br />
output. Then, they wait for PSs to finish the plans reading. Since PSs are assigned<br />
to a set of agents, while reading plans, they only keep the records of the agents that they are<br />
responsible for. Once the plans are in the memory, PSs multi-cast to TSs all the agent IDs and<br />
the links, on which agents start their execution. PSs also specify the earliest time among agents<br />
to start simulating. This is important for synchronization of TSs running in parallel.<br />
TSs retrieve the information about all agents and check which agents start on their local<br />
network described by the domain decomposition. TSs send these agent IDs, i.e. IDs of agents<br />
that start execution on their local domains, back to PSs as feedback so that PSs send the complete<br />
agents information and the plans to TSs in the next step. Once the agents information is<br />
received, TSs start executing plans. The pseudo code of the benchmark is given in Figure 7.1.<br />
One might notice that some empty messages are exchanged before starting time measurements.<br />
This is necessary to synchronize all the modules in the system. Figure 7.2 graphically<br />
shows the execution sequence of tasks on a time-line when a single PS and a single TS interact.<br />
7.3.2 mpiJava<br />
Since all the other modules in the framework are implemented in C++, the plans server is<br />
also originally written in C++. Another implementation of plans server in Java is achieved to<br />
compare the performance values of C++ and Java.<br />
Java [42] is an object-oriented programming language developed by Sun Microsystems.<br />
It was created to address weakness of C++ such as the lack of garbage collection and multithreading.<br />
One problem comes with Java from the application’s point of view is that MPI standards<br />
provide language-specific bindings for C, Fortran and C++, however, there exists no Java bindings.<br />
Several groups tried to developed MPI-like bindings for Java independently. Among<br />
these, mpiJava [93] was chosen since it is just a simple wrapper to the MPI version, MPICH [52],<br />
used in this thesis and not a commercial effort.<br />
Another problem arises when using modules written in different languages. Having PSs in<br />
Java and TSs in C++ requires MPICH and mpiJava to communicate. However, as mpiJava is a<br />
wrapper around MPICH helps solve the problem.<br />
105
Algorithm A – Plans Server<br />
while not EOF do<br />
read a plan<br />
keep the earliest time that an agent start simulating<br />
if agent is mine then<br />
save agent details and its plan<br />
end if<br />
end while<br />
exchange fictitious messages with TSs to be synchronized<br />
time measurement for IDs starts<br />
pack all agent IDs and start link IDs along with simulation start time<br />
multi-cast packet of agent IDs and start link IDs to TSs<br />
collect from TSs feedback about which agent starts on which TS<br />
time measurement for IDs ends<br />
exchange fictitious messages with TSs to be synchronized<br />
time measurement for plans starts<br />
pack plans of agents into a single packet for each TS<br />
send packets of plans to TSs<br />
time measurement for plans ends<br />
(a) Plans Server<br />
Algorithm B – Traffic Simulator<br />
read domain decomposition result<br />
exchange fictitious messages with PSs to be synchronized<br />
time measurement for IDs starts<br />
receive agent IDs, start link IDs and earliest start time from PSs<br />
unpack agent IDs and start link IDs<br />
record simulation start time<br />
send back agent IDs that have start links which are on my sub-domain<br />
time measurement for IDs ends<br />
exchange fictitious messages with PSs to be synchronized<br />
time measurement for plans starts<br />
receive agents’ info and their plans<br />
unpack agents’ info and their plans<br />
time measurement for plans ends<br />
start simulating<br />
(b) Traffic Simulator<br />
Figure 7.1: Interaction of Plans Servers with Traffic Simulators<br />
7.4 Java and C++ Implementations of the Plans Server<br />
Although there are common approaches to the data structures available in C++ and Java, and<br />
furthermore the plans server implementation in Java is a projection of the one in C++, their<br />
performance results might be different. This section gives the details of data structures and<br />
operations used in different plans server implementations. Section 7.4.1 discusses packing<br />
data via different methods from plans servers. Specifically, the plans server written in C++<br />
packs the data by using the C function memcpy. The plans server written in Java, utilizes a<br />
self-implemented class called BytesUtil to pack the data. Section 7.4.2 explains different<br />
data structures on different plans servers to store the agents. The plans server in C++ uses<br />
106
time measurement starts<br />
Packing IDs<br />
PS<br />
TS<br />
First ID sent<br />
Last ID sent<br />
ALL AGENT IDs<br />
¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ time measurement starts<br />
First ID received<br />
T<br />
I<br />
M<br />
E<br />
Receiving local IDs<br />
Unpacking local IDs<br />
time measurement ends<br />
time measurement starts<br />
Packing plans<br />
¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ LOCAL AGENT IDs<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦¡¦<br />
¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥¡¥<br />
Last ID received<br />
Unpacking IDs<br />
Finding local IDs<br />
Packing local IDs<br />
Sending local IDs<br />
time measurement ends<br />
¡¡¡¡¡¡¡¡¡¡¡¡¡¡<br />
¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢<br />
First plan sent<br />
LOCAL AGENT PLANS<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
time measurement starts<br />
First plan received<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
Last plan sent<br />
time measurement ends<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
Last plan received<br />
Unpacking plans<br />
time measurement ends<br />
¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤¡¤<br />
£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£¡£<br />
Start Simulating<br />
Figure 7.2: Sequence of Tasks Execution of TSs and PSs<br />
STL-multimap and the one in Java employs TreeMap structures to store the agents.<br />
7.4.1 Packing and Unpacking<br />
PS/TS in C++ packs/unpacks messages by calling memcpy as many times as the number of<br />
items to be packed/unpacked. memcpy can be called as a conversion function between different<br />
data types and bytes 1 . When different items are packed into/unpacked from the same byte<br />
buffer, pointers regarding the buffer must be moved forward explicitly. An example is shown<br />
in Figure 7.3.<br />
The Java implementation of the PS uses the same approach as in C++ version by converting<br />
all the data into bytes 2 by using a class called BytesUtil. The BytesUtil conversion<br />
methods take a byte buffer, the data itself and a position in the buffer as input arguments and<br />
convert the data starting at the position on the buffer. The position is incremented by the<br />
methods implicitly. An example is given in Figure 7.4.<br />
1 In C/C++, a char is a byte-long. Hence, they are used reciprocally in C/C++.<br />
2 However, in Java a char is two byte-long, so they give different meanings. Since TSs are written in C++,<br />
PSs in Java use byte when transferring the data.<br />
107
i n t i n t e g e r i t e m ;<br />
double d o u b l e i t e m ;<br />
byte a r r a y b u f f e r ;<br />
memcpy ( b u f f e r , i n t e g e r i t e m , s i z e o f ( i n t e g e r i t e m ) ) ;<br />
move p o i n t e r t o b u f f e r by s i z e o f ( i n t e g e r i t e m ) ;<br />
memcpy ( b u f f e r , d o u b l e i t e m , s i z e o f ( d o u b l e i t e m ) ) ;<br />
move p o i n t e r t o b u f f e r by s i z e o f ( d o u b l e i t e m ) ;<br />
Figure 7.3: Pseudo code for packing different data types with memcpy<br />
¡<br />
/ / Conversion f u n c t i o n : from i n t to b y t e s<br />
/ / The other f u n c t i o n s are analog<br />
/ / but with d i f f e r e n t number of b i t s .<br />
p u b l i c i n t i n t T o B y t e s ( i n t num , byte [ ] b y t e s , i n t s t a r t I n d e x )<br />
b y t e s [ s t a r t I n d e x ] = ( byte ) ( num & 0 x f f ) ;<br />
b y t e s [ s t a r t I n d e x + 1 ] = ( byte ) ( ( num 8) & 0 x f f ) ;<br />
b y t e s [ s t a r t I n d e x + 2 ] = ( byte ) ( ( num 16) & 0 x f f ) ;<br />
b y t e s [ s t a r t I n d e x + 3 ] = ( byte ) ( ( num 24) & 0 x f f ) ;<br />
return s t a r t I n d e x +4;<br />
i n t i n t e g e r i t e m ;<br />
double d o u b l e i t e m ;<br />
byte a r r a y b u f f e r ;<br />
i n t s t a r t i n d e x ;<br />
s t a r t i n d e x = i n t T o B y t e s ( i n t e g e r i t e m , b u f f e r , s t a r t i n d e x ) ;<br />
s t a r t i n d e x = doubleToBytes ( d o u b l e i t e m , b u f f e r , s t a r t i n d e x ) ;<br />
Figure 7.4: An example for the methods of BytesUtil<br />
7.4.2 Storing Agents in the Plans Server<br />
The data structure for storing agents is nontrivial not only in the sense of memory usage it<br />
consumes but also in the sense of accessing time to the agents. The PS in C++ uses the STL<br />
(Standard Template Library, Section 3.3.1)-multimap. The STL-multimap for the agents<br />
holds the pointers to the agents, therefore the memory consumption is as big as agents themselves.<br />
Each PS, for each TS, creates a linked list. Each linked list holds pointers to the agents<br />
that the corresponding TS is interested in. When TSs send back the agents IDs to request the<br />
agents data, PSs search for agent IDs in the STL-multimap and add the pointer to the agent<br />
to the corresponding TS’s linked list. Each linked list holds pointers to the agents that the<br />
corresponding TS is interested in. The code is given in Figure 7.5.<br />
The PS in Java uses a Java-TreeMap to store all the agents read at the beginning. Then, it<br />
creates a Java-Vector for each TS after it knows about the domain decomposition and which<br />
PSs the agents belong to. The code is given in Figure 7.6.<br />
108
C++ v e r s i o n<br />
i n t key ;<br />
f o r ( t h e number of a g e n t s ) t i m e s<br />
c r e a t e an a g e n t ;<br />
a g e n t s . i n s e r t ( m a k e p a i r ( key , a g e n t ) ) ;<br />
e n d f o r<br />
L i n k e d L i s t s u b a g e n t s ; / / as many as TSs<br />
c r e a t e a s u b a g e n t l i n k e d l i s t f o r each TS<br />
r e c e i v e d i s t r i b u t i o n i n f o r m a t i o n of a g e n t s from TSs<br />
f o r each a g e n t i n a g e n t s<br />
f i n d TS ID of t h e a g e n t t h a t i t b e l o n g s t o<br />
s u b a g e n t s [ TS ID ] . p u s h b a c k ( a g e n t ) ¡<br />
;<br />
f o r each t r a f f i c flow s i m u l a t o r<br />
p r e p a r e a send p a c k e t using<br />
t h e c o r r e s p o n d i n g s u b a g e n t l i n k e d l i s ¡<br />
t<br />
Figure 7.5: Data structures for agents in a C++ Plans Server<br />
/ / Java v e r s i o n<br />
O b j e c t key ;<br />
TreeMap a g e n t s = new TreeMap ( ) ;<br />
f o r t h e number of a g e n t s t i m e s<br />
c r e a t e an a g e n t ;<br />
a g e n t s . p u t ( key , a g e n t ) ¡<br />
;<br />
V e c t o r [ ] s u b a g e n t s ; / / as many as TSs<br />
c r e a t e a s u b a g e n t v e c t o r f o r each TS<br />
r e c e i v e d i s t r i b u t i o n i n f o r m a t i o n of a g e n t s from TSs<br />
f o r each a g e n t i n a g e n t s<br />
f i n d TS ID of a g e n t t h a t i t b e l o n g s t o<br />
s u b a g e n t s [ TS ID ] . add ( a g e n t ) ¡<br />
;<br />
f o r each t r a f f i c flow s i m u l a t o r<br />
p r e p a r e a send p a c k e t using<br />
t h e c o r r e s p o n d i n g s u b a g e n t l i n k e d l i s ¡<br />
t<br />
Figure 7.6: Data structures for agents in a Java Plans Server<br />
7.5 Theoretical Expectations<br />
When figures for theoretical expectations are calculated, a program that counts clock cycles<br />
from [35] and PMB [30] that measures the performance of MPI are used. PMB is explained<br />
109
)<br />
)<br />
)<br />
£<br />
¡<br />
©<br />
§¦ ©¨<br />
¡<br />
£<br />
¡<br />
©<br />
¡<br />
(<br />
(<br />
¨<br />
<br />
¨<br />
(<br />
<br />
¡<br />
¨<br />
<br />
¡<br />
2 ( <br />
<br />
¡<br />
¡ ¡<br />
4<br />
¡<br />
¡<br />
¡<br />
¡<br />
¡<br />
¡)<br />
$<br />
¢<br />
<br />
<br />
¡<br />
¢<br />
¡<br />
$<br />
¢<br />
¡<br />
¤ ¢<br />
¢<br />
<br />
$<br />
¢<br />
¢<br />
<br />
<br />
<br />
<br />
<br />
¡<br />
¡ ) <br />
¡<br />
<br />
¡<br />
<br />
<br />
¡<br />
<br />
¡<br />
<br />
in Section 6.4. These calculations are made based on the modules of the framework written in<br />
C++ and made in a system with a single PS. PSs and TSs are run on a cluster whose nodes have<br />
PIII 1GHz (1 billion cycles per second) CPUs. More details about the cluster can be found in<br />
Section 4.5. The underlying network used in the tests here is chosen to be Myrinet [54].<br />
7.5.1 PSs Pack<br />
Agent IDs and their start link IDs<br />
A PS creates a single packet composed of all agents IDs assigned to itself and their start link<br />
IDs. The packing procedure is composed of 2 calls to the memcpy function of C.<br />
Packing an agent ID and a start link ID uses ¡<br />
¢ ¢<br />
up clock cycles according to the clock<br />
cycles counter. Hence, for a system with a single PS, it will take<br />
©¡<br />
£ <br />
¢ ¢<br />
$<br />
to pack about 1 million agents IDs and their start link IDs.<br />
Agents and their plans<br />
Again, a single packet is created for all agents information and their plans by using the memcpy<br />
library call. The data packed for each agent contains an agent ID, a start link ID, a route length,<br />
node IDs that the agent must go through during its trip, duration of the activity, an end time of<br />
the activity, a leg number and a destination link ID.<br />
Counting the clock cycles shows that each agent is packed in 1800 clock cycles. Therefore,<br />
1 million agents will be packed in ¡ ¤ ¢ ¢<br />
$¢¡<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
7.5.2 PSs Send and TSs Receive<br />
As discussed in section 6.8.2, the results of PMB used for measuring the latency and the bandwidth<br />
values for packets of different sizes on the cluster are additive. Consequently, the formula<br />
of the theoretical expectation of transmission will be as follows:<br />
$<br />
£<br />
¡<br />
©<br />
© <br />
¡<br />
Agent IDs and their start link IDs<br />
Using 1million agents gives a single packet approximately 8 MBytes in size knowing that<br />
agent IDs and link IDs are integer values (4 bytes for each). Using PMB, the latency value for<br />
a 8MB-packet can be found approximately as ¡ & milliseconds. This gives<br />
¡<br />
)¢¤$<br />
0()<br />
$<br />
Therefore, 0.035 seconds is needed for a single PS to send all agents IDs and their start link<br />
IDs to a single TS.<br />
¡ ¢<br />
¡ & <br />
¡ & <br />
Agents and their plans<br />
During the tests, it is observed that a packet which contains all the agent information and plans<br />
¢ <br />
is approximately 95 MBytes long. PMB gives ¤<br />
a latency of milliseconds for a packet size<br />
of MB. Then, the theoretical value is calculated & as<br />
¢<br />
¡ ¢<br />
$<br />
$ ¤<br />
$ ¤<br />
110
)<br />
£<br />
¡<br />
©<br />
¤¥¤ ¢ ¢<br />
¤<br />
(<br />
¡<br />
¨<br />
¢<br />
¢<br />
<br />
¡<br />
¡<br />
¡<br />
¡<br />
¡<br />
¡<br />
$<br />
<br />
¢<br />
¤<br />
¢<br />
$<br />
$<br />
¤<br />
¢<br />
$<br />
<br />
¡<br />
<br />
¡<br />
<br />
¡<br />
<br />
<br />
¡<br />
<br />
7.5.3 TSs Unpack<br />
Agent IDs and their start link IDs<br />
Unpacking a single agent ID and its start link ID takes, as expected, as the same amount of time<br />
as packing a single agent ID and its start link ID. This is because of calling memcpy function<br />
twice as in packing case. Given that the number of clock cycles needed by unpacking an agent<br />
ID and its start link ID is around 330, in a system consists of a single PS and a single TS, TS<br />
unpacks 1million agents IDs and their start link IDs in<br />
¡£¡<br />
$<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
$¢¡£¡ <br />
Agents and their plans<br />
Similarly, unpacking a single agent with its plans takes 8800 clock cycles whereas packing<br />
only takes 1800. The difference of 7000 clock cycles comes from creating agents, searching<br />
for start and end link IDs in the local network and setting the agent information before starting<br />
simulation. Therefore, when there is a single PS and a single TS, 1 million agents are effectively<br />
unpacked in<br />
¤ ¢<br />
$<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
7.5.4 TSs Pack and Send<br />
Local Agent IDs<br />
When TSs, as shown in Algorithm 7.1(b), receive all the agent IDs and their start link IDs, they<br />
unpack these values, then they search for links IDs that are local to them (based on the domain<br />
decomposition). If a link ID is local to a TS, then all the agent IDs that start on that link are<br />
added to a send buffer that will be sent back to the corresponding PS, which is responsible for<br />
that agent. Thus, for a TS packing local agent IDs means that searching start link IDs in the<br />
local links and packing those agents that start on the TS’s local network. The search algorithm<br />
is the binary search because of the performance reasons explained in Section 3.3.2.<br />
The number of clock cycles needed for searching a link ID in the local links and packing<br />
an agent ID that is on a local link is recorded as 830 cycles, therefore, 1 million agent IDs will<br />
be packed in<br />
¡ ¢ ¢ ¢¢¥¤<br />
¥ <br />
¡ <br />
$<br />
A TS sends the agent IDs back in a single packet of 4 MBytes. Using PMB, the latency<br />
value for a 4MB-packet can be found approximately as milliseconds. This gives<br />
¢ <br />
¢ ¢<br />
¢ <br />
¡ ¢<br />
$<br />
7.5.5 PSs unpack<br />
Local Agent IDs<br />
An agent ID is unpacked by a PS in 550 cycles. This number includes finding the agent ID in<br />
111
)<br />
©<br />
¨<br />
¡<br />
<br />
<br />
¡<br />
$<br />
¢<br />
¡<br />
¡<br />
<br />
*<br />
¢<br />
$<br />
¤<br />
¢<br />
¡<br />
$<br />
¡<br />
<br />
*<br />
¡<br />
¥<br />
<br />
¡<br />
©<br />
the plan and setting its corresponding TS value. In a system with 1 million agents, all the IDs<br />
will be unpacked by a PS in<br />
& &<br />
$<br />
¥ <br />
$'& & <br />
¡ ¢ ¢ ¢¢¥¤<br />
7.5.6 Multi-casting Plans<br />
As an alternative, plans can be multi-cast to TSs and accordingly TSs receive all the plans and<br />
stores only the ones that are local to its network. When there is only a single PS and when the<br />
PS packs plans for multi-cast, the packing is done once and it takes 1.80 seconds as explained<br />
above. This packing process results in a message/packet with a size of 95 MBytes. Sending this<br />
big packet to one TS takes 0.42 seconds. ¡ If there are TSs in the system, sending the packet<br />
¢ <br />
to all ¡ $ ¤<br />
the TSs takes seconds since each sending will cause latency. Finally, when the<br />
packet is retrieved by a TS, TS will unpack the packet and check if a plan/person starts on its<br />
local domain. As said above, a TS unpacks the complete plans in 8.80seconds (unpacking the<br />
values takes 1.80 seconds and creating and inserting objects into the appropriate links takes<br />
7.00 seconds) and checks if a plan starts on its sub-domain in 0.83 seconds. Thus, when plans<br />
are multi-cast to TSs, the theoretical time for packing and sending by a PS and receiving and<br />
unpacking by TSs will ¡ approximately take<br />
¢ ¢<br />
¤ ¢<br />
¤ ¢<br />
¢<br />
"%$<br />
)§(<br />
* ¡<br />
¡ *<br />
¡£4 $<br />
) §<br />
$ ¤<br />
7.6 Results<br />
The tests are repeated for the systems with different number of PSs and TSs. Theoretical values<br />
also added into figures and labeled as “TV”.<br />
7.6.1 PSs Pack<br />
Packing times for IDs and plans are shown in Figure 7.7. This figure shows only the packing<br />
time. The measurement starts right before the first data is packed and ends right after the last<br />
data is in the buffer. The theoretical curve and the resulting curves for a system with one PS<br />
are approximately equal to each other as expected. The curve is constant on y-axis because PSs<br />
must pack all the agents that they have, no matter how many TSs are in the system.<br />
As the number of PSs increases, the packing time decreases since adding more PSs into<br />
system will decrease the number of agents per PS to be packed.<br />
Since there is no corresponding Java function to convert different data types into bytes or<br />
vice versa, several functions are explicitly implemented to do so in Java as shown in Figure 7.4.<br />
Figure 7.7 shows that these supplementary functions in Java are 6 and 3 times slower than the<br />
memcpy function in C when packing IDs and plans respectively.<br />
7.6.2 PSs Send<br />
The time measurement only for the sending time over Myrinet is shown in Figure 7.8. As<br />
the number of PSs in system increases, the total sending time decreases since the send buffer<br />
contains less elements.<br />
112
64<br />
16<br />
4<br />
For PSs, Pack Only (IDs & StartLinks)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1<br />
0.25<br />
0.0625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
For PSs, Pack Only (Agents & Routes)<br />
64<br />
16<br />
4<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1<br />
0.25<br />
0.0625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 7.7: Time elapsed for packing. (a) Packing agent IDs and start link IDs. (b) Packing<br />
agents. Having PSs means that approx. agents are handled by each PS.<br />
¡ ¡ ¡<br />
The curves in Figure 7.8(a) go up linearly. This is because that PSs send all agents IDs and<br />
their start link IDs to all the TSs. Therefore, the latency contribution of each transfer increases<br />
cumulatively as the number of recipients (TSs) increases. Multi-casting agents IDs and their<br />
start link IDs is handled by a “for” loop such that PSs multi-cast the same data to all TSs in<br />
the system. After each send to a TS is completed, PSs transfer data to the next TS. In order<br />
to minimize the competition among PSs for the same TSs to send data to, the sequence of the<br />
“for” loop for each PS is shuffled. Hence, each PS follows a different sequence of TS IDs to<br />
send the data.<br />
The curves in Figure7.8 are almost equal for the same number of PSs in the system when<br />
Java or C++ type PSs are used. This is because in both cases, the send and the receive functions<br />
are the ones provided by MPI.<br />
When sending the agents information and the plans, PSs create a separate packet for each<br />
TS in the system, and then initiate sending to that TS in a “for” loop. Again the sequence of<br />
the “for” loop is shuffled to minimize the effects of competition of PSs for the same TSs in the<br />
same sequence.<br />
113
For PSs, Send Only (IDs & StartLinks)<br />
64<br />
16<br />
4<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1<br />
0.25<br />
0.0625<br />
0.015625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
For PSs, Send Only (Agents & Routes)<br />
64<br />
16<br />
4<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1<br />
0.25<br />
0.0625<br />
0.015625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 7.8: Time elapsed only for sending over Myrinet. (a) Sending agent IDs and start link<br />
IDs. (b) Sending agents.<br />
7.6.3 TSs Receive<br />
The receiving time measurement starts just before MPI Recv command is called. As shown<br />
in the algorithm in Figure 7.1, the measurement ends after unpacking finishes. In order to<br />
measure the receiving time, the unpacking process shown in Figure 7.1(b) is excluded from the<br />
time measurement. Moreover, “empty” messages are exchanged before the time measurement<br />
for receiving takes place to synchronize and to exclude other intermediate operations.<br />
The effective receiving time curves over Myrinet are shown in Figure 7.9. If one merges<br />
figures for packing time (Figure 7.7) and for sending time (Figure 7.8), the resulting figure will<br />
be the same as in Figure 7.9. This is what one expects since when a receive command is issued,<br />
receiver should wait till data comes. Therefore, the receiving time includes not only actual<br />
receiving time but also waiting time for the sender. While TSs wait for the data from PSs, PSs<br />
pack the data. That is why receiving time is the sum of sending and packing times.<br />
The curves are nearly constant since most of the receiving time includes the waiting time<br />
for data. The waiting time contribution comes from PSs, which are packing. (See Figure 7.7).<br />
114
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
0.5<br />
0.25<br />
For TSs, Effective Receive (IDs & StartLinks)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
0.125<br />
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
0.5<br />
0.25<br />
0.125<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
For TSs, Effective Receive (Agents & Routes)<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
Figure 7.9: Time elapsed for the effective receiving time over Myrinet. (a) Effective receiving<br />
agent IDs and start link IDs. (b) Effective receiving agents.<br />
Figure 7.9(a) shows a slight increase of the curve when the number of TSs increases. This is<br />
a result of given a constant number of PSs the data being transferred to all the TSs (Figure 7.8).<br />
7.6.4 TSs Unpack<br />
When TSs unpack the agent IDs and their start link IDs, the total unpacking time will be almost<br />
constant. This is because all TSs should retrieve all agents IDs and start link IDs to find out<br />
which ones start on its local domain. The resulting curves for unpacking on top of the effective<br />
receiving time over Myrinet are shown in Figure 7.10(a).<br />
Even if the number of PSs changes, the total number of agents in the system does not.<br />
Hence, the total number of agent IDs and start link IDs to be unpacked on TS side will be<br />
the same. The difference between the curves in Figure 7.10(a) comes from the fact that they<br />
represent the unpacking time on top of the effective receiving time.<br />
If the data transferred is the agents information and the plans, TSs get a packet which<br />
consists of all the agents that start on their own sub-domain. As the number of TSs increases,<br />
115
For TSs, Effective Receive + Unpack (IDs & StartLinks)<br />
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
0.5<br />
0.25<br />
0.125<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
For TSs, Effective Receive + Unpack (Agents & Routes)<br />
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
0.5<br />
0.25<br />
0.125<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ --1PS<br />
Figure 7.10: Time elapsed for unpacking on top of the effective receiving time over Myrinet.<br />
(a) unpacking agent IDs and start link IDs. (b) unpacking agents.<br />
the total number of agents start on sub-domain of a TS decreases. Figure 7.10(b) shows the<br />
unpacking time for agents on top of the effective receiving time.<br />
7.6.5 TSs Pack and Send<br />
After a TS unpacks the agent IDs and their start link IDs, it must check which agents will start<br />
in its local sub-domain. The search is achieved via a binary search method. If the start link<br />
ID of an agent is found in the local domain of a TS, that the TS adds the agent ID into a send<br />
buffer that will be sent to the dedicated PS. Thus, prior to for a TS packing agent IDs, the search<br />
algorithm is run for all the start link IDs. In other words, the packing time for local agent IDs<br />
by TSs is dominated by the search algorithm and it is independent of the number of TSs and<br />
PSs in the system. The resulting curves for packing time are shown in Figure 7.11. There is<br />
no difference between C++ and Java versions of PSs since the packing is done by TSs and it is<br />
independent of the implementation of PSs.<br />
Figure 7.12 shows the sending time over Myrinet for the agent IDs by TSs. Since TSs send<br />
116
64<br />
16<br />
4<br />
For TSs, Pack Only (IDs)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1<br />
0.25<br />
0.0625<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
Figure 7.11: Time elapsed for packing agent IDs by TSs<br />
64<br />
16<br />
4<br />
1<br />
0.25<br />
0.0625<br />
0.015625<br />
0.00390625<br />
For TSs, Send Only (IDs)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
Figure 7.12: Time elapsed for sending agent IDs by TSs to PSs over Myrinet<br />
packets to several or all the PSs in the system, as the number of PSs increases, the latency<br />
involvement in the total sending time increases. On the other hand, as the number of TSs<br />
increases, the sent packet size decreases, i.e, the packet is transferred faster. Moreover, as the<br />
number of TSs in the system increases, the competition between TSs for PSs appears more.<br />
Thus, the resulting curves in Figure 7.12 show all of these effects.<br />
7.6.6 PSs Receive and Unpack<br />
Figure 7.13 shows the effective receiving time over Myrinet of agent IDs by PSs. Given a<br />
constant number of PSs, the total number of IDs that will be sent to each PS is independent of<br />
the number of TSs. Furthermore, as stated earlier, the effective receiving time also includes the<br />
waiting time passed before a packet enters the receiver buffer. The curves are almost constant<br />
because after a PS issues a receive command, it waits for TSs, which execute a binary search<br />
algorithm to find out the local link IDs and pack the agent IDs on the local links. As shown in<br />
117
64<br />
32<br />
16<br />
8<br />
4<br />
2<br />
1<br />
0.5<br />
0.25<br />
For PSs, Effective Receive (IDs)<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
0.125<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
Figure 7.13: Time elapsed for receiving agent IDs by PSs over Myrinet<br />
For TSs, Effective Receive + Unpack (IDs)<br />
64<br />
32<br />
16<br />
8<br />
4<br />
C++ -- 1PS<br />
C++ -- 2PS<br />
C++ -- 4PS<br />
Java -- 1PS<br />
Java -- 2PS<br />
Java -- 4PS<br />
TV -- C++ -- 1PS<br />
2<br />
1<br />
0.5<br />
0.25<br />
0.125<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
Figure 7.14: Time elapsed for unpacking agent IDs by PSs on top of the effective receiving<br />
time over Myrinet<br />
Figure 7.11, the time elapsed for packing is constant since the binary search algorithm dominates<br />
the time elapsed and it is executed for all the link IDs. This constant effect is also seen in<br />
Figure 7.13.<br />
Figure 7.14 shows the unpacking time for PSs on top of the effective receiving time over<br />
Myrinet. The most obvious result is the difference between the Java and C++ implementations<br />
of the unpacking procedure of a PS. Although the Java implementation unpacks the agent IDs<br />
3.5 times slower than that of C++ for a system with one PS, increasing number of PSs reduces<br />
the difference. For example, in a system with 4 PSs, the version Java version is 2.7 times slower<br />
than the C++ version.<br />
118
Plans Server,C++, Myrinet<br />
Plans Server,Java, Myrinet<br />
256<br />
64<br />
16<br />
4<br />
1<br />
PS-p all-ids<br />
PS-m all-ids<br />
TS-er all-ids<br />
TS-u all ids<br />
TS-p loc ids<br />
TS-s loc ids<br />
PS-er loc ids<br />
PS-u loc ids<br />
PS-p agents<br />
PS-s agents<br />
TS-er agents<br />
TS-u agents<br />
File I/O<br />
256<br />
64<br />
16<br />
4<br />
1<br />
PS-p all-ids<br />
PS-m all-ids<br />
TS-er all-ids<br />
TS-u all ids<br />
TS-p loc ids<br />
TS-s loc ids<br />
PS-er loc ids<br />
PS-u loc ids<br />
PS-p agents<br />
PS-s agents<br />
TS-er agents<br />
TS-u agents<br />
Reading file<br />
0.25<br />
0.25<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 7.15: Summary figures. The results are for the single PS case. (a) Plans Server written<br />
in C++. (b) Plans Server written in Java. – The thick lines denote the time consumption when<br />
plans reading would be fully replaced by sending the plans directly to a traffic flow simulator.<br />
Each label denotes a curve showing the total time of the operations till the end of the operation<br />
of the label. For example, the curve “PS-m all ids” is drawn on top of “PS-p all ids”. The<br />
meanings of labels are as follows: pack, multicast, effective receive, unpack, send.<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
Plans Server,C++, Myrinet<br />
PS-p all-ids<br />
PS-m all-ids<br />
TS-er all-ids<br />
TS-u all ids<br />
TS-p loc ids<br />
TS-s loc ids<br />
PS-er loc ids<br />
PS-u loc ids<br />
PS-p agents<br />
PS-s agents<br />
TS-er agents<br />
TS-u agents<br />
File I/O<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
Plans Server,Java, Myrinet<br />
PS-p all-ids<br />
PS-m all-ids<br />
TS-er all-ids<br />
TS-u all ids<br />
TS-p loc ids<br />
TS-s loc ids<br />
PS-er loc ids<br />
PS-u loc ids<br />
PS-p agents<br />
PS-s agents<br />
TS-er agents<br />
TS-u agents<br />
Reading file<br />
50<br />
50<br />
0<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(a)<br />
0<br />
1 2 4 8 16<br />
Number of Traffic Simulators<br />
(b)<br />
Figure 7.16: Same plots as Figure 7.15, but on linear scale.<br />
7.7 Conclusions and Summary<br />
Plans are input to traffic flow simulators. Traditionally, plans are read from a file. Because<br />
of the inefficiency problem of file I/O explained in Section 5.4, transferring plans between<br />
modules via messages is explored in this chapter.<br />
The “ch6-9” scenario with approximately 1 million agents is read in 159 seconds in C++<br />
and 240 seconds in Java. These numbers include memory allocation of agents. The plans of<br />
the agents in the same scenario are written in 272 seconds in C++ and 100 in Java case. Hence,<br />
the total time required for I/O is 431 and 340 seconds for C++ and Java cases, respectively.<br />
When plans fit into memory, instead of dumping them into files, the plans servers pack<br />
them from the memory and send them to the traffic flow simulators. Traffic flow simulators<br />
receive, unpack and allocate agents, then start simulating. Two big chunks of data are transfered<br />
between plans servers and traffic flow simulators to accomplish the setup explained:<br />
agents IDs and their start link IDs, so that traffic flow simulators can distinguish the<br />
119
Time nPSs nTSs OP Note<br />
1.87s 1 1 pack agents C++, memcpy<br />
5.06s 1 1 pack agents Java, BytesUtil<br />
0.48s 1 1 send agents C++, mpi, MPICH<br />
0.50s 1 1 send agents Java, mpi, mpiJava<br />
2.35s 1 1 recv agents C++, mpi, MPICH, eff<br />
5.56s 1 1 recv agents Java, mpi, mpiJava, eff<br />
11.2s 1 1 unpack agents C++, memcpy, on top of recv<br />
14.4s 1 1 unpack agents C++, memcpy, on top of recv<br />
Table 7.1: Summary table of the performance results of plans transfered between TSs and PSs.<br />
agents, which have to start on a link belong to themselves under the domain decomposition<br />
knowledge.<br />
agents information and their plans, so that traffic flow simulators can start simulating.<br />
According to the tests, the total time elapsed for packing, sending, receiving, unpacking<br />
and allocating memory for agents is approximately 13 seconds in C++-case and 22 seconds<br />
in Java-case. Therefore, transition from the XML plans file to messages speeds up the time<br />
for traffic flow simulators obtaining plans by 33 times in C++ and by 16 times in Java. The<br />
summary figures are seen in Figure 7.15 and Figure 7.16.<br />
The most of important conclusions of transferring plans via messages are given in the following:<br />
Reading plans from file and writing them into files should be replaced by sharing plans<br />
among modules via messages.<br />
Conversion of data into a byte-array using Java performs a bit worse than that of C++.<br />
Since the underlying MPI implementation is the same for mpiJava and MPICH, MPI<br />
functions do not show a difference.<br />
Plans can be multi-cast to modules so that they are transmitted once as a whole.<br />
Plans shared via files would be fully replaced by sending the plans directly to a traffic flow<br />
simulator. By doing so, when plans in the XML file format are concerned, the computational<br />
performance is 12 times better in C++ and 11 times better in Java.<br />
Table 7.1 summarizes the most important performance numbers, which are collected for<br />
operations under different circumstances. The abbreviations used in the table mean as follows:<br />
nPSs and nTSs are the numbers of the plans servers and the traffic flow simulations; OP shows<br />
the operation measured; MPICH and mpiJava are the MPI implementations used for C++ and<br />
Java, respectively; eff is short for effective receiving; on top of recv refers to the time measured<br />
including the effective receive time.<br />
120
Chapter 8<br />
Going beyond Vehicle Traffic<br />
8.1 Introduction<br />
Historically, the demand for connecting users together incited developments in the communication<br />
systems. Communication using a dedicated circuit was the first step. This is known<br />
as circuit-switched network, which establishes a physical channel/circuit/path dedicated to<br />
a single connection for the duration of transmission between two end-points. The telephone<br />
system is circuit-switched as it is connection-oriented.<br />
Internet being an appealing evolution in history initiated changeover from circuit-switched<br />
networks to packet-switched networks, which allow sharing of the physical channel among<br />
multiple virtual/logical dedicated connections. In this type of connections, the transmission of<br />
messages are achieved as packets, which are re-assembled at the destination to form the original<br />
message.<br />
Telephone networks are encompassed by mathematics at all levels such as design, control<br />
and management [88]. These kind of networks guarantee quick transmission and ordered arrival<br />
of data in the same order it is sent. The entire message follows the same path. Telephone<br />
networks are known as having a static nature since the variability in these networks is seldom.<br />
The routers keep track of the active connections to forward the data. The data transfers in<br />
the telephone network are formulated by Poisson distribution since call arrivals are mutually<br />
independent and durations are exponentially distributed with a single parameter [14]. The static<br />
nature of telephone networks reflects Poisson distribution (the aggregated traffic becomes less<br />
bursty as the number of traffic sources increases) so the analysis of data and predictions can be<br />
easily employed.<br />
The data packet traffic (packet-switched network) exhibits different characteristics than the<br />
voice traffic (telephone network). As opposed to the voice traffic,<br />
the variability in duration for the data packet traffic is vast,<br />
packets of the message might follow different routes on the way to the destination,<br />
each packet carries a header information, thus routers only check the header to forward<br />
the data.<br />
Therefore, the data packet traffic requires more than a single-parameter Poisson distribution<br />
to be understood. In contrast to a Poisson process, as the number of sources (users) increases,<br />
the resulting aggregated data packet traffic becomes more bursty instead of becoming smoother.<br />
Some previous works ([88],[49]) showed that there is an increasing evidence that self-similar<br />
(fractal) behaviors arise in the data packet networks on large time scales. A process is said<br />
121
£<br />
¡<br />
¡ §<br />
¤¡2 )<br />
©<br />
¡<br />
(<br />
<br />
¡<br />
<br />
<br />
¡<br />
(<br />
¨<br />
¡<br />
¨<br />
¡<br />
¡<br />
)<br />
¨<br />
¡<br />
¡ )<br />
¡<br />
¡<br />
)<br />
£<br />
to be self similar with a self similarity parameter (Hurst parameter) if the aggregated processes<br />
have the same correlation as . Therefore, the variance of the arithmetic mean decreases<br />
more slowly than the corresponding sample size [87].<br />
The self-similar nature of LAN traffic (aggregated traffic, i.e., the number of packets or<br />
bytes per time unit sent over the Ethernet [77] by all active hosts) was shown by Leland et al.<br />
in [87]. In order words, the LAN traffic measured in microseconds/seconds exhibits the same<br />
characteristics as that of larger time scales. Paxson and Floyd proved in [62] that the WAN<br />
traffic also follows self-similarity.<br />
Self-similar processes can be explained by a power law function, which roughly relates<br />
new scales with old scales by factors. The power law describes systems when large events are<br />
rare and small ones are quite common. For example, there are only a few web sites which are<br />
visited by the enormous number of people in contrast to millions of the web pages accessed by<br />
less people. Self-similar processes have an advantage over Poisson distribution that there is no<br />
defined natural length of “burst”, which can be ranged from a few milliseconds to minutes and<br />
hours.<br />
Besides self-similarity of the data packet traffic at macroscopic level (aggregated traffic),<br />
Willinger et al. also demonstrated in [89] that the data packet traffic at microscopic level (traffic<br />
pattern displayed by individual source-destination pairs) is observed as heavy-tail models.<br />
Huisinga et al. in [81] gives a microscopic model for the data packet traffic in the Internet<br />
and this is described as a simple one dimensional model. The model investigates congested<br />
and free flow phases under the presence of a slow router in the system. The model introduces a<br />
simple cellular automaton model, which defines a finite buffer on each router. When a packet<br />
needs to leave a router for the next destination, its movement should obey a router-specific<br />
probability besides the availability of the buffer in the next router. All buffers are FIFO queues<br />
and are updated in parallel. Travel times of the packets are measured for free flow and jammed<br />
regime as a defect router is introduced to the system. The results show that in both cases, travel<br />
times obey the power law characteristics.<br />
8.2 Queue Model as a Possible Microscopic Model for Internet<br />
Packet Traffic<br />
In this section, the queue model described in Chapter 2 is investigated as a possible model for<br />
the data packet traffic in the Internet. The queue model can be used to simulate “internetworks”<br />
since the routing is employed on this level. Moreover, since the queue model is defined to<br />
be large-scale, large scenarios such as Distributed Denial of Service (DDoS) attacks can be<br />
simulated at this level.<br />
The graph data is described as in the following: Routers and hosts are the nodes. Those,<br />
which are parts of several networks, can have more than one interface. Each interface, then, is<br />
assigned to a unique IP address. Links, on the other hand, can refer to cables, modems, Digital<br />
Subscriber Line (DSL) or satellites, which connect two nodes. Agents of such a system are the<br />
data packets. In contrast to vehicles, they only know the destination but not the route.<br />
The queue model explained in Chapter 2 needs some modifications for a better fit for the<br />
Internet packet traffic. The storage constraint of a link corresponds to the number of sites/spaces<br />
available on the link. It is inversely proportional to the packet length, i.e.,<br />
6 <br />
¡<br />
£ $<br />
The spatial queue and the buffer of the link can be thought as incoming and outgoing mem-<br />
122
ories of a network card. Specifically, the spatial queue is the outgoing memory and the buffer<br />
becomes the incoming memory. Thus, if a packet is about to leave a sender side network card,<br />
it is put into the outgoing memory and it is put into the incoming memory upon its arrival at<br />
the receiver side.<br />
Consequently, the packets are moved from the outgoing memory of the sender side to the incoming<br />
memory of the receiver size with the capacity of the link. The node-to-node bandwidth<br />
(the amount of data that can pass between two nodes in one second) is given by the capacity<br />
of the link, which is, for example, 100 Mbit/s for 100 Mbit Ethernet LAN. This corresponds to<br />
the flow capacity defined in the queue model.<br />
The last constraint of the queue model is the free flow travel time, which is the ratio of<br />
length of the link to free flow velocity in the vehicle traffic. When packet traffic in the Internet<br />
is concerned, the free network travel time is defined as the node-to-node latency, which includes<br />
the processing overhead of initialization at the network card, copying data between memory and<br />
the network and the transfer time of the data from the sender to the receiver.<br />
When agents are vehicles, they are aware of their routes since the route is predefined in the<br />
queue simulation. The Internet packets, in contrast, only know their destination. When a router<br />
receives an incoming packet, it only checks the header of the packet to find out the next flow<br />
that the packet should follow by using a routing table. Hence, the nodes in vehicle traffic are<br />
careless about constraints but this is overruled when the Internet data packets are the agents in<br />
the simulation. Each node (router) should have a capability of packets that can be handled in a<br />
unit time.<br />
Vehicles are also informed when and on which link to start. During simulation, creation<br />
of new vehicles than the other ones defined at the beginning is unusual. When Internet data<br />
packets are chosen as agents, they contradict some features of vehicles. Packets only know<br />
their destinations as their routes from source to destination are computed on the fly. Moreover,<br />
the dynamics of the Internet allows new packets to be created. For example, requests for ftp [32]<br />
or HTTP [12] result in creation of new packets to handle the response to the requester. As the<br />
types of packets in Internet vary, they cause different event handlers to take the corresponding<br />
actions. The responses for an ftp and an HTTP requests, for example, are different.<br />
Besides existence of packets for different purposes, packets in the Internet have variable<br />
length. The queue model, on the other hand, assumes each vehicle occupies a space of fixed<br />
length (7.5m), therefore, when the queue model is applied to the Internet packet traffic, the<br />
packet size is fixed.<br />
Besides the modifications explained above, some new constraints and parameters need to<br />
be introduced as well. Some examples are<br />
packets per second that a node can forward, i.e., the number of packets that a node can<br />
moved from its incoming buffer to its outgoing buffer.<br />
IP addresses and masks to be assigned to each interface of the hosts.<br />
One of the most important features that should be added to the queue simulation in order to<br />
handle the Internet data packets is the creation of routing tables at the nodes (routers). Although<br />
they can be formed before the simulation starts, they should be updated regularly according to<br />
the congestion information of the network. A very simple routing algorithm associates the<br />
destination with the next hop meaning that the destination can be “optimally” reached via the<br />
next hop.<br />
A simple attempt [75] for employing the queue simulation as a simulation for the Internet<br />
packet traffic is accomplished by making changes similar to the ones explained above. Tests are<br />
done using the star-topology where all nodes are connected to a central router. This topology<br />
123
100<br />
Round Trip Travel Time<br />
10<br />
1<br />
0.1<br />
Ping<br />
Uncongested qsim<br />
Congested qsim - 5hosts<br />
Congested qsim - 10hosts<br />
0.01<br />
10 100 1000 10000 100000<br />
Message Size<br />
Figure 8.1: Round-trip travel times of different sizes of messages. For the ping packets, congestion<br />
is not avoidable. The other curves show the results when the queue model simulates<br />
congestion or no congestion.<br />
is selected because of its simplicity but bottlenecks are unavoidable because all data must pass<br />
through the centralized router.<br />
Regardless of lack of realistic the Internet traffic patterns, packets are created roughly in the<br />
following form:<br />
p a c k e t i d =”1” t y p e =”HTTP” s t a r t T i m e =”100”<br />
-<br />
s o u r c e I p = ” 1 2 8 . 9 6 . 3 4 . 1 3 3 ”<br />
d e s t i n a t i o n I p = ” 1 2 8 . 9 6 . 3 3 . 1 3 0 ”<br />
s i z e =”1000” t t l =”7”<br />
/<br />
where each packet is defined by a unique ID, a type field, a start time, a size and a source IP<br />
address and a destination IP address. ttl field is short for Time to Live, which specifies how<br />
many hops a packet can travel before being discarded or returned.<br />
Some simple tests regarding ping packets are done. ping is used to determine whether an<br />
Internet connection is active. In order to verify the reachability, it sends a packet to a specified<br />
Internet host and waits for a reply. The results of ping is round-trip travel times.<br />
Figure 8.1 shows the round trip travel times of different size of messages on a 100 Mbit<br />
LAN. The tests are done in a way that the destination is chosen to be 100m apart from the<br />
source(s). The speed of the link is 12.5 MBytes/sec. Each second simulates 100000 steps, thus,<br />
1 step is about s.<br />
The curves for ping and uncongested qsim are the results between one source and one<br />
¡ ¢<br />
destination. The values labeled as uncongested qsim gathered from the simulation are not as<br />
high as those from the ping command. This is because during this test, only one packet is<br />
sent from source to destination without dealing with any congestion. ping results are from<br />
real-world, hence the traffic towards the destination node is not predictable. The curve labeled<br />
as congested qsim 5 hosts takes a congested system with 5 hosts into consideration. In this<br />
scenario, 4 of the hosts send ping packets every second to the single destination. Congested<br />
qsim - 10 hosts is similar to 5-host case but now 9 hosts exhaust a single destination.<br />
The Ethernet packet size is about 1500 bytes that is 12 kbit. For a 100 Mbit Ethernet, this<br />
124
means that 8333 packets are processed in a second. If one takes, simulation time steps of 1<br />
second, then in order to handle 8333 packets, a 100 Mbit Ethernet card needs to increase its<br />
incoming and outgoing buffer sizes to handle 8333 packets without any overflow. If one takes<br />
simulation time steps of 1 millisecond, then the number of packets processes per millisecond<br />
becomes approximately 8, and consequently 8 packets with a size of 1500 bytes will give a total<br />
of 12 KB. Given the fact that the Ethernet card used has a memory of 18 KB (approximately<br />
12 Ethernet packets), one can conclude that 8 packets per millisecond can be handled without<br />
an overflow on the Ethernet card.<br />
As stated in Section 4.5.1, a Real Time Ratio (RTR) of 900 can be achieved by running<br />
the queue model with a simulation time step of 1 seconds in parallel. If the simulation runs<br />
on time steps of 1 millisecond, then an RTR of 900 is translated into 0.9, which means that<br />
with 1 millisecond as the simulation time step, the parallel queue model runs close to the real<br />
time. Thus, the parallel queue model can be utilized for the Internet packet traffic with a RTR<br />
of 1. One should notice that the modifications done in the queue model to simulate the Internet<br />
packet traffic are elementary. More complicated simulations conduct the Internet packet traffic<br />
by providing different facilities such as more complicated routing algorithms and topologies.<br />
For example, the parallel implementation of a widely known network simulator, NS [1], reports<br />
a speed-up of 3 on a system with 192 nodes decomposed on 4 computing nodes [44]. The events<br />
produced during the tests are reportedly between 15 and 70 million.<br />
The domain decomposition explained in Section 4.1.2 is still useful. The subnetworks, for<br />
example, can be distributed among the computing nodes. Any traffic between two subnetworks<br />
on two different computing nodes can be carried by MPI [51].<br />
Some of the modules in the framework such as the events recorder would be rather useless<br />
for the following reason: people in a human mobility simulation react rather slowly to new<br />
circumstances. However the thought process can be very complex. The Internet, contrarily,<br />
reacts very quickly based on very simple rules.<br />
8.3 Summary<br />
Attention that the Internet draws makes scientists explore about analyzing the data flowing<br />
through Internet. Although similar analysis has been done for telephone networks, when the<br />
Internet is concerned it is not as simple as the telephone networks case.<br />
Aggregated data traffic becomes more bursty as the number of users increases. This contradicts<br />
telephone networks, which are analyzed by Poisson distribution. Both LAN and WAN<br />
data traffic are proved to show self-similarity. Self-similar processes are explained by the power<br />
law. It has been showed in various papers ([88],[62], [49], [89]) that the data plots fit on the<br />
power law plots.<br />
Huisinga et.al. gives a microscopic description of the data packet transport in the Internet<br />
by using a simple cellular automaton model. For similar purposes, the queue model described<br />
in Chapter 2 can be employed. This means the graph data needs to be redefined, the rules<br />
especially constraints needs to be adapted to the Internet data packet traffic. Furthermore,<br />
packets become agents that only know the destination but not the intermediate nodes between<br />
the source and the destination. Nodes in Internet, contrary to nodes in vehicle traffic, are defined<br />
by a constraint, which limits packets per second that can be processed by a node.<br />
Last but not least, a routing table needs to be created at each node and must be updated<br />
regularly according to the congestion in the system. Routing tables provide information to the<br />
nodes, which are supposed to forward packets (if necessary) coming to their buffers.<br />
The parallel queue simulation along with the domain decomposition would be employed to<br />
125
observe the Internet packet traffic. The simulation would give an RTR of 1 when the simulation<br />
time step is chosen as 1 millisecond.<br />
126
Chapter 9<br />
Summary<br />
Among different simulation techniques, multi-agent simulations [22] attract attention since<br />
they enable agents to be defined as complex, because of the rules, and intelligent, because of<br />
the ability to adapt and to learn. Multi-agent simulations, as the name implies, allow multiple<br />
agents to be executed simultaneously based on the rules. This approach gives the possibility<br />
of observing the behaviors of the agents interacting with each other and also helps forecasting<br />
about possible behaviors in future.<br />
The modules in the traditional four-step process for transportation planning describe human<br />
behavior but in aggregated flows. Because of its shortcomings, Dynamic Traffic Assignment<br />
(DTA)(e.g. [19, 20, 27, 5]) model is used to represent the agents at the individual entity level.<br />
To solve DTA with spill-back queues due to congestion, systematic relaxation is employed.<br />
Within relaxation, agents gradually learn from the previous experiences (iterations) where they<br />
interact with the other agents and the environment. The rules defined for agents are executed<br />
during each iteration. After an iteration, each agent records and evaluates its performance.<br />
Evaluation is the learning step. Consequently, the system goes from a congested state to a<br />
relaxed state after some iterations.<br />
The execution of rules are integrated into a traffic flow simulation based on the queue model.<br />
Agents in the simulations are described along with their routes from a source location to a destination<br />
location. The traffic flow simulation takes these routes as input. It produces events of<br />
the agents as agents interact with each other and the environment. These events are interpreted<br />
by the other modules such as router, agent database and activity generator. The router produces<br />
new plans on request, the activity generator changes the end time and the duration of activities<br />
on request and the agent database merges plans come from different sources (routers and<br />
agents) to produce the plans input file for the next iteration.<br />
Object-oriented programming languages such as Java [42] and C++ [80] characterize multiagent<br />
simulations in the best way because they represent internal object structure and agent(object)-<br />
to-agent interactions in the cleanest way. C++ is chosen as the implementation language of the<br />
work represented in this thesis.<br />
One of the reason for using C++ as the programming language is that it promises computationally<br />
fast programs. Running a set of iterations as described above might take enormous<br />
time that is not preferred, especially when an application is detailed at the individual agent level.<br />
Meaningful size scenarios contain several millions of agents, therefore large-scale applications<br />
make the computational performance worse.<br />
Software enhancements for the sequential computing sometimes help an application to<br />
speed up. For example, the data structures to store the frequently accessed data make a difference<br />
based on the structures, on the ways of accessing elements of the structures and on the<br />
methods available for operations on the structures.<br />
127
A probably better way to reduce the computation time of a large-scale multi-agent application<br />
is to make it run in parallel. Among different methods, the domain decomposition suits<br />
well for the work presented here. This method divides the problem into a set of subproblems<br />
and assigns each subproblem to a different computing node. It aims at two goals:<br />
balancing the load on the computing nodes,<br />
minimizing the communication between computing nodes.<br />
Load balancing makes sure that each computing node gets a fair share of the big problem<br />
such that a computing node is neither exhausted nor idle because of its workload. The second<br />
issue regarding reducing communication between computing nodes is related to the subject<br />
that the subproblems generated by the domain decomposition are usually not fully independent<br />
of each other. Hence, solving such a subproblem requires exchanging information at the<br />
boundaries.<br />
With respect to transportation planning, a domain refers to the street network or the graph<br />
data. Each sub-graph data, therefore, is assigned to a computing node and each of these computing<br />
nodes run a separate traffic flow simulation on its graph data. If an agent’s trip goes<br />
beyond the graph data defined on a computing node, then the computing node makes sure that<br />
the agent in question is transferred to the next computing node via messages. This is called<br />
message passing. Message passing can be implemented via several software libraries such as<br />
MPI [51], PVM [63], CORBA [92], etc. MPI is chosen among those because of the computational<br />
reasons and the efforts that put on its development.<br />
When further reduces in the computation time of a large-scale application are of interest,<br />
improving hardware is another option: as stated earlier in this work, Myrinet [54] is a costeffective<br />
and high-performance packet-communication and switching technology and can be<br />
used to reduce the latency contribution of the Ethernet [77] to each message.<br />
Message passing can also be used for inter-modular communication. The relaxation described<br />
above is achieved in a framework that contains different strategic and physical modules,<br />
each of which is responsible for a different task. However, these modules are not fully independent<br />
of each other as they share the data such as plans and events. Therefore, some agreements<br />
must be done on the data representation. The available wire formats do not point out a perfect<br />
solution but they offer different advantages such as better performance or extensibility. The<br />
data between modules can be shared via files but this method suffers from file I/O bottlenecks.<br />
Hence, instead of using files, data can be passed between the modules via messages.<br />
The queue model simulation can be extended to go beyond the transportation planning.<br />
One possible area is simulating the Internet packet traffic since the Internet packet traffic draws<br />
attention of researchers when analysis of data flow, analysis of statistics and predictions are<br />
concerned.<br />
128
Bibliography<br />
[1] Information Sciences Institute at Univ. of Southern California. The Network Simulator.<br />
See www.isi.edu/nsnam/ns, accessed 2005.<br />
[2] M. Balmer, K. Nagel, and R. Raney. Large scale multi-agent simulations for transportation<br />
applications. ITS Journal, in press.<br />
[3] M. Balmer, B. Raney, and K. Nagel. Coupling activity-based demand generation to a truly<br />
agent-based traffic simulation – activity time allocation. In Presented at EIRASS workshop<br />
on Progress in activity-based analysis, Maastricht, NL, May 2004. Also presented at<br />
STRC’04, see www.strc.ch.<br />
[4] J. Barcelo, J.L. Ferrer, D. Garcia, M. Florian, and E. Le Saux. Parallelization of microscopic<br />
traffic simulation for ATT systems. In P. Marcotte and S. Nguyen, editors, Equilibrium<br />
and advanced transportation modelling, pages 1–26. Kluwer Academic Publishers,<br />
1998.<br />
[5] J.A. Bottom. Consistent anticipatory route guidance. PhD thesis, Massachusetts Institute<br />
of Technology, Cambridge, MA, 2000.<br />
[6] Bundesamt für Raumentwicklung (ARE), Bern. Räumliche Auswirkungen der<br />
Verkehrsinfrastrukturen, Methodologische Vorstudie, Information und Pflichtenheft für<br />
die Anbieter, 13.6. 2001.<br />
[7] A. Burriad. Intersection dynamics in queue models. Term project report, Swiss Federal<br />
Institute of Technology, 2002. See sim.inf.ethz.ch/papers.<br />
[8] G. D. B. Cameron and C. I. D. Duncan. PARAMICS — Parallel microscopic simulation<br />
of road traffic. Journal of Supercomputing, 10(1):25, 1996.<br />
[9] R. Cayford, W.-H. Lin, and C.F. Daganzo. The NETCELL simulation package: Technical<br />
description. California PATH Research Report UCB-ITS-PRR-97-23, University of<br />
California, Berkeley, 1997.<br />
[10] G.L. Chang, T. Junchaya, and A.J. Santiago. A real-time network traffic simulation model<br />
for ATMS applications: Part I — Simulation methodologies. IVHS Journal, 1(3):227–<br />
241, 1994.<br />
[11] The World Wide Web Consortium. HTML: HyperText Markup Language. See<br />
www.w3.org/MarkUp, accessed 2005.<br />
[12] The World Wide Web Consortium. HTTP: HyperText Transfer Protocol. See<br />
www.w3.org/Protocols, accessed 2005.<br />
[13] Microsoft Corporation. MS Windows. See www.microsoft.com/windows, accessed 2005.<br />
129
[14] Berksekas D. and R. Gallager. Data Networks. Prentice Hall, MA, U.S.A., 1991.<br />
[15] C.F. Daganzo. The cell transmission model: A dynamic representation of highway traffic<br />
consistent with the hydrodynamic theory. Transportation Research B, 28B(4):269–287,<br />
1994.<br />
[16] C.F. Daganzo. The cell transmission model, part II: Network traffic. Transportation<br />
Research B, 29B(2):79–93, 1995.<br />
[17] US Dept. of Transportation Federal Highway Administration. DynaMIT prototype description.<br />
See www.dynamictrafficassignment.org/dynamit.htm, accessed 2005.<br />
[18] US Dept. of Transportation Federal Highway Administration. DYNASMART-X prototype<br />
description. See www.dynamictrafficassignment.org/dsmart x.htm, accessed 2005.<br />
[19] DYNAMIT www page. See mit.edu/its and dynamictrafficassignment.org, accessed 2005.<br />
[20] DYNASMART www page. See www.dynasmart.com and dynamictrafficassignment.org,<br />
accessed 2005.<br />
[21] Expat www page. James Clark’s Expat XML parser library. See expat.sourceforge.net,<br />
accessed 2005.<br />
[22] J. Ferber. Multi-agent systems. An Introduction to distributed artificial intelligence.<br />
Addison-Wesley, 1999.<br />
[23] J.L. Ferrer and J. Barceló. AIMSUN2: Advanced Interactive Microscopic Simulator for<br />
Urban and non-urban Networks. Internal report, Departamento de Estadística e Investigación<br />
Operativa, Facultad de Informática, Universitat Politècnica de Catalunya, 1993.<br />
[24] U. Frisch, B. Hasslacher, and Y. Pomeau. Lattice-gas automata for Navier-Stokes equation.<br />
Phys. Rev. Letters, 56:1505, 1986.<br />
[25] F. Bustamante G. Eisenhauer and K. Schwan. Native data representation: An efficient<br />
wire format for high performance distributed computing. IEEE Transactions on Parallel<br />
and Distributed Systems, 13:1234–1246, 2002.<br />
[26] C. Gawron. An iterative algorithm to determine the dynamic user equilibrium in a traffic<br />
simulation model. International Journal of Modern Physics C, 9(3):393–407, 1998.<br />
[27] C. Gawron. Simulation-based traffic assignment. PhD thesis, University of Cologne,<br />
Cologne, Germany, 1998. available via www.zaik.uni-koeln.de/˜paper.<br />
[28] R.A. Gingold and J.J. Monaghan. Smoothed particle hydrodynamics - theory and application<br />
to non-spherical stars. Royal Astronomical Society, Monthly Notices, 181:375–389,<br />
1977.<br />
[29] C. Gloor. Distributed Intelligence in Real-World Mobility Simulations. PhD thesis, Swiss<br />
Federal Institute of Technology ETH, 2005.<br />
[30] Pallas GmbH. Pallas MPI Benchmark. See www.pallas.com/e/products/pmb, accessed<br />
2005.<br />
[31] P. Gonnet. A thread-based distributed traffic micro-simulation. Term project, Swiss Federal<br />
Institute of Technology ETH, Zürich, Switzerland, 2001.<br />
130
[32] Network Working Group. File Transfer Protocol. See www.faqs.org/rfcs/rfc959.html,<br />
accessed 2005.<br />
[33] The Open Group. Technical report for Remote Procedure Call. See<br />
www.opengroup.org/public/pubs/catalog/c706.htm, accessed 2005.<br />
[34] D. Hensher and J. King. In D. Hensher and J. King, editors, The Leading Edge of Travel<br />
Behavior Research. Pergamon, Oxford, 2001.<br />
[35] W. A. Hunt. Clock cycles counter. See www.cs.utexas.edu/users/hunt/class/2003-<br />
fall/cs352/lectures/ class01a.pdf, accessed 2005.<br />
[36] J. Hurwitz and W. Feng. Initial end-to-end performance evaluation of 10-Gigabit Ethernet.<br />
IEEE Hot Interconnects, 2003.<br />
[37] IBM SP2 web page. RS/6000 SP System. See www.rs6000.ibm.com/hardware/largescale,<br />
accessed 2005.<br />
[38] Cray Inc. See www.cray.com, accessed 2005.<br />
[39] Linux Online Inc. The Linux home page at Linux Online. See www.linux.org, accessed<br />
2005.<br />
[40] Red Hat Online Inc. Red Hat Linux. See www.redhat.com, accessed 2005.<br />
[41] InfiniBand Trade Association www page. InfiniBand. See www.infinibandta.org, accessed<br />
2005.<br />
[42] See java.sun.com. Java technology, accessed 2005.<br />
[43] java.sun.com/products/jdk/rmi. Java Remote Method Invocation (RMI), accessed 2005.<br />
[44] K. G. Jones and S. R. Das. Parallel execution of a sequential network simulator. In<br />
Proceedings of the 32nd Conference on Winter Simulation, pages 418–424, 2000.<br />
[45] C. Kurmann, T. Stricker, and F. Rauch. Speculative defragmentation - leading Gigabit<br />
Ethernet to true zero-copy communication cluster computing. Journal of Networks, Software<br />
Tools and Applications, 4(4):7–18, March 2001.<br />
[46] Rauch F. Kurmann, C. and T Stricker. Cost/performance tradeoffs in network interconnects<br />
for clusters of commodity PCs. International Parallel and Distributed Processing<br />
Symposium,www.ipdps.org, April 2003.<br />
[47] Lawrence Livermore National Laboratory. ASC at Livermore. See www.llnl.gov/asci,<br />
accessed 2005.<br />
[48] M. J. Lighthill and J. B. Whitham. On kinematic waves. I: Flow movement in long rivers.<br />
II: A Theory of traffic flow on long crowded roads. Proceedings of the Royal Society A,<br />
229:281–345, 1955.<br />
[49] P. Faloutos M. Faloutos and C. Faloutos. On power-law relationships of the Internet<br />
topology. SIGCOMM, pages 251–262, 1999.<br />
[50] MATSIM www page. MultiAgent Transportation SIMulation. See www.matsim.org,<br />
accessed 2005.<br />
131
[51] MPI www page. www-unix.mcs.anl.gov/mpi/, accessed 2005. MPI: Message Passing<br />
Interface.<br />
[52] MPICH www page. www-unix.mcs.anl.gov/mpi/mpich/, accessed 2005. MPI: Message<br />
Passing Interface MPICH implementation.<br />
[53] S. D. Myers. Effective STL: 50 specific ways to improve your use of the Standard Template<br />
Library. Addition-Wesley, 2001.<br />
[54] Myricom www page. Myrinet. See www.myri.com, accessed 2005. Myricom, Inc.,<br />
Arcadia, CA.<br />
[55] M. Rosembluth N. Metropolis, A. Rosembluth and A. Teller. Equation of state calculations<br />
by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.<br />
[56] K. Nagel. High-speed microsimulations of traffic flow. PhD thesis, University of Cologne,<br />
1994/95. See www.inf.ethz.ch/˜nagel/papers or www.zaik.uni-koeln.de/˜paper.<br />
[57] K. Nagel and M. Rickert. Parallel implementation of the TRANSIMS micro-simulation.<br />
Parallel Computing, 27(12):1611–1639, 2001.<br />
[58] K. Nagel and A. Schleicher. Microscopic traffic modeling on parallel high performance<br />
computers. Parallel Computing, 20:125–146, 1994.<br />
[59] K. Nagel, P. Stretz, M. Pieck, S. Leckey, R. Donnelly, and C. L. Barrett. TRANSIMS traffic<br />
flow characteristics. Los Alamos Unclassified Report (LA-UR) 97-3530, Los Alamos<br />
National Laboratory, Los Alamos, NM, see transims.tsasa.lanl.gov, 1997.<br />
[60] W. Niedringhaus, J. Opper, L. Rhodes, and B. Hughes. IVHS traffic modeling using<br />
parallel computing: Performance results. In Proceedings of the International Conference<br />
on Parallel Processing, pages 688–693. IEEE, 1994.<br />
[61] Klaus Nökel and Matthias Schmidt. Parallel DYNEMO: Meso-scopic traffic flow simulation<br />
on large networks. Networks and Spatial Economics, 2(4):387–403, December<br />
2002.<br />
[62] V. Paxson and S. Floyd. Wide-Area traffic: The failure of Poisson modeling. IEEE/ACM<br />
Transactions on Networking, 3(3):226–244, 1995.<br />
[63] PVM www page. See www.epm.ornl.gov/pvm, accessed 2005. PVM: Parallel Virtual<br />
Machine.<br />
[64] H. A. Rakha and M. W. Van Aerde. Comparison of simulation modules of TRANSYT<br />
and INTEGRATION models. Transportation Research Record, 1566:1–7, 1996.<br />
[65] B. Raney. Learning Framework for Large-Scale Multi-Agent Simulations. PhD thesis,<br />
Swiss Federal Institute of Technology ETH, 2005.<br />
[66] B. Raney and K. Nagel. Truly agent-based strategy selection for transportation simulations.<br />
Paper 03-4258, Transportation Research Board Annual Meeting, Washington, D.C.,<br />
2003.<br />
[67] B. Raney and K. Nagel. An improved framework for large-scale multi-agent simulations<br />
of travel behavior. In P. Rietveld, B. Jourquin, and K. Westin, editors, Towards better<br />
performing European Transportation Systems. accepted.<br />
132
[68] M. Rickert. Traffic simulation on distributed memory computers. PhD thesis, University<br />
of Cologne, Cologne, Germany, 1998. See www.zaik.uni-koeln.de/˜paper.<br />
[69] Sandia National Laboratories. ASC at Sandia. See www.sandia.gov/ASC, accessed 2005.<br />
[70] G. Satir and D. Brown. C++, The Core Language. O’Reilly & Associates, Inc., 1995.<br />
[71] T. Schwerdtfeger. Makroskopisches Simulationsmodell für Schnellstraßennetze mit Berücksichtigung<br />
von Einzelfahrzeugen (DYNEMO). PhD thesis, University of Karsruhe,<br />
Germany, 1987.<br />
[72] A. Silberschatz, P. B. Galvin, and G. Gagne. Operating System Concepts. John Wiley &<br />
Sons, Inc., 2001.<br />
[73] H.P. Simão and W.B. Powell. Numerical methods for simulating transient, stochastic<br />
queueing networks. Transportation Science, 26:296–311, 1992.<br />
[74] P. M. Simon and K. Nagel. Simple queueing model applied to the city of Portland. International<br />
Journal of Modern Physics C, 10(5):941–960, 1999.<br />
[75] Hinnerk Spindler. Personal communication.<br />
[76] William Stallings. Queuing analysis. ftp://shell.shore.net/members/w/s/ws/Support/<br />
QueuingAnalysis.pdf, 2000.<br />
[77] IEEE 802 LAN/MAN Standards Committee. Ethernet IEEE 802 standards. See<br />
www.ieee802.org, accessed 2005.<br />
[78] R. Standish. Classdesc Library. See parallel.hpc.unsw.edu.au/rks/classdesc, accessed<br />
2005.<br />
[79] W. R. Stevens. UNIX Network Programming. Prentice Hall, 1990.<br />
[80] B. Stroustrup. The Design and Evolution of C++. Addison-Wesley, 1994.<br />
[81] W. Knospe A. Schadschneider T. Huisinga, R. Barlovic and M. Schreckenberg. A microscopic<br />
model for packet transport in the Internet. Physica A, pages 249–256, 2001.<br />
[82] TRANSIMS www page. TRansportation ANalysis and SIMulation System. transims.tsasa.lanl.gov,<br />
accessed 2005. Los Alamos National Laboratory, Los Alamos, NM.<br />
[83] A. Gupta V. Kumar, A. Grama and G. Karypis. Introduction to Parallel Computing. The<br />
Benjamin/Cummings Publishing Company, Inc., 1994.<br />
[84] VISSIM www page. www.ptv.de, accessed 2004. Planung Transport und Verkehr (PTV)<br />
GmbH.<br />
[85] M. Vrtic, K.W.Axhausen, R. Koblo, and M. Vödisch. Entwicklung bimodales Personenverkehrsmodell<br />
als Grundlage für Bahn2000, 2. Etappe, Auftrag 1. Report to the Swiss<br />
National Railway and to the Dienst für Gesamtverkehrsfragen, Prognos AG, Basel, 1999.<br />
See www.ivt.baug.ethz.ch/vrp/ab115.pdf for a related report.<br />
[86] P. Waddell, A. Borning, M. Noth, N. Freier, M. Becke, and G. Ulfarsson. Microsimulation<br />
of urban development and location choices: Design and implementation of UrbanSim.<br />
Networks and Spatial Economics, 3(1):43–67, 2003.<br />
133
[87] W. Willinger W.E. Leland, M. Taqqu and D. V. Wilson. On the self-similar nature of<br />
Ethernet traffic. SIGCOMM, pages 183–193, 1993.<br />
[88] W. Willinger and V. Paxson. Where mathematics meets the Internet. Notices of the AMS,<br />
45(8):961–970, 1998.<br />
[89] W. Willinger, V. Paxson, and M. S. Taqqu. Self-similarity and heavy tails: Structural<br />
modeling of network traffic. To appear in R. Adler, R. Feldman, and M. S. Taqqu, editors,<br />
A Practical Guide to Heavy Tails: Statistical Techniques and Applications, 1998.<br />
Birkhauser Verlag, Boston.<br />
[90] D.E. Wolf, M. Schreckenberg, and A. Bachem, editors. Traffic and granular flow. World<br />
Scientific,Singapore, 1996.<br />
[91] www-users.cs.umn.edu/˜karypis/metis. METIS library, accessed 2005.<br />
[92] www.corba.org. CORBA: Common Object Request Broker Architecture, accessed 2005.<br />
[93] See www.hpjava.org/mpiJava.html. mpiJava, a Java interface to the standard MPI, accessed<br />
2005.<br />
[94] www.mysql.com. MYSQL, an open-source SQL database, accessed 2005.<br />
[95] www.oracle.com/products. Oracle database server, accessed 2005.<br />
[96] www.urbansim.org. URBANSIM, accessed 2003.<br />
[97] www.w3.org/XML. XML, eXtensible Markup Language, accessed 2005.<br />
134
CURRICULUM VITAE: NURHAN ÇETIN<br />
December 1st, 1974 born in Turkey, citizen of the Republic of Turkey.<br />
1980-1985 Elementary School of Yeşilbahar, Istanbul.<br />
1985-1988 Secondary School of Göztepe, Istanbul.<br />
1988-1991 High School of Erenköy, Istanbul.<br />
1991-1994 Environmental Engineering, Marmara University, Istanbul.<br />
1994-1997 B.Sc in Computer Engineering,<br />
Department of Computer Engineering,<br />
Marmara University, Istanbul.<br />
1996-1997 Worked at Computer Center of Marmara University, Istanbul.<br />
1997-1999 Worked at Computer Center of Yeditepe University, Istanbul.<br />
1999-2000 M.Sc. in Computer Science,<br />
Department of Computer Science and Engineering<br />
The Pennsylvania State University, University Park, PA, USA<br />
2000 Worked at America Online, Herndon, VA, USA.<br />
2000-2005 Research and Teaching Assistant in the<br />
Modelling and Simulation headed by Prof. Kai Nagel,<br />
Institute of Computational Science,<br />
Swiss Federal Institute of Technology Zürich,<br />
Zürich, Switzerland.<br />
PUBLICATIONS<br />
Towards truly agent-based traffic and mobility simulations;<br />
M. Balmer, N. Cetin, K. Nagel, B. Raney;<br />
Autonomous agents and multiagent systems (AAMAS’04);<br />
New York, NY, USA, 2004.<br />
An agent-based microsimulation model of Swiss travel: First results;<br />
N. Cetin, B. Raney, A. Völlmy, M. Vrtic, K. Axhausen, K. Nagel;<br />
Networks and Spatial Economics;<br />
Volume 3, Pages 23–41, 2003.<br />
A parallel queue model approach to traffic simulations;<br />
N. Cetin, K. Nagel, A. Burri;<br />
Transportation Research Board (TRB) Conference;<br />
Washington, D.C., USA, 2003.<br />
Large-scale multi-agent transportation simulations;<br />
N. Cetin, K. Nagel, B. Raney, A. Voellmy;<br />
42nd European Regional Science Association (ERSA) Congress;<br />
Dortmund, Germany, 2002.<br />
Towards a microscopic traffic simulation of all of Switzerland;<br />
N. Cetin, B. Raney, A. Voellmy, M. Vrtic, K. Nagel;<br />
Proceedings of the International Conference of Computational Science;<br />
Amsterdam, The Netherlands, 2002.
Large-scale multi-agent transportation simulations;<br />
N. Cetin, K. Nagel, B. Raney, A. Voellmy;<br />
Computational Physics Conference;<br />
Aachen, Germany 2001.<br />
Large-scale transportation simulations on Beowulf clusters;<br />
N. Cetin, K. Nagel;<br />
Swiss Transport Research Conference;<br />
Ascona, Switzerland, 2001.<br />
Solaris 2.x System Administration Course Notes;<br />
N. Cetin, O. Demir, G. Devlet, D. Erdil, G. Küċük, M. Yildiz;<br />
Istanbul, Turkey, 1997.<br />
MaROS: A framework for application development on mobile hosts;<br />
S. Baydere, N. Cetin, O. Demir, G. Devlet, D. Erdil, G. Küċük, M. Yildiz;<br />
Proceedings of the IASTED International Conference on Parallel and Distributed Systems;<br />
Barcelona, Spain, June 97.