IntroductionDynamic and Adaptive IO Aggregator Pattern.The popularity of multi-core processors provides a flexible way to increase the computationalcapability of clusters. Nowadays, parallel applications are shifting to multi-core clusters andnew optimization techniques are demanded for exploiting this architecture.In this kind of environment, the efficiency of parallel applications is maximized when theworkload is evenly distributed among nodes and the overhead introduced in theparallelization process is minimized: the cost of communication, synchronization and I/Ooperations must be kept as low as possible. In order to achieve this, the interconnectionsubsystem used to support the interchange of messages must be fast enough to avoidbecoming a bottleneck.In addition, the scientific applications usually work with a large set of data that must betransferred among the processes and also stored in disk, provoking several hot spots incommunications, but also in I/O operations. The multi-core clusters increase thecomputational capability of clusters. Nevertheless, one of major limits of this kind of clusteris the I/O operation, because I/O requests initiated by multi-core may saturate the I/O bus.Therefore, we can conclude that it is necessary to develop techniques for improving theperformance of I/O subsystems, especially in Multi-core clusters. The main goal of this workis to improve the scalability and performance of MPI-based applications executed in multicoreclusters reducing the overhead of I/O subsystems, improving the performance ofcollective I/O operations.Collective IO: Two Phase IO.Many applications use collective I/O operations to read/write data from/to disk. One of the most used is the Two-‐Phase I/O technique extended by Thakur and Choudhary in ROMIO . Two-‐Phase I/O takes place in two phases: a redistributed data exchange and an I/O phase. In the first phase, by means of communication, small file requests are grouped into larger ones. In the second phase, contiguous transfers are performed to or from the file system. Before that, Two-‐Phase I/O divides the file into equal contiguous parts (called File Domains (FD)), and assigns each FD to a configurable number of compute nodes, called aggregators. Each aggregator is responsible for aggregating all the data, which it maps inside its assigned FD, and for transferring the FD to or from the file system. In the default implementation of Two-‐Phase I/O the assignment of each aggregator (aggregator pattern) is fixed, independent of distribution of data over the processes. This fixed aggregator pattern might create a I/O bottleneck, as a consequence of the multiple requests performed by aggregators to collect all data assigned to their FD. Therefore, in this work we propose replacing the rigid assignment of aggregators over the processes by one of two new adaptive aggregation patters that we have designed. Dynamic and Adaptive Aggregation patterns. In this proposal we want to improve the collective I/O called Two Phase I/O. To decide the optimal aggregator pattern we have two aggregation-‐criteria: 1. Reduce the number of communications: This criteria assigns each aggregator to the node who has more highest number of contiguous data blocks of the file domain associated with the aggregator. We called this criteria aggregation-‐by-communication-‐number (ACN).
2. Reduce the volume of communications: This criteria assigns each aggregator to the node who has more data of the file domain associated with the aggregator. We called this criteria aggregation-‐by-‐volume-‐number (AVN). The result is a new dynamic and adaptive I/O aggregator pattern based on the local data that each node stores. The new aggregator pattern is dynamic, because it is calculated at runtime. Also it is adaptive, because each application has its own pattern, and could select the aggregation-‐criteria (ACN or AVN pattern) that reduce more the communication phase in Two-‐Phase I/O technique. Besides, for each read/write collective operation performed during the execution time, the new aggregator pattern may change the aggregation-‐criteria if it is necessary. Evaluation We have used the BISP3D application to evaluate our proposal. BISP3D is a 3-‐dimensional simulator of BJT and HBT bipolar device. This application uses Two-‐Phase I/O to write the results and is programmed by using MPI. For our evaluations BIPS3D has been executed in HECToR multi-‐cluster by using three different meshes, and different number of processes. We have evaluated our proposal modifying TwoPhase IO by using MPICH2 to include the new aggregation Patterns presented in this work: ACN and AVN. Figure 1 shows the overall speedup achieved using the Original Two Phase IO, and our modified TwoPhase IO with ACN an AVN. In most cases, our modified Two-‐Phase I/O reduces between 20% and 30% the overall execution time of BISP3D application with one of the Aggregation Patterns. In some cases, the overall time is reduced even 50%. This means, that with the appropriate aggregator pattern (ACN or AVN) the number or volume of communications is reduced. Therefore, the overall execution time is also reduced. In the worst cases (witch represent less than 10% of the evaluations), the loss of our technique was near to 5%. For these cases, the default aggregator pattern seems to be good enough. BISP3D-‐Mesh1 BISP3D-‐Mesh2 1.20 1.40 Speedup 1.10 1.00 0.90 Speedup 1.20 1.00 0.80 8 16 32 64 128 0.80 8 16 32 64 128 Processes Processes Speedup 1.80 1.60 1.40 1.20 1.00 0.80 BISP3D-‐Mesh3 8 16 32 64 128 ACN AVN Processes