Tesis y Tesistas 2020 - Postgrado - Fac. de Informática - UNLP

Recommendations

Info

DOCTORADO ENCIENCIAS INFORMÁTICASDr. Diego MiguelMontezantie-maildmontezanti@lidi.info.unlp.edu.arAdvisorsEng. Armando Eduardo De GiustiDr. Dolores Rexachs del RosarioCodirectorsDr. Marcelo NaioufDr. Emilio Luque FadónThesis defense dateMarch 18, 2020SEDICIhttp://sedici.unlp.edu.ar/handle/10915/98816SEDAR:Soft Error Detectionand Automatic Recoveryin High PerformanceComputing SystemsKeywords: transient faults; fault detection; process replication; automatic recovery; silent data corruption;HPC systems; fault injection; checkpoints.MotivationReliability and fault tolerance have become aspects ofgrowing relevance in the field of HPC, due to the increasedprobability that faults of different kinds will occur in thesesystems. This is fundamentally due to the increasing complexityof the processors, in the search to improve performance,which leads to an rise in the scale of integration and in thenumber of components that work near their technologicallimits, being increasingly prone to failures. Another factorthat affects is the growth in the size of parallel systems toobtain greater computational power, in terms of number ofcores and processing nodes.As applications demand longer uninterrupted computationtimes, the impact of faults grows, due to the cost ofrelaunching an execution that was aborted due to theoccurrence of a fault or concluded with erroneous results.Consequently, it is necessary to run these applications onhighly available and reliable systems, requiring strategiescapable of providing detection, protection and recoveryagainst faults.In the next years it is planned to reach Exa-scale, in whichthere will be supercomputers In the next years it is planned toreach Exascale, in which there will be supercomputers second.This is a great window of opportunity for HPC applications,but it also increases the risk that they will not complete theirexecutions. Recent studies show that, as systems continueto include more processors, the Mean Time Between Errorsdecreases, resulting in higher failure rates and increasedrisk of corrupted results; large parallel applications areexpected to deal with errors that occur every few minutes,requiring external help to progress efficiently. Silent DataCorruptions are the most dangerous errors that can occur,since they can generate incorrect results in programs thatappear to execute correctly. Scientific applications and largescalesimulations are the most affected, making silent errorhandling the main challenge towards resilience in HPC. Inmessage passing applications, a silent error, affecting asingle task, can produce a pattern of corruption that spreadsto all communicating processes; in the worst case scenario,the erroneous final results cannot be detected at the end ofthe execution and will be taken as correct.Since scientific applications have execution times of the orderof hours or even days, it is essential to find strategies thatallow applications to reach correct solutions in a boundedtime, despite the underlying failures. These strategies alsoprevent energy consumption from skyrocketing, since ifthey are not used, the executions should be launched againfrom the beginning. However, the most popular parallelprogramming models used in supercomputers lack supportfor fault tolerance.In this context of high error rates, unreliable results and highverification costs, the aim of this thesis is to help scientistsand programmers of parallel applications to provide reliabilityto their results, within a predictable time.To accomplish this goal, we have designed and developed14
englishDr. Diego Miguel Montezantithe SEDAR (Soft Error Detection and Automatic Recovery)methodology, which provides tolerance to transient faultsin systems consisting in message passing applicationsthat run in multicore clusters. SEDAR is based on processreplication and monitoring of messages to be sent and oflocal computation, taking advantage of the intrinsic hardwareredundancy of the multicores.SEDAR provides three variants: detection and automaticrelaunch from the beginning; automatic recovery, based onthe storage of multiple system-level checkpoints (periodic orsynchronized with events); and automatic recovery, based on asingle safe application-level checkpoint. The main goal is thedesign of the methodology and the functional validation ofits effectiveness to detect transient faults and automaticallyrecover executions, using an analytical verification model; aSEDAR prototype is also implemented. From the tests carriedout with this prototype, the temporal behavior is characterized,i.e. the overhead introduced by each variant. The flexibility todynamically choose the most convenient alternative to adaptto system requirements (such as maximum allowed overheador completion time) is also evidenced, showing that SEDARis a viable and effective methodology to tolerate transientfaults in HPC. Unlike specific strategies, which provide partialresilience for certain applications, at the cost of modifyingthem, SEDAR is essentially transparent and agnosticregarding the protected algorithm.• The evidence of SEDAR’s flexibility to adapt to a particularcompromise between the cost and the obtained performance.Future Research Lines• Extending experimental validation, using the recovery algorithmbased on application-level checkpoints.• Calculating the optimal checkpoint interval, in order tominimize both the execution overhead and the rework to bedone, quantifying the relationship between detection latencyand the communications pattern.• Optimally support the occurrence of multiple non-relatedfaults, with the recovery strategy based on multiplecheckpoints, and predict the temporal response.• Implementing a dynamic adaptation of the recoverymechanism, and auxiliary tools to provide reports andstatistics to the user for subsequent analysis.• Integrating SEDAR with architectures that toleratepermanent faults, to support both types of faults with asingle functional tool for Exa-scale, considering the impactof energy consumption on resilience.Thesis contributionsThe main contributions of this thesis are:• The development of a functionally valid fault tolerancemethodology, that integrates duplication (for detection) withthe checkpoint & restart that is used to guarantee recoveryfrom permanent failures, thus obtaining a strategy thatensures both completion and reliability of the results.• The description and verification of the functional behavior,using a model that considers all the possible failurescenarios, demonstrating the effectiveness of detection andthe recovery mechanism based on multiple system-levelcheckpoints.• The empirical verification of the model’s predictions,through controlled fault-injection experiments.• The implementation of prototype of an automatic toolthat is capable of recovering without user intervention,which integrates the detection mechanism with the recoverymechanism based on multiple checkpoints.• The detail of the experimental work carried out to attachSEDAR to parallel applications.• The determination of the amount of necessary resources,together with the temporal characterization and theevaluation of the overheads of each of the three alternatives.Thus the benefits obtained both in runtime and in reliabilityare shown, and therefore the feasibility of SEDAR to toleratetransient failures in HPC.15 TESIS Y TESISTAS 2020
Page 2 and 3: índiceEquipo Editorialpag. 4MAESTR
Page 4 and 5: equipoeditorialDIRECTOR DE POSTGRAD
Page 7 and 8: 01DOCTORADO ENCIENCIAS INFORMÁTICA
Page 9 and 10: españolDr. Facundo Manuel Quiroga5
Page 11 and 12: englishDr. Facundo Manuel Quirogaeq
Page 13: españolDr. Diego Miguel Montezanti
Page 17 and 18: españolDra. Verónica ArtolaAsí,
Page 19 and 20: englishDr. Verónica Artola• The
Page 21 and 22: españolDra. Patricia Rosalia Jimbo
Page 23 and 24: englishDr. Patricia Rosalia Jimbo S
Page 25 and 26: españolDr. Sergio Hernán Rocabado
Page 27 and 28: englishDr. Sergio Hernán Rocabado
Page 29 and 30: españolDra. Noelia Soledad Pintoat
Page 31 and 32: englishDr. Noelia Soledad Pintoof e
Page 33 and 34: DOCTORADO ENCIENCIAS INFORMÁTICASD
Page 35 and 36: españolDr. Nahuel Mangiaruautiliza
Page 37: englishDr. Nahuel Mangiaruaand reco
Page 40 and 41: MAESTRÍATECNOLOGÍA INFORMÁTICAAP
Page 64 and 65:
MAESTRÍATECNOLOGÍA INFORMÁTICAAP
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
Page 74 and 75:
Page 76 and 77:
Page 78 and 79:
MAESTRÍAINGENIERÍA DE SOFTWAREMg.
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
MAESTRÍAredes de datosMg. Mónica
Page 96 and 97:
MAESTRÍAredes de datosMg. Mónica
Page 98 and 99:
MAESTRÍAredes de datosMg. Moises E
Page 100 and 101:
MAESTRÍAredes de datosMg. Diego Ro
Page 102 and 103:
MAESTRÍAredes de datosMg. Diego Ro
Page 104 and 105:
MAESTRÍAredes de datosMg. Hugo Ort
Page 106 and 107:
MAESTRÍAredes de datosMg. Hugo Ort
Page 108 and 109:
MAESTRÍAredes de datosMg. Emanuel
Page 110 and 111:
MAESTRÍACómputo de AltasPrestacio
Page 113 and 114:
03especializacionesTECNOLOGÍA INFO
Page 115 and 116:
españolDr. José Manuel Ochoa Robl
Page 117 and 118:
englishDr. José Manuel Ochoa Roble
Page 119 and 120:
españolEsp. Mario Alberto Vincenzi
Page 121 and 122:
englishEsp. Mario Alberto VincenziW
Page 123 and 124:
españolEsp. Nicolás Martín Páez
Page 125 and 126:
englishEsp. Nicolás Martín PáezP
Page 127 and 128:
españolEsp. Cesar Armando Estrebou
Page 129 and 130:
englishEsp. Cesar Armando Estrebouw
Page 131 and 132:
españolEsp. Laura Randa Calcagnius
Page 133 and 134:
englishEsp. Laura Randa Calcagnican
Page 135 and 136:
especializaciónIngeniería de Soft
Page 137 and 138:
JURADOSDESIGNADOSLic. Azrilevich Pa
show all

Tesis y Tesistas 2020 - Postgrado - Fac. de Informática - UNLP

Create successful ePaper yourself

Delete template?

Save as template?