Architecture of Computing Systems (Lecture Notes in Computer ...

Recommendations

Info

74 M. Bonn and H. Schmeck Fig. 1. JoSchKa principle Typical features of JoSchKa are fully autonomous agents running on heterogeneous worker nodes and the support for any programming language (it has to be executable on the worker node, of course). The management communication and even the file transfer rely on HTTP only, there is no common file system. So the nodes can be run behind firewalls and NAT routers without problems. To support the users, there are both a graphical user interface for batch job upload and management, and a modern object-oriented API to develop interactive parallel applications. In the following chapter, the server and its two mainly used distribution methods are described briefly. In the next sections, the agent and the user interface are described shortly, followed by its relation to organic computing and some practical experiences. The paper finishes with an outlook related to cloud computing. 2 Server t Worker Nodes with Agents Job available? Job J i J i Execution Results of J i J i saved The server is the central component in the JoSchKa system. It has to deal with the up- /download of files, the management of the jobs and the decision, which one to select for a node if the agent running on this node polls for a job. The file transfer interfaces used by the agents are standard HTTP-based interfaces, the ones for the user are SMB-based, due to convenient drag and drop file transfer. The most complex interface is the interface used to exchange management and control data, perform job queries, upload job description and so on. This interface is a SOAP-based webservice [7] using HTTP as transfer protocol. The agents on the nodes address the SOAP-interface only. So they can be run behind firewalls and NAT routers. As a result, the server is not able to distribute the jobs or send other management data to the agents using a pushing-style but it has to react on queries by the agents and send commands as a response to these queries. If the server wants to tell something to an agent (“execute this job:…”), it has to wait for the agent performing a request for exactly this information (“job available?”). If an agent queries for a job, the server first creates a candidate list containing all jobs which are able to be run on the node. To this, the server performs some checks with every job in the database (the server manages a record of 22 values per job; the detailed description of the data model is omitted here): Server J i Persistance User
The JoSchKa System 75 • Do the platform specifications of the job and the agent match (operation system or memory requirements, for example)? • Does the type of the job match with the type queried by the agent? • All specified source files are physically available for download? • If the job has predecessors defined, are they all finished? If all tests evaluate to true, the job is added to a candidate list from which the scheduler has to select one single job. But before describing the selection process, it is necessary to explain why it is even a problem to select a suitable job and why not just selecting any job. Many publications (like [8, 9]) are dealing with intelligent scheduling algorithms, but they all are running on a reliable cluster or a super computer and generally have full knowledge about the nodes and the jobs’ runtime requirements. None of the common desktop job distribution systems like BOINC, Condor or Alchemi try to observe the nodes and to exploit this additional knowledge, especially to improve the total system performance. Here, we have to distribute jobs to heterogeneous, unpredictably failing nodes. So other methods are needed, especially a monitoring of the nodes is essential. These methods are described in the following sections. 2.1 Fault Tolerance When using a dedicated cluster, one can be sure that every node executes its jobs without failing before the current running job is finished. But when the nodes are part of an unsecure pool or if they are used as standard desktop computers we have to face with failures and unfinished jobs, because users normally shut down or reboot the PCs. So there are some mechanisms integrated to counteract this problem: • The upload of intermediary results from the agent to the server and • a heartbeat signal from the agent to the server. The user has to activate the intermediary upload explicitly. If active, the agent uploads all new or changed result files in periodic intervals. So one can decide if a job has to be stopped or – in case of a node failure – started again. But this decision hast to be made by the user and of course depends on the characteristics of the problem the jobs are working on. If the user does not block a failed job explicitly, the server starts it again after a certain time period. For every job, the server manages an internal counter, which is periodically decreased by 1. Every time the agent which executes the job sends the periodic heartbeat signal, the counter is set back to its maximum value. If the node with the agent shuts down, the heartbeat for this job will be missing and the counter reaches zero. Then the server creates a new ID for the job and marks it available for a new run. Due to programming mistakes or other software errors it may happen that jobs fail even on reliable nodes. A special mechanism takes care of restarting only for a finite number of times. Jobs which are failing too often are blocked automatically by the server until the user fixes the problem and removes the lock manually. The acceptable number of failures until auto-blocking can be changed by the user.
Page 2 and 3:
Lecture Notes in Computer Science 5
Page 4 and 5:
Volume Editors Christian Müller-Sc
Page 6 and 7:
General Chair Organization Christia
Page 8 and 9:
Organization IX Hartmut Schmeck Kar
Page 10 and 11:
Keynote Table of Contents HyVM - Hy
Page 12 and 13:
Table of Contents XIII JetBench:AnO
Page 14 and 15:
How to Enhance a Superscalar Proces
Page 16 and 17:
4 J. Mische et al. The Real-time Vi
Page 18 and 19:
6 J. Mische et al. 4.1 Instruction
Page 20 and 21:
8 J. Mische et al. Additionally the
Page 22 and 23:
10 J. Mische et al. % of unused pip
Page 24 and 25:
12 J. Mische et al. % of cycles spe
Page 26 and 27:
14 J. Mische et al. 16. Lickly, B.,
Page 28 and 29:
16 G. Aşılıoğlu, E.M. Kaya, and
Page 30 and 31:
Page 32 and 33:
Page 34 and 35:
Page 36 and 37: 24 G. Aşılıoğlu, E.M. Kaya, and
Page 38 and 39: 26 T.B. Preußer, P. Reichel, and R
Page 50 and 51: 38 P. Bellasi, W. Fornaciari, and D
Page 62 and 63: 50 J. Zeppenfeld and A. Herkersdorf
Page 74 and 75: 62 B. Jakimovski, B. Meyer, and E.
Page 88 and 89: 76 M. Bonn and H. Schmeck 2.2 Node
Page 90 and 91: 78 M. Bonn and H. Schmeck Uptime-ba
Page 92 and 93: 80 M. Bonn and H. Schmeck 2.4 Simul
Page 94 and 95: 82 M. Bonn and H. Schmeck done rate
Page 96 and 97: 84 M. Bonn and H. Schmeck Fig. 8. J
Page 98 and 99: 86 M. Bonn and H. Schmeck tells the
Page 100 and 101: 88 J.-P. Steghöfer et al. � �
Page 102 and 103: 90 J.-P. Steghöfer et al. mechanis
Page 104 and 105: 92 J.-P. Steghöfer et al. resource
Page 106 and 107: 94 J.-P. Steghöfer et al. Choose a
Page 108 and 109: 96 J.-P. Steghöfer et al. 1. Defin
Page 110 and 111: 98 J.-P. Steghöfer et al. and all
Page 112 and 113: 100 J.-P. Steghöfer et al. 19. Kim
Page 114 and 115: 102 K. Kloch et al. large-scale sys
Page 116 and 117: 104 K. Kloch et al. a�t� 1.0 0.
Page 118 and 119: 106 K. Kloch et al. constant. This
Page 120 and 121: 108 K. Kloch et al. Relative number
Page 122 and 123: 110 K. Kloch et al. (a) infection r
Page 124 and 125: 112 K. Kloch et al. (ii) Phase of a
Page 126 and 127: 114 P. Petoumenos et al. Studying t
Page 128 and 129: 116 P. Petoumenos et al. % of Misse
Page 130 and 131: 118 P. Petoumenos et al. IQ: n 4-in
Page 132 and 133: 120 P. Petoumenos et al. downsizing
Page 134 and 135: 122 P. Petoumenos et al. As long as
Page 136 and 137:
124 P. Petoumenos et al. comparable
Page 138 and 139:
Exploiting Inactive Rename Slots fo
Page 140 and 141:
128 M. Kayaalp et al. In a supersca
Page 142 and 143:
130 M. Kayaalp et al. INSTRUCTION 1
Page 144 and 145:
132 M. Kayaalp et al. time. Alterna
Page 146 and 147:
134 M. Kayaalp et al. Fig. 3. Numbe
Page 148 and 149:
136 M. Kayaalp et al. The results o
Page 150 and 151:
Efficient Transaction Nesting in Ha
Page 152 and 153:
140 Y. Liu et al. HTMs include TCC
Page 154 and 155:
142 Y. Liu et al. rollback T0 begin
Page 156 and 157:
144 Y. Liu et al. Processor core Pr
Page 158 and 159:
146 Y. Liu et al. 5.2 Results and A
Page 160 and 161:
148 Y. Liu et al. decreases, partia
Page 162 and 163:
Decentralized Energy-Management to
Page 164 and 165:
152 B. Becker et al. Furthermore, t
Page 166 and 167:
154 B. Becker et al. An optimizing
Page 168 and 169:
156 B. Becker et al. smart-home man
Page 170 and 171:
158 B. Becker et al. freedom like w
Page 172 and 173:
160 B. Becker et al. Power [W] 4500
Page 174 and 175:
EnergySaving Cluster Roll: Power Sa
Page 176 and 177:
164 M.F. Dolz et al. Themodulequeri
Page 178 and 179:
166 M.F. Dolz et al. This daemon al
Page 180 and 181:
168 M.F. Dolz et al. been submitted
Page 182 and 183:
170 M.F. Dolz et al. On the other h
Page 184 and 185:
172 M.F. Dolz et al. at the inactiv
Page 186 and 187:
Effect of the Degree of Neighborhoo
Page 188 and 189:
176 T. Abdullah et al. A zone based
Page 190 and 191:
178 T. Abdullah et al. A consumer/p
Page 192 and 193:
180 T. Abdullah et al. Messages 140
Page 194 and 195:
182 T. Abdullah et al. (except when
Page 196 and 197:
184 T. Abdullah et al. % Matchmakin
Page 198 and 199:
186 T. Abdullah et al. show that th
Page 200 and 201:
188 M. Schindewolf, D. Kramer, and
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
200 R. Plyaskin and A. Herkersdorf
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
212 M.Y. Qadri, D. Matichard, and K
Page 226 and 227:
Page 228 and 229:
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
A Tightly Coupled Accelerator Infra
Page 236 and 237:
224 F. Nowak and R. Buchty where A
Page 238 and 239:
226 F. Nowak and R. Buchty Fig. 3.
Page 240 and 241:
228 F. Nowak and R. Buchty Table 2.
Page 242 and 243:
230 F. Nowak and R. Buchty Table 4.
Page 244 and 245:
232 F. Nowak and R. Buchty 5.3 Comp
Page 246 and 247:
Optimizing Stencil Application on M
Page 248 and 249:
236 F. Xudong et al. compared to th
Page 250 and 251:
238 F. Xudong et al. stencil comput
Page 252 and 253:
240 F. Xudong et al. threads in the
Page 254 and 255:
242 F. Xudong et al. Speedup 14 12
Page 256 and 257:
244 F. Xudong et al. 5 Related Work
Page 258:
Abdullah, Tariq 174 Alima, Luc Onan
show all

Architecture of Computing Systems (Lecture Notes in Computer ...

Create successful ePaper yourself

Delete template?

Save as template?