24.11.2014 Views

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

Elektronika 2009-11.pdf - Instytut Systemów Elektronicznych

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

[4] Bose A., i in.: Analysis of manufacturing blocking systems with<br />

Network Calculus. Performance Evaluation, vol. 63, pp. 1216-<br />

1234, 2006.<br />

[5] Economou A., Fakinos D. Product form stationary distributions<br />

for queuing networks with blocking and rerouting. Queuing Systems,<br />

vol. 30(3/4), pp. 251-260, 1998.<br />

[6] Gomez-Corral A., Matros M. E. Performance of two-stage tandem<br />

queues with blocking: The impact of several flows of signals.<br />

Performance Evaluation, vol. 63, pp. 910-938, 2006.<br />

[7] Gupta U. C., i in. Discrete-time single-server finite-buffer under<br />

discrete Markovian arrival process with vacations. Performance<br />

Evaluation, vol. 64, pp. 1-19, 2007.<br />

[8] Kim C. S., i in. The BMAP/G/1-> ·/PH/1/M tandem queue with<br />

feedback and losses. Performance Evaluation, vol. 64, pp. 802-<br />

818, 2007.<br />

[9] Mei van der R. D., i in. Response times in a two-node queuing<br />

network with feedback. Performance Evaluation, vol. 49, pp. 99-<br />

110, 2002.<br />

[10] Oniszczuk W. Analysis of an Open Linked Series Three-Station<br />

Network with Blocking. In Advances in Information Processing<br />

and Protection, J. Pejaś, K. Saeed (Eds), Springer: New York,<br />

pp. 419-429, 2007.<br />

[11] Stewart W. J. Introduction to the Numerical Solution of Markov<br />

Chains. Princeton University Press: New Jersey, 1994.<br />

A system automating repairs of IT systems<br />

(System automatyzujący naprawy systemów IT)<br />

MSc MAREK KAMIŃSKI<br />

Gdańsk University of Technology Technology, Faculty of Electronics, Telecommunication<br />

Lufthansa Systems Poland, Sp. z o.o., Gdańsk<br />

Very fast and progressive informatization of almost all domains<br />

of our lives became an undisputed fact. Because huge<br />

number of IT systems of various kinds needs supervision and<br />

maintenance 24 hours a day and 365 days a year, many IT<br />

companies provide nowadays to their customers a new service<br />

offer of remote monitoring, technical support and assistance<br />

in taking care of their systems.<br />

Monitoring is relatively easy in realization because lots of<br />

monitoring systems have been already developed [1] and they<br />

usually accomplish their task in a satisfactory way. We distinguish<br />

traditional monitoring systems [2-7] (usually having centralized<br />

monitoring logic), and ones dedicated to monitoring<br />

grid or cluster structures [8-11] (usually having monitoring<br />

logic distributed), but regardless of the kind of monitoring system,<br />

monitoring aims to give administrators of monitored systems<br />

a clear indication of what is wrong. The next step is to<br />

solve the problem (to repair the system), so integrating monitoring<br />

and repair aspects seems natural.<br />

However, repairs are usually more complicated than monitoring,<br />

as they often involve manual and time-consuming interventions<br />

of administrators, but observations made by the<br />

author of this article show that, in many cases, even they can<br />

be automated and, moreover, integrated with the monitoring<br />

task. Their automation is a complex problem, which is still the<br />

area of active research. The already proposed solutions are<br />

usually too trivial to handle complex repairs in the real world<br />

situations. Others have vulnerabilities, excluding their industrial<br />

application. Some solutions use the concept of timeouts<br />

or retrying failed actions finite number of times, hoping the<br />

problem will not reoccur, or the concept of event handlers<br />

[4-6], being simple executions of remote commands [7], telling<br />

the system which program to execute on undesired monitoring<br />

results. Some solutions address only automation of administrative<br />

tasks [12,13] and do not integrate with monitoring.<br />

Other attempts focus on describing the architecture the system<br />

should have, to facilitate automatic repair [14-16]. Also efforts<br />

were made to integrate the Nagios Monitoring System<br />

[4-6] with the CFengine system (in particular situation, regarding<br />

network problems [17]), and some works show the directions,<br />

monitoring and healing (repairing) can go [18].<br />

Goal and context<br />

This article focuses mainly on the Repair Management System<br />

(RMS), that is one of the parts of the developed Repair<br />

Management Framework, aming at automating the process<br />

of repairing IT systems. The RMF consists also of the Repair<br />

Management Model (RMM) and the repair library. Both of<br />

them were formally and precisely specified, using the Z notation<br />

[19-21] and are described in [22]. The RMM introduces<br />

two mathematical models (model of monitoring and model of<br />

repair processes), general enough to cover the existing problems<br />

and solutions, while RMS, with its complex architecture,<br />

uses those concepts as fundamentals, to incorporate and exploit<br />

existing monitoring solutions into triggering and conducting<br />

repairs automatically. The repair library introduces the<br />

abstraction layer (in a form of API = Application Programmers<br />

Interface), to allow for easy constructing of the repair algorithms,<br />

taking advantage of the programming language, they<br />

are embedded in, and of set of predefined routines (procedures<br />

and function), hiding away from programmers many<br />

unimportant details, regarding monitoring and repair of those<br />

particular problems they solve. All parts of the RMF have been<br />

already instantiated and are under tests in the Lufthansa Systems<br />

Poland Company.<br />

Definition of the solved problem<br />

The problem addressed in this article may be briefly defined<br />

as follows:<br />

• IT company continuously supervises work of many objects<br />

of various kinds,<br />

54 ELEKTRONIKA 11/<strong>2009</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!