10.07.2015 Views

Coordinated Atomic Actions and System Fault Tolerance

Coordinated Atomic Actions and System Fault Tolerance

Coordinated Atomic Actions and System Fault Tolerance

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3. CA <strong>Actions</strong>A CA action is a generalised form of the basic atomic action structure.Multi-Threaded Enclosure <strong>and</strong> Coordination - CA actions provide a mechanism forenclosing <strong>and</strong> coordinating interactions among activities, <strong>and</strong> ensuring consistentaccess to objects (resources) in the presence of complex concurrency <strong>and</strong> potentialfaults.<strong>Fault</strong> <strong>Tolerance</strong> - If an exception is raised inside a CA action, appropriate forward<strong>and</strong>/or backward recovery measures will be invoked cooperatively in order toreach some mutually consistent conclusion <strong>and</strong>, if possible, to recover.Multiple Outcomes - CA actions combine exception h<strong>and</strong>ling with the nested actionstructure to allow multiple outcomes, e.g. a normal outcome or some possibleexceptional outcomes.CA actions (as well as ACID transactions, conversations, etc.) are nested structuringunits of system design <strong>and</strong> execution.10 February, 2004


3. CA <strong>Actions</strong>: Action NestingT1T2CA actionraise e isignal ε jTnExternal objectsact upone = {e 1 , e 2 , e 3 , ...}ε = {ε 1 , ε 2 , ε 3 ,...}Recursive relation:Exceptions inside the CA action must be declared with theaction definition <strong>and</strong> h<strong>and</strong>led within the actionExceptions to be signalled from the action to its environment(e.g. the enclosing action) must be specified in the CA actioninterfaceε nested is a subset of e enclosing11 February, 2004


3. CA <strong>Actions</strong>: Concurrent ExceptionsIn a distributed system different activities may raise different exceptions <strong>and</strong> theexceptions may be raised simultaneously.Concurrent exceptions must be h<strong>and</strong>led in a coordinated manner.Example: If there are two exceptions fire_alarm <strong>and</strong> gaze_leakage raisedconcurrently in two distributed cooperating activities, their separate h<strong>and</strong>ling, orh<strong>and</strong>ling them in any order, or ignoring either of them can cause serious harm.There is a variety of reasons why several exceptions may be raised concurrently. Forexample, it is often difficult to interrupt the normal operations of the other nodesimmediately after an exception has been raised or the several exceptions can besymptoms of the same problems.12 February, 2004


3. CA <strong>Actions</strong>: Concurrent Exception ResolutionAn exception resolution graph approach is developed in order to find the exceptionthat “covers” all the exceptions raised concurrently. The graph imposes partialorder on all internal action exceptions.Universal Exception e4universal exceptionlevel 3Emergency Engine Loss Exception e3e 1 ∧ e 2 ∧ e 3level 2Left EngineException e1Right EngineException e2e 1 ∧ e 2 e 1 ∧ e 3e 2 ∧ e 3level 0level 1e 1 ee 2 e 3113 February, 2004


3. CA <strong>Actions</strong>: Long-Lived ActivitiesCA actions allow us to design long-lived activities as they support:• exception h<strong>and</strong>ling: you do not have to always abort <strong>and</strong> go back, yourather try to h<strong>and</strong>le the problem <strong>and</strong> continue• action nesting (recursive system structuring, choice of the right level ofgranularity) – lost computation can be minimised• explicit application-specific programming of cooperation/competitionwith respect to shared resources – minimise periods when sharedresource are locked14 February, 2004


3. CA <strong>Actions</strong>: A Case StudyThe FZI (Forschungszentrum Informatik, Germany) have specified <strong>and</strong> provided asimulator for the <strong>Fault</strong>-Tolerant Production Cell.It represents a manufacturing process involving six devices: two conveyor belts (afeed belt <strong>and</strong> a deposit belt), an elevating rotary table, two presses <strong>and</strong> a rotaryrobot that has two orthogonal extendible arms.The task of the cell is to get metal blanks from its “environment” via the feed belt,transform them into forged plates by using one of the presses, <strong>and</strong> then returnthem to the environment via the deposit belt.The challenge posed by FZI is to design a control system that maintains specifiedsafety <strong>and</strong> liveness properties even in the presence of a large number <strong>and</strong> variety ofdevice <strong>and</strong> sensor failures, <strong>and</strong> which continues to operate even if one of the pressesis non-operational.Our aim was to show how concurrent exception h<strong>and</strong>ling <strong>and</strong> CA actions aid both thedesign <strong>and</strong> validation of this control system.15 February, 2004


3. CA <strong>Actions</strong>: A Case Studyenvironmentsensortraffic ligh for insertiondeposit beltrobotarm_1arm_2press_216 February, 2004feed beltsystem clock alarm signal!elevatingrotarytablepress_1


3. CA <strong>Actions</strong>: A Case StudyOur design uses 12 main CA actions; each action controls one step of the blankprocessing <strong>and</strong> typically involves passing a blank between two devices.A control program was implemented that uses a Java implementation of a distributedCA action support (this scheme makes use of the nested multi-threaded transactionfacilities provided by the Arjuna transaction support system).concurrentthreadsCA action LoadPress1RobotSensorRobot(Arm1)rotate robotextend arm 1retractarm 1Press1SensorPress1move press 1to the middlepositionsynchronizingdrop blankExternalobjectBlankaccess17 February, 2004


3. CA <strong>Actions</strong>: A Case StudyPre-conditionsrobot offBlank on arm 1Both arms retractedrobot at one of the defined anglespress 1 offno blank in press 1press 1 in bottom positionpost-conditionsrobot offno blank on arm 1both arms retractedrobot angle: arm 1 towards press 1press 1 offblank in press 1press 1 in middle positionException to signalexceptional post-conditionsException to signalexceptional post-conditionsPress 1 failurerobot offblank on arm 1both arms retracted(rotary sensor or motorfailure) & Press 1 failurerobot offblank on arm 1robot angle: arm 1 towardspress 2press 1 offboth arms retractedpress 1 offno blank in press 1no blank in press 118 February, 2004


3. CA <strong>Actions</strong>: A Case StudyThe program controls the FZI simulator. Overlaid on the screen are rectangularoutlines showing the scopes of these CA actions.As the simulator is operated each of these outlines is gradually shaded in each timethe corresponding CA action is executed - the colour of this shading is changedwhen the CA action is involved in exception h<strong>and</strong>ling, in response to simulatedfaults.A fault injector was implemented.All injected device or sensor failures were caught successfully <strong>and</strong> h<strong>and</strong>ledimmediately by our control program. A previously unknown bug in the FZIsimulator was detected by a CA action <strong>and</strong> recovered automatically by the actionusing the retry operation.19 February, 2004


3. CA <strong>Actions</strong>: A Case Study20 February, 2004


3. CA <strong>Actions</strong>: Ongoing ResearchSeveral more case studies have been developed:• a real-time Production Cell• a distributed auction system• a railway control system• a distributed internet Gamma computationCA actions were intentionally developed as a general concept.There are concrete schemes for concurrent OO systems, process-oriented systems,message-passing distributed systems, component-based systems.These are “close” environments <strong>and</strong> architectures ...21 February, 2004


4. Integration of Complex Web ApplicationsThe focus of this ongoing work is on developing techniques for buildingdependable Web applications. The complex Web applications are being <strong>and</strong> willbe built by integration of existing Web services. This is typical of the emergingservice-oriented architecture paradigm.Existing <strong>and</strong> future Web applications with high dependability requirements:banking (bank portals), auctions, internet shopping, hotel/car/flight/trainreservation <strong>and</strong> booking, e-business, e-science (including the Grid computing),business account management, …22 February, 2004


4. Integration of Complex Web ApplicationsComplex systems of systems. Open systems. Web services to be integrated:• are ready-made (COTS) components• are autonomous systems without general control• may not provide sufficient quality of service (“dirty” boxes - have bugs, do not fit,have poor specification <strong>and</strong> documentation, etc.)• are black boxes (no source code, no spec) with known interfaces• belong to different organizations• are heterogeneous: they have different st<strong>and</strong>ards, fault assumptions, followdifferent conventions• may be in operation when being integrated• should provide individual services when integrated <strong>and</strong> when the integratedsystems fails• may change their behaviour on the fly.23 February, 2004


4. Integration of Complex Web ApplicationsIn addition to that:• integrated systems are to be used by general public lacking computer-relatedskills• the Internet is a poor communication medium: low quality, not predictable.Dependability is a serious concern. These systems are inherently complex <strong>and</strong> areprone to many faults of many types. Besides, multiple abnormal situations arelikely to happen concurrently.Conventional hardware fault tolerance techniques (replication, ordered delivery,group communication, TMR techniques <strong>and</strong> retry – cf OMG <strong>Fault</strong> TolerantCORBA service) can offer only partial solutions.24 February, 2004


4. Integration of Complex Web ApplicationsWe need fault tolerance mechanisms to deal with• users’ mistakes• components mistakes• component systems not delivering the service requested• environmental faults• component mismatches• application developers’ mistakes• <strong>and</strong> with all types of errors propagated by the underlying levels (OS,middleware, hardware) when they fail to deliver the required services.This is software fault tolerance at the application level (i.e., the level of integratedWeb applications).25 February, 2004


4. Integration of Complex Web ApplicationsForward error recovery because rollback/abort is not the right approach to dealwith the faults of all these type. Compensation is a particular case of exceptionh<strong>and</strong>ling (BPEL focus on compensation is misleading).Besides, canonical ACID transactions are not applicable for the Web:• the OASIS BTP (IONA, Sun, BEA <strong>System</strong>s, Choreology, etc.),• WS-Transactions&WS-Cooperation (submitted to W3C by IBM, MS <strong>and</strong> BEA).Existing schemes do not offer structured solutions, do not addresscooperative/competitive systems, do not rely on cooperative exception h<strong>and</strong>ling.CA actions are a good choice but there is a need for applying the general conceptto dealing with external non-ACID resources/components.26 February, 2004


5. CA Action Design of Web ApplicationsGeneral requirements for CA actions in the Web context. They should allow for:• dealing with component systems that are outside of our control• relaxing entry/exit synchronization (people, documents, organizations, goods, etc.)• new participants to be forked/joined• other component systems to be invited/involved into an action when necessaryBut they should• keep atomicity, exceptions <strong>and</strong> error propagation under control• provide co-operative exception h<strong>and</strong>lingWe rely on C.T. Davies’ concept of spheres of control while extending CA actions.We should talk about controlled atomicity.27 February, 2004


5. CA Action Design of Web Applications: Travel AgencyWe use the canonical Travel Agency (TA) case study as a running example todemonstrate our ideas. TA was designed as a pair:• the TA front end - client side (TAFE-CS): web front-end, exception h<strong>and</strong>ling,communication• the TA front end - server side (TAFE-SS): access to existing services, tripcomposition, exception h<strong>and</strong>ling, component system monitoringLinking Interface (LIF) is a reduction of the component system (WS) specificationthat includes its functional, temporal <strong>and</strong> dependability descriptions, which arerequired for integration.clientTAFE-CSLIFLIFTravel AgencyTAFE-SSLIFLIFKLMGeoDBHiltonHertzSabena28 February, 2004


5. CA Action Design of Web Applications: StructuringApplication-level software fault tolerance in TA is achieved by structuring TAusing nested CA actions <strong>and</strong> by employing disciplined exception h<strong>and</strong>lingwithin this framework.Cooperative exception h<strong>and</strong>ling at the action level can involve individual WSs,people including the client, TA support, component system support (ifpossible).29 February, 2004


5. CA Action Design of Web Applications: StructuringEach client session is a CA action consisting of a number of nested actionsperforming: availability checking, trip booking, trip cancellation, payment,etc.client controllerTA CS controlleractioncheck_availabilityactionbookingTA SS controlleraction session30 February, 2004


5. CA Action Design of Web Applications: Structuringclient controllerTA CS controlleractionrequestactionconsult_servicesTA SS controlleraction check_availabilityct controllerflightcarhotelaction compose_tripsfa controllerKLMSabenaBAaction flight_availability31 February, 2004


5. CA Action Design of Web Applications: ImplementationAn experimental implementation of the Internet Travel Agency using Java RMI<strong>and</strong> JavaServer Page (JSP), <strong>and</strong> a distributed (RMI) CA action supportdeveloped in Newcastle.General implementation structure:TA ClientSideTA Server SideclientHTTP/HTMLJSPRMIclientHTTP/HTMLJSPRMILIFsRMIlegacy componentsTA ClientSide32 February, 2004


5. CA Action Design of Web Applications: ImplementationClientbrowserHTTPrequestLegacycomponentClientbrowserRMICORBASOAPHTMLTA HTTPserver(JSP/ASP)HTTPRequestRMICORBARMICORBATA <strong>System</strong>RMICORBATACA actions33 February, 2004


5. CA Action Design of Web ApplicationsInterfacing subsystems (LIFs) provide a number of functionalities, includingsupport for:• turning a component system into a CA action participant taking part in allaction-specific activities such as cooperative exception h<strong>and</strong>ling (includingresolution of concurrent exceptions at the action level), action entry <strong>and</strong> exitsynchronisation (when necessary), etc.• structuring the whole TA recursively using CA actions with component systemstaking part in these actions• performing systematic <strong>and</strong> disciplined local error detection <strong>and</strong> exceptionh<strong>and</strong>ling at the level of component systems <strong>and</strong> dealing with mismatchesExperience:• structuring integrated WSs using CA actions• systematic dealing with external non-ACID resources (weakening theatomicity).34 February, 2004


5. CA Action Design of Web Applications: New CharacteristicsCA actions allow for disciplined exception h<strong>and</strong>ling <strong>and</strong> system structuring forfault tolerance. New characteristics:• control of the WSs accessed during action execution• cooperation of WSs (normal <strong>and</strong> abnormal behaviour)• dedicated processes representing participation of the individual WSs in theactions.• flexible participation (using participant forking-joining)• flexible structuring (action nesting <strong>and</strong> action composition)35 February, 2004


5. CA Action Design of Web Applications: WSCAOngoing joint work between INRIA Rocquencourt (V. Issarny’s group) <strong>and</strong>Newcastle U.The WSCA (Web Service Composition <strong>Actions</strong>) scheme is an extension of CAactions <strong>and</strong> their adaptation for this application area.An XML-based language is under development. The WSCA language (WSCAL)builds on coming W3C st<strong>and</strong>ards: Web Service Description Language (WSDL)<strong>and</strong> Web Services Conversation Language (WSCL).It allows dependable composition of WSs to be specified at an abstract levelsupporting recursive system structuring <strong>and</strong> cooperative exception h<strong>and</strong>ling.36 February, 2004


5. CA Action Design of Web Applications: Future WorkOur future work in the WS area focuses:• on developing a middleware WSCAL support to allow the skeletons of thecomposed applications to be automatically generated• <strong>and</strong> on introducing extended CA actions into existing business-logic description<strong>and</strong> composition languages (BPEL4WS).37 February, 2004


6. Future Work <strong>and</strong> Conclusions<strong>Fault</strong> tolerance in open dynamic loosely-coupled systems:• Mapping the CA action concept into the context of agent systems. In particular,introducing nested structuring <strong>and</strong> cooperative exception h<strong>and</strong>ling using LGI• Development of the CA action-based schemes for mobile (e.g. tuple-based)environmentsSoftware engineering issues:• Formalisation, including action <strong>and</strong> participant refinement <strong>and</strong> decomposition• CA actions in software architecture/ADLsCA actions for fault tolerance in service-oriented architectures.38 February, 2004


6. Future Work <strong>and</strong> ConclusionsIt is unfortunate <strong>and</strong> counterproductive to focus all efforts on making systemsfaultless.Exception h<strong>and</strong>ling is the most general means for tolerating faults of the widestpossible range.<strong>Fault</strong> tolerance features to (e.g. exception h<strong>and</strong>ling mechanisms) should match thedevelopment paradigm, the specific characteristics of the application domain, thecomputational model.(Recursive) system structuring <strong>and</strong> providing system fault tolerance should goh<strong>and</strong> in h<strong>and</strong> in developing complex systems.39 February, 2004


Thanks to Brian R<strong>and</strong>ell, Avelino Zorzo, Valerie Issarny, Jie Xu, CeciliaRubira, Panos Perriorelis, Ian Welch, Robert Stroud, Galip-FerdaTartanoglu, Cliff Jones, Joey Coleman <strong>and</strong> Nicole Levy.Two EC Projects:Design for Validation - DeVa (1996-1999)Dependable <strong>System</strong>s of <strong>System</strong>s - DSoS (2000-2003)40 February, 2004

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!