Space Shuttle Fault Tolerance: Analog and Digital Teamwork

SPACE SHUTTLE FAULT TOLERANCE: ANALOG AND DIGITAL TEAMWORK 

Abstract 

The Space Shuttle control system (including the 

avionics suite) was developed during the 1970s to 

meet stringent survivability requirements that were 

then extraordinary but today may serve as a standard 

against which modern avionics can be measured. In 

30 years of service, only two major malfunctions 

have occurred, both due to failures far beyond the 

reach of fault tolerance technology: the explosion of 

an external fuel tank, and the destruction of a launchdamaged 

wing by re-entry friction. 

The Space Shuttle is among the earliest systems 

(if not the earliest) designed to a “FO-FO-FS” 

criterion, meaning that it had to Fail (fully) 

Operational after any one failure, then Fail 

Operational after any second failure (even of the 

same kind of unit), then Fail Safe after most kinds of 

third failure. The computer system had to meet this 

criterion using a Redundant Set of 4 computers plus a 

backup of the same type, which was (ostensibly!) a 

COTS type. Quadruple redundancy was also 

employed in the hydraulic actuators for elevons and 

rudder. Sensors were installed with quadruple, triple, 

or dual redundancy. For still greater fault tolerance, 

these three redundancies (sensors, computers, 

actuators) were made independent of each other so 

that the reliability criterion applies to each category 

separately. The mission rule for Shuttle flights, as 

distinct from the design criterion, became “FO-FS,” 

so that a mission continues intact after any one 

failure, but is terminated with a safe return after any 

second failure of the same type. 

To avoid an unrecoverable flat spin during the 

most dynamic flight phases, the overall system had to 

continue safe operation within 400 msec of any 

failure, but the decision to shut down a computer had 

to be made by the crew. Among the interesting 

problems to be solved were “control slivering” and 

“sync holes.” The first flight test (Approach and 

Landing only) was the proof of the pudding: when a 

key wire harness solder joint was jarred loose by the 

Shuttle’s being popped off the back of its 747 mother 

Hugh Blair-Smith, Down to the Metal, Dennis, MA 

(formerly of C. S. Draper Lab, Cambridge, MA) 

978-1-4244-4078-8/09/$25.00 ©2009 IEEE. 6.B.1-1 

ship, one of the computers “went bananas” (actual 

quote from an IBM expert). 

MIT Roles In Apollo And Space 

Shuttle 

I was never on NASA’s payroll, but worked as a 

staff member (1959-1981) at the Charles Stark 

Draper Laboratory, formerly the MIT 

Instrumentation Lab in the Apollo years when it was 

the prime contractor for Guidance, Navigation & 

Control. CSDL, spun off as an independent 

corporation in 1973 but still a part of the MIT 

community, was in a consulting role to NASA and 

IBM Federal Systems in the Shuttle program. We 

who had designed the Apollo Guidance Computer 

knew that NASA wasn’t going that way again, so we 

re-invented ourselves as fault tolerance experts. 

Fault tolerance for hard failures in Apollo was 

relatively simple: if the crew thought the Primary 

GN&C System (PGNCS) had stopped working, they 

would switch to a totally different backup system. 

We took that as a requirement that the PGNCS shall 

suffer no such failures, ever, and moved heaven and 

earth to achieve that goal—and succeeded! Part of 

that effort was a not-so-simple approach to computer 

recovery from transient faults: readiness to stifle lowpriority 

tasks in overload conditions, and saving the 

status quo frequently and effectively rebooting (very 

quickly) when things got too weird. That 

combination was what saved the Apollo 11 lunar 

landing when a radar interface went haywire; don’t 

let Richard Nixon [1] or anyone else tell you that the 

computer “failed” and a heroic human took over and 

saved the day! 

NASA was under pressure to maximize use of 

Commercial Off-The-Shelf (COTS) subsystems in 

the Space Shuttle, theoretically for economy. They 

had less confidence in such units, as we did, and so 

designed in massive redundancy and advanced 

techniques to manage it. Thus it was that our role 

focused on working with NASA on the architecture 

of the whole vehicle to make the analog and digital 

parts cooperate in fault tolerance, and specifically

debating with IBM Federal Systems on how to give 

the spacecraft computers “instantaneous” fault 

tolerance. 

Aside from confidence in suppliers, another 

issue that excited the crews’ interest in high 

reliability was that the Shuttle would be one of the 

first aircraft to implement control in the “fly-by-wire” 

mode. It was all very well for Apollo-style spacecraft 

to be fly-by-wire because pilots hadn’t known any 

other type, but if it looked and flew more like an 

airplane, the astronauts tended to remember how 

good it felt to know that there were strong steel 

cables between their hand controls and the aerosurfaces, 

albeit with a power boost from hydraulics. 

It was our very good fortune to have Bob Crippen as 

the astronaut expert in this area, with his broad and 

deep understanding of what we were doing. 

Once we were satisfied that the Orbiter 

architecture would support analog-digital teamwork 

(as in Apollo, we weren’t involved with the booster), 

we focused primarily on the Flight Computer 

Operating System (FCOS), which in turn was 

focused on task management, including redundancy 

management, among the General Purpose Computers 

(GPCs) and the sensors and actuators. It wasn’t any 

sort of a DOS (since there were no disk drives), and it 

didn’t have to handle directly the crew displays and 

inputs, as there are separate Display Electronics Units 

(DEUs) for that. The focus on task management was 

well suited to our experience in the same area in 

Apollo, whose multi-tasking “executive” was much 

too simple to be called an Operating System. As 

we’ll see, the exact architecture of the priorityoriented 

task management was crucial to our 

approach to redundancy management. 

Reliability Requirements And Degrees 

Perhaps the biggest difference between Apollo 

and the Shuttle was the degree to which the latter, 

though a glider, had to emulate a regular powered 

aircraft. One requirement is for a cross-range 

capability of 1500 miles, meaning that the airframe 

has to be highly maneuverable—the downside of that 

is a capability (if not properly controlled) of going 

into an unrecoverable flat spin during descent. It was 

soon determined that “properly controlled” meant no 

major defects in control could last longer than 400 

msec. Obviously, the crew couldn’t switch to a 

backup system under that constraint, so fully 

6.B.1-2 

automatic recovery from any single failure became a 

requirement. Such a recovery can be categorized as 

either Fail Safe (FS) or Fail Operational (FO), 

depending on whether the mission is aborted or 

allowed to continue undiminished after that failure. 

Since Shuttle missions can be quite long (weeks), 

treating the first recovery as FO requires an equally 

fully automatic recovery from a second failure, even 

of the same type of unit. The only relief in that 

requirement is an assumption that failures do not 

happen simultaneously; that there’s time to clean up 

the configuration after a first failure before a second 

failure of the same type takes place. The combination 

of two similar failures can then be called either FO- 

FS or FO-FO, depending on whether an abort is 

called on the second failure. The presence of an 

independent backup system, which the crew can 

switch to after a third failure if it doesn’t happen in 

one of those control-critical moments, suggests that 

the Shuttle avionics architecture could be called FO- 

FO-FS—Fail Operational-Fail Operational-Fail 

Safe—well, mostly safe anyway. NASA’s mission 

rules take the conservative approach of the FO-FS 

level, since two failures of one type of unit may not 

be a coincidence. 

Reliability Approach Throughout 

The deliberations on vehicle-wide fault tolerance 

architecture settled on quadruple modular redundancy 

(QMR) with voting for critical units, with a 

sprinkling of triple modular redundancy (TMR) with 

voting for units that were the best but not the only 

way to do their jobs, and a few dual-redundant, less 

critical, units. To obtain the benefits of analog-digital 

teamwork, this approach extended well beyond 

GPCs: to data buses, peripheral electronics, and even 

hydraulics, where servovalves driving secondary 

actuators perform QMR voting of their own. 

The advantage of modular redundancy with 

voting is that reliability does not depend on built-in 

tests, which can always be shown to have incomplete 

coverage; the challenge is that voting requires strict 

rules for which signals or actions are enough alike to 

be considered the same (and thus qualified to be part 

of the majority). Units with significant analog parts 

have to be forgiven differences within a tolerance, 

while all-digital units must produce identical results. 

Given this, and given the stipulation that failures 

happen one at a time, QMR voting is 4-out-of-4

initially, then 3-out-of-4 when a failure occurs; then it 

becomes TMR by ceasing to try working with the 

failed unit. Similarly, TMR votes 3-out-of-3 until one 

of those 3 fails, then 2-out-of-3. As you can’t make a 

majority out of an electorate of 2, the only assured 

recovery from a third failure is a resort to the 

independent backup. 

Voting With Multi-Port Hydraulics 

Before getting into the computer system, let’s 

look at how the secondary hydraulic actuators 

achieve the FO-FO-FS standard when moving the 

aero control surfaces: elevons, rudder, and body flap. 

Each of these surfaces is driven by a large powerful 

primary actuator which in itself must never fail, just 

as each wing must never fail. Hydraulic pressure is 

fed into the secondary actuator by 4 servovalves, any 

one of which can feed in the full pressure. If 3 of 

them call for deflection up and 1 calls for deflection 

down, the majority overpower the minority in feeding 

pressure to the primary actuator. When the minority 

servovalve is closed and locked by disabling its 

servoamplifier, there is still 2-out-of-3 hydraulic 

voting (Figure 1). 

Figure 1. Hydraulic Actuator Voting 

6.B.1-3 

Teamwork With QMR Actuator Servos 

Since the hydraulic actuators are used in the 

critical phases of descent, approach, and landing, 

their relationship to GPCs and data buses is the key 

to the digital fault tolerance. During critical phases, 

the 4 GPCs running the Primary Avionics Software 

System (PASS—a fifth GPC is running the Backup 

Flight System) are configured as a Redundant Set 

(RS) with each GPC driving one of the Flight Critical 

data buses, and each of those buses controls one 

servoamplifier at each secondary actuator. Thus 

every primary/ secondary actuator set is driven by the 

entire RS with a FO-FO level of fault tolerance 

As suggested above, Shuttle avionics is a system 

of distributed computers, with many local processors 

in addition to the GPCs. Most of these are designated 

by the functional name of Modulator/Demodulator 

(MDM) because of their dependence on direction 

from the data buses, but they are in fact embedded 

processors capable of interpreting polling queries and 

actuator commands from a GPC, controlling 

whatever subsystem they belong to, and returning 

sensor data to the entire RS. The fact that they 

communicate with the GPCs by whole coherent 

messages is important because it allows the GPCs to 

be synchronized at a fairly coarse granularity, as 

opposed to, say, every instruction execution. 

Figure 2 shows all the data paths involved when 

GPCs in the RS send command messages to a typical 

actuator (shown schematically in the upper right 

corner). A command message is a coherent semantic 

unit with a content such as “Go to +5º deflection for 

the next 40 msec.” The horizontal lines between 

GPCs and MDMs are the 4 aft-subsystem flight 

critical data buses, each controlled by one GPC. Each 

of the Flight Critical/Aft-subsystem MDMs (FA1-4) 

controls one ServoAmplifier (ASA). Note how the 

ΔP signals from the servovalves feed back into the 

ASAs and the Monitors, from which they are passed 

to the MDM. That’s how the GPCs can poll these 

MDMs to see whether any ΔP signal is so far out of 

line as to be adjudged a servo failure.

Figure 2. GPCs To MDMs To ASAs To Actuators 

Teamwork With Redundant Sensors 

For input data to the GPCs, the fault tolerance 

design requires that each GPC in the RS receive 

exactly the same data from the set of redundant 

copies of any sensor. The software examines the 

multiple instances of input data and applies a filtering 

rule appropriate to the unit and its degree of 

redundancy, e.g. average, intermediate value, etc. If 

all the digital gear is working, this will produce the 

exact same filtered data in all GPCs in the RS, plus 

the exact same opinion of which sensor data is out of 

line, if any. That’s important to keep the calculations 

identical throughout the RS. (Cases where not all the 

digital gear is working are covered further on.) 

The MDMs are omitted from Figure 3, but 

should be understood to be in the long horizontal 

lines, one per Rate Gyro. The cross-coupling shown 

is actually performed by the data buses, since each 

MDM sends its data on all 4 buses. This implies that 

the 4 MDMs cannot all transmit at once, which seems 

contrary to the principle of synchronized data 

gathering, but they do sample and hold the data as 

simultaneously as their respective controlling GPCs 

sent polling commands; then they take turns on the 

bus set using normal contention resolution. 

6.B.1-4 

Figure 3. GPCs In RS Listening To All Rate 

Gyros 

The Need For Exact Data Uniformity 

If TMR had been a fully satisfactory fault 

tolerance approach, it would have been good enough 

to make the 3 GPCs in a TMR-level RS acquire input 

data that was substantially though not bit-for-bit 

equal, and issue commands that are substantially 

though not bit-for-bit equal. Going to quad 

redundancy and achieving graceful degradation of 

many types of units poses a number of complications, 

but above all, a sort of “smoking gun” issue that 

establishes the need for identical data in all GPCs: 

“Control Slivering.” With 4 GPCs calculating 

maneuvers that are almost the same, the system could 

put actuators into a 2-against-2 force fight, resulting 

in no maneuver. For Shuttle purposes, we made this 

argument about a 180º roll maneuver (common 

enough for spacecraft), but in a conference with a 

commercial-aircraft orientation, perhaps we should 

talk about a 180º turn instead! This is not any kind of 

failure case. With all digital and analog units working 

within their design tolerances, 2 GPCs could call for 

a maneuver of +179.99º and the 2 others could call 

for a maneuver of -179.99º.

Method Of GPC Synchronization 

Having dictated exact uniformity of data in all 

GPCs in the RS, we had to define task management 

to support it. Part of the problem was that the GPCs 

weren’t really designed for strictly synchronized 

operation, so they were supplied with a set of 

interconnecting 28V discretes, such that 3 discretes 

(“sync lines”) from each GPC could be read as a 3-bit 

“sync code” by each of the other GPCs. What task 

management in the FCOS has to do, without violating 

the requirement for priority-driven task execution, is 

to establish enough levels of “sync points” to make 

all changes of priority level occur at the same point 

of logical flow in all GPCs. As a practical matter, 

there are only three occasions to switch to a different 

task: the normal end of a task, the expiration of a 

timer interval, and the completion of an input/output 

(I/O) operation. The timer and I/O occasions are the 

fundamental causes of program interrupts, but in 

computers as loosely coupled as these, there’s no 

way to force each interrupt to appear at exactly the 

same point in the logic flow in all GPCs. 

Our solution was to aggregate effective logic 

flow into much larger blocks than instruction 

executions, and to require a sync point when ready to 

pass from one to the next. Interrupts butt in wherever 

they can, as in any computer system, but they are not 

allowed to affect mission logic flow except under 

control of a sync point. To assure timely response to 

such tamed interrupts, we made a rule that every task 

has to invoke a “programmed sync point” at least 

every millisecond. (This would never work in an 

ordinary mainframe environment where many 

unrefined programs are running around without 

regard to rules and freezing the works with infinite 

loops and deadly embraces, but Shuttle software is 

built in a much more disciplined environment.) The 

expiration of a timer interval flags the next sync point 

to become a “timer sync point,” and an I/O 

completion flags the next sync point to become an 

“IOC sync point.” These flags take the form of 

priorities, with programmed sync points having the 

lowest priority, timer sync points the next higher, and 

I/O completions the highest. 

These priority values are exchanged among 

GPCs as sync codes, via the sync lines mentioned 

above. The effect is that no GPC proceeds to a new 

logic block, either the next one in the same task or a 

different one that responds to an interrupt-level sync 

6.B.1-5 

point, until all GPCs have come to a common point in 

their logic flow and agree (under a timeout limit) on a 

priority level. That’s how synchronization with data 

uniformity is maintained throughout the RS in the 

absence of failures. Next we must consider how the 

synchronization scheme responds to failure 

conditions. 

Case 1: GPC “Total” Failure 

If a GPC fails, as happened on the first 

Approach and Landing Test flight, it will either be a 

no-show at the next programmed sync point, or (if 

it’s running at all) may enter a sync point prematurely 

and see all the others as no-shows. Either way, the 

timeout limit leads to the conclusion that a failure to 

sync has occurred, and every GPC reaching this 

conclusion turns on an alarm light to alert the crew, 

who alone are empowered to shut down a GPC. The 

Computer Annunciation Matrix, described in a 

later section, presents enough information for a 

confident decision. 

Case 2: Sensor Failure 

If a sensor (or the MDM that provides the only 

access to it) fails, it’s essential that all GPCs in the 

RS perceive this failure on the same I/O cycle, so that 

they can simultaneously exclude it from their data 

filter and resolve not to listen to that sensor again. 

Whenever a sensor failure is simply an error in 

numerical value and doesn’t trigger a low-level I/O 

error condition, it can be caught by any of several 

“old-fashioned” software techniques: parity and 

cyclic redundancy checks, reasonableness tests based 

on serial correlation or a check on uniformity within 

a tolerance with corresponding data from redundant 

sensors. 

Where there is a low-level I/O error, there is 

generally a fault in some digital subsystem, but it’s 

not always clear whether the fault lies in an MDM or 

bus (on the far side of the cross-coupling from the 

GPCs) or on the GPC side of the cross-coupling. That 

is, certain digital subsystem failures can present the 

appearance of a sensor/MDM failure to some GPCs 

but not to others, which leads into the next failure 

case.

Case 3: GPC Bus Receiver Failure 

If a GPC becomes unable to receive data from a 

particular data bus, it can’t by itself distinguish that 

case from a bus failure, an MDM failure, or a sensor 

failure. The importance of the distinction is of course 

that if all GPCs in the RS see the same problem, the 

GPCs must be working correctly and the “string” 

(bus/MDM/sensor) must be at fault, while if only one 

GPC sees a problem, it must take the blame as the 

failed unit so that the other GPCs can continue 

normal use of that string and sensor. For this reason, 

the number of sync codes was increased to 

distinguish normal I/O completion from I/O 

completion with errors. That part of the design went 

through several iterations, of which the version 

presented here is the simplest to describe but doesn’t 

lack any of the final functionality. 

Sync Code System (No I/O Errors) 

This section goes into more detail about how 

GPC synchronization resolves interrupt priority races 

and detects faults in any GPC’s logic flow, without 

getting into I/O errors. For these purposes, four sync 

codes are enough for each GPC to issue as shown in 

Table 1. 

Table 1. Basic Sync Codes (Binary) 

Code Interpretation 

000 No sync point requested 

001 Programmed sync point 

010 Timer interrupt 

011 I/O Completion interrupt 

In the absence of any interrupts, the GPC with 

the fastest clock gets to a programmed sync point 

first, issues sync code 001, and waits for the others in 

the RS to do likewise, but only long enough for the 

GPC with the slowest clock to catch up. If all the 

others turn up with code 001 within the time limit, 

this GPC proceeds into the next block of the current 

task. Any other GPC that doesn’t catch up within the 

tolerance on clock rates is considered by this GPC to 

be out of sync, and it removes that GPC from its 

record of the RS and lights the appropriate crew 

alarm, but then proceeds into the next block as usual. 

If there’s only 1 no-show out of the original 4, the 

“local” RS is reduced to 3; but if all the other 3 are 

no-shows, the local RS is reduced to 1, in which case 

6.B.1-6 

the lonely GPC does not attempt to enter any more 

sync points. Presumably, the lonely GPC is convicted 

of failure, but only the crew may turn it off. 

When a timer interrupt occurs in any GPC, it has 

no immediate effect but lies in wait for a 

programmed sync point. When that occurs, this GPC 

issues sync code 010 to propose that course to the 

RS, and waits for the others to agree on 010 within 

the time limit. Assuming for now that no I/O 

Complete interrupt occurs while this is going on, the 

sequence is similar to the 001 case, except that all 

GPCs that agree on code 010 proceed to process the 

timer interrupt and perform whatever change of task 

that implies. Processing either type of interrupt is 

simply a matter of elevating the priority of whatever 

task is supposed to follow the interrupt event, and is 

so quick that it doesn’t need to do a programmed 

sync point of its own. Also, any GPCs that fail to 

sync may be showing 001 rather than being noshows, 

but it’s a fail to sync just the same. If all but 

one of the RS agree on 010, the timer logic in the 

lonely GPC may have failed (stopped), but if only 

one winds up with 010 while the others time out 

agreeing on 001, its timer logic may be triggering 

falsely. 

When an I/O Complete interrupt occurs in any 

GPC, it also has no immediate effect but lies in wait 

for a programmed sync point. (For now, assume no 

timer interrupt occurs around this time.) When the 

programmed sync point occurs, this GPC issues sync 

code 011 to propose that course to the RS, and waits 

for the others to agree on 011 within the time limit. 

The sequence is again similar, except that all GPCs 

that agree on code 011 proceed to process the I/O 

completion and perform whatever task change that 

implies. If all but one of the GPCs agree on 011, the 

I/O logic in the lonely GPC may have failed, but if 

only one GPC winds up with 011 while the others 

time out agreeing on 001, its I/O completion logic 

may be triggering falsely. 

When both types of interrupt occur at 

approximately the same time, they may occur in the 

same or different orders in the GPCs, so an additional 

step is seen to be required whenever a timer sync 

code (010) has to yield to an I/O Completion code 

(011). The GPC observing this event maintains a 

record of it, and the next sync point to occur (most 

likely in the high-priority task invoked by the I/O 

completion) looks for it and issues 010, rather than

001, if it finds it. Thus, a timer interrupt that loses a 

close race with an I/O completion still gets processed 

within a millisecond—though whether the task it 

invokes has a higher or lower priority than the I/Otriggered 

task is another question. This combination 

case has more paths to enumerate (and to verify and 

validate), but the priority organization of both the 

sync codes and the tasks triggered by the interrupt 

events keeps the task management fairly 

straightforward. 

Sync Code System (With I/O Errors) 

This section covers how GPC synchronization at 

I/O completion uses extended sync codes to 

determine the correct response to input errors 

detected by a GPC’s hardware-level checking. Since 

the sensors (including feedback from actuators) are 

organized into 4 strings, and since we don’t have to 

consider multiple simultaneous failures, four 

extended sync codes are added to the original four as 

shown in Table 2. 

Table 2. Full Set Of Sync Codes (Binary) 

Code Interpretation 

000 No sync point requested 

001 Programmed sync point 

010 Timer interrupt 

011 I/O Completion interrupt, no error 

100 I/O Completion, error on string 1 




At I/O Completion sync points, the codes 011, 

100, 101, 110, and 111 are all considered to be in 

“agreement” as far as synchronization goes, but the 

fault tolerance logic takes a second look at them as 

soon as synchronization is assured. If all GPCs in the 

RS reported code, say, 110, they all evict string 3 

data (for the sensor currently being sampled) from 

their filter calculations and resolve never to look at 

that sensor again. 

If just one GPC reports, say, 101, while all the 

rest of the RS report 011, the lonely GPC is judged to 

6.B.1-7 

have failed to sync after all, with the usual 

consequences. This is required because that GPC is 

unable to perform the same input filtering 

calculations as the others, and the others are not 

“willing” to reduce their input data set by eliminating 

data that they all received correctly. Under our rule 

about one failure at a time, the only combinations 

that have to be considered are as shown in Table 3, 

where A, B, and C designate GPCs other than Self. 

Table 3. I/O Sync Codes In A 4-GPC RS 

Self A B C Interpretation & Action 

011 011 011 011 No string fail; use all data 

011 011 011 1xx C failed sync; use all data 

011 011 1xx 011 B failed sync; use all data 

011 1xx 011 011 A failed sync; use all data 

1xx 011 011 011 Self failed sync; leave RS 

100 100 100 100 String 1 fail; use 2-3-4 data 




Although I’ve been using “sensor” in connection 

with the data being treated in this way, it’s important 

to realize that all the actuators include sensors to 

report back on their own performance, e.g. the ΔP 

feedback from the hydraulics. Similarly, subsystems 

that are primarily sensors often accept effector 

commands, e.g. alignment commands for IMUs. So 

most MDMs participate in two-way conversations. 

As stated above, not all the local processors are 

MDMs. Figure 4 lists all the types and shows how 

they are connected through the 7 different types of 

data buses to all 5 GPCs—one of which is running 

the backup software (BFS). The BFS GPC normally 

listens to all buses so as to stay current with the state 

of the vehicle, but does not command any bus except 

its own Intercomputer bus, but it commands all buses 

when backup mode is selected. Whatever the type of 

local processor, the data bus protocols for message 

format, contention resolution, etc. are the same as 

described above for MDMs.

Sync Point Routines And Sync Holes 

The only GPC code we at MIT generated was for 

the sync routines. In contrast to all the flight-related 

code, which was written in HAL/S (a very high-level 

engineering language incorporating vectors, matrices, 

and differential equations), the sync routines were in 

assembly language, more like kernel programming 

with every nerdy slick trick allowed, and all the 

microseconds carefully counted out. The central loop 

of each sync routine made use of the fact that all 15 

sync lines (3 from each of the 5 GPCs) could be read in 

by a single instruction and passed through a current RS 

mask to look for agreement, with about 10 μsec per 

iteration—remember, that was blazing fast in the 

1970s. This implied that with random phasing between 

the loops in various GPCs, one could exit a sync point 

as much as 10 μsec ahead of another. 

We spent vast amounts of effort searching for 

ways in which sync codes could change at such times 

that they could be seen differently by different GPCs, 

further complicated by the fact that each GPC had its 

own clock rate, and refined the code to minimize these 

Figure 4. Shuttle Data Bus Architecture 

6.B.1-8 

“sync holes.” Fortunately, we were doing this at a time 

when the main-engine people and the thermal-tile 

people were struggling with their development 

problems and taking all the blame for delays, which 

gave us time to grind the sync holes down to very 

small probabilities. As far as I know, no Shuttle flight 

has experienced one. 

Computer Annunciation Matrix 

Figure 5 shows this dedicated display that shows 

each GPC’s vote as to whether itself or some other 

GPC has failed to sync. It treats all 5 GPCs equally, 

relying on the crew to know which computer is running 

BFS and which PASS computers are the RS. As the 

labeling shows, the top row is GPC 1’s votes regarding 

each of the 5 GPCs, the second row is GPC 2’s votes, 

and so on. Correspondingly, each column displays the 

“peer group’s” opinion of one GPC. As long as there is 

no fail to sync, the whole matrix is dark. Any GPC’s 

vote against another GPC turns on a white light; when 

a GPC perceives that it has fallen out of the RS, it turns 

on the yellow light in the main diagonal. If the crew

turns off a GPC, the entire corresponding row and 

column go dark so as not to clutter the unit’s possible 

later displays. 

Figure 5. Computer Annunciation Matrix: CAM 

Various patterns of lights are quite informative, 

and help the crew decide whether a GPC that failed to 

sync might be worth trying again in a later mission 

phase. For example, suppose that, in an RS of GPCs 1- 

4, GPC 2 finds that it’s the only one unable to read one 

of the buses, but is otherwise running normally. 

The pattern of Figure 6 suggests that GPC 2 might 

be usable again if some configurational change were 

made, e.g. giving it command of a different bus. 

6.B.1-9 

Figure 6. CAM After GPC 2 Failed To Sync 

The Proof Of The Pudding 

The first free-flight test of the prototype Space 

Shuttle, Enterprise, was in 1977. It was carried aloft by 

a specially adapted Boeing 747 (still in use to ferry 

Shuttles from Edwards AFB to Kennedy Space Center) 

and was popped off the mother ship’s back to perform 

the Approach and Landing phases. At the instant of 

separation, the CAM pattern of Figure 6 appeared for a 

split second, then that of Figure 7, making it clear that 

GPC 2 was very sick indeed. The crew moded the 

GPC through STANDBY to OFF and the CAM went 

dark while the remaining RS of GPCs 1, 3, and 4 

controlled the vehicle to a smooth flawless landing.

Figure 7. CAM After GPC 2 Failed Completely 

The immediate reaction by some of the NASA 

and IBM people was that GPC 2’s troubles must have 

been caused by all that complicated fault tolerance 

logic dreamed up by those MIT “scientists” (a sort of 

cuss word in the mouths of some who had wearied of 

our harangues on making the scheme complete). 

Several MIT and IBM engineers pored over the postflight 

hex dumps of GPC 2’s memory, looking for very 

small-needle clues in a very large haystack. One of the 

best IBMers, Lynn Killingbeck, concluded that it 

looked like the computer had “gone bananas” and that 

turned out to be exactly right. When the GPC was 

returned to IBM’s hardware labs, they found that the 

wire harness connecting its I/O Processor (IOP, see 

Figure 4) to its CPU had one cold-solder joint that had 

been jolted loose by the release from the 747. As luck 

would have it, the wire in question triggered an ultrapriority 

interrupt as it rattled against its connector pin. 

Nothing that the software was trying to do was going 

to get anywhere with that going on—bananas, indeed! 

One of Lynn’s colleagues, in honor of his swift 

and accurate analysis, presented him with a varnished 

mahogany trophy featuring two bananas fresh from the 

grocery (Figure 8). 

6.B.1-10 

Figure 8. Lynn Killingbeck’s Desk Trophy 

In those days, IBM was a strict white-shirt and 

sober-tie kind of organization, so it was no great 

surprise to learn that Lynn’s manager came along after 

a couple of days and observed that although the trophy 

was appropriate for the moment, it wasn’t the sort of 

thing that upheld IBM’s corporate image. “Lynn, we’d 

like to suggest that you eat the evidence,” he said in his 

resonant bass voice. 

A Heretical Observation 

It has long been orthodox, in both mainframe and 

control computers, to emphasize the feature of program 

interrupts to support the time-sharing that is essential 

(in very different ways) to both application areas. 

While the Shuttle GPCs had this feature and used it, 

the GPC synchronization scheme described here goes a 

long way toward emulating a system in which no 

interrupts exist. In effect, each interrupt had to be 

encapsulated and prevented from altering the logic 

flow until the higher-level software was prepared to 

consider such alterations. I believe that verification of 

the Shuttle PASS software was made much easier and 

quicker by thus reducing the variety of logic flow 

paths. Not many people remember that even in the 

Apollo Guidance Computer, a primitive precursor of 

this approach existed: jobs that had to be restartable 

after transient disturbances protected themselves by 

sampling a variable NEWJOB at convenient times.

As stated above, mainframes are generally timeshared 

between production programs and new 

programs under development. The latter must be 

expected to violate standards of good behavior (e.g. 

infinite loops), so that it’s critical for interrupts to seize 

control of the machine without any cooperation from 

them. 

In real-time control systems, by contrast, all 

software elements are necessarily production code, 

which has many reasons to comply with strict behavior 

protocols: jitter, transport lag, optimal disposition of 

cycle overruns, etc. From this base, it was not difficult 

to add a protocol for frequent sampling of the 

conditions that can alter logic flow by elevating the 

priority of one task over another—that is, to perceive 

the events that trigger interrupts in a way that actually 

used the interrupts but could easily have been done 

without them. 

So that’s the heresy: I suggest that real-time 

control computers may be better off without program 

interrupts than with them, emulating their effects 

without over-complicating the logic flows. 

References 

Almost all of this material is from my memory of 

the work we did at the time, supported by a few looks 

at “Digital Development Memos” produced by the 

CSDL group named in the acknowledgments below, 

plus illustrations (copied for use in slide shows) from 

documents by NASA, IBM, and Rockwell 

International’s Space Division. Feeling that references 

in a technical paper should cite resources that 

interested readers can access, I must say with regret 

that aside from a few memos in my basement, even I 

can’t lay hands on the primary documents. 

6.B.1-11 

Nevertheless, I cannot forbear to include one reference, 

even if it is to a side issue: 

[1] Mindell, David A., 2008, Digital Apollo, 

Cambridge, MA, MIT Press, p. 232. 

Acknowledgements 

I must acknowledge my colleagues at the Charles 

Stark Draper Laboratory, Inc. with whom I worked on 

fault tolerance during the Space Shuttle Avionics 

development years, basically the 1970s: Division Head 

Eldon C. Hall, Group Leader Alan I. Green, Albert L. 

Hopkins, Ramon Alonso, Malcolm W. Johnston, Gary 

Schwartz, Frank Gauntt, Bill Weinstein, and others. 

NASA people who contributed to development 

include Bob Crippen, Cline Frasier, Bill Tindall, Phil 

Shaffer, Ken Cox, and many others. 

From IBM Federal Systems Division: Lynn 

Killingbeck, Steve Palka, Tony Macina, and more. 

From Rockwell International Space Division: Vic 

Harrison, Bob D’Evelyn, Ron Loeliger, Bob Peller, 

Gene O’Hern, Sy Rubenstein, Ken McQuade, and 

more. 

Rich Katz, of the Office of Logic Design at 

NASA’s Goddard Space Flight Center, encouraged me 

to present this material in a Lessons Learned session at 

the 2006 MAPLD (Military-Aerospace Programmable 

Logic Devices) conference. While it does not address 

such devices, it does cover problems that participants 

needed to understand, bringing forward solutions that 

had been developed a generation before. 

28th Digital Avionics Systems Conference 

October 25-29, 2009

Space Shuttle Fault Tolerance: Analog and Digital Teamwork

Create successful ePaper yourself

Delete template?

Save as template?