Dependable Memory - Laboratoire Interface Capteurs ...

LABORATOIRE INTERFACES CAPTEURS 

ET MICRO-ÉLECTRONIQUE 

Doctoral School of IAEM - Lorraine 

Department of Electronics and Electrical Engineering 

A dissertation submitted to the University Paul Verlaine - Metz, France 

in partial fulfillment of the requirements for the degree of Doctor of Philosophy 

Discipline : Electronic Systems 

Specialty : Microelectronics 

DESIGN METHODOLOGY OF A FAULT-TOLERANT 

JOURNALIZED STACK PROCESSOR ARCHITECTURE 

by 

MOHSIN AMIN 

Thesis defended on June 9, 2011 

Doctoral Committee : 

PROF. LUC HEBRARD University of Strasbourg, France President of jury 

PROF. AHMED BOURIDANE University of Northumbria, Newcastle, UK Reviewer 

PROF. FERNANDO MORAES University of PUCRS, Porto Alegre, Brazil Reviewer 

DR. CAMILLE DIOU Paul Verlaine University - Metz, France Co-Supervisor 

PROF. FABRICE MONTEIRO Paul Verlaine University - Metz, France Superviror 

LICM - 7 Rue Marconi, Technopôle, 57070 Metz, France 

Tel : +33 (0)3 87 31 56 57 - Fax : +33 (0)3 87 54 73 07 - www.licm.fr

I DEDICATE THIS WORK TO 

i 

MY BELOVED BROTHER (LATE) QAISER AMIN 

May God give him peaceful rest forever!

Acknowledgements 

A PhD thesis is a great experience for working on very stimulating topics, challenging problems, 

and for me perhaps the most important to meet and collaborate with extraordinary people. Along with 

getting a degree and research skills, here I have learn French language, experience a new culture and 

learn to live in a different climate. For five years I am in France but indeed there is much more to 

explore. 

First and foremost, many thanks go to Prof. Fabrice MONTEIRO and Dr. Camille DIOU for 

supervising my PhD thesis and teaching me a lot of new stuff, for guidance and support, for all the 

fruitful discussions, and for the company during the conference trips. I am grateful to them for letting 

me pursue my research interests with sufficient freedom, while being there to guide me all the same. 

Also, I am grateful to director LICM, Prof. Abbas DANDACHE and Dr. Camel TANOUGAST for 

their kind support during my stay at LICM-Metz. 

My greetings go to Prof. Ahmed BOURIDANE, Northumbria University, Newcastle, UK and 

Prof. Fernando MORAES, University PUCRS Porto Alegre, Brezil who honored me by accepting to 

review this thesis. I am also grateful to President of the jury, Prof. Luc HEBRARD, University de 

Strasbourg, France to supervise this event. 

I am thankful to my colleague Dr. Abbas RAMAZANI who guided me a lot during my thesis. I 

would like to thank my officemates Frédéric, Hussain, Kevin, Mazan, Medhi and Rita for the good 

times we have had. I say good luck to the next: Alaa-Aldin, Cédric, David, Luca, Mokhtar, Salah 

and Said. Among them some are now more than officemates. I would like to express my gratitude 

and appreciation for Aamir, Armaghan, Fahad, Jawad, KB, Liaquat, Rafiq, Sadiq and Sundar. Special 

thanks to Sajid Butt for his unconditional friendship, his support, and to remember me to focus on 

finishing my PhD. 

Last but certainly not the least, I owe a great deal to my family for providing me with emotional 

support during my PhD. Many thanks to my parents, my brother: Qasim, wife: Ayesha and sister: 

Saba who all contributed a lot (probably the most) to my life during this period in many ways. Love to 

my beloved children Mohammad Abu-Bakar and Aleeza. Finally, special thanks to Higher Education 

Communication of Pakistan for funding my PhD thesis. 

Thanks, folks! 

iii 

Mohsin AMIN

Contents 

GENERAL INTRODUCTION 7 

I. STATE OF ART 13 

1 Dependability and Fault Tolerance 13 

1.1 Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

1.1.1 Common Source of Faults and their Consequences . . . . . . . . . . . . . . 15 

1.2 Basic Concepts and Taxonomy of Dependable Computing . . . . . . . . . . . . . . 18 

1.2.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

1.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.4 Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

1.4.1 System Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.4.2 Characteristics of a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

1.5 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

1.5.1 Fault Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

1.5.2 Fault Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

1.5.3 Fault Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

1.5.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

1.6 Techniques Applied at Different Levels . . . . . . . . . . . . . . . . . . . . . . . . 27 

1.6.1 FT Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

2 Methods to Design and Evaluate FT Processors 31 

2.1 Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.1.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.1.2 Temporal/Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

2.1.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.2 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.2.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.2.2 Temporal Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

2.2.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

1

2 CONTENTS 

2.3 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

2.4 FT Processor Design Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

2.5 FT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

2.5.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

2.5.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

2.5.3 The Fault Injection Framework . . . . . . . . . . . . . . . . . . . . . . . . 48 

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

II. QUALITATIVE AND QUANTITATIVE STUDY 53 

3 Design Methodology and Model Specifications 53 

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.2.1 Concurrent Error Detection: Parity Codes . . . . . . . . . . . . . . . . . . . 54 

3.2.2 Error Recovery: Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

3.4 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

3.5 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.5.1 Challenge # 1: Self Checking Processor Core Requirements . . . . . . . . . 59 

3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal . . . . . . . . 60 

3.5.3 Challenge # 3: Processor-Memory Interfacing . . . . . . . . . . . . . . . . . 62 

3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation of 

Rollback Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

3.6 Model Specifications and Global Design Flow . . . . . . . . . . . . . . . . . . . . . 63 

3.7 Functional Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.7.1 Model-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

3.7.2 Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

3.7.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

4 Design and Implementation of a Self Checking Processor 77 

4.1 Processor Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

4.1.1 Advantages of Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . . 78 

4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

4.3 Hardware Model of the Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . 82 

4.4 Design Challenges in FT Stack Processor . . . . . . . . . . . . . . . . . . . . . . . 84 

4.4.1 Challenge I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . 85 

4.4.2 Challenge II: Performance Improvement . . . . . . . . . . . . . . . . . . . . 85 

4.5 Solution-I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 87

CONTENTS 3 

4.5.1 Error Detecting in ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.5.2 Error Detecting in Register and Data-Path . . . . . . . . . . . . . . . . . . . 92 

4.5.3 Self-Checking Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

4.5.4 Store Sensitive Elements (SE) . . . . . . . . . . . . . . . . . . . . . . . . . 93 

4.5.5 Protecting Opcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

4.6 Solution-II: Performance Aspects of Self-Checking Processor Core . . . . . . . . . . 94 

4.6.1 Solution-II (a): Multiple-byte Instructions . . . . . . . . . . . . . . . . . . . 94 

4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Execution 95 

4.6.3 Reducing Overhead for Conditional Branches . . . . . . . . . . . . . . . . . 96 

4.7 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

5 Design of a Self Checking Hardware Journal 103 

5.1 Error Detection and Correction in the Journal . . . . . . . . . . . . . . . . . . . . . 104 

5.2 Principle of the technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

5.3 Journal Architecture and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

5.3.1 Modes of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

5.4 Risk of data contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

5.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

5.5.1 Minimizing the Size of the Journal . . . . . . . . . . . . . . . . . . . . . . . 115 

5.5.2 Dynamic Sequence Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

6 Fault Tolerant Processor Validation 121 

6.1 Design Hypothesis and Properties to be Checked . . . . . . . . . . . . . . . . . . . 122 

6.2 Error Injection Methodology and Error Profiles . . . . . . . . . . . . . . . . . . . . 122 

6.3 Experimental Validation of Self-Checking Methodology . . . . . . . . . . . . . . . 123 

6.4 Performance Degradation due to Re-execution . . . . . . . . . . . . . . . . . . . . . 126 

6.4.1 Evaluating Performance Degradation . . . . . . . . . . . . . . . . . . . . . 127 

6.5 Effect of Error Injection on Rate of Rollback . . . . . . . . . . . . . . . . . . . . . . 130 

6.6 Comparison with LEON FT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

GENERAL CONCLUSION AND PROSPECTS 135 

A Canonical Stack Computers: 139 

B Instruction Set of Stack Processor 141 

B.1 Data Operations in Stack Processor: . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4 CONTENTS 

C Instruction Set of Pipelined Stack Processor 147 

D List of Acronyms 153 

E List of publications 157

GENERAL INTRODUCTION 

5

General Introduction 

owadays, devices are becoming more sensitive to the strike of high energy particles. There are 

N 

great chances that it can cause single-event upset (SEU) when hitting the surface of silicon device. 

This can result in soft errors that emerge as bit flips in memory or signal noise in combinatorial logic. 

Although, in recent years microprocessor performance has been increased exponentially with 

modern design trends. However, they have increased susceptibility towards environmental effects 

[Kop11]. When clock speeds increase and feature sizes decrease; systems may become susceptible 

to ionizing radiations that leak through the atmosphere. In addition, soft errors may be triggered by 

environmental factors such as static discharges or fluctuations in temperature and power supply volt- 

age. The occurrence of soft errors in modern electronic systems is continuously increasing [Nic10]. 

Dependability is an important concern for current and future generation processor design [RI08]. 

Conventional approaches for dependable processor design employ space or time redundancy 

[RR08]. Processor replication has been used for a long time as a fault tolerance (FT) technique 

against transient faults [Kop04]. It is a costly solution requiring more than 100% of area overheads 

(and also power overheads), since duplication at least is required for error detection (triplication at 

least for error correction/masking) and additional voting circuitry. Practically, it is an expensive so- 

lution to detect errors at register level, especially when SEU are being considered. Software based 

temporal approaches has less hardware overheads and it can significantly improve reliability [RI08]. 

For example, in duplex execution all instructions are executed twice to detect transient errors [MB07]. 

However, this technique tend to induce significant time overheads making severe time constraints hard 

to match in real time designs. These approaches may providing robust fault tolerance but incurring 

high penalty in terms of performance, area, and power [RR08]. 

Explicit redundancy is suitable for mission critical applications where hardware cost is not an 

important constrain. However, after rapid technology scaling, today almost every system need at- 

least little consideration of FT features [FGAD10]. These systems demand more cost-effective FT 

solutions that may have less coverage than hardware redundancy but substantial coverage nonethe- 

less [RR08]. Therefore, research is needed to have alternate unconventional and cost-effective solu- 

tions. 

We are proposing a new hardware/software co-design methodology to tolerate transient faults in 

the processor. The methodology relies on two main choices: fast error detection and low cost error 

recovery. It should have fast error detection so that errors can be detected before they reach the 

7

8 GENERAL INTRODUCTION 

system boundaries to cause catastrophic failures 1 . Consequently, the hardware based concurrent 

error detection (CED) has been chosen. To limit the overall cost, we may accept little time penalty in 

error correction. In this scenario, the software based rollback is employed. It will reduce the overall 

cost as compared to hardware based recovery. Whereas, it will not effect lot to overall performance 

because the proposed methodology is suitable for ground applications where occurrence of error is 

far less than space. 

There is a hypothetical dependable memory (DM) attached to the processor. Moreover, to make 

the rollback fast and to simplify the memory management there is an intermediate data storage be- 

tween processor and DM. Here, architectural choices are important to make the overall methodology 

successful. For-example, the processor core having minimum internal states to be checked (for detect- 

ing error) and load and store (for rollback recovery) can make this technique effective (less expensive 

and fast). The FT processor has been modeled at VHDL-RTL level. Finally, the processor self check- 

ing ability and performance degradation due to re-execution has been tested by artificial error injection 

in the simulated model. 

The contributions of this work are as follows: Proposing a new methodology based on hard- 

ware/software co-design to have a compromise between protection and time/area constrains. For 

fast error detection, hardware based concurrent detection is employed. For low hardware overheads, 

software based micro-rollback recovery will be used. To reduce the overall area overheads we are em- 

ploying stack processor from MISC class. The processor has minimum internal registers which result 

in low cost error detection and on the other hand it is suitable for efficient error recovery. Further- 

more to mask the error from entering into DM, the intermediate temporal data storage is introduced 

between processor and DM. 

This thesis is partitioned into six chapters. 

Chapter 1: It outline the background and describe the motivation for on-line error detection and 

fast correction in embedded microprocessors. It present the basic concepts and the terminologies 

related to dependable embedded processor design. It further explores attributes, threats and means 

to attain dependability. Lastly, the different dependability techniques applied at different levels are 

discussed. 

Chapter 2: This chapter will be presenting different redundancy techniques to detect and correct 

errors. It explores different FT methodologies employed in the existing fault tolerant processors. The 

last part will be dedicated to the validation methodology of a dependable processor. 

Chapter 3: This chapter identifies the model specifications and design methodology of the desired 

architecture. It address the overall problem by exploring the design paradigm and the related con- 

strains of the proposed approach. Later the processor-memory interface will be finalized by different 

functional implementations. 

Chapter 4: The proposed FT processor has two parts: self-checking processor core (SCPC) and 

self-checking hardware journal (SCHJ). This chapter steps towards a design methodology of self- 

1 where the cost of harmful consequences is orders of magnitude, or even incommensurably, higher than the benefit 

provided by correct service delivery [LRL04]

checking processor core (SCPC). The processor will be chosen from the MISC (minimum instruction 

set computer) class; therefore, firstly, we clarify the reasons of choosing such a specialized processor. 

Later on, error detection and recovery mechanism are finalized. Finally, the hardware model of the 

self-checking processor core will be synthesized on Altera, Quartus II Stratix III. 

Chapter 5: The chapter discusses the hardware design and protection scheme of a self-checking 

hardware journal (SCHJ), which will be temporary data storage to mask errors from entering into the 

dependable main memory. Finally, the overall hardware model of the FT processor will be synthesized 

on Altera, Quartus II Stratix III. 

Chapter 6: Lastly, the FT model will be evaluated in presence of errors. The evaluation will be 

based on the self-checking and performance degradation in presence of errors. Hence, the obtained 

results validate the protection techniques proposed in the chapter 3. 

Finally, the last section will be discussing conclusions and perspectives. 

9

10 GENERAL INTRODUCTION

I. STATE OF ART 

11

Chapter 1 

Dependability and Fault Tolerance 

t is a complex task to design embedded systems for critical real-time applications. Such systems 

I 

must not only guarantee to meet hard real-time deadlines imposed by their physical environment, 

but also guarantee to do so dependably, despite the occurrence of faults [Pow10]. The need of fault 

tolerant (FT) computing is becoming more and more important in recent years [Che08] and likely 

become the norm. In the past, FT was the exclusive domain of very specialized applications like 

safety critical systems. However modern design trends are making circuits more sensitive and now 

all real-time systems should have at least some FT features. Therefore, FT is an important need of the 

time. 

Modern social system is hinged to automated industry. In some sensitive industrial sectors, even a 

single fault can result in a million dollar loss (e.g. in banking and stock markets) or can result in loss 

of life (e.g. air traffic control system). Industries like automotive, avionics, and energy production re- 

quire availability, performance and real-time response ability to avoid catastrophic failures. In table 1, 

cost per hour for the failure of the control systems has been compared to show the importance/demand 

of FT in the industrial sector. 

Table 1.1: Cost/hour for failure of control system [Pie07] 

Application Domain Cost (Euro/hour) 

Cell-phone Operator 40k 

Airline Reservation 90k 

ATM Machine (Banking) 2.5M 

Automobile Assembling Unit 6M 

Stock Transaction 6.5M 

Most of these system (in table 1) rely on embedded systems. The design of the FT processor 

is one of the basic requirement for dependable embedded applications. Accordingly, we propose to 

design a fault tolerant processor to eliminate (tolerate) transient faults that result from SEUs. In this 

introductory chapter, we will address the basic concepts and terminologies related to fault tolerant 

computing. This chapter is divided into three main parts: the first part will be arguing the current 

13

14 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE 

trends that increase the probability of faults and the sources and consequences of the faults. The 

second part discusses the concepts of dependable computing and third part will be exploring the 

means to attain dependability. 

1.1 Problematic 

For many years researchers focused on performance issues. Due to their restless effort, they have 

improved the overall performance in last few years fueled by deep technology scaling. However, the 

boundaries provided by the Moore’s law have been reached to its saturation level and on the other 

hand; there is a decrease in the dependability due to ever-increasing physical faults. There are various 

trends in search of high performance, which have increased the need for dependable architecture 

design. Among them, some are discussed below: 

Smaller Technologies / Design Scales 

Although scaling of the transistors and wires has steadily improved processor performance and 

reduced cost, it also adversely affected long-term chip lifetime reliability. When a transistor is ex- 

posed to high-energy ionizing radiation, electron-hole pairs are created [SSF + 08]. Transistor source 

and diffusion nodes accumulate charge, which may invert the logic state of the transistor [Muk08]. 

With device dimensions projected to shrink below 18 nanometers by 2015 significantly threaten next- 

generation technologies [RYKO11]. For transient faults, smaller devices tend to have low charge to 

hold the states of the registers and make them more sensitive towards noise. When the noise margin 

decreases the probability that a high-energy particle strike can disturb the charge on the devices also 

increases which in turn increases the probability of transient faults. The lower voltages used for power 

efficiency reasons will increase the susceptibility of future chips [FGAD10]. 

More Transistors per Chip 

Due to more transistors, more wires are required to connect them, resulting in more chances of 

faults both during the fabrication and working of such devices. Modern processors are more prone to 

the faults due to greater number of transistors and registers. Moreover, temperature is another factor 

causing transient and permanent faults. More the devices on the chip so more power will be dragged 

from the supply. Higher supply power per unit area will increase the leakage power dissipated per 

unit area due to which higher will be the temperature and probability of errors. 

Complex Design 

Today, the processor has become more complicated as compared to the past, which increases the 

probability of design faults. On the other hand it is also making debugging of errors difficult. Research 

effort is oriented towards alternate methods to increase system performance without increasing the

1.1. PROBLEMATIC 15 

sensitivity of the circuit but unfortunately the bottleneck has been reached and alternate solutions are 

more complex and make fault debugging a more difficult task to fulfill. 

In short, the devices are becoming more sensitive against ionized radiation (which may cause soft 

errors), operating point variation by means of temperature or supply voltage fluctuations, as well as 

parasitic effects, which results in statical leakage currents [ITR07]. Changing the parameters like 

dimensions, noise margin and supplied voltage cannot be further fruitful to increase the performance. 

In the near future, due to small size and high frequency the failure trend in modern computing 

systems will further increase because saturation level has already been reached. The further increase 

is leading towards increasing rate of soft error in logic and memory chips [Bau05] which is affecting 

the reliability even at sea level [WA08]. To assure the circuit integrity, FT must be an important design 

consideration for modern circuits. The dependable system must be aware of the tolerance mechanism 

against possible errors. 

1.1.1 Common Source of Faults and their Consequences 

Today, one significant threat to the reliability of the digital circuits is concerned with the sensitivity 

of the logic states to various noise sources and specially in certain specific environments such as in 

space or nuclear systems where collision of charged particles can result in transient faults. Such 

particles may include cosmic rays produced by sun and alpha particles produced by disintegration of 

radioactive isotopes. 

For space applications, FT is a mandatory requirement due to the severe radiation environment. As 

the manufacturing technology is scaled towards finer geometries the probability for SEUs is increas- 

ing. With the present technology, dependability is not only required for some critical applications: 

even for commodity systems, dependability needs to be above a certain level for the system to be 

useful for anything [FGAD10]. Radiation induced soft error is becoming an increasingly important 

threat to the reliability of digital circuits even in ground level applications [Nic10]. 

Transient faults can be caused by on-chip perturbations, like power supply noise or external 

noise [NX06]. The researchers have classified three common sources of soft errors in semi-conductors 

including alpha particles, discovered in 1970’s, proved to be the main source of soft errors in com- 

puter systems, specially DRAM [MW07]. Secondly, the high-energy neutrons from cosmic radia- 

tions could induce soft errors in semi conductor devices via the secondary ions produced by neutron 

reaction with silicon nuclei [ZL09] as shown in figure 1.1, where a single high-energy neutron has 

disturbed the internal charge distribution of the whole device. Thirdly, soft-error source is induced 

by low-energy cosmic neutron interactions with the isotope boron-10 in IC materials, specifically in 

Boro-Phos-Silicate-Glass (BPSG), used widely to form insulator layers in IC manufacturing. This 

recently proved to be the dominant source of soft errors in SRAM fabricated with BPSG [WA08]. 

Figure 1.2 represents the sequence of events that may occur once an energetic particle hit the 

substrate provoking ionization. This ionization may generates a set of electron-hole pairs that create 

a transient current that is injected or extracted to that node. According to the amplitude and duration


Figure 1.1: An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs in its 

wake, which effects charge disturbance [MW04] 

of this current pulse, a transient voltage pulse may appear at the hit node. This is characterized as the 

fault. There is a fault latency period that defines the time needed for that fault to become an error in 

the circuit. This will only occur if this transient voltage node changes the logic of a storage element 

(flip-flop), generating a bit-flip. This bit-flip may generate an error if the content of this flip-flop is 

used for a certain operation. However, for the application point of view, it is not mandatory that this 

error is manifested as a failure in the system. There is also an error latency that defines the time 

needed for that error to become a failure in the system. The common term for any measurable effect 

Ionization 

Transient 

Current 

Transient 

Voltage Pulse 

Fault Effect 

Figure 1.2: Strike of high energy particle resulted in error(s) 

Error 

resulting from the deposition of energy from a single ionizing particle strike, is a Single Event Effect 

(SEE). The most relevant SEEs are classified in figure 1.3.

1.1. PROBLEMATIC 17 

SEE 

(Single Event 

Effect) 

SET 

(Single Event 

Transient) 

SBU 

(Single Bit Upset) 

MBU 

(Multi Bit Upset) 

SEFI 

(Single Event 

Functional Interrupt) 

SELU 

(Single Event Latch-Up) 

SEGR/SEB 

(Single Event 

Gate-Rupture/Burnout) 

SEU 

(Single Event Upset) 

Soft Error 

Hard Error 

Figure 1.3: Classification of faults on basis of single event effect (SEE) [Pie07]. 

Single Event Upset (SEU) 

The SEU is mostly a soft error caused by the transient signal induced by a single energetic particle 

strike [JES06]. In [Bau05], it is said to occur when a radiation event causes a charge disturbance large 

enough to reverse or flip the data state of a memory cell, register, latch, or flip-flop. The error is called 

soft because the device is not permanently damaged by the radiation and when new data is written to 

the struck memory cell, the device will store it correctly [Bau05]. 

The SEU is a very serious problem because it is one of the major source of failure in digital 

systems [Nic10]. It will likely pose serious threats to the future of robust computing [RK09] and 

require serious attention. It may manifest itself as Single Bit Upset (SBU) or Multiple Bit Upset 

(MBU). 

Single Bit Upset (SBU) and Multiple Bit Upset (MBU) 

An SBU is a single radiation event that results in one bit flip whereas an MBU is a single radiation 

event that results in more than a single bit being flipped. Each bit flip is essentially an SEU. An 

SBU and MBU are therefore considered a subset of the SEU. The SBU are usually a major fraction 

and MBU are usually a small fraction of the total number of observed SEUs. However, the MBU 

probability is steadily increasing as geometries shrink [BCT08, QGK + 06]. Presently, this thesis is 

addressing SBUs. In future, methodology will be further extended for addressing MBUs.


Single Event Transient (SET) 

An SET is a transient pulse in the logic path of an IC. Similar to an SEU, it is induced by a charge 

deposition of a single ionizing particle. An SET can be propagated along the logical path where it 

was created. It may be latched into a register, latch or flip-flop causing their output value to change. 

Single Event Functional Interrupt (SEFI) 

Xilinx [BCT08] defines SEFI as an SEE that results in the interference of the normal operation of 

a complex digital circuit. As for the previously mentioned SETs, further investigation of SEFI rates 

are not considered for this thesis. 

Single Event Latch-Up (SELU) 

A spurious current spike induced by an ionizing particle in a transistors may be amplified by the 

large positive feedback of the thyristor and cause a virtual short between Vdd and ground, resulting 

into a s SELU [NTN + 09]. SELUs are not addressed in this thesis. 

Single Event Gate Rupture (SEGR) and Single Event Burnout (SEB) 

Single Event Gate Rupture (SEGR) is a single ion induced condition in power MOSFETs that may 

result in the formation of a conducting path in the gate oxide. Single Event Burnout is a condition, 

which can cause device destruction due to a high current state in a power transistor. Both of them are 

permanent faults and not addressed in this thesis. 

1.2 Basic Concepts and Taxonomy of Dependable Computing 

This part defines the basic terminologies related to dependable computing. The terminologies are 

globally extracted from [LB07, Lap04]. In this section, we identify the important methods and their 

characteristics to make a system tolerant to faults. 

1.2.1 Dependability 

Dependability is the ability to deliver service that can justifiably be trusted [LRL04]. The defini- 

tion is focused on trust. In other words, the dependability of a system is the ability to avoid service 

failures that are more frequent and more severe than acceptable. Dependability relies on a set of 

measures that allow all phases of product life, to ensure that the functionality will be maintained 

while accomplishing the mission for which it has been designed. According to Laprie [LB07], the 

dependability of a system is the property that places a justified confidence in the service it delivers.

1.3. ATTRIBUTES 19 

Dependability 

and 

Security 

1.3 Attributes 

Attributes 

Threats 

Means 

Dependability 

Security 

Figure 1.4: Dependability Tree 

Availability 

Reliability 

Safety 

Confidentiality 

Integrity 

Maintainability 

Faults 

Errors 

Failure 

Fault Prevention 

Fault Tolerance 

Fault Removal 

Fault Forecasting 

Dependability is a vast concept based on various attributes as shown in figure 1.4. 

• Availability: it is the readiness for correct service; 

• Reliability: it is the continuity of correct service; 

• Safety: it is the absence of the catastrophic consequences on the user(s) and the environment; 

• Integrity: it is the absence of the improper system alterations; 

• Maintainability: it is the ability to undergo modifications. 

Moreover, when dealing with the security issues, an additional attribute called confidentiality is 

also considered as shown in figure 1.4. Confidentiality is the absence of unauthorized disclosure of 

information. Some other attributes related to security are availability and integrity, which have already 

been discussed with dependability attributes [VK07]. 

It is difficult to fully respect all of the dependability attributes at a time in a system because it 

can increase the cost, power consumption and hardware area of the system. So, one respects these 

attributes according to the system needs. It has been stated in [FGAM10] that it is impossible to 

design a 100% dependable system. For example, in-order to improve the availability of component,


sometimes one overlooks maintenance and the safety decreases accordingly. Here, two types of 

systems have been considered: 

• a web server 

• a nuclear reactors 

Let us see which dependability and security attributes are more important for each of the system. 

In a university web-server, availability is the important attribute because every student needs to ac- 

cess it regularly whereas, for a nuclear reactor, the attributes like availability, reliability, safety and 

maintainability are important considerations. The [Pie07] sum up the importance of these attributes 

inform of a table in 1.2, where 4 points have been given to very important attributes and 1 to least 

important attribute. Hence from the table 1.2 each application has its only dependability and security 

requirements. 

Table 1.2: Dependability attributes for University web-server and Nuclear-reactor [Pie07], where 

attributes are classified as: – very important = 4 points, – least important = 1 point 

1.4 Threats 

Attributes University Web Server Nuclear Reactor 

Availability 3 4 

Reliability 1 4 

Safety 1 4 

confidentiality 2 1 

Integrity 2 3 

Maintainability 2 4 

There are three fundamental threats to a dependable computer. They are: (i) fault, (ii) error and 

(iii) failures. Fault is define as an erroneous state of hardware or software resulting from failures of 

components, physical interference from the environment, operator error, or incorrect design [Pie06]. 

A fault is active when it produces an error, otherwise it is considered as a dormant/sleeping fault. An 

active fault can be an internal fault that was previously dormant. The error is itself caused by a fault 

and a failure occurs when there is deviation from correct services due to some error. All three have 

cause and affect relationship between them (as shown in figure 1.5). In general, active fault causes 

error. It can propagate from one place to another inside the system. In figure 1.6, an error produced 

in the processor has been transfered to main memory. Furthermore, if an error reaches the boundaries 

of the system it may result in the failure of the system, causing the service provided to deviate from 

its specification [GMT08] (see figure 1.5). If the initial system is a sub-system of a global system 

then it can cause a fault in the global system. In this way chain of fault, error and failure keep on 

progressing.

1.4. THREATS 21 

Sub-system 

Global 

system 

….. Fault Error Failure 

Fault Error ….. 

Activation Propagation 

Consequences Activation 

activation 

Figure 1.5: Fault, error and failure chain 

propagation propagation 

fault error error error 

Processor Main Memory 

READ/ 

WRITE 

Figure 1.6: Error propagation from processor to main memory 

A SEU may result in system failure, like in figure 1.7: a high-energy neutron strike (caused due 

to cosmic rays) on a VLSI circuit has resulted into a SBU (active fault), which provoked an error in 

traffic control system and finally resulted into the system failure. 

1.4.1 System Failure 

A correct service is given by a system when it is respecting its functionality. Whereas, a system 

failure is a deviation of the service delivered by the system from its specification [Pie06]. Such 

a deviation can be in the form of incorrect service, or no service at all [GMT08]. Whereas, the 

transition from incorrect to correct service is a service restoration (see figure 1.8). 

The service failure may occur because the system is no more respecting its functionality or maybe 

the functional specifications were not correctly defined for that system under certain conditions. On 

the other hand, FT techniques allow a system to continuously deliver its service according to its 

correct functionality even in the presence of faults.


Fault Error Failure 

fault 

z 

0 

1 

always A=1 

1 

1 

0 

error 

Traffic 

Control 

system 

Figure 1.7: A single fault caused failure of traffic control system 

Correct 

Service 

1.4.2 Characteristics of a Fault 

Service 

failure 

Service 

restoration 

Figure 1.8: service failure 

Incorrect 

Service 

Correct 

signal 

Wrong 

signal 

Faults can be characterized by five attributes, which are cause, nature, duration, extend and value. 

Figure 1.9 illustrates each of these basic characteristics of faults. They are discussed in the following 

section. 

Cause 

Possible fault can be caused due to four salient problems: 

1. Specifications Mistakes: These include incorrect algorithms, architectures, or incorrect design

1.4. THREATS 23 

Fault 

Characteristics 

Cause 

Nature 

Duration 

Extent 

Value 

Specification Mistakes 

Implementation 

External Disturbances 

Component Defects 

Software 

Hardware 

Transient 

Intermittent 

Permanent 

Local 

Global 

Determinate 

Indeterminate 

Figure 1.9: Fault characteristics. 

HDL 

Programming 

Logical 

Electronic CMOS 

Digital 

Analog 

specifications, as in row 1 of figure 1.10 where there is fault caused by the wrong interconnec- 

tion between the two systems. 

2. Implementation Mistakes: The implementation can introduce faults due to poor design, poor 

component selection, poor construction, or hardware/software coding mistakes as in rows 2 and 

3 of figure 1.10. The row 2 shows the programming fault in which c is incremented if a is less 

than b but c will not be incremented if a is equal to b which is a programming error. Similarly 

in row 3 of figure 1.10, r1 charge the result of addition a+b in the register c. 

3. Components Defects: These include random device defects, manufacturing imperfections, and 

component wear-out. It can be a logical component or electronic CMOS. As shown in the row 

4 and 5 of figure 1.10 

4. External Disturbance: These include operator mistakes, radiation, electromagnetic interfer- 

ence, and environment extremes. As in row 6 of the figure. 1.10. Moreover, due to reducing 

noise margin the ‘1’ can be read as ‘0’ if its value is lower than threshold (Vm) (as shown in 

row 7 of the figure. 1.10).


Sr. 

No. 

1 

2 

3 

4 

5 

6 

7 

Nature 

Fault at different level 


Programming fault 

HDL 

Component defect 1 

Component defect 2 

External Disturbance 

Fault due to lower noise 

Correct state 

A B C 

if a < = b 

c : c + 1; 

end; 

r1 : c

1.5. MEANS 25 

Extent 

A permanent fault if occurs once, persists until the end of the execution. Even a single perma- 

nent fault can create multiple errors until being repaired. Such errors are called hard errors. 

• Intermittent: It is a fault which appears, disappears, and reappears repeatedly within a very 

short period. An intermittent fault can occur repeatedly but not continuously for a long time in 

a device. 

The errors in modern computers may result due to permanent, intermediate and transient faults. 

However, transient faults occur considerably more often than permanent ones, and are much 

harder to detect [RS09]. The ratios of transient-to-permanent faults can vary between 2:1, 

100:1 or higher. This ratio is continuously increasing [Kop04]. 

The fault extent specifies whether the fault is localized to a given hardware or software module or 

whether it globally affects the hardware, the software, or both. 

Value 

The fault value can either be determinate or indeterminate. A determinate fault is one whose 

status remain unchanged throughout the time unless there is an external action upon it, whereas an 

indeterminate fault is one whose status at some time t may be different from its status at another time. 

1.5 Means 

There are four means to attain dependability: fault prevention, fault tolerance, fault removal and 

fault forecasting. Fault prevention and fault tolerance aim to provide the ability to deliver a service 

that can be trusted, while fault removal and fault forecasting aim to reach confidence in that ability by 

justifying that the functional and the dependability and security specifications are adequate and that 

the system is likely to meet them [LRL04, LB07]. 

1.5.1 Fault Prevention 

Fault prevention is the ability to avoid the occurrence or introduction of the faults. Fault prevention 

includes any technique that attempts to prevent the occurrence of faults. It can include design reviews, 

component screening, testing and other quality control methods. 

1.5.2 Fault Removal 

Fault removal is the ability to lessen the number and severity of faults. It can be conducted 

during corrective or preventive maintenance processes. Corrective maintenance aims to remove faults


that have already produced and starts after error detection, while preventive maintenance is aimed at 

removing faults before they might have caused errors [LB07]. 

1.5.3 Fault Forecasting 

It is the ability to estimate the present number, the future incidence, and the likely consequences 

of faults. It is conducted by performing an evaluation of the system behavior with respect to fault 

occurrence or activation; it has two aspects that are qualitative and quantitative. The main approaches 

to probabilistic fault forecasting aimed to derive probabilistic estimates are modeling and testing 

[LB07]. 

1.5.4 Fault Tolerance 

It is intended to preserve the delivery of correct service in the presence of active faults [ALR01]. 

Ideally FT system is capable of executing their tasks correctly regardless of faults. However, in 

practice no-one cannot guarantee the flawless execution of tasks under all circumstances. The real 

FT system are design to have tolerance against more likely to occur faults. In this work FT has 

been addressed. It resides on three pillars, which are fault masking, error detection and error correc- 

tion/recovery. 

Fault Masking 

Fault masking hide the effects of failures through the means that redundant information outweighs 

the incorrect information [Pie06]. It is a structural redundancy technique that completely masks faults 

within system redundant modules. A number of identical modules execute the same functions, and 

their outputs are voted to remove errors created by a faulty module e.g. Triple Modular Redundancy 

(TMR) is a commonly used technique of fault masking. 

Through in fault masking, we achieve dependability by hiding faults that occur. It prevents the 

effects of faults from spreading throughout the system. It can tolerate software and hardware faults 

as shown in figure 1.11. Such system does not need error detection and correction to maintain system 

dependability. Fault masking has not been directly employed in this thesis. However, TMR will be 

used for comparison in the later chapters. 

Error Detection 

If fault masking is not employed then error detection may be employed in a FT system. Error 

detection is the building block of a FT system, because a system cannot tolerate an error if it is not 

known to it. Error detection mechanisms form the basis of an error resilient system as any fault 

during operation needs to be detected first before the system can take a corrective action to tolerate 

it [LBS + 11]. Even if a system cannot recover from the detected error, it can at least halt the process 

or inform the user that an error is detected and that the results are no more reliable.

1.6. TECHNIQUES APPLIED AT DIFFERENT LEVELS 27 

Error Correction/Recovery 

Detecting an error is sufficient for providing safety, but we would also like the system to recover 

from the faulty states. Recovery hides the effects of the error from the user. After recovery, the system 

can resume operation and ideally remain live. Error recovery is an important feature for the system 

based on the two attributes of reliability and availability because both the metrics require the system 

to recover from its errors without user intervention. 

Error detection and recovery are addressed in this thesis, they will discussed in detail in the chap- 

ter 2. Similarly, various techniques of error detection (in section 2.1) and correction (in section 2.2) 

are also discussed. 

1.6 Techniques Applied at Different Levels 

Figure 1.11 illustrates the dependability techniques applied at different levels in a hardware and 

a software system in which fault avoidance (fault prevention) is the primary method to improve the 

system dependability. It may be taken into account through hardware or software implementations. 

The fault avoidance in a hardware based system can be achieved by preventing specification and 

implementation faults, component defects and external disturbances, while in a software based system 

it requires prevention of specification and implementation faults. On the other hand, fault masking is 

a technique used to ensure dependability, by masking the faults. TMR is a well-known example of 

this technique. If fault masking is not applied, then FT is a practical choice to overcome errors. 

1.6.1 FT Techniques 

Fault tolerant techniques for integrated circuits can be applied at different moments in the circuit 

design flow. They can be applied in the electrical design phase, such as transistor dimension, transistor 

redundancy and by adding electrical sensors. Some techniques can be added at logic design step, such 

as by adding hardware and time redundancy in the logic blocks and in the software application. The 

figure 1.12 is the further extension of previous discussed figure 1.2 . The figure represents different 

phases to tolerate faults (detect and correct). In each phase a different fault tolerant technique can be 

used. We are addressing the fault tolerant at hardware redundancy and self-checking level that are 

two higher levels (as shown in ‘c’ and ‘d’ of figure 1.12). 

1.7 Conclusions 

The goal of this chapter was to introduce the concepts of dependability in embedded systems. In 

fulfilling this objective, we have introduced the main issues related to the design and analysis of fault 

tolerant systems. Here, we have discussed different types of faults and their characteristics because


Fault Avoidance Fault Masking 



External Disturbances 

Component Defects 



Hardware 

Faults 

Software 

Faults 

Fault Tolerance 

Errors 

Errors 

Figure 1.11: Dependability techniques. 

Error Detection 

Recovery 

System 

Failure 

System 

Failure 

our final objective is the design of a fault tolerant computing system against single event effect such 

as SEUs (Single Event Upsets). 

In addition, this chapter is addressing the dependability issues against non-permanent distur- 

bances. Our goal is to propose a new design methodology of dependable processor architectures. 

Consequently, in chapter 2, we will discuss some existing methodologies of detecting and correcting 

errors.

1.7. CONCLUSIONS 29 

Ionization 

Different 

Fault 

tolerance 

levels 

Transient 

Current 

Sensors 

(detectors) 

Time redundancy 

(detection migration) 

Fault 

Latency 

Transient 

Voltage Pulse 

Fault Effect 

Flip-Flop Error Failure 

Hardware Redundancy 

Error correction codes 

(detection & migration) 

Error 

Latency 

a b c d e 

Self checking 

mechanism with 

recovery 

(detection & migration) 

Fault Tolerant at level c & d in figure 

has been addressed in this thesis 

Figure 1.12: Sequence of events from ionization to failure and a set of fault tolerant techniques applied 

at different time. [Pie07]. 

Redundancy/ 

spare 

components

30 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE

Chapter 2 

Methods to Design and Evaluate FT 

Processors 

The goal of FT techniques is to limit the effects of a fault, which means to increase the probability 

that an error is accepted by the system. A common feature of all the FT-techniques is the use of 

redundancy. Redundancy is simply the addition of hardware resources, or time beyond what is needed 

for normal system operation [Poe05]. It can be hardware (some hardware modules are replicated), 

time (parts of a program are executed multiple times), information (the circuit or program has a 

redundancy of information) or a mixture of these three solutions. 

Traditional solutions involving excessive redundancy are too expensive in area, power, and perfor- 

mance [BBV + 05], other cheap approaches do not provide the necessary fault detection and correction 

abilities. Fault-tolerant embedded systems have to be optimized in order to meet time and area con- 

straints [PIEP09]. Therefore, special attention is required when choosing redundancy techniques for 

critical applications. 

Accordingly, chapter is presenting a comparison of the existing FT techniques in terms of error 

detection and correction ability, time delays and their hardware overheads. From these comparisons 

we will identify the techniques that can effectively fulfill our design objectives. 

Later part of the chapter will explore redundancy techniques employed in different FT proces- 

sors. The last section will be addressing the evaluation methods to check the effectiveness of FT 

methodologies in the processor. 

2.1 Error Detection 

Error detection originates an error signal or message within the system. It has been previously 

discussed in section 1.5.4. It can be based on preemptive detection or concurrent checking. Preemp- 

tive detection is mostly offline technique and takes place while normal service delivery is suspended, 

check the system for latent errors and dormant faults wheras, concurrent detection is an online tech- 

31

32 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS 

nique and takes place during normal service delivery [ALR01]. Similarly, Bickham defines concur- 

rent error detection (CED) as a process of detecting and reporting errors while, at the same time, 

performing normal operations of the system [Bic10]. 

CED techniques are widely used to enhance system dependability [HCTS10, CTS + 10, WL10]. 

The basic principle of CED techniques have been sum up in [MM00], in which a system is consid- 

ered, which realizes a function (f) and produces output in response to an input sequence. A CED 

scheme generally contains another unit, which independently predicts some special characteristic of 

the system-output for every input sequence. Finally, a checker unit compares the two outputs to 

predict an error signal. The architecture of a general CED scheme is shown in figure 2.1. 

Function (f) 

output 

Input 

Output 

Characteristics 

Predictor 

checker 

Predicted output 

characteristics 

Figure 2.1: General architecture of a concurrent error detection schemes [MM00] 

Several CED based redundancy techniques have been proposed and used commercially for de- 

signing reliable computing systems [HCTS10, SG10]. They have been classified into three classes; 

hardware redundancy, time redundancy, and information redundancy. The FT based system uses one 

or more among them. These techniques mainly differ in their error-detection capabilities and the con- 

straints they impose on the system design. In the next section, we will explore the commonly used 

error detection techniques. 

2.1.1 Hardware Redundancy 

Hardware redundancy is the commonly used approach [Bic10]. It refers to the addition of extra 

hardware resources such as doubling the system and using a comparator at the output to detect errors. 

Here, the consideration is given to the structure of the circuit and not to the functionality. It is equally 

effective for transient, timing and permanent faults. However, the area and power requirements are 

quite big. It can be classified into two sub types: (i) duplication with comparison and (ii) duplication 

with complement redundancy. 

Duplication with comparison (DWC)/ dual modular redundancy (DMR) [JHW + 08] is the simple 

and easy to implement error detection technique (see figure 2.2). It has a good error detection ca- 

Error

2.1. ERROR DETECTION 33 

pability and theoritically it can detect 100% of all possible errors by running all operations on two 

copies of a component and comparing the results [MS07]. However, it cannot detect the bugs due 

to design, error in the comparator and combinations of simultaneous error(s) in both the modules. 

Replication can be performed at different granularities (units vs. cores), but always comes at a con- 

siderable hardware cost (more than 200%). A classic example of DMR is the IBM S/390 mainframe 

processor [SG10] where the I-unit (fetch and decode units) and E-unit (execution unit) are duplicated, 

and their signals compared for transient fault detection. 

F 

Output 

Input comparator 

Error 

Signal 

F 

Figure 2.2: Duplication with comparison (DWC) 

There is another complementary technique called duplication with complement redundancy (DWCR) 

[Jab09]. This technique is similar to the DWC but in this technique both the modules, the input sig- 

nals and output control signals and data internal signals are of opposite polarity to avoid simultaneous 

errors in both the module to avoid the system failure. 

Here as well, the area and the power consumed overhead is more than 200%. However, this 

method increases the complexity of the design compared to a simple duplication. This technique 

is used in dual-checker rail (DCR), where both outputs are reversed if there is no error; they are 

sometimes employed in controller. 

2.1.2 Temporal/Time Redundancy 

This is type of redundancy technique that requires single unit to perform an operation twice; one 

followed by another. If there is existence of difference between the subsequent computations, it means 

that there is an existence of error [AFK05]. In this approach, there is penalty in terms of extra time 

however their area penalty is lesser than DMR. The additional hardware is required due to comparator 

and and requirement of additional temporary storage. It is a time replication technique and with no 

consideration to the functionality of the circuit. 

In this scheme, intermittent and transient faults are detected (as shown in figure 2.3) but permanent 

faults are not. For permanent fault detection, the circuit has been modified as shown in the following 

figure 2.4, according to which the computation using the input data is first performed at time t. The 

consequences of this computation is then stored in a buffer. The same data is used to repeat the 

computation, using the same functional block at time t + δt. However, this time the input data


Input 

F Output 

t 

t+δ t 

comparator 

buffer 

Error 

Signal 

Figure 2.3: Time redundancy for temporary and intermittent fault detection 

is first encoded in some manner. The results of the computation are then decoded and the results 

are compared to the results produced before. Any discrepancy will detect a permanent fault in the 

functional block. 

Input 

t 

t+δ t 

Encoder 

Encoder 

F 

t 

t+δ t 

buffer 

Decoder 

comparator 

Output 

Error 

Signal 

Figure 2.4: Time redundancy for permanent error detection 

An alternative approach can be redundant execution with shifted operands (RESO) [PF06] where 

some instructions are executed redundantly with shifted operands on the same functional units. Shift- 

ing the result back by the same amount yields the original result computed with un-shifted operands. 

Re-executing instructions detects transient faults whereas re-executing with shifted operands also 

detects permanent faults. Scheme works when functionality possesses required properties such as 

linearity 

Time redundancy directly affects the system performance although the hardware cost is generally 

less than that of hardware redundancy. Therefore temporal redundancy based systems are compara- 

tively slower. In order to overcome this issue, many systems use pipelining to hide the latency issue 

from the client. The temporal redundancy does not address energy consequence at all, except it uses 

twice as much active power as a non-redundant unit.


2.1.3 Information Redundancy 

The basic idea behind an information scheme is to add redundant information to the data being 

transmitted or stored or processed to determine if errors have been introduced [IK03]. It is the way of 

protecting data through mathematical encoding, which can be reuse after decode the original data (as 

in figure 2.5). The encoding and decoding circuitry adds additional delays, which make them slower 

than DMR, but the area overhead is much lower than DMR. In coding, the consideration is given to 

the information stored or maybe to the functionality of the circuit but no consideration given to the 

structure of the circuit. Typically, information redundancy is used to protect storage elements (like 

memory, caches, register files, etc) [HCTS10] e.g. in Power 6 and 7 [KMSK09]. These codes are 

classified based on their ability of detection and correction, code efficiency and complexity. In this 

section, we will discuss only error detection codes. 

Data Encode 

Add 

Redundancy 

Noise 

Transmit 

or 

Store 

Decode Data 

Check 

Redundancy 

Figure 2.5: Information redundancy principle 

The error detecting codes (EDC) have less hardware overhead than the error correcting codes. 

There are different EDCs e.g. parity, Borden, Berger and Bose codes. We will not go in much details 

but will compare their salient features will be discussed. 

The parity-coding strategies is simplest and it has lowest HW overhead [ARM + 11]. It is based on 

calculation of even or odd parity for data of word length N. The parity can be calculated with XOR 

operation among the data bit. A parity code has a distance of 2 and can detect all odd-bit errors. 

Input 

Parity 

Generator 

P 

Data 

Data 

Received data 

comparator 

Figure 2.6: Parity coder in data storage 

Error 

Signal 

Output 

Before storing data in the register, the parity generator is used to compute the parity bit required 

(as shown in the figure 2.6). Then both the computed parity and the original data are stored in register.


When data is retrieved, a parity checker is used to compute the parity based on data bit stored. The 

parity checker compares the computed parity and the stored parity and an error signal is set accord- 

ingly. Similarly, parity coding can also be used to protect the logic functions (see the figure 2.7). It is 

used commonly in computers to check errors in busses, memory, and registers [IK03]. 

Input 

Input 

parity 

F 

P 

comparator 

Figure 2.7: Functional Parity 

Output 

Output 

parity 

Error 

Signal 

Cyclic redundancy checks (CRCs) is another class of EDC. It is commonly employed to detect 

errors in digital system [IK03]. Cyclic codes are parity check codes with the additional property that 

the cyclic shift of a codeword is also a codeword. If 

then 

(Cn−1, Cn−2, ...C1, C0) is a codeword, 

(Cn−2, Cn−3, ...C0, Cn−1) is also a codeword. 

The idea is to append a checksum to the end of the data frame in such a way that the polynomial 

represented by the resulting frame is divisible by the generator polynomial G(x) that the sender and 

receiver have agreed upon. When the receiver gets the checksummed frame, it divides it by G(x) 

and if the remainder is not zero, there has been a transmission error. It is then clear that the best 

generator polynomials are those less likely to divide evenly into a frame that contains errors. CRCs are 

distinguished by the generator polynomials they use. It cannot directly specify the error bit position 

during the decoding process. Hence it is only limited for error detection. 

The Borden codes are another class of codes that can detect unidirectional errors (errors that 

cause either a 0 → 1 or 1 → 0 transition, but not both). It is the optimal code for unidirectional 

error detection. The Berger EDC is capable of detecting all unidirectional errors. It is formulated by 

appending check bit to the data word. The check bit constitute the binary representation of the number 

of 0 ′ s in the data word. For example, in 3-bit long data word, we need 2-bit for the check. Berger 

code is simpler to deal than the Borden codes. The Bose code is more efficient than the Berger code. 

The Bose code provides the same error detecting capability that the Berger code does, but with fewer 

checks bit. Briefly, on increasing the complexity of the codes the efficiency of the codes increases. 

Choosing the right code depends on the application needs. 

In arithmetic processing circuits (such as in ALU) the previously discussed codes are incapable of 

detecting errors because when two data symbols are subjected to an arithmetic operations, it result in


a new data symbol which cannot be uniquely expressed as the combination of inputs [FP02, Nic02]. 

In other words, they are useful in checking arithmetic operations, where parity would not be pre- 

served [FP02, IK03]. The information parts of an operand are processed through a typical arithmetic 

operator, while a check symbol is concurrently generated (based on the information bits) [Bic10]. 

They have two classical implementation: AN and residue codes. 

AN codes are the simplest form of arithmetic codes [Muk08]. They are formed by multiplying 

each data word ‘N’ by a constant ‘A’. The following equation gives an example of an AN code: 

A (N1 + N2) = A (N1) + A (N2) (2.1) 

They are preserved only under arithmetic operations and they are not valid for logical and shift 

operations. They are not commonly employ due to their high hardware and timing penalty. 

Figure 2.8: Residue codes adder [FFMR09]. 

Residue codes are another type of arithmetic code, in which the information to be used in checking 

is called the residue. The residue, r, of an operand, A, is equal to the remainder of ‘A’ divided by the 

modulo-base ‘m’ [Bic10]. Both the computations occur simultaneously (see figure 2.8). For the first 

computation step, two operands, A and B, undergo an arithmetic operation in the ALU. A residue 

generator then produces a residue code from the ALU result. For the computation, each operand 

concurrently enters a residue generator. These residues then undergo the same ALU operation as in 

the first computation (addition in this case) [FFMR09]. Finally the residue is compared to find the 

errors.


2.2 Error Correction 

Error correction has previously been discussed in section 1.5.4. Similar to error detection, cor- 

rection techniques are also classified into three sub classes like hardware, information and temporal 

redundancy. 

2.2.1 Hardware Redundancy 

Adding an additional third module and replacing the comparator with a voter in DMR leads to 

Triple Modular Redundancy (TMR), as shown in figure 2.9. TMR in-addition to error detection can 

also correct the errors. A more general approach of N-Modular Redundancy (NMR) is discussed in 

[KKB07]. In this technique the effects of faults are masked. All the components work simultaneously 

and their outputs are fed into a voter. The output of the voter will be correct if at least two of the 

components are non-faulty. Static redundancy techniques are characterized to be simple but they have 

high area and power overheads. 

Input 

F 

F 

F 

voter Fault 

Masked 

Figure 2.9: Triple modular redundancy (TMR) 

TMR has been prominent FT solution in aircrafts [Yeh02] and space shuttles, where not only 

processors, but entire systems are replicated for robustness. TMR can be implemented at software 

level, a propsed approach in [SMR + 07] uses software implementation of TMR in which operating 

system processes are triplicate and run on multiple available cores. Input replication and output 

comparison is done by a system call emulation unit. 

TMR can be employed to address single-bit data errors (SET, Persistent, Non-Persistent) occurring 

in a cell [Car01]. In TMR the point of failure is voter, because if fault occur in voter all the system 

fails. However, voter is typically small and hence often assumed to be reliable. There is a significant 

area and power penalty (approximately a factor of 3.0 − 3.5 times) associated with TMR as compared 

to the non-redundant design [JHW + 08].

2.2. ERROR CORRECTION 39 

2.2.2 Temporal Redundancy 

For error correction with temporal redundancy, a computation is repeated on the same hardware 

at three different times intervals and finally the results are voted [MMPW07]. It requires three times 

more clock cycles to execute the same task. It can only correct errors due to transient faults provided 

that the duration of fault is lesser than computational time. It needs additional time to repeat the 

computations and can be employed in systems with low or no constraints on time. However, it has 

low area overheads as compared to TMR. 

2.2.3 Information Redundancy 

The error correcting codes (ECC) can provide cheaper solutions than other well-known redun- 

dancy techniques like TMR [CPB + 06]. They are commonly used to protect the memory (see fig- 

ure 2.10). The overhead of a code depends on (i) additional bits required to protect the information 

(ii) additional hardware/latency for encoding and decoding. However, encoding/decoding latency can 

be reduced if executed in parallel. 

Input 

Input 

parity 

Error 

Error 

Detecting Detecting & 

& 

Correcting 

Correcting 

block 

block 

Output 

Output 

parity 

Error 

Signal 

Figure 2.10: Error detecting and correcting memory block 

Among the different ECC codes, the commonly employed codes in digital circuits include Ham- 

ming Codes, Hsiao Codes and Reed-Solomon Codes. These codes can correct errors in-addition to 

detection. There are two key parameters of error correcting codes: (i) number of erroneous bit that can 

be detected and (ii) number of erroneous bit that can be corrected. Code’s error detection/correction 

properties are based on their ability to partition a set of 2n, n-bit words into a code space of 2 m code 

words and a non-code space of 2 n − 2 m words [FP02]. The simplest block code are Hamming codes, 

they are single error correcting, double error detecting (SEC-DED) codes [LBS + 11] but not both si- 

multaneously. They are the earliest linear ECC codes. They are quite useful in cases where only a 

single error is of significant probability, they do carry the hazard of miss correcting double errors. 

The Hsiao Codes (also called advance Hamming codes) are other commonly used codes for pro- 

tection / correction of errors in the memory [Mon07]. They have fast encoding and error detection 

than Hamming codes [Hsi10]. 

Codes that are more powerful may be constructed by using appropriate generating polynomials.


Among them, Reed Soloman codes are cyclic codes that require complex encoding and decoding 

circuitry and are especially well-suited to applications where errors occur in bursts. That’s why they 

are mostly employed in channel coding. On the other hand, convolution coding schemes are useful in 

data storage and transmission systems, such as memory and data networks [FP02]. 

2.3 Error Recovery 

Recovery transforms a system state that contains one or more errors and (possibly) faults into an- 

other state without detected errors and faults that can be activated again [ALR01]. Error recovery can 

only be initiated on detection of fault/error, therefore, the system should have built-in self checking 

mechanism. Nowadays, modern microprocessors have variable built-in error detection capabilities 

like, error detection in memory, cache, registers, illegal op-code detection, and so on [MBS07]. It can 

be higher level based on error handling (eliminate error from the system states) or lower level based 

on fault handling (prevent fault from being activated again). 

Recovery hides the consequences of faults from the user. It is more adequate for transient and 

intermittent faults, whereas, for permanent fault, the recovery is generally not sufficient. It needs one 

mandatory feature that is fault handling (see figure 2.11), it eliminates faults from the system state 

[LB07]. The fault-handling feature prevents faults from being activated again. This requires further 

features such as diagnosis, which reveals and localizes the cause(s) of error(s) [LRL04]. Diagnosis 

can also enable a more effective error handling. If the cause of the error is localized, the recovery 

procedure can take actions concerning the associated components without affecting the other parts of 

the system and related system functionality. In this work, we will be addressing the soft errors caused 

due to transient faults. Therefore, fault handling technique will not be important to explore. 

There are two sub-types of error recovery; forward error recovery (FER) and backward error 

recovery (BER). In FER, the system does not need to restore its states but it continues to make forward 

progress without restoring the system states. The compensator will overcome the faults (as shown in 

figure 2.11 (FER)). For example, in TMR the voter will mask (compensate) the fault and in ECC the 

error correcting circuitry will correct the (corrigible) error. 

BER involves restoring the states of the system to a previous known sure states. In otherwords, 

the state transformation consists of returning the system back to a saved state that existed prior to 

error detection [ALR01]. For the successful BER the system must be aware of the following facts: (i) 

which and where states are to be saved for recovery point; (ii) which algorithm to use; and (iii) what 

the system do after recovery. 

There are two known algorithms for saving the BER recovery states: checking point and logging. 

The choice depends on the micro-architecture of the core and recovery requirements, because both 

have different costs for different types of states and many BER systems use hybrid of both. A system 

presented in [SMHW02] uses hybrid BER. An actual criterion of choice is that if we have few registers 

and recoveries are not frequent, then check pointing is preferred. If there are many registers and 

frequent recovery then logging will be preferred.

2.4. FT PROCESSOR DESIGN TRENDS 41 

Backward Error Recovery (BER) Forward Error Recovery (FER) 

Maintenance 

Call 

Error 

Detection 

Rollback 

Permanent 

fault 

Fault 

Handling 

Transient - 

Intermittent 

faults 

No Yes 

Service Continuation 

Permanent 

fault 

Fault 

Handling 

No 

Maintenance 

Call 

Error 

Detection 

Compensation 

Yes 

Transient - 

Intermittent 

faults 

Service Continuation 

Figure 2.11: Basic strategies for implementing Error Recovery. 

Another important aspect is where to save the states of the recovery point. A shadow register file 

is created in the core to save the states of the sensitive elements. The backup values in the shadow 

copy can be used for rollback and recovery [AHHW08]. However, some other techniques, which 

require high reliability, store the states of internal registers off-chip. When the states are recovered 

the ECC are employed to avoid possible errors. In the recent era, a lot of development has been done 

in BER and many low cost computers employ BER. Like IBM is employing checkpoint recovery in 

POWER-6 micro-architecture [MSSM10]. 

2.4 FT Processor Design Trends 

Recently, fault-tolerant computing has begun to draw more and more attention in a wider range of 

industrial and academic communities, due to increased safety and reliability demand [ZJ08]. Today, 

FT is the need of real time industrial application [RI08]. Mostly, high cost solutions are not acceptable 

for the industry, consequently the modern processors avoid hardware replication and tend to employ 

alternate techniques having lower power and area overheads (like information redundancy or hybrid 

redundancy). Information redundancy (like employing ECC) have less hardware overhead, however


they may have additional performance penalty. 

The performance penalty and hardware overheads depends on the type of ECC. The choice of 

ECC depends on three constrain power, area and error coverage requirements. The codes having 

better error coverages often have higher time penalty and hardware overheads. The parity codes are 

faster and low hardware cost whereas commonly employed ECC like hamming codes have better 

error coverage (like DED-SEC). 

The performance overhead can be minimized (masked) to an extent by calculating the parity-bits 

in parallel. On the other hand, a common trend to reduce the hardware penalty is compromising on 

error coverage and employing low cost error detecting codes (e.g. simple parity or mod-3 coding). 

Likewise, some well known processor of last decay like Power6, Itanium series and SPARC64 V 

employ parity predictors and modulo codes in their arithmetic/logic units to reduce power and cost. 

The ECC are commonly employed to protect the caches and data storage [QLZ05]. For example, 

Itanium processor can detect 2 bit errors in cache while relying on ECC. IBM processors have write 

through L1 caches and uses simple parity in the L1 cache and ECC in the L2 cache. On the other 

hand, Intel uses ECC even in the L1 caches. 

Check pointing with rollback is an alternate trend. It can be an effective FT solution for the 

processor having minimum internal states (registers). Higher the number of internal states, higher 

will be the performance (time) penalty in checking, loading and storing the states. Some modern 

processors that employ this methodology reduce the performance penalty by checking pointing after 

every super-scalar block of instruction. The common examples are Power6 and Power7. 

A newer trend to design a processor is to employ flexible error coverage and allow the user to 

choose the level of protection and redundancy he need in particular application. e.g. ARM Cortex-R 

series, an application specific processor. For higher error coverage, DMR is employed whereas for 

lower coverage ECC will be employed. However, area overhead will always be higher than 200%. In 

the following section, we are discussing different FT methodologies being employed in some well- 

known FT processor of last decay. 

SPARC64 V [AKT + 08] 

The SPARC64 V microprocessor is designed for mission-critical UNIX servers. In order to 

achieve un-interrupted operation, these servers must be resistant to soft errors. Also, data integrity is 

highly important because of the dangers that silent data corruption (SDC) can pose in mission-critical 

systems. To meet these requirements, the processor was designed not only to correct SRAM errors, 

but also to detect errors in logic circuits and to recover from those errors when practical. 

There are three smaller cache arrays of 128KB each, namely the level-1 instruction cache, level-1 

data cache and branch history cache (BRHIS). The level-1 data cache is write-back and protected by 

the same SEC-DED codes as the level-2 cache. The level-1instruction cache and BRHIS are covered 

by parity check. When an error is detected during level-1 instruction cache read, the read entry is 

invalidated and re-fetched from the ECC-protected level-2 cache. An error in BRHIS is treated as a 

cache miss and the processor delays execution of the conditional branch instruction until the correct


branch address is calculated. The processor takes a minor performance hit but is able to continue 

correct instruction execution. 

Tags for level-1 instruction and data caches are parity-protected. Both level-1 caches are inclusion 

caches; tag information is duplicated in the level-2 tag. When a parity error is detected in a level-1 tag 

access, the level-2 tag is interrogated for the correct copy of the tag. The level-1 cache access is then 

re-executed. The last major SRAM array on the chip is the Translation Look-aside Buffer (TLB). 

TLB is protected by parity check and a parity error in the TLB is treated as a miss. The correct 

page table entry is fetched from the ECC-protected main memory during re-execution. In addition 

to implementing cache and TLB protection, the SPARC64 V is designed to detect single bit SRAM 

errors in other smaller SRAM arrays and recover from those errors as well. 

The processor logic circuits are protected by byte parity check to detect single bit logic errors in 

each byte. Parity check bits are calculated at the location of new data value generation and passed 

with the associated data through the processor logic circuits. Parity bits are checked at the receiving 

end. 

Arithmetic/logic units are equipped with byte parity predictors. The byte parity predictors calcu- 

late the parity bits for each output byte of an arithmetic/logic unit using the same input signals as the 

unit to be checked. These independently calculated byte parity bits are compared with the byte parity 

bits calculated from the output of the arithmetic/logic unit. Multipliers are checked with a modulo-3 

scheme. 

The byte parity predictors in the arithmetic/logic unit do not detect point errors that result in an 

even number of bit flips in the output byte, and the modulo-3 scheme used in the multipliers do 

not detect point errors that give the same modulo-3 residue. These checks, however, do detect the 

majority of single point errors and are cost-effective compared to a full duplication and compare 

implementation. When a parity error is detected in the logic circuits or small SRAM arrays, the 

processor stops issuing new instructions and clears all intermediate states. It then restarts execution 

at the instruction directly following the last correctly executed instruction by using the check-pointed 

states. This action is called instruction retry. 

The checkpoint and instruction retry mechanisms are implemented in the processor for recovery 

from branch misprediction. Thus, the additional cost associated with utilizing these mechanisms for 

error recovery is small. Furthermore, many microprocessors today feature either ECC or byte parity 

for large on-chip SRAM arrays. Compared with those microprocessors, the SPARC64 V micropro- 

cessor only requires additional transistors for implementing byte parity bits, byte parity predictors and 

the associated parity checkers in the logic circuits and small SRAM arrays. The number of transistors 

devoted to the error detection mechanisms of the SPARC64-V microprocessor is about 10% of the 

transistors for logic gates, latches and parity-protected small SRAM arrays. 

LEON3 FT 

LEON3 is the successor of the LEON2 processor developed for the European Space Agency


(ESA). The LEON3FT [GC06] is a fault-tolerant version of the standard LEON3 (clone SPARC 

V8). In LEON3FT the consideration is only given to the protection of data storage and not to the 

functionality of the processor. There is no protection for control unit, data path and ALU circuitry. 

The internal registers are protected with ECC codes plus a shadow copy. Upon a detected parity 

error, a duplicate copy of the data is read out from a redundant location in the register file, replacing 

the failed data. Few internal registers have four bit error detection capacity however, majority of 

registers only have two bit error detection. 

The cache memory in LEON3-FT consists of separate instruction and data caches, both 8 K byte 

large. Each cache has two parts; tags and data RAM. The tag and data memories are implemented 

with on-chip block RAM and protected with four parity bit per 32-bit word, allowing detecting up to 

four simultaneous errors per cache word. Upon a detected error, the corresponding cache line deleted 

and the instruction is restarted. This operation takes 6 clock cycles (idle states) and is transparent 

to software. For diagnostic purposes, error counters are provided to monitor detected and corrected 

errors in both tag and data parts of the caches. 

Boeing 777 the control system 

In Boeing 777, the control system is made reliable through redundant channels with different 

processors and diverse softwares to protect against design errors as well as hardware faults [BT02]. 

It uses heterogeneous triple-triple modular redundancy [Yeh02] (as shown in the figure 2.12). There 

are three different processor architecture (Intel 80486, Motorola 68040 and AMD 29050) executing 

the same operation. However, it is an expensive solution and can only be employed in mission critical 

applications. 

ARM Cortex R Series [ARM09] 

ARM cortex R-series is a family of embedded processors for real time industrial applications. 

They have high customizability, so that the manufacturer can choose the features that suits their 

applications needs. 

If ECC build option is enable, then a 64-bit ECC scheme protects instruction cache. The data 

RAM include eight bit of ECC codes for 64-bit of data. The data cache is protected by 32-bit ECC 

scheme. The data RAM include seven bit of ECC codes for every 32 bit of data. 

If the parity build option is enabled, then the cache is protected by parity bit. For both the instruc- 

tion and data cache, the data RAMs includes one parity bit per byte of data. 

The processor can be implemented with a second redundancy copy of most of the logic. The 

second copy shares the cache RAMs of master core, so that only one set of cache is used. The 

comparison of the outputs of the redundant core with those of the master core detects fault.


Intel 80486 

Motorola 68040 

AMD 29050 

voter 

Power6 [MSSM10, KMSK09, KKS + 07] 

Intel 80486 


AMD 29050 

voter 

output 

voter 

Intel 80486 


AMD 29050 

Error in any 

component 

Figure 2.12: The triple-TMR in Boeing 777 [Yeh02] 

IBM designs the Power6 processor. It uses inline checkers instead of TMR technique that uses 

less power and HW overheads. It has build in self-checking ability in the data and control flow paths. 

The residue checking is employed for floating-point unit and logical consistency checkers for control 

logic. It has recovery unit which checkpoints after a group of superscalar instructions are completed. 

The inline checkers writes into fault isolation register that decides if current state is error free. In case 

of error detection, the recovery unit initiates instructions retry recovery. The memory bus including 

input-output unit is protected by ECC codes. The L1 cache is protected by simple parity, while L2, 

L3 caches and all signals in and out of chip to L3 have ECC protection. 

Intel Itanium 9300 Series [Int09] 

Intel Itanium 9300 series processors is a high performance processors. The L2, L3 and directory 

caches are protected with ECC. It can correct all single bit errors and most double errors. Moreover, 

hardware assisted scrubbing support is available for L2, L3 and directory caches. Memory is also 

protected against the thermal protection. Here, different thermal sensors send information to memory 

controllers that consequently increases fan speed to regulate the temperature. The internal registers of 

the processor are protected by ECC. Additionally there is a redundancy clocks and soft error hardened 

latches and registers to improve resistance to soft errors. 

voter


2.5 FT Evaluation 

In semi-conductor industry, testing expenses increase the overall cost of the IC design and man- 

ufacturing. Generally, industrial testing is meant to find permanent faults that can be produced at 

the time of manufacturing. However, the most frequently occurring faults in computer systems are 

temporary effects like transient and intermittent faults. They are the main cause of the digital system 

failure [VFM06]. Due to increase in the probability of the transient faults in the latest technologies, 

more and more designers will have to analyse the potential impact of these faults on the behaviour of 

the circuits. 

The error model to evaluate faults depends on their duration. Permanent faults can be tolerated 

by replacing the faulty component whereas a temporary fault can self-repair. Intermediate faults are 

treated either as permanent or temporary model depending on how often they occur. Some common 

techniques to evaluate the FT system are discussed in [WCS08]. Among other techniques, fault 

injection is the widely accepted as an effective approach to evaluate fault tolerance [LN09, Nic10]. 

2.5.1 Fault Injection 

Fault Injection is a validation technique for FT system, which consists in the accomplishment 

of controlled experiments where the observations of the system’s behavior in presence of faults are 

induced explicitly by the voluntary introduction of faults to the system [ACC + 93]. 

In other words, it is the purposeful introduction of faults (or errors) into a target [NBV + 09]. Thus, 

it is an intentional activation of faults in order to observe the behavior of the system under fault. The 

objective is to compare the nominal behavior of the circuit (without fault injection) with its behavior 

in the presence of faults injected during the execution of an application. 

Fault injection techniques have become popular to evaluate and improve the dependability of 

embedded processor-based systems [LAT07]. It can be accomplished at physical or simulated level. 

1. physical fault injection: it injects faults directly into the hardware, disturbing the environment 

(like heavy ions radiation, electromagnetic interference, LASER etc) [BGB + 08, Too11]. Many 

methods have been proposed, based primarily on the validation of physical systems, including 

injections on circuits pins, injection of heavy ions, disruption of supplies, or the fault injection 

laser [GPLL09]. None of these approaches can be used for evaluation security before the circuit 

is actually made. Therefore, alternate solution is to employ some injection techniques that allow 

earlier analysis of the design, typically at the register transfer level or gate level e.g. it may 

include mistakes in an RTL description. 

2. simulated fault injection: fault injection campaigns can be performed using several approaches, 

especially the simulation for high-level approaches. It has been widely used for its simplicity, 

versatility, and controllability [NL11]. The simulation is more expensive in time, however it 

may allow more comprehensive analysis and provide more accurate results and cost less than 

physical fault injection [NL11]. The fine access to the internal states of the processor is easily

2.5. FT EVALUATION 47 

possible with simulated fault injection and that is why it has better controllability/observability. 

In this technique, the system under test is simulated in other computer system. The faults are 

produced by altering the logical values during the simulation. 

The simulated fault injection is a special case of injecting soft errors that can support various levels 

of abstraction of the system like architectural, functional and logic [CP02], and for this reason it has 

been widely used to study fault injection. Moreover, there are various other advantages for this tech- 

nique. For-example, its greatest advantage over other ones is the controllability/observability of all 

the modelled components. Another positive aspect of this technique is the possibility to carry on the 

validation of the system during the design phase before having a final design. Alternate approaches of 

physical/simulation environment to perform safety analysis have been discussed in [Bau05,RNS + 05]. 

2.5.2 Error Models 

To design an FT system, it is important that the system should be aware of the possible faults that 

can appear in it. Some of the commonly occurring faults are shown in table 2.1. However, architecture 

is normally design to overcome possible errors. Such system can detect the active faults that produce 

errors because they are not aware of the underlying physical phenomena. 

Table 2.1: fault modeling 

level Model 

Programming Instruction, sequences etc 

HDL Functional model, register 

Logic gate level 

Electronic CMOS Transistor 

Technology Physical layout 

There are different types of error models and they have been classified in three axes: type of 

error, error duration, and number of simultaneous errors in [Sor09]. A commonly considered error 

model is bridging model; it considers short-circuits and cross-talks. This model is suited to detect the 

fabrication defects that can cause the short-circuit between two connections/wires. It is a low-level 

error model. 

Fail-stop error model is a higher-level error model. All the components will stop working in case 

of error detection in a system based on fail stop model. Such systems are used in critical systems 

such as in ATM (automated teller machine) machines, where a single error in calculation can result 

in hundreds dollar loss. Such system stops working if some non-corrigible error is detected. 

A delay error model is one in which the circuit produces the correct response but after a certain 

unexpected delay. This type of error can occur due to various internal physical phenomena of the 

device. Some related work is discussed in [EKD + 05]. 

Here, we are interested in the bit-flip errors that are largely representative of transients errors due 

to SEU (SBUs and MBUs). Moreover, they are easy to model at many abstraction levels.


2.5.3 The Fault Injection Framework 

A fault injection framework usually needs at least 3-types of information: 

(a) when the fault is to be injected. What is the condition that will trigger the fault injection during 

the simulation; 

(b) where the fault is to be injected. In which location the fault will be injected; 

(c) what is the kind of fault that is to be injected. What will have its effect. 

Fault Trigger (when) 

Fault injection may be done according to a deterministic or non-deterministic (time) profile. Non- 

deterministic fault triggers may inject a fault during a simulation after an amount of time, On the 

otherhand in deterministic approach, the fault trigger is by counting the amount of simulated instruc- 

tions. The fault will be injected after a specified amount of instructions simulated. In the simulated 

fault injection, the non-deterministic behavior is obtained by specifying the amount of time or the 

amount of instructions simulated randomly. 

A deterministic fault trigger may limit the scope of the fault injection, by determining that the 

injection will be done in a specific range of interval In real-time applications faults occur at ran- 

dom instances. The practical solution is to use non-deterministic approaches by combine different 

(possible) trigger conditions under specific situation. 

Fault Location (where) 

In processors, faults may effect ALU or internal registers or memory address, depending on the 

output of the instruction using the affected logic. In all cases, change in processor registers or memory 

can represent a real possible fault. A fault location is often described deterministically, but it can also 

be described in a non-deterministic way if we let the fault injection framework choose randomly 

which processor register to inject the fault. 

Fault Effect (what) 

As explained previously, the most common effect of a transient fault into a processor register or 

memory is an inversion in a state of a bit (single bit flip). By flipping a bit in a register or in a memory 

address we can inject a fault as it occurs in a real situation. The value of the altered bit is always 

toggled to the opposite value. This upset model is the standard transient fault model used in the 

reliability literature [Muk08]. A deterministically fault operation can be done by specifying which bit 

to flip, but it can also be done non-deterministically letting the fault injection framework randomly 

choose the bit to flip in the fault operation.


The above mentioned details are the basic information that is needed by a fault injection simulator. 

Our exact choices will be presented in the chapter 6, where exact validation methodology based on 

artificial error injection will be exploited. 


In this chapter, we explore the existing design methodology and validation techniques for FT 

processor. Today, FT processors employ variety of redundancy techniques and each has its own 

area/time overheads. Hardware redundancy has faster error detection and correction but have high 

area overheads whereas, temporal redundancy technique have lower area overheads but have high 

time overheads associated. 

In past, the FT processors were only used for mission critical applications and mostly rely on 

hardware replication. However, now every system need at least some consideration of FT. The expen- 

sive solutions are not acceptable for most embedded systems. Therefore, modern processors are either 

relying on hybrid techniques or more focused towards information redundancy techniques to reduce 

the power and area requirements. The available low cost solutions are missing fast error detection 

ability. The need is to develop alternate tolerance methodology that have fast error detection at low 

power/area overheads. 

In the later sections, different methods of processor evaluation are discussed. Among them, simu- 

lation fault injection has many advantages including better controllability/observability and architec- 

ture validation during initial development stages, compared to physical fault injection.

50 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS

II. QUALITATIVE AND QUANTITATIVE 

STUDY 

51

Chapter 3 

Design Methodology and Model 

Specifications 

3.1 Motivation 

Due to current technology trends, there is growing concern that transient faults will occur more 

frequently in future [FGAD10]. Since this reliability threat is projected to affect the broad computing 

market, traditional solutions involving excessive redundancy are too expensive in area, power, and 

performance [BBV + 05, SHLR + 09]. The research in FT systems design having minimum hardware 

overhead has gained great importance in last few years. 

Previously research in this domain was just to attain high-level dependability at minimum perfor- 

mance degradation and there were no big consideration about low cost hardware solutions. Conse- 

quently, the dependability was mostly attained through expensive solutions like hardware replication. 

The well-known FT processor that have been developed in the past such as Stratus, Leon FT, Sun FT 

SPARC, IBM S/390 were employing hardware redundancy solutions (either DMR or TMR). These 

processors have high power and hardware overheads and they are not addressing the need of daily life 

applications. 

The available FT solutions often incur significant penalties in area, cost or performance and they 

are unable to efficient tolerate faults [PIEP09]. They cannot fulfill common industrial applications 

needs. Some temporal redundancy techniques have minimum hardware overheads however they have 

significant time overheads that limit the overall performance. On the other hand, the hardware repli- 

cation is faster but increases the cost and power requirements. It is a great challenge to build efficient 

FT-systems with reduced time and hardware-overheads. Efficient design optimization techniques are 

required in order to meet time and hardware constraints in the context of FT systems. Consequently, 

in this work we are proposing a FT processor design methodology offering an acceptable compromise 

between protection and area/time overheads. The next section is exploring the proposed methodology. 

53

54 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS 

3.2 Methodology 

We want to design a fault tolerant processor having minimum error detection latency and low 

hardware overhead. In this situation, the challenge is to search a compromise between protection and 

hardware overheads. The hardware detection and correction is fast but it has high hardware overheads. 

On the other hand, software based detection and correction is slower but it has low hardware overhead. 

Consequently, for fast error detection we choose hardware based concurrent error detection, so 

that the error is detected before it reaches the system boundaries to result in catastrophic failure. On 

the other hand, for hardware saving we have to accept the additional time penalty in error correction. 

Moreover, in real time applications, fault does not occur often. Consequently, software based rollback 

mechanism can be chosen to recover the error. 

Fault Tolerance Computer 

Low hardware and time trade-offs 

Fast error detection Low hardware overhead 

Concurrent Error Detection Rollback Mechanism 

Figure 3.1: Proposed Methodology 

The resultant hardware-software co-design methodology (see figure 3.1) will have the ability to 

detect errors as soon as they occur and start immediately with error recovery strategy to prevent 

the propagations of errors throughout the system. The proposed methodology is successful for non- 

safety-critical FT-application where errors occurrence rates are not too high 

In the next section, we will discuss the most suitable CED and recovery mechanisms for the above 

scenario. 

3.2.1 Concurrent Error Detection: Parity Codes 

The implementation of the CED usually requires extra hardware overhead. One of the straight 

ahead and commonly used CED approach is DMR. Theoretically, it can detect 100% errors (except 

simultaneous errors in both modules and errors in comparator) [MS07]. However, this technique 

imposes an area overhead higher than 200%. The decision for the checking strategy is a compro- 

mise between error coverage and acceptable overhead. Cost-effective solutions are the objective of

3.2. METHODOLOGY 55 

further investigations in error-detection. EDC have smaller area overhead [Pat10] and they are often 

considered sufficient for non-safety-critical processors [MS07]. Among EDC, we will employ the 

most simplest codes because our objective is to show feasibility of our approach. Once the over- 

all methodology has shown interesting results then we may employ stronger codes with better error 

coverage. 

Parity codes are the simplest and cheapest known EDC. It provide odd bit-count error detection 

and need to have extra circuits for checking the bit generation and output parity verification. Their 

hardware overhead is much lower than the DMR approach. It can be employed for protecting registers, 

data-bus, RAM and bit-sliced circuits [Pie06]. 

The disadvantage is the missing recognizability of by 2-divisible multiple-bit faults (even multi- 

plicity). The example for an 8 × 8-bit register file in figure 3.2 illustrate this fact. Faults in register 1 

and 3 can be detected by parity-check. However, faults in register 2 and 5 remain undetected. 

They do not need complex encoding and decoding circuitry. They have smaller gate count for the 

complete on-line checking scheme. Moreover, in case of soft errors where an error is random in time 

and space, the likelihood of multiple errors in 1 clock cycle is exceedingly low. Therefore, in this 

scenario, a less expensive approach such as the parity error detection could suffice [Gha11]. 

x 1 

x 2 

x 3 

x 4 

x 5 

x 6 

x 7 

x 8 

Fault Free Environment 

1 0 1 1 0 0 0 0 

1 1 1 0 0 0 0 1 

0 0 0 0 1 0 1 0 

0 1 1 0 1 0 0 0 

0 0 0 0 0 0 0 0 

0 1 0 0 1 0 1 0 

0 0 1 0 1 0 1 1 

0 1 0 1 1 0 1 1 

Odd 

parity 

1 

0 

0 

1 

0 

1 

0 

1 

x 1 

x 2 

x 3 

x 4 

x 5 

x 6 

x 7 

x 8 

Noisy Environment 

1 0 0 0 1 0 0 0 

1 1 1 1 0 0 1 0 

0 0 1 0 1 0 1 0 

0 1 1 0 1 0 0 0 

0 0 0 1 1 0 0 0 

0 1 0 0 1 0 1 0 

0 0 1 0 1 0 1 1 

0 1 0 1 1 0 1 1 

- Detected Error 

- Un-detected Error 

Figure 3.2: Limitation of parity check 

Odd 

parity 

Lisboa [LC08] has employed a similar approach; he uses a standard parity based technique to de- 

tect errors in single output combinational circuits. In this work a second circuit is used that generates 

an extra output signal, named check bit, and two circuits for verification of the parity of inputs and 

outputs are based on reduced area XOR gates to detect soft errors. 

0 

0 

1 

1 

0 

1 

0 

1


3.2.2 Error Recovery: Rollback 

For minimum hardware overhead software based error correction will be useful. One of straight 

ahead solution is temporal based fault masking (software TMR). It has low hardware overhead but 

has 3× additional time penalty and moreover it does not match with already chosen hardware CED. 

Alternate approach can be based on software rollback, where the transient non-persistent faults can 

be tolerated (repaired) by repeating the operation in a controlled manner using the same hardware 

again [RRTV02]. 

It can overcome the errors by returning back to a point prior to the occurrence of the fault [MG09]. 

Rollback is a technique that allows a system to tolerate a failure by periodically saving the entire state 

so that if an error is detected, rolling back to the prior checkpoint to recover [JPS08]. It require small 

hardware overhead and the resulting architecture can overcome errors at low cost. It can be good 

candidate for situations where delays for recovery are acceptable. The rollback principle can be an 

efficient approach with CED for error recovery [BP02, EAWJ02]. 

State 

A 

State 

F 

No-error detected 

during last SD 

(Data Validated at VP) 

VP n-1 

Store 

SEs 

t+6 

t 

State 

E 

State 

B 

t+1 

t+4 

t+3 

t+5 

State 

C 

State 

D 

Figure 3.3: Rollback Execution 

Rollback to VP n-1 

Instruction(s) Execution in current 

SD 

Last SD Sequence Duration (SD) 

Next SD 

t+2 

Error detection 

Figure 3.4: Error detection during Sequence Duration (SD) and rollback called 

Our strategy to implement fault recovery is based on rollback execution, a classically employed 

VP n

3.2. METHODOLOGY 57 

software technique in real-time embedded systems [KKB07]. It relies on the following behaviors (see 

figure 3.3): 

• program (or thread) execution is split in sequences of fixed maximal length; 

• each sequence must reach its end without any error being detected to be validated; 

• data generated during a faulty sequence must be dismissed and execution restarted from the 

beginning of the faulty sequence; 

If an error occurs within the next instruction sequences, processor-registers can be updated with 

prior saved contents. Like in figure 3.4, it has been considered that an error has been detected during 

the instruction execution so rollback mechanism is called and data re-execution starts from the last 

stored states at previous validation point (VP). The VP executes after a fix interval of instructions. On 

the other hand, in the figure 3.5, no error is found during the Sequence Duration (SD) and all the data 

written during SD is validated at VP. 




VP n-1 

Store 

SEs 

Instructions Execution in current SD 

SED (SD-SED) 





Next SD 

VP n 

Figure 3.5: No-error detected during the SD 

The SD represents the full length of the sequence, which include time taken to store the sensitive 

elements (SE) as well as length of active instruction execution. In the remain work, the processor 

internal states will be called SEs. Let’s denote the minimum time to load the processor SEs as ‘SED’ 

(see figure 3.5). Then ratio of the active sequence duration will be ‘(SD-SED)/SD’ whereas ratio 

of time to load SE will be ‘SED/SD’. For SD=10, (assuming program length of 10,000 instruction 

and neglecting the possibility of provoking errors), then there will be about 1000 time loading of SE 

whereas if program contains SD=100 then SE will be loaded about 100 time. Which means there is 

10 times less penalty of SED/SD (loading SE) with SD=100, it means that larger SD can result in 

faster program execution. 

On the other hand, if probability of error provoking is not ignored, then the resultant time penalty 

due to re-execution can vary with length of sequence. At higher error injection rates, lower SDs will 

be an effective compromise because possibility of sequence un-validation is higher for bigger SDs 

and vice versa. This will be further discussed in chapter 5 and 6. 

Store 

SEs


a1 a2 a3 a4 

1 2 3 4 

a5 a6 a7 

Error 

5 6 7 4 Load 6 SEs7 

8 9 14 

Figure 3.6: Time overhead in rollback 

Clock cycles 

The rollback-principle is a repetition of an erroneous operation outgoing from a defined (saved) 

checkpoint in the past. There is a time penalty in the case of error detection. For-example, an error 

is detected at ‘a7’ in figure 3.6. Now the processor rollbacks, reloads the SEs and re-executes from 

the previous states ‘a7’. Consequently, this requires additional clock cycles delay to re-execute the 

sequence. Moreover, a delay can be higher if SD is large (more instructions in the sequence). 

3.3 Limitations 

The system relying on rollback cannot communicate data to a real-time environment until it is 

known that the data is error free. If an erroneous data enters into the system peripherals, then it 

cannot be recovered and may result in catastrophic failures. It is a fundamental problem of rollback 

recovery and has been discussed in [NMGT06]. The common approach is to wait until the data is 

validated. In present methodology the output events can be addressed with one SD delay. It may 

result in real time constrains. 

There is a need to design a special output unit to monitor the output control signals. Actually, 

real-time communicating is not not under the scope of the present work. It will be considered in 

future work. 

3.4 Hypothesis 

Among other underlying hypotheses, we suppose that the processor core is connected to a depend- 

able memory in which data is supposed to be kept safely without any risk of corruption. According to 

this assumption, all the internal errors produced in DM are detected and corrected by DM. Therefore, 

DM is internally a safe storage but it should be protected from errors coming from outside which 

means that only valid data to be written into the DM. 

Actually, a lot of work has been dedicated in the past to the protection of memory devices [MS06, 

Hsi10] making this hypothesis valid. 

a7

3.5. DESIGN CHALLENGES 59 

3.5 Design Challenges 

Choosing an error detection mechanism based on concurrent error detection and recovery based 

on rollback are not enough to achieve our design objectives. An effective implementation of the 

above scenario can be realized by making appropriate choices concerning in particular the processor 

architecture. These design choices should improve the dependability, cost and overall performance. 

In the following section, we will analyze some of the major desired features required for a successful 

implementation of the above scenario. 

3.5.1 Challenge # 1: Self Checking Processor Core Requirements 

The choice of a base processor architecture is the first step towards the implementation of FT- 

processor because not all processor architectures can fit themselves in this context. For a successful 

implementation, we must determine the required key features of the processor. 

• minimum hardware: we aim to design an FT-processor in which a small hardware "fingerprint". 

It can reduce the chance of data contamination, since greater the area exposed to the environ- 

ment, more the chance of provoking the errors. Due to smaller area, more efficient architecture 

can be built with smaller silicon dies and thus the yield will be much higher [TM95]. 

• minimum internal states to be checked and stored: (i) with concurrent error detection, the hard- 

ware overhead necessary to check simultaneously all the internal states may be rather important. 

Having a reduced number of internal states helps reducing this hardware overhead. (ii) Roll- 

back recovery requires internal states of the processor to be saved periodically, incurring a time 

penalty/overhead that can be lowered with a reduced number of internal states. 

The commonly employed RISC (Reduce Instruction Set Computers) class machines have a 

large register file and cannot fit with the proposed methodology because of following reasons: 

(a) More registers means more expensive CED; 

(b) More registers implies more time-consuming in periodic saving of register contents; 

(c) A large number of registers requires a large number of instruction bits as register speci- 

fiers, meaning less dense code. 

(d) CPU registers are more expensive than external memory locations; 

On the other hand, CISC (Complex Instruction Set Computers) require complex control ar- 

chitecture. It will complicate the overall implementation methodology. It has high memory 

requirements. Moreover, it increase the probability of design errors. 

In short, we cannot rely on the classical processors (RISC or CISC). Our choice will be a simple 

processor architecture having minimum internal states. This will reduce the overall area/time 

penalty and make the processor more robust against external disturbances. On the other hand,


it should have minimal complexity of the architecture and provide better utilization of chip 

resources. 

3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal 

If a self-checking processor is directly connected to DM (see figure 3.7) then, there is need to 

manage the validated data (VD, written in previously validated SD) and un-validated data (UVD, 

being executed in the present sequence) inside DM, which will induce additional time penalty. 

In such case, generally paginated memory (one page for a sequence) is employed, in which there 

are un-validated and validated pages to manage rollback. If an error is detected in the current se- 

quence, the corresponding page is discarded and previous page must be restored. This is slower 

approach and requires additional pointers to handle pages; these pointers can either be dedicated 

registers (faster) or dedicated variables (slower and more risky). 

Moreover, there is an additional risk concerning the corruption of these pointers which may result 

in loosing the track of validated and un-validated pages. Thereafter, DM will no more be sure data 

storage, which is violation of the basic hypothesis. If there is a large amount of data being copied 

among pages or pages and the main pool of data in memory, this takes a lot of time. Furthermore, 

system requires bigger DM to separately store the validated and un-validated data. 

Processor 

Write 

Dependable 

Memory 

Figure 3.7: Untrusted data flowing into dependable memory (DM) 

An alternate approach to simplify the above scenario is employing a temporary data storage be- 

tween processor and DM. It can strongly reduce the time penalty and also the risk of error to some 

extent. Furthermore, it will simplify the periodic saving of data and only validated data will be trans- 

ferred to DM. 

The basic idea is to implement some hardware devices on the path between the processor and 

the DM controlling the way data flows from one side to the other and preventing un-trustable data 

to end-up in the DM (as suggested in figure 3.8). This can be achieved by first writing the non- 

secure/non-validated data to a temporary location before transferring it to DM after sequence valida- 

tion its validation. The SCPC can detect the errors and re-executed instructions from the last secure 

states (in case of error detection). In this way external errors (environment /processor) will be masked 

from entering into the DM (as shown in figure 3.8). The underlying idea behind the journalization

3.5. DESIGN CHALLENGES 61 

Self-Checking 

Processor 

Core 

Write 

Temporary 

Storage 

Write 



Figure 3.8: Data stored to temporary location before writing to DM 

mechanism is to prevent un-trustable data to flow into the DM and to allow an easy recovery from 

faulty situations. Hence there is a need of a temporary location (called self checking hardware journal 

in later chapters) to mask the errors from entering into DM. 

Need Self Checking Hardware Journal 

Data stored inside this temporary location can also be corrupted in the the case of transient faults 

affecting it, such as SEUs (see figure 3.9). Hence, there should be an error detecting and correcting 

mechanism to ensure the reliable operation of this temporary data storage. 

Self-Checking 

Processor 

Core 

non 

validated 

data 

Transient 

Fault 

Temporary 

Storage 

supposedly 

validated 

data 

Figure 3.9: Data corruption in temporary storage. 



Let us suppose that we have written a data in the journal at time ‘t’, there was no error detected 

during the sequence till VP and data is ready to transfer to the DM. Is this data dependable? No, 

because data remained in the journal for time ‘tx’ and the possibility of fault occurrence cannot be 

ignored during the time tx. Hence, there is a need for self-checking mechanism to detect errors in the 

journal and hence prevent the DM from data contamination as shown in figure 3.10. It will make the 

journal a safe temporary data storage.


Self-Checking 

Processor 

Core 

non 

validated 

data 

Transient 

Fault 


Temporary 

Storage 

trustable 

validated 

data 

Figure 3.10: Protecting DM from contamination. 

Separate Storage of Validated and Un-validated Data 



The data initially written in the temporary location is un-validated and if no error occurs during 

the present sequence then at validation point the data is validated. At any instant inside the temporary 

storage, there are two types of data, one called un-validated data and other validated data. Conse- 

quently, inside it there must be two different parts: one to store data and other to store un-validated 

data. In addition, it helps us to transfer the validated data towards the DM. 

3.5.3 Challenge # 3: Processor-Memory Interfacing 

The overall performance of the FT-processor can be limited in absence of an efficient interfacing 

between processor, temporary location and memory. Since in most processors the majority of the 

instructions involve either read or write from/to the memory. The overall performance is affected if 

there is a long critical path or more than one clock cycle needed to read and write the data. In our case, 

this situation is much more delicate because there is a temporary data storage between the processor 

and the DM. There is a need of designing an intelligent interfacing to mask the errors from entering 

to the DM. The interface must provide an efficient interconnect between the modules. 

In this scenario, there are two possible interfaces: processor communicates with DM via a journal 

or processor communication with journal and memory in parallel. The challenge is to evaluate both 

the processor models from dependability and performance degradation point of view and to choose 

the most suitable one. 

3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation 

of Rollback Mechanism 

The objective of rollback-technique is to restore the system state (in case of error) by overwriting 

the current sequence states with previously validated states of SEs as shown in figure 3.4. There are 

two performance-limiting factors (i) time taken to periodically store the SEs and (ii) un-validation of 

SD and SEs reload on error detection. If one needs to reduce, the time penalty when reloading the 

SEs there is a need of long sequences so that the overall number of load/store SEs will be smaller than

3.6. MODEL SPECIFICATIONS AND GLOBAL DESIGN FLOW 63 

‘(SD-SED)/SD’. On the other hand, for larger sequence at higher error rates, there are less chances 

of sequence validation. Hence, there will be greater rate of rollback, which again results in time 

penalty/performance degradation. Therefore, it is advisable to use large sequences with low error 

rates and small sequences with higher error rates. 

3.6 Model Specifications and Global Design Flow 

The basic role of the journal is to hold the new data being generated during the sequence being 

currently executed until it can be validated (at the end of the current sequence). On sequence valida- 

tion, this data can be transferred to the DM else, it is simply dismissed and the current sequence can 

restart from the beginning using the SEs (held in the DM) and corresponding to the state prevailing at 

the end of the previous sequence. 

Self-Checking 

Processor 

Core 

HW Journal 

Figure 3.11: Overall design specifications 



Our global design strategy of FT processor is classified into four steps as summarized in figure 

3.12. Step-I summarises the proposed model specification as shown in the block diagram in the figure 

3.11. This includes exploring the design requirement like in this case the SCPC, DM and hardware 

journal to mask the errors from entering into the DM. Moreover, the architecture must respect the 

challenges 1-4 mentioned in the previous section. 

Step two, refines our design strategy using various functional implementations and will be dis- 

cussed in the next section. The hardware implementation will be presented in chapters 4 and 5. Fi- 

nally, the fourth step is concerned with a validation of the overall approach by artificial error injection, 

it will be presented in chapter 6. 

3.7 Functional Implementation 

This section is concerned with step-II of the figure 3.12 and aim to refine the proposed model. 

There are two possible connections between the processor and DM: (i) Model-I: the processor is 

connected to the DM via a journal pair and processor cannot read from DM in a clock cycle; (ii) 

Model-II: the processor is connected to Journal and DM in parallel and it can directly read from DM


Step-I Step-II Step-III Step-IV 

Model 

Specifications 

Functional 


Refinement 

Model-1 

Model- 2 

Model-3 

HW 


Figure 3.12: Global design flow. 

Testing 

HW 

Validation 

and Journal simultaneously. These approaches depend on the type of connection between the SCPC 

and the DM. As a result, the overall dependability and performance is affected. In this section, we 

will finalize the processor memory (DM) interfacing. 

Which scenario is the best can be judged by developing the corresponding functional models 

and then comparing the simulation curves (clock cycle per instructions (CPI) vs. error injection rate 

(EIR)) obtained by artificial error injection. 

Hypothesis: 

In order to simplify the functional model, the following hypothesis are assumed: 

(a) the processor core is self checking; 

(b) there is a dependable memory attached to the processor where data can safely stay without risk 

of provoking errors; 

(c) journal/cache are considered dependable data storage places; 

(d) all instructions are supposed to be executed in a clock cycle; 

(e) and of course re-execution of instructions can recover soft errors; 

Benchmarks 

A set of benchmarks consisting of the main kernels of general tasks in target application has 

been selected and divided into 3 groups. The first group is execution of memory operations (permu- 

tation/sorting), the second group is representative of arithmetic dominated algorithms and the third 

group of control dominated algorithms. All applications have significant memory requirements be- 

cause every time they need to read from and after execution they write back to memory (these bench- 

marks are not designed to evaluate I/O events hence they only read from and write back to memory).

3.7. FUNCTIONAL IMPLEMENTATION 65 

(a) Benchmark Group-I: Bubble sorting algorithm has been considered, which is one of the simplest 

algorithms for sorting an array. It consists of repeatedly exchanging of out of order data pairs of 

adjacent array written in the memory and looping. The algorithm repeats until all the elements 

are sorted in an order. It has been implemented in serial fashion, only one pair can be examined at 

a time. It has a time complexity of O(n2). The n passes must be taken through the array, where 

n is number of elements. 

(b) Benchmark Group -II: It is memory computation benchmark and require data writing back to 

close addresses. This version of matrix multiply multiplies two 7 × 7 matrices in O(n k ) with 

k > 2. This runtime is achieved by implementing a vector-matrix multiplier, which stores an 

initial matrix away, and repeatedly returns its product with an input vector. 

(c) Benchmark Group -III: The control benchmark process the data coming from the sensor, which 

have been previously stored in the memory. The outputs are stored in the memory to be later used 

by the actuators. We chose logic and arithmetic equations for the data because some industrial 

systems need to control the actuators on this kind of equations. There are two assumptions: (i) 

measurements from the sensors are stored in memory (ii) results later will be send to the actuators. 

The control equations are: 

3.7.1 Model-I 

Y0 = A × [(X0 + X1) − (X2 − X3)]/[X4 × (−X5)] 

Y1 = NOT [(X6 OR X1) AND (X9 XOR X7)] AND [NOT (X8)] 

if ((X8 + X9) < A) 

Y2 = B × [(Y0 + X1) − (X9 × X8)]/[X1 − X5] + C 

else 

Y2 = [(Y0 + X1) − (X9 × X8)] [X1 − X5] + D 

Y3 = NOT[(X6 OR Y1)AND(X9 XOR X7)] AND [ NOT (Y1)] 

In model-I (shown in figure 3.13) the processor is connected to DM via a cache memory and a pair 

of journals. The pair of journals mask the errors from transferring to DM in order not to propagate 

potential errors. The write (processor to memory) is modified and is performed in three steps as shown 

in figure 3.13: (i) the write operation is performed simultaneously in cache and in the Un-Validated 

Journal (UVJ), (ii) if no error is detected, then at the VP the data from UVJ is transferred to validated 

journal (VJ, contains only the validated data) and (iii) finally, the validated data is written to the DM. 

At VP all the last sure states of the SEs are conserved and being validated. As shown in figure 

3.13, data is transferred from UVJ to VJ and finally to DM. If an error is detected during an SD, the 

processor retries the instruction execution from preceding VP (as shown in figure 3.4). In this way, the


1 

2 

Self Checking 

Processor Core 

(SCPC) 

Write 

Un-Validated Journal (UVJ) 

Validated Journal (VJ) 

write to DM in 3-steps 

WRITE 

READ 

Write 

Data 

cache 

Read 



(DM) 

Figure 3.13: Model-I with data cache and a pair of journals 

Address from 

processor 

Compared with 

all stored 

addresses 

simultaneously 

Address 

Cache 

Address 

found 

Data 

Address not 

found in cache 

Figure 3.14: Cache with associative mapping 

system restores its prior dependable states and DM remains preserved from the errors. On sequence 

validation, the data in UVD is validated by transferring synchronously from UVJ to VJ in a single 

clock cycle. The processor can read directly from the cache memory. 

Associative mapping is employed in cache, each block is composed of both the memory address 

and the corresponding data. The incoming address is simultaneously compared with all stored ad- 

dresses using the internal logic of the associative memory, as shown in figure 3.14. If a match is 

found, the corresponding data is read out. Otherwise, required data will be read from memory. When 

3 

DM


new data is written into cache then first the controller will match the available existing addresses to 

overwrite the data on same addresses. If no match occurs then data is written in a new position with 

corresponding addresses. The use of associative memory is fast but also expensive. The cost depends 

on how big the cache is. 

Model-I: Simulation Results 

A functional model (emulator) of the processor/journal/cache has been developed in C++. The 

emulator acts as a virtual machine before hardware implementation to test various fault models and 

protection techniques. In addition, the emulator allows us to evaluate the architectural choices, cal- 

culate both the internal processor states and the program execution duration. It helps us to calculate 

average clock cycles per instruction on different memory accesses. 

Benchmarks 

(MPSoC application) 

Stack Processor 

Emulator 

Consequences 

Figure 3.15: FT evaluation 

Periodic, Random & 

Burst Errors injection 

The errors have been artificially injected in the processor emulator (as shown in figure 3.15) and 

then the performance has been evaluated for the FT processor model. The goal of this experimental 

setup is to evaluate the effect of error injection on system performance. For simplicity, the actual 

time overhead of periodical saving of SE is ignored. The fault injection profiles being considered are 

shown in figure 3.16. 

The emulator receives the previously described groups of benchmarks representing target appli- 

cations with a set of representative data. The input of the emulator is a classical hexadecimal file. 

Our criterion in evaluation is the ratio of average number of Clocks per Instruction (CPI) vs. Error


Periodic Errors 

Random Errors 

Burst Errors 

No of program cycles 



Figure 3.16: Periodic, random and burst errors models 

Injection Rate (EIR). The goal of the simulations is to evaluate the performance degradation of the 

proposed model in presence of high error injection rates. 

Figures 3.17, 3.18, 3.19 present the results for the benchmark group 1, 2 and 3 for SD of 10, 

50 and 100. The CPI* is the clock cycles per instruction of the dependable architecture (with error 

injection) and the CPI is clock cycles per instruction without error injection. The ratio of CPI*/CPI 

will give us the ratio of additional clocks required on re-execution due to rollback. 

Figure 3.17 shows the simulation results of computation benchmark. In this graph, there are 

two horizontal reference lines. The bottom green continuous doted line is extended from the value 

of CPI*/CPI in presence of no error while the top red non-continuous dotted line is drawn at 2 × 

CPI*/CPI. There are three curves in each figure drawn at SDs of 10, 50 and 100 respectively. The 

curves overlap each other in presence of low Error Injection Rate (EIR). As the EIR increases the 

value of CPI*/CPI also increasing exponentially which is more dominant for higher SDs like 50 and 

100. Hence in the presence of high EIR the rate of re-executing also increases which finally increases 

the overall CPI*/CPI ratio. This model give a better CPI*/CPI ratio for lower EIR but for higher EIR 

the ratio of CPI increases rapidly because of re-execution of instruction due to error detection and 

additional clock cycles in case of cache miss. These two problems will be addressed in the Model-II. 

3.7.2 Model-II 

Model-II consists of three parts; a Self Checking Processor, Journal and DM as shown in figure 

3.20 (a). It has a modified journal architecture that has two internal parts; one containing validated 

data and other containing un-validated data as shown in fig 3.22.


CPI*/CPI 

7 

6 

5 

4 

3 

2 

1 

0 

Computation Benchmark 

(Periodic Errors Injection) 

CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100) 

0 

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 

Error Injection Rate (EIR) 

Figure 3.17: Model-I: additional CPI for benchmark group I 

When processor needs to read, it checks the data in the Journal and DM simultaneously (due to 

parallel access). Associative mapping is employed in journal, each block is composed of both the 

memory address and the corresponding data (as shown in fig 3.22). If a journal-miss occurs then the 

required data will be sent to the processor during the same clock cycle. If data is found both in the 

journal and the DM then the controller (MUX) will prefer the data from the journal, as it is the most 

recent written data (see the figure 3.21). 

To allow simultaneous read and write the journal should have two address ports. Some instructions 

may need two operations (one read and one write) at the same time in journal. The newly written data 

is stored in UVJ. If no error is detected in the sequence then the data is validated (as shown in the 

figure 3.22) and is transferred to VJ. On the other hand, if an error is detected then all the data written 

during the sequence is discarded (as shown in figure 3.23) and the processor rollbacks and starts 

execution from the last known states of the SEs. In the next section, we will evaluate the performance 

of this architecture. 

Model-II: Simulation Results 

The experimental protocol remain the same for model-II (like model-I). It can be observe from 

the simulation curves of figures. 3.24, 3.25, 3.26 that model-II is more efficient than model-I because 

even in the presence of high error rate the CPI*/CPI required to run the dependable architecture are 

significantly smaller than the previous architecture presented in the tables.


CPI*/CPI 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 

Permutation Benchmark 

(Random Errors Injection) 

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) 

CPI*/CPI (seq_duration=100) 

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 


Figure 3.18: Model-I: additional CPI for benchmark group II 

From the simulation results of graphs 3.24, 3.25, 3.26 the CPI*/CPI ratio is smaller than model-I. 

In graph 3.24 with SD of 10 when EIR varies from 2e-4 to 2e-2, the additional CPI increases only by 

50% (on y-axis CPI*/CPI is 1.5) which means the execution time increases by only 50% even when 

the EIR becomes 100 times higher. This shows a good performance for the proposed architecture. 

For example, if we accept increasing by 50% the CPI, with SD of 10 there can be 20 errors per 1000 

instructions whereas with SD of 50 there will be only 6 errors per 1000 instructions. Furthermore, the 

SD has a direct impact on the size of journal memory in architecture and subsequently on its area. 

3.7.3 Comparison 

A comparison between model-I and model-II is summarized up in table 3.1. In both models, the 

effect of rollback is more dominant in higher SDs like 100 and 50. For example, as VP occurs after 

every hundred instructions in the SD= 100 there is more chances of provoking errors which rises the 

CPI*/CPI ratio more rapidly as compared to SD = 10 and 50. Since there is a large interval between 

the two consecutive VPs there is more chances for error occurrence which on the other hand increases 

the rate of re-execution of instructions. 

From the performance point of view in model-II, due to parallel access to the memory and journal 

in read operation the overall efficiency of the system is increased resulting in lower CPI ratios at 

higher EIRs as compared to model-I. Therefore, no clock-cycles are wasted if data is not found in the 

Journal. It has a better performance than our previous model as shown in the graphs 3.24, 3.25, 3.26.


CPI*/CPI 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

Control Benchmark 

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100) 

0 

0 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 5.12E-02 1.02E-01 


Figure 3.19: Model-I: additional CPI for benchmark group III 

Self Checking 

Processor Core 

WRITE 

READ 

Read 

HW Journal 

WRITE 



(DM) 

Figure 3.20: Block diagram of Model-II 

From dependability point of view, the model-II is better a choice because it will have a minimum 

hardware overhead due to a single journal as compared to model-I. The more area exposed to the envi- 

ronment also increase the chances of provoking errors. Both problems, the performance degradation 

at higher error injection rate and effective area on-chip are addressed better than model-I. Therefore, 

we will choose model-II for further development. The results obtained are quite encouraging to carry 

on research by relaying on this model. In the next two chapters, we are designing both the processor 

(chapter 4) and the journal (chapter 5).


Free space 

Validated data 

VP n 

SD = 10 

VP n-1 

1 

1 

1 

1 

1 

1 

Address from 

processor 

Compared with 

all stored 

addresses 

simultaneously 

Address 

Journal 

MISS 

Access 

main memory 

Figure 3.21: Processor can simultaneously read from Journal and DM 

vADDRESS 

w address 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

0011…..0011 

0011…..0001 

0011…..1011 

0011…..0111 

0011…..1011 

1011…..0011 

0111…..0011 

0001…..0011 

0010…..0011 

0000…..0011 

1111…..0011 

1100…..0011 

1011…..0011 

0100…..0011 

0101…..0011 

1100…..1100 

1001…..1001 

1011…..1011 

0000…..1011 

1011…..0000 


from processor / main memory 

0 

DATA data 

0011…..0001 

0011…..0011 

0011…..1011 

0011…..0111 

0011…..0011 

0011…..1011 

0000…..0011 

0011…..1011 

1011…..1011 

0010…..0011 

0011…..0011 

0000…..0011 

0010…..0011 

0100…..0011 

1011…..0011 

0101…..0011 

0011…..0011 

0011…..0011 

0111…..0011 

1111…..0011 

towards main memory 

data can transfer 

Data 

Data 

Store SE SEs 

VP n-1 

DM 

Un-Validated Data (UVD) 

Free Space 

Validated Data (VD) 

instructions of the 

application 

SD = 10 

Figure 3.22: No error detected during SD and data is validated at VP 

till ‘VP’ no error 

detected 

This chapter summarizes an alternative approach to design an FT-processor. We have presented an 

architecture specification and design methodology of the proposed scheme. It is a hardware/software 

combined approach in which error detection is achieved concurrently by hardware means using parity 

codes and rollback is used for error recovery. The major advantage of this scenario is the ability to 

VP n 

Next 

SE


Free space 

Data written during SD 

VP n 

SD = 10 

VP n-1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

from processor / main memory 

v ADDRESS 

w address 

0011…..0011 

0011…..0001 

0011…..1011 

0011…..0111 

0011…..1011 

1011…..0011 

0111…..0011 

0001…..0011 

0010…..0011 

0000…..0011 

1111…..0011 

1100…..0011 

1011…..0011 

0100…..0011 

0101…..0011 

1100…..1100 

1001…..1001 

1011…..1011 

0000…..1011 

1011…..0000 

DATA data 

0011…..0001 

0011…..0011 

0011…..1011 

0011…..0111 

0011…..0011 

0011…..1011 

0000…..0011 

0011…..1011 

1011…..1011 

0010…..0011 

0011…..0011 

0000…..0011 

0010…..0011 

0100…..0011 

1011…..0011 

0101…..0011 

0011…..0011 

0011…..0011 

0111…..0011 

1111…..0011 

towards main memory 

data can’t transfer 

SE 

VP n-1 

Un-Validated Data (UVD) 

Free Space 

Validated Data (VD) 

Rollback to 

VP n 

instructions of the 

application 

SD = 10 

Figure 3.23: Error detected and all the data written during SD is deleted 

Table 3.1: Comparison of the Processor-Memory Models 

Error 

detected 

Model-I Model-II 

read from memory (DM) Processor ⇐ Cache ⇐ DM Processor ⇐ DM 

read from Cache/Journal Processor ⇐ Cache Processor ⇐ Journal 

write to DM Processor ⇒ UVJ ⇒ VJ ⇒ DM Processor ⇒ Journal ⇒ DM 

Cache/Journal size requirement Comparatively bigger No MISS in Journal 

(to avoid cache MISS) cache required due to Parallel access 

Performance Medium Performance Reasonable good Performance 

at High Error Rate even at High Error Rate 

have an effective FT mechanism, with limited hardware and time overheads. The overall methodology 

can be successful if certain design challenges are respected like choosing a appropriate processor 

having minimum internal states to load and store, designing an intermediate self-checking hardware 

journal to prevent the errors from entering into the dependable memory and reasonable length of 

sequence duration for certain error rate. 

The last part of the chapter was dedicated towards defining the processor-memory interface. Ac- 

cordingly, we have proposed two different models: model-II and I. On comparison, model-II has been 

chosen for further developing the VHDL-RTL model because it is more reasonable from dependabil- 

ity and performance point of view. In this model, in write to memory, the data pass via temporary 

VP n


CPI*/CPI 

CPI*/CPI 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

Permutation Benchmark 


0 

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 


Figure 3.24: Model-II: additional CPI for benchmark group I 

Computation Benchmark 


0 

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 


Figure 3.25: Model-II: additional CPI for benchmark group II 

storage towards DM, while on read the processor can directly read from the DM. In this way DM 

remain preserved from error propagation coming from processor. In next chapters, we will develop 

the VHDL-RTL model for the FT-processor.


CPI*/CPI 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

Control Benchmark 

(Burst Errors Injection) 

CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100) 

0 

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 


Figure 3.26: Model-II: additional CPI for benchmark group III

76 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS

Chapter 4 

Design and Implementation of a Self 

Checking Processor 

We aim to design a fault tolerant processor that has two parts: self-checking processor core (SCPC) 

and self-checking hardware journal (SCHJ). In this chapter, we will only focus on the design of the 

SCPC as highlighted in figure 4.1. 

SCPC 

SCHJ 

Figure 4.1: Design of a self checking processor core (SCPC) 

To explore the SCPC, the chapter has been divided into different sections. In the first section, 

we start modelling the processor by choosing the appropriate architecture family that fulfils the basic 

design objectives identified in chapter 3. In the later section, the processor hardware-model (non- 

FT version) will be presented and explored. The performance and dependability challenges will 

be identified in the hardware model. The later sections address their solutions. A generic model 

will be described in VHDL-RTL (Register Transfer Level) and synthesized on Altera Quartus II. 

Experimental results will be presented in terms of throughput (number of bit processed per second) 

and area usage. Finally, fault tolerance capacity of SCPC will be validated in chapter 6. 

77 

DM

78 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR 

4.1 Processor Design Strategy 

The FT strategy we have chosen has already been discussed in section 3.2. In this part, we will 

choose the processor architecture that can fit well, as presented in figure 4.2 (continuation of the 

previously presented figure 3.2). The hardware based concurrent error detection is expensive if there 

are many internal states to be checked, which is against the design constrains. Therefore, to have fast 

error detection at low hardware overhead there is a need of a processor having minimum internal states 

to check (figure 4.2 presenting criteria for processor design). The software based rollback mechanism 

has low hardware overhead but will be slow if there are many internal states in the processor to 

be periodically saved. Moreover, rollback in case of error detection will be faster if the number of 

internal states to restore is lower. 

Consequently, our choice will go to a processor architecture belonging to the minimum instruc- 

tion set computers (MISC) class, which possesses many desired characteristics suitable for our design 

strategy. In the MISC class, we have chosen a basic stack processor architecture. It is a simple and 

flexible architecture, having a very reduced number of internal registers [JDMD07]. Alternate choice 

can be an accumulator-based processor but their disadvantage is that they are highly dependent on 

the random access to memory. Therefore, they are less efficient. However, the chosen stack proces- 

sor architecture also rely on memory accesses, but most of them are very predictable (neighbouring 

addresses) as they are related to stack operation, and can be very effectively handled. 

4.1.1 Advantages of Stack Processor 

Stack processor have different additional advantages both from protection and performance point 

of views. Some classical advantages are discussed in [KJ89] and others are detailed below. 

Stack based processor can result in more reliable architecture as compared to their counterpart 

RISC design approach because they have less number of internal states and small area on-chip which 

reduces the chances of external environmental contaminations. In most RISC based approaches there 

are register banks that make them more sensitive against SEUs and MBUs. Whereas, in stack proces- 

sor the number of internal registers are far less. For example, the stack processor presented in [Jal09] 

has six internal registers: TOS (Top of the Stack), NOS (Next of the Stack), TORS (Top of Return 

Stack), IP (Instruction Pointer), DSP (Data Stack Pointer) and RSP (Return Stack Pointer). They are 

far less than the modern RISC processors (i.e. LEON3FT has more than 150 internal register [Aer11]). 

In FT processor all internal register must be protected against transient faults [ARM + 11]. 

Furthermore, due to many internal registers in RISC based architecture the instruction length is 

widen. More registers means large address decoding, which increases the propagation delay. That is 

why, RISC (and modern CISC) processor needs multi-stage pipelining to restore average throughput 

by hiding internal latency. Moreover, it needs better branch scheduling. For example, Pentium 4 

has a 20-stage pipeline, and any miss in caches and branch prediction buffers can suffer a 30-cycle 

penalty for a missed branch (20 cycles in the pipeline, 10 in the memory). Whereas, the RTX (stack- 

based processor) has a fixed 2-cycles overhead in all cases [PB04]. Furthermore, processor’s natural

4.1. PROCESSOR DESIGN STRATEGY 79 

Fault Tolerance Computer 

Low hardware and time trade-offs 

Fast error detection Low hardware overhead 

Concurrent Error Detection Rollback Mechanism 

Low HW overhead 

Processor with minimum internal states 

Minimum Instruction Set Computers (MISC) 

Stack Processor 

Figure 4.2: Criteria behind the choice of the stack processor 

resistances against SEUs decreases with increase in stages of pipelining [MW04]. 

Fast periodic backup 

Stack processor have various advantages over RISC based machines like higher clock speeds, low 

procedure call overhead and fast interrupt handling [Sha06]. They have higher clock speed because 

the instructions are performed between two tops of stack (condition: internal stack caching or internal 

hardware stack). They have low procedure call overhead as there are limited registers needed in saving 

to memory across the procedure calls. Fast interrupt handling since interrupt routines can execute 

immediately as hardware takes care of the stack management. An architecture of a stack based Java 

processor has been evaluated in [Sch08] and the results show better performance and smaller gate 

count compared to a RISC family processor on an FPGA. 

Commercially stack processor have been used for medical imaging, hard disc drives and satellite 

applications. Some well known examples include Novix NC4000, Harris RTX2000, Silicon Com- 

posers SC32 [PB04]. They are deployed in space applications for reasonable performance and low 

power overheads [HH06]. e.g. SCIP, a stacked based processor designed for spaceships [Hay05]. Re- 

cently, Green-Arrays project is employing stack based architectures to design multi-computer chips. 

The company have designed chips with attractive features like minimum cost and energy with high


performance [Gre10, Bai10]. 

The FT-processor design is dedicated to our long-term objective: devising a new fault tolerant 

multi-resource system based on message passing, in which the current fault tolerant processor design 

will be used as a processing node. It is clear from the beginning that severe constraints concerning 

the area consumption apply to the architectural design of a single node in order to match the future 

massively parallel objective, yet preserving as much as possible the individual performance. The stack 

machine remains a viable architecture, due to the smaller size and lower cost and power requirements. 

Stack processor can result in simple and smart cores for parallel distributed applications [Gre10]. 

On the other hand, the stack machine is favorable for sequential instruction execution. They fit 

well for control dominant application. They are less favorable for data dominant applications like 

video streaming. 

4.2 Proposed Architecture 

The architecture of the stack processor has been presented in [Jal09]. It is inspired from second 

generation canonical stack processor [KJ89]. The stack taxonomy is based on three attributes: the 

number of the stacks, the size of the stack buffer memories, and the number of operands in the 

instruction format. They are represented by three coordinate’s axis in figure 4.3. These dimensions 

have various possible combinations. Among these choices, the canonical stack has multiple and 

large stacks and it is a 0-operand (ML0) machine as shown in the figure 4.3. The 0-operand means 

all instruction operand locations are implicit, thus it is not necessary to give their address in the 

instruction. In the case of stack the implicit location is top of stack. 

To satisfy the simplicity requirement there are two stacks 1 , data stack (DS) and return stack (RS). 

One is used for expression evaluation and subroutine parameters passing. The second is used for 

the subroutine return address, interruption address and temporary data copies. The two-stacks allow 

accessing the multiple values with in one clock cycle and improves the speed. Due to separate stack 

for return address and data stacks, the subroutine calls and data returns can be performed in parallel 

with data operations. It can reduce program size and system complexity, which improves system 

performance. 

Concerning the size of stack buffers, we have chosen large stack buffer that reside in dependable 

memory which allows multiple storage of data without loss. The DM is on-chip so the data can be 

accessed in single clock cycle. In addition, there is no restriction in the stack depth. 

There are three registers named TOS (Top Of Stack), NOS (Next Of Stack) and TORS (Top Of 

Return Stack) which represents top of data-stack (DS), next of DS and top of return-stack (TORS) 

respectively. NOS and TORS do not exist in the canonical model. They proved to be useful, allowing 

a simplified instruction set [Jal09]. The DS and RS stacks reside in main memory (DM), which is a 

feature similar to 1st generation stack machine. They do not have address registers but are addressed 

by internal pointers namely: data stack pointer (DSP) and return stack pointer (RSP). We have chosen 

1 According to Turing definition, the minimal number of stacks for a pure stack machine is 2 [KJ89].

4.2. PROPOSED ARCHITECTURE 81 

2 

Number of Stacks 

1 

No. of inst. operand 

2 

1 

0 4 8 12 16 

ML0 

Size of the Stack 

Figure 4.3: Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design 

space 

this feature to protect the data contents because according to the hypothesis, DM is a dependable 

storage and these stacks remain fault secure. 

The stack based proposed architecture contains data bus, data stack (DS) and return stack (RS) 

with their top of-stack registers, arithmetic/logic unit (ALU), instruction pointer register (IP), instruc- 

tion buffer with instruction register and control logic (for hardwired control), as shown in figure 4.4. 

The input/output module (shown in figure 4.4) requires special management to be fault tolerant and 

this is not treated in this work. 

The ALU performs arithmetic and logic operations, which includes addition, subtraction, logical 

functions (AND, OR, XOR), and test for zero and others. It perform operations on the top of the data 

stack (operands and result), TOS and NOS being the next element of DS, where TORS is the top post 

element of RS. The IP holds the address of the next instruction to be executed. The IP may be loaded 

from the bus to implement branches, or may be incremented to fetch the next sequential instruction 

from program memory. Like DS and RS, the program memory also residing in DM. 

The MAR (Memory Address Register) unit that exists in canonical stack processor has been elim- 

inated from our model because program memory along with IP (Instruction Pointer) is sufficient to 

manage all the instructions and provide the address of the next instruction to be executed. The result- 

ing processor has a simple instruction set of 37 instructions and being executed in a clock cycle except 

one (STORE Instruction) that requires two clock cycles for execution. The complete instruction set 

of 37 instructions have been represented in the appendix B where all the instructions are expressed at 

RTL (Register Transfer Level). Thanks to the limited instruction set and 0-operand model, the 8-bit 

opcode is sufficient to represent all instructions.


Table: List of Acronyms 

DS 

RS 

I/O 

NOS 

TORS 

TOS 

ALU 

IP 

Data Stack 

Return Stack 

Input-Output 

Next Of Stack 

Top Of Return 

Stack 

Top Of Stack 

Arithmetic Logic 

Unit 

Instruction 

Pointer 

DS 

RS 

I/O 

Control 

Unit 

D 

A 

T 

A 

B 

U 

S 

NOS 

DATA 

Figure 4.4: Simplified stack machine 

4.3 Hardware Model of the Stack Processor 

TORS 

ALU 

IP 

ADDRESS 

Program 


The processor hardware model has been described at VHDL RTL level. The initial processor 

model (non-FT version) has been synthesized with Altera Quartus II (version 7.1). 

It consists of arithmetic and logic unit (ALU), internal registers, instruction buffer, control unit 

and data path connecting them, as shown in figure 4.5. The DS and RS are addressed by two pointer 

registers DSP and RSP respectively. The three on-chip registers TOS, NOS and TORS resolve the 

possible conflicts in transferring the data between two stacks e.g. during instruction execution of 

R2D, D2R, OVER, ROT. For further explanation, lets consider R2D (Return Stack to Data Stack) 

instruction. Due to availability of TORS, TOS and NOS inside the processor no conflict in accessing 

the data bus occurs. The contents of TORS are written in TOS, TOS into NOS, DSP is incremented, 

NOS written in DS, RS[RSP] read in TORS and RSP decremented. Therefore, no conflict occurs. 

In a stack processor, normally data execution is faster than the classical processors because data 

is implicitly available on the two tops of the stack, instead of having to read data from addressed 

registers or memory. It effectively reduces the length of the critical path. For a better understanding 

the simplified data path for arithmetic and logic instructions have been shown in figure 4.6. Processor 

read memory in parallel to compensate the ‘one element less’ on stack balance. The memory read is 

just to fill the empty place resulting from the instruction execution (for next instruction). Therefore, 

TOS

4.3. HARDWARE MODEL OF THE STACK PROCESSOR 83 

Prog_addr 

TOS 

TOS 

Program 


ADD 

MUX_RSP 

RSP 

ADD 

MUX_DSP 

LSB 

MSB 

MUX_ADD_RSP 

DSP 

ADD 

-1 

0 

+1 

MUX_ADD_DSP 

0 

2 

3 

4 

5 

-1 

0 

+1 

TOS 

Instruction R1 R2 R3 R4 Buffer R5 

Instruction Buffer 

IBMU 

R3 

R4 

Instruction Pointer 

to / from Memory 

Op_code 

TORS 

NOS 

Control 

Unit (CU) 

TORS 

a 

IP+a 

data_mem 

Figure 4.5: Modified stack processor 

it do not need to wait for address decoding before accessing the operands. 

Control signals 

ADD ADD 

to CU 

NOS TOS 

d 

cout 

+1 

z 

ALU 

0 

1 

2 

3 

MUX_IP_COUNT 

IP 

TORS 

DSP 

RSP 

MUX_TORS 

15 th bit 

Ext. to 16-bits 

For DLIT 

Mostly, each block of program memory (16-bit) contains two successive instructions (8-bit + 

8-bit). The instructions residing in program memory pass through instruction buffer (IB) and are 

decoded in the control unit, which activates all the MUXes accordingly. The instruction buffer (IB) 

is fed with a pair of LSB followed by MSB as shown in figure 4.18. The IB consists of cascaded 

byte-size (8-bit) buffers, connected with via multiplexers which control the flow of instructions in 

IB (as shown in the figure 4.18). The interconnection between the multiplexers is controlled by the 

instruction buffer management unit (IBMU) (as shown in figure 4.6). The IB and IBMU will be 

discussed later in detail. 

Although stack processor has fixed length opcodes but some instructions need additional infor- 

mation (immediate data) to be executed. Therefore sometimes instruction block is larger than 8-bit. 

Such instructions include branches and call that need additional 16-bit address (or 8-bit displacement) 

to know the target address (e.g. see appendix C for ZBRA d, SBRA d) and the instructions requiring 

an immediate data constant (e.g. LIT a, DLIT a). This challenge is further addressed in section 4.4.2. 

The control unit manages the components of the processor; it reads and decodes the program 

For LIT


instructions, transform them into a series of control signals which activate other parts of the processor 

via MUXes. The salient jobs of control unit are: 

• Decode the numerical code for the instruction into a set of commands or signals for each of the 

MUX; 

• Update the DSP and RSP pointers; 

• Activate Read or Write from memory according to the active instruction; 

• Select correct operation in the ALU; 

The IP points to the next instruction to be executed. The IP prepares the program memory to feed 

the IB according to the next instruction execution requirements. 

clk 

Program 


IP 


IBMU 

From Memory 

TOS/TORS/NOS 

clk 

Control Unit 

TOS/ 

TORS 

Figure 4.6: Simplified data-path of the proposed model (arithmetic and logic instructions) 

The execution of conditional/unconditional branches has been discussed in [KJ89] and further 

explored in design of modified stack processor [Jal09]. The stack processor is fast in branch execution 

due to minimum stages of pipelining [PB04]. However, every branch instruction is followed by NOP 

(no-operation) because the IB is flush to load new instructions. It is a performance penalty. This issue 

has not been addressed in this work however, one possible solution has been proposed in section 4.6.3. 

TOS 

NOS 

TORS 

4.4 Design Challenges in FT Stack Processor 

This section is dedicated to the implementation of the FT methodology using stack processor. The 

required architecture should have self checking ability along with minimum performance degradation. 

These two challenges are addressed in this section. 

NOS 

A 

L 

U

4.4. DESIGN CHALLENGES IN FT STACK PROCESSOR 85 

4.4.1 Challenge I: Self Checking Mechanism 

The architecture having minimum number of internal register does not guarantee that there will 

be no possibility of provoking the errors. External disturbances can still contaminate the execution 

of the processor (even if it is far less frequent than other classes of processors like RISC and CISC 

implementations). There is a need to have self checking mechanism in internal registers and ALU. 

4.4.2 Challenge II: Performance Improvement 

Depends on the architectural choices made to implement the FT-stack processor, it has two per- 

formance limitations that are limiting the overall execution speed. They include: (i) multi-clock in- 

struction execution and (ii) Multiple-byte instructions block. Both these issues adds additional delays 

in program executions. 

Challenge II-a: Multi-clock Instruction Execution 

Most of the instructions require single clock cycle in the data-path to be executed, but there are 

few instructions that require multi-clock cycles to be executed like DUP, OVER, R2D, CPR2D, D2R, 

FETCH, STORE, PUSH_DSP, PUSH_RSP, LIT, DLIT, CALL. There minimal clock count cannot be 

unity for a non-pipelined architecture because of their conflicts in accessing data bus in same direction. 

Inst. Types 

(according to clks. count) 

1 clk 2 clks 3 clks 

Most of 

instruction 

Some of the 

instructions 

Only one 

instruction 

‘STORE’ 

Figure 4.7: Different instructions type from execution point of view (without pipelining) 

For a better understanding, lets explore a DUP (duplication) instruction that requires two clock 

cycles to execute the instruction in the data path. Here, the data contents of TOS must be copied 

into NOS and the contents of the NOS transferred to the third position in the DS (pointed by DSP). 

However, if the instruction is executed in one clock cycle then we loose the data in third element of 

DS because in this case, NOS will be written at address pointed by DSP and without prior increment 

to DSP we will loose one data element. 

On the other hand, it can be successfully executed in two clock cycles, during the first clock cycle 

(t) a new place is created by adding DSP + 1 and in next cycle (t + 1) the data contents of NOS 

are written in DS, TOS to NOS and NOS to DS[DSP] as shown in figure 4.8. Such (multi-cycles) 

instructions result in performance degradations and needs employing execution pipelining.


Prog_addr 

Program 


ADD 

TOS 

TOS 

MUX_RSP 

ADD 

MUX_DSP 

t 

LSB 

MSB 

RSP 

MUX_ADD_RSP 

DSP 

ADD 

-1 

0 

+1 

MUX_ADD_DSP 

0 

2 

3 

4 

5 

-1 

0 

+1 

TOS 

R1 Instruction R2 R3 R4 Buffer R5 


IBMU 

R3 

R4 

Instruction Pointer 

to / from Memory 

t+1 

Op_code 

DUP 

t+1 

TORS 

NOS 

Control 

Unit (CU) 

TORS 

a 

IP+a 

data_mem 

t+1 


ADD ADD 

to CU 

NOS TOS 

d 

t+1 

cout 

+1 

z 

ALU 

0 

1 

2 

3 

MUX_IP_COUNT 

Figure 4.8: Execution of duplication (DUP) instruction in 2-clock 

Challenge II-b: Multiple-byte Instructions Block 

IP 

TORS 

DSP 

RSP 

MUX_TORS 

15 th bit 

Ext. to 16-bits 

For DLIT 

The opcode of the instructions are 1-byte length. They have implicit source and destination reg- 

isters and do not need explicit addressing e.g. instruction ADD (addition) means adding TOS with 

NOS and storing the result in TOS. However, 7 of the 37 instructions require an additional parameter 

to be furnished, either an immediate 8-bit or 16-bit constant (LIT and DLIT respectively), an 16-bit 

absolute address (LBRA, CALL) or an 8-bit displacement (SBRA, ZBRA) (see appendix C, table 8). 

On the other hand, the program memory is 16 bit whereas instruction opcode for instruction is 8- 

bit. The average flow (in bits) instructions being executed is lower than that of instruction pre-fetching 

flow capacity, the first being close to 8 bits while the later is closer to 16 bits. Therefore, the loading 

of instructions in IB is almost twice the rate of execution (considering 8-bit opcode is executed in a 

clock cycle). It requires intelligent instruction buffer management to: (i) monitor the input and output 

flow (ii) manage the flow of variable block instructions (like LIT, LBRA). 

To execute single instruction per cycle, the next instruction to be executed should reach the control 

unit in next clock cycle (t + 1). For example, the figure 4.9 is addressing this issue. First of all, we 

are supposing that the present instruction being executed is ADD (the instruction to be executed lies 

For LIT

4.5. SOLUTION-I: SELF CHECKING MECHANISM 87 

1-byte instruction 

LSB 

8-bits 

MSB 

8-bits 

R1 

2-bytes instruction 

LSB 

8-bits 

MSB 

8-bits 

R1 

3-bytes instruction 

LSB 

8-bits 

MSB 

8-bits 

R1 

R2 

R2 

R2 

R3 


R3 

R3 XX 

t+1 


t+1 


R4 

ADD R5 

t+1 

Figure 4.9: Multiple-byte instructions 

R4 XX R5 LIT 

XX R4 DLIT R5 

in register R5). It is 1-byte instruction so the opcode of the next instruction to be executed must be 

in register R4. This opcode must reach the control unit in next clock cycle (t + 1). Secondly, we are 

supposing LIT a as a active instruction (at time t) so in the next clock (t+1) the contents of R3 should 

reach the control unit. 

The solutions to the above mentioned challenges will be addressed in the following sections. 

4.5 Solution-I: Self Checking Mechanism 

The processor need error detecting inside the ALU and internal states. First of all we start with 

the design of a self checking ALU. 

4.5.1 Error Detecting in ALU 

There is no single code that can (simultaneously) protect the arithmetic and logic operations si- 

multaneousy. Consequent, we are using combination of arithmetic and logic codes (often called 

‘combination codes’) to protect both operations, modulo-3 residue code to protect arithmetic opera- 

Op-code 

Op-code 

Op-code


tions on one side, and parity code to protect logic operations on the others side. We have chosen these 

codes, because they are simple and yet effective enough to prove the effectiveness of our approach. 

Moreover, they requires minimum resources, which can be depicted from the results in [SFRB05]. 

In [SFRB05], the ALU is designed with different error detection techniques was simulated using 

Quartus II simulation tool provided by Altera. The FPGA resource utilization of the two built-in- 

error-detection (BIED) techniques (Berger Check, Residue/Parity Codes Check) were recorded from 

the simulation. The figure 4.11 shows the resource utilization comparison chart for the two BIED 

techniques compared with a TMR ALU and an ALU without any error detection. 

Data according 

to next instruction 

NOS 

Protected Register 

Protected 

ALU 

TOS 


Figure 4.10: Data-path of protected-processor’s ALU 

It is obvious from the figure 4.11 [SFRB05] that EDALU (error detecting ALU with modulo-3 

residue/parity check) uses 54% less logic elements compared than the TMR ALU, and the Berger 

check prediction ALU uses 42% less logic elements than the TMR ALU. Therefore, according to the 

result, it is clear that the ALU with residue/parity check has better resource utilization than Berger 

codes. 

ALU instructions fall into two groups: arithmetic and logical (see figure 4.12). By grouping the 

instructions, the active area of the circuit at any instant is reduced [SFRB05]. For example, a strike 

on module generating arithmetic parity would not affect the logical module and vice versa. 

Error Detecting in Arithmetic Instructions 

A remainder calculated from the data symbols X and Y can preserve arithmetic operation in an 

ALU. Here, the error detection in arithmetic instructions is based on modulus-3 residue check parity. 

In ALU, they are executed in two concurrent computation (as shown in figure 4.13). On one hand, 

two operands, Xand Y , undergo an arithmetic operation and results are stored in S. On the other, the


LOGIC ELEMENT COUNT 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

3106 

TMR -TRIPLE MODULAR REDUNDANCY 

BC -BERGER CODE CHECK 

ED -RESIDUE/PARITY CODES CHECK 

1792 

1432 

TMRALU BCALU EDALU ALU 

ALU DESIGN 

986 

Figure 4.11: Resource utilization chart for various ALU designs [SFRB05] 

residues, P AX and P AY , undergo the equivalent arithmetic operation and generate the residue. The 

outcome, P AS will be stored with S (as shown in figure 4.13). In next clock, the parity generator 

(mod-3 generator) will produce P A ′ S 

(residue of S), which will be compared with the already stored 

parity P AS. In case of discrepancy, error alarm signal will be raised. The mathematics behind the 

residue check codes are shown below: 

X = xnx(n − 1)x(n − 2).....x2x1x0, 

Y = yny(n − 1)y(n − 2).....y2y1y0, 

where Xand Y are the data/information symbols applied to the input of the ALU. 

C = CmC(m − 1)C(m − 2).....C2C1C0 

where C is the check divisor used to calculate the residue. The remainders determined from the 

division of ALU data symbols X and Y from the check divisor C is given by 

P AX = RxmRx(m − 1).....Rx2Rx1Rx0, 

P AY = RymRy(m − 1).....Ry2Ry1Ry0 

where P AX and P AY represent the remainders from X and Y respectively and Rxm = xn/C and so 

on. The ALU output is represented as follows 

P A ′ S 

S = X � Y where, � = ADD, SUB or MUL 

= S mod C 

P AS is the remainder check symbol which is given by 

� 

P AS = (P AX P AY ) mod C 

The error signal is generated by the comparator, which is given by the following function


Redundancy 


ALU instructions into two groups 

logical 

parity check 

1-bit 

arithmetic 

mod 3 

1-bit 2-bits 

Figure 4.12: ALU is protecting the Logical and Arithmetic instructions separately 

Error Signal = 1 if P AS �= P A ′ S 

Error Signal �= 1 if P AS = P A ′ S 

P LS, represents logic parity that will be generated locally for the next instruction. 

For instance, if X = 10, Y = 11 and C = 3 

Residue of X (P AX): 10 mod 3 = 1 

Residue of Y (P AY ): 11 mod 3 = 2 

First concurrent computation: S = X + Y = 10 + 11 = 21 

Residue of first computation: P A ′ S 

Addition of P AX and P AY : = 1 + 2 = 3 

Residue: P AS = 3 mod 3 = 0 

Thus, the residues P AS and P A ′ S 

Error Detecting in Logic Instructions 

= 21 mod 3 = 0 

is equal, Therefore no error. 

Error detection in logical instructions is based on calculation of parity bit from the information 

symbols in X and Y . The parity calculation is simple. The parity bit is calculated by XOR between 

the information bit. Among the two variants of parity bit: even parity bit and odd parity bit, we are 

using even parity which means that the parity bit is set to 1 if the number of ones in a given set of bit 

(not including the parity bit) is odd, making the entire set of bit (including the parity bit) even. By 

comparing the parity bit from input and output, the re-generate/re-configure signal is set as high/low.


OPCODE 

X 

PL X 

PA X Y PL Y PA Y 

ALU 

S 

S 

Parity 

generator 

PL S 

(PA X 

PA S 

+ 

S mod C 

PA Y ) mod C 

PA S 

PA’ S 

Check 

Parity 

OPCODE 

Error Signal 

Figure 4.13: Reminder check technique for error detection in arithmetic instructions 

It can be represented by simple logical equation below: 

P LX = x15 XOR x14 XOR x13 XOR, ... x0 

P LY = y15 XOR y14 XOR y13 XOR, ... y0 

Similarly, the P LX and P LY represents the parity of Xand Y respectively. 

S = (X � Y ) where � = AND or OR 

where X = (x15 x14 x13 ... x0) 

where Y = (y15 y14 y13 ... y0) 

� 

P LS = P LX P LY 

P L ′ S 

= S XOR 

The error signal is generated by the comparator, which is given by the following function 

Error Signal = 1 if P LS �= P L ′ S 

Error Signal �= 1 if P AS = P A ′ S 

Similarly, P AS will be generated locally for the next instruction (if needed). Its a synchronous 

system and the error will be detected in the next clock. In ALU, the error latency is 1-clock cycle.


OPCODE 

X 

PL X 

PA X Y PL Y PA Y 

ALU 

S 

S 

Parity 

generator 

PLS PAS XOR 

(PL X 

PL Y ) 

PL S 

PL’ S 

Check 

Parity 

OPCODE 

Error Signal 

Figure 4.14: Parity check technique for error detection in logic instructions 

4.5.2 Error Detecting in Register and Data-Path 

For error detection in registers, we are again relying on parity codes. Register concurrently checks 

errors by matching the regenerated and and already existing parity bit. If unmatched it means an error 

signal (as shown in the figure 4.15). Furthermore, a single parity check can only detect single bit 

errors or errors which have not an even multiplicity. 

X 

Register 

X 

P X 

Parity 

Generator 

P X 

P’ X 

Check 

Parity 

Error Signal 

Figure 4.15: Parity check technique for error detection in register(s)


4.5.3 Self-Checking Processor 

With the protections in subsections 4.5.1, 4.5.2, the processor has built-in self-check facilities to 

detect SBUs. The error coverage can be improved by alternate EDCs however, it will also increases 

the circuit complexity. 

Error 

Data according 

to next instruction 

NOS 


Protected 

ALU 

4.5.4 Store Sensitive Elements (SE) 

TOS 


Figure 4.16: Error occurred in Protected ALU 

The six internal states: TOS, NOS, TORS, DSP, RSP, IP must be saved at the end of the valid 

sequence for possible rollback. We have decided to store them in DM. The procedure follows six 

consecutive instructions where the contents of TORS are stored in RS and the others in DS. The posi- 

tive aspect of this approach is that it does not add additional hardware overhead inside processor while 

the downside is the extra performance penalty in storing SE. One possible combination of instructions 

can be: 

CALL a 

CPR2D 

PUSH_RSP 

PUSH_DSP 

DUP 

DUP 

Alternate solution to protected internal states is using internal shadow registers holding end val- 

ues of previous valid sequence. On rollback, shadow copies are loaded back to the corresponding 

registers. The advantage of this scheme is a single clock being needed to save or restore the registers. 

However, it will double the register count of SEs and these shadow copies must also be protected, 

which incurs extra hardware overhead. Moreover, it is not a favourable choice for context swapping.


4.5.5 Protecting Opcode 

Program memory is already inside the DM; therefore, there is no risk of faults but they may 

occur inside the opcode during the execution. Fortunately, there is a possibility of protecting the 

opcode without additional hardware penalty because even with 6-bit code, we can address 64 different 

addresses. We have 8-bit opcode for 37 instructions, which allows us to employ some low overhead 

EDCs without having an additional hardware penalty. 

4.6 Solution-II: Performance Aspects of Self-Checking Processor 

Core 

Owing to the chosen stack architecture, the data is implicitly available on the two tops of the stack, 

which reduces the length of the critical path. But for high (time) performance (i) instruction execution 

rate (clock per instruction) should be approximately unity. (ii) for multiple-byte instructions, the next 

instruction to be executed must reach the control unit in clock cycle (t + 1). In other words there 

should be a continuous flow of instructions inside IB. The instruction buffer management unit (IBMU) 

is “dedicated to this task”. 

The IBMU generates six different control signals, among them five are dedicated to the data flow 

control in the IB, namely SM1, SM2, SM3, SM4 and SM5 as shown in the figure 4.17 while the sixth 

(SM6) is reserved for IP. Next section address solutions to each of them. 

4.6.1 Solution-II (a): Multiple-byte Instructions 

There are seven multiple-byte instructions (block of 2 or 3 byte). The IBMU controls the flow of 

instructions in IB by pre-fetching the next instruction and make it available in control unit during the 

next clock cycle (t+1). The IBMU controls the series of cascaded buffers having multiple intercon- 

nections to cope with complex conditions as shown in figure 4.18. The decisions inside the IBMU 

are taken according to the predefined states of FSM (Finite State Machine). Transition between those 

states depends upon the present state of the IB. 

Table 4.1: Instruction types 

b7 b6 b5 details 

1 0 0 1-byte 

1 1 1 1-byte 

(multi-clock) 

1 0 1 2-bytes 

1 1 0 3-bytes 

0 1 1 1-byte + IP-change 

0 0 1 2-bytes + IP-change 

0 1 0 3-bytes + IP-change

4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 95 

Prog_addr 

16-bits 

Prog_addr 

16-bits 

Prog. 


Adder 

LSB 

8-bits 

MSB 

8-bits 

R1 

R2 

SM1 SM2 SM3 SM4 SM5 

SM6 


Management Unit 

(IBMU) 

‘0’ 

‘2’ 

‘3’ 

Inst. Pointer (IP) 

R3 

Instruction Buffer (IB) 

‘4’ 

‘5’ Adder 

16-bits 

‘IP’ 

‘a’ 

‘IP+d’ 

R3 

Adder 

R4 R5 

R4 

‘0’ 

‘1’ 

‘2’ 

‘3’ 

LSB operand 

MSB operand 

Figure 4.17: Instruction buffer management Unit (IBMU) 

16-bits 

Op-code 

8-bits 

Control 

Unit 


8-bits 

Extended 

To 16-bits 

16-bits 

IP � d + IP 

IP � a 

IP � TORS 

Inst_type 

3-bits 

ZBRA d 

SBRA d 

LBRA a 

CALL a 

RETURN 

4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Ex- 

ecution 

The majority of the instruction require single clock cycle to process the instruction in the data-path 

while others require multiple clock. To execute the pipelining, we need to differentiate between the 

single and multiple-clock cycle instructions. The three most significant bit (b7 b6 b5) of the opcode 

are reserved to determine the type of the instruction as shown in figure 4.19 (a). Effectively, the 

instructions that require multiple clock cycle per instruction execution have been given code ’111’ in 

b7b6b5. We can differentiate between the various instructions on the basis of instruction length and IP 

change as shown in table 4.1. The IP change occurs in the instruction-containing jump. 

The multiple clock cycle instructions have been analysed (with various instruction combinations) 

to find the possible conflicts between them. It has been found that if they are executed in two stages 

pipelining then all conflicts in addressing the memory can be avoided. During first stage, the stack 

pointers are incremented (DSP + 1/ RSP + 1) according to the type of instruction while in sec- 

ond stage the rest of the instruction is executed, there will be no conflicts in accessing the memory. 

The DSP (Data Stack Pointer) and RSP (Return Stack Pointer) points to top of DS and RS in DM 

respectively. This results in the two stage execution pipelining as shown in the figure 4.19 (b).


LSB 

MSB 

8-bits 

8-bits 

R1 

R2 

M1 M2 

M3 R3 

SM1 

SM2 

R1 

SM3 

R2 

R3 

R3 

M4 

SM4 

R4 

M-operand 

Figure 4.18: Instruction buffer 

R4 

R4 

M5 

SM5 

L-operand 

During pipelining, part of next instruction is pre-executed (DSP + 1/RSP + 1 ) with the active 

(present) instruction in a clock. In this way remaining part of the instruction is executed in one clock 

cycle in the next clock cycle. The control unit takes the 8-bit op-code of the present instruction to 

generate the control signals for all the associated MUXs. Simultaneous the three MSBs of the op- 

code of the next instruction to be executed are also extracted as these MSBs identify the type of the 

instruction. 

To evaluate the effectiveness of the pipelining, we have executed a sample benchmark consisting 

of five instructions (shown in 4.20). Without pipelining this program requires 9 clock cycles while 

only 5 are needing with the pipelining. Hence, an improvement of 45%. 

Therefore on pipelining, all instructions can be executed in a single control cycle except STORE 

instruction that requires two control clock cycle. Actually in STORE instruction we need to execute 

two times DSP +1, which can not be done in a single clock cycle. The complete list of the instructions 

is given in tables in the appendix C. 

4.6.3 Reducing Overhead for Conditional Branches 

It has been previously discussed that loading of instructions in IB is almost twice the rate of 

execution, (8-bit opcode is executed in a clock cycle). Therefore, the next instruction to be executed 

is already pre-fetched in IB. However, in case of jump instruction the IB must be flush and new 

R5 

R5 

8-bits

4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 97 

Clock 

Cycle 

1 

2 

3 

4 

5 

6 

7 

8 

9 

b 0 b 1 b 2 b 3 b 4 

Bits precisely 

describing 

instruction 

Bits need for 

present inst. 

execution 

• Non-Pipelined 

Operation 

TOS � TOS + NOS 

NOS � DS [DSP] 

DSP � DSP +1 

DS [DSP] � NOS 

NOS � TOS 


TOS � TORS 

NOS � TOS 


TORS � RS [RSP] 

RSP � RSP -1 



NOS � TOS 

TOS � DSP 


DS[DSP] � NOS 

NOS � TOS 

TOS � data (byte) 

(a) Opcode 

b 5 b 6 b 7 

Instruction Type 

i-e DSP+1,DSP-1, 

RSP+1,RSP-1 

Bits need for 

pre-execution 

Present Inst. 

Execution 

(b) Pipelined Execution Model 

Pre- Execution 

of Next Inst. 

Present Inst. 

Execution 


of Next Inst. 

Present Inst. 

Execution 


of Next Inst. 

Clock cycles 

Figure 4.19: (a) Opcodes description and (b) pipelined execution model 

Instruction 

Executed 

ADD 

DUP 

R2D 

PUSH_DSP 

LITa 

Clock 

Cycle 

1 

2 

3 

4 

5 

Instruction 

DUP 

R2D 

PUSH_DSP 

LIT a 

1st stage 

Operation 





NOP 

• Pipelined 

Instruction 

ADD 

DUP 

R2D 

PUSH_DSP 

LIT a 

2nd stage 

Operation 

TOS � TOS + NOS 

NOS � DS [DSP] 


NOS � TOS 

TOS � TORS 

NOS � TOS 


TORS � RS [RSP] 

RSP � RSP -1 


NOS � TOS 

TOS � DSP 

DS[DSP] � NOS 

NOS � TOS 

TOS � data (byte) 

Instruction 

Executed 

ADD 

DUP 

R2D 

PUSH_DSP 

Figure 4.20: A sample program executed through non-pipelined and pipelined stack processor core 

instructions should be loaded. It can result in performance penalty. This can be overcomed if we take 

advantage from the faster load than consumption of instructions in stack processor. The approach will 

be based on loading both the conditions of jump inside the IB (as shown in figure 4.23. Therefore, 

there will be no extra NOPs in consequence of a jump instruction. However, it may increase the 

additional complexity of the instruction buffer management unit and possibly we will need a larger 

IB. The VHDL-RTL implementation of such a solution is not considered in this work. 

LIT a


Non-pipelined Implementation 

ADD DUP R2D PUSH_DSP LIT a 

1 2 3 4 5 6 7 8 9 

Pipelined Implementation 

ADD DUP R2D PUSH_DSP LIT a 

1 2 3 4 5 

Figure 4.21: Timing diagram for a sample program executed twice: once in non-pipelined version 

and then pipelined version 

4.7 Implementation Results 

The self-checking processor core has been synthesized in Altera Quartus II. The figure 4.22 shows 

the implementation design flow of SCPC modeled in VHDL-RTL and implemented on a Altera 

Stratix III EP3SE50F484C2 device with Altera QuartusII. From the results, the following observa- 

tions can be done: 

• Area occupation : the results obtained in terms of area are reported in table 4.2. The area re- 

quired for SCPC minimum, it can be a suitable core processor for fututre MPSoC development. 

Table 4.2: Implementation area 

Comb. ALUTS (Ded. Logic) 

SCPC 861 (278) 

• Performance analysis : although in this chapter we have only modelled a processor core (SCPC) 

and the model needs to be completed with a self-checking hardware journal (SCHJ) as studied 

on next chapters, the processor performance aspects can be analyzed to know the effective- 

ness of the stack approach. In a stack based machines, we have a small clock cycle because 

the operands are implicitly available on the two tops of the stack. It is interesting to note that 

the chosen stack processor requires 2 stages pipelining to obtain rather good performance. All 

instructions (except STORE) can be executed in single clock cycle. The performance of the ar- 

chitecture was checked and the results are shown in figure 4.24. The results depict the execution 

of instructions in single clock cycle.

4.7. IMPLEMENTATION RESULTS 99 

LSB 

8-bits 

MSB 

8-bits 

R1 

Proposed Model 

(.vhd) 

Synthesis 

Quartus II - v.7.1 

Simulation 

(Altera / Stratix II) 

Area Frequency 

Figure 4.22: Implementation design flow 

R2 

cond. 2 

R3 


cond. 1 

R4 

R5 

t+1 

Conditional 

branch 

Figure 4.23: Strategy to overcome performance overhead due to conditional branches 

• Self-checking analysis : we will validate the error detection ability by injecting simple error 

(SBU). However, the complete validation of the overall model will be presented in the chapter- 

6 where different error scenarios will be injected artificially to check the effectiveness of the 

overall approach. Here, the implementation results in figure 4.25 shows that the processor in the 

read/write (mode-01). (The working modes of the processor will be discussed in the chapter-5). 

At an instant, an error is artificially injected in the self-checking processor. On the detection of 

Op-code


LIT 5 LIT 6 ADD DUP LIT A00 STORE ADD 

Figure 4.24: Implementation of a self-checking processor core 

an error, the forward instruction execution stops and the processor rollbacks. 


Figure 4.25: Error detected in SCPC 

In this chapter, we have designed a self-checking processor core (SCPC) having a tolerance against 

SBU along with measures to improve the performance. Design choices have been made in order to 

ensure fast error detection in the resultant processor with minimum hardware overhead. Error detec- 

tion is based on combinational codes (residue-parity) while error recovery is based on the rollback 

mechanism. 

The interesting point is choosing a MISC stack computer architecture. It is a simple processor 

having reduced internal states, which is favourable for both CED and rollback. It occupies small area 

on chip, which is favourable from dependability and hardware saving points of views. 

To improve the instructions execution rate, the processor consists of two stages for the execution 

pipelining. The instruction buffer management unit controls the flow of Multiple-byte size instruc- 

tions in instruction buffer. Therefore, we take advantage of high code density of variable-length 

instructions while enabling two stage execution pipelined in which apart of the next instruction is 

pre-executed along with the present instruction.


In the next chapter, we will discuss the design and implementation of the self-checking hardware 

journal that masks the errors from entering into the dependable memory.

102 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR

Chapter 5 

Design of a Self Checking Hardware Journal 

his chapter focuses on the design of self-checking hardware journal (SCHJ), which is being used 

T 

as a centerpiece in our strategy to devise a fault tolerant processor against transient faults (as shown 

in figure 5.1). 

SCPC 

SCHJ 

Figure 5.1: Design of SCHJ 

The basic role of this SCHJ is to hold new data being generated during the currently executed 

sequence until it can be validated at the end of the current sequence (see figure 5.2). If the sequence is 

validated, this data can be transferred to the DM. Otherwise, in the case of error detection during the 

current sequence, this data is simply skipped and the current sequence can restart from the beginning 

using the trustable data hold in the DM and corresponding to the state prevailing at the end of the 

previous sequence. However, there is a need of an error detection and correction mechanism in the 

journal to detect possible errors being provoked in the journal during temporary stay. 

The chapter is exploring the construction and working of the SCHJ and the work is distributed as 

follows. The first section describes the self-checking methodology. In the later section the hardware 

architecture and working of the journal is described. Finally to evaluate the working of the self- 

checking hardware journal a generic model is described in VHDL-RTL (Register Transfer Level) and 

103 

DM

104 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL 

Self-Checking 

Processor 

Core 

non 

validated 

data 

is synthesized on Altera Quartus-II. 

Transient 

Fault 


Temporary 

Storage 

trustable 

validated 

data 

Figure 5.2: Protecting DM from contamination. 

5.1 Error Detection and Correction in the Journal 



It has been shown in section 3.5.2 that the journal should have built-in self-checking mechanism 

because data stored inside this temporary location can also be corrupted in the consequence of tran- 

sient faults affecting it (see figure 5.2 and 5.3). 

In the journal, there will be some part of data which belong to the present SD (stored in upper 

un-validated part of fig 5.3 (a)) and rest of the data that belongs to the previous SD (VD in lower part 

of journal in fig 5.3 (b)). If an error is detected in the data belonging to the present sequence then 

we can rollback to the previous validated states. However, if an error occurs in the data that does 

not correspond to the present state then we cannot rollback because the states of the SEs are no more 

saved in the memory as shown in figure 5.3 (b). It means that there is only need of error detection in 

UVJ. However, there is a need of error correction in addition to error detection inside VJ. 

The ECC will be employed for detection and correction of errors in the SCHJ. In ECC, the Ham- 

ming codes and Hsiao codes are most commonly employed [Sta06]. Among them Hsiao code is more 

efficient and require minimum hardware overhead than Hamming [GBT05]. It has been widely used 

in designing dependable memories [Che08]. Hsiao code being employed since three decays and are 

still the most efficient code used in industry [GBT05, Che08]. 

5.2 Principle of the technique 

The Hsiao codes [Hsi10] will be employed in the self-checking HW journal. It provide fast 

encoding and error detection in the decoding process. It is obtained from a shortening of Hamming 

codes. The construction of the code is best described in terms of the parity-check matrix Ho. The 

selection of the columns of Ho matrix for a given (n, k) code is based on three conditions:

5.2. PRINCIPLE OF THE TECHNIQUE 105 

Un-Validated Journal 

(UVJ) 

Validated Journal 

(VJ) 

Un-validated 

Data 

Validated 

Data 

(a) 

Errors 

Un-Validated Journal 

(UVJ) 

Validated Journal 

(VJ) 

Un-validated 

Data 

Validated 

Data 

(b) 

Figure 5.3: (a) Error(s) in un-validated journal (b) error(s) in validated journal 

• Every column should have an odd number of 1 ′ s. 

• The total number of 1 ′ s in the Ho matrix should be a minimum. 

Errors 

• The number of 1 ′ s in each row of Ho should be made equal, or as close as possible, to the 

average number (i.e., the total number of l’s in Ho divided by the number of rows). 

The first requirement guarantees the code generated by Ho has minimum distance of at least 4. 

Therefore, it can be used for single-error-correction and double-error detection. The second and third 

requirements would yield minimum logic levels in forming parity or syndrome bit, and less hardware 

in implementation of the code. For instance, if r parity-check bit are used to match k data bit, then 

the following equation should be true for Hsiao codes: 

�≤r 

i=1,i=odd 

Precisely, Ho matrix is constructed as follows: 

(a) all � � r 

weight-1 columns are used for the r check bit positions; 

1 

� 

r 

i 

� 

≥ r + k (5.1)


(b) next, if � � � � � � r 

r 

r 

≥ k, select k weight-3 column out of all possible combinations. If < k, all 

3 

3 

3 

� � � � 

r 

r 

weight-3 column should be selected. The left over columns are then first selected from all 3 

5 

weight-5 column and then by � � r 

and so on until all k columns have unique combinations. 

7 

If codeword length n = k + r is exactly equal to 

�≤r 

i=1,i=odd 

for some odd j ≤ r, each row of Ho matrix will have the following number of 1’s: 

1 

r 

�i≤r 

i=1 

i=odd 

i 

� 

r 

i 

� 

� 

r 

i 

� 

= 1 

� 

� 

r(r − 1)(r − 2) r(r − 1) · · · (r − j + 1) 

r + 3 + · · · + j 

r 

3! 

j! 

= 

If n is not exactly equal to 

� 

1 + 

� 

r − 1 

2 

� 

�≤r 

i=1,i=odd 

+ · · · + 

� 

r − 1 

j − 1 

for some j, then the arbitrary selection of the � � r 

cases should make the number of 1’s in each row 

i 

close to the average number. 

� 

The single bit error correction and double bit error detection is accomplished in the following 

way. A single bit error results in a syndrome pattern that should matches a column of the parity check 

matrix Ho. Thus, by matching a syndrome pattern to a column in the Ho can identify an erroneous 

bit. If the column corresponds to a check bit, then no correction is necessary else bit inversion will 

correct the error [Lal05]. The double-error detection is accomplished by examining the over-all parity 

of all syndrome bit. Since, the Hsiao code uses only an odd number of 1‘s in the columns of its Ho a 

syndrome pattern corresponding to a single bit error has odd parity. However, if it has an even number 

of syndrome bit, then it indicates the presence of a double error in a code word. 

Bit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 C1 C2 C3 C4 C5 C6 C7 

S1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

S7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

r 

i 

� 

�� 

Figure 5.4: Hsiao Parity Check Matrix (41,34) 

Hsiao showed that by using minimum odd weight columns, the number of 1 ′ s in the Ho-matrix 

could be minimized (and made less than a Hamming SEC-DED code). This translates to less hardware 

(5.2) 

(5.3) 

(5.4)

5.3. JOURNAL ARCHITECTURE AND OPERATION 107 

area in the corresponding ECC circuitry. Furthermore, by selecting the odd weight columns in a 

way that balances the number of 1’s in each row of the Ho-matrix, the delay of the checker can be 

minimized (as the delay is constrained by the maximum weight row). 

Effectively, the data residing in the self checking HW journal is coded with parity bit generated 

according to Hsiao codes [Hsi10]. These parity bit ensure that the data written in the journal remain 

unchanged. Each block (row) in journal has three parts: the first part contains a pair of data and 

corresponding address, second consists of pair of w and v bit and third part consists of the generated 

parity as shown in figure 5.6. We have used the Hsiao (41, 34) codes to protect the stored data, this 

class of codes are used for (SEC-DED). There are 7-parity bit to construct the H-matrix as follows: 

1. all 1 of 7 combinations of weight-1 columns are used 

2. we selected 34-weight 3 columns out of all possible 3 of 7 combinations. 

The parity check matrix (Ho) for (41,34) Hsiao code is shown in figure 5.4, It has following features: 

1. total number of 1’s in H-matrix is equal to 7 + 3 × 34 = 109; 

2. average number of 1’s in each row is equal to 109/7. 

Moreover, these codes are encoded and decoded in a parallel manner. In encoding, the message bit 

enter the encoding circuit in parallel and the parity-check bit are formed simultaneously. In decoding, 

the received bit enter the decoding circuit. In parallel, the syndrome bit are formed simultaneously 

and the received bit are corrected in parallel. Double-error detection is accomplished by examining 

the number of 1 ′ s in the syndrome vectors. 

5.3 Journal Architecture and Operation 

The journal storage space is internally split into two parts: UVJ and UJ, as shown in figure 5.5. At 

the end of each valid SD, the contents of UVD turns into VD and thus, the virtual line separating the 

upper part from the lower part shifts up to denote the new situation. While, VD is being transfered to 

the DM during the execution of the current sequence. 

Each row in the SCHJ is 41 bit long. The v and w bit will be discussed later. Together with the 16 

address bit and the 16 data bit, they represent the information corresponding to a single block being 

stored in the SCHJ. The remaining bits in the row are parity bit, which represent the information 

redundancy related to the error correcting code and protecting the other bit, as shown in figure 5.6. In 

order to trust the data temporarily stored in SCHJ, we need a built-in mechanism to detect and correct 

errors that may occur due to transient faults. Here, we have chosen to rely on error control coding, 

a classic and effective approach to protect storage devices [ARM + 11]. In section 5.1, we selected 

a Hsiao (41, 34) code, a systematic single-error-correction and double-error-detection (SEC-DED) 

code, Hsiao codes being more effective than Hamming codes in terms of cost and reliability [Hsi10].


Unvalidated 

Data 

Validated 

Data 

parity bits v w 

address bits data bits 

Figure 5.5: SCHJ structure. 

The system is based on model 2 presented in chapter 3, where the data cannot be written directly 

to the DM (depicted in figure 5.7), in order to insure its contains is always trusted. The data is first 

written in the SCHJ and then DM. The corresponding address is always searched in un-validated area 

so no two data elements in this area correspond to same address. If the address is found, the data 

element is updated. Else, a new row is initialized in the unvalidated area with w = 1 and v = 0 and 

the address, data and parity-bit fields filled with the adequate values. The w and v bit are used to 

denote written and VD, respectively. 

Before transferring to the DM, data awaits for the validation of the current sequence at the VP. 

The waiting delay depends on the number of instructions being executed in a SD. If no error is found 

at the end of the current sequence, the processor validates the sequence. All the UVD in the SCHJ is 

validated by switching the corresponding v bit to 1. Otherwise, if any error is detected, the sequence 

is not validated and the UVD data in the SCHJ is disclosed by switching the corresponding w bit to 

0. Only data having v = 1 can be transferred to the DM. 

It is to be noticed that the last instructions in a sequence are used to write the SE to the SCHJ. On 

sequence validation, this data gets the v bit set to 1 and is consequently stored in the DM. In the case 

of the sequence un-validation (see figure 5.8), the SE data is restored from memory on rollback as the 

UVD in the SCHJ is dismissed, and execution is restarted from previous VP. Further explanation on 

the rollback operation can be found in [RAM + 09, AMD + 10, ARM + 11]. 

As stated before, the on-chip DM is supposed to be fast enough to fulfill the performance require- 

ments of our SCPC. Our strategy of using a SCHJ aims not only to improve FT but also to allow the 

rollback mechanism to be used with very little time penalty compared to a full hardware approach or 

no protection at all. 

Each row in the SCHJ is protected by a Hsiao code as shown in figure 5.6. This protection is used 

in the following way: 

• error detected in the UVD will result in the sequence un-validation (rollback). 

• VD is written row by row to the DM. The VD is the copy of the latest validated sequence.


X add X data Xw Xv 16-bits 16-bits 1 1 7-bits 

X add+data+w+v 

Parity 

Generator 

P X 

P X 

P’ X 

Error 

detection 

and 

correction 

Noncorrigible 

Data ready to 

Transmit 

Figure 5.6: Error detection and correction in journal (a memory block of SCHJ). 

Rollback/ 

reset 

No-error/ 

corrigible 

Thus, throwing away this data would avoid correct completion of the program/thread execution 

and require a system reset. This can only happen if an error is detected that overpasses the 

correction capacity of the code (e.g. a two bit error in a single VD row). 

5.3.1 Modes of SCHJ 

The overall operation of the SCHJ is depicted in the flow chart of figure 5.9. Four modes of oper- 

ation are summarized in table 5.1. The ECC checker circuit is activated during each read and writes 

access to the memory. The traffic signals in figures 5.10, 5.11, 5.13 and 5.15, are representing data 

flow with respect to write-operation because in read operation the SCHJ and DM is totally transparent 

for the processor. 

Mode 00 – this mode is active on start of program or restart if a non corrigible error is detected in 

a VJ of SCHJ. In this mode, the processor resets and re-executes from default values, discarding all 

the data stored in the journal. All the w and v bit are set to 0 (v


Error detecting 

processor 

core 

READ 

WRITE 

READ 

Error 

Detection 

READ 


Main Memory 

Journal 

Figure 5.7: Overall architecture 

Table 5.1: Modes of Journal 

Modes Operation 

00 Initialized 

01 read/write 

10 Valid (v= 1) 

11 Un-Valid (Rollback) 

Un-Validated Data 

Validated Data 

WRITE 

WRITE 

Error 

Detection 

and 

Correction 

Mode 01 – this is a normal read or write mode depending on the active instruction in the SCPC 

(rd = 1 or wr = 1). In this mode, the SCPC can write directly into SCHJ but not to DM, in order 

to avoid risk of data contamination in the DM. However, it has read accesses to both the SCHJ and 

the DM (not shown in fig 5.11 to avoid complexity). The data read from the SCHJ are checked for 

possible errors. On error detection, the processor enters the mode 11 in which rollback mechanism is 

activated without waiting for the VP of the current sequence. 

Under normal conditions, the processor is mostly in mode 01. As shown in the figure 5.12, 

when the processor needs to read from SCHJ, the address tags are checked to match the required data 

(depicted by arrow a in the figure 5.12). If the required address is found, then before transferring 

the data towards the SCPC it is checked for possible errors by comparing the stored parity bit with 

re-generated parity bit according to Hsiao codes in the error detection unit (shown in figure 5.12). 

• if an error is detected (shown in figure 5.12), the rollback mechanism is invoked because data 

contents in UVJ contains data generated during the current sequence (denoted by the v field set





VP n-1 

Store 

SEs 

Rollback to VP n-1 

Instruction(s) Execution in current 

SD 



Next SD 

VP n 

Restore 

SEs 

Note: VP is Validation Point 

SE is State-determining Element(s) of the Processor 

Figure 5.8: Rollback mechanism on error detection. 

to 0). The enable signal on data bus is then set to ‘0’ to forbid further data transfers from the 

SCHJ to the SCPC. All the data contents written during this sequence are considered as garbage 

values (w


Data towards 

Processor 

Error Detecting 

Processor 

READ from Journal Validated DATA towards Main Memory 

No 

READ from 

Journal 

Error 

Detection 

Yes 

Rollback 

Mechanism 

No READ 

Validated Data 

Towards Memory 

No 

Yes 

Figure 5.9: SCHJ operation flow chart. 

& WRITE 


Journal 

Initialized 

(v=0 & w=0) 

Figure 5.10: SCHJ mode 00. 

No 

(Non corrigible) 

No WRITE 


In Journal 

Error 

Detection 

Yes 

Error 

Correction 

RESET 



SCPC, switching to mode 00 (and possibly raising some alarm indicator) is the usual behavior 

in this situation. 

Mode 11 – this mode is invoked when an error is detected during the read/write-operation as shown 

in figure 5.15 and it has been partially discussed with mode-01. In this mode, all the data written 

in UVJ of SCHJ (i.e. all the data generated during the current sequence) is invalid and discarded 

(w

5.4. RISK OF DATA CONTAMINATION 113 


Processor 

Addr_rd 

Addr_wr 

Data_in 

Steps followed: 

Parity 

generator 

Mode=01 wr = 1 

i) Address tags matching 

ii) Error detection ‘e’=1 

iii) Rollback mechanism 

2 

READ 

& WRITE 

parity 

bits 

000110 

001100 

011000 

110000 

000100 

100010 

100110 

101010 

000111 

000001 

011110 

010111 

011110 

010111 

001000 

001111 

the 01-mode (read/write-mode) is activated. 

a 

READ/WRITE 


Journal 

(w=1) 


v 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

w 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

address 

0011…..0011 

0011…..0001 

0011…..1011 

0011…..0111 

0011…..1011 

1011…..0011 

0111…..0011 

0001…..0011 

0010…..0011 

0000…..0011 

1111…..0011 

1100…..0011 

1011…..0011 

0100…..0011 

0101…..0011 

1100…..1100 

data 

0011…..0001 

0011…..0011 

0011…..0111 

0011…..0111 

0011…..0011 

0011…..1011 

0000…..0011 

0010…..0011 

0011…..0011 

0000…..0011 

0010…..0011 

0100…..0011 

1011…..0011 

0101…..0011 

0011…..0011 

0011…..0011 

Error Detection & Correction Unit 

wr_to_mem 

5.4 Risk of data contamination 

Un-validated data 


1 

READ operation 

MODE : 01 

reset 

No WRITE 

Parity_bits 

Data_out 

Figure 5.12: Read of UVD from SCHJ in mode 01 

Data bus 




Unit 

Mode=01 rd = 1 

En 

Data bus 

Address bus 


Inside UVJ, we can detect and recover 2-bit errors by relying on Hsiao codes for error detection 

and recovery on rollback mechanism. The maximum time penalty to correct the error will be equal to 

length of SD. 

‘e’ 

Rollback at e=‘1’ 

to main memory to processor


If No-Error Detected at VP 


Processor 

Addr_rd 

Addr_wr 

Data_in 

Parity 

generator 

Mode=01 wr = 1 

Steps followed: 

i) Data bus available 

ii) Non-corrigible error detection 

iii) RESET 

READ 

& WRITE 

Un-validated data 


DATA VALIDATED 

(v = 1) 


parity 

bits 

000110 

001100 

011000 

110000 

000100 

100010 

100110 

101010 

000111 

000001 

011110 

010111 

011110 

010111 

001000 

001111 

v 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

Transfer to Memory 

w 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

MODE : 10 

address 

0011…..0011 

0011…..0001 

0011…..1011 

0011…..0111 

0011…..1011 

1011…..0011 

0111…..0011 

0001…..0011 

0010…..0011 

0000…..0011 

1111…..0011 

1100…..0011 

1011…..0011 

0100…..0011 

0101…..0011 

1100…..1100 

data 

0011…..0001 

0011…..0011 

0011…..0111 

0011…..0111 

0011…..0011 

0011…..1011 

0000…..0011 

0010…..0011 

0011…..0011 

0000…..0011 

0010…..0011 

0100…..0011 

1011…..0011 

0101…..0011 

0011…..0011 

0011…..0011 

Error Detection & Correction Unit 

wr_to_mem 

reset 

WRITE 

Parity_bits 

Data_out 

Data bus 

Address bus 




unit 

Mode=01 rd = 1 

Data bus 

Figure 5.14: Mode 10 of SCHJ operation (un-corrigible error detected) 

On the other hand, inside VJ we can correct only single bit error (SBU). Whereas, if 2-bit MBU is 

detected than program should re-execute, which may result in real time performance constrains but our 

hypothesis of DM remain secure. Moreover, probability of MBU is much lesser than SBU [QGK + 06], 

therefore, chances of such situation are rare. 

This means that VJ is more critical data storage than UVJ from dependability point of view. 

Therefore, it is important to know how much time the data stay inside the VJ. In-fact, every data 

‘e’ 

Rollback if e=‘1’ 

to main memory to processor


Errors detected 

ROLLBACK called 


Processor 

READ 

& WRITE 

DATA 

UN-VALIDATED 


WRITE 



stored in UJ is transfer to DM in a SD. It means that maximum risk duration of data contamination is 

SD. The bigger SD has more risk of data contamination inside SCHJ and vice versa. 

5.5 Implementation Results 

The SCHJ have been modeled in VHDL at the RTL level and implemented on a Altera Stratix III 

EP3SE50F484C2 device using Altera QuartusII. The results obtained in terms of area for depth of 

SCHJ equal to 10 are reported in table 5.2. From the results, the following observations can be done: 

• the SCHJ occupies about 40 − 50% of the total area depending on the depth of the journal; 

Table 5.2: Implementation area 

Comb. ALUTS (Ded. Logic) 

SCHJ 591 (399) 

SCPC and SCHJ 1452 (677) 

In the case of a non-corrigible error (e.g. a double error in a single row) is detected in the validated 

part of the journal (VJ) then even by rollback we cannot recover the errors because the data does not 

belong to the present SD in this case the processor must have to reset as shown in the figure 5.16. 

5.5.1 Minimizing the Size of the Journal 

From the implementation results in table 5.2, it has been found that the journal acquires an im- 

portant percentage of the total area of FT-processor. We have investigated impact on percentage 

utilization of overall processor versus the SCHJ depth. The results are reported in figure 5.17. They 

show that overall hardware overhead depends directly on the depth of SCHJ.


Percentage utilization of FT Processor on 

EP3SE50F484C2 

14 

12 

10 

8 

6 

4 

2 

0 

Figure 5.16: Non-corrigible error detection 

10 16 24 32 40 54 62 

Depth of the Dependable Journal 

Figure 5.17: Increase of percentage utilization of FT processor (SCPC + SCHJ) on device 

EP3SE50F484C2 with increase in the depth. 

In fact, the depth of a journal is a relative parameter and relies on the type of benchmark being 

employed and duration of SD. From theoretical point of view, the UVJ should be equal to the max- 

imum SD, if the present benchmarks consists of all instructions that require write to memory (e.g. 

series of duplication instruction in figure 5.18). On each instruction execution the contents of NOS 

will be written to memory. The required size of UVJ should be equal to the length of SD (see fig- 

ure 5.19 arrow a). On the other hand, the lowest extreme case is possible for benchmarks containing 

instructions that do not or very little need to write to memory (like series of SWAP in figure 5.18). 

The required depth of the journal is minimal. 

To address real industrial applications, there is need to find relationship between SD and journal 

depth. Accordingly, we have calculated the percentage of write in already discussed benchmarks 

(see section 3.7). They are expensive in processor-memory traffic because they always read and 

write the data from/to memory. However the result shows that maximum percentage of write to 

memory is 39% (see figure 5.19 arrow b). Although, we are ignoring write to same memory addresses


(a) DUP (Duplication) Execution 

Data 

NOS 

NOS TOS 

stack TOS 

ALU 

DUP DUP DUP DUP DUP DUP DUP DUP DUP DUP 

1 2 3 4 5 6 7 8 9 10 

NOS TOS 

SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP 

1 2 3 4 5 6 7 8 9 10 

For Sequence Duration (SD) =10 

Figure 5.18: Theoretical limits of Journal Depth. 

(b) SWAP Execution 

NOS 

TOS 

ALU 

100% Write to 

memory 

UVJ min. depth = SD 

0% Write to 

memory 

UVJ min. depth = 0 

which can further reduce the required depth of journal. This shows that the practical journal depth 

should not exceed 50% of SD (considering eleven percent safety margin). Here, it must be noted 

that the previously presented area occupation results were calculated at worst case (when SD = UVJ). 

However, required journal depth is SD/2. 

Now to finalize the depth of the journal, it is important to find the relationship between SD and 

performance degradation. Accordingly, we have developed a processor model using dedicated C++ 

tools. The errors are injected artificially into the simulated processor model. Here only injection 

of SBU has been considered. The complete experimental setup will be discussed in chapter 6. The 

factor CPO (clock per operation) is chosen to determine performance degradation due to re-execution. 

Ideally processor executes single instruction/clock cycle which means that CPO ≈ 1. The discussion 

in section 3.2.2 shows that in a BER system, the performance degradation relies on two factors: 

rate of re-execution (rollback) and ratio of effective instruction execution, (SD-SED)/SD. Greater the 

performance degradation higher will be the CPO. 

The graphs have been drawn between CPO vs. SD (shown in figure 5.20) for different error 

injection rates. The obtained curves follow U-shape pattern because at low SD the rate of loading of 

internal states are dominants and for bigger SD the re-execution is dominant. The curves shows that 

bigger journal depth is only possible with for lower EIR. Also for every EIR there are certain limits 

in which it can have good performance. 

Here, if we accept the 20% performance degradation than the minimum SD comes out to be 20


In worst case 

Sequence Duration 

SD =Journal depth 

Sequence Duration 

SD/2 =Journal depth 

Journal 

Journal 

Theory point of view 

Benchmark 

Series of DUP 

Series of Swap 

% Write 

100 

In practical Practical point of view 

Benchmark 

Bubble Sort 

Matrix 

multiplication 

Control 

0 

% Write 

38 

Figure 5.19: Relation between journal depth and percentage write in benchmarks. 

Clocks per instruction (CPI) 

2 

1,8 

1,6 

1,4 

1,2 

Loading of internal 

states is dominant 

CPI vs. Sequence Duration 

1 

1 10 A 

100 

Sequence Durat ion 

1000 10000 

Figure 5.20: CPI vs. SD 

39 

36 

(a) 

(b) 

Effect of re-execution 

is dominant 

1/500 

1/1000 

1/10000 1/5000 

1/10000 

1/100 

(see arrow A). In brief, practical journal depth lies some where near 10 if accepting the depth of 

journal = 50% of SD.


5.5.2 Dynamic Sequence Duration 

In the presented model, we have used fix SD model that has both the area and performance over- 

heads. However, with dynamic SD these problems can be solved. Here, the SD has an average value 

as shown in figure 5.21. This can allow us to employ bigger SDs with smaller journal depth. This can 

improve the area overhead and performance degradation for low EIR. 

SD 1 

SD 2 

SD 3 

Figure 5.21: Dynamic SD. 

SD 4 

SD 5 

HW 

Journal 

Moreover, it can allow to dynamically reconfigure SD with EIR. For example, if SD is repeatedly 

unvalidated then system will automatically reduce the SD to adjust its value with EIRs. Downside 

is that it may increase the complexity of the journal management. Dynamic SD is an important 

consideration for future reduction of hardware overhead. 


The presence of the journal facilitates the rollback mechanism on one hand, and it mask errors 

(SBU and 2-bit MBU) from entering into the DM on other hand. To reduce the hardware overhead, 

the Hsiao code have been employed. They provide an effective double detection and single error 

correction. Due to parallel access to the memory and journal simultaneously in READ operation the 

overall efficiency of the system has been increased. 

The SCHJ occupies an important percentage of the overall fault tolerant processor area. Reducing 

the journal size can effectively reduce the global area occupation. The size of the journal depends 

on the type of benchmarks being employed. For practical applications journal depth can be half 

the duration of SD. Further reduction in depth is possible by employing dynamic SD than fix SDs. 

In the next chapter we will investigate the error coverage and the performance degradation due to 

re-execution.

120 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL

Chapter 6 

Fault Tolerant Processor Validation 

In the previous chapters, we have designed a FT processor based on concurrent error detection ca- 

pability and rollback error recovery strategy. The fault tolerant design is built on a self-checking 

processor core (whose architecture follows the MISC philosophy) and on a self-checking hardware 

journal that prevents errors to flow into the DM and limits the impact of the rollback mechanism on 

time performance. The architecture of the self-checking processor and the self-checking hardware 

journal have been discussed in chapters 4 and 5, respectively. 

Self-Checking 

Processor 

Core 

FT-Processor 

FT Processor 

Self Checking 

Journal 

Figure 6.1: The overall FT-processor to be validated. 



In this chapter, we will evaluate the FT capability of the overall FT-processor (SCPC + SCHJ), 

as highlighted in figure 6.1 in order to validate design strategy. The evaluation will be carried out 

through simulation. Controlled error injection will be used to force the processor model artificially 

face abnormal situations. The FT capability of the processor will be judged by calculating the detected 

to injected error ratios under different simulation scenarios (different application benchmarks and 

different error injection profiles). The time performance will also be evaluated. 

The chapter distribution is as follow. First of all we will analyse the design hypothesis that have 

been assumed in the methodology and hence, the FT-processor properties to be checked. Then, after 

a short presentation of the error injection methodology, the experimental results are presented and 

discussed, both from the FT and the time performance points of view. Finally, we will compare 

proposed methodology with LEON3FT FT design methodology. 

121

122 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION 

6.1 Design Hypothesis and Properties to be Checked 

Inside the SCPC, parity and remainder codes are employed to detect errors in the internal registers 

and arithmetic/logic circuitry of ALU. According to assumptions, DM is a trustable place where data 

remain uncorrupted. Hence, unsafe data is prevented to flow to the DM. This is achieved by the 

SCHJ which has error detection and also some error correction capability. Its role is to simplified 

the management of validated and un-validated data and to speed up the rollback mechanism used for 

error recovery. 

Self-Checking 

Processor 

Core 

FT-Processor 

FT Processor 

Self Checking 

Journal 

Figure 6.2: Error injection in FT-processor. 

Artificial Error 

Injection 



The FT capability of the processor must be evaluated as the capacity of correctly handling errors 

appearing in any part of the SCPC or the SCHJ (figure 6.2), with different error profiles to be tested 

(different error patterns and rates). Speed performance degradation will also be assessed along with 

the FT capability, as the impact of rollback is expected to rise with (due to higher re-execution rates). 

Accordingly, the rate of rollback vs. error injection rate can also be calculated. 

In short, we will investigate the overall dependability and performance of the proposed FT- 

processor architecture by addressing the following challenges in the upcoming sections: 

• self-checking effectiveness of the FT processor; 

• performance degradation due to re-execution; and 

• effect of error injection on rate of rollback. 

6.2 Error Injection Methodology and Error Profiles 

Before addressing the above-mentioned challenges, it is necessary to choose both the error injec- 

tion methodology and the error profiles that will be applied, i.e., the error patterns and error rates. The 

fault injection in the hardware of a system can be implemented in two ways: 

1. physical fault injection;

6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 123 

2. simulated fault injection. 

In this work, we will employ the simulated fault injection of soft errors (due to transient faults) 

in which errors are injected altering the logical values during the simulation. The simulation-based 

injection is a special case of fault/error injection that can support various levels of abstraction of the 

system like functional, architectural, logic and power [CP02]. For this reason, it has been widely used 

to study fault injection. 

Moreover, there are various other advantages in this technique, the greatest being the observability 

and controllability of all the modelled components. Another positive aspect of this technique is the 

possibility of carrying the validation of the system during the design phase before having a final 

design. 

Scenarios 1 

Scenarios 2 

1.a Random SBU 

1.b. Random MBU (2-bits) 

1.c. Random MBU (3-bits) 

X X 

X X X 

2. Random MBU (1, 2, …, 7 or 8 bits) 

X X X X X X X X 

Figure 6.3: Error patterns (errors can occur in any bit, not necessarily the bit shown here). 

The faults being considered are SBUs (one-bit changing in a single register) and MBUs (multiple 

bit changing at once in one register). These fault models (SBU and MBU) are commonly used with 

RTL models [Van08]. The exact error patterns being considered in these experiments are shown in 

figure 6.3: in scenario 1, we have considered (a) random single bit error, (b) random 2-bit error and 

(c) random 3-bit error. In scenario 2, the random harsh-error (from 1-bit up-to 8-bit error in a single 

register) are considered. 

6.3 Experimental Validation of Self-Checking Methodology 

We will evaluate the error coverage through simulated fault injection, the objective being to find 

the effectiveness of the proposed fault tolerant scheme, and hence determine its limits. There is a 

need of creating an environment to analyze the effects of transient faults into the final architecture. 

X


By designing this environment, we will be able to do fault injection experiments to evaluate the 

effects of SBU and MBUs caused due to transient faults into processor registers and data-path. Hence 

to analyze the robustness against single bit flip (due to transient faults). 

Practically, the VHDL model at the RTL level used to synthesize the circuit is not used for the 

fault injection simulation. In order to allow very fast simulation (and hence, allow a large number of 

simulation campaigns to be conducted in a minimum delay), dedicated C++ tools have been developed 

to replace the original ‘discrete event driven’ simulation model on which VHDL relies, by the faster 

‘cycle driven’ simulation model that fits very well synchronous designs [CHL97]. For the simulation, 

strictly equivalent C++ ‘cycle drive models’ replaced the original VHDL models at RTL level. 

The starting point of the environment designing is to define how to describe a way to reproduce 

transient faults: when to reproduce, where to affect and what to change. We have chosen a non- 

deterministic approach of fault trigger during a fixed s where the bit flips can randomly be provoked 

in SCPC and SCHJ. 

Initial 

setup 

Fault injected 

Error latency 2 clks 

Error 

detected 

No 

Yes 

Increment: error detected 

fault injected 

Program reset 

Data log 

clks 

Campaign 1 Campaign 2 Campaign 3 Campaign N 

Figure 6.4: Experimental Setup 

Final Report 

Total: error detected 

Fault injected 

The basic steps of a fault-injection campaign are shown in figure 6.4. The C++ based simulator 

inject fault pattern in the processor model by randomly picking bit(s) out of total bits that form the 

registers. On fault injection the simulation is arrested after 2 cycles and the self-checking circuitry 

indicates whether the error is detected or not. If detected, then counter increments and afterwards, a 

new simulation campaign starts with a new fault injection profile (as shown in figure 6.4). Finally, a 

report of total number of injected/detected errors is generated. 

Two types of injection methods were conducted: a campaign to inject single (SBU), double and 

triple (MBU) random error patterns, and another campaign to inject random harsh-errors (random 

weight from 1 up to 8). The results are presented in the graphs of figures 6.5, 6.6, 6.7 and 6.8, 

respectively.

6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 125 

No. of Errors 

No. of Errors 

40000 

35000 

30000 

25000 

20000 

15000 

10000 

5000 

0 

40000 

35000 

30000 

25000 

20000 

15000 

10000 

5000 

0 

Random Error Injection and Detection of SEU 

1 2 3 4 5 6 7 8 9 10 

Errors Injected Errors Detected 

Figure 6.5: Single bit error injection. 

2-bits Random Errors Injection and Detection 

1 2 3 4 5 6 7 8 

Errors Injected Errors Detected Errors Non-Detected 

Figure 6.6: Double bit error injection. 

For scenario 1, the figure 6.5 shows the processor can detect 100% of the injected single bit 

errors. The detection rate for double and triple bit errors, with rates higher than 60% and 78%, 

respectively (as shown in figures 6.6 and 6.7 respectively). In scenario 2, harsh patterns are used (1 

up to 8 randomly), the detection rate still remains significant with a value greater than 36% for all 

configurations, as shown in figure 6.8. 

It is interesting to notice that, while using very simple detecting codes in the SCPC devised for 

low s, the error coverage is still 100% for SBU. Taking into account the small amount of registers to 

protect in the processor core and the fact that the SCPC area is only a fraction of the total FT-processor 

area, using better codes in the SCPC can probably improve the FT level without a big impact on area. 

This tends to prove that the proposed FT processor design approach is a useful one. It is still 

necessary to evaluate the impact of increasing error rates on speed performance.


No. of Errors 

No. of Errors 

30000 

25000 

20000 

15000 

10000 

5000 

0 

30000 

25000 

20000 

15000 

10000 

5000 

0 

3-bits Random Errors Injection and Detection 

1 2 3 4 5 6 7 8 


Figure 6.7: Triple bit error injection. 

1 to 8 - bits Random Error Injection and Detection 

1 2 3 4 5 6 7 8 


Figure 6.8: Harsh (1 up to 8 bit randomly) error injection. 

6.4 Performance Degradation due to Re-execution 

To measure the impact of transient faults on system performance, we have evaluated the per- 

formance degradation on different sets of benchmarks through simulations. The average number of 

clock ticks per operation (CPO) has been measured for different EIRs, as an indicator of speed perfor- 

mance (the higher the value, the lower the performance) and hence of performance degradation under 

different error injection conditions. 

In the pipelined journalized stack processor, all instructions are executed in a clock cycle (ex- 

cept STORE). Therefore, an average clock cycle per operation execution (CPO) or clock cycle per 

instruction execution is the unity. However, in the case of an error detection, the rollback is executed 

which increases the overall time penalty. The greater the rate of rollback, the higher the average CPO,

6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 127 

Performance 

degradation 

Program length 

Clock count 

No 

rollback 

Program length 

Clock count 

with 

rollback 

Figure 6.9: Performance Degradation due to re-execution 

because more clock cycles will be needed to accomplish the required task (see figure 6.9). In other 

words, the greater the average CPO, the lower the overall performance. 

The benchmarks have already been discussed in section 3.7. Here, table 6.1 summarizes the 

percentage profiles of read_from/write_to DM for each group of induced by the instructions running 

on the SCPC. Note that the instruction set of the SCPC has 36 instructions among which, 23 involve 

reading or writing from/to the memory. 

Table 6.1: Read/Write profiles in benchmarks groups 

Group Read Write 

I 45% 39% 

II 57% 38% 

III 50% 38% 

6.4.1 Evaluating Performance Degradation 

The goal is to measure the effect of re-execution on the length of SD. We have drawn the graphs 

of average clock cycle per operation (CPO) vs. EIR for different benchmarks. Figures 6.10, 6.11 

and 6.12 present the results for benchmarks in group I, group II and group III, respectively for different 

SDs (such as 10, 20, 50 and 100). In these graphs, the number of clock cycles per operation (CPO) 

has been plotted against EIR. Here, the penalty in loading SEs has not been considered. The errors 

have been injected in processor and journal at different EIRs. The analysis of the graphs shows that 

the curves tend to overlap for the lower values of EIR. This is logic as, in absence of error, no extra 

time penalty due to rollback is induced, whatever the benchmark being used.


Clock Per Operation (CPO) 


2 

1,8 

1,6 

1,4 

1,2 

1 

0,8 

0,6 

0,4 

0,2 

0 

2,5 

2 

1,5 

1 

0,5 

0 

Benchmarks Group - I 

SD=10 SD=20 SD=50 SD=100 

A B C 

0,000001 0,000005 0,00001 0,00005 0,0001 


Figure 6.10: Simulation curves for group-I. 

Benchmark Group II 

SD=10 SD=20 SD=50 SD=100 

A B C 

0,000001 0,000005 0,00001 0,00005 0,0001 


Figure 6.11: Simulation curves for group-II. 

In figure 6.10, moving from point A to B corresponds to an increase of 10% of the error rate. The 

corresponding increase in CPO remains low (almost unchanged for SD=10 and 20, 1.1 for SD=50 

and 1.6 for SD=100), meaning a no or little degradation of speed performance. Similarly, the move 

from A to C corresponds to an error rate increase of 100%: the CPO remains very low for SD=10, 

and lower than 2 for SD=20 and 50. Similar observations can be made from graphs in figures 6.11 

and 6.12. With increase of error rate by 100% the time penalty for lower SDs remains reasonable 

which means a good performance. 

With higher EIR, the smaller SD are the ones that denote the lower time penalty being incurred. 

This is also coherent with predicted results. Indeed, for a given error rate, the risk that a sequence be

6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 129 


12 

10 

8 

6 

4 

2 

0 

A 

Benchmark Group III 

SD=10 SD=20 SD=50 SD=100 

B 

0,000001 0,000005 0,00001 0,00005 0,0001 


Figure 6.12: Simulation curves for group-III. 

invalidated is higher for a longer SD, leading to a higher rollback rate. 

Taking into account that the architecture chosen for the SCPC requires little time being used to 

save the SE, it is possible to select short SD and still have a good level of performance. Furthermore, 

this allows a lower SCHJ depth to be chosen with a reduced area consumption. It can further reduce 

the risk that errors cumulate in the SCHJ and induce a non recoverable error. 

Number of 

rollbacks 

250 

200 

150 

100 

50 

0 

0.000 

0.00001 

0.00005 

0.0001 


0.0005 

Figure 6.13: Effect of EIR on rollback for benchmarks group-I. 

C 

SD = 10 

SD = 20 

SD = 50 

SD = 100


Number of 

rollbacks 

500 

450 

400 

350 

300 

250 

200 

150 

100 

50 

0 

0.000 0.00001 0.00005 


Figure 6.14: Effect of EIR on rollback for benchmarks group-II. 

Number of 

rollbacks 

500 

450 

400 

350 

300 

250 

200 

150 

100 

50 

0 

0.000 

0.00001 

0.00005 

0.0001 


0.0005 

Figure 6.15: Effect of EIR on rollback for benchmarks group-III. 

6.5 Effect of Error Injection on Rate of Rollback 

SD = 10 

SD = 20 

SD = 50 

SD = 100 

SD = 10 

SD = 20 

SD = 50 

SD = 100 

An increase in the rate of rollback is a performance-limiting factor due to the time penalty in 

re-execution of sequences. As a result, in this section we will analyze the increase in rate of rollback 

on increasing EIRs. Actually, for higher EIR the rate of re-execution will also increase which will

6.6. COMPARISON WITH LEON FT-3 131 

further decrease the overall performance. However, if the error probability is known then it is possible 

to find the optimal number of checkpoints and possible rollbacks [VSL09]. In real system, the error 

probability is not known in advance and is difficult to estimate. 

In the rollback mechanism, there are two performance-limiting factors: (i) the time taken to 

store/reload SEs and (ii) the length of the sequence (SD). If we need to reduce the time penalty 

in reloading the SEs there is a need of long sequences (SD) so that overall number of load and store 

of SEs will be smaller. This behavior needs to be confirmed by artificial error injection. 

Consequently, we have artificially injected the errors in the FT-processor to observe its effect on 

the rollback mechanism as shown in figures 6.13, 6.14 and 6.15. (Note: for higher SDs like 50 and 

100 the number of rollbacks at higher EIR is missing because their values get out of graph range at 

y-axis) From the simulation curves, it has been shown that for low error rates the rate of rollback is 

also low and vice versa. Moreover, for higher error rate the effect of rollback is dominant in bigger 

sequences (SD). Hence, there will be greater number of rollbacks which again result in time penalty 

and limit the overall performance. 

Therefore, it is advisable to use larger SDs with low error rates and smaller SD with higher error 

rates. One can propose the optimal duration of SD if final application is know that is why the length 

of the SD will be a user defined parameter and can be adjusted according to the external environment. 

6.6 Comparison with LEON FT-3 

The LEON3 FT has been discusses previously in section 2.4. In this part, we are comparing 

the protection scheme of LEON3 FT with journalized stack processor. This will be a qualitative 

comparison. 

The LEON3 FT focus on the protection of the data storage and not on the functionality of the 

architecture. The overall scheme is using ECC and duplication of internal states. The most of the 

registers have 2-bit error detection whereas, few have 4 bit error detection. There is no protection of 

data path, ALU functionality and control unit. 

On the other hand, in FT journalized stack processor the focus is to have an overall architecture 

protection. In processor, there is single bit error detection and the journal part can detect 2-bit error. 

However, in newer version there will be consideration to search other high coverage codes inside the 

processor. There is a protected data path, ALU and control path can be protected without additional 

hardware overheads. In brief, the FT journalized processor is in development phase and still shows 

interesting feature and needs further optimization from protection point of view. 


In this chapter, we have validated the design of a Journalized Fault Tolerant Stack Processor. 

During validation different parameters we have been evaluated such as self checking ability, impact


on time performance and increase in rate of rollback due to error injection. Finally, the proposed 

model has been compared with the LEON 3-FT. 

For injection of single errors, 100% of the errors were detected in several experimental configura- 

tions. Similarly, with double and triple bit error injection, the average percentage detection was about 

60% and 78%. According to the results obtained with much worse error patterns (up to 8-bit error 

patterns), the correction is still possible with a rather significant correction rate of about 36%. 

The performance degradation results have also shown satisfactory results. The proposed architec- 

ture offers rather good performance even in presence of high error rates. With large error rates, the 

time penalty can remain reasonable using lower SDs. Practically, it’s advisable to use bigger SD with 

low error rate and smaller SD with higher error rate. Knowing the final application and the average 

error profile related to the execution environment, it is possible to chose the most appropriate SD 

duration (which is left as a generic parameter in the synthesize models).

GENERAL CONCLUSION AND PROSPECTS 

133

General Conclusions 

With the predicted evolutions in technology, soft errors in electronic circuits are becoming a major 

issue in the design of complex digital systems, especially in applications with safety critical relevance. 

Indeed, current advancements in nano-technology, largely based on component dimensions shrinking, 

voltage supply reduction and clock speed increase, are lowering the resulting noise margins. As a con- 

sequence, the sensitivity of digital circuits to high-energy particles and electromagnetic disturbances 

is raising very fast, making the probability of Single-Event Upset (SEU) and Multiple-Bit Upset 

(MBU) occurrence very high, not only in space but also in ground applications. Hence, taking into 

account from the beginning of any electronic digital design, the growing risk that these transient faults 

occurs is turning very fast into a critical need. 

Ensuring the proper operation, even in the presence of transient faults, requires that the system 

held some fault tolerant capability. Next to the fault tolerance issue, the demand for larger, faster, 

more complex and flexible system and yet easy do design is endless. Together with the enhanced 

means of on-chip communication (Network on Chip – NoC), the increased possibilities of integration 

in modern electronic circuits allow now grouping all the functionality of a full system in a single chip 

(System on Chip – SoC). Among all the recent developments, the MPSoC (Multiprocessor System 

On Chip) design paradigm is becoming very popular for its capacity to provide both computational 

power and flexibility. It brings together a large number of processors (the processing nodes) linked 

together by a NoC (the inter-nodes communication mean). An MPSoC not being naturally immune 

to transient faults, an obvious goal is to develop on-chip capacity for fault tolerance and hence, a fault 

tolerant processor to be used as "processing node". 

The work that has been presented in this thesis is dedicated to the design of such a FT processor 

using a new architectural approach, the design goals being addressed including high level of protection 

against transient faults along with reasonable performance and area overhead. It was clear from the 

beginning that severe constraints concerning the area consumption should apply to the architectural 

design of the processing node in order to match the massively parallel objective, and yet preserving 

as much as possible the node performance. 

The concepts chosen to be the basis of the design methodology are on-line concurrent error de- 

tection capability and error recovery through rollback execution. Central to the new architecture are a 

self-checking processor core and a hardware journalization mechanism. The processor core, devised 

in the MISC class instead of the classic RISC or CISC, is a self-checking processor inspired from the 

canonical stack processor [KJ89], able to offer a rather good level of performance with only a limited 

135

136 GENERAL CONCLUSIONS 

amount of hardware being required. The architectural simplicity (little amount of logical resources 

and internal storage) and the great compactness of the code are important characteristics favorable 

to self-checking capability and rollback recovery implementation. Next to the processor core, a self- 

checking hardware journal dedicated to the journalization mechanism prevents error propagation from 

the processor core to the main memory and limits the impact of rollback on time performance. Among 

the underlying hypotheses, the main memory is supposed to be dependable, i.e. data is admitted to be 

kept reliably in it without any risk of corruption. 

On occurrence of a transient fault, data can be corrupted in the processor. Such errors can be 

detected in the processor core but not corrected. Hence, erroneous data may flow out of the processor 

core and would end-up in the dependable memory without the use of the hardware journal. The DM 

would then be a non-trustable place in that case and implementing a software recovery mechanism 

would be rather painful, with a lot of data redundancy being necessary in the memory device. Clas- 

sical rollback techniques operate with check pointing: at regular time intervals, the processor state 

and produced data is saved allowing rollback to the saved point in case of error detection. The best 

suited sequence duration (distance between two check points) depends on application constraints and 

on error occurrence rates. While a larger sequence duration may limit the impact of the rollback 

mechanism on time performance in absence of errors, it requires a larger hardware journal and more, 

it increases the risk of rollback activation in case of error occurrence. 

Data produced in the current sequence can be discarded in case of error detection as it can be 

generated again from the last save check point. On sequence validation, with no error occurrence in 

the ending sequence, the related data is validated and must be transferred to the dependable memory. 

Error control coding techniques are used to detect errors in the processor core and the unvalidated 

data in the journal, and used to correct errors in the validated data part in the journal. 

The fault tolerant processor architecture has been modeled in VHDL at the RTL level and then 

synthesized using Altera Quartus II, to determine area requirements and maximal operation frequency. 

Simulated error injection campaigns have been used to determine the effectiveness of the proposed 

fault tolerant strategy under different faulty scenarios (varying the error rate and error pattern profiles) 

and different sequence durations. 

The self-checking ability of the fault tolerant processor was tested for Single-Event Upsets (SEUs, 

1 bit error pattern) and Multiple-Bit Upsets (2 up to 8 bits error pattern in a single 16-bits data word). 

Considering SEUs, 100% of the errors are detected and error recovery is close to 100% event for high 

error injection rates. With 2-bit and 3-bit patterns the average detection percentage was about 60% 

and 78%, respectively. When harder conditions are considered with error patterns of up to 8 bits, 

correction is still possible with correction rates of about 36%. 

Similarly, the performance degradation due to error injection was evaluated. Error recovery being 

based on rollback execution on error detection, the instruction in the faulty sequence are re-executed 

from the previous preserved states hence adding time penalty, i.e. performance degradation. Higher 

error injection rates induce higher rollback rates, result in lower performance. The analysis of the 

measured performance degradation curve shows that proposed architecture offers a reasonable good

performance even in presence of high error rates. It shows also that the optimal sequence duration 

depends on the average error injection rate that should be adjusted according to application external 

environment. 

Practically, the experimental results demonstrate that the principle of journalization can be rather 

effective on a stack computing based processor core architecture, and deserves more research effort 

to enhance the performances and protection capability. 

The future work is divided into two aspects: protection and performance. From protection point 

of view there is a need to improve the error coverage in the processor part. Presently simple parity 

can only detect odd bit errors. The challenge is to search a low hardware overhead codes. Moreover, 

opcode (in control circuitry) can be protected with ECC. In MISC based stack methodology there are 

37 instructions, present opcode is 8 bit. It has the capacity to add redundancy bits without additional 

overheads. 

From performance point of view, the architectural optimization, mainly on the hardware journal 

part is required. Present processor have a critical path in the error correcting circuitry and write to 

DM. If this task is split in 2 stage pipeline then overall performance can improve lot. Another possible 

aspect is to overcome performance overhead due to conditional branches. The methodology will be 

to load both condition of jump in IB. 

On the long term, the continuation of this work should be dedicated to the integration of this fault 

tolerant processor architecture as a building block of a fault tolerant MPSoC. 

137

Appendix A 

Canonical Stack Computers: 

The canonical stack processor [KJ89] has been chosen to develop the fault tolerant processor 

core. It is characteristics resembles mostly with the second generation stack machines which is more 

cost effective than first generation. In this section we will briefly discuss the construction of canon- 

ical stack machine because it will be helpful in understanding the similarities and differences with 

proposed stack machine. 

Figure A.1 shows the block diagram of the Canonical Stack Machine. Each block represents a 

logical resource that include: the data bus, the Data Stack (DS), Return Stack (RS), Arithmetic/Logic 

Unit (ALU), Top Of Stack register (TOS), Program Counter (PC), Memory Address Register (MAR), 

Instruction Register (IR), and an Input/Output unit (I/O). For reason of simplicity the canonical ma- 

chine has been represented with a single Data Bus but real processors may have more than one for 

parallel fetching and instruction execution in the figure A.1. Real processors may have more than one 

data path to allow for parallel operation of instruction fetching and calculations. 

The DS is a buffer which works according to LIFO (Last In First Out) mechanism. Only two 

operations PUSH and POP can take place in DS. In PUSH, the new data elements are written on the 

top most position of the DS and the old values are shifted one position downwards. In POP operation, 

the top value already residing in the stack is placed on the data bus and the next cell on the stack 

is shift one place upwards and so on. Similarly RS is also LIFO based implementation. The only 

difference is that the return stack is used to store subroutine return addresses instead of instruction 

operands. 

The program memory block has both a Memory Address Register (MAR) and a reasonable 

amount of random access memory. To access the memory, first the MAR is written with the ad- 

dress to be read or written. then, on the next system cycle the program memory is either read onto or 

written from the data bus accordingly. 

139

140 APPENDIX A. CANONICAL STACK COMPUTERS: 

Data Stack 

DS 

Return Stack 

RS 

I/O 

Control 

Logic & IR 

D 

A 

T 

A 

B 

U 

S 

DATA 

Figure A.1: Canonical Stack Machine [KJ89] 

ALU 

PC 

MAR 

Program 


TOS REG 

ADDRESS

Appendix B 

Instruction Set of Stack Processor 

Arithmetic and logic operations 

The basic arithmetic and logic operations in table B.1 are same as in canonical machine but they 

have been modified according to the needs. Some additional instructions like Addition with carry 

(ADC), Subtraction with carry (SUBC), Modulus (MOD), Negative (NEG), NOT-operation (NOT), 

Increment (INC), Decrement (DEC) and Sign (SIGN) have been added. These additional instructions 

provide more flexibility when programming the Stack Processor. All the instructions are described in 

terms of register transfer level pseudo which is assumed as self explanatory. 

Stack manipulation operations 

Pure Stack machines can only access the two tops of the stack for arithmetic operations. Therefore 

some extra instructions are always needed in order to explore the operands other than the TOS, NOS 

or TORS. Here such instructions include Rotate (ROT), RS to DS (R2D), DS to RS (D2R), Copy RS 

to DS (CPR2D). The R2D, D2R and CPR2D are generally used for shuffling the DS and RS. (Pseudo 

of all the instructions in this table are according to non-pipelined version of the propose model.) 

Memory Fetch and Store 

All the arithmetic and logical operations are performed on data elements of the stack so, there 

must be some way of loading information onto the stack and storing the data to the memory. The 

register transfer pseudo is in the table B.3 below. 

Loading Literals 

There must be a way to get the constants on the stack. The instructions to do so include LIT and 

DLIT which can load a byte and a word data respectively on the DS as shown below in table B.4. 

141

142 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR 

Conditional branch 

Table B.1: Arithmetic and logic operations 

Symbol Instruction Operations 

ADD Addition TOS ⇐ TOS + NOS 

NOS ⇐ DS [DSP] 

DSP ⇐ DSP - 1 

ADC Addition with carry TOS ⇐ TOS + NOS 

NOS ⇐ Cout 

SUB Subtraction TOS ⇐ TOS - NOS 



MUL Multiplication TOS ⇐ TOS × NOS 



DIV Division TOS ⇐ TOS ÷ NOS 



MOD Modulus TOS ⇐ TOS mod NOS 



AND AND-operation TOS ⇐ TOS & NOS 



OR OR-operation TOS ⇐ TOS | NOS 



XOR XOR-operation TOS ⇐ TOS xor NOS 



NEG Negative TOS ⇐ -TOS 

NOT NOT-operation TOS ⇐ not TOS 

INC Increment TOS ⇐ TOS + 1 

DEC Decrement TOS ⇐ TOS - 1 

SIGN Sign if (TOS 0) then TOS ⇐ 0 × 0000 

When processing data there is need to take decisions, the machine must have the possibility of 

conditional branch. The conditional jumps can depend on various conditions. 

Subroutine Calls 

In stack machine most of the instructions are executed between TOS and NOS but in this archi- 

tecture to improve the flexibility of stack based machines there is RS (Return Stack) added along with

Table B.2: Stack manipulation operations 


DROP Drop TOS ⇐ NOS 



DUP Duplication DSP ⇐ DSP + 1 

DS[DSP] ⇐ NOS 

NOS ⇐ TOS 

SWAP Swap TOS ⇐ NOS 

NOS ⇐ TOS 

OVER Over DSP ⇐ DSP + 1 


TOS ⇐ NOS 

NOS ⇐ TOS 

ROT Rotate TOS ⇐ DS[DSP] 

NOS ⇐ TOS 


R2D Return Stack to Data Stack TOS ⇐ TORS 

NOS ⇐ TOS 

DSP ⇐ DSP + 1 


TORS ⇐ RS[RSP] 

RSP ⇐ RSP - 1 

CPR2D Copy Return Stack to Data Stack TOS ⇐ TORS 

NOS ⇐ TOS 

DSP ⇐ DSP + 1 


D2R Data Stack to Return Stack TORS ⇐ TOS 


TOS ⇐ NOS 

NOS ⇐ DS[DSP] 

RSP ⇐ RSP + 1 

RS[RSP] ⇐ TORS 

RET Return IP ⇐ TORS 

TORS ⇐ RS [RSP] 


DS (Data Stack). The proposed machine can efficiently call subroutines. 

Subroutine call push the value in PC on the TOS and sometime we can directly write some know 

address on the TOS as shown below in table B.6. 

143

144 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR 

Push and Pop 

Table B.3: Memory Fetch and Store 


FETCH Fetch Mem_Addr ⇐ TOS 

TOS ⇐ Mem 

STORE Store Mem_Addr ⇐ TOS 

Mem ⇐ NOS 


TOS ⇐ DS[DSP] 



Table B.4: Loading Literals 


LIT d8 Writing data (Byte size) to TOS DSP ⇐ DSP + 1 


NOS ⇐ TOS 

TOS ⇐ data(byte) 

LIT d16 Writing data (Word size) to TOS DSP ⇐ DSP + 1 


NOS ⇐ TOS 

TOS ⇐ data(word) 

Table B.5: Conditional Branch 


ZBRA d Jump to ‘d’ if TOS = 0 if (TOS=0) then 

IP ⇐ IP + d 

SBRA d Jump to ‘d’ if TOS < 0 if (TOS

B.1. DATA OPERATIONS IN STACK PROCESSOR: 145 

Table B.7: Push and Pop 


PUSH DSP Push Data Stack Pointer DSP ⇐ DSP + 1 


NOS ⇐ TOS 

TOS ⇐ DSP 

POP DSP Pop Data Stack Pointer DSP ⇐ TOS 

TOS ⇐ NOS 

DSP ⇐ DSP-1 


PUSH RSP Push Return Stack Pointer DSP ⇐ DSP + 1 


NOS ⇐ TOS 

TOS ⇐ RSP 

POP RSP POP Return Stack Pointer RSP ⇐ TOS 

TOS ⇐ NOS 


DSP ⇐ DSP-1 

B.1 Data Operations in Stack Processor: 

Stack machine operates on data manipulation using postfix operation. Such operation is also 

called as ‘Reverse Polish’ that is used to describe post fix operations. In such operations the operators 

come before the operation and the operator act upon the most recently seen operands.. For example 

if we have a following expression: 

(24 + 04) × 82 

This expression in Postfix representation will be 

82 24 04 + × 

They are usually smaller then infix notation. The stack processor can execute postfix expressions 

directly without burdening the compiler anymore.

146 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR

Appendix C 

Instruction Set of Pipelined Stack Processor 

We have analyzed the multi clock instructions to explore the possible conflicts between various in- 

structions. We have found that all the multi clock-instructions can be sub divided into two parts. The 

first part consists of DSP+1/RSP+1 depending on the type of instruction while second part contains 

the rest of the instruction, such instructions can be recognized by the code ‘111’. Due to intelligent 

pipelining we can pre-execute some part of the next instruction (DSP+1/RSP+1) along with the next 

instruction to be executed. In this way in the next clock we execute the remaining part of the in- 

struction. Hence after pipelining we will execute all the instructions in a single clock except STORE, 

which requires 2 clocks after pipelining. Before pipelining STORE instruction requires 3-clock cy- 

cles. Actually in STORE instruction we need to execute two times DSP + 1, which can not be done 

in a single clock cycle. And rest of the instruction is executed in the next clock. 

All the instructions are intelligently divided into the two stages so that each instruction be executed 

in a clock cycle after implementation of two stage pipelining. The complete list of the instructions is 

given below. 

147

148 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR 

Table C.1: Instruction set of stack processor (pipelined model) 

Instructions First Stage Second Stage 

ADD NOP TOS ⇐ TOS + NOS 



ADC NOP TOS ⇐ TOS + NOS 

NOS ⇐ Cout 

SUB NOP TOS ⇐ TOS - NOS 



MUL NOP TOS ⇐ TOS × NOS 



DIV NOP TOS ⇐ TOS ÷ NOS 



MOD NOP TOS ⇐ TOS mod NOS 



AND NOP TOS ⇐ TOS & NOS 



OR NOP TOS ⇐ TOS | NOS 



XOR NOP TOS ⇐ TOS xor NOS 



NEG NOP TOS ⇐ -TOS 

NOT NOP TOS ⇐ not TOS 

INC NOP TOS ⇐ TOS + 1 

DEC NOP TOS ⇐ TOS - 1 

SIGN NOP if (TOS 0) then TOS ⇐ 0 × 0000

Table C.2: Stack manipulation operations 

Instructions First stage Second Stage 

DROP NOP TOS ⇐ NOS 



DUP DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS 

NOS ⇐ TOS 

SWAP NOP TOS ⇐ NOS 

NOS ⇐ TOS 

OVER DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS 

TOS ⇐ NOS 

NOS ⇐ TOS 

ROT NOP TOS ⇐ DS[DSP] 

NOS ⇐ TOS 


R2D DSP ⇐ DSP + 1 TOS ⇐ TORS 

NOS ⇐ TOS 


TORS ⇐ RS[RSP] 


CPR2D DSP ⇐ DSP + 1 TOS ⇐ TORS 

NOS ⇐ TOS 


D2R RS[RSP] ⇐ TORS TORS ⇐ TOS 


TOS ⇐ NOS 


RSP ⇐ RSP + 1 

RET NOP IP ⇐ TORS 

TORS ⇐ RS [RSP] 


Table C.3: Memory Fetch and Store 


FETCH Mem_Addr ⇐ TOS TOS ⇐ Mem 

STORE Mem_Addr ⇐ TOS Mem ⇐ NOS 

TOS ⇐ DS[DSP] NOS ⇐ DS[DSP] 

DSP ⇐ DSP - 1 DSP ⇐ DSP - 1 

149

150 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR 

Table C.4: Loading Literals 


LIT d8 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS 

NOS ⇐ TOS 

TOS ⇐ data(byte) 

DLIT d16 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS 

NOS ⇐ TOS 

TOS ⇐ data(word) 

Table C.5: Conditional Branch 


ZBRA d NOP if (TOS=0) then 

IP ⇐ IP + d 

SBRA d NOP if (TOS

Table C.8: Instruction Codes and Instruction Lengths 

Type of 

b4 b3 b2 b1 b0 instruction Instruction Length 

0 0 0 0 0 0 0 0 NOP 0-byte 

1 0 0 0 0 0 0 0 ADD 1-byte 

1 0 0 0 0 0 0 1 ADC 1-byte 

1 0 0 0 0 0 1 0 SUB 1-byte 

1 0 0 0 0 0 1 1 SUBC 1-byte 

1 0 0 0 0 1 0 0 MUL 1-byte 

1 0 0 0 0 1 0 1 DIV 1-byte 

1 0 0 0 0 1 1 0 MOD 1-byte 

1 0 0 0 0 1 1 1 AND 1-byte 

1 0 0 0 1 0 0 0 OR 1-byte 

1 0 0 0 1 0 0 1 XOR 1-byte 

1 0 0 0 1 0 1 0 NEG 1-byte 

1 0 0 0 1 0 1 1 NOT 1-byte 

1 0 0 0 1 1 0 0 INC 1-byte 

1 0 0 0 1 1 0 1 DEC 1-byte 

1 0 0 0 1 1 1 0 SIGN 1-byte 

1 0 0 0 1 1 1 1 DROP 1-byte 

1 0 0 1 0 0 0 0 DUP 1-byte 

1 0 0 1 0 0 0 1 SWAP 1-byte 

1 0 0 1 0 0 1 0 OVER 1-byte 

1 0 0 1 0 0 1 1 ROT 1-byte 

1 0 0 1 0 1 0 0 R2D 1-byte 

1 0 0 1 0 1 0 1 CPR2D 1-byte 

1 0 0 1 0 1 1 0 D2R 1-byte 

1 0 0 1 0 1 1 1 FETCH 1-byte 

1 0 0 1 1 0 0 0 STORE 1-byte 

1 0 0 1 1 0 0 1 PUSH_DSP 1-byte 

1 0 0 1 1 0 1 0 POP_DSP 1-byte 

1 0 0 1 1 0 1 1 PUSH_DSP 1-byte 

1 0 0 1 1 1 0 0 POP_RSP 1-byte 

1 0 1 0 0 0 0 0 LIT a 2-bytes 

1 1 0 0 0 0 0 0 DLIT a 3-bytes 

1 1 1 0 0 0 0 0 RET 1-byte + IP-change 

0 0 1 0 0 0 0 1 ZBRA 2-bytes + IP-change 

0 0 1 0 0 0 1 0 SBRA 2-bytes + IP-change 

0 1 0 0 0 0 0 0 LBRA 3-bytes + IP-change 

0 1 0 0 0 0 0 1 CALL a 3-bytes + IP-change 

b7 b6 b5 

151

152 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR

Appendix D 

List of Acronyms 

ALU : Arithmetic Logic Unit. 

ASIC : Application Specific Integrated Circuits. 

BER : Backward Error Recovery. 

BIED : Built-In Error Detection Schemes. 

BPSG : Boro-Phos-Silicate-Glass. 

CED : Concurrent Error Detection. 

CISC : Complex Instruction Set Computer. 

CMOS : Complimentary Metal Oxide Semiconductor. 

CPO : Clock Per Operation. 

CPI : Clock Per Instruction. 

CRC : Cyclic Redundancy Codes. 

DCR : Dual-Checker Rail. 

DED : Double Error Detection. 

DM : Dependable Memory. 

DMR : Dual Modular Redundancy. 

DS : Data Stack. 

DSP : Data Stack Pointer. 

DWC : Duplication With Comparison. 

DWCR : Duplication With Complement Redundancy. 

ECC : Error Control Coding. 

EDC : Error Detecting Codes. 

EDCC : Error Detecting and Correction Codes. 

EDP : Error Detecting Processor. 

ESS : Electronic Switching Systems. 

FER : Forward Error Recovery. 

FPGA : Field Programmable gate Array. 

FT : Fault Tolerant. 

FTMP : Fault Tolerant Multi-Processor. 

153

154 APPENDIX D. LIST OF ACRONYMS 

HD : Hamming Distance. 

HDL : High-level Description Language. 

HW : Hardware. 

IEEE : Institute of Electrical and Electronics Engineers. 

IB : Instruction Buffer. 

IBMU : Instruction Buffer Management Unit. 

IP : Instruction Pointer. 

ISA : Instruction Set Architecture. 

LICM : Laboratoire Interfaces Capteurs et Micro-électronique. 

LIFO : Last In First Out. 

MBU : Multiple Bit Upsets. 

MCU : Multiple Cell Upsets. 

MISC : Minimum Instruction Set Computer. 

MPSoC : Multi-Processor System on Chip. 

NASA : National Aeronautics and Space Agency. 

NoC : Network on Chip. 

NOS : Next Of data Stack. 

PC : Program Counter. 

RAM : Random Access Memory. 

REE : Remote Exploration and Experimentation. 

RESO : Redundant Execution with Shifted Operands. 

RISC : Reduce Instruction Set Computer. 

RS : Return Stack. 

RSP : Return Stack Pointer. 

RTL : Register Transfer Level. 

SCHJ : Self-Checking Hardware Journal. 

SCPC : Self-Checking Processor Core. 

SD : Sequence Duration. 

SE : State determining Elements. 

SEB : Single Event Burnout. 

SEE : Single Event Effect. 

SEGR : Single Event Gate Rupture. 

SEFI : Single Event Functional Interrupt. 

SEL : Single Event Latchup. 

SEU : Single Event Upset. 

SET : Single Event Transient. 

SEC : Single Error Correction. 

SW : Software. 

SIFT : Software Implemented Fault Tolerance.

SoC : System on Chip. 

STAR : Self-Testing and Repair. 

TMR : Triple Modular Redundancy. 

TORS : Top Of Return Stack. 

TOS : Top Of data Stack. 

UJ : Un-validated Journal. 

VP : Validation Point. 

UVD : Un-Validated Data. 

VD : Validated Data. 

VHDL : VHSIC Hardware Description Language. 

VJ : Validated Journal. 

155

156 APPENDIX D. LIST OF ACRONYMS

Appendix E 

List of publications 

• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE, 

“A Self-Checking HW Journal for a Fault Tolerant Processor Architecture,” International Jour- 

nal of Reconfigurable Computing 2011 (IJRC’11) (Accepted). 

• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE, 

“A Dependable Stack Processor Core for MPSoC Development,” XXIV Conference on Design 

of Circuits and Integrated Systems (DCIS’09), Zaragoza, Spain, November 18-20, 2009. 

• Mohsin AMIN, Fabrice MONTEIRO, Camille DIOU, Abbas RAMAZANI, Abbas DANDACHE, 

“A HW/SW Mixed Mechanism to Improve the Dependability of a Stack Processor,” 16th 

IEEE International Conference on Electronics, Circuits, and Systems (ICECS’09), Hammamet, 

Tunisia, December 13-16, 2009. 

• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE, 

“Journalized Stack Processor for Reliable Embedded Systems,” 1st International Conference 

on Aerospace Science and Engineering (ICASE’09), Islamabad, Pakistan, August 18-20, 2009. 

• A. Ramazani, M. Amin, F. Monteiro, C. Diou, A. Dandache, “A Fault Tolerant Journalized 

Stack Processor Architecture,” 15th IEEE International On-Line Testing Symposium (IOLTS’09), 

Sesimbra-Lisbonne, Portugal, 24–27 June 2009. 

• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE, 

“Error Detecting and Correcting Journal for Dependable Processor Core,” GDR System on Chip 

- System in Package (GDR-SoC-SiP’10), Cergy-Paris, France, 9-11 June 2010. 

• Mohsin Amin, Camille Diou, Fabrice Monteiro, Abbas Ramazani, “Design Methodology of 

Reliable Stack Processor Core,” GDR System on Chip - System in Package 2009 (GDR-SoC- 

SiP’09), Orsay-Paris, France, 9-11 June 2010. 

157

158 APPENDIX E. LIST OF PUBLICATIONS 

• Mohsin AMIN, “Self-Organization in Embedded Systems,” 2nd Winter School on Self Organi- 

zation in Embedded Systems, Schloss Dagstuhl, Germany, November 2007.

List of Figures 

1.1 An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs 

in its wake, which effects charge disturbance [MW04] . . . . . . . . . . . . . . . . 16 

1.2 Strike of high energy particle resulted in error(s) . . . . . . . . . . . . . . . . . . . . 16 

1.3 Classification of faults on basis of single event effect (SEE) [Pie07]. . . . . . . . . . 17 

1.4 Dependability Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.5 Fault, error and failure chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.6 Error propagation from processor to main memory . . . . . . . . . . . . . . . . . . 21 

1.7 A single fault caused failure of traffic control system . . . . . . . . . . . . . . . . . 22 

1.8 service failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

1.9 Fault characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

1.10 Few reasons of fault occurrence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

1.11 Dependability techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

1.12 Sequence of events from ionization to failure and a set of fault tolerant techniques 

applied at different time. [Pie07]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.1 General architecture of a concurrent error detection schemes [MM00] . . . . . . . . 32 

2.2 Duplication with comparison (DWC) . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

2.3 Time redundancy for temporary and intermittent fault detection . . . . . . . . . . . . 34 

2.4 Time redundancy for permanent error detection . . . . . . . . . . . . . . . . . . . . 34 

2.5 Information redundancy principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.6 Parity coder in data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.7 Functional Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 

2.8 Residue codes adder [FFMR09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

2.9 Triple modular redundancy (TMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.10 Error detecting and correcting memory block . . . . . . . . . . . . . . . . . . . . . 39 

2.11 Basic strategies for implementing Error Recovery. . . . . . . . . . . . . . . . . . . . 41 

2.12 The triple-TMR in Boeing 777 [Yeh02] . . . . . . . . . . . . . . . . . . . . . . . . 45 

3.1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.2 Limitation of parity check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.3 Rollback Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

159

160 LIST OF FIGURES 

3.4 Error detection during Sequence Duration (SD) and rollback called . . . . . . . . . . 56 

3.5 No-error detected during the SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

3.6 Time overhead in rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

3.7 Untrusted data flowing into dependable memory (DM) . . . . . . . . . . . . . . . . 60 

3.8 Data stored to temporary location before writing to DM . . . . . . . . . . . . . . . . 61 

3.9 Data corruption in temporary storage. . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3.10 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

3.11 Overall design specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.12 Global design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

3.13 Model-I with data cache and a pair of journals . . . . . . . . . . . . . . . . . . . . . 66 

3.14 Cache with associative mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

3.15 FT evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

3.16 Periodic, random and burst errors models . . . . . . . . . . . . . . . . . . . . . . . 68 

3.17 Model-I: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 69 

3.18 Model-I: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . . 70 

3.19 Model-I: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 71 

3.20 Block diagram of Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

3.21 Processor can simultaneously read from Journal and DM . . . . . . . . . . . . . . . 72 

3.22 No error detected during SD and data is validated at VP . . . . . . . . . . . . . . . . 72 

3.23 Error detected and all the data written during SD is deleted . . . . . . . . . . . . . . 73 

3.24 Model-II: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 74 

3.25 Model-II: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . 74 

3.26 Model-II: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 75 

4.1 Design of a self checking processor core (SCPC) . . . . . . . . . . . . . . . . . . . 77 

4.2 Criteria behind the choice of the stack processor . . . . . . . . . . . . . . . . . . . 79 

4.3 Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design 

space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

4.4 Simplified stack machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

4.5 Modified stack processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.6 Simplified data-path of the proposed model (arithmetic and logic instructions) . . . . 84 

4.7 Different instructions type from execution point of view (without pipelining) . . . . . 85 

4.8 Execution of duplication (DUP) instruction in 2-clock . . . . . . . . . . . . . . . . . 86 

4.9 Multiple-byte instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.10 Data-path of protected-processor’s ALU . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.11 Resource utilization chart for various ALU designs [SFRB05] . . . . . . . . . . . . 89 

4.12 ALU is protecting the Logical and Arithmetic instructions separately . . . . . . . . . 90 

4.13 Reminder check technique for error detection in arithmetic instructions . . . . . . . . 91 

4.14 Parity check technique for error detection in logic instructions . . . . . . . . . . . . 92

LIST OF FIGURES 161 

4.15 Parity check technique for error detection in register(s) . . . . . . . . . . . . . . . . 92 

4.16 Error occurred in Protected ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

4.17 Instruction buffer management Unit (IBMU) . . . . . . . . . . . . . . . . . . . . . 95 

4.18 Instruction buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

4.19 (a) Opcodes description and (b) pipelined execution model . . . . . . . . . . . . . . 97 

4.20 A sample program executed through non-pipelined and pipelined stack processor core 97 

4.21 Timing diagram for a sample program executed twice: once in non-pipelined version 

and then pipelined version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

4.22 Implementation design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

4.23 Strategy to overcome performance overhead due to conditional branches . . . . . . . 99 

4.24 Implementation of a self-checking processor core . . . . . . . . . . . . . . . . . . . 100 

4.25 Error detected in SCPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

5.1 Design of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 

5.2 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

5.3 (a) Error(s) in un-validated journal (b) error(s) in validated journal . . . . . . . . . . 105 

5.4 Hsiao Parity Check Matrix (41,34) . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

5.5 SCHJ structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

5.6 Error detection and correction in journal (a memory block of SCHJ). . . . . . . . . 109 

5.7 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

5.8 Rollback mechanism on error detection. . . . . . . . . . . . . . . . . . . . . . . . . 111 

5.9 SCHJ operation flow chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

5.10 SCHJ mode 00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

5.11 SCHJ mode 01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

5.12 Read of UVD from SCHJ in mode 01 . . . . . . . . . . . . . . . . . . . . . . . . . 113 

5.13 SCHJ mode 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

5.14 Mode 10 of SCHJ operation (un-corrigible error detected) . . . . . . . . . . . . . . . 114 

5.15 SCHJ mode 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

5.16 Non-corrigible error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

5.17 Increase of percentage utilization of FT processor (SCPC + SCHJ) on device EP3SE50F484C2 

with increase in the depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

5.18 Theoretical limits of Journal Depth. . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

5.19 Relation between journal depth and percentage write in benchmarks. . . . . . . . . . 118 

5.20 CPI vs. SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

5.21 Dynamic SD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

6.1 The overall FT-processor to be validated. . . . . . . . . . . . . . . . . . . . . . . . . 121 

6.2 Error injection in FT-processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

6.3 Error patterns (errors can occur in any bit, not necessarily the bit shown here). . . . . 123 

6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

162 LIST OF FIGURES 

6.5 Single bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

6.6 Double bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

6.7 Triple bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

6.8 Harsh (1 up to 8 bit randomly) error injection. . . . . . . . . . . . . . . . . . . . . . 126 

6.9 Performance Degradation due to re-execution . . . . . . . . . . . . . . . . . . . . . 127 

6.10 Simulation curves for group-I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

6.11 Simulation curves for group-II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

6.12 Simulation curves for group-III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

6.13 Effect of EIR on rollback for benchmarks group-I. . . . . . . . . . . . . . . . . . . . 129 

6.14 Effect of EIR on rollback for benchmarks group-II. . . . . . . . . . . . . . . . . . . 130 

6.15 Effect of EIR on rollback for benchmarks group-III. . . . . . . . . . . . . . . . . . . 130 

A.1 Canonical Stack Machine [KJ89] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

List of Tables 

1.1 Cost/hour for failure of control system [Pie07] . . . . . . . . . . . . . . . . . . . . . 13 

1.2 Dependability attributes for University web-server and Nuclear-reactor [Pie07], where 

attributes are classified as: – very important = 4 points, – least important = 1 point . 20 

2.1 fault modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.1 Comparison of the Processor-Memory Models . . . . . . . . . . . . . . . . . . . . 73 

4.1 Instruction types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

4.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

5.1 Modes of Journal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

5.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

6.1 Read/Write profiles in benchmarks groups . . . . . . . . . . . . . . . . . . . . . . . 127 

B.1 Arithmetic and logic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

B.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 

B.3 Memory Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

B.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

B.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

B.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

B.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

C.1 Instruction set of stack processor (pipelined model) . . . . . . . . . . . . . . . . . . 148 

163

164 LIST OF TABLES 

C.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

C.3 Memory Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

C.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

C.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

C.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

C.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

C.8 Instruction Codes and Instruction Lengths . . . . . . . . . . . . . . . . . . . . . . . 151

Bibliography 

[ACC + 93] J. Arlat, A. Costes, Y. Crouzet, J. C Laprie, and D. Powell. Fault injection and depend- 

ability evaluation of fault-tolerant systems. IEEE Transactions on Computers, page 

913–923, 1993. 

[Aer11] Aeroflex. Dual-Core LEON3FT SPARC v8 processor, 2011. 

[AFK05] J. Aidemark, P. Folkesson, and J. Karlsson. A framework for node-level fault toler- 

ance in distributed real-time systems. In Proceedings of International Conference on 

Dependable Systems and Networks, 2005 (DSN’05), page 656–665, 2005. 

[AHHW08] U. Amgalan, C. Hachmann, S. Hellebrand, and H. J Wunderlich. Signature Rollback-A 

technique for testing robust circuits. In 26th IEEE VLSI Test Symposium, 2008 (VTS’08), 

page 125–130, 2008. 

[AKT + 08] H. Ando, R. Kan, Y. Tosaka, K. Takahisa, and K. Hatanaka. Validation of hardware 

error recovery mechanisms for the SPARC64 v microprocessor. In Dependable Systems 

and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference 

on, page 62–69, 2008. 

[ALR01] A. Avizienis, J. C Laprie, and B. Randell. Fundamental concepts of dependability. 

Research report UCLA CSD Report no. 010028, 2001. 

[AMD + 10] M. Amin, F. Monteiro, C. Diou, A. Ramazani, and A. Dandache. A HW/SW 

mixed mechanism to improve the dependability of a stack processor. In Proceedings 

of 16th IEEE International Conference on Electronics, Circuits, and Systems, 2009 

(ICECS’09), page 976–979, 2010. 

[ARM09] ARM. Cortex-R4 and Cortex-R4F. Technical reference manual, 2009. 

[ARM + 11] Mohsin Amin, Abbas Ramazani, Fabrice Monteiro, Camille Diou, and Abbas Dan- 

dache. A Self-Checking hardware journal for a fault tolerant processor architecture. 

Hindawi Publishing Corporation, 2011. 

[Bai10] G Bailey. Comparison of GreenArrays chips with texas instruments MSP430F5xx as 

micropower controllers, June 2010. 

165

166 BIBLIOGRAPHY 

[Bau05] R. C Baumann. Radiation-induced soft errors in advanced semiconductor technologies. 

IEEE Transactions on Device and materials reliability, 5(3):305–316, 2005. 

[BBV + 05] D. Bernick, B. Bruckert, P. D Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. 

NonStop advanced architecture. In Proceedings of International Conference on De- 

pendable Systems and Networks, 2005 (DSN’05), page 12–21, 2005. 

[BCT08] B. Bridgford, C. Carmichael, and C. W Tseng. Single-event upset mitigation selection 

guide. Xilinx Application Note, 987, 2008. 

[BGB + 08] J. C Baraza, J. Gracia, S. Blanc, D. Gil, and P. J Gil. Enhancement of fault injection 

techniques based on the modification of VHDL code. IEEE Transactions on Very Large 

Scale Integration (VLSI) Systems, 16(6):693–706, 2008. 

[Bic10] R. Bickham. An Analysis of Error Detection Techniques for Arithmetic Logic Units. 

PhD thesis, Vanderbilt University, 2010. 

[BP02] N. S Bowen and D. K Pradhan. Virtual checkpoints: Architecture and performance. 

Computers, IEEE Transactions on, 41(5):516–525, 2002. 

[BT02] D. Briere and P. Traverse. AIRBUS A320/A330/A340 electrical flight controls-a fam- 

ily of fault-tolerant systems. In the Twenty-Third International Symposium on Fault- 

Tolerant Computing System, 1993 (FTCS’93), page 616–623, 2002. 

[Car01] C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx 

Application Note XAPP197, 1, 2001. 

[Che08] L. Chen. Hsiao-Code check matrices and recursively balanced matrices. Arxiv preprint 

arXiv:0803.1217, 2008. 

[CHL97] W-T Chang, S Ha, and E.A. Lee. Heterogeneous simulation - mixing Discrete-Event 

models with dataflow. Journal of VLSI Signal Processing, 15(1-2):127–144, 1997. 

[CP02] J. A Clark and D. K Pradhan. Fault injection: A method for validating computer-system 

dependability. Computer, 28(6):47–56, 2002. 

[CPB + 06] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and 

M. Orshansky. Bulletproof: A defect-tolerant CMP switch architecture. In Proceedings 

of 25th International Symposium on High-Performance Computer Architecture, 2006, 

page 5–16, 2006. 

[CTS + 10] C. L. Chen, N. N. Tendolkar, A. J. Sutton, M. Y. Hsiao, and D. C. Bossen. Fault- 

tolerance design of the IBM enterprise system/9000 type 9021 processors. IBM Journal 

of Research and Development, 36(4):765–779, 2010.

BIBLIOGRAPHY 167 

[EAWJ02] E. N. Elnozahy, L. Alvisi, Y. M Wang, and D. B Johnson. A survey of rollback- 

recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 

34(3):375–408, 2002. 

[EKD + 05] D. Ernst, N. S Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, 

K. Flautner, et al. Razor: A low-power pipeline based on circuit-level timing specula- 

tion. In Proceedings of 36th Annual IEEE/ACM International Symposium on Microar- 

chitecture, 2003 (MICRO’03), page 7–18, 2005. 

[FFMR09] R. Forsati, K. Faez, F. Moradi, and A. Rahbar. A fault tolerant method for residue 

arithmetic circuits. In Proceedings of 2009 International Conference on Information 

Management and Engineering, page 59–63, 2009. 

[FGAD10] R. Fernández-Pascual, J. M Garcia, M. E Acacio, and J. Duato. Dealing with tran- 

sient faults in the interconnection network of CMPs at the cache coherence level. IEEE 

Transactions on Parallel and Distributed Systems, 21(8):1117–1131, 2010. 

[FGAM10] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error relia- 

bility on the cheap. ACM SIGPLAN Notices, 45(3):385–396, 2010. 

[FP02] E. Fujiwara and D. K Pradhan. Error-control coding in computers. Computer, 

23(7):63–72, 2002. 

[GBT05] S. Ghosh, S. Basu, and N. A Touba. Selecting error correcting codes to minimize power 

in memory checker circuits. Journal of Low Power Electronics, 1(1):63–72, 2005. 

[GC06] J. Gaisler and E. Catovic. Multi-Core processor based on LEON3-FT IP core (LEON3- 

FT-MP). In in Proceedings of Data Systems in Aerospace, 2006 (DASIA’06), volume 

630, page 76, 2006. 

[Gha11] S. Ghaznavi. Soft Error Resistant Design of the AES Cipher Using SRAM-based FPGA. 

PhD thesis, University of Waterloo, 2011. 

[GMT08] M. Grottke, R. Matias, and K. S Trivedi. The fundamentals of software aging. In Pro- 

ceedings of IEEE International Conference on Software Reliability Engineering Work- 

shops, 2008 (ISSRE Wksp 2008), page 1–6, 2008. 

[GPLL09] C. Godlewski, V. Pouget, D. Lewis, and M. Lisart. Electrical modeling of the effect 

of beam profile for pulsed laser fault injection. Microelectronics Reliability, 49(9- 

11):1143–1147, 2009. 

[Gre10] Green. Project green array chip, 2010. 

[Hay05] J.R. Hayes. The architecture of the scalable configurable instrument processor. Techni- 

cal Report SRI-05-030, The Johns Hopkins Applied Physics Laboratory, 2005.


[HCTS10] M. Y. Hsiao, W. C Carter, J. W Thomas, and W. R Stringfellow. Reliability, availabil- 

ity, and serviceability of IBM computer systems: A quarter century of progress. IBM 

Journal of Research and Development, 25(5):453–468, 2010. 

[HH06] A. J Harris and J. R Hayes. Functional programming on a Stack-Based embedded 

processor. 2006. 

[Hsi10] M. Y Hsiao. A class of optimal minimum odd-weight-column SEC-DED codes. IBM 

Journal of Research and Development, 14(4):395–401, 2010. 

[IK03] R. K Iyer and Z. Kalbarczyk. Hardware and software error detection. Technical report, 

Center for Reliable and High-Performance Computing, University of Illinois at Urbana- 

Champaign, Urbana, 2003. 

[Int09] Intel. White paper - the intel itanium processor 9300 series. Technical report, 2009. 

[ITR07] ITRS. International technology roadmap for semiconductors. 2007. 

[Jab09] Jaber. Conception architecturale haut débit et sûre de fonctionnement pour les codes 

correcteurs d’erreurs. PhD thesis, Université Paul Verlaine - Metz, France, Metz, 2009. 

[Jal09] M. Jallouli. Méthodologie de conception d’architectures de processeur sûres de fonc- 

tionnement pour les applications mécatroniques. PhD thesis, Université Paul Verlaine - 

Metz, France, Metz, 2009. 

[JDMD07] M. Jallouli, C. Diou, F. Monteiro, and A. Dandache. Stack processor architec- 

ture and development methods suitable for dependable applications. Reconfigurable 

Communication-centric SoCs (ReCoSoC’07), Montpellier, France, 2007. 

[JES06] J. S JESD89A. Measurement and reporting of alpha particle and terrestrial cosmic 

ray-induced soft errors in semiconductor devices. October, 2006. 

[JHW + 08] J. Johnson, W. Howes, M. Wirthlin, D. L McMurtrey, M. Caffrey, P. Graham, and 

K. Morgan. Using duplication with compare for on-line error detection in FPGA-based 

designs. In Proceedings of IEEE Aerospace Conference, 2008, page 1–11, 2008. 

[JPS08] B. Joshi, D. Pradhan, and J. Stiffler. Fault-Tolerant computing. 2008. 

[KJ89] P. J Koopman Jr. Stack computers: the new wave. Halsted Press New York, NY, USA, 

1989. 

[KKB07] I. Koren, C. M Krishna, and Inc Books24x7. Fault-tolerant systems. Elsevier/Morgan 

Kaufmann, 2007.


[KKS + 07] P. Kudva, J. Kellington, P. Sanda, R. McBeth, J. Schumann, and R. Kalla. Fault injec- 

tion verification of IBM POWER6 soft error resilience. In Architectural Support for 

Gigascale Integration (ASGI) Workshop, 2007. 

[KMSK09] J. W Kellington, R. McBeth, P. Sanda, and R. N Kalla. IBM POWER6 processor soft 

error tolerance analysis using proton irradiation. In Proceedings of the IEEE Workshop 

on Silicon Errors in Logic—Systems Effects (SELSE) Conference, 2009. 

[Kop04] H. Kopetz. From a federated to an integrated architecture for dependable embedded 

systems. PhD thesis, Technische Univ Vienna, Vienna, Austria, 2004. 

[Kop11] H. Kopetz. Real-time systems: design principles for distributed embedded applications, 

volume 25. Springer-Verlag New York Inc, 2011. 

[Lal05] P. K Lala. Single error correction and double error detecting coding scheme, 2005. 

[Lap04] J.C. Laprie. Sûreté de fonctionnement des systèmes : concepts de base et terminologie. 

2004. 

[LAT07] K. W. Li, J. R. Armstrong, and J. G. Tront. An HDL simulation of the effects of single 

event upsets on microprocessor program flow. IEEE Transactions on Nuclear Science, 

31(6):1139–1144, 2007. 

[LB07] J. Laprie and R. Brian. Origins and integration of the concepts. 2007. 

[LBS + 11] I. Lee, M. Basoglu, M. Sullivan, D. H Yoon, L. Kaplan, and M. Erez. Survey of error 

and fault detection mechanisms. 2011. 

[LC08] C. A.L Lisboa and L. Carro. XOR-based low cost checkers for combinational logic. In 

IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, page 

281–289, 2008. 

[LN09] Dongwoo Lee and Jongwhoa Na. A novel simulation fault injection method for depend- 

ability analysis. IEEE Design & Test of Computers, 26(6):50–61, December 2009. 

[LRL04] J. C Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable 

and secure computing. IEEE Trans. on Dependable Secure Computers, 1(1):11–33, 

2004. 

[MB07] N. Madan and R. Balasubramonian. Power efficient approaches to redundant multi- 

threading. IEEE Transactions on Parallel and Distributed Systems, page 1066–1079, 

2007. 

[MBS07] A. Meixner, M. E Bauer, and D. Sorin. Argus: Low-cost, comprehensive error detection 

in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium 

on Microarchitecture, page 210–222, 2007.


[MG09] A. Maloney and A. Goscinski. A survey and review of the current state of rollback- 

recovery for cluster systems. Concurrency and Computation: Practice and Experience, 

21(12):1632–1666, 2009. 

[MM00] S. Mitra and E. J McCluskey. Which concurrent error detection scheme to choose? 

2000. 

[MMPW07] K. S Morgan, D. L McMurtrey, B. H Pratt, and M. J Wirthlin. A comparison of TMR 

with alternative fault-tolerant design techniques for FPGAs. IEEE Transactions on Nu- 

clear Science, 54(6):2065–2072, 2007. 

[Mon07] Y. Monnet. Etude et modélisation de circuits résistants aux attaques non intrusives par 

injection de fautes. Thèse de doctorat, Institut National Polytechnique de Grenoble, 

2007. 

[MS06] F. MacWilliams and N. Sloane. The theory of error-correcting codes. 2006. 

[MS07] A. Meixner and D. J Sorin. Error detection using dynamic dataflow verification. In Pro- 

ceedings of the 16th International Conference on Parallel Architecture and Compilation 

Techniques, page 104–118, 2007. 

[MSSM10] M. J. Mack, W. M. Sauer, S. B. Swaney, and B. G. Mealey. IBM power6 reliability. 

IBM Journal of Research and Development, 51(6):763–774, 2010. 

[Muk08] S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2008. 

[MW04] R. Mastipuram and E. C Wee. Soft errors’ impact on system reliability. EDN, Sept, 30, 

2004. 

[MW07] T. C May and M. H Woods. A new physical mechanism for soft errors in dynamic 

memories. In 16th Annual Reliability Physics Symposium, page 33–40, 2007. 

[NBV + 09] T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L Scott. Fault injection frame- 

work for system resilience evaluation: fake faults for finding future failures. In Pro- 

ceedings of the 2009 workshop on Resiliency in high performance, page 23–28, 2009. 

[Nic02] M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In 

Proceedings of Twenty-Third International Symposium on Fault-Tolerant Computing, 

(FTCS-23), page 586–595, 2002. 

[Nic10] M. Nicolaidis. Soft Errors in Modern Electronic Systems. Springer Verlag, 2010. 

[NL11] J. Na and D. Lee. Simulated fault injection using simulator modification technique. 

ETRI Journal, 33(1), 2011.


[NMGT06] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: efficient han- 

dling of I/O in highly-available rollback-recovery servers. In the Twety-fifth Interna- 

tional Symposium on High-Performance Computer Architecture, 2006, page 200–211, 

2006. 

[NTN + 09] M. Nicolaidis, K. Torki, F. Natali, F. Belhaddad, and D. Alexandrescu. Implementation 

and validation of a low-cost single-event latchup mitigation scheme. In IEEE Workshop 

on Silicon Errors in Logic–System Effects (SELSE), Stanford, CA, 2009. 

[NX06] V. Narayanan and Y. Xie. Reliability concerns in embedded system designs. Computer, 

39(1):118–120, 2006. 

[Pat10] Anurag Patel. Fault tolerant features of modern processors, 2010. 

[PB04] S Pelc and C. Bailey. Ubiquitous forth objects. In Euro-forth’04, Dahgstuhl, Germany, 

2004. 

[PF06] J. H Patel and L. Y Fung. Concurrent error detection in ALU’s by recomputing with 

shifted operands. IEEE Transactions on Computers, 100(7):589–595, 2006. 

[Pie06] S.J. Piestrak. Dependable computing: Problems, techniques and their applications. In 

First Winter School on Self-Organization in Embedded Systems, Schloss Dagstuhl, Ger- 

many, 2006. 

[Pie07] S.J. Piestrak. Systèmes numériques tolérants aux fautes, 2007. 

[PIEP09] P. Pop, V. Izosimov, P. Eles, and Z. Peng. Design optimization of time-and cost- 

constrained fault-tolerant embedded systems with checkpointing and replication. IEEE 

Transactions on Very Large Scale Integration (VLSI) Systems, 17(3):389–402, 2009. 

[Poe05] Christian Poellabauer. Real-Time systems, 2005. 

[Pow10] D. Powell. A generic fault-tolerant architecture for real-time dependable systems. 

Springer Publishing Company, Incorporated, 2010. 

[QGK + 06] H. Quinn, P. Graham, J. Krone, M. Caffrey, and S. Rezgui. Radiation-induced 

multi-bit upsets in SRAM-based FPGAs. IEEE Transactions on Nuclear Science, 

52(6):2455–2461, 2006. 

[QLZ05] F. Qin, S. Lu, and Y. Zhou. SafeMem: exploiting ECC-memory for detecting memory 

leaks and memory corruption during production runs. In 11th International Symposium 

on High-Performance Computer Architecture, 2005 (HPCA’11), page 291–302, 2005. 

[RAM + 09] A. Ramazani, M. Amin, F. Monteiro, C. Diou, and A. Dandache. A fault tolerant jour- 

nalized stack processor architecture. In 15th IEEE International On-Line Testing Sym- 

posium, 2009 (IOLTS’09), Sesimbra-Lisbon, Portugal, 2009.


[RI08] G. A Reis III. Software modulated fault tolerance. PhD thesis, Princeton University, 

2008. 

[RK09] J. A Rivers and P. Kudva. Reliability challenges and system performance at the archi- 

tecture level. IEEE Design & Test of Computers, 26(6):62–73, 2009. 

[RNS + 05] K. Rothbart, U. Neffe, C. Steger, R. Weiss, E. Rieger, and A. Muehlberger. A smart 

card test environment using multi-level fault injection in SystemC. In Proceedings of 

6th IEEE Latin-American Test Workshop 2005, page 103–108, March 2005. 

[RR08] V. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check regimen 

in a superscalar processor. In IEEE International Conference on Dependable Systems 

and Networks 2008 (DSN’08), page 1–10, Anchorage, Alaska, 2008. 

[RRTV02] M. Rebaudengo, S. Reorda, M. Torchiano, and M. Violante. Soft-error detection through 

software fault-tolerance techniques. In International Symposium on Defect and Fault 

Tolerance in VLSI Systems, page 210–218, 2002. 

[RS09] B. Rahbaran and A. Steininger. Is asynchronous logic more robust than synchronous 

logic? IEEE Transactions on Dependable and Secure Computing, page 282–294, 2009. 

[RYKO11] W. Rao, C. Yang, R. Karri, and A. Orailoglu. Toward future systems with nanoscale 

devices: Overcoming the reliability challenge. Computer, 44(2):46–53, 2011. 

[Sch08] Martin Schoeberl. A java processor architecture for embedded real-time systems. Jour- 

nal of Systems Architecture, 2008. 

[SFRB05] V. Srinivasan, J. W. Farquharson, W. H. Robinson, and B. L. Bhuva. Evaluation of 

error detection strategies for an FPGA-Based Self-Checking arithmetic and logic unit. 

In MAPLD International Conference, 2005. 

[SG10] L. Spainhower and T. A Gregg. IBM s/390 parallel enterprise server g5 fault tolerance: 

A historical perspective. IBM Journal of Research and Development, 43(5.6):863–873, 

2010. 

[Sha06] Mark Shannon. A C Compiler for Stack Machines. MSc thesis, University of York, 

2006. 

[SHLR + 09] S. K Sastry Hari, M. L Li, P. Ramachandran, B. Choi, and S. V Adve. mSWAT: low- 

cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the 

42nd Annual IEEE/ACM International Symposium on Microarchitecture, page 122–132, 

2009.


[SMHW02] D. J Sorin, M. M.K Martin, M. D Hill, and D. A Wood. SafetyNet: improving the 

availability of shared memory multiprocessors with global checkpoint/recovery. In Pro- 

ceedings of the 29th annual international symposium on Computer architecture, page 

123–134, 2002. 

[SMR + 07] A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process- 

level redundancy to exploit multiple cores for transient fault tolerance. In 37th An- 

nual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. 

DSN’07, page 297–306, 2007. 

[Sor09] D.J. Sorin. Fault tolerant computer architecture, 2009. 

[SSF + 08] J. R Schwank, M. R Shaneyfelt, D. M Fleetwood, J. A Felix, P. E Dodd, P. Paillet, and 

V. Ferlet-Cavrois. Radiation effects in MOS oxides. IEEE Transactions on Nuclear 

Science, 55(4):1833–1853, 2008. 

[Sta06] William Stallings. Computer Organization and Architecture. Prentice Hall, 7th edition, 

2006. 

[TM95] C. H. Ting and C. H. Moore. Mup21 a high performance misc processor. Forth Dimen- 

sions, 1995. 

[Too11] C. Toomey. Statical Fault Injection and Analysis at the Register Transfer Level using 

the Verilog Procedural Interface. PhD thesis, Vanderbilt University, 2011. 

[Van08] V.P. Vanhauwaert. Fault injection based dependability analysis in a FPGA-based envi- 

roment. PhD thesis, Institut Polytechnique de Grenoble, Gernoble, France, 2008. 

[VFM06] A. Vahdatpour, M. Fazeli, and S. Miremadi. Transient error detection in embedded 

systems using reconfigurable components. In International Symposium on Industrial 

Embedded Systems, 2006 (IES’06), page 1–6, 2006. 

[VK07] J. Von Knop. A Process for Developing a Common Vocabulary in the Information Se- 

curity Area. Ios Pr Inc, 2007. 

[VSL09] M. Vayrynen, V. Singh, and E. Larsson. Fault-tolerant average execution time opti- 

mization for general-purpose multi-processor system-on-chips. In Proceedings of De- 

sign, Automation & Test in Europe Conference & Exhibition, 2009 (DATE’09), page 

484–489, 2009. 

[WA08] F. Wang and V. D Agrawal. Single event upset: An embedded tutorial. In 21st Interna- 

tional Conference on VLSI Design, 2008. VLSID 2008, page 429–434, 2008. 

[WCS08] P. M Wells, K. Chakraborty, and G. S Sohi. Adapting to intermittent faults in multicore 

systems. ACM SIGPLAN Notices, 43(3):255–264, 2008.


[WL10] C. F. Webb and J. S. Liptay. A high-frequency custom CMOS s/390 microprocessor. 

IBM Journal of Research and Development, 41(4.5):463–473, 2010. 

[Yeh02] Y. C. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of IEEE 

Aerospace Applications Conference, volume 1, page 293–307, 2002. 

[ZJ08] Y. Zhang and J. Jiang. Bibliographical review on reconfigurable fault-tolerant control 

systems. Annual Reviews in Control, 32(2):229–252, 2008. 

[ZL09] J. F. Ziegler and W. A. Lanford. The effect of sea level cosmic rays on electronic devices. 

Journal of applied physics, 52(6):4305–4312, 2009.

RÉSUMÉ 

Dans cette thèse, nous proposons une nouvelle approche pour la conception d’un processeur tolérant aux fautes. Celleci 

répond à plusieurs objectifs dont celui d’obtenir un niveau de protection élevé contre les erreurs transitoires et un 

compromis raisonnable entre performances temporelles et coût en surface. Le processeur résultant sera utilisé ultérieurement 

comme élément constitutif d’un système multiprocesseur sur puce (MPSoC) tolérant aux fautes. Les concepts mis 

en œuvre pour la tolérance aux fautes reposent sur l’emploi de techniques de détection concurrente d’erreurs et de recouvrement 

par réexécution. Les éléments centraux de la nouvelle architecture sont, un cœur de processeur à pile de données 

de type MISC (Minimal Instruction Set Computer) capable d’auto-détection d’erreurs, et un mécanisme matériel de journalisation 

chargé d’empêcher la propagation d’erreurs vers la mémoire centrale (supposée sûre) et de limiter l’impact du 

mécanisme de recouvrement sur les performances temporelles. 

L’approche méthodologique mise en œuvre repose sur la modélisation et la simulation selon différents modes et niveaux 

d’abstraction, le développement d’outils logiciels dédiées, et le prototypage sur des technologies FPGA. Les résultats, 

obtenus sans recherche d’optimisation poussée, montrent clairement la pertinence de l’approche proposée, en offrant 

un bon compromis entre protection et performances. En effet, comme le montrent les multiples campagnes d’injection 

d’erreurs, le niveau de tolérance au fautes est élevé avec 100% des erreurs simples détectées et recouvrées et environ 60% 

et 78% des erreurs doubles et triples. Le taux recouvrement reste raisonnable pour des erreurs à multiplicité plus élevée, 

étant encore de 36% pour des erreurs de multiplicité 8. 

Mots clés : Tolérance aux fautes, Processeur à pile de données, MPSoC, Journalisation, Restauration, Injection de 

fautes, Modélisation RTL. 

ABSTRACT 

In this thesis, we propose a new approach to designing a fault tolerant processor. The methodology is addressing several 

goals including high level of protection against transient faults along with reasonable performance and area overhead 

trade-offs. The resulting fault-tolerant processor will be used as a building block in a fault tolerant MPSoC (Multi- 

Processor System-on-Chip) architecture. The concepts being used to achieve fault tolerance are based on concurrent 

detection and rollback error recovery techniques. The core elements in this architecture are a stack processor core from 

the MISC (Minimal Instruction Set Computer) class and a hardware journal in charge of preventing error propagation to 

the main memory (supposedly dependable) and limiting the impact of the rollback mechanism on time performance. 

The design methodology relies on modeling at different abstraction levels and simulating modes, developing dedicated 

software tools, and prototyping on FPGA technology. The results, obtained without seeking a thorough optimization, show 

clearly the relevance of the proposed approach, offering a good compromise in terms of protection and performance. 

Indeed, fault tolerance, as revealed by several error injection campaigns, prove to be high with 100% of errors being 

detected and recovered for single bit error patterns, and about 60% and 78% for double and triple bit error patterns, 

respectively. Furthermore, recovery rate is still acceptable for larger error patterns, with yet a recovery rate of 36%on 8 

bit error patterns. 

Keywords: Fault Tolerance, Stack Processor, MPSoC, Journalization, Rollback, Fault Injection, RTL modeling.

Dependable Memory - Laboratoire Interface Capteurs ...

Create successful ePaper yourself

Delete template?

Save as template?