12.12.2012 Views

Dependable Memory - Laboratoire Interface Capteurs ...

Dependable Memory - Laboratoire Interface Capteurs ...

Dependable Memory - Laboratoire Interface Capteurs ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

LABORATOIRE INTERFACES CAPTEURS<br />

ET MICRO-ÉLECTRONIQUE<br />

Doctoral School of IAEM - Lorraine<br />

Department of Electronics and Electrical Engineering<br />

A dissertation submitted to the University Paul Verlaine - Metz, France<br />

in partial fulfillment of the requirements for the degree of Doctor of Philosophy<br />

Discipline : Electronic Systems<br />

Specialty : Microelectronics<br />

DESIGN METHODOLOGY OF A FAULT-TOLERANT<br />

JOURNALIZED STACK PROCESSOR ARCHITECTURE<br />

by<br />

MOHSIN AMIN<br />

Thesis defended on June 9, 2011<br />

Doctoral Committee :<br />

PROF. LUC HEBRARD University of Strasbourg, France President of jury<br />

PROF. AHMED BOURIDANE University of Northumbria, Newcastle, UK Reviewer<br />

PROF. FERNANDO MORAES University of PUCRS, Porto Alegre, Brazil Reviewer<br />

DR. CAMILLE DIOU Paul Verlaine University - Metz, France Co-Supervisor<br />

PROF. FABRICE MONTEIRO Paul Verlaine University - Metz, France Superviror<br />

LICM - 7 Rue Marconi, Technopôle, 57070 Metz, France<br />

Tel : +33 (0)3 87 31 56 57 - Fax : +33 (0)3 87 54 73 07 - www.licm.fr


I DEDICATE THIS WORK TO<br />

i<br />

MY BELOVED BROTHER (LATE) QAISER AMIN<br />

May God give him peaceful rest forever!


Acknowledgements<br />

A PhD thesis is a great experience for working on very stimulating topics, challenging problems,<br />

and for me perhaps the most important to meet and collaborate with extraordinary people. Along with<br />

getting a degree and research skills, here I have learn French language, experience a new culture and<br />

learn to live in a different climate. For five years I am in France but indeed there is much more to<br />

explore.<br />

First and foremost, many thanks go to Prof. Fabrice MONTEIRO and Dr. Camille DIOU for<br />

supervising my PhD thesis and teaching me a lot of new stuff, for guidance and support, for all the<br />

fruitful discussions, and for the company during the conference trips. I am grateful to them for letting<br />

me pursue my research interests with sufficient freedom, while being there to guide me all the same.<br />

Also, I am grateful to director LICM, Prof. Abbas DANDACHE and Dr. Camel TANOUGAST for<br />

their kind support during my stay at LICM-Metz.<br />

My greetings go to Prof. Ahmed BOURIDANE, Northumbria University, Newcastle, UK and<br />

Prof. Fernando MORAES, University PUCRS Porto Alegre, Brezil who honored me by accepting to<br />

review this thesis. I am also grateful to President of the jury, Prof. Luc HEBRARD, University de<br />

Strasbourg, France to supervise this event.<br />

I am thankful to my colleague Dr. Abbas RAMAZANI who guided me a lot during my thesis. I<br />

would like to thank my officemates Frédéric, Hussain, Kevin, Mazan, Medhi and Rita for the good<br />

times we have had. I say good luck to the next: Alaa-Aldin, Cédric, David, Luca, Mokhtar, Salah<br />

and Said. Among them some are now more than officemates. I would like to express my gratitude<br />

and appreciation for Aamir, Armaghan, Fahad, Jawad, KB, Liaquat, Rafiq, Sadiq and Sundar. Special<br />

thanks to Sajid Butt for his unconditional friendship, his support, and to remember me to focus on<br />

finishing my PhD.<br />

Last but certainly not the least, I owe a great deal to my family for providing me with emotional<br />

support during my PhD. Many thanks to my parents, my brother: Qasim, wife: Ayesha and sister:<br />

Saba who all contributed a lot (probably the most) to my life during this period in many ways. Love to<br />

my beloved children Mohammad Abu-Bakar and Aleeza. Finally, special thanks to Higher Education<br />

Communication of Pakistan for funding my PhD thesis.<br />

Thanks, folks!<br />

iii<br />

Mohsin AMIN


Contents<br />

GENERAL INTRODUCTION 7<br />

I. STATE OF ART 13<br />

1 Dependability and Fault Tolerance 13<br />

1.1 Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

1.1.1 Common Source of Faults and their Consequences . . . . . . . . . . . . . . 15<br />

1.2 Basic Concepts and Taxonomy of <strong>Dependable</strong> Computing . . . . . . . . . . . . . . 18<br />

1.2.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.4 Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

1.4.1 System Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.4.2 Characteristics of a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

1.5 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.5.1 Fault Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.5.2 Fault Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.5.3 Fault Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

1.5.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

1.6 Techniques Applied at Different Levels . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

1.6.1 FT Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

2 Methods to Design and Evaluate FT Processors 31<br />

2.1 Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.1.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

2.1.2 Temporal/Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.1.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.2 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.2.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.2.2 Temporal Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.2.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

1


2 CONTENTS<br />

2.3 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

2.4 FT Processor Design Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

2.5 FT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

2.5.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

2.5.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

2.5.3 The Fault Injection Framework . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

II. QUALITATIVE AND QUANTITATIVE STUDY 53<br />

3 Design Methodology and Model Specifications 53<br />

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

3.2.1 Concurrent Error Detection: Parity Codes . . . . . . . . . . . . . . . . . . . 54<br />

3.2.2 Error Recovery: Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

3.4 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

3.5 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

3.5.1 Challenge # 1: Self Checking Processor Core Requirements . . . . . . . . . 59<br />

3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal . . . . . . . . 60<br />

3.5.3 Challenge # 3: Processor-<strong>Memory</strong> Interfacing . . . . . . . . . . . . . . . . . 62<br />

3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation of<br />

Rollback Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

3.6 Model Specifications and Global Design Flow . . . . . . . . . . . . . . . . . . . . . 63<br />

3.7 Functional Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

3.7.1 Model-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

3.7.2 Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

3.7.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

4 Design and Implementation of a Self Checking Processor 77<br />

4.1 Processor Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

4.1.1 Advantages of Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

4.3 Hardware Model of the Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

4.4 Design Challenges in FT Stack Processor . . . . . . . . . . . . . . . . . . . . . . . 84<br />

4.4.1 Challenge I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . 85<br />

4.4.2 Challenge II: Performance Improvement . . . . . . . . . . . . . . . . . . . . 85<br />

4.5 Solution-I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 87


CONTENTS 3<br />

4.5.1 Error Detecting in ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

4.5.2 Error Detecting in Register and Data-Path . . . . . . . . . . . . . . . . . . . 92<br />

4.5.3 Self-Checking Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

4.5.4 Store Sensitive Elements (SE) . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

4.5.5 Protecting Opcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

4.6 Solution-II: Performance Aspects of Self-Checking Processor Core . . . . . . . . . . 94<br />

4.6.1 Solution-II (a): Multiple-byte Instructions . . . . . . . . . . . . . . . . . . . 94<br />

4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Execution 95<br />

4.6.3 Reducing Overhead for Conditional Branches . . . . . . . . . . . . . . . . . 96<br />

4.7 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

5 Design of a Self Checking Hardware Journal 103<br />

5.1 Error Detection and Correction in the Journal . . . . . . . . . . . . . . . . . . . . . 104<br />

5.2 Principle of the technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

5.3 Journal Architecture and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

5.3.1 Modes of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

5.4 Risk of data contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

5.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

5.5.1 Minimizing the Size of the Journal . . . . . . . . . . . . . . . . . . . . . . . 115<br />

5.5.2 Dynamic Sequence Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

6 Fault Tolerant Processor Validation 121<br />

6.1 Design Hypothesis and Properties to be Checked . . . . . . . . . . . . . . . . . . . 122<br />

6.2 Error Injection Methodology and Error Profiles . . . . . . . . . . . . . . . . . . . . 122<br />

6.3 Experimental Validation of Self-Checking Methodology . . . . . . . . . . . . . . . 123<br />

6.4 Performance Degradation due to Re-execution . . . . . . . . . . . . . . . . . . . . . 126<br />

6.4.1 Evaluating Performance Degradation . . . . . . . . . . . . . . . . . . . . . 127<br />

6.5 Effect of Error Injection on Rate of Rollback . . . . . . . . . . . . . . . . . . . . . . 130<br />

6.6 Comparison with LEON FT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

GENERAL CONCLUSION AND PROSPECTS 135<br />

A Canonical Stack Computers: 139<br />

B Instruction Set of Stack Processor 141<br />

B.1 Data Operations in Stack Processor: . . . . . . . . . . . . . . . . . . . . . . . . . . 145


4 CONTENTS<br />

C Instruction Set of Pipelined Stack Processor 147<br />

D List of Acronyms 153<br />

E List of publications 157


GENERAL INTRODUCTION<br />

5


General Introduction<br />

owadays, devices are becoming more sensitive to the strike of high energy particles. There are<br />

N<br />

great chances that it can cause single-event upset (SEU) when hitting the surface of silicon device.<br />

This can result in soft errors that emerge as bit flips in memory or signal noise in combinatorial logic.<br />

Although, in recent years microprocessor performance has been increased exponentially with<br />

modern design trends. However, they have increased susceptibility towards environmental effects<br />

[Kop11]. When clock speeds increase and feature sizes decrease; systems may become susceptible<br />

to ionizing radiations that leak through the atmosphere. In addition, soft errors may be triggered by<br />

environmental factors such as static discharges or fluctuations in temperature and power supply volt-<br />

age. The occurrence of soft errors in modern electronic systems is continuously increasing [Nic10].<br />

Dependability is an important concern for current and future generation processor design [RI08].<br />

Conventional approaches for dependable processor design employ space or time redundancy<br />

[RR08]. Processor replication has been used for a long time as a fault tolerance (FT) technique<br />

against transient faults [Kop04]. It is a costly solution requiring more than 100% of area overheads<br />

(and also power overheads), since duplication at least is required for error detection (triplication at<br />

least for error correction/masking) and additional voting circuitry. Practically, it is an expensive so-<br />

lution to detect errors at register level, especially when SEU are being considered. Software based<br />

temporal approaches has less hardware overheads and it can significantly improve reliability [RI08].<br />

For example, in duplex execution all instructions are executed twice to detect transient errors [MB07].<br />

However, this technique tend to induce significant time overheads making severe time constraints hard<br />

to match in real time designs. These approaches may providing robust fault tolerance but incurring<br />

high penalty in terms of performance, area, and power [RR08].<br />

Explicit redundancy is suitable for mission critical applications where hardware cost is not an<br />

important constrain. However, after rapid technology scaling, today almost every system need at-<br />

least little consideration of FT features [FGAD10]. These systems demand more cost-effective FT<br />

solutions that may have less coverage than hardware redundancy but substantial coverage nonethe-<br />

less [RR08]. Therefore, research is needed to have alternate unconventional and cost-effective solu-<br />

tions.<br />

We are proposing a new hardware/software co-design methodology to tolerate transient faults in<br />

the processor. The methodology relies on two main choices: fast error detection and low cost error<br />

recovery. It should have fast error detection so that errors can be detected before they reach the<br />

7


8 GENERAL INTRODUCTION<br />

system boundaries to cause catastrophic failures 1 . Consequently, the hardware based concurrent<br />

error detection (CED) has been chosen. To limit the overall cost, we may accept little time penalty in<br />

error correction. In this scenario, the software based rollback is employed. It will reduce the overall<br />

cost as compared to hardware based recovery. Whereas, it will not effect lot to overall performance<br />

because the proposed methodology is suitable for ground applications where occurrence of error is<br />

far less than space.<br />

There is a hypothetical dependable memory (DM) attached to the processor. Moreover, to make<br />

the rollback fast and to simplify the memory management there is an intermediate data storage be-<br />

tween processor and DM. Here, architectural choices are important to make the overall methodology<br />

successful. For-example, the processor core having minimum internal states to be checked (for detect-<br />

ing error) and load and store (for rollback recovery) can make this technique effective (less expensive<br />

and fast). The FT processor has been modeled at VHDL-RTL level. Finally, the processor self check-<br />

ing ability and performance degradation due to re-execution has been tested by artificial error injection<br />

in the simulated model.<br />

The contributions of this work are as follows: Proposing a new methodology based on hard-<br />

ware/software co-design to have a compromise between protection and time/area constrains. For<br />

fast error detection, hardware based concurrent detection is employed. For low hardware overheads,<br />

software based micro-rollback recovery will be used. To reduce the overall area overheads we are em-<br />

ploying stack processor from MISC class. The processor has minimum internal registers which result<br />

in low cost error detection and on the other hand it is suitable for efficient error recovery. Further-<br />

more to mask the error from entering into DM, the intermediate temporal data storage is introduced<br />

between processor and DM.<br />

This thesis is partitioned into six chapters.<br />

Chapter 1: It outline the background and describe the motivation for on-line error detection and<br />

fast correction in embedded microprocessors. It present the basic concepts and the terminologies<br />

related to dependable embedded processor design. It further explores attributes, threats and means<br />

to attain dependability. Lastly, the different dependability techniques applied at different levels are<br />

discussed.<br />

Chapter 2: This chapter will be presenting different redundancy techniques to detect and correct<br />

errors. It explores different FT methodologies employed in the existing fault tolerant processors. The<br />

last part will be dedicated to the validation methodology of a dependable processor.<br />

Chapter 3: This chapter identifies the model specifications and design methodology of the desired<br />

architecture. It address the overall problem by exploring the design paradigm and the related con-<br />

strains of the proposed approach. Later the processor-memory interface will be finalized by different<br />

functional implementations.<br />

Chapter 4: The proposed FT processor has two parts: self-checking processor core (SCPC) and<br />

self-checking hardware journal (SCHJ). This chapter steps towards a design methodology of self-<br />

1 where the cost of harmful consequences is orders of magnitude, or even incommensurably, higher than the benefit<br />

provided by correct service delivery [LRL04]


checking processor core (SCPC). The processor will be chosen from the MISC (minimum instruction<br />

set computer) class; therefore, firstly, we clarify the reasons of choosing such a specialized processor.<br />

Later on, error detection and recovery mechanism are finalized. Finally, the hardware model of the<br />

self-checking processor core will be synthesized on Altera, Quartus II Stratix III.<br />

Chapter 5: The chapter discusses the hardware design and protection scheme of a self-checking<br />

hardware journal (SCHJ), which will be temporary data storage to mask errors from entering into the<br />

dependable main memory. Finally, the overall hardware model of the FT processor will be synthesized<br />

on Altera, Quartus II Stratix III.<br />

Chapter 6: Lastly, the FT model will be evaluated in presence of errors. The evaluation will be<br />

based on the self-checking and performance degradation in presence of errors. Hence, the obtained<br />

results validate the protection techniques proposed in the chapter 3.<br />

Finally, the last section will be discussing conclusions and perspectives.<br />

9


10 GENERAL INTRODUCTION


I. STATE OF ART<br />

11


Chapter 1<br />

Dependability and Fault Tolerance<br />

t is a complex task to design embedded systems for critical real-time applications. Such systems<br />

I<br />

must not only guarantee to meet hard real-time deadlines imposed by their physical environment,<br />

but also guarantee to do so dependably, despite the occurrence of faults [Pow10]. The need of fault<br />

tolerant (FT) computing is becoming more and more important in recent years [Che08] and likely<br />

become the norm. In the past, FT was the exclusive domain of very specialized applications like<br />

safety critical systems. However modern design trends are making circuits more sensitive and now<br />

all real-time systems should have at least some FT features. Therefore, FT is an important need of the<br />

time.<br />

Modern social system is hinged to automated industry. In some sensitive industrial sectors, even a<br />

single fault can result in a million dollar loss (e.g. in banking and stock markets) or can result in loss<br />

of life (e.g. air traffic control system). Industries like automotive, avionics, and energy production re-<br />

quire availability, performance and real-time response ability to avoid catastrophic failures. In table 1,<br />

cost per hour for the failure of the control systems has been compared to show the importance/demand<br />

of FT in the industrial sector.<br />

Table 1.1: Cost/hour for failure of control system [Pie07]<br />

Application Domain Cost (Euro/hour)<br />

Cell-phone Operator 40k<br />

Airline Reservation 90k<br />

ATM Machine (Banking) 2.5M<br />

Automobile Assembling Unit 6M<br />

Stock Transaction 6.5M<br />

Most of these system (in table 1) rely on embedded systems. The design of the FT processor<br />

is one of the basic requirement for dependable embedded applications. Accordingly, we propose to<br />

design a fault tolerant processor to eliminate (tolerate) transient faults that result from SEUs. In this<br />

introductory chapter, we will address the basic concepts and terminologies related to fault tolerant<br />

computing. This chapter is divided into three main parts: the first part will be arguing the current<br />

13


14 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

trends that increase the probability of faults and the sources and consequences of the faults. The<br />

second part discusses the concepts of dependable computing and third part will be exploring the<br />

means to attain dependability.<br />

1.1 Problematic<br />

For many years researchers focused on performance issues. Due to their restless effort, they have<br />

improved the overall performance in last few years fueled by deep technology scaling. However, the<br />

boundaries provided by the Moore’s law have been reached to its saturation level and on the other<br />

hand; there is a decrease in the dependability due to ever-increasing physical faults. There are various<br />

trends in search of high performance, which have increased the need for dependable architecture<br />

design. Among them, some are discussed below:<br />

Smaller Technologies / Design Scales<br />

Although scaling of the transistors and wires has steadily improved processor performance and<br />

reduced cost, it also adversely affected long-term chip lifetime reliability. When a transistor is ex-<br />

posed to high-energy ionizing radiation, electron-hole pairs are created [SSF + 08]. Transistor source<br />

and diffusion nodes accumulate charge, which may invert the logic state of the transistor [Muk08].<br />

With device dimensions projected to shrink below 18 nanometers by 2015 significantly threaten next-<br />

generation technologies [RYKO11]. For transient faults, smaller devices tend to have low charge to<br />

hold the states of the registers and make them more sensitive towards noise. When the noise margin<br />

decreases the probability that a high-energy particle strike can disturb the charge on the devices also<br />

increases which in turn increases the probability of transient faults. The lower voltages used for power<br />

efficiency reasons will increase the susceptibility of future chips [FGAD10].<br />

More Transistors per Chip<br />

Due to more transistors, more wires are required to connect them, resulting in more chances of<br />

faults both during the fabrication and working of such devices. Modern processors are more prone to<br />

the faults due to greater number of transistors and registers. Moreover, temperature is another factor<br />

causing transient and permanent faults. More the devices on the chip so more power will be dragged<br />

from the supply. Higher supply power per unit area will increase the leakage power dissipated per<br />

unit area due to which higher will be the temperature and probability of errors.<br />

Complex Design<br />

Today, the processor has become more complicated as compared to the past, which increases the<br />

probability of design faults. On the other hand it is also making debugging of errors difficult. Research<br />

effort is oriented towards alternate methods to increase system performance without increasing the


1.1. PROBLEMATIC 15<br />

sensitivity of the circuit but unfortunately the bottleneck has been reached and alternate solutions are<br />

more complex and make fault debugging a more difficult task to fulfill.<br />

In short, the devices are becoming more sensitive against ionized radiation (which may cause soft<br />

errors), operating point variation by means of temperature or supply voltage fluctuations, as well as<br />

parasitic effects, which results in statical leakage currents [ITR07]. Changing the parameters like<br />

dimensions, noise margin and supplied voltage cannot be further fruitful to increase the performance.<br />

In the near future, due to small size and high frequency the failure trend in modern computing<br />

systems will further increase because saturation level has already been reached. The further increase<br />

is leading towards increasing rate of soft error in logic and memory chips [Bau05] which is affecting<br />

the reliability even at sea level [WA08]. To assure the circuit integrity, FT must be an important design<br />

consideration for modern circuits. The dependable system must be aware of the tolerance mechanism<br />

against possible errors.<br />

1.1.1 Common Source of Faults and their Consequences<br />

Today, one significant threat to the reliability of the digital circuits is concerned with the sensitivity<br />

of the logic states to various noise sources and specially in certain specific environments such as in<br />

space or nuclear systems where collision of charged particles can result in transient faults. Such<br />

particles may include cosmic rays produced by sun and alpha particles produced by disintegration of<br />

radioactive isotopes.<br />

For space applications, FT is a mandatory requirement due to the severe radiation environment. As<br />

the manufacturing technology is scaled towards finer geometries the probability for SEUs is increas-<br />

ing. With the present technology, dependability is not only required for some critical applications:<br />

even for commodity systems, dependability needs to be above a certain level for the system to be<br />

useful for anything [FGAD10]. Radiation induced soft error is becoming an increasingly important<br />

threat to the reliability of digital circuits even in ground level applications [Nic10].<br />

Transient faults can be caused by on-chip perturbations, like power supply noise or external<br />

noise [NX06]. The researchers have classified three common sources of soft errors in semi-conductors<br />

including alpha particles, discovered in 1970’s, proved to be the main source of soft errors in com-<br />

puter systems, specially DRAM [MW07]. Secondly, the high-energy neutrons from cosmic radia-<br />

tions could induce soft errors in semi conductor devices via the secondary ions produced by neutron<br />

reaction with silicon nuclei [ZL09] as shown in figure 1.1, where a single high-energy neutron has<br />

disturbed the internal charge distribution of the whole device. Thirdly, soft-error source is induced<br />

by low-energy cosmic neutron interactions with the isotope boron-10 in IC materials, specifically in<br />

Boro-Phos-Silicate-Glass (BPSG), used widely to form insulator layers in IC manufacturing. This<br />

recently proved to be the dominant source of soft errors in SRAM fabricated with BPSG [WA08].<br />

Figure 1.2 represents the sequence of events that may occur once an energetic particle hit the<br />

substrate provoking ionization. This ionization may generates a set of electron-hole pairs that create<br />

a transient current that is injected or extracted to that node. According to the amplitude and duration


16 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

Figure 1.1: An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs in its<br />

wake, which effects charge disturbance [MW04]<br />

of this current pulse, a transient voltage pulse may appear at the hit node. This is characterized as the<br />

fault. There is a fault latency period that defines the time needed for that fault to become an error in<br />

the circuit. This will only occur if this transient voltage node changes the logic of a storage element<br />

(flip-flop), generating a bit-flip. This bit-flip may generate an error if the content of this flip-flop is<br />

used for a certain operation. However, for the application point of view, it is not mandatory that this<br />

error is manifested as a failure in the system. There is also an error latency that defines the time<br />

needed for that error to become a failure in the system. The common term for any measurable effect<br />

Ionization<br />

Transient<br />

Current<br />

Transient<br />

Voltage Pulse<br />

Fault Effect<br />

Figure 1.2: Strike of high energy particle resulted in error(s)<br />

Error<br />

resulting from the deposition of energy from a single ionizing particle strike, is a Single Event Effect<br />

(SEE). The most relevant SEEs are classified in figure 1.3.


1.1. PROBLEMATIC 17<br />

SEE<br />

(Single Event<br />

Effect)<br />

SET<br />

(Single Event<br />

Transient)<br />

SBU<br />

(Single Bit Upset)<br />

MBU<br />

(Multi Bit Upset)<br />

SEFI<br />

(Single Event<br />

Functional Interrupt)<br />

SELU<br />

(Single Event Latch-Up)<br />

SEGR/SEB<br />

(Single Event<br />

Gate-Rupture/Burnout)<br />

SEU<br />

(Single Event Upset)<br />

Soft Error<br />

Hard Error<br />

Figure 1.3: Classification of faults on basis of single event effect (SEE) [Pie07].<br />

Single Event Upset (SEU)<br />

The SEU is mostly a soft error caused by the transient signal induced by a single energetic particle<br />

strike [JES06]. In [Bau05], it is said to occur when a radiation event causes a charge disturbance large<br />

enough to reverse or flip the data state of a memory cell, register, latch, or flip-flop. The error is called<br />

soft because the device is not permanently damaged by the radiation and when new data is written to<br />

the struck memory cell, the device will store it correctly [Bau05].<br />

The SEU is a very serious problem because it is one of the major source of failure in digital<br />

systems [Nic10]. It will likely pose serious threats to the future of robust computing [RK09] and<br />

require serious attention. It may manifest itself as Single Bit Upset (SBU) or Multiple Bit Upset<br />

(MBU).<br />

Single Bit Upset (SBU) and Multiple Bit Upset (MBU)<br />

An SBU is a single radiation event that results in one bit flip whereas an MBU is a single radiation<br />

event that results in more than a single bit being flipped. Each bit flip is essentially an SEU. An<br />

SBU and MBU are therefore considered a subset of the SEU. The SBU are usually a major fraction<br />

and MBU are usually a small fraction of the total number of observed SEUs. However, the MBU<br />

probability is steadily increasing as geometries shrink [BCT08, QGK + 06]. Presently, this thesis is<br />

addressing SBUs. In future, methodology will be further extended for addressing MBUs.


18 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

Single Event Transient (SET)<br />

An SET is a transient pulse in the logic path of an IC. Similar to an SEU, it is induced by a charge<br />

deposition of a single ionizing particle. An SET can be propagated along the logical path where it<br />

was created. It may be latched into a register, latch or flip-flop causing their output value to change.<br />

Single Event Functional Interrupt (SEFI)<br />

Xilinx [BCT08] defines SEFI as an SEE that results in the interference of the normal operation of<br />

a complex digital circuit. As for the previously mentioned SETs, further investigation of SEFI rates<br />

are not considered for this thesis.<br />

Single Event Latch-Up (SELU)<br />

A spurious current spike induced by an ionizing particle in a transistors may be amplified by the<br />

large positive feedback of the thyristor and cause a virtual short between Vdd and ground, resulting<br />

into a s SELU [NTN + 09]. SELUs are not addressed in this thesis.<br />

Single Event Gate Rupture (SEGR) and Single Event Burnout (SEB)<br />

Single Event Gate Rupture (SEGR) is a single ion induced condition in power MOSFETs that may<br />

result in the formation of a conducting path in the gate oxide. Single Event Burnout is a condition,<br />

which can cause device destruction due to a high current state in a power transistor. Both of them are<br />

permanent faults and not addressed in this thesis.<br />

1.2 Basic Concepts and Taxonomy of <strong>Dependable</strong> Computing<br />

This part defines the basic terminologies related to dependable computing. The terminologies are<br />

globally extracted from [LB07, Lap04]. In this section, we identify the important methods and their<br />

characteristics to make a system tolerant to faults.<br />

1.2.1 Dependability<br />

Dependability is the ability to deliver service that can justifiably be trusted [LRL04]. The defini-<br />

tion is focused on trust. In other words, the dependability of a system is the ability to avoid service<br />

failures that are more frequent and more severe than acceptable. Dependability relies on a set of<br />

measures that allow all phases of product life, to ensure that the functionality will be maintained<br />

while accomplishing the mission for which it has been designed. According to Laprie [LB07], the<br />

dependability of a system is the property that places a justified confidence in the service it delivers.


1.3. ATTRIBUTES 19<br />

Dependability<br />

and<br />

Security<br />

1.3 Attributes<br />

Attributes<br />

Threats<br />

Means<br />

Dependability<br />

Security<br />

Figure 1.4: Dependability Tree<br />

Availability<br />

Reliability<br />

Safety<br />

Confidentiality<br />

Integrity<br />

Maintainability<br />

Faults<br />

Errors<br />

Failure<br />

Fault Prevention<br />

Fault Tolerance<br />

Fault Removal<br />

Fault Forecasting<br />

Dependability is a vast concept based on various attributes as shown in figure 1.4.<br />

• Availability: it is the readiness for correct service;<br />

• Reliability: it is the continuity of correct service;<br />

• Safety: it is the absence of the catastrophic consequences on the user(s) and the environment;<br />

• Integrity: it is the absence of the improper system alterations;<br />

• Maintainability: it is the ability to undergo modifications.<br />

Moreover, when dealing with the security issues, an additional attribute called confidentiality is<br />

also considered as shown in figure 1.4. Confidentiality is the absence of unauthorized disclosure of<br />

information. Some other attributes related to security are availability and integrity, which have already<br />

been discussed with dependability attributes [VK07].<br />

It is difficult to fully respect all of the dependability attributes at a time in a system because it<br />

can increase the cost, power consumption and hardware area of the system. So, one respects these<br />

attributes according to the system needs. It has been stated in [FGAM10] that it is impossible to<br />

design a 100% dependable system. For example, in-order to improve the availability of component,


20 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

sometimes one overlooks maintenance and the safety decreases accordingly. Here, two types of<br />

systems have been considered:<br />

• a web server<br />

• a nuclear reactors<br />

Let us see which dependability and security attributes are more important for each of the system.<br />

In a university web-server, availability is the important attribute because every student needs to ac-<br />

cess it regularly whereas, for a nuclear reactor, the attributes like availability, reliability, safety and<br />

maintainability are important considerations. The [Pie07] sum up the importance of these attributes<br />

inform of a table in 1.2, where 4 points have been given to very important attributes and 1 to least<br />

important attribute. Hence from the table 1.2 each application has its only dependability and security<br />

requirements.<br />

Table 1.2: Dependability attributes for University web-server and Nuclear-reactor [Pie07], where<br />

attributes are classified as: – very important = 4 points, – least important = 1 point<br />

1.4 Threats<br />

Attributes University Web Server Nuclear Reactor<br />

Availability 3 4<br />

Reliability 1 4<br />

Safety 1 4<br />

confidentiality 2 1<br />

Integrity 2 3<br />

Maintainability 2 4<br />

There are three fundamental threats to a dependable computer. They are: (i) fault, (ii) error and<br />

(iii) failures. Fault is define as an erroneous state of hardware or software resulting from failures of<br />

components, physical interference from the environment, operator error, or incorrect design [Pie06].<br />

A fault is active when it produces an error, otherwise it is considered as a dormant/sleeping fault. An<br />

active fault can be an internal fault that was previously dormant. The error is itself caused by a fault<br />

and a failure occurs when there is deviation from correct services due to some error. All three have<br />

cause and affect relationship between them (as shown in figure 1.5). In general, active fault causes<br />

error. It can propagate from one place to another inside the system. In figure 1.6, an error produced<br />

in the processor has been transfered to main memory. Furthermore, if an error reaches the boundaries<br />

of the system it may result in the failure of the system, causing the service provided to deviate from<br />

its specification [GMT08] (see figure 1.5). If the initial system is a sub-system of a global system<br />

then it can cause a fault in the global system. In this way chain of fault, error and failure keep on<br />

progressing.


1.4. THREATS 21<br />

Sub-system<br />

Global<br />

system<br />

….. Fault Error Failure<br />

Fault Error …..<br />

Activation Propagation<br />

Consequences Activation<br />

activation<br />

Figure 1.5: Fault, error and failure chain<br />

propagation propagation<br />

fault error error error<br />

Processor Main <strong>Memory</strong><br />

READ/<br />

WRITE<br />

Figure 1.6: Error propagation from processor to main memory<br />

A SEU may result in system failure, like in figure 1.7: a high-energy neutron strike (caused due<br />

to cosmic rays) on a VLSI circuit has resulted into a SBU (active fault), which provoked an error in<br />

traffic control system and finally resulted into the system failure.<br />

1.4.1 System Failure<br />

A correct service is given by a system when it is respecting its functionality. Whereas, a system<br />

failure is a deviation of the service delivered by the system from its specification [Pie06]. Such<br />

a deviation can be in the form of incorrect service, or no service at all [GMT08]. Whereas, the<br />

transition from incorrect to correct service is a service restoration (see figure 1.8).<br />

The service failure may occur because the system is no more respecting its functionality or maybe<br />

the functional specifications were not correctly defined for that system under certain conditions. On<br />

the other hand, FT techniques allow a system to continuously deliver its service according to its<br />

correct functionality even in the presence of faults.


22 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

Fault Error Failure<br />

fault<br />

z<br />

0<br />

1<br />

always A=1<br />

1<br />

1<br />

0<br />

error<br />

Traffic<br />

Control<br />

system<br />

Figure 1.7: A single fault caused failure of traffic control system<br />

Correct<br />

Service<br />

1.4.2 Characteristics of a Fault<br />

Service<br />

failure<br />

Service<br />

restoration<br />

Figure 1.8: service failure<br />

Incorrect<br />

Service<br />

Correct<br />

signal<br />

Wrong<br />

signal<br />

Faults can be characterized by five attributes, which are cause, nature, duration, extend and value.<br />

Figure 1.9 illustrates each of these basic characteristics of faults. They are discussed in the following<br />

section.<br />

Cause<br />

Possible fault can be caused due to four salient problems:<br />

1. Specifications Mistakes: These include incorrect algorithms, architectures, or incorrect design


1.4. THREATS 23<br />

Fault<br />

Characteristics<br />

Cause<br />

Nature<br />

Duration<br />

Extent<br />

Value<br />

Specification Mistakes<br />

Implementation<br />

External Disturbances<br />

Component Defects<br />

Software<br />

Hardware<br />

Transient<br />

Intermittent<br />

Permanent<br />

Local<br />

Global<br />

Determinate<br />

Indeterminate<br />

Figure 1.9: Fault characteristics.<br />

HDL<br />

Programming<br />

Logical<br />

Electronic CMOS<br />

Digital<br />

Analog<br />

specifications, as in row 1 of figure 1.10 where there is fault caused by the wrong interconnec-<br />

tion between the two systems.<br />

2. Implementation Mistakes: The implementation can introduce faults due to poor design, poor<br />

component selection, poor construction, or hardware/software coding mistakes as in rows 2 and<br />

3 of figure 1.10. The row 2 shows the programming fault in which c is incremented if a is less<br />

than b but c will not be incremented if a is equal to b which is a programming error. Similarly<br />

in row 3 of figure 1.10, r1 charge the result of addition a+b in the register c.<br />

3. Components Defects: These include random device defects, manufacturing imperfections, and<br />

component wear-out. It can be a logical component or electronic CMOS. As shown in the row<br />

4 and 5 of figure 1.10<br />

4. External Disturbance: These include operator mistakes, radiation, electromagnetic interfer-<br />

ence, and environment extremes. As in row 6 of the figure. 1.10. Moreover, due to reducing<br />

noise margin the ‘1’ can be read as ‘0’ if its value is lower than threshold (Vm) (as shown in<br />

row 7 of the figure. 1.10).


24 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

Sr.<br />

No.<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

Nature<br />

Fault at different level<br />

Specification Mistakes<br />

Programming fault<br />

HDL<br />

Component defect 1<br />

Component defect 2<br />

External Disturbance<br />

Fault due to lower noise<br />

Correct state<br />

A B C<br />

if a < = b<br />

c : c + 1;<br />

end;<br />

r1 : c


1.5. MEANS 25<br />

Extent<br />

A permanent fault if occurs once, persists until the end of the execution. Even a single perma-<br />

nent fault can create multiple errors until being repaired. Such errors are called hard errors.<br />

• Intermittent: It is a fault which appears, disappears, and reappears repeatedly within a very<br />

short period. An intermittent fault can occur repeatedly but not continuously for a long time in<br />

a device.<br />

The errors in modern computers may result due to permanent, intermediate and transient faults.<br />

However, transient faults occur considerably more often than permanent ones, and are much<br />

harder to detect [RS09]. The ratios of transient-to-permanent faults can vary between 2:1,<br />

100:1 or higher. This ratio is continuously increasing [Kop04].<br />

The fault extent specifies whether the fault is localized to a given hardware or software module or<br />

whether it globally affects the hardware, the software, or both.<br />

Value<br />

The fault value can either be determinate or indeterminate. A determinate fault is one whose<br />

status remain unchanged throughout the time unless there is an external action upon it, whereas an<br />

indeterminate fault is one whose status at some time t may be different from its status at another time.<br />

1.5 Means<br />

There are four means to attain dependability: fault prevention, fault tolerance, fault removal and<br />

fault forecasting. Fault prevention and fault tolerance aim to provide the ability to deliver a service<br />

that can be trusted, while fault removal and fault forecasting aim to reach confidence in that ability by<br />

justifying that the functional and the dependability and security specifications are adequate and that<br />

the system is likely to meet them [LRL04, LB07].<br />

1.5.1 Fault Prevention<br />

Fault prevention is the ability to avoid the occurrence or introduction of the faults. Fault prevention<br />

includes any technique that attempts to prevent the occurrence of faults. It can include design reviews,<br />

component screening, testing and other quality control methods.<br />

1.5.2 Fault Removal<br />

Fault removal is the ability to lessen the number and severity of faults. It can be conducted<br />

during corrective or preventive maintenance processes. Corrective maintenance aims to remove faults


26 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

that have already produced and starts after error detection, while preventive maintenance is aimed at<br />

removing faults before they might have caused errors [LB07].<br />

1.5.3 Fault Forecasting<br />

It is the ability to estimate the present number, the future incidence, and the likely consequences<br />

of faults. It is conducted by performing an evaluation of the system behavior with respect to fault<br />

occurrence or activation; it has two aspects that are qualitative and quantitative. The main approaches<br />

to probabilistic fault forecasting aimed to derive probabilistic estimates are modeling and testing<br />

[LB07].<br />

1.5.4 Fault Tolerance<br />

It is intended to preserve the delivery of correct service in the presence of active faults [ALR01].<br />

Ideally FT system is capable of executing their tasks correctly regardless of faults. However, in<br />

practice no-one cannot guarantee the flawless execution of tasks under all circumstances. The real<br />

FT system are design to have tolerance against more likely to occur faults. In this work FT has<br />

been addressed. It resides on three pillars, which are fault masking, error detection and error correc-<br />

tion/recovery.<br />

Fault Masking<br />

Fault masking hide the effects of failures through the means that redundant information outweighs<br />

the incorrect information [Pie06]. It is a structural redundancy technique that completely masks faults<br />

within system redundant modules. A number of identical modules execute the same functions, and<br />

their outputs are voted to remove errors created by a faulty module e.g. Triple Modular Redundancy<br />

(TMR) is a commonly used technique of fault masking.<br />

Through in fault masking, we achieve dependability by hiding faults that occur. It prevents the<br />

effects of faults from spreading throughout the system. It can tolerate software and hardware faults<br />

as shown in figure 1.11. Such system does not need error detection and correction to maintain system<br />

dependability. Fault masking has not been directly employed in this thesis. However, TMR will be<br />

used for comparison in the later chapters.<br />

Error Detection<br />

If fault masking is not employed then error detection may be employed in a FT system. Error<br />

detection is the building block of a FT system, because a system cannot tolerate an error if it is not<br />

known to it. Error detection mechanisms form the basis of an error resilient system as any fault<br />

during operation needs to be detected first before the system can take a corrective action to tolerate<br />

it [LBS + 11]. Even if a system cannot recover from the detected error, it can at least halt the process<br />

or inform the user that an error is detected and that the results are no more reliable.


1.6. TECHNIQUES APPLIED AT DIFFERENT LEVELS 27<br />

Error Correction/Recovery<br />

Detecting an error is sufficient for providing safety, but we would also like the system to recover<br />

from the faulty states. Recovery hides the effects of the error from the user. After recovery, the system<br />

can resume operation and ideally remain live. Error recovery is an important feature for the system<br />

based on the two attributes of reliability and availability because both the metrics require the system<br />

to recover from its errors without user intervention.<br />

Error detection and recovery are addressed in this thesis, they will discussed in detail in the chap-<br />

ter 2. Similarly, various techniques of error detection (in section 2.1) and correction (in section 2.2)<br />

are also discussed.<br />

1.6 Techniques Applied at Different Levels<br />

Figure 1.11 illustrates the dependability techniques applied at different levels in a hardware and<br />

a software system in which fault avoidance (fault prevention) is the primary method to improve the<br />

system dependability. It may be taken into account through hardware or software implementations.<br />

The fault avoidance in a hardware based system can be achieved by preventing specification and<br />

implementation faults, component defects and external disturbances, while in a software based system<br />

it requires prevention of specification and implementation faults. On the other hand, fault masking is<br />

a technique used to ensure dependability, by masking the faults. TMR is a well-known example of<br />

this technique. If fault masking is not applied, then FT is a practical choice to overcome errors.<br />

1.6.1 FT Techniques<br />

Fault tolerant techniques for integrated circuits can be applied at different moments in the circuit<br />

design flow. They can be applied in the electrical design phase, such as transistor dimension, transistor<br />

redundancy and by adding electrical sensors. Some techniques can be added at logic design step, such<br />

as by adding hardware and time redundancy in the logic blocks and in the software application. The<br />

figure 1.12 is the further extension of previous discussed figure 1.2 . The figure represents different<br />

phases to tolerate faults (detect and correct). In each phase a different fault tolerant technique can be<br />

used. We are addressing the fault tolerant at hardware redundancy and self-checking level that are<br />

two higher levels (as shown in ‘c’ and ‘d’ of figure 1.12).<br />

1.7 Conclusions<br />

The goal of this chapter was to introduce the concepts of dependability in embedded systems. In<br />

fulfilling this objective, we have introduced the main issues related to the design and analysis of fault<br />

tolerant systems. Here, we have discussed different types of faults and their characteristics because


28 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />

Fault Avoidance Fault Masking<br />

Specification Mistakes<br />

Implementation<br />

External Disturbances<br />

Component Defects<br />

Specification Mistakes<br />

Implementation<br />

Hardware<br />

Faults<br />

Software<br />

Faults<br />

Fault Tolerance<br />

Errors<br />

Errors<br />

Figure 1.11: Dependability techniques.<br />

Error Detection<br />

Recovery<br />

System<br />

Failure<br />

System<br />

Failure<br />

our final objective is the design of a fault tolerant computing system against single event effect such<br />

as SEUs (Single Event Upsets).<br />

In addition, this chapter is addressing the dependability issues against non-permanent distur-<br />

bances. Our goal is to propose a new design methodology of dependable processor architectures.<br />

Consequently, in chapter 2, we will discuss some existing methodologies of detecting and correcting<br />

errors.


1.7. CONCLUSIONS 29<br />

Ionization<br />

Different<br />

Fault<br />

tolerance<br />

levels<br />

Transient<br />

Current<br />

Sensors<br />

(detectors)<br />

Time redundancy<br />

(detection migration)<br />

Fault<br />

Latency<br />

Transient<br />

Voltage Pulse<br />

Fault Effect<br />

Flip-Flop Error Failure<br />

Hardware Redundancy<br />

Error correction codes<br />

(detection & migration)<br />

Error<br />

Latency<br />

a b c d e<br />

Self checking<br />

mechanism with<br />

recovery<br />

(detection & migration)<br />

Fault Tolerant at level c & d in figure<br />

has been addressed in this thesis<br />

Figure 1.12: Sequence of events from ionization to failure and a set of fault tolerant techniques applied<br />

at different time. [Pie07].<br />

Redundancy/<br />

spare<br />

components


30 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE


Chapter 2<br />

Methods to Design and Evaluate FT<br />

Processors<br />

The goal of FT techniques is to limit the effects of a fault, which means to increase the probability<br />

that an error is accepted by the system. A common feature of all the FT-techniques is the use of<br />

redundancy. Redundancy is simply the addition of hardware resources, or time beyond what is needed<br />

for normal system operation [Poe05]. It can be hardware (some hardware modules are replicated),<br />

time (parts of a program are executed multiple times), information (the circuit or program has a<br />

redundancy of information) or a mixture of these three solutions.<br />

Traditional solutions involving excessive redundancy are too expensive in area, power, and perfor-<br />

mance [BBV + 05], other cheap approaches do not provide the necessary fault detection and correction<br />

abilities. Fault-tolerant embedded systems have to be optimized in order to meet time and area con-<br />

straints [PIEP09]. Therefore, special attention is required when choosing redundancy techniques for<br />

critical applications.<br />

Accordingly, chapter is presenting a comparison of the existing FT techniques in terms of error<br />

detection and correction ability, time delays and their hardware overheads. From these comparisons<br />

we will identify the techniques that can effectively fulfill our design objectives.<br />

Later part of the chapter will explore redundancy techniques employed in different FT proces-<br />

sors. The last section will be addressing the evaluation methods to check the effectiveness of FT<br />

methodologies in the processor.<br />

2.1 Error Detection<br />

Error detection originates an error signal or message within the system. It has been previously<br />

discussed in section 1.5.4. It can be based on preemptive detection or concurrent checking. Preemp-<br />

tive detection is mostly offline technique and takes place while normal service delivery is suspended,<br />

check the system for latent errors and dormant faults wheras, concurrent detection is an online tech-<br />

31


32 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

nique and takes place during normal service delivery [ALR01]. Similarly, Bickham defines concur-<br />

rent error detection (CED) as a process of detecting and reporting errors while, at the same time,<br />

performing normal operations of the system [Bic10].<br />

CED techniques are widely used to enhance system dependability [HCTS10, CTS + 10, WL10].<br />

The basic principle of CED techniques have been sum up in [MM00], in which a system is consid-<br />

ered, which realizes a function (f) and produces output in response to an input sequence. A CED<br />

scheme generally contains another unit, which independently predicts some special characteristic of<br />

the system-output for every input sequence. Finally, a checker unit compares the two outputs to<br />

predict an error signal. The architecture of a general CED scheme is shown in figure 2.1.<br />

Function (f)<br />

output<br />

Input<br />

Output<br />

Characteristics<br />

Predictor<br />

checker<br />

Predicted output<br />

characteristics<br />

Figure 2.1: General architecture of a concurrent error detection schemes [MM00]<br />

Several CED based redundancy techniques have been proposed and used commercially for de-<br />

signing reliable computing systems [HCTS10, SG10]. They have been classified into three classes;<br />

hardware redundancy, time redundancy, and information redundancy. The FT based system uses one<br />

or more among them. These techniques mainly differ in their error-detection capabilities and the con-<br />

straints they impose on the system design. In the next section, we will explore the commonly used<br />

error detection techniques.<br />

2.1.1 Hardware Redundancy<br />

Hardware redundancy is the commonly used approach [Bic10]. It refers to the addition of extra<br />

hardware resources such as doubling the system and using a comparator at the output to detect errors.<br />

Here, the consideration is given to the structure of the circuit and not to the functionality. It is equally<br />

effective for transient, timing and permanent faults. However, the area and power requirements are<br />

quite big. It can be classified into two sub types: (i) duplication with comparison and (ii) duplication<br />

with complement redundancy.<br />

Duplication with comparison (DWC)/ dual modular redundancy (DMR) [JHW + 08] is the simple<br />

and easy to implement error detection technique (see figure 2.2). It has a good error detection ca-<br />

Error


2.1. ERROR DETECTION 33<br />

pability and theoritically it can detect 100% of all possible errors by running all operations on two<br />

copies of a component and comparing the results [MS07]. However, it cannot detect the bugs due<br />

to design, error in the comparator and combinations of simultaneous error(s) in both the modules.<br />

Replication can be performed at different granularities (units vs. cores), but always comes at a con-<br />

siderable hardware cost (more than 200%). A classic example of DMR is the IBM S/390 mainframe<br />

processor [SG10] where the I-unit (fetch and decode units) and E-unit (execution unit) are duplicated,<br />

and their signals compared for transient fault detection.<br />

F<br />

Output<br />

Input comparator<br />

Error<br />

Signal<br />

F<br />

Figure 2.2: Duplication with comparison (DWC)<br />

There is another complementary technique called duplication with complement redundancy (DWCR)<br />

[Jab09]. This technique is similar to the DWC but in this technique both the modules, the input sig-<br />

nals and output control signals and data internal signals are of opposite polarity to avoid simultaneous<br />

errors in both the module to avoid the system failure.<br />

Here as well, the area and the power consumed overhead is more than 200%. However, this<br />

method increases the complexity of the design compared to a simple duplication. This technique<br />

is used in dual-checker rail (DCR), where both outputs are reversed if there is no error; they are<br />

sometimes employed in controller.<br />

2.1.2 Temporal/Time Redundancy<br />

This is type of redundancy technique that requires single unit to perform an operation twice; one<br />

followed by another. If there is existence of difference between the subsequent computations, it means<br />

that there is an existence of error [AFK05]. In this approach, there is penalty in terms of extra time<br />

however their area penalty is lesser than DMR. The additional hardware is required due to comparator<br />

and and requirement of additional temporary storage. It is a time replication technique and with no<br />

consideration to the functionality of the circuit.<br />

In this scheme, intermittent and transient faults are detected (as shown in figure 2.3) but permanent<br />

faults are not. For permanent fault detection, the circuit has been modified as shown in the following<br />

figure 2.4, according to which the computation using the input data is first performed at time t. The<br />

consequences of this computation is then stored in a buffer. The same data is used to repeat the<br />

computation, using the same functional block at time t + δt. However, this time the input data


34 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

Input<br />

F Output<br />

t<br />

t+δ t<br />

comparator<br />

buffer<br />

Error<br />

Signal<br />

Figure 2.3: Time redundancy for temporary and intermittent fault detection<br />

is first encoded in some manner. The results of the computation are then decoded and the results<br />

are compared to the results produced before. Any discrepancy will detect a permanent fault in the<br />

functional block.<br />

Input<br />

t<br />

t+δ t<br />

Encoder<br />

Encoder<br />

F<br />

t<br />

t+δ t<br />

buffer<br />

Decoder<br />

comparator<br />

Output<br />

Error<br />

Signal<br />

Figure 2.4: Time redundancy for permanent error detection<br />

An alternative approach can be redundant execution with shifted operands (RESO) [PF06] where<br />

some instructions are executed redundantly with shifted operands on the same functional units. Shift-<br />

ing the result back by the same amount yields the original result computed with un-shifted operands.<br />

Re-executing instructions detects transient faults whereas re-executing with shifted operands also<br />

detects permanent faults. Scheme works when functionality possesses required properties such as<br />

linearity<br />

Time redundancy directly affects the system performance although the hardware cost is generally<br />

less than that of hardware redundancy. Therefore temporal redundancy based systems are compara-<br />

tively slower. In order to overcome this issue, many systems use pipelining to hide the latency issue<br />

from the client. The temporal redundancy does not address energy consequence at all, except it uses<br />

twice as much active power as a non-redundant unit.


2.1. ERROR DETECTION 35<br />

2.1.3 Information Redundancy<br />

The basic idea behind an information scheme is to add redundant information to the data being<br />

transmitted or stored or processed to determine if errors have been introduced [IK03]. It is the way of<br />

protecting data through mathematical encoding, which can be reuse after decode the original data (as<br />

in figure 2.5). The encoding and decoding circuitry adds additional delays, which make them slower<br />

than DMR, but the area overhead is much lower than DMR. In coding, the consideration is given to<br />

the information stored or maybe to the functionality of the circuit but no consideration given to the<br />

structure of the circuit. Typically, information redundancy is used to protect storage elements (like<br />

memory, caches, register files, etc) [HCTS10] e.g. in Power 6 and 7 [KMSK09]. These codes are<br />

classified based on their ability of detection and correction, code efficiency and complexity. In this<br />

section, we will discuss only error detection codes.<br />

Data Encode<br />

Add<br />

Redundancy<br />

Noise<br />

Transmit<br />

or<br />

Store<br />

Decode Data<br />

Check<br />

Redundancy<br />

Figure 2.5: Information redundancy principle<br />

The error detecting codes (EDC) have less hardware overhead than the error correcting codes.<br />

There are different EDCs e.g. parity, Borden, Berger and Bose codes. We will not go in much details<br />

but will compare their salient features will be discussed.<br />

The parity-coding strategies is simplest and it has lowest HW overhead [ARM + 11]. It is based on<br />

calculation of even or odd parity for data of word length N. The parity can be calculated with XOR<br />

operation among the data bit. A parity code has a distance of 2 and can detect all odd-bit errors.<br />

Input<br />

Parity<br />

Generator<br />

P<br />

Data<br />

Data<br />

Received data<br />

comparator<br />

Figure 2.6: Parity coder in data storage<br />

Error<br />

Signal<br />

Output<br />

Before storing data in the register, the parity generator is used to compute the parity bit required<br />

(as shown in the figure 2.6). Then both the computed parity and the original data are stored in register.


36 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

When data is retrieved, a parity checker is used to compute the parity based on data bit stored. The<br />

parity checker compares the computed parity and the stored parity and an error signal is set accord-<br />

ingly. Similarly, parity coding can also be used to protect the logic functions (see the figure 2.7). It is<br />

used commonly in computers to check errors in busses, memory, and registers [IK03].<br />

Input<br />

Input<br />

parity<br />

F<br />

P<br />

comparator<br />

Figure 2.7: Functional Parity<br />

Output<br />

Output<br />

parity<br />

Error<br />

Signal<br />

Cyclic redundancy checks (CRCs) is another class of EDC. It is commonly employed to detect<br />

errors in digital system [IK03]. Cyclic codes are parity check codes with the additional property that<br />

the cyclic shift of a codeword is also a codeword. If<br />

then<br />

(Cn−1, Cn−2, ...C1, C0) is a codeword,<br />

(Cn−2, Cn−3, ...C0, Cn−1) is also a codeword.<br />

The idea is to append a checksum to the end of the data frame in such a way that the polynomial<br />

represented by the resulting frame is divisible by the generator polynomial G(x) that the sender and<br />

receiver have agreed upon. When the receiver gets the checksummed frame, it divides it by G(x)<br />

and if the remainder is not zero, there has been a transmission error. It is then clear that the best<br />

generator polynomials are those less likely to divide evenly into a frame that contains errors. CRCs are<br />

distinguished by the generator polynomials they use. It cannot directly specify the error bit position<br />

during the decoding process. Hence it is only limited for error detection.<br />

The Borden codes are another class of codes that can detect unidirectional errors (errors that<br />

cause either a 0 → 1 or 1 → 0 transition, but not both). It is the optimal code for unidirectional<br />

error detection. The Berger EDC is capable of detecting all unidirectional errors. It is formulated by<br />

appending check bit to the data word. The check bit constitute the binary representation of the number<br />

of 0 ′ s in the data word. For example, in 3-bit long data word, we need 2-bit for the check. Berger<br />

code is simpler to deal than the Borden codes. The Bose code is more efficient than the Berger code.<br />

The Bose code provides the same error detecting capability that the Berger code does, but with fewer<br />

checks bit. Briefly, on increasing the complexity of the codes the efficiency of the codes increases.<br />

Choosing the right code depends on the application needs.<br />

In arithmetic processing circuits (such as in ALU) the previously discussed codes are incapable of<br />

detecting errors because when two data symbols are subjected to an arithmetic operations, it result in


2.1. ERROR DETECTION 37<br />

a new data symbol which cannot be uniquely expressed as the combination of inputs [FP02, Nic02].<br />

In other words, they are useful in checking arithmetic operations, where parity would not be pre-<br />

served [FP02, IK03]. The information parts of an operand are processed through a typical arithmetic<br />

operator, while a check symbol is concurrently generated (based on the information bits) [Bic10].<br />

They have two classical implementation: AN and residue codes.<br />

AN codes are the simplest form of arithmetic codes [Muk08]. They are formed by multiplying<br />

each data word ‘N’ by a constant ‘A’. The following equation gives an example of an AN code:<br />

A (N1 + N2) = A (N1) + A (N2) (2.1)<br />

They are preserved only under arithmetic operations and they are not valid for logical and shift<br />

operations. They are not commonly employ due to their high hardware and timing penalty.<br />

Figure 2.8: Residue codes adder [FFMR09].<br />

Residue codes are another type of arithmetic code, in which the information to be used in checking<br />

is called the residue. The residue, r, of an operand, A, is equal to the remainder of ‘A’ divided by the<br />

modulo-base ‘m’ [Bic10]. Both the computations occur simultaneously (see figure 2.8). For the first<br />

computation step, two operands, A and B, undergo an arithmetic operation in the ALU. A residue<br />

generator then produces a residue code from the ALU result. For the computation, each operand<br />

concurrently enters a residue generator. These residues then undergo the same ALU operation as in<br />

the first computation (addition in this case) [FFMR09]. Finally the residue is compared to find the<br />

errors.


38 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

2.2 Error Correction<br />

Error correction has previously been discussed in section 1.5.4. Similar to error detection, cor-<br />

rection techniques are also classified into three sub classes like hardware, information and temporal<br />

redundancy.<br />

2.2.1 Hardware Redundancy<br />

Adding an additional third module and replacing the comparator with a voter in DMR leads to<br />

Triple Modular Redundancy (TMR), as shown in figure 2.9. TMR in-addition to error detection can<br />

also correct the errors. A more general approach of N-Modular Redundancy (NMR) is discussed in<br />

[KKB07]. In this technique the effects of faults are masked. All the components work simultaneously<br />

and their outputs are fed into a voter. The output of the voter will be correct if at least two of the<br />

components are non-faulty. Static redundancy techniques are characterized to be simple but they have<br />

high area and power overheads.<br />

Input<br />

F<br />

F<br />

F<br />

voter Fault<br />

Masked<br />

Figure 2.9: Triple modular redundancy (TMR)<br />

TMR has been prominent FT solution in aircrafts [Yeh02] and space shuttles, where not only<br />

processors, but entire systems are replicated for robustness. TMR can be implemented at software<br />

level, a propsed approach in [SMR + 07] uses software implementation of TMR in which operating<br />

system processes are triplicate and run on multiple available cores. Input replication and output<br />

comparison is done by a system call emulation unit.<br />

TMR can be employed to address single-bit data errors (SET, Persistent, Non-Persistent) occurring<br />

in a cell [Car01]. In TMR the point of failure is voter, because if fault occur in voter all the system<br />

fails. However, voter is typically small and hence often assumed to be reliable. There is a significant<br />

area and power penalty (approximately a factor of 3.0 − 3.5 times) associated with TMR as compared<br />

to the non-redundant design [JHW + 08].


2.2. ERROR CORRECTION 39<br />

2.2.2 Temporal Redundancy<br />

For error correction with temporal redundancy, a computation is repeated on the same hardware<br />

at three different times intervals and finally the results are voted [MMPW07]. It requires three times<br />

more clock cycles to execute the same task. It can only correct errors due to transient faults provided<br />

that the duration of fault is lesser than computational time. It needs additional time to repeat the<br />

computations and can be employed in systems with low or no constraints on time. However, it has<br />

low area overheads as compared to TMR.<br />

2.2.3 Information Redundancy<br />

The error correcting codes (ECC) can provide cheaper solutions than other well-known redun-<br />

dancy techniques like TMR [CPB + 06]. They are commonly used to protect the memory (see fig-<br />

ure 2.10). The overhead of a code depends on (i) additional bits required to protect the information<br />

(ii) additional hardware/latency for encoding and decoding. However, encoding/decoding latency can<br />

be reduced if executed in parallel.<br />

Input<br />

Input<br />

parity<br />

Error<br />

Error<br />

Detecting Detecting &<br />

&<br />

Correcting<br />

Correcting<br />

block<br />

block<br />

Output<br />

Output<br />

parity<br />

Error<br />

Signal<br />

Figure 2.10: Error detecting and correcting memory block<br />

Among the different ECC codes, the commonly employed codes in digital circuits include Ham-<br />

ming Codes, Hsiao Codes and Reed-Solomon Codes. These codes can correct errors in-addition to<br />

detection. There are two key parameters of error correcting codes: (i) number of erroneous bit that can<br />

be detected and (ii) number of erroneous bit that can be corrected. Code’s error detection/correction<br />

properties are based on their ability to partition a set of 2n, n-bit words into a code space of 2 m code<br />

words and a non-code space of 2 n − 2 m words [FP02]. The simplest block code are Hamming codes,<br />

they are single error correcting, double error detecting (SEC-DED) codes [LBS + 11] but not both si-<br />

multaneously. They are the earliest linear ECC codes. They are quite useful in cases where only a<br />

single error is of significant probability, they do carry the hazard of miss correcting double errors.<br />

The Hsiao Codes (also called advance Hamming codes) are other commonly used codes for pro-<br />

tection / correction of errors in the memory [Mon07]. They have fast encoding and error detection<br />

than Hamming codes [Hsi10].<br />

Codes that are more powerful may be constructed by using appropriate generating polynomials.


40 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

Among them, Reed Soloman codes are cyclic codes that require complex encoding and decoding<br />

circuitry and are especially well-suited to applications where errors occur in bursts. That’s why they<br />

are mostly employed in channel coding. On the other hand, convolution coding schemes are useful in<br />

data storage and transmission systems, such as memory and data networks [FP02].<br />

2.3 Error Recovery<br />

Recovery transforms a system state that contains one or more errors and (possibly) faults into an-<br />

other state without detected errors and faults that can be activated again [ALR01]. Error recovery can<br />

only be initiated on detection of fault/error, therefore, the system should have built-in self checking<br />

mechanism. Nowadays, modern microprocessors have variable built-in error detection capabilities<br />

like, error detection in memory, cache, registers, illegal op-code detection, and so on [MBS07]. It can<br />

be higher level based on error handling (eliminate error from the system states) or lower level based<br />

on fault handling (prevent fault from being activated again).<br />

Recovery hides the consequences of faults from the user. It is more adequate for transient and<br />

intermittent faults, whereas, for permanent fault, the recovery is generally not sufficient. It needs one<br />

mandatory feature that is fault handling (see figure 2.11), it eliminates faults from the system state<br />

[LB07]. The fault-handling feature prevents faults from being activated again. This requires further<br />

features such as diagnosis, which reveals and localizes the cause(s) of error(s) [LRL04]. Diagnosis<br />

can also enable a more effective error handling. If the cause of the error is localized, the recovery<br />

procedure can take actions concerning the associated components without affecting the other parts of<br />

the system and related system functionality. In this work, we will be addressing the soft errors caused<br />

due to transient faults. Therefore, fault handling technique will not be important to explore.<br />

There are two sub-types of error recovery; forward error recovery (FER) and backward error<br />

recovery (BER). In FER, the system does not need to restore its states but it continues to make forward<br />

progress without restoring the system states. The compensator will overcome the faults (as shown in<br />

figure 2.11 (FER)). For example, in TMR the voter will mask (compensate) the fault and in ECC the<br />

error correcting circuitry will correct the (corrigible) error.<br />

BER involves restoring the states of the system to a previous known sure states. In otherwords,<br />

the state transformation consists of returning the system back to a saved state that existed prior to<br />

error detection [ALR01]. For the successful BER the system must be aware of the following facts: (i)<br />

which and where states are to be saved for recovery point; (ii) which algorithm to use; and (iii) what<br />

the system do after recovery.<br />

There are two known algorithms for saving the BER recovery states: checking point and logging.<br />

The choice depends on the micro-architecture of the core and recovery requirements, because both<br />

have different costs for different types of states and many BER systems use hybrid of both. A system<br />

presented in [SMHW02] uses hybrid BER. An actual criterion of choice is that if we have few registers<br />

and recoveries are not frequent, then check pointing is preferred. If there are many registers and<br />

frequent recovery then logging will be preferred.


2.4. FT PROCESSOR DESIGN TRENDS 41<br />

Backward Error Recovery (BER) Forward Error Recovery (FER)<br />

Maintenance<br />

Call<br />

Error<br />

Detection<br />

Rollback<br />

Permanent<br />

fault<br />

Fault<br />

Handling<br />

Transient -<br />

Intermittent<br />

faults<br />

No Yes<br />

Service Continuation<br />

Permanent<br />

fault<br />

Fault<br />

Handling<br />

No<br />

Maintenance<br />

Call<br />

Error<br />

Detection<br />

Compensation<br />

Yes<br />

Transient -<br />

Intermittent<br />

faults<br />

Service Continuation<br />

Figure 2.11: Basic strategies for implementing Error Recovery.<br />

Another important aspect is where to save the states of the recovery point. A shadow register file<br />

is created in the core to save the states of the sensitive elements. The backup values in the shadow<br />

copy can be used for rollback and recovery [AHHW08]. However, some other techniques, which<br />

require high reliability, store the states of internal registers off-chip. When the states are recovered<br />

the ECC are employed to avoid possible errors. In the recent era, a lot of development has been done<br />

in BER and many low cost computers employ BER. Like IBM is employing checkpoint recovery in<br />

POWER-6 micro-architecture [MSSM10].<br />

2.4 FT Processor Design Trends<br />

Recently, fault-tolerant computing has begun to draw more and more attention in a wider range of<br />

industrial and academic communities, due to increased safety and reliability demand [ZJ08]. Today,<br />

FT is the need of real time industrial application [RI08]. Mostly, high cost solutions are not acceptable<br />

for the industry, consequently the modern processors avoid hardware replication and tend to employ<br />

alternate techniques having lower power and area overheads (like information redundancy or hybrid<br />

redundancy). Information redundancy (like employing ECC) have less hardware overhead, however


42 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

they may have additional performance penalty.<br />

The performance penalty and hardware overheads depends on the type of ECC. The choice of<br />

ECC depends on three constrain power, area and error coverage requirements. The codes having<br />

better error coverages often have higher time penalty and hardware overheads. The parity codes are<br />

faster and low hardware cost whereas commonly employed ECC like hamming codes have better<br />

error coverage (like DED-SEC).<br />

The performance overhead can be minimized (masked) to an extent by calculating the parity-bits<br />

in parallel. On the other hand, a common trend to reduce the hardware penalty is compromising on<br />

error coverage and employing low cost error detecting codes (e.g. simple parity or mod-3 coding).<br />

Likewise, some well known processor of last decay like Power6, Itanium series and SPARC64 V<br />

employ parity predictors and modulo codes in their arithmetic/logic units to reduce power and cost.<br />

The ECC are commonly employed to protect the caches and data storage [QLZ05]. For example,<br />

Itanium processor can detect 2 bit errors in cache while relying on ECC. IBM processors have write<br />

through L1 caches and uses simple parity in the L1 cache and ECC in the L2 cache. On the other<br />

hand, Intel uses ECC even in the L1 caches.<br />

Check pointing with rollback is an alternate trend. It can be an effective FT solution for the<br />

processor having minimum internal states (registers). Higher the number of internal states, higher<br />

will be the performance (time) penalty in checking, loading and storing the states. Some modern<br />

processors that employ this methodology reduce the performance penalty by checking pointing after<br />

every super-scalar block of instruction. The common examples are Power6 and Power7.<br />

A newer trend to design a processor is to employ flexible error coverage and allow the user to<br />

choose the level of protection and redundancy he need in particular application. e.g. ARM Cortex-R<br />

series, an application specific processor. For higher error coverage, DMR is employed whereas for<br />

lower coverage ECC will be employed. However, area overhead will always be higher than 200%. In<br />

the following section, we are discussing different FT methodologies being employed in some well-<br />

known FT processor of last decay.<br />

SPARC64 V [AKT + 08]<br />

The SPARC64 V microprocessor is designed for mission-critical UNIX servers. In order to<br />

achieve un-interrupted operation, these servers must be resistant to soft errors. Also, data integrity is<br />

highly important because of the dangers that silent data corruption (SDC) can pose in mission-critical<br />

systems. To meet these requirements, the processor was designed not only to correct SRAM errors,<br />

but also to detect errors in logic circuits and to recover from those errors when practical.<br />

There are three smaller cache arrays of 128KB each, namely the level-1 instruction cache, level-1<br />

data cache and branch history cache (BRHIS). The level-1 data cache is write-back and protected by<br />

the same SEC-DED codes as the level-2 cache. The level-1instruction cache and BRHIS are covered<br />

by parity check. When an error is detected during level-1 instruction cache read, the read entry is<br />

invalidated and re-fetched from the ECC-protected level-2 cache. An error in BRHIS is treated as a<br />

cache miss and the processor delays execution of the conditional branch instruction until the correct


2.4. FT PROCESSOR DESIGN TRENDS 43<br />

branch address is calculated. The processor takes a minor performance hit but is able to continue<br />

correct instruction execution.<br />

Tags for level-1 instruction and data caches are parity-protected. Both level-1 caches are inclusion<br />

caches; tag information is duplicated in the level-2 tag. When a parity error is detected in a level-1 tag<br />

access, the level-2 tag is interrogated for the correct copy of the tag. The level-1 cache access is then<br />

re-executed. The last major SRAM array on the chip is the Translation Look-aside Buffer (TLB).<br />

TLB is protected by parity check and a parity error in the TLB is treated as a miss. The correct<br />

page table entry is fetched from the ECC-protected main memory during re-execution. In addition<br />

to implementing cache and TLB protection, the SPARC64 V is designed to detect single bit SRAM<br />

errors in other smaller SRAM arrays and recover from those errors as well.<br />

The processor logic circuits are protected by byte parity check to detect single bit logic errors in<br />

each byte. Parity check bits are calculated at the location of new data value generation and passed<br />

with the associated data through the processor logic circuits. Parity bits are checked at the receiving<br />

end.<br />

Arithmetic/logic units are equipped with byte parity predictors. The byte parity predictors calcu-<br />

late the parity bits for each output byte of an arithmetic/logic unit using the same input signals as the<br />

unit to be checked. These independently calculated byte parity bits are compared with the byte parity<br />

bits calculated from the output of the arithmetic/logic unit. Multipliers are checked with a modulo-3<br />

scheme.<br />

The byte parity predictors in the arithmetic/logic unit do not detect point errors that result in an<br />

even number of bit flips in the output byte, and the modulo-3 scheme used in the multipliers do<br />

not detect point errors that give the same modulo-3 residue. These checks, however, do detect the<br />

majority of single point errors and are cost-effective compared to a full duplication and compare<br />

implementation. When a parity error is detected in the logic circuits or small SRAM arrays, the<br />

processor stops issuing new instructions and clears all intermediate states. It then restarts execution<br />

at the instruction directly following the last correctly executed instruction by using the check-pointed<br />

states. This action is called instruction retry.<br />

The checkpoint and instruction retry mechanisms are implemented in the processor for recovery<br />

from branch misprediction. Thus, the additional cost associated with utilizing these mechanisms for<br />

error recovery is small. Furthermore, many microprocessors today feature either ECC or byte parity<br />

for large on-chip SRAM arrays. Compared with those microprocessors, the SPARC64 V micropro-<br />

cessor only requires additional transistors for implementing byte parity bits, byte parity predictors and<br />

the associated parity checkers in the logic circuits and small SRAM arrays. The number of transistors<br />

devoted to the error detection mechanisms of the SPARC64-V microprocessor is about 10% of the<br />

transistors for logic gates, latches and parity-protected small SRAM arrays.<br />

LEON3 FT<br />

LEON3 is the successor of the LEON2 processor developed for the European Space Agency


44 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

(ESA). The LEON3FT [GC06] is a fault-tolerant version of the standard LEON3 (clone SPARC<br />

V8). In LEON3FT the consideration is only given to the protection of data storage and not to the<br />

functionality of the processor. There is no protection for control unit, data path and ALU circuitry.<br />

The internal registers are protected with ECC codes plus a shadow copy. Upon a detected parity<br />

error, a duplicate copy of the data is read out from a redundant location in the register file, replacing<br />

the failed data. Few internal registers have four bit error detection capacity however, majority of<br />

registers only have two bit error detection.<br />

The cache memory in LEON3-FT consists of separate instruction and data caches, both 8 K byte<br />

large. Each cache has two parts; tags and data RAM. The tag and data memories are implemented<br />

with on-chip block RAM and protected with four parity bit per 32-bit word, allowing detecting up to<br />

four simultaneous errors per cache word. Upon a detected error, the corresponding cache line deleted<br />

and the instruction is restarted. This operation takes 6 clock cycles (idle states) and is transparent<br />

to software. For diagnostic purposes, error counters are provided to monitor detected and corrected<br />

errors in both tag and data parts of the caches.<br />

Boeing 777 the control system<br />

In Boeing 777, the control system is made reliable through redundant channels with different<br />

processors and diverse softwares to protect against design errors as well as hardware faults [BT02].<br />

It uses heterogeneous triple-triple modular redundancy [Yeh02] (as shown in the figure 2.12). There<br />

are three different processor architecture (Intel 80486, Motorola 68040 and AMD 29050) executing<br />

the same operation. However, it is an expensive solution and can only be employed in mission critical<br />

applications.<br />

ARM Cortex R Series [ARM09]<br />

ARM cortex R-series is a family of embedded processors for real time industrial applications.<br />

They have high customizability, so that the manufacturer can choose the features that suits their<br />

applications needs.<br />

If ECC build option is enable, then a 64-bit ECC scheme protects instruction cache. The data<br />

RAM include eight bit of ECC codes for 64-bit of data. The data cache is protected by 32-bit ECC<br />

scheme. The data RAM include seven bit of ECC codes for every 32 bit of data.<br />

If the parity build option is enabled, then the cache is protected by parity bit. For both the instruc-<br />

tion and data cache, the data RAMs includes one parity bit per byte of data.<br />

The processor can be implemented with a second redundancy copy of most of the logic. The<br />

second copy shares the cache RAMs of master core, so that only one set of cache is used. The<br />

comparison of the outputs of the redundant core with those of the master core detects fault.


2.4. FT PROCESSOR DESIGN TRENDS 45<br />

Intel 80486<br />

Motorola 68040<br />

AMD 29050<br />

voter<br />

Power6 [MSSM10, KMSK09, KKS + 07]<br />

Intel 80486<br />

Motorola 68040<br />

AMD 29050<br />

voter<br />

output<br />

voter<br />

Intel 80486<br />

Motorola 68040<br />

AMD 29050<br />

Error in any<br />

component<br />

Figure 2.12: The triple-TMR in Boeing 777 [Yeh02]<br />

IBM designs the Power6 processor. It uses inline checkers instead of TMR technique that uses<br />

less power and HW overheads. It has build in self-checking ability in the data and control flow paths.<br />

The residue checking is employed for floating-point unit and logical consistency checkers for control<br />

logic. It has recovery unit which checkpoints after a group of superscalar instructions are completed.<br />

The inline checkers writes into fault isolation register that decides if current state is error free. In case<br />

of error detection, the recovery unit initiates instructions retry recovery. The memory bus including<br />

input-output unit is protected by ECC codes. The L1 cache is protected by simple parity, while L2,<br />

L3 caches and all signals in and out of chip to L3 have ECC protection.<br />

Intel Itanium 9300 Series [Int09]<br />

Intel Itanium 9300 series processors is a high performance processors. The L2, L3 and directory<br />

caches are protected with ECC. It can correct all single bit errors and most double errors. Moreover,<br />

hardware assisted scrubbing support is available for L2, L3 and directory caches. <strong>Memory</strong> is also<br />

protected against the thermal protection. Here, different thermal sensors send information to memory<br />

controllers that consequently increases fan speed to regulate the temperature. The internal registers of<br />

the processor are protected by ECC. Additionally there is a redundancy clocks and soft error hardened<br />

latches and registers to improve resistance to soft errors.<br />

voter


46 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

2.5 FT Evaluation<br />

In semi-conductor industry, testing expenses increase the overall cost of the IC design and man-<br />

ufacturing. Generally, industrial testing is meant to find permanent faults that can be produced at<br />

the time of manufacturing. However, the most frequently occurring faults in computer systems are<br />

temporary effects like transient and intermittent faults. They are the main cause of the digital system<br />

failure [VFM06]. Due to increase in the probability of the transient faults in the latest technologies,<br />

more and more designers will have to analyse the potential impact of these faults on the behaviour of<br />

the circuits.<br />

The error model to evaluate faults depends on their duration. Permanent faults can be tolerated<br />

by replacing the faulty component whereas a temporary fault can self-repair. Intermediate faults are<br />

treated either as permanent or temporary model depending on how often they occur. Some common<br />

techniques to evaluate the FT system are discussed in [WCS08]. Among other techniques, fault<br />

injection is the widely accepted as an effective approach to evaluate fault tolerance [LN09, Nic10].<br />

2.5.1 Fault Injection<br />

Fault Injection is a validation technique for FT system, which consists in the accomplishment<br />

of controlled experiments where the observations of the system’s behavior in presence of faults are<br />

induced explicitly by the voluntary introduction of faults to the system [ACC + 93].<br />

In other words, it is the purposeful introduction of faults (or errors) into a target [NBV + 09]. Thus,<br />

it is an intentional activation of faults in order to observe the behavior of the system under fault. The<br />

objective is to compare the nominal behavior of the circuit (without fault injection) with its behavior<br />

in the presence of faults injected during the execution of an application.<br />

Fault injection techniques have become popular to evaluate and improve the dependability of<br />

embedded processor-based systems [LAT07]. It can be accomplished at physical or simulated level.<br />

1. physical fault injection: it injects faults directly into the hardware, disturbing the environment<br />

(like heavy ions radiation, electromagnetic interference, LASER etc) [BGB + 08, Too11]. Many<br />

methods have been proposed, based primarily on the validation of physical systems, including<br />

injections on circuits pins, injection of heavy ions, disruption of supplies, or the fault injection<br />

laser [GPLL09]. None of these approaches can be used for evaluation security before the circuit<br />

is actually made. Therefore, alternate solution is to employ some injection techniques that allow<br />

earlier analysis of the design, typically at the register transfer level or gate level e.g. it may<br />

include mistakes in an RTL description.<br />

2. simulated fault injection: fault injection campaigns can be performed using several approaches,<br />

especially the simulation for high-level approaches. It has been widely used for its simplicity,<br />

versatility, and controllability [NL11]. The simulation is more expensive in time, however it<br />

may allow more comprehensive analysis and provide more accurate results and cost less than<br />

physical fault injection [NL11]. The fine access to the internal states of the processor is easily


2.5. FT EVALUATION 47<br />

possible with simulated fault injection and that is why it has better controllability/observability.<br />

In this technique, the system under test is simulated in other computer system. The faults are<br />

produced by altering the logical values during the simulation.<br />

The simulated fault injection is a special case of injecting soft errors that can support various levels<br />

of abstraction of the system like architectural, functional and logic [CP02], and for this reason it has<br />

been widely used to study fault injection. Moreover, there are various other advantages for this tech-<br />

nique. For-example, its greatest advantage over other ones is the controllability/observability of all<br />

the modelled components. Another positive aspect of this technique is the possibility to carry on the<br />

validation of the system during the design phase before having a final design. Alternate approaches of<br />

physical/simulation environment to perform safety analysis have been discussed in [Bau05,RNS + 05].<br />

2.5.2 Error Models<br />

To design an FT system, it is important that the system should be aware of the possible faults that<br />

can appear in it. Some of the commonly occurring faults are shown in table 2.1. However, architecture<br />

is normally design to overcome possible errors. Such system can detect the active faults that produce<br />

errors because they are not aware of the underlying physical phenomena.<br />

Table 2.1: fault modeling<br />

level Model<br />

Programming Instruction, sequences etc<br />

HDL Functional model, register<br />

Logic gate level<br />

Electronic CMOS Transistor<br />

Technology Physical layout<br />

There are different types of error models and they have been classified in three axes: type of<br />

error, error duration, and number of simultaneous errors in [Sor09]. A commonly considered error<br />

model is bridging model; it considers short-circuits and cross-talks. This model is suited to detect the<br />

fabrication defects that can cause the short-circuit between two connections/wires. It is a low-level<br />

error model.<br />

Fail-stop error model is a higher-level error model. All the components will stop working in case<br />

of error detection in a system based on fail stop model. Such systems are used in critical systems<br />

such as in ATM (automated teller machine) machines, where a single error in calculation can result<br />

in hundreds dollar loss. Such system stops working if some non-corrigible error is detected.<br />

A delay error model is one in which the circuit produces the correct response but after a certain<br />

unexpected delay. This type of error can occur due to various internal physical phenomena of the<br />

device. Some related work is discussed in [EKD + 05].<br />

Here, we are interested in the bit-flip errors that are largely representative of transients errors due<br />

to SEU (SBUs and MBUs). Moreover, they are easy to model at many abstraction levels.


48 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />

2.5.3 The Fault Injection Framework<br />

A fault injection framework usually needs at least 3-types of information:<br />

(a) when the fault is to be injected. What is the condition that will trigger the fault injection during<br />

the simulation;<br />

(b) where the fault is to be injected. In which location the fault will be injected;<br />

(c) what is the kind of fault that is to be injected. What will have its effect.<br />

Fault Trigger (when)<br />

Fault injection may be done according to a deterministic or non-deterministic (time) profile. Non-<br />

deterministic fault triggers may inject a fault during a simulation after an amount of time, On the<br />

otherhand in deterministic approach, the fault trigger is by counting the amount of simulated instruc-<br />

tions. The fault will be injected after a specified amount of instructions simulated. In the simulated<br />

fault injection, the non-deterministic behavior is obtained by specifying the amount of time or the<br />

amount of instructions simulated randomly.<br />

A deterministic fault trigger may limit the scope of the fault injection, by determining that the<br />

injection will be done in a specific range of interval In real-time applications faults occur at ran-<br />

dom instances. The practical solution is to use non-deterministic approaches by combine different<br />

(possible) trigger conditions under specific situation.<br />

Fault Location (where)<br />

In processors, faults may effect ALU or internal registers or memory address, depending on the<br />

output of the instruction using the affected logic. In all cases, change in processor registers or memory<br />

can represent a real possible fault. A fault location is often described deterministically, but it can also<br />

be described in a non-deterministic way if we let the fault injection framework choose randomly<br />

which processor register to inject the fault.<br />

Fault Effect (what)<br />

As explained previously, the most common effect of a transient fault into a processor register or<br />

memory is an inversion in a state of a bit (single bit flip). By flipping a bit in a register or in a memory<br />

address we can inject a fault as it occurs in a real situation. The value of the altered bit is always<br />

toggled to the opposite value. This upset model is the standard transient fault model used in the<br />

reliability literature [Muk08]. A deterministically fault operation can be done by specifying which bit<br />

to flip, but it can also be done non-deterministically letting the fault injection framework randomly<br />

choose the bit to flip in the fault operation.


2.6. CONCLUSIONS 49<br />

The above mentioned details are the basic information that is needed by a fault injection simulator.<br />

Our exact choices will be presented in the chapter 6, where exact validation methodology based on<br />

artificial error injection will be exploited.<br />

2.6 Conclusions<br />

In this chapter, we explore the existing design methodology and validation techniques for FT<br />

processor. Today, FT processors employ variety of redundancy techniques and each has its own<br />

area/time overheads. Hardware redundancy has faster error detection and correction but have high<br />

area overheads whereas, temporal redundancy technique have lower area overheads but have high<br />

time overheads associated.<br />

In past, the FT processors were only used for mission critical applications and mostly rely on<br />

hardware replication. However, now every system need at least some consideration of FT. The expen-<br />

sive solutions are not acceptable for most embedded systems. Therefore, modern processors are either<br />

relying on hybrid techniques or more focused towards information redundancy techniques to reduce<br />

the power and area requirements. The available low cost solutions are missing fast error detection<br />

ability. The need is to develop alternate tolerance methodology that have fast error detection at low<br />

power/area overheads.<br />

In the later sections, different methods of processor evaluation are discussed. Among them, simu-<br />

lation fault injection has many advantages including better controllability/observability and architec-<br />

ture validation during initial development stages, compared to physical fault injection.


50 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS


II. QUALITATIVE AND QUANTITATIVE<br />

STUDY<br />

51


Chapter 3<br />

Design Methodology and Model<br />

Specifications<br />

3.1 Motivation<br />

Due to current technology trends, there is growing concern that transient faults will occur more<br />

frequently in future [FGAD10]. Since this reliability threat is projected to affect the broad computing<br />

market, traditional solutions involving excessive redundancy are too expensive in area, power, and<br />

performance [BBV + 05, SHLR + 09]. The research in FT systems design having minimum hardware<br />

overhead has gained great importance in last few years.<br />

Previously research in this domain was just to attain high-level dependability at minimum perfor-<br />

mance degradation and there were no big consideration about low cost hardware solutions. Conse-<br />

quently, the dependability was mostly attained through expensive solutions like hardware replication.<br />

The well-known FT processor that have been developed in the past such as Stratus, Leon FT, Sun FT<br />

SPARC, IBM S/390 were employing hardware redundancy solutions (either DMR or TMR). These<br />

processors have high power and hardware overheads and they are not addressing the need of daily life<br />

applications.<br />

The available FT solutions often incur significant penalties in area, cost or performance and they<br />

are unable to efficient tolerate faults [PIEP09]. They cannot fulfill common industrial applications<br />

needs. Some temporal redundancy techniques have minimum hardware overheads however they have<br />

significant time overheads that limit the overall performance. On the other hand, the hardware repli-<br />

cation is faster but increases the cost and power requirements. It is a great challenge to build efficient<br />

FT-systems with reduced time and hardware-overheads. Efficient design optimization techniques are<br />

required in order to meet time and hardware constraints in the context of FT systems. Consequently,<br />

in this work we are proposing a FT processor design methodology offering an acceptable compromise<br />

between protection and area/time overheads. The next section is exploring the proposed methodology.<br />

53


54 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

3.2 Methodology<br />

We want to design a fault tolerant processor having minimum error detection latency and low<br />

hardware overhead. In this situation, the challenge is to search a compromise between protection and<br />

hardware overheads. The hardware detection and correction is fast but it has high hardware overheads.<br />

On the other hand, software based detection and correction is slower but it has low hardware overhead.<br />

Consequently, for fast error detection we choose hardware based concurrent error detection, so<br />

that the error is detected before it reaches the system boundaries to result in catastrophic failure. On<br />

the other hand, for hardware saving we have to accept the additional time penalty in error correction.<br />

Moreover, in real time applications, fault does not occur often. Consequently, software based rollback<br />

mechanism can be chosen to recover the error.<br />

Fault Tolerance Computer<br />

Low hardware and time trade-offs<br />

Fast error detection Low hardware overhead<br />

Concurrent Error Detection Rollback Mechanism<br />

Figure 3.1: Proposed Methodology<br />

The resultant hardware-software co-design methodology (see figure 3.1) will have the ability to<br />

detect errors as soon as they occur and start immediately with error recovery strategy to prevent<br />

the propagations of errors throughout the system. The proposed methodology is successful for non-<br />

safety-critical FT-application where errors occurrence rates are not too high<br />

In the next section, we will discuss the most suitable CED and recovery mechanisms for the above<br />

scenario.<br />

3.2.1 Concurrent Error Detection: Parity Codes<br />

The implementation of the CED usually requires extra hardware overhead. One of the straight<br />

ahead and commonly used CED approach is DMR. Theoretically, it can detect 100% errors (except<br />

simultaneous errors in both modules and errors in comparator) [MS07]. However, this technique<br />

imposes an area overhead higher than 200%. The decision for the checking strategy is a compro-<br />

mise between error coverage and acceptable overhead. Cost-effective solutions are the objective of


3.2. METHODOLOGY 55<br />

further investigations in error-detection. EDC have smaller area overhead [Pat10] and they are often<br />

considered sufficient for non-safety-critical processors [MS07]. Among EDC, we will employ the<br />

most simplest codes because our objective is to show feasibility of our approach. Once the over-<br />

all methodology has shown interesting results then we may employ stronger codes with better error<br />

coverage.<br />

Parity codes are the simplest and cheapest known EDC. It provide odd bit-count error detection<br />

and need to have extra circuits for checking the bit generation and output parity verification. Their<br />

hardware overhead is much lower than the DMR approach. It can be employed for protecting registers,<br />

data-bus, RAM and bit-sliced circuits [Pie06].<br />

The disadvantage is the missing recognizability of by 2-divisible multiple-bit faults (even multi-<br />

plicity). The example for an 8 × 8-bit register file in figure 3.2 illustrate this fact. Faults in register 1<br />

and 3 can be detected by parity-check. However, faults in register 2 and 5 remain undetected.<br />

They do not need complex encoding and decoding circuitry. They have smaller gate count for the<br />

complete on-line checking scheme. Moreover, in case of soft errors where an error is random in time<br />

and space, the likelihood of multiple errors in 1 clock cycle is exceedingly low. Therefore, in this<br />

scenario, a less expensive approach such as the parity error detection could suffice [Gha11].<br />

x 1<br />

x 2<br />

x 3<br />

x 4<br />

x 5<br />

x 6<br />

x 7<br />

x 8<br />

Fault Free Environment<br />

1 0 1 1 0 0 0 0<br />

1 1 1 0 0 0 0 1<br />

0 0 0 0 1 0 1 0<br />

0 1 1 0 1 0 0 0<br />

0 0 0 0 0 0 0 0<br />

0 1 0 0 1 0 1 0<br />

0 0 1 0 1 0 1 1<br />

0 1 0 1 1 0 1 1<br />

Odd<br />

parity<br />

1<br />

0<br />

0<br />

1<br />

0<br />

1<br />

0<br />

1<br />

x 1<br />

x 2<br />

x 3<br />

x 4<br />

x 5<br />

x 6<br />

x 7<br />

x 8<br />

Noisy Environment<br />

1 0 0 0 1 0 0 0<br />

1 1 1 1 0 0 1 0<br />

0 0 1 0 1 0 1 0<br />

0 1 1 0 1 0 0 0<br />

0 0 0 1 1 0 0 0<br />

0 1 0 0 1 0 1 0<br />

0 0 1 0 1 0 1 1<br />

0 1 0 1 1 0 1 1<br />

- Detected Error<br />

- Un-detected Error<br />

Figure 3.2: Limitation of parity check<br />

Odd<br />

parity<br />

Lisboa [LC08] has employed a similar approach; he uses a standard parity based technique to de-<br />

tect errors in single output combinational circuits. In this work a second circuit is used that generates<br />

an extra output signal, named check bit, and two circuits for verification of the parity of inputs and<br />

outputs are based on reduced area XOR gates to detect soft errors.<br />

0<br />

0<br />

1<br />

1<br />

0<br />

1<br />

0<br />

1


56 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

3.2.2 Error Recovery: Rollback<br />

For minimum hardware overhead software based error correction will be useful. One of straight<br />

ahead solution is temporal based fault masking (software TMR). It has low hardware overhead but<br />

has 3× additional time penalty and moreover it does not match with already chosen hardware CED.<br />

Alternate approach can be based on software rollback, where the transient non-persistent faults can<br />

be tolerated (repaired) by repeating the operation in a controlled manner using the same hardware<br />

again [RRTV02].<br />

It can overcome the errors by returning back to a point prior to the occurrence of the fault [MG09].<br />

Rollback is a technique that allows a system to tolerate a failure by periodically saving the entire state<br />

so that if an error is detected, rolling back to the prior checkpoint to recover [JPS08]. It require small<br />

hardware overhead and the resulting architecture can overcome errors at low cost. It can be good<br />

candidate for situations where delays for recovery are acceptable. The rollback principle can be an<br />

efficient approach with CED for error recovery [BP02, EAWJ02].<br />

State<br />

A<br />

State<br />

F<br />

No-error detected<br />

during last SD<br />

(Data Validated at VP)<br />

VP n-1<br />

Store<br />

SEs<br />

t+6<br />

t<br />

State<br />

E<br />

State<br />

B<br />

t+1<br />

t+4<br />

t+3<br />

t+5<br />

State<br />

C<br />

State<br />

D<br />

Figure 3.3: Rollback Execution<br />

Rollback to VP n-1<br />

Instruction(s) Execution in current<br />

SD<br />

Last SD Sequence Duration (SD)<br />

Next SD<br />

t+2<br />

Error detection<br />

Figure 3.4: Error detection during Sequence Duration (SD) and rollback called<br />

Our strategy to implement fault recovery is based on rollback execution, a classically employed<br />

VP n


3.2. METHODOLOGY 57<br />

software technique in real-time embedded systems [KKB07]. It relies on the following behaviors (see<br />

figure 3.3):<br />

• program (or thread) execution is split in sequences of fixed maximal length;<br />

• each sequence must reach its end without any error being detected to be validated;<br />

• data generated during a faulty sequence must be dismissed and execution restarted from the<br />

beginning of the faulty sequence;<br />

If an error occurs within the next instruction sequences, processor-registers can be updated with<br />

prior saved contents. Like in figure 3.4, it has been considered that an error has been detected during<br />

the instruction execution so rollback mechanism is called and data re-execution starts from the last<br />

stored states at previous validation point (VP). The VP executes after a fix interval of instructions. On<br />

the other hand, in the figure 3.5, no error is found during the Sequence Duration (SD) and all the data<br />

written during SD is validated at VP.<br />

No-error detected<br />

during last SD<br />

(Data Validated at VP)<br />

VP n-1<br />

Store<br />

SEs<br />

Instructions Execution in current SD<br />

SED (SD-SED)<br />

No-error detected<br />

during last SD<br />

(Data Validated at VP)<br />

Last SD Sequence Duration (SD)<br />

Next SD<br />

VP n<br />

Figure 3.5: No-error detected during the SD<br />

The SD represents the full length of the sequence, which include time taken to store the sensitive<br />

elements (SE) as well as length of active instruction execution. In the remain work, the processor<br />

internal states will be called SEs. Let’s denote the minimum time to load the processor SEs as ‘SED’<br />

(see figure 3.5). Then ratio of the active sequence duration will be ‘(SD-SED)/SD’ whereas ratio<br />

of time to load SE will be ‘SED/SD’. For SD=10, (assuming program length of 10,000 instruction<br />

and neglecting the possibility of provoking errors), then there will be about 1000 time loading of SE<br />

whereas if program contains SD=100 then SE will be loaded about 100 time. Which means there is<br />

10 times less penalty of SED/SD (loading SE) with SD=100, it means that larger SD can result in<br />

faster program execution.<br />

On the other hand, if probability of error provoking is not ignored, then the resultant time penalty<br />

due to re-execution can vary with length of sequence. At higher error injection rates, lower SDs will<br />

be an effective compromise because possibility of sequence un-validation is higher for bigger SDs<br />

and vice versa. This will be further discussed in chapter 5 and 6.<br />

Store<br />

SEs


58 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

a1 a2 a3 a4<br />

1 2 3 4<br />

a5 a6 a7<br />

Error<br />

5 6 7 4 Load 6 SEs7<br />

8 9 14<br />

Figure 3.6: Time overhead in rollback<br />

Clock cycles<br />

The rollback-principle is a repetition of an erroneous operation outgoing from a defined (saved)<br />

checkpoint in the past. There is a time penalty in the case of error detection. For-example, an error<br />

is detected at ‘a7’ in figure 3.6. Now the processor rollbacks, reloads the SEs and re-executes from<br />

the previous states ‘a7’. Consequently, this requires additional clock cycles delay to re-execute the<br />

sequence. Moreover, a delay can be higher if SD is large (more instructions in the sequence).<br />

3.3 Limitations<br />

The system relying on rollback cannot communicate data to a real-time environment until it is<br />

known that the data is error free. If an erroneous data enters into the system peripherals, then it<br />

cannot be recovered and may result in catastrophic failures. It is a fundamental problem of rollback<br />

recovery and has been discussed in [NMGT06]. The common approach is to wait until the data is<br />

validated. In present methodology the output events can be addressed with one SD delay. It may<br />

result in real time constrains.<br />

There is a need to design a special output unit to monitor the output control signals. Actually,<br />

real-time communicating is not not under the scope of the present work. It will be considered in<br />

future work.<br />

3.4 Hypothesis<br />

Among other underlying hypotheses, we suppose that the processor core is connected to a depend-<br />

able memory in which data is supposed to be kept safely without any risk of corruption. According to<br />

this assumption, all the internal errors produced in DM are detected and corrected by DM. Therefore,<br />

DM is internally a safe storage but it should be protected from errors coming from outside which<br />

means that only valid data to be written into the DM.<br />

Actually, a lot of work has been dedicated in the past to the protection of memory devices [MS06,<br />

Hsi10] making this hypothesis valid.<br />

a7


3.5. DESIGN CHALLENGES 59<br />

3.5 Design Challenges<br />

Choosing an error detection mechanism based on concurrent error detection and recovery based<br />

on rollback are not enough to achieve our design objectives. An effective implementation of the<br />

above scenario can be realized by making appropriate choices concerning in particular the processor<br />

architecture. These design choices should improve the dependability, cost and overall performance.<br />

In the following section, we will analyze some of the major desired features required for a successful<br />

implementation of the above scenario.<br />

3.5.1 Challenge # 1: Self Checking Processor Core Requirements<br />

The choice of a base processor architecture is the first step towards the implementation of FT-<br />

processor because not all processor architectures can fit themselves in this context. For a successful<br />

implementation, we must determine the required key features of the processor.<br />

• minimum hardware: we aim to design an FT-processor in which a small hardware "fingerprint".<br />

It can reduce the chance of data contamination, since greater the area exposed to the environ-<br />

ment, more the chance of provoking the errors. Due to smaller area, more efficient architecture<br />

can be built with smaller silicon dies and thus the yield will be much higher [TM95].<br />

• minimum internal states to be checked and stored: (i) with concurrent error detection, the hard-<br />

ware overhead necessary to check simultaneously all the internal states may be rather important.<br />

Having a reduced number of internal states helps reducing this hardware overhead. (ii) Roll-<br />

back recovery requires internal states of the processor to be saved periodically, incurring a time<br />

penalty/overhead that can be lowered with a reduced number of internal states.<br />

The commonly employed RISC (Reduce Instruction Set Computers) class machines have a<br />

large register file and cannot fit with the proposed methodology because of following reasons:<br />

(a) More registers means more expensive CED;<br />

(b) More registers implies more time-consuming in periodic saving of register contents;<br />

(c) A large number of registers requires a large number of instruction bits as register speci-<br />

fiers, meaning less dense code.<br />

(d) CPU registers are more expensive than external memory locations;<br />

On the other hand, CISC (Complex Instruction Set Computers) require complex control ar-<br />

chitecture. It will complicate the overall implementation methodology. It has high memory<br />

requirements. Moreover, it increase the probability of design errors.<br />

In short, we cannot rely on the classical processors (RISC or CISC). Our choice will be a simple<br />

processor architecture having minimum internal states. This will reduce the overall area/time<br />

penalty and make the processor more robust against external disturbances. On the other hand,


60 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

it should have minimal complexity of the architecture and provide better utilization of chip<br />

resources.<br />

3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal<br />

If a self-checking processor is directly connected to DM (see figure 3.7) then, there is need to<br />

manage the validated data (VD, written in previously validated SD) and un-validated data (UVD,<br />

being executed in the present sequence) inside DM, which will induce additional time penalty.<br />

In such case, generally paginated memory (one page for a sequence) is employed, in which there<br />

are un-validated and validated pages to manage rollback. If an error is detected in the current se-<br />

quence, the corresponding page is discarded and previous page must be restored. This is slower<br />

approach and requires additional pointers to handle pages; these pointers can either be dedicated<br />

registers (faster) or dedicated variables (slower and more risky).<br />

Moreover, there is an additional risk concerning the corruption of these pointers which may result<br />

in loosing the track of validated and un-validated pages. Thereafter, DM will no more be sure data<br />

storage, which is violation of the basic hypothesis. If there is a large amount of data being copied<br />

among pages or pages and the main pool of data in memory, this takes a lot of time. Furthermore,<br />

system requires bigger DM to separately store the validated and un-validated data.<br />

Processor<br />

Write<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

Figure 3.7: Untrusted data flowing into dependable memory (DM)<br />

An alternate approach to simplify the above scenario is employing a temporary data storage be-<br />

tween processor and DM. It can strongly reduce the time penalty and also the risk of error to some<br />

extent. Furthermore, it will simplify the periodic saving of data and only validated data will be trans-<br />

ferred to DM.<br />

The basic idea is to implement some hardware devices on the path between the processor and<br />

the DM controlling the way data flows from one side to the other and preventing un-trustable data<br />

to end-up in the DM (as suggested in figure 3.8). This can be achieved by first writing the non-<br />

secure/non-validated data to a temporary location before transferring it to DM after sequence valida-<br />

tion its validation. The SCPC can detect the errors and re-executed instructions from the last secure<br />

states (in case of error detection). In this way external errors (environment /processor) will be masked<br />

from entering into the DM (as shown in figure 3.8). The underlying idea behind the journalization


3.5. DESIGN CHALLENGES 61<br />

Self-Checking<br />

Processor<br />

Core<br />

Write<br />

Temporary<br />

Storage<br />

Write<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

Figure 3.8: Data stored to temporary location before writing to DM<br />

mechanism is to prevent un-trustable data to flow into the DM and to allow an easy recovery from<br />

faulty situations. Hence there is a need of a temporary location (called self checking hardware journal<br />

in later chapters) to mask the errors from entering into DM.<br />

Need Self Checking Hardware Journal<br />

Data stored inside this temporary location can also be corrupted in the the case of transient faults<br />

affecting it, such as SEUs (see figure 3.9). Hence, there should be an error detecting and correcting<br />

mechanism to ensure the reliable operation of this temporary data storage.<br />

Self-Checking<br />

Processor<br />

Core<br />

non<br />

validated<br />

data<br />

Transient<br />

Fault<br />

Temporary<br />

Storage<br />

supposedly<br />

validated<br />

data<br />

Figure 3.9: Data corruption in temporary storage.<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

Let us suppose that we have written a data in the journal at time ‘t’, there was no error detected<br />

during the sequence till VP and data is ready to transfer to the DM. Is this data dependable? No,<br />

because data remained in the journal for time ‘tx’ and the possibility of fault occurrence cannot be<br />

ignored during the time tx. Hence, there is a need for self-checking mechanism to detect errors in the<br />

journal and hence prevent the DM from data contamination as shown in figure 3.10. It will make the<br />

journal a safe temporary data storage.


62 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

Self-Checking<br />

Processor<br />

Core<br />

non<br />

validated<br />

data<br />

Transient<br />

Fault<br />

<strong>Dependable</strong><br />

Temporary<br />

Storage<br />

trustable<br />

validated<br />

data<br />

Figure 3.10: Protecting DM from contamination.<br />

Separate Storage of Validated and Un-validated Data<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

The data initially written in the temporary location is un-validated and if no error occurs during<br />

the present sequence then at validation point the data is validated. At any instant inside the temporary<br />

storage, there are two types of data, one called un-validated data and other validated data. Conse-<br />

quently, inside it there must be two different parts: one to store data and other to store un-validated<br />

data. In addition, it helps us to transfer the validated data towards the DM.<br />

3.5.3 Challenge # 3: Processor-<strong>Memory</strong> Interfacing<br />

The overall performance of the FT-processor can be limited in absence of an efficient interfacing<br />

between processor, temporary location and memory. Since in most processors the majority of the<br />

instructions involve either read or write from/to the memory. The overall performance is affected if<br />

there is a long critical path or more than one clock cycle needed to read and write the data. In our case,<br />

this situation is much more delicate because there is a temporary data storage between the processor<br />

and the DM. There is a need of designing an intelligent interfacing to mask the errors from entering<br />

to the DM. The interface must provide an efficient interconnect between the modules.<br />

In this scenario, there are two possible interfaces: processor communicates with DM via a journal<br />

or processor communication with journal and memory in parallel. The challenge is to evaluate both<br />

the processor models from dependability and performance degradation point of view and to choose<br />

the most suitable one.<br />

3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation<br />

of Rollback Mechanism<br />

The objective of rollback-technique is to restore the system state (in case of error) by overwriting<br />

the current sequence states with previously validated states of SEs as shown in figure 3.4. There are<br />

two performance-limiting factors (i) time taken to periodically store the SEs and (ii) un-validation of<br />

SD and SEs reload on error detection. If one needs to reduce, the time penalty when reloading the<br />

SEs there is a need of long sequences so that the overall number of load/store SEs will be smaller than


3.6. MODEL SPECIFICATIONS AND GLOBAL DESIGN FLOW 63<br />

‘(SD-SED)/SD’. On the other hand, for larger sequence at higher error rates, there are less chances<br />

of sequence validation. Hence, there will be greater rate of rollback, which again results in time<br />

penalty/performance degradation. Therefore, it is advisable to use large sequences with low error<br />

rates and small sequences with higher error rates.<br />

3.6 Model Specifications and Global Design Flow<br />

The basic role of the journal is to hold the new data being generated during the sequence being<br />

currently executed until it can be validated (at the end of the current sequence). On sequence valida-<br />

tion, this data can be transferred to the DM else, it is simply dismissed and the current sequence can<br />

restart from the beginning using the SEs (held in the DM) and corresponding to the state prevailing at<br />

the end of the previous sequence.<br />

Self-Checking<br />

Processor<br />

Core<br />

HW Journal<br />

Figure 3.11: Overall design specifications<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

Our global design strategy of FT processor is classified into four steps as summarized in figure<br />

3.12. Step-I summarises the proposed model specification as shown in the block diagram in the figure<br />

3.11. This includes exploring the design requirement like in this case the SCPC, DM and hardware<br />

journal to mask the errors from entering into the DM. Moreover, the architecture must respect the<br />

challenges 1-4 mentioned in the previous section.<br />

Step two, refines our design strategy using various functional implementations and will be dis-<br />

cussed in the next section. The hardware implementation will be presented in chapters 4 and 5. Fi-<br />

nally, the fourth step is concerned with a validation of the overall approach by artificial error injection,<br />

it will be presented in chapter 6.<br />

3.7 Functional Implementation<br />

This section is concerned with step-II of the figure 3.12 and aim to refine the proposed model.<br />

There are two possible connections between the processor and DM: (i) Model-I: the processor is<br />

connected to the DM via a journal pair and processor cannot read from DM in a clock cycle; (ii)<br />

Model-II: the processor is connected to Journal and DM in parallel and it can directly read from DM


64 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

Step-I Step-II Step-III Step-IV<br />

Model<br />

Specifications<br />

Functional<br />

Implementation<br />

Refinement<br />

Model-1<br />

Model- 2<br />

Model-3<br />

HW<br />

Implementation<br />

Figure 3.12: Global design flow.<br />

Testing<br />

HW<br />

Validation<br />

and Journal simultaneously. These approaches depend on the type of connection between the SCPC<br />

and the DM. As a result, the overall dependability and performance is affected. In this section, we<br />

will finalize the processor memory (DM) interfacing.<br />

Which scenario is the best can be judged by developing the corresponding functional models<br />

and then comparing the simulation curves (clock cycle per instructions (CPI) vs. error injection rate<br />

(EIR)) obtained by artificial error injection.<br />

Hypothesis:<br />

In order to simplify the functional model, the following hypothesis are assumed:<br />

(a) the processor core is self checking;<br />

(b) there is a dependable memory attached to the processor where data can safely stay without risk<br />

of provoking errors;<br />

(c) journal/cache are considered dependable data storage places;<br />

(d) all instructions are supposed to be executed in a clock cycle;<br />

(e) and of course re-execution of instructions can recover soft errors;<br />

Benchmarks<br />

A set of benchmarks consisting of the main kernels of general tasks in target application has<br />

been selected and divided into 3 groups. The first group is execution of memory operations (permu-<br />

tation/sorting), the second group is representative of arithmetic dominated algorithms and the third<br />

group of control dominated algorithms. All applications have significant memory requirements be-<br />

cause every time they need to read from and after execution they write back to memory (these bench-<br />

marks are not designed to evaluate I/O events hence they only read from and write back to memory).


3.7. FUNCTIONAL IMPLEMENTATION 65<br />

(a) Benchmark Group-I: Bubble sorting algorithm has been considered, which is one of the simplest<br />

algorithms for sorting an array. It consists of repeatedly exchanging of out of order data pairs of<br />

adjacent array written in the memory and looping. The algorithm repeats until all the elements<br />

are sorted in an order. It has been implemented in serial fashion, only one pair can be examined at<br />

a time. It has a time complexity of O(n2). The n passes must be taken through the array, where<br />

n is number of elements.<br />

(b) Benchmark Group -II: It is memory computation benchmark and require data writing back to<br />

close addresses. This version of matrix multiply multiplies two 7 × 7 matrices in O(n k ) with<br />

k > 2. This runtime is achieved by implementing a vector-matrix multiplier, which stores an<br />

initial matrix away, and repeatedly returns its product with an input vector.<br />

(c) Benchmark Group -III: The control benchmark process the data coming from the sensor, which<br />

have been previously stored in the memory. The outputs are stored in the memory to be later used<br />

by the actuators. We chose logic and arithmetic equations for the data because some industrial<br />

systems need to control the actuators on this kind of equations. There are two assumptions: (i)<br />

measurements from the sensors are stored in memory (ii) results later will be send to the actuators.<br />

The control equations are:<br />

3.7.1 Model-I<br />

Y0 = A × [(X0 + X1) − (X2 − X3)]/[X4 × (−X5)]<br />

Y1 = NOT [(X6 OR X1) AND (X9 XOR X7)] AND [NOT (X8)]<br />

if ((X8 + X9) < A)<br />

Y2 = B × [(Y0 + X1) − (X9 × X8)]/[X1 − X5] + C<br />

else<br />

Y2 = [(Y0 + X1) − (X9 × X8)] [X1 − X5] + D<br />

Y3 = NOT[(X6 OR Y1)AND(X9 XOR X7)] AND [ NOT (Y1)]<br />

In model-I (shown in figure 3.13) the processor is connected to DM via a cache memory and a pair<br />

of journals. The pair of journals mask the errors from transferring to DM in order not to propagate<br />

potential errors. The write (processor to memory) is modified and is performed in three steps as shown<br />

in figure 3.13: (i) the write operation is performed simultaneously in cache and in the Un-Validated<br />

Journal (UVJ), (ii) if no error is detected, then at the VP the data from UVJ is transferred to validated<br />

journal (VJ, contains only the validated data) and (iii) finally, the validated data is written to the DM.<br />

At VP all the last sure states of the SEs are conserved and being validated. As shown in figure<br />

3.13, data is transferred from UVJ to VJ and finally to DM. If an error is detected during an SD, the<br />

processor retries the instruction execution from preceding VP (as shown in figure 3.4). In this way, the


66 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

1<br />

2<br />

Self Checking<br />

Processor Core<br />

(SCPC)<br />

Write<br />

Un-Validated Journal (UVJ)<br />

Validated Journal (VJ)<br />

write to DM in 3-steps<br />

WRITE<br />

READ<br />

Write<br />

Data<br />

cache<br />

Read<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

(DM)<br />

Figure 3.13: Model-I with data cache and a pair of journals<br />

Address from<br />

processor<br />

Compared with<br />

all stored<br />

addresses<br />

simultaneously<br />

Address<br />

Cache<br />

Address<br />

found<br />

Data<br />

Address not<br />

found in cache<br />

Figure 3.14: Cache with associative mapping<br />

system restores its prior dependable states and DM remains preserved from the errors. On sequence<br />

validation, the data in UVD is validated by transferring synchronously from UVJ to VJ in a single<br />

clock cycle. The processor can read directly from the cache memory.<br />

Associative mapping is employed in cache, each block is composed of both the memory address<br />

and the corresponding data. The incoming address is simultaneously compared with all stored ad-<br />

dresses using the internal logic of the associative memory, as shown in figure 3.14. If a match is<br />

found, the corresponding data is read out. Otherwise, required data will be read from memory. When<br />

3<br />

DM


3.7. FUNCTIONAL IMPLEMENTATION 67<br />

new data is written into cache then first the controller will match the available existing addresses to<br />

overwrite the data on same addresses. If no match occurs then data is written in a new position with<br />

corresponding addresses. The use of associative memory is fast but also expensive. The cost depends<br />

on how big the cache is.<br />

Model-I: Simulation Results<br />

A functional model (emulator) of the processor/journal/cache has been developed in C++. The<br />

emulator acts as a virtual machine before hardware implementation to test various fault models and<br />

protection techniques. In addition, the emulator allows us to evaluate the architectural choices, cal-<br />

culate both the internal processor states and the program execution duration. It helps us to calculate<br />

average clock cycles per instruction on different memory accesses.<br />

Benchmarks<br />

(MPSoC application)<br />

Stack Processor<br />

Emulator<br />

Consequences<br />

Figure 3.15: FT evaluation<br />

Periodic, Random &<br />

Burst Errors injection<br />

The errors have been artificially injected in the processor emulator (as shown in figure 3.15) and<br />

then the performance has been evaluated for the FT processor model. The goal of this experimental<br />

setup is to evaluate the effect of error injection on system performance. For simplicity, the actual<br />

time overhead of periodical saving of SE is ignored. The fault injection profiles being considered are<br />

shown in figure 3.16.<br />

The emulator receives the previously described groups of benchmarks representing target appli-<br />

cations with a set of representative data. The input of the emulator is a classical hexadecimal file.<br />

Our criterion in evaluation is the ratio of average number of Clocks per Instruction (CPI) vs. Error


68 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

Periodic Errors<br />

Random Errors<br />

Burst Errors<br />

No of program cycles<br />

No of program cycles<br />

No of program cycles<br />

Figure 3.16: Periodic, random and burst errors models<br />

Injection Rate (EIR). The goal of the simulations is to evaluate the performance degradation of the<br />

proposed model in presence of high error injection rates.<br />

Figures 3.17, 3.18, 3.19 present the results for the benchmark group 1, 2 and 3 for SD of 10,<br />

50 and 100. The CPI* is the clock cycles per instruction of the dependable architecture (with error<br />

injection) and the CPI is clock cycles per instruction without error injection. The ratio of CPI*/CPI<br />

will give us the ratio of additional clocks required on re-execution due to rollback.<br />

Figure 3.17 shows the simulation results of computation benchmark. In this graph, there are<br />

two horizontal reference lines. The bottom green continuous doted line is extended from the value<br />

of CPI*/CPI in presence of no error while the top red non-continuous dotted line is drawn at 2 ×<br />

CPI*/CPI. There are three curves in each figure drawn at SDs of 10, 50 and 100 respectively. The<br />

curves overlap each other in presence of low Error Injection Rate (EIR). As the EIR increases the<br />

value of CPI*/CPI also increasing exponentially which is more dominant for higher SDs like 50 and<br />

100. Hence in the presence of high EIR the rate of re-executing also increases which finally increases<br />

the overall CPI*/CPI ratio. This model give a better CPI*/CPI ratio for lower EIR but for higher EIR<br />

the ratio of CPI increases rapidly because of re-execution of instruction due to error detection and<br />

additional clock cycles in case of cache miss. These two problems will be addressed in the Model-II.<br />

3.7.2 Model-II<br />

Model-II consists of three parts; a Self Checking Processor, Journal and DM as shown in figure<br />

3.20 (a). It has a modified journal architecture that has two internal parts; one containing validated<br />

data and other containing un-validated data as shown in fig 3.22.


3.7. FUNCTIONAL IMPLEMENTATION 69<br />

CPI*/CPI<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Computation Benchmark<br />

(Periodic Errors Injection)<br />

CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100)<br />

0<br />

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />

Error Injection Rate (EIR)<br />

Figure 3.17: Model-I: additional CPI for benchmark group I<br />

When processor needs to read, it checks the data in the Journal and DM simultaneously (due to<br />

parallel access). Associative mapping is employed in journal, each block is composed of both the<br />

memory address and the corresponding data (as shown in fig 3.22). If a journal-miss occurs then the<br />

required data will be sent to the processor during the same clock cycle. If data is found both in the<br />

journal and the DM then the controller (MUX) will prefer the data from the journal, as it is the most<br />

recent written data (see the figure 3.21).<br />

To allow simultaneous read and write the journal should have two address ports. Some instructions<br />

may need two operations (one read and one write) at the same time in journal. The newly written data<br />

is stored in UVJ. If no error is detected in the sequence then the data is validated (as shown in the<br />

figure 3.22) and is transferred to VJ. On the other hand, if an error is detected then all the data written<br />

during the sequence is discarded (as shown in figure 3.23) and the processor rollbacks and starts<br />

execution from the last known states of the SEs. In the next section, we will evaluate the performance<br />

of this architecture.<br />

Model-II: Simulation Results<br />

The experimental protocol remain the same for model-II (like model-I). It can be observe from<br />

the simulation curves of figures. 3.24, 3.25, 3.26 that model-II is more efficient than model-I because<br />

even in the presence of high error rate the CPI*/CPI required to run the dependable architecture are<br />

significantly smaller than the previous architecture presented in the tables.


70 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

CPI*/CPI<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0<br />

Permutation Benchmark<br />

(Random Errors Injection)<br />

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50)<br />

CPI*/CPI (seq_duration=100)<br />

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />

Error Injection Rate (EIR)<br />

Figure 3.18: Model-I: additional CPI for benchmark group II<br />

From the simulation results of graphs 3.24, 3.25, 3.26 the CPI*/CPI ratio is smaller than model-I.<br />

In graph 3.24 with SD of 10 when EIR varies from 2e-4 to 2e-2, the additional CPI increases only by<br />

50% (on y-axis CPI*/CPI is 1.5) which means the execution time increases by only 50% even when<br />

the EIR becomes 100 times higher. This shows a good performance for the proposed architecture.<br />

For example, if we accept increasing by 50% the CPI, with SD of 10 there can be 20 errors per 1000<br />

instructions whereas with SD of 50 there will be only 6 errors per 1000 instructions. Furthermore, the<br />

SD has a direct impact on the size of journal memory in architecture and subsequently on its area.<br />

3.7.3 Comparison<br />

A comparison between model-I and model-II is summarized up in table 3.1. In both models, the<br />

effect of rollback is more dominant in higher SDs like 100 and 50. For example, as VP occurs after<br />

every hundred instructions in the SD= 100 there is more chances of provoking errors which rises the<br />

CPI*/CPI ratio more rapidly as compared to SD = 10 and 50. Since there is a large interval between<br />

the two consecutive VPs there is more chances for error occurrence which on the other hand increases<br />

the rate of re-execution of instructions.<br />

From the performance point of view in model-II, due to parallel access to the memory and journal<br />

in read operation the overall efficiency of the system is increased resulting in lower CPI ratios at<br />

higher EIRs as compared to model-I. Therefore, no clock-cycles are wasted if data is not found in the<br />

Journal. It has a better performance than our previous model as shown in the graphs 3.24, 3.25, 3.26.


3.7. FUNCTIONAL IMPLEMENTATION 71<br />

CPI*/CPI<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

Control Benchmark<br />

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />

0<br />

0 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 5.12E-02 1.02E-01<br />

Error Injection Rate (EIR)<br />

Figure 3.19: Model-I: additional CPI for benchmark group III<br />

Self Checking<br />

Processor Core<br />

WRITE<br />

READ<br />

Read<br />

HW Journal<br />

WRITE<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

(DM)<br />

Figure 3.20: Block diagram of Model-II<br />

From dependability point of view, the model-II is better a choice because it will have a minimum<br />

hardware overhead due to a single journal as compared to model-I. The more area exposed to the envi-<br />

ronment also increase the chances of provoking errors. Both problems, the performance degradation<br />

at higher error injection rate and effective area on-chip are addressed better than model-I. Therefore,<br />

we will choose model-II for further development. The results obtained are quite encouraging to carry<br />

on research by relaying on this model. In the next two chapters, we are designing both the processor<br />

(chapter 4) and the journal (chapter 5).


72 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

Free space<br />

Validated data<br />

VP n<br />

SD = 10<br />

VP n-1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

Address from<br />

processor<br />

Compared with<br />

all stored<br />

addresses<br />

simultaneously<br />

Address<br />

Journal<br />

MISS<br />

Access<br />

main memory<br />

Figure 3.21: Processor can simultaneously read from Journal and DM<br />

vADDRESS<br />

w address<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

0011…..0011<br />

0011…..0001<br />

0011…..1011<br />

0011…..0111<br />

0011…..1011<br />

1011…..0011<br />

0111…..0011<br />

0001…..0011<br />

0010…..0011<br />

0000…..0011<br />

1111…..0011<br />

1100…..0011<br />

1011…..0011<br />

0100…..0011<br />

0101…..0011<br />

1100…..1100<br />

1001…..1001<br />

1011…..1011<br />

0000…..1011<br />

1011…..0000<br />

3.8 Conclusions<br />

from processor / main memory<br />

0<br />

DATA data<br />

0011…..0001<br />

0011…..0011<br />

0011…..1011<br />

0011…..0111<br />

0011…..0011<br />

0011…..1011<br />

0000…..0011<br />

0011…..1011<br />

1011…..1011<br />

0010…..0011<br />

0011…..0011<br />

0000…..0011<br />

0010…..0011<br />

0100…..0011<br />

1011…..0011<br />

0101…..0011<br />

0011…..0011<br />

0011…..0011<br />

0111…..0011<br />

1111…..0011<br />

towards main memory<br />

data can transfer<br />

Data<br />

Data<br />

Store SE SEs<br />

VP n-1<br />

DM<br />

Un-Validated Data (UVD)<br />

Free Space<br />

Validated Data (VD)<br />

instructions of the<br />

application<br />

SD = 10<br />

Figure 3.22: No error detected during SD and data is validated at VP<br />

till ‘VP’ no error<br />

detected<br />

This chapter summarizes an alternative approach to design an FT-processor. We have presented an<br />

architecture specification and design methodology of the proposed scheme. It is a hardware/software<br />

combined approach in which error detection is achieved concurrently by hardware means using parity<br />

codes and rollback is used for error recovery. The major advantage of this scenario is the ability to<br />

VP n<br />

Next<br />

SE


3.8. CONCLUSIONS 73<br />

Free space<br />

Data written during SD<br />

VP n<br />

SD = 10<br />

VP n-1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

from processor / main memory<br />

v ADDRESS<br />

w address<br />

0011…..0011<br />

0011…..0001<br />

0011…..1011<br />

0011…..0111<br />

0011…..1011<br />

1011…..0011<br />

0111…..0011<br />

0001…..0011<br />

0010…..0011<br />

0000…..0011<br />

1111…..0011<br />

1100…..0011<br />

1011…..0011<br />

0100…..0011<br />

0101…..0011<br />

1100…..1100<br />

1001…..1001<br />

1011…..1011<br />

0000…..1011<br />

1011…..0000<br />

DATA data<br />

0011…..0001<br />

0011…..0011<br />

0011…..1011<br />

0011…..0111<br />

0011…..0011<br />

0011…..1011<br />

0000…..0011<br />

0011…..1011<br />

1011…..1011<br />

0010…..0011<br />

0011…..0011<br />

0000…..0011<br />

0010…..0011<br />

0100…..0011<br />

1011…..0011<br />

0101…..0011<br />

0011…..0011<br />

0011…..0011<br />

0111…..0011<br />

1111…..0011<br />

towards main memory<br />

data can’t transfer<br />

SE<br />

VP n-1<br />

Un-Validated Data (UVD)<br />

Free Space<br />

Validated Data (VD)<br />

Rollback to<br />

VP n<br />

instructions of the<br />

application<br />

SD = 10<br />

Figure 3.23: Error detected and all the data written during SD is deleted<br />

Table 3.1: Comparison of the Processor-<strong>Memory</strong> Models<br />

Error<br />

detected<br />

Model-I Model-II<br />

read from memory (DM) Processor ⇐ Cache ⇐ DM Processor ⇐ DM<br />

read from Cache/Journal Processor ⇐ Cache Processor ⇐ Journal<br />

write to DM Processor ⇒ UVJ ⇒ VJ ⇒ DM Processor ⇒ Journal ⇒ DM<br />

Cache/Journal size requirement Comparatively bigger No MISS in Journal<br />

(to avoid cache MISS) cache required due to Parallel access<br />

Performance Medium Performance Reasonable good Performance<br />

at High Error Rate even at High Error Rate<br />

have an effective FT mechanism, with limited hardware and time overheads. The overall methodology<br />

can be successful if certain design challenges are respected like choosing a appropriate processor<br />

having minimum internal states to load and store, designing an intermediate self-checking hardware<br />

journal to prevent the errors from entering into the dependable memory and reasonable length of<br />

sequence duration for certain error rate.<br />

The last part of the chapter was dedicated towards defining the processor-memory interface. Ac-<br />

cordingly, we have proposed two different models: model-II and I. On comparison, model-II has been<br />

chosen for further developing the VHDL-RTL model because it is more reasonable from dependabil-<br />

ity and performance point of view. In this model, in write to memory, the data pass via temporary<br />

VP n


74 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />

CPI*/CPI<br />

CPI*/CPI<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

Permutation Benchmark<br />

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />

0<br />

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />

Error Injection Rate (EIR)<br />

Figure 3.24: Model-II: additional CPI for benchmark group I<br />

Computation Benchmark<br />

CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />

0<br />

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />

Error Injection Rate (EIR)<br />

Figure 3.25: Model-II: additional CPI for benchmark group II<br />

storage towards DM, while on read the processor can directly read from the DM. In this way DM<br />

remain preserved from error propagation coming from processor. In next chapters, we will develop<br />

the VHDL-RTL model for the FT-processor.


3.8. CONCLUSIONS 75<br />

CPI*/CPI<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

Control Benchmark<br />

(Burst Errors Injection)<br />

CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100)<br />

0<br />

0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />

Error Injection Rate (EIR)<br />

Figure 3.26: Model-II: additional CPI for benchmark group III


76 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS


Chapter 4<br />

Design and Implementation of a Self<br />

Checking Processor<br />

We aim to design a fault tolerant processor that has two parts: self-checking processor core (SCPC)<br />

and self-checking hardware journal (SCHJ). In this chapter, we will only focus on the design of the<br />

SCPC as highlighted in figure 4.1.<br />

SCPC<br />

SCHJ<br />

Figure 4.1: Design of a self checking processor core (SCPC)<br />

To explore the SCPC, the chapter has been divided into different sections. In the first section,<br />

we start modelling the processor by choosing the appropriate architecture family that fulfils the basic<br />

design objectives identified in chapter 3. In the later section, the processor hardware-model (non-<br />

FT version) will be presented and explored. The performance and dependability challenges will<br />

be identified in the hardware model. The later sections address their solutions. A generic model<br />

will be described in VHDL-RTL (Register Transfer Level) and synthesized on Altera Quartus II.<br />

Experimental results will be presented in terms of throughput (number of bit processed per second)<br />

and area usage. Finally, fault tolerance capacity of SCPC will be validated in chapter 6.<br />

77<br />

DM


78 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

4.1 Processor Design Strategy<br />

The FT strategy we have chosen has already been discussed in section 3.2. In this part, we will<br />

choose the processor architecture that can fit well, as presented in figure 4.2 (continuation of the<br />

previously presented figure 3.2). The hardware based concurrent error detection is expensive if there<br />

are many internal states to be checked, which is against the design constrains. Therefore, to have fast<br />

error detection at low hardware overhead there is a need of a processor having minimum internal states<br />

to check (figure 4.2 presenting criteria for processor design). The software based rollback mechanism<br />

has low hardware overhead but will be slow if there are many internal states in the processor to<br />

be periodically saved. Moreover, rollback in case of error detection will be faster if the number of<br />

internal states to restore is lower.<br />

Consequently, our choice will go to a processor architecture belonging to the minimum instruc-<br />

tion set computers (MISC) class, which possesses many desired characteristics suitable for our design<br />

strategy. In the MISC class, we have chosen a basic stack processor architecture. It is a simple and<br />

flexible architecture, having a very reduced number of internal registers [JDMD07]. Alternate choice<br />

can be an accumulator-based processor but their disadvantage is that they are highly dependent on<br />

the random access to memory. Therefore, they are less efficient. However, the chosen stack proces-<br />

sor architecture also rely on memory accesses, but most of them are very predictable (neighbouring<br />

addresses) as they are related to stack operation, and can be very effectively handled.<br />

4.1.1 Advantages of Stack Processor<br />

Stack processor have different additional advantages both from protection and performance point<br />

of views. Some classical advantages are discussed in [KJ89] and others are detailed below.<br />

Stack based processor can result in more reliable architecture as compared to their counterpart<br />

RISC design approach because they have less number of internal states and small area on-chip which<br />

reduces the chances of external environmental contaminations. In most RISC based approaches there<br />

are register banks that make them more sensitive against SEUs and MBUs. Whereas, in stack proces-<br />

sor the number of internal registers are far less. For example, the stack processor presented in [Jal09]<br />

has six internal registers: TOS (Top of the Stack), NOS (Next of the Stack), TORS (Top of Return<br />

Stack), IP (Instruction Pointer), DSP (Data Stack Pointer) and RSP (Return Stack Pointer). They are<br />

far less than the modern RISC processors (i.e. LEON3FT has more than 150 internal register [Aer11]).<br />

In FT processor all internal register must be protected against transient faults [ARM + 11].<br />

Furthermore, due to many internal registers in RISC based architecture the instruction length is<br />

widen. More registers means large address decoding, which increases the propagation delay. That is<br />

why, RISC (and modern CISC) processor needs multi-stage pipelining to restore average throughput<br />

by hiding internal latency. Moreover, it needs better branch scheduling. For example, Pentium 4<br />

has a 20-stage pipeline, and any miss in caches and branch prediction buffers can suffer a 30-cycle<br />

penalty for a missed branch (20 cycles in the pipeline, 10 in the memory). Whereas, the RTX (stack-<br />

based processor) has a fixed 2-cycles overhead in all cases [PB04]. Furthermore, processor’s natural


4.1. PROCESSOR DESIGN STRATEGY 79<br />

Fault Tolerance Computer<br />

Low hardware and time trade-offs<br />

Fast error detection Low hardware overhead<br />

Concurrent Error Detection Rollback Mechanism<br />

Low HW overhead<br />

Processor with minimum internal states<br />

Minimum Instruction Set Computers (MISC)<br />

Stack Processor<br />

Figure 4.2: Criteria behind the choice of the stack processor<br />

resistances against SEUs decreases with increase in stages of pipelining [MW04].<br />

Fast periodic backup<br />

Stack processor have various advantages over RISC based machines like higher clock speeds, low<br />

procedure call overhead and fast interrupt handling [Sha06]. They have higher clock speed because<br />

the instructions are performed between two tops of stack (condition: internal stack caching or internal<br />

hardware stack). They have low procedure call overhead as there are limited registers needed in saving<br />

to memory across the procedure calls. Fast interrupt handling since interrupt routines can execute<br />

immediately as hardware takes care of the stack management. An architecture of a stack based Java<br />

processor has been evaluated in [Sch08] and the results show better performance and smaller gate<br />

count compared to a RISC family processor on an FPGA.<br />

Commercially stack processor have been used for medical imaging, hard disc drives and satellite<br />

applications. Some well known examples include Novix NC4000, Harris RTX2000, Silicon Com-<br />

posers SC32 [PB04]. They are deployed in space applications for reasonable performance and low<br />

power overheads [HH06]. e.g. SCIP, a stacked based processor designed for spaceships [Hay05]. Re-<br />

cently, Green-Arrays project is employing stack based architectures to design multi-computer chips.<br />

The company have designed chips with attractive features like minimum cost and energy with high


80 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

performance [Gre10, Bai10].<br />

The FT-processor design is dedicated to our long-term objective: devising a new fault tolerant<br />

multi-resource system based on message passing, in which the current fault tolerant processor design<br />

will be used as a processing node. It is clear from the beginning that severe constraints concerning<br />

the area consumption apply to the architectural design of a single node in order to match the future<br />

massively parallel objective, yet preserving as much as possible the individual performance. The stack<br />

machine remains a viable architecture, due to the smaller size and lower cost and power requirements.<br />

Stack processor can result in simple and smart cores for parallel distributed applications [Gre10].<br />

On the other hand, the stack machine is favorable for sequential instruction execution. They fit<br />

well for control dominant application. They are less favorable for data dominant applications like<br />

video streaming.<br />

4.2 Proposed Architecture<br />

The architecture of the stack processor has been presented in [Jal09]. It is inspired from second<br />

generation canonical stack processor [KJ89]. The stack taxonomy is based on three attributes: the<br />

number of the stacks, the size of the stack buffer memories, and the number of operands in the<br />

instruction format. They are represented by three coordinate’s axis in figure 4.3. These dimensions<br />

have various possible combinations. Among these choices, the canonical stack has multiple and<br />

large stacks and it is a 0-operand (ML0) machine as shown in the figure 4.3. The 0-operand means<br />

all instruction operand locations are implicit, thus it is not necessary to give their address in the<br />

instruction. In the case of stack the implicit location is top of stack.<br />

To satisfy the simplicity requirement there are two stacks 1 , data stack (DS) and return stack (RS).<br />

One is used for expression evaluation and subroutine parameters passing. The second is used for<br />

the subroutine return address, interruption address and temporary data copies. The two-stacks allow<br />

accessing the multiple values with in one clock cycle and improves the speed. Due to separate stack<br />

for return address and data stacks, the subroutine calls and data returns can be performed in parallel<br />

with data operations. It can reduce program size and system complexity, which improves system<br />

performance.<br />

Concerning the size of stack buffers, we have chosen large stack buffer that reside in dependable<br />

memory which allows multiple storage of data without loss. The DM is on-chip so the data can be<br />

accessed in single clock cycle. In addition, there is no restriction in the stack depth.<br />

There are three registers named TOS (Top Of Stack), NOS (Next Of Stack) and TORS (Top Of<br />

Return Stack) which represents top of data-stack (DS), next of DS and top of return-stack (TORS)<br />

respectively. NOS and TORS do not exist in the canonical model. They proved to be useful, allowing<br />

a simplified instruction set [Jal09]. The DS and RS stacks reside in main memory (DM), which is a<br />

feature similar to 1st generation stack machine. They do not have address registers but are addressed<br />

by internal pointers namely: data stack pointer (DSP) and return stack pointer (RSP). We have chosen<br />

1 According to Turing definition, the minimal number of stacks for a pure stack machine is 2 [KJ89].


4.2. PROPOSED ARCHITECTURE 81<br />

2<br />

Number of Stacks<br />

1<br />

No. of inst. operand<br />

2<br />

1<br />

0 4 8 12 16<br />

ML0<br />

Size of the Stack<br />

Figure 4.3: Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design<br />

space<br />

this feature to protect the data contents because according to the hypothesis, DM is a dependable<br />

storage and these stacks remain fault secure.<br />

The stack based proposed architecture contains data bus, data stack (DS) and return stack (RS)<br />

with their top of-stack registers, arithmetic/logic unit (ALU), instruction pointer register (IP), instruc-<br />

tion buffer with instruction register and control logic (for hardwired control), as shown in figure 4.4.<br />

The input/output module (shown in figure 4.4) requires special management to be fault tolerant and<br />

this is not treated in this work.<br />

The ALU performs arithmetic and logic operations, which includes addition, subtraction, logical<br />

functions (AND, OR, XOR), and test for zero and others. It perform operations on the top of the data<br />

stack (operands and result), TOS and NOS being the next element of DS, where TORS is the top post<br />

element of RS. The IP holds the address of the next instruction to be executed. The IP may be loaded<br />

from the bus to implement branches, or may be incremented to fetch the next sequential instruction<br />

from program memory. Like DS and RS, the program memory also residing in DM.<br />

The MAR (<strong>Memory</strong> Address Register) unit that exists in canonical stack processor has been elim-<br />

inated from our model because program memory along with IP (Instruction Pointer) is sufficient to<br />

manage all the instructions and provide the address of the next instruction to be executed. The result-<br />

ing processor has a simple instruction set of 37 instructions and being executed in a clock cycle except<br />

one (STORE Instruction) that requires two clock cycles for execution. The complete instruction set<br />

of 37 instructions have been represented in the appendix B where all the instructions are expressed at<br />

RTL (Register Transfer Level). Thanks to the limited instruction set and 0-operand model, the 8-bit<br />

opcode is sufficient to represent all instructions.


82 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

Table: List of Acronyms<br />

DS<br />

RS<br />

I/O<br />

NOS<br />

TORS<br />

TOS<br />

ALU<br />

IP<br />

Data Stack<br />

Return Stack<br />

Input-Output<br />

Next Of Stack<br />

Top Of Return<br />

Stack<br />

Top Of Stack<br />

Arithmetic Logic<br />

Unit<br />

Instruction<br />

Pointer<br />

DS<br />

RS<br />

I/O<br />

Control<br />

Unit<br />

D<br />

A<br />

T<br />

A<br />

B<br />

U<br />

S<br />

NOS<br />

DATA<br />

Figure 4.4: Simplified stack machine<br />

4.3 Hardware Model of the Stack Processor<br />

TORS<br />

ALU<br />

IP<br />

ADDRESS<br />

Program<br />

<strong>Memory</strong><br />

The processor hardware model has been described at VHDL RTL level. The initial processor<br />

model (non-FT version) has been synthesized with Altera Quartus II (version 7.1).<br />

It consists of arithmetic and logic unit (ALU), internal registers, instruction buffer, control unit<br />

and data path connecting them, as shown in figure 4.5. The DS and RS are addressed by two pointer<br />

registers DSP and RSP respectively. The three on-chip registers TOS, NOS and TORS resolve the<br />

possible conflicts in transferring the data between two stacks e.g. during instruction execution of<br />

R2D, D2R, OVER, ROT. For further explanation, lets consider R2D (Return Stack to Data Stack)<br />

instruction. Due to availability of TORS, TOS and NOS inside the processor no conflict in accessing<br />

the data bus occurs. The contents of TORS are written in TOS, TOS into NOS, DSP is incremented,<br />

NOS written in DS, RS[RSP] read in TORS and RSP decremented. Therefore, no conflict occurs.<br />

In a stack processor, normally data execution is faster than the classical processors because data<br />

is implicitly available on the two tops of the stack, instead of having to read data from addressed<br />

registers or memory. It effectively reduces the length of the critical path. For a better understanding<br />

the simplified data path for arithmetic and logic instructions have been shown in figure 4.6. Processor<br />

read memory in parallel to compensate the ‘one element less’ on stack balance. The memory read is<br />

just to fill the empty place resulting from the instruction execution (for next instruction). Therefore,<br />

TOS


4.3. HARDWARE MODEL OF THE STACK PROCESSOR 83<br />

Prog_addr<br />

TOS<br />

TOS<br />

Program<br />

<strong>Memory</strong><br />

ADD<br />

MUX_RSP<br />

RSP<br />

ADD<br />

MUX_DSP<br />

LSB<br />

MSB<br />

MUX_ADD_RSP<br />

DSP<br />

ADD<br />

-1<br />

0<br />

+1<br />

MUX_ADD_DSP<br />

0<br />

2<br />

3<br />

4<br />

5<br />

-1<br />

0<br />

+1<br />

TOS<br />

Instruction R1 R2 R3 R4 Buffer R5<br />

Instruction Buffer<br />

IBMU<br />

R3<br />

R4<br />

Instruction Pointer<br />

to / from <strong>Memory</strong><br />

Op_code<br />

TORS<br />

NOS<br />

Control<br />

Unit (CU)<br />

TORS<br />

a<br />

IP+a<br />

data_mem<br />

Figure 4.5: Modified stack processor<br />

it do not need to wait for address decoding before accessing the operands.<br />

Control signals<br />

ADD ADD<br />

to CU<br />

NOS TOS<br />

d<br />

cout<br />

+1<br />

z<br />

ALU<br />

0<br />

1<br />

2<br />

3<br />

MUX_IP_COUNT<br />

IP<br />

TORS<br />

DSP<br />

RSP<br />

MUX_TORS<br />

15 th bit<br />

Ext. to 16-bits<br />

For DLIT<br />

Mostly, each block of program memory (16-bit) contains two successive instructions (8-bit +<br />

8-bit). The instructions residing in program memory pass through instruction buffer (IB) and are<br />

decoded in the control unit, which activates all the MUXes accordingly. The instruction buffer (IB)<br />

is fed with a pair of LSB followed by MSB as shown in figure 4.18. The IB consists of cascaded<br />

byte-size (8-bit) buffers, connected with via multiplexers which control the flow of instructions in<br />

IB (as shown in the figure 4.18). The interconnection between the multiplexers is controlled by the<br />

instruction buffer management unit (IBMU) (as shown in figure 4.6). The IB and IBMU will be<br />

discussed later in detail.<br />

Although stack processor has fixed length opcodes but some instructions need additional infor-<br />

mation (immediate data) to be executed. Therefore sometimes instruction block is larger than 8-bit.<br />

Such instructions include branches and call that need additional 16-bit address (or 8-bit displacement)<br />

to know the target address (e.g. see appendix C for ZBRA d, SBRA d) and the instructions requiring<br />

an immediate data constant (e.g. LIT a, DLIT a). This challenge is further addressed in section 4.4.2.<br />

The control unit manages the components of the processor; it reads and decodes the program<br />

For LIT


84 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

instructions, transform them into a series of control signals which activate other parts of the processor<br />

via MUXes. The salient jobs of control unit are:<br />

• Decode the numerical code for the instruction into a set of commands or signals for each of the<br />

MUX;<br />

• Update the DSP and RSP pointers;<br />

• Activate Read or Write from memory according to the active instruction;<br />

• Select correct operation in the ALU;<br />

The IP points to the next instruction to be executed. The IP prepares the program memory to feed<br />

the IB according to the next instruction execution requirements.<br />

clk<br />

Program<br />

<strong>Memory</strong><br />

IP<br />

Instruction Buffer<br />

IBMU<br />

From <strong>Memory</strong><br />

TOS/TORS/NOS<br />

clk<br />

Control Unit<br />

TOS/<br />

TORS<br />

Figure 4.6: Simplified data-path of the proposed model (arithmetic and logic instructions)<br />

The execution of conditional/unconditional branches has been discussed in [KJ89] and further<br />

explored in design of modified stack processor [Jal09]. The stack processor is fast in branch execution<br />

due to minimum stages of pipelining [PB04]. However, every branch instruction is followed by NOP<br />

(no-operation) because the IB is flush to load new instructions. It is a performance penalty. This issue<br />

has not been addressed in this work however, one possible solution has been proposed in section 4.6.3.<br />

TOS<br />

NOS<br />

TORS<br />

4.4 Design Challenges in FT Stack Processor<br />

This section is dedicated to the implementation of the FT methodology using stack processor. The<br />

required architecture should have self checking ability along with minimum performance degradation.<br />

These two challenges are addressed in this section.<br />

NOS<br />

A<br />

L<br />

U


4.4. DESIGN CHALLENGES IN FT STACK PROCESSOR 85<br />

4.4.1 Challenge I: Self Checking Mechanism<br />

The architecture having minimum number of internal register does not guarantee that there will<br />

be no possibility of provoking the errors. External disturbances can still contaminate the execution<br />

of the processor (even if it is far less frequent than other classes of processors like RISC and CISC<br />

implementations). There is a need to have self checking mechanism in internal registers and ALU.<br />

4.4.2 Challenge II: Performance Improvement<br />

Depends on the architectural choices made to implement the FT-stack processor, it has two per-<br />

formance limitations that are limiting the overall execution speed. They include: (i) multi-clock in-<br />

struction execution and (ii) Multiple-byte instructions block. Both these issues adds additional delays<br />

in program executions.<br />

Challenge II-a: Multi-clock Instruction Execution<br />

Most of the instructions require single clock cycle in the data-path to be executed, but there are<br />

few instructions that require multi-clock cycles to be executed like DUP, OVER, R2D, CPR2D, D2R,<br />

FETCH, STORE, PUSH_DSP, PUSH_RSP, LIT, DLIT, CALL. There minimal clock count cannot be<br />

unity for a non-pipelined architecture because of their conflicts in accessing data bus in same direction.<br />

Inst. Types<br />

(according to clks. count)<br />

1 clk 2 clks 3 clks<br />

Most of<br />

instruction<br />

Some of the<br />

instructions<br />

Only one<br />

instruction<br />

‘STORE’<br />

Figure 4.7: Different instructions type from execution point of view (without pipelining)<br />

For a better understanding, lets explore a DUP (duplication) instruction that requires two clock<br />

cycles to execute the instruction in the data path. Here, the data contents of TOS must be copied<br />

into NOS and the contents of the NOS transferred to the third position in the DS (pointed by DSP).<br />

However, if the instruction is executed in one clock cycle then we loose the data in third element of<br />

DS because in this case, NOS will be written at address pointed by DSP and without prior increment<br />

to DSP we will loose one data element.<br />

On the other hand, it can be successfully executed in two clock cycles, during the first clock cycle<br />

(t) a new place is created by adding DSP + 1 and in next cycle (t + 1) the data contents of NOS<br />

are written in DS, TOS to NOS and NOS to DS[DSP] as shown in figure 4.8. Such (multi-cycles)<br />

instructions result in performance degradations and needs employing execution pipelining.


86 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

Prog_addr<br />

Program<br />

<strong>Memory</strong><br />

ADD<br />

TOS<br />

TOS<br />

MUX_RSP<br />

ADD<br />

MUX_DSP<br />

t<br />

LSB<br />

MSB<br />

RSP<br />

MUX_ADD_RSP<br />

DSP<br />

ADD<br />

-1<br />

0<br />

+1<br />

MUX_ADD_DSP<br />

0<br />

2<br />

3<br />

4<br />

5<br />

-1<br />

0<br />

+1<br />

TOS<br />

R1 Instruction R2 R3 R4 Buffer R5<br />

Instruction Buffer<br />

IBMU<br />

R3<br />

R4<br />

Instruction Pointer<br />

to / from <strong>Memory</strong><br />

t+1<br />

Op_code<br />

DUP<br />

t+1<br />

TORS<br />

NOS<br />

Control<br />

Unit (CU)<br />

TORS<br />

a<br />

IP+a<br />

data_mem<br />

t+1<br />

Control signals<br />

ADD ADD<br />

to CU<br />

NOS TOS<br />

d<br />

t+1<br />

cout<br />

+1<br />

z<br />

ALU<br />

0<br />

1<br />

2<br />

3<br />

MUX_IP_COUNT<br />

Figure 4.8: Execution of duplication (DUP) instruction in 2-clock<br />

Challenge II-b: Multiple-byte Instructions Block<br />

IP<br />

TORS<br />

DSP<br />

RSP<br />

MUX_TORS<br />

15 th bit<br />

Ext. to 16-bits<br />

For DLIT<br />

The opcode of the instructions are 1-byte length. They have implicit source and destination reg-<br />

isters and do not need explicit addressing e.g. instruction ADD (addition) means adding TOS with<br />

NOS and storing the result in TOS. However, 7 of the 37 instructions require an additional parameter<br />

to be furnished, either an immediate 8-bit or 16-bit constant (LIT and DLIT respectively), an 16-bit<br />

absolute address (LBRA, CALL) or an 8-bit displacement (SBRA, ZBRA) (see appendix C, table 8).<br />

On the other hand, the program memory is 16 bit whereas instruction opcode for instruction is 8-<br />

bit. The average flow (in bits) instructions being executed is lower than that of instruction pre-fetching<br />

flow capacity, the first being close to 8 bits while the later is closer to 16 bits. Therefore, the loading<br />

of instructions in IB is almost twice the rate of execution (considering 8-bit opcode is executed in a<br />

clock cycle). It requires intelligent instruction buffer management to: (i) monitor the input and output<br />

flow (ii) manage the flow of variable block instructions (like LIT, LBRA).<br />

To execute single instruction per cycle, the next instruction to be executed should reach the control<br />

unit in next clock cycle (t + 1). For example, the figure 4.9 is addressing this issue. First of all, we<br />

are supposing that the present instruction being executed is ADD (the instruction to be executed lies<br />

For LIT


4.5. SOLUTION-I: SELF CHECKING MECHANISM 87<br />

1-byte instruction<br />

LSB<br />

8-bits<br />

MSB<br />

8-bits<br />

R1<br />

2-bytes instruction<br />

LSB<br />

8-bits<br />

MSB<br />

8-bits<br />

R1<br />

3-bytes instruction<br />

LSB<br />

8-bits<br />

MSB<br />

8-bits<br />

R1<br />

R2<br />

R2<br />

R2<br />

R3<br />

Instruction Buffer<br />

R3<br />

R3 XX<br />

t+1<br />

Instruction Buffer<br />

t+1<br />

Instruction Buffer<br />

R4<br />

ADD R5<br />

t+1<br />

Figure 4.9: Multiple-byte instructions<br />

R4 XX R5 LIT<br />

XX R4 DLIT R5<br />

in register R5). It is 1-byte instruction so the opcode of the next instruction to be executed must be<br />

in register R4. This opcode must reach the control unit in next clock cycle (t + 1). Secondly, we are<br />

supposing LIT a as a active instruction (at time t) so in the next clock (t+1) the contents of R3 should<br />

reach the control unit.<br />

The solutions to the above mentioned challenges will be addressed in the following sections.<br />

4.5 Solution-I: Self Checking Mechanism<br />

The processor need error detecting inside the ALU and internal states. First of all we start with<br />

the design of a self checking ALU.<br />

4.5.1 Error Detecting in ALU<br />

There is no single code that can (simultaneously) protect the arithmetic and logic operations si-<br />

multaneousy. Consequent, we are using combination of arithmetic and logic codes (often called<br />

‘combination codes’) to protect both operations, modulo-3 residue code to protect arithmetic opera-<br />

Op-code<br />

Op-code<br />

Op-code


88 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

tions on one side, and parity code to protect logic operations on the others side. We have chosen these<br />

codes, because they are simple and yet effective enough to prove the effectiveness of our approach.<br />

Moreover, they requires minimum resources, which can be depicted from the results in [SFRB05].<br />

In [SFRB05], the ALU is designed with different error detection techniques was simulated using<br />

Quartus II simulation tool provided by Altera. The FPGA resource utilization of the two built-in-<br />

error-detection (BIED) techniques (Berger Check, Residue/Parity Codes Check) were recorded from<br />

the simulation. The figure 4.11 shows the resource utilization comparison chart for the two BIED<br />

techniques compared with a TMR ALU and an ALU without any error detection.<br />

Data according<br />

to next instruction<br />

NOS<br />

Protected Register<br />

Protected<br />

ALU<br />

TOS<br />

Protected Register<br />

Figure 4.10: Data-path of protected-processor’s ALU<br />

It is obvious from the figure 4.11 [SFRB05] that EDALU (error detecting ALU with modulo-3<br />

residue/parity check) uses 54% less logic elements compared than the TMR ALU, and the Berger<br />

check prediction ALU uses 42% less logic elements than the TMR ALU. Therefore, according to the<br />

result, it is clear that the ALU with residue/parity check has better resource utilization than Berger<br />

codes.<br />

ALU instructions fall into two groups: arithmetic and logical (see figure 4.12). By grouping the<br />

instructions, the active area of the circuit at any instant is reduced [SFRB05]. For example, a strike<br />

on module generating arithmetic parity would not affect the logical module and vice versa.<br />

Error Detecting in Arithmetic Instructions<br />

A remainder calculated from the data symbols X and Y can preserve arithmetic operation in an<br />

ALU. Here, the error detection in arithmetic instructions is based on modulus-3 residue check parity.<br />

In ALU, they are executed in two concurrent computation (as shown in figure 4.13). On one hand,<br />

two operands, Xand Y , undergo an arithmetic operation and results are stored in S. On the other, the


4.5. SOLUTION-I: SELF CHECKING MECHANISM 89<br />

LOGIC ELEMENT COUNT<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

3106<br />

TMR -TRIPLE MODULAR REDUNDANCY<br />

BC -BERGER CODE CHECK<br />

ED -RESIDUE/PARITY CODES CHECK<br />

1792<br />

1432<br />

TMRALU BCALU EDALU ALU<br />

ALU DESIGN<br />

986<br />

Figure 4.11: Resource utilization chart for various ALU designs [SFRB05]<br />

residues, P AX and P AY , undergo the equivalent arithmetic operation and generate the residue. The<br />

outcome, P AS will be stored with S (as shown in figure 4.13). In next clock, the parity generator<br />

(mod-3 generator) will produce P A ′ S<br />

(residue of S), which will be compared with the already stored<br />

parity P AS. In case of discrepancy, error alarm signal will be raised. The mathematics behind the<br />

residue check codes are shown below:<br />

X = xnx(n − 1)x(n − 2).....x2x1x0,<br />

Y = yny(n − 1)y(n − 2).....y2y1y0,<br />

where Xand Y are the data/information symbols applied to the input of the ALU.<br />

C = CmC(m − 1)C(m − 2).....C2C1C0<br />

where C is the check divisor used to calculate the residue. The remainders determined from the<br />

division of ALU data symbols X and Y from the check divisor C is given by<br />

P AX = RxmRx(m − 1).....Rx2Rx1Rx0,<br />

P AY = RymRy(m − 1).....Ry2Ry1Ry0<br />

where P AX and P AY represent the remainders from X and Y respectively and Rxm = xn/C and so<br />

on. The ALU output is represented as follows<br />

P A ′ S<br />

S = X � Y where, � = ADD, SUB or MUL<br />

= S mod C<br />

P AS is the remainder check symbol which is given by<br />

�<br />

P AS = (P AX P AY ) mod C<br />

The error signal is generated by the comparator, which is given by the following function


90 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

Redundancy<br />

Error detection<br />

ALU instructions into two groups<br />

logical<br />

parity check<br />

1-bit<br />

arithmetic<br />

mod 3<br />

1-bit 2-bits<br />

Figure 4.12: ALU is protecting the Logical and Arithmetic instructions separately<br />

Error Signal = 1 if P AS �= P A ′ S<br />

Error Signal �= 1 if P AS = P A ′ S<br />

P LS, represents logic parity that will be generated locally for the next instruction.<br />

For instance, if X = 10, Y = 11 and C = 3<br />

Residue of X (P AX): 10 mod 3 = 1<br />

Residue of Y (P AY ): 11 mod 3 = 2<br />

First concurrent computation: S = X + Y = 10 + 11 = 21<br />

Residue of first computation: P A ′ S<br />

Addition of P AX and P AY : = 1 + 2 = 3<br />

Residue: P AS = 3 mod 3 = 0<br />

Thus, the residues P AS and P A ′ S<br />

Error Detecting in Logic Instructions<br />

= 21 mod 3 = 0<br />

is equal, Therefore no error.<br />

Error detection in logical instructions is based on calculation of parity bit from the information<br />

symbols in X and Y . The parity calculation is simple. The parity bit is calculated by XOR between<br />

the information bit. Among the two variants of parity bit: even parity bit and odd parity bit, we are<br />

using even parity which means that the parity bit is set to 1 if the number of ones in a given set of bit<br />

(not including the parity bit) is odd, making the entire set of bit (including the parity bit) even. By<br />

comparing the parity bit from input and output, the re-generate/re-configure signal is set as high/low.


4.5. SOLUTION-I: SELF CHECKING MECHANISM 91<br />

OPCODE<br />

X<br />

PL X<br />

PA X Y PL Y PA Y<br />

ALU<br />

S<br />

S<br />

Parity<br />

generator<br />

PL S<br />

(PA X<br />

PA S<br />

+<br />

S mod C<br />

PA Y ) mod C<br />

PA S<br />

PA’ S<br />

Check<br />

Parity<br />

OPCODE<br />

Error Signal<br />

Figure 4.13: Reminder check technique for error detection in arithmetic instructions<br />

It can be represented by simple logical equation below:<br />

P LX = x15 XOR x14 XOR x13 XOR, ... x0<br />

P LY = y15 XOR y14 XOR y13 XOR, ... y0<br />

Similarly, the P LX and P LY represents the parity of Xand Y respectively.<br />

S = (X � Y ) where � = AND or OR<br />

where X = (x15 x14 x13 ... x0)<br />

where Y = (y15 y14 y13 ... y0)<br />

�<br />

P LS = P LX P LY<br />

P L ′ S<br />

= S XOR<br />

The error signal is generated by the comparator, which is given by the following function<br />

Error Signal = 1 if P LS �= P L ′ S<br />

Error Signal �= 1 if P AS = P A ′ S<br />

Similarly, P AS will be generated locally for the next instruction (if needed). Its a synchronous<br />

system and the error will be detected in the next clock. In ALU, the error latency is 1-clock cycle.


92 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

OPCODE<br />

X<br />

PL X<br />

PA X Y PL Y PA Y<br />

ALU<br />

S<br />

S<br />

Parity<br />

generator<br />

PLS PAS XOR<br />

(PL X<br />

PL Y )<br />

PL S<br />

PL’ S<br />

Check<br />

Parity<br />

OPCODE<br />

Error Signal<br />

Figure 4.14: Parity check technique for error detection in logic instructions<br />

4.5.2 Error Detecting in Register and Data-Path<br />

For error detection in registers, we are again relying on parity codes. Register concurrently checks<br />

errors by matching the regenerated and and already existing parity bit. If unmatched it means an error<br />

signal (as shown in the figure 4.15). Furthermore, a single parity check can only detect single bit<br />

errors or errors which have not an even multiplicity.<br />

X<br />

Register<br />

X<br />

P X<br />

Parity<br />

Generator<br />

P X<br />

P’ X<br />

Check<br />

Parity<br />

Error Signal<br />

Figure 4.15: Parity check technique for error detection in register(s)


4.5. SOLUTION-I: SELF CHECKING MECHANISM 93<br />

4.5.3 Self-Checking Processor<br />

With the protections in subsections 4.5.1, 4.5.2, the processor has built-in self-check facilities to<br />

detect SBUs. The error coverage can be improved by alternate EDCs however, it will also increases<br />

the circuit complexity.<br />

Error<br />

Data according<br />

to next instruction<br />

NOS<br />

Protected Register<br />

Protected<br />

ALU<br />

4.5.4 Store Sensitive Elements (SE)<br />

TOS<br />

Protected Register<br />

Figure 4.16: Error occurred in Protected ALU<br />

The six internal states: TOS, NOS, TORS, DSP, RSP, IP must be saved at the end of the valid<br />

sequence for possible rollback. We have decided to store them in DM. The procedure follows six<br />

consecutive instructions where the contents of TORS are stored in RS and the others in DS. The posi-<br />

tive aspect of this approach is that it does not add additional hardware overhead inside processor while<br />

the downside is the extra performance penalty in storing SE. One possible combination of instructions<br />

can be:<br />

CALL a<br />

CPR2D<br />

PUSH_RSP<br />

PUSH_DSP<br />

DUP<br />

DUP<br />

Alternate solution to protected internal states is using internal shadow registers holding end val-<br />

ues of previous valid sequence. On rollback, shadow copies are loaded back to the corresponding<br />

registers. The advantage of this scheme is a single clock being needed to save or restore the registers.<br />

However, it will double the register count of SEs and these shadow copies must also be protected,<br />

which incurs extra hardware overhead. Moreover, it is not a favourable choice for context swapping.


94 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

4.5.5 Protecting Opcode<br />

Program memory is already inside the DM; therefore, there is no risk of faults but they may<br />

occur inside the opcode during the execution. Fortunately, there is a possibility of protecting the<br />

opcode without additional hardware penalty because even with 6-bit code, we can address 64 different<br />

addresses. We have 8-bit opcode for 37 instructions, which allows us to employ some low overhead<br />

EDCs without having an additional hardware penalty.<br />

4.6 Solution-II: Performance Aspects of Self-Checking Processor<br />

Core<br />

Owing to the chosen stack architecture, the data is implicitly available on the two tops of the stack,<br />

which reduces the length of the critical path. But for high (time) performance (i) instruction execution<br />

rate (clock per instruction) should be approximately unity. (ii) for multiple-byte instructions, the next<br />

instruction to be executed must reach the control unit in clock cycle (t + 1). In other words there<br />

should be a continuous flow of instructions inside IB. The instruction buffer management unit (IBMU)<br />

is “dedicated to this task”.<br />

The IBMU generates six different control signals, among them five are dedicated to the data flow<br />

control in the IB, namely SM1, SM2, SM3, SM4 and SM5 as shown in the figure 4.17 while the sixth<br />

(SM6) is reserved for IP. Next section address solutions to each of them.<br />

4.6.1 Solution-II (a): Multiple-byte Instructions<br />

There are seven multiple-byte instructions (block of 2 or 3 byte). The IBMU controls the flow of<br />

instructions in IB by pre-fetching the next instruction and make it available in control unit during the<br />

next clock cycle (t+1). The IBMU controls the series of cascaded buffers having multiple intercon-<br />

nections to cope with complex conditions as shown in figure 4.18. The decisions inside the IBMU<br />

are taken according to the predefined states of FSM (Finite State Machine). Transition between those<br />

states depends upon the present state of the IB.<br />

Table 4.1: Instruction types<br />

b7 b6 b5 details<br />

1 0 0 1-byte<br />

1 1 1 1-byte<br />

(multi-clock)<br />

1 0 1 2-bytes<br />

1 1 0 3-bytes<br />

0 1 1 1-byte + IP-change<br />

0 0 1 2-bytes + IP-change<br />

0 1 0 3-bytes + IP-change


4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 95<br />

Prog_addr<br />

16-bits<br />

Prog_addr<br />

16-bits<br />

Prog.<br />

<strong>Memory</strong><br />

Adder<br />

LSB<br />

8-bits<br />

MSB<br />

8-bits<br />

R1<br />

R2<br />

SM1 SM2 SM3 SM4 SM5<br />

SM6<br />

Instruction Buffer<br />

Management Unit<br />

(IBMU)<br />

‘0’<br />

‘2’<br />

‘3’<br />

Inst. Pointer (IP)<br />

R3<br />

Instruction Buffer (IB)<br />

‘4’<br />

‘5’ Adder<br />

16-bits<br />

‘IP’<br />

‘a’<br />

‘IP+d’<br />

R3<br />

Adder<br />

R4 R5<br />

R4<br />

‘0’<br />

‘1’<br />

‘2’<br />

‘3’<br />

LSB operand<br />

MSB operand<br />

Figure 4.17: Instruction buffer management Unit (IBMU)<br />

16-bits<br />

Op-code<br />

8-bits<br />

Control<br />

Unit<br />

Control signals<br />

8-bits<br />

Extended<br />

To 16-bits<br />

16-bits<br />

IP � d + IP<br />

IP � a<br />

IP � TORS<br />

Inst_type<br />

3-bits<br />

ZBRA d<br />

SBRA d<br />

LBRA a<br />

CALL a<br />

RETURN<br />

4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Ex-<br />

ecution<br />

The majority of the instruction require single clock cycle to process the instruction in the data-path<br />

while others require multiple clock. To execute the pipelining, we need to differentiate between the<br />

single and multiple-clock cycle instructions. The three most significant bit (b7 b6 b5) of the opcode<br />

are reserved to determine the type of the instruction as shown in figure 4.19 (a). Effectively, the<br />

instructions that require multiple clock cycle per instruction execution have been given code ’111’ in<br />

b7b6b5. We can differentiate between the various instructions on the basis of instruction length and IP<br />

change as shown in table 4.1. The IP change occurs in the instruction-containing jump.<br />

The multiple clock cycle instructions have been analysed (with various instruction combinations)<br />

to find the possible conflicts between them. It has been found that if they are executed in two stages<br />

pipelining then all conflicts in addressing the memory can be avoided. During first stage, the stack<br />

pointers are incremented (DSP + 1/ RSP + 1) according to the type of instruction while in sec-<br />

ond stage the rest of the instruction is executed, there will be no conflicts in accessing the memory.<br />

The DSP (Data Stack Pointer) and RSP (Return Stack Pointer) points to top of DS and RS in DM<br />

respectively. This results in the two stage execution pipelining as shown in the figure 4.19 (b).


96 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

LSB<br />

MSB<br />

8-bits<br />

8-bits<br />

R1<br />

R2<br />

M1 M2<br />

M3 R3<br />

SM1<br />

SM2<br />

R1<br />

SM3<br />

R2<br />

R3<br />

R3<br />

M4<br />

SM4<br />

R4<br />

M-operand<br />

Figure 4.18: Instruction buffer<br />

R4<br />

R4<br />

M5<br />

SM5<br />

L-operand<br />

During pipelining, part of next instruction is pre-executed (DSP + 1/RSP + 1 ) with the active<br />

(present) instruction in a clock. In this way remaining part of the instruction is executed in one clock<br />

cycle in the next clock cycle. The control unit takes the 8-bit op-code of the present instruction to<br />

generate the control signals for all the associated MUXs. Simultaneous the three MSBs of the op-<br />

code of the next instruction to be executed are also extracted as these MSBs identify the type of the<br />

instruction.<br />

To evaluate the effectiveness of the pipelining, we have executed a sample benchmark consisting<br />

of five instructions (shown in 4.20). Without pipelining this program requires 9 clock cycles while<br />

only 5 are needing with the pipelining. Hence, an improvement of 45%.<br />

Therefore on pipelining, all instructions can be executed in a single control cycle except STORE<br />

instruction that requires two control clock cycle. Actually in STORE instruction we need to execute<br />

two times DSP +1, which can not be done in a single clock cycle. The complete list of the instructions<br />

is given in tables in the appendix C.<br />

4.6.3 Reducing Overhead for Conditional Branches<br />

It has been previously discussed that loading of instructions in IB is almost twice the rate of<br />

execution, (8-bit opcode is executed in a clock cycle). Therefore, the next instruction to be executed<br />

is already pre-fetched in IB. However, in case of jump instruction the IB must be flush and new<br />

R5<br />

R5<br />

8-bits


4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 97<br />

Clock<br />

Cycle<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

b 0 b 1 b 2 b 3 b 4<br />

Bits precisely<br />

describing<br />

instruction<br />

Bits need for<br />

present inst.<br />

execution<br />

• Non-Pipelined<br />

Operation<br />

TOS � TOS + NOS<br />

NOS � DS [DSP]<br />

DSP � DSP +1<br />

DS [DSP] � NOS<br />

NOS � TOS<br />

DSP � DSP +1<br />

TOS � TORS<br />

NOS � TOS<br />

DS [DSP] � NOS<br />

TORS � RS [RSP]<br />

RSP � RSP -1<br />

DSP � DSP +1<br />

DS [DSP] � NOS<br />

NOS � TOS<br />

TOS � DSP<br />

DSP � DSP +1<br />

DS[DSP] � NOS<br />

NOS � TOS<br />

TOS � data (byte)<br />

(a) Opcode<br />

b 5 b 6 b 7<br />

Instruction Type<br />

i-e DSP+1,DSP-1,<br />

RSP+1,RSP-1<br />

Bits need for<br />

pre-execution<br />

Present Inst.<br />

Execution<br />

(b) Pipelined Execution Model<br />

Pre- Execution<br />

of Next Inst.<br />

Present Inst.<br />

Execution<br />

Pre- Execution<br />

of Next Inst.<br />

Present Inst.<br />

Execution<br />

Pre- Execution<br />

of Next Inst.<br />

Clock cycles<br />

Figure 4.19: (a) Opcodes description and (b) pipelined execution model<br />

Instruction<br />

Executed<br />

ADD<br />

DUP<br />

R2D<br />

PUSH_DSP<br />

LITa<br />

Clock<br />

Cycle<br />

1<br />

2<br />

3<br />

4<br />

5<br />

Instruction<br />

DUP<br />

R2D<br />

PUSH_DSP<br />

LIT a<br />

1st stage<br />

Operation<br />

DSP � DSP +1<br />

DSP � DSP +1<br />

DSP � DSP +1<br />

DSP � DSP +1<br />

NOP<br />

• Pipelined<br />

Instruction<br />

ADD<br />

DUP<br />

R2D<br />

PUSH_DSP<br />

LIT a<br />

2nd stage<br />

Operation<br />

TOS � TOS + NOS<br />

NOS � DS [DSP]<br />

DS [DSP] � NOS<br />

NOS � TOS<br />

TOS � TORS<br />

NOS � TOS<br />

DS [DSP] � NOS<br />

TORS � RS [RSP]<br />

RSP � RSP -1<br />

DS [DSP] � NOS<br />

NOS � TOS<br />

TOS � DSP<br />

DS[DSP] � NOS<br />

NOS � TOS<br />

TOS � data (byte)<br />

Instruction<br />

Executed<br />

ADD<br />

DUP<br />

R2D<br />

PUSH_DSP<br />

Figure 4.20: A sample program executed through non-pipelined and pipelined stack processor core<br />

instructions should be loaded. It can result in performance penalty. This can be overcomed if we take<br />

advantage from the faster load than consumption of instructions in stack processor. The approach will<br />

be based on loading both the conditions of jump inside the IB (as shown in figure 4.23. Therefore,<br />

there will be no extra NOPs in consequence of a jump instruction. However, it may increase the<br />

additional complexity of the instruction buffer management unit and possibly we will need a larger<br />

IB. The VHDL-RTL implementation of such a solution is not considered in this work.<br />

LIT a


98 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

Non-pipelined Implementation<br />

ADD DUP R2D PUSH_DSP LIT a<br />

1 2 3 4 5 6 7 8 9<br />

Pipelined Implementation<br />

ADD DUP R2D PUSH_DSP LIT a<br />

1 2 3 4 5<br />

Figure 4.21: Timing diagram for a sample program executed twice: once in non-pipelined version<br />

and then pipelined version<br />

4.7 Implementation Results<br />

The self-checking processor core has been synthesized in Altera Quartus II. The figure 4.22 shows<br />

the implementation design flow of SCPC modeled in VHDL-RTL and implemented on a Altera<br />

Stratix III EP3SE50F484C2 device with Altera QuartusII. From the results, the following observa-<br />

tions can be done:<br />

• Area occupation : the results obtained in terms of area are reported in table 4.2. The area re-<br />

quired for SCPC minimum, it can be a suitable core processor for fututre MPSoC development.<br />

Table 4.2: Implementation area<br />

Comb. ALUTS (Ded. Logic)<br />

SCPC 861 (278)<br />

• Performance analysis : although in this chapter we have only modelled a processor core (SCPC)<br />

and the model needs to be completed with a self-checking hardware journal (SCHJ) as studied<br />

on next chapters, the processor performance aspects can be analyzed to know the effective-<br />

ness of the stack approach. In a stack based machines, we have a small clock cycle because<br />

the operands are implicitly available on the two tops of the stack. It is interesting to note that<br />

the chosen stack processor requires 2 stages pipelining to obtain rather good performance. All<br />

instructions (except STORE) can be executed in single clock cycle. The performance of the ar-<br />

chitecture was checked and the results are shown in figure 4.24. The results depict the execution<br />

of instructions in single clock cycle.


4.7. IMPLEMENTATION RESULTS 99<br />

LSB<br />

8-bits<br />

MSB<br />

8-bits<br />

R1<br />

Proposed Model<br />

(.vhd)<br />

Synthesis<br />

Quartus II - v.7.1<br />

Simulation<br />

(Altera / Stratix II)<br />

Area Frequency<br />

Figure 4.22: Implementation design flow<br />

R2<br />

cond. 2<br />

R3<br />

Instruction Buffer<br />

cond. 1<br />

R4<br />

R5<br />

t+1<br />

Conditional<br />

branch<br />

Figure 4.23: Strategy to overcome performance overhead due to conditional branches<br />

• Self-checking analysis : we will validate the error detection ability by injecting simple error<br />

(SBU). However, the complete validation of the overall model will be presented in the chapter-<br />

6 where different error scenarios will be injected artificially to check the effectiveness of the<br />

overall approach. Here, the implementation results in figure 4.25 shows that the processor in the<br />

read/write (mode-01). (The working modes of the processor will be discussed in the chapter-5).<br />

At an instant, an error is artificially injected in the self-checking processor. On the detection of<br />

Op-code


100 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />

LIT 5 LIT 6 ADD DUP LIT A00 STORE ADD<br />

Figure 4.24: Implementation of a self-checking processor core<br />

an error, the forward instruction execution stops and the processor rollbacks.<br />

4.8 Conclusions<br />

Figure 4.25: Error detected in SCPC<br />

In this chapter, we have designed a self-checking processor core (SCPC) having a tolerance against<br />

SBU along with measures to improve the performance. Design choices have been made in order to<br />

ensure fast error detection in the resultant processor with minimum hardware overhead. Error detec-<br />

tion is based on combinational codes (residue-parity) while error recovery is based on the rollback<br />

mechanism.<br />

The interesting point is choosing a MISC stack computer architecture. It is a simple processor<br />

having reduced internal states, which is favourable for both CED and rollback. It occupies small area<br />

on chip, which is favourable from dependability and hardware saving points of views.<br />

To improve the instructions execution rate, the processor consists of two stages for the execution<br />

pipelining. The instruction buffer management unit controls the flow of Multiple-byte size instruc-<br />

tions in instruction buffer. Therefore, we take advantage of high code density of variable-length<br />

instructions while enabling two stage execution pipelined in which apart of the next instruction is<br />

pre-executed along with the present instruction.


4.8. CONCLUSIONS 101<br />

In the next chapter, we will discuss the design and implementation of the self-checking hardware<br />

journal that masks the errors from entering into the dependable memory.


102 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR


Chapter 5<br />

Design of a Self Checking Hardware Journal<br />

his chapter focuses on the design of self-checking hardware journal (SCHJ), which is being used<br />

T<br />

as a centerpiece in our strategy to devise a fault tolerant processor against transient faults (as shown<br />

in figure 5.1).<br />

SCPC<br />

SCHJ<br />

Figure 5.1: Design of SCHJ<br />

The basic role of this SCHJ is to hold new data being generated during the currently executed<br />

sequence until it can be validated at the end of the current sequence (see figure 5.2). If the sequence is<br />

validated, this data can be transferred to the DM. Otherwise, in the case of error detection during the<br />

current sequence, this data is simply skipped and the current sequence can restart from the beginning<br />

using the trustable data hold in the DM and corresponding to the state prevailing at the end of the<br />

previous sequence. However, there is a need of an error detection and correction mechanism in the<br />

journal to detect possible errors being provoked in the journal during temporary stay.<br />

The chapter is exploring the construction and working of the SCHJ and the work is distributed as<br />

follows. The first section describes the self-checking methodology. In the later section the hardware<br />

architecture and working of the journal is described. Finally to evaluate the working of the self-<br />

checking hardware journal a generic model is described in VHDL-RTL (Register Transfer Level) and<br />

103<br />

DM


104 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

Self-Checking<br />

Processor<br />

Core<br />

non<br />

validated<br />

data<br />

is synthesized on Altera Quartus-II.<br />

Transient<br />

Fault<br />

<strong>Dependable</strong><br />

Temporary<br />

Storage<br />

trustable<br />

validated<br />

data<br />

Figure 5.2: Protecting DM from contamination.<br />

5.1 Error Detection and Correction in the Journal<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

It has been shown in section 3.5.2 that the journal should have built-in self-checking mechanism<br />

because data stored inside this temporary location can also be corrupted in the consequence of tran-<br />

sient faults affecting it (see figure 5.2 and 5.3).<br />

In the journal, there will be some part of data which belong to the present SD (stored in upper<br />

un-validated part of fig 5.3 (a)) and rest of the data that belongs to the previous SD (VD in lower part<br />

of journal in fig 5.3 (b)). If an error is detected in the data belonging to the present sequence then<br />

we can rollback to the previous validated states. However, if an error occurs in the data that does<br />

not correspond to the present state then we cannot rollback because the states of the SEs are no more<br />

saved in the memory as shown in figure 5.3 (b). It means that there is only need of error detection in<br />

UVJ. However, there is a need of error correction in addition to error detection inside VJ.<br />

The ECC will be employed for detection and correction of errors in the SCHJ. In ECC, the Ham-<br />

ming codes and Hsiao codes are most commonly employed [Sta06]. Among them Hsiao code is more<br />

efficient and require minimum hardware overhead than Hamming [GBT05]. It has been widely used<br />

in designing dependable memories [Che08]. Hsiao code being employed since three decays and are<br />

still the most efficient code used in industry [GBT05, Che08].<br />

5.2 Principle of the technique<br />

The Hsiao codes [Hsi10] will be employed in the self-checking HW journal. It provide fast<br />

encoding and error detection in the decoding process. It is obtained from a shortening of Hamming<br />

codes. The construction of the code is best described in terms of the parity-check matrix Ho. The<br />

selection of the columns of Ho matrix for a given (n, k) code is based on three conditions:


5.2. PRINCIPLE OF THE TECHNIQUE 105<br />

Un-Validated Journal<br />

(UVJ)<br />

Validated Journal<br />

(VJ)<br />

Un-validated<br />

Data<br />

Validated<br />

Data<br />

(a)<br />

Errors<br />

Un-Validated Journal<br />

(UVJ)<br />

Validated Journal<br />

(VJ)<br />

Un-validated<br />

Data<br />

Validated<br />

Data<br />

(b)<br />

Figure 5.3: (a) Error(s) in un-validated journal (b) error(s) in validated journal<br />

• Every column should have an odd number of 1 ′ s.<br />

• The total number of 1 ′ s in the Ho matrix should be a minimum.<br />

Errors<br />

• The number of 1 ′ s in each row of Ho should be made equal, or as close as possible, to the<br />

average number (i.e., the total number of l’s in Ho divided by the number of rows).<br />

The first requirement guarantees the code generated by Ho has minimum distance of at least 4.<br />

Therefore, it can be used for single-error-correction and double-error detection. The second and third<br />

requirements would yield minimum logic levels in forming parity or syndrome bit, and less hardware<br />

in implementation of the code. For instance, if r parity-check bit are used to match k data bit, then<br />

the following equation should be true for Hsiao codes:<br />

�≤r<br />

i=1,i=odd<br />

Precisely, Ho matrix is constructed as follows:<br />

(a) all � � r<br />

weight-1 columns are used for the r check bit positions;<br />

1<br />

�<br />

r<br />

i<br />

�<br />

≥ r + k (5.1)


106 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

(b) next, if � � � � � � r<br />

r<br />

r<br />

≥ k, select k weight-3 column out of all possible combinations. If < k, all<br />

3<br />

3<br />

3<br />

� � � �<br />

r<br />

r<br />

weight-3 column should be selected. The left over columns are then first selected from all 3<br />

5<br />

weight-5 column and then by � � r<br />

and so on until all k columns have unique combinations.<br />

7<br />

If codeword length n = k + r is exactly equal to<br />

�≤r<br />

i=1,i=odd<br />

for some odd j ≤ r, each row of Ho matrix will have the following number of 1’s:<br />

1<br />

r<br />

�i≤r<br />

i=1<br />

i=odd<br />

i<br />

�<br />

r<br />

i<br />

�<br />

�<br />

r<br />

i<br />

�<br />

= 1<br />

�<br />

�<br />

r(r − 1)(r − 2) r(r − 1) · · · (r − j + 1)<br />

r + 3 + · · · + j<br />

r<br />

3!<br />

j!<br />

=<br />

If n is not exactly equal to<br />

�<br />

1 +<br />

�<br />

r − 1<br />

2<br />

�<br />

�≤r<br />

i=1,i=odd<br />

+ · · · +<br />

�<br />

r − 1<br />

j − 1<br />

for some j, then the arbitrary selection of the � � r<br />

cases should make the number of 1’s in each row<br />

i<br />

close to the average number.<br />

�<br />

The single bit error correction and double bit error detection is accomplished in the following<br />

way. A single bit error results in a syndrome pattern that should matches a column of the parity check<br />

matrix Ho. Thus, by matching a syndrome pattern to a column in the Ho can identify an erroneous<br />

bit. If the column corresponds to a check bit, then no correction is necessary else bit inversion will<br />

correct the error [Lal05]. The double-error detection is accomplished by examining the over-all parity<br />

of all syndrome bit. Since, the Hsiao code uses only an odd number of 1‘s in the columns of its Ho a<br />

syndrome pattern corresponding to a single bit error has odd parity. However, if it has an even number<br />

of syndrome bit, then it indicates the presence of a double error in a code word.<br />

Bit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 C1 C2 C3 C4 C5 C6 C7<br />

S1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

S7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />

r<br />

i<br />

�<br />

��<br />

Figure 5.4: Hsiao Parity Check Matrix (41,34)<br />

Hsiao showed that by using minimum odd weight columns, the number of 1 ′ s in the Ho-matrix<br />

could be minimized (and made less than a Hamming SEC-DED code). This translates to less hardware<br />

(5.2)<br />

(5.3)<br />

(5.4)


5.3. JOURNAL ARCHITECTURE AND OPERATION 107<br />

area in the corresponding ECC circuitry. Furthermore, by selecting the odd weight columns in a<br />

way that balances the number of 1’s in each row of the Ho-matrix, the delay of the checker can be<br />

minimized (as the delay is constrained by the maximum weight row).<br />

Effectively, the data residing in the self checking HW journal is coded with parity bit generated<br />

according to Hsiao codes [Hsi10]. These parity bit ensure that the data written in the journal remain<br />

unchanged. Each block (row) in journal has three parts: the first part contains a pair of data and<br />

corresponding address, second consists of pair of w and v bit and third part consists of the generated<br />

parity as shown in figure 5.6. We have used the Hsiao (41, 34) codes to protect the stored data, this<br />

class of codes are used for (SEC-DED). There are 7-parity bit to construct the H-matrix as follows:<br />

1. all 1 of 7 combinations of weight-1 columns are used<br />

2. we selected 34-weight 3 columns out of all possible 3 of 7 combinations.<br />

The parity check matrix (Ho) for (41,34) Hsiao code is shown in figure 5.4, It has following features:<br />

1. total number of 1’s in H-matrix is equal to 7 + 3 × 34 = 109;<br />

2. average number of 1’s in each row is equal to 109/7.<br />

Moreover, these codes are encoded and decoded in a parallel manner. In encoding, the message bit<br />

enter the encoding circuit in parallel and the parity-check bit are formed simultaneously. In decoding,<br />

the received bit enter the decoding circuit. In parallel, the syndrome bit are formed simultaneously<br />

and the received bit are corrected in parallel. Double-error detection is accomplished by examining<br />

the number of 1 ′ s in the syndrome vectors.<br />

5.3 Journal Architecture and Operation<br />

The journal storage space is internally split into two parts: UVJ and UJ, as shown in figure 5.5. At<br />

the end of each valid SD, the contents of UVD turns into VD and thus, the virtual line separating the<br />

upper part from the lower part shifts up to denote the new situation. While, VD is being transfered to<br />

the DM during the execution of the current sequence.<br />

Each row in the SCHJ is 41 bit long. The v and w bit will be discussed later. Together with the 16<br />

address bit and the 16 data bit, they represent the information corresponding to a single block being<br />

stored in the SCHJ. The remaining bits in the row are parity bit, which represent the information<br />

redundancy related to the error correcting code and protecting the other bit, as shown in figure 5.6. In<br />

order to trust the data temporarily stored in SCHJ, we need a built-in mechanism to detect and correct<br />

errors that may occur due to transient faults. Here, we have chosen to rely on error control coding,<br />

a classic and effective approach to protect storage devices [ARM + 11]. In section 5.1, we selected<br />

a Hsiao (41, 34) code, a systematic single-error-correction and double-error-detection (SEC-DED)<br />

code, Hsiao codes being more effective than Hamming codes in terms of cost and reliability [Hsi10].


108 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

Unvalidated<br />

Data<br />

Validated<br />

Data<br />

parity bits v w<br />

address bits data bits<br />

Figure 5.5: SCHJ structure.<br />

The system is based on model 2 presented in chapter 3, where the data cannot be written directly<br />

to the DM (depicted in figure 5.7), in order to insure its contains is always trusted. The data is first<br />

written in the SCHJ and then DM. The corresponding address is always searched in un-validated area<br />

so no two data elements in this area correspond to same address. If the address is found, the data<br />

element is updated. Else, a new row is initialized in the unvalidated area with w = 1 and v = 0 and<br />

the address, data and parity-bit fields filled with the adequate values. The w and v bit are used to<br />

denote written and VD, respectively.<br />

Before transferring to the DM, data awaits for the validation of the current sequence at the VP.<br />

The waiting delay depends on the number of instructions being executed in a SD. If no error is found<br />

at the end of the current sequence, the processor validates the sequence. All the UVD in the SCHJ is<br />

validated by switching the corresponding v bit to 1. Otherwise, if any error is detected, the sequence<br />

is not validated and the UVD data in the SCHJ is disclosed by switching the corresponding w bit to<br />

0. Only data having v = 1 can be transferred to the DM.<br />

It is to be noticed that the last instructions in a sequence are used to write the SE to the SCHJ. On<br />

sequence validation, this data gets the v bit set to 1 and is consequently stored in the DM. In the case<br />

of the sequence un-validation (see figure 5.8), the SE data is restored from memory on rollback as the<br />

UVD in the SCHJ is dismissed, and execution is restarted from previous VP. Further explanation on<br />

the rollback operation can be found in [RAM + 09, AMD + 10, ARM + 11].<br />

As stated before, the on-chip DM is supposed to be fast enough to fulfill the performance require-<br />

ments of our SCPC. Our strategy of using a SCHJ aims not only to improve FT but also to allow the<br />

rollback mechanism to be used with very little time penalty compared to a full hardware approach or<br />

no protection at all.<br />

Each row in the SCHJ is protected by a Hsiao code as shown in figure 5.6. This protection is used<br />

in the following way:<br />

• error detected in the UVD will result in the sequence un-validation (rollback).<br />

• VD is written row by row to the DM. The VD is the copy of the latest validated sequence.


5.3. JOURNAL ARCHITECTURE AND OPERATION 109<br />

X add X data Xw Xv 16-bits 16-bits 1 1 7-bits<br />

X add+data+w+v<br />

Parity<br />

Generator<br />

P X<br />

P X<br />

P’ X<br />

Error<br />

detection<br />

and<br />

correction<br />

Noncorrigible<br />

Data ready to<br />

Transmit<br />

Figure 5.6: Error detection and correction in journal (a memory block of SCHJ).<br />

Rollback/<br />

reset<br />

No-error/<br />

corrigible<br />

Thus, throwing away this data would avoid correct completion of the program/thread execution<br />

and require a system reset. This can only happen if an error is detected that overpasses the<br />

correction capacity of the code (e.g. a two bit error in a single VD row).<br />

5.3.1 Modes of SCHJ<br />

The overall operation of the SCHJ is depicted in the flow chart of figure 5.9. Four modes of oper-<br />

ation are summarized in table 5.1. The ECC checker circuit is activated during each read and writes<br />

access to the memory. The traffic signals in figures 5.10, 5.11, 5.13 and 5.15, are representing data<br />

flow with respect to write-operation because in read operation the SCHJ and DM is totally transparent<br />

for the processor.<br />

Mode 00 – this mode is active on start of program or restart if a non corrigible error is detected in<br />

a VJ of SCHJ. In this mode, the processor resets and re-executes from default values, discarding all<br />

the data stored in the journal. All the w and v bit are set to 0 (v


110 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

Error detecting<br />

processor<br />

core<br />

READ<br />

WRITE<br />

READ<br />

Error<br />

Detection<br />

READ<br />

<strong>Dependable</strong><br />

Main <strong>Memory</strong><br />

Journal<br />

Figure 5.7: Overall architecture<br />

Table 5.1: Modes of Journal<br />

Modes Operation<br />

00 Initialized<br />

01 read/write<br />

10 Valid (v= 1)<br />

11 Un-Valid (Rollback)<br />

Un-Validated Data<br />

Validated Data<br />

WRITE<br />

WRITE<br />

Error<br />

Detection<br />

and<br />

Correction<br />

Mode 01 – this is a normal read or write mode depending on the active instruction in the SCPC<br />

(rd = 1 or wr = 1). In this mode, the SCPC can write directly into SCHJ but not to DM, in order<br />

to avoid risk of data contamination in the DM. However, it has read accesses to both the SCHJ and<br />

the DM (not shown in fig 5.11 to avoid complexity). The data read from the SCHJ are checked for<br />

possible errors. On error detection, the processor enters the mode 11 in which rollback mechanism is<br />

activated without waiting for the VP of the current sequence.<br />

Under normal conditions, the processor is mostly in mode 01. As shown in the figure 5.12,<br />

when the processor needs to read from SCHJ, the address tags are checked to match the required data<br />

(depicted by arrow a in the figure 5.12). If the required address is found, then before transferring<br />

the data towards the SCPC it is checked for possible errors by comparing the stored parity bit with<br />

re-generated parity bit according to Hsiao codes in the error detection unit (shown in figure 5.12).<br />

• if an error is detected (shown in figure 5.12), the rollback mechanism is invoked because data<br />

contents in UVJ contains data generated during the current sequence (denoted by the v field set


5.3. JOURNAL ARCHITECTURE AND OPERATION 111<br />

No-error detected<br />

during last SD<br />

(Data Validated at VP)<br />

VP n-1<br />

Store<br />

SEs<br />

Rollback to VP n-1<br />

Instruction(s) Execution in current<br />

SD<br />

Error detection<br />

Last SD Sequence Duration (SD)<br />

Next SD<br />

VP n<br />

Restore<br />

SEs<br />

Note: VP is Validation Point<br />

SE is State-determining Element(s) of the Processor<br />

Figure 5.8: Rollback mechanism on error detection.<br />

to 0). The enable signal on data bus is then set to ‘0’ to forbid further data transfers from the<br />

SCHJ to the SCPC. All the data contents written during this sequence are considered as garbage<br />

values (w


112 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

Data towards<br />

Processor<br />

Error Detecting<br />

Processor<br />

READ from Journal Validated DATA towards Main <strong>Memory</strong><br />

No<br />

READ from<br />

Journal<br />

Error<br />

Detection<br />

Yes<br />

Rollback<br />

Mechanism<br />

No READ<br />

Validated Data<br />

Towards <strong>Memory</strong><br />

No<br />

Yes<br />

Figure 5.9: SCHJ operation flow chart.<br />

& WRITE<br />

<strong>Dependable</strong><br />

Journal<br />

Initialized<br />

(v=0 & w=0)<br />

Figure 5.10: SCHJ mode 00.<br />

No<br />

(Non corrigible)<br />

No WRITE<br />

Validated data<br />

In Journal<br />

Error<br />

Detection<br />

Yes<br />

Error<br />

Correction<br />

RESET<br />

<strong>Dependable</strong><br />

Main <strong>Memory</strong><br />

SCPC, switching to mode 00 (and possibly raising some alarm indicator) is the usual behavior<br />

in this situation.<br />

Mode 11 – this mode is invoked when an error is detected during the read/write-operation as shown<br />

in figure 5.15 and it has been partially discussed with mode-01. In this mode, all the data written<br />

in UVJ of SCHJ (i.e. all the data generated during the current sequence) is invalid and discarded<br />

(w


5.4. RISK OF DATA CONTAMINATION 113<br />

Error Detecting<br />

Processor<br />

Addr_rd<br />

Addr_wr<br />

Data_in<br />

Steps followed:<br />

Parity<br />

generator<br />

Mode=01 wr = 1<br />

i) Address tags matching<br />

ii) Error detection ‘e’=1<br />

iii) Rollback mechanism<br />

2<br />

READ<br />

& WRITE<br />

parity<br />

bits<br />

000110<br />

001100<br />

011000<br />

110000<br />

000100<br />

100010<br />

100110<br />

101010<br />

000111<br />

000001<br />

011110<br />

010111<br />

011110<br />

010111<br />

001000<br />

001111<br />

the 01-mode (read/write-mode) is activated.<br />

a<br />

READ/WRITE<br />

<strong>Dependable</strong><br />

Journal<br />

(w=1)<br />

Figure 5.11: SCHJ mode 01.<br />

v<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

w<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

address<br />

0011…..0011<br />

0011…..0001<br />

0011…..1011<br />

0011…..0111<br />

0011…..1011<br />

1011…..0011<br />

0111…..0011<br />

0001…..0011<br />

0010…..0011<br />

0000…..0011<br />

1111…..0011<br />

1100…..0011<br />

1011…..0011<br />

0100…..0011<br />

0101…..0011<br />

1100…..1100<br />

data<br />

0011…..0001<br />

0011…..0011<br />

0011…..0111<br />

0011…..0111<br />

0011…..0011<br />

0011…..1011<br />

0000…..0011<br />

0010…..0011<br />

0011…..0011<br />

0000…..0011<br />

0010…..0011<br />

0100…..0011<br />

1011…..0011<br />

0101…..0011<br />

0011…..0011<br />

0011…..0011<br />

Error Detection & Correction Unit<br />

wr_to_mem<br />

5.4 Risk of data contamination<br />

Un-validated data<br />

Validated data<br />

1<br />

READ operation<br />

MODE : 01<br />

reset<br />

No WRITE<br />

Parity_bits<br />

Data_out<br />

Figure 5.12: Read of UVD from SCHJ in mode 01<br />

Data bus<br />

<strong>Dependable</strong><br />

Main <strong>Memory</strong><br />

Error detection<br />

Unit<br />

Mode=01 rd = 1<br />

En<br />

Data bus<br />

Address bus<br />

Error detection<br />

Inside UVJ, we can detect and recover 2-bit errors by relying on Hsiao codes for error detection<br />

and recovery on rollback mechanism. The maximum time penalty to correct the error will be equal to<br />

length of SD.<br />

‘e’<br />

Rollback at e=‘1’<br />

to main memory to processor


114 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

If No-Error Detected at VP<br />

Error Detecting<br />

Processor<br />

Addr_rd<br />

Addr_wr<br />

Data_in<br />

Parity<br />

generator<br />

Mode=01 wr = 1<br />

Steps followed:<br />

i) Data bus available<br />

ii) Non-corrigible error detection<br />

iii) RESET<br />

READ<br />

& WRITE<br />

Un-validated data<br />

Validated data<br />

DATA VALIDATED<br />

(v = 1)<br />

Figure 5.13: SCHJ mode 10.<br />

parity<br />

bits<br />

000110<br />

001100<br />

011000<br />

110000<br />

000100<br />

100010<br />

100110<br />

101010<br />

000111<br />

000001<br />

011110<br />

010111<br />

011110<br />

010111<br />

001000<br />

001111<br />

v<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

Transfer to <strong>Memory</strong><br />

w<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

MODE : 10<br />

address<br />

0011…..0011<br />

0011…..0001<br />

0011…..1011<br />

0011…..0111<br />

0011…..1011<br />

1011…..0011<br />

0111…..0011<br />

0001…..0011<br />

0010…..0011<br />

0000…..0011<br />

1111…..0011<br />

1100…..0011<br />

1011…..0011<br />

0100…..0011<br />

0101…..0011<br />

1100…..1100<br />

data<br />

0011…..0001<br />

0011…..0011<br />

0011…..0111<br />

0011…..0111<br />

0011…..0011<br />

0011…..1011<br />

0000…..0011<br />

0010…..0011<br />

0011…..0011<br />

0000…..0011<br />

0010…..0011<br />

0100…..0011<br />

1011…..0011<br />

0101…..0011<br />

0011…..0011<br />

0011…..0011<br />

Error Detection & Correction Unit<br />

wr_to_mem<br />

reset<br />

WRITE<br />

Parity_bits<br />

Data_out<br />

Data bus<br />

Address bus<br />

<strong>Dependable</strong><br />

Main <strong>Memory</strong><br />

Error detection<br />

unit<br />

Mode=01 rd = 1<br />

Data bus<br />

Figure 5.14: Mode 10 of SCHJ operation (un-corrigible error detected)<br />

On the other hand, inside VJ we can correct only single bit error (SBU). Whereas, if 2-bit MBU is<br />

detected than program should re-execute, which may result in real time performance constrains but our<br />

hypothesis of DM remain secure. Moreover, probability of MBU is much lesser than SBU [QGK + 06],<br />

therefore, chances of such situation are rare.<br />

This means that VJ is more critical data storage than UVJ from dependability point of view.<br />

Therefore, it is important to know how much time the data stay inside the VJ. In-fact, every data<br />

‘e’<br />

Rollback if e=‘1’<br />

to main memory to processor


5.5. IMPLEMENTATION RESULTS 115<br />

Errors detected<br />

ROLLBACK called<br />

Error Detecting<br />

Processor<br />

READ<br />

& WRITE<br />

DATA<br />

UN-VALIDATED<br />

Figure 5.15: SCHJ mode 11.<br />

WRITE<br />

<strong>Dependable</strong><br />

Main <strong>Memory</strong><br />

stored in UJ is transfer to DM in a SD. It means that maximum risk duration of data contamination is<br />

SD. The bigger SD has more risk of data contamination inside SCHJ and vice versa.<br />

5.5 Implementation Results<br />

The SCHJ have been modeled in VHDL at the RTL level and implemented on a Altera Stratix III<br />

EP3SE50F484C2 device using Altera QuartusII. The results obtained in terms of area for depth of<br />

SCHJ equal to 10 are reported in table 5.2. From the results, the following observations can be done:<br />

• the SCHJ occupies about 40 − 50% of the total area depending on the depth of the journal;<br />

Table 5.2: Implementation area<br />

Comb. ALUTS (Ded. Logic)<br />

SCHJ 591 (399)<br />

SCPC and SCHJ 1452 (677)<br />

In the case of a non-corrigible error (e.g. a double error in a single row) is detected in the validated<br />

part of the journal (VJ) then even by rollback we cannot recover the errors because the data does not<br />

belong to the present SD in this case the processor must have to reset as shown in the figure 5.16.<br />

5.5.1 Minimizing the Size of the Journal<br />

From the implementation results in table 5.2, it has been found that the journal acquires an im-<br />

portant percentage of the total area of FT-processor. We have investigated impact on percentage<br />

utilization of overall processor versus the SCHJ depth. The results are reported in figure 5.17. They<br />

show that overall hardware overhead depends directly on the depth of SCHJ.


116 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

Percentage utilization of FT Processor on<br />

EP3SE50F484C2<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

Figure 5.16: Non-corrigible error detection<br />

10 16 24 32 40 54 62<br />

Depth of the <strong>Dependable</strong> Journal<br />

Figure 5.17: Increase of percentage utilization of FT processor (SCPC + SCHJ) on device<br />

EP3SE50F484C2 with increase in the depth.<br />

In fact, the depth of a journal is a relative parameter and relies on the type of benchmark being<br />

employed and duration of SD. From theoretical point of view, the UVJ should be equal to the max-<br />

imum SD, if the present benchmarks consists of all instructions that require write to memory (e.g.<br />

series of duplication instruction in figure 5.18). On each instruction execution the contents of NOS<br />

will be written to memory. The required size of UVJ should be equal to the length of SD (see fig-<br />

ure 5.19 arrow a). On the other hand, the lowest extreme case is possible for benchmarks containing<br />

instructions that do not or very little need to write to memory (like series of SWAP in figure 5.18).<br />

The required depth of the journal is minimal.<br />

To address real industrial applications, there is need to find relationship between SD and journal<br />

depth. Accordingly, we have calculated the percentage of write in already discussed benchmarks<br />

(see section 3.7). They are expensive in processor-memory traffic because they always read and<br />

write the data from/to memory. However the result shows that maximum percentage of write to<br />

memory is 39% (see figure 5.19 arrow b). Although, we are ignoring write to same memory addresses


5.5. IMPLEMENTATION RESULTS 117<br />

(a) DUP (Duplication) Execution<br />

Data<br />

NOS<br />

NOS TOS<br />

stack TOS<br />

ALU<br />

DUP DUP DUP DUP DUP DUP DUP DUP DUP DUP<br />

1 2 3 4 5 6 7 8 9 10<br />

NOS TOS<br />

SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP<br />

1 2 3 4 5 6 7 8 9 10<br />

For Sequence Duration (SD) =10<br />

Figure 5.18: Theoretical limits of Journal Depth.<br />

(b) SWAP Execution<br />

NOS<br />

TOS<br />

ALU<br />

100% Write to<br />

memory<br />

UVJ min. depth = SD<br />

0% Write to<br />

memory<br />

UVJ min. depth = 0<br />

which can further reduce the required depth of journal. This shows that the practical journal depth<br />

should not exceed 50% of SD (considering eleven percent safety margin). Here, it must be noted<br />

that the previously presented area occupation results were calculated at worst case (when SD = UVJ).<br />

However, required journal depth is SD/2.<br />

Now to finalize the depth of the journal, it is important to find the relationship between SD and<br />

performance degradation. Accordingly, we have developed a processor model using dedicated C++<br />

tools. The errors are injected artificially into the simulated processor model. Here only injection<br />

of SBU has been considered. The complete experimental setup will be discussed in chapter 6. The<br />

factor CPO (clock per operation) is chosen to determine performance degradation due to re-execution.<br />

Ideally processor executes single instruction/clock cycle which means that CPO ≈ 1. The discussion<br />

in section 3.2.2 shows that in a BER system, the performance degradation relies on two factors:<br />

rate of re-execution (rollback) and ratio of effective instruction execution, (SD-SED)/SD. Greater the<br />

performance degradation higher will be the CPO.<br />

The graphs have been drawn between CPO vs. SD (shown in figure 5.20) for different error<br />

injection rates. The obtained curves follow U-shape pattern because at low SD the rate of loading of<br />

internal states are dominants and for bigger SD the re-execution is dominant. The curves shows that<br />

bigger journal depth is only possible with for lower EIR. Also for every EIR there are certain limits<br />

in which it can have good performance.<br />

Here, if we accept the 20% performance degradation than the minimum SD comes out to be 20


118 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />

In worst case<br />

Sequence Duration<br />

SD =Journal depth<br />

Sequence Duration<br />

SD/2 =Journal depth<br />

Journal<br />

Journal<br />

Theory point of view<br />

Benchmark<br />

Series of DUP<br />

Series of Swap<br />

% Write<br />

100<br />

In practical Practical point of view<br />

Benchmark<br />

Bubble Sort<br />

Matrix<br />

multiplication<br />

Control<br />

0<br />

% Write<br />

38<br />

Figure 5.19: Relation between journal depth and percentage write in benchmarks.<br />

Clocks per instruction (CPI)<br />

2<br />

1,8<br />

1,6<br />

1,4<br />

1,2<br />

Loading of internal<br />

states is dominant<br />

CPI vs. Sequence Duration<br />

1<br />

1 10 A<br />

100<br />

Sequence Durat ion<br />

1000 10000<br />

Figure 5.20: CPI vs. SD<br />

39<br />

36<br />

(a)<br />

(b)<br />

Effect of re-execution<br />

is dominant<br />

1/500<br />

1/1000<br />

1/10000 1/5000<br />

1/10000<br />

1/100<br />

(see arrow A). In brief, practical journal depth lies some where near 10 if accepting the depth of<br />

journal = 50% of SD.


5.6. CONCLUSIONS 119<br />

5.5.2 Dynamic Sequence Duration<br />

In the presented model, we have used fix SD model that has both the area and performance over-<br />

heads. However, with dynamic SD these problems can be solved. Here, the SD has an average value<br />

as shown in figure 5.21. This can allow us to employ bigger SDs with smaller journal depth. This can<br />

improve the area overhead and performance degradation for low EIR.<br />

SD 1<br />

SD 2<br />

SD 3<br />

Figure 5.21: Dynamic SD.<br />

SD 4<br />

SD 5<br />

HW<br />

Journal<br />

Moreover, it can allow to dynamically reconfigure SD with EIR. For example, if SD is repeatedly<br />

unvalidated then system will automatically reduce the SD to adjust its value with EIRs. Downside<br />

is that it may increase the complexity of the journal management. Dynamic SD is an important<br />

consideration for future reduction of hardware overhead.<br />

5.6 Conclusions<br />

The presence of the journal facilitates the rollback mechanism on one hand, and it mask errors<br />

(SBU and 2-bit MBU) from entering into the DM on other hand. To reduce the hardware overhead,<br />

the Hsiao code have been employed. They provide an effective double detection and single error<br />

correction. Due to parallel access to the memory and journal simultaneously in READ operation the<br />

overall efficiency of the system has been increased.<br />

The SCHJ occupies an important percentage of the overall fault tolerant processor area. Reducing<br />

the journal size can effectively reduce the global area occupation. The size of the journal depends<br />

on the type of benchmarks being employed. For practical applications journal depth can be half<br />

the duration of SD. Further reduction in depth is possible by employing dynamic SD than fix SDs.<br />

In the next chapter we will investigate the error coverage and the performance degradation due to<br />

re-execution.


120 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL


Chapter 6<br />

Fault Tolerant Processor Validation<br />

In the previous chapters, we have designed a FT processor based on concurrent error detection ca-<br />

pability and rollback error recovery strategy. The fault tolerant design is built on a self-checking<br />

processor core (whose architecture follows the MISC philosophy) and on a self-checking hardware<br />

journal that prevents errors to flow into the DM and limits the impact of the rollback mechanism on<br />

time performance. The architecture of the self-checking processor and the self-checking hardware<br />

journal have been discussed in chapters 4 and 5, respectively.<br />

Self-Checking<br />

Processor<br />

Core<br />

FT-Processor<br />

FT Processor<br />

Self Checking<br />

Journal<br />

Figure 6.1: The overall FT-processor to be validated.<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

In this chapter, we will evaluate the FT capability of the overall FT-processor (SCPC + SCHJ),<br />

as highlighted in figure 6.1 in order to validate design strategy. The evaluation will be carried out<br />

through simulation. Controlled error injection will be used to force the processor model artificially<br />

face abnormal situations. The FT capability of the processor will be judged by calculating the detected<br />

to injected error ratios under different simulation scenarios (different application benchmarks and<br />

different error injection profiles). The time performance will also be evaluated.<br />

The chapter distribution is as follow. First of all we will analyse the design hypothesis that have<br />

been assumed in the methodology and hence, the FT-processor properties to be checked. Then, after<br />

a short presentation of the error injection methodology, the experimental results are presented and<br />

discussed, both from the FT and the time performance points of view. Finally, we will compare<br />

proposed methodology with LEON3FT FT design methodology.<br />

121


122 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

6.1 Design Hypothesis and Properties to be Checked<br />

Inside the SCPC, parity and remainder codes are employed to detect errors in the internal registers<br />

and arithmetic/logic circuitry of ALU. According to assumptions, DM is a trustable place where data<br />

remain uncorrupted. Hence, unsafe data is prevented to flow to the DM. This is achieved by the<br />

SCHJ which has error detection and also some error correction capability. Its role is to simplified<br />

the management of validated and un-validated data and to speed up the rollback mechanism used for<br />

error recovery.<br />

Self-Checking<br />

Processor<br />

Core<br />

FT-Processor<br />

FT Processor<br />

Self Checking<br />

Journal<br />

Figure 6.2: Error injection in FT-processor.<br />

Artificial Error<br />

Injection<br />

<strong>Dependable</strong><br />

<strong>Memory</strong><br />

The FT capability of the processor must be evaluated as the capacity of correctly handling errors<br />

appearing in any part of the SCPC or the SCHJ (figure 6.2), with different error profiles to be tested<br />

(different error patterns and rates). Speed performance degradation will also be assessed along with<br />

the FT capability, as the impact of rollback is expected to rise with (due to higher re-execution rates).<br />

Accordingly, the rate of rollback vs. error injection rate can also be calculated.<br />

In short, we will investigate the overall dependability and performance of the proposed FT-<br />

processor architecture by addressing the following challenges in the upcoming sections:<br />

• self-checking effectiveness of the FT processor;<br />

• performance degradation due to re-execution; and<br />

• effect of error injection on rate of rollback.<br />

6.2 Error Injection Methodology and Error Profiles<br />

Before addressing the above-mentioned challenges, it is necessary to choose both the error injec-<br />

tion methodology and the error profiles that will be applied, i.e., the error patterns and error rates. The<br />

fault injection in the hardware of a system can be implemented in two ways:<br />

1. physical fault injection;


6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 123<br />

2. simulated fault injection.<br />

In this work, we will employ the simulated fault injection of soft errors (due to transient faults)<br />

in which errors are injected altering the logical values during the simulation. The simulation-based<br />

injection is a special case of fault/error injection that can support various levels of abstraction of the<br />

system like functional, architectural, logic and power [CP02]. For this reason, it has been widely used<br />

to study fault injection.<br />

Moreover, there are various other advantages in this technique, the greatest being the observability<br />

and controllability of all the modelled components. Another positive aspect of this technique is the<br />

possibility of carrying the validation of the system during the design phase before having a final<br />

design.<br />

Scenarios 1<br />

Scenarios 2<br />

1.a Random SBU<br />

1.b. Random MBU (2-bits)<br />

1.c. Random MBU (3-bits)<br />

X X<br />

X X X<br />

2. Random MBU (1, 2, …, 7 or 8 bits)<br />

X X X X X X X X<br />

Figure 6.3: Error patterns (errors can occur in any bit, not necessarily the bit shown here).<br />

The faults being considered are SBUs (one-bit changing in a single register) and MBUs (multiple<br />

bit changing at once in one register). These fault models (SBU and MBU) are commonly used with<br />

RTL models [Van08]. The exact error patterns being considered in these experiments are shown in<br />

figure 6.3: in scenario 1, we have considered (a) random single bit error, (b) random 2-bit error and<br />

(c) random 3-bit error. In scenario 2, the random harsh-error (from 1-bit up-to 8-bit error in a single<br />

register) are considered.<br />

6.3 Experimental Validation of Self-Checking Methodology<br />

We will evaluate the error coverage through simulated fault injection, the objective being to find<br />

the effectiveness of the proposed fault tolerant scheme, and hence determine its limits. There is a<br />

need of creating an environment to analyze the effects of transient faults into the final architecture.<br />

X


124 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

By designing this environment, we will be able to do fault injection experiments to evaluate the<br />

effects of SBU and MBUs caused due to transient faults into processor registers and data-path. Hence<br />

to analyze the robustness against single bit flip (due to transient faults).<br />

Practically, the VHDL model at the RTL level used to synthesize the circuit is not used for the<br />

fault injection simulation. In order to allow very fast simulation (and hence, allow a large number of<br />

simulation campaigns to be conducted in a minimum delay), dedicated C++ tools have been developed<br />

to replace the original ‘discrete event driven’ simulation model on which VHDL relies, by the faster<br />

‘cycle driven’ simulation model that fits very well synchronous designs [CHL97]. For the simulation,<br />

strictly equivalent C++ ‘cycle drive models’ replaced the original VHDL models at RTL level.<br />

The starting point of the environment designing is to define how to describe a way to reproduce<br />

transient faults: when to reproduce, where to affect and what to change. We have chosen a non-<br />

deterministic approach of fault trigger during a fixed s where the bit flips can randomly be provoked<br />

in SCPC and SCHJ.<br />

Initial<br />

setup<br />

Fault injected<br />

Error latency 2 clks<br />

Error<br />

detected<br />

No<br />

Yes<br />

Increment: error detected<br />

fault injected<br />

Program reset<br />

Data log<br />

clks<br />

Campaign 1 Campaign 2 Campaign 3 Campaign N<br />

Figure 6.4: Experimental Setup<br />

Final Report<br />

Total: error detected<br />

Fault injected<br />

The basic steps of a fault-injection campaign are shown in figure 6.4. The C++ based simulator<br />

inject fault pattern in the processor model by randomly picking bit(s) out of total bits that form the<br />

registers. On fault injection the simulation is arrested after 2 cycles and the self-checking circuitry<br />

indicates whether the error is detected or not. If detected, then counter increments and afterwards, a<br />

new simulation campaign starts with a new fault injection profile (as shown in figure 6.4). Finally, a<br />

report of total number of injected/detected errors is generated.<br />

Two types of injection methods were conducted: a campaign to inject single (SBU), double and<br />

triple (MBU) random error patterns, and another campaign to inject random harsh-errors (random<br />

weight from 1 up to 8). The results are presented in the graphs of figures 6.5, 6.6, 6.7 and 6.8,<br />

respectively.


6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 125<br />

No. of Errors<br />

No. of Errors<br />

40000<br />

35000<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

40000<br />

35000<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

Random Error Injection and Detection of SEU<br />

1 2 3 4 5 6 7 8 9 10<br />

Errors Injected Errors Detected<br />

Figure 6.5: Single bit error injection.<br />

2-bits Random Errors Injection and Detection<br />

1 2 3 4 5 6 7 8<br />

Errors Injected Errors Detected Errors Non-Detected<br />

Figure 6.6: Double bit error injection.<br />

For scenario 1, the figure 6.5 shows the processor can detect 100% of the injected single bit<br />

errors. The detection rate for double and triple bit errors, with rates higher than 60% and 78%,<br />

respectively (as shown in figures 6.6 and 6.7 respectively). In scenario 2, harsh patterns are used (1<br />

up to 8 randomly), the detection rate still remains significant with a value greater than 36% for all<br />

configurations, as shown in figure 6.8.<br />

It is interesting to notice that, while using very simple detecting codes in the SCPC devised for<br />

low s, the error coverage is still 100% for SBU. Taking into account the small amount of registers to<br />

protect in the processor core and the fact that the SCPC area is only a fraction of the total FT-processor<br />

area, using better codes in the SCPC can probably improve the FT level without a big impact on area.<br />

This tends to prove that the proposed FT processor design approach is a useful one. It is still<br />

necessary to evaluate the impact of increasing error rates on speed performance.


126 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

No. of Errors<br />

No. of Errors<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

3-bits Random Errors Injection and Detection<br />

1 2 3 4 5 6 7 8<br />

Errors Injected Errors Detected Errors Non-Detected<br />

Figure 6.7: Triple bit error injection.<br />

1 to 8 - bits Random Error Injection and Detection<br />

1 2 3 4 5 6 7 8<br />

Errors Injected Errors Detected Errors Non-Detected<br />

Figure 6.8: Harsh (1 up to 8 bit randomly) error injection.<br />

6.4 Performance Degradation due to Re-execution<br />

To measure the impact of transient faults on system performance, we have evaluated the per-<br />

formance degradation on different sets of benchmarks through simulations. The average number of<br />

clock ticks per operation (CPO) has been measured for different EIRs, as an indicator of speed perfor-<br />

mance (the higher the value, the lower the performance) and hence of performance degradation under<br />

different error injection conditions.<br />

In the pipelined journalized stack processor, all instructions are executed in a clock cycle (ex-<br />

cept STORE). Therefore, an average clock cycle per operation execution (CPO) or clock cycle per<br />

instruction execution is the unity. However, in the case of an error detection, the rollback is executed<br />

which increases the overall time penalty. The greater the rate of rollback, the higher the average CPO,


6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 127<br />

Performance<br />

degradation<br />

Program length<br />

Clock count<br />

No<br />

rollback<br />

Program length<br />

Clock count<br />

with<br />

rollback<br />

Figure 6.9: Performance Degradation due to re-execution<br />

because more clock cycles will be needed to accomplish the required task (see figure 6.9). In other<br />

words, the greater the average CPO, the lower the overall performance.<br />

The benchmarks have already been discussed in section 3.7. Here, table 6.1 summarizes the<br />

percentage profiles of read_from/write_to DM for each group of induced by the instructions running<br />

on the SCPC. Note that the instruction set of the SCPC has 36 instructions among which, 23 involve<br />

reading or writing from/to the memory.<br />

Table 6.1: Read/Write profiles in benchmarks groups<br />

Group Read Write<br />

I 45% 39%<br />

II 57% 38%<br />

III 50% 38%<br />

6.4.1 Evaluating Performance Degradation<br />

The goal is to measure the effect of re-execution on the length of SD. We have drawn the graphs<br />

of average clock cycle per operation (CPO) vs. EIR for different benchmarks. Figures 6.10, 6.11<br />

and 6.12 present the results for benchmarks in group I, group II and group III, respectively for different<br />

SDs (such as 10, 20, 50 and 100). In these graphs, the number of clock cycles per operation (CPO)<br />

has been plotted against EIR. Here, the penalty in loading SEs has not been considered. The errors<br />

have been injected in processor and journal at different EIRs. The analysis of the graphs shows that<br />

the curves tend to overlap for the lower values of EIR. This is logic as, in absence of error, no extra<br />

time penalty due to rollback is induced, whatever the benchmark being used.


128 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

Clock Per Operation (CPO)<br />

Clock Per Operation (CPO)<br />

2<br />

1,8<br />

1,6<br />

1,4<br />

1,2<br />

1<br />

0,8<br />

0,6<br />

0,4<br />

0,2<br />

0<br />

2,5<br />

2<br />

1,5<br />

1<br />

0,5<br />

0<br />

Benchmarks Group - I<br />

SD=10 SD=20 SD=50 SD=100<br />

A B C<br />

0,000001 0,000005 0,00001 0,00005 0,0001<br />

Error Injection Rate (EIR)<br />

Figure 6.10: Simulation curves for group-I.<br />

Benchmark Group II<br />

SD=10 SD=20 SD=50 SD=100<br />

A B C<br />

0,000001 0,000005 0,00001 0,00005 0,0001<br />

Error Injection Rate (EIR)<br />

Figure 6.11: Simulation curves for group-II.<br />

In figure 6.10, moving from point A to B corresponds to an increase of 10% of the error rate. The<br />

corresponding increase in CPO remains low (almost unchanged for SD=10 and 20, 1.1 for SD=50<br />

and 1.6 for SD=100), meaning a no or little degradation of speed performance. Similarly, the move<br />

from A to C corresponds to an error rate increase of 100%: the CPO remains very low for SD=10,<br />

and lower than 2 for SD=20 and 50. Similar observations can be made from graphs in figures 6.11<br />

and 6.12. With increase of error rate by 100% the time penalty for lower SDs remains reasonable<br />

which means a good performance.<br />

With higher EIR, the smaller SD are the ones that denote the lower time penalty being incurred.<br />

This is also coherent with predicted results. Indeed, for a given error rate, the risk that a sequence be


6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 129<br />

Clock Per Operation (CPO)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

A<br />

Benchmark Group III<br />

SD=10 SD=20 SD=50 SD=100<br />

B<br />

0,000001 0,000005 0,00001 0,00005 0,0001<br />

Error Injection Rate (EIR)<br />

Figure 6.12: Simulation curves for group-III.<br />

invalidated is higher for a longer SD, leading to a higher rollback rate.<br />

Taking into account that the architecture chosen for the SCPC requires little time being used to<br />

save the SE, it is possible to select short SD and still have a good level of performance. Furthermore,<br />

this allows a lower SCHJ depth to be chosen with a reduced area consumption. It can further reduce<br />

the risk that errors cumulate in the SCHJ and induce a non recoverable error.<br />

Number of<br />

rollbacks<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

0.000<br />

0.00001<br />

0.00005<br />

0.0001<br />

Error Injection Rate (EIR)<br />

0.0005<br />

Figure 6.13: Effect of EIR on rollback for benchmarks group-I.<br />

C<br />

SD = 10<br />

SD = 20<br />

SD = 50<br />

SD = 100


130 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

Number of<br />

rollbacks<br />

500<br />

450<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

0.000 0.00001 0.00005<br />

Error Injection Rate (EIR)<br />

Figure 6.14: Effect of EIR on rollback for benchmarks group-II.<br />

Number of<br />

rollbacks<br />

500<br />

450<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

0.000<br />

0.00001<br />

0.00005<br />

0.0001<br />

Error Injection Rate (EIR)<br />

0.0005<br />

Figure 6.15: Effect of EIR on rollback for benchmarks group-III.<br />

6.5 Effect of Error Injection on Rate of Rollback<br />

SD = 10<br />

SD = 20<br />

SD = 50<br />

SD = 100<br />

SD = 10<br />

SD = 20<br />

SD = 50<br />

SD = 100<br />

An increase in the rate of rollback is a performance-limiting factor due to the time penalty in<br />

re-execution of sequences. As a result, in this section we will analyze the increase in rate of rollback<br />

on increasing EIRs. Actually, for higher EIR the rate of re-execution will also increase which will


6.6. COMPARISON WITH LEON FT-3 131<br />

further decrease the overall performance. However, if the error probability is known then it is possible<br />

to find the optimal number of checkpoints and possible rollbacks [VSL09]. In real system, the error<br />

probability is not known in advance and is difficult to estimate.<br />

In the rollback mechanism, there are two performance-limiting factors: (i) the time taken to<br />

store/reload SEs and (ii) the length of the sequence (SD). If we need to reduce the time penalty<br />

in reloading the SEs there is a need of long sequences (SD) so that overall number of load and store<br />

of SEs will be smaller. This behavior needs to be confirmed by artificial error injection.<br />

Consequently, we have artificially injected the errors in the FT-processor to observe its effect on<br />

the rollback mechanism as shown in figures 6.13, 6.14 and 6.15. (Note: for higher SDs like 50 and<br />

100 the number of rollbacks at higher EIR is missing because their values get out of graph range at<br />

y-axis) From the simulation curves, it has been shown that for low error rates the rate of rollback is<br />

also low and vice versa. Moreover, for higher error rate the effect of rollback is dominant in bigger<br />

sequences (SD). Hence, there will be greater number of rollbacks which again result in time penalty<br />

and limit the overall performance.<br />

Therefore, it is advisable to use larger SDs with low error rates and smaller SD with higher error<br />

rates. One can propose the optimal duration of SD if final application is know that is why the length<br />

of the SD will be a user defined parameter and can be adjusted according to the external environment.<br />

6.6 Comparison with LEON FT-3<br />

The LEON3 FT has been discusses previously in section 2.4. In this part, we are comparing<br />

the protection scheme of LEON3 FT with journalized stack processor. This will be a qualitative<br />

comparison.<br />

The LEON3 FT focus on the protection of the data storage and not on the functionality of the<br />

architecture. The overall scheme is using ECC and duplication of internal states. The most of the<br />

registers have 2-bit error detection whereas, few have 4 bit error detection. There is no protection of<br />

data path, ALU functionality and control unit.<br />

On the other hand, in FT journalized stack processor the focus is to have an overall architecture<br />

protection. In processor, there is single bit error detection and the journal part can detect 2-bit error.<br />

However, in newer version there will be consideration to search other high coverage codes inside the<br />

processor. There is a protected data path, ALU and control path can be protected without additional<br />

hardware overheads. In brief, the FT journalized processor is in development phase and still shows<br />

interesting feature and needs further optimization from protection point of view.<br />

6.7 Conclusions<br />

In this chapter, we have validated the design of a Journalized Fault Tolerant Stack Processor.<br />

During validation different parameters we have been evaluated such as self checking ability, impact


132 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />

on time performance and increase in rate of rollback due to error injection. Finally, the proposed<br />

model has been compared with the LEON 3-FT.<br />

For injection of single errors, 100% of the errors were detected in several experimental configura-<br />

tions. Similarly, with double and triple bit error injection, the average percentage detection was about<br />

60% and 78%. According to the results obtained with much worse error patterns (up to 8-bit error<br />

patterns), the correction is still possible with a rather significant correction rate of about 36%.<br />

The performance degradation results have also shown satisfactory results. The proposed architec-<br />

ture offers rather good performance even in presence of high error rates. With large error rates, the<br />

time penalty can remain reasonable using lower SDs. Practically, it’s advisable to use bigger SD with<br />

low error rate and smaller SD with higher error rate. Knowing the final application and the average<br />

error profile related to the execution environment, it is possible to chose the most appropriate SD<br />

duration (which is left as a generic parameter in the synthesize models).


GENERAL CONCLUSION AND PROSPECTS<br />

133


General Conclusions<br />

With the predicted evolutions in technology, soft errors in electronic circuits are becoming a major<br />

issue in the design of complex digital systems, especially in applications with safety critical relevance.<br />

Indeed, current advancements in nano-technology, largely based on component dimensions shrinking,<br />

voltage supply reduction and clock speed increase, are lowering the resulting noise margins. As a con-<br />

sequence, the sensitivity of digital circuits to high-energy particles and electromagnetic disturbances<br />

is raising very fast, making the probability of Single-Event Upset (SEU) and Multiple-Bit Upset<br />

(MBU) occurrence very high, not only in space but also in ground applications. Hence, taking into<br />

account from the beginning of any electronic digital design, the growing risk that these transient faults<br />

occurs is turning very fast into a critical need.<br />

Ensuring the proper operation, even in the presence of transient faults, requires that the system<br />

held some fault tolerant capability. Next to the fault tolerance issue, the demand for larger, faster,<br />

more complex and flexible system and yet easy do design is endless. Together with the enhanced<br />

means of on-chip communication (Network on Chip – NoC), the increased possibilities of integration<br />

in modern electronic circuits allow now grouping all the functionality of a full system in a single chip<br />

(System on Chip – SoC). Among all the recent developments, the MPSoC (Multiprocessor System<br />

On Chip) design paradigm is becoming very popular for its capacity to provide both computational<br />

power and flexibility. It brings together a large number of processors (the processing nodes) linked<br />

together by a NoC (the inter-nodes communication mean). An MPSoC not being naturally immune<br />

to transient faults, an obvious goal is to develop on-chip capacity for fault tolerance and hence, a fault<br />

tolerant processor to be used as "processing node".<br />

The work that has been presented in this thesis is dedicated to the design of such a FT processor<br />

using a new architectural approach, the design goals being addressed including high level of protection<br />

against transient faults along with reasonable performance and area overhead. It was clear from the<br />

beginning that severe constraints concerning the area consumption should apply to the architectural<br />

design of the processing node in order to match the massively parallel objective, and yet preserving<br />

as much as possible the node performance.<br />

The concepts chosen to be the basis of the design methodology are on-line concurrent error de-<br />

tection capability and error recovery through rollback execution. Central to the new architecture are a<br />

self-checking processor core and a hardware journalization mechanism. The processor core, devised<br />

in the MISC class instead of the classic RISC or CISC, is a self-checking processor inspired from the<br />

canonical stack processor [KJ89], able to offer a rather good level of performance with only a limited<br />

135


136 GENERAL CONCLUSIONS<br />

amount of hardware being required. The architectural simplicity (little amount of logical resources<br />

and internal storage) and the great compactness of the code are important characteristics favorable<br />

to self-checking capability and rollback recovery implementation. Next to the processor core, a self-<br />

checking hardware journal dedicated to the journalization mechanism prevents error propagation from<br />

the processor core to the main memory and limits the impact of rollback on time performance. Among<br />

the underlying hypotheses, the main memory is supposed to be dependable, i.e. data is admitted to be<br />

kept reliably in it without any risk of corruption.<br />

On occurrence of a transient fault, data can be corrupted in the processor. Such errors can be<br />

detected in the processor core but not corrected. Hence, erroneous data may flow out of the processor<br />

core and would end-up in the dependable memory without the use of the hardware journal. The DM<br />

would then be a non-trustable place in that case and implementing a software recovery mechanism<br />

would be rather painful, with a lot of data redundancy being necessary in the memory device. Clas-<br />

sical rollback techniques operate with check pointing: at regular time intervals, the processor state<br />

and produced data is saved allowing rollback to the saved point in case of error detection. The best<br />

suited sequence duration (distance between two check points) depends on application constraints and<br />

on error occurrence rates. While a larger sequence duration may limit the impact of the rollback<br />

mechanism on time performance in absence of errors, it requires a larger hardware journal and more,<br />

it increases the risk of rollback activation in case of error occurrence.<br />

Data produced in the current sequence can be discarded in case of error detection as it can be<br />

generated again from the last save check point. On sequence validation, with no error occurrence in<br />

the ending sequence, the related data is validated and must be transferred to the dependable memory.<br />

Error control coding techniques are used to detect errors in the processor core and the unvalidated<br />

data in the journal, and used to correct errors in the validated data part in the journal.<br />

The fault tolerant processor architecture has been modeled in VHDL at the RTL level and then<br />

synthesized using Altera Quartus II, to determine area requirements and maximal operation frequency.<br />

Simulated error injection campaigns have been used to determine the effectiveness of the proposed<br />

fault tolerant strategy under different faulty scenarios (varying the error rate and error pattern profiles)<br />

and different sequence durations.<br />

The self-checking ability of the fault tolerant processor was tested for Single-Event Upsets (SEUs,<br />

1 bit error pattern) and Multiple-Bit Upsets (2 up to 8 bits error pattern in a single 16-bits data word).<br />

Considering SEUs, 100% of the errors are detected and error recovery is close to 100% event for high<br />

error injection rates. With 2-bit and 3-bit patterns the average detection percentage was about 60%<br />

and 78%, respectively. When harder conditions are considered with error patterns of up to 8 bits,<br />

correction is still possible with correction rates of about 36%.<br />

Similarly, the performance degradation due to error injection was evaluated. Error recovery being<br />

based on rollback execution on error detection, the instruction in the faulty sequence are re-executed<br />

from the previous preserved states hence adding time penalty, i.e. performance degradation. Higher<br />

error injection rates induce higher rollback rates, result in lower performance. The analysis of the<br />

measured performance degradation curve shows that proposed architecture offers a reasonable good


performance even in presence of high error rates. It shows also that the optimal sequence duration<br />

depends on the average error injection rate that should be adjusted according to application external<br />

environment.<br />

Practically, the experimental results demonstrate that the principle of journalization can be rather<br />

effective on a stack computing based processor core architecture, and deserves more research effort<br />

to enhance the performances and protection capability.<br />

The future work is divided into two aspects: protection and performance. From protection point<br />

of view there is a need to improve the error coverage in the processor part. Presently simple parity<br />

can only detect odd bit errors. The challenge is to search a low hardware overhead codes. Moreover,<br />

opcode (in control circuitry) can be protected with ECC. In MISC based stack methodology there are<br />

37 instructions, present opcode is 8 bit. It has the capacity to add redundancy bits without additional<br />

overheads.<br />

From performance point of view, the architectural optimization, mainly on the hardware journal<br />

part is required. Present processor have a critical path in the error correcting circuitry and write to<br />

DM. If this task is split in 2 stage pipeline then overall performance can improve lot. Another possible<br />

aspect is to overcome performance overhead due to conditional branches. The methodology will be<br />

to load both condition of jump in IB.<br />

On the long term, the continuation of this work should be dedicated to the integration of this fault<br />

tolerant processor architecture as a building block of a fault tolerant MPSoC.<br />

137


Appendix A<br />

Canonical Stack Computers:<br />

The canonical stack processor [KJ89] has been chosen to develop the fault tolerant processor<br />

core. It is characteristics resembles mostly with the second generation stack machines which is more<br />

cost effective than first generation. In this section we will briefly discuss the construction of canon-<br />

ical stack machine because it will be helpful in understanding the similarities and differences with<br />

proposed stack machine.<br />

Figure A.1 shows the block diagram of the Canonical Stack Machine. Each block represents a<br />

logical resource that include: the data bus, the Data Stack (DS), Return Stack (RS), Arithmetic/Logic<br />

Unit (ALU), Top Of Stack register (TOS), Program Counter (PC), <strong>Memory</strong> Address Register (MAR),<br />

Instruction Register (IR), and an Input/Output unit (I/O). For reason of simplicity the canonical ma-<br />

chine has been represented with a single Data Bus but real processors may have more than one for<br />

parallel fetching and instruction execution in the figure A.1. Real processors may have more than one<br />

data path to allow for parallel operation of instruction fetching and calculations.<br />

The DS is a buffer which works according to LIFO (Last In First Out) mechanism. Only two<br />

operations PUSH and POP can take place in DS. In PUSH, the new data elements are written on the<br />

top most position of the DS and the old values are shifted one position downwards. In POP operation,<br />

the top value already residing in the stack is placed on the data bus and the next cell on the stack<br />

is shift one place upwards and so on. Similarly RS is also LIFO based implementation. The only<br />

difference is that the return stack is used to store subroutine return addresses instead of instruction<br />

operands.<br />

The program memory block has both a <strong>Memory</strong> Address Register (MAR) and a reasonable<br />

amount of random access memory. To access the memory, first the MAR is written with the ad-<br />

dress to be read or written. then, on the next system cycle the program memory is either read onto or<br />

written from the data bus accordingly.<br />

139


140 APPENDIX A. CANONICAL STACK COMPUTERS:<br />

Data Stack<br />

DS<br />

Return Stack<br />

RS<br />

I/O<br />

Control<br />

Logic & IR<br />

D<br />

A<br />

T<br />

A<br />

B<br />

U<br />

S<br />

DATA<br />

Figure A.1: Canonical Stack Machine [KJ89]<br />

ALU<br />

PC<br />

MAR<br />

Program<br />

<strong>Memory</strong><br />

TOS REG<br />

ADDRESS


Appendix B<br />

Instruction Set of Stack Processor<br />

Arithmetic and logic operations<br />

The basic arithmetic and logic operations in table B.1 are same as in canonical machine but they<br />

have been modified according to the needs. Some additional instructions like Addition with carry<br />

(ADC), Subtraction with carry (SUBC), Modulus (MOD), Negative (NEG), NOT-operation (NOT),<br />

Increment (INC), Decrement (DEC) and Sign (SIGN) have been added. These additional instructions<br />

provide more flexibility when programming the Stack Processor. All the instructions are described in<br />

terms of register transfer level pseudo which is assumed as self explanatory.<br />

Stack manipulation operations<br />

Pure Stack machines can only access the two tops of the stack for arithmetic operations. Therefore<br />

some extra instructions are always needed in order to explore the operands other than the TOS, NOS<br />

or TORS. Here such instructions include Rotate (ROT), RS to DS (R2D), DS to RS (D2R), Copy RS<br />

to DS (CPR2D). The R2D, D2R and CPR2D are generally used for shuffling the DS and RS. (Pseudo<br />

of all the instructions in this table are according to non-pipelined version of the propose model.)<br />

<strong>Memory</strong> Fetch and Store<br />

All the arithmetic and logical operations are performed on data elements of the stack so, there<br />

must be some way of loading information onto the stack and storing the data to the memory. The<br />

register transfer pseudo is in the table B.3 below.<br />

Loading Literals<br />

There must be a way to get the constants on the stack. The instructions to do so include LIT and<br />

DLIT which can load a byte and a word data respectively on the DS as shown below in table B.4.<br />

141


142 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR<br />

Conditional branch<br />

Table B.1: Arithmetic and logic operations<br />

Symbol Instruction Operations<br />

ADD Addition TOS ⇐ TOS + NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

ADC Addition with carry TOS ⇐ TOS + NOS<br />

NOS ⇐ Cout<br />

SUB Subtraction TOS ⇐ TOS - NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

MUL Multiplication TOS ⇐ TOS × NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

DIV Division TOS ⇐ TOS ÷ NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

MOD Modulus TOS ⇐ TOS mod NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

AND AND-operation TOS ⇐ TOS & NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

OR OR-operation TOS ⇐ TOS | NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

XOR XOR-operation TOS ⇐ TOS xor NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

NEG Negative TOS ⇐ -TOS<br />

NOT NOT-operation TOS ⇐ not TOS<br />

INC Increment TOS ⇐ TOS + 1<br />

DEC Decrement TOS ⇐ TOS - 1<br />

SIGN Sign if (TOS 0) then TOS ⇐ 0 × 0000<br />

When processing data there is need to take decisions, the machine must have the possibility of<br />

conditional branch. The conditional jumps can depend on various conditions.<br />

Subroutine Calls<br />

In stack machine most of the instructions are executed between TOS and NOS but in this archi-<br />

tecture to improve the flexibility of stack based machines there is RS (Return Stack) added along with


Table B.2: Stack manipulation operations<br />

Symbol Instruction Operations<br />

DROP Drop TOS ⇐ NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

DUP Duplication DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

SWAP Swap TOS ⇐ NOS<br />

NOS ⇐ TOS<br />

OVER Over DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

TOS ⇐ NOS<br />

NOS ⇐ TOS<br />

ROT Rotate TOS ⇐ DS[DSP]<br />

NOS ⇐ TOS<br />

DS[DSP] ⇐ NOS<br />

R2D Return Stack to Data Stack TOS ⇐ TORS<br />

NOS ⇐ TOS<br />

DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

TORS ⇐ RS[RSP]<br />

RSP ⇐ RSP - 1<br />

CPR2D Copy Return Stack to Data Stack TOS ⇐ TORS<br />

NOS ⇐ TOS<br />

DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

D2R Data Stack to Return Stack TORS ⇐ TOS<br />

DSP ⇐ DSP - 1<br />

TOS ⇐ NOS<br />

NOS ⇐ DS[DSP]<br />

RSP ⇐ RSP + 1<br />

RS[RSP] ⇐ TORS<br />

RET Return IP ⇐ TORS<br />

TORS ⇐ RS [RSP]<br />

RSP ⇐ RSP - 1<br />

DS (Data Stack). The proposed machine can efficiently call subroutines.<br />

Subroutine call push the value in PC on the TOS and sometime we can directly write some know<br />

address on the TOS as shown below in table B.6.<br />

143


144 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR<br />

Push and Pop<br />

Table B.3: <strong>Memory</strong> Fetch and Store<br />

Symbol Instruction Operations<br />

FETCH Fetch Mem_Addr ⇐ TOS<br />

TOS ⇐ Mem<br />

STORE Store Mem_Addr ⇐ TOS<br />

Mem ⇐ NOS<br />

DSP ⇐ DSP - 1<br />

TOS ⇐ DS[DSP]<br />

DSP ⇐ DSP - 1<br />

NOS ⇐ DS[DSP]<br />

Table B.4: Loading Literals<br />

Symbol Instruction Operations<br />

LIT d8 Writing data (Byte size) to TOS DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ data(byte)<br />

LIT d16 Writing data (Word size) to TOS DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ data(word)<br />

Table B.5: Conditional Branch<br />

Symbol Instruction Operations<br />

ZBRA d Jump to ‘d’ if TOS = 0 if (TOS=0) then<br />

IP ⇐ IP + d<br />

SBRA d Jump to ‘d’ if TOS < 0 if (TOS


B.1. DATA OPERATIONS IN STACK PROCESSOR: 145<br />

Table B.7: Push and Pop<br />

Symbol Instruction Operations<br />

PUSH DSP Push Data Stack Pointer DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ DSP<br />

POP DSP Pop Data Stack Pointer DSP ⇐ TOS<br />

TOS ⇐ NOS<br />

DSP ⇐ DSP-1<br />

NOS ⇐ DS[DSP]<br />

PUSH RSP Push Return Stack Pointer DSP ⇐ DSP + 1<br />

DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ RSP<br />

POP RSP POP Return Stack Pointer RSP ⇐ TOS<br />

TOS ⇐ NOS<br />

NOS ⇐ DS[DSP]<br />

DSP ⇐ DSP-1<br />

B.1 Data Operations in Stack Processor:<br />

Stack machine operates on data manipulation using postfix operation. Such operation is also<br />

called as ‘Reverse Polish’ that is used to describe post fix operations. In such operations the operators<br />

come before the operation and the operator act upon the most recently seen operands.. For example<br />

if we have a following expression:<br />

(24 + 04) × 82<br />

This expression in Postfix representation will be<br />

82 24 04 + ×<br />

They are usually smaller then infix notation. The stack processor can execute postfix expressions<br />

directly without burdening the compiler anymore.


146 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR


Appendix C<br />

Instruction Set of Pipelined Stack Processor<br />

We have analyzed the multi clock instructions to explore the possible conflicts between various in-<br />

structions. We have found that all the multi clock-instructions can be sub divided into two parts. The<br />

first part consists of DSP+1/RSP+1 depending on the type of instruction while second part contains<br />

the rest of the instruction, such instructions can be recognized by the code ‘111’. Due to intelligent<br />

pipelining we can pre-execute some part of the next instruction (DSP+1/RSP+1) along with the next<br />

instruction to be executed. In this way in the next clock we execute the remaining part of the in-<br />

struction. Hence after pipelining we will execute all the instructions in a single clock except STORE,<br />

which requires 2 clocks after pipelining. Before pipelining STORE instruction requires 3-clock cy-<br />

cles. Actually in STORE instruction we need to execute two times DSP + 1, which can not be done<br />

in a single clock cycle. And rest of the instruction is executed in the next clock.<br />

All the instructions are intelligently divided into the two stages so that each instruction be executed<br />

in a clock cycle after implementation of two stage pipelining. The complete list of the instructions is<br />

given below.<br />

147


148 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR<br />

Table C.1: Instruction set of stack processor (pipelined model)<br />

Instructions First Stage Second Stage<br />

ADD NOP TOS ⇐ TOS + NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

ADC NOP TOS ⇐ TOS + NOS<br />

NOS ⇐ Cout<br />

SUB NOP TOS ⇐ TOS - NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

MUL NOP TOS ⇐ TOS × NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

DIV NOP TOS ⇐ TOS ÷ NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

MOD NOP TOS ⇐ TOS mod NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

AND NOP TOS ⇐ TOS & NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

OR NOP TOS ⇐ TOS | NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

XOR NOP TOS ⇐ TOS xor NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

NEG NOP TOS ⇐ -TOS<br />

NOT NOP TOS ⇐ not TOS<br />

INC NOP TOS ⇐ TOS + 1<br />

DEC NOP TOS ⇐ TOS - 1<br />

SIGN NOP if (TOS 0) then TOS ⇐ 0 × 0000


Table C.2: Stack manipulation operations<br />

Instructions First stage Second Stage<br />

DROP NOP TOS ⇐ NOS<br />

NOS ⇐ DS [DSP]<br />

DSP ⇐ DSP - 1<br />

DUP DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

SWAP NOP TOS ⇐ NOS<br />

NOS ⇐ TOS<br />

OVER DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />

TOS ⇐ NOS<br />

NOS ⇐ TOS<br />

ROT NOP TOS ⇐ DS[DSP]<br />

NOS ⇐ TOS<br />

DS[DSP] ⇐ NOS<br />

R2D DSP ⇐ DSP + 1 TOS ⇐ TORS<br />

NOS ⇐ TOS<br />

DS[DSP] ⇐ NOS<br />

TORS ⇐ RS[RSP]<br />

RSP ⇐ RSP - 1<br />

CPR2D DSP ⇐ DSP + 1 TOS ⇐ TORS<br />

NOS ⇐ TOS<br />

DS[DSP] ⇐ NOS<br />

D2R RS[RSP] ⇐ TORS TORS ⇐ TOS<br />

DSP ⇐ DSP - 1<br />

TOS ⇐ NOS<br />

NOS ⇐ DS[DSP]<br />

RSP ⇐ RSP + 1<br />

RET NOP IP ⇐ TORS<br />

TORS ⇐ RS [RSP]<br />

RSP ⇐ RSP - 1<br />

Table C.3: <strong>Memory</strong> Fetch and Store<br />

Instructions First Stage Second Stage<br />

FETCH Mem_Addr ⇐ TOS TOS ⇐ Mem<br />

STORE Mem_Addr ⇐ TOS Mem ⇐ NOS<br />

TOS ⇐ DS[DSP] NOS ⇐ DS[DSP]<br />

DSP ⇐ DSP - 1 DSP ⇐ DSP - 1<br />

149


150 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR<br />

Table C.4: Loading Literals<br />

Instructions First Stage Second Stage<br />

LIT d8 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ data(byte)<br />

DLIT d16 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />

NOS ⇐ TOS<br />

TOS ⇐ data(word)<br />

Table C.5: Conditional Branch<br />

Instructions First Stage Second Stage<br />

ZBRA d NOP if (TOS=0) then<br />

IP ⇐ IP + d<br />

SBRA d NOP if (TOS


Table C.8: Instruction Codes and Instruction Lengths<br />

Type of<br />

b4 b3 b2 b1 b0 instruction Instruction Length<br />

0 0 0 0 0 0 0 0 NOP 0-byte<br />

1 0 0 0 0 0 0 0 ADD 1-byte<br />

1 0 0 0 0 0 0 1 ADC 1-byte<br />

1 0 0 0 0 0 1 0 SUB 1-byte<br />

1 0 0 0 0 0 1 1 SUBC 1-byte<br />

1 0 0 0 0 1 0 0 MUL 1-byte<br />

1 0 0 0 0 1 0 1 DIV 1-byte<br />

1 0 0 0 0 1 1 0 MOD 1-byte<br />

1 0 0 0 0 1 1 1 AND 1-byte<br />

1 0 0 0 1 0 0 0 OR 1-byte<br />

1 0 0 0 1 0 0 1 XOR 1-byte<br />

1 0 0 0 1 0 1 0 NEG 1-byte<br />

1 0 0 0 1 0 1 1 NOT 1-byte<br />

1 0 0 0 1 1 0 0 INC 1-byte<br />

1 0 0 0 1 1 0 1 DEC 1-byte<br />

1 0 0 0 1 1 1 0 SIGN 1-byte<br />

1 0 0 0 1 1 1 1 DROP 1-byte<br />

1 0 0 1 0 0 0 0 DUP 1-byte<br />

1 0 0 1 0 0 0 1 SWAP 1-byte<br />

1 0 0 1 0 0 1 0 OVER 1-byte<br />

1 0 0 1 0 0 1 1 ROT 1-byte<br />

1 0 0 1 0 1 0 0 R2D 1-byte<br />

1 0 0 1 0 1 0 1 CPR2D 1-byte<br />

1 0 0 1 0 1 1 0 D2R 1-byte<br />

1 0 0 1 0 1 1 1 FETCH 1-byte<br />

1 0 0 1 1 0 0 0 STORE 1-byte<br />

1 0 0 1 1 0 0 1 PUSH_DSP 1-byte<br />

1 0 0 1 1 0 1 0 POP_DSP 1-byte<br />

1 0 0 1 1 0 1 1 PUSH_DSP 1-byte<br />

1 0 0 1 1 1 0 0 POP_RSP 1-byte<br />

1 0 1 0 0 0 0 0 LIT a 2-bytes<br />

1 1 0 0 0 0 0 0 DLIT a 3-bytes<br />

1 1 1 0 0 0 0 0 RET 1-byte + IP-change<br />

0 0 1 0 0 0 0 1 ZBRA 2-bytes + IP-change<br />

0 0 1 0 0 0 1 0 SBRA 2-bytes + IP-change<br />

0 1 0 0 0 0 0 0 LBRA 3-bytes + IP-change<br />

0 1 0 0 0 0 0 1 CALL a 3-bytes + IP-change<br />

b7 b6 b5<br />

151


152 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR


Appendix D<br />

List of Acronyms<br />

ALU : Arithmetic Logic Unit.<br />

ASIC : Application Specific Integrated Circuits.<br />

BER : Backward Error Recovery.<br />

BIED : Built-In Error Detection Schemes.<br />

BPSG : Boro-Phos-Silicate-Glass.<br />

CED : Concurrent Error Detection.<br />

CISC : Complex Instruction Set Computer.<br />

CMOS : Complimentary Metal Oxide Semiconductor.<br />

CPO : Clock Per Operation.<br />

CPI : Clock Per Instruction.<br />

CRC : Cyclic Redundancy Codes.<br />

DCR : Dual-Checker Rail.<br />

DED : Double Error Detection.<br />

DM : <strong>Dependable</strong> <strong>Memory</strong>.<br />

DMR : Dual Modular Redundancy.<br />

DS : Data Stack.<br />

DSP : Data Stack Pointer.<br />

DWC : Duplication With Comparison.<br />

DWCR : Duplication With Complement Redundancy.<br />

ECC : Error Control Coding.<br />

EDC : Error Detecting Codes.<br />

EDCC : Error Detecting and Correction Codes.<br />

EDP : Error Detecting Processor.<br />

ESS : Electronic Switching Systems.<br />

FER : Forward Error Recovery.<br />

FPGA : Field Programmable gate Array.<br />

FT : Fault Tolerant.<br />

FTMP : Fault Tolerant Multi-Processor.<br />

153


154 APPENDIX D. LIST OF ACRONYMS<br />

HD : Hamming Distance.<br />

HDL : High-level Description Language.<br />

HW : Hardware.<br />

IEEE : Institute of Electrical and Electronics Engineers.<br />

IB : Instruction Buffer.<br />

IBMU : Instruction Buffer Management Unit.<br />

IP : Instruction Pointer.<br />

ISA : Instruction Set Architecture.<br />

LICM : <strong>Laboratoire</strong> <strong>Interface</strong>s <strong>Capteurs</strong> et Micro-électronique.<br />

LIFO : Last In First Out.<br />

MBU : Multiple Bit Upsets.<br />

MCU : Multiple Cell Upsets.<br />

MISC : Minimum Instruction Set Computer.<br />

MPSoC : Multi-Processor System on Chip.<br />

NASA : National Aeronautics and Space Agency.<br />

NoC : Network on Chip.<br />

NOS : Next Of data Stack.<br />

PC : Program Counter.<br />

RAM : Random Access <strong>Memory</strong>.<br />

REE : Remote Exploration and Experimentation.<br />

RESO : Redundant Execution with Shifted Operands.<br />

RISC : Reduce Instruction Set Computer.<br />

RS : Return Stack.<br />

RSP : Return Stack Pointer.<br />

RTL : Register Transfer Level.<br />

SCHJ : Self-Checking Hardware Journal.<br />

SCPC : Self-Checking Processor Core.<br />

SD : Sequence Duration.<br />

SE : State determining Elements.<br />

SEB : Single Event Burnout.<br />

SEE : Single Event Effect.<br />

SEGR : Single Event Gate Rupture.<br />

SEFI : Single Event Functional Interrupt.<br />

SEL : Single Event Latchup.<br />

SEU : Single Event Upset.<br />

SET : Single Event Transient.<br />

SEC : Single Error Correction.<br />

SW : Software.<br />

SIFT : Software Implemented Fault Tolerance.


SoC : System on Chip.<br />

STAR : Self-Testing and Repair.<br />

TMR : Triple Modular Redundancy.<br />

TORS : Top Of Return Stack.<br />

TOS : Top Of data Stack.<br />

UJ : Un-validated Journal.<br />

VP : Validation Point.<br />

UVD : Un-Validated Data.<br />

VD : Validated Data.<br />

VHDL : VHSIC Hardware Description Language.<br />

VJ : Validated Journal.<br />

155


156 APPENDIX D. LIST OF ACRONYMS


Appendix E<br />

List of publications<br />

• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE,<br />

“A Self-Checking HW Journal for a Fault Tolerant Processor Architecture,” International Jour-<br />

nal of Reconfigurable Computing 2011 (IJRC’11) (Accepted).<br />

• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE,<br />

“A <strong>Dependable</strong> Stack Processor Core for MPSoC Development,” XXIV Conference on Design<br />

of Circuits and Integrated Systems (DCIS’09), Zaragoza, Spain, November 18-20, 2009.<br />

• Mohsin AMIN, Fabrice MONTEIRO, Camille DIOU, Abbas RAMAZANI, Abbas DANDACHE,<br />

“A HW/SW Mixed Mechanism to Improve the Dependability of a Stack Processor,” 16th<br />

IEEE International Conference on Electronics, Circuits, and Systems (ICECS’09), Hammamet,<br />

Tunisia, December 13-16, 2009.<br />

• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE,<br />

“Journalized Stack Processor for Reliable Embedded Systems,” 1st International Conference<br />

on Aerospace Science and Engineering (ICASE’09), Islamabad, Pakistan, August 18-20, 2009.<br />

• A. Ramazani, M. Amin, F. Monteiro, C. Diou, A. Dandache, “A Fault Tolerant Journalized<br />

Stack Processor Architecture,” 15th IEEE International On-Line Testing Symposium (IOLTS’09),<br />

Sesimbra-Lisbonne, Portugal, 24–27 June 2009.<br />

• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE,<br />

“Error Detecting and Correcting Journal for <strong>Dependable</strong> Processor Core,” GDR System on Chip<br />

- System in Package (GDR-SoC-SiP’10), Cergy-Paris, France, 9-11 June 2010.<br />

• Mohsin Amin, Camille Diou, Fabrice Monteiro, Abbas Ramazani, “Design Methodology of<br />

Reliable Stack Processor Core,” GDR System on Chip - System in Package 2009 (GDR-SoC-<br />

SiP’09), Orsay-Paris, France, 9-11 June 2010.<br />

157


158 APPENDIX E. LIST OF PUBLICATIONS<br />

• Mohsin AMIN, “Self-Organization in Embedded Systems,” 2nd Winter School on Self Organi-<br />

zation in Embedded Systems, Schloss Dagstuhl, Germany, November 2007.


List of Figures<br />

1.1 An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs<br />

in its wake, which effects charge disturbance [MW04] . . . . . . . . . . . . . . . . 16<br />

1.2 Strike of high energy particle resulted in error(s) . . . . . . . . . . . . . . . . . . . . 16<br />

1.3 Classification of faults on basis of single event effect (SEE) [Pie07]. . . . . . . . . . 17<br />

1.4 Dependability Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.5 Fault, error and failure chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.6 Error propagation from processor to main memory . . . . . . . . . . . . . . . . . . 21<br />

1.7 A single fault caused failure of traffic control system . . . . . . . . . . . . . . . . . 22<br />

1.8 service failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

1.9 Fault characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.10 Few reasons of fault occurrence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

1.11 Dependability techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

1.12 Sequence of events from ionization to failure and a set of fault tolerant techniques<br />

applied at different time. [Pie07]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.1 General architecture of a concurrent error detection schemes [MM00] . . . . . . . . 32<br />

2.2 Duplication with comparison (DWC) . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.3 Time redundancy for temporary and intermittent fault detection . . . . . . . . . . . . 34<br />

2.4 Time redundancy for permanent error detection . . . . . . . . . . . . . . . . . . . . 34<br />

2.5 Information redundancy principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.6 Parity coder in data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.7 Functional Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

2.8 Residue codes adder [FFMR09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

2.9 Triple modular redundancy (TMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.10 Error detecting and correcting memory block . . . . . . . . . . . . . . . . . . . . . 39<br />

2.11 Basic strategies for implementing Error Recovery. . . . . . . . . . . . . . . . . . . . 41<br />

2.12 The triple-TMR in Boeing 777 [Yeh02] . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

3.1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

3.2 Limitation of parity check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.3 Rollback Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

159


160 LIST OF FIGURES<br />

3.4 Error detection during Sequence Duration (SD) and rollback called . . . . . . . . . . 56<br />

3.5 No-error detected during the SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3.6 Time overhead in rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

3.7 Untrusted data flowing into dependable memory (DM) . . . . . . . . . . . . . . . . 60<br />

3.8 Data stored to temporary location before writing to DM . . . . . . . . . . . . . . . . 61<br />

3.9 Data corruption in temporary storage. . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

3.10 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

3.11 Overall design specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

3.12 Global design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

3.13 Model-I with data cache and a pair of journals . . . . . . . . . . . . . . . . . . . . . 66<br />

3.14 Cache with associative mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

3.15 FT evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

3.16 Periodic, random and burst errors models . . . . . . . . . . . . . . . . . . . . . . . 68<br />

3.17 Model-I: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 69<br />

3.18 Model-I: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . . 70<br />

3.19 Model-I: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 71<br />

3.20 Block diagram of Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

3.21 Processor can simultaneously read from Journal and DM . . . . . . . . . . . . . . . 72<br />

3.22 No error detected during SD and data is validated at VP . . . . . . . . . . . . . . . . 72<br />

3.23 Error detected and all the data written during SD is deleted . . . . . . . . . . . . . . 73<br />

3.24 Model-II: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 74<br />

3.25 Model-II: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . 74<br />

3.26 Model-II: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 75<br />

4.1 Design of a self checking processor core (SCPC) . . . . . . . . . . . . . . . . . . . 77<br />

4.2 Criteria behind the choice of the stack processor . . . . . . . . . . . . . . . . . . . 79<br />

4.3 Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design<br />

space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

4.4 Simplified stack machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

4.5 Modified stack processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4.6 Simplified data-path of the proposed model (arithmetic and logic instructions) . . . . 84<br />

4.7 Different instructions type from execution point of view (without pipelining) . . . . . 85<br />

4.8 Execution of duplication (DUP) instruction in 2-clock . . . . . . . . . . . . . . . . . 86<br />

4.9 Multiple-byte instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

4.10 Data-path of protected-processor’s ALU . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.11 Resource utilization chart for various ALU designs [SFRB05] . . . . . . . . . . . . 89<br />

4.12 ALU is protecting the Logical and Arithmetic instructions separately . . . . . . . . . 90<br />

4.13 Reminder check technique for error detection in arithmetic instructions . . . . . . . . 91<br />

4.14 Parity check technique for error detection in logic instructions . . . . . . . . . . . . 92


LIST OF FIGURES 161<br />

4.15 Parity check technique for error detection in register(s) . . . . . . . . . . . . . . . . 92<br />

4.16 Error occurred in Protected ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

4.17 Instruction buffer management Unit (IBMU) . . . . . . . . . . . . . . . . . . . . . 95<br />

4.18 Instruction buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

4.19 (a) Opcodes description and (b) pipelined execution model . . . . . . . . . . . . . . 97<br />

4.20 A sample program executed through non-pipelined and pipelined stack processor core 97<br />

4.21 Timing diagram for a sample program executed twice: once in non-pipelined version<br />

and then pipelined version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

4.22 Implementation design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

4.23 Strategy to overcome performance overhead due to conditional branches . . . . . . . 99<br />

4.24 Implementation of a self-checking processor core . . . . . . . . . . . . . . . . . . . 100<br />

4.25 Error detected in SCPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

5.1 Design of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />

5.2 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

5.3 (a) Error(s) in un-validated journal (b) error(s) in validated journal . . . . . . . . . . 105<br />

5.4 Hsiao Parity Check Matrix (41,34) . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />

5.5 SCHJ structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

5.6 Error detection and correction in journal (a memory block of SCHJ). . . . . . . . . 109<br />

5.7 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

5.8 Rollback mechanism on error detection. . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

5.9 SCHJ operation flow chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

5.10 SCHJ mode 00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

5.11 SCHJ mode 01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

5.12 Read of UVD from SCHJ in mode 01 . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

5.13 SCHJ mode 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

5.14 Mode 10 of SCHJ operation (un-corrigible error detected) . . . . . . . . . . . . . . . 114<br />

5.15 SCHJ mode 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

5.16 Non-corrigible error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

5.17 Increase of percentage utilization of FT processor (SCPC + SCHJ) on device EP3SE50F484C2<br />

with increase in the depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

5.18 Theoretical limits of Journal Depth. . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

5.19 Relation between journal depth and percentage write in benchmarks. . . . . . . . . . 118<br />

5.20 CPI vs. SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

5.21 Dynamic SD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

6.1 The overall FT-processor to be validated. . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.2 Error injection in FT-processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

6.3 Error patterns (errors can occur in any bit, not necessarily the bit shown here). . . . . 123<br />

6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124


162 LIST OF FIGURES<br />

6.5 Single bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

6.6 Double bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

6.7 Triple bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

6.8 Harsh (1 up to 8 bit randomly) error injection. . . . . . . . . . . . . . . . . . . . . . 126<br />

6.9 Performance Degradation due to re-execution . . . . . . . . . . . . . . . . . . . . . 127<br />

6.10 Simulation curves for group-I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.11 Simulation curves for group-II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.12 Simulation curves for group-III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

6.13 Effect of EIR on rollback for benchmarks group-I. . . . . . . . . . . . . . . . . . . . 129<br />

6.14 Effect of EIR on rollback for benchmarks group-II. . . . . . . . . . . . . . . . . . . 130<br />

6.15 Effect of EIR on rollback for benchmarks group-III. . . . . . . . . . . . . . . . . . . 130<br />

A.1 Canonical Stack Machine [KJ89] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140


List of Tables<br />

1.1 Cost/hour for failure of control system [Pie07] . . . . . . . . . . . . . . . . . . . . . 13<br />

1.2 Dependability attributes for University web-server and Nuclear-reactor [Pie07], where<br />

attributes are classified as: – very important = 4 points, – least important = 1 point . 20<br />

2.1 fault modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.1 Comparison of the Processor-<strong>Memory</strong> Models . . . . . . . . . . . . . . . . . . . . 73<br />

4.1 Instruction types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

4.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.1 Modes of Journal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

5.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

6.1 Read/Write profiles in benchmarks groups . . . . . . . . . . . . . . . . . . . . . . . 127<br />

B.1 Arithmetic and logic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

B.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />

B.3 <strong>Memory</strong> Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

B.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

B.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

B.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

B.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

C.1 Instruction set of stack processor (pipelined model) . . . . . . . . . . . . . . . . . . 148<br />

163


164 LIST OF TABLES<br />

C.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

C.3 <strong>Memory</strong> Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

C.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

C.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

C.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

C.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

C.8 Instruction Codes and Instruction Lengths . . . . . . . . . . . . . . . . . . . . . . . 151


Bibliography<br />

[ACC + 93] J. Arlat, A. Costes, Y. Crouzet, J. C Laprie, and D. Powell. Fault injection and depend-<br />

ability evaluation of fault-tolerant systems. IEEE Transactions on Computers, page<br />

913–923, 1993.<br />

[Aer11] Aeroflex. Dual-Core LEON3FT SPARC v8 processor, 2011.<br />

[AFK05] J. Aidemark, P. Folkesson, and J. Karlsson. A framework for node-level fault toler-<br />

ance in distributed real-time systems. In Proceedings of International Conference on<br />

<strong>Dependable</strong> Systems and Networks, 2005 (DSN’05), page 656–665, 2005.<br />

[AHHW08] U. Amgalan, C. Hachmann, S. Hellebrand, and H. J Wunderlich. Signature Rollback-A<br />

technique for testing robust circuits. In 26th IEEE VLSI Test Symposium, 2008 (VTS’08),<br />

page 125–130, 2008.<br />

[AKT + 08] H. Ando, R. Kan, Y. Tosaka, K. Takahisa, and K. Hatanaka. Validation of hardware<br />

error recovery mechanisms for the SPARC64 v microprocessor. In <strong>Dependable</strong> Systems<br />

and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference<br />

on, page 62–69, 2008.<br />

[ALR01] A. Avizienis, J. C Laprie, and B. Randell. Fundamental concepts of dependability.<br />

Research report UCLA CSD Report no. 010028, 2001.<br />

[AMD + 10] M. Amin, F. Monteiro, C. Diou, A. Ramazani, and A. Dandache. A HW/SW<br />

mixed mechanism to improve the dependability of a stack processor. In Proceedings<br />

of 16th IEEE International Conference on Electronics, Circuits, and Systems, 2009<br />

(ICECS’09), page 976–979, 2010.<br />

[ARM09] ARM. Cortex-R4 and Cortex-R4F. Technical reference manual, 2009.<br />

[ARM + 11] Mohsin Amin, Abbas Ramazani, Fabrice Monteiro, Camille Diou, and Abbas Dan-<br />

dache. A Self-Checking hardware journal for a fault tolerant processor architecture.<br />

Hindawi Publishing Corporation, 2011.<br />

[Bai10] G Bailey. Comparison of GreenArrays chips with texas instruments MSP430F5xx as<br />

micropower controllers, June 2010.<br />

165


166 BIBLIOGRAPHY<br />

[Bau05] R. C Baumann. Radiation-induced soft errors in advanced semiconductor technologies.<br />

IEEE Transactions on Device and materials reliability, 5(3):305–316, 2005.<br />

[BBV + 05] D. Bernick, B. Bruckert, P. D Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen.<br />

NonStop advanced architecture. In Proceedings of International Conference on De-<br />

pendable Systems and Networks, 2005 (DSN’05), page 12–21, 2005.<br />

[BCT08] B. Bridgford, C. Carmichael, and C. W Tseng. Single-event upset mitigation selection<br />

guide. Xilinx Application Note, 987, 2008.<br />

[BGB + 08] J. C Baraza, J. Gracia, S. Blanc, D. Gil, and P. J Gil. Enhancement of fault injection<br />

techniques based on the modification of VHDL code. IEEE Transactions on Very Large<br />

Scale Integration (VLSI) Systems, 16(6):693–706, 2008.<br />

[Bic10] R. Bickham. An Analysis of Error Detection Techniques for Arithmetic Logic Units.<br />

PhD thesis, Vanderbilt University, 2010.<br />

[BP02] N. S Bowen and D. K Pradhan. Virtual checkpoints: Architecture and performance.<br />

Computers, IEEE Transactions on, 41(5):516–525, 2002.<br />

[BT02] D. Briere and P. Traverse. AIRBUS A320/A330/A340 electrical flight controls-a fam-<br />

ily of fault-tolerant systems. In the Twenty-Third International Symposium on Fault-<br />

Tolerant Computing System, 1993 (FTCS’93), page 616–623, 2002.<br />

[Car01] C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx<br />

Application Note XAPP197, 1, 2001.<br />

[Che08] L. Chen. Hsiao-Code check matrices and recursively balanced matrices. Arxiv preprint<br />

arXiv:0803.1217, 2008.<br />

[CHL97] W-T Chang, S Ha, and E.A. Lee. Heterogeneous simulation - mixing Discrete-Event<br />

models with dataflow. Journal of VLSI Signal Processing, 15(1-2):127–144, 1997.<br />

[CP02] J. A Clark and D. K Pradhan. Fault injection: A method for validating computer-system<br />

dependability. Computer, 28(6):47–56, 2002.<br />

[CPB + 06] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and<br />

M. Orshansky. Bulletproof: A defect-tolerant CMP switch architecture. In Proceedings<br />

of 25th International Symposium on High-Performance Computer Architecture, 2006,<br />

page 5–16, 2006.<br />

[CTS + 10] C. L. Chen, N. N. Tendolkar, A. J. Sutton, M. Y. Hsiao, and D. C. Bossen. Fault-<br />

tolerance design of the IBM enterprise system/9000 type 9021 processors. IBM Journal<br />

of Research and Development, 36(4):765–779, 2010.


BIBLIOGRAPHY 167<br />

[EAWJ02] E. N. Elnozahy, L. Alvisi, Y. M Wang, and D. B Johnson. A survey of rollback-<br />

recovery protocols in message-passing systems. ACM Computing Surveys (CSUR),<br />

34(3):375–408, 2002.<br />

[EKD + 05] D. Ernst, N. S Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin,<br />

K. Flautner, et al. Razor: A low-power pipeline based on circuit-level timing specula-<br />

tion. In Proceedings of 36th Annual IEEE/ACM International Symposium on Microar-<br />

chitecture, 2003 (MICRO’03), page 7–18, 2005.<br />

[FFMR09] R. Forsati, K. Faez, F. Moradi, and A. Rahbar. A fault tolerant method for residue<br />

arithmetic circuits. In Proceedings of 2009 International Conference on Information<br />

Management and Engineering, page 59–63, 2009.<br />

[FGAD10] R. Fernández-Pascual, J. M Garcia, M. E Acacio, and J. Duato. Dealing with tran-<br />

sient faults in the interconnection network of CMPs at the cache coherence level. IEEE<br />

Transactions on Parallel and Distributed Systems, 21(8):1117–1131, 2010.<br />

[FGAM10] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error relia-<br />

bility on the cheap. ACM SIGPLAN Notices, 45(3):385–396, 2010.<br />

[FP02] E. Fujiwara and D. K Pradhan. Error-control coding in computers. Computer,<br />

23(7):63–72, 2002.<br />

[GBT05] S. Ghosh, S. Basu, and N. A Touba. Selecting error correcting codes to minimize power<br />

in memory checker circuits. Journal of Low Power Electronics, 1(1):63–72, 2005.<br />

[GC06] J. Gaisler and E. Catovic. Multi-Core processor based on LEON3-FT IP core (LEON3-<br />

FT-MP). In in Proceedings of Data Systems in Aerospace, 2006 (DASIA’06), volume<br />

630, page 76, 2006.<br />

[Gha11] S. Ghaznavi. Soft Error Resistant Design of the AES Cipher Using SRAM-based FPGA.<br />

PhD thesis, University of Waterloo, 2011.<br />

[GMT08] M. Grottke, R. Matias, and K. S Trivedi. The fundamentals of software aging. In Pro-<br />

ceedings of IEEE International Conference on Software Reliability Engineering Work-<br />

shops, 2008 (ISSRE Wksp 2008), page 1–6, 2008.<br />

[GPLL09] C. Godlewski, V. Pouget, D. Lewis, and M. Lisart. Electrical modeling of the effect<br />

of beam profile for pulsed laser fault injection. Microelectronics Reliability, 49(9-<br />

11):1143–1147, 2009.<br />

[Gre10] Green. Project green array chip, 2010.<br />

[Hay05] J.R. Hayes. The architecture of the scalable configurable instrument processor. Techni-<br />

cal Report SRI-05-030, The Johns Hopkins Applied Physics Laboratory, 2005.


168 BIBLIOGRAPHY<br />

[HCTS10] M. Y. Hsiao, W. C Carter, J. W Thomas, and W. R Stringfellow. Reliability, availabil-<br />

ity, and serviceability of IBM computer systems: A quarter century of progress. IBM<br />

Journal of Research and Development, 25(5):453–468, 2010.<br />

[HH06] A. J Harris and J. R Hayes. Functional programming on a Stack-Based embedded<br />

processor. 2006.<br />

[Hsi10] M. Y Hsiao. A class of optimal minimum odd-weight-column SEC-DED codes. IBM<br />

Journal of Research and Development, 14(4):395–401, 2010.<br />

[IK03] R. K Iyer and Z. Kalbarczyk. Hardware and software error detection. Technical report,<br />

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-<br />

Champaign, Urbana, 2003.<br />

[Int09] Intel. White paper - the intel itanium processor 9300 series. Technical report, 2009.<br />

[ITR07] ITRS. International technology roadmap for semiconductors. 2007.<br />

[Jab09] Jaber. Conception architecturale haut débit et sûre de fonctionnement pour les codes<br />

correcteurs d’erreurs. PhD thesis, Université Paul Verlaine - Metz, France, Metz, 2009.<br />

[Jal09] M. Jallouli. Méthodologie de conception d’architectures de processeur sûres de fonc-<br />

tionnement pour les applications mécatroniques. PhD thesis, Université Paul Verlaine -<br />

Metz, France, Metz, 2009.<br />

[JDMD07] M. Jallouli, C. Diou, F. Monteiro, and A. Dandache. Stack processor architec-<br />

ture and development methods suitable for dependable applications. Reconfigurable<br />

Communication-centric SoCs (ReCoSoC’07), Montpellier, France, 2007.<br />

[JES06] J. S JESD89A. Measurement and reporting of alpha particle and terrestrial cosmic<br />

ray-induced soft errors in semiconductor devices. October, 2006.<br />

[JHW + 08] J. Johnson, W. Howes, M. Wirthlin, D. L McMurtrey, M. Caffrey, P. Graham, and<br />

K. Morgan. Using duplication with compare for on-line error detection in FPGA-based<br />

designs. In Proceedings of IEEE Aerospace Conference, 2008, page 1–11, 2008.<br />

[JPS08] B. Joshi, D. Pradhan, and J. Stiffler. Fault-Tolerant computing. 2008.<br />

[KJ89] P. J Koopman Jr. Stack computers: the new wave. Halsted Press New York, NY, USA,<br />

1989.<br />

[KKB07] I. Koren, C. M Krishna, and Inc Books24x7. Fault-tolerant systems. Elsevier/Morgan<br />

Kaufmann, 2007.


BIBLIOGRAPHY 169<br />

[KKS + 07] P. Kudva, J. Kellington, P. Sanda, R. McBeth, J. Schumann, and R. Kalla. Fault injec-<br />

tion verification of IBM POWER6 soft error resilience. In Architectural Support for<br />

Gigascale Integration (ASGI) Workshop, 2007.<br />

[KMSK09] J. W Kellington, R. McBeth, P. Sanda, and R. N Kalla. IBM POWER6 processor soft<br />

error tolerance analysis using proton irradiation. In Proceedings of the IEEE Workshop<br />

on Silicon Errors in Logic—Systems Effects (SELSE) Conference, 2009.<br />

[Kop04] H. Kopetz. From a federated to an integrated architecture for dependable embedded<br />

systems. PhD thesis, Technische Univ Vienna, Vienna, Austria, 2004.<br />

[Kop11] H. Kopetz. Real-time systems: design principles for distributed embedded applications,<br />

volume 25. Springer-Verlag New York Inc, 2011.<br />

[Lal05] P. K Lala. Single error correction and double error detecting coding scheme, 2005.<br />

[Lap04] J.C. Laprie. Sûreté de fonctionnement des systèmes : concepts de base et terminologie.<br />

2004.<br />

[LAT07] K. W. Li, J. R. Armstrong, and J. G. Tront. An HDL simulation of the effects of single<br />

event upsets on microprocessor program flow. IEEE Transactions on Nuclear Science,<br />

31(6):1139–1144, 2007.<br />

[LB07] J. Laprie and R. Brian. Origins and integration of the concepts. 2007.<br />

[LBS + 11] I. Lee, M. Basoglu, M. Sullivan, D. H Yoon, L. Kaplan, and M. Erez. Survey of error<br />

and fault detection mechanisms. 2011.<br />

[LC08] C. A.L Lisboa and L. Carro. XOR-based low cost checkers for combinational logic. In<br />

IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, page<br />

281–289, 2008.<br />

[LN09] Dongwoo Lee and Jongwhoa Na. A novel simulation fault injection method for depend-<br />

ability analysis. IEEE Design & Test of Computers, 26(6):50–61, December 2009.<br />

[LRL04] J. C Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable<br />

and secure computing. IEEE Trans. on <strong>Dependable</strong> Secure Computers, 1(1):11–33,<br />

2004.<br />

[MB07] N. Madan and R. Balasubramonian. Power efficient approaches to redundant multi-<br />

threading. IEEE Transactions on Parallel and Distributed Systems, page 1066–1079,<br />

2007.<br />

[MBS07] A. Meixner, M. E Bauer, and D. Sorin. Argus: Low-cost, comprehensive error detection<br />

in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium<br />

on Microarchitecture, page 210–222, 2007.


170 BIBLIOGRAPHY<br />

[MG09] A. Maloney and A. Goscinski. A survey and review of the current state of rollback-<br />

recovery for cluster systems. Concurrency and Computation: Practice and Experience,<br />

21(12):1632–1666, 2009.<br />

[MM00] S. Mitra and E. J McCluskey. Which concurrent error detection scheme to choose?<br />

2000.<br />

[MMPW07] K. S Morgan, D. L McMurtrey, B. H Pratt, and M. J Wirthlin. A comparison of TMR<br />

with alternative fault-tolerant design techniques for FPGAs. IEEE Transactions on Nu-<br />

clear Science, 54(6):2065–2072, 2007.<br />

[Mon07] Y. Monnet. Etude et modélisation de circuits résistants aux attaques non intrusives par<br />

injection de fautes. Thèse de doctorat, Institut National Polytechnique de Grenoble,<br />

2007.<br />

[MS06] F. MacWilliams and N. Sloane. The theory of error-correcting codes. 2006.<br />

[MS07] A. Meixner and D. J Sorin. Error detection using dynamic dataflow verification. In Pro-<br />

ceedings of the 16th International Conference on Parallel Architecture and Compilation<br />

Techniques, page 104–118, 2007.<br />

[MSSM10] M. J. Mack, W. M. Sauer, S. B. Swaney, and B. G. Mealey. IBM power6 reliability.<br />

IBM Journal of Research and Development, 51(6):763–774, 2010.<br />

[Muk08] S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2008.<br />

[MW04] R. Mastipuram and E. C Wee. Soft errors’ impact on system reliability. EDN, Sept, 30,<br />

2004.<br />

[MW07] T. C May and M. H Woods. A new physical mechanism for soft errors in dynamic<br />

memories. In 16th Annual Reliability Physics Symposium, page 33–40, 2007.<br />

[NBV + 09] T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L Scott. Fault injection frame-<br />

work for system resilience evaluation: fake faults for finding future failures. In Pro-<br />

ceedings of the 2009 workshop on Resiliency in high performance, page 23–28, 2009.<br />

[Nic02] M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In<br />

Proceedings of Twenty-Third International Symposium on Fault-Tolerant Computing,<br />

(FTCS-23), page 586–595, 2002.<br />

[Nic10] M. Nicolaidis. Soft Errors in Modern Electronic Systems. Springer Verlag, 2010.<br />

[NL11] J. Na and D. Lee. Simulated fault injection using simulator modification technique.<br />

ETRI Journal, 33(1), 2011.


BIBLIOGRAPHY 171<br />

[NMGT06] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: efficient han-<br />

dling of I/O in highly-available rollback-recovery servers. In the Twety-fifth Interna-<br />

tional Symposium on High-Performance Computer Architecture, 2006, page 200–211,<br />

2006.<br />

[NTN + 09] M. Nicolaidis, K. Torki, F. Natali, F. Belhaddad, and D. Alexandrescu. Implementation<br />

and validation of a low-cost single-event latchup mitigation scheme. In IEEE Workshop<br />

on Silicon Errors in Logic–System Effects (SELSE), Stanford, CA, 2009.<br />

[NX06] V. Narayanan and Y. Xie. Reliability concerns in embedded system designs. Computer,<br />

39(1):118–120, 2006.<br />

[Pat10] Anurag Patel. Fault tolerant features of modern processors, 2010.<br />

[PB04] S Pelc and C. Bailey. Ubiquitous forth objects. In Euro-forth’04, Dahgstuhl, Germany,<br />

2004.<br />

[PF06] J. H Patel and L. Y Fung. Concurrent error detection in ALU’s by recomputing with<br />

shifted operands. IEEE Transactions on Computers, 100(7):589–595, 2006.<br />

[Pie06] S.J. Piestrak. <strong>Dependable</strong> computing: Problems, techniques and their applications. In<br />

First Winter School on Self-Organization in Embedded Systems, Schloss Dagstuhl, Ger-<br />

many, 2006.<br />

[Pie07] S.J. Piestrak. Systèmes numériques tolérants aux fautes, 2007.<br />

[PIEP09] P. Pop, V. Izosimov, P. Eles, and Z. Peng. Design optimization of time-and cost-<br />

constrained fault-tolerant embedded systems with checkpointing and replication. IEEE<br />

Transactions on Very Large Scale Integration (VLSI) Systems, 17(3):389–402, 2009.<br />

[Poe05] Christian Poellabauer. Real-Time systems, 2005.<br />

[Pow10] D. Powell. A generic fault-tolerant architecture for real-time dependable systems.<br />

Springer Publishing Company, Incorporated, 2010.<br />

[QGK + 06] H. Quinn, P. Graham, J. Krone, M. Caffrey, and S. Rezgui. Radiation-induced<br />

multi-bit upsets in SRAM-based FPGAs. IEEE Transactions on Nuclear Science,<br />

52(6):2455–2461, 2006.<br />

[QLZ05] F. Qin, S. Lu, and Y. Zhou. SafeMem: exploiting ECC-memory for detecting memory<br />

leaks and memory corruption during production runs. In 11th International Symposium<br />

on High-Performance Computer Architecture, 2005 (HPCA’11), page 291–302, 2005.<br />

[RAM + 09] A. Ramazani, M. Amin, F. Monteiro, C. Diou, and A. Dandache. A fault tolerant jour-<br />

nalized stack processor architecture. In 15th IEEE International On-Line Testing Sym-<br />

posium, 2009 (IOLTS’09), Sesimbra-Lisbon, Portugal, 2009.


172 BIBLIOGRAPHY<br />

[RI08] G. A Reis III. Software modulated fault tolerance. PhD thesis, Princeton University,<br />

2008.<br />

[RK09] J. A Rivers and P. Kudva. Reliability challenges and system performance at the archi-<br />

tecture level. IEEE Design & Test of Computers, 26(6):62–73, 2009.<br />

[RNS + 05] K. Rothbart, U. Neffe, C. Steger, R. Weiss, E. Rieger, and A. Muehlberger. A smart<br />

card test environment using multi-level fault injection in SystemC. In Proceedings of<br />

6th IEEE Latin-American Test Workshop 2005, page 103–108, March 2005.<br />

[RR08] V. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check regimen<br />

in a superscalar processor. In IEEE International Conference on <strong>Dependable</strong> Systems<br />

and Networks 2008 (DSN’08), page 1–10, Anchorage, Alaska, 2008.<br />

[RRTV02] M. Rebaudengo, S. Reorda, M. Torchiano, and M. Violante. Soft-error detection through<br />

software fault-tolerance techniques. In International Symposium on Defect and Fault<br />

Tolerance in VLSI Systems, page 210–218, 2002.<br />

[RS09] B. Rahbaran and A. Steininger. Is asynchronous logic more robust than synchronous<br />

logic? IEEE Transactions on <strong>Dependable</strong> and Secure Computing, page 282–294, 2009.<br />

[RYKO11] W. Rao, C. Yang, R. Karri, and A. Orailoglu. Toward future systems with nanoscale<br />

devices: Overcoming the reliability challenge. Computer, 44(2):46–53, 2011.<br />

[Sch08] Martin Schoeberl. A java processor architecture for embedded real-time systems. Jour-<br />

nal of Systems Architecture, 2008.<br />

[SFRB05] V. Srinivasan, J. W. Farquharson, W. H. Robinson, and B. L. Bhuva. Evaluation of<br />

error detection strategies for an FPGA-Based Self-Checking arithmetic and logic unit.<br />

In MAPLD International Conference, 2005.<br />

[SG10] L. Spainhower and T. A Gregg. IBM s/390 parallel enterprise server g5 fault tolerance:<br />

A historical perspective. IBM Journal of Research and Development, 43(5.6):863–873,<br />

2010.<br />

[Sha06] Mark Shannon. A C Compiler for Stack Machines. MSc thesis, University of York,<br />

2006.<br />

[SHLR + 09] S. K Sastry Hari, M. L Li, P. Ramachandran, B. Choi, and S. V Adve. mSWAT: low-<br />

cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the<br />

42nd Annual IEEE/ACM International Symposium on Microarchitecture, page 122–132,<br />

2009.


BIBLIOGRAPHY 173<br />

[SMHW02] D. J Sorin, M. M.K Martin, M. D Hill, and D. A Wood. SafetyNet: improving the<br />

availability of shared memory multiprocessors with global checkpoint/recovery. In Pro-<br />

ceedings of the 29th annual international symposium on Computer architecture, page<br />

123–134, 2002.<br />

[SMR + 07] A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process-<br />

level redundancy to exploit multiple cores for transient fault tolerance. In 37th An-<br />

nual IEEE/IFIP International Conference on <strong>Dependable</strong> Systems and Networks, 2007.<br />

DSN’07, page 297–306, 2007.<br />

[Sor09] D.J. Sorin. Fault tolerant computer architecture, 2009.<br />

[SSF + 08] J. R Schwank, M. R Shaneyfelt, D. M Fleetwood, J. A Felix, P. E Dodd, P. Paillet, and<br />

V. Ferlet-Cavrois. Radiation effects in MOS oxides. IEEE Transactions on Nuclear<br />

Science, 55(4):1833–1853, 2008.<br />

[Sta06] William Stallings. Computer Organization and Architecture. Prentice Hall, 7th edition,<br />

2006.<br />

[TM95] C. H. Ting and C. H. Moore. Mup21 a high performance misc processor. Forth Dimen-<br />

sions, 1995.<br />

[Too11] C. Toomey. Statical Fault Injection and Analysis at the Register Transfer Level using<br />

the Verilog Procedural <strong>Interface</strong>. PhD thesis, Vanderbilt University, 2011.<br />

[Van08] V.P. Vanhauwaert. Fault injection based dependability analysis in a FPGA-based envi-<br />

roment. PhD thesis, Institut Polytechnique de Grenoble, Gernoble, France, 2008.<br />

[VFM06] A. Vahdatpour, M. Fazeli, and S. Miremadi. Transient error detection in embedded<br />

systems using reconfigurable components. In International Symposium on Industrial<br />

Embedded Systems, 2006 (IES’06), page 1–6, 2006.<br />

[VK07] J. Von Knop. A Process for Developing a Common Vocabulary in the Information Se-<br />

curity Area. Ios Pr Inc, 2007.<br />

[VSL09] M. Vayrynen, V. Singh, and E. Larsson. Fault-tolerant average execution time opti-<br />

mization for general-purpose multi-processor system-on-chips. In Proceedings of De-<br />

sign, Automation & Test in Europe Conference & Exhibition, 2009 (DATE’09), page<br />

484–489, 2009.<br />

[WA08] F. Wang and V. D Agrawal. Single event upset: An embedded tutorial. In 21st Interna-<br />

tional Conference on VLSI Design, 2008. VLSID 2008, page 429–434, 2008.<br />

[WCS08] P. M Wells, K. Chakraborty, and G. S Sohi. Adapting to intermittent faults in multicore<br />

systems. ACM SIGPLAN Notices, 43(3):255–264, 2008.


174 BIBLIOGRAPHY<br />

[WL10] C. F. Webb and J. S. Liptay. A high-frequency custom CMOS s/390 microprocessor.<br />

IBM Journal of Research and Development, 41(4.5):463–473, 2010.<br />

[Yeh02] Y. C. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of IEEE<br />

Aerospace Applications Conference, volume 1, page 293–307, 2002.<br />

[ZJ08] Y. Zhang and J. Jiang. Bibliographical review on reconfigurable fault-tolerant control<br />

systems. Annual Reviews in Control, 32(2):229–252, 2008.<br />

[ZL09] J. F. Ziegler and W. A. Lanford. The effect of sea level cosmic rays on electronic devices.<br />

Journal of applied physics, 52(6):4305–4312, 2009.


RÉSUMÉ<br />

Dans cette thèse, nous proposons une nouvelle approche pour la conception d’un processeur tolérant aux fautes. Celleci<br />

répond à plusieurs objectifs dont celui d’obtenir un niveau de protection élevé contre les erreurs transitoires et un<br />

compromis raisonnable entre performances temporelles et coût en surface. Le processeur résultant sera utilisé ultérieurement<br />

comme élément constitutif d’un système multiprocesseur sur puce (MPSoC) tolérant aux fautes. Les concepts mis<br />

en œuvre pour la tolérance aux fautes reposent sur l’emploi de techniques de détection concurrente d’erreurs et de recouvrement<br />

par réexécution. Les éléments centraux de la nouvelle architecture sont, un cœur de processeur à pile de données<br />

de type MISC (Minimal Instruction Set Computer) capable d’auto-détection d’erreurs, et un mécanisme matériel de journalisation<br />

chargé d’empêcher la propagation d’erreurs vers la mémoire centrale (supposée sûre) et de limiter l’impact du<br />

mécanisme de recouvrement sur les performances temporelles.<br />

L’approche méthodologique mise en œuvre repose sur la modélisation et la simulation selon différents modes et niveaux<br />

d’abstraction, le développement d’outils logiciels dédiées, et le prototypage sur des technologies FPGA. Les résultats,<br />

obtenus sans recherche d’optimisation poussée, montrent clairement la pertinence de l’approche proposée, en offrant<br />

un bon compromis entre protection et performances. En effet, comme le montrent les multiples campagnes d’injection<br />

d’erreurs, le niveau de tolérance au fautes est élevé avec 100% des erreurs simples détectées et recouvrées et environ 60%<br />

et 78% des erreurs doubles et triples. Le taux recouvrement reste raisonnable pour des erreurs à multiplicité plus élevée,<br />

étant encore de 36% pour des erreurs de multiplicité 8.<br />

Mots clés : Tolérance aux fautes, Processeur à pile de données, MPSoC, Journalisation, Restauration, Injection de<br />

fautes, Modélisation RTL.<br />

ABSTRACT<br />

In this thesis, we propose a new approach to designing a fault tolerant processor. The methodology is addressing several<br />

goals including high level of protection against transient faults along with reasonable performance and area overhead<br />

trade-offs. The resulting fault-tolerant processor will be used as a building block in a fault tolerant MPSoC (Multi-<br />

Processor System-on-Chip) architecture. The concepts being used to achieve fault tolerance are based on concurrent<br />

detection and rollback error recovery techniques. The core elements in this architecture are a stack processor core from<br />

the MISC (Minimal Instruction Set Computer) class and a hardware journal in charge of preventing error propagation to<br />

the main memory (supposedly dependable) and limiting the impact of the rollback mechanism on time performance.<br />

The design methodology relies on modeling at different abstraction levels and simulating modes, developing dedicated<br />

software tools, and prototyping on FPGA technology. The results, obtained without seeking a thorough optimization, show<br />

clearly the relevance of the proposed approach, offering a good compromise in terms of protection and performance.<br />

Indeed, fault tolerance, as revealed by several error injection campaigns, prove to be high with 100% of errors being<br />

detected and recovered for single bit error patterns, and about 60% and 78% for double and triple bit error patterns,<br />

respectively. Furthermore, recovery rate is still acceptable for larger error patterns, with yet a recovery rate of 36%on 8<br />

bit error patterns.<br />

Keywords: Fault Tolerance, Stack Processor, MPSoC, Journalization, Rollback, Fault Injection, RTL modeling.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!