Dependable Memory - Laboratoire Interface Capteurs ...
Dependable Memory - Laboratoire Interface Capteurs ...
Dependable Memory - Laboratoire Interface Capteurs ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
LABORATOIRE INTERFACES CAPTEURS<br />
ET MICRO-ÉLECTRONIQUE<br />
Doctoral School of IAEM - Lorraine<br />
Department of Electronics and Electrical Engineering<br />
A dissertation submitted to the University Paul Verlaine - Metz, France<br />
in partial fulfillment of the requirements for the degree of Doctor of Philosophy<br />
Discipline : Electronic Systems<br />
Specialty : Microelectronics<br />
DESIGN METHODOLOGY OF A FAULT-TOLERANT<br />
JOURNALIZED STACK PROCESSOR ARCHITECTURE<br />
by<br />
MOHSIN AMIN<br />
Thesis defended on June 9, 2011<br />
Doctoral Committee :<br />
PROF. LUC HEBRARD University of Strasbourg, France President of jury<br />
PROF. AHMED BOURIDANE University of Northumbria, Newcastle, UK Reviewer<br />
PROF. FERNANDO MORAES University of PUCRS, Porto Alegre, Brazil Reviewer<br />
DR. CAMILLE DIOU Paul Verlaine University - Metz, France Co-Supervisor<br />
PROF. FABRICE MONTEIRO Paul Verlaine University - Metz, France Superviror<br />
LICM - 7 Rue Marconi, Technopôle, 57070 Metz, France<br />
Tel : +33 (0)3 87 31 56 57 - Fax : +33 (0)3 87 54 73 07 - www.licm.fr
I DEDICATE THIS WORK TO<br />
i<br />
MY BELOVED BROTHER (LATE) QAISER AMIN<br />
May God give him peaceful rest forever!
Acknowledgements<br />
A PhD thesis is a great experience for working on very stimulating topics, challenging problems,<br />
and for me perhaps the most important to meet and collaborate with extraordinary people. Along with<br />
getting a degree and research skills, here I have learn French language, experience a new culture and<br />
learn to live in a different climate. For five years I am in France but indeed there is much more to<br />
explore.<br />
First and foremost, many thanks go to Prof. Fabrice MONTEIRO and Dr. Camille DIOU for<br />
supervising my PhD thesis and teaching me a lot of new stuff, for guidance and support, for all the<br />
fruitful discussions, and for the company during the conference trips. I am grateful to them for letting<br />
me pursue my research interests with sufficient freedom, while being there to guide me all the same.<br />
Also, I am grateful to director LICM, Prof. Abbas DANDACHE and Dr. Camel TANOUGAST for<br />
their kind support during my stay at LICM-Metz.<br />
My greetings go to Prof. Ahmed BOURIDANE, Northumbria University, Newcastle, UK and<br />
Prof. Fernando MORAES, University PUCRS Porto Alegre, Brezil who honored me by accepting to<br />
review this thesis. I am also grateful to President of the jury, Prof. Luc HEBRARD, University de<br />
Strasbourg, France to supervise this event.<br />
I am thankful to my colleague Dr. Abbas RAMAZANI who guided me a lot during my thesis. I<br />
would like to thank my officemates Frédéric, Hussain, Kevin, Mazan, Medhi and Rita for the good<br />
times we have had. I say good luck to the next: Alaa-Aldin, Cédric, David, Luca, Mokhtar, Salah<br />
and Said. Among them some are now more than officemates. I would like to express my gratitude<br />
and appreciation for Aamir, Armaghan, Fahad, Jawad, KB, Liaquat, Rafiq, Sadiq and Sundar. Special<br />
thanks to Sajid Butt for his unconditional friendship, his support, and to remember me to focus on<br />
finishing my PhD.<br />
Last but certainly not the least, I owe a great deal to my family for providing me with emotional<br />
support during my PhD. Many thanks to my parents, my brother: Qasim, wife: Ayesha and sister:<br />
Saba who all contributed a lot (probably the most) to my life during this period in many ways. Love to<br />
my beloved children Mohammad Abu-Bakar and Aleeza. Finally, special thanks to Higher Education<br />
Communication of Pakistan for funding my PhD thesis.<br />
Thanks, folks!<br />
iii<br />
Mohsin AMIN
Contents<br />
GENERAL INTRODUCTION 7<br />
I. STATE OF ART 13<br />
1 Dependability and Fault Tolerance 13<br />
1.1 Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
1.1.1 Common Source of Faults and their Consequences . . . . . . . . . . . . . . 15<br />
1.2 Basic Concepts and Taxonomy of <strong>Dependable</strong> Computing . . . . . . . . . . . . . . 18<br />
1.2.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.4 Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
1.4.1 System Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.4.2 Characteristics of a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.5 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
1.5.1 Fault Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
1.5.2 Fault Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
1.5.3 Fault Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
1.5.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
1.6 Techniques Applied at Different Levels . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
1.6.1 FT Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
2 Methods to Design and Evaluate FT Processors 31<br />
2.1 Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.1.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
2.1.2 Temporal/Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
2.1.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.2 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.2.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.2.2 Temporal Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
2.2.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
1
2 CONTENTS<br />
2.3 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
2.4 FT Processor Design Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
2.5 FT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
2.5.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
2.5.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
2.5.3 The Fault Injection Framework . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
II. QUALITATIVE AND QUANTITATIVE STUDY 53<br />
3 Design Methodology and Model Specifications 53<br />
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.2.1 Concurrent Error Detection: Parity Codes . . . . . . . . . . . . . . . . . . . 54<br />
3.2.2 Error Recovery: Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
3.4 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
3.5 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.5.1 Challenge # 1: Self Checking Processor Core Requirements . . . . . . . . . 59<br />
3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal . . . . . . . . 60<br />
3.5.3 Challenge # 3: Processor-<strong>Memory</strong> Interfacing . . . . . . . . . . . . . . . . . 62<br />
3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation of<br />
Rollback Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
3.6 Model Specifications and Global Design Flow . . . . . . . . . . . . . . . . . . . . . 63<br />
3.7 Functional Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.7.1 Model-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
3.7.2 Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
3.7.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
4 Design and Implementation of a Self Checking Processor 77<br />
4.1 Processor Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
4.1.1 Advantages of Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
4.3 Hardware Model of the Stack Processor . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
4.4 Design Challenges in FT Stack Processor . . . . . . . . . . . . . . . . . . . . . . . 84<br />
4.4.1 Challenge I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . 85<br />
4.4.2 Challenge II: Performance Improvement . . . . . . . . . . . . . . . . . . . . 85<br />
4.5 Solution-I: Self Checking Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 87
CONTENTS 3<br />
4.5.1 Error Detecting in ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.5.2 Error Detecting in Register and Data-Path . . . . . . . . . . . . . . . . . . . 92<br />
4.5.3 Self-Checking Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
4.5.4 Store Sensitive Elements (SE) . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
4.5.5 Protecting Opcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
4.6 Solution-II: Performance Aspects of Self-Checking Processor Core . . . . . . . . . . 94<br />
4.6.1 Solution-II (a): Multiple-byte Instructions . . . . . . . . . . . . . . . . . . . 94<br />
4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Execution 95<br />
4.6.3 Reducing Overhead for Conditional Branches . . . . . . . . . . . . . . . . . 96<br />
4.7 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
5 Design of a Self Checking Hardware Journal 103<br />
5.1 Error Detection and Correction in the Journal . . . . . . . . . . . . . . . . . . . . . 104<br />
5.2 Principle of the technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
5.3 Journal Architecture and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
5.3.1 Modes of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
5.4 Risk of data contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
5.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
5.5.1 Minimizing the Size of the Journal . . . . . . . . . . . . . . . . . . . . . . . 115<br />
5.5.2 Dynamic Sequence Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
6 Fault Tolerant Processor Validation 121<br />
6.1 Design Hypothesis and Properties to be Checked . . . . . . . . . . . . . . . . . . . 122<br />
6.2 Error Injection Methodology and Error Profiles . . . . . . . . . . . . . . . . . . . . 122<br />
6.3 Experimental Validation of Self-Checking Methodology . . . . . . . . . . . . . . . 123<br />
6.4 Performance Degradation due to Re-execution . . . . . . . . . . . . . . . . . . . . . 126<br />
6.4.1 Evaluating Performance Degradation . . . . . . . . . . . . . . . . . . . . . 127<br />
6.5 Effect of Error Injection on Rate of Rollback . . . . . . . . . . . . . . . . . . . . . . 130<br />
6.6 Comparison with LEON FT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
GENERAL CONCLUSION AND PROSPECTS 135<br />
A Canonical Stack Computers: 139<br />
B Instruction Set of Stack Processor 141<br />
B.1 Data Operations in Stack Processor: . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4 CONTENTS<br />
C Instruction Set of Pipelined Stack Processor 147<br />
D List of Acronyms 153<br />
E List of publications 157
GENERAL INTRODUCTION<br />
5
General Introduction<br />
owadays, devices are becoming more sensitive to the strike of high energy particles. There are<br />
N<br />
great chances that it can cause single-event upset (SEU) when hitting the surface of silicon device.<br />
This can result in soft errors that emerge as bit flips in memory or signal noise in combinatorial logic.<br />
Although, in recent years microprocessor performance has been increased exponentially with<br />
modern design trends. However, they have increased susceptibility towards environmental effects<br />
[Kop11]. When clock speeds increase and feature sizes decrease; systems may become susceptible<br />
to ionizing radiations that leak through the atmosphere. In addition, soft errors may be triggered by<br />
environmental factors such as static discharges or fluctuations in temperature and power supply volt-<br />
age. The occurrence of soft errors in modern electronic systems is continuously increasing [Nic10].<br />
Dependability is an important concern for current and future generation processor design [RI08].<br />
Conventional approaches for dependable processor design employ space or time redundancy<br />
[RR08]. Processor replication has been used for a long time as a fault tolerance (FT) technique<br />
against transient faults [Kop04]. It is a costly solution requiring more than 100% of area overheads<br />
(and also power overheads), since duplication at least is required for error detection (triplication at<br />
least for error correction/masking) and additional voting circuitry. Practically, it is an expensive so-<br />
lution to detect errors at register level, especially when SEU are being considered. Software based<br />
temporal approaches has less hardware overheads and it can significantly improve reliability [RI08].<br />
For example, in duplex execution all instructions are executed twice to detect transient errors [MB07].<br />
However, this technique tend to induce significant time overheads making severe time constraints hard<br />
to match in real time designs. These approaches may providing robust fault tolerance but incurring<br />
high penalty in terms of performance, area, and power [RR08].<br />
Explicit redundancy is suitable for mission critical applications where hardware cost is not an<br />
important constrain. However, after rapid technology scaling, today almost every system need at-<br />
least little consideration of FT features [FGAD10]. These systems demand more cost-effective FT<br />
solutions that may have less coverage than hardware redundancy but substantial coverage nonethe-<br />
less [RR08]. Therefore, research is needed to have alternate unconventional and cost-effective solu-<br />
tions.<br />
We are proposing a new hardware/software co-design methodology to tolerate transient faults in<br />
the processor. The methodology relies on two main choices: fast error detection and low cost error<br />
recovery. It should have fast error detection so that errors can be detected before they reach the<br />
7
8 GENERAL INTRODUCTION<br />
system boundaries to cause catastrophic failures 1 . Consequently, the hardware based concurrent<br />
error detection (CED) has been chosen. To limit the overall cost, we may accept little time penalty in<br />
error correction. In this scenario, the software based rollback is employed. It will reduce the overall<br />
cost as compared to hardware based recovery. Whereas, it will not effect lot to overall performance<br />
because the proposed methodology is suitable for ground applications where occurrence of error is<br />
far less than space.<br />
There is a hypothetical dependable memory (DM) attached to the processor. Moreover, to make<br />
the rollback fast and to simplify the memory management there is an intermediate data storage be-<br />
tween processor and DM. Here, architectural choices are important to make the overall methodology<br />
successful. For-example, the processor core having minimum internal states to be checked (for detect-<br />
ing error) and load and store (for rollback recovery) can make this technique effective (less expensive<br />
and fast). The FT processor has been modeled at VHDL-RTL level. Finally, the processor self check-<br />
ing ability and performance degradation due to re-execution has been tested by artificial error injection<br />
in the simulated model.<br />
The contributions of this work are as follows: Proposing a new methodology based on hard-<br />
ware/software co-design to have a compromise between protection and time/area constrains. For<br />
fast error detection, hardware based concurrent detection is employed. For low hardware overheads,<br />
software based micro-rollback recovery will be used. To reduce the overall area overheads we are em-<br />
ploying stack processor from MISC class. The processor has minimum internal registers which result<br />
in low cost error detection and on the other hand it is suitable for efficient error recovery. Further-<br />
more to mask the error from entering into DM, the intermediate temporal data storage is introduced<br />
between processor and DM.<br />
This thesis is partitioned into six chapters.<br />
Chapter 1: It outline the background and describe the motivation for on-line error detection and<br />
fast correction in embedded microprocessors. It present the basic concepts and the terminologies<br />
related to dependable embedded processor design. It further explores attributes, threats and means<br />
to attain dependability. Lastly, the different dependability techniques applied at different levels are<br />
discussed.<br />
Chapter 2: This chapter will be presenting different redundancy techniques to detect and correct<br />
errors. It explores different FT methodologies employed in the existing fault tolerant processors. The<br />
last part will be dedicated to the validation methodology of a dependable processor.<br />
Chapter 3: This chapter identifies the model specifications and design methodology of the desired<br />
architecture. It address the overall problem by exploring the design paradigm and the related con-<br />
strains of the proposed approach. Later the processor-memory interface will be finalized by different<br />
functional implementations.<br />
Chapter 4: The proposed FT processor has two parts: self-checking processor core (SCPC) and<br />
self-checking hardware journal (SCHJ). This chapter steps towards a design methodology of self-<br />
1 where the cost of harmful consequences is orders of magnitude, or even incommensurably, higher than the benefit<br />
provided by correct service delivery [LRL04]
checking processor core (SCPC). The processor will be chosen from the MISC (minimum instruction<br />
set computer) class; therefore, firstly, we clarify the reasons of choosing such a specialized processor.<br />
Later on, error detection and recovery mechanism are finalized. Finally, the hardware model of the<br />
self-checking processor core will be synthesized on Altera, Quartus II Stratix III.<br />
Chapter 5: The chapter discusses the hardware design and protection scheme of a self-checking<br />
hardware journal (SCHJ), which will be temporary data storage to mask errors from entering into the<br />
dependable main memory. Finally, the overall hardware model of the FT processor will be synthesized<br />
on Altera, Quartus II Stratix III.<br />
Chapter 6: Lastly, the FT model will be evaluated in presence of errors. The evaluation will be<br />
based on the self-checking and performance degradation in presence of errors. Hence, the obtained<br />
results validate the protection techniques proposed in the chapter 3.<br />
Finally, the last section will be discussing conclusions and perspectives.<br />
9
10 GENERAL INTRODUCTION
I. STATE OF ART<br />
11
Chapter 1<br />
Dependability and Fault Tolerance<br />
t is a complex task to design embedded systems for critical real-time applications. Such systems<br />
I<br />
must not only guarantee to meet hard real-time deadlines imposed by their physical environment,<br />
but also guarantee to do so dependably, despite the occurrence of faults [Pow10]. The need of fault<br />
tolerant (FT) computing is becoming more and more important in recent years [Che08] and likely<br />
become the norm. In the past, FT was the exclusive domain of very specialized applications like<br />
safety critical systems. However modern design trends are making circuits more sensitive and now<br />
all real-time systems should have at least some FT features. Therefore, FT is an important need of the<br />
time.<br />
Modern social system is hinged to automated industry. In some sensitive industrial sectors, even a<br />
single fault can result in a million dollar loss (e.g. in banking and stock markets) or can result in loss<br />
of life (e.g. air traffic control system). Industries like automotive, avionics, and energy production re-<br />
quire availability, performance and real-time response ability to avoid catastrophic failures. In table 1,<br />
cost per hour for the failure of the control systems has been compared to show the importance/demand<br />
of FT in the industrial sector.<br />
Table 1.1: Cost/hour for failure of control system [Pie07]<br />
Application Domain Cost (Euro/hour)<br />
Cell-phone Operator 40k<br />
Airline Reservation 90k<br />
ATM Machine (Banking) 2.5M<br />
Automobile Assembling Unit 6M<br />
Stock Transaction 6.5M<br />
Most of these system (in table 1) rely on embedded systems. The design of the FT processor<br />
is one of the basic requirement for dependable embedded applications. Accordingly, we propose to<br />
design a fault tolerant processor to eliminate (tolerate) transient faults that result from SEUs. In this<br />
introductory chapter, we will address the basic concepts and terminologies related to fault tolerant<br />
computing. This chapter is divided into three main parts: the first part will be arguing the current<br />
13
14 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
trends that increase the probability of faults and the sources and consequences of the faults. The<br />
second part discusses the concepts of dependable computing and third part will be exploring the<br />
means to attain dependability.<br />
1.1 Problematic<br />
For many years researchers focused on performance issues. Due to their restless effort, they have<br />
improved the overall performance in last few years fueled by deep technology scaling. However, the<br />
boundaries provided by the Moore’s law have been reached to its saturation level and on the other<br />
hand; there is a decrease in the dependability due to ever-increasing physical faults. There are various<br />
trends in search of high performance, which have increased the need for dependable architecture<br />
design. Among them, some are discussed below:<br />
Smaller Technologies / Design Scales<br />
Although scaling of the transistors and wires has steadily improved processor performance and<br />
reduced cost, it also adversely affected long-term chip lifetime reliability. When a transistor is ex-<br />
posed to high-energy ionizing radiation, electron-hole pairs are created [SSF + 08]. Transistor source<br />
and diffusion nodes accumulate charge, which may invert the logic state of the transistor [Muk08].<br />
With device dimensions projected to shrink below 18 nanometers by 2015 significantly threaten next-<br />
generation technologies [RYKO11]. For transient faults, smaller devices tend to have low charge to<br />
hold the states of the registers and make them more sensitive towards noise. When the noise margin<br />
decreases the probability that a high-energy particle strike can disturb the charge on the devices also<br />
increases which in turn increases the probability of transient faults. The lower voltages used for power<br />
efficiency reasons will increase the susceptibility of future chips [FGAD10].<br />
More Transistors per Chip<br />
Due to more transistors, more wires are required to connect them, resulting in more chances of<br />
faults both during the fabrication and working of such devices. Modern processors are more prone to<br />
the faults due to greater number of transistors and registers. Moreover, temperature is another factor<br />
causing transient and permanent faults. More the devices on the chip so more power will be dragged<br />
from the supply. Higher supply power per unit area will increase the leakage power dissipated per<br />
unit area due to which higher will be the temperature and probability of errors.<br />
Complex Design<br />
Today, the processor has become more complicated as compared to the past, which increases the<br />
probability of design faults. On the other hand it is also making debugging of errors difficult. Research<br />
effort is oriented towards alternate methods to increase system performance without increasing the
1.1. PROBLEMATIC 15<br />
sensitivity of the circuit but unfortunately the bottleneck has been reached and alternate solutions are<br />
more complex and make fault debugging a more difficult task to fulfill.<br />
In short, the devices are becoming more sensitive against ionized radiation (which may cause soft<br />
errors), operating point variation by means of temperature or supply voltage fluctuations, as well as<br />
parasitic effects, which results in statical leakage currents [ITR07]. Changing the parameters like<br />
dimensions, noise margin and supplied voltage cannot be further fruitful to increase the performance.<br />
In the near future, due to small size and high frequency the failure trend in modern computing<br />
systems will further increase because saturation level has already been reached. The further increase<br />
is leading towards increasing rate of soft error in logic and memory chips [Bau05] which is affecting<br />
the reliability even at sea level [WA08]. To assure the circuit integrity, FT must be an important design<br />
consideration for modern circuits. The dependable system must be aware of the tolerance mechanism<br />
against possible errors.<br />
1.1.1 Common Source of Faults and their Consequences<br />
Today, one significant threat to the reliability of the digital circuits is concerned with the sensitivity<br />
of the logic states to various noise sources and specially in certain specific environments such as in<br />
space or nuclear systems where collision of charged particles can result in transient faults. Such<br />
particles may include cosmic rays produced by sun and alpha particles produced by disintegration of<br />
radioactive isotopes.<br />
For space applications, FT is a mandatory requirement due to the severe radiation environment. As<br />
the manufacturing technology is scaled towards finer geometries the probability for SEUs is increas-<br />
ing. With the present technology, dependability is not only required for some critical applications:<br />
even for commodity systems, dependability needs to be above a certain level for the system to be<br />
useful for anything [FGAD10]. Radiation induced soft error is becoming an increasingly important<br />
threat to the reliability of digital circuits even in ground level applications [Nic10].<br />
Transient faults can be caused by on-chip perturbations, like power supply noise or external<br />
noise [NX06]. The researchers have classified three common sources of soft errors in semi-conductors<br />
including alpha particles, discovered in 1970’s, proved to be the main source of soft errors in com-<br />
puter systems, specially DRAM [MW07]. Secondly, the high-energy neutrons from cosmic radia-<br />
tions could induce soft errors in semi conductor devices via the secondary ions produced by neutron<br />
reaction with silicon nuclei [ZL09] as shown in figure 1.1, where a single high-energy neutron has<br />
disturbed the internal charge distribution of the whole device. Thirdly, soft-error source is induced<br />
by low-energy cosmic neutron interactions with the isotope boron-10 in IC materials, specifically in<br />
Boro-Phos-Silicate-Glass (BPSG), used widely to form insulator layers in IC manufacturing. This<br />
recently proved to be the dominant source of soft errors in SRAM fabricated with BPSG [WA08].<br />
Figure 1.2 represents the sequence of events that may occur once an energetic particle hit the<br />
substrate provoking ionization. This ionization may generates a set of electron-hole pairs that create<br />
a transient current that is injected or extracted to that node. According to the amplitude and duration
16 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
Figure 1.1: An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs in its<br />
wake, which effects charge disturbance [MW04]<br />
of this current pulse, a transient voltage pulse may appear at the hit node. This is characterized as the<br />
fault. There is a fault latency period that defines the time needed for that fault to become an error in<br />
the circuit. This will only occur if this transient voltage node changes the logic of a storage element<br />
(flip-flop), generating a bit-flip. This bit-flip may generate an error if the content of this flip-flop is<br />
used for a certain operation. However, for the application point of view, it is not mandatory that this<br />
error is manifested as a failure in the system. There is also an error latency that defines the time<br />
needed for that error to become a failure in the system. The common term for any measurable effect<br />
Ionization<br />
Transient<br />
Current<br />
Transient<br />
Voltage Pulse<br />
Fault Effect<br />
Figure 1.2: Strike of high energy particle resulted in error(s)<br />
Error<br />
resulting from the deposition of energy from a single ionizing particle strike, is a Single Event Effect<br />
(SEE). The most relevant SEEs are classified in figure 1.3.
1.1. PROBLEMATIC 17<br />
SEE<br />
(Single Event<br />
Effect)<br />
SET<br />
(Single Event<br />
Transient)<br />
SBU<br />
(Single Bit Upset)<br />
MBU<br />
(Multi Bit Upset)<br />
SEFI<br />
(Single Event<br />
Functional Interrupt)<br />
SELU<br />
(Single Event Latch-Up)<br />
SEGR/SEB<br />
(Single Event<br />
Gate-Rupture/Burnout)<br />
SEU<br />
(Single Event Upset)<br />
Soft Error<br />
Hard Error<br />
Figure 1.3: Classification of faults on basis of single event effect (SEE) [Pie07].<br />
Single Event Upset (SEU)<br />
The SEU is mostly a soft error caused by the transient signal induced by a single energetic particle<br />
strike [JES06]. In [Bau05], it is said to occur when a radiation event causes a charge disturbance large<br />
enough to reverse or flip the data state of a memory cell, register, latch, or flip-flop. The error is called<br />
soft because the device is not permanently damaged by the radiation and when new data is written to<br />
the struck memory cell, the device will store it correctly [Bau05].<br />
The SEU is a very serious problem because it is one of the major source of failure in digital<br />
systems [Nic10]. It will likely pose serious threats to the future of robust computing [RK09] and<br />
require serious attention. It may manifest itself as Single Bit Upset (SBU) or Multiple Bit Upset<br />
(MBU).<br />
Single Bit Upset (SBU) and Multiple Bit Upset (MBU)<br />
An SBU is a single radiation event that results in one bit flip whereas an MBU is a single radiation<br />
event that results in more than a single bit being flipped. Each bit flip is essentially an SEU. An<br />
SBU and MBU are therefore considered a subset of the SEU. The SBU are usually a major fraction<br />
and MBU are usually a small fraction of the total number of observed SEUs. However, the MBU<br />
probability is steadily increasing as geometries shrink [BCT08, QGK + 06]. Presently, this thesis is<br />
addressing SBUs. In future, methodology will be further extended for addressing MBUs.
18 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
Single Event Transient (SET)<br />
An SET is a transient pulse in the logic path of an IC. Similar to an SEU, it is induced by a charge<br />
deposition of a single ionizing particle. An SET can be propagated along the logical path where it<br />
was created. It may be latched into a register, latch or flip-flop causing their output value to change.<br />
Single Event Functional Interrupt (SEFI)<br />
Xilinx [BCT08] defines SEFI as an SEE that results in the interference of the normal operation of<br />
a complex digital circuit. As for the previously mentioned SETs, further investigation of SEFI rates<br />
are not considered for this thesis.<br />
Single Event Latch-Up (SELU)<br />
A spurious current spike induced by an ionizing particle in a transistors may be amplified by the<br />
large positive feedback of the thyristor and cause a virtual short between Vdd and ground, resulting<br />
into a s SELU [NTN + 09]. SELUs are not addressed in this thesis.<br />
Single Event Gate Rupture (SEGR) and Single Event Burnout (SEB)<br />
Single Event Gate Rupture (SEGR) is a single ion induced condition in power MOSFETs that may<br />
result in the formation of a conducting path in the gate oxide. Single Event Burnout is a condition,<br />
which can cause device destruction due to a high current state in a power transistor. Both of them are<br />
permanent faults and not addressed in this thesis.<br />
1.2 Basic Concepts and Taxonomy of <strong>Dependable</strong> Computing<br />
This part defines the basic terminologies related to dependable computing. The terminologies are<br />
globally extracted from [LB07, Lap04]. In this section, we identify the important methods and their<br />
characteristics to make a system tolerant to faults.<br />
1.2.1 Dependability<br />
Dependability is the ability to deliver service that can justifiably be trusted [LRL04]. The defini-<br />
tion is focused on trust. In other words, the dependability of a system is the ability to avoid service<br />
failures that are more frequent and more severe than acceptable. Dependability relies on a set of<br />
measures that allow all phases of product life, to ensure that the functionality will be maintained<br />
while accomplishing the mission for which it has been designed. According to Laprie [LB07], the<br />
dependability of a system is the property that places a justified confidence in the service it delivers.
1.3. ATTRIBUTES 19<br />
Dependability<br />
and<br />
Security<br />
1.3 Attributes<br />
Attributes<br />
Threats<br />
Means<br />
Dependability<br />
Security<br />
Figure 1.4: Dependability Tree<br />
Availability<br />
Reliability<br />
Safety<br />
Confidentiality<br />
Integrity<br />
Maintainability<br />
Faults<br />
Errors<br />
Failure<br />
Fault Prevention<br />
Fault Tolerance<br />
Fault Removal<br />
Fault Forecasting<br />
Dependability is a vast concept based on various attributes as shown in figure 1.4.<br />
• Availability: it is the readiness for correct service;<br />
• Reliability: it is the continuity of correct service;<br />
• Safety: it is the absence of the catastrophic consequences on the user(s) and the environment;<br />
• Integrity: it is the absence of the improper system alterations;<br />
• Maintainability: it is the ability to undergo modifications.<br />
Moreover, when dealing with the security issues, an additional attribute called confidentiality is<br />
also considered as shown in figure 1.4. Confidentiality is the absence of unauthorized disclosure of<br />
information. Some other attributes related to security are availability and integrity, which have already<br />
been discussed with dependability attributes [VK07].<br />
It is difficult to fully respect all of the dependability attributes at a time in a system because it<br />
can increase the cost, power consumption and hardware area of the system. So, one respects these<br />
attributes according to the system needs. It has been stated in [FGAM10] that it is impossible to<br />
design a 100% dependable system. For example, in-order to improve the availability of component,
20 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
sometimes one overlooks maintenance and the safety decreases accordingly. Here, two types of<br />
systems have been considered:<br />
• a web server<br />
• a nuclear reactors<br />
Let us see which dependability and security attributes are more important for each of the system.<br />
In a university web-server, availability is the important attribute because every student needs to ac-<br />
cess it regularly whereas, for a nuclear reactor, the attributes like availability, reliability, safety and<br />
maintainability are important considerations. The [Pie07] sum up the importance of these attributes<br />
inform of a table in 1.2, where 4 points have been given to very important attributes and 1 to least<br />
important attribute. Hence from the table 1.2 each application has its only dependability and security<br />
requirements.<br />
Table 1.2: Dependability attributes for University web-server and Nuclear-reactor [Pie07], where<br />
attributes are classified as: – very important = 4 points, – least important = 1 point<br />
1.4 Threats<br />
Attributes University Web Server Nuclear Reactor<br />
Availability 3 4<br />
Reliability 1 4<br />
Safety 1 4<br />
confidentiality 2 1<br />
Integrity 2 3<br />
Maintainability 2 4<br />
There are three fundamental threats to a dependable computer. They are: (i) fault, (ii) error and<br />
(iii) failures. Fault is define as an erroneous state of hardware or software resulting from failures of<br />
components, physical interference from the environment, operator error, or incorrect design [Pie06].<br />
A fault is active when it produces an error, otherwise it is considered as a dormant/sleeping fault. An<br />
active fault can be an internal fault that was previously dormant. The error is itself caused by a fault<br />
and a failure occurs when there is deviation from correct services due to some error. All three have<br />
cause and affect relationship between them (as shown in figure 1.5). In general, active fault causes<br />
error. It can propagate from one place to another inside the system. In figure 1.6, an error produced<br />
in the processor has been transfered to main memory. Furthermore, if an error reaches the boundaries<br />
of the system it may result in the failure of the system, causing the service provided to deviate from<br />
its specification [GMT08] (see figure 1.5). If the initial system is a sub-system of a global system<br />
then it can cause a fault in the global system. In this way chain of fault, error and failure keep on<br />
progressing.
1.4. THREATS 21<br />
Sub-system<br />
Global<br />
system<br />
….. Fault Error Failure<br />
Fault Error …..<br />
Activation Propagation<br />
Consequences Activation<br />
activation<br />
Figure 1.5: Fault, error and failure chain<br />
propagation propagation<br />
fault error error error<br />
Processor Main <strong>Memory</strong><br />
READ/<br />
WRITE<br />
Figure 1.6: Error propagation from processor to main memory<br />
A SEU may result in system failure, like in figure 1.7: a high-energy neutron strike (caused due<br />
to cosmic rays) on a VLSI circuit has resulted into a SBU (active fault), which provoked an error in<br />
traffic control system and finally resulted into the system failure.<br />
1.4.1 System Failure<br />
A correct service is given by a system when it is respecting its functionality. Whereas, a system<br />
failure is a deviation of the service delivered by the system from its specification [Pie06]. Such<br />
a deviation can be in the form of incorrect service, or no service at all [GMT08]. Whereas, the<br />
transition from incorrect to correct service is a service restoration (see figure 1.8).<br />
The service failure may occur because the system is no more respecting its functionality or maybe<br />
the functional specifications were not correctly defined for that system under certain conditions. On<br />
the other hand, FT techniques allow a system to continuously deliver its service according to its<br />
correct functionality even in the presence of faults.
22 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
Fault Error Failure<br />
fault<br />
z<br />
0<br />
1<br />
always A=1<br />
1<br />
1<br />
0<br />
error<br />
Traffic<br />
Control<br />
system<br />
Figure 1.7: A single fault caused failure of traffic control system<br />
Correct<br />
Service<br />
1.4.2 Characteristics of a Fault<br />
Service<br />
failure<br />
Service<br />
restoration<br />
Figure 1.8: service failure<br />
Incorrect<br />
Service<br />
Correct<br />
signal<br />
Wrong<br />
signal<br />
Faults can be characterized by five attributes, which are cause, nature, duration, extend and value.<br />
Figure 1.9 illustrates each of these basic characteristics of faults. They are discussed in the following<br />
section.<br />
Cause<br />
Possible fault can be caused due to four salient problems:<br />
1. Specifications Mistakes: These include incorrect algorithms, architectures, or incorrect design
1.4. THREATS 23<br />
Fault<br />
Characteristics<br />
Cause<br />
Nature<br />
Duration<br />
Extent<br />
Value<br />
Specification Mistakes<br />
Implementation<br />
External Disturbances<br />
Component Defects<br />
Software<br />
Hardware<br />
Transient<br />
Intermittent<br />
Permanent<br />
Local<br />
Global<br />
Determinate<br />
Indeterminate<br />
Figure 1.9: Fault characteristics.<br />
HDL<br />
Programming<br />
Logical<br />
Electronic CMOS<br />
Digital<br />
Analog<br />
specifications, as in row 1 of figure 1.10 where there is fault caused by the wrong interconnec-<br />
tion between the two systems.<br />
2. Implementation Mistakes: The implementation can introduce faults due to poor design, poor<br />
component selection, poor construction, or hardware/software coding mistakes as in rows 2 and<br />
3 of figure 1.10. The row 2 shows the programming fault in which c is incremented if a is less<br />
than b but c will not be incremented if a is equal to b which is a programming error. Similarly<br />
in row 3 of figure 1.10, r1 charge the result of addition a+b in the register c.<br />
3. Components Defects: These include random device defects, manufacturing imperfections, and<br />
component wear-out. It can be a logical component or electronic CMOS. As shown in the row<br />
4 and 5 of figure 1.10<br />
4. External Disturbance: These include operator mistakes, radiation, electromagnetic interfer-<br />
ence, and environment extremes. As in row 6 of the figure. 1.10. Moreover, due to reducing<br />
noise margin the ‘1’ can be read as ‘0’ if its value is lower than threshold (Vm) (as shown in<br />
row 7 of the figure. 1.10).
24 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
Sr.<br />
No.<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
Nature<br />
Fault at different level<br />
Specification Mistakes<br />
Programming fault<br />
HDL<br />
Component defect 1<br />
Component defect 2<br />
External Disturbance<br />
Fault due to lower noise<br />
Correct state<br />
A B C<br />
if a < = b<br />
c : c + 1;<br />
end;<br />
r1 : c
1.5. MEANS 25<br />
Extent<br />
A permanent fault if occurs once, persists until the end of the execution. Even a single perma-<br />
nent fault can create multiple errors until being repaired. Such errors are called hard errors.<br />
• Intermittent: It is a fault which appears, disappears, and reappears repeatedly within a very<br />
short period. An intermittent fault can occur repeatedly but not continuously for a long time in<br />
a device.<br />
The errors in modern computers may result due to permanent, intermediate and transient faults.<br />
However, transient faults occur considerably more often than permanent ones, and are much<br />
harder to detect [RS09]. The ratios of transient-to-permanent faults can vary between 2:1,<br />
100:1 or higher. This ratio is continuously increasing [Kop04].<br />
The fault extent specifies whether the fault is localized to a given hardware or software module or<br />
whether it globally affects the hardware, the software, or both.<br />
Value<br />
The fault value can either be determinate or indeterminate. A determinate fault is one whose<br />
status remain unchanged throughout the time unless there is an external action upon it, whereas an<br />
indeterminate fault is one whose status at some time t may be different from its status at another time.<br />
1.5 Means<br />
There are four means to attain dependability: fault prevention, fault tolerance, fault removal and<br />
fault forecasting. Fault prevention and fault tolerance aim to provide the ability to deliver a service<br />
that can be trusted, while fault removal and fault forecasting aim to reach confidence in that ability by<br />
justifying that the functional and the dependability and security specifications are adequate and that<br />
the system is likely to meet them [LRL04, LB07].<br />
1.5.1 Fault Prevention<br />
Fault prevention is the ability to avoid the occurrence or introduction of the faults. Fault prevention<br />
includes any technique that attempts to prevent the occurrence of faults. It can include design reviews,<br />
component screening, testing and other quality control methods.<br />
1.5.2 Fault Removal<br />
Fault removal is the ability to lessen the number and severity of faults. It can be conducted<br />
during corrective or preventive maintenance processes. Corrective maintenance aims to remove faults
26 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
that have already produced and starts after error detection, while preventive maintenance is aimed at<br />
removing faults before they might have caused errors [LB07].<br />
1.5.3 Fault Forecasting<br />
It is the ability to estimate the present number, the future incidence, and the likely consequences<br />
of faults. It is conducted by performing an evaluation of the system behavior with respect to fault<br />
occurrence or activation; it has two aspects that are qualitative and quantitative. The main approaches<br />
to probabilistic fault forecasting aimed to derive probabilistic estimates are modeling and testing<br />
[LB07].<br />
1.5.4 Fault Tolerance<br />
It is intended to preserve the delivery of correct service in the presence of active faults [ALR01].<br />
Ideally FT system is capable of executing their tasks correctly regardless of faults. However, in<br />
practice no-one cannot guarantee the flawless execution of tasks under all circumstances. The real<br />
FT system are design to have tolerance against more likely to occur faults. In this work FT has<br />
been addressed. It resides on three pillars, which are fault masking, error detection and error correc-<br />
tion/recovery.<br />
Fault Masking<br />
Fault masking hide the effects of failures through the means that redundant information outweighs<br />
the incorrect information [Pie06]. It is a structural redundancy technique that completely masks faults<br />
within system redundant modules. A number of identical modules execute the same functions, and<br />
their outputs are voted to remove errors created by a faulty module e.g. Triple Modular Redundancy<br />
(TMR) is a commonly used technique of fault masking.<br />
Through in fault masking, we achieve dependability by hiding faults that occur. It prevents the<br />
effects of faults from spreading throughout the system. It can tolerate software and hardware faults<br />
as shown in figure 1.11. Such system does not need error detection and correction to maintain system<br />
dependability. Fault masking has not been directly employed in this thesis. However, TMR will be<br />
used for comparison in the later chapters.<br />
Error Detection<br />
If fault masking is not employed then error detection may be employed in a FT system. Error<br />
detection is the building block of a FT system, because a system cannot tolerate an error if it is not<br />
known to it. Error detection mechanisms form the basis of an error resilient system as any fault<br />
during operation needs to be detected first before the system can take a corrective action to tolerate<br />
it [LBS + 11]. Even if a system cannot recover from the detected error, it can at least halt the process<br />
or inform the user that an error is detected and that the results are no more reliable.
1.6. TECHNIQUES APPLIED AT DIFFERENT LEVELS 27<br />
Error Correction/Recovery<br />
Detecting an error is sufficient for providing safety, but we would also like the system to recover<br />
from the faulty states. Recovery hides the effects of the error from the user. After recovery, the system<br />
can resume operation and ideally remain live. Error recovery is an important feature for the system<br />
based on the two attributes of reliability and availability because both the metrics require the system<br />
to recover from its errors without user intervention.<br />
Error detection and recovery are addressed in this thesis, they will discussed in detail in the chap-<br />
ter 2. Similarly, various techniques of error detection (in section 2.1) and correction (in section 2.2)<br />
are also discussed.<br />
1.6 Techniques Applied at Different Levels<br />
Figure 1.11 illustrates the dependability techniques applied at different levels in a hardware and<br />
a software system in which fault avoidance (fault prevention) is the primary method to improve the<br />
system dependability. It may be taken into account through hardware or software implementations.<br />
The fault avoidance in a hardware based system can be achieved by preventing specification and<br />
implementation faults, component defects and external disturbances, while in a software based system<br />
it requires prevention of specification and implementation faults. On the other hand, fault masking is<br />
a technique used to ensure dependability, by masking the faults. TMR is a well-known example of<br />
this technique. If fault masking is not applied, then FT is a practical choice to overcome errors.<br />
1.6.1 FT Techniques<br />
Fault tolerant techniques for integrated circuits can be applied at different moments in the circuit<br />
design flow. They can be applied in the electrical design phase, such as transistor dimension, transistor<br />
redundancy and by adding electrical sensors. Some techniques can be added at logic design step, such<br />
as by adding hardware and time redundancy in the logic blocks and in the software application. The<br />
figure 1.12 is the further extension of previous discussed figure 1.2 . The figure represents different<br />
phases to tolerate faults (detect and correct). In each phase a different fault tolerant technique can be<br />
used. We are addressing the fault tolerant at hardware redundancy and self-checking level that are<br />
two higher levels (as shown in ‘c’ and ‘d’ of figure 1.12).<br />
1.7 Conclusions<br />
The goal of this chapter was to introduce the concepts of dependability in embedded systems. In<br />
fulfilling this objective, we have introduced the main issues related to the design and analysis of fault<br />
tolerant systems. Here, we have discussed different types of faults and their characteristics because
28 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE<br />
Fault Avoidance Fault Masking<br />
Specification Mistakes<br />
Implementation<br />
External Disturbances<br />
Component Defects<br />
Specification Mistakes<br />
Implementation<br />
Hardware<br />
Faults<br />
Software<br />
Faults<br />
Fault Tolerance<br />
Errors<br />
Errors<br />
Figure 1.11: Dependability techniques.<br />
Error Detection<br />
Recovery<br />
System<br />
Failure<br />
System<br />
Failure<br />
our final objective is the design of a fault tolerant computing system against single event effect such<br />
as SEUs (Single Event Upsets).<br />
In addition, this chapter is addressing the dependability issues against non-permanent distur-<br />
bances. Our goal is to propose a new design methodology of dependable processor architectures.<br />
Consequently, in chapter 2, we will discuss some existing methodologies of detecting and correcting<br />
errors.
1.7. CONCLUSIONS 29<br />
Ionization<br />
Different<br />
Fault<br />
tolerance<br />
levels<br />
Transient<br />
Current<br />
Sensors<br />
(detectors)<br />
Time redundancy<br />
(detection migration)<br />
Fault<br />
Latency<br />
Transient<br />
Voltage Pulse<br />
Fault Effect<br />
Flip-Flop Error Failure<br />
Hardware Redundancy<br />
Error correction codes<br />
(detection & migration)<br />
Error<br />
Latency<br />
a b c d e<br />
Self checking<br />
mechanism with<br />
recovery<br />
(detection & migration)<br />
Fault Tolerant at level c & d in figure<br />
has been addressed in this thesis<br />
Figure 1.12: Sequence of events from ionization to failure and a set of fault tolerant techniques applied<br />
at different time. [Pie07].<br />
Redundancy/<br />
spare<br />
components
30 CHAPTER 1. DEPENDABILITY AND FAULT TOLERANCE
Chapter 2<br />
Methods to Design and Evaluate FT<br />
Processors<br />
The goal of FT techniques is to limit the effects of a fault, which means to increase the probability<br />
that an error is accepted by the system. A common feature of all the FT-techniques is the use of<br />
redundancy. Redundancy is simply the addition of hardware resources, or time beyond what is needed<br />
for normal system operation [Poe05]. It can be hardware (some hardware modules are replicated),<br />
time (parts of a program are executed multiple times), information (the circuit or program has a<br />
redundancy of information) or a mixture of these three solutions.<br />
Traditional solutions involving excessive redundancy are too expensive in area, power, and perfor-<br />
mance [BBV + 05], other cheap approaches do not provide the necessary fault detection and correction<br />
abilities. Fault-tolerant embedded systems have to be optimized in order to meet time and area con-<br />
straints [PIEP09]. Therefore, special attention is required when choosing redundancy techniques for<br />
critical applications.<br />
Accordingly, chapter is presenting a comparison of the existing FT techniques in terms of error<br />
detection and correction ability, time delays and their hardware overheads. From these comparisons<br />
we will identify the techniques that can effectively fulfill our design objectives.<br />
Later part of the chapter will explore redundancy techniques employed in different FT proces-<br />
sors. The last section will be addressing the evaluation methods to check the effectiveness of FT<br />
methodologies in the processor.<br />
2.1 Error Detection<br />
Error detection originates an error signal or message within the system. It has been previously<br />
discussed in section 1.5.4. It can be based on preemptive detection or concurrent checking. Preemp-<br />
tive detection is mostly offline technique and takes place while normal service delivery is suspended,<br />
check the system for latent errors and dormant faults wheras, concurrent detection is an online tech-<br />
31
32 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
nique and takes place during normal service delivery [ALR01]. Similarly, Bickham defines concur-<br />
rent error detection (CED) as a process of detecting and reporting errors while, at the same time,<br />
performing normal operations of the system [Bic10].<br />
CED techniques are widely used to enhance system dependability [HCTS10, CTS + 10, WL10].<br />
The basic principle of CED techniques have been sum up in [MM00], in which a system is consid-<br />
ered, which realizes a function (f) and produces output in response to an input sequence. A CED<br />
scheme generally contains another unit, which independently predicts some special characteristic of<br />
the system-output for every input sequence. Finally, a checker unit compares the two outputs to<br />
predict an error signal. The architecture of a general CED scheme is shown in figure 2.1.<br />
Function (f)<br />
output<br />
Input<br />
Output<br />
Characteristics<br />
Predictor<br />
checker<br />
Predicted output<br />
characteristics<br />
Figure 2.1: General architecture of a concurrent error detection schemes [MM00]<br />
Several CED based redundancy techniques have been proposed and used commercially for de-<br />
signing reliable computing systems [HCTS10, SG10]. They have been classified into three classes;<br />
hardware redundancy, time redundancy, and information redundancy. The FT based system uses one<br />
or more among them. These techniques mainly differ in their error-detection capabilities and the con-<br />
straints they impose on the system design. In the next section, we will explore the commonly used<br />
error detection techniques.<br />
2.1.1 Hardware Redundancy<br />
Hardware redundancy is the commonly used approach [Bic10]. It refers to the addition of extra<br />
hardware resources such as doubling the system and using a comparator at the output to detect errors.<br />
Here, the consideration is given to the structure of the circuit and not to the functionality. It is equally<br />
effective for transient, timing and permanent faults. However, the area and power requirements are<br />
quite big. It can be classified into two sub types: (i) duplication with comparison and (ii) duplication<br />
with complement redundancy.<br />
Duplication with comparison (DWC)/ dual modular redundancy (DMR) [JHW + 08] is the simple<br />
and easy to implement error detection technique (see figure 2.2). It has a good error detection ca-<br />
Error
2.1. ERROR DETECTION 33<br />
pability and theoritically it can detect 100% of all possible errors by running all operations on two<br />
copies of a component and comparing the results [MS07]. However, it cannot detect the bugs due<br />
to design, error in the comparator and combinations of simultaneous error(s) in both the modules.<br />
Replication can be performed at different granularities (units vs. cores), but always comes at a con-<br />
siderable hardware cost (more than 200%). A classic example of DMR is the IBM S/390 mainframe<br />
processor [SG10] where the I-unit (fetch and decode units) and E-unit (execution unit) are duplicated,<br />
and their signals compared for transient fault detection.<br />
F<br />
Output<br />
Input comparator<br />
Error<br />
Signal<br />
F<br />
Figure 2.2: Duplication with comparison (DWC)<br />
There is another complementary technique called duplication with complement redundancy (DWCR)<br />
[Jab09]. This technique is similar to the DWC but in this technique both the modules, the input sig-<br />
nals and output control signals and data internal signals are of opposite polarity to avoid simultaneous<br />
errors in both the module to avoid the system failure.<br />
Here as well, the area and the power consumed overhead is more than 200%. However, this<br />
method increases the complexity of the design compared to a simple duplication. This technique<br />
is used in dual-checker rail (DCR), where both outputs are reversed if there is no error; they are<br />
sometimes employed in controller.<br />
2.1.2 Temporal/Time Redundancy<br />
This is type of redundancy technique that requires single unit to perform an operation twice; one<br />
followed by another. If there is existence of difference between the subsequent computations, it means<br />
that there is an existence of error [AFK05]. In this approach, there is penalty in terms of extra time<br />
however their area penalty is lesser than DMR. The additional hardware is required due to comparator<br />
and and requirement of additional temporary storage. It is a time replication technique and with no<br />
consideration to the functionality of the circuit.<br />
In this scheme, intermittent and transient faults are detected (as shown in figure 2.3) but permanent<br />
faults are not. For permanent fault detection, the circuit has been modified as shown in the following<br />
figure 2.4, according to which the computation using the input data is first performed at time t. The<br />
consequences of this computation is then stored in a buffer. The same data is used to repeat the<br />
computation, using the same functional block at time t + δt. However, this time the input data
34 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
Input<br />
F Output<br />
t<br />
t+δ t<br />
comparator<br />
buffer<br />
Error<br />
Signal<br />
Figure 2.3: Time redundancy for temporary and intermittent fault detection<br />
is first encoded in some manner. The results of the computation are then decoded and the results<br />
are compared to the results produced before. Any discrepancy will detect a permanent fault in the<br />
functional block.<br />
Input<br />
t<br />
t+δ t<br />
Encoder<br />
Encoder<br />
F<br />
t<br />
t+δ t<br />
buffer<br />
Decoder<br />
comparator<br />
Output<br />
Error<br />
Signal<br />
Figure 2.4: Time redundancy for permanent error detection<br />
An alternative approach can be redundant execution with shifted operands (RESO) [PF06] where<br />
some instructions are executed redundantly with shifted operands on the same functional units. Shift-<br />
ing the result back by the same amount yields the original result computed with un-shifted operands.<br />
Re-executing instructions detects transient faults whereas re-executing with shifted operands also<br />
detects permanent faults. Scheme works when functionality possesses required properties such as<br />
linearity<br />
Time redundancy directly affects the system performance although the hardware cost is generally<br />
less than that of hardware redundancy. Therefore temporal redundancy based systems are compara-<br />
tively slower. In order to overcome this issue, many systems use pipelining to hide the latency issue<br />
from the client. The temporal redundancy does not address energy consequence at all, except it uses<br />
twice as much active power as a non-redundant unit.
2.1. ERROR DETECTION 35<br />
2.1.3 Information Redundancy<br />
The basic idea behind an information scheme is to add redundant information to the data being<br />
transmitted or stored or processed to determine if errors have been introduced [IK03]. It is the way of<br />
protecting data through mathematical encoding, which can be reuse after decode the original data (as<br />
in figure 2.5). The encoding and decoding circuitry adds additional delays, which make them slower<br />
than DMR, but the area overhead is much lower than DMR. In coding, the consideration is given to<br />
the information stored or maybe to the functionality of the circuit but no consideration given to the<br />
structure of the circuit. Typically, information redundancy is used to protect storage elements (like<br />
memory, caches, register files, etc) [HCTS10] e.g. in Power 6 and 7 [KMSK09]. These codes are<br />
classified based on their ability of detection and correction, code efficiency and complexity. In this<br />
section, we will discuss only error detection codes.<br />
Data Encode<br />
Add<br />
Redundancy<br />
Noise<br />
Transmit<br />
or<br />
Store<br />
Decode Data<br />
Check<br />
Redundancy<br />
Figure 2.5: Information redundancy principle<br />
The error detecting codes (EDC) have less hardware overhead than the error correcting codes.<br />
There are different EDCs e.g. parity, Borden, Berger and Bose codes. We will not go in much details<br />
but will compare their salient features will be discussed.<br />
The parity-coding strategies is simplest and it has lowest HW overhead [ARM + 11]. It is based on<br />
calculation of even or odd parity for data of word length N. The parity can be calculated with XOR<br />
operation among the data bit. A parity code has a distance of 2 and can detect all odd-bit errors.<br />
Input<br />
Parity<br />
Generator<br />
P<br />
Data<br />
Data<br />
Received data<br />
comparator<br />
Figure 2.6: Parity coder in data storage<br />
Error<br />
Signal<br />
Output<br />
Before storing data in the register, the parity generator is used to compute the parity bit required<br />
(as shown in the figure 2.6). Then both the computed parity and the original data are stored in register.
36 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
When data is retrieved, a parity checker is used to compute the parity based on data bit stored. The<br />
parity checker compares the computed parity and the stored parity and an error signal is set accord-<br />
ingly. Similarly, parity coding can also be used to protect the logic functions (see the figure 2.7). It is<br />
used commonly in computers to check errors in busses, memory, and registers [IK03].<br />
Input<br />
Input<br />
parity<br />
F<br />
P<br />
comparator<br />
Figure 2.7: Functional Parity<br />
Output<br />
Output<br />
parity<br />
Error<br />
Signal<br />
Cyclic redundancy checks (CRCs) is another class of EDC. It is commonly employed to detect<br />
errors in digital system [IK03]. Cyclic codes are parity check codes with the additional property that<br />
the cyclic shift of a codeword is also a codeword. If<br />
then<br />
(Cn−1, Cn−2, ...C1, C0) is a codeword,<br />
(Cn−2, Cn−3, ...C0, Cn−1) is also a codeword.<br />
The idea is to append a checksum to the end of the data frame in such a way that the polynomial<br />
represented by the resulting frame is divisible by the generator polynomial G(x) that the sender and<br />
receiver have agreed upon. When the receiver gets the checksummed frame, it divides it by G(x)<br />
and if the remainder is not zero, there has been a transmission error. It is then clear that the best<br />
generator polynomials are those less likely to divide evenly into a frame that contains errors. CRCs are<br />
distinguished by the generator polynomials they use. It cannot directly specify the error bit position<br />
during the decoding process. Hence it is only limited for error detection.<br />
The Borden codes are another class of codes that can detect unidirectional errors (errors that<br />
cause either a 0 → 1 or 1 → 0 transition, but not both). It is the optimal code for unidirectional<br />
error detection. The Berger EDC is capable of detecting all unidirectional errors. It is formulated by<br />
appending check bit to the data word. The check bit constitute the binary representation of the number<br />
of 0 ′ s in the data word. For example, in 3-bit long data word, we need 2-bit for the check. Berger<br />
code is simpler to deal than the Borden codes. The Bose code is more efficient than the Berger code.<br />
The Bose code provides the same error detecting capability that the Berger code does, but with fewer<br />
checks bit. Briefly, on increasing the complexity of the codes the efficiency of the codes increases.<br />
Choosing the right code depends on the application needs.<br />
In arithmetic processing circuits (such as in ALU) the previously discussed codes are incapable of<br />
detecting errors because when two data symbols are subjected to an arithmetic operations, it result in
2.1. ERROR DETECTION 37<br />
a new data symbol which cannot be uniquely expressed as the combination of inputs [FP02, Nic02].<br />
In other words, they are useful in checking arithmetic operations, where parity would not be pre-<br />
served [FP02, IK03]. The information parts of an operand are processed through a typical arithmetic<br />
operator, while a check symbol is concurrently generated (based on the information bits) [Bic10].<br />
They have two classical implementation: AN and residue codes.<br />
AN codes are the simplest form of arithmetic codes [Muk08]. They are formed by multiplying<br />
each data word ‘N’ by a constant ‘A’. The following equation gives an example of an AN code:<br />
A (N1 + N2) = A (N1) + A (N2) (2.1)<br />
They are preserved only under arithmetic operations and they are not valid for logical and shift<br />
operations. They are not commonly employ due to their high hardware and timing penalty.<br />
Figure 2.8: Residue codes adder [FFMR09].<br />
Residue codes are another type of arithmetic code, in which the information to be used in checking<br />
is called the residue. The residue, r, of an operand, A, is equal to the remainder of ‘A’ divided by the<br />
modulo-base ‘m’ [Bic10]. Both the computations occur simultaneously (see figure 2.8). For the first<br />
computation step, two operands, A and B, undergo an arithmetic operation in the ALU. A residue<br />
generator then produces a residue code from the ALU result. For the computation, each operand<br />
concurrently enters a residue generator. These residues then undergo the same ALU operation as in<br />
the first computation (addition in this case) [FFMR09]. Finally the residue is compared to find the<br />
errors.
38 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
2.2 Error Correction<br />
Error correction has previously been discussed in section 1.5.4. Similar to error detection, cor-<br />
rection techniques are also classified into three sub classes like hardware, information and temporal<br />
redundancy.<br />
2.2.1 Hardware Redundancy<br />
Adding an additional third module and replacing the comparator with a voter in DMR leads to<br />
Triple Modular Redundancy (TMR), as shown in figure 2.9. TMR in-addition to error detection can<br />
also correct the errors. A more general approach of N-Modular Redundancy (NMR) is discussed in<br />
[KKB07]. In this technique the effects of faults are masked. All the components work simultaneously<br />
and their outputs are fed into a voter. The output of the voter will be correct if at least two of the<br />
components are non-faulty. Static redundancy techniques are characterized to be simple but they have<br />
high area and power overheads.<br />
Input<br />
F<br />
F<br />
F<br />
voter Fault<br />
Masked<br />
Figure 2.9: Triple modular redundancy (TMR)<br />
TMR has been prominent FT solution in aircrafts [Yeh02] and space shuttles, where not only<br />
processors, but entire systems are replicated for robustness. TMR can be implemented at software<br />
level, a propsed approach in [SMR + 07] uses software implementation of TMR in which operating<br />
system processes are triplicate and run on multiple available cores. Input replication and output<br />
comparison is done by a system call emulation unit.<br />
TMR can be employed to address single-bit data errors (SET, Persistent, Non-Persistent) occurring<br />
in a cell [Car01]. In TMR the point of failure is voter, because if fault occur in voter all the system<br />
fails. However, voter is typically small and hence often assumed to be reliable. There is a significant<br />
area and power penalty (approximately a factor of 3.0 − 3.5 times) associated with TMR as compared<br />
to the non-redundant design [JHW + 08].
2.2. ERROR CORRECTION 39<br />
2.2.2 Temporal Redundancy<br />
For error correction with temporal redundancy, a computation is repeated on the same hardware<br />
at three different times intervals and finally the results are voted [MMPW07]. It requires three times<br />
more clock cycles to execute the same task. It can only correct errors due to transient faults provided<br />
that the duration of fault is lesser than computational time. It needs additional time to repeat the<br />
computations and can be employed in systems with low or no constraints on time. However, it has<br />
low area overheads as compared to TMR.<br />
2.2.3 Information Redundancy<br />
The error correcting codes (ECC) can provide cheaper solutions than other well-known redun-<br />
dancy techniques like TMR [CPB + 06]. They are commonly used to protect the memory (see fig-<br />
ure 2.10). The overhead of a code depends on (i) additional bits required to protect the information<br />
(ii) additional hardware/latency for encoding and decoding. However, encoding/decoding latency can<br />
be reduced if executed in parallel.<br />
Input<br />
Input<br />
parity<br />
Error<br />
Error<br />
Detecting Detecting &<br />
&<br />
Correcting<br />
Correcting<br />
block<br />
block<br />
Output<br />
Output<br />
parity<br />
Error<br />
Signal<br />
Figure 2.10: Error detecting and correcting memory block<br />
Among the different ECC codes, the commonly employed codes in digital circuits include Ham-<br />
ming Codes, Hsiao Codes and Reed-Solomon Codes. These codes can correct errors in-addition to<br />
detection. There are two key parameters of error correcting codes: (i) number of erroneous bit that can<br />
be detected and (ii) number of erroneous bit that can be corrected. Code’s error detection/correction<br />
properties are based on their ability to partition a set of 2n, n-bit words into a code space of 2 m code<br />
words and a non-code space of 2 n − 2 m words [FP02]. The simplest block code are Hamming codes,<br />
they are single error correcting, double error detecting (SEC-DED) codes [LBS + 11] but not both si-<br />
multaneously. They are the earliest linear ECC codes. They are quite useful in cases where only a<br />
single error is of significant probability, they do carry the hazard of miss correcting double errors.<br />
The Hsiao Codes (also called advance Hamming codes) are other commonly used codes for pro-<br />
tection / correction of errors in the memory [Mon07]. They have fast encoding and error detection<br />
than Hamming codes [Hsi10].<br />
Codes that are more powerful may be constructed by using appropriate generating polynomials.
40 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
Among them, Reed Soloman codes are cyclic codes that require complex encoding and decoding<br />
circuitry and are especially well-suited to applications where errors occur in bursts. That’s why they<br />
are mostly employed in channel coding. On the other hand, convolution coding schemes are useful in<br />
data storage and transmission systems, such as memory and data networks [FP02].<br />
2.3 Error Recovery<br />
Recovery transforms a system state that contains one or more errors and (possibly) faults into an-<br />
other state without detected errors and faults that can be activated again [ALR01]. Error recovery can<br />
only be initiated on detection of fault/error, therefore, the system should have built-in self checking<br />
mechanism. Nowadays, modern microprocessors have variable built-in error detection capabilities<br />
like, error detection in memory, cache, registers, illegal op-code detection, and so on [MBS07]. It can<br />
be higher level based on error handling (eliminate error from the system states) or lower level based<br />
on fault handling (prevent fault from being activated again).<br />
Recovery hides the consequences of faults from the user. It is more adequate for transient and<br />
intermittent faults, whereas, for permanent fault, the recovery is generally not sufficient. It needs one<br />
mandatory feature that is fault handling (see figure 2.11), it eliminates faults from the system state<br />
[LB07]. The fault-handling feature prevents faults from being activated again. This requires further<br />
features such as diagnosis, which reveals and localizes the cause(s) of error(s) [LRL04]. Diagnosis<br />
can also enable a more effective error handling. If the cause of the error is localized, the recovery<br />
procedure can take actions concerning the associated components without affecting the other parts of<br />
the system and related system functionality. In this work, we will be addressing the soft errors caused<br />
due to transient faults. Therefore, fault handling technique will not be important to explore.<br />
There are two sub-types of error recovery; forward error recovery (FER) and backward error<br />
recovery (BER). In FER, the system does not need to restore its states but it continues to make forward<br />
progress without restoring the system states. The compensator will overcome the faults (as shown in<br />
figure 2.11 (FER)). For example, in TMR the voter will mask (compensate) the fault and in ECC the<br />
error correcting circuitry will correct the (corrigible) error.<br />
BER involves restoring the states of the system to a previous known sure states. In otherwords,<br />
the state transformation consists of returning the system back to a saved state that existed prior to<br />
error detection [ALR01]. For the successful BER the system must be aware of the following facts: (i)<br />
which and where states are to be saved for recovery point; (ii) which algorithm to use; and (iii) what<br />
the system do after recovery.<br />
There are two known algorithms for saving the BER recovery states: checking point and logging.<br />
The choice depends on the micro-architecture of the core and recovery requirements, because both<br />
have different costs for different types of states and many BER systems use hybrid of both. A system<br />
presented in [SMHW02] uses hybrid BER. An actual criterion of choice is that if we have few registers<br />
and recoveries are not frequent, then check pointing is preferred. If there are many registers and<br />
frequent recovery then logging will be preferred.
2.4. FT PROCESSOR DESIGN TRENDS 41<br />
Backward Error Recovery (BER) Forward Error Recovery (FER)<br />
Maintenance<br />
Call<br />
Error<br />
Detection<br />
Rollback<br />
Permanent<br />
fault<br />
Fault<br />
Handling<br />
Transient -<br />
Intermittent<br />
faults<br />
No Yes<br />
Service Continuation<br />
Permanent<br />
fault<br />
Fault<br />
Handling<br />
No<br />
Maintenance<br />
Call<br />
Error<br />
Detection<br />
Compensation<br />
Yes<br />
Transient -<br />
Intermittent<br />
faults<br />
Service Continuation<br />
Figure 2.11: Basic strategies for implementing Error Recovery.<br />
Another important aspect is where to save the states of the recovery point. A shadow register file<br />
is created in the core to save the states of the sensitive elements. The backup values in the shadow<br />
copy can be used for rollback and recovery [AHHW08]. However, some other techniques, which<br />
require high reliability, store the states of internal registers off-chip. When the states are recovered<br />
the ECC are employed to avoid possible errors. In the recent era, a lot of development has been done<br />
in BER and many low cost computers employ BER. Like IBM is employing checkpoint recovery in<br />
POWER-6 micro-architecture [MSSM10].<br />
2.4 FT Processor Design Trends<br />
Recently, fault-tolerant computing has begun to draw more and more attention in a wider range of<br />
industrial and academic communities, due to increased safety and reliability demand [ZJ08]. Today,<br />
FT is the need of real time industrial application [RI08]. Mostly, high cost solutions are not acceptable<br />
for the industry, consequently the modern processors avoid hardware replication and tend to employ<br />
alternate techniques having lower power and area overheads (like information redundancy or hybrid<br />
redundancy). Information redundancy (like employing ECC) have less hardware overhead, however
42 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
they may have additional performance penalty.<br />
The performance penalty and hardware overheads depends on the type of ECC. The choice of<br />
ECC depends on three constrain power, area and error coverage requirements. The codes having<br />
better error coverages often have higher time penalty and hardware overheads. The parity codes are<br />
faster and low hardware cost whereas commonly employed ECC like hamming codes have better<br />
error coverage (like DED-SEC).<br />
The performance overhead can be minimized (masked) to an extent by calculating the parity-bits<br />
in parallel. On the other hand, a common trend to reduce the hardware penalty is compromising on<br />
error coverage and employing low cost error detecting codes (e.g. simple parity or mod-3 coding).<br />
Likewise, some well known processor of last decay like Power6, Itanium series and SPARC64 V<br />
employ parity predictors and modulo codes in their arithmetic/logic units to reduce power and cost.<br />
The ECC are commonly employed to protect the caches and data storage [QLZ05]. For example,<br />
Itanium processor can detect 2 bit errors in cache while relying on ECC. IBM processors have write<br />
through L1 caches and uses simple parity in the L1 cache and ECC in the L2 cache. On the other<br />
hand, Intel uses ECC even in the L1 caches.<br />
Check pointing with rollback is an alternate trend. It can be an effective FT solution for the<br />
processor having minimum internal states (registers). Higher the number of internal states, higher<br />
will be the performance (time) penalty in checking, loading and storing the states. Some modern<br />
processors that employ this methodology reduce the performance penalty by checking pointing after<br />
every super-scalar block of instruction. The common examples are Power6 and Power7.<br />
A newer trend to design a processor is to employ flexible error coverage and allow the user to<br />
choose the level of protection and redundancy he need in particular application. e.g. ARM Cortex-R<br />
series, an application specific processor. For higher error coverage, DMR is employed whereas for<br />
lower coverage ECC will be employed. However, area overhead will always be higher than 200%. In<br />
the following section, we are discussing different FT methodologies being employed in some well-<br />
known FT processor of last decay.<br />
SPARC64 V [AKT + 08]<br />
The SPARC64 V microprocessor is designed for mission-critical UNIX servers. In order to<br />
achieve un-interrupted operation, these servers must be resistant to soft errors. Also, data integrity is<br />
highly important because of the dangers that silent data corruption (SDC) can pose in mission-critical<br />
systems. To meet these requirements, the processor was designed not only to correct SRAM errors,<br />
but also to detect errors in logic circuits and to recover from those errors when practical.<br />
There are three smaller cache arrays of 128KB each, namely the level-1 instruction cache, level-1<br />
data cache and branch history cache (BRHIS). The level-1 data cache is write-back and protected by<br />
the same SEC-DED codes as the level-2 cache. The level-1instruction cache and BRHIS are covered<br />
by parity check. When an error is detected during level-1 instruction cache read, the read entry is<br />
invalidated and re-fetched from the ECC-protected level-2 cache. An error in BRHIS is treated as a<br />
cache miss and the processor delays execution of the conditional branch instruction until the correct
2.4. FT PROCESSOR DESIGN TRENDS 43<br />
branch address is calculated. The processor takes a minor performance hit but is able to continue<br />
correct instruction execution.<br />
Tags for level-1 instruction and data caches are parity-protected. Both level-1 caches are inclusion<br />
caches; tag information is duplicated in the level-2 tag. When a parity error is detected in a level-1 tag<br />
access, the level-2 tag is interrogated for the correct copy of the tag. The level-1 cache access is then<br />
re-executed. The last major SRAM array on the chip is the Translation Look-aside Buffer (TLB).<br />
TLB is protected by parity check and a parity error in the TLB is treated as a miss. The correct<br />
page table entry is fetched from the ECC-protected main memory during re-execution. In addition<br />
to implementing cache and TLB protection, the SPARC64 V is designed to detect single bit SRAM<br />
errors in other smaller SRAM arrays and recover from those errors as well.<br />
The processor logic circuits are protected by byte parity check to detect single bit logic errors in<br />
each byte. Parity check bits are calculated at the location of new data value generation and passed<br />
with the associated data through the processor logic circuits. Parity bits are checked at the receiving<br />
end.<br />
Arithmetic/logic units are equipped with byte parity predictors. The byte parity predictors calcu-<br />
late the parity bits for each output byte of an arithmetic/logic unit using the same input signals as the<br />
unit to be checked. These independently calculated byte parity bits are compared with the byte parity<br />
bits calculated from the output of the arithmetic/logic unit. Multipliers are checked with a modulo-3<br />
scheme.<br />
The byte parity predictors in the arithmetic/logic unit do not detect point errors that result in an<br />
even number of bit flips in the output byte, and the modulo-3 scheme used in the multipliers do<br />
not detect point errors that give the same modulo-3 residue. These checks, however, do detect the<br />
majority of single point errors and are cost-effective compared to a full duplication and compare<br />
implementation. When a parity error is detected in the logic circuits or small SRAM arrays, the<br />
processor stops issuing new instructions and clears all intermediate states. It then restarts execution<br />
at the instruction directly following the last correctly executed instruction by using the check-pointed<br />
states. This action is called instruction retry.<br />
The checkpoint and instruction retry mechanisms are implemented in the processor for recovery<br />
from branch misprediction. Thus, the additional cost associated with utilizing these mechanisms for<br />
error recovery is small. Furthermore, many microprocessors today feature either ECC or byte parity<br />
for large on-chip SRAM arrays. Compared with those microprocessors, the SPARC64 V micropro-<br />
cessor only requires additional transistors for implementing byte parity bits, byte parity predictors and<br />
the associated parity checkers in the logic circuits and small SRAM arrays. The number of transistors<br />
devoted to the error detection mechanisms of the SPARC64-V microprocessor is about 10% of the<br />
transistors for logic gates, latches and parity-protected small SRAM arrays.<br />
LEON3 FT<br />
LEON3 is the successor of the LEON2 processor developed for the European Space Agency
44 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
(ESA). The LEON3FT [GC06] is a fault-tolerant version of the standard LEON3 (clone SPARC<br />
V8). In LEON3FT the consideration is only given to the protection of data storage and not to the<br />
functionality of the processor. There is no protection for control unit, data path and ALU circuitry.<br />
The internal registers are protected with ECC codes plus a shadow copy. Upon a detected parity<br />
error, a duplicate copy of the data is read out from a redundant location in the register file, replacing<br />
the failed data. Few internal registers have four bit error detection capacity however, majority of<br />
registers only have two bit error detection.<br />
The cache memory in LEON3-FT consists of separate instruction and data caches, both 8 K byte<br />
large. Each cache has two parts; tags and data RAM. The tag and data memories are implemented<br />
with on-chip block RAM and protected with four parity bit per 32-bit word, allowing detecting up to<br />
four simultaneous errors per cache word. Upon a detected error, the corresponding cache line deleted<br />
and the instruction is restarted. This operation takes 6 clock cycles (idle states) and is transparent<br />
to software. For diagnostic purposes, error counters are provided to monitor detected and corrected<br />
errors in both tag and data parts of the caches.<br />
Boeing 777 the control system<br />
In Boeing 777, the control system is made reliable through redundant channels with different<br />
processors and diverse softwares to protect against design errors as well as hardware faults [BT02].<br />
It uses heterogeneous triple-triple modular redundancy [Yeh02] (as shown in the figure 2.12). There<br />
are three different processor architecture (Intel 80486, Motorola 68040 and AMD 29050) executing<br />
the same operation. However, it is an expensive solution and can only be employed in mission critical<br />
applications.<br />
ARM Cortex R Series [ARM09]<br />
ARM cortex R-series is a family of embedded processors for real time industrial applications.<br />
They have high customizability, so that the manufacturer can choose the features that suits their<br />
applications needs.<br />
If ECC build option is enable, then a 64-bit ECC scheme protects instruction cache. The data<br />
RAM include eight bit of ECC codes for 64-bit of data. The data cache is protected by 32-bit ECC<br />
scheme. The data RAM include seven bit of ECC codes for every 32 bit of data.<br />
If the parity build option is enabled, then the cache is protected by parity bit. For both the instruc-<br />
tion and data cache, the data RAMs includes one parity bit per byte of data.<br />
The processor can be implemented with a second redundancy copy of most of the logic. The<br />
second copy shares the cache RAMs of master core, so that only one set of cache is used. The<br />
comparison of the outputs of the redundant core with those of the master core detects fault.
2.4. FT PROCESSOR DESIGN TRENDS 45<br />
Intel 80486<br />
Motorola 68040<br />
AMD 29050<br />
voter<br />
Power6 [MSSM10, KMSK09, KKS + 07]<br />
Intel 80486<br />
Motorola 68040<br />
AMD 29050<br />
voter<br />
output<br />
voter<br />
Intel 80486<br />
Motorola 68040<br />
AMD 29050<br />
Error in any<br />
component<br />
Figure 2.12: The triple-TMR in Boeing 777 [Yeh02]<br />
IBM designs the Power6 processor. It uses inline checkers instead of TMR technique that uses<br />
less power and HW overheads. It has build in self-checking ability in the data and control flow paths.<br />
The residue checking is employed for floating-point unit and logical consistency checkers for control<br />
logic. It has recovery unit which checkpoints after a group of superscalar instructions are completed.<br />
The inline checkers writes into fault isolation register that decides if current state is error free. In case<br />
of error detection, the recovery unit initiates instructions retry recovery. The memory bus including<br />
input-output unit is protected by ECC codes. The L1 cache is protected by simple parity, while L2,<br />
L3 caches and all signals in and out of chip to L3 have ECC protection.<br />
Intel Itanium 9300 Series [Int09]<br />
Intel Itanium 9300 series processors is a high performance processors. The L2, L3 and directory<br />
caches are protected with ECC. It can correct all single bit errors and most double errors. Moreover,<br />
hardware assisted scrubbing support is available for L2, L3 and directory caches. <strong>Memory</strong> is also<br />
protected against the thermal protection. Here, different thermal sensors send information to memory<br />
controllers that consequently increases fan speed to regulate the temperature. The internal registers of<br />
the processor are protected by ECC. Additionally there is a redundancy clocks and soft error hardened<br />
latches and registers to improve resistance to soft errors.<br />
voter
46 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
2.5 FT Evaluation<br />
In semi-conductor industry, testing expenses increase the overall cost of the IC design and man-<br />
ufacturing. Generally, industrial testing is meant to find permanent faults that can be produced at<br />
the time of manufacturing. However, the most frequently occurring faults in computer systems are<br />
temporary effects like transient and intermittent faults. They are the main cause of the digital system<br />
failure [VFM06]. Due to increase in the probability of the transient faults in the latest technologies,<br />
more and more designers will have to analyse the potential impact of these faults on the behaviour of<br />
the circuits.<br />
The error model to evaluate faults depends on their duration. Permanent faults can be tolerated<br />
by replacing the faulty component whereas a temporary fault can self-repair. Intermediate faults are<br />
treated either as permanent or temporary model depending on how often they occur. Some common<br />
techniques to evaluate the FT system are discussed in [WCS08]. Among other techniques, fault<br />
injection is the widely accepted as an effective approach to evaluate fault tolerance [LN09, Nic10].<br />
2.5.1 Fault Injection<br />
Fault Injection is a validation technique for FT system, which consists in the accomplishment<br />
of controlled experiments where the observations of the system’s behavior in presence of faults are<br />
induced explicitly by the voluntary introduction of faults to the system [ACC + 93].<br />
In other words, it is the purposeful introduction of faults (or errors) into a target [NBV + 09]. Thus,<br />
it is an intentional activation of faults in order to observe the behavior of the system under fault. The<br />
objective is to compare the nominal behavior of the circuit (without fault injection) with its behavior<br />
in the presence of faults injected during the execution of an application.<br />
Fault injection techniques have become popular to evaluate and improve the dependability of<br />
embedded processor-based systems [LAT07]. It can be accomplished at physical or simulated level.<br />
1. physical fault injection: it injects faults directly into the hardware, disturbing the environment<br />
(like heavy ions radiation, electromagnetic interference, LASER etc) [BGB + 08, Too11]. Many<br />
methods have been proposed, based primarily on the validation of physical systems, including<br />
injections on circuits pins, injection of heavy ions, disruption of supplies, or the fault injection<br />
laser [GPLL09]. None of these approaches can be used for evaluation security before the circuit<br />
is actually made. Therefore, alternate solution is to employ some injection techniques that allow<br />
earlier analysis of the design, typically at the register transfer level or gate level e.g. it may<br />
include mistakes in an RTL description.<br />
2. simulated fault injection: fault injection campaigns can be performed using several approaches,<br />
especially the simulation for high-level approaches. It has been widely used for its simplicity,<br />
versatility, and controllability [NL11]. The simulation is more expensive in time, however it<br />
may allow more comprehensive analysis and provide more accurate results and cost less than<br />
physical fault injection [NL11]. The fine access to the internal states of the processor is easily
2.5. FT EVALUATION 47<br />
possible with simulated fault injection and that is why it has better controllability/observability.<br />
In this technique, the system under test is simulated in other computer system. The faults are<br />
produced by altering the logical values during the simulation.<br />
The simulated fault injection is a special case of injecting soft errors that can support various levels<br />
of abstraction of the system like architectural, functional and logic [CP02], and for this reason it has<br />
been widely used to study fault injection. Moreover, there are various other advantages for this tech-<br />
nique. For-example, its greatest advantage over other ones is the controllability/observability of all<br />
the modelled components. Another positive aspect of this technique is the possibility to carry on the<br />
validation of the system during the design phase before having a final design. Alternate approaches of<br />
physical/simulation environment to perform safety analysis have been discussed in [Bau05,RNS + 05].<br />
2.5.2 Error Models<br />
To design an FT system, it is important that the system should be aware of the possible faults that<br />
can appear in it. Some of the commonly occurring faults are shown in table 2.1. However, architecture<br />
is normally design to overcome possible errors. Such system can detect the active faults that produce<br />
errors because they are not aware of the underlying physical phenomena.<br />
Table 2.1: fault modeling<br />
level Model<br />
Programming Instruction, sequences etc<br />
HDL Functional model, register<br />
Logic gate level<br />
Electronic CMOS Transistor<br />
Technology Physical layout<br />
There are different types of error models and they have been classified in three axes: type of<br />
error, error duration, and number of simultaneous errors in [Sor09]. A commonly considered error<br />
model is bridging model; it considers short-circuits and cross-talks. This model is suited to detect the<br />
fabrication defects that can cause the short-circuit between two connections/wires. It is a low-level<br />
error model.<br />
Fail-stop error model is a higher-level error model. All the components will stop working in case<br />
of error detection in a system based on fail stop model. Such systems are used in critical systems<br />
such as in ATM (automated teller machine) machines, where a single error in calculation can result<br />
in hundreds dollar loss. Such system stops working if some non-corrigible error is detected.<br />
A delay error model is one in which the circuit produces the correct response but after a certain<br />
unexpected delay. This type of error can occur due to various internal physical phenomena of the<br />
device. Some related work is discussed in [EKD + 05].<br />
Here, we are interested in the bit-flip errors that are largely representative of transients errors due<br />
to SEU (SBUs and MBUs). Moreover, they are easy to model at many abstraction levels.
48 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS<br />
2.5.3 The Fault Injection Framework<br />
A fault injection framework usually needs at least 3-types of information:<br />
(a) when the fault is to be injected. What is the condition that will trigger the fault injection during<br />
the simulation;<br />
(b) where the fault is to be injected. In which location the fault will be injected;<br />
(c) what is the kind of fault that is to be injected. What will have its effect.<br />
Fault Trigger (when)<br />
Fault injection may be done according to a deterministic or non-deterministic (time) profile. Non-<br />
deterministic fault triggers may inject a fault during a simulation after an amount of time, On the<br />
otherhand in deterministic approach, the fault trigger is by counting the amount of simulated instruc-<br />
tions. The fault will be injected after a specified amount of instructions simulated. In the simulated<br />
fault injection, the non-deterministic behavior is obtained by specifying the amount of time or the<br />
amount of instructions simulated randomly.<br />
A deterministic fault trigger may limit the scope of the fault injection, by determining that the<br />
injection will be done in a specific range of interval In real-time applications faults occur at ran-<br />
dom instances. The practical solution is to use non-deterministic approaches by combine different<br />
(possible) trigger conditions under specific situation.<br />
Fault Location (where)<br />
In processors, faults may effect ALU or internal registers or memory address, depending on the<br />
output of the instruction using the affected logic. In all cases, change in processor registers or memory<br />
can represent a real possible fault. A fault location is often described deterministically, but it can also<br />
be described in a non-deterministic way if we let the fault injection framework choose randomly<br />
which processor register to inject the fault.<br />
Fault Effect (what)<br />
As explained previously, the most common effect of a transient fault into a processor register or<br />
memory is an inversion in a state of a bit (single bit flip). By flipping a bit in a register or in a memory<br />
address we can inject a fault as it occurs in a real situation. The value of the altered bit is always<br />
toggled to the opposite value. This upset model is the standard transient fault model used in the<br />
reliability literature [Muk08]. A deterministically fault operation can be done by specifying which bit<br />
to flip, but it can also be done non-deterministically letting the fault injection framework randomly<br />
choose the bit to flip in the fault operation.
2.6. CONCLUSIONS 49<br />
The above mentioned details are the basic information that is needed by a fault injection simulator.<br />
Our exact choices will be presented in the chapter 6, where exact validation methodology based on<br />
artificial error injection will be exploited.<br />
2.6 Conclusions<br />
In this chapter, we explore the existing design methodology and validation techniques for FT<br />
processor. Today, FT processors employ variety of redundancy techniques and each has its own<br />
area/time overheads. Hardware redundancy has faster error detection and correction but have high<br />
area overheads whereas, temporal redundancy technique have lower area overheads but have high<br />
time overheads associated.<br />
In past, the FT processors were only used for mission critical applications and mostly rely on<br />
hardware replication. However, now every system need at least some consideration of FT. The expen-<br />
sive solutions are not acceptable for most embedded systems. Therefore, modern processors are either<br />
relying on hybrid techniques or more focused towards information redundancy techniques to reduce<br />
the power and area requirements. The available low cost solutions are missing fast error detection<br />
ability. The need is to develop alternate tolerance methodology that have fast error detection at low<br />
power/area overheads.<br />
In the later sections, different methods of processor evaluation are discussed. Among them, simu-<br />
lation fault injection has many advantages including better controllability/observability and architec-<br />
ture validation during initial development stages, compared to physical fault injection.
50 CHAPTER 2. METHODS TO DESIGN AND EVALUATE FT PROCESSORS
II. QUALITATIVE AND QUANTITATIVE<br />
STUDY<br />
51
Chapter 3<br />
Design Methodology and Model<br />
Specifications<br />
3.1 Motivation<br />
Due to current technology trends, there is growing concern that transient faults will occur more<br />
frequently in future [FGAD10]. Since this reliability threat is projected to affect the broad computing<br />
market, traditional solutions involving excessive redundancy are too expensive in area, power, and<br />
performance [BBV + 05, SHLR + 09]. The research in FT systems design having minimum hardware<br />
overhead has gained great importance in last few years.<br />
Previously research in this domain was just to attain high-level dependability at minimum perfor-<br />
mance degradation and there were no big consideration about low cost hardware solutions. Conse-<br />
quently, the dependability was mostly attained through expensive solutions like hardware replication.<br />
The well-known FT processor that have been developed in the past such as Stratus, Leon FT, Sun FT<br />
SPARC, IBM S/390 were employing hardware redundancy solutions (either DMR or TMR). These<br />
processors have high power and hardware overheads and they are not addressing the need of daily life<br />
applications.<br />
The available FT solutions often incur significant penalties in area, cost or performance and they<br />
are unable to efficient tolerate faults [PIEP09]. They cannot fulfill common industrial applications<br />
needs. Some temporal redundancy techniques have minimum hardware overheads however they have<br />
significant time overheads that limit the overall performance. On the other hand, the hardware repli-<br />
cation is faster but increases the cost and power requirements. It is a great challenge to build efficient<br />
FT-systems with reduced time and hardware-overheads. Efficient design optimization techniques are<br />
required in order to meet time and hardware constraints in the context of FT systems. Consequently,<br />
in this work we are proposing a FT processor design methodology offering an acceptable compromise<br />
between protection and area/time overheads. The next section is exploring the proposed methodology.<br />
53
54 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
3.2 Methodology<br />
We want to design a fault tolerant processor having minimum error detection latency and low<br />
hardware overhead. In this situation, the challenge is to search a compromise between protection and<br />
hardware overheads. The hardware detection and correction is fast but it has high hardware overheads.<br />
On the other hand, software based detection and correction is slower but it has low hardware overhead.<br />
Consequently, for fast error detection we choose hardware based concurrent error detection, so<br />
that the error is detected before it reaches the system boundaries to result in catastrophic failure. On<br />
the other hand, for hardware saving we have to accept the additional time penalty in error correction.<br />
Moreover, in real time applications, fault does not occur often. Consequently, software based rollback<br />
mechanism can be chosen to recover the error.<br />
Fault Tolerance Computer<br />
Low hardware and time trade-offs<br />
Fast error detection Low hardware overhead<br />
Concurrent Error Detection Rollback Mechanism<br />
Figure 3.1: Proposed Methodology<br />
The resultant hardware-software co-design methodology (see figure 3.1) will have the ability to<br />
detect errors as soon as they occur and start immediately with error recovery strategy to prevent<br />
the propagations of errors throughout the system. The proposed methodology is successful for non-<br />
safety-critical FT-application where errors occurrence rates are not too high<br />
In the next section, we will discuss the most suitable CED and recovery mechanisms for the above<br />
scenario.<br />
3.2.1 Concurrent Error Detection: Parity Codes<br />
The implementation of the CED usually requires extra hardware overhead. One of the straight<br />
ahead and commonly used CED approach is DMR. Theoretically, it can detect 100% errors (except<br />
simultaneous errors in both modules and errors in comparator) [MS07]. However, this technique<br />
imposes an area overhead higher than 200%. The decision for the checking strategy is a compro-<br />
mise between error coverage and acceptable overhead. Cost-effective solutions are the objective of
3.2. METHODOLOGY 55<br />
further investigations in error-detection. EDC have smaller area overhead [Pat10] and they are often<br />
considered sufficient for non-safety-critical processors [MS07]. Among EDC, we will employ the<br />
most simplest codes because our objective is to show feasibility of our approach. Once the over-<br />
all methodology has shown interesting results then we may employ stronger codes with better error<br />
coverage.<br />
Parity codes are the simplest and cheapest known EDC. It provide odd bit-count error detection<br />
and need to have extra circuits for checking the bit generation and output parity verification. Their<br />
hardware overhead is much lower than the DMR approach. It can be employed for protecting registers,<br />
data-bus, RAM and bit-sliced circuits [Pie06].<br />
The disadvantage is the missing recognizability of by 2-divisible multiple-bit faults (even multi-<br />
plicity). The example for an 8 × 8-bit register file in figure 3.2 illustrate this fact. Faults in register 1<br />
and 3 can be detected by parity-check. However, faults in register 2 and 5 remain undetected.<br />
They do not need complex encoding and decoding circuitry. They have smaller gate count for the<br />
complete on-line checking scheme. Moreover, in case of soft errors where an error is random in time<br />
and space, the likelihood of multiple errors in 1 clock cycle is exceedingly low. Therefore, in this<br />
scenario, a less expensive approach such as the parity error detection could suffice [Gha11].<br />
x 1<br />
x 2<br />
x 3<br />
x 4<br />
x 5<br />
x 6<br />
x 7<br />
x 8<br />
Fault Free Environment<br />
1 0 1 1 0 0 0 0<br />
1 1 1 0 0 0 0 1<br />
0 0 0 0 1 0 1 0<br />
0 1 1 0 1 0 0 0<br />
0 0 0 0 0 0 0 0<br />
0 1 0 0 1 0 1 0<br />
0 0 1 0 1 0 1 1<br />
0 1 0 1 1 0 1 1<br />
Odd<br />
parity<br />
1<br />
0<br />
0<br />
1<br />
0<br />
1<br />
0<br />
1<br />
x 1<br />
x 2<br />
x 3<br />
x 4<br />
x 5<br />
x 6<br />
x 7<br />
x 8<br />
Noisy Environment<br />
1 0 0 0 1 0 0 0<br />
1 1 1 1 0 0 1 0<br />
0 0 1 0 1 0 1 0<br />
0 1 1 0 1 0 0 0<br />
0 0 0 1 1 0 0 0<br />
0 1 0 0 1 0 1 0<br />
0 0 1 0 1 0 1 1<br />
0 1 0 1 1 0 1 1<br />
- Detected Error<br />
- Un-detected Error<br />
Figure 3.2: Limitation of parity check<br />
Odd<br />
parity<br />
Lisboa [LC08] has employed a similar approach; he uses a standard parity based technique to de-<br />
tect errors in single output combinational circuits. In this work a second circuit is used that generates<br />
an extra output signal, named check bit, and two circuits for verification of the parity of inputs and<br />
outputs are based on reduced area XOR gates to detect soft errors.<br />
0<br />
0<br />
1<br />
1<br />
0<br />
1<br />
0<br />
1
56 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
3.2.2 Error Recovery: Rollback<br />
For minimum hardware overhead software based error correction will be useful. One of straight<br />
ahead solution is temporal based fault masking (software TMR). It has low hardware overhead but<br />
has 3× additional time penalty and moreover it does not match with already chosen hardware CED.<br />
Alternate approach can be based on software rollback, where the transient non-persistent faults can<br />
be tolerated (repaired) by repeating the operation in a controlled manner using the same hardware<br />
again [RRTV02].<br />
It can overcome the errors by returning back to a point prior to the occurrence of the fault [MG09].<br />
Rollback is a technique that allows a system to tolerate a failure by periodically saving the entire state<br />
so that if an error is detected, rolling back to the prior checkpoint to recover [JPS08]. It require small<br />
hardware overhead and the resulting architecture can overcome errors at low cost. It can be good<br />
candidate for situations where delays for recovery are acceptable. The rollback principle can be an<br />
efficient approach with CED for error recovery [BP02, EAWJ02].<br />
State<br />
A<br />
State<br />
F<br />
No-error detected<br />
during last SD<br />
(Data Validated at VP)<br />
VP n-1<br />
Store<br />
SEs<br />
t+6<br />
t<br />
State<br />
E<br />
State<br />
B<br />
t+1<br />
t+4<br />
t+3<br />
t+5<br />
State<br />
C<br />
State<br />
D<br />
Figure 3.3: Rollback Execution<br />
Rollback to VP n-1<br />
Instruction(s) Execution in current<br />
SD<br />
Last SD Sequence Duration (SD)<br />
Next SD<br />
t+2<br />
Error detection<br />
Figure 3.4: Error detection during Sequence Duration (SD) and rollback called<br />
Our strategy to implement fault recovery is based on rollback execution, a classically employed<br />
VP n
3.2. METHODOLOGY 57<br />
software technique in real-time embedded systems [KKB07]. It relies on the following behaviors (see<br />
figure 3.3):<br />
• program (or thread) execution is split in sequences of fixed maximal length;<br />
• each sequence must reach its end without any error being detected to be validated;<br />
• data generated during a faulty sequence must be dismissed and execution restarted from the<br />
beginning of the faulty sequence;<br />
If an error occurs within the next instruction sequences, processor-registers can be updated with<br />
prior saved contents. Like in figure 3.4, it has been considered that an error has been detected during<br />
the instruction execution so rollback mechanism is called and data re-execution starts from the last<br />
stored states at previous validation point (VP). The VP executes after a fix interval of instructions. On<br />
the other hand, in the figure 3.5, no error is found during the Sequence Duration (SD) and all the data<br />
written during SD is validated at VP.<br />
No-error detected<br />
during last SD<br />
(Data Validated at VP)<br />
VP n-1<br />
Store<br />
SEs<br />
Instructions Execution in current SD<br />
SED (SD-SED)<br />
No-error detected<br />
during last SD<br />
(Data Validated at VP)<br />
Last SD Sequence Duration (SD)<br />
Next SD<br />
VP n<br />
Figure 3.5: No-error detected during the SD<br />
The SD represents the full length of the sequence, which include time taken to store the sensitive<br />
elements (SE) as well as length of active instruction execution. In the remain work, the processor<br />
internal states will be called SEs. Let’s denote the minimum time to load the processor SEs as ‘SED’<br />
(see figure 3.5). Then ratio of the active sequence duration will be ‘(SD-SED)/SD’ whereas ratio<br />
of time to load SE will be ‘SED/SD’. For SD=10, (assuming program length of 10,000 instruction<br />
and neglecting the possibility of provoking errors), then there will be about 1000 time loading of SE<br />
whereas if program contains SD=100 then SE will be loaded about 100 time. Which means there is<br />
10 times less penalty of SED/SD (loading SE) with SD=100, it means that larger SD can result in<br />
faster program execution.<br />
On the other hand, if probability of error provoking is not ignored, then the resultant time penalty<br />
due to re-execution can vary with length of sequence. At higher error injection rates, lower SDs will<br />
be an effective compromise because possibility of sequence un-validation is higher for bigger SDs<br />
and vice versa. This will be further discussed in chapter 5 and 6.<br />
Store<br />
SEs
58 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
a1 a2 a3 a4<br />
1 2 3 4<br />
a5 a6 a7<br />
Error<br />
5 6 7 4 Load 6 SEs7<br />
8 9 14<br />
Figure 3.6: Time overhead in rollback<br />
Clock cycles<br />
The rollback-principle is a repetition of an erroneous operation outgoing from a defined (saved)<br />
checkpoint in the past. There is a time penalty in the case of error detection. For-example, an error<br />
is detected at ‘a7’ in figure 3.6. Now the processor rollbacks, reloads the SEs and re-executes from<br />
the previous states ‘a7’. Consequently, this requires additional clock cycles delay to re-execute the<br />
sequence. Moreover, a delay can be higher if SD is large (more instructions in the sequence).<br />
3.3 Limitations<br />
The system relying on rollback cannot communicate data to a real-time environment until it is<br />
known that the data is error free. If an erroneous data enters into the system peripherals, then it<br />
cannot be recovered and may result in catastrophic failures. It is a fundamental problem of rollback<br />
recovery and has been discussed in [NMGT06]. The common approach is to wait until the data is<br />
validated. In present methodology the output events can be addressed with one SD delay. It may<br />
result in real time constrains.<br />
There is a need to design a special output unit to monitor the output control signals. Actually,<br />
real-time communicating is not not under the scope of the present work. It will be considered in<br />
future work.<br />
3.4 Hypothesis<br />
Among other underlying hypotheses, we suppose that the processor core is connected to a depend-<br />
able memory in which data is supposed to be kept safely without any risk of corruption. According to<br />
this assumption, all the internal errors produced in DM are detected and corrected by DM. Therefore,<br />
DM is internally a safe storage but it should be protected from errors coming from outside which<br />
means that only valid data to be written into the DM.<br />
Actually, a lot of work has been dedicated in the past to the protection of memory devices [MS06,<br />
Hsi10] making this hypothesis valid.<br />
a7
3.5. DESIGN CHALLENGES 59<br />
3.5 Design Challenges<br />
Choosing an error detection mechanism based on concurrent error detection and recovery based<br />
on rollback are not enough to achieve our design objectives. An effective implementation of the<br />
above scenario can be realized by making appropriate choices concerning in particular the processor<br />
architecture. These design choices should improve the dependability, cost and overall performance.<br />
In the following section, we will analyze some of the major desired features required for a successful<br />
implementation of the above scenario.<br />
3.5.1 Challenge # 1: Self Checking Processor Core Requirements<br />
The choice of a base processor architecture is the first step towards the implementation of FT-<br />
processor because not all processor architectures can fit themselves in this context. For a successful<br />
implementation, we must determine the required key features of the processor.<br />
• minimum hardware: we aim to design an FT-processor in which a small hardware "fingerprint".<br />
It can reduce the chance of data contamination, since greater the area exposed to the environ-<br />
ment, more the chance of provoking the errors. Due to smaller area, more efficient architecture<br />
can be built with smaller silicon dies and thus the yield will be much higher [TM95].<br />
• minimum internal states to be checked and stored: (i) with concurrent error detection, the hard-<br />
ware overhead necessary to check simultaneously all the internal states may be rather important.<br />
Having a reduced number of internal states helps reducing this hardware overhead. (ii) Roll-<br />
back recovery requires internal states of the processor to be saved periodically, incurring a time<br />
penalty/overhead that can be lowered with a reduced number of internal states.<br />
The commonly employed RISC (Reduce Instruction Set Computers) class machines have a<br />
large register file and cannot fit with the proposed methodology because of following reasons:<br />
(a) More registers means more expensive CED;<br />
(b) More registers implies more time-consuming in periodic saving of register contents;<br />
(c) A large number of registers requires a large number of instruction bits as register speci-<br />
fiers, meaning less dense code.<br />
(d) CPU registers are more expensive than external memory locations;<br />
On the other hand, CISC (Complex Instruction Set Computers) require complex control ar-<br />
chitecture. It will complicate the overall implementation methodology. It has high memory<br />
requirements. Moreover, it increase the probability of design errors.<br />
In short, we cannot rely on the classical processors (RISC or CISC). Our choice will be a simple<br />
processor architecture having minimum internal states. This will reduce the overall area/time<br />
penalty and make the processor more robust against external disturbances. On the other hand,
60 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
it should have minimal complexity of the architecture and provide better utilization of chip<br />
resources.<br />
3.5.2 Challenge # 2: Temporary Storage Needed: Hardware Journal<br />
If a self-checking processor is directly connected to DM (see figure 3.7) then, there is need to<br />
manage the validated data (VD, written in previously validated SD) and un-validated data (UVD,<br />
being executed in the present sequence) inside DM, which will induce additional time penalty.<br />
In such case, generally paginated memory (one page for a sequence) is employed, in which there<br />
are un-validated and validated pages to manage rollback. If an error is detected in the current se-<br />
quence, the corresponding page is discarded and previous page must be restored. This is slower<br />
approach and requires additional pointers to handle pages; these pointers can either be dedicated<br />
registers (faster) or dedicated variables (slower and more risky).<br />
Moreover, there is an additional risk concerning the corruption of these pointers which may result<br />
in loosing the track of validated and un-validated pages. Thereafter, DM will no more be sure data<br />
storage, which is violation of the basic hypothesis. If there is a large amount of data being copied<br />
among pages or pages and the main pool of data in memory, this takes a lot of time. Furthermore,<br />
system requires bigger DM to separately store the validated and un-validated data.<br />
Processor<br />
Write<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
Figure 3.7: Untrusted data flowing into dependable memory (DM)<br />
An alternate approach to simplify the above scenario is employing a temporary data storage be-<br />
tween processor and DM. It can strongly reduce the time penalty and also the risk of error to some<br />
extent. Furthermore, it will simplify the periodic saving of data and only validated data will be trans-<br />
ferred to DM.<br />
The basic idea is to implement some hardware devices on the path between the processor and<br />
the DM controlling the way data flows from one side to the other and preventing un-trustable data<br />
to end-up in the DM (as suggested in figure 3.8). This can be achieved by first writing the non-<br />
secure/non-validated data to a temporary location before transferring it to DM after sequence valida-<br />
tion its validation. The SCPC can detect the errors and re-executed instructions from the last secure<br />
states (in case of error detection). In this way external errors (environment /processor) will be masked<br />
from entering into the DM (as shown in figure 3.8). The underlying idea behind the journalization
3.5. DESIGN CHALLENGES 61<br />
Self-Checking<br />
Processor<br />
Core<br />
Write<br />
Temporary<br />
Storage<br />
Write<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
Figure 3.8: Data stored to temporary location before writing to DM<br />
mechanism is to prevent un-trustable data to flow into the DM and to allow an easy recovery from<br />
faulty situations. Hence there is a need of a temporary location (called self checking hardware journal<br />
in later chapters) to mask the errors from entering into DM.<br />
Need Self Checking Hardware Journal<br />
Data stored inside this temporary location can also be corrupted in the the case of transient faults<br />
affecting it, such as SEUs (see figure 3.9). Hence, there should be an error detecting and correcting<br />
mechanism to ensure the reliable operation of this temporary data storage.<br />
Self-Checking<br />
Processor<br />
Core<br />
non<br />
validated<br />
data<br />
Transient<br />
Fault<br />
Temporary<br />
Storage<br />
supposedly<br />
validated<br />
data<br />
Figure 3.9: Data corruption in temporary storage.<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
Let us suppose that we have written a data in the journal at time ‘t’, there was no error detected<br />
during the sequence till VP and data is ready to transfer to the DM. Is this data dependable? No,<br />
because data remained in the journal for time ‘tx’ and the possibility of fault occurrence cannot be<br />
ignored during the time tx. Hence, there is a need for self-checking mechanism to detect errors in the<br />
journal and hence prevent the DM from data contamination as shown in figure 3.10. It will make the<br />
journal a safe temporary data storage.
62 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
Self-Checking<br />
Processor<br />
Core<br />
non<br />
validated<br />
data<br />
Transient<br />
Fault<br />
<strong>Dependable</strong><br />
Temporary<br />
Storage<br />
trustable<br />
validated<br />
data<br />
Figure 3.10: Protecting DM from contamination.<br />
Separate Storage of Validated and Un-validated Data<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
The data initially written in the temporary location is un-validated and if no error occurs during<br />
the present sequence then at validation point the data is validated. At any instant inside the temporary<br />
storage, there are two types of data, one called un-validated data and other validated data. Conse-<br />
quently, inside it there must be two different parts: one to store data and other to store un-validated<br />
data. In addition, it helps us to transfer the validated data towards the DM.<br />
3.5.3 Challenge # 3: Processor-<strong>Memory</strong> Interfacing<br />
The overall performance of the FT-processor can be limited in absence of an efficient interfacing<br />
between processor, temporary location and memory. Since in most processors the majority of the<br />
instructions involve either read or write from/to the memory. The overall performance is affected if<br />
there is a long critical path or more than one clock cycle needed to read and write the data. In our case,<br />
this situation is much more delicate because there is a temporary data storage between the processor<br />
and the DM. There is a need of designing an intelligent interfacing to mask the errors from entering<br />
to the DM. The interface must provide an efficient interconnect between the modules.<br />
In this scenario, there are two possible interfaces: processor communicates with DM via a journal<br />
or processor communication with journal and memory in parallel. The challenge is to evaluate both<br />
the processor models from dependability and performance degradation point of view and to choose<br />
the most suitable one.<br />
3.5.4 Challenge # 4: Optimal Sequence Duration for Efficient Implementation<br />
of Rollback Mechanism<br />
The objective of rollback-technique is to restore the system state (in case of error) by overwriting<br />
the current sequence states with previously validated states of SEs as shown in figure 3.4. There are<br />
two performance-limiting factors (i) time taken to periodically store the SEs and (ii) un-validation of<br />
SD and SEs reload on error detection. If one needs to reduce, the time penalty when reloading the<br />
SEs there is a need of long sequences so that the overall number of load/store SEs will be smaller than
3.6. MODEL SPECIFICATIONS AND GLOBAL DESIGN FLOW 63<br />
‘(SD-SED)/SD’. On the other hand, for larger sequence at higher error rates, there are less chances<br />
of sequence validation. Hence, there will be greater rate of rollback, which again results in time<br />
penalty/performance degradation. Therefore, it is advisable to use large sequences with low error<br />
rates and small sequences with higher error rates.<br />
3.6 Model Specifications and Global Design Flow<br />
The basic role of the journal is to hold the new data being generated during the sequence being<br />
currently executed until it can be validated (at the end of the current sequence). On sequence valida-<br />
tion, this data can be transferred to the DM else, it is simply dismissed and the current sequence can<br />
restart from the beginning using the SEs (held in the DM) and corresponding to the state prevailing at<br />
the end of the previous sequence.<br />
Self-Checking<br />
Processor<br />
Core<br />
HW Journal<br />
Figure 3.11: Overall design specifications<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
Our global design strategy of FT processor is classified into four steps as summarized in figure<br />
3.12. Step-I summarises the proposed model specification as shown in the block diagram in the figure<br />
3.11. This includes exploring the design requirement like in this case the SCPC, DM and hardware<br />
journal to mask the errors from entering into the DM. Moreover, the architecture must respect the<br />
challenges 1-4 mentioned in the previous section.<br />
Step two, refines our design strategy using various functional implementations and will be dis-<br />
cussed in the next section. The hardware implementation will be presented in chapters 4 and 5. Fi-<br />
nally, the fourth step is concerned with a validation of the overall approach by artificial error injection,<br />
it will be presented in chapter 6.<br />
3.7 Functional Implementation<br />
This section is concerned with step-II of the figure 3.12 and aim to refine the proposed model.<br />
There are two possible connections between the processor and DM: (i) Model-I: the processor is<br />
connected to the DM via a journal pair and processor cannot read from DM in a clock cycle; (ii)<br />
Model-II: the processor is connected to Journal and DM in parallel and it can directly read from DM
64 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
Step-I Step-II Step-III Step-IV<br />
Model<br />
Specifications<br />
Functional<br />
Implementation<br />
Refinement<br />
Model-1<br />
Model- 2<br />
Model-3<br />
HW<br />
Implementation<br />
Figure 3.12: Global design flow.<br />
Testing<br />
HW<br />
Validation<br />
and Journal simultaneously. These approaches depend on the type of connection between the SCPC<br />
and the DM. As a result, the overall dependability and performance is affected. In this section, we<br />
will finalize the processor memory (DM) interfacing.<br />
Which scenario is the best can be judged by developing the corresponding functional models<br />
and then comparing the simulation curves (clock cycle per instructions (CPI) vs. error injection rate<br />
(EIR)) obtained by artificial error injection.<br />
Hypothesis:<br />
In order to simplify the functional model, the following hypothesis are assumed:<br />
(a) the processor core is self checking;<br />
(b) there is a dependable memory attached to the processor where data can safely stay without risk<br />
of provoking errors;<br />
(c) journal/cache are considered dependable data storage places;<br />
(d) all instructions are supposed to be executed in a clock cycle;<br />
(e) and of course re-execution of instructions can recover soft errors;<br />
Benchmarks<br />
A set of benchmarks consisting of the main kernels of general tasks in target application has<br />
been selected and divided into 3 groups. The first group is execution of memory operations (permu-<br />
tation/sorting), the second group is representative of arithmetic dominated algorithms and the third<br />
group of control dominated algorithms. All applications have significant memory requirements be-<br />
cause every time they need to read from and after execution they write back to memory (these bench-<br />
marks are not designed to evaluate I/O events hence they only read from and write back to memory).
3.7. FUNCTIONAL IMPLEMENTATION 65<br />
(a) Benchmark Group-I: Bubble sorting algorithm has been considered, which is one of the simplest<br />
algorithms for sorting an array. It consists of repeatedly exchanging of out of order data pairs of<br />
adjacent array written in the memory and looping. The algorithm repeats until all the elements<br />
are sorted in an order. It has been implemented in serial fashion, only one pair can be examined at<br />
a time. It has a time complexity of O(n2). The n passes must be taken through the array, where<br />
n is number of elements.<br />
(b) Benchmark Group -II: It is memory computation benchmark and require data writing back to<br />
close addresses. This version of matrix multiply multiplies two 7 × 7 matrices in O(n k ) with<br />
k > 2. This runtime is achieved by implementing a vector-matrix multiplier, which stores an<br />
initial matrix away, and repeatedly returns its product with an input vector.<br />
(c) Benchmark Group -III: The control benchmark process the data coming from the sensor, which<br />
have been previously stored in the memory. The outputs are stored in the memory to be later used<br />
by the actuators. We chose logic and arithmetic equations for the data because some industrial<br />
systems need to control the actuators on this kind of equations. There are two assumptions: (i)<br />
measurements from the sensors are stored in memory (ii) results later will be send to the actuators.<br />
The control equations are:<br />
3.7.1 Model-I<br />
Y0 = A × [(X0 + X1) − (X2 − X3)]/[X4 × (−X5)]<br />
Y1 = NOT [(X6 OR X1) AND (X9 XOR X7)] AND [NOT (X8)]<br />
if ((X8 + X9) < A)<br />
Y2 = B × [(Y0 + X1) − (X9 × X8)]/[X1 − X5] + C<br />
else<br />
Y2 = [(Y0 + X1) − (X9 × X8)] [X1 − X5] + D<br />
Y3 = NOT[(X6 OR Y1)AND(X9 XOR X7)] AND [ NOT (Y1)]<br />
In model-I (shown in figure 3.13) the processor is connected to DM via a cache memory and a pair<br />
of journals. The pair of journals mask the errors from transferring to DM in order not to propagate<br />
potential errors. The write (processor to memory) is modified and is performed in three steps as shown<br />
in figure 3.13: (i) the write operation is performed simultaneously in cache and in the Un-Validated<br />
Journal (UVJ), (ii) if no error is detected, then at the VP the data from UVJ is transferred to validated<br />
journal (VJ, contains only the validated data) and (iii) finally, the validated data is written to the DM.<br />
At VP all the last sure states of the SEs are conserved and being validated. As shown in figure<br />
3.13, data is transferred from UVJ to VJ and finally to DM. If an error is detected during an SD, the<br />
processor retries the instruction execution from preceding VP (as shown in figure 3.4). In this way, the
66 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
1<br />
2<br />
Self Checking<br />
Processor Core<br />
(SCPC)<br />
Write<br />
Un-Validated Journal (UVJ)<br />
Validated Journal (VJ)<br />
write to DM in 3-steps<br />
WRITE<br />
READ<br />
Write<br />
Data<br />
cache<br />
Read<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
(DM)<br />
Figure 3.13: Model-I with data cache and a pair of journals<br />
Address from<br />
processor<br />
Compared with<br />
all stored<br />
addresses<br />
simultaneously<br />
Address<br />
Cache<br />
Address<br />
found<br />
Data<br />
Address not<br />
found in cache<br />
Figure 3.14: Cache with associative mapping<br />
system restores its prior dependable states and DM remains preserved from the errors. On sequence<br />
validation, the data in UVD is validated by transferring synchronously from UVJ to VJ in a single<br />
clock cycle. The processor can read directly from the cache memory.<br />
Associative mapping is employed in cache, each block is composed of both the memory address<br />
and the corresponding data. The incoming address is simultaneously compared with all stored ad-<br />
dresses using the internal logic of the associative memory, as shown in figure 3.14. If a match is<br />
found, the corresponding data is read out. Otherwise, required data will be read from memory. When<br />
3<br />
DM
3.7. FUNCTIONAL IMPLEMENTATION 67<br />
new data is written into cache then first the controller will match the available existing addresses to<br />
overwrite the data on same addresses. If no match occurs then data is written in a new position with<br />
corresponding addresses. The use of associative memory is fast but also expensive. The cost depends<br />
on how big the cache is.<br />
Model-I: Simulation Results<br />
A functional model (emulator) of the processor/journal/cache has been developed in C++. The<br />
emulator acts as a virtual machine before hardware implementation to test various fault models and<br />
protection techniques. In addition, the emulator allows us to evaluate the architectural choices, cal-<br />
culate both the internal processor states and the program execution duration. It helps us to calculate<br />
average clock cycles per instruction on different memory accesses.<br />
Benchmarks<br />
(MPSoC application)<br />
Stack Processor<br />
Emulator<br />
Consequences<br />
Figure 3.15: FT evaluation<br />
Periodic, Random &<br />
Burst Errors injection<br />
The errors have been artificially injected in the processor emulator (as shown in figure 3.15) and<br />
then the performance has been evaluated for the FT processor model. The goal of this experimental<br />
setup is to evaluate the effect of error injection on system performance. For simplicity, the actual<br />
time overhead of periodical saving of SE is ignored. The fault injection profiles being considered are<br />
shown in figure 3.16.<br />
The emulator receives the previously described groups of benchmarks representing target appli-<br />
cations with a set of representative data. The input of the emulator is a classical hexadecimal file.<br />
Our criterion in evaluation is the ratio of average number of Clocks per Instruction (CPI) vs. Error
68 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
Periodic Errors<br />
Random Errors<br />
Burst Errors<br />
No of program cycles<br />
No of program cycles<br />
No of program cycles<br />
Figure 3.16: Periodic, random and burst errors models<br />
Injection Rate (EIR). The goal of the simulations is to evaluate the performance degradation of the<br />
proposed model in presence of high error injection rates.<br />
Figures 3.17, 3.18, 3.19 present the results for the benchmark group 1, 2 and 3 for SD of 10,<br />
50 and 100. The CPI* is the clock cycles per instruction of the dependable architecture (with error<br />
injection) and the CPI is clock cycles per instruction without error injection. The ratio of CPI*/CPI<br />
will give us the ratio of additional clocks required on re-execution due to rollback.<br />
Figure 3.17 shows the simulation results of computation benchmark. In this graph, there are<br />
two horizontal reference lines. The bottom green continuous doted line is extended from the value<br />
of CPI*/CPI in presence of no error while the top red non-continuous dotted line is drawn at 2 ×<br />
CPI*/CPI. There are three curves in each figure drawn at SDs of 10, 50 and 100 respectively. The<br />
curves overlap each other in presence of low Error Injection Rate (EIR). As the EIR increases the<br />
value of CPI*/CPI also increasing exponentially which is more dominant for higher SDs like 50 and<br />
100. Hence in the presence of high EIR the rate of re-executing also increases which finally increases<br />
the overall CPI*/CPI ratio. This model give a better CPI*/CPI ratio for lower EIR but for higher EIR<br />
the ratio of CPI increases rapidly because of re-execution of instruction due to error detection and<br />
additional clock cycles in case of cache miss. These two problems will be addressed in the Model-II.<br />
3.7.2 Model-II<br />
Model-II consists of three parts; a Self Checking Processor, Journal and DM as shown in figure<br />
3.20 (a). It has a modified journal architecture that has two internal parts; one containing validated<br />
data and other containing un-validated data as shown in fig 3.22.
3.7. FUNCTIONAL IMPLEMENTATION 69<br />
CPI*/CPI<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Computation Benchmark<br />
(Periodic Errors Injection)<br />
CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100)<br />
0<br />
0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />
Error Injection Rate (EIR)<br />
Figure 3.17: Model-I: additional CPI for benchmark group I<br />
When processor needs to read, it checks the data in the Journal and DM simultaneously (due to<br />
parallel access). Associative mapping is employed in journal, each block is composed of both the<br />
memory address and the corresponding data (as shown in fig 3.22). If a journal-miss occurs then the<br />
required data will be sent to the processor during the same clock cycle. If data is found both in the<br />
journal and the DM then the controller (MUX) will prefer the data from the journal, as it is the most<br />
recent written data (see the figure 3.21).<br />
To allow simultaneous read and write the journal should have two address ports. Some instructions<br />
may need two operations (one read and one write) at the same time in journal. The newly written data<br />
is stored in UVJ. If no error is detected in the sequence then the data is validated (as shown in the<br />
figure 3.22) and is transferred to VJ. On the other hand, if an error is detected then all the data written<br />
during the sequence is discarded (as shown in figure 3.23) and the processor rollbacks and starts<br />
execution from the last known states of the SEs. In the next section, we will evaluate the performance<br />
of this architecture.<br />
Model-II: Simulation Results<br />
The experimental protocol remain the same for model-II (like model-I). It can be observe from<br />
the simulation curves of figures. 3.24, 3.25, 3.26 that model-II is more efficient than model-I because<br />
even in the presence of high error rate the CPI*/CPI required to run the dependable architecture are<br />
significantly smaller than the previous architecture presented in the tables.
70 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
CPI*/CPI<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0<br />
Permutation Benchmark<br />
(Random Errors Injection)<br />
CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50)<br />
CPI*/CPI (seq_duration=100)<br />
0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />
Error Injection Rate (EIR)<br />
Figure 3.18: Model-I: additional CPI for benchmark group II<br />
From the simulation results of graphs 3.24, 3.25, 3.26 the CPI*/CPI ratio is smaller than model-I.<br />
In graph 3.24 with SD of 10 when EIR varies from 2e-4 to 2e-2, the additional CPI increases only by<br />
50% (on y-axis CPI*/CPI is 1.5) which means the execution time increases by only 50% even when<br />
the EIR becomes 100 times higher. This shows a good performance for the proposed architecture.<br />
For example, if we accept increasing by 50% the CPI, with SD of 10 there can be 20 errors per 1000<br />
instructions whereas with SD of 50 there will be only 6 errors per 1000 instructions. Furthermore, the<br />
SD has a direct impact on the size of journal memory in architecture and subsequently on its area.<br />
3.7.3 Comparison<br />
A comparison between model-I and model-II is summarized up in table 3.1. In both models, the<br />
effect of rollback is more dominant in higher SDs like 100 and 50. For example, as VP occurs after<br />
every hundred instructions in the SD= 100 there is more chances of provoking errors which rises the<br />
CPI*/CPI ratio more rapidly as compared to SD = 10 and 50. Since there is a large interval between<br />
the two consecutive VPs there is more chances for error occurrence which on the other hand increases<br />
the rate of re-execution of instructions.<br />
From the performance point of view in model-II, due to parallel access to the memory and journal<br />
in read operation the overall efficiency of the system is increased resulting in lower CPI ratios at<br />
higher EIRs as compared to model-I. Therefore, no clock-cycles are wasted if data is not found in the<br />
Journal. It has a better performance than our previous model as shown in the graphs 3.24, 3.25, 3.26.
3.7. FUNCTIONAL IMPLEMENTATION 71<br />
CPI*/CPI<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
Control Benchmark<br />
CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />
0<br />
0 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02 5.12E-02 1.02E-01<br />
Error Injection Rate (EIR)<br />
Figure 3.19: Model-I: additional CPI for benchmark group III<br />
Self Checking<br />
Processor Core<br />
WRITE<br />
READ<br />
Read<br />
HW Journal<br />
WRITE<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
(DM)<br />
Figure 3.20: Block diagram of Model-II<br />
From dependability point of view, the model-II is better a choice because it will have a minimum<br />
hardware overhead due to a single journal as compared to model-I. The more area exposed to the envi-<br />
ronment also increase the chances of provoking errors. Both problems, the performance degradation<br />
at higher error injection rate and effective area on-chip are addressed better than model-I. Therefore,<br />
we will choose model-II for further development. The results obtained are quite encouraging to carry<br />
on research by relaying on this model. In the next two chapters, we are designing both the processor<br />
(chapter 4) and the journal (chapter 5).
72 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
Free space<br />
Validated data<br />
VP n<br />
SD = 10<br />
VP n-1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
Address from<br />
processor<br />
Compared with<br />
all stored<br />
addresses<br />
simultaneously<br />
Address<br />
Journal<br />
MISS<br />
Access<br />
main memory<br />
Figure 3.21: Processor can simultaneously read from Journal and DM<br />
vADDRESS<br />
w address<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
0011…..0011<br />
0011…..0001<br />
0011…..1011<br />
0011…..0111<br />
0011…..1011<br />
1011…..0011<br />
0111…..0011<br />
0001…..0011<br />
0010…..0011<br />
0000…..0011<br />
1111…..0011<br />
1100…..0011<br />
1011…..0011<br />
0100…..0011<br />
0101…..0011<br />
1100…..1100<br />
1001…..1001<br />
1011…..1011<br />
0000…..1011<br />
1011…..0000<br />
3.8 Conclusions<br />
from processor / main memory<br />
0<br />
DATA data<br />
0011…..0001<br />
0011…..0011<br />
0011…..1011<br />
0011…..0111<br />
0011…..0011<br />
0011…..1011<br />
0000…..0011<br />
0011…..1011<br />
1011…..1011<br />
0010…..0011<br />
0011…..0011<br />
0000…..0011<br />
0010…..0011<br />
0100…..0011<br />
1011…..0011<br />
0101…..0011<br />
0011…..0011<br />
0011…..0011<br />
0111…..0011<br />
1111…..0011<br />
towards main memory<br />
data can transfer<br />
Data<br />
Data<br />
Store SE SEs<br />
VP n-1<br />
DM<br />
Un-Validated Data (UVD)<br />
Free Space<br />
Validated Data (VD)<br />
instructions of the<br />
application<br />
SD = 10<br />
Figure 3.22: No error detected during SD and data is validated at VP<br />
till ‘VP’ no error<br />
detected<br />
This chapter summarizes an alternative approach to design an FT-processor. We have presented an<br />
architecture specification and design methodology of the proposed scheme. It is a hardware/software<br />
combined approach in which error detection is achieved concurrently by hardware means using parity<br />
codes and rollback is used for error recovery. The major advantage of this scenario is the ability to<br />
VP n<br />
Next<br />
SE
3.8. CONCLUSIONS 73<br />
Free space<br />
Data written during SD<br />
VP n<br />
SD = 10<br />
VP n-1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
from processor / main memory<br />
v ADDRESS<br />
w address<br />
0011…..0011<br />
0011…..0001<br />
0011…..1011<br />
0011…..0111<br />
0011…..1011<br />
1011…..0011<br />
0111…..0011<br />
0001…..0011<br />
0010…..0011<br />
0000…..0011<br />
1111…..0011<br />
1100…..0011<br />
1011…..0011<br />
0100…..0011<br />
0101…..0011<br />
1100…..1100<br />
1001…..1001<br />
1011…..1011<br />
0000…..1011<br />
1011…..0000<br />
DATA data<br />
0011…..0001<br />
0011…..0011<br />
0011…..1011<br />
0011…..0111<br />
0011…..0011<br />
0011…..1011<br />
0000…..0011<br />
0011…..1011<br />
1011…..1011<br />
0010…..0011<br />
0011…..0011<br />
0000…..0011<br />
0010…..0011<br />
0100…..0011<br />
1011…..0011<br />
0101…..0011<br />
0011…..0011<br />
0011…..0011<br />
0111…..0011<br />
1111…..0011<br />
towards main memory<br />
data can’t transfer<br />
SE<br />
VP n-1<br />
Un-Validated Data (UVD)<br />
Free Space<br />
Validated Data (VD)<br />
Rollback to<br />
VP n<br />
instructions of the<br />
application<br />
SD = 10<br />
Figure 3.23: Error detected and all the data written during SD is deleted<br />
Table 3.1: Comparison of the Processor-<strong>Memory</strong> Models<br />
Error<br />
detected<br />
Model-I Model-II<br />
read from memory (DM) Processor ⇐ Cache ⇐ DM Processor ⇐ DM<br />
read from Cache/Journal Processor ⇐ Cache Processor ⇐ Journal<br />
write to DM Processor ⇒ UVJ ⇒ VJ ⇒ DM Processor ⇒ Journal ⇒ DM<br />
Cache/Journal size requirement Comparatively bigger No MISS in Journal<br />
(to avoid cache MISS) cache required due to Parallel access<br />
Performance Medium Performance Reasonable good Performance<br />
at High Error Rate even at High Error Rate<br />
have an effective FT mechanism, with limited hardware and time overheads. The overall methodology<br />
can be successful if certain design challenges are respected like choosing a appropriate processor<br />
having minimum internal states to load and store, designing an intermediate self-checking hardware<br />
journal to prevent the errors from entering into the dependable memory and reasonable length of<br />
sequence duration for certain error rate.<br />
The last part of the chapter was dedicated towards defining the processor-memory interface. Ac-<br />
cordingly, we have proposed two different models: model-II and I. On comparison, model-II has been<br />
chosen for further developing the VHDL-RTL model because it is more reasonable from dependabil-<br />
ity and performance point of view. In this model, in write to memory, the data pass via temporary<br />
VP n
74 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS<br />
CPI*/CPI<br />
CPI*/CPI<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
Permutation Benchmark<br />
CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />
0<br />
0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />
Error Injection Rate (EIR)<br />
Figure 3.24: Model-II: additional CPI for benchmark group I<br />
Computation Benchmark<br />
CPI*/CPI (seq_duration=10) CPI*/CPI (seq_duration=50) CPI*/CPI (seq_duration=100)<br />
0<br />
0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />
Error Injection Rate (EIR)<br />
Figure 3.25: Model-II: additional CPI for benchmark group II<br />
storage towards DM, while on read the processor can directly read from the DM. In this way DM<br />
remain preserved from error propagation coming from processor. In next chapters, we will develop<br />
the VHDL-RTL model for the FT-processor.
3.8. CONCLUSIONS 75<br />
CPI*/CPI<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
Control Benchmark<br />
(Burst Errors Injection)<br />
CPI*/CPI (SD = 10) CPI*/CPI (SD = 50) CPI*/CPI (SD = 100)<br />
0<br />
0 2.00E-04 4.00E-04 8.00E-04 1.60E-03 3.20E-03 6.40E-03 1.28E-02 2.56E-02<br />
Error Injection Rate (EIR)<br />
Figure 3.26: Model-II: additional CPI for benchmark group III
76 CHAPTER 3. DESIGN METHODOLOGY AND MODEL SPECIFICATIONS
Chapter 4<br />
Design and Implementation of a Self<br />
Checking Processor<br />
We aim to design a fault tolerant processor that has two parts: self-checking processor core (SCPC)<br />
and self-checking hardware journal (SCHJ). In this chapter, we will only focus on the design of the<br />
SCPC as highlighted in figure 4.1.<br />
SCPC<br />
SCHJ<br />
Figure 4.1: Design of a self checking processor core (SCPC)<br />
To explore the SCPC, the chapter has been divided into different sections. In the first section,<br />
we start modelling the processor by choosing the appropriate architecture family that fulfils the basic<br />
design objectives identified in chapter 3. In the later section, the processor hardware-model (non-<br />
FT version) will be presented and explored. The performance and dependability challenges will<br />
be identified in the hardware model. The later sections address their solutions. A generic model<br />
will be described in VHDL-RTL (Register Transfer Level) and synthesized on Altera Quartus II.<br />
Experimental results will be presented in terms of throughput (number of bit processed per second)<br />
and area usage. Finally, fault tolerance capacity of SCPC will be validated in chapter 6.<br />
77<br />
DM
78 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
4.1 Processor Design Strategy<br />
The FT strategy we have chosen has already been discussed in section 3.2. In this part, we will<br />
choose the processor architecture that can fit well, as presented in figure 4.2 (continuation of the<br />
previously presented figure 3.2). The hardware based concurrent error detection is expensive if there<br />
are many internal states to be checked, which is against the design constrains. Therefore, to have fast<br />
error detection at low hardware overhead there is a need of a processor having minimum internal states<br />
to check (figure 4.2 presenting criteria for processor design). The software based rollback mechanism<br />
has low hardware overhead but will be slow if there are many internal states in the processor to<br />
be periodically saved. Moreover, rollback in case of error detection will be faster if the number of<br />
internal states to restore is lower.<br />
Consequently, our choice will go to a processor architecture belonging to the minimum instruc-<br />
tion set computers (MISC) class, which possesses many desired characteristics suitable for our design<br />
strategy. In the MISC class, we have chosen a basic stack processor architecture. It is a simple and<br />
flexible architecture, having a very reduced number of internal registers [JDMD07]. Alternate choice<br />
can be an accumulator-based processor but their disadvantage is that they are highly dependent on<br />
the random access to memory. Therefore, they are less efficient. However, the chosen stack proces-<br />
sor architecture also rely on memory accesses, but most of them are very predictable (neighbouring<br />
addresses) as they are related to stack operation, and can be very effectively handled.<br />
4.1.1 Advantages of Stack Processor<br />
Stack processor have different additional advantages both from protection and performance point<br />
of views. Some classical advantages are discussed in [KJ89] and others are detailed below.<br />
Stack based processor can result in more reliable architecture as compared to their counterpart<br />
RISC design approach because they have less number of internal states and small area on-chip which<br />
reduces the chances of external environmental contaminations. In most RISC based approaches there<br />
are register banks that make them more sensitive against SEUs and MBUs. Whereas, in stack proces-<br />
sor the number of internal registers are far less. For example, the stack processor presented in [Jal09]<br />
has six internal registers: TOS (Top of the Stack), NOS (Next of the Stack), TORS (Top of Return<br />
Stack), IP (Instruction Pointer), DSP (Data Stack Pointer) and RSP (Return Stack Pointer). They are<br />
far less than the modern RISC processors (i.e. LEON3FT has more than 150 internal register [Aer11]).<br />
In FT processor all internal register must be protected against transient faults [ARM + 11].<br />
Furthermore, due to many internal registers in RISC based architecture the instruction length is<br />
widen. More registers means large address decoding, which increases the propagation delay. That is<br />
why, RISC (and modern CISC) processor needs multi-stage pipelining to restore average throughput<br />
by hiding internal latency. Moreover, it needs better branch scheduling. For example, Pentium 4<br />
has a 20-stage pipeline, and any miss in caches and branch prediction buffers can suffer a 30-cycle<br />
penalty for a missed branch (20 cycles in the pipeline, 10 in the memory). Whereas, the RTX (stack-<br />
based processor) has a fixed 2-cycles overhead in all cases [PB04]. Furthermore, processor’s natural
4.1. PROCESSOR DESIGN STRATEGY 79<br />
Fault Tolerance Computer<br />
Low hardware and time trade-offs<br />
Fast error detection Low hardware overhead<br />
Concurrent Error Detection Rollback Mechanism<br />
Low HW overhead<br />
Processor with minimum internal states<br />
Minimum Instruction Set Computers (MISC)<br />
Stack Processor<br />
Figure 4.2: Criteria behind the choice of the stack processor<br />
resistances against SEUs decreases with increase in stages of pipelining [MW04].<br />
Fast periodic backup<br />
Stack processor have various advantages over RISC based machines like higher clock speeds, low<br />
procedure call overhead and fast interrupt handling [Sha06]. They have higher clock speed because<br />
the instructions are performed between two tops of stack (condition: internal stack caching or internal<br />
hardware stack). They have low procedure call overhead as there are limited registers needed in saving<br />
to memory across the procedure calls. Fast interrupt handling since interrupt routines can execute<br />
immediately as hardware takes care of the stack management. An architecture of a stack based Java<br />
processor has been evaluated in [Sch08] and the results show better performance and smaller gate<br />
count compared to a RISC family processor on an FPGA.<br />
Commercially stack processor have been used for medical imaging, hard disc drives and satellite<br />
applications. Some well known examples include Novix NC4000, Harris RTX2000, Silicon Com-<br />
posers SC32 [PB04]. They are deployed in space applications for reasonable performance and low<br />
power overheads [HH06]. e.g. SCIP, a stacked based processor designed for spaceships [Hay05]. Re-<br />
cently, Green-Arrays project is employing stack based architectures to design multi-computer chips.<br />
The company have designed chips with attractive features like minimum cost and energy with high
80 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
performance [Gre10, Bai10].<br />
The FT-processor design is dedicated to our long-term objective: devising a new fault tolerant<br />
multi-resource system based on message passing, in which the current fault tolerant processor design<br />
will be used as a processing node. It is clear from the beginning that severe constraints concerning<br />
the area consumption apply to the architectural design of a single node in order to match the future<br />
massively parallel objective, yet preserving as much as possible the individual performance. The stack<br />
machine remains a viable architecture, due to the smaller size and lower cost and power requirements.<br />
Stack processor can result in simple and smart cores for parallel distributed applications [Gre10].<br />
On the other hand, the stack machine is favorable for sequential instruction execution. They fit<br />
well for control dominant application. They are less favorable for data dominant applications like<br />
video streaming.<br />
4.2 Proposed Architecture<br />
The architecture of the stack processor has been presented in [Jal09]. It is inspired from second<br />
generation canonical stack processor [KJ89]. The stack taxonomy is based on three attributes: the<br />
number of the stacks, the size of the stack buffer memories, and the number of operands in the<br />
instruction format. They are represented by three coordinate’s axis in figure 4.3. These dimensions<br />
have various possible combinations. Among these choices, the canonical stack has multiple and<br />
large stacks and it is a 0-operand (ML0) machine as shown in the figure 4.3. The 0-operand means<br />
all instruction operand locations are implicit, thus it is not necessary to give their address in the<br />
instruction. In the case of stack the implicit location is top of stack.<br />
To satisfy the simplicity requirement there are two stacks 1 , data stack (DS) and return stack (RS).<br />
One is used for expression evaluation and subroutine parameters passing. The second is used for<br />
the subroutine return address, interruption address and temporary data copies. The two-stacks allow<br />
accessing the multiple values with in one clock cycle and improves the speed. Due to separate stack<br />
for return address and data stacks, the subroutine calls and data returns can be performed in parallel<br />
with data operations. It can reduce program size and system complexity, which improves system<br />
performance.<br />
Concerning the size of stack buffers, we have chosen large stack buffer that reside in dependable<br />
memory which allows multiple storage of data without loss. The DM is on-chip so the data can be<br />
accessed in single clock cycle. In addition, there is no restriction in the stack depth.<br />
There are three registers named TOS (Top Of Stack), NOS (Next Of Stack) and TORS (Top Of<br />
Return Stack) which represents top of data-stack (DS), next of DS and top of return-stack (TORS)<br />
respectively. NOS and TORS do not exist in the canonical model. They proved to be useful, allowing<br />
a simplified instruction set [Jal09]. The DS and RS stacks reside in main memory (DM), which is a<br />
feature similar to 1st generation stack machine. They do not have address registers but are addressed<br />
by internal pointers namely: data stack pointer (DSP) and return stack pointer (RSP). We have chosen<br />
1 According to Turing definition, the minimal number of stacks for a pure stack machine is 2 [KJ89].
4.2. PROPOSED ARCHITECTURE 81<br />
2<br />
Number of Stacks<br />
1<br />
No. of inst. operand<br />
2<br />
1<br />
0 4 8 12 16<br />
ML0<br />
Size of the Stack<br />
Figure 4.3: Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design<br />
space<br />
this feature to protect the data contents because according to the hypothesis, DM is a dependable<br />
storage and these stacks remain fault secure.<br />
The stack based proposed architecture contains data bus, data stack (DS) and return stack (RS)<br />
with their top of-stack registers, arithmetic/logic unit (ALU), instruction pointer register (IP), instruc-<br />
tion buffer with instruction register and control logic (for hardwired control), as shown in figure 4.4.<br />
The input/output module (shown in figure 4.4) requires special management to be fault tolerant and<br />
this is not treated in this work.<br />
The ALU performs arithmetic and logic operations, which includes addition, subtraction, logical<br />
functions (AND, OR, XOR), and test for zero and others. It perform operations on the top of the data<br />
stack (operands and result), TOS and NOS being the next element of DS, where TORS is the top post<br />
element of RS. The IP holds the address of the next instruction to be executed. The IP may be loaded<br />
from the bus to implement branches, or may be incremented to fetch the next sequential instruction<br />
from program memory. Like DS and RS, the program memory also residing in DM.<br />
The MAR (<strong>Memory</strong> Address Register) unit that exists in canonical stack processor has been elim-<br />
inated from our model because program memory along with IP (Instruction Pointer) is sufficient to<br />
manage all the instructions and provide the address of the next instruction to be executed. The result-<br />
ing processor has a simple instruction set of 37 instructions and being executed in a clock cycle except<br />
one (STORE Instruction) that requires two clock cycles for execution. The complete instruction set<br />
of 37 instructions have been represented in the appendix B where all the instructions are expressed at<br />
RTL (Register Transfer Level). Thanks to the limited instruction set and 0-operand model, the 8-bit<br />
opcode is sufficient to represent all instructions.
82 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
Table: List of Acronyms<br />
DS<br />
RS<br />
I/O<br />
NOS<br />
TORS<br />
TOS<br />
ALU<br />
IP<br />
Data Stack<br />
Return Stack<br />
Input-Output<br />
Next Of Stack<br />
Top Of Return<br />
Stack<br />
Top Of Stack<br />
Arithmetic Logic<br />
Unit<br />
Instruction<br />
Pointer<br />
DS<br />
RS<br />
I/O<br />
Control<br />
Unit<br />
D<br />
A<br />
T<br />
A<br />
B<br />
U<br />
S<br />
NOS<br />
DATA<br />
Figure 4.4: Simplified stack machine<br />
4.3 Hardware Model of the Stack Processor<br />
TORS<br />
ALU<br />
IP<br />
ADDRESS<br />
Program<br />
<strong>Memory</strong><br />
The processor hardware model has been described at VHDL RTL level. The initial processor<br />
model (non-FT version) has been synthesized with Altera Quartus II (version 7.1).<br />
It consists of arithmetic and logic unit (ALU), internal registers, instruction buffer, control unit<br />
and data path connecting them, as shown in figure 4.5. The DS and RS are addressed by two pointer<br />
registers DSP and RSP respectively. The three on-chip registers TOS, NOS and TORS resolve the<br />
possible conflicts in transferring the data between two stacks e.g. during instruction execution of<br />
R2D, D2R, OVER, ROT. For further explanation, lets consider R2D (Return Stack to Data Stack)<br />
instruction. Due to availability of TORS, TOS and NOS inside the processor no conflict in accessing<br />
the data bus occurs. The contents of TORS are written in TOS, TOS into NOS, DSP is incremented,<br />
NOS written in DS, RS[RSP] read in TORS and RSP decremented. Therefore, no conflict occurs.<br />
In a stack processor, normally data execution is faster than the classical processors because data<br />
is implicitly available on the two tops of the stack, instead of having to read data from addressed<br />
registers or memory. It effectively reduces the length of the critical path. For a better understanding<br />
the simplified data path for arithmetic and logic instructions have been shown in figure 4.6. Processor<br />
read memory in parallel to compensate the ‘one element less’ on stack balance. The memory read is<br />
just to fill the empty place resulting from the instruction execution (for next instruction). Therefore,<br />
TOS
4.3. HARDWARE MODEL OF THE STACK PROCESSOR 83<br />
Prog_addr<br />
TOS<br />
TOS<br />
Program<br />
<strong>Memory</strong><br />
ADD<br />
MUX_RSP<br />
RSP<br />
ADD<br />
MUX_DSP<br />
LSB<br />
MSB<br />
MUX_ADD_RSP<br />
DSP<br />
ADD<br />
-1<br />
0<br />
+1<br />
MUX_ADD_DSP<br />
0<br />
2<br />
3<br />
4<br />
5<br />
-1<br />
0<br />
+1<br />
TOS<br />
Instruction R1 R2 R3 R4 Buffer R5<br />
Instruction Buffer<br />
IBMU<br />
R3<br />
R4<br />
Instruction Pointer<br />
to / from <strong>Memory</strong><br />
Op_code<br />
TORS<br />
NOS<br />
Control<br />
Unit (CU)<br />
TORS<br />
a<br />
IP+a<br />
data_mem<br />
Figure 4.5: Modified stack processor<br />
it do not need to wait for address decoding before accessing the operands.<br />
Control signals<br />
ADD ADD<br />
to CU<br />
NOS TOS<br />
d<br />
cout<br />
+1<br />
z<br />
ALU<br />
0<br />
1<br />
2<br />
3<br />
MUX_IP_COUNT<br />
IP<br />
TORS<br />
DSP<br />
RSP<br />
MUX_TORS<br />
15 th bit<br />
Ext. to 16-bits<br />
For DLIT<br />
Mostly, each block of program memory (16-bit) contains two successive instructions (8-bit +<br />
8-bit). The instructions residing in program memory pass through instruction buffer (IB) and are<br />
decoded in the control unit, which activates all the MUXes accordingly. The instruction buffer (IB)<br />
is fed with a pair of LSB followed by MSB as shown in figure 4.18. The IB consists of cascaded<br />
byte-size (8-bit) buffers, connected with via multiplexers which control the flow of instructions in<br />
IB (as shown in the figure 4.18). The interconnection between the multiplexers is controlled by the<br />
instruction buffer management unit (IBMU) (as shown in figure 4.6). The IB and IBMU will be<br />
discussed later in detail.<br />
Although stack processor has fixed length opcodes but some instructions need additional infor-<br />
mation (immediate data) to be executed. Therefore sometimes instruction block is larger than 8-bit.<br />
Such instructions include branches and call that need additional 16-bit address (or 8-bit displacement)<br />
to know the target address (e.g. see appendix C for ZBRA d, SBRA d) and the instructions requiring<br />
an immediate data constant (e.g. LIT a, DLIT a). This challenge is further addressed in section 4.4.2.<br />
The control unit manages the components of the processor; it reads and decodes the program<br />
For LIT
84 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
instructions, transform them into a series of control signals which activate other parts of the processor<br />
via MUXes. The salient jobs of control unit are:<br />
• Decode the numerical code for the instruction into a set of commands or signals for each of the<br />
MUX;<br />
• Update the DSP and RSP pointers;<br />
• Activate Read or Write from memory according to the active instruction;<br />
• Select correct operation in the ALU;<br />
The IP points to the next instruction to be executed. The IP prepares the program memory to feed<br />
the IB according to the next instruction execution requirements.<br />
clk<br />
Program<br />
<strong>Memory</strong><br />
IP<br />
Instruction Buffer<br />
IBMU<br />
From <strong>Memory</strong><br />
TOS/TORS/NOS<br />
clk<br />
Control Unit<br />
TOS/<br />
TORS<br />
Figure 4.6: Simplified data-path of the proposed model (arithmetic and logic instructions)<br />
The execution of conditional/unconditional branches has been discussed in [KJ89] and further<br />
explored in design of modified stack processor [Jal09]. The stack processor is fast in branch execution<br />
due to minimum stages of pipelining [PB04]. However, every branch instruction is followed by NOP<br />
(no-operation) because the IB is flush to load new instructions. It is a performance penalty. This issue<br />
has not been addressed in this work however, one possible solution has been proposed in section 4.6.3.<br />
TOS<br />
NOS<br />
TORS<br />
4.4 Design Challenges in FT Stack Processor<br />
This section is dedicated to the implementation of the FT methodology using stack processor. The<br />
required architecture should have self checking ability along with minimum performance degradation.<br />
These two challenges are addressed in this section.<br />
NOS<br />
A<br />
L<br />
U
4.4. DESIGN CHALLENGES IN FT STACK PROCESSOR 85<br />
4.4.1 Challenge I: Self Checking Mechanism<br />
The architecture having minimum number of internal register does not guarantee that there will<br />
be no possibility of provoking the errors. External disturbances can still contaminate the execution<br />
of the processor (even if it is far less frequent than other classes of processors like RISC and CISC<br />
implementations). There is a need to have self checking mechanism in internal registers and ALU.<br />
4.4.2 Challenge II: Performance Improvement<br />
Depends on the architectural choices made to implement the FT-stack processor, it has two per-<br />
formance limitations that are limiting the overall execution speed. They include: (i) multi-clock in-<br />
struction execution and (ii) Multiple-byte instructions block. Both these issues adds additional delays<br />
in program executions.<br />
Challenge II-a: Multi-clock Instruction Execution<br />
Most of the instructions require single clock cycle in the data-path to be executed, but there are<br />
few instructions that require multi-clock cycles to be executed like DUP, OVER, R2D, CPR2D, D2R,<br />
FETCH, STORE, PUSH_DSP, PUSH_RSP, LIT, DLIT, CALL. There minimal clock count cannot be<br />
unity for a non-pipelined architecture because of their conflicts in accessing data bus in same direction.<br />
Inst. Types<br />
(according to clks. count)<br />
1 clk 2 clks 3 clks<br />
Most of<br />
instruction<br />
Some of the<br />
instructions<br />
Only one<br />
instruction<br />
‘STORE’<br />
Figure 4.7: Different instructions type from execution point of view (without pipelining)<br />
For a better understanding, lets explore a DUP (duplication) instruction that requires two clock<br />
cycles to execute the instruction in the data path. Here, the data contents of TOS must be copied<br />
into NOS and the contents of the NOS transferred to the third position in the DS (pointed by DSP).<br />
However, if the instruction is executed in one clock cycle then we loose the data in third element of<br />
DS because in this case, NOS will be written at address pointed by DSP and without prior increment<br />
to DSP we will loose one data element.<br />
On the other hand, it can be successfully executed in two clock cycles, during the first clock cycle<br />
(t) a new place is created by adding DSP + 1 and in next cycle (t + 1) the data contents of NOS<br />
are written in DS, TOS to NOS and NOS to DS[DSP] as shown in figure 4.8. Such (multi-cycles)<br />
instructions result in performance degradations and needs employing execution pipelining.
86 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
Prog_addr<br />
Program<br />
<strong>Memory</strong><br />
ADD<br />
TOS<br />
TOS<br />
MUX_RSP<br />
ADD<br />
MUX_DSP<br />
t<br />
LSB<br />
MSB<br />
RSP<br />
MUX_ADD_RSP<br />
DSP<br />
ADD<br />
-1<br />
0<br />
+1<br />
MUX_ADD_DSP<br />
0<br />
2<br />
3<br />
4<br />
5<br />
-1<br />
0<br />
+1<br />
TOS<br />
R1 Instruction R2 R3 R4 Buffer R5<br />
Instruction Buffer<br />
IBMU<br />
R3<br />
R4<br />
Instruction Pointer<br />
to / from <strong>Memory</strong><br />
t+1<br />
Op_code<br />
DUP<br />
t+1<br />
TORS<br />
NOS<br />
Control<br />
Unit (CU)<br />
TORS<br />
a<br />
IP+a<br />
data_mem<br />
t+1<br />
Control signals<br />
ADD ADD<br />
to CU<br />
NOS TOS<br />
d<br />
t+1<br />
cout<br />
+1<br />
z<br />
ALU<br />
0<br />
1<br />
2<br />
3<br />
MUX_IP_COUNT<br />
Figure 4.8: Execution of duplication (DUP) instruction in 2-clock<br />
Challenge II-b: Multiple-byte Instructions Block<br />
IP<br />
TORS<br />
DSP<br />
RSP<br />
MUX_TORS<br />
15 th bit<br />
Ext. to 16-bits<br />
For DLIT<br />
The opcode of the instructions are 1-byte length. They have implicit source and destination reg-<br />
isters and do not need explicit addressing e.g. instruction ADD (addition) means adding TOS with<br />
NOS and storing the result in TOS. However, 7 of the 37 instructions require an additional parameter<br />
to be furnished, either an immediate 8-bit or 16-bit constant (LIT and DLIT respectively), an 16-bit<br />
absolute address (LBRA, CALL) or an 8-bit displacement (SBRA, ZBRA) (see appendix C, table 8).<br />
On the other hand, the program memory is 16 bit whereas instruction opcode for instruction is 8-<br />
bit. The average flow (in bits) instructions being executed is lower than that of instruction pre-fetching<br />
flow capacity, the first being close to 8 bits while the later is closer to 16 bits. Therefore, the loading<br />
of instructions in IB is almost twice the rate of execution (considering 8-bit opcode is executed in a<br />
clock cycle). It requires intelligent instruction buffer management to: (i) monitor the input and output<br />
flow (ii) manage the flow of variable block instructions (like LIT, LBRA).<br />
To execute single instruction per cycle, the next instruction to be executed should reach the control<br />
unit in next clock cycle (t + 1). For example, the figure 4.9 is addressing this issue. First of all, we<br />
are supposing that the present instruction being executed is ADD (the instruction to be executed lies<br />
For LIT
4.5. SOLUTION-I: SELF CHECKING MECHANISM 87<br />
1-byte instruction<br />
LSB<br />
8-bits<br />
MSB<br />
8-bits<br />
R1<br />
2-bytes instruction<br />
LSB<br />
8-bits<br />
MSB<br />
8-bits<br />
R1<br />
3-bytes instruction<br />
LSB<br />
8-bits<br />
MSB<br />
8-bits<br />
R1<br />
R2<br />
R2<br />
R2<br />
R3<br />
Instruction Buffer<br />
R3<br />
R3 XX<br />
t+1<br />
Instruction Buffer<br />
t+1<br />
Instruction Buffer<br />
R4<br />
ADD R5<br />
t+1<br />
Figure 4.9: Multiple-byte instructions<br />
R4 XX R5 LIT<br />
XX R4 DLIT R5<br />
in register R5). It is 1-byte instruction so the opcode of the next instruction to be executed must be<br />
in register R4. This opcode must reach the control unit in next clock cycle (t + 1). Secondly, we are<br />
supposing LIT a as a active instruction (at time t) so in the next clock (t+1) the contents of R3 should<br />
reach the control unit.<br />
The solutions to the above mentioned challenges will be addressed in the following sections.<br />
4.5 Solution-I: Self Checking Mechanism<br />
The processor need error detecting inside the ALU and internal states. First of all we start with<br />
the design of a self checking ALU.<br />
4.5.1 Error Detecting in ALU<br />
There is no single code that can (simultaneously) protect the arithmetic and logic operations si-<br />
multaneousy. Consequent, we are using combination of arithmetic and logic codes (often called<br />
‘combination codes’) to protect both operations, modulo-3 residue code to protect arithmetic opera-<br />
Op-code<br />
Op-code<br />
Op-code
88 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
tions on one side, and parity code to protect logic operations on the others side. We have chosen these<br />
codes, because they are simple and yet effective enough to prove the effectiveness of our approach.<br />
Moreover, they requires minimum resources, which can be depicted from the results in [SFRB05].<br />
In [SFRB05], the ALU is designed with different error detection techniques was simulated using<br />
Quartus II simulation tool provided by Altera. The FPGA resource utilization of the two built-in-<br />
error-detection (BIED) techniques (Berger Check, Residue/Parity Codes Check) were recorded from<br />
the simulation. The figure 4.11 shows the resource utilization comparison chart for the two BIED<br />
techniques compared with a TMR ALU and an ALU without any error detection.<br />
Data according<br />
to next instruction<br />
NOS<br />
Protected Register<br />
Protected<br />
ALU<br />
TOS<br />
Protected Register<br />
Figure 4.10: Data-path of protected-processor’s ALU<br />
It is obvious from the figure 4.11 [SFRB05] that EDALU (error detecting ALU with modulo-3<br />
residue/parity check) uses 54% less logic elements compared than the TMR ALU, and the Berger<br />
check prediction ALU uses 42% less logic elements than the TMR ALU. Therefore, according to the<br />
result, it is clear that the ALU with residue/parity check has better resource utilization than Berger<br />
codes.<br />
ALU instructions fall into two groups: arithmetic and logical (see figure 4.12). By grouping the<br />
instructions, the active area of the circuit at any instant is reduced [SFRB05]. For example, a strike<br />
on module generating arithmetic parity would not affect the logical module and vice versa.<br />
Error Detecting in Arithmetic Instructions<br />
A remainder calculated from the data symbols X and Y can preserve arithmetic operation in an<br />
ALU. Here, the error detection in arithmetic instructions is based on modulus-3 residue check parity.<br />
In ALU, they are executed in two concurrent computation (as shown in figure 4.13). On one hand,<br />
two operands, Xand Y , undergo an arithmetic operation and results are stored in S. On the other, the
4.5. SOLUTION-I: SELF CHECKING MECHANISM 89<br />
LOGIC ELEMENT COUNT<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
3106<br />
TMR -TRIPLE MODULAR REDUNDANCY<br />
BC -BERGER CODE CHECK<br />
ED -RESIDUE/PARITY CODES CHECK<br />
1792<br />
1432<br />
TMRALU BCALU EDALU ALU<br />
ALU DESIGN<br />
986<br />
Figure 4.11: Resource utilization chart for various ALU designs [SFRB05]<br />
residues, P AX and P AY , undergo the equivalent arithmetic operation and generate the residue. The<br />
outcome, P AS will be stored with S (as shown in figure 4.13). In next clock, the parity generator<br />
(mod-3 generator) will produce P A ′ S<br />
(residue of S), which will be compared with the already stored<br />
parity P AS. In case of discrepancy, error alarm signal will be raised. The mathematics behind the<br />
residue check codes are shown below:<br />
X = xnx(n − 1)x(n − 2).....x2x1x0,<br />
Y = yny(n − 1)y(n − 2).....y2y1y0,<br />
where Xand Y are the data/information symbols applied to the input of the ALU.<br />
C = CmC(m − 1)C(m − 2).....C2C1C0<br />
where C is the check divisor used to calculate the residue. The remainders determined from the<br />
division of ALU data symbols X and Y from the check divisor C is given by<br />
P AX = RxmRx(m − 1).....Rx2Rx1Rx0,<br />
P AY = RymRy(m − 1).....Ry2Ry1Ry0<br />
where P AX and P AY represent the remainders from X and Y respectively and Rxm = xn/C and so<br />
on. The ALU output is represented as follows<br />
P A ′ S<br />
S = X � Y where, � = ADD, SUB or MUL<br />
= S mod C<br />
P AS is the remainder check symbol which is given by<br />
�<br />
P AS = (P AX P AY ) mod C<br />
The error signal is generated by the comparator, which is given by the following function
90 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
Redundancy<br />
Error detection<br />
ALU instructions into two groups<br />
logical<br />
parity check<br />
1-bit<br />
arithmetic<br />
mod 3<br />
1-bit 2-bits<br />
Figure 4.12: ALU is protecting the Logical and Arithmetic instructions separately<br />
Error Signal = 1 if P AS �= P A ′ S<br />
Error Signal �= 1 if P AS = P A ′ S<br />
P LS, represents logic parity that will be generated locally for the next instruction.<br />
For instance, if X = 10, Y = 11 and C = 3<br />
Residue of X (P AX): 10 mod 3 = 1<br />
Residue of Y (P AY ): 11 mod 3 = 2<br />
First concurrent computation: S = X + Y = 10 + 11 = 21<br />
Residue of first computation: P A ′ S<br />
Addition of P AX and P AY : = 1 + 2 = 3<br />
Residue: P AS = 3 mod 3 = 0<br />
Thus, the residues P AS and P A ′ S<br />
Error Detecting in Logic Instructions<br />
= 21 mod 3 = 0<br />
is equal, Therefore no error.<br />
Error detection in logical instructions is based on calculation of parity bit from the information<br />
symbols in X and Y . The parity calculation is simple. The parity bit is calculated by XOR between<br />
the information bit. Among the two variants of parity bit: even parity bit and odd parity bit, we are<br />
using even parity which means that the parity bit is set to 1 if the number of ones in a given set of bit<br />
(not including the parity bit) is odd, making the entire set of bit (including the parity bit) even. By<br />
comparing the parity bit from input and output, the re-generate/re-configure signal is set as high/low.
4.5. SOLUTION-I: SELF CHECKING MECHANISM 91<br />
OPCODE<br />
X<br />
PL X<br />
PA X Y PL Y PA Y<br />
ALU<br />
S<br />
S<br />
Parity<br />
generator<br />
PL S<br />
(PA X<br />
PA S<br />
+<br />
S mod C<br />
PA Y ) mod C<br />
PA S<br />
PA’ S<br />
Check<br />
Parity<br />
OPCODE<br />
Error Signal<br />
Figure 4.13: Reminder check technique for error detection in arithmetic instructions<br />
It can be represented by simple logical equation below:<br />
P LX = x15 XOR x14 XOR x13 XOR, ... x0<br />
P LY = y15 XOR y14 XOR y13 XOR, ... y0<br />
Similarly, the P LX and P LY represents the parity of Xand Y respectively.<br />
S = (X � Y ) where � = AND or OR<br />
where X = (x15 x14 x13 ... x0)<br />
where Y = (y15 y14 y13 ... y0)<br />
�<br />
P LS = P LX P LY<br />
P L ′ S<br />
= S XOR<br />
The error signal is generated by the comparator, which is given by the following function<br />
Error Signal = 1 if P LS �= P L ′ S<br />
Error Signal �= 1 if P AS = P A ′ S<br />
Similarly, P AS will be generated locally for the next instruction (if needed). Its a synchronous<br />
system and the error will be detected in the next clock. In ALU, the error latency is 1-clock cycle.
92 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
OPCODE<br />
X<br />
PL X<br />
PA X Y PL Y PA Y<br />
ALU<br />
S<br />
S<br />
Parity<br />
generator<br />
PLS PAS XOR<br />
(PL X<br />
PL Y )<br />
PL S<br />
PL’ S<br />
Check<br />
Parity<br />
OPCODE<br />
Error Signal<br />
Figure 4.14: Parity check technique for error detection in logic instructions<br />
4.5.2 Error Detecting in Register and Data-Path<br />
For error detection in registers, we are again relying on parity codes. Register concurrently checks<br />
errors by matching the regenerated and and already existing parity bit. If unmatched it means an error<br />
signal (as shown in the figure 4.15). Furthermore, a single parity check can only detect single bit<br />
errors or errors which have not an even multiplicity.<br />
X<br />
Register<br />
X<br />
P X<br />
Parity<br />
Generator<br />
P X<br />
P’ X<br />
Check<br />
Parity<br />
Error Signal<br />
Figure 4.15: Parity check technique for error detection in register(s)
4.5. SOLUTION-I: SELF CHECKING MECHANISM 93<br />
4.5.3 Self-Checking Processor<br />
With the protections in subsections 4.5.1, 4.5.2, the processor has built-in self-check facilities to<br />
detect SBUs. The error coverage can be improved by alternate EDCs however, it will also increases<br />
the circuit complexity.<br />
Error<br />
Data according<br />
to next instruction<br />
NOS<br />
Protected Register<br />
Protected<br />
ALU<br />
4.5.4 Store Sensitive Elements (SE)<br />
TOS<br />
Protected Register<br />
Figure 4.16: Error occurred in Protected ALU<br />
The six internal states: TOS, NOS, TORS, DSP, RSP, IP must be saved at the end of the valid<br />
sequence for possible rollback. We have decided to store them in DM. The procedure follows six<br />
consecutive instructions where the contents of TORS are stored in RS and the others in DS. The posi-<br />
tive aspect of this approach is that it does not add additional hardware overhead inside processor while<br />
the downside is the extra performance penalty in storing SE. One possible combination of instructions<br />
can be:<br />
CALL a<br />
CPR2D<br />
PUSH_RSP<br />
PUSH_DSP<br />
DUP<br />
DUP<br />
Alternate solution to protected internal states is using internal shadow registers holding end val-<br />
ues of previous valid sequence. On rollback, shadow copies are loaded back to the corresponding<br />
registers. The advantage of this scheme is a single clock being needed to save or restore the registers.<br />
However, it will double the register count of SEs and these shadow copies must also be protected,<br />
which incurs extra hardware overhead. Moreover, it is not a favourable choice for context swapping.
94 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
4.5.5 Protecting Opcode<br />
Program memory is already inside the DM; therefore, there is no risk of faults but they may<br />
occur inside the opcode during the execution. Fortunately, there is a possibility of protecting the<br />
opcode without additional hardware penalty because even with 6-bit code, we can address 64 different<br />
addresses. We have 8-bit opcode for 37 instructions, which allows us to employ some low overhead<br />
EDCs without having an additional hardware penalty.<br />
4.6 Solution-II: Performance Aspects of Self-Checking Processor<br />
Core<br />
Owing to the chosen stack architecture, the data is implicitly available on the two tops of the stack,<br />
which reduces the length of the critical path. But for high (time) performance (i) instruction execution<br />
rate (clock per instruction) should be approximately unity. (ii) for multiple-byte instructions, the next<br />
instruction to be executed must reach the control unit in clock cycle (t + 1). In other words there<br />
should be a continuous flow of instructions inside IB. The instruction buffer management unit (IBMU)<br />
is “dedicated to this task”.<br />
The IBMU generates six different control signals, among them five are dedicated to the data flow<br />
control in the IB, namely SM1, SM2, SM3, SM4 and SM5 as shown in the figure 4.17 while the sixth<br />
(SM6) is reserved for IP. Next section address solutions to each of them.<br />
4.6.1 Solution-II (a): Multiple-byte Instructions<br />
There are seven multiple-byte instructions (block of 2 or 3 byte). The IBMU controls the flow of<br />
instructions in IB by pre-fetching the next instruction and make it available in control unit during the<br />
next clock cycle (t+1). The IBMU controls the series of cascaded buffers having multiple intercon-<br />
nections to cope with complex conditions as shown in figure 4.18. The decisions inside the IBMU<br />
are taken according to the predefined states of FSM (Finite State Machine). Transition between those<br />
states depends upon the present state of the IB.<br />
Table 4.1: Instruction types<br />
b7 b6 b5 details<br />
1 0 0 1-byte<br />
1 1 1 1-byte<br />
(multi-clock)<br />
1 0 1 2-bytes<br />
1 1 0 3-bytes<br />
0 1 1 1-byte + IP-change<br />
0 0 1 2-bytes + IP-change<br />
0 1 0 3-bytes + IP-change
4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 95<br />
Prog_addr<br />
16-bits<br />
Prog_addr<br />
16-bits<br />
Prog.<br />
<strong>Memory</strong><br />
Adder<br />
LSB<br />
8-bits<br />
MSB<br />
8-bits<br />
R1<br />
R2<br />
SM1 SM2 SM3 SM4 SM5<br />
SM6<br />
Instruction Buffer<br />
Management Unit<br />
(IBMU)<br />
‘0’<br />
‘2’<br />
‘3’<br />
Inst. Pointer (IP)<br />
R3<br />
Instruction Buffer (IB)<br />
‘4’<br />
‘5’ Adder<br />
16-bits<br />
‘IP’<br />
‘a’<br />
‘IP+d’<br />
R3<br />
Adder<br />
R4 R5<br />
R4<br />
‘0’<br />
‘1’<br />
‘2’<br />
‘3’<br />
LSB operand<br />
MSB operand<br />
Figure 4.17: Instruction buffer management Unit (IBMU)<br />
16-bits<br />
Op-code<br />
8-bits<br />
Control<br />
Unit<br />
Control signals<br />
8-bits<br />
Extended<br />
To 16-bits<br />
16-bits<br />
IP � d + IP<br />
IP � a<br />
IP � TORS<br />
Inst_type<br />
3-bits<br />
ZBRA d<br />
SBRA d<br />
LBRA a<br />
CALL a<br />
RETURN<br />
4.6.2 Solution-II (b): 2-Stage Pipelining to resolve Multi-clock Instruction Ex-<br />
ecution<br />
The majority of the instruction require single clock cycle to process the instruction in the data-path<br />
while others require multiple clock. To execute the pipelining, we need to differentiate between the<br />
single and multiple-clock cycle instructions. The three most significant bit (b7 b6 b5) of the opcode<br />
are reserved to determine the type of the instruction as shown in figure 4.19 (a). Effectively, the<br />
instructions that require multiple clock cycle per instruction execution have been given code ’111’ in<br />
b7b6b5. We can differentiate between the various instructions on the basis of instruction length and IP<br />
change as shown in table 4.1. The IP change occurs in the instruction-containing jump.<br />
The multiple clock cycle instructions have been analysed (with various instruction combinations)<br />
to find the possible conflicts between them. It has been found that if they are executed in two stages<br />
pipelining then all conflicts in addressing the memory can be avoided. During first stage, the stack<br />
pointers are incremented (DSP + 1/ RSP + 1) according to the type of instruction while in sec-<br />
ond stage the rest of the instruction is executed, there will be no conflicts in accessing the memory.<br />
The DSP (Data Stack Pointer) and RSP (Return Stack Pointer) points to top of DS and RS in DM<br />
respectively. This results in the two stage execution pipelining as shown in the figure 4.19 (b).
96 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
LSB<br />
MSB<br />
8-bits<br />
8-bits<br />
R1<br />
R2<br />
M1 M2<br />
M3 R3<br />
SM1<br />
SM2<br />
R1<br />
SM3<br />
R2<br />
R3<br />
R3<br />
M4<br />
SM4<br />
R4<br />
M-operand<br />
Figure 4.18: Instruction buffer<br />
R4<br />
R4<br />
M5<br />
SM5<br />
L-operand<br />
During pipelining, part of next instruction is pre-executed (DSP + 1/RSP + 1 ) with the active<br />
(present) instruction in a clock. In this way remaining part of the instruction is executed in one clock<br />
cycle in the next clock cycle. The control unit takes the 8-bit op-code of the present instruction to<br />
generate the control signals for all the associated MUXs. Simultaneous the three MSBs of the op-<br />
code of the next instruction to be executed are also extracted as these MSBs identify the type of the<br />
instruction.<br />
To evaluate the effectiveness of the pipelining, we have executed a sample benchmark consisting<br />
of five instructions (shown in 4.20). Without pipelining this program requires 9 clock cycles while<br />
only 5 are needing with the pipelining. Hence, an improvement of 45%.<br />
Therefore on pipelining, all instructions can be executed in a single control cycle except STORE<br />
instruction that requires two control clock cycle. Actually in STORE instruction we need to execute<br />
two times DSP +1, which can not be done in a single clock cycle. The complete list of the instructions<br />
is given in tables in the appendix C.<br />
4.6.3 Reducing Overhead for Conditional Branches<br />
It has been previously discussed that loading of instructions in IB is almost twice the rate of<br />
execution, (8-bit opcode is executed in a clock cycle). Therefore, the next instruction to be executed<br />
is already pre-fetched in IB. However, in case of jump instruction the IB must be flush and new<br />
R5<br />
R5<br />
8-bits
4.6. SOLUTION-II: PERFORMANCE ASPECTS OF SELF-CHECKING PROCESSOR CORE 97<br />
Clock<br />
Cycle<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
b 0 b 1 b 2 b 3 b 4<br />
Bits precisely<br />
describing<br />
instruction<br />
Bits need for<br />
present inst.<br />
execution<br />
• Non-Pipelined<br />
Operation<br />
TOS � TOS + NOS<br />
NOS � DS [DSP]<br />
DSP � DSP +1<br />
DS [DSP] � NOS<br />
NOS � TOS<br />
DSP � DSP +1<br />
TOS � TORS<br />
NOS � TOS<br />
DS [DSP] � NOS<br />
TORS � RS [RSP]<br />
RSP � RSP -1<br />
DSP � DSP +1<br />
DS [DSP] � NOS<br />
NOS � TOS<br />
TOS � DSP<br />
DSP � DSP +1<br />
DS[DSP] � NOS<br />
NOS � TOS<br />
TOS � data (byte)<br />
(a) Opcode<br />
b 5 b 6 b 7<br />
Instruction Type<br />
i-e DSP+1,DSP-1,<br />
RSP+1,RSP-1<br />
Bits need for<br />
pre-execution<br />
Present Inst.<br />
Execution<br />
(b) Pipelined Execution Model<br />
Pre- Execution<br />
of Next Inst.<br />
Present Inst.<br />
Execution<br />
Pre- Execution<br />
of Next Inst.<br />
Present Inst.<br />
Execution<br />
Pre- Execution<br />
of Next Inst.<br />
Clock cycles<br />
Figure 4.19: (a) Opcodes description and (b) pipelined execution model<br />
Instruction<br />
Executed<br />
ADD<br />
DUP<br />
R2D<br />
PUSH_DSP<br />
LITa<br />
Clock<br />
Cycle<br />
1<br />
2<br />
3<br />
4<br />
5<br />
Instruction<br />
DUP<br />
R2D<br />
PUSH_DSP<br />
LIT a<br />
1st stage<br />
Operation<br />
DSP � DSP +1<br />
DSP � DSP +1<br />
DSP � DSP +1<br />
DSP � DSP +1<br />
NOP<br />
• Pipelined<br />
Instruction<br />
ADD<br />
DUP<br />
R2D<br />
PUSH_DSP<br />
LIT a<br />
2nd stage<br />
Operation<br />
TOS � TOS + NOS<br />
NOS � DS [DSP]<br />
DS [DSP] � NOS<br />
NOS � TOS<br />
TOS � TORS<br />
NOS � TOS<br />
DS [DSP] � NOS<br />
TORS � RS [RSP]<br />
RSP � RSP -1<br />
DS [DSP] � NOS<br />
NOS � TOS<br />
TOS � DSP<br />
DS[DSP] � NOS<br />
NOS � TOS<br />
TOS � data (byte)<br />
Instruction<br />
Executed<br />
ADD<br />
DUP<br />
R2D<br />
PUSH_DSP<br />
Figure 4.20: A sample program executed through non-pipelined and pipelined stack processor core<br />
instructions should be loaded. It can result in performance penalty. This can be overcomed if we take<br />
advantage from the faster load than consumption of instructions in stack processor. The approach will<br />
be based on loading both the conditions of jump inside the IB (as shown in figure 4.23. Therefore,<br />
there will be no extra NOPs in consequence of a jump instruction. However, it may increase the<br />
additional complexity of the instruction buffer management unit and possibly we will need a larger<br />
IB. The VHDL-RTL implementation of such a solution is not considered in this work.<br />
LIT a
98 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
Non-pipelined Implementation<br />
ADD DUP R2D PUSH_DSP LIT a<br />
1 2 3 4 5 6 7 8 9<br />
Pipelined Implementation<br />
ADD DUP R2D PUSH_DSP LIT a<br />
1 2 3 4 5<br />
Figure 4.21: Timing diagram for a sample program executed twice: once in non-pipelined version<br />
and then pipelined version<br />
4.7 Implementation Results<br />
The self-checking processor core has been synthesized in Altera Quartus II. The figure 4.22 shows<br />
the implementation design flow of SCPC modeled in VHDL-RTL and implemented on a Altera<br />
Stratix III EP3SE50F484C2 device with Altera QuartusII. From the results, the following observa-<br />
tions can be done:<br />
• Area occupation : the results obtained in terms of area are reported in table 4.2. The area re-<br />
quired for SCPC minimum, it can be a suitable core processor for fututre MPSoC development.<br />
Table 4.2: Implementation area<br />
Comb. ALUTS (Ded. Logic)<br />
SCPC 861 (278)<br />
• Performance analysis : although in this chapter we have only modelled a processor core (SCPC)<br />
and the model needs to be completed with a self-checking hardware journal (SCHJ) as studied<br />
on next chapters, the processor performance aspects can be analyzed to know the effective-<br />
ness of the stack approach. In a stack based machines, we have a small clock cycle because<br />
the operands are implicitly available on the two tops of the stack. It is interesting to note that<br />
the chosen stack processor requires 2 stages pipelining to obtain rather good performance. All<br />
instructions (except STORE) can be executed in single clock cycle. The performance of the ar-<br />
chitecture was checked and the results are shown in figure 4.24. The results depict the execution<br />
of instructions in single clock cycle.
4.7. IMPLEMENTATION RESULTS 99<br />
LSB<br />
8-bits<br />
MSB<br />
8-bits<br />
R1<br />
Proposed Model<br />
(.vhd)<br />
Synthesis<br />
Quartus II - v.7.1<br />
Simulation<br />
(Altera / Stratix II)<br />
Area Frequency<br />
Figure 4.22: Implementation design flow<br />
R2<br />
cond. 2<br />
R3<br />
Instruction Buffer<br />
cond. 1<br />
R4<br />
R5<br />
t+1<br />
Conditional<br />
branch<br />
Figure 4.23: Strategy to overcome performance overhead due to conditional branches<br />
• Self-checking analysis : we will validate the error detection ability by injecting simple error<br />
(SBU). However, the complete validation of the overall model will be presented in the chapter-<br />
6 where different error scenarios will be injected artificially to check the effectiveness of the<br />
overall approach. Here, the implementation results in figure 4.25 shows that the processor in the<br />
read/write (mode-01). (The working modes of the processor will be discussed in the chapter-5).<br />
At an instant, an error is artificially injected in the self-checking processor. On the detection of<br />
Op-code
100 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR<br />
LIT 5 LIT 6 ADD DUP LIT A00 STORE ADD<br />
Figure 4.24: Implementation of a self-checking processor core<br />
an error, the forward instruction execution stops and the processor rollbacks.<br />
4.8 Conclusions<br />
Figure 4.25: Error detected in SCPC<br />
In this chapter, we have designed a self-checking processor core (SCPC) having a tolerance against<br />
SBU along with measures to improve the performance. Design choices have been made in order to<br />
ensure fast error detection in the resultant processor with minimum hardware overhead. Error detec-<br />
tion is based on combinational codes (residue-parity) while error recovery is based on the rollback<br />
mechanism.<br />
The interesting point is choosing a MISC stack computer architecture. It is a simple processor<br />
having reduced internal states, which is favourable for both CED and rollback. It occupies small area<br />
on chip, which is favourable from dependability and hardware saving points of views.<br />
To improve the instructions execution rate, the processor consists of two stages for the execution<br />
pipelining. The instruction buffer management unit controls the flow of Multiple-byte size instruc-<br />
tions in instruction buffer. Therefore, we take advantage of high code density of variable-length<br />
instructions while enabling two stage execution pipelined in which apart of the next instruction is<br />
pre-executed along with the present instruction.
4.8. CONCLUSIONS 101<br />
In the next chapter, we will discuss the design and implementation of the self-checking hardware<br />
journal that masks the errors from entering into the dependable memory.
102 CHAPTER 4. DESIGN AND IMPLEMENTATION OF A SELF CHECKING PROCESSOR
Chapter 5<br />
Design of a Self Checking Hardware Journal<br />
his chapter focuses on the design of self-checking hardware journal (SCHJ), which is being used<br />
T<br />
as a centerpiece in our strategy to devise a fault tolerant processor against transient faults (as shown<br />
in figure 5.1).<br />
SCPC<br />
SCHJ<br />
Figure 5.1: Design of SCHJ<br />
The basic role of this SCHJ is to hold new data being generated during the currently executed<br />
sequence until it can be validated at the end of the current sequence (see figure 5.2). If the sequence is<br />
validated, this data can be transferred to the DM. Otherwise, in the case of error detection during the<br />
current sequence, this data is simply skipped and the current sequence can restart from the beginning<br />
using the trustable data hold in the DM and corresponding to the state prevailing at the end of the<br />
previous sequence. However, there is a need of an error detection and correction mechanism in the<br />
journal to detect possible errors being provoked in the journal during temporary stay.<br />
The chapter is exploring the construction and working of the SCHJ and the work is distributed as<br />
follows. The first section describes the self-checking methodology. In the later section the hardware<br />
architecture and working of the journal is described. Finally to evaluate the working of the self-<br />
checking hardware journal a generic model is described in VHDL-RTL (Register Transfer Level) and<br />
103<br />
DM
104 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
Self-Checking<br />
Processor<br />
Core<br />
non<br />
validated<br />
data<br />
is synthesized on Altera Quartus-II.<br />
Transient<br />
Fault<br />
<strong>Dependable</strong><br />
Temporary<br />
Storage<br />
trustable<br />
validated<br />
data<br />
Figure 5.2: Protecting DM from contamination.<br />
5.1 Error Detection and Correction in the Journal<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
It has been shown in section 3.5.2 that the journal should have built-in self-checking mechanism<br />
because data stored inside this temporary location can also be corrupted in the consequence of tran-<br />
sient faults affecting it (see figure 5.2 and 5.3).<br />
In the journal, there will be some part of data which belong to the present SD (stored in upper<br />
un-validated part of fig 5.3 (a)) and rest of the data that belongs to the previous SD (VD in lower part<br />
of journal in fig 5.3 (b)). If an error is detected in the data belonging to the present sequence then<br />
we can rollback to the previous validated states. However, if an error occurs in the data that does<br />
not correspond to the present state then we cannot rollback because the states of the SEs are no more<br />
saved in the memory as shown in figure 5.3 (b). It means that there is only need of error detection in<br />
UVJ. However, there is a need of error correction in addition to error detection inside VJ.<br />
The ECC will be employed for detection and correction of errors in the SCHJ. In ECC, the Ham-<br />
ming codes and Hsiao codes are most commonly employed [Sta06]. Among them Hsiao code is more<br />
efficient and require minimum hardware overhead than Hamming [GBT05]. It has been widely used<br />
in designing dependable memories [Che08]. Hsiao code being employed since three decays and are<br />
still the most efficient code used in industry [GBT05, Che08].<br />
5.2 Principle of the technique<br />
The Hsiao codes [Hsi10] will be employed in the self-checking HW journal. It provide fast<br />
encoding and error detection in the decoding process. It is obtained from a shortening of Hamming<br />
codes. The construction of the code is best described in terms of the parity-check matrix Ho. The<br />
selection of the columns of Ho matrix for a given (n, k) code is based on three conditions:
5.2. PRINCIPLE OF THE TECHNIQUE 105<br />
Un-Validated Journal<br />
(UVJ)<br />
Validated Journal<br />
(VJ)<br />
Un-validated<br />
Data<br />
Validated<br />
Data<br />
(a)<br />
Errors<br />
Un-Validated Journal<br />
(UVJ)<br />
Validated Journal<br />
(VJ)<br />
Un-validated<br />
Data<br />
Validated<br />
Data<br />
(b)<br />
Figure 5.3: (a) Error(s) in un-validated journal (b) error(s) in validated journal<br />
• Every column should have an odd number of 1 ′ s.<br />
• The total number of 1 ′ s in the Ho matrix should be a minimum.<br />
Errors<br />
• The number of 1 ′ s in each row of Ho should be made equal, or as close as possible, to the<br />
average number (i.e., the total number of l’s in Ho divided by the number of rows).<br />
The first requirement guarantees the code generated by Ho has minimum distance of at least 4.<br />
Therefore, it can be used for single-error-correction and double-error detection. The second and third<br />
requirements would yield minimum logic levels in forming parity or syndrome bit, and less hardware<br />
in implementation of the code. For instance, if r parity-check bit are used to match k data bit, then<br />
the following equation should be true for Hsiao codes:<br />
�≤r<br />
i=1,i=odd<br />
Precisely, Ho matrix is constructed as follows:<br />
(a) all � � r<br />
weight-1 columns are used for the r check bit positions;<br />
1<br />
�<br />
r<br />
i<br />
�<br />
≥ r + k (5.1)
106 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
(b) next, if � � � � � � r<br />
r<br />
r<br />
≥ k, select k weight-3 column out of all possible combinations. If < k, all<br />
3<br />
3<br />
3<br />
� � � �<br />
r<br />
r<br />
weight-3 column should be selected. The left over columns are then first selected from all 3<br />
5<br />
weight-5 column and then by � � r<br />
and so on until all k columns have unique combinations.<br />
7<br />
If codeword length n = k + r is exactly equal to<br />
�≤r<br />
i=1,i=odd<br />
for some odd j ≤ r, each row of Ho matrix will have the following number of 1’s:<br />
1<br />
r<br />
�i≤r<br />
i=1<br />
i=odd<br />
i<br />
�<br />
r<br />
i<br />
�<br />
�<br />
r<br />
i<br />
�<br />
= 1<br />
�<br />
�<br />
r(r − 1)(r − 2) r(r − 1) · · · (r − j + 1)<br />
r + 3 + · · · + j<br />
r<br />
3!<br />
j!<br />
=<br />
If n is not exactly equal to<br />
�<br />
1 +<br />
�<br />
r − 1<br />
2<br />
�<br />
�≤r<br />
i=1,i=odd<br />
+ · · · +<br />
�<br />
r − 1<br />
j − 1<br />
for some j, then the arbitrary selection of the � � r<br />
cases should make the number of 1’s in each row<br />
i<br />
close to the average number.<br />
�<br />
The single bit error correction and double bit error detection is accomplished in the following<br />
way. A single bit error results in a syndrome pattern that should matches a column of the parity check<br />
matrix Ho. Thus, by matching a syndrome pattern to a column in the Ho can identify an erroneous<br />
bit. If the column corresponds to a check bit, then no correction is necessary else bit inversion will<br />
correct the error [Lal05]. The double-error detection is accomplished by examining the over-all parity<br />
of all syndrome bit. Since, the Hsiao code uses only an odd number of 1‘s in the columns of its Ho a<br />
syndrome pattern corresponding to a single bit error has odd parity. However, if it has an even number<br />
of syndrome bit, then it indicates the presence of a double error in a code word.<br />
Bit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 C1 C2 C3 C4 C5 C6 C7<br />
S1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
S7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1<br />
r<br />
i<br />
�<br />
��<br />
Figure 5.4: Hsiao Parity Check Matrix (41,34)<br />
Hsiao showed that by using minimum odd weight columns, the number of 1 ′ s in the Ho-matrix<br />
could be minimized (and made less than a Hamming SEC-DED code). This translates to less hardware<br />
(5.2)<br />
(5.3)<br />
(5.4)
5.3. JOURNAL ARCHITECTURE AND OPERATION 107<br />
area in the corresponding ECC circuitry. Furthermore, by selecting the odd weight columns in a<br />
way that balances the number of 1’s in each row of the Ho-matrix, the delay of the checker can be<br />
minimized (as the delay is constrained by the maximum weight row).<br />
Effectively, the data residing in the self checking HW journal is coded with parity bit generated<br />
according to Hsiao codes [Hsi10]. These parity bit ensure that the data written in the journal remain<br />
unchanged. Each block (row) in journal has three parts: the first part contains a pair of data and<br />
corresponding address, second consists of pair of w and v bit and third part consists of the generated<br />
parity as shown in figure 5.6. We have used the Hsiao (41, 34) codes to protect the stored data, this<br />
class of codes are used for (SEC-DED). There are 7-parity bit to construct the H-matrix as follows:<br />
1. all 1 of 7 combinations of weight-1 columns are used<br />
2. we selected 34-weight 3 columns out of all possible 3 of 7 combinations.<br />
The parity check matrix (Ho) for (41,34) Hsiao code is shown in figure 5.4, It has following features:<br />
1. total number of 1’s in H-matrix is equal to 7 + 3 × 34 = 109;<br />
2. average number of 1’s in each row is equal to 109/7.<br />
Moreover, these codes are encoded and decoded in a parallel manner. In encoding, the message bit<br />
enter the encoding circuit in parallel and the parity-check bit are formed simultaneously. In decoding,<br />
the received bit enter the decoding circuit. In parallel, the syndrome bit are formed simultaneously<br />
and the received bit are corrected in parallel. Double-error detection is accomplished by examining<br />
the number of 1 ′ s in the syndrome vectors.<br />
5.3 Journal Architecture and Operation<br />
The journal storage space is internally split into two parts: UVJ and UJ, as shown in figure 5.5. At<br />
the end of each valid SD, the contents of UVD turns into VD and thus, the virtual line separating the<br />
upper part from the lower part shifts up to denote the new situation. While, VD is being transfered to<br />
the DM during the execution of the current sequence.<br />
Each row in the SCHJ is 41 bit long. The v and w bit will be discussed later. Together with the 16<br />
address bit and the 16 data bit, they represent the information corresponding to a single block being<br />
stored in the SCHJ. The remaining bits in the row are parity bit, which represent the information<br />
redundancy related to the error correcting code and protecting the other bit, as shown in figure 5.6. In<br />
order to trust the data temporarily stored in SCHJ, we need a built-in mechanism to detect and correct<br />
errors that may occur due to transient faults. Here, we have chosen to rely on error control coding,<br />
a classic and effective approach to protect storage devices [ARM + 11]. In section 5.1, we selected<br />
a Hsiao (41, 34) code, a systematic single-error-correction and double-error-detection (SEC-DED)<br />
code, Hsiao codes being more effective than Hamming codes in terms of cost and reliability [Hsi10].
108 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
Unvalidated<br />
Data<br />
Validated<br />
Data<br />
parity bits v w<br />
address bits data bits<br />
Figure 5.5: SCHJ structure.<br />
The system is based on model 2 presented in chapter 3, where the data cannot be written directly<br />
to the DM (depicted in figure 5.7), in order to insure its contains is always trusted. The data is first<br />
written in the SCHJ and then DM. The corresponding address is always searched in un-validated area<br />
so no two data elements in this area correspond to same address. If the address is found, the data<br />
element is updated. Else, a new row is initialized in the unvalidated area with w = 1 and v = 0 and<br />
the address, data and parity-bit fields filled with the adequate values. The w and v bit are used to<br />
denote written and VD, respectively.<br />
Before transferring to the DM, data awaits for the validation of the current sequence at the VP.<br />
The waiting delay depends on the number of instructions being executed in a SD. If no error is found<br />
at the end of the current sequence, the processor validates the sequence. All the UVD in the SCHJ is<br />
validated by switching the corresponding v bit to 1. Otherwise, if any error is detected, the sequence<br />
is not validated and the UVD data in the SCHJ is disclosed by switching the corresponding w bit to<br />
0. Only data having v = 1 can be transferred to the DM.<br />
It is to be noticed that the last instructions in a sequence are used to write the SE to the SCHJ. On<br />
sequence validation, this data gets the v bit set to 1 and is consequently stored in the DM. In the case<br />
of the sequence un-validation (see figure 5.8), the SE data is restored from memory on rollback as the<br />
UVD in the SCHJ is dismissed, and execution is restarted from previous VP. Further explanation on<br />
the rollback operation can be found in [RAM + 09, AMD + 10, ARM + 11].<br />
As stated before, the on-chip DM is supposed to be fast enough to fulfill the performance require-<br />
ments of our SCPC. Our strategy of using a SCHJ aims not only to improve FT but also to allow the<br />
rollback mechanism to be used with very little time penalty compared to a full hardware approach or<br />
no protection at all.<br />
Each row in the SCHJ is protected by a Hsiao code as shown in figure 5.6. This protection is used<br />
in the following way:<br />
• error detected in the UVD will result in the sequence un-validation (rollback).<br />
• VD is written row by row to the DM. The VD is the copy of the latest validated sequence.
5.3. JOURNAL ARCHITECTURE AND OPERATION 109<br />
X add X data Xw Xv 16-bits 16-bits 1 1 7-bits<br />
X add+data+w+v<br />
Parity<br />
Generator<br />
P X<br />
P X<br />
P’ X<br />
Error<br />
detection<br />
and<br />
correction<br />
Noncorrigible<br />
Data ready to<br />
Transmit<br />
Figure 5.6: Error detection and correction in journal (a memory block of SCHJ).<br />
Rollback/<br />
reset<br />
No-error/<br />
corrigible<br />
Thus, throwing away this data would avoid correct completion of the program/thread execution<br />
and require a system reset. This can only happen if an error is detected that overpasses the<br />
correction capacity of the code (e.g. a two bit error in a single VD row).<br />
5.3.1 Modes of SCHJ<br />
The overall operation of the SCHJ is depicted in the flow chart of figure 5.9. Four modes of oper-<br />
ation are summarized in table 5.1. The ECC checker circuit is activated during each read and writes<br />
access to the memory. The traffic signals in figures 5.10, 5.11, 5.13 and 5.15, are representing data<br />
flow with respect to write-operation because in read operation the SCHJ and DM is totally transparent<br />
for the processor.<br />
Mode 00 – this mode is active on start of program or restart if a non corrigible error is detected in<br />
a VJ of SCHJ. In this mode, the processor resets and re-executes from default values, discarding all<br />
the data stored in the journal. All the w and v bit are set to 0 (v
110 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
Error detecting<br />
processor<br />
core<br />
READ<br />
WRITE<br />
READ<br />
Error<br />
Detection<br />
READ<br />
<strong>Dependable</strong><br />
Main <strong>Memory</strong><br />
Journal<br />
Figure 5.7: Overall architecture<br />
Table 5.1: Modes of Journal<br />
Modes Operation<br />
00 Initialized<br />
01 read/write<br />
10 Valid (v= 1)<br />
11 Un-Valid (Rollback)<br />
Un-Validated Data<br />
Validated Data<br />
WRITE<br />
WRITE<br />
Error<br />
Detection<br />
and<br />
Correction<br />
Mode 01 – this is a normal read or write mode depending on the active instruction in the SCPC<br />
(rd = 1 or wr = 1). In this mode, the SCPC can write directly into SCHJ but not to DM, in order<br />
to avoid risk of data contamination in the DM. However, it has read accesses to both the SCHJ and<br />
the DM (not shown in fig 5.11 to avoid complexity). The data read from the SCHJ are checked for<br />
possible errors. On error detection, the processor enters the mode 11 in which rollback mechanism is<br />
activated without waiting for the VP of the current sequence.<br />
Under normal conditions, the processor is mostly in mode 01. As shown in the figure 5.12,<br />
when the processor needs to read from SCHJ, the address tags are checked to match the required data<br />
(depicted by arrow a in the figure 5.12). If the required address is found, then before transferring<br />
the data towards the SCPC it is checked for possible errors by comparing the stored parity bit with<br />
re-generated parity bit according to Hsiao codes in the error detection unit (shown in figure 5.12).<br />
• if an error is detected (shown in figure 5.12), the rollback mechanism is invoked because data<br />
contents in UVJ contains data generated during the current sequence (denoted by the v field set
5.3. JOURNAL ARCHITECTURE AND OPERATION 111<br />
No-error detected<br />
during last SD<br />
(Data Validated at VP)<br />
VP n-1<br />
Store<br />
SEs<br />
Rollback to VP n-1<br />
Instruction(s) Execution in current<br />
SD<br />
Error detection<br />
Last SD Sequence Duration (SD)<br />
Next SD<br />
VP n<br />
Restore<br />
SEs<br />
Note: VP is Validation Point<br />
SE is State-determining Element(s) of the Processor<br />
Figure 5.8: Rollback mechanism on error detection.<br />
to 0). The enable signal on data bus is then set to ‘0’ to forbid further data transfers from the<br />
SCHJ to the SCPC. All the data contents written during this sequence are considered as garbage<br />
values (w
112 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
Data towards<br />
Processor<br />
Error Detecting<br />
Processor<br />
READ from Journal Validated DATA towards Main <strong>Memory</strong><br />
No<br />
READ from<br />
Journal<br />
Error<br />
Detection<br />
Yes<br />
Rollback<br />
Mechanism<br />
No READ<br />
Validated Data<br />
Towards <strong>Memory</strong><br />
No<br />
Yes<br />
Figure 5.9: SCHJ operation flow chart.<br />
& WRITE<br />
<strong>Dependable</strong><br />
Journal<br />
Initialized<br />
(v=0 & w=0)<br />
Figure 5.10: SCHJ mode 00.<br />
No<br />
(Non corrigible)<br />
No WRITE<br />
Validated data<br />
In Journal<br />
Error<br />
Detection<br />
Yes<br />
Error<br />
Correction<br />
RESET<br />
<strong>Dependable</strong><br />
Main <strong>Memory</strong><br />
SCPC, switching to mode 00 (and possibly raising some alarm indicator) is the usual behavior<br />
in this situation.<br />
Mode 11 – this mode is invoked when an error is detected during the read/write-operation as shown<br />
in figure 5.15 and it has been partially discussed with mode-01. In this mode, all the data written<br />
in UVJ of SCHJ (i.e. all the data generated during the current sequence) is invalid and discarded<br />
(w
5.4. RISK OF DATA CONTAMINATION 113<br />
Error Detecting<br />
Processor<br />
Addr_rd<br />
Addr_wr<br />
Data_in<br />
Steps followed:<br />
Parity<br />
generator<br />
Mode=01 wr = 1<br />
i) Address tags matching<br />
ii) Error detection ‘e’=1<br />
iii) Rollback mechanism<br />
2<br />
READ<br />
& WRITE<br />
parity<br />
bits<br />
000110<br />
001100<br />
011000<br />
110000<br />
000100<br />
100010<br />
100110<br />
101010<br />
000111<br />
000001<br />
011110<br />
010111<br />
011110<br />
010111<br />
001000<br />
001111<br />
the 01-mode (read/write-mode) is activated.<br />
a<br />
READ/WRITE<br />
<strong>Dependable</strong><br />
Journal<br />
(w=1)<br />
Figure 5.11: SCHJ mode 01.<br />
v<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
w<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
address<br />
0011…..0011<br />
0011…..0001<br />
0011…..1011<br />
0011…..0111<br />
0011…..1011<br />
1011…..0011<br />
0111…..0011<br />
0001…..0011<br />
0010…..0011<br />
0000…..0011<br />
1111…..0011<br />
1100…..0011<br />
1011…..0011<br />
0100…..0011<br />
0101…..0011<br />
1100…..1100<br />
data<br />
0011…..0001<br />
0011…..0011<br />
0011…..0111<br />
0011…..0111<br />
0011…..0011<br />
0011…..1011<br />
0000…..0011<br />
0010…..0011<br />
0011…..0011<br />
0000…..0011<br />
0010…..0011<br />
0100…..0011<br />
1011…..0011<br />
0101…..0011<br />
0011…..0011<br />
0011…..0011<br />
Error Detection & Correction Unit<br />
wr_to_mem<br />
5.4 Risk of data contamination<br />
Un-validated data<br />
Validated data<br />
1<br />
READ operation<br />
MODE : 01<br />
reset<br />
No WRITE<br />
Parity_bits<br />
Data_out<br />
Figure 5.12: Read of UVD from SCHJ in mode 01<br />
Data bus<br />
<strong>Dependable</strong><br />
Main <strong>Memory</strong><br />
Error detection<br />
Unit<br />
Mode=01 rd = 1<br />
En<br />
Data bus<br />
Address bus<br />
Error detection<br />
Inside UVJ, we can detect and recover 2-bit errors by relying on Hsiao codes for error detection<br />
and recovery on rollback mechanism. The maximum time penalty to correct the error will be equal to<br />
length of SD.<br />
‘e’<br />
Rollback at e=‘1’<br />
to main memory to processor
114 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
If No-Error Detected at VP<br />
Error Detecting<br />
Processor<br />
Addr_rd<br />
Addr_wr<br />
Data_in<br />
Parity<br />
generator<br />
Mode=01 wr = 1<br />
Steps followed:<br />
i) Data bus available<br />
ii) Non-corrigible error detection<br />
iii) RESET<br />
READ<br />
& WRITE<br />
Un-validated data<br />
Validated data<br />
DATA VALIDATED<br />
(v = 1)<br />
Figure 5.13: SCHJ mode 10.<br />
parity<br />
bits<br />
000110<br />
001100<br />
011000<br />
110000<br />
000100<br />
100010<br />
100110<br />
101010<br />
000111<br />
000001<br />
011110<br />
010111<br />
011110<br />
010111<br />
001000<br />
001111<br />
v<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
Transfer to <strong>Memory</strong><br />
w<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
MODE : 10<br />
address<br />
0011…..0011<br />
0011…..0001<br />
0011…..1011<br />
0011…..0111<br />
0011…..1011<br />
1011…..0011<br />
0111…..0011<br />
0001…..0011<br />
0010…..0011<br />
0000…..0011<br />
1111…..0011<br />
1100…..0011<br />
1011…..0011<br />
0100…..0011<br />
0101…..0011<br />
1100…..1100<br />
data<br />
0011…..0001<br />
0011…..0011<br />
0011…..0111<br />
0011…..0111<br />
0011…..0011<br />
0011…..1011<br />
0000…..0011<br />
0010…..0011<br />
0011…..0011<br />
0000…..0011<br />
0010…..0011<br />
0100…..0011<br />
1011…..0011<br />
0101…..0011<br />
0011…..0011<br />
0011…..0011<br />
Error Detection & Correction Unit<br />
wr_to_mem<br />
reset<br />
WRITE<br />
Parity_bits<br />
Data_out<br />
Data bus<br />
Address bus<br />
<strong>Dependable</strong><br />
Main <strong>Memory</strong><br />
Error detection<br />
unit<br />
Mode=01 rd = 1<br />
Data bus<br />
Figure 5.14: Mode 10 of SCHJ operation (un-corrigible error detected)<br />
On the other hand, inside VJ we can correct only single bit error (SBU). Whereas, if 2-bit MBU is<br />
detected than program should re-execute, which may result in real time performance constrains but our<br />
hypothesis of DM remain secure. Moreover, probability of MBU is much lesser than SBU [QGK + 06],<br />
therefore, chances of such situation are rare.<br />
This means that VJ is more critical data storage than UVJ from dependability point of view.<br />
Therefore, it is important to know how much time the data stay inside the VJ. In-fact, every data<br />
‘e’<br />
Rollback if e=‘1’<br />
to main memory to processor
5.5. IMPLEMENTATION RESULTS 115<br />
Errors detected<br />
ROLLBACK called<br />
Error Detecting<br />
Processor<br />
READ<br />
& WRITE<br />
DATA<br />
UN-VALIDATED<br />
Figure 5.15: SCHJ mode 11.<br />
WRITE<br />
<strong>Dependable</strong><br />
Main <strong>Memory</strong><br />
stored in UJ is transfer to DM in a SD. It means that maximum risk duration of data contamination is<br />
SD. The bigger SD has more risk of data contamination inside SCHJ and vice versa.<br />
5.5 Implementation Results<br />
The SCHJ have been modeled in VHDL at the RTL level and implemented on a Altera Stratix III<br />
EP3SE50F484C2 device using Altera QuartusII. The results obtained in terms of area for depth of<br />
SCHJ equal to 10 are reported in table 5.2. From the results, the following observations can be done:<br />
• the SCHJ occupies about 40 − 50% of the total area depending on the depth of the journal;<br />
Table 5.2: Implementation area<br />
Comb. ALUTS (Ded. Logic)<br />
SCHJ 591 (399)<br />
SCPC and SCHJ 1452 (677)<br />
In the case of a non-corrigible error (e.g. a double error in a single row) is detected in the validated<br />
part of the journal (VJ) then even by rollback we cannot recover the errors because the data does not<br />
belong to the present SD in this case the processor must have to reset as shown in the figure 5.16.<br />
5.5.1 Minimizing the Size of the Journal<br />
From the implementation results in table 5.2, it has been found that the journal acquires an im-<br />
portant percentage of the total area of FT-processor. We have investigated impact on percentage<br />
utilization of overall processor versus the SCHJ depth. The results are reported in figure 5.17. They<br />
show that overall hardware overhead depends directly on the depth of SCHJ.
116 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
Percentage utilization of FT Processor on<br />
EP3SE50F484C2<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
Figure 5.16: Non-corrigible error detection<br />
10 16 24 32 40 54 62<br />
Depth of the <strong>Dependable</strong> Journal<br />
Figure 5.17: Increase of percentage utilization of FT processor (SCPC + SCHJ) on device<br />
EP3SE50F484C2 with increase in the depth.<br />
In fact, the depth of a journal is a relative parameter and relies on the type of benchmark being<br />
employed and duration of SD. From theoretical point of view, the UVJ should be equal to the max-<br />
imum SD, if the present benchmarks consists of all instructions that require write to memory (e.g.<br />
series of duplication instruction in figure 5.18). On each instruction execution the contents of NOS<br />
will be written to memory. The required size of UVJ should be equal to the length of SD (see fig-<br />
ure 5.19 arrow a). On the other hand, the lowest extreme case is possible for benchmarks containing<br />
instructions that do not or very little need to write to memory (like series of SWAP in figure 5.18).<br />
The required depth of the journal is minimal.<br />
To address real industrial applications, there is need to find relationship between SD and journal<br />
depth. Accordingly, we have calculated the percentage of write in already discussed benchmarks<br />
(see section 3.7). They are expensive in processor-memory traffic because they always read and<br />
write the data from/to memory. However the result shows that maximum percentage of write to<br />
memory is 39% (see figure 5.19 arrow b). Although, we are ignoring write to same memory addresses
5.5. IMPLEMENTATION RESULTS 117<br />
(a) DUP (Duplication) Execution<br />
Data<br />
NOS<br />
NOS TOS<br />
stack TOS<br />
ALU<br />
DUP DUP DUP DUP DUP DUP DUP DUP DUP DUP<br />
1 2 3 4 5 6 7 8 9 10<br />
NOS TOS<br />
SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP<br />
1 2 3 4 5 6 7 8 9 10<br />
For Sequence Duration (SD) =10<br />
Figure 5.18: Theoretical limits of Journal Depth.<br />
(b) SWAP Execution<br />
NOS<br />
TOS<br />
ALU<br />
100% Write to<br />
memory<br />
UVJ min. depth = SD<br />
0% Write to<br />
memory<br />
UVJ min. depth = 0<br />
which can further reduce the required depth of journal. This shows that the practical journal depth<br />
should not exceed 50% of SD (considering eleven percent safety margin). Here, it must be noted<br />
that the previously presented area occupation results were calculated at worst case (when SD = UVJ).<br />
However, required journal depth is SD/2.<br />
Now to finalize the depth of the journal, it is important to find the relationship between SD and<br />
performance degradation. Accordingly, we have developed a processor model using dedicated C++<br />
tools. The errors are injected artificially into the simulated processor model. Here only injection<br />
of SBU has been considered. The complete experimental setup will be discussed in chapter 6. The<br />
factor CPO (clock per operation) is chosen to determine performance degradation due to re-execution.<br />
Ideally processor executes single instruction/clock cycle which means that CPO ≈ 1. The discussion<br />
in section 3.2.2 shows that in a BER system, the performance degradation relies on two factors:<br />
rate of re-execution (rollback) and ratio of effective instruction execution, (SD-SED)/SD. Greater the<br />
performance degradation higher will be the CPO.<br />
The graphs have been drawn between CPO vs. SD (shown in figure 5.20) for different error<br />
injection rates. The obtained curves follow U-shape pattern because at low SD the rate of loading of<br />
internal states are dominants and for bigger SD the re-execution is dominant. The curves shows that<br />
bigger journal depth is only possible with for lower EIR. Also for every EIR there are certain limits<br />
in which it can have good performance.<br />
Here, if we accept the 20% performance degradation than the minimum SD comes out to be 20
118 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL<br />
In worst case<br />
Sequence Duration<br />
SD =Journal depth<br />
Sequence Duration<br />
SD/2 =Journal depth<br />
Journal<br />
Journal<br />
Theory point of view<br />
Benchmark<br />
Series of DUP<br />
Series of Swap<br />
% Write<br />
100<br />
In practical Practical point of view<br />
Benchmark<br />
Bubble Sort<br />
Matrix<br />
multiplication<br />
Control<br />
0<br />
% Write<br />
38<br />
Figure 5.19: Relation between journal depth and percentage write in benchmarks.<br />
Clocks per instruction (CPI)<br />
2<br />
1,8<br />
1,6<br />
1,4<br />
1,2<br />
Loading of internal<br />
states is dominant<br />
CPI vs. Sequence Duration<br />
1<br />
1 10 A<br />
100<br />
Sequence Durat ion<br />
1000 10000<br />
Figure 5.20: CPI vs. SD<br />
39<br />
36<br />
(a)<br />
(b)<br />
Effect of re-execution<br />
is dominant<br />
1/500<br />
1/1000<br />
1/10000 1/5000<br />
1/10000<br />
1/100<br />
(see arrow A). In brief, practical journal depth lies some where near 10 if accepting the depth of<br />
journal = 50% of SD.
5.6. CONCLUSIONS 119<br />
5.5.2 Dynamic Sequence Duration<br />
In the presented model, we have used fix SD model that has both the area and performance over-<br />
heads. However, with dynamic SD these problems can be solved. Here, the SD has an average value<br />
as shown in figure 5.21. This can allow us to employ bigger SDs with smaller journal depth. This can<br />
improve the area overhead and performance degradation for low EIR.<br />
SD 1<br />
SD 2<br />
SD 3<br />
Figure 5.21: Dynamic SD.<br />
SD 4<br />
SD 5<br />
HW<br />
Journal<br />
Moreover, it can allow to dynamically reconfigure SD with EIR. For example, if SD is repeatedly<br />
unvalidated then system will automatically reduce the SD to adjust its value with EIRs. Downside<br />
is that it may increase the complexity of the journal management. Dynamic SD is an important<br />
consideration for future reduction of hardware overhead.<br />
5.6 Conclusions<br />
The presence of the journal facilitates the rollback mechanism on one hand, and it mask errors<br />
(SBU and 2-bit MBU) from entering into the DM on other hand. To reduce the hardware overhead,<br />
the Hsiao code have been employed. They provide an effective double detection and single error<br />
correction. Due to parallel access to the memory and journal simultaneously in READ operation the<br />
overall efficiency of the system has been increased.<br />
The SCHJ occupies an important percentage of the overall fault tolerant processor area. Reducing<br />
the journal size can effectively reduce the global area occupation. The size of the journal depends<br />
on the type of benchmarks being employed. For practical applications journal depth can be half<br />
the duration of SD. Further reduction in depth is possible by employing dynamic SD than fix SDs.<br />
In the next chapter we will investigate the error coverage and the performance degradation due to<br />
re-execution.
120 CHAPTER 5. DESIGN OF A SELF CHECKING HARDWARE JOURNAL
Chapter 6<br />
Fault Tolerant Processor Validation<br />
In the previous chapters, we have designed a FT processor based on concurrent error detection ca-<br />
pability and rollback error recovery strategy. The fault tolerant design is built on a self-checking<br />
processor core (whose architecture follows the MISC philosophy) and on a self-checking hardware<br />
journal that prevents errors to flow into the DM and limits the impact of the rollback mechanism on<br />
time performance. The architecture of the self-checking processor and the self-checking hardware<br />
journal have been discussed in chapters 4 and 5, respectively.<br />
Self-Checking<br />
Processor<br />
Core<br />
FT-Processor<br />
FT Processor<br />
Self Checking<br />
Journal<br />
Figure 6.1: The overall FT-processor to be validated.<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
In this chapter, we will evaluate the FT capability of the overall FT-processor (SCPC + SCHJ),<br />
as highlighted in figure 6.1 in order to validate design strategy. The evaluation will be carried out<br />
through simulation. Controlled error injection will be used to force the processor model artificially<br />
face abnormal situations. The FT capability of the processor will be judged by calculating the detected<br />
to injected error ratios under different simulation scenarios (different application benchmarks and<br />
different error injection profiles). The time performance will also be evaluated.<br />
The chapter distribution is as follow. First of all we will analyse the design hypothesis that have<br />
been assumed in the methodology and hence, the FT-processor properties to be checked. Then, after<br />
a short presentation of the error injection methodology, the experimental results are presented and<br />
discussed, both from the FT and the time performance points of view. Finally, we will compare<br />
proposed methodology with LEON3FT FT design methodology.<br />
121
122 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
6.1 Design Hypothesis and Properties to be Checked<br />
Inside the SCPC, parity and remainder codes are employed to detect errors in the internal registers<br />
and arithmetic/logic circuitry of ALU. According to assumptions, DM is a trustable place where data<br />
remain uncorrupted. Hence, unsafe data is prevented to flow to the DM. This is achieved by the<br />
SCHJ which has error detection and also some error correction capability. Its role is to simplified<br />
the management of validated and un-validated data and to speed up the rollback mechanism used for<br />
error recovery.<br />
Self-Checking<br />
Processor<br />
Core<br />
FT-Processor<br />
FT Processor<br />
Self Checking<br />
Journal<br />
Figure 6.2: Error injection in FT-processor.<br />
Artificial Error<br />
Injection<br />
<strong>Dependable</strong><br />
<strong>Memory</strong><br />
The FT capability of the processor must be evaluated as the capacity of correctly handling errors<br />
appearing in any part of the SCPC or the SCHJ (figure 6.2), with different error profiles to be tested<br />
(different error patterns and rates). Speed performance degradation will also be assessed along with<br />
the FT capability, as the impact of rollback is expected to rise with (due to higher re-execution rates).<br />
Accordingly, the rate of rollback vs. error injection rate can also be calculated.<br />
In short, we will investigate the overall dependability and performance of the proposed FT-<br />
processor architecture by addressing the following challenges in the upcoming sections:<br />
• self-checking effectiveness of the FT processor;<br />
• performance degradation due to re-execution; and<br />
• effect of error injection on rate of rollback.<br />
6.2 Error Injection Methodology and Error Profiles<br />
Before addressing the above-mentioned challenges, it is necessary to choose both the error injec-<br />
tion methodology and the error profiles that will be applied, i.e., the error patterns and error rates. The<br />
fault injection in the hardware of a system can be implemented in two ways:<br />
1. physical fault injection;
6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 123<br />
2. simulated fault injection.<br />
In this work, we will employ the simulated fault injection of soft errors (due to transient faults)<br />
in which errors are injected altering the logical values during the simulation. The simulation-based<br />
injection is a special case of fault/error injection that can support various levels of abstraction of the<br />
system like functional, architectural, logic and power [CP02]. For this reason, it has been widely used<br />
to study fault injection.<br />
Moreover, there are various other advantages in this technique, the greatest being the observability<br />
and controllability of all the modelled components. Another positive aspect of this technique is the<br />
possibility of carrying the validation of the system during the design phase before having a final<br />
design.<br />
Scenarios 1<br />
Scenarios 2<br />
1.a Random SBU<br />
1.b. Random MBU (2-bits)<br />
1.c. Random MBU (3-bits)<br />
X X<br />
X X X<br />
2. Random MBU (1, 2, …, 7 or 8 bits)<br />
X X X X X X X X<br />
Figure 6.3: Error patterns (errors can occur in any bit, not necessarily the bit shown here).<br />
The faults being considered are SBUs (one-bit changing in a single register) and MBUs (multiple<br />
bit changing at once in one register). These fault models (SBU and MBU) are commonly used with<br />
RTL models [Van08]. The exact error patterns being considered in these experiments are shown in<br />
figure 6.3: in scenario 1, we have considered (a) random single bit error, (b) random 2-bit error and<br />
(c) random 3-bit error. In scenario 2, the random harsh-error (from 1-bit up-to 8-bit error in a single<br />
register) are considered.<br />
6.3 Experimental Validation of Self-Checking Methodology<br />
We will evaluate the error coverage through simulated fault injection, the objective being to find<br />
the effectiveness of the proposed fault tolerant scheme, and hence determine its limits. There is a<br />
need of creating an environment to analyze the effects of transient faults into the final architecture.<br />
X
124 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
By designing this environment, we will be able to do fault injection experiments to evaluate the<br />
effects of SBU and MBUs caused due to transient faults into processor registers and data-path. Hence<br />
to analyze the robustness against single bit flip (due to transient faults).<br />
Practically, the VHDL model at the RTL level used to synthesize the circuit is not used for the<br />
fault injection simulation. In order to allow very fast simulation (and hence, allow a large number of<br />
simulation campaigns to be conducted in a minimum delay), dedicated C++ tools have been developed<br />
to replace the original ‘discrete event driven’ simulation model on which VHDL relies, by the faster<br />
‘cycle driven’ simulation model that fits very well synchronous designs [CHL97]. For the simulation,<br />
strictly equivalent C++ ‘cycle drive models’ replaced the original VHDL models at RTL level.<br />
The starting point of the environment designing is to define how to describe a way to reproduce<br />
transient faults: when to reproduce, where to affect and what to change. We have chosen a non-<br />
deterministic approach of fault trigger during a fixed s where the bit flips can randomly be provoked<br />
in SCPC and SCHJ.<br />
Initial<br />
setup<br />
Fault injected<br />
Error latency 2 clks<br />
Error<br />
detected<br />
No<br />
Yes<br />
Increment: error detected<br />
fault injected<br />
Program reset<br />
Data log<br />
clks<br />
Campaign 1 Campaign 2 Campaign 3 Campaign N<br />
Figure 6.4: Experimental Setup<br />
Final Report<br />
Total: error detected<br />
Fault injected<br />
The basic steps of a fault-injection campaign are shown in figure 6.4. The C++ based simulator<br />
inject fault pattern in the processor model by randomly picking bit(s) out of total bits that form the<br />
registers. On fault injection the simulation is arrested after 2 cycles and the self-checking circuitry<br />
indicates whether the error is detected or not. If detected, then counter increments and afterwards, a<br />
new simulation campaign starts with a new fault injection profile (as shown in figure 6.4). Finally, a<br />
report of total number of injected/detected errors is generated.<br />
Two types of injection methods were conducted: a campaign to inject single (SBU), double and<br />
triple (MBU) random error patterns, and another campaign to inject random harsh-errors (random<br />
weight from 1 up to 8). The results are presented in the graphs of figures 6.5, 6.6, 6.7 and 6.8,<br />
respectively.
6.3. EXPERIMENTAL VALIDATION OF SELF-CHECKING METHODOLOGY 125<br />
No. of Errors<br />
No. of Errors<br />
40000<br />
35000<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
40000<br />
35000<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
Random Error Injection and Detection of SEU<br />
1 2 3 4 5 6 7 8 9 10<br />
Errors Injected Errors Detected<br />
Figure 6.5: Single bit error injection.<br />
2-bits Random Errors Injection and Detection<br />
1 2 3 4 5 6 7 8<br />
Errors Injected Errors Detected Errors Non-Detected<br />
Figure 6.6: Double bit error injection.<br />
For scenario 1, the figure 6.5 shows the processor can detect 100% of the injected single bit<br />
errors. The detection rate for double and triple bit errors, with rates higher than 60% and 78%,<br />
respectively (as shown in figures 6.6 and 6.7 respectively). In scenario 2, harsh patterns are used (1<br />
up to 8 randomly), the detection rate still remains significant with a value greater than 36% for all<br />
configurations, as shown in figure 6.8.<br />
It is interesting to notice that, while using very simple detecting codes in the SCPC devised for<br />
low s, the error coverage is still 100% for SBU. Taking into account the small amount of registers to<br />
protect in the processor core and the fact that the SCPC area is only a fraction of the total FT-processor<br />
area, using better codes in the SCPC can probably improve the FT level without a big impact on area.<br />
This tends to prove that the proposed FT processor design approach is a useful one. It is still<br />
necessary to evaluate the impact of increasing error rates on speed performance.
126 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
No. of Errors<br />
No. of Errors<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
3-bits Random Errors Injection and Detection<br />
1 2 3 4 5 6 7 8<br />
Errors Injected Errors Detected Errors Non-Detected<br />
Figure 6.7: Triple bit error injection.<br />
1 to 8 - bits Random Error Injection and Detection<br />
1 2 3 4 5 6 7 8<br />
Errors Injected Errors Detected Errors Non-Detected<br />
Figure 6.8: Harsh (1 up to 8 bit randomly) error injection.<br />
6.4 Performance Degradation due to Re-execution<br />
To measure the impact of transient faults on system performance, we have evaluated the per-<br />
formance degradation on different sets of benchmarks through simulations. The average number of<br />
clock ticks per operation (CPO) has been measured for different EIRs, as an indicator of speed perfor-<br />
mance (the higher the value, the lower the performance) and hence of performance degradation under<br />
different error injection conditions.<br />
In the pipelined journalized stack processor, all instructions are executed in a clock cycle (ex-<br />
cept STORE). Therefore, an average clock cycle per operation execution (CPO) or clock cycle per<br />
instruction execution is the unity. However, in the case of an error detection, the rollback is executed<br />
which increases the overall time penalty. The greater the rate of rollback, the higher the average CPO,
6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 127<br />
Performance<br />
degradation<br />
Program length<br />
Clock count<br />
No<br />
rollback<br />
Program length<br />
Clock count<br />
with<br />
rollback<br />
Figure 6.9: Performance Degradation due to re-execution<br />
because more clock cycles will be needed to accomplish the required task (see figure 6.9). In other<br />
words, the greater the average CPO, the lower the overall performance.<br />
The benchmarks have already been discussed in section 3.7. Here, table 6.1 summarizes the<br />
percentage profiles of read_from/write_to DM for each group of induced by the instructions running<br />
on the SCPC. Note that the instruction set of the SCPC has 36 instructions among which, 23 involve<br />
reading or writing from/to the memory.<br />
Table 6.1: Read/Write profiles in benchmarks groups<br />
Group Read Write<br />
I 45% 39%<br />
II 57% 38%<br />
III 50% 38%<br />
6.4.1 Evaluating Performance Degradation<br />
The goal is to measure the effect of re-execution on the length of SD. We have drawn the graphs<br />
of average clock cycle per operation (CPO) vs. EIR for different benchmarks. Figures 6.10, 6.11<br />
and 6.12 present the results for benchmarks in group I, group II and group III, respectively for different<br />
SDs (such as 10, 20, 50 and 100). In these graphs, the number of clock cycles per operation (CPO)<br />
has been plotted against EIR. Here, the penalty in loading SEs has not been considered. The errors<br />
have been injected in processor and journal at different EIRs. The analysis of the graphs shows that<br />
the curves tend to overlap for the lower values of EIR. This is logic as, in absence of error, no extra<br />
time penalty due to rollback is induced, whatever the benchmark being used.
128 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
Clock Per Operation (CPO)<br />
Clock Per Operation (CPO)<br />
2<br />
1,8<br />
1,6<br />
1,4<br />
1,2<br />
1<br />
0,8<br />
0,6<br />
0,4<br />
0,2<br />
0<br />
2,5<br />
2<br />
1,5<br />
1<br />
0,5<br />
0<br />
Benchmarks Group - I<br />
SD=10 SD=20 SD=50 SD=100<br />
A B C<br />
0,000001 0,000005 0,00001 0,00005 0,0001<br />
Error Injection Rate (EIR)<br />
Figure 6.10: Simulation curves for group-I.<br />
Benchmark Group II<br />
SD=10 SD=20 SD=50 SD=100<br />
A B C<br />
0,000001 0,000005 0,00001 0,00005 0,0001<br />
Error Injection Rate (EIR)<br />
Figure 6.11: Simulation curves for group-II.<br />
In figure 6.10, moving from point A to B corresponds to an increase of 10% of the error rate. The<br />
corresponding increase in CPO remains low (almost unchanged for SD=10 and 20, 1.1 for SD=50<br />
and 1.6 for SD=100), meaning a no or little degradation of speed performance. Similarly, the move<br />
from A to C corresponds to an error rate increase of 100%: the CPO remains very low for SD=10,<br />
and lower than 2 for SD=20 and 50. Similar observations can be made from graphs in figures 6.11<br />
and 6.12. With increase of error rate by 100% the time penalty for lower SDs remains reasonable<br />
which means a good performance.<br />
With higher EIR, the smaller SD are the ones that denote the lower time penalty being incurred.<br />
This is also coherent with predicted results. Indeed, for a given error rate, the risk that a sequence be
6.4. PERFORMANCE DEGRADATION DUE TO RE-EXECUTION 129<br />
Clock Per Operation (CPO)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
A<br />
Benchmark Group III<br />
SD=10 SD=20 SD=50 SD=100<br />
B<br />
0,000001 0,000005 0,00001 0,00005 0,0001<br />
Error Injection Rate (EIR)<br />
Figure 6.12: Simulation curves for group-III.<br />
invalidated is higher for a longer SD, leading to a higher rollback rate.<br />
Taking into account that the architecture chosen for the SCPC requires little time being used to<br />
save the SE, it is possible to select short SD and still have a good level of performance. Furthermore,<br />
this allows a lower SCHJ depth to be chosen with a reduced area consumption. It can further reduce<br />
the risk that errors cumulate in the SCHJ and induce a non recoverable error.<br />
Number of<br />
rollbacks<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
0.000<br />
0.00001<br />
0.00005<br />
0.0001<br />
Error Injection Rate (EIR)<br />
0.0005<br />
Figure 6.13: Effect of EIR on rollback for benchmarks group-I.<br />
C<br />
SD = 10<br />
SD = 20<br />
SD = 50<br />
SD = 100
130 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
Number of<br />
rollbacks<br />
500<br />
450<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
0.000 0.00001 0.00005<br />
Error Injection Rate (EIR)<br />
Figure 6.14: Effect of EIR on rollback for benchmarks group-II.<br />
Number of<br />
rollbacks<br />
500<br />
450<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
0.000<br />
0.00001<br />
0.00005<br />
0.0001<br />
Error Injection Rate (EIR)<br />
0.0005<br />
Figure 6.15: Effect of EIR on rollback for benchmarks group-III.<br />
6.5 Effect of Error Injection on Rate of Rollback<br />
SD = 10<br />
SD = 20<br />
SD = 50<br />
SD = 100<br />
SD = 10<br />
SD = 20<br />
SD = 50<br />
SD = 100<br />
An increase in the rate of rollback is a performance-limiting factor due to the time penalty in<br />
re-execution of sequences. As a result, in this section we will analyze the increase in rate of rollback<br />
on increasing EIRs. Actually, for higher EIR the rate of re-execution will also increase which will
6.6. COMPARISON WITH LEON FT-3 131<br />
further decrease the overall performance. However, if the error probability is known then it is possible<br />
to find the optimal number of checkpoints and possible rollbacks [VSL09]. In real system, the error<br />
probability is not known in advance and is difficult to estimate.<br />
In the rollback mechanism, there are two performance-limiting factors: (i) the time taken to<br />
store/reload SEs and (ii) the length of the sequence (SD). If we need to reduce the time penalty<br />
in reloading the SEs there is a need of long sequences (SD) so that overall number of load and store<br />
of SEs will be smaller. This behavior needs to be confirmed by artificial error injection.<br />
Consequently, we have artificially injected the errors in the FT-processor to observe its effect on<br />
the rollback mechanism as shown in figures 6.13, 6.14 and 6.15. (Note: for higher SDs like 50 and<br />
100 the number of rollbacks at higher EIR is missing because their values get out of graph range at<br />
y-axis) From the simulation curves, it has been shown that for low error rates the rate of rollback is<br />
also low and vice versa. Moreover, for higher error rate the effect of rollback is dominant in bigger<br />
sequences (SD). Hence, there will be greater number of rollbacks which again result in time penalty<br />
and limit the overall performance.<br />
Therefore, it is advisable to use larger SDs with low error rates and smaller SD with higher error<br />
rates. One can propose the optimal duration of SD if final application is know that is why the length<br />
of the SD will be a user defined parameter and can be adjusted according to the external environment.<br />
6.6 Comparison with LEON FT-3<br />
The LEON3 FT has been discusses previously in section 2.4. In this part, we are comparing<br />
the protection scheme of LEON3 FT with journalized stack processor. This will be a qualitative<br />
comparison.<br />
The LEON3 FT focus on the protection of the data storage and not on the functionality of the<br />
architecture. The overall scheme is using ECC and duplication of internal states. The most of the<br />
registers have 2-bit error detection whereas, few have 4 bit error detection. There is no protection of<br />
data path, ALU functionality and control unit.<br />
On the other hand, in FT journalized stack processor the focus is to have an overall architecture<br />
protection. In processor, there is single bit error detection and the journal part can detect 2-bit error.<br />
However, in newer version there will be consideration to search other high coverage codes inside the<br />
processor. There is a protected data path, ALU and control path can be protected without additional<br />
hardware overheads. In brief, the FT journalized processor is in development phase and still shows<br />
interesting feature and needs further optimization from protection point of view.<br />
6.7 Conclusions<br />
In this chapter, we have validated the design of a Journalized Fault Tolerant Stack Processor.<br />
During validation different parameters we have been evaluated such as self checking ability, impact
132 CHAPTER 6. FAULT TOLERANT PROCESSOR VALIDATION<br />
on time performance and increase in rate of rollback due to error injection. Finally, the proposed<br />
model has been compared with the LEON 3-FT.<br />
For injection of single errors, 100% of the errors were detected in several experimental configura-<br />
tions. Similarly, with double and triple bit error injection, the average percentage detection was about<br />
60% and 78%. According to the results obtained with much worse error patterns (up to 8-bit error<br />
patterns), the correction is still possible with a rather significant correction rate of about 36%.<br />
The performance degradation results have also shown satisfactory results. The proposed architec-<br />
ture offers rather good performance even in presence of high error rates. With large error rates, the<br />
time penalty can remain reasonable using lower SDs. Practically, it’s advisable to use bigger SD with<br />
low error rate and smaller SD with higher error rate. Knowing the final application and the average<br />
error profile related to the execution environment, it is possible to chose the most appropriate SD<br />
duration (which is left as a generic parameter in the synthesize models).
GENERAL CONCLUSION AND PROSPECTS<br />
133
General Conclusions<br />
With the predicted evolutions in technology, soft errors in electronic circuits are becoming a major<br />
issue in the design of complex digital systems, especially in applications with safety critical relevance.<br />
Indeed, current advancements in nano-technology, largely based on component dimensions shrinking,<br />
voltage supply reduction and clock speed increase, are lowering the resulting noise margins. As a con-<br />
sequence, the sensitivity of digital circuits to high-energy particles and electromagnetic disturbances<br />
is raising very fast, making the probability of Single-Event Upset (SEU) and Multiple-Bit Upset<br />
(MBU) occurrence very high, not only in space but also in ground applications. Hence, taking into<br />
account from the beginning of any electronic digital design, the growing risk that these transient faults<br />
occurs is turning very fast into a critical need.<br />
Ensuring the proper operation, even in the presence of transient faults, requires that the system<br />
held some fault tolerant capability. Next to the fault tolerance issue, the demand for larger, faster,<br />
more complex and flexible system and yet easy do design is endless. Together with the enhanced<br />
means of on-chip communication (Network on Chip – NoC), the increased possibilities of integration<br />
in modern electronic circuits allow now grouping all the functionality of a full system in a single chip<br />
(System on Chip – SoC). Among all the recent developments, the MPSoC (Multiprocessor System<br />
On Chip) design paradigm is becoming very popular for its capacity to provide both computational<br />
power and flexibility. It brings together a large number of processors (the processing nodes) linked<br />
together by a NoC (the inter-nodes communication mean). An MPSoC not being naturally immune<br />
to transient faults, an obvious goal is to develop on-chip capacity for fault tolerance and hence, a fault<br />
tolerant processor to be used as "processing node".<br />
The work that has been presented in this thesis is dedicated to the design of such a FT processor<br />
using a new architectural approach, the design goals being addressed including high level of protection<br />
against transient faults along with reasonable performance and area overhead. It was clear from the<br />
beginning that severe constraints concerning the area consumption should apply to the architectural<br />
design of the processing node in order to match the massively parallel objective, and yet preserving<br />
as much as possible the node performance.<br />
The concepts chosen to be the basis of the design methodology are on-line concurrent error de-<br />
tection capability and error recovery through rollback execution. Central to the new architecture are a<br />
self-checking processor core and a hardware journalization mechanism. The processor core, devised<br />
in the MISC class instead of the classic RISC or CISC, is a self-checking processor inspired from the<br />
canonical stack processor [KJ89], able to offer a rather good level of performance with only a limited<br />
135
136 GENERAL CONCLUSIONS<br />
amount of hardware being required. The architectural simplicity (little amount of logical resources<br />
and internal storage) and the great compactness of the code are important characteristics favorable<br />
to self-checking capability and rollback recovery implementation. Next to the processor core, a self-<br />
checking hardware journal dedicated to the journalization mechanism prevents error propagation from<br />
the processor core to the main memory and limits the impact of rollback on time performance. Among<br />
the underlying hypotheses, the main memory is supposed to be dependable, i.e. data is admitted to be<br />
kept reliably in it without any risk of corruption.<br />
On occurrence of a transient fault, data can be corrupted in the processor. Such errors can be<br />
detected in the processor core but not corrected. Hence, erroneous data may flow out of the processor<br />
core and would end-up in the dependable memory without the use of the hardware journal. The DM<br />
would then be a non-trustable place in that case and implementing a software recovery mechanism<br />
would be rather painful, with a lot of data redundancy being necessary in the memory device. Clas-<br />
sical rollback techniques operate with check pointing: at regular time intervals, the processor state<br />
and produced data is saved allowing rollback to the saved point in case of error detection. The best<br />
suited sequence duration (distance between two check points) depends on application constraints and<br />
on error occurrence rates. While a larger sequence duration may limit the impact of the rollback<br />
mechanism on time performance in absence of errors, it requires a larger hardware journal and more,<br />
it increases the risk of rollback activation in case of error occurrence.<br />
Data produced in the current sequence can be discarded in case of error detection as it can be<br />
generated again from the last save check point. On sequence validation, with no error occurrence in<br />
the ending sequence, the related data is validated and must be transferred to the dependable memory.<br />
Error control coding techniques are used to detect errors in the processor core and the unvalidated<br />
data in the journal, and used to correct errors in the validated data part in the journal.<br />
The fault tolerant processor architecture has been modeled in VHDL at the RTL level and then<br />
synthesized using Altera Quartus II, to determine area requirements and maximal operation frequency.<br />
Simulated error injection campaigns have been used to determine the effectiveness of the proposed<br />
fault tolerant strategy under different faulty scenarios (varying the error rate and error pattern profiles)<br />
and different sequence durations.<br />
The self-checking ability of the fault tolerant processor was tested for Single-Event Upsets (SEUs,<br />
1 bit error pattern) and Multiple-Bit Upsets (2 up to 8 bits error pattern in a single 16-bits data word).<br />
Considering SEUs, 100% of the errors are detected and error recovery is close to 100% event for high<br />
error injection rates. With 2-bit and 3-bit patterns the average detection percentage was about 60%<br />
and 78%, respectively. When harder conditions are considered with error patterns of up to 8 bits,<br />
correction is still possible with correction rates of about 36%.<br />
Similarly, the performance degradation due to error injection was evaluated. Error recovery being<br />
based on rollback execution on error detection, the instruction in the faulty sequence are re-executed<br />
from the previous preserved states hence adding time penalty, i.e. performance degradation. Higher<br />
error injection rates induce higher rollback rates, result in lower performance. The analysis of the<br />
measured performance degradation curve shows that proposed architecture offers a reasonable good
performance even in presence of high error rates. It shows also that the optimal sequence duration<br />
depends on the average error injection rate that should be adjusted according to application external<br />
environment.<br />
Practically, the experimental results demonstrate that the principle of journalization can be rather<br />
effective on a stack computing based processor core architecture, and deserves more research effort<br />
to enhance the performances and protection capability.<br />
The future work is divided into two aspects: protection and performance. From protection point<br />
of view there is a need to improve the error coverage in the processor part. Presently simple parity<br />
can only detect odd bit errors. The challenge is to search a low hardware overhead codes. Moreover,<br />
opcode (in control circuitry) can be protected with ECC. In MISC based stack methodology there are<br />
37 instructions, present opcode is 8 bit. It has the capacity to add redundancy bits without additional<br />
overheads.<br />
From performance point of view, the architectural optimization, mainly on the hardware journal<br />
part is required. Present processor have a critical path in the error correcting circuitry and write to<br />
DM. If this task is split in 2 stage pipeline then overall performance can improve lot. Another possible<br />
aspect is to overcome performance overhead due to conditional branches. The methodology will be<br />
to load both condition of jump in IB.<br />
On the long term, the continuation of this work should be dedicated to the integration of this fault<br />
tolerant processor architecture as a building block of a fault tolerant MPSoC.<br />
137
Appendix A<br />
Canonical Stack Computers:<br />
The canonical stack processor [KJ89] has been chosen to develop the fault tolerant processor<br />
core. It is characteristics resembles mostly with the second generation stack machines which is more<br />
cost effective than first generation. In this section we will briefly discuss the construction of canon-<br />
ical stack machine because it will be helpful in understanding the similarities and differences with<br />
proposed stack machine.<br />
Figure A.1 shows the block diagram of the Canonical Stack Machine. Each block represents a<br />
logical resource that include: the data bus, the Data Stack (DS), Return Stack (RS), Arithmetic/Logic<br />
Unit (ALU), Top Of Stack register (TOS), Program Counter (PC), <strong>Memory</strong> Address Register (MAR),<br />
Instruction Register (IR), and an Input/Output unit (I/O). For reason of simplicity the canonical ma-<br />
chine has been represented with a single Data Bus but real processors may have more than one for<br />
parallel fetching and instruction execution in the figure A.1. Real processors may have more than one<br />
data path to allow for parallel operation of instruction fetching and calculations.<br />
The DS is a buffer which works according to LIFO (Last In First Out) mechanism. Only two<br />
operations PUSH and POP can take place in DS. In PUSH, the new data elements are written on the<br />
top most position of the DS and the old values are shifted one position downwards. In POP operation,<br />
the top value already residing in the stack is placed on the data bus and the next cell on the stack<br />
is shift one place upwards and so on. Similarly RS is also LIFO based implementation. The only<br />
difference is that the return stack is used to store subroutine return addresses instead of instruction<br />
operands.<br />
The program memory block has both a <strong>Memory</strong> Address Register (MAR) and a reasonable<br />
amount of random access memory. To access the memory, first the MAR is written with the ad-<br />
dress to be read or written. then, on the next system cycle the program memory is either read onto or<br />
written from the data bus accordingly.<br />
139
140 APPENDIX A. CANONICAL STACK COMPUTERS:<br />
Data Stack<br />
DS<br />
Return Stack<br />
RS<br />
I/O<br />
Control<br />
Logic & IR<br />
D<br />
A<br />
T<br />
A<br />
B<br />
U<br />
S<br />
DATA<br />
Figure A.1: Canonical Stack Machine [KJ89]<br />
ALU<br />
PC<br />
MAR<br />
Program<br />
<strong>Memory</strong><br />
TOS REG<br />
ADDRESS
Appendix B<br />
Instruction Set of Stack Processor<br />
Arithmetic and logic operations<br />
The basic arithmetic and logic operations in table B.1 are same as in canonical machine but they<br />
have been modified according to the needs. Some additional instructions like Addition with carry<br />
(ADC), Subtraction with carry (SUBC), Modulus (MOD), Negative (NEG), NOT-operation (NOT),<br />
Increment (INC), Decrement (DEC) and Sign (SIGN) have been added. These additional instructions<br />
provide more flexibility when programming the Stack Processor. All the instructions are described in<br />
terms of register transfer level pseudo which is assumed as self explanatory.<br />
Stack manipulation operations<br />
Pure Stack machines can only access the two tops of the stack for arithmetic operations. Therefore<br />
some extra instructions are always needed in order to explore the operands other than the TOS, NOS<br />
or TORS. Here such instructions include Rotate (ROT), RS to DS (R2D), DS to RS (D2R), Copy RS<br />
to DS (CPR2D). The R2D, D2R and CPR2D are generally used for shuffling the DS and RS. (Pseudo<br />
of all the instructions in this table are according to non-pipelined version of the propose model.)<br />
<strong>Memory</strong> Fetch and Store<br />
All the arithmetic and logical operations are performed on data elements of the stack so, there<br />
must be some way of loading information onto the stack and storing the data to the memory. The<br />
register transfer pseudo is in the table B.3 below.<br />
Loading Literals<br />
There must be a way to get the constants on the stack. The instructions to do so include LIT and<br />
DLIT which can load a byte and a word data respectively on the DS as shown below in table B.4.<br />
141
142 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR<br />
Conditional branch<br />
Table B.1: Arithmetic and logic operations<br />
Symbol Instruction Operations<br />
ADD Addition TOS ⇐ TOS + NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
ADC Addition with carry TOS ⇐ TOS + NOS<br />
NOS ⇐ Cout<br />
SUB Subtraction TOS ⇐ TOS - NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
MUL Multiplication TOS ⇐ TOS × NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
DIV Division TOS ⇐ TOS ÷ NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
MOD Modulus TOS ⇐ TOS mod NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
AND AND-operation TOS ⇐ TOS & NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
OR OR-operation TOS ⇐ TOS | NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
XOR XOR-operation TOS ⇐ TOS xor NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
NEG Negative TOS ⇐ -TOS<br />
NOT NOT-operation TOS ⇐ not TOS<br />
INC Increment TOS ⇐ TOS + 1<br />
DEC Decrement TOS ⇐ TOS - 1<br />
SIGN Sign if (TOS 0) then TOS ⇐ 0 × 0000<br />
When processing data there is need to take decisions, the machine must have the possibility of<br />
conditional branch. The conditional jumps can depend on various conditions.<br />
Subroutine Calls<br />
In stack machine most of the instructions are executed between TOS and NOS but in this archi-<br />
tecture to improve the flexibility of stack based machines there is RS (Return Stack) added along with
Table B.2: Stack manipulation operations<br />
Symbol Instruction Operations<br />
DROP Drop TOS ⇐ NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
DUP Duplication DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
SWAP Swap TOS ⇐ NOS<br />
NOS ⇐ TOS<br />
OVER Over DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
TOS ⇐ NOS<br />
NOS ⇐ TOS<br />
ROT Rotate TOS ⇐ DS[DSP]<br />
NOS ⇐ TOS<br />
DS[DSP] ⇐ NOS<br />
R2D Return Stack to Data Stack TOS ⇐ TORS<br />
NOS ⇐ TOS<br />
DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
TORS ⇐ RS[RSP]<br />
RSP ⇐ RSP - 1<br />
CPR2D Copy Return Stack to Data Stack TOS ⇐ TORS<br />
NOS ⇐ TOS<br />
DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
D2R Data Stack to Return Stack TORS ⇐ TOS<br />
DSP ⇐ DSP - 1<br />
TOS ⇐ NOS<br />
NOS ⇐ DS[DSP]<br />
RSP ⇐ RSP + 1<br />
RS[RSP] ⇐ TORS<br />
RET Return IP ⇐ TORS<br />
TORS ⇐ RS [RSP]<br />
RSP ⇐ RSP - 1<br />
DS (Data Stack). The proposed machine can efficiently call subroutines.<br />
Subroutine call push the value in PC on the TOS and sometime we can directly write some know<br />
address on the TOS as shown below in table B.6.<br />
143
144 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR<br />
Push and Pop<br />
Table B.3: <strong>Memory</strong> Fetch and Store<br />
Symbol Instruction Operations<br />
FETCH Fetch Mem_Addr ⇐ TOS<br />
TOS ⇐ Mem<br />
STORE Store Mem_Addr ⇐ TOS<br />
Mem ⇐ NOS<br />
DSP ⇐ DSP - 1<br />
TOS ⇐ DS[DSP]<br />
DSP ⇐ DSP - 1<br />
NOS ⇐ DS[DSP]<br />
Table B.4: Loading Literals<br />
Symbol Instruction Operations<br />
LIT d8 Writing data (Byte size) to TOS DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ data(byte)<br />
LIT d16 Writing data (Word size) to TOS DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ data(word)<br />
Table B.5: Conditional Branch<br />
Symbol Instruction Operations<br />
ZBRA d Jump to ‘d’ if TOS = 0 if (TOS=0) then<br />
IP ⇐ IP + d<br />
SBRA d Jump to ‘d’ if TOS < 0 if (TOS
B.1. DATA OPERATIONS IN STACK PROCESSOR: 145<br />
Table B.7: Push and Pop<br />
Symbol Instruction Operations<br />
PUSH DSP Push Data Stack Pointer DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ DSP<br />
POP DSP Pop Data Stack Pointer DSP ⇐ TOS<br />
TOS ⇐ NOS<br />
DSP ⇐ DSP-1<br />
NOS ⇐ DS[DSP]<br />
PUSH RSP Push Return Stack Pointer DSP ⇐ DSP + 1<br />
DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ RSP<br />
POP RSP POP Return Stack Pointer RSP ⇐ TOS<br />
TOS ⇐ NOS<br />
NOS ⇐ DS[DSP]<br />
DSP ⇐ DSP-1<br />
B.1 Data Operations in Stack Processor:<br />
Stack machine operates on data manipulation using postfix operation. Such operation is also<br />
called as ‘Reverse Polish’ that is used to describe post fix operations. In such operations the operators<br />
come before the operation and the operator act upon the most recently seen operands.. For example<br />
if we have a following expression:<br />
(24 + 04) × 82<br />
This expression in Postfix representation will be<br />
82 24 04 + ×<br />
They are usually smaller then infix notation. The stack processor can execute postfix expressions<br />
directly without burdening the compiler anymore.
146 APPENDIX B. INSTRUCTION SET OF STACK PROCESSOR
Appendix C<br />
Instruction Set of Pipelined Stack Processor<br />
We have analyzed the multi clock instructions to explore the possible conflicts between various in-<br />
structions. We have found that all the multi clock-instructions can be sub divided into two parts. The<br />
first part consists of DSP+1/RSP+1 depending on the type of instruction while second part contains<br />
the rest of the instruction, such instructions can be recognized by the code ‘111’. Due to intelligent<br />
pipelining we can pre-execute some part of the next instruction (DSP+1/RSP+1) along with the next<br />
instruction to be executed. In this way in the next clock we execute the remaining part of the in-<br />
struction. Hence after pipelining we will execute all the instructions in a single clock except STORE,<br />
which requires 2 clocks after pipelining. Before pipelining STORE instruction requires 3-clock cy-<br />
cles. Actually in STORE instruction we need to execute two times DSP + 1, which can not be done<br />
in a single clock cycle. And rest of the instruction is executed in the next clock.<br />
All the instructions are intelligently divided into the two stages so that each instruction be executed<br />
in a clock cycle after implementation of two stage pipelining. The complete list of the instructions is<br />
given below.<br />
147
148 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR<br />
Table C.1: Instruction set of stack processor (pipelined model)<br />
Instructions First Stage Second Stage<br />
ADD NOP TOS ⇐ TOS + NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
ADC NOP TOS ⇐ TOS + NOS<br />
NOS ⇐ Cout<br />
SUB NOP TOS ⇐ TOS - NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
MUL NOP TOS ⇐ TOS × NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
DIV NOP TOS ⇐ TOS ÷ NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
MOD NOP TOS ⇐ TOS mod NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
AND NOP TOS ⇐ TOS & NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
OR NOP TOS ⇐ TOS | NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
XOR NOP TOS ⇐ TOS xor NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
NEG NOP TOS ⇐ -TOS<br />
NOT NOP TOS ⇐ not TOS<br />
INC NOP TOS ⇐ TOS + 1<br />
DEC NOP TOS ⇐ TOS - 1<br />
SIGN NOP if (TOS 0) then TOS ⇐ 0 × 0000
Table C.2: Stack manipulation operations<br />
Instructions First stage Second Stage<br />
DROP NOP TOS ⇐ NOS<br />
NOS ⇐ DS [DSP]<br />
DSP ⇐ DSP - 1<br />
DUP DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
SWAP NOP TOS ⇐ NOS<br />
NOS ⇐ TOS<br />
OVER DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />
TOS ⇐ NOS<br />
NOS ⇐ TOS<br />
ROT NOP TOS ⇐ DS[DSP]<br />
NOS ⇐ TOS<br />
DS[DSP] ⇐ NOS<br />
R2D DSP ⇐ DSP + 1 TOS ⇐ TORS<br />
NOS ⇐ TOS<br />
DS[DSP] ⇐ NOS<br />
TORS ⇐ RS[RSP]<br />
RSP ⇐ RSP - 1<br />
CPR2D DSP ⇐ DSP + 1 TOS ⇐ TORS<br />
NOS ⇐ TOS<br />
DS[DSP] ⇐ NOS<br />
D2R RS[RSP] ⇐ TORS TORS ⇐ TOS<br />
DSP ⇐ DSP - 1<br />
TOS ⇐ NOS<br />
NOS ⇐ DS[DSP]<br />
RSP ⇐ RSP + 1<br />
RET NOP IP ⇐ TORS<br />
TORS ⇐ RS [RSP]<br />
RSP ⇐ RSP - 1<br />
Table C.3: <strong>Memory</strong> Fetch and Store<br />
Instructions First Stage Second Stage<br />
FETCH Mem_Addr ⇐ TOS TOS ⇐ Mem<br />
STORE Mem_Addr ⇐ TOS Mem ⇐ NOS<br />
TOS ⇐ DS[DSP] NOS ⇐ DS[DSP]<br />
DSP ⇐ DSP - 1 DSP ⇐ DSP - 1<br />
149
150 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR<br />
Table C.4: Loading Literals<br />
Instructions First Stage Second Stage<br />
LIT d8 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ data(byte)<br />
DLIT d16 DSP ⇐ DSP + 1 DS[DSP] ⇐ NOS<br />
NOS ⇐ TOS<br />
TOS ⇐ data(word)<br />
Table C.5: Conditional Branch<br />
Instructions First Stage Second Stage<br />
ZBRA d NOP if (TOS=0) then<br />
IP ⇐ IP + d<br />
SBRA d NOP if (TOS
Table C.8: Instruction Codes and Instruction Lengths<br />
Type of<br />
b4 b3 b2 b1 b0 instruction Instruction Length<br />
0 0 0 0 0 0 0 0 NOP 0-byte<br />
1 0 0 0 0 0 0 0 ADD 1-byte<br />
1 0 0 0 0 0 0 1 ADC 1-byte<br />
1 0 0 0 0 0 1 0 SUB 1-byte<br />
1 0 0 0 0 0 1 1 SUBC 1-byte<br />
1 0 0 0 0 1 0 0 MUL 1-byte<br />
1 0 0 0 0 1 0 1 DIV 1-byte<br />
1 0 0 0 0 1 1 0 MOD 1-byte<br />
1 0 0 0 0 1 1 1 AND 1-byte<br />
1 0 0 0 1 0 0 0 OR 1-byte<br />
1 0 0 0 1 0 0 1 XOR 1-byte<br />
1 0 0 0 1 0 1 0 NEG 1-byte<br />
1 0 0 0 1 0 1 1 NOT 1-byte<br />
1 0 0 0 1 1 0 0 INC 1-byte<br />
1 0 0 0 1 1 0 1 DEC 1-byte<br />
1 0 0 0 1 1 1 0 SIGN 1-byte<br />
1 0 0 0 1 1 1 1 DROP 1-byte<br />
1 0 0 1 0 0 0 0 DUP 1-byte<br />
1 0 0 1 0 0 0 1 SWAP 1-byte<br />
1 0 0 1 0 0 1 0 OVER 1-byte<br />
1 0 0 1 0 0 1 1 ROT 1-byte<br />
1 0 0 1 0 1 0 0 R2D 1-byte<br />
1 0 0 1 0 1 0 1 CPR2D 1-byte<br />
1 0 0 1 0 1 1 0 D2R 1-byte<br />
1 0 0 1 0 1 1 1 FETCH 1-byte<br />
1 0 0 1 1 0 0 0 STORE 1-byte<br />
1 0 0 1 1 0 0 1 PUSH_DSP 1-byte<br />
1 0 0 1 1 0 1 0 POP_DSP 1-byte<br />
1 0 0 1 1 0 1 1 PUSH_DSP 1-byte<br />
1 0 0 1 1 1 0 0 POP_RSP 1-byte<br />
1 0 1 0 0 0 0 0 LIT a 2-bytes<br />
1 1 0 0 0 0 0 0 DLIT a 3-bytes<br />
1 1 1 0 0 0 0 0 RET 1-byte + IP-change<br />
0 0 1 0 0 0 0 1 ZBRA 2-bytes + IP-change<br />
0 0 1 0 0 0 1 0 SBRA 2-bytes + IP-change<br />
0 1 0 0 0 0 0 0 LBRA 3-bytes + IP-change<br />
0 1 0 0 0 0 0 1 CALL a 3-bytes + IP-change<br />
b7 b6 b5<br />
151
152 APPENDIX C. INSTRUCTION SET OF PIPELINED STACK PROCESSOR
Appendix D<br />
List of Acronyms<br />
ALU : Arithmetic Logic Unit.<br />
ASIC : Application Specific Integrated Circuits.<br />
BER : Backward Error Recovery.<br />
BIED : Built-In Error Detection Schemes.<br />
BPSG : Boro-Phos-Silicate-Glass.<br />
CED : Concurrent Error Detection.<br />
CISC : Complex Instruction Set Computer.<br />
CMOS : Complimentary Metal Oxide Semiconductor.<br />
CPO : Clock Per Operation.<br />
CPI : Clock Per Instruction.<br />
CRC : Cyclic Redundancy Codes.<br />
DCR : Dual-Checker Rail.<br />
DED : Double Error Detection.<br />
DM : <strong>Dependable</strong> <strong>Memory</strong>.<br />
DMR : Dual Modular Redundancy.<br />
DS : Data Stack.<br />
DSP : Data Stack Pointer.<br />
DWC : Duplication With Comparison.<br />
DWCR : Duplication With Complement Redundancy.<br />
ECC : Error Control Coding.<br />
EDC : Error Detecting Codes.<br />
EDCC : Error Detecting and Correction Codes.<br />
EDP : Error Detecting Processor.<br />
ESS : Electronic Switching Systems.<br />
FER : Forward Error Recovery.<br />
FPGA : Field Programmable gate Array.<br />
FT : Fault Tolerant.<br />
FTMP : Fault Tolerant Multi-Processor.<br />
153
154 APPENDIX D. LIST OF ACRONYMS<br />
HD : Hamming Distance.<br />
HDL : High-level Description Language.<br />
HW : Hardware.<br />
IEEE : Institute of Electrical and Electronics Engineers.<br />
IB : Instruction Buffer.<br />
IBMU : Instruction Buffer Management Unit.<br />
IP : Instruction Pointer.<br />
ISA : Instruction Set Architecture.<br />
LICM : <strong>Laboratoire</strong> <strong>Interface</strong>s <strong>Capteurs</strong> et Micro-électronique.<br />
LIFO : Last In First Out.<br />
MBU : Multiple Bit Upsets.<br />
MCU : Multiple Cell Upsets.<br />
MISC : Minimum Instruction Set Computer.<br />
MPSoC : Multi-Processor System on Chip.<br />
NASA : National Aeronautics and Space Agency.<br />
NoC : Network on Chip.<br />
NOS : Next Of data Stack.<br />
PC : Program Counter.<br />
RAM : Random Access <strong>Memory</strong>.<br />
REE : Remote Exploration and Experimentation.<br />
RESO : Redundant Execution with Shifted Operands.<br />
RISC : Reduce Instruction Set Computer.<br />
RS : Return Stack.<br />
RSP : Return Stack Pointer.<br />
RTL : Register Transfer Level.<br />
SCHJ : Self-Checking Hardware Journal.<br />
SCPC : Self-Checking Processor Core.<br />
SD : Sequence Duration.<br />
SE : State determining Elements.<br />
SEB : Single Event Burnout.<br />
SEE : Single Event Effect.<br />
SEGR : Single Event Gate Rupture.<br />
SEFI : Single Event Functional Interrupt.<br />
SEL : Single Event Latchup.<br />
SEU : Single Event Upset.<br />
SET : Single Event Transient.<br />
SEC : Single Error Correction.<br />
SW : Software.<br />
SIFT : Software Implemented Fault Tolerance.
SoC : System on Chip.<br />
STAR : Self-Testing and Repair.<br />
TMR : Triple Modular Redundancy.<br />
TORS : Top Of Return Stack.<br />
TOS : Top Of data Stack.<br />
UJ : Un-validated Journal.<br />
VP : Validation Point.<br />
UVD : Un-Validated Data.<br />
VD : Validated Data.<br />
VHDL : VHSIC Hardware Description Language.<br />
VJ : Validated Journal.<br />
155
156 APPENDIX D. LIST OF ACRONYMS
Appendix E<br />
List of publications<br />
• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE,<br />
“A Self-Checking HW Journal for a Fault Tolerant Processor Architecture,” International Jour-<br />
nal of Reconfigurable Computing 2011 (IJRC’11) (Accepted).<br />
• Mohsin AMIN, Abbas RAMAZANI, Fabrice MONTEIRO, Camille DIOU, Abbas DANDACHE,<br />
“A <strong>Dependable</strong> Stack Processor Core for MPSoC Development,” XXIV Conference on Design<br />
of Circuits and Integrated Systems (DCIS’09), Zaragoza, Spain, November 18-20, 2009.<br />
• Mohsin AMIN, Fabrice MONTEIRO, Camille DIOU, Abbas RAMAZANI, Abbas DANDACHE,<br />
“A HW/SW Mixed Mechanism to Improve the Dependability of a Stack Processor,” 16th<br />
IEEE International Conference on Electronics, Circuits, and Systems (ICECS’09), Hammamet,<br />
Tunisia, December 13-16, 2009.<br />
• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE,<br />
“Journalized Stack Processor for Reliable Embedded Systems,” 1st International Conference<br />
on Aerospace Science and Engineering (ICASE’09), Islamabad, Pakistan, August 18-20, 2009.<br />
• A. Ramazani, M. Amin, F. Monteiro, C. Diou, A. Dandache, “A Fault Tolerant Journalized<br />
Stack Processor Architecture,” 15th IEEE International On-Line Testing Symposium (IOLTS’09),<br />
Sesimbra-Lisbonne, Portugal, 24–27 June 2009.<br />
• Mohsin AMIN, Camille DIOU, Fabrice MONTEIRO, Abbas RAMAZANI, Abbas DANDACHE,<br />
“Error Detecting and Correcting Journal for <strong>Dependable</strong> Processor Core,” GDR System on Chip<br />
- System in Package (GDR-SoC-SiP’10), Cergy-Paris, France, 9-11 June 2010.<br />
• Mohsin Amin, Camille Diou, Fabrice Monteiro, Abbas Ramazani, “Design Methodology of<br />
Reliable Stack Processor Core,” GDR System on Chip - System in Package 2009 (GDR-SoC-<br />
SiP’09), Orsay-Paris, France, 9-11 June 2010.<br />
157
158 APPENDIX E. LIST OF PUBLICATIONS<br />
• Mohsin AMIN, “Self-Organization in Embedded Systems,” 2nd Winter School on Self Organi-<br />
zation in Embedded Systems, Schloss Dagstuhl, Germany, November 2007.
List of Figures<br />
1.1 An alpha particle hits a CMOS transistor. The particle generates electron-hole pairs<br />
in its wake, which effects charge disturbance [MW04] . . . . . . . . . . . . . . . . 16<br />
1.2 Strike of high energy particle resulted in error(s) . . . . . . . . . . . . . . . . . . . . 16<br />
1.3 Classification of faults on basis of single event effect (SEE) [Pie07]. . . . . . . . . . 17<br />
1.4 Dependability Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.5 Fault, error and failure chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.6 Error propagation from processor to main memory . . . . . . . . . . . . . . . . . . 21<br />
1.7 A single fault caused failure of traffic control system . . . . . . . . . . . . . . . . . 22<br />
1.8 service failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.9 Fault characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.10 Few reasons of fault occurrence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
1.11 Dependability techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
1.12 Sequence of events from ionization to failure and a set of fault tolerant techniques<br />
applied at different time. [Pie07]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.1 General architecture of a concurrent error detection schemes [MM00] . . . . . . . . 32<br />
2.2 Duplication with comparison (DWC) . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
2.3 Time redundancy for temporary and intermittent fault detection . . . . . . . . . . . . 34<br />
2.4 Time redundancy for permanent error detection . . . . . . . . . . . . . . . . . . . . 34<br />
2.5 Information redundancy principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.6 Parity coder in data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.7 Functional Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
2.8 Residue codes adder [FFMR09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
2.9 Triple modular redundancy (TMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.10 Error detecting and correcting memory block . . . . . . . . . . . . . . . . . . . . . 39<br />
2.11 Basic strategies for implementing Error Recovery. . . . . . . . . . . . . . . . . . . . 41<br />
2.12 The triple-TMR in Boeing 777 [Yeh02] . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
3.1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.2 Limitation of parity check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.3 Rollback Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
159
160 LIST OF FIGURES<br />
3.4 Error detection during Sequence Duration (SD) and rollback called . . . . . . . . . . 56<br />
3.5 No-error detected during the SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.6 Time overhead in rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
3.7 Untrusted data flowing into dependable memory (DM) . . . . . . . . . . . . . . . . 60<br />
3.8 Data stored to temporary location before writing to DM . . . . . . . . . . . . . . . . 61<br />
3.9 Data corruption in temporary storage. . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
3.10 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
3.11 Overall design specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.12 Global design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
3.13 Model-I with data cache and a pair of journals . . . . . . . . . . . . . . . . . . . . . 66<br />
3.14 Cache with associative mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
3.15 FT evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
3.16 Periodic, random and burst errors models . . . . . . . . . . . . . . . . . . . . . . . 68<br />
3.17 Model-I: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 69<br />
3.18 Model-I: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . . 70<br />
3.19 Model-I: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 71<br />
3.20 Block diagram of Model-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
3.21 Processor can simultaneously read from Journal and DM . . . . . . . . . . . . . . . 72<br />
3.22 No error detected during SD and data is validated at VP . . . . . . . . . . . . . . . . 72<br />
3.23 Error detected and all the data written during SD is deleted . . . . . . . . . . . . . . 73<br />
3.24 Model-II: additional CPI for benchmark group I . . . . . . . . . . . . . . . . . . . . 74<br />
3.25 Model-II: additional CPI for benchmark group II . . . . . . . . . . . . . . . . . . . 74<br />
3.26 Model-II: additional CPI for benchmark group III . . . . . . . . . . . . . . . . . . . 75<br />
4.1 Design of a self checking processor core (SCPC) . . . . . . . . . . . . . . . . . . . 77<br />
4.2 Criteria behind the choice of the stack processor . . . . . . . . . . . . . . . . . . . 79<br />
4.3 Multiple and large stacks, 0-operand (ML0) computer chosen from three axis design<br />
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
4.4 Simplified stack machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
4.5 Modified stack processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.6 Simplified data-path of the proposed model (arithmetic and logic instructions) . . . . 84<br />
4.7 Different instructions type from execution point of view (without pipelining) . . . . . 85<br />
4.8 Execution of duplication (DUP) instruction in 2-clock . . . . . . . . . . . . . . . . . 86<br />
4.9 Multiple-byte instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.10 Data-path of protected-processor’s ALU . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.11 Resource utilization chart for various ALU designs [SFRB05] . . . . . . . . . . . . 89<br />
4.12 ALU is protecting the Logical and Arithmetic instructions separately . . . . . . . . . 90<br />
4.13 Reminder check technique for error detection in arithmetic instructions . . . . . . . . 91<br />
4.14 Parity check technique for error detection in logic instructions . . . . . . . . . . . . 92
LIST OF FIGURES 161<br />
4.15 Parity check technique for error detection in register(s) . . . . . . . . . . . . . . . . 92<br />
4.16 Error occurred in Protected ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
4.17 Instruction buffer management Unit (IBMU) . . . . . . . . . . . . . . . . . . . . . 95<br />
4.18 Instruction buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
4.19 (a) Opcodes description and (b) pipelined execution model . . . . . . . . . . . . . . 97<br />
4.20 A sample program executed through non-pipelined and pipelined stack processor core 97<br />
4.21 Timing diagram for a sample program executed twice: once in non-pipelined version<br />
and then pipelined version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
4.22 Implementation design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
4.23 Strategy to overcome performance overhead due to conditional branches . . . . . . . 99<br />
4.24 Implementation of a self-checking processor core . . . . . . . . . . . . . . . . . . . 100<br />
4.25 Error detected in SCPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
5.1 Design of SCHJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />
5.2 Protecting DM from contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
5.3 (a) Error(s) in un-validated journal (b) error(s) in validated journal . . . . . . . . . . 105<br />
5.4 Hsiao Parity Check Matrix (41,34) . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />
5.5 SCHJ structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
5.6 Error detection and correction in journal (a memory block of SCHJ). . . . . . . . . 109<br />
5.7 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.8 Rollback mechanism on error detection. . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
5.9 SCHJ operation flow chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
5.10 SCHJ mode 00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
5.11 SCHJ mode 01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
5.12 Read of UVD from SCHJ in mode 01 . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
5.13 SCHJ mode 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
5.14 Mode 10 of SCHJ operation (un-corrigible error detected) . . . . . . . . . . . . . . . 114<br />
5.15 SCHJ mode 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
5.16 Non-corrigible error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
5.17 Increase of percentage utilization of FT processor (SCPC + SCHJ) on device EP3SE50F484C2<br />
with increase in the depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
5.18 Theoretical limits of Journal Depth. . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
5.19 Relation between journal depth and percentage write in benchmarks. . . . . . . . . . 118<br />
5.20 CPI vs. SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
5.21 Dynamic SD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
6.1 The overall FT-processor to be validated. . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.2 Error injection in FT-processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
6.3 Error patterns (errors can occur in any bit, not necessarily the bit shown here). . . . . 123<br />
6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
162 LIST OF FIGURES<br />
6.5 Single bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
6.6 Double bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
6.7 Triple bit error injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
6.8 Harsh (1 up to 8 bit randomly) error injection. . . . . . . . . . . . . . . . . . . . . . 126<br />
6.9 Performance Degradation due to re-execution . . . . . . . . . . . . . . . . . . . . . 127<br />
6.10 Simulation curves for group-I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.11 Simulation curves for group-II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.12 Simulation curves for group-III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />
6.13 Effect of EIR on rollback for benchmarks group-I. . . . . . . . . . . . . . . . . . . . 129<br />
6.14 Effect of EIR on rollback for benchmarks group-II. . . . . . . . . . . . . . . . . . . 130<br />
6.15 Effect of EIR on rollback for benchmarks group-III. . . . . . . . . . . . . . . . . . . 130<br />
A.1 Canonical Stack Machine [KJ89] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
List of Tables<br />
1.1 Cost/hour for failure of control system [Pie07] . . . . . . . . . . . . . . . . . . . . . 13<br />
1.2 Dependability attributes for University web-server and Nuclear-reactor [Pie07], where<br />
attributes are classified as: – very important = 4 points, – least important = 1 point . 20<br />
2.1 fault modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.1 Comparison of the Processor-<strong>Memory</strong> Models . . . . . . . . . . . . . . . . . . . . 73<br />
4.1 Instruction types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
4.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.1 Modes of Journal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.2 Implementation area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
6.1 Read/Write profiles in benchmarks groups . . . . . . . . . . . . . . . . . . . . . . . 127<br />
B.1 Arithmetic and logic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />
B.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />
B.3 <strong>Memory</strong> Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
B.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
B.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
B.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
B.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />
C.1 Instruction set of stack processor (pipelined model) . . . . . . . . . . . . . . . . . . 148<br />
163
164 LIST OF TABLES<br />
C.2 Stack manipulation operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
C.3 <strong>Memory</strong> Fetch and Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
C.4 Loading Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
C.5 Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
C.6 Subroutine Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
C.7 Push and Pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
C.8 Instruction Codes and Instruction Lengths . . . . . . . . . . . . . . . . . . . . . . . 151
Bibliography<br />
[ACC + 93] J. Arlat, A. Costes, Y. Crouzet, J. C Laprie, and D. Powell. Fault injection and depend-<br />
ability evaluation of fault-tolerant systems. IEEE Transactions on Computers, page<br />
913–923, 1993.<br />
[Aer11] Aeroflex. Dual-Core LEON3FT SPARC v8 processor, 2011.<br />
[AFK05] J. Aidemark, P. Folkesson, and J. Karlsson. A framework for node-level fault toler-<br />
ance in distributed real-time systems. In Proceedings of International Conference on<br />
<strong>Dependable</strong> Systems and Networks, 2005 (DSN’05), page 656–665, 2005.<br />
[AHHW08] U. Amgalan, C. Hachmann, S. Hellebrand, and H. J Wunderlich. Signature Rollback-A<br />
technique for testing robust circuits. In 26th IEEE VLSI Test Symposium, 2008 (VTS’08),<br />
page 125–130, 2008.<br />
[AKT + 08] H. Ando, R. Kan, Y. Tosaka, K. Takahisa, and K. Hatanaka. Validation of hardware<br />
error recovery mechanisms for the SPARC64 v microprocessor. In <strong>Dependable</strong> Systems<br />
and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference<br />
on, page 62–69, 2008.<br />
[ALR01] A. Avizienis, J. C Laprie, and B. Randell. Fundamental concepts of dependability.<br />
Research report UCLA CSD Report no. 010028, 2001.<br />
[AMD + 10] M. Amin, F. Monteiro, C. Diou, A. Ramazani, and A. Dandache. A HW/SW<br />
mixed mechanism to improve the dependability of a stack processor. In Proceedings<br />
of 16th IEEE International Conference on Electronics, Circuits, and Systems, 2009<br />
(ICECS’09), page 976–979, 2010.<br />
[ARM09] ARM. Cortex-R4 and Cortex-R4F. Technical reference manual, 2009.<br />
[ARM + 11] Mohsin Amin, Abbas Ramazani, Fabrice Monteiro, Camille Diou, and Abbas Dan-<br />
dache. A Self-Checking hardware journal for a fault tolerant processor architecture.<br />
Hindawi Publishing Corporation, 2011.<br />
[Bai10] G Bailey. Comparison of GreenArrays chips with texas instruments MSP430F5xx as<br />
micropower controllers, June 2010.<br />
165
166 BIBLIOGRAPHY<br />
[Bau05] R. C Baumann. Radiation-induced soft errors in advanced semiconductor technologies.<br />
IEEE Transactions on Device and materials reliability, 5(3):305–316, 2005.<br />
[BBV + 05] D. Bernick, B. Bruckert, P. D Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen.<br />
NonStop advanced architecture. In Proceedings of International Conference on De-<br />
pendable Systems and Networks, 2005 (DSN’05), page 12–21, 2005.<br />
[BCT08] B. Bridgford, C. Carmichael, and C. W Tseng. Single-event upset mitigation selection<br />
guide. Xilinx Application Note, 987, 2008.<br />
[BGB + 08] J. C Baraza, J. Gracia, S. Blanc, D. Gil, and P. J Gil. Enhancement of fault injection<br />
techniques based on the modification of VHDL code. IEEE Transactions on Very Large<br />
Scale Integration (VLSI) Systems, 16(6):693–706, 2008.<br />
[Bic10] R. Bickham. An Analysis of Error Detection Techniques for Arithmetic Logic Units.<br />
PhD thesis, Vanderbilt University, 2010.<br />
[BP02] N. S Bowen and D. K Pradhan. Virtual checkpoints: Architecture and performance.<br />
Computers, IEEE Transactions on, 41(5):516–525, 2002.<br />
[BT02] D. Briere and P. Traverse. AIRBUS A320/A330/A340 electrical flight controls-a fam-<br />
ily of fault-tolerant systems. In the Twenty-Third International Symposium on Fault-<br />
Tolerant Computing System, 1993 (FTCS’93), page 616–623, 2002.<br />
[Car01] C. Carmichael. Triple module redundancy design techniques for virtex FPGAs. Xilinx<br />
Application Note XAPP197, 1, 2001.<br />
[Che08] L. Chen. Hsiao-Code check matrices and recursively balanced matrices. Arxiv preprint<br />
arXiv:0803.1217, 2008.<br />
[CHL97] W-T Chang, S Ha, and E.A. Lee. Heterogeneous simulation - mixing Discrete-Event<br />
models with dataflow. Journal of VLSI Signal Processing, 15(1-2):127–144, 1997.<br />
[CP02] J. A Clark and D. K Pradhan. Fault injection: A method for validating computer-system<br />
dependability. Computer, 28(6):47–56, 2002.<br />
[CPB + 06] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and<br />
M. Orshansky. Bulletproof: A defect-tolerant CMP switch architecture. In Proceedings<br />
of 25th International Symposium on High-Performance Computer Architecture, 2006,<br />
page 5–16, 2006.<br />
[CTS + 10] C. L. Chen, N. N. Tendolkar, A. J. Sutton, M. Y. Hsiao, and D. C. Bossen. Fault-<br />
tolerance design of the IBM enterprise system/9000 type 9021 processors. IBM Journal<br />
of Research and Development, 36(4):765–779, 2010.
BIBLIOGRAPHY 167<br />
[EAWJ02] E. N. Elnozahy, L. Alvisi, Y. M Wang, and D. B Johnson. A survey of rollback-<br />
recovery protocols in message-passing systems. ACM Computing Surveys (CSUR),<br />
34(3):375–408, 2002.<br />
[EKD + 05] D. Ernst, N. S Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin,<br />
K. Flautner, et al. Razor: A low-power pipeline based on circuit-level timing specula-<br />
tion. In Proceedings of 36th Annual IEEE/ACM International Symposium on Microar-<br />
chitecture, 2003 (MICRO’03), page 7–18, 2005.<br />
[FFMR09] R. Forsati, K. Faez, F. Moradi, and A. Rahbar. A fault tolerant method for residue<br />
arithmetic circuits. In Proceedings of 2009 International Conference on Information<br />
Management and Engineering, page 59–63, 2009.<br />
[FGAD10] R. Fernández-Pascual, J. M Garcia, M. E Acacio, and J. Duato. Dealing with tran-<br />
sient faults in the interconnection network of CMPs at the cache coherence level. IEEE<br />
Transactions on Parallel and Distributed Systems, 21(8):1117–1131, 2010.<br />
[FGAM10] S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error relia-<br />
bility on the cheap. ACM SIGPLAN Notices, 45(3):385–396, 2010.<br />
[FP02] E. Fujiwara and D. K Pradhan. Error-control coding in computers. Computer,<br />
23(7):63–72, 2002.<br />
[GBT05] S. Ghosh, S. Basu, and N. A Touba. Selecting error correcting codes to minimize power<br />
in memory checker circuits. Journal of Low Power Electronics, 1(1):63–72, 2005.<br />
[GC06] J. Gaisler and E. Catovic. Multi-Core processor based on LEON3-FT IP core (LEON3-<br />
FT-MP). In in Proceedings of Data Systems in Aerospace, 2006 (DASIA’06), volume<br />
630, page 76, 2006.<br />
[Gha11] S. Ghaznavi. Soft Error Resistant Design of the AES Cipher Using SRAM-based FPGA.<br />
PhD thesis, University of Waterloo, 2011.<br />
[GMT08] M. Grottke, R. Matias, and K. S Trivedi. The fundamentals of software aging. In Pro-<br />
ceedings of IEEE International Conference on Software Reliability Engineering Work-<br />
shops, 2008 (ISSRE Wksp 2008), page 1–6, 2008.<br />
[GPLL09] C. Godlewski, V. Pouget, D. Lewis, and M. Lisart. Electrical modeling of the effect<br />
of beam profile for pulsed laser fault injection. Microelectronics Reliability, 49(9-<br />
11):1143–1147, 2009.<br />
[Gre10] Green. Project green array chip, 2010.<br />
[Hay05] J.R. Hayes. The architecture of the scalable configurable instrument processor. Techni-<br />
cal Report SRI-05-030, The Johns Hopkins Applied Physics Laboratory, 2005.
168 BIBLIOGRAPHY<br />
[HCTS10] M. Y. Hsiao, W. C Carter, J. W Thomas, and W. R Stringfellow. Reliability, availabil-<br />
ity, and serviceability of IBM computer systems: A quarter century of progress. IBM<br />
Journal of Research and Development, 25(5):453–468, 2010.<br />
[HH06] A. J Harris and J. R Hayes. Functional programming on a Stack-Based embedded<br />
processor. 2006.<br />
[Hsi10] M. Y Hsiao. A class of optimal minimum odd-weight-column SEC-DED codes. IBM<br />
Journal of Research and Development, 14(4):395–401, 2010.<br />
[IK03] R. K Iyer and Z. Kalbarczyk. Hardware and software error detection. Technical report,<br />
Center for Reliable and High-Performance Computing, University of Illinois at Urbana-<br />
Champaign, Urbana, 2003.<br />
[Int09] Intel. White paper - the intel itanium processor 9300 series. Technical report, 2009.<br />
[ITR07] ITRS. International technology roadmap for semiconductors. 2007.<br />
[Jab09] Jaber. Conception architecturale haut débit et sûre de fonctionnement pour les codes<br />
correcteurs d’erreurs. PhD thesis, Université Paul Verlaine - Metz, France, Metz, 2009.<br />
[Jal09] M. Jallouli. Méthodologie de conception d’architectures de processeur sûres de fonc-<br />
tionnement pour les applications mécatroniques. PhD thesis, Université Paul Verlaine -<br />
Metz, France, Metz, 2009.<br />
[JDMD07] M. Jallouli, C. Diou, F. Monteiro, and A. Dandache. Stack processor architec-<br />
ture and development methods suitable for dependable applications. Reconfigurable<br />
Communication-centric SoCs (ReCoSoC’07), Montpellier, France, 2007.<br />
[JES06] J. S JESD89A. Measurement and reporting of alpha particle and terrestrial cosmic<br />
ray-induced soft errors in semiconductor devices. October, 2006.<br />
[JHW + 08] J. Johnson, W. Howes, M. Wirthlin, D. L McMurtrey, M. Caffrey, P. Graham, and<br />
K. Morgan. Using duplication with compare for on-line error detection in FPGA-based<br />
designs. In Proceedings of IEEE Aerospace Conference, 2008, page 1–11, 2008.<br />
[JPS08] B. Joshi, D. Pradhan, and J. Stiffler. Fault-Tolerant computing. 2008.<br />
[KJ89] P. J Koopman Jr. Stack computers: the new wave. Halsted Press New York, NY, USA,<br />
1989.<br />
[KKB07] I. Koren, C. M Krishna, and Inc Books24x7. Fault-tolerant systems. Elsevier/Morgan<br />
Kaufmann, 2007.
BIBLIOGRAPHY 169<br />
[KKS + 07] P. Kudva, J. Kellington, P. Sanda, R. McBeth, J. Schumann, and R. Kalla. Fault injec-<br />
tion verification of IBM POWER6 soft error resilience. In Architectural Support for<br />
Gigascale Integration (ASGI) Workshop, 2007.<br />
[KMSK09] J. W Kellington, R. McBeth, P. Sanda, and R. N Kalla. IBM POWER6 processor soft<br />
error tolerance analysis using proton irradiation. In Proceedings of the IEEE Workshop<br />
on Silicon Errors in Logic—Systems Effects (SELSE) Conference, 2009.<br />
[Kop04] H. Kopetz. From a federated to an integrated architecture for dependable embedded<br />
systems. PhD thesis, Technische Univ Vienna, Vienna, Austria, 2004.<br />
[Kop11] H. Kopetz. Real-time systems: design principles for distributed embedded applications,<br />
volume 25. Springer-Verlag New York Inc, 2011.<br />
[Lal05] P. K Lala. Single error correction and double error detecting coding scheme, 2005.<br />
[Lap04] J.C. Laprie. Sûreté de fonctionnement des systèmes : concepts de base et terminologie.<br />
2004.<br />
[LAT07] K. W. Li, J. R. Armstrong, and J. G. Tront. An HDL simulation of the effects of single<br />
event upsets on microprocessor program flow. IEEE Transactions on Nuclear Science,<br />
31(6):1139–1144, 2007.<br />
[LB07] J. Laprie and R. Brian. Origins and integration of the concepts. 2007.<br />
[LBS + 11] I. Lee, M. Basoglu, M. Sullivan, D. H Yoon, L. Kaplan, and M. Erez. Survey of error<br />
and fault detection mechanisms. 2011.<br />
[LC08] C. A.L Lisboa and L. Carro. XOR-based low cost checkers for combinational logic. In<br />
IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, page<br />
281–289, 2008.<br />
[LN09] Dongwoo Lee and Jongwhoa Na. A novel simulation fault injection method for depend-<br />
ability analysis. IEEE Design & Test of Computers, 26(6):50–61, December 2009.<br />
[LRL04] J. C Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable<br />
and secure computing. IEEE Trans. on <strong>Dependable</strong> Secure Computers, 1(1):11–33,<br />
2004.<br />
[MB07] N. Madan and R. Balasubramonian. Power efficient approaches to redundant multi-<br />
threading. IEEE Transactions on Parallel and Distributed Systems, page 1066–1079,<br />
2007.<br />
[MBS07] A. Meixner, M. E Bauer, and D. Sorin. Argus: Low-cost, comprehensive error detection<br />
in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium<br />
on Microarchitecture, page 210–222, 2007.
170 BIBLIOGRAPHY<br />
[MG09] A. Maloney and A. Goscinski. A survey and review of the current state of rollback-<br />
recovery for cluster systems. Concurrency and Computation: Practice and Experience,<br />
21(12):1632–1666, 2009.<br />
[MM00] S. Mitra and E. J McCluskey. Which concurrent error detection scheme to choose?<br />
2000.<br />
[MMPW07] K. S Morgan, D. L McMurtrey, B. H Pratt, and M. J Wirthlin. A comparison of TMR<br />
with alternative fault-tolerant design techniques for FPGAs. IEEE Transactions on Nu-<br />
clear Science, 54(6):2065–2072, 2007.<br />
[Mon07] Y. Monnet. Etude et modélisation de circuits résistants aux attaques non intrusives par<br />
injection de fautes. Thèse de doctorat, Institut National Polytechnique de Grenoble,<br />
2007.<br />
[MS06] F. MacWilliams and N. Sloane. The theory of error-correcting codes. 2006.<br />
[MS07] A. Meixner and D. J Sorin. Error detection using dynamic dataflow verification. In Pro-<br />
ceedings of the 16th International Conference on Parallel Architecture and Compilation<br />
Techniques, page 104–118, 2007.<br />
[MSSM10] M. J. Mack, W. M. Sauer, S. B. Swaney, and B. G. Mealey. IBM power6 reliability.<br />
IBM Journal of Research and Development, 51(6):763–774, 2010.<br />
[Muk08] S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2008.<br />
[MW04] R. Mastipuram and E. C Wee. Soft errors’ impact on system reliability. EDN, Sept, 30,<br />
2004.<br />
[MW07] T. C May and M. H Woods. A new physical mechanism for soft errors in dynamic<br />
memories. In 16th Annual Reliability Physics Symposium, page 33–40, 2007.<br />
[NBV + 09] T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L Scott. Fault injection frame-<br />
work for system resilience evaluation: fake faults for finding future failures. In Pro-<br />
ceedings of the 2009 workshop on Resiliency in high performance, page 23–28, 2009.<br />
[Nic02] M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In<br />
Proceedings of Twenty-Third International Symposium on Fault-Tolerant Computing,<br />
(FTCS-23), page 586–595, 2002.<br />
[Nic10] M. Nicolaidis. Soft Errors in Modern Electronic Systems. Springer Verlag, 2010.<br />
[NL11] J. Na and D. Lee. Simulated fault injection using simulator modification technique.<br />
ETRI Journal, 33(1), 2011.
BIBLIOGRAPHY 171<br />
[NMGT06] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: efficient han-<br />
dling of I/O in highly-available rollback-recovery servers. In the Twety-fifth Interna-<br />
tional Symposium on High-Performance Computer Architecture, 2006, page 200–211,<br />
2006.<br />
[NTN + 09] M. Nicolaidis, K. Torki, F. Natali, F. Belhaddad, and D. Alexandrescu. Implementation<br />
and validation of a low-cost single-event latchup mitigation scheme. In IEEE Workshop<br />
on Silicon Errors in Logic–System Effects (SELSE), Stanford, CA, 2009.<br />
[NX06] V. Narayanan and Y. Xie. Reliability concerns in embedded system designs. Computer,<br />
39(1):118–120, 2006.<br />
[Pat10] Anurag Patel. Fault tolerant features of modern processors, 2010.<br />
[PB04] S Pelc and C. Bailey. Ubiquitous forth objects. In Euro-forth’04, Dahgstuhl, Germany,<br />
2004.<br />
[PF06] J. H Patel and L. Y Fung. Concurrent error detection in ALU’s by recomputing with<br />
shifted operands. IEEE Transactions on Computers, 100(7):589–595, 2006.<br />
[Pie06] S.J. Piestrak. <strong>Dependable</strong> computing: Problems, techniques and their applications. In<br />
First Winter School on Self-Organization in Embedded Systems, Schloss Dagstuhl, Ger-<br />
many, 2006.<br />
[Pie07] S.J. Piestrak. Systèmes numériques tolérants aux fautes, 2007.<br />
[PIEP09] P. Pop, V. Izosimov, P. Eles, and Z. Peng. Design optimization of time-and cost-<br />
constrained fault-tolerant embedded systems with checkpointing and replication. IEEE<br />
Transactions on Very Large Scale Integration (VLSI) Systems, 17(3):389–402, 2009.<br />
[Poe05] Christian Poellabauer. Real-Time systems, 2005.<br />
[Pow10] D. Powell. A generic fault-tolerant architecture for real-time dependable systems.<br />
Springer Publishing Company, Incorporated, 2010.<br />
[QGK + 06] H. Quinn, P. Graham, J. Krone, M. Caffrey, and S. Rezgui. Radiation-induced<br />
multi-bit upsets in SRAM-based FPGAs. IEEE Transactions on Nuclear Science,<br />
52(6):2455–2461, 2006.<br />
[QLZ05] F. Qin, S. Lu, and Y. Zhou. SafeMem: exploiting ECC-memory for detecting memory<br />
leaks and memory corruption during production runs. In 11th International Symposium<br />
on High-Performance Computer Architecture, 2005 (HPCA’11), page 291–302, 2005.<br />
[RAM + 09] A. Ramazani, M. Amin, F. Monteiro, C. Diou, and A. Dandache. A fault tolerant jour-<br />
nalized stack processor architecture. In 15th IEEE International On-Line Testing Sym-<br />
posium, 2009 (IOLTS’09), Sesimbra-Lisbon, Portugal, 2009.
172 BIBLIOGRAPHY<br />
[RI08] G. A Reis III. Software modulated fault tolerance. PhD thesis, Princeton University,<br />
2008.<br />
[RK09] J. A Rivers and P. Kudva. Reliability challenges and system performance at the archi-<br />
tecture level. IEEE Design & Test of Computers, 26(6):62–73, 2009.<br />
[RNS + 05] K. Rothbart, U. Neffe, C. Steger, R. Weiss, E. Rieger, and A. Muehlberger. A smart<br />
card test environment using multi-level fault injection in SystemC. In Proceedings of<br />
6th IEEE Latin-American Test Workshop 2005, page 103–108, March 2005.<br />
[RR08] V. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check regimen<br />
in a superscalar processor. In IEEE International Conference on <strong>Dependable</strong> Systems<br />
and Networks 2008 (DSN’08), page 1–10, Anchorage, Alaska, 2008.<br />
[RRTV02] M. Rebaudengo, S. Reorda, M. Torchiano, and M. Violante. Soft-error detection through<br />
software fault-tolerance techniques. In International Symposium on Defect and Fault<br />
Tolerance in VLSI Systems, page 210–218, 2002.<br />
[RS09] B. Rahbaran and A. Steininger. Is asynchronous logic more robust than synchronous<br />
logic? IEEE Transactions on <strong>Dependable</strong> and Secure Computing, page 282–294, 2009.<br />
[RYKO11] W. Rao, C. Yang, R. Karri, and A. Orailoglu. Toward future systems with nanoscale<br />
devices: Overcoming the reliability challenge. Computer, 44(2):46–53, 2011.<br />
[Sch08] Martin Schoeberl. A java processor architecture for embedded real-time systems. Jour-<br />
nal of Systems Architecture, 2008.<br />
[SFRB05] V. Srinivasan, J. W. Farquharson, W. H. Robinson, and B. L. Bhuva. Evaluation of<br />
error detection strategies for an FPGA-Based Self-Checking arithmetic and logic unit.<br />
In MAPLD International Conference, 2005.<br />
[SG10] L. Spainhower and T. A Gregg. IBM s/390 parallel enterprise server g5 fault tolerance:<br />
A historical perspective. IBM Journal of Research and Development, 43(5.6):863–873,<br />
2010.<br />
[Sha06] Mark Shannon. A C Compiler for Stack Machines. MSc thesis, University of York,<br />
2006.<br />
[SHLR + 09] S. K Sastry Hari, M. L Li, P. Ramachandran, B. Choi, and S. V Adve. mSWAT: low-<br />
cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the<br />
42nd Annual IEEE/ACM International Symposium on Microarchitecture, page 122–132,<br />
2009.
BIBLIOGRAPHY 173<br />
[SMHW02] D. J Sorin, M. M.K Martin, M. D Hill, and D. A Wood. SafetyNet: improving the<br />
availability of shared memory multiprocessors with global checkpoint/recovery. In Pro-<br />
ceedings of the 29th annual international symposium on Computer architecture, page<br />
123–134, 2002.<br />
[SMR + 07] A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process-<br />
level redundancy to exploit multiple cores for transient fault tolerance. In 37th An-<br />
nual IEEE/IFIP International Conference on <strong>Dependable</strong> Systems and Networks, 2007.<br />
DSN’07, page 297–306, 2007.<br />
[Sor09] D.J. Sorin. Fault tolerant computer architecture, 2009.<br />
[SSF + 08] J. R Schwank, M. R Shaneyfelt, D. M Fleetwood, J. A Felix, P. E Dodd, P. Paillet, and<br />
V. Ferlet-Cavrois. Radiation effects in MOS oxides. IEEE Transactions on Nuclear<br />
Science, 55(4):1833–1853, 2008.<br />
[Sta06] William Stallings. Computer Organization and Architecture. Prentice Hall, 7th edition,<br />
2006.<br />
[TM95] C. H. Ting and C. H. Moore. Mup21 a high performance misc processor. Forth Dimen-<br />
sions, 1995.<br />
[Too11] C. Toomey. Statical Fault Injection and Analysis at the Register Transfer Level using<br />
the Verilog Procedural <strong>Interface</strong>. PhD thesis, Vanderbilt University, 2011.<br />
[Van08] V.P. Vanhauwaert. Fault injection based dependability analysis in a FPGA-based envi-<br />
roment. PhD thesis, Institut Polytechnique de Grenoble, Gernoble, France, 2008.<br />
[VFM06] A. Vahdatpour, M. Fazeli, and S. Miremadi. Transient error detection in embedded<br />
systems using reconfigurable components. In International Symposium on Industrial<br />
Embedded Systems, 2006 (IES’06), page 1–6, 2006.<br />
[VK07] J. Von Knop. A Process for Developing a Common Vocabulary in the Information Se-<br />
curity Area. Ios Pr Inc, 2007.<br />
[VSL09] M. Vayrynen, V. Singh, and E. Larsson. Fault-tolerant average execution time opti-<br />
mization for general-purpose multi-processor system-on-chips. In Proceedings of De-<br />
sign, Automation & Test in Europe Conference & Exhibition, 2009 (DATE’09), page<br />
484–489, 2009.<br />
[WA08] F. Wang and V. D Agrawal. Single event upset: An embedded tutorial. In 21st Interna-<br />
tional Conference on VLSI Design, 2008. VLSID 2008, page 429–434, 2008.<br />
[WCS08] P. M Wells, K. Chakraborty, and G. S Sohi. Adapting to intermittent faults in multicore<br />
systems. ACM SIGPLAN Notices, 43(3):255–264, 2008.
174 BIBLIOGRAPHY<br />
[WL10] C. F. Webb and J. S. Liptay. A high-frequency custom CMOS s/390 microprocessor.<br />
IBM Journal of Research and Development, 41(4.5):463–473, 2010.<br />
[Yeh02] Y. C. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of IEEE<br />
Aerospace Applications Conference, volume 1, page 293–307, 2002.<br />
[ZJ08] Y. Zhang and J. Jiang. Bibliographical review on reconfigurable fault-tolerant control<br />
systems. Annual Reviews in Control, 32(2):229–252, 2008.<br />
[ZL09] J. F. Ziegler and W. A. Lanford. The effect of sea level cosmic rays on electronic devices.<br />
Journal of applied physics, 52(6):4305–4312, 2009.
RÉSUMÉ<br />
Dans cette thèse, nous proposons une nouvelle approche pour la conception d’un processeur tolérant aux fautes. Celleci<br />
répond à plusieurs objectifs dont celui d’obtenir un niveau de protection élevé contre les erreurs transitoires et un<br />
compromis raisonnable entre performances temporelles et coût en surface. Le processeur résultant sera utilisé ultérieurement<br />
comme élément constitutif d’un système multiprocesseur sur puce (MPSoC) tolérant aux fautes. Les concepts mis<br />
en œuvre pour la tolérance aux fautes reposent sur l’emploi de techniques de détection concurrente d’erreurs et de recouvrement<br />
par réexécution. Les éléments centraux de la nouvelle architecture sont, un cœur de processeur à pile de données<br />
de type MISC (Minimal Instruction Set Computer) capable d’auto-détection d’erreurs, et un mécanisme matériel de journalisation<br />
chargé d’empêcher la propagation d’erreurs vers la mémoire centrale (supposée sûre) et de limiter l’impact du<br />
mécanisme de recouvrement sur les performances temporelles.<br />
L’approche méthodologique mise en œuvre repose sur la modélisation et la simulation selon différents modes et niveaux<br />
d’abstraction, le développement d’outils logiciels dédiées, et le prototypage sur des technologies FPGA. Les résultats,<br />
obtenus sans recherche d’optimisation poussée, montrent clairement la pertinence de l’approche proposée, en offrant<br />
un bon compromis entre protection et performances. En effet, comme le montrent les multiples campagnes d’injection<br />
d’erreurs, le niveau de tolérance au fautes est élevé avec 100% des erreurs simples détectées et recouvrées et environ 60%<br />
et 78% des erreurs doubles et triples. Le taux recouvrement reste raisonnable pour des erreurs à multiplicité plus élevée,<br />
étant encore de 36% pour des erreurs de multiplicité 8.<br />
Mots clés : Tolérance aux fautes, Processeur à pile de données, MPSoC, Journalisation, Restauration, Injection de<br />
fautes, Modélisation RTL.<br />
ABSTRACT<br />
In this thesis, we propose a new approach to designing a fault tolerant processor. The methodology is addressing several<br />
goals including high level of protection against transient faults along with reasonable performance and area overhead<br />
trade-offs. The resulting fault-tolerant processor will be used as a building block in a fault tolerant MPSoC (Multi-<br />
Processor System-on-Chip) architecture. The concepts being used to achieve fault tolerance are based on concurrent<br />
detection and rollback error recovery techniques. The core elements in this architecture are a stack processor core from<br />
the MISC (Minimal Instruction Set Computer) class and a hardware journal in charge of preventing error propagation to<br />
the main memory (supposedly dependable) and limiting the impact of the rollback mechanism on time performance.<br />
The design methodology relies on modeling at different abstraction levels and simulating modes, developing dedicated<br />
software tools, and prototyping on FPGA technology. The results, obtained without seeking a thorough optimization, show<br />
clearly the relevance of the proposed approach, offering a good compromise in terms of protection and performance.<br />
Indeed, fault tolerance, as revealed by several error injection campaigns, prove to be high with 100% of errors being<br />
detected and recovered for single bit error patterns, and about 60% and 78% for double and triple bit error patterns,<br />
respectively. Furthermore, recovery rate is still acceptable for larger error patterns, with yet a recovery rate of 36%on 8<br />
bit error patterns.<br />
Keywords: Fault Tolerance, Stack Processor, MPSoC, Journalization, Rollback, Fault Injection, RTL modeling.