Robust System Design - VLSI
Robust System Design - VLSI
Robust System Design - VLSI
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Robust</strong> <strong>System</strong> <strong>Design</strong><br />
Subhasish Mitra<br />
<strong>Robust</strong> <strong>System</strong>s Group<br />
Dept. of EE & Dept. of CS<br />
Stanford University<br />
www.stanford.edu/~subh<br />
Acknowledgment: Students & Collaborators<br />
1
Radiation-Induced Soft Errors<br />
“It’s ridiculous. I’ve got a $300,000<br />
server that doesn’t work. The thing<br />
should be bullet-proof.”<br />
Causes: α particles (package), Neutrons (cosmic rays)<br />
2
Who Cares About Soft Errors ?<br />
20K processors server farm<br />
1 major flip-flop error every 20 days<br />
Silent data corruption<br />
$ 20K $ 3,616 bank deposit<br />
Downtime: $100K - $10M / hr.<br />
Memory ECC routinely used<br />
<strong>System</strong> error rates increasing<br />
Comb. logic<br />
Flipflop<br />
Un<br />
protected<br />
memory<br />
Soft error rate<br />
contributions<br />
3
<strong>System</strong> Outage Causes<br />
Unknown<br />
Power, operator<br />
Software<br />
Hardware<br />
HPC, Los Alamos [Schroeder DSN 06]<br />
(Disk failures excluded)<br />
Hardware failures can be significant<br />
Depends on application<br />
Power<br />
Software Hardware<br />
Storage servers [Shazli ITC 08]<br />
(Disk failures excluded)<br />
4
<strong>Robust</strong> <strong>System</strong> <strong>Design</strong> Challenges<br />
Radiation,<br />
Erratic bits<br />
Technology challenges<br />
Inputs <strong>Robust</strong> <strong>System</strong><br />
Post-Silicon bugs<br />
<strong>System</strong> complexity<br />
Aging,<br />
Early-life failures<br />
Acceptable results<br />
Constraints:<br />
Power, performance<br />
5
<strong>Robust</strong> <strong>System</strong> <strong>Design</strong><br />
Meet user expectations<br />
Despite underlying disturbances<br />
Thorough validation & test<br />
Tolerate imperfect hardware<br />
Beyond silicon-CMOS: imperfection-immune logic<br />
6
Outline<br />
Introduction<br />
Thorough validation & test<br />
Tolerate imperfect hardware<br />
Beyond silicon-CMOS: imperfection-immune logic<br />
Conclusion<br />
7
Who Cares About Post-Silicon Validation ?<br />
<strong>Design</strong><br />
Pre-Silicon<br />
Verification<br />
35 % development time<br />
25 % design resources<br />
Post-Silicon<br />
Validation<br />
High<br />
Volume<br />
Barcelona shipment delayed 6 months due to<br />
bug in TLB.<br />
Bios fix: 10 -20% performance penalty<br />
“Post-silicon cost & complexity is rising faster<br />
than design cost” – S. Yerramilli, V.P., Intel<br />
8
Post-Silicon Bug Localization Challenge<br />
Pinpoint from system failure (e.g., crash)<br />
Bug location, exposing stimulus<br />
Electrical bugs: days to weeks per bug<br />
Run apps. (OS, games)<br />
Root-cause & fix<br />
Localization<br />
dominates cost<br />
Detect bugs<br />
Localize bugs<br />
9
IFRA Key Message<br />
IFRA: Instruction Footprint Recording and Analysis<br />
Effective: 96% accurate<br />
No system failure reproduction<br />
No system simulation<br />
Inexpensive: 1% area<br />
Practical<br />
Alpha 21264<br />
Intel Nehalem<br />
10
IFRA Principle<br />
<strong>Design</strong><br />
Phase<br />
Post-Si<br />
Validation<br />
No<br />
Insert recorders<br />
inside chip design<br />
Record special info. in<br />
recorders / Run tests<br />
Failure<br />
detected?<br />
Yes<br />
Scan out recorder<br />
contents<br />
Post-analyze offline<br />
Localized Bug: (location, stimulus)<br />
1% area cost<br />
60KB for Alpha 21264<br />
Non-intrusive<br />
No failure reproduction<br />
Single test run<br />
No system simulation<br />
Self-consistency<br />
vs. test program binary<br />
11
IFRA Hardware in Superscalar Processor<br />
FETCH<br />
DECODE<br />
DISPATCH<br />
ISSUE<br />
EXECUTE<br />
COMMIT<br />
Branch Predictor I-TLB I-Cache<br />
Fetch Queue ID assignment<br />
Pipeline Registers<br />
Decoders<br />
Pipeline Registers<br />
Reg Map Reg Free<br />
Pipeline Registers<br />
Instruction Window<br />
Pipeline Registers<br />
MUL 2xALU<br />
2xBr FPU 2xLSU<br />
Pipeline Registers<br />
Reg Rename<br />
Phys Regfile<br />
D-Cache<br />
D-TLB<br />
Reorder Buffer Reg Map<br />
Pipeline Registers<br />
Recorders<br />
Recorders<br />
Recorders<br />
Recorders<br />
Recorders<br />
Recorders<br />
Alpha 21264<br />
(open source)<br />
Part of<br />
scan chain<br />
No at-speed<br />
routing<br />
Post-Trigger<br />
Generator<br />
12
Recording Example<br />
FETCH<br />
DECODE<br />
Branch Predictor I-TLB I-Cache<br />
Fetch Queue<br />
Pipeline Reg<br />
Decoder<br />
Pipeline Reg INST2 INST1 ID1 ID2<br />
INST2 INST1 Auxiliary Info: PC2 PC1 ID2 ID1<br />
INST2 INST1 ID2 ID1 ID2 Auxiliary Info: PC2<br />
ID1 Auxiliary Info: PC1<br />
INST2 INST1 ID2 ID1 Auxiliary Info: Decoded bits2 bits1<br />
Special ID<br />
assignment rule<br />
Instruction Footprints<br />
Recorder 1<br />
ID Assignment<br />
Recorder 2<br />
ID2 Auxiliary Info: Decoded bits2<br />
ID1 Auxiliary Info: Decoded bits1<br />
13
Special Rule for Instruction ID<br />
Simplistic schemes inadequate<br />
Speculation + flushes, out-of-order, loops<br />
Multiple clocks, voltage frequency scaling<br />
Special rule [Park TCAD 09]<br />
ID width: log 24n bits<br />
n = max. instructions in flight<br />
8 bits for Alpha (n = 64)<br />
No timestamp or global synchronization<br />
14
What to Record ?<br />
Pipeline<br />
stage<br />
Auxiliary information<br />
Description Bits per recorder<br />
Number of<br />
recorders<br />
Fetch Program Counter 32 4<br />
Decode Decoding results 4 4<br />
Dispatch Register names residue 6 4<br />
Issue Operands residue 6 4<br />
Execution<br />
(ALU, MUL)<br />
Result residue 3 4<br />
Execution None 0 2<br />
(Branch)<br />
Execution<br />
(Load/Store)<br />
Result residue;<br />
Memory address<br />
35 2<br />
Total required storage for all recorders: 60 KBytes<br />
15
Early Warnings MUST<br />
t=0<br />
Error after 5 billion cycles<br />
(e.g., speedpath)<br />
Test Program Execution<br />
Need to capture<br />
in recorder storage<br />
Early failure detection (post-triggers)<br />
Error detection – residue, array parity<br />
Deadlock & segfault<br />
Failure after 6 billion cycles<br />
(e.g., crash)<br />
time<br />
Early failure<br />
suspect<br />
detection<br />
Special early warnings pause recording<br />
16
Post-Analysis Overview<br />
Test program<br />
binary<br />
Link footprints<br />
High-level<br />
analysis<br />
Low-level<br />
analysis<br />
Footprints<br />
from recorders<br />
List of bug<br />
location-stimulus pairs<br />
Micro-architecture independent<br />
Control-flow analysis<br />
Data-dependency analysis<br />
Decoding analysis<br />
Load/Store analysis<br />
Micro-architecture dependent<br />
Residue consistency check<br />
17
Link Footprints<br />
Test program<br />
binary<br />
…<br />
PC0 INST0<br />
PC1 INST1<br />
PC2 INST2<br />
PC3 INST3<br />
PC4 INST4<br />
PC5 INST5<br />
PC6 INST6<br />
…<br />
… …<br />
Fetch-stage<br />
recorder<br />
…<br />
ID: 4<br />
ID: 5<br />
ID: 6<br />
ID: 7<br />
…<br />
ID: 7 PC5<br />
ID: 0 PC0<br />
ID: 5 PC3<br />
ID: 6 PC4<br />
ID: 7 PC5<br />
ID: 0 PC6<br />
PC0<br />
PC1<br />
PC2<br />
PC3<br />
ID: 0 PC4<br />
Special ID rule ensures:<br />
Issue-stage<br />
recorder<br />
ID: 0 AUX0<br />
ID: 7<br />
ID: 6<br />
ID: 5<br />
AUX1<br />
AUX2<br />
AUX3<br />
ID: 0 AUX3<br />
ID: 7 AUX5<br />
ID: 5 AUX6<br />
ID: 4 AUX7<br />
ID: 6 AUX8<br />
ID: 0 AUX9<br />
ID: 7 AUX10<br />
Execution-stage<br />
recorder<br />
…<br />
ID: 7<br />
ID: 5<br />
ID: 7<br />
ID: 6<br />
ID: 5<br />
ID: 4<br />
ID: 7<br />
ID: 6<br />
…<br />
ID: 0 AUX20<br />
AUX21<br />
AUX22<br />
AUX23<br />
AUX24<br />
AUX25<br />
AUX26<br />
AUX27<br />
AUX28<br />
ID: 0 AUX29<br />
Uncommitted instructions uniquely identified<br />
…<br />
…<br />
Committed instructions correctly linked<br />
earlier<br />
time<br />
later<br />
18
Debug Example<br />
Link footprints<br />
HLA1 HLA2 HLA3 HLA4<br />
?<br />
? ?<br />
? ? ?<br />
?<br />
? ?<br />
? ?<br />
?<br />
? ? ? ? ?<br />
?<br />
Bug locations + exposing stimulus<br />
HLA: High-level analysis<br />
Low-level analysis<br />
?<br />
?<br />
19
IFRA Results – Alpha 21264<br />
Correct<br />
localization<br />
(96%)<br />
<br />
Complete<br />
miss (4%)<br />
Exciting results for Intel Nehalem<br />
☺<br />
Total candidates: > 200,000<br />
☺<br />
<br />
Exact<br />
localization<br />
(78%)<br />
Avg. 6<br />
candidates<br />
(22%)<br />
1 of 200 design blocks, 1 of 1,000 error appearance cycles<br />
20
Outline<br />
Introduction<br />
Thorough validation & test<br />
Tolerate imperfect hardware<br />
Beyond silicon-CMOS: imperfection-immune logic<br />
Conclusion<br />
21
Low-Cost Error Detection Most Important<br />
Why concurrent error detection (CED) ?<br />
Crashes vs. silent errors<br />
Traditional CED expensive<br />
Low-cost – How ?<br />
New failure mode signatures<br />
Optimize across abstraction layers<br />
Configurable & reuse<br />
22
Low-cost Resilience<br />
Failure<br />
rate<br />
Burn-in difficult<br />
Iddq<br />
ineffective<br />
Early-life failures<br />
(infant mortality)<br />
Circuit Failure Prediction<br />
On-line Diagnostics<br />
Built-In Soft Error<br />
Resilience (BISER)<br />
45nm data:<br />
Errors reduced: 1,000X<br />
Transistor aging<br />
Guardbands<br />
expensive<br />
Lifetime Wearout Time<br />
Global optimization software orchestrated<br />
23
BISER Latch Soft Error Correction<br />
IN<br />
Comb.<br />
logic<br />
A B<br />
C-element<br />
(A, B)<br />
OUT<br />
00<br />
Clock<br />
1<br />
11<br />
0<br />
Latch<br />
D<br />
C<br />
D<br />
C<br />
Previous value<br />
retained<br />
Q<br />
Q<br />
A<br />
01<br />
Previous value<br />
retained<br />
Redundant Latch (Scan Test & Debug reuse)<br />
B<br />
10<br />
Weak keeper<br />
OUT<br />
C-element<br />
Key Observation: Latches vulnerable only in Opaque state (Clock = 0)<br />
24
Architecture-Aware BISER Insertion<br />
cumulative error coverage<br />
100%<br />
80%<br />
60%<br />
40%<br />
20%<br />
10X chip-level protection<br />
2X<br />
2.5%<br />
power<br />
penalty<br />
0%<br />
0% 20% 40% 60% 80% 100%<br />
cumulative latch coverage<br />
Alpha 21264<br />
error injection<br />
9% chip-level<br />
power penalty<br />
Ack: Prof. S.J. Patel,<br />
UIUC for error injector<br />
Optimized BISER insertion: verification-guided ?<br />
25
Reconfigurable Correction – Economy Mode<br />
Scan Clock<br />
B = 1<br />
Scan Data<br />
Scan Clock A<br />
Capture = 0<br />
Update<br />
<strong>System</strong><br />
Data<br />
<strong>System</strong><br />
Clock<br />
1D<br />
C1<br />
Scan / Checking Flip-flop<br />
1D<br />
C1<br />
2D<br />
C2<br />
Q<br />
1D<br />
Q Q<br />
C1<br />
Integrated design quality<br />
+<br />
&<br />
1D<br />
C1<br />
2D<br />
C2<br />
Q<br />
<strong>System</strong> Flip- flop<br />
Scan<br />
Output<br />
C-element Keeper<br />
<strong>System</strong><br />
Output<br />
Soft error correction, scan test, post-silicon debug 26
Single Event Multiple Upsets (SEMU)<br />
SEMUs (aka MBUs) increasing<br />
Single error assumption not sufficient<br />
Measured<br />
Error rate<br />
(arbitrary<br />
units)<br />
10 4<br />
10 3<br />
10 2<br />
10<br />
1<br />
“Basic”<br />
flip-flop<br />
2X<br />
flip-flop<br />
Radiation<br />
experiment results<br />
BISER<br />
flip-flop<br />
New<br />
LEAP<br />
flip-flop<br />
[IRPS 10]<br />
27
Low-cost Resilience<br />
Failure<br />
rate<br />
Burn-in difficult<br />
Iddq<br />
ineffective<br />
Early-life failures<br />
(infant mortality)<br />
Circuit Failure Prediction<br />
On-line Diagnostics<br />
Built-In Soft Error<br />
Resilience (BISER)<br />
45nm data:<br />
Errors reduced: 1,000X<br />
Transistor aging<br />
Guardbands<br />
expensive<br />
Lifetime Wearout Time<br />
Global optimization software orchestrated<br />
28
Circuit Failure Prediction Early Indicator<br />
BEFORE errors appear<br />
Failure Prediction Error Detection<br />
Before errors appear After errors appear<br />
+ No corrupt data & states – Corrupt data & states<br />
+ Low cost<br />
“A little fire is quickly trodden out;<br />
Which, being suffer'd, rivers cannot quench.”<br />
– High cost<br />
+ Self-diagnosis – Limited diagnosis<br />
Both can be efficiently combined<br />
William Shakespeare<br />
King Henry the Sixth<br />
Part III<br />
29
Circuit Failure Prediction Applicability<br />
Degradation delay SHIFT ≠ delay fault<br />
Transistor aging<br />
e.g., Negative Bias Temperature Instability<br />
Adaptive (≠ worst-case) guardbands<br />
Gate-oxide Early-Life Failures (ELF)<br />
Burn-in alternatives<br />
30
Standard Normal Quantile<br />
Ids [µA] after 5340 min stress<br />
New Gate-oxide ELF Signature: 90nm<br />
Delay shifts over time: distinct from NBTI, PBTI, hot-e<br />
4<br />
2<br />
0<br />
-2<br />
-4<br />
-6<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
240 min. stress<br />
5340 min. stress<br />
20 40 60 80 100<br />
Ids[µA]<br />
W = 0.2µm<br />
Fresh<br />
Outliers<br />
20<br />
60 70 80 90 100<br />
Fresh Ids [µA]<br />
10<br />
min.<br />
stress<br />
952 pairs:<br />
I ds outliers<br />
11.6% of entire<br />
population<br />
Outliers<br />
Y<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
885<br />
pairs<br />
92%<br />
Outlier<br />
locations<br />
random in<br />
0.2µm array<br />
0<br />
0 50 100 150 200 250<br />
X<br />
952 pairs:<br />
Largest I g increase<br />
11.6% of entire<br />
population<br />
Delay Shifts [ps]<br />
1,500<br />
1,000<br />
500<br />
0<br />
Iddq [A] at VDD=1V<br />
10-4 10-4 10-4 10-5 10-5 10-5 10-6 10-6 10-6 10-7 10-7 10-7 Inverter chain<br />
input LOW<br />
Inverter chain<br />
input HIGH<br />
0 20 40 60 80 100<br />
Rising<br />
transition<br />
Stress Time [sec]<br />
100X<br />
Delay jump<br />
-500<br />
0 20 40 60 80 100<br />
Stress Time [sec] Falling<br />
transition<br />
31
Circuit Failure Prediction: How ?<br />
Periodic on-line self-test & diagnostics<br />
Periodic minimal power costs<br />
Clock control reuse: test & debug, DVFS<br />
Concurrent with application execution<br />
Special flip-flops<br />
Self-healing<br />
Optimized lifetime power efficiency [DATE 10]<br />
32
On-line Self-Test & Diagnostics: CASP<br />
Pseudo-random BIST difficult<br />
CASP: Concurrent, Autonomous, Stored Patterns<br />
Multi-core no visible downtime<br />
Software orchestration<br />
Stored test patterns: off-chip FLASH<br />
Test compression: X-Compact<br />
Comparable or better than production tests<br />
Major Technology Trends Favor CASP<br />
33
CASP Online Diagnostics Flow<br />
Scheduling<br />
Pre-processing<br />
Core 4<br />
selected<br />
for test<br />
Core N<br />
normal<br />
operation<br />
Core 4<br />
resume<br />
operation<br />
Core N<br />
normal<br />
operation<br />
Power /<br />
performance<br />
-aware<br />
scheduler<br />
Post-processing<br />
Bring core<br />
from online<br />
diagnostics<br />
to normal<br />
operation<br />
Core 4<br />
temporarily<br />
isolated<br />
Core N<br />
normal<br />
operation<br />
Test Application<br />
Core 4<br />
under test<br />
Core N<br />
normal<br />
operation<br />
Prepare<br />
core for<br />
online<br />
diagnostics<br />
Thorough<br />
scan &<br />
functional<br />
testing<br />
34
CASP for SUN OpenSPARC T1 Cores<br />
Test Coverage<br />
Stuck-at: 99.5 %<br />
Transition: 96 %<br />
True-time: 93.5 %<br />
Storage<br />
48 MBytes<br />
Test time per core<br />
0.3 sec.<br />
0.01% area impact<br />
8<br />
processor<br />
cores<br />
Modified<br />
for<br />
CASP<br />
support<br />
on-chip<br />
buffer<br />
(7.5KB)<br />
FPU<br />
Crossbar<br />
Switch<br />
Modified<br />
for<br />
CASP<br />
support<br />
Jbus<br />
Interface<br />
L2<br />
CASP control<br />
DRAM<br />
Control<br />
OFF-CHIP<br />
Flash<br />
48 MB<br />
compressed<br />
test patterns<br />
~ 8K Verilog LOC modified (out of 100K+)<br />
35
Hardware-Only CASP Inefficient<br />
I/O packet drop, interrupt handling<br />
Visible application performance impact<br />
Solutions<br />
VAST – Virtualization Assisted CASP Self-Test<br />
OS migration<br />
CASP-aware OS scheduling<br />
CPUs OS Virtualization s/w<br />
ARM: MP11 x 4 Linux 2.6.7 NEC in-house<br />
NEC<br />
36
Hardware-Only CASP Inefficient<br />
I/O packet drop, interrupt handling<br />
Test<br />
coverage<br />
Visible application performance impact<br />
Solutions<br />
Minimize<br />
system performance<br />
impact<br />
Hardware-only<br />
VAST +<br />
CASP<br />
CASP-aware<br />
OS scheduler<br />
VAST – Virtualization Assisted CASP Self-Test<br />
OS migration<br />
High coverage &<br />
Low cost<br />
CASP-aware OS scheduling<br />
Logic BIST<br />
CPUs OS Virtualization s/w<br />
ARM: MP11 x 4 Linux 2.6.7 NEC in-house<br />
Efficiency<br />
NEC<br />
37
CASP-Aware OS Scheduling: Interactive App.<br />
Workload: Firefox<br />
Platform: Dual quad-core Xeon, Linux 2.6.25.9 scheduler modified<br />
CASP-aware OS scheduling<br />
Hardware-only CASP<br />
< 200ms > 200ms, 500ms<br />
Response<br />
time<br />
☺ No Effect <br />
UNACCEPTABLE<br />
38
CASP for Uncore in SoCs ?<br />
Uncore CASP challenging<br />
Multiple uncore copies not available<br />
Multiple cores affected during uncore CASP<br />
Uncore significant : OpenSPARC T2 (8-cores, 64 threads)<br />
Memory BIST +<br />
Self-repair<br />
Memory<br />
Cores<br />
Core CASP<br />
Uncore<br />
New Uncore CASP<br />
Utilize self-similarity<br />
[VTS 10]<br />
39
Outline<br />
Introduction<br />
Thorough validation & test<br />
Tolerate imperfect hardware<br />
Beyond silicon-CMOS: imperfection-immune logic<br />
Conclusion<br />
40
Carbon Nanotube (CNT) FETs: Big Promise<br />
BUT, Major barriers<br />
Imperfections: Mis-positioned & metallic CNTs<br />
Imperfection-immune design essential<br />
New solutions: <strong>VLSI</strong>, practical, elegantly simple<br />
First experimental demo: Complex circuits, Latches<br />
Imperfection-immune Half-adder Sum Imperfection-immune D-latch<br />
V VOUT OUT (V)<br />
3<br />
2<br />
1<br />
0<br />
0<br />
V B = 0V<br />
V B = 3V<br />
V<br />
1 2 3 A (V)<br />
1 2 3 A (V)<br />
CNTs<br />
20 µm<br />
20 µm<br />
Collaborator: Prof. H.-S.P. Wong, Stanford<br />
41
Outline<br />
Introduction<br />
Thorough validation & test<br />
Tolerate imperfect hardware<br />
Beyond silicon-CMOS: imperfection-immune logic<br />
Conclusion<br />
42
Conclusion<br />
<strong>Robust</strong> system design<br />
Efficient techniques practical<br />
Thorough validation & test<br />
IFRA<br />
Tolerate imperfect hardware<br />
BISER + failure prediction + CASP diagnostics<br />
Software-orchestration a MUST<br />
43