20.09.2013 Views

Robust System Design - VLSI

Robust System Design - VLSI

Robust System Design - VLSI

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Robust</strong> <strong>System</strong> <strong>Design</strong><br />

Subhasish Mitra<br />

<strong>Robust</strong> <strong>System</strong>s Group<br />

Dept. of EE & Dept. of CS<br />

Stanford University<br />

www.stanford.edu/~subh<br />

Acknowledgment: Students & Collaborators<br />

1


Radiation-Induced Soft Errors<br />

“It’s ridiculous. I’ve got a $300,000<br />

server that doesn’t work. The thing<br />

should be bullet-proof.”<br />

Causes: α particles (package), Neutrons (cosmic rays)<br />

2


Who Cares About Soft Errors ?<br />

20K processors server farm<br />

1 major flip-flop error every 20 days<br />

Silent data corruption<br />

$ 20K $ 3,616 bank deposit<br />

Downtime: $100K - $10M / hr.<br />

Memory ECC routinely used<br />

<strong>System</strong> error rates increasing<br />

Comb. logic<br />

Flipflop<br />

Un<br />

protected<br />

memory<br />

Soft error rate<br />

contributions<br />

3


<strong>System</strong> Outage Causes<br />

Unknown<br />

Power, operator<br />

Software<br />

Hardware<br />

HPC, Los Alamos [Schroeder DSN 06]<br />

(Disk failures excluded)<br />

Hardware failures can be significant<br />

Depends on application<br />

Power<br />

Software Hardware<br />

Storage servers [Shazli ITC 08]<br />

(Disk failures excluded)<br />

4


<strong>Robust</strong> <strong>System</strong> <strong>Design</strong> Challenges<br />

Radiation,<br />

Erratic bits<br />

Technology challenges<br />

Inputs <strong>Robust</strong> <strong>System</strong><br />

Post-Silicon bugs<br />

<strong>System</strong> complexity<br />

Aging,<br />

Early-life failures<br />

Acceptable results<br />

Constraints:<br />

Power, performance<br />

5


<strong>Robust</strong> <strong>System</strong> <strong>Design</strong><br />

Meet user expectations<br />

Despite underlying disturbances<br />

Thorough validation & test<br />

Tolerate imperfect hardware<br />

Beyond silicon-CMOS: imperfection-immune logic<br />

6


Outline<br />

Introduction<br />

Thorough validation & test<br />

Tolerate imperfect hardware<br />

Beyond silicon-CMOS: imperfection-immune logic<br />

Conclusion<br />

7


Who Cares About Post-Silicon Validation ?<br />

<strong>Design</strong><br />

Pre-Silicon<br />

Verification<br />

35 % development time<br />

25 % design resources<br />

Post-Silicon<br />

Validation<br />

High<br />

Volume<br />

Barcelona shipment delayed 6 months due to<br />

bug in TLB.<br />

Bios fix: 10 -20% performance penalty<br />

“Post-silicon cost & complexity is rising faster<br />

than design cost” – S. Yerramilli, V.P., Intel<br />

8


Post-Silicon Bug Localization Challenge<br />

Pinpoint from system failure (e.g., crash)<br />

Bug location, exposing stimulus<br />

Electrical bugs: days to weeks per bug<br />

Run apps. (OS, games)<br />

Root-cause & fix<br />

Localization<br />

dominates cost<br />

Detect bugs<br />

Localize bugs<br />

9


IFRA Key Message<br />

IFRA: Instruction Footprint Recording and Analysis<br />

Effective: 96% accurate<br />

No system failure reproduction<br />

No system simulation<br />

Inexpensive: 1% area<br />

Practical<br />

Alpha 21264<br />

Intel Nehalem<br />

10


IFRA Principle<br />

<strong>Design</strong><br />

Phase<br />

Post-Si<br />

Validation<br />

No<br />

Insert recorders<br />

inside chip design<br />

Record special info. in<br />

recorders / Run tests<br />

Failure<br />

detected?<br />

Yes<br />

Scan out recorder<br />

contents<br />

Post-analyze offline<br />

Localized Bug: (location, stimulus)<br />

1% area cost<br />

60KB for Alpha 21264<br />

Non-intrusive<br />

No failure reproduction<br />

Single test run<br />

No system simulation<br />

Self-consistency<br />

vs. test program binary<br />

11


IFRA Hardware in Superscalar Processor<br />

FETCH<br />

DECODE<br />

DISPATCH<br />

ISSUE<br />

EXECUTE<br />

COMMIT<br />

Branch Predictor I-TLB I-Cache<br />

Fetch Queue ID assignment<br />

Pipeline Registers<br />

Decoders<br />

Pipeline Registers<br />

Reg Map Reg Free<br />

Pipeline Registers<br />

Instruction Window<br />

Pipeline Registers<br />

MUL 2xALU<br />

2xBr FPU 2xLSU<br />

Pipeline Registers<br />

Reg Rename<br />

Phys Regfile<br />

D-Cache<br />

D-TLB<br />

Reorder Buffer Reg Map<br />

Pipeline Registers<br />

Recorders<br />

Recorders<br />

Recorders<br />

Recorders<br />

Recorders<br />

Recorders<br />

Alpha 21264<br />

(open source)<br />

Part of<br />

scan chain<br />

No at-speed<br />

routing<br />

Post-Trigger<br />

Generator<br />

12


Recording Example<br />

FETCH<br />

DECODE<br />

Branch Predictor I-TLB I-Cache<br />

Fetch Queue<br />

Pipeline Reg<br />

Decoder<br />

Pipeline Reg INST2 INST1 ID1 ID2<br />

INST2 INST1 Auxiliary Info: PC2 PC1 ID2 ID1<br />

INST2 INST1 ID2 ID1 ID2 Auxiliary Info: PC2<br />

ID1 Auxiliary Info: PC1<br />

INST2 INST1 ID2 ID1 Auxiliary Info: Decoded bits2 bits1<br />

Special ID<br />

assignment rule<br />

Instruction Footprints<br />

Recorder 1<br />

ID Assignment<br />

Recorder 2<br />

ID2 Auxiliary Info: Decoded bits2<br />

ID1 Auxiliary Info: Decoded bits1<br />

13


Special Rule for Instruction ID<br />

Simplistic schemes inadequate<br />

Speculation + flushes, out-of-order, loops<br />

Multiple clocks, voltage frequency scaling<br />

Special rule [Park TCAD 09]<br />

ID width: log 24n bits<br />

n = max. instructions in flight<br />

8 bits for Alpha (n = 64)<br />

No timestamp or global synchronization<br />

14


What to Record ?<br />

Pipeline<br />

stage<br />

Auxiliary information<br />

Description Bits per recorder<br />

Number of<br />

recorders<br />

Fetch Program Counter 32 4<br />

Decode Decoding results 4 4<br />

Dispatch Register names residue 6 4<br />

Issue Operands residue 6 4<br />

Execution<br />

(ALU, MUL)<br />

Result residue 3 4<br />

Execution None 0 2<br />

(Branch)<br />

Execution<br />

(Load/Store)<br />

Result residue;<br />

Memory address<br />

35 2<br />

Total required storage for all recorders: 60 KBytes<br />

15


Early Warnings MUST<br />

t=0<br />

Error after 5 billion cycles<br />

(e.g., speedpath)<br />

Test Program Execution<br />

Need to capture<br />

in recorder storage<br />

Early failure detection (post-triggers)<br />

Error detection – residue, array parity<br />

Deadlock & segfault<br />

Failure after 6 billion cycles<br />

(e.g., crash)<br />

time<br />

Early failure<br />

suspect<br />

detection<br />

Special early warnings pause recording<br />

16


Post-Analysis Overview<br />

Test program<br />

binary<br />

Link footprints<br />

High-level<br />

analysis<br />

Low-level<br />

analysis<br />

Footprints<br />

from recorders<br />

List of bug<br />

location-stimulus pairs<br />

Micro-architecture independent<br />

Control-flow analysis<br />

Data-dependency analysis<br />

Decoding analysis<br />

Load/Store analysis<br />

Micro-architecture dependent<br />

Residue consistency check<br />

17


Link Footprints<br />

Test program<br />

binary<br />

…<br />

PC0 INST0<br />

PC1 INST1<br />

PC2 INST2<br />

PC3 INST3<br />

PC4 INST4<br />

PC5 INST5<br />

PC6 INST6<br />

…<br />

… …<br />

Fetch-stage<br />

recorder<br />

…<br />

ID: 4<br />

ID: 5<br />

ID: 6<br />

ID: 7<br />

…<br />

ID: 7 PC5<br />

ID: 0 PC0<br />

ID: 5 PC3<br />

ID: 6 PC4<br />

ID: 7 PC5<br />

ID: 0 PC6<br />

PC0<br />

PC1<br />

PC2<br />

PC3<br />

ID: 0 PC4<br />

Special ID rule ensures:<br />

Issue-stage<br />

recorder<br />

ID: 0 AUX0<br />

ID: 7<br />

ID: 6<br />

ID: 5<br />

AUX1<br />

AUX2<br />

AUX3<br />

ID: 0 AUX3<br />

ID: 7 AUX5<br />

ID: 5 AUX6<br />

ID: 4 AUX7<br />

ID: 6 AUX8<br />

ID: 0 AUX9<br />

ID: 7 AUX10<br />

Execution-stage<br />

recorder<br />

…<br />

ID: 7<br />

ID: 5<br />

ID: 7<br />

ID: 6<br />

ID: 5<br />

ID: 4<br />

ID: 7<br />

ID: 6<br />

…<br />

ID: 0 AUX20<br />

AUX21<br />

AUX22<br />

AUX23<br />

AUX24<br />

AUX25<br />

AUX26<br />

AUX27<br />

AUX28<br />

ID: 0 AUX29<br />

Uncommitted instructions uniquely identified<br />

…<br />

…<br />

Committed instructions correctly linked<br />

earlier<br />

time<br />

later<br />

18


Debug Example<br />

Link footprints<br />

HLA1 HLA2 HLA3 HLA4<br />

?<br />

? ?<br />

? ? ?<br />

?<br />

? ?<br />

? ?<br />

?<br />

? ? ? ? ?<br />

?<br />

Bug locations + exposing stimulus<br />

HLA: High-level analysis<br />

Low-level analysis<br />

?<br />

?<br />

19


IFRA Results – Alpha 21264<br />

Correct<br />

localization<br />

(96%)<br />

<br />

Complete<br />

miss (4%)<br />

Exciting results for Intel Nehalem<br />

☺<br />

Total candidates: > 200,000<br />

☺<br />

<br />

Exact<br />

localization<br />

(78%)<br />

Avg. 6<br />

candidates<br />

(22%)<br />

1 of 200 design blocks, 1 of 1,000 error appearance cycles<br />

20


Outline<br />

Introduction<br />

Thorough validation & test<br />

Tolerate imperfect hardware<br />

Beyond silicon-CMOS: imperfection-immune logic<br />

Conclusion<br />

21


Low-Cost Error Detection Most Important<br />

Why concurrent error detection (CED) ?<br />

Crashes vs. silent errors<br />

Traditional CED expensive<br />

Low-cost – How ?<br />

New failure mode signatures<br />

Optimize across abstraction layers<br />

Configurable & reuse<br />

22


Low-cost Resilience<br />

Failure<br />

rate<br />

Burn-in difficult<br />

Iddq<br />

ineffective<br />

Early-life failures<br />

(infant mortality)<br />

Circuit Failure Prediction<br />

On-line Diagnostics<br />

Built-In Soft Error<br />

Resilience (BISER)<br />

45nm data:<br />

Errors reduced: 1,000X<br />

Transistor aging<br />

Guardbands<br />

expensive<br />

Lifetime Wearout Time<br />

Global optimization software orchestrated<br />

23


BISER Latch Soft Error Correction<br />

IN<br />

Comb.<br />

logic<br />

A B<br />

C-element<br />

(A, B)<br />

OUT<br />

00<br />

Clock<br />

1<br />

11<br />

0<br />

Latch<br />

D<br />

C<br />

D<br />

C<br />

Previous value<br />

retained<br />

Q<br />

Q<br />

A<br />

01<br />

Previous value<br />

retained<br />

Redundant Latch (Scan Test & Debug reuse)<br />

B<br />

10<br />

Weak keeper<br />

OUT<br />

C-element<br />

Key Observation: Latches vulnerable only in Opaque state (Clock = 0)<br />

24


Architecture-Aware BISER Insertion<br />

cumulative error coverage<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

10X chip-level protection<br />

2X<br />

2.5%<br />

power<br />

penalty<br />

0%<br />

0% 20% 40% 60% 80% 100%<br />

cumulative latch coverage<br />

Alpha 21264<br />

error injection<br />

9% chip-level<br />

power penalty<br />

Ack: Prof. S.J. Patel,<br />

UIUC for error injector<br />

Optimized BISER insertion: verification-guided ?<br />

25


Reconfigurable Correction – Economy Mode<br />

Scan Clock<br />

B = 1<br />

Scan Data<br />

Scan Clock A<br />

Capture = 0<br />

Update<br />

<strong>System</strong><br />

Data<br />

<strong>System</strong><br />

Clock<br />

1D<br />

C1<br />

Scan / Checking Flip-flop<br />

1D<br />

C1<br />

2D<br />

C2<br />

Q<br />

1D<br />

Q Q<br />

C1<br />

Integrated design quality<br />

+<br />

&<br />

1D<br />

C1<br />

2D<br />

C2<br />

Q<br />

<strong>System</strong> Flip- flop<br />

Scan<br />

Output<br />

C-element Keeper<br />

<strong>System</strong><br />

Output<br />

Soft error correction, scan test, post-silicon debug 26


Single Event Multiple Upsets (SEMU)<br />

SEMUs (aka MBUs) increasing<br />

Single error assumption not sufficient<br />

Measured<br />

Error rate<br />

(arbitrary<br />

units)<br />

10 4<br />

10 3<br />

10 2<br />

10<br />

1<br />

“Basic”<br />

flip-flop<br />

2X<br />

flip-flop<br />

Radiation<br />

experiment results<br />

BISER<br />

flip-flop<br />

New<br />

LEAP<br />

flip-flop<br />

[IRPS 10]<br />

27


Low-cost Resilience<br />

Failure<br />

rate<br />

Burn-in difficult<br />

Iddq<br />

ineffective<br />

Early-life failures<br />

(infant mortality)<br />

Circuit Failure Prediction<br />

On-line Diagnostics<br />

Built-In Soft Error<br />

Resilience (BISER)<br />

45nm data:<br />

Errors reduced: 1,000X<br />

Transistor aging<br />

Guardbands<br />

expensive<br />

Lifetime Wearout Time<br />

Global optimization software orchestrated<br />

28


Circuit Failure Prediction Early Indicator<br />

BEFORE errors appear<br />

Failure Prediction Error Detection<br />

Before errors appear After errors appear<br />

+ No corrupt data & states – Corrupt data & states<br />

+ Low cost<br />

“A little fire is quickly trodden out;<br />

Which, being suffer'd, rivers cannot quench.”<br />

– High cost<br />

+ Self-diagnosis – Limited diagnosis<br />

Both can be efficiently combined<br />

William Shakespeare<br />

King Henry the Sixth<br />

Part III<br />

29


Circuit Failure Prediction Applicability<br />

Degradation delay SHIFT ≠ delay fault<br />

Transistor aging<br />

e.g., Negative Bias Temperature Instability<br />

Adaptive (≠ worst-case) guardbands<br />

Gate-oxide Early-Life Failures (ELF)<br />

Burn-in alternatives<br />

30


Standard Normal Quantile<br />

Ids [µA] after 5340 min stress<br />

New Gate-oxide ELF Signature: 90nm<br />

Delay shifts over time: distinct from NBTI, PBTI, hot-e<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

240 min. stress<br />

5340 min. stress<br />

20 40 60 80 100<br />

Ids[µA]<br />

W = 0.2µm<br />

Fresh<br />

Outliers<br />

20<br />

60 70 80 90 100<br />

Fresh Ids [µA]<br />

10<br />

min.<br />

stress<br />

952 pairs:<br />

I ds outliers<br />

11.6% of entire<br />

population<br />

Outliers<br />

Y<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

885<br />

pairs<br />

92%<br />

Outlier<br />

locations<br />

random in<br />

0.2µm array<br />

0<br />

0 50 100 150 200 250<br />

X<br />

952 pairs:<br />

Largest I g increase<br />

11.6% of entire<br />

population<br />

Delay Shifts [ps]<br />

1,500<br />

1,000<br />

500<br />

0<br />

Iddq [A] at VDD=1V<br />

10-4 10-4 10-4 10-5 10-5 10-5 10-6 10-6 10-6 10-7 10-7 10-7 Inverter chain<br />

input LOW<br />

Inverter chain<br />

input HIGH<br />

0 20 40 60 80 100<br />

Rising<br />

transition<br />

Stress Time [sec]<br />

100X<br />

Delay jump<br />

-500<br />

0 20 40 60 80 100<br />

Stress Time [sec] Falling<br />

transition<br />

31


Circuit Failure Prediction: How ?<br />

Periodic on-line self-test & diagnostics<br />

Periodic minimal power costs<br />

Clock control reuse: test & debug, DVFS<br />

Concurrent with application execution<br />

Special flip-flops<br />

Self-healing<br />

Optimized lifetime power efficiency [DATE 10]<br />

32


On-line Self-Test & Diagnostics: CASP<br />

Pseudo-random BIST difficult<br />

CASP: Concurrent, Autonomous, Stored Patterns<br />

Multi-core no visible downtime<br />

Software orchestration<br />

Stored test patterns: off-chip FLASH<br />

Test compression: X-Compact<br />

Comparable or better than production tests<br />

Major Technology Trends Favor CASP<br />

33


CASP Online Diagnostics Flow<br />

Scheduling<br />

Pre-processing<br />

Core 4<br />

selected<br />

for test<br />

Core N<br />

normal<br />

operation<br />

Core 4<br />

resume<br />

operation<br />

Core N<br />

normal<br />

operation<br />

Power /<br />

performance<br />

-aware<br />

scheduler<br />

Post-processing<br />

Bring core<br />

from online<br />

diagnostics<br />

to normal<br />

operation<br />

Core 4<br />

temporarily<br />

isolated<br />

Core N<br />

normal<br />

operation<br />

Test Application<br />

Core 4<br />

under test<br />

Core N<br />

normal<br />

operation<br />

Prepare<br />

core for<br />

online<br />

diagnostics<br />

Thorough<br />

scan &<br />

functional<br />

testing<br />

34


CASP for SUN OpenSPARC T1 Cores<br />

Test Coverage<br />

Stuck-at: 99.5 %<br />

Transition: 96 %<br />

True-time: 93.5 %<br />

Storage<br />

48 MBytes<br />

Test time per core<br />

0.3 sec.<br />

0.01% area impact<br />

8<br />

processor<br />

cores<br />

Modified<br />

for<br />

CASP<br />

support<br />

on-chip<br />

buffer<br />

(7.5KB)<br />

FPU<br />

Crossbar<br />

Switch<br />

Modified<br />

for<br />

CASP<br />

support<br />

Jbus<br />

Interface<br />

L2<br />

CASP control<br />

DRAM<br />

Control<br />

OFF-CHIP<br />

Flash<br />

48 MB<br />

compressed<br />

test patterns<br />

~ 8K Verilog LOC modified (out of 100K+)<br />

35


Hardware-Only CASP Inefficient<br />

I/O packet drop, interrupt handling<br />

Visible application performance impact<br />

Solutions<br />

VAST – Virtualization Assisted CASP Self-Test<br />

OS migration<br />

CASP-aware OS scheduling<br />

CPUs OS Virtualization s/w<br />

ARM: MP11 x 4 Linux 2.6.7 NEC in-house<br />

NEC<br />

36


Hardware-Only CASP Inefficient<br />

I/O packet drop, interrupt handling<br />

Test<br />

coverage<br />

Visible application performance impact<br />

Solutions<br />

Minimize<br />

system performance<br />

impact<br />

Hardware-only<br />

VAST +<br />

CASP<br />

CASP-aware<br />

OS scheduler<br />

VAST – Virtualization Assisted CASP Self-Test<br />

OS migration<br />

High coverage &<br />

Low cost<br />

CASP-aware OS scheduling<br />

Logic BIST<br />

CPUs OS Virtualization s/w<br />

ARM: MP11 x 4 Linux 2.6.7 NEC in-house<br />

Efficiency<br />

NEC<br />

37


CASP-Aware OS Scheduling: Interactive App.<br />

Workload: Firefox<br />

Platform: Dual quad-core Xeon, Linux 2.6.25.9 scheduler modified<br />

CASP-aware OS scheduling<br />

Hardware-only CASP<br />

< 200ms > 200ms, 500ms<br />

Response<br />

time<br />

☺ No Effect <br />

UNACCEPTABLE<br />

38


CASP for Uncore in SoCs ?<br />

Uncore CASP challenging<br />

Multiple uncore copies not available<br />

Multiple cores affected during uncore CASP<br />

Uncore significant : OpenSPARC T2 (8-cores, 64 threads)<br />

Memory BIST +<br />

Self-repair<br />

Memory<br />

Cores<br />

Core CASP<br />

Uncore<br />

New Uncore CASP<br />

Utilize self-similarity<br />

[VTS 10]<br />

39


Outline<br />

Introduction<br />

Thorough validation & test<br />

Tolerate imperfect hardware<br />

Beyond silicon-CMOS: imperfection-immune logic<br />

Conclusion<br />

40


Carbon Nanotube (CNT) FETs: Big Promise<br />

BUT, Major barriers<br />

Imperfections: Mis-positioned & metallic CNTs<br />

Imperfection-immune design essential<br />

New solutions: <strong>VLSI</strong>, practical, elegantly simple<br />

First experimental demo: Complex circuits, Latches<br />

Imperfection-immune Half-adder Sum Imperfection-immune D-latch<br />

V VOUT OUT (V)<br />

3<br />

2<br />

1<br />

0<br />

0<br />

V B = 0V<br />

V B = 3V<br />

V<br />

1 2 3 A (V)<br />

1 2 3 A (V)<br />

CNTs<br />

20 µm<br />

20 µm<br />

Collaborator: Prof. H.-S.P. Wong, Stanford<br />

41


Outline<br />

Introduction<br />

Thorough validation & test<br />

Tolerate imperfect hardware<br />

Beyond silicon-CMOS: imperfection-immune logic<br />

Conclusion<br />

42


Conclusion<br />

<strong>Robust</strong> system design<br />

Efficient techniques practical<br />

Thorough validation & test<br />

IFRA<br />

Tolerate imperfect hardware<br />

BISER + failure prediction + CASP diagnostics<br />

Software-orchestration a MUST<br />

43

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!