01.07.2015 Views

Trends from Ten Years of Soft Error Experimentation

Trends from Ten Years of Soft Error Experimentation

Trends from Ten Years of Soft Error Experimentation

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Trends</strong> <strong>from</strong> <strong>Ten</strong> <strong>Years</strong> <strong>of</strong><br />

S<strong>of</strong>t <strong>Error</strong> <strong>Experimentation</strong><br />

Anand Dixit<br />

Raymond Heald<br />

Alan Wood<br />

March 24, 2009


Overview<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Introduction<br />

Test setups<br />

SER trends for memory and flops<br />

Multi-cell upsets and design implications<br />

Conclusions<br />

March 24, 2009<br />

Sun Microsystems<br />

2


Introduction<br />

●<br />

Reliability and Availability – key selling points<br />

– S<strong>of</strong>t errors<br />

●<br />

●<br />

Experimental results : 250nm to 65nm<br />

SER trends<br />

– SER ↓ as technology has scaled<br />

– Flops overtake SRAM (on per cell basis)<br />

– Multi-cell upsets gaining importance<br />

●<br />

Changes in test setups over the years<br />

March 24, 2009<br />

Sun Microsystems<br />

3


Test Setups<br />

March 24, 2009<br />

Sun Microsystems<br />

4


Early high altitude tests<br />

●<br />

Broomfield, CO<br />

●<br />

~5000 ft altitude<br />

●<br />

~4X neutrons<br />

●<br />

300 machines<br />

●<br />

Few months<br />

●<br />

Large error bars<br />

March 24, 2009<br />

Sun Microsystems<br />

5


Case for Los Alamos<br />

Ref: http://wnr.lanl.gov/see/poster.pdf<br />

(MeV)<br />

March 24, 2009<br />

Sun Microsystems<br />

6


Test Strategy<br />

●<br />

Memory<br />

– Load pattern in the arrays<br />

●<br />

– Continuous read back loops while the part is in the<br />

beam<br />

– <strong>Error</strong>s correlated in time and space<br />

Flops<br />

– Load pattern through scan chain<br />

– Soak part in the beam<br />

– Read after the beam is turned <strong>of</strong>f<br />

March 24, 2009<br />

Sun Microsystems<br />

7


Systems in the beam<br />

March 24, 2009<br />

Sun Microsystems<br />

8


Systems in the beam<br />

●<br />

Good<br />

●<br />

Bad<br />

– Power, cooling<br />

– Only need one<br />

ethernet cable (typ)<br />

– Needs expert low<br />

level coding<br />

– Cannot be leveraged<br />

– Mix different systems<br />

●<br />

Ugly<br />

– Support chips come in<br />

the way<br />

March 24, 2009<br />

Sun Microsystems<br />

9


Tester based setup<br />

March 24, 2009<br />

Sun Microsystems<br />

10


Small low cost tester<br />

March 24, 2009<br />

Sun Microsystems<br />

11


Test boards in the beam<br />

●<br />

Good<br />

●<br />

Bad<br />

– More stable (only DUT<br />

in the beam)<br />

– Multiple external<br />

power supplies<br />

– Codes leveraged <strong>from</strong><br />

early silicon debug<br />

– Test boards are<br />

available earlier than<br />

full systems<br />

– Cooling hardware<br />

– Can't mix systems<br />

(depends on tester)<br />

March 24, 2009<br />

Sun Microsystems<br />

12


Trade<strong>of</strong>f Summary<br />

●<br />

System<br />

●<br />

Tester<br />

– Power, cooling<br />

– Only need one<br />

ethernet cable (typ)<br />

– Mix different systems<br />

– Support chips come in<br />

the way<br />

– Needs expert low<br />

level coding<br />

– Cannot be leveraged<br />

– More stable (only DUT<br />

in the beam)<br />

– Codes leveraged <strong>from</strong><br />

early silicon debug<br />

– Multiple external<br />

power supplies<br />

– Cooling hardware<br />

– Can't mix systems<br />

(depends on tester)<br />

March 24, 2009<br />

Sun Microsystems<br />

13


Future : Test chips<br />

- Get data early<br />

- More efficient for flops<br />

- Hope : get flops and<br />

memories on a single<br />

test chip<br />

March 24, 2009<br />

Sun Microsystems<br />

14


Apparent beam attenuation<br />

Technology<br />

Node<br />

180 nm<br />

130 nm<br />

90 nm<br />

65 nm<br />

Apparent<br />

Test Condition<br />

Systems<br />

Attenuation<br />

5-10%<br />

Systems 5-10%<br />

Systems ~ 30%<br />

Test Boards ~ 40%<br />

March 24, 2009<br />

Sun Microsystems<br />

15


Beam Transmission Spectrum<br />

March 24, 2009<br />

Sun Microsystems<br />

16


Test Results<br />

March 24, 2009<br />

Sun Microsystems<br />

17


SER trend for SRAM & Flops<br />

Vdd<br />

18


SER trend for SRAM & Flops<br />

SRAM<br />

19


SER trend for SRAM & Flops<br />

Flops<br />

20


Effect <strong>of</strong> Vdd and Area<br />

Vdd ↓ Critical Charge ↓ SER ↑<br />

Area ↓ Sensitive depletion region ↓ SER ↓<br />

Linear with Area; Exponential with Vdd<br />

March 24, 2009<br />

Sun Microsystems<br />

21


Test Results<br />

Big drop !!<br />

22


6T SRAM cell<br />

Ref: Borrowed <strong>from</strong> EE271 Stanford University<br />

March 24, 2009<br />

Sun Microsystems<br />

23


6T SRAM cell<br />

SER: NMOS depletion region<br />

plays important role<br />

March 24, 2009<br />

Sun Microsystems<br />

24


Traditional SRAM layout<br />

L-shaped OD<br />

Ref: Intel's 0.18um SRAM cell<br />

25


Litho friendly layout<br />

Traditional<br />

layout<br />

New industry<br />

standard<br />

Ref: Ishida et. al., IEDM 98-201.<br />

26


Raw SEU Rate Per Processor<br />

Relative<br />

Tech. (nm)<br />

Relative SEU<br />

in FITs/kbit<br />

uncorrected<br />

Mbits/Processor SEU/Processor<br />

250 3.2 1.52 5.0<br />

180 3.0 1.52 4.3<br />

130 2.4 3.28 7.9<br />

90 1.0 33.6 33.6<br />

65 0.7 44.3 30.5<br />

Optical shrink<br />

March 24, 2009<br />

Sun Microsystems<br />

27


Raw SEU Rate Per Processor<br />

Relative<br />

Tech. (nm)<br />

Relative SEU<br />

in FITs/kbit<br />

uncorrected<br />

Mbits/Processor SEU/Processor<br />

250 3.2 1.52 5.0<br />

180 3.0 1.52 4.3<br />

130 2.4 3.28 7.9<br />

90 1.0 33.6 33.6<br />

65 0.7 44.3 30.5<br />

Same die size<br />

2x memory<br />

March 24, 2009<br />

Sun Microsystems<br />

28


Raw SEU Rate Per Processor<br />

Relative<br />

Tech. (nm)<br />

Relative SEU<br />

in FITs/kbit<br />

uncorrected<br />

Mbits/Processor SEU/Processor<br />

250 3.2 1.52 5.0<br />

180 3.0 1.52 4.3<br />

130 2.4 3.28 7.9<br />

90 1.0 33.6 33.6<br />

65 0.7 44.3 30.5<br />

Massive increase<br />

in memory : L2$<br />

March 24, 2009<br />

Sun Microsystems<br />

29


Raw SEU Rate Per Processor<br />

Relative<br />

Tech. (nm)<br />

Relative SEU<br />

in FITs/kbit<br />

uncorrected<br />

Mbits/Processor SEU/Processor<br />

250 3.2 1.52 5.0<br />

180 3.0 1.52 4.3<br />

130 2.4 3.28 7.9<br />

90 1.0 33.6 33.6<br />

65 0.7 44.3 30.5<br />

Only a modest<br />

increase in memory<br />

March 24, 2009<br />

Sun Microsystems<br />

30


Raw SEU Rate Per Processor<br />

Relative<br />

Tech. (nm)<br />

Relative SEU<br />

in FITs/kbit<br />

uncorrected<br />

Mbits/Processor SEU/Processor<br />

250 3.2 1.52 5.0<br />

180 3.0 1.52 4.3<br />

130 2.4 3.28 7.9<br />

90 1.0 33.6 33.6<br />

65 0.7 44.3 30.5<br />

Reflects processor<br />

design over the years<br />

March 24, 2009<br />

Sun Microsystems<br />

31


Multi-cell upsets<br />

and<br />

Design Implications<br />

March 24, 2009<br />

Sun Microsystems<br />

32


Multi-cell upsets<br />

March 24, 2009<br />

Sun Microsystems<br />

33


Multi-cell patterns (90 nm)<br />

March 24, 2009<br />

Sun Microsystems<br />

34


Design Scenarios : Parity<br />

~16 μm<br />

●<br />

Cell: 2.1μm x 0.7μm<br />

● Worst case pattern :<br />

~16μm<br />

●<br />

Design guideline<br />

example<br />

– Cells in same word<br />

should be at 9-word<br />

interleave<br />

March 24, 2009<br />

Sun Microsystems<br />

35


Design Scenarios : ECC (1)<br />

2-rows<br />

in error<br />

●<br />

●<br />

Standard ECC<br />

– 1-bit correct, 2-bit detect<br />

Design guideline example<br />

– 2-word interleave => ~ 2 FIT<br />

One pattern for silent<br />

uncorrected error for<br />

2-bit interleave<br />

– Use more distance for better protection<br />

against silent uncorrected errors<br />

March 24, 2009<br />

Sun Microsystems<br />

36


Design Scenarios : ECC (2)<br />

●<br />

Plot probability <strong>of</strong> a second bit in error<br />

versus distance<br />

– Relatively unchanged over the years<br />

●<br />

Dial in the separation between cells<br />

– Get probability 'p'<br />

●<br />

Calculate<br />

– Prob. <strong>of</strong> 2-bit errors (detected but not<br />

corrected) = 'p'<br />

– Prob. <strong>of</strong> 3 or more bits in error (silent<br />

uncorrected) ≈ 'p 2 '<br />

March 24, 2009<br />

Sun Microsystems<br />

37


Conclusions<br />

●<br />

●<br />

Tester based experimental setups<br />

SER trends<br />

– SER ↓ as technology has scaled (250nm to 65nm)<br />

– Flops are above SRAM (on per cell basis)<br />

– Multi-cell upsets<br />

●<br />

Design implications for cell interleaving<br />

● Watch out for flops !<br />

– Easily overlooked<br />

– Multiple complex designs in use<br />

March 24, 2009<br />

Sun Microsystems<br />

38


Questions ?<br />

March 24, 2009<br />

Sun Microsystems<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!