The Vector Floating-Point Unit in a Synergistic Processor Element of ...

The Vector Floating-Point Unit in a 

Synergistic Processor Element of a CELL 

Processor 

Silvia M Mueller, Ch. Jacobi 

IBM Boeblingen 

H-J. Oh, K.D. Tran, S.R. Cottier, B.W. Michael, S.H. Dhong 

IBM Austin 

H. Nishikawa IBM Yasu 

Y. Totsuka SCEI / SONY 

T. Namatame, N. Yano, T. Machida Toshiba 

ARITH-17, June 2005

Vector FPU of a Synergistic Processing Element of a CELL Processor 

CELL Processor: “Supercomputer on a Chip” 

Synergistic Processing Element SPE 

• Provides computational power of CELL processor 

• Optimized for compute-intensive and broadband rich media 

applications such as real-time 3D graphics, media, DSP, …. 

• Power & area efficiency are key enablers Multi-Core design 

SPfpu & DPfpu 

FPU of the SPE 

• Vector FPU operating on 128b data 

– Single precision: 4x32b 

– Double precision: 2x64b 

• Outstanding SP performance 

High frequency, low latency 

SP in main data-flow bit stack 

DP on the side 

SPE 

Silvia M. Mueller | June 2005 | ARITH-17 

© 2005 IBM Corporation


SP-FPU of the SPE 

Operands 

Result 

Performance 

• 5.5-cycle FMA core @ 11fo4 60fo4 total 

• Latency of conventional FMA pipes 

DP: 110 to 120fo4, SP: about 100fo4 

Need to save 40fo4 

Concepts 

• Co-design all levels of the design: 

– architecture, logic, circuit, layout, floorplan 

• Make common case fast 

– Extra cycle for Integer multiply & converts 

– FMA-type ops, normal range, truncation 

– Simpler rounder saves 15fo4 

• Avoid design tricks with large area overhead 

• Multi-level clock gating: 

– wave, opcode, data dependent 

(save 10FO4) 

Format 

Multiplier 

Aligner 

(save 6FO4) 

(save 10FO4) 

Carry Save Adder 

LZA 

Adder 

(save 4FO4) 

(save 4FO4) 

Exp Rounder Norm. & Round 

(save 15FO4) 

(save 20FO4) 

Result Multiplexer 

FPSCR* 

frontend FMA core 




Impact of Physical Design -- Example 

• Latch insertion delay is 2 to 3fo4, consuming about 25% of the cycle 

– Special latches in CELL to reduce / hide latch delay 

– Customize to logic to make good use of these latches 

• “Mux-latch” 

– Wide mux (up to 7 ports) integrated on input side of the latch 

– Mux delay mostly hidden by latch insertion delay 

– Easy to apply to blocks like aligner, normalizer, operand & result muxes 

– After optimizations also useful for adder, LZA, exponent logic 

• Pulsed latch with integrated AND function 

– 1fo4 transparent window enables limited cycle steeling 

– Very helpful for balancing max-path-length of pipe stages 3% difference 

Improved SPfpu latency by about 12fo4 




Fraction Adder 

Compute sum / abs. difference 

x + y if !effsub 

r = x + !y +1 if effsub, eac=1 

! (x + !y) if effsub, eac=0 

eac = cout(x + !y) 

Product 

Fraction from Aligner 

I(0:24) 

(25:72) 

3:2 Carry Save Adder (48b) 

carry 

sum 

Sticky field 

(0:23) 

OR 

control 

Aligner provides 1’s complement of 

addend y in case of subtraction 

INC 

24b 

Compound Adder 

Carry Network 

control 

Swap re-complement & 

selection of INC result 

subtract 

XOR 

I0 

XOR 

I1 

high 

sum0 

low 

sum1 

cout 

high 

low 

Merging 3 mux stages 

Selection between sum0 & sum1 

Optional re-complement of the result 

First normalizer stage 

6-port mux-latch hides mux delay 

mux latch 

add_result(0:24) 

mux latch 

add_result(25:47) 

Mux select signals (based on cout) are timing critical 

Extra carry-tree to speed-up adder carry-out 

Compute 2 sets of selects and choose based on cout 




Normalizer & Rounder 

Rounding modes 

• Graphics applications optimized for round towards zero / 

truncation rounding 

• Truncation is simplest of the 4 IEEE rounding modes 

• Allows for simpler and faster rounder hardware 

exp-lz 

lzaerr 

normalizer 

SPfpu supports only truncation rounding 

• Fraction rounder turns into a result mux 

unf, ovf 

25b INC 

– Saves about 15fo4 on the fraction path 

• Exponent rounding becomes timing critical ! 

– Full-blown rounder: latency hidden by normalizer & rounder 

– Truncation: normalizer faster than exponent rounding 

Speed up exponent rounding 

• SP operations with other rounding modes can be emulated by 

DPfpu: convert, DP op, convert 




Exponent Rounding -- Optimizations 

• SPfpu supports only non-trapped exception handling 

and truncation rounding 

No exponent wrapping on UNF and OVF 

No post-normalization 

• Exponent round logic 

Subtract # leading zeros, forces constants on UNF, OVF 

er = e + 1 + ! lz if lzaerr=0 

e + 2 + ! lz if lzaerr=1 

OVF: er > emax 

UNF: er < emin i.e.: sign(er-1) 

• Pre-compute 3 copies of exponent “e” 

• Fast UNF check 

– Compute copies for lzaerr=0 / 1 and select 

UNF1 = sign(e1 + !lz -1) = sign(e + !lz) 

UNF2 = sign(e2 + !lz -1) = sign(e1 + !lz) 

ERND 

ex(0:7) ey(0:7) ez(2:9) 

cry, 0 

3:2 compressor (8b) 

3-way compound adder (10b) 

sum+2 sum+1 sum 

e2 e1 e 

add (10b) 

er1 

add (10b) 

er0 

ez(0:1) 

sum 

sign (10b) 

result mux 

selects 

sel 

!lz 




Summary 

• SPfpu: high-frequency, low-latency, power & area efficient 

FPU design 

– 5.5 cycle FMA at an 11fo4 cycle time 

– About 450K transistors in 1.3mm 2 , fabricated with IBM 90nm SOI 

– Correct operation observed up to 5.6 GHz at 1.4V 

– Supporting a peak performance of 44.8 GFlops / SPE 

• Key enablers 

– Architecture & implementation optimized for target applications 

– Co-design of architecture, logic, circuit, and floorplan 

– Pipeline stages are fully balanced: 3% max path delay difference 

– Intensive clock gating (wave, opcode & data dependent)

The Vector Floating-Point Unit in a Synergistic Processor Element of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?