14.01.2015 Views

The Vector Floating-Point Unit in a Synergistic Processor Element of ...

The Vector Floating-Point Unit in a Synergistic Processor Element of ...

The Vector Floating-Point Unit in a Synergistic Processor Element of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Vector</strong> <strong>Float<strong>in</strong>g</strong>-<strong>Po<strong>in</strong>t</strong> <strong>Unit</strong> <strong>in</strong> a<br />

<strong>Synergistic</strong> <strong>Processor</strong> <strong>Element</strong> <strong>of</strong> a CELL<br />

<strong>Processor</strong><br />

Silvia M Mueller, Ch. Jacobi<br />

IBM Boebl<strong>in</strong>gen<br />

H-J. Oh, K.D. Tran, S.R. Cottier, B.W. Michael, S.H. Dhong<br />

IBM Aust<strong>in</strong><br />

H. Nishikawa IBM Yasu<br />

Y. Totsuka SCEI / SONY<br />

T. Namatame, N. Yano, T. Machida Toshiba<br />

ARITH-17, June 2005


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

CELL <strong>Processor</strong>: “Supercomputer on a Chip”<br />

<strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> SPE<br />

• Provides computational power <strong>of</strong> CELL processor<br />

• Optimized for compute-<strong>in</strong>tensive and broadband rich media<br />

applications such as real-time 3D graphics, media, DSP, ….<br />

• Power & area efficiency are key enablers Multi-Core design<br />

SPfpu & DPfpu<br />

FPU <strong>of</strong> the SPE<br />

• <strong>Vector</strong> FPU operat<strong>in</strong>g on 128b data<br />

– S<strong>in</strong>gle precision: 4x32b<br />

– Double precision: 2x64b<br />

• Outstand<strong>in</strong>g SP performance<br />

High frequency, low latency<br />

SP <strong>in</strong> ma<strong>in</strong> data-flow bit stack<br />

DP on the side<br />

SPE<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

SP-FPU <strong>of</strong> the SPE<br />

Operands<br />

Result<br />

Performance<br />

• 5.5-cycle FMA core @ 11fo4 60fo4 total<br />

• Latency <strong>of</strong> conventional FMA pipes<br />

DP: 110 to 120fo4, SP: about 100fo4<br />

Need to save 40fo4<br />

Concepts<br />

• Co-design all levels <strong>of</strong> the design:<br />

– architecture, logic, circuit, layout, floorplan<br />

• Make common case fast<br />

– Extra cycle for Integer multiply & converts<br />

– FMA-type ops, normal range, truncation<br />

– Simpler rounder saves 15fo4<br />

• Avoid design tricks with large area overhead<br />

• Multi-level clock gat<strong>in</strong>g:<br />

– wave, opcode, data dependent<br />

(save 10FO4)<br />

Format<br />

Multiplier<br />

Aligner<br />

(save 6FO4)<br />

(save 10FO4)<br />

Carry Save Adder<br />

LZA<br />

Adder<br />

(save 4FO4)<br />

(save 4FO4)<br />

Exp Rounder Norm. & Round<br />

(save 15FO4)<br />

(save 20FO4)<br />

Result Multiplexer<br />

FPSCR*<br />

frontend FMA core<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

Impact <strong>of</strong> Physical Design -- Example<br />

• Latch <strong>in</strong>sertion delay is 2 to 3fo4, consum<strong>in</strong>g about 25% <strong>of</strong> the cycle<br />

– Special latches <strong>in</strong> CELL to reduce / hide latch delay<br />

– Customize to logic to make good use <strong>of</strong> these latches<br />

• “Mux-latch”<br />

– Wide mux (up to 7 ports) <strong>in</strong>tegrated on <strong>in</strong>put side <strong>of</strong> the latch<br />

– Mux delay mostly hidden by latch <strong>in</strong>sertion delay<br />

– Easy to apply to blocks like aligner, normalizer, operand & result muxes<br />

– After optimizations also useful for adder, LZA, exponent logic<br />

• Pulsed latch with <strong>in</strong>tegrated AND function<br />

– 1fo4 transparent w<strong>in</strong>dow enables limited cycle steel<strong>in</strong>g<br />

– Very helpful for balanc<strong>in</strong>g max-path-length <strong>of</strong> pipe stages 3% difference<br />

Improved SPfpu latency by about 12fo4<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

Fraction Adder<br />

Compute sum / abs. difference<br />

x + y if !effsub<br />

r = x + !y +1 if effsub, eac=1<br />

! (x + !y) if effsub, eac=0<br />

eac = cout(x + !y)<br />

Product<br />

Fraction from Aligner<br />

I(0:24)<br />

(25:72)<br />

3:2 Carry Save Adder (48b)<br />

carry<br />

sum<br />

Sticky field<br />

(0:23)<br />

OR<br />

control<br />

Aligner provides 1’s complement <strong>of</strong><br />

addend y <strong>in</strong> case <strong>of</strong> subtraction<br />

INC<br />

24b<br />

Compound Adder<br />

Carry Network<br />

control<br />

Swap re-complement &<br />

selection <strong>of</strong> INC result<br />

subtract<br />

XOR<br />

I0<br />

XOR<br />

I1<br />

high<br />

sum0<br />

low<br />

sum1<br />

cout<br />

high<br />

low<br />

Merg<strong>in</strong>g 3 mux stages<br />

Selection between sum0 & sum1<br />

Optional re-complement <strong>of</strong> the result<br />

First normalizer stage<br />

6-port mux-latch hides mux delay<br />

mux latch<br />

add_result(0:24)<br />

mux latch<br />

add_result(25:47)<br />

Mux select signals (based on cout) are tim<strong>in</strong>g critical<br />

Extra carry-tree to speed-up adder carry-out<br />

Compute 2 sets <strong>of</strong> selects and choose based on cout<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

Normalizer & Rounder<br />

Round<strong>in</strong>g modes<br />

• Graphics applications optimized for round towards zero /<br />

truncation round<strong>in</strong>g<br />

• Truncation is simplest <strong>of</strong> the 4 IEEE round<strong>in</strong>g modes<br />

• Allows for simpler and faster rounder hardware<br />

exp-lz<br />

lzaerr<br />

normalizer<br />

SPfpu supports only truncation round<strong>in</strong>g<br />

• Fraction rounder turns <strong>in</strong>to a result mux<br />

unf, ovf<br />

25b INC<br />

– Saves about 15fo4 on the fraction path<br />

• Exponent round<strong>in</strong>g becomes tim<strong>in</strong>g critical !<br />

– Full-blown rounder: latency hidden by normalizer & rounder<br />

– Truncation: normalizer faster than exponent round<strong>in</strong>g<br />

Speed up exponent round<strong>in</strong>g<br />

• SP operations with other round<strong>in</strong>g modes can be emulated by<br />

DPfpu: convert, DP op, convert<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

Exponent Round<strong>in</strong>g -- Optimizations<br />

• SPfpu supports only non-trapped exception handl<strong>in</strong>g<br />

and truncation round<strong>in</strong>g<br />

No exponent wrapp<strong>in</strong>g on UNF and OVF<br />

No post-normalization<br />

• Exponent round logic<br />

Subtract # lead<strong>in</strong>g zeros, forces constants on UNF, OVF<br />

er = e + 1 + ! lz if lzaerr=0<br />

e + 2 + ! lz if lzaerr=1<br />

OVF: er > emax<br />

UNF: er < em<strong>in</strong> i.e.: sign(er-1)<br />

• Pre-compute 3 copies <strong>of</strong> exponent “e”<br />

• Fast UNF check<br />

– Compute copies for lzaerr=0 / 1 and select<br />

UNF1 = sign(e1 + !lz -1) = sign(e + !lz)<br />

UNF2 = sign(e2 + !lz -1) = sign(e1 + !lz)<br />

ERND<br />

ex(0:7) ey(0:7) ez(2:9)<br />

cry, 0<br />

3:2 compressor (8b)<br />

3-way compound adder (10b)<br />

sum+2 sum+1 sum<br />

e2 e1 e<br />

add (10b)<br />

er1<br />

add (10b)<br />

er0<br />

ez(0:1)<br />

sum<br />

sign (10b)<br />

result mux<br />

selects<br />

sel<br />

!lz<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation


<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />

Summary<br />

• SPfpu: high-frequency, low-latency, power & area efficient<br />

FPU design<br />

– 5.5 cycle FMA at an 11fo4 cycle time<br />

– About 450K transistors <strong>in</strong> 1.3mm 2 , fabricated with IBM 90nm SOI<br />

– Correct operation observed up to 5.6 GHz at 1.4V<br />

– Support<strong>in</strong>g a peak performance <strong>of</strong> 44.8 GFlops / SPE<br />

• Key enablers<br />

– Architecture & implementation optimized for target applications<br />

– Co-design <strong>of</strong> architecture, logic, circuit, and floorplan<br />

– Pipel<strong>in</strong>e stages are fully balanced: 3% max path delay difference<br />

– Intensive clock gat<strong>in</strong>g (wave, opcode & data dependent)<br />

Silvia M. Mueller | June 2005 | ARITH-17<br />

© 2005 IBM Corporation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!