The Vector Floating-Point Unit in a Synergistic Processor Element of ...
The Vector Floating-Point Unit in a Synergistic Processor Element of ...
The Vector Floating-Point Unit in a Synergistic Processor Element of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Vector</strong> <strong>Float<strong>in</strong>g</strong>-<strong>Po<strong>in</strong>t</strong> <strong>Unit</strong> <strong>in</strong> a<br />
<strong>Synergistic</strong> <strong>Processor</strong> <strong>Element</strong> <strong>of</strong> a CELL<br />
<strong>Processor</strong><br />
Silvia M Mueller, Ch. Jacobi<br />
IBM Boebl<strong>in</strong>gen<br />
H-J. Oh, K.D. Tran, S.R. Cottier, B.W. Michael, S.H. Dhong<br />
IBM Aust<strong>in</strong><br />
H. Nishikawa IBM Yasu<br />
Y. Totsuka SCEI / SONY<br />
T. Namatame, N. Yano, T. Machida Toshiba<br />
ARITH-17, June 2005
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
CELL <strong>Processor</strong>: “Supercomputer on a Chip”<br />
<strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> SPE<br />
• Provides computational power <strong>of</strong> CELL processor<br />
• Optimized for compute-<strong>in</strong>tensive and broadband rich media<br />
applications such as real-time 3D graphics, media, DSP, ….<br />
• Power & area efficiency are key enablers Multi-Core design<br />
SPfpu & DPfpu<br />
FPU <strong>of</strong> the SPE<br />
• <strong>Vector</strong> FPU operat<strong>in</strong>g on 128b data<br />
– S<strong>in</strong>gle precision: 4x32b<br />
– Double precision: 2x64b<br />
• Outstand<strong>in</strong>g SP performance<br />
High frequency, low latency<br />
SP <strong>in</strong> ma<strong>in</strong> data-flow bit stack<br />
DP on the side<br />
SPE<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
SP-FPU <strong>of</strong> the SPE<br />
Operands<br />
Result<br />
Performance<br />
• 5.5-cycle FMA core @ 11fo4 60fo4 total<br />
• Latency <strong>of</strong> conventional FMA pipes<br />
DP: 110 to 120fo4, SP: about 100fo4<br />
Need to save 40fo4<br />
Concepts<br />
• Co-design all levels <strong>of</strong> the design:<br />
– architecture, logic, circuit, layout, floorplan<br />
• Make common case fast<br />
– Extra cycle for Integer multiply & converts<br />
– FMA-type ops, normal range, truncation<br />
– Simpler rounder saves 15fo4<br />
• Avoid design tricks with large area overhead<br />
• Multi-level clock gat<strong>in</strong>g:<br />
– wave, opcode, data dependent<br />
(save 10FO4)<br />
Format<br />
Multiplier<br />
Aligner<br />
(save 6FO4)<br />
(save 10FO4)<br />
Carry Save Adder<br />
LZA<br />
Adder<br />
(save 4FO4)<br />
(save 4FO4)<br />
Exp Rounder Norm. & Round<br />
(save 15FO4)<br />
(save 20FO4)<br />
Result Multiplexer<br />
FPSCR*<br />
frontend FMA core<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
Impact <strong>of</strong> Physical Design -- Example<br />
• Latch <strong>in</strong>sertion delay is 2 to 3fo4, consum<strong>in</strong>g about 25% <strong>of</strong> the cycle<br />
– Special latches <strong>in</strong> CELL to reduce / hide latch delay<br />
– Customize to logic to make good use <strong>of</strong> these latches<br />
• “Mux-latch”<br />
– Wide mux (up to 7 ports) <strong>in</strong>tegrated on <strong>in</strong>put side <strong>of</strong> the latch<br />
– Mux delay mostly hidden by latch <strong>in</strong>sertion delay<br />
– Easy to apply to blocks like aligner, normalizer, operand & result muxes<br />
– After optimizations also useful for adder, LZA, exponent logic<br />
• Pulsed latch with <strong>in</strong>tegrated AND function<br />
– 1fo4 transparent w<strong>in</strong>dow enables limited cycle steel<strong>in</strong>g<br />
– Very helpful for balanc<strong>in</strong>g max-path-length <strong>of</strong> pipe stages 3% difference<br />
Improved SPfpu latency by about 12fo4<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
Fraction Adder<br />
Compute sum / abs. difference<br />
x + y if !effsub<br />
r = x + !y +1 if effsub, eac=1<br />
! (x + !y) if effsub, eac=0<br />
eac = cout(x + !y)<br />
Product<br />
Fraction from Aligner<br />
I(0:24)<br />
(25:72)<br />
3:2 Carry Save Adder (48b)<br />
carry<br />
sum<br />
Sticky field<br />
(0:23)<br />
OR<br />
control<br />
Aligner provides 1’s complement <strong>of</strong><br />
addend y <strong>in</strong> case <strong>of</strong> subtraction<br />
INC<br />
24b<br />
Compound Adder<br />
Carry Network<br />
control<br />
Swap re-complement &<br />
selection <strong>of</strong> INC result<br />
subtract<br />
XOR<br />
I0<br />
XOR<br />
I1<br />
high<br />
sum0<br />
low<br />
sum1<br />
cout<br />
high<br />
low<br />
Merg<strong>in</strong>g 3 mux stages<br />
Selection between sum0 & sum1<br />
Optional re-complement <strong>of</strong> the result<br />
First normalizer stage<br />
6-port mux-latch hides mux delay<br />
mux latch<br />
add_result(0:24)<br />
mux latch<br />
add_result(25:47)<br />
Mux select signals (based on cout) are tim<strong>in</strong>g critical<br />
Extra carry-tree to speed-up adder carry-out<br />
Compute 2 sets <strong>of</strong> selects and choose based on cout<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
Normalizer & Rounder<br />
Round<strong>in</strong>g modes<br />
• Graphics applications optimized for round towards zero /<br />
truncation round<strong>in</strong>g<br />
• Truncation is simplest <strong>of</strong> the 4 IEEE round<strong>in</strong>g modes<br />
• Allows for simpler and faster rounder hardware<br />
exp-lz<br />
lzaerr<br />
normalizer<br />
SPfpu supports only truncation round<strong>in</strong>g<br />
• Fraction rounder turns <strong>in</strong>to a result mux<br />
unf, ovf<br />
25b INC<br />
– Saves about 15fo4 on the fraction path<br />
• Exponent round<strong>in</strong>g becomes tim<strong>in</strong>g critical !<br />
– Full-blown rounder: latency hidden by normalizer & rounder<br />
– Truncation: normalizer faster than exponent round<strong>in</strong>g<br />
Speed up exponent round<strong>in</strong>g<br />
• SP operations with other round<strong>in</strong>g modes can be emulated by<br />
DPfpu: convert, DP op, convert<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
Exponent Round<strong>in</strong>g -- Optimizations<br />
• SPfpu supports only non-trapped exception handl<strong>in</strong>g<br />
and truncation round<strong>in</strong>g<br />
No exponent wrapp<strong>in</strong>g on UNF and OVF<br />
No post-normalization<br />
• Exponent round logic<br />
Subtract # lead<strong>in</strong>g zeros, forces constants on UNF, OVF<br />
er = e + 1 + ! lz if lzaerr=0<br />
e + 2 + ! lz if lzaerr=1<br />
OVF: er > emax<br />
UNF: er < em<strong>in</strong> i.e.: sign(er-1)<br />
• Pre-compute 3 copies <strong>of</strong> exponent “e”<br />
• Fast UNF check<br />
– Compute copies for lzaerr=0 / 1 and select<br />
UNF1 = sign(e1 + !lz -1) = sign(e + !lz)<br />
UNF2 = sign(e2 + !lz -1) = sign(e1 + !lz)<br />
ERND<br />
ex(0:7) ey(0:7) ez(2:9)<br />
cry, 0<br />
3:2 compressor (8b)<br />
3-way compound adder (10b)<br />
sum+2 sum+1 sum<br />
e2 e1 e<br />
add (10b)<br />
er1<br />
add (10b)<br />
er0<br />
ez(0:1)<br />
sum<br />
sign (10b)<br />
result mux<br />
selects<br />
sel<br />
!lz<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation
<strong>Vector</strong> FPU <strong>of</strong> a <strong>Synergistic</strong> Process<strong>in</strong>g <strong>Element</strong> <strong>of</strong> a CELL <strong>Processor</strong><br />
Summary<br />
• SPfpu: high-frequency, low-latency, power & area efficient<br />
FPU design<br />
– 5.5 cycle FMA at an 11fo4 cycle time<br />
– About 450K transistors <strong>in</strong> 1.3mm 2 , fabricated with IBM 90nm SOI<br />
– Correct operation observed up to 5.6 GHz at 1.4V<br />
– Support<strong>in</strong>g a peak performance <strong>of</strong> 44.8 GFlops / SPE<br />
• Key enablers<br />
– Architecture & implementation optimized for target applications<br />
– Co-design <strong>of</strong> architecture, logic, circuit, and floorplan<br />
– Pipel<strong>in</strong>e stages are fully balanced: 3% max path delay difference<br />
– Intensive clock gat<strong>in</strong>g (wave, opcode & data dependent)<br />
Silvia M. Mueller | June 2005 | ARITH-17<br />
© 2005 IBM Corporation