Permedia and GLINT Delta: New Generation Silicon for - Hot Chips
Permedia and GLINT Delta: New Generation Silicon for - Hot Chips
Permedia and GLINT Delta: New Generation Silicon for - Hot Chips
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 1<br />
ERMEDIA <strong>and</strong> <strong>GLINT</strong> <strong>Delta</strong><br />
PERMEDIA<br />
<strong>New</strong> <strong>Generation</strong> <strong>Silicon</strong> <strong>for</strong> 3D Graphics<br />
Neil Trevett<br />
Vice President Marketing<br />
(408) 436 3455<br />
www.3dlabs.com
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 2<br />
Professional<br />
Games Pervasive<br />
3Dlabs <strong>New</strong> <strong>Generation</strong> <strong>Silicon</strong><br />
<strong>GLINT</strong><br />
300SX<br />
3D Blaster<br />
Chip<br />
The first single chip<br />
geometry pipeline<br />
processor<br />
<strong>GLINT</strong><br />
<strong>Delta</strong><br />
<strong>GLINT</strong><br />
500TX<br />
Q195 Q395 Q196<br />
PERMEDIA<br />
1st <strong>Generation</strong> 2nd <strong>Generation</strong><br />
Professional 3D -<br />
Strong competition<br />
to Workstations<br />
Pervasive 3D -<br />
Making 2D chips<br />
obsolete<br />
Q396
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 3<br />
Alternative Approaches...<br />
... to low-cost 3D silicon<br />
No Compromises<br />
Per<strong>for</strong>mance<br />
3Dlabs<br />
3D Per<strong>for</strong>mance?<br />
Matrox<br />
ATI<br />
S3<br />
Rendition<br />
2D <strong>and</strong> 3D<br />
Per<strong>for</strong>mance?<br />
Nvidia<br />
3Dfx<br />
VideoLogic<br />
Pervasive 3D FreeD Games 3D Arcade 3D<br />
No 2D?<br />
Cost
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 4<br />
Pervasive 3D - the Challenge<br />
No compromises!<br />
Video Acceleration<br />
3D <strong>for</strong> games 3D <strong>for</strong> authoring<br />
Pervasive 3D<br />
Fast Windows Acceleration<br />
3D <strong>for</strong> browsing<br />
Fast VGA <strong>for</strong> DOS games<br />
• 3D per<strong>for</strong>mance must be much greater than software only<br />
• Software = 5 million texture-mapped pixels per second<br />
• Hardware should deliver >25 million bilinear-filtered texturemapped<br />
pixels per second
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 5<br />
PERMEDIA ERMEDIA Design Targets<br />
• Robust 100% pixel functionality of all key 3D APIs<br />
• Direct 3D, OpenGL, Heidi, QuickDraw 3D, QuickDraw 3D RAVE<br />
• No 2D compromises<br />
• >30 Million Winmarks<br />
• Fast 3D per<strong>for</strong>mance<br />
• Balanced per<strong>for</strong>mance <strong>for</strong> both textures <strong>and</strong> polygons<br />
• 600,000 textured polygons/second<br />
• 30 Million bilinear filtered texture-mapped pixels /second<br />
• Low cost<br />
• Selling on boards costing
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 6<br />
PERMEDIA ERMEDIA Architecture<br />
High per<strong>for</strong>mance<br />
bus interface<br />
Unified graphics<br />
core <strong>for</strong><br />
accelerated<br />
3D <strong>and</strong> 2D<br />
VGA<br />
Graphics Core<br />
PCI Memory<br />
Interface<br />
RamDac<br />
Interface<br />
Bypass<br />
Bypass <strong>for</strong> direct rendering <strong>and</strong> control registers<br />
Backward compatibility<br />
<strong>for</strong> DOS games <strong>and</strong> boot<br />
Video<br />
High b<strong>and</strong>width<br />
integrated<br />
memory<br />
interface<br />
Drives<br />
RAMDAC
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 7<br />
PERMEDIA ERMEDIA Host Interface<br />
Avoids polling the FIFO<br />
Glueless PCI<br />
Interface<br />
Avoids byte-swapping<br />
PCI Bus<br />
Interface<br />
Rev 2.1<br />
Disconnect<br />
32-bit Slave<br />
32-bit Master<br />
Bi-Endian<br />
Support<br />
Provides setup-fetch<br />
overlap<br />
Memory<br />
Bypass<br />
41x32<br />
FIFO<br />
Interrupt<br />
Controller<br />
DMA<br />
Controller<br />
<strong>Delta</strong><br />
Interface<br />
Allows software drawing<br />
Provides fetchdraw<br />
overlap<br />
Provides upgrade path
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 8<br />
PERMEDIA ERMEDIA Memory Interface<br />
• SGRAM <strong>for</strong> next generation graphics<br />
• Good r<strong>and</strong>om access speed - vital <strong>for</strong> texture mapping<br />
• Block fills - very important <strong>for</strong> clearing buffers<br />
• Write-per-bit mask, needed <strong>for</strong> per-window double buffering<br />
• Upgrade path to 100 MHz <strong>and</strong> beyond<br />
• 2 to 8 MBytes<br />
• Up to 4 pages open at any time<br />
• E.g. front color buffer, back color buffer, depth buffer, texture buffer<br />
Core<br />
Texture<br />
Z-Buffer<br />
Stencil<br />
RGBA<br />
Bypass<br />
Memory<br />
I/F<br />
Unit<br />
64 bit<br />
BANK 0<br />
2 MBytes<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
BANK 1<br />
4 MBytes<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
BANK 2<br />
6 MBytes<br />
etc...
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 9<br />
Consolidated Memory<br />
All buffers in same physical memory<br />
• Efficient <strong>and</strong> Flexible<br />
• Dynamically allocate color <strong>and</strong> depth buffers<br />
• Any spare memory available <strong>for</strong> textures<br />
• Trade resolution <strong>for</strong> depth buffer, color depth <strong>for</strong> texture space etc.<br />
• All data in same memory, scope <strong>for</strong> optimization, e.g.<br />
• Clear depth buffer with framebuffer block fills<br />
• Use texture operations on any image<br />
• Full scene anti-aliasing<br />
• Video texture-mapping<br />
• 3D sprite processing<br />
Depth <strong>and</strong> stencil<br />
Texture<br />
Backbuffer<br />
Frontbuffer
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 10<br />
PERMEDIA ERMEDIA Pixel Core<br />
• Hyper-pipelined function units<br />
• Message passing protocol between units<br />
Rasterizer<br />
Scissor<br />
Stipple<br />
Color DDA Framebuffer<br />
Read<br />
Texture/Fog<br />
Blend<br />
PERMEDIA Core<br />
Localbuffer<br />
Read<br />
Localbuffer<br />
Write<br />
Dither Logic Op<br />
Stencil<br />
Depth<br />
YUV<br />
Framebuffer<br />
Write<br />
Texture<br />
Address<br />
Texture<br />
Read<br />
Host<br />
Out
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 11<br />
Pipeline Principles<br />
• Each unit in the pipeline is independent<br />
• Can be designed, tested <strong>and</strong> synthesized separately<br />
• Unit State Machine<br />
• Wait <strong>for</strong> message in input FIFO<br />
• If message is not relevant, pass to next unit<br />
• Else process message <strong>and</strong> pass on any messages as required<br />
• Return to waiting<br />
• Some units know their place<br />
• Some units are completely self contained<br />
• E.g. scissor/stipple unit<br />
• Some units know where they are in the pipeline<br />
• E.g. YUV can absorb localbuffer data if the chroma test fails
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 12<br />
Unit Pipeline Stage<br />
• The pipeline uses a message passing paradigm<br />
• A message is made up of a tag field <strong>and</strong> a data field<br />
• The tag identifies the message type<br />
Tag<br />
Data<br />
F<br />
I<br />
F<br />
O<br />
Tag<br />
Data<br />
Core Unit<br />
Tag<br />
Data<br />
F<br />
I<br />
F<br />
O<br />
Tag<br />
Data<br />
Two stage FIFO Pipelining as required<br />
Two stage FIFO<br />
Tag = 9 bits<br />
Data = 32 bits<br />
Input Stage<br />
Register Storage<br />
Processing<br />
Control<br />
Output Stage
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 13<br />
Message Passing<br />
• Everything that moves through the pipeline is a message<br />
• Messages are used to program control registers - e.g. enable texture<br />
• Messages are used to carry transient in<strong>for</strong>mation e.g. texture color <strong>for</strong><br />
current pixel<br />
• Messages are used as comm<strong>and</strong>s - e.g. start new primitive<br />
• Messages are used <strong>for</strong> synchronization - e.g. Sync message<br />
• ‘Step’ messages drive the units<br />
• A Step message <strong>for</strong> each pixel to be plotted<br />
• Passive steps are pixels not to be plotted<br />
• If a pixel fails a test it is converted from active to passive<br />
• Passive steps cannot be deleted because they advance DDA units<br />
• Step messages hold the pixel X,Y coordinate in the data field
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 14<br />
Unified 2D/3D Pixel Engine<br />
• 3D is a superset of 2D<br />
• Don’t separate them<br />
• 2D operations use 3D pipeline <strong>and</strong> use special features<br />
• Texture units used <strong>for</strong> tiled blits<br />
• Chroma key test used <strong>for</strong> transparent blits<br />
• Bilinear filter used <strong>for</strong> stretch blits<br />
• Using the 3D units is gate efficient<br />
• No duplication of functions<br />
• No compromise on per<strong>for</strong>mance
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 15<br />
PERMEDIA ERMEDIA Per<strong>for</strong>mance<br />
• 30 Mpixels/sec, 600K polygons/sec, textured, bilinear, no Z<br />
• With full per pixel perspective correction, 16-bit framebuffer, 4-bit palletized<br />
textures, 50 displayed pixels per polygon, meshed, 640x480 at 75Hz<br />
• 640x480 full screen bi-linear textured, x2.5 depth<br />
complexity = 40Hz frame rate<br />
• 2 GBytes/sec Fill rate using SGRAM block fill<br />
• 2 GBytes/sec Color expansion<br />
• > 30 Million Winmarks<br />
• Video Playback per<strong>for</strong>mance - 30fps<br />
• 320x200 YUV source zoomed <strong>and</strong> filtered to 640x480x16-bit RGB
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 16<br />
PERMEDIA ERMEDIA Physical Characteristics<br />
• Packaging<br />
• 256 pin BGA<br />
• Wire-bonded into a plastic BGA package<br />
• 3W at 3.3V<br />
• Process<br />
• 0.35μ, 4 layer metal<br />
• 60 MHz<br />
• Shipping now
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 17<br />
Board Design<br />
Low component count<br />
• Single PERMEDIA Chip<br />
• plus SGRAM, RAMDAC, ROM<br />
• External interfaces<br />
• Glueless PCI Interface<br />
• High per<strong>for</strong>mance 64-bit SGRAM Interface<br />
• High speed pixel port to RAMDAC<br />
RAM<br />
DAC<br />
OSC<br />
PERMEDIA<br />
ROM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
Typically 2MBytes,<br />
with optional<br />
upgrade to 4MBytes
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 18<br />
But where’s the bottleneck?<br />
Geometry!<br />
• The fastest Pentium Pro cannot keep PERMEDIA<br />
saturated if running the geometry in software<br />
1K polygons/MHz on a Pentium<br />
Class machine 3D API<br />
(90K polygons on a P5/90) Trans<strong>for</strong>ms<br />
Lighting<br />
<strong>Delta</strong> Calcs<br />
70% of the<br />
CPU cycles<br />
spent in setup!<br />
100% of<br />
Rasterization in<br />
PERMEDIA<br />
silicon<br />
Rasterization
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 19<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
Breaking the Geometry Bottleneck<br />
• Hardwired 3D Pipeline Processing<br />
• 1M vertex/sec Vertex Setup Processor<br />
• Per<strong>for</strong>ms all delta calculations <strong>and</strong> floating point conversions<br />
• 100 MFlop floating point processor<br />
• Reduces PCI B<strong>and</strong>width - just passing vertices - no slopes<br />
3D API<br />
Trans<strong>for</strong>ms<br />
Lighting<br />
110 Bytes / polygon<br />
<strong>Delta</strong> Calcs<br />
PCI<br />
PCI<br />
<strong>GLINT</strong><br />
<strong>Delta</strong><br />
33 Bytes / polygon<br />
PERMEDIA<br />
3D API<br />
Trans<strong>for</strong>ms<br />
Lighting<br />
PCI<br />
PERMEDIA<br />
Triples CPU<br />
Geometry<br />
Per<strong>for</strong>mance
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 20<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
Setup Processing in a PCI Bridge<br />
Full Bus<br />
Master<br />
Primary<br />
PCI Bus<br />
Provides<br />
setup-fetch<br />
overlap<br />
Allows transparent use of VGA <strong>and</strong> 8514 behind bridge<br />
176 Pin PQFP<br />
3.3V power<br />
5V I/O<br />
PCI 0<br />
Master &<br />
Slave<br />
Interface<br />
DMA<br />
Controller<br />
Input<br />
FIFO<br />
Function 0 Decode with VGA/8514<br />
Function 1 Decode<br />
<strong>Delta</strong><br />
Setup<br />
Engine<br />
Slope <strong>and</strong> Setup Calculations<br />
<strong>for</strong> <strong>GLINT</strong> <strong>and</strong> <strong>Permedia</strong><br />
Bypass<br />
Path<br />
Output<br />
FIFO<br />
PCI 1<br />
Master<br />
Interface<br />
<strong>and</strong><br />
DMA<br />
Control<br />
DMA to<br />
<strong>GLINT</strong> or<br />
PERMEDIA<br />
Secondary<br />
PCI Bus
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 21<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
Setup Engine Functionality<br />
• Follows the message passing architecture of PERMEDIA<br />
• <strong>Delta</strong> is just another unit in front of the rasterizer<br />
• API neutral - low-level functionality<br />
• Triangle primitive setup (AA <strong>and</strong> non-AA)<br />
• Line primitive setup (AA <strong>and</strong> non-AA)<br />
• Interpolation Parameters - XYZ, RGBA, F, STQ, Ks, Kd<br />
• Accepts floating point (IEEE SP) or fixed point inputs<br />
• Texture coordinate auto normalization<br />
• Optional input value clamping<br />
• High precision sub-pixel correction
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 22<br />
<strong>GLINT</strong> <strong>Delta</strong> Setup Engine<br />
Hardwired processing<br />
Input Vertex In<strong>for</strong>mation - 16x32<br />
Vertex Store<br />
0<br />
Vertex Store<br />
1<br />
Vertex Store<br />
2<br />
Data Routing<br />
FMUL FADD FDIV FDIV FConvert<br />
Floating Point Operation Units<br />
VHDL Coded, Inferred Routing<br />
Working<br />
Store<br />
Output<br />
FIFO<br />
Temp<br />
Storage<br />
8x32
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 23<br />
<strong>GLINT</strong> <strong>Delta</strong> Calculations<br />
Floating Point improves robustness <strong>and</strong> visual quality<br />
• Input parameter score-boarding<br />
• All internal calculations in custom floating point <strong>for</strong>mat<br />
• Less dynamic range, but more precision than IEEE<br />
• RGBAZ triangle set-up involves:<br />
• 41 floating point add or subtract<br />
• 27 floating point multiplies<br />
• 5 floating point divides<br />
• plus.. compares, clamping, fixed point/floating point conversions<br />
• Main floating point operators are:<br />
• One multiplier (one pipeline stage, single cycle).<br />
• One adder/subtracter (single cycle)<br />
• Two dividers (5 cycle iterative, autonomous)<br />
• Four comparators<br />
• Float to fixed point conversion with clamping
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 24<br />
Hard-Wired Processing<br />
Cost-effective floating point per<strong>for</strong>mance<br />
• Control is a VHDL state machine.<br />
• No RAM or ROM <strong>for</strong> program storage (less gates)<br />
• No program sequencer or instruction set (less gates)<br />
• No program fetch (less memory b<strong>and</strong>width)<br />
• Data paths are inferred directly from VHDL<br />
• No general purpose routing costs<br />
• No software maintenance<br />
• 35 cents / MFlop
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 25<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
Physical Characteristics<br />
• Low cost device - 176 pin PQFP<br />
• .45μ, 40MHz, 3 layer metal<br />
• Shipping now<br />
• Per<strong>for</strong>mance<br />
• 1M Meshed Shaded, Z buffered triangles/sec<br />
• 2M 2D polylines/sec
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 26<br />
Combined Board Design<br />
<strong>Delta</strong> <strong>and</strong> PERMEDIA PERMEDIA<br />
• Matched Geometry <strong>and</strong> Rasterization per<strong>for</strong>mance<br />
• High per<strong>for</strong>mance Arcade machines, VR/simulation<br />
engines, entry-level desktop OpenGL acceleration<br />
• Sub $350 street price<br />
OSC<br />
RAM<br />
DAC PERMEDIA<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
ROM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM<br />
256x32<br />
SGRAM
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 27<br />
<strong>GLINT</strong> <strong>Delta</strong><br />
Measured Per<strong>for</strong>mance Increases<br />
Tspeed3 V3.0 OpenGL No <strong>Delta</strong> With <strong>Delta</strong> X Faster<br />
With <strong>Delta</strong><br />
Meshed Triangles (Z, Shaded) 50 Pixel per second 155,146 238,997 1.54<br />
Meshed Triangles (Z, flat) 50 Pixel per second 205,870 321,247 1.56<br />
Meshed Triangles (Z, Shaded) 25 Pixel per second 180,744 427,242 2.36<br />
Meshed Triangles (Z, flat) 25 Pixel per second 232,398 573,212 2.47<br />
Meshed Triangles (Z, Shaded) Small Triangles per second 187,454 599,762 3.20<br />
Meshed Triangles (Z, flat) Small Triangles per second 249,629 586,527 2.35<br />
Meshed Triangles (Z, Shaded) Single Pixel Triangles per second 187,454 600,476 3.20<br />
Meshed Triangles (Z, flat) Single Pixel Triangles per second 249,629 586,527 2.35<br />
Meshed Triangles (No Z, Shaded) 50 Pixel per second 182,048 277,016 1.52<br />
Meshed Triangles (No Z, flat) 50 Pixel per second 223,155 365,412 1.64<br />
Meshed Triangles (No Z, Shaded) 25 Pixel per second 199,290 514,781 2.58<br />
Meshed Triangles (No Z, flat) 25 Pixel per second 272,531 585,847 2.15<br />
Meshed Triangles (No Z, Shaded) Small Triangles per second 200,159 646,607 3.23<br />
Meshed Triangles (No Z, flat) Small Triangles per second 271,068 586,527 2.16<br />
Meshed Triangles (No Z, Shaded) Single Pixel Triangles per second 200,079 646,607 3.23<br />
Meshed Triangles (No Z, flat) Single Pixel Triangles per second 271,214 586,527 2.16
3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 28<br />
Future Directions<br />
• More Geometry Pipeline in Hardwired Logic<br />
• CPUs just aren’t fast enough<br />
• Hardwired logic is more cost-effective<br />
• Unified Memory<br />
• Using system memory <strong>for</strong> texture<br />
• Intel’s AGP - Accelerated Graphics Port<br />
• 3D Graphics on the Motherboard<br />
• High integration - RAMDACs <strong>and</strong> geometry included on-chip<br />
• Aggressive Per<strong>for</strong>mance Increases<br />
• Next generation silicon - single chip million polygon devices<br />
• Major silicon vendors entering graphics chips market