05.04.2013 Views

Permedia and GLINT Delta: New Generation Silicon for - Hot Chips

Permedia and GLINT Delta: New Generation Silicon for - Hot Chips

Permedia and GLINT Delta: New Generation Silicon for - Hot Chips

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 1<br />

ERMEDIA <strong>and</strong> <strong>GLINT</strong> <strong>Delta</strong><br />

PERMEDIA<br />

<strong>New</strong> <strong>Generation</strong> <strong>Silicon</strong> <strong>for</strong> 3D Graphics<br />

Neil Trevett<br />

Vice President Marketing<br />

(408) 436 3455<br />

www.3dlabs.com


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 2<br />

Professional<br />

Games Pervasive<br />

3Dlabs <strong>New</strong> <strong>Generation</strong> <strong>Silicon</strong><br />

<strong>GLINT</strong><br />

300SX<br />

3D Blaster<br />

Chip<br />

The first single chip<br />

geometry pipeline<br />

processor<br />

<strong>GLINT</strong><br />

<strong>Delta</strong><br />

<strong>GLINT</strong><br />

500TX<br />

Q195 Q395 Q196<br />

PERMEDIA<br />

1st <strong>Generation</strong> 2nd <strong>Generation</strong><br />

Professional 3D -<br />

Strong competition<br />

to Workstations<br />

Pervasive 3D -<br />

Making 2D chips<br />

obsolete<br />

Q396


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 3<br />

Alternative Approaches...<br />

... to low-cost 3D silicon<br />

No Compromises<br />

Per<strong>for</strong>mance<br />

3Dlabs<br />

3D Per<strong>for</strong>mance?<br />

Matrox<br />

ATI<br />

S3<br />

Rendition<br />

2D <strong>and</strong> 3D<br />

Per<strong>for</strong>mance?<br />

Nvidia<br />

3Dfx<br />

VideoLogic<br />

Pervasive 3D FreeD Games 3D Arcade 3D<br />

No 2D?<br />

Cost


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 4<br />

Pervasive 3D - the Challenge<br />

No compromises!<br />

Video Acceleration<br />

3D <strong>for</strong> games 3D <strong>for</strong> authoring<br />

Pervasive 3D<br />

Fast Windows Acceleration<br />

3D <strong>for</strong> browsing<br />

Fast VGA <strong>for</strong> DOS games<br />

• 3D per<strong>for</strong>mance must be much greater than software only<br />

• Software = 5 million texture-mapped pixels per second<br />

• Hardware should deliver >25 million bilinear-filtered texturemapped<br />

pixels per second


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 5<br />

PERMEDIA ERMEDIA Design Targets<br />

• Robust 100% pixel functionality of all key 3D APIs<br />

• Direct 3D, OpenGL, Heidi, QuickDraw 3D, QuickDraw 3D RAVE<br />

• No 2D compromises<br />

• >30 Million Winmarks<br />

• Fast 3D per<strong>for</strong>mance<br />

• Balanced per<strong>for</strong>mance <strong>for</strong> both textures <strong>and</strong> polygons<br />

• 600,000 textured polygons/second<br />

• 30 Million bilinear filtered texture-mapped pixels /second<br />

• Low cost<br />

• Selling on boards costing


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 6<br />

PERMEDIA ERMEDIA Architecture<br />

High per<strong>for</strong>mance<br />

bus interface<br />

Unified graphics<br />

core <strong>for</strong><br />

accelerated<br />

3D <strong>and</strong> 2D<br />

VGA<br />

Graphics Core<br />

PCI Memory<br />

Interface<br />

RamDac<br />

Interface<br />

Bypass<br />

Bypass <strong>for</strong> direct rendering <strong>and</strong> control registers<br />

Backward compatibility<br />

<strong>for</strong> DOS games <strong>and</strong> boot<br />

Video<br />

High b<strong>and</strong>width<br />

integrated<br />

memory<br />

interface<br />

Drives<br />

RAMDAC


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 7<br />

PERMEDIA ERMEDIA Host Interface<br />

Avoids polling the FIFO<br />

Glueless PCI<br />

Interface<br />

Avoids byte-swapping<br />

PCI Bus<br />

Interface<br />

Rev 2.1<br />

Disconnect<br />

32-bit Slave<br />

32-bit Master<br />

Bi-Endian<br />

Support<br />

Provides setup-fetch<br />

overlap<br />

Memory<br />

Bypass<br />

41x32<br />

FIFO<br />

Interrupt<br />

Controller<br />

DMA<br />

Controller<br />

<strong>Delta</strong><br />

Interface<br />

Allows software drawing<br />

Provides fetchdraw<br />

overlap<br />

Provides upgrade path


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 8<br />

PERMEDIA ERMEDIA Memory Interface<br />

• SGRAM <strong>for</strong> next generation graphics<br />

• Good r<strong>and</strong>om access speed - vital <strong>for</strong> texture mapping<br />

• Block fills - very important <strong>for</strong> clearing buffers<br />

• Write-per-bit mask, needed <strong>for</strong> per-window double buffering<br />

• Upgrade path to 100 MHz <strong>and</strong> beyond<br />

• 2 to 8 MBytes<br />

• Up to 4 pages open at any time<br />

• E.g. front color buffer, back color buffer, depth buffer, texture buffer<br />

Core<br />

Texture<br />

Z-Buffer<br />

Stencil<br />

RGBA<br />

Bypass<br />

Memory<br />

I/F<br />

Unit<br />

64 bit<br />

BANK 0<br />

2 MBytes<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

BANK 1<br />

4 MBytes<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

BANK 2<br />

6 MBytes<br />

etc...


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 9<br />

Consolidated Memory<br />

All buffers in same physical memory<br />

• Efficient <strong>and</strong> Flexible<br />

• Dynamically allocate color <strong>and</strong> depth buffers<br />

• Any spare memory available <strong>for</strong> textures<br />

• Trade resolution <strong>for</strong> depth buffer, color depth <strong>for</strong> texture space etc.<br />

• All data in same memory, scope <strong>for</strong> optimization, e.g.<br />

• Clear depth buffer with framebuffer block fills<br />

• Use texture operations on any image<br />

• Full scene anti-aliasing<br />

• Video texture-mapping<br />

• 3D sprite processing<br />

Depth <strong>and</strong> stencil<br />

Texture<br />

Backbuffer<br />

Frontbuffer


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 10<br />

PERMEDIA ERMEDIA Pixel Core<br />

• Hyper-pipelined function units<br />

• Message passing protocol between units<br />

Rasterizer<br />

Scissor<br />

Stipple<br />

Color DDA Framebuffer<br />

Read<br />

Texture/Fog<br />

Blend<br />

PERMEDIA Core<br />

Localbuffer<br />

Read<br />

Localbuffer<br />

Write<br />

Dither Logic Op<br />

Stencil<br />

Depth<br />

YUV<br />

Framebuffer<br />

Write<br />

Texture<br />

Address<br />

Texture<br />

Read<br />

Host<br />

Out


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 11<br />

Pipeline Principles<br />

• Each unit in the pipeline is independent<br />

• Can be designed, tested <strong>and</strong> synthesized separately<br />

• Unit State Machine<br />

• Wait <strong>for</strong> message in input FIFO<br />

• If message is not relevant, pass to next unit<br />

• Else process message <strong>and</strong> pass on any messages as required<br />

• Return to waiting<br />

• Some units know their place<br />

• Some units are completely self contained<br />

• E.g. scissor/stipple unit<br />

• Some units know where they are in the pipeline<br />

• E.g. YUV can absorb localbuffer data if the chroma test fails


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 12<br />

Unit Pipeline Stage<br />

• The pipeline uses a message passing paradigm<br />

• A message is made up of a tag field <strong>and</strong> a data field<br />

• The tag identifies the message type<br />

Tag<br />

Data<br />

F<br />

I<br />

F<br />

O<br />

Tag<br />

Data<br />

Core Unit<br />

Tag<br />

Data<br />

F<br />

I<br />

F<br />

O<br />

Tag<br />

Data<br />

Two stage FIFO Pipelining as required<br />

Two stage FIFO<br />

Tag = 9 bits<br />

Data = 32 bits<br />

Input Stage<br />

Register Storage<br />

Processing<br />

Control<br />

Output Stage


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 13<br />

Message Passing<br />

• Everything that moves through the pipeline is a message<br />

• Messages are used to program control registers - e.g. enable texture<br />

• Messages are used to carry transient in<strong>for</strong>mation e.g. texture color <strong>for</strong><br />

current pixel<br />

• Messages are used as comm<strong>and</strong>s - e.g. start new primitive<br />

• Messages are used <strong>for</strong> synchronization - e.g. Sync message<br />

• ‘Step’ messages drive the units<br />

• A Step message <strong>for</strong> each pixel to be plotted<br />

• Passive steps are pixels not to be plotted<br />

• If a pixel fails a test it is converted from active to passive<br />

• Passive steps cannot be deleted because they advance DDA units<br />

• Step messages hold the pixel X,Y coordinate in the data field


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 14<br />

Unified 2D/3D Pixel Engine<br />

• 3D is a superset of 2D<br />

• Don’t separate them<br />

• 2D operations use 3D pipeline <strong>and</strong> use special features<br />

• Texture units used <strong>for</strong> tiled blits<br />

• Chroma key test used <strong>for</strong> transparent blits<br />

• Bilinear filter used <strong>for</strong> stretch blits<br />

• Using the 3D units is gate efficient<br />

• No duplication of functions<br />

• No compromise on per<strong>for</strong>mance


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 15<br />

PERMEDIA ERMEDIA Per<strong>for</strong>mance<br />

• 30 Mpixels/sec, 600K polygons/sec, textured, bilinear, no Z<br />

• With full per pixel perspective correction, 16-bit framebuffer, 4-bit palletized<br />

textures, 50 displayed pixels per polygon, meshed, 640x480 at 75Hz<br />

• 640x480 full screen bi-linear textured, x2.5 depth<br />

complexity = 40Hz frame rate<br />

• 2 GBytes/sec Fill rate using SGRAM block fill<br />

• 2 GBytes/sec Color expansion<br />

• > 30 Million Winmarks<br />

• Video Playback per<strong>for</strong>mance - 30fps<br />

• 320x200 YUV source zoomed <strong>and</strong> filtered to 640x480x16-bit RGB


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 16<br />

PERMEDIA ERMEDIA Physical Characteristics<br />

• Packaging<br />

• 256 pin BGA<br />

• Wire-bonded into a plastic BGA package<br />

• 3W at 3.3V<br />

• Process<br />

• 0.35μ, 4 layer metal<br />

• 60 MHz<br />

• Shipping now


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 17<br />

Board Design<br />

Low component count<br />

• Single PERMEDIA Chip<br />

• plus SGRAM, RAMDAC, ROM<br />

• External interfaces<br />

• Glueless PCI Interface<br />

• High per<strong>for</strong>mance 64-bit SGRAM Interface<br />

• High speed pixel port to RAMDAC<br />

RAM<br />

DAC<br />

OSC<br />

PERMEDIA<br />

ROM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

Typically 2MBytes,<br />

with optional<br />

upgrade to 4MBytes


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 18<br />

But where’s the bottleneck?<br />

Geometry!<br />

• The fastest Pentium Pro cannot keep PERMEDIA<br />

saturated if running the geometry in software<br />

1K polygons/MHz on a Pentium<br />

Class machine 3D API<br />

(90K polygons on a P5/90) Trans<strong>for</strong>ms<br />

Lighting<br />

<strong>Delta</strong> Calcs<br />

70% of the<br />

CPU cycles<br />

spent in setup!<br />

100% of<br />

Rasterization in<br />

PERMEDIA<br />

silicon<br />

Rasterization


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 19<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

Breaking the Geometry Bottleneck<br />

• Hardwired 3D Pipeline Processing<br />

• 1M vertex/sec Vertex Setup Processor<br />

• Per<strong>for</strong>ms all delta calculations <strong>and</strong> floating point conversions<br />

• 100 MFlop floating point processor<br />

• Reduces PCI B<strong>and</strong>width - just passing vertices - no slopes<br />

3D API<br />

Trans<strong>for</strong>ms<br />

Lighting<br />

110 Bytes / polygon<br />

<strong>Delta</strong> Calcs<br />

PCI<br />

PCI<br />

<strong>GLINT</strong><br />

<strong>Delta</strong><br />

33 Bytes / polygon<br />

PERMEDIA<br />

3D API<br />

Trans<strong>for</strong>ms<br />

Lighting<br />

PCI<br />

PERMEDIA<br />

Triples CPU<br />

Geometry<br />

Per<strong>for</strong>mance


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 20<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

Setup Processing in a PCI Bridge<br />

Full Bus<br />

Master<br />

Primary<br />

PCI Bus<br />

Provides<br />

setup-fetch<br />

overlap<br />

Allows transparent use of VGA <strong>and</strong> 8514 behind bridge<br />

176 Pin PQFP<br />

3.3V power<br />

5V I/O<br />

PCI 0<br />

Master &<br />

Slave<br />

Interface<br />

DMA<br />

Controller<br />

Input<br />

FIFO<br />

Function 0 Decode with VGA/8514<br />

Function 1 Decode<br />

<strong>Delta</strong><br />

Setup<br />

Engine<br />

Slope <strong>and</strong> Setup Calculations<br />

<strong>for</strong> <strong>GLINT</strong> <strong>and</strong> <strong>Permedia</strong><br />

Bypass<br />

Path<br />

Output<br />

FIFO<br />

PCI 1<br />

Master<br />

Interface<br />

<strong>and</strong><br />

DMA<br />

Control<br />

DMA to<br />

<strong>GLINT</strong> or<br />

PERMEDIA<br />

Secondary<br />

PCI Bus


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 21<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

Setup Engine Functionality<br />

• Follows the message passing architecture of PERMEDIA<br />

• <strong>Delta</strong> is just another unit in front of the rasterizer<br />

• API neutral - low-level functionality<br />

• Triangle primitive setup (AA <strong>and</strong> non-AA)<br />

• Line primitive setup (AA <strong>and</strong> non-AA)<br />

• Interpolation Parameters - XYZ, RGBA, F, STQ, Ks, Kd<br />

• Accepts floating point (IEEE SP) or fixed point inputs<br />

• Texture coordinate auto normalization<br />

• Optional input value clamping<br />

• High precision sub-pixel correction


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 22<br />

<strong>GLINT</strong> <strong>Delta</strong> Setup Engine<br />

Hardwired processing<br />

Input Vertex In<strong>for</strong>mation - 16x32<br />

Vertex Store<br />

0<br />

Vertex Store<br />

1<br />

Vertex Store<br />

2<br />

Data Routing<br />

FMUL FADD FDIV FDIV FConvert<br />

Floating Point Operation Units<br />

VHDL Coded, Inferred Routing<br />

Working<br />

Store<br />

Output<br />

FIFO<br />

Temp<br />

Storage<br />

8x32


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 23<br />

<strong>GLINT</strong> <strong>Delta</strong> Calculations<br />

Floating Point improves robustness <strong>and</strong> visual quality<br />

• Input parameter score-boarding<br />

• All internal calculations in custom floating point <strong>for</strong>mat<br />

• Less dynamic range, but more precision than IEEE<br />

• RGBAZ triangle set-up involves:<br />

• 41 floating point add or subtract<br />

• 27 floating point multiplies<br />

• 5 floating point divides<br />

• plus.. compares, clamping, fixed point/floating point conversions<br />

• Main floating point operators are:<br />

• One multiplier (one pipeline stage, single cycle).<br />

• One adder/subtracter (single cycle)<br />

• Two dividers (5 cycle iterative, autonomous)<br />

• Four comparators<br />

• Float to fixed point conversion with clamping


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 24<br />

Hard-Wired Processing<br />

Cost-effective floating point per<strong>for</strong>mance<br />

• Control is a VHDL state machine.<br />

• No RAM or ROM <strong>for</strong> program storage (less gates)<br />

• No program sequencer or instruction set (less gates)<br />

• No program fetch (less memory b<strong>and</strong>width)<br />

• Data paths are inferred directly from VHDL<br />

• No general purpose routing costs<br />

• No software maintenance<br />

• 35 cents / MFlop


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 25<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

Physical Characteristics<br />

• Low cost device - 176 pin PQFP<br />

• .45μ, 40MHz, 3 layer metal<br />

• Shipping now<br />

• Per<strong>for</strong>mance<br />

• 1M Meshed Shaded, Z buffered triangles/sec<br />

• 2M 2D polylines/sec


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 26<br />

Combined Board Design<br />

<strong>Delta</strong> <strong>and</strong> PERMEDIA PERMEDIA<br />

• Matched Geometry <strong>and</strong> Rasterization per<strong>for</strong>mance<br />

• High per<strong>for</strong>mance Arcade machines, VR/simulation<br />

engines, entry-level desktop OpenGL acceleration<br />

• Sub $350 street price<br />

OSC<br />

RAM<br />

DAC PERMEDIA<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

ROM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM<br />

256x32<br />

SGRAM


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 27<br />

<strong>GLINT</strong> <strong>Delta</strong><br />

Measured Per<strong>for</strong>mance Increases<br />

Tspeed3 V3.0 OpenGL No <strong>Delta</strong> With <strong>Delta</strong> X Faster<br />

With <strong>Delta</strong><br />

Meshed Triangles (Z, Shaded) 50 Pixel per second 155,146 238,997 1.54<br />

Meshed Triangles (Z, flat) 50 Pixel per second 205,870 321,247 1.56<br />

Meshed Triangles (Z, Shaded) 25 Pixel per second 180,744 427,242 2.36<br />

Meshed Triangles (Z, flat) 25 Pixel per second 232,398 573,212 2.47<br />

Meshed Triangles (Z, Shaded) Small Triangles per second 187,454 599,762 3.20<br />

Meshed Triangles (Z, flat) Small Triangles per second 249,629 586,527 2.35<br />

Meshed Triangles (Z, Shaded) Single Pixel Triangles per second 187,454 600,476 3.20<br />

Meshed Triangles (Z, flat) Single Pixel Triangles per second 249,629 586,527 2.35<br />

Meshed Triangles (No Z, Shaded) 50 Pixel per second 182,048 277,016 1.52<br />

Meshed Triangles (No Z, flat) 50 Pixel per second 223,155 365,412 1.64<br />

Meshed Triangles (No Z, Shaded) 25 Pixel per second 199,290 514,781 2.58<br />

Meshed Triangles (No Z, flat) 25 Pixel per second 272,531 585,847 2.15<br />

Meshed Triangles (No Z, Shaded) Small Triangles per second 200,159 646,607 3.23<br />

Meshed Triangles (No Z, flat) Small Triangles per second 271,068 586,527 2.16<br />

Meshed Triangles (No Z, Shaded) Single Pixel Triangles per second 200,079 646,607 3.23<br />

Meshed Triangles (No Z, flat) Single Pixel Triangles per second 271,214 586,527 2.16


3Dlabs - <strong>Hot</strong> <strong>Chips</strong> - Stan<strong>for</strong>d August 1996 - Page 28<br />

Future Directions<br />

• More Geometry Pipeline in Hardwired Logic<br />

• CPUs just aren’t fast enough<br />

• Hardwired logic is more cost-effective<br />

• Unified Memory<br />

• Using system memory <strong>for</strong> texture<br />

• Intel’s AGP - Accelerated Graphics Port<br />

• 3D Graphics on the Motherboard<br />

• High integration - RAMDACs <strong>and</strong> geometry included on-chip<br />

• Aggressive Per<strong>for</strong>mance Increases<br />

• Next generation silicon - single chip million polygon devices<br />

• Major silicon vendors entering graphics chips market

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!