01.02.2013 Views

DTJ Volume 7 Number 3 1995 - Digital Technical Journals

DTJ Volume 7 Number 3 1995 - Digital Technical Journals

DTJ Volume 7 Number 3 1995 - Digital Technical Journals

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Digital</strong><br />

<strong>Technical</strong><br />

Journal<br />

I<br />

HIGH PERFORMANCE FORTRAN<br />

IN PARALLEL ENVIRONMENTS<br />

SEQUOIA 2000 RESEARCH<br />

<strong>Volume</strong> 7 <strong>Number</strong> 3<br />

<strong>1995</strong>


Editorial<br />

Jane C. Blake, Managing Editor<br />

Helen L. Patterson, Editor<br />

Kathleen M. Stetson, Editor<br />

Circulation<br />

Catherine M. Phillips, Administrator<br />

Dorothea B. Cassady, Secretary<br />

Production<br />

Terri Autieri, Production Editor<br />

Anne S. Katzeff, Typographer<br />

Peter R. Woodbury, Illustrator<br />

Advisory Board<br />

Samuel H. Fuller, Chairman<br />

Richard W Beane<br />

Donald Z. Harbert<br />

William R. Hawe<br />

Richard J. Hollingsworth<br />

Richard F. Lary<br />

Alan G. Nemeth<br />

Robert M. Supnik<br />

Cover Design<br />

The images on the front and back covers<br />

of this issue are different visualizations<br />

of the same data output from a regional<br />

climate simulation program run by Dr.<br />

John Roads of the Scripps Institution of<br />

Oceanography. The data depicted contain<br />

measures of temperature, liquid and<br />

gaseous water content, and wind vectors;<br />

the topography represented by the data<br />

is the western U.S. in January 1990. Providing<br />

earth scientists with the ability to<br />

visualize such data is one of the objectives<br />

of the Sequoia 2000 research projecta<br />

joint eftort of the University of California,<br />

government agencies, and industry to build<br />

a computing environment for global change<br />

research. This issue presents papers on several<br />

major an::as explored by Sequoia 2000<br />

rese;uchers, including an electronic repository,<br />

networking, and visualization.<br />

The cover was designed by Lucinda O'Neill<br />

of <strong>Digital</strong>'s Design Group. Special thanks go<br />

to Peter Kochevar for supplying the cover<br />

images.<br />

The <strong>Digital</strong> <strong>Technical</strong> journal is a refereed<br />

journal published quarterly by <strong>Digital</strong><br />

Equipment Corporation, 30 Porter Road<br />

L)02/D lO, Littleton, Massachusetts 01460.<br />

Subscriptions ro rhejoumaiare $40.00<br />

(non-U.S. $60) for four issues and $75.00<br />

(non-U.S. $115) for eight issues and must<br />

be prepaid in U.S. funds. University and<br />

college protessors and Ph.D. srudems in<br />

the electrical engineering and computer<br />

science fields receive complimentary subscriptions<br />

upon request. Orders, inquiries,<br />

and address changes should be sent to the<br />

<strong>Digital</strong> Technica!journalat rhe publishedby<br />

address. Inquiries can also be sent electronically<br />

ro drj@digital.com. Single copies<br />

and back issues are available for $16.00 each<br />

by calling DECdirecr at 1-800-DIGITAL<br />

(1-800-344-4825). Recent back issuesofrhe<br />

journal are also available on the Internet at<br />

h rrp:/ jwww.digital.com/info/<strong>DTJ</strong>/home.<br />

hrml. Complete <strong>Digital</strong> Internet listings can<br />

be obtained by sending an electronic mail<br />

message ro info@digiral.com.<br />

<strong>Digital</strong> employees may order subscriptions<br />

through Readers Choice by enteringVTX<br />

PROFILE at the system prompt.<br />

Comments on the content of any paper are<br />

welcomed and may be sent to the managing<br />

editor at the pubhshed-by or network address.<br />

Copyright© <strong>1995</strong> <strong>Digital</strong> Equipment<br />

Corporation. Copying without fee is permitted<br />

provided that such copies are made<br />

for use in educational institutions by faculty<br />

members and are not distributed for commercial<br />

advantage. Abstracting with credit<br />

of <strong>Digital</strong> Equipment Corporation's authorship<br />

is permitted. All rights reserved.<br />

The information in the journal is subject<br />

to change without notice and should not<br />

be construed as a commitment by <strong>Digital</strong><br />

Equipment Corporation or by the companies<br />

herein represented. <strong>Digital</strong> Equipment<br />

Corporation assumes no responsibility for<br />

any errors that may appear in the Journal.<br />

JSSN 0898-901X<br />

Documentation <strong>Number</strong> EY-T838E-TJ<br />

Book production was done by Quantic<br />

Communications, Inc.<br />

The following are trademarks of <strong>Digital</strong><br />

Equipment Corporation: <strong>Digital</strong>, the<br />

DIGITAL logo, Alpha Generation,<br />

AJphaServer, AlphaSrarion, DEC, DEC<br />

OSF /1, DECstation, GIGAswirch,<br />

TURBOchannel, and ULTlUX.<br />

Dore is a registered trademark of Kubota<br />

Pacific Computer Inc.<br />

Ex a byte is a registered trademark of<br />

Exabyre Corporation.<br />

Hewlett-Packard and HP are registered<br />

trademarks of Hewlett-Packard Company<br />

JBM and SP2 are registered trademarks<br />

of! nternarional Business Machines<br />

Corporation.<br />

II lustra is a registered trademark of Illustra<br />

lntormation Technologies, Inc.<br />

Intel is a trademark oflnrel Corporation.<br />

MCf is a registered trademark of MCI<br />

Communications Corporation.<br />

MEMORY CHANNEL is a trademark<br />

of Encore Computer Corporation.<br />

Mosaic is a trademark of Mosaic<br />

Communications Corporation.<br />

Nerscape is a trademark ofNerscape<br />

Communications Corporation.<br />

NewronScripr is a trademark of Apple<br />

Computer, Inc.<br />

NFS is a registered trademark of Sun<br />

Microsysrems, Inc.<br />

OpenGL is a registered trademark and<br />

Open Inventor is a trademark of Silicon<br />

Graphics, Inc.<br />

Picture Tel is a registered trademark of<br />

PictureTcl Corporation.<br />

PostScript is a registered trademark of<br />

Adobe Systems Inc.<br />

SAIC is a registered trademark of Science<br />

Applications International Corporation.<br />

Siemens is ·a registered trademark of<br />

Siemens Nixdorflnformation Systems, Inc.<br />

Sony is a registered trademark of Sony<br />

Corporation.<br />

SPEC is a trademark of the Standard<br />

Performance Evaluation Council.<br />

Telescripr is a trademark of General Magic, Inc.<br />

UNIX is a registered trademark in the<br />

United Stares and other countries, licensed<br />

exclusively through X/Open Company Ltd.


Contents<br />

Foreword<br />

HIGH PERFORMANCE FORTRAN IN<br />

PARALLEL ENVIRONMENTS<br />

Compiling High Performance Fortran for<br />

Distributed-memory Systems<br />

Design of <strong>Digital</strong>'s Parallel Software Environment<br />

SEQUOIA 2000 RESEARCH<br />

An Overview of the Sequoia 2000 Project<br />

The Sequoia 2000 Electronic Repository<br />

Tecate: A Software Platform for Browsing and<br />

Visualizing Data from Networked Data Sources<br />

High-performance 1/0 and Networking Software<br />

in Sequoia 2000<br />

) can C. Bonnev<br />

Jonathan Harris, John A. Bircsak, M. Regina Bolduc,<br />

Jill Ann Diell'ald, Israel Gale, !\cil W. Johnson,<br />

Shin Lee, C. Alexander Kelson, :md Carl D. Offner<br />

Edward G. Benson, David C. P. La France- Linden,<br />

Richard A. Warn::n, and Sant�1 Wirvaman<br />

Michael Sroncbr,lker<br />

Ra\· R. Larson, Christian PLlunr,<br />

Allison G. Woodruff, and MJrri Hearst<br />

Peter D. Kochevar and L.eon�1rd R. Wanger<br />

joseph Pasquale, Eric W. Anderson, Kevin Fall, and<br />

Jonathan S. Kav<br />

<strong>Digital</strong> T�chnical )ounLll Vol.7 l'-:o.3 <strong>1995</strong><br />

3<br />

5<br />

24<br />

39<br />

50<br />

66<br />

84


2<br />

Editor's<br />

Introduction<br />

Scientists have long been moti,·ators<br />

tc>r the development of powert-l1l<br />

computing environments. Two<br />

sections in this issue of the.founw/<br />

address the requirements of scienrinc<br />

and technical computing. The tirst,<br />

fron1 <strong>Digital</strong>'s High Pertonnance<br />

<strong>Technical</strong> Computing Group, looks<br />

at compiler and development tools<br />

that accelerate performance in parallel<br />

environments. The second section<br />

looks ro the future of computing;<br />

University of California and <strong>Digital</strong><br />

researchers present their work on a<br />

large, distributed computing environmem<br />

suited to the needs of earth scientists<br />

studving global changes such<br />

as ocean dynamics, global warming,<br />

and ozone depletion. <strong>Digital</strong> w:�s an<br />

early industry sponsor and particip:mt<br />

in this joint research project, called<br />

Sequoia 2000.<br />

To support the writing of parallel<br />

programs tor computationally intense<br />

environments, <strong>Digital</strong> has extended<br />

DEC fortran 90 by implementing<br />

most of High Performance Fortran<br />

( H Pf) version 1.1. After reviewing<br />

the syntactic features of Fortran 90<br />

and HPF, Jonathan Harris et al. fc>eus<br />

on the HPF compiler design and<br />

expbin the optimizations it pertorms<br />

ro improve interprocessor communication<br />

in a distributed-memory environment,<br />

specincally, in workstation<br />

clusters (tarms) based on <strong>Digital</strong>'s<br />

64-bit Alpha microprocessors.<br />

The run-time support for this distributed<br />

environment is the Parallel<br />

Software Environment (PSE). Ed<br />

Benson, David Lafrance- Linden,<br />

Rich Warren, and Santa vVin'aman<br />

describe the PSE product, which is<br />

layered on the UNIX operating system<br />

and includes tools for developing<br />

Digit�l'lr ll'riting the Foreword ro tl1is<br />

ISSUC.<br />

Our next issue will feature papers<br />

on multimedia and UNIX clusters.<br />

);me C. Blake<br />

Mallrl_r�illg t.ditor


Foreword<br />

Jean C. Bonney<br />

Dirc>ci!J/: J::Yit'l"ll!li Nc>sem·cb<br />

The lntonmrion Utilitv, rhe<br />

Information Highw�l\', the Internet,<br />

rhe Infi.>bahn, rhe Information<br />

Eeononw-rhe sound lwtes ofrhe<br />

1990s. To 111�1ke these concepts<br />

reality, a robust technolob"' intra­<br />

structure is necessary. In 1990,<br />

Digit�1l's research organization saw<br />

rbis need :md ser out to develop an<br />

exl1erimenral rest hed that would<br />

examine assumptions and provide a<br />

basis tr a technolob"' edge in the '90s.<br />

The resulting project was Sequoia<br />

2000, a three-vcar research collabora­<br />

tion bet\veen <strong>Digital</strong>, campuses of the<br />

University ofCaliti.>rni�l, �111d several<br />

other industrv and governmem orga­<br />

nizations. The Sequoia 2000 vision is<br />

l'c>!ahyles! i.e .. /ri/liuiiS uj'hylc>s)<br />

oj'dala in u dislrilmled orchi(lc.<br />

/runsparel!lir ma11a,t.;ed. aJtd<br />

lugicallv cieti'C'd ouer a hl;t.;h-speul<br />

nelnnd' u·i!h isuchmiiOIIS capahililies<br />

l'ia a hosl u/loo/s<br />

-in orher words, a big, bst, easy-ro­<br />

use system.<br />

Although rhe vision is still not reality<br />

today, our more than three years<br />

of participation in Sequoia 2000<br />

research gave us rhe knowledge base<br />

we sought.<br />

AlTer a rigorous process of pro­<br />

poSrnia, Sequoia 2000 began<br />

in June 1991. The t(JCus of the<br />

rcsean.:h was a high-speed, broad­<br />

hrmation ...:ol­<br />

lccred trom satellites. Once rhe data<br />

arc colle...:tcd :111d organized, there is<br />

rhe challenge of massive simulations,<br />

simulations that t(>rccast world climate<br />

ten or even one hundred yeJrs ti·om<br />

now. These were exactly the kinds<br />

of challenges the computer scientists<br />

needed .<br />

Among the m:1jor results of three<br />

ye�ns of work on Sequoia 2000 was<br />

a set of produ...:t requiremems t(>r<br />

large data applications. These require­<br />

mems have been validated through<br />

discussions with customers in tinan­<br />

cial, healthcare, and communications<br />

industries Jnd in governmenr. The<br />

requin:menrs include<br />

• A computing environment built<br />

on an object relational database,<br />

i.e., a thta-cenrric computing<br />

syste m<br />

• A tbtabase that handles a wide<br />

variety of nontraditional objects<br />

such as tor, audio, video, graph­<br />

ics, and inuges<br />

• Support ror a variety ofrraditional<br />

databases and file systems<br />

• The ability to perform necessary<br />

operations ti·om computing<br />

environments that arc intuitive<br />

and have rhe same look and kcl;<br />

the imerface to the environment<br />

should be generic, very high level,<br />

and easily tailored to the user<br />

:lf-1plication<br />

• High -speed data migration<br />

bet\veen secondary and tertiary<br />

storage with the ability to handle<br />

very large data transfers<br />

• Net\vork bandwidth capable<br />

of handling image transmission<br />

across nervvorks in an acceptable<br />

rime hame with quality guarantees<br />

fc->r the dara<br />

• High-quality remote visualization<br />

of any relevant data regardless<br />

oftormat; the user must be able<br />

to manipulate the visual data<br />

interactively<br />

• Reliable, guaranteed , deli very<br />

of data ti·om tertiary storage to<br />

rhe desktop<br />

Sequoia 2000 was also a catalyst<br />

t(>r maturing the POSTGRES research<br />

database soft\vare to the point where<br />

it was ready tor ...:ommercialization.<br />

The commercial version, I !lustra,<br />

is available on Alpha platforms and<br />

is enjoying success in the banking<br />

industry and in geographic informa­<br />

tion system (GIS) applications, as<br />

well as in other government applica­<br />

tions with massive data requirements .<br />

Illusrra is Jlso making inroads into the<br />

Imernet where it is used by on-line<br />

services.<br />

Yet another major result of Sequoia<br />

2000 was a grant fi-om the National<br />

<strong>Digital</strong> Tcclmic�l Journ.1l Vol. 7 No.3 <strong>1995</strong> 3


4<br />

Aeronautics and Space Administra­<br />

tion (l\:ASA) to develop an ;lltemare<br />

archirectu1·e t(Jr the Earth Obsen·ing<br />

Svstem i);ua and lntcmnarion S)·stem<br />

(EOSDIS). EOS[)JS will process the<br />

pctJbvres of re:�l-time dar:� ti·om<br />

the E:�rth Observing System ( EOS)<br />

satellites to be launched at the end<br />

of the dec1de. The altematc int(Jr­<br />

marion :1rchirecrure proposed bv the<br />

Universitv ofCalit(Jrni:� bculn· II'Js<br />

. .<br />

the Sequoi;l 2000 ;Jrchirecrure. lr<br />

will have a m;ljor intluence on the<br />

EOSDIS project.<br />

fm the e;uth sciemisrs, g;lins<br />

1\'Cre made in simul;nion speeds :md<br />

111 access to Luge stores oforg:mized<br />

data. These sciemists used some oF<br />

<strong>Digital</strong>'s tirst Alpha workst;Hion Emm<br />

and sottw;lre prototi'JKS t(Jr their cli­<br />

mate simulations. An eight-processor<br />

Alph:� ,,·orksr:�rion brm pnwided a<br />

t\\'O-to-onc price/ped(mn;mce Jlklll­<br />

tage o1·er the pm1·erful, multimillion­<br />

dollar C:RAY C90 machine. In :mother<br />

earth science apJ)Iiurion, sciemisrs<br />

usi11g Alpha and hicr;uchiul stor;lgc<br />

svstems could simui;He r11·o \'cHs'<br />

worth of climate d,ltJ over the 11 cck­<br />

end without operator imcn'Cntion;<br />

tcmllcrh', t\I'O months' worth ohbt:-t<br />

rook one dav to simui;He :-tnd required<br />

considerable OJ)er;Hor imen·cntion.<br />

Thus m:�ny more simul:�rions could<br />

be processed in ;l tixed rime :-tnd<br />

"time to discoverv" W;ls decrc1sed<br />

considerablv.<br />

Now that ll'e em look ;lt Sequoi:-t<br />

2000 in retrospect, would II'C do<br />

such a project againl The ;111swcr<br />

is a resounding ''l·cs" ti·om ;!II of<br />

us im·oh·ed. It \\';ls a complex proj­<br />

ect that included 12 Unil'lTsin· of<br />

Calitcm1ia bculn· members, 2:1 grad­<br />

uate students, ;md 20 st:1ff Another<br />

8 hculn· members and students pm­<br />

,·idcd additional expertise. Four of<br />

<strong>Digital</strong>'s engineers worked on site,<br />

;md ;l varien· of support personnel<br />

ti·om other industrv sponsors p::�rtici­<br />

p:-tted, including SAlC, the C1li�clrni::�<br />

DepMtmenr ofvVater Resources,<br />

Hewlett-Packard, Metrum, United<br />

States CcologicaJ Survey (USGS),<br />

Hughes Application Int(mnation<br />

Sen·ices, ::�nd the Annv Corps of<br />

Engineers.<br />

But ;Js is the c1se with such ::�mbi­<br />

tious projects, there \\'ere tlll:mtici­<br />

p:ncd ;lnd difticulr lessons tclr ;111<br />

to leam. To experiment 11·ith real-<br />

lite rest beds means considerabh­<br />

more rhan 11ri ring a rigorous set<br />

oFh1-porheses in a proposal. Michael<br />

Stonehr,lker, in his paper, notes a<br />

lllllllbcr of ch,lilenges \\"C t:Ked and<br />

the lessons lc:1rned. One ofthe issues<br />

rh::�t kqlt surbcing was the "grease<br />

;md glue" tell. the inti·astrucrure, that<br />

is, the inreroperabilin· of pieces of<br />

sofn1 are and hard11·are tlut composed<br />

the end-to-end S\'Stem.1'his remains<br />

,\ chal.lenge that needs rese;Jrch if we<br />

;li'C going to achieve the promised<br />

gcds ot'intcrnct\l'orking. Another<br />

stick\· point was scalability. On the<br />

one hand, it is difficult to build a l't:rv<br />

large networked svstem ti·orn scratch.<br />

On the orhcr hand, as we slowlv built<br />

the mass storage svsrem to the point<br />

oF minimal critical mass, liT t(nmd<br />

th:lt the current oft�thc-shelftech­<br />

nologies t(Jr m.1ss storage were not<br />

I'CJch· to be put use tclr our purposes.<br />

So, ves, \\'C be lie1·e the project ll';ls<br />

11·orthll'hile 11·irh some Gll'e


Compiling High<br />

Performance Fortran<br />

for Distributedmemory<br />

Systems<br />

<strong>Digital</strong>'s DEC Fortran 90 compiler implements<br />

most of High Performance Fortran version 1.1,<br />

a language fo r writing parallel programs. The<br />

co mpiler generates code for distributed-memory<br />

machi nes consisting of interconnected work­<br />

stations or servers powered by <strong>Digital</strong>'s Alpha<br />

microprocessors. The DEC Fortran 90 compiler<br />

efficiently implements the features of Fortran 90<br />

and HPF that support parallelism. HPF programs<br />

compiled with <strong>Digital</strong>'s compiler yield perfor­<br />

mance that scales linearly or even superlinearly<br />

on significant applications on both distributed­<br />

memory and shared-memory architectures.<br />

I<br />

Jonathan Harris<br />

John A. Bircsak<br />

M. Regina Bolduc<br />

Jill Ann Diewald<br />

Israel Gale<br />

Neil W. J olmson<br />

Shin Lee<br />

C. Alexandet· Nelson<br />

Carl D. Offner<br />

High Performance Fortran (HPF) is a new program­<br />

ming language for writing parallel programs. It is<br />

based on the Fortran 90 language, with extensions<br />

that enable the programmer ro speci�r how array oper­<br />

ations can be divided among multiple processors tor<br />

increased performance. In HPF, the program specitles<br />

only the pattern in which the data is divided among<br />

the processors; the compiler auromates the low-level<br />

details of synchronization and communication of data<br />

benvecn processors.<br />

<strong>Digital</strong>'s DEC Fortran 90 compiler is the tirsr imple­<br />

mentation of the full H PF version l.l language<br />

(except for transcriptive argument passing, dynamic<br />

remapping, and nested FORALL and 'NHERE con­<br />

structs). The compiler was designed tor a distributed­<br />

memory machine made up of a cluster (or farm) of<br />

workstations and/or servers powered by <strong>Digital</strong>'s<br />

Alpha microprocessors.<br />

In a distributed-memory machine, communication<br />

between processors must be kept to an absolute mini­<br />

mum, because communication across the nenvork is<br />

enormously more time-consuming than anv operation<br />

done locally. <strong>Digital</strong>'s DEC Fortran 90 compiler<br />

includes a number of optimizations to minimize the<br />

cost of communication benveen processors.<br />

This paper briefly reviews the teatures of Fortran 90<br />

and HPF that support parallelism, describes how the<br />

compiler implements these features efficiently, and<br />

concludes with some recent performance results<br />

showing that HPt programs compiled with <strong>Digital</strong>'s<br />

compiler yield pert(xmance that scales linearly or even<br />

superlinearly on significant applications on both<br />

distributed-memory and shared-memory architectures.<br />

Historical Background<br />

The desire to write parallel programs dates back to the<br />

1950s, at least, and probably earlier. The mathematician<br />

John von Neumann, credited with the invention of the<br />

basic architecture of today's serial computers, also<br />

invented cellular automata, the precursor of today's<br />

massively parallel machines. The continuing motiva­<br />

tion tor parallelism is provided by the need to solve<br />

computationally intense problems in a reasonable time<br />

and


6<br />

which range ti·om collections of workstations con­<br />

nected by standard ti ber-optic networks to rightlv cou­<br />

pled CPUs with custom high-speed interconn�ction<br />

networks, are cheaper than single-processor svstems<br />

with equivalent performance . In many cases, �qu iva­<br />

lent smgle-processor svstems do not exist and could<br />

not be constructed with existing technologv.<br />

Historically, one of the difficulties \\ : ith parallel<br />

machmes has been writing paral lel programs. The work<br />

of paralleli zing a program w�1s fa r fi·om the original sci­<br />

ence be�ng explored; it required progra mmers to keep<br />

track of a great deal of inti:m11ation unrelated ro the<br />

actual computations; and it was done using ad hoc<br />

methods that were not portable to other machines.<br />

The experience gained fi·om this work however led<br />

'<br />

'<br />

to a consensus on a better way to write portable<br />

Fortran programs that would pcrtorm well on a varietv<br />

of paral lel machines. The High Pcrt


c<br />

i n teger n, number_o f_i terations, i , j,k<br />

parameter(n=16)<br />

real A( n,n), Temp(n,n)<br />

. . . (Ini t ialize A, numbe r_o f_i terations)<br />

do k=1 , number_o f_i terations<br />

Update non-edge eleme nts on ly<br />

do i=2, n-1<br />

Figure 1<br />

A Computation Expressed in fortran 77<br />

Figure 2<br />

do j=2, n-1<br />

Temp( i , j ) =(A(i , j-1 )+A(i, j+1 )+A(i+1, j ) +A( i-1 , j ) )*0.25<br />

enddo<br />

end do<br />

do i=2, n-1<br />

do j=2, n-1<br />

end do<br />

enddo<br />

end do<br />

A(i , j ) = Temp(i , j )<br />

i n teger n, number_o f_i terations, i , j , k<br />

pa rameter (n= 16)<br />

rea l A


8<br />

l'rocessor O<br />

SEND<br />

A(8, 2) . . A(8, 8)<br />

ro Processor I<br />

SEND<br />

A(2, 8) A(S, 8)<br />

ro Processor 2<br />

RECEIVE<br />

A(9, 2) .A(9, 8)<br />

ti·om Processor I<br />

RECEIVE<br />

A(2, 9) ... A(Il, 9)<br />

from Processor· 2<br />

Processor<br />

SEND<br />

A(9, 2) .. A(9, 8)<br />

to Processor 0<br />

SE:'-JD<br />

A(9, 8) . . A(lS, 8)<br />

ro Processor 3<br />

RECEIVE<br />

A(8, 2) ... A( 8, R)<br />

from Processor 0<br />

RECE IVE<br />

A(9, 9) ... A( I 5, 9)<br />

ri·om Processor 3<br />

Figure 4<br />

Compiler-generated Communication to r


We now describe some of the HPF language exten­<br />

sions used to manage parallel data .<br />

Distributing Data over Processors<br />

Data is distributed over processors by the<br />

DISTRI BUTE directive, the ALIGN directive, or<br />

the default distribution .<br />

The DISTRIBUTE Directive For parallel execution of<br />

array operations, each array must be divided in mem­<br />

ory, with each processor storing some portion of<br />

the array in its own local memory. Dividing the array<br />

into parts is known as distributing the array. The HPF<br />

DISTRIBUTE directive controls the distribution of<br />

arrays across each processor's local memory. ft does<br />

this by spcci�'ing a mapping pattern of data objects<br />

onto processors. Many mappings are possible; we illus­<br />

trate only a kw.<br />

Consider fi rst the case of a 16 X 16 arrav A in an<br />

environment with to ur processors. One possi ble speci­<br />

fication tor A is<br />

I hpf$<br />

rea l AC16, 16)<br />

distribute A(*, block)<br />

The asterisk ( *) tor the fi rst dimension of A means<br />

that the array elements are not distributed along<br />

the tirst (vertica l) axis. In other words, the elements<br />

in any given column are not divided among differ­<br />

ent processors, bur are assigned as a single block to<br />

one processor. This type of mapping is rekrred to as<br />

serial distribution. Figure 5 illustrates this distribution.<br />

The BLOCK keyword tor the second dimension<br />

means that for any given row, the array elements are<br />

distributed over each processor in large blocks. The<br />

blocks are of approximately equal size-i n this case,<br />

they are exactly equal-with each processor holding<br />

one block. As a result, A is broken into fo ur contigu­<br />

ous groups of columns, with each group assigned to<br />

a separate processor.<br />

Another possibility is a ( *, CYCLIC) distribution .<br />

As in (*, BLOCK), all the elements in each co lumn are<br />

assigned to one processor. The elements in any given<br />

row, however, are dealt out ro the processors in round­<br />

robin order, like playing cards dealt out to players<br />

around a table. \Vhen elements are distributed over n<br />

processors, each processor contains every 11th column,<br />

starting h·orn a ditkrent offset. Figure 6 shows the<br />

same array and processor arrangement, distributed<br />

CYCLIC instead ofl3LOCK.<br />

As these examples indicate, the distributions of the<br />

separate dimensions are independent.<br />

A (BLOCK, BLOCK) distribution, as in hgure 3,<br />

divides the array into large rectangles. In that ti gure,<br />

the array clements in any given column or any given<br />

row are divided into rwo large blocks: Processor 0 gets<br />

A( l :8, 1 :8), processor l gets A(9:16, l :8), processor 2<br />

gets A( 1 :8, 9:16), and processor 3 gets A(9:16,9 : 16).<br />

I<br />

I<br />

0<br />

Figure 5<br />

A (*, BLOCK) Disrriburion<br />

0<br />

2 3 0 2 3 0<br />

Figure 6<br />

A(*, CYCLIC) Distri burion<br />

2<br />

I<br />

II I<br />

I<br />

I<br />

2 3 0<br />

3<br />

2 3<br />

The ALIGN Directive The ALIGN directive is used to<br />

speci�' the mapping of arrays relative ro one another.<br />

Corresponding elements in aligned arrays are always<br />

mapped to the same processor; array operations<br />

between aligned arrays are in most cases more efficient<br />

thJn Jrray operations between arrays that are not<br />

known to be aligned .<br />

The most common use of ALIGN is to specifY that<br />

the corresponding clements of rwo or more Jrrays be<br />

mapped identically, as in the following example:<br />

DigitJI Tcchnic;\l Journal Vol. 7 No. 3 <strong>1995</strong> 9


10<br />

! hpf$ align A(i ) wi th B(i )<br />

This example specifies that the two arrays A and Bare<br />

always mapped the same way. More complex align­<br />

ments can also be specitled. For example:<br />

! hpf$ align E(i) wi th F(2*i-1 )<br />

In this example, the elements of fare aligned with the<br />

odd elements ofF In this case, 1:· can have at most half<br />

as many elements as F<br />

An array can be aligned with the interior of a larger<br />

array:<br />

real A( 12, 12)<br />

rea l B(16, 16)<br />

! hpf$ align A(i, j ) with B(i+2, j+2)<br />

In this example, the 12 X 12 array A is aligned with<br />

the interior of the 16 X 16 array B (see Figure 7). Each<br />

interior element of B is always stored on the same<br />

processor as the corresponding element of A.<br />

The Default Distribution Va riables that are not explic­<br />

itly distributed or aligned are given a default distribu­<br />

tion by the compiler. The default distribution is not<br />

specified by the language: dinerent compilers can<br />

choose different default distributions, usually based<br />

on constraints of the target architecture. In the DEC<br />

Fortran 90 language, an array or scalar with the default<br />

distri bution is completely replicated. This decision was<br />

made because the large arrays in the program are the<br />

significant ones that the programmer has to distri bute<br />

explicitly to get good performance. Any other arrays<br />

or scalars will be small and ge nerally will benetit from<br />

being replicated since their values will then be available<br />

everywhere. Of course, the progra mmer retains com­<br />

plete control and can specif}1 a ditre re nt distribution<br />

to r these arrays.<br />

Re plicated data is cheap to read but generally<br />

expensive to write. Programmers typically use repli­<br />

cated data fo r information rh:�r is computed infre­<br />

quently but used ofi:en.<br />

B<br />

Figure 7<br />

An Example of A.rrav Alignment<br />

A<br />

<strong>Digital</strong> Technic:d )ournJI Voi. 7 I o.3 <strong>1995</strong><br />

Data Mapping and Procedure Calls<br />

The distribution of arrays across processors introduces<br />

a new complication fo r procedure calls: the inte rface<br />

between the procedure and the calling program must<br />

take into accoum not only the type and size of the rel­<br />

evant objects but also their mapping across processors.<br />

The HPF language includes special forms of the<br />

ALIGN and DISTRIBUTE directives fo r procedure<br />

interbces . These allow tl1 e program to speci�' whether<br />

array arguments can be handled by the procedure as<br />

they are currently distributed, or whether (and how)<br />

they need to be redistributed across the processors.<br />

Expressing Parallel Comp utations<br />

Parallel computations in HPF can be identified in four<br />

ways:<br />

• fortran 90 array assignments<br />

• FORALL statements<br />

• The IN DEPENDENT directive, applied to DO<br />

loops and rORALL statements<br />

• Fortran 90 and HPF intrinsics and library fu nctions<br />

In addition, a compiler may be able to discover paral­<br />

lelism in other constructs. ln this section, we discuss<br />

the fi rst two of these paral lel constructions.<br />

Fortran 90 Array Assignment In Fortran 77, operations<br />

on whole arrays can be accomplished only through<br />

explicit DO loops that access array clements one at a<br />

time. Fortran 90 array assignment statements allow<br />

operations on entire arr:�ys to be expressed more simply.<br />

In Fortran 90, the usual intrinsic operations fo r<br />

scalars (arithmetic, comparison, and logical) can be<br />

applied to arrays, provided the arrays arc of the same<br />

shape. for example, if A, B, and C arc two-dimensional<br />

arrays of the same shape, the statement C = A + B<br />

assigns to each element of C a value equal to the sum<br />

of the corresponding clements of A and B.<br />

In more complex cases, this assignment syntax can<br />

have the eftect of drastically simplit),ing the code. For<br />

instance, consider the case of three-d imensional<br />

arrays, such as the arrays dimensioned in the fo llowing<br />

declaration:<br />

real D(10, 5 : 2 4, -5 :M), E(0:9, 20, M+6)<br />

In Fortr:�n 77 S)'ntax, an assignment to every ele­<br />

ment of D requires triple-nested loops such as rhe<br />

example shown in Figure 8.<br />

line:<br />

In Fortran 90, this code can be expressed in a single<br />

D = 2.5*D+E+2 .0<br />

The FORALL Statement The FORALL statement is an<br />

H PF extension to the American National Standards<br />

Institute (ANSI) Fortran 90 standard but has been<br />

included in the dratt Fortran 95 standard .


do i = 1 , 10<br />

do j = 5, 24<br />

do k = -5, M<br />

D(i, j, k)<br />

end do<br />

end do<br />

end do<br />

Figure 8<br />

An Example of'' Triple·nes[cd Loop<br />

FOIW...L is a generalized fo rm of Fortran 90 array<br />

assignment syntax that allows a wider variety of array<br />

assignments to be expressed . For example, the diago­<br />

nal of an array cannot be represented as a single<br />

Fortran 90 array section . Theretore, the assignment of<br />

a value to every element of the diagonal cannot be<br />

expressed in a single array assign ment statement. It<br />

can be expressed in a FOIW...L statement :<br />

rea l, dimension(n, n) A<br />

fora l l Ci = 1 : n) ACi, i ) = 1<br />

Although FORALL structures serve the same pur­<br />

pose as some DO loops do in Fortran 77, a FORALL<br />

structure is a parallel assignment statement, not a<br />

loop, and in many cases prod uces a difterent result<br />

fr om an analogous DO loop. For example, the<br />

FORALL statement<br />

fora l l ( i = 2:5) CCi, i ) CCi-1 , i-1 )<br />

applied to the matrix<br />

ll 0 0 0 0<br />

0 22 0 0 0<br />

c 0 0 33 0 0<br />

0 0 0 44 0<br />

0 0 0 0 55<br />

produces the to llowing result:<br />

On the other hand , the apparently similar DO loop<br />

do i = 2, 5<br />

C(i, i ) = C( i-1 , i-1 )<br />

end do<br />

produces<br />

c<br />

11 0<br />

0 11<br />

0 0<br />

0 0<br />

0 0<br />

0 0 0<br />

0 0 0<br />

ll 0 0<br />

0 ll 0<br />

0 0 11<br />

2 . 5 *D( i , j, k) + E(i-1 , j-4, k+6) + 2.0<br />

This happens because the DO loop iterations are per­<br />

to rmed sequentially, so that each successive element of<br />

the diago nal is updated betore it is used in the next<br />

iteration. In contrast, in the FORALL statement, all<br />

the diagonal elements are fetched and used betore any<br />

stores happen .<br />

The Ta rget Machine<br />

<strong>Digital</strong>'s DEC Fortran 90 compiler generates code<br />

to r clusters of Alpha processors running the <strong>Digital</strong><br />

UNIX operating system. These clusters can be separate<br />

Al pha workstations or servers connected by a fiber dis­<br />

tributed data interface (FDDI) or other network<br />

devices. (<strong>Digital</strong>'s high-speed GIGAswitch/FDDI sys­<br />

tem is particularly appropriate.") A shared-memory,<br />

symmetric multiprocessing (SMP) system like the<br />

AJphaServer 8400 system can also be used. In the case<br />

of an Sl'v!P system, the message-passing library uses<br />

shared memory as the message-passing medium; the<br />

generated code is otherwise identical. The same exe­<br />

cutable can run on a distributed-mernory cluster or an<br />

SM P shared -memory cluster without recompiling.<br />

DEC Fortran 90 programs use the execution envi­<br />

ronment provided by <strong>Digital</strong>'s Parallel Sothvare<br />

Environment (PSE), a companion prod uct -'" PSE<br />

is responsible fo r invoking the program on multiple<br />

processors and for performing the message passing<br />

req uested by the generated code.<br />

The Architecture of the Compiler<br />

Figure 9 illustrates the high-level an:hitecture of<br />

the compiler. The curved path is the path taken<br />

when wmpi lcr command-line switches are set to r<br />

compiling programs that will not execute in parallel ,<br />

or when the scoping unit being compiled is declared<br />

as EXTRINSIC(HPF _LOCAL).<br />

Figure 9 shows the fr ont end, tr�mstorm, middle<br />

end , and GEM back end components of the com piler.<br />

These components fu nction in the to llowing ways:<br />

• The front end parses the input code and produces<br />

an internal representation containing an abstract<br />

syntax tree and a symbol ta ble. It pedorms exten­<br />

sive semantic checking.'"<br />

Digit�! Tc( hnic;11 )oumal Vol. 7 No. 3 <strong>1995</strong> ll


12<br />

Figure 9<br />

Compiler C:om[10llents<br />

• The transform component pertorms the transtor­<br />

rnation from globai-HPF to cxplicit-SPMD fo rm .<br />

To do this, it localizes the add ressing of data, inserts<br />

communication where necessary, and distribures<br />

p


LOW ER ph;1sc implements rhis lw turning c:tch<br />

forrr.1n 90 :liT:l\' assign ment into an cqui1·;1k11t<br />

rORA.l.L statement (acrualh', into the dorrcc repre­<br />

sentation of one ). This unitorm rcprescnt:.1rion m c:1 ns<br />

rh:1t rhe compiler has t�1r tcll'er special cases ro consider<br />

than orhcn1·isc might be necess;1ry and leads ro no<br />

degr;1lbtion of the generJted code.<br />

DATA<br />

The DATA phase specities where dara lives. Pbcing<br />

and addressing data correctlv is one ofrhc major rasks<br />

ofrr:1 nstcmn . Ih: n: are :1 large number of possibilities:<br />

\V hcn :1 nluc is ;1\'ailable on c1 en· processor, it is<br />

S;1id ro be I'C1Jiicoted. When it is a1·aibblc on more than<br />

one hut nor :1 11 processors, it is said ro be ;wrtioll l '<br />

rej;/icotecl. for instance, a scalar m:ll' lil'c 011 onh' one<br />

processor, or on more than one processor. Tvpic:-rl h·, a<br />

scJI:Jr is replic1tcd-it lilTS on all processors. The rcpli­<br />

cnion of scalar d:.1u m;1kes ktches cheJp because e;Kh<br />

processor has a copy of the req uested \';1lue. Stores to<br />

replic:1red sc::� Llr data can be expensive, howcl'e r, if the<br />

value ro be stored has not been replicated. In rhar cJse,<br />

the value ro be stored must be senr to each processor.<br />

The s:1me consideration applies to Jr-r::�vs. Arravs<br />

mJv lx: replicJted, in ll'hich c1se c1ch processor h;1S a<br />

copv ofJn crnirc :1rrav; or arravs m:1v be p;1rti;1llv rc pl i ­<br />

carcd, in ll'hich c1se c::�ch element o f r hc arr.l\' is ::tl :lil­<br />

able on ;1 subset of the processors.<br />

hrrrl1crmorc, :l iTa\'S that :1re nor repl icated 111:11' be<br />

distributed Jcross the processors in se1crJI diftcrcn r<br />

t:lshions, as nplained abm'e , In t:K t, each dimension<br />

of each ;1 rrav ma\' be distributed indcpcndcmly of<br />

the other dimensions. The HPr m:�.pping direc tives,<br />

princip;1lil' ALIGN ;md DISTRI BUTE, give the pro­<br />

grammer the :r bi lity to specii}' completely how each<br />

dimension of each arrav i s bid ouc DATA uses the<br />

inhm11;1tion in these directives to construct ;1 11 internal<br />

description m data space of the i:11·our of each arrav.<br />

ITER<br />

The ITER ph


14<br />

ln our impkmentation, the CJIIer ped(mns all<br />

remapping. If remapping is necess:�n·, AR


ea l A(100, 20), 8(100, 20)<br />

1 h pf$ distribute A(cyc l i c , block), B(cy clic, block)<br />

(a) Code Fragment<br />

fora l l (i = 2 : 99)<br />

A(i , k) = B(i , k)<br />

end fora l l<br />

m = my_ processor ( )<br />

if k mod 5 = Lm/4J then<br />

do i = ( i f m mod 4 = 0 then 2 else 1 ) , ( i f m mod 4<br />

end i f<br />

A( i , Lk/5Jl = B(i, Lk/5Jl<br />

end do<br />

(h) Pseudocode Cennared for Code Fragmenr<br />

Figure 11<br />

Code Fragtncnr and Pseudocode GcnCLHCd r(>r Code Fr"


16<br />

If the :11T,11'S A and B arc bid o u r as in Figu re 12 md<br />

if 13 is to be assigned to A, the n JIT�11· clcmems LJ( 4 ) ,<br />

13( S ) , �1 11d /J( 6 ) , all of 11 h ich li1 c 011 p roccssm 6,<br />

shou ld be scm to processor 1 . Clc�1d1·, \\'c do 11ot 11 :1nt<br />

to gcncr�ltc three distinct messages t(Jr this. Thcrd(Jre,<br />

we collect these tilrce clements ;1 11d ge nn�Hc one mes­<br />

sage con taini ng :1ll three of them. This oamplc<br />

i nvolves full knowledge.<br />

Communic:�tions involving p�1rti


Figure 14<br />

Sh.1do11· Edges Fo r· c1 :\ccHcsr-ncighbor· Computation<br />

cor11munication involving even kss 0\'Crhead than<br />

general communicnion im·olving fu ll knowledge.<br />

• No local copying. If sh:1dow edges were not used ,<br />

then the t(lllowing stambrd processing would take<br />

pbce: �or c:1ch shitted-Jrr:Jy rdi.: rcncc on the right­<br />

hand side of the ''ssignmcnt , shift the entire :1rray;<br />

thcn idcmi�· that parr ol . . thc shir'tcd array that lives<br />

loc1ll\' 011 each processor and create a local tem po­<br />

r,m• to hold it. Some of that tcmporarv (the part<br />

reprcscmi ng our sh,ldo\\' edge ) II'Ould be moved in<br />

from :1 neighboring processor, and the rest of the<br />

tcmporarv \\'(lli ld be copied loc1llv ri·om the origi­<br />

nal arLl\'. Our processing el im inates the need fo r<br />

the local tcmpor,u·,· and tix rhc local co�w, which is<br />

substami,ll t!. >r large '' rravs.<br />

• Greater local in· of rdi.: rcncc. When the :�cn!al com­<br />

pur:�tion is pcdcmncd, grc.1tcr localin· of reference<br />

is ,1 chiC\ni because the sh,ldOII' edges (i.e., the<br />

rccci,·cd ,·a lues ) :11-c 110\\' p:�rt of the atTJ\', r:1ther<br />

than being '' tcmporan· somc\\·hcrc else in mcmon·.<br />

• Fcll'er mcss,1gcs. hn,1lly, the optimiz,nion also<br />

makes it possible fo r the compiler to sec that some<br />

messages m,1,. be combined into one message,<br />

rhnelw reducing the number of messages that<br />

must be sent. for insuncc, if the right-hand side<br />

of the '1ssignment st,Hc mellt in the above exam ple<br />

also co11t,1incd '1 term A( i + l ,.f + 1 ), C\'Cn though<br />

overlapping shJdo\\' edges and Jn additiona l<br />

shJdO\\' edge \\'ou ld now he in the diagonally adja­<br />

cent processor, no :�ddirional communicuion<br />

\\'ou ld need to be generated .<br />

Reductions<br />

The SUM intrinsic fu nction of t-:ortrJn 90 takes an<br />

arrav argument and returns rhc sum of all its elements.<br />

Altcrn;�tivclv, SUM c1n return an '1 rrav ll'hose rank is<br />

one less thJn the rJnk of its '1 rgumem, and eJch of<br />

\\'hose ,.,1lucs is the sum of the clements in the argu­<br />

ment along .1 line p'1rallcl to J spcciticd dimension.<br />

In either case, the rank of the result is less than that of<br />

the argument; thcrd(m:, SUM is rdi.: rred to ''s ,,<br />

reduction intrinsic. Fortran 90 includes J bmil\' of<br />

such reductions, :1nd HPF


18<br />

is repl icated or distributed , (2) the different distri­<br />

butions that uch �liTa\· dimension might ha\·c, a11d<br />

( 3 ) whether the n:ducri on is comp lete or p:�rti�1 1 (i.e.,<br />

\\'ith a DL\11 argument).<br />

Run -time Preprocessing of Irregular Data Accesses<br />

Run-time preprocessing of i rregu l ar data accesses 1s<br />

a popu l:lr teehni


is a slice through either the water or the atmosphere.<br />

Partial difte rcntial equations relate the variables.<br />

The model is implemented using a �l nite-diftcrence<br />

method that approximates th e partial differential<br />

equations at each of the mesh points.1' Mode ls based<br />

on partial difterenti�ll equations are ar the core of many<br />

simulations of physical phenomena; finite difkrcnce<br />

methods arc commonly used �or sol ving such models<br />

on com puters.<br />

The shallow-water program is a widclv quoted<br />

benchmark, partly because the program is small<br />

enough to examine and tune careful l �', vet it per�orms<br />

real computation representative of many scientitlc si m­<br />

ulations. Unlike SPEC and other be nchmarks, the<br />

source �or the shallow-water program is not controlled.<br />

The shal low-water benchmark was written in H PF<br />

and run i n p�1 ra llcl on workstation ��mns using PS E.<br />

There is no ex pl ici t message-passing code in the pro­<br />

gram. We modified the Fo rtran 90 version that<br />

Applied Parallel Research used �or its benchmark data.<br />

The f90 / H Pf version of the program t


20<br />

of heat tlow and elcctrostJtic or gravitational poten­<br />

tial. We have investigated a Poisson soh·cr using the<br />

conjugate-gradient algorithm. The code exercises<br />

both the nearest-neighbor optimizations and the<br />

inlining abilities of the DEC FortrJn 90 compiler.''<br />

Ta ble 3 gives the timings :111d speedup obtained<br />

on a 1000 X 1000 JrrJv. The h:1rdw:1re and sothvare<br />

contlgurations arc identical to those used for the<br />

shallow-water timings.<br />

Red-black Relaxation<br />

A common method of solving p:trtial difkrenti:tl<br />

equations is red-black relaxation -'" vVe used this<br />

method to solve the Poisson equation in J three­<br />

dimensional cube. We compare the p:1rallelization<br />

of this algorithm tor a distributed-memorv svstem<br />

(a cluster of Digita l Alpha workstations) wirl� P�rallcl<br />

Virtual Machine ( PVM ), which is an explicit message­<br />

passing li brary, and with HPF.l' These algorithms arc<br />

based on codes written by Klose, Wolton, and Lemke<br />

and made a\·ailable as pan of the suite of GENESIS<br />

distributed -memorv benchmarks.'"<br />

Table 4 gives the speedups obtained t(x both<br />

the HPF and PVM \'ersions of the program, which<br />

soh·es a 128 X 128 X 128 probJem, on a cluster of<br />

DEC 3000 Model 900 workstations connected bv an<br />

FDDI/GIGAswitch system. The speedups shown arc<br />

relative to DEC Fortran 77 code written for and run on<br />

a single processor. This table shows th:lt the HPF ,·er­<br />

sion perf(m11S somewhat better than the PVM version .<br />

There is a significant ditkrence in the complexitv ot·<br />

the programs, however. The PVM code is quite intri­<br />

cate, because it requires that the user be responsibJ e<br />

tor the block partitioning of the volume, and then tor<br />

explicitly copying boundary t:Kes between processors.<br />

By contrast, the HPF code is intuitive and tar more<br />

easilv maintained . The reader is encouraged to obtain<br />

the codes (as described abm·e) and compare them.<br />

Ta ble 3<br />

Ta ble 4<br />

Speedups of DEC Fortran 90/H PF<br />

and DEC Fortran 77/PVM on<br />

Red-black Code<br />

DEC Fortran 77<br />

DEC Fortran 77/PVM 7.01<br />

DEC Fortran 90/H PF 8.04<br />

,--- <strong>Number</strong> of Processors ,<br />

8 4 2 1<br />

3.73<br />

4. 10<br />

1.79<br />

1.95<br />

1.00<br />

1.05<br />

In conclusion, we have shown that important algo­<br />

rithms fa miliar to the scientitlc and technical commu­<br />

nity can be \\'ritten in HPF. HPF codes scale well to at<br />

le


References and Notes<br />

I. High Perfornuncc fortran Forum, "High Pc rtormJnce<br />

Fonran L1 ngu:1gc Specitic:�rion, Version 1.0,"<br />

Scient iji'i.: Pmg l'(.lllllllillp,, ,·ol. 2, no. I ( 1 993). Also<br />

Jvai!Jblc :1s Tcc bnicJI Report CR.PC-TR93300, Center<br />

tor ReseJrch on .Par;11lcl Computation, Rice Uni\'ersitv,<br />

Houston, Te x.; ;\11d via anonymous ftp ti·om<br />

tiL\JU.:s.ricc.edu in rhe dircctorv public/HPFF /d raft;<br />

version I .1 is rhe ti le hp(y l 1 .ps.<br />

2. C. Ko elbcl, D. Lo,·cmJn, R. Schreiber, G. Steele, Jr.,<br />

and M. Zoscl, The High Perj(;rmance Fonmn<br />

/-land hook (Cambridge, Mass.: J'vi iT Press, 1994 ).<br />

3. Dig ital J-tt;� h Perj'ormance Fortran 90 HPF and<br />

PSE Ma nual (Mavnard, Mass.: <strong>Digital</strong> Eq uipment<br />

Corporation, <strong>1995</strong> ).<br />

4. {)f:'C hll'irWI 90 La uguap,e !?ej'erence Manual (Maynard,<br />

i'vLlss.: <strong>Digital</strong> Equipment Corporation, 1994).<br />

5. E. Albert, K Knobc, j. Lukas, and G. Steele, Jr., "Compiling<br />

ForrrJl\ 8x Arrav Features fo r the Connection<br />

Machine Co1nputcr Svsn::m," S) 'mposium on Pam/lei<br />

Prog mmming L\perience tl'ith Applicatio ns.<br />

Lw·1p, 11agc>s. wnl ,)).'stems. ACM .sJG'PIAN. j ulv 19R8.<br />

6. K. Knobc, ]. Lukas, ;H \d G. Steele, Jr., " Massively Par­<br />

;\! lei Dat


22<br />

2S. B. Boulter, "Perr(mn:mce E,·,llucHion ofH l'J-' J·( )r Scien­<br />

ritic Compu ting," Pmc( ·et liii,U,S u/f ll:u,h l'crjiJf'l!7Cli/Ce<br />

Cumpu!ing one/ .\ i'/1/'(JJldiiJ..',. l.eclure . .\'u/es ill<br />

0JIIIjmlel' Scie7!Ce 91 il',Jl<br />

.md dei'L·Iopmcnr otrhc trcmst(mn componcm ofrhe IJ FC:<br />

�mrrcm 90 compikr. Bdcl!T joining Di);i!.ll in I 99 I , he 11 Js<br />

1111 ohed in the design ,\Jld del elo[m leJlt of compilers ;Jt<br />

C:oll11'·''-5, lnc., })l'ior ro rh.11, he 11 mkcd on COlllJ'i krs .md<br />

sotiw·.1re tools cH Rcll'rhcon Coq'. He holds '' H.S.r:. in<br />

cD!llJlUtn science and en!..!.im·eri ng from the l' ni,·ersit'l'<br />

of }\:J\n)I·JI;mia i J98.J. ) :\;ld ,\n ,\\\ Ill COJll})lJI'LT Sc'Jel;Ce<br />

ti·om 1\osron Uni1 crsm· ( 1990 ).<br />

\'ol. 7 :--lo .'\ 1'.19:i<br />

M. Regina Bolduc<br />

Rcgi n.1 l�olduc joined <strong>Digital</strong> in 1991; she is a principal<br />

so tiwc1rc engineer in rhe High l'ertc mnancc Computing<br />

CirouJ' 1\egin;l "·"s in1·oh cd in the de1·elopmem of the<br />

rrcmst(mn c1nd f)'()llf end conlpoJKilfS ofrbc DEC: �o1·n·.1n<br />

90 C0111J'ikr. l'ri())' to th is 11 ork, she 11 .1s ;1 :.cnior membLT<br />

Dfrhe tech11iol src1tLH C :ompa'>.s, Inc., 11·hcrc she 11 01·ked<br />

On the de.sign Cllld dcl·eJOJ'Il1CIH OrC0111piJers cl lld COJl) ['ikJ··<br />

gcner;ltor rools. 1\egi n;l recei1ed J lL\ . in mcHhcmcHics<br />

Ji·om Emm.111uel College in J 957.<br />

Jill Ann Diewald<br />

Jill Dic11;1 ld UlJIHihltted to rhe design and implcmcllfcl­<br />

fiOil ofrhe tr.msl(mn COll\J'OllC!H ofthe DEC Forrr;Jn 90<br />

compiler. She is cl princiJ'''I sofm·c\rc engineer in the 1:-l igh<br />

Pe rt()J·mcmce C:ompuring (;roup. Before joining Digit;l l<br />

in 199 I , Jill ll 'as cl rcchniul coordin;Hor clr CompJss,<br />

lnc., ll'lleJ-c she helped design :�m1 den:lol' compikrs c1nd<br />

cumpiln-relcHL'd tool.s. Prior ro rhJt JX>sition, she 11 o1"ked<br />

.H lnno\;Hilc S\'SteJns TL·clllliLJues and D�rc\ 1\esou rcc.s,<br />

[nc. on J'J'OgraJnJning hngu.1ges rhJr prm ide economic<br />

.m.1hsis, modeling, cmd d.l tclb�sL· c:1pc1biliries t(Jr rhe ti ncm­<br />

ciJi nurkeq,bcc. She hclS " B.S. in compurer science trom<br />

rhc l' ni,crsir1· of,\ lichigc1n<br />

Israel Gale<br />

lsr;\cl


Neil W. Johnson<br />

Rd(m: coming to Digiul in I 99 1, Neil johnson was a<br />

staff scicnrist :Jt Compass, Inc. He has more rhan 30 vears<br />

ofcxpnience in the dcvdopmenr of compilers, including<br />

work on the vc.:crorizarion and optimiLation phases and<br />

tools t( H· compiler devcloprnenr. As


24<br />

Design of <strong>Digital</strong>'s<br />

Parallel Software<br />

Environment<br />

<strong>Digital</strong>'s Pa ra llel Software Environ ment was<br />

designed to support the development and exe­<br />

cution of scalable pa rallel applications on clus­<br />

ters (fa rms) of distributed- and shared-memory<br />

Alpha processors running the <strong>Digital</strong> UNIX oper­<br />

ating system. PSE supports the parallel execu­<br />

tion of High Performance Fortran applications<br />

with message-passing libraries that meet the<br />

low-latency and high-bandwidth communica­<br />

tion requirements of efficient parallel comput­<br />

ing. It provides system management tools to<br />

create clusters for distributed para llel process­<br />

ing and development tools to debug and pro­<br />

file HPF programs. An extended version of dbx<br />

allows HPF-distributed arrays to be viewed,<br />

and a parallel profiler supports both program<br />

cou nter and interval sampling. PSE also supplies<br />

generic facilities required by other parallel lan­<br />

guages and systems.<br />

Dig,it:ll TcchnicJl ]uurnal<br />

I<br />

Edward G. Benson<br />

David C.P. LaFrance- Linden<br />

Richard A. Wa rren<br />

Santa Wiryaman<br />

Digir:t l's PJr�1llcl Sofr11·;1t-c Etll'i ronmcnr ( PS E) 11 as<br />

dcsig:ncd ro support the dc,·clopmenr and execution<br />

of scabble par:rllcJ :t pplic:t rions on clusters (brms) of<br />

distri buted- :md slun:d -mcmorv Alpha pr


tr:msmissio11 con trol prorocoljinrcrncr protocol<br />

(TCP/lP). Netll"orking tech nologies em r:mge h·om<br />

si mple Ethernet to �DDI, AT M, and MEMORY<br />

CHA\:;-.:EL. l\1 r�1 l l e l e\ecution is most dli cie nr ,,·hen<br />

the imercon nect technolog1· oftcrs hi gh - ba nd11·idth<br />

and lm1·-latencv com munications ro the user ar the<br />

process level. When bu ildi ng a cluster t(x p�1 ralle l pro­<br />

cessing, rhe bisectional ba ndwidth ofrhe communica­<br />

tions b bri c shou ld scale with the number of processors<br />

in the cluster. In practice , such a contiguration cu1 be<br />

:�c hieved lw building clusters using Alph:1 processors<br />

�u1d D igi r�1l's GIGAswirch/FDDf as components in �1<br />

m u lti stage s11·itch configuration.45 Fi gures l �md 2<br />

sh o11· tii'O oamples of PS E cl uster contigu r:nions.<br />

Although the design center tor PSE is :1 set ofm�1ehines<br />

connected by �� high - speed local area i nterconnect, a<br />

cluster can be constructed that incl udes remote<br />

111�1chincs con nected Lw a ll"ide :�rca llCtll"ork.<br />

PS E is �� col lection of tn


26<br />

Figure 2<br />

l'S E Multistage S\\ itch C:ontig;ur,uion<br />

Before dc1·eloping I'Sf. t(Jr use \\ith Hf'F prog1·,1ms,<br />

Digi tal considned r11 o major altem,ni' cs : the d istribu<br />

ted compu ti ng cn1·ironmcnt ( DC E) ,md l'VM.'''<br />

(At that time, the mess,Jgc-rJassing i ntcrhcc I JVl l'l J<br />

stand ard efl(Jrr \\'JS in progress. �)<br />

Although a good model t(Jr client-server :1 pplicnion<br />

dcpl ovment, DCL 1s desi gned fo r usc wi th I'CillOte C:l'U<br />

resources 1i1 procedure calls to libraries. This model<br />

is 1·en· difkrcnr ti·om the data-p,u·allel :1 11d mcss,lgcpassi<br />

ng nature of distributed parallel processing. Irs<br />

svnch ronous procedure call model requires the oten­<br />

si,·e use ofrh rc1ds. In ,,ddirion, DCE cont,lins '' si gnit�<br />

icant number of setup :�nd management Ll sks. For<br />

these reasons, we rejected the DCE environ ment .<br />

\'ol. 7 �"- 3 Jl)l)�<br />

Three m'1jm consider:�ti ons in out· choice to de1·clop<br />

PSE instead of using PVM \\'eJ'C stabilin·, ped(mllJilCc,<br />


the rr�1nsp�1rcnr us:�bilin· of J S\'lll lllctric multiproces­<br />

sor. EK ilirics such as :\fS (to moullt/sh�lrc ri le s1·srcms<br />

�1mong machines, in particular working directories )<br />

:llld IKt\Hlrk in r(mn arion scn·ice ( ::--.: Is) (also kno\\'n JS<br />

"vcllo\1' p:1gcs" an d mcd to share password ri les) arc<br />

ri·c quemlv used ro set up :1 common S\'Stc m cm·iron­<br />

rncnr. ln such �1 n cm·ironmc:nt, users em log into :m y<br />

m:�chinc �llld sec the same cnviron mcm. Other distrib­<br />

uted cnvirolllllC11tS such JS Load Sluring F:�ci lity<br />

( L


fKtors can help an applicuion choose the best set of<br />

members fo r parallel execution. This d\'llJ.mic inform:� ­<br />

rion is collected bv a d:�emon process (farmd). The<br />

farmd daemon process executes as :1 privileged (root )<br />

process on each cluster member :�nd liste ns to r requests<br />

on a well-known clusrcr-spcciflc UDP/lP port.<br />

Multiple cluster members ddincd in the configura­<br />

tion d:.uabasc :-tiT designated as lo::�d sen·e�·s. The load<br />

servers arc the central rcpositorv for the dvnJ.mic<br />

information for the entire cluster. Their farmd process<br />

periodicallv receives time-stamped upthtcs h· om the<br />

individual daemons. Applications querv the load<br />

servers for both static and dvnamic information .<br />

Applications do not themselves parse the database nor<br />

query the individu�ll farmd d�1emons running on each<br />

cluster member.<br />

Once PSE is inst::�llcd and configured, farmd is<br />

st:�rted each time the syste m is booted. Thc nJmc of<br />

the cluster that farmd \\'i ll service and the number of<br />

t�S E jobs (job slots) that \\'ill be :�I lowed ro run arc set.<br />

The inetd fa cility is used to restart farmd in res�)OilSc to<br />

UDP/IP connection requests, if farmd is nor run­<br />

ning _,., Usc of the inetd fJ cility to start farmd impro,·es<br />

the avaibbility of machines to run PSE applications by<br />

transparenrlv restarting farmd in the case of a bilure.<br />

As farmd daemons are st:�rted, they attempt to<br />

establish TC P/IP connections \\'ith their neighbors as<br />

ddincd bv the PSE configur


Figure 3<br />

PSE Usc<br />

COM PILATION<br />

SOURCE<br />

"MYPROG F90"<br />

�<br />

FORTRAN 90<br />

OPTIONS r--- - COMPILER<br />

(E G., -WSF 4)<br />

LIBRARIES:<br />

• FORTRAN<br />

·RUN-TIME<br />

• PSE<br />

-+<br />

COMMAND LINE<br />

SWITCHES AND<br />

ENVIRONMENT<br />

VARIABLES<br />

I<br />

�<br />

OBJECT FILE<br />

"MYPROG.U<br />

�<br />

STANDARD<br />

LINKER (LD)<br />

t<br />

EXECUTABLE<br />

"MYPROG.,<br />

�<br />

EXECUTION<br />

(E.G., SHELL)<br />

� �<br />

CONTROLLING<br />

PROCESS<br />

1- I-f---<br />

process re lationship between the io_managcr and peer<br />

processes. This relationship is used fo r exit status report­<br />

ing and process conrrol. It also enables or eases other<br />

activities, such as signal handling and propagation. Peer<br />

processes cre�ue com munication channels between<br />

themselves and perform standard !jO through a desig­<br />

nated peer. Stand:trd I/0 is fo rwarded to and fr om the<br />

controlling process through the io_manager. Figure 4<br />

shows a PSE application structure.<br />

Applica tion Initialization<br />

Prior to the execution of JllV user code, an initializa­<br />

tion routine executes automatically through fu nction­<br />

alit\' provi ded by the linker and loader. The<br />

initialization routine impkrnenrs both the controlling<br />

process fu nctions :m d the HPf-specitic peer initializa­<br />

tion. Because no explicit call is required , parallel HPF<br />

procedures can be used within non-HPF main pro­<br />

grams, :md proper initialization will occur. A simple<br />

HPF mJin program can also be used with PS E to start<br />

up and manage a task-parallel application that uses<br />

PVM or MPI tor message passing.<br />

In general, the controlling process places peer<br />

processes onto members of;l partition , although hand<br />

placement of indil'idual peers onto selected members<br />

PEER SPAWN<br />

AND CONTROL<br />

STATIC<br />

DATABASE<br />

(E.G , DNS)<br />

�<br />

LOAD SERVER<br />

I<br />

DYNAMIC INFORMA TION<br />

(E.G., LOAD)<br />

CLUSTER<br />

--- ---------- ---,<br />

.- ---<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

-- ------- -------,<br />

I<br />

HOST<br />

I<br />

I<br />

(CLUSTER MEMBER) I I I I<br />

I<br />

I<br />

! HOST<br />

(CLUSTER MEMBER) :<br />

I:<br />

- - - - - - - - - - - - - - - _I<br />

1- -------------- I<br />

I :<br />

I<br />

I<br />

I<br />

I<br />

HOST [SMP]<br />

(CLUSTER MEMBER) :<br />

L--------------..J<br />

! HOST<br />

(CLUSTER MEMBER)<br />

PARTITIONS<br />

------------ --------<br />

is possible. To achieve efficiency and f1 irness in map­<br />

ping a set of peers, the comrolling process consults<br />

with a load server for load-balancing information .<br />

Which members are used and the order in which they<br />

are used is based on each member's load average,<br />

number of CPUs, and number of available job slots.<br />

As an alternative, PS E may nup peer processes onto<br />

members based upon a user-selected mock of opera ­<br />

tion . In the dctault physical mode of operation, PSE<br />

maps one peer process per mem ber. In virtual mode,<br />

PS E a! loll'S more than one peer process per member,<br />

thereby cnJbling large virtual cl ustcrs. This is useful<br />

fo r developing and debuggi ng parJIIel programs on<br />

limited resources. Virtual clusters also i mprove appli­<br />

cation availability: when the requested number of peer<br />

processes is greater th an the :wailable set of partition<br />

members, applications continue to run; however, they<br />

may sufkr perf(mnance degradation.<br />

Application Peer Execution<br />

Each application peer process has an io_manager<br />

parent process that provides it with env iron ment<br />

initialization, exit value processing, l/0 burkri ng,<br />

signal to rwarding, and potemial scheduling priority<br />

and policy modification. Ra ther than include the<br />

<strong>Digital</strong> Technic:1l Jou rn:1l Vol . 7 No. 3 <strong>1995</strong> 29<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I


30<br />

Figure 4<br />

KEY:<br />

CONTROLLING<br />

PROCESS<br />

PS E Ap1>l icuion Srn�mcesses exit ,,·irhour error and :tt<br />

�1pproxim;nch· the S


nonn�1l �lcti,·irY ll'ill cease after reeci1·ing the bsr nit<br />

noriricuion h·om an io_managcr.<br />

Peers rh�lt oit abnormalh· J11r the JpplicJtion.<br />

Consider one peer process that exits due ro �l segmen­<br />

tation bult and another that exits due to J tl oaring­<br />

poinr exception. There is no single exit value possi ble<br />

ti:>r the :1 pplication; PSE chooses the tirsr abnornJJI<br />

va lue it sees. �urrhcrmorc, as a result of error detec­<br />

tion in the comnHlllication Jibrarv, the other peer<br />

processes wi II exit ll'ith lost network connections. It is<br />

possible that the controlling process ll'ill sec an exit<br />

l':tlue ti: >r this cfkcr before it sees an exit ,.�1Im: ti:>r one<br />

ofrhc c1uses, resulting in r the core directo­<br />

ries cJn be set through an en1 ironmcnt vari�1blc.<br />

Issues<br />

Although l'S E Jchic,·es the srambrd UNIX look-andkel<br />

f(>r most 3pplication situations, complete trans­<br />

parency is nor achieved. f:or ex.�1mple, riming au<br />

applicarion-colltrolling process using the c-shell's<br />

built-in rime command, does not time user code or<br />

�) f'ol·ide meaningful statistics other rhJn the cbpscd<br />

ll'all clock rime to starr a p:lrallcl application and ro teJr<br />

it do11 n. Another situ:Jtion that highlights the p�lr:JIIcl<br />

n�1rure of I'S E occurs during applicnion debugging:<br />

multiple debug sessions arc StJrtcd lw running the<br />

:1 pplic::nion ll 'ith :1 debugger flag ra ther th�111 l11' using<br />

dbx direcrh·.<br />

Tools for HPF Progra mming<br />

The dc,·elopmellt model t( >r H Pf-b�1scd :1 pplicnions<br />

is �l rwo-srcp process. First, a serial I-'orrran 90 progrJm<br />

is wri tten, debugged , and optimized. Then it is paral­<br />

lelizcd with H PI-' directives and JgJin debugged and<br />

optimized . The de1·elopmenr tools supplied with PS E<br />

Jddrcss 11roriling and debuggi ng. Unlike most of PSI:::,<br />

ll'hich is l:1nguagc -independent, both rhe pprof protil­<br />

ing bcilitY and the "dbx in n ll'indoll's" debugging<br />

t:Kilin· �liT spccitic to HPF programming.<br />

Pro filing<br />

Sc,·eral issues in protlling par:�llcl HPF programs do<br />

not appl:' ro fortrJn progrr the execution of each evenr.<br />

This produces an accurate execution profile. El'cnts<br />

include rou tines, JtTa\' assignments, do loops, FORALL.<br />

constructs, messJge sends, and message rcceil'es.<br />

Because the cntn' ::1nd exit rimes Me recorded, rime<br />

spent executing other ei'Cllts within an ei'Cnt is<br />

included, which gi1'CS a hicrarchicJI profi le. To achiC\'C<br />

fi ne-resolution timings (single-digit na noseconds), the<br />

AJpha process cvcle coumcr is used to measure ti me.'''<br />

Analysis The pprof urilirv prol'ides many different<br />

ways to en mine and report on a large ser of protiling<br />

data ti·om a p3rallel program execution. Diftc rmr<br />

approaches include f( >eusi ng on routines, statements,<br />

or communicnions. In contrast, prof reports on proce­<br />

dures onlv. With pprof, the scope of the analvsis cJn be<br />

limited to a single peer process or cncompJss all Jppli­<br />

cation processes. The range of reports generated can be<br />

comprehensive or limited to a number of c1·enrs or<br />

Digir.1l Techniol )ournll Vol. 7 :-.Jo. 3 199S 31


�' perecnc1gc of time. Users em spccir\• their reports<br />

fi·om a combination of a nalysis, report tcm11:-tt, :� nd<br />

scoping options . By debult, rhe pprof u rilitv reports on<br />

routine-level acri,·itY a1·eraged across a ll peer pmcesscs,<br />

11·hich prm·ides an 0\nal l 1·ie11· of :-tppliCJtion beh;n ior.<br />

Parallel programs execute most eftl.cicnrl1· ll'hen<br />

there is minimum communicnion betll'ccn processes.<br />

The high-level, data par.1llel natu re of the HPf<br />

language reduces the 1·isibilitv of· comnu1 11icnion to<br />

the programmer. To make tuning easier, pprof 11·:-ts<br />

designed ll'ith the abilin· to tclCus tuning 011 communi­<br />

cation. Reports can be gc nc r;Hcd th


is implemented using J I"CCtor or' pointers to fu nctions<br />

II ithin J sh�1red Jibr:trl', pr01 ides rl e\ibi!in· to illtmd uce<br />

ne11· prorocols �1 11d media without ha,·ing to recompile<br />

or relink existing applications.<br />

Shared-memory Message Passing<br />

The usc of shared mcmorv as a mcssage-pass1ng<br />

medium :l i!OWS [()!' I'Cr V high pcrf()rlllancc because<br />

data docs 11or have to be copied. When design i ng<br />

shJred-memorv messaging, we looked :H a variety of<br />

interrcbted issues, includ i ng coordin�nion mecha­<br />

nisms, memorv-sharing strategies, and memorv con ­<br />

sumption. The usc or- locks (i.e., semaphores ) in the<br />

trJditional m:1nner to coordinate access ro shared­<br />

mcmon· segments pron:d problem atic. �or oample,<br />

clicnrs often req uest a message ti·om


34<br />

The UDPIIP option provides better bandwidth<br />

than the TCP liP with smaller messages and matches<br />

the TCP I r P bandwidth at large message size. The<br />

user-level latency reduction, however, was less than<br />

expected. The next t\vo sections discuss our investiga­<br />

tion into ways to optimize the latency ofUDPIIP and<br />

the performance of the message-passing options.<br />

Optimizing UDP/IP<br />

Our initial approach to improve latencv was to reex­<br />

amine the standard UDPilP code path within the<br />

<strong>Digital</strong> UNIX kernel fo r unnecessary overhead . Our<br />

idea was to create a faster path, optimized tor a<br />

UDPIIP over a local area network (LAN ) configura­<br />

tion by reducing numerous conditional checks in the<br />

path. Although this work yielded some improvement,<br />

it was not enough to justit)' supporting a deviation<br />

from the standard code path. An overhaul of the origi­<br />

nal code path would have been necessary for this<br />

approach to gain significant improvement in latency.<br />

UDP llP provides a general transport protocol,<br />

capable of runni ng across a range of network inter­<br />

faces. We realize the value in retaining the generality<br />

of UDPIIP. For optimal performance, however, we<br />

anticipate typical cluster configurations being con­<br />

structed using a high-perkmnance switched LAN<br />

technology such as the GIGAswitchiFDDI system.;<br />

In such configurations, the IP fa mily of protocols<br />

presents unnecessary protocol-processing overhead.<br />

A messaging system using a lower-level protocol, such<br />

as native FDDI, wou ld ofter better late ncy, but its<br />

implementation requires the usc of nonstandard mech­<br />

anisms to access the data link layer directly, which is less<br />

general and portable than a UDPIIP implementation.<br />

Based on the above observations, we designed a new<br />

protocol stack in the kernel, called UDP_prime, to<br />

coexist with the standard UDPilP stack. UDP_prime<br />

packets conform to the UDP I IP specirication.'9 To<br />

reduce the amount of per-packet processing and<br />

approach that of a lower-level protocol, UDP_ prime<br />

imposes several restrictions on its usc. These restric­<br />

tions optimize the typical switched LAt"J cluster config­<br />

urations. To retain the generality of UDPIIP,<br />

UDP_prime falls back to rbe standard UDPilP stack<br />

when these restrictions are not appucable.<br />

Restrictions on UDP_ prime<br />

The LAN nature of the cluster configuration imposes<br />

a restriction on UDP _prime. Each cl uster member has<br />

to be within the same JP subnet, which is directly<br />

accessib.le from any other member. With this restric­<br />

tion, routing decision and internet-to-hardware<br />

address resolution can be done once tor each peer­<br />

to-peer connection rather than on a per- packet basis.<br />

Per-packet UDP I IP checksum processing can also<br />

be eliminated, because intermediate routing is not<br />

<strong>Digital</strong> <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong><br />

involved and the data link cyclic redundancy check<br />

( CRC) is sufficient to guarantee error-tl-ee packets.<br />

The next restriction is the maximum length of rhe<br />

message. PSE message passing uses fixed-size bufters.<br />

UDP_prime restricts the maximum bufter size ro be<br />

the maximum transmission unit (MTU) ofrhe underly­<br />

ing network interface. This eliminates per-message IP<br />

fragmentation and defragmentation overhead. Since<br />

the messaging clients have to fi-agment the messages<br />

inro fi xed-size buffers at the higher layer, there is no<br />

need fo r rhe IP layer to perform fu rther fragmentation.<br />

One complication in our current implementation<br />

occurs when multiple peers are running on a single<br />

system while others are on remote systems. The<br />

default behavior for peers within a single system is<br />

to communicate across the loopback intert�Ke. In this<br />

situation, there are two MTU values, one fo r the net­<br />

work interface and one for the loopback interface .<br />

Our currenr implementation of UDP_prime does not<br />

allow communication over the loopback interface so<br />

that a single-size MTU can be used. rurrher studies<br />

need to be done to fi nd an optimal maximum buffer<br />

size, taking into account multiple MTU values, page<br />

alignment, and so forth.<br />

Based on the above restrictions, UDP_ prime opti ­<br />

mizes the per-packet processing overhead of sending a<br />

packet by constructing a UDP, IP, and data link packet<br />

header template for each peer ar initialization. Except<br />

for a few fields, the content of these headers is static<br />

with respect to a particular peer. UDP _prime defines a<br />

new IP option, IP _UDP _PRI ME, tor the setsockopt( )<br />

system call, to allow the messaging system to define<br />

the set of peers and their Internet addresses involved in<br />

the application execution."' The IP option processing,<br />

done prior to sending any message, makes routing<br />

decisions, performs Internet-to-hardware address res­<br />

olution, and tills in the static portion of the header<br />

fields. When sending a packet, UDP _prime simply<br />

copies the header template to the beginning of the<br />

packet, minimizing the per-packer processing over­<br />

head and increasing the likelihood of the templates<br />

being in the CPU cache. Several header fi elds, such as<br />

the IP identification, header checksum, and packet<br />

length fi elds, are then filled dynamical ly, and the com­<br />

plete packet is presented to the interface layer.<br />

UDP_ prime Packet Processing<br />

Since a UDP_prime packer is a UDPIIP packet, the<br />

standard UDP I IP receive processing can handle the<br />

packet and deliver it to the messaging client. To trig­<br />

ger the use of UDP_prime optimized receive process­<br />

ing, the sending system uses the type of service (TOS)<br />

field within the IP header to specif)' priority delivery of<br />

the packet." The priority delivery indication does not<br />

by itself uniquely differentiate bet\veen UD P _prime<br />

and UDPIIP packets, as any other IP packets can<br />

also have the TOS ti.eld set to priority. As a result, the


optimized receive processing has to check for the<br />

packer's adherence to the UD P _prime restrictions.<br />

Nonadherence to the restrictions reroutes the packet<br />

to the standard receive processing code.<br />

\tVhen a packet arrives at a network interface, the<br />

imertace posts a hardware interrupt, and the interface<br />

interrupt service routine processes the packer. The<br />

standard interrupt service routine deletes the data link<br />

header, and hands the packet over ro the netisr ke rnel<br />

thread ." Netisr clemultiplexes the packet based on<br />

the packet header contents and delivers it to the appli­<br />

cation's socket receive bufkr. Netisr, designed to be<br />

a general-purpose packet demultiplexer, runs at a low­<br />

interrupt priori ry level. The main reason fo r a thread­<br />

based demultiplexer is extensibility. New protocol<br />

stacks can be registered to the thread . Since there is<br />

no a priori knowledge of the execution and SMP lock­<br />

ing requirernents of these stacks, a thread-based low­<br />

interrupt priority demultiplexer is needed so that the<br />

network interrupt processing rime can be held to a<br />

minimum. The extensibi lity feature, however, intro­<br />

duces a context switch overhead.<br />

For UDP _prime, the packet header processing ti me<br />

on the receive path is almost a small constant. We<br />

modified the interface service routine to demultiplex<br />

the packet by processing the data link, IP, and UDP<br />

headers, and deliver the packet to the socket receive<br />

buffe r without handing it over to netisr. This short cir­<br />

cuit path is used only when the packet is a UDP / IP<br />

packet with no 1 P fi·agmentation and with priority<br />

delivery indication . If these cond itions are not met,<br />

the standard netisr path is chosen. The UDP_prime<br />

receive path eliminates the netisr context switch over­<br />

bead . This is a significant advantage, especially when<br />

the receiving application runs with a rea l-rime FIFO<br />

scheduling policy.<br />

SMP Synchronization<br />

One difficulty in designing the UDP_prime stack to<br />

run in parallel with the standard UDP/IP stack was<br />

in SMP synchronization .2-' The socket butle r structure<br />

is a critical section guarded by a complex lock.<br />

Requesting a complex lock in <strong>Digital</strong> UNIX blocks<br />

execution if the lock is ta ken. To prevent deadlocks,<br />

its use is prohibited at an elevated priority level, such<br />

as the case in the interrupt service routine. To work<br />

around this problem, a new spin lock was introduced<br />

in the short circuit path and in the socket layer where<br />

access to the socket buffe r needs to be synchronized.<br />

Performance<br />

To measure message-passing performance, we used<br />

two DEC 3000 Model 700 workstations connected by<br />

a GIGAswi tch/FDDI system using TURBOcbannel­<br />

based DEFTA fu ll-duplex FDDI adapters. Each work-<br />

station contained a 225-megahcrrz (MHz) Alpha<br />

21064 microprocessor and was running the <strong>Digital</strong><br />

UNIX version 3.0 operating system.<br />

Figure 6 shows the message-passing bandwidth fo r<br />

TC P/ IP, UDP/IP, and UDP_prime transports at dif­<br />

ferent message sizes. The bandwidth was measured at<br />

the message-passing application programmer interrace<br />

(API) level, raking into account al location and deallo­<br />

cation of each message buffer in addition to the data<br />

transmission. TCP/IP, UDP/IP and UDP_prime<br />

bandwidth peaks at approximately 95 megabits per<br />

second at a 4,224-byte message, approaching the<br />

FDDI peak bandwidth. UDP/IP approaches the peak<br />

bandwidth at a 1 ,4 00-byre message, and UDP _prime<br />

at a 1,024-byte message. Reaching the peak band­<br />

width using small messages is a measure of protocol<br />

processing efficiency.<br />

Figure 7 shows the minimum message-passi ng<br />

latency fo r TCP/ IP, UDP/IP, and UDP_prime<br />

transports at diffe rent message sizes. The latency was<br />

measured as balfofthe minimum rime to send a nl CS­<br />

sage and receive the same message looped by the<br />

receiver system over many iterations. The measure­<br />

ment made allowance fo r the allocation and deallo­<br />

carjon of each message bufter, in addition to the<br />

round-trip transmission .<br />

Compared to the TCP/IP option, UDP/IP has a<br />

slightly higher minimum latency. This is not expected,<br />

because the original goal ofrhe UDP/IP option was to<br />

reduce TCP/IP processing overhead . It is, however,<br />

encouraging to see only a slight degradation in latency<br />

when the reliable in-order delivery protocol is imple­<br />

mented at the library level. This prompted us to use<br />

the same protocol engine in the libra ry to r<br />

UDP_prime. At a very small m.essage si ze (4 bytes),<br />

100<br />

90<br />

80<br />

0 70<br />

z<br />

0 60<br />

(_)<br />

lli 50<br />

{7j<br />

1- 40<br />

m<br />

::;;; 30<br />

20<br />

10<br />

KEY:<br />

r - . /<br />

/<br />

/<br />

/<br />

/<br />

/<br />

/<br />

/<br />

...... - /<br />

I<br />

I<br />

I<br />

I<br />

I .<br />

.<br />

0 500 1 000 1500 2000 2500 3000 3500 4000 4500<br />

-- UDP _PRIME<br />

Figure 6<br />

TCP/IP<br />

UDP/IP<br />

Peer-to- Peer Bandwidth<br />

MESSAGE SIZE (BYTES)<br />

Digiral <strong>Technical</strong> Journal Vol. 7 No. 3 <strong>1995</strong> 35


36<br />

1000<br />

900<br />

(f) 800<br />

0<br />

z 700<br />

0<br />

� 600<br />

� 500<br />

a:<br />

� 400<br />

2 300<br />

KEY<br />

200<br />

100<br />

· /<br />

- �<br />

/<br />

/<br />

0 500 1000 1500 2000 2500 3000 3500 4000 4500<br />

-- UDP PRIME<br />

Figure 7<br />

TCPII P<br />

UDPIIP<br />

;\r[inimum Peer-to-Peer Larctll\.<br />

MESSAGE SIZE (BYTES)<br />

prorocol processing overhead domin;ltcs rile Lncncv.<br />

At rhis poinr, UDP_primc ll"


References<br />

I. <strong>Digital</strong> High Pe Jfrmnance Fo rtran 90. HPF a nd<br />

PSI;" /VIa mwl (M:�vnard , M:1ss.: Digiral Equipment<br />

Corporation, Orde r No. AA-2ATAA-Te , <strong>1995</strong> ).<br />

2. G. Be ll, "Sc:�lablc, Pclt"


Richard A. \Varren<br />

Richard W;uTcn is cl prin..:ipc1l software engi neer in the<br />

High Pedornunce Computi ng Group, ll'hcre his [1 rimclrl'<br />

n: sponsibilirY is the design Jnd dcl'clopmcnr of Digi tc1 l's<br />

pclr:JIIel softii':Jrc cm·ironmcnr. Since joining <strong>Digital</strong> in<br />

1977, Ri ch,mi hc1s conrrilJuted to PDI'- 11 SI'Stems dcl·cl­<br />

opmcnr and \'.-\ \ 32 -bi r shcHed-mctnon· tnul rip rocnsor<br />

designs. He h:1s c1lso been


An Overview of the<br />

Sequoia 2000 Project<br />

The Sequoia 2000 project is the joint effort<br />

of computer scientists, earth scientists, gov­<br />

ernment agencies, and industry partners to<br />

build a better computing environment for<br />

global change researchers. The objectives of<br />

this widely distributed project are to support<br />

high-performance 1/0 on terabyte data sets,<br />

to put all data in a database management<br />

system, and to provide improved visualization<br />

tools and high-speed networking. The partici­<br />

pants developed a four-level architecture to<br />

meet these objectives. Chief among the lessons<br />

learned is that the Sequoia 2000 system must<br />

be considered an end-to-end solution, with all<br />

pieces of the architecture working together.<br />

This paper describes the Sequoia 2000 project<br />

and its implementation efforts during the first<br />

th ree years. The research was sponsored by<br />

<strong>Digital</strong> Equipment Corporation.<br />

I Michael Stonebraker<br />

The purpose of the Sequoia 2000 project is to build a<br />

better computi ng environment to r global change<br />

researchers, hereafter referred to as Sequoia 2000<br />

clients. These researchers investigate issues such as<br />

global warming, ozone depletion, envi ron ment toxiti­<br />

cation, and species extinction and are members of<br />

earth science departments at universities and national<br />

laboratOries. A more detailed conception tor the proj­<br />

ect appears in the Sequoia 2000 technical report<br />

"Large Capacity Object Servers to Support Global<br />

Change Research."1<br />

The participants in the Seq uoia 2000 project are<br />

investigators of to ur types: ( 1) computer science<br />

researchers, (2) earth science researchers, ( 3) govern­<br />

ment age ncies, and ( 4) industry partners.<br />

Computer science researchers arc responsible fo r<br />

building a prototype environment that better serves<br />

the needs of the target clients. Participating in<br />

the Sequoia 2000 project are invesrigarors from the<br />

Computer Science Division at the University of<br />

California, Berkeley; the Computer Science Depart­<br />

ment at the Universi ty of Ca lifornia, San Diego; the<br />

School of Library and Information Studies at the<br />

University of California, Berke ley; and the San Diego<br />

Supercomputer Center.<br />

Earth science researchers must explain their needs<br />

to the computer science im'estigators and use the<br />

resu lting prototype environment to perform better<br />

earth science research. The Sequoia 2000 project<br />

comprises earth science investigators fr om the<br />

Department of Geography at the University of<br />

California, Santa Barbara; the Atmospheric Science<br />

Department at the University of California, Los<br />

Angeles (UCLA); the Climate Research Division at<br />

the Scripps Institution of Oceanography; and the<br />

Department of Earrh, Air, and Water at the University<br />

of Calitornia, Davis.<br />

To ensure that the resulting computer environment<br />

addresses the needs of the Sequoia 2000 clients, gov­<br />

ernment agencies that are affected by global change<br />

matters participate in the project. The responsibi l ity of<br />

these agencies is to steer Seq uoia 2000 research<br />

toward achieving solutions to their problems. The<br />

government agencies that participate are the State of<br />

Ca lifornia Department of Water Re sou rces ( DWR),<br />

Digir


40<br />

the SLnc of Cal itorni�1 Dcp�1rtment of forcsrn·, rh e<br />

Coord inated Environment Research L1boJ:�non·<br />

(CERL) or the United States Arnw, the N;uion�; l<br />

Aeronautics and Space AdJllJJlisn·:tti�n (\:AS,-\), rhc<br />

:\'ation�1l Occ.mic and Atnws�1hcric Adminisrr�Hion<br />

(�0,-\A), Jnd rhe United Sr:1res Geologic Sun·C\·<br />

(USCS).<br />

The task or' the ind ustr\' participants is to use<br />

the Seq u oia 2000 tech nol og\'


APPLICATIONS -<br />

t<br />

DATABASE<br />

MANAGEMENT<br />

SYSTEM<br />

NETWORK t<br />

Figure 1<br />

The Scquoi� 2000 Architecrun:<br />

The File System Layer<br />

FILE SYSTEMS t<br />

FOOTPRINT<br />

t<br />

STORAGE DEVICES<br />

On top of the fo otprint layer is the ti le system layer.<br />

Two ri le systems manage data in the Bigt-oot multilevel<br />

memory hierarchy. The fi rst t-i le system is Highlight,<br />

which extends the Log-structured File System ( LFS )<br />

pioneered ror disk devices bv Ousterhout and<br />

Rosenblum to tertiary memory.1··' The original LFS<br />

treats a disk device as a single continuous log onto<br />

which newlv written disk blocks are appended. Blocks<br />

are never overwritten, so a disk de,'ice can always be<br />

written sequentially. Hence, the LFS turns a random­<br />

write environ ment into a seq uential-write environ­<br />

ment. In particu lar problem areas, this may lead to<br />

much higher performance. Benchmark dar� support<br />

this conclusion 4 In add ition, the LFS can always iden­<br />

ti f:}• the last kw blocks that were written prior ro a t-i le<br />

S}'Stem fa ilure by tinding the end of the log at recovery<br />

time. File system repair is then very t-a st, because<br />

potentially damaged blocks arc easily ro und. This<br />

approach dift-ers 6-om conventional ti le svstem repair,<br />

where a laborious check of the disk must be performed<br />

tO ascertain disk integrity.<br />

Highlight extends the LFS to support tertiary mem­<br />

ory by adding a second log-structured t-i le system on<br />

top of the footpri nt layer. This ti le system also writes<br />

tertiary memory blocks seq uentially, thereby obtain­<br />

ing the pedcmmnce characteristics of the LFS. The<br />

Highlight ri le system adds migration and bookkeeping<br />

code that treats the disk LFS file system as a cache tor<br />

the te rtiary memory tile system. In summarv,<br />

Highlight should provide good performance t6r<br />

workloads that consist of mainly write operations.<br />

Since Sequoia 2000 clients want to archive vast<br />

amounts of data, the Highlight file system has the<br />

potential tor good performance in the Sequoia 2000<br />

environment.<br />

The second ti le system is Inversion.; Most DBMSs,<br />

including the one used fo r the Seq uoia 2000 project,<br />

support binary large objects (BLO Bs), which are<br />

arbitrary-length byte strings of variable length. Like<br />

several commercial systems, Sequoia's data manager<br />

POSTGRES stores large objects in a customized<br />

storage system directly on a raw storage device.6 As<br />

a result, it is a straightforward exercise to support con­<br />

ve ntional files on top of DBMS large objects. In this<br />

way, the fi·onr end turns every read or write operation<br />

into a query or an update, which is processed directly<br />

by the DBMS. Simulating t-i les on top of DBMS large<br />

objects has several advantages. First, DBMS services<br />

such as transaction management and security are auto­<br />

matically supported tor t-i les. In addition, novel charac­<br />

teristics of our next-generation DBMS, including time<br />

travel and an extensi ble type system tor all DBMS<br />

objects, are automatically available t-or tiles. Of course,<br />

the possi ble disadvantage of simulating files on top of<br />

a DBMS is poor performance . As reported by Olson ,<br />

Inversion performance is exceedingly good when large<br />

blocks of data are read and written, as is characteristic<br />

of the Seq uoia 2000 workload.;<br />

At the present ti me, Highlight is operational but<br />

very buggy. Inversion, on the other hand, is used to<br />

manage production data on Sequoia's Sony WORM<br />

jukebox. Unfortu nately, the reliability of the proto­<br />

type system has not met user expectations. Sequoia<br />

2000 cliems have a strong desire fo r commercial off<br />

the-shelf(COTS) software and are fr ustrated by docu­<br />

mentation gl itches, bugs, and crashes.<br />

As a result, the Sequoia 2000 project ream has also<br />

deployed two commercial t-i le systems, Epoch and<br />

AMASS. The Epoch file system is quite reliable but<br />

does not support either of Sequoia's large-capacity<br />

robots. Hence, it is used heavily but only fo r small d:�ta<br />

sets. The AMASS fi le system is just coming into pro­<br />

duction use fo r Sequoia's Merrum robot and replaces<br />

an earlier COTS syste m, which was unreliable. Given<br />

the experience of the Sequoia 2000 team with tertiary<br />

memory su pport, tertiary memory users should care­<br />

fu lly test all file system software .<br />

The DBMS Layer<br />

To meet Sequoia 2000 client needs, a DBMS<br />

must support spatial data such as points, lines, and<br />

polygons. In addition, the DBMS must support the<br />

large spatial arrays in which satellite imagery is natu­<br />

ra lly stored. These characteristics are nor met by pop­<br />

ular, general-purpose relational and object-oriented<br />

DB1\tlS s.7 The best fit to client needs is a special­<br />

purpose Geographic Information System (GIS) or<br />

a next-generation object-relational DBMS. Since it<br />

has one such object-relational system, namely<br />

<strong>Digital</strong> <strong>Technical</strong> Journal Vo1. 7 No. 3 <strong>1995</strong> 41


42<br />

POSTGRES, the Seq uoi::t 2000 project elected to<br />

tix us its DBJ'vi S efforts on this s1·srem.<br />

To make the POSTGRES DBMS suit::tb l e fo r<br />

Sequoia 2000 use, we req ui re �1 schcm�1 t( >l. :�II Seq u oia<br />

d�1t:1. This ciat:1base design process h:ts CH>Ivcd as :1<br />

cooper::ttive exercise between v:1 rio us d:tt�1b�1sc experts<br />

:tt Bcrkelev, the San Diego Su percomputer Center,<br />

CERL, and SAJC. The Sequoi:� schc111�1 is rhc collec·<br />

rion of metadata that describes the d�1ta stored in the<br />

POSTGRES DBMS on Bigt(>ot. SpccitiClilv, these<br />

mcrad:Ha comprise<br />

• A standard \'Ocabularv of terms ,,·irh �1grced·u pon<br />

definitions that arc used to descri be the (b tJ<br />

• A set of rvpcs, instances of 11 hich m�n store data<br />

"''lues<br />

• A hier:1rchicll collection of cl:�sscs tlut describe<br />


3. A browsi ng capability for te xtual intixmation<br />

4. A fa cility ro interrace the UCLA General Circula­<br />

tion Model (GCM ) to the POSTG RES/IIIustra<br />

SI'Stem<br />

5. A desktop ,·idcoconterencing or "picturephonc''<br />

tacilitl'<br />

For the off-the-shelf visualization tool, we have<br />

converged around the usc ofAVS and IDL fo r project<br />

activities. AVS has :m easy-to-use "hox�.:s-and-arrows"<br />

user interface, ll"hcrcas I DL has a more conventional<br />

linear programming notation. On the other hand,<br />

IDL has better tll"o-dimensional (2-D) graphics tea ­<br />

tllres. Both AVS and IDL a llo\\' the user to read and<br />

write ti le data. To connect to the DRMS, we have writ­<br />

ten an AVS-POSTGR.ES bridge. This program allows<br />

the user to construct an ad hoc POSTG R.ES q uerv and<br />

pipe the result into :m AYS boxes-and-arrows net11·ork.<br />

Seq uoia 2000 clients can use AYS fo r fu rther process­<br />

ing on any data retrieved fr om the DBMS. IDL is<br />

being interfaced to AVS by the vendor. Consequently,<br />

data retrieved fr om the database can be moved into<br />

IDL using AVS as an intermediarv. Now that \\ 'e have<br />

migrated to the Illustra DBMS, 1\'C arc consideri ng<br />

porting this AVS bridge to the II lustra :1pplicarion pro­<br />

gramming interface (API ).<br />

AVS has sorm: disadvantages


+4<br />

Illustr::t has Jl1 itnegy�ued document d�n�1 tvpe 11 ith<br />

capabi l i ti es si m ilar ro the extensi ons 11·c m:-tdc ro<br />

POSTGRES.<br />

A related Berkclt\' project is r(K uscd 011 digiti;ing<br />

all the Berkcl c\· Computer Science Techni c�1 l Rq)orts.<br />

This project uses �1 Mosaic cl ient to �1ccess �1 custom<br />

World Wide We b scnn ca lled D i e nst, 11·h ich stores<br />

technical report objects in a L�IX ri le SI'Stcm. J n a tC \\'<br />

months, 11·e e x pect to con1·crr D ienst to stmc objects<br />

in the Seq uoi:-t 2000 d;na b:-tse, rather th;1n in ti les.<br />

\Vhcn thi s SI'Ste m, nicknamed Datab:�sc Di e nst, is<br />

operati onal, M osaic/ Dienst sen·icc ll'i II he �1 1'r all textual objects in the Seq uoia schem;l.<br />

Our fourth thrust in the application laver is �1 bcil i t\'<br />

ro i ntcr bce rhe LIC:LA Ccncral C:in.:ul.1tiot1 Model<br />

(GC:'vl ) to the POSTCRES/lllusrra SI'Stc nl . Th is pro­<br />

gram is :1 "cbu pum p" bec1use ir pumps cbL1 our of<br />

the si mul;nion model :-tnd imo the DBMS. \Vc 11,1mcd<br />

rhc progra m "the big l ifr" attcr rhe DWR pu m p ing<br />

station that r:�i scs Northern C:tlit(m1ia 11 ;Hn m·er the<br />

Te hacha pi Mount;l ins inr.o Southern Cllit(m1i;l.<br />

Basicallv, the UC :l.A CC:M p rod uces :1 1ccrm or' sim­<br />

ulation outpur ,.�1riables r()r each ti me stq) of;l l en gt h1·<br />

ru n tor each tile in ;1 three -dimensional (3-D) grid of<br />

the at mosp here ;1 11d occ;ln. De pe ndi ng 011 the scal e<br />

of thc model, its resolution, and rhe clp;lbilin· of the<br />

serial or parallel m:-tchine on 11·hich the mode l is running,<br />

the CCL\ GC:Ivl c�111 prod uce ti·om 0. 1 to 10.0<br />

megabytes per se cond ( J\'l B/s ) ou tpu t. The pu rpose of<br />

rile big litr is ro inst�11l rhc outpu t dar�1 into :1 chr�1 b;1se<br />

in re:d ti me. UCLA sci t· n rists can then usc Objccr­<br />

K.n oll'led ge , Tiog�1, Tc c;ltc, AVS, or lDL ro ,·isualizc<br />

their simulation output. The big lirr 11 ill likcil h�11 c ro<br />

e x ploi t p ar: dldism in rhc d�Ha managn, if i t is req uired<br />

to keep up \\'ith rhe ncc u tion of the model on ;1 11L1s­<br />

si ,·c ll· parall el �1 rc hirccrurc .<br />

The rifrh applic1tion S\'Stem is a con krc ncing SI'S­<br />

tc m. Since Scq uoi:J 2000 is �1 distnburcd pmjcct, II'C<br />

lcarnccl e�1rlv that bcc-ro-L1cc meetings rh;H requ ired<br />

p;lrricip;m ts to tr;llcl to other si tes �1 nd el ectron ic mail<br />

\\'Cre not sufticic m ro keep proj ect members "·urking<br />

;ls J tc:am. Consequcml1·, 11·e pu rch�1scd umkrcncc<br />

room 1 idcocon krcncing eq u ip m ent t( >r each project<br />

sire. ·rhis tcc l1 11olog1 costs .lt)prmim�1tcll- S::iO,OOO per<br />

si te ;m d �• llo11 s m u l till'rc, 1·ideoconterencing te nds to be<br />

usee! fo r arra nged conkrcnccs and nor t


Sequoia 2000 as an End-to-End Problem<br />

The m�1jor lesson 11·c ha1 c le:m1ed ti·om the Scquoi�1<br />

2000 project is rh�1t manv issues being our clients c\n­<br />

not be isobrcd ro �l single Lwcr of the Scq uoi�1 2000<br />

�1 rc hirecrurc. This section describes three such end-toend<br />

problems: gu:1ranrecd dcli1-crv, :� bstLKts, :111l1<br />

COil\ 1XeSS10n.<br />

Guaranteed Delivery<br />

Clearlv, guar:� ntced delivery must be an cnd-ro-end<br />

colltracr. Suppose a Scquoi:� 2000 clicnr wishes ro l'isu ­<br />

:d izc a s11ccitic compurarion; t( >r oample, rhe client<br />

11·a1HS to obscr1·e H u rricanc And rc11· as it mm·cs ti·om<br />

the B�1h�1n1:1s ro florid:� ro Louisi.111�1. Spcciti c1li l·, the<br />

clienr 11·ishes to 1'isu:1lizc appropriate s�udlirc im:�gcrv at<br />

a resolution of 500 X 500 in 8-bit color at 10 ti·:unes<br />

per second . Hence, the client req uires 2.5 MB/s of<br />

b:mdwidrh ro his screen. The t()IJ011·i ng sce 1 urio mi ght<br />

be the co111�1utation steps that t�1 kc pbcc.<br />

TilL DHMS must run a q u ery to krch the s�ucllitc<br />

im:�gerv. The querv might require retuming �1 16-bir<br />

d�lta l'�lluc t(>r each pixel that will ultimarclv appear 011<br />

the screen . The DBMS must rhcrd(>JT agree to occure<br />

the quen· in such �1 II'JI' rhat ir guaranrecs ourpur<br />

.1t �1 r�1tc of 5.0 tv! B/s.<br />

The stor:�gc s1·stcm at the sencr will krch some<br />

number of 1/0 blocks ti·om scconchn· and/or tertian·<br />

memo!'\'. DB,\IS quen· opti mizers em accuratch- guess<br />

ho11· mam· blocks thc1· need to rod to s�Hist\· the<br />

qucr1'. The DBMS can then e1sih· generate �1 guar:l ntccd<br />

dcliiLTI' comrxt that the sror::�gc 111:1n�1gcr must<br />

sarist\·, rhus �1 1ioll'ing the DBMS ro sarist\· irs contract.<br />

The network must �1 gree to dc]ii'Cr 5.0 MIVs over<br />

the network link that connects the clienr to rhc server.<br />

The Sequoia 2000 nctll'ork soft ll'arc expects cxactlv<br />

rh is tvpc of contract request.<br />

The 1 isu:1liz�1tion package must


46<br />

then the DBi\rlS may hal'e to decompress the data to<br />

perform the search. If an application resides on the<br />

same machine as the storage 111


finally, the computer scientists and the earth scientists<br />

r-cJiizcd then the word "bcnchmJrk" has a difkrem<br />

meaning t()r cJch of the two groups of researchers. To<br />

cJrth scientists, a benchmark is a scenario, \\·hcreas to<br />

com puter scientists, a be nchmark is an abstract example.<br />

This \'ignctte was typical of the experience these<br />

two disciplines had trying to understa nd one Jnother.<br />

Fundamentally, this process is time-consuming, �md<br />

ample inrerJction time should be planned tc.>r Jny project<br />

that must dcJI with multiple disciplines.<br />

The ScquoiJ 2000 project participants made dkctivc<br />

usc of "converters." A converter is a person of one<br />

discipline \\'ho is planted directly in the research group<br />

of :mother discipline. Through intorrn�11 communication,<br />

this person serves JS an interpreter and translator<br />

fi:>r the other discipline. Corwerters arc encouraged lw<br />

the existence of a ti:>rmal exchange progr:un, whcreb\·<br />

ccntL11 Seq uoia 2000 resources pa\' the li\·ing opcnses<br />

of the exchange personnel.<br />

Lesson 4: Da tabase technology is a major advance for<br />

earth scientists.<br />

Our initial plan was to introduce database technology<br />

into the project with the expectation that the earth scientists<br />

\\'Ould pick ir up and use it. Unfo rtunately, they<br />

:1rc accustomed to dat:1 being in ti les and t-()lmd it vcrv<br />

difticult to make the transition to a database vic\\·. The<br />

earth scientists arc becoming increasing!\· aware of<br />

the inherent ad\'Jntages of DBMS technolot, '\'.<br />

In ad dition, we


4�<br />

the Seq uoi;1 2000 eftorr. As ;1 result, project man;lge­<br />

menr WJS pcrt(mned poorlv at best. In Jny t'ururc large<br />

project, this component should be �1 ddresscd sJtisbc­<br />

rorily up fi·ont bv project personnel.<br />

Lesson 6: Multicampus projects are extremely difficult<br />

to implement.<br />

Sequoia 2000 '' ork is ra king place in seve n difkrcnt<br />

orgJ niz:ttions wi thin the Uni1·crsitv ofCJiit(nnia cd u­<br />

cationJI S\'Stcm. There is a constJnt need to trJnsfer<br />

monel' and people among these org:1nizarions. Accom­<br />

plishing such molTS is :1 difti cult and slm1· process,<br />

hm,·e, er, because of the burc1ucrac1· ll'ithin rhe S\'S­<br />

rern. In Jddition, rhe personnel rules of the Uni,·ersitv<br />

are often in conflicr with the needs of the Sequoi;1<br />

2000 project. As a resul r, multi -i nsti w tion projects,<br />

where participants arc in difkrcnr and often distant<br />

locations, :1re ntrernelv difficult to earn· our.<br />

Status and Future Plans<br />

The Sequoia 2000 project is more rh:1n rhrce vcars old<br />

and h:1s nearlv accomplished irs objectives. We luve<br />

a common schema in place t()r rrs under 1\'J\' in rhc spatial at-c n:1.1" The<br />

intl·asrrucrurc is in place ro enable this sc hcm;1 to<br />

evolvc JS more data tvpes, uscr-dctined fu nctions, ;1lld<br />

operators an: included in the fu rurc.<br />

The combination of Objecr-K.noll'ledge, ll lustLl,<br />

Epoc h, and AMASS is prm ing robust and meers our<br />

clients' needs. Lastk, ,,.c IDI'C ample resources ro<br />

move our prototvpe into prod uction use at UCLA and<br />

Santa B:1rlx1r;1 during rhc next SCI'cral months.<br />

We arc ::1lso extending the scope of the prorotvpe in<br />

t\\'O difkrcnr directions. first, 11·c ll'i] l recruit Jddirional<br />

ea rrh scientists to utili 1.e our s1·srem. This ll'i II<br />

req uire extending our common schema to meet their<br />

needs and then inst::1 lling our suite ofsoft11'rucccdiug;<br />

o/ !he 7993 I.t.FL' /Jo/o L'n,!!, illeerin,�.; c!ilt/(>rence.<br />

Housron, Texas ( Febru:ll'l' 1993 ).<br />

9. ) . Hcllcrsrein and M. Sronehrmccedit tgs of /be 1091 tiC�! SJC.HO/) Cmtj(n'IICC.<br />

W:1 shingron, D.C. (rVb1· ]')93).


10. P. Koc lu.:var and L. Wanger, "Tecne: A Software<br />

P!Jrfonn tor Browsing and Visualizing Data ti·om<br />

Networked Data Sources," Digita l <strong>Technical</strong><br />

.foumal, vol. 7, no. 3 ( <strong>1995</strong>, this issue): 66-8 3.<br />

11. M. Stonebraker er al., "Tioga : PrO\·iding Dara Man­<br />

agement for Scientific Visu:�liz::nion Appl ications, "<br />

Proceedings Q/ the 199.3 VLDB Co n/erence. Dublin,<br />

Ireland (August 1993).<br />

12. A. vVoodruffet :� 1., "Zooming


so<br />

The Sequoia 2000<br />

Electronic Repository<br />

A major effort in the Sequoia 2000 project was to<br />

build a very large database of earth science infor­<br />

mation. Without providing the means for scien­<br />

tists to efficiently and effectively locate required<br />

information and to browse its contents, how­<br />

ever, this vast database would rapidly become<br />

unmanageable and eventually unusable. The<br />

Sequoia 2000 Electronic Repository addresses<br />

these problems through indexing and retrieval<br />

software that is incorporated into the POSTGRES<br />

database management system. The Electronic<br />

Repository effort involved the design of proba­<br />

bilistic indexing and retrieval for text documents<br />

in POSTGRES, and the development of algo­<br />

rithms for automatic georeferencing of text<br />

documents and segmentation of full texts<br />

into topica lly coherent segments for improved<br />

retrieval. Va rious graphical interfaces support<br />

these retrieval features.<br />

Dig;ir:1l Tcch11ical JounlJI Vo l. 7 :--:o . .'\ l


the research conducrcd in these r the POSTG RES database system,<br />

the GIPSY S\'Stcm f(>r automatic indexing of texts<br />

using geographic coordinates based on locations mentioned<br />

in the tnt, :1nd the TextTiling method f(>r<br />

improving access to fu ll-text documents.<br />

Indexing and Retrieval in the Electronic Repository<br />

The priman· engine f(>r information storage and<br />

retrieval in the Sequoia 2000 Electronic Repository<br />

is the POSTGIU:S ncxt-generarion darabase management<br />

system (DBMS).' POSTG RES is the core of<br />

rhe DBMS-ccntric Sequoia 2000 system design. All<br />

rhe data used in the projecr was stored in POSTGRES,<br />

including complex mulridimensional arravs of data,<br />

parial objecrs such as raster and vector maps, sarellite<br />

images, and sers of measuremenrs, as well as all the<br />

fu ll-rext documents available. The POSTG RES DBMS<br />

supports user-defined abstracr data types, user-defined<br />

ti.mcrions, a rules system, and man�' katures of objectoriented<br />

DBMSs, including inheritance and methods,<br />

through fu ncrions in both the querv l


52<br />

Figure 1<br />

KW_SOURCES<br />

KW_INDEX_FLAGS<br />

The Lassen POSTGRES Classes t()r Indexing ,md Their Linkages<br />

a particu!Jr term from the wn_index class. The<br />

POSTQUEL definition ohhis class is<br />

create kw_term_doc_rel (<br />

termid = int4, I* WordNet termid number *I<br />

synset = int4 , I* WordNet sense number *I<br />

docid = int4, I* document ID *I<br />

termfreq = int4) I* term frequency within<br />

the document *I<br />

The raw fi-equencv of occurrence of the term<br />

in the document ( lermji·eq) is included in the<br />

k.\1·_tenn_doc_rel tuple . This information is used in<br />

the retrieval process for calculating the probability of<br />

relevance for each document that contains the term.<br />

The kw doc index class stores information on indi­<br />

,·idual documents in the database. This information<br />

includes a unique documem identifier ( docid), the<br />

location ofthe document (the class, the attribute, and<br />

the tuple in which it is contained), and whether it is<br />

a simple attribute or a large object (with effectively<br />

unlimited size). The kw_doc_index class also main­<br />

tains additional statistical information, such as the<br />

number of unique terms fo und in the document. The<br />

POSTQUEL definition is as tol lo11 s:<br />

create kw_doc_index (<br />

docid = int4, I* document ID *I<br />

reloid = oid, I* oid of relation<br />

containing it *I<br />

attroid oid, I* attribute definition of<br />

attr containing it *I<br />

attrnum int2, I* attribute number of attr<br />

containing it *I<br />

tupleid oid, I* tuple oid of tuple<br />

containing it *I<br />

sourcetype = int4, I* type of object -- attribute<br />

or large obj ect *I<br />

doc_len = int4 , I* document length in words *I<br />

doc_ulen = int4) I* number of unique words in<br />

document *I<br />

<strong>Digital</strong> Tec hnical Journal<br />

Vol. 7 No. 3 <strong>1995</strong><br />

WN_INDEX<br />

The kw_sources class contains information about<br />

the classes and attributes indexed at the class ln·el, as<br />

,,·ell as statistics such as the number of items indexed<br />

from any given class. The fol lowing POSTQUEL<br />

statement defines this class:<br />

create kw_sources<br />

relname = charl6,<br />

relaid = oid,<br />

attrname = charl6,<br />

attroid oid,<br />

attrnum int2 ,<br />

I* name of indexed<br />

relation *I<br />

I* oid of indexed<br />

relation *I<br />

I* name of indexed<br />

attribute *I<br />

I* obj ect ID of indexed<br />

attribute *I<br />

I* number of indexed<br />

attribute *I<br />

attrtype = int4, I* attribute type -- large<br />

object or otherwise *I<br />

num_indexed = int4 , I* number of items<br />

indexed *I<br />

last tid = oid, I* oid and time for last *I<br />

last time abstime, I* tuple added *I<br />

tot_terms = int4, I* total terms from all<br />

items *I<br />

tot_uterms = int4, I* total unique terms from<br />

all items *I<br />

include_pat text , I* simple patterns to *I<br />

exc lude_pat text ) I* match for indexable<br />

I* items *I<br />

The other classes shown in Figure l relate to the<br />

indexing and retrieval processes. The Lassen prototvpe<br />

uses the POSTGRES rules svstem to perform such<br />

tasks as storing the elements of the bibliographic<br />

records in an appropriate normalized fo rm and to trig­<br />

ger the indexing daemon.<br />

Defining an attribute in the database as indexable<br />

tor information retrie1·al purposes (i.e., bv appending<br />

a new tuple to the kw_sources definition) creates a rule<br />

that appends the class name and attribute name to the


kw_index_tlags class whenever a new tuple is appended<br />

to the class. Another rule then starts the indexing<br />

process for the newly appended data . Figure 2 shows<br />

this trigger process.<br />

The indexing process extracts each unique keyword<br />

fr om the indexed attributes of the database and stores<br />

it along with pointers to its source document and its<br />

frequency of occurrence in kw_term_doc_rel. This<br />

process is shown in Figure 3. The indexing daemon<br />

and the rules system maintain other global frequency<br />

information. For example, the overall fr equency of<br />

occurrence of terms in the database and the total num­<br />

ber of indexed items are maintained fo r retrieval pro­<br />

cessing. The indexing daemon attempts to perform<br />

any outstanding indexing tasks before it shuts down. It<br />

also updates the kw _doc_index tuple fo r a given index­<br />

able class and attribute with a time stamp fo r the last<br />

item indexed (last_tid and last_time). This permits<br />

ongoing incremental indexing without having to<br />

reindex older tuples.<br />

Retrieval<br />

The prototype version of Lassen provides ranked<br />

retrieval of the documents indexed by the indexing<br />

daemon using a probabilistic retrieval algorithm. This<br />

algorithm estimates the probability of relevance fo r<br />

each docu ment based on statistical information on<br />

term usage in a user's natural language query and in<br />

the database. The algorithm used in the prototype is<br />

based on the staged logistic regression method.8<br />

A POSTGRES user-defined function invokes ranked<br />

retrieval processing. That is, fi-om a user's perspective,<br />

ranked retrieval is performed by a simple fu nction<br />

call (k\vsea rch) in a POSTQ UEL query language<br />

Figure 2<br />

DO NOTHING<br />

The Lassen Indexing Trigger Process<br />

RULE STARTS<br />

FUNCTION<br />

DAEMON_ TRIGGER<br />

statement. Information from the classes created and<br />

maintained by the indexing daemon are used to esti­<br />

mate the probability of relevance fo r each indexed doc­<br />

ument. (Note that the fu ll power of the POSTQUEL<br />

query language can also be used to perform conven­<br />

tional Boolean retrieval using the classes created by the<br />

indexing process and to combine the results of ranked<br />

retrieval with other search criteria.) Figure 4 shows the<br />

process involved in the probabilistic ranked retrieval<br />

from the repository database.<br />

The actual query to the Lassen ranked retrieval<br />

process consists simply of a natural language statement<br />

of the searcher's interests. The query goes through the<br />

Figure 3<br />

The Lassen Indexing Daemon Process<br />

<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 7 No. 3 <strong>1995</strong> 53


54<br />

Figure 4<br />

The Lassen Rctrie,·a\ Process<br />

same processing steps as documems in the indexing<br />

process. The i ndivid ual words o f the query are<br />

extracred :1 11d located in the wn_indn dictionary<br />

(after re moving common words or "stopll'ords"). The<br />

tcrmids �o r matching words ti·om wn_i ndn arc then<br />

u sed to retrieve all the tuples in kw_tcrm_doc_rd that<br />

contain the te rm. For each u niq ue document identifier<br />

in this list ofruples, the marching k11 _doc_indn tuple<br />

is rctrinTd. \Vith the �i·eq uenc�· int(mllation contained<br />

in kw_term_d oc_rel and kll'_doc_indn, th e estimated<br />

probabi litY of relevance is calculated fo r e:�ch docu ­<br />

ment that con tains at least one term in common ll'i th<br />

the query. The formulae used in the calculation arc<br />

based on experiments with fu ll-text retrieval -" The<br />

basic equation fo r the probabilistic model used in<br />

Lassen states the fo llowing: The probability of the<br />

event that a document is relevant R. gin:n that there<br />

is a set of N"clues" associated with that docu ment, A1<br />

fo r i = l, 2, ... , 1\', is<br />

log O(RIA,, ... ,A,) = log O(R) + 2 )1og O(RIA,)<br />

Digit�! Technic�! Journal<br />

1= I<br />

- log O (R)], (l)<br />

Vol. 7 :--Jo. 3 J


;md queri es (wirh relevance judgments) from the<br />

TlPSTER intcmn:nion retrieval test collection -" The<br />

deri,·ation of rhis fc>rmula is given in "Probabilistic<br />

Re trie' al Rased on Staged Logistic Regression ."' The<br />

fu ll ren·i n·al cquJtion used f(x the prororl'pe ,·ersion of<br />

retrie,·al described in this section i s<br />

where<br />

log O(NIA1, ... ,A11) """ - 3.Sl<br />

·" ·"<br />

I<br />

374<br />

+ \ .lJ + d :Lx,11 + 0.330 :Lx.,,<br />

I I<br />

If<br />

-0.1937 :Lx"',] + o.0929M, ( 3 )<br />

I<br />

X111 1 is the quotiem of the number of ti mes the mth<br />

term occurs in the querv and the sum of the rot:d<br />

number ofte nns in the querv plus 35;<br />

X1112 is the logarithm of the quotient a rri ved at bv<br />

dividing the number of times the mth term occurs in<br />

the document by the sum of the tot;ll number of te nns<br />

in the document plus 80;<br />

Figure 5<br />

CALCULATE NUMBER OF<br />

TERMS IN COMMON<br />

BETWEEN QUERY AND<br />

DOCUMENT M<br />

SUM FREQUENCY OF<br />

TERM IN QUERY DIVIDED<br />

BY ALL TERM<br />

OCCURENCES PLUS A<br />

CONSTANT<br />

2-Xm. l<br />

CALCULATE LOGARITHM<br />

OF SUM OF NUMBER OF<br />

TIMES TERM OCCURS IN<br />

DOCUMENT DIVIDED BY<br />

TOTAL TERMS IN DOCUMENT<br />

PLUS A CONSTANT<br />

LXm.2<br />

CALCULATE LOGARITHM<br />

OF SUM OF NUMBER OF<br />

TIMES TERM OCCURS IN<br />

DATABASE DIVIDED BY<br />

TOTAL TERM OCCURENCES<br />

IN DATABASE<br />

�Xm.3<br />

YES<br />

x/J/.3 is the logarithm of the quotie llt arrived at by<br />

dividing the n u m ber of times the 111th term occu rs in<br />

the databJsc ( i .e., in all documents) by the total num­<br />

ber of terms in the collection;<br />

.ll is the number of terms held in common bv the<br />

querv and the document.<br />

Note that the .'v/2 term called f(>r in Equation 2 was<br />

not fo und to prol'ide any significmt difkrence in the<br />

results and w:1s omitted fr om Eq u ati on 3. The con ­<br />

stants 35 and 80, whi ch were used in X111 1 and X11,_2,<br />

are :1 rbitrarv but appear to offe r the best results 11·hen<br />

set to the a\·erage size of a quen· and the r the Staged Logistic Regression ProbJhilistic Ranking Process<br />

CALCULATE DOCUMENT<br />

PROBABILITY OF RELEVANCE<br />

P(R) = 1 I 1 + e "(- LOG O(R))<br />

CALCULATE DOCUMENT LOG<br />

ODDS OF RELEVANCE<br />

LOG O(R) = k1 + (S · [k2 • :LXm.l<br />

+ k3 · :


56<br />

and its unique identitier are stored in the k\\'_querv<br />

class . To see the results of the retrieval operation, the<br />

querv idenritier is used to 1·errieve the appropri,ue<br />

kw_retrieval tuples, r:mked in order according to the<br />

estimated probabilitY of reln·ance. The k11 _rerrie1·al<br />

and kl1 _quen· classes ha1 e the tollo11·ing POSTQUEL<br />

detinitions:<br />

create kw_query<br />

query_id = int4,<br />

query_user charl6,<br />

query_text = text )<br />

create kw_ret rieval<br />

query_id = int4 ,<br />

doc_id = int4 ,<br />

rel_oid = oid,<br />

attr_oid = oid,<br />

attr_num = int2,<br />

tuple_id = oid,<br />

/* ID number */<br />

/* POSTGRES user name */<br />

/* the actual query */<br />

!* link to the query */<br />

/* document ID number */<br />

/* location of doc */<br />

doc_len = int4 , /* size of document */<br />

doc_ma tch_terms = int4 , /* number of query terms<br />

in the document */<br />

doc_prob_rel = floatS ) /* estimated probability<br />

of relevance */<br />

The algorithm used tor ranked retrin·al in the<br />

Lassen prototvpe was rested


geograp hic position being referenced in the te xt. The<br />

actual index te rms assigned to a document are a ser of<br />

coordinate pol�·gons that descri be an area on rhe<br />

e:mh 's surface in a standard geographical projection<br />

system. Using coordinates instead of names for rhe<br />

place or geographic characteristic offers a number of<br />

advantages.<br />

• Uniqueness. Place names are not uniq ue, e.g.,<br />

Ve nice, California, and Ve nice, Italy, are not appar­<br />

ent!�· dit'tere nt ll'irhout the qualit\·ing larger region<br />

to differentiate them. Using coordinates removes<br />

th is ambiguitv.<br />

• Immunity to spatial boundary changes. Political<br />

boundaries change over time, lead ing to confiJsion<br />

about the precise area being re.ferred to. Coordi­<br />

nates do nor depend on political boundaries.<br />

• lmmunirv ro name changes. Geographic names<br />

change 01-er rime, making it diftlcult for a user to<br />

retrieve all information that has been written about<br />

an area during any extended rime period. Coord i­<br />

nates remove this ambiguity.<br />

• Immunity to spatial, naming, and spelling l'aria­<br />

rion. Names and rerms 1·an· nor onlv o1·er rime but<br />

also in contemporar�· usage . Geographic names<br />

1·ary in spelling over rime and by language. Areas of<br />

interest ro rhe user will often be given place names<br />

designated onlv in the context of a specific docu­<br />

ment or proje ct. Such va riations occur frequentlv<br />

for studies done in oceanic locations. Names associ­<br />

ated 11·irh these stud ies are unknoll'n to most users.<br />

Coordinates are not subject to these ki nds oherbal<br />

,·ariations.<br />

Indexing texts


58<br />

L titu<br />

a d<br />

e (mi ute )<br />

n s<br />

.... ..J<br />

1<br />

___ o _ i_ s p_l _ a y_M_ a _P __ _j ___<br />

Figure 6<br />

H_ i d<br />

Scr(cn h·om rhc L1sscn GcogLl}) hic Brm1 ocr<br />

_ e _ lc_o_n _s __ _J ___ _<br />

necessity, the result of


e lationships ti·om the Word Ner thesaurus, to<br />

include rhe ri: >llowing types ofrerms:7<br />

Sl'nOI1\'In\'<br />

. . .<br />

= S\'110111' 111<br />

kind -of reL1 rionsh ips<br />

= lwponym ( mJ pl c is :1 - ofrrec )<br />

@ : = lwpernvm (tree is :1 @ of mapl e)<br />

p:1rt-of rebrionsilips<br />

# mc rom·m (finger is a # of lund)<br />

% h o lonvm (hand is �1 % oHinge r)<br />

& c1 ido111'm (pine is a & ofshortleaf<br />

pine )<br />

3. Overlay polygons to estimate a pproximate loc:l ­<br />

rions. The objccr ivc of rhis step is ro combine the<br />

evidence �1ccumubted in rile preceding step and<br />

111kr a scr or· polvgons th:�t prm ides a reason:�bJc<br />

approximation of rhc geogr:1phical locations me n ­<br />

tioned in rile rcxr. E::tch gf!oj)hrasc. t/'e��bt. pull'gon<br />

ru plc em be represented �•s a thn . :c-dimcnsion�ll<br />

"extruded" polygon whose b�1se is in the plane of<br />

the x- ;m d z- axes and whose hcighr extends upw


Figure 7<br />

/7<br />

Overl;wing Poh·gons to Estimate Approximate Locations<br />

(a) The "weight" of a polygon, indicated by the<br />

vertical arrow, is interpreted as "thickness."<br />

(b) Two adjacent polygons do not affect each other;<br />

J.<br />

I<br />

I<br />

each is merely assigned its appropriate "thickness."<br />

(c) When one polygon subsumes another, their<br />

"thicknesses" in the area of overlap are summed.<br />

(d) When two polygons intersect, their "thicknesses"<br />

are summed in the area of overlap.<br />

60 <strong>Digital</strong> Techni(al journal Vo 1. 7 No. 3 !995


Figure 8<br />

Surface Plot Produced ri-om the State Water Project Text<br />

TextTiling: Enhancing Retrieval through<br />

Automatic Subtopic Identification<br />

Full-length documents have onlv recentlv become<br />

available on-line in large quantities, although technical<br />

abstracts, short newswire texts, and legal documents<br />

have been accessible for many years 23 The large major­<br />

itY ofon -line information has been bibliographic (e.g.,<br />

authors, titles, and abstracts) instead of the fu ll text of<br />

the document. For this reason, most information<br />

retriev:ll methods :1re better suited for accessing<br />

abstracts than for accessing longer documents. Part of<br />

the rcposirorv research was an exploration of ne\\'<br />

approaches to information retrieval particularlv suited<br />

to ti.l ll-length texts, such as those expected in the<br />

Sequoia 2000 datalnse.<br />

A problem \\'ith applying traditional information<br />

retrieval methods to full-length text documents is that<br />

the structure of full-length documents is quite differ­<br />

ent from that of abstracts. (In this paper, "ti.l ll-length<br />

document" refers to expositorv text of any length.<br />

Tvpical examples are a short magazine article and<br />

a 50-page technical report. We exclude documents<br />

composed of headlines, short advertisements, and any<br />

other disjointed texts of whatever length. We also<br />

assume that the document does not have detailed<br />

orthographically marked structure. Croft, Krovetz,<br />

and Tu rtle describe work that takes advantage of this<br />

ki nd of information.'; ) One way to view an expository<br />

texr is as a sequence of subtopics set against a backdrop<br />

of one or two main topics. A long text comprises many<br />

diffe rent subtopics that may be related to one another<br />

and to the backdrop in many diffe rent ways. The main<br />

topics of a text are discussed in its abstract, if one<br />

exists, but su btopics are usuallv not mentioned.<br />

Therefore, instead of querying against the entire<br />

content of a document, a user should be able to issue a<br />

query about a coherent subpart, or subtopic, of a ti.dl­<br />

length document, and that subtopic shou ld be specifi­<br />

able with respect to the document's main topic(s).<br />

Consider a Discouer magazine :�nicle about the<br />

Magellan space probe's exploration of Ve nus.';<br />

A reader divided this 23-paragraph article into the fo l­<br />

lowing segments with the labels shown, where the<br />

numbers indicate paragraph numbers:<br />

1-2 lntro to J'vlagellan space probe<br />

3-4 Intro to Venus<br />

5-7 Lack of craters<br />

8-1 1 Evidence of volcanic action<br />

12-15 Ri'n Stvx<br />

16-18 Crustal spreading<br />

19-2 1 Recent volcanism<br />

22-23 Future of Magellan<br />

Assume that the topic of volcanic acti,·itv is of<br />

interest to a user. Crucial to a system's decision to<br />

retrieve this document is the knowledge that a dense<br />

discussion of volcanic activity, rather than a passing ref­<br />

erence, appears. Since volcanism is not one of the<br />

text's two main topics, the number of references to<br />

this term will probably not dominate the statistics of<br />

term frequencv. On the other hand, document selec­<br />

tion should not necessarilv be based on the number of<br />

references ro the target terms.<br />

The goal should be to determine whether or not<br />

a relevant discussion of a concept or topic appears.<br />

A simple approach to distinguishing between a true<br />

discussion and a passing reference is to determine the<br />

locality of the references. In the computer science<br />

operating systems literature, locality refers to the fact<br />

that over rime, memory access patterns tend to con­<br />

centrate in localized clusters rather rhan be distributed<br />

evenly throughout memory. Similarly, in fu ll-length<br />

texts, the close proximity of members of a set of<br />

<strong>Digital</strong> <strong>Technical</strong> journal Vo l. 7 :--;o. 3 <strong>1995</strong> 61


62<br />

references to a particul ar concept is a good indiuror of<br />

topicality. For example, the term L'olcauislll occurs 5<br />

rim es in the Magellan article, the first fo ur insr;l nces of<br />

which occur in fo ur adj :\ Ce ll t paragraphs, ;l long ll'ith<br />

accompanvi ng discussion. In comrast, the rerm scieu­<br />

tists. which is not a ,·a lid sub topic, occurs 13 ri mes, dis­<br />

tributed somewhat e\-cnlv througho ut . Bv irs \'en·<br />

nature, a su btopic wi l l nor be discussed throughout an<br />

entire texr. Similarly, true subtopics are nor indiuted<br />

by only passing references. The term helll' do ucer<br />

occurs on l�· once , and its re lated terms arc con tined to<br />

the one sentence it appea rs in. As its usage is onh·<br />

a passing rckrence, belh· (Lwcing is not a true subropic<br />

of this text.<br />

Our solution to the problem of reuining v:did<br />

subtopical discu ssions while �H the same time ;woid­<br />

ing being to oled by passing rdcrcnces is to make<br />

usc of localirv information :m d to partition docu­<br />

ments according to their subto p ica l structure. 'This<br />

approach's capacitY tor i mp rO\'i ng a srallCLl rd informa­<br />

tion retrieval tasl< has been verified bv i nt(mn arion<br />

retrieval experiments using fu ll-texr rest collections<br />

from t he TIPSTER daubase .l(··27<br />

One wav to get an approximation of the subtopic<br />

structure is to break the document into paragr;lp hs, or<br />

fo r \'cry long documenrs, sections. In both c1scs, rhis<br />

entai l s using the orthogr;1phic marking supplied l1\' the<br />

author to determine topic boundrmance oF moti\·;Jted<br />

segmenta tion, i.e., segmentation that rdkc rs rile<br />

text's true underlyi ng su btopic structu re, \\'hich ofren<br />

spans pac1graph boundaries.<br />

To\\'ard rhis end, \\'C h :-�,·c dn·c loped To rTi l ing,<br />

a method f( Jr partitioning fu ll-length tnt docu mems<br />

into coherent mLtlriparal:'-r,lph u n its called tiks.'"·'" . .:o'<br />

TcxrTi l ing approximates rhe su btopi c structure of<br />

a docu rne m by using patterns ot'lcxical conlKCtivin· ro<br />

find coherem su bdiscussions. The l


SimibritY ber11·ecn blocks is ukularcd b1· the t(,l loll·lllg<br />

cosJ Jle meastu·c: Ci1 en two tex t blocks bl �1 11d !J2,<br />

cos (1?1 ,1?2) =<br />

"<br />

L u·, ,, 11', ,1<br />

I I<br />

I! "<br />

L"·i,,, L ll'j ,,,<br />

t J /'='"I<br />

11 here 1 ranges m·cr �1 11 the terms in the document, a nd<br />

t/'1 "1 is the tf.idfweighr �1ssigned to term tin block hl .<br />

Thus, iF the similaritv score ben1 een tii'O blocks is<br />

high, then nor only do the blocks have te rms in com­<br />

mon , bur the common terms �llT relatively rare with<br />

respect to the rest of the document. The evidence in<br />

the reverse is nor as conclusive . U' :-�djacenr blocks hJ1·e<br />

a lo\\' sirnilarin· measure, this docs not necess�1riil'<br />

J11C111 that the blocks cohere . 1 n pr:1crice, hm1·e\'er, this<br />

negJti1·e n·idence is often jusritied.<br />

The graph is then smoothed using a discrete con,·o­<br />

lurion" of the similarirv fu nction \\'ith the fu nction<br />

h1?. ( ), where<br />

I ii � �< - 1<br />

otherwise.<br />

The result is smoothed tiJrrhcr 11·ith a simple median<br />

smoothing algori thm ro elimit1


64<br />

5. D. BlI Ctl., and H. Tu rt l e , "lmcr.JCtilc<br />

Retr·ie' ,,) of Com pic .x Docu me n rs," lnjbrnw I ion J'm­<br />

cessillg u11d Jfol/{tgc>lllell/. ,·ol. 26, no. S ( 1 990 ):<br />

593-6 16.<br />

2S. A. Chaikin, " M agellan Pierces the Vc nusi:-tn Ve il,"<br />

/Ji.1crwe1: ,·ol. 13, no. I (Januarv 1992).<br />

26. M. Hearst .ltld C:. Plcl unt, "Su bropic Srru crmi ng f(lr<br />

Full-Length Doc u1ncn t Access," Proceedings of the<br />

Sixteenth Anmwl !nlenwliomt! ACH SIGIR Co nfer­<br />

ence on N.escorch ond Det ·elopmenl in !njonnation<br />

!(etrieml (S/GIR '93J. Pittsburgh, June 1993 (Nc"·<br />

York: Association fo r Computing MachinerY, 1993 ):<br />

59-68.<br />

27. ,\>!. H earst, "Context ;m d Stru cture in Automated<br />

Full-Te xt Information Access," Ph.D. dissertation,<br />

Re port :\'o. UCB/CSD-94/836 ( Berkele1·, Calif<br />

Uni,·crsin· of CllitcJrni;J , Bcrkclcv, Compu ter Science<br />

Di,·ision, 1994 ).<br />

28. M. Hearst, "Tc xrTili n g : A Quan ti tative Approach ro<br />

Discourse Segrnen rarion ," Sequoia 2000 Tcchnicll<br />

Repcm 93/24 (S2K-93-24) (Berkele1., Ca li f Uni,n­<br />

sit\' of C:liit(Jrnia, Bc rkele1·, 1993) (frp:/ /s2k­<br />

h:p .cs. bcrkcl c1·. ed u /pub/ sequoia/ tech- reports/ s2 k -9<br />

3-24/s2 k-93-24.ps ).<br />

29. 1Vl. H c1rsr, "Mulri-Par;lgJ·aph Segmentation of Expos­<br />

iron· TC \t," l;mceedings of the 77? irty-sccol!d<br />

J/e!:'lill,f.!. o/ the Assuciotion jar Co 111putctlional<br />

Li11p11s!ics. Los Cruces, ]\.' e\\' Mexico, June 1994.


30. G. Salton, AIIIOIIIlllic rex/ Processing· The Tran�jor-<br />

11/C/I io11. A nu!)•sis. ct/1(/ Ret rie!'al of Information by<br />

Computer ( RcJding, Mass.: Addison· Weskv, 1989).<br />

31. The authors arc gr�Hctii l ro Miclucl Braverman fo r<br />

proving that the smoothing algorithm is equivalent to<br />

this convolution .<br />

32. L. l'-1bincr and R. SchJti:r, f)ig ital Processill8 of<br />

Speech S(t;11als (Englewood Clifts, N.J.: Prcntiee­<br />

H�lll, Inc., 1978).<br />

Biographies<br />

Ray R. Larson<br />

Rav Larson is an AssociJtc Professor rmation Studies). He tcJches courses and conducts<br />

resean.:h on the design Jnd evaluation ofintormJtion<br />

retrievJI svstcms. Rav received his Ph .D. !Tom the Universirv<br />

ofCalit(>�nia. He is � member of the American Socictv tor .<br />

. .<br />

lntcxmation Science (AS IS), the Association for Comput·<br />

ing 1Yiachinery (ACM ), the IEEE Computer Society, the<br />

Ametican Association tc >r the Advancement of Science,<br />

and the American Librarv Association. He is the Associate<br />

Editor fo r IICM Tru l/sa�lions 011 fllj(>rmati


Tecate: A Software<br />

Platform for Browsing<br />

and Visualizing Data<br />

from Networked Data<br />

Sources<br />

Te cate is a new infrastructure on which applica­<br />

tions can be constructed that allow end users<br />

to browse for and then visualize data within<br />

networked data sources. This software platform<br />

capitalizes on the architectural strengths of cur­<br />

rent scientific visualization systems, network<br />

browsers like Netscape, database management<br />

system front ends, and virtual reality systems.<br />

Applications layered on top of Tecate are able<br />

to browse for information in databases man­<br />

aged by database management systems and for<br />

information contained in the World Wide Web.<br />

In addition, Tecate dynamically crafts user inter­<br />

faces and interactive visualizations of selected<br />

data sets with the aid of an intelligent system.<br />

This system automatically maps many kinds of<br />

data sets into a virtual world that can be explored<br />

directly by end users. In describing these virtual<br />

worlds, Tecate uses an interpretive language that<br />

is also capable of performing arbitrary compu­<br />

tations and mediating communications among<br />

different processes.<br />

66 <strong>Digital</strong> <strong>Technical</strong> Journal Vol. 7 No. 3 1':195<br />

I<br />

Peter D. Kochevar<br />

Leonard R. Wanger<br />

All people share the need to find and assimilate infor­<br />

mation. Data fi-om which information is created is<br />

increasi ngly avai lable electronically, and that data<br />

is becoming more and more accessible with the prolif�<br />

eration of computer networks. Therefore, the world<br />

is quickly becoming abstracted as a collection of net­<br />

worked data spaces, where a data space is a data source<br />

or repository whose access is control led by means of<br />

a well-defined software interface. Some examples<br />

of data spac


the tools provided by Tecate can be used to build<br />

applications of only modest capabilities.<br />

Historically, Tecate grew out of the Sequoia 2000<br />

project, which was initiated jointly by <strong>Digital</strong> Equip­<br />

ment Corporation and the University of California<br />

in 1991 . The primary purpose of the Sequoia 2000<br />

project was to develop inf(xmation systems that wou ld<br />

allow earth scientists to better study global envi­<br />

ronmental change. Sequoia 2000 participants needed<br />

to browse tor data sets on which ro test scientific<br />

hypotheses and then to interactively visualize the data<br />

sets once to und. The data can be quite varied in con­<br />

tent and strucnJre, ranging rrom text and images<br />

to time-varying, multidimensional, gridded or poly­<br />

hedral data sets. Such data may stream rrom many clif­<br />

fe rent sources, e.g., databases managed by a database<br />

management system, a running simulation of some<br />

physical process, or tl1e WWW. Therefore, a tool was<br />

required that could interface to any such source. To be<br />

of maximum use, though, the tool had to be easy<br />

to use so that the scientists th emselves cou ld make<br />

sophisticated data queries and then experiment with<br />

the query results using a wide variety of data visualiza­<br />

tjon tec hniques.<br />

Generalizing rrom its Sequoia 2000 roots, the<br />

design of Te cate is intended to achieve fo ur goals:<br />

l. lnrerface to general data spaces wherever they may<br />

reside.<br />

2 . Saliently visualize most kinds of data, e.g., scientific<br />

data and the listings in a telephone book.<br />

3. Dynamically craft user interfaces and interactive<br />

visualizations based on what data is selected, who is<br />

doing the visualizing, and why the user is exploring<br />

me data.<br />

4. AJJow end users to interact with elements in visual­<br />

izations as a means to query data spaces, to explore<br />

alternate ways of presenting information, and to<br />

make annotations.<br />

There arc systems available today that have some of<br />

these capabilities, but no one system possesses all fo ur.<br />

Data visualization systems such as AVS, Khoros, or<br />

Data Explorer are capable of visualizing scientific data;<br />

however, they are poor at interfacing to general data<br />

spaces, they provide only limited interactivit:y \vitl1ill<br />

visualizations themselves, and they require visualiza­<br />

tions to be crafted by hand by knowledgeable end<br />

users.'.l.-' Network browsers such as Netscape are good<br />

at fe tching data rrom certajn types of data spaces but<br />

are limited in the variety of data they can directly visu­<br />

alize without having to rely on external viewer pro­<br />

grams. Moreover, most network browsers offer a<br />

restricted type of interacrivity where only hyperlinks<br />

can be f()llowcd and text can be submitted through<br />

fo rms. Finally, rront ends to database management<br />

systems provide elaborate querying mechanisms fo r<br />

selecting data rrom a database, but they lack a sophisti­<br />

cated means fo r visualizing and further exploring<br />

query results.<br />

The Tecate architecture borrows from that of visu­<br />

alization systems, network browsers, and database<br />

management systems as well as rrom virtual reality sys­<br />

tems like Alice and the Minimal Reality Toolkit/<br />

Object Modeling Language (MR/OML)!·' One<br />

major contribution of the Tecate system is tJ1at it<br />

incorporates the architectural strengths of these<br />

systems into a coherent whole. In addition, Tecate<br />

possesses at least two novel features that are not fo und<br />

in other data visualization systems. One fe ature is<br />

Tecate's use of an interpretive language that can<br />

describe three-dimensional (3-D) virtual worlds. This<br />

language is more than a markup language in that it is<br />

capable of performjng arbitrary computations and<br />

fa cilitating communication among clifferent processes.<br />

The second novel component ofTecate is tJ1e presence<br />

of an expert system that automatically crafts interactive<br />

visualizations of data. This system is intended to make<br />

data space exploration easier to perform by having end<br />

users simply state their goals while leaving the detajls<br />

of implementing a visualization to anajn tl1osc goals to<br />

tJ1e expert system.<br />

The remajnder of the paper outlines Tecate's sys­<br />

tem model and architecture and then identifies and<br />

describes Te cate's major components. finally, the<br />

paper sketches Tecate's capabilities by discussing two<br />

simple applications mat have been implemented on top<br />

of me Tecate software fTamework. The first application<br />

is a tool fo r visualizing earth science dat.a rcsicling in<br />

a database managed by a database management system .<br />

The second app�cation is a We b browser that uses 3-0<br />

graphics as an underlying browsing paradigm ramer<br />

than depending solely on me mecliwn of hypertext.<br />

Tecate's System Model<br />

After presenting an overview ofTecate's system model,<br />

this section provides detans of me object model and the<br />

interpretive, object-oriented language used to describe<br />

virtual world objects.<br />

Overview<br />

From the standpoint of an applications programmer,<br />

Tecate is a distributed, object-oriented system. AU<br />

major components ofTecate, as weU as entities appear­<br />

ing in virtual worlds created by Tecate, are objects that<br />

communicate with one another by means of message<br />

passing. The majn focus within Tecate is on object­<br />

object interactions. These interactions occur primarily<br />

when objects send messages to one another. An object<br />

can also send a message to itself, which has the effect of<br />

making a local function cal l. Unlike with graphics<br />

systems such as Open Inventor, rendering is not a cen­<br />

traJ activity wi thin Tecate; rather it is just a side effect<br />

DigiraJ <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 67


68<br />

of object-object interactions -" In this sense, Tecate is<br />

like virtual reality programming systems such as Alice<br />

and MR/OML, although Tecate is far more fl exible.<br />

In the Tecate syste m, objects can create and destroy<br />

other objects and can alter the properties of existing<br />

objects on-the-fly. Sud1 capabilities make Tecate vcrv<br />

extensible and give it great power and fl exibility. These<br />

capabilities can also cause problems fo r applications<br />

programmers, however, if care is not taken 'vhen writ­<br />

ing programs. Presently, all of an object's properties<br />

are visible to all other objects, and hence those proper­<br />

tics can be manipulated fi-om outside the object. In the<br />

fi.1ture, some form of selective property hiding needs<br />

to be added so that designated properties of an object<br />

cannot be altered by other objects.<br />

A powerful teature ofTecatc is irs ability ro dynami­<br />

cally establish object-subobjecr relationships. This fea­<br />

ture provides a mechanism for bui ldi ng assemblies of<br />

parts similar to the mechanisms in classical hierarchical<br />

graphics systems like Don� or Open lnvcnror.7 This<br />

tcatun: also provides the capability of creating sets or<br />

aggregates of objects rhar share some trait, such as<br />

being highlighted. Te cate allows all objects within a set<br />

to be treated en masse by providing a means of se lec­<br />

tively broadcasting messages ro groups of objects.<br />

A message that is sent to an object can be fixwarded<br />

to all the object's subobjccrs. Thus, for example, one<br />

object can serve as a container fix all other objects that<br />

are highlighted; the highlighted objects arc merely sub­<br />

objects of the container. To unbighlight all highlighted<br />

objects, a single unhighlight message can be sent to the<br />

container object, which then fcmvards the message to<br />

al l its subobjccts. In general, an object can be the sub­<br />

object of any nu mber of other objects and thus simulta­<br />

neously be a member of many diffe rent sets.<br />

The handling of user input within Tecate is<br />

intended to appear the same as ordinary object-object<br />

interactions. Al l physical input devices that are known<br />

to Tecate have an agent object associated with them<br />

th at acts as a device handler. All objects that wish to<br />

be informed of a particular input event register with<br />

the appropriate agent. When an input event occurs,<br />

the agent sends all registered objects a message notit)r­<br />

ing them of the event. Complex events, such as the<br />

occurrence of event A and event B within a specified<br />

time period, can easily be defined by creating new han­<br />

dler objects. These handlers register to be informed of<br />

separate events but then, in turn, inform other objects<br />

of the events' conjunction.<br />

The Object Model<br />

Te cate uses an object model in which no distinction<br />

is made between classes and instances, as is done in<br />

languages like C++ .8 In Tecate, there is a single object<br />

creation operation called cloning. Any object in the<br />

system can serve as a prototype fi-om which a copy can<br />

be made through the clone operation. A clone inherits<br />

<strong>Digital</strong> T�chnical )ourn


compute-intensive behaviors or whose behavior execu­<br />

tions arc time-critical are generally implemented as<br />

resource objects. For instance, most Te cate objects that<br />

provide system services, such as rendering or database<br />

management, are implemented as resource objects.<br />

Objects populating virtual worlds that represent<br />

data fe atu res are implemented diffe rently than<br />

resource objects by using an interpretive program­<br />

ming language ca lled the Abstract Visualization Lan­<br />

guage (AVL). Such objects are called dynamic objects<br />

because they may be created, destroyed, and altered<br />

on-the-fly as a Tecate session unfolds. Nonetheless,<br />

the ability to dynamically add, remove, and alter object<br />

properties is nor solely endemic to dynamic objects.<br />

Resource properties may also bt: changed on -the-tly<br />

because resources are actually implemented with a<br />

dynamic object that interfaces to the portion of the<br />

resource that is implemented as an external process.<br />

The Abstract Visualization Language<br />

AV L is essential to the Tecate system; it is through AVL<br />

that applications programmers write applications that<br />

usc Tecate's ti.:arures.'' AVL is an interpretive, object­<br />

oriented programming language that is capable of<br />

performing arbitrary computations and facilitating<br />

communication among diffe rent processes. Through<br />

this language, applications programmers spcci f)� and<br />

manipulate object properties and invoke object lx :hav­<br />

iors by se nding messages fr om one object to another.<br />

AV L is a typelcss language that manipulates char­<br />

acter strings; it is based on the Tel embeddable<br />

command language.'" AVL extends Tel by adding<br />

objcct-oricnn.:d program ming support, 3-D graphics,<br />

and a more sophisticated event-handling mechanism.<br />

Although AVL is a proper supe:rser ofTcl, the relation ­<br />

ship between AVL and Tel is much like that between<br />

C and C++. By adding a small set of new constructs<br />

to Te l, the way applications programmns structure<br />

AVL progra ms differs markedly !Tom how they struc­<br />

ture Tel progra ms, just as the C+ + language exten­<br />

sions to C: greatly alter the C programming style.<br />

One usc of AV L is to describe virtual worlds that<br />

represent data sets. Through AV L, objects that popu­<br />

late these worlds can be assigned bt:haviors that are<br />

elicited through user interaction . �or instann.:, select­<br />

ing a 3-D icon can cause a Universal Resource Locator<br />

(UlU,) to be fo llowed out into the WVVVV . In this<br />

sense, AV L is somewhat like the Hypertext Markup<br />

Language ( HTM L) that underlies all Web browsers<br />

today, or, more fi ning, it is similar to the Virtual<br />

R.c:�liry Modeling Language (VRML) that has been<br />

proposed as a 3-D analog ofHTML.'' AVL does, how­<br />

ever, differ marked ly !Tom HTM L and VRNlL, which<br />

Jrc only markup l:�nguages. Bec:�use AVL is a fu ll­<br />

tl edged programming language that Ius sophisticated<br />

interaction h:�ndling built in, it is philosophically more<br />

similar to interpretive languages like Telescript,<br />

NewtonScript, and PostScript. "·"·'4 Like Te lescript,<br />

fo r instance, AVL programs can encode "smart<br />

agents" that can be sent across a network to perform<br />

user tasks at a remote machine, if an AVL interpreter<br />

resides there . Note, however, that in the present ver­<br />

sion of Tecate, there is no notion of security when<br />

arbitrary AVL code runs on a remote machine.<br />

AVL includes some additional commands that aug­<br />

ment the Te l instruction set, fo r instance, clone and<br />

delete. The clone command is the object creation com­<br />

mand within AVL, and the delete command is the com­<br />

plementary operation to delete objects fi-om the<br />

syste m. Object properties are specified and manipu­<br />

lated using the add command and deleted using rhe<br />

remove command . Behaviors in one object are initiated<br />

by another object using the send command, which<br />

specifies the behavior to invoke and the argu ments to<br />

be passed . Queries about object properties can be<br />

made using the inquire command. The which com­<br />

mand is used to determine where an object's properties<br />

are actually defined in light of Tecate's use of delega­<br />

tion to resolve property references. Finally, AVL pro­<br />

vides a rich set of matrix and vector operators that arc<br />

useful when positioning objects "�thin 3-D scenes.<br />

As an example of how AVL is used in practjce,<br />

Figure l depicts a code fTagmcnt similar to one that<br />

appears in the vV'vV'vV application described later in the<br />

paper. The code fi-agmenr creates a 3-D Web site icon<br />

that is positioned on a world map. The code begins<br />

with the definition of the Hyper/ink object fi-om which<br />

all Web site icons arc cloned. The Hyper/ink object is<br />

itself cloned !Tom the Visual object that is predefined<br />

by Tecate at system start-up. The Visual object con­<br />

tains properties that relate to the viewing of objects<br />

within scenes. h)f instance, objects that are cloned<br />

fi-om the Visual object inherit behaviors to rotate<br />

themselves and to change their color. To the proper­<br />

ties that are inherited from the Visual object, the<br />

Hyper/ink object adds the state variables uri and desc,<br />

which will be used to store respectively a URL and irs<br />

textual description. In addition, objects cloned !Tom<br />

the !-Iyperlink object will inherit the default appear­<br />

ance of a solid blue sphere having unit radius.<br />

The specification fo r the Hyper/ink object also<br />

defines three behaviors: ini/, openUrl, and showDesc.<br />

The inil behavior replaces the inil method inherited<br />

!Tom the Visual object. When an object cloned !Tom<br />

the Hyper/ink object receives an init message, it sets irs<br />

uri and desc state variables, positions itself within the<br />

scene whose name is given by the argument scene, and<br />

registers itself with the mouse handler agent to receive<br />

two events. When mouse bu tton l is depressed , the<br />

agent sends the object the open Uri message, which in<br />

turn requests the W'vVW Interface to fetch the data<br />

pointed to by the object's URL. Depressing button 2<br />

invokes d1e shOti'Desc message, causing the Web site<br />

URL and description to be displayed by a previously<br />

<strong>Digital</strong> Tt:x hnicJI /ournJI Vo l. 7 No. 3 <strong>1995</strong> 69


70<br />

# Define a prototype for all Web icons<br />

clone Hyperlink Visua l<br />

add Hyperlink<br />

}<br />

state {<br />

u r L ""<br />

}<br />

desc ""<br />

appearance {<br />

}<br />

shape {sphere}<br />

diffuseColor {0.0 0.0 1 . 0}<br />

repType {surface}<br />

behavior {<br />

# Ini tialize hyper link<br />

}<br />

init {url desc pos scene wi ndow} {<br />

}<br />

addstate url $url<br />

addstate desc $desc<br />

send [getselfJ move "add $pos "<br />

add $scene "subo bject [get self]"<br />

send $wi ndow addEvent "[getse lfJ {Button-1 {openUrl {}}} {Button-2 {showDesc {}}}"<br />

# Open the URL<br />

openUr l {} {send www fetch "[getstate url J"}<br />

# Display the desc ription<br />

showDesc {} {send me taViewer display "[getstate descJ''}<br />

# Ini tialize an informationa l Landscape<br />

clone scene Vi sua l<br />

clone window Viewer<br />

send wi ndow init {scene}<br />

# Create a Web site i con<br />

clone hlink Hyperlink<br />

send hlink init {"http: //www.sdsc . edu/ Home . html"<br />

"SDSC home page" "-2.3 -2 .0 1 . 0" scene wi ndow}<br />

# Use the SDSC mode l geometry<br />

add hlink {appea rance {shape {box}}}<br />

Figure 1<br />

An Implementation of a World Wide Web Icon in the Abstract Visualization Language<br />

defined interface widget ca lled the meta Viewer. The<br />

AVL command getself. which is used within the init<br />

behavior body, returns the name of the object on<br />

which the behavior was called, thus allowing applica­<br />

tions programmers to write generic behaviors. The<br />

other AVL commands, getstate and addstate, are<br />

shorthand fo r "get [getself) state ... " and "add [getself]<br />

{state . . . l."<br />

Once the HyJX>rlink object is defined, a scene, a dis­<br />

play window, and a Web site icon are created . The<br />

Tecate scene object is cloned fi-om the Visual object.<br />

The window object, cloned fi-om the predefined<br />

Viewe1· object, is the viewport into which the scene is<br />

to be rendered. Finally, blink is a We b site icon whose<br />

appearance difters from that which is inherited from<br />

the Hyper/ink object. Rather than being spherical, the<br />

shape of the blink icon is a unit cube.<br />

<strong>Digital</strong> Tcchnic:d Journal Vo l. 7 No. 3 <strong>1995</strong><br />

Tecate's Architecture<br />

The general structure of Tecate and how it relates to<br />

application progra ms is depicted in Figure 2. Tecate<br />

consists of a kernel, a set of basic system services, and a<br />

toolkit of predefined objects. The Tecate kernel, which<br />

is shown in Figure 3, is an object management system<br />

called the Abstract Visualization Machine; AVL is its<br />

native language. The Abstract Visualization Machine<br />

is responsible fo r creating, destroying, altering, ren­<br />

dering, and mediating communication between<br />

objects. The two major components of the Abstract<br />

Visualizarjon Machine are the Object Manager and<br />

the Rendering Engine.<br />

The Object Manager is the primary component of<br />

the Abstract Visualization Machine. It is responsible fo r<br />

interpreting AVL programs, managing a database of


APPUCA TIONS<br />

TECATE<br />

Figure 2<br />

The Tecate System and Its Rcbrionship to Application<br />

Programs<br />

objects, mediating communication between objects,<br />

and intcrbcing with input devices. The Object<br />

Manager is itself a resource object that is distinguished<br />

by the fact that all other resource objects are spawned<br />

fi-om this om: objecr. In addition, the Object Manager<br />

is responsible to r creating a distinguished dymmic<br />

object, called Root. fi-om which aU other dynamic<br />

objects can trace their heritage through prototype­<br />

clone relationships.<br />

The Object Manager is implemented on a simple,<br />

custom-built thread package. Each object within<br />

Tecate can be thought of as a process that has its own<br />

thread of control. Each thread can be implemented<br />

either as a lightweight process that shares the same<br />

machine context as the Object Manager's operating<br />

system process or as its own operating system process<br />

separate fi-om that of the Object Manager. Lightweight<br />

processes are so named because their use requires little<br />

system overhead, which enables thousands of such<br />

processes to be active at any given time. Within Tecate,<br />

dynamic objects arc implemented as lightweight<br />

Figure 3<br />

INTELLIGENT<br />

VISUALIZATION<br />

SYSTEM<br />

BIGRIVER<br />

RENDERING<br />

ENGINE<br />

processes, whereas resource objects arc implemented<br />

as heavyweight operating system processes, which may<br />

or may not be paired with a lightweight, adjunct<br />

process. A low-level function library is provided to<br />

handle the creation and destruction of threads and<br />

to handle interthread communication regardkss of<br />

how the threads are implemented.<br />

Closely allied with the Object Manager is the Ren ­<br />

dering Engine, which is a special resource object<br />

wholly contained \vi thin the Abstract Visualization<br />

Machine. The Rendering Engine is responsible fo r<br />

creating a graphical rendition of a virtual world that is<br />

specified by AVL programs interpreted by the Object<br />

Manager. When interpreting an AV L program, the<br />

Object Manager strips off appearance attributes of<br />

objects and sends appropriate messages to the Ren­<br />

dering Engine so that it can maintain a separate display<br />

list that represents a virtual world. Display lists are rep­<br />

resented as directed, acyclic graphs whose connectivity<br />

is determined by object-subobject relationships that<br />

are specified \vi thin AVL programs.<br />

The present Rendering Engine implementation<br />

uses the Don� graphics package running on a DEC<br />

3000 Model 500 workstation.7 The display lists that are<br />

created by invoking behaviors within the Rendering<br />

Engine are actually built up and maintained through<br />

Dorc. The set of messages that the Rendering Engine<br />

responds to represents an interface to a platform's<br />

graphics hardware that is indepe ndent of both the<br />

graphics package and the display device.<br />

Layered on top of the Abstract Visualization<br />

Machine arc Tecate's system services and the object<br />

toolkit. The system services consist of a col lection of<br />

resource objects that are automatically instantiated at<br />

system start-up. These resources include an expert<br />

system ca lled the Intelligent Visualization System,<br />

the Database Interface, the vVWW Interface , and a<br />

OBJECT MANAGER<br />

ABSTRACT VISUALIZATION MACHINE<br />

DATABASE<br />

INTERFACE<br />

www<br />

INTERFACE<br />

Derail ofTecare's Kernel (the Abstr:Kt Visualization Machine) and the System Services Provided by Tecate<br />

<strong>Digital</strong> TcchnicJI journal Vol. 7 No. 3 <strong>1995</strong> 71


72<br />

visualization programming system called BigRi,·cr.<br />

Figure 3 shows these resources in relationship to<br />

Tecate's kernel . Each resource is a Tecate object that<br />

has a number of predefined be haviors that can be use­<br />

ful to applications programmers. For instance, the<br />

W'vVW Interface has a be havior that fe tches a data tile<br />

referred to by a URL and then translates the file's con­<br />

tents into an appropriate AVL program.<br />

The tool kit within Tecate is a set of predefined<br />

dynamic objects that program mers can use to develop<br />

applications. These objects arc considered abstract<br />

objects in the sense that they arc not intended to be<br />

used directly. Rather, they serve as prototypes from<br />

which clones can be created. The rool kir consists of<br />

objects such as viewporrs, lights, and cameras that arc<br />

used to illuminate and render virtual worlds. The<br />

toolkit also contains a modest collection of 3-D usn<br />

interface widgets that can be used wirhin virtual<br />

worlds created by an applications programmer. These<br />

widgets include sliders, menus, icons, legends, and<br />

coordinate axes.<br />

One usefld object in the toolkit that aids in simulat­<br />

ing physical processes and helps in perfixming aoinu­<br />

tions is a clock. This object is an event generator thar<br />

signals every clock tick. If objects wish ro be informed<br />

of a clock pulse, those objects register rhemsch-cs with<br />

the clock object just like objects register themselves<br />

with input device agent objects. The dcbult clock<br />

object can be cloned , and each clone can be instanti­<br />

ated with a difkrcnr dock period down to J resolution<br />

of one millisecond. Any number of clocks em be tick­<br />

ing si multaneously during a Tcc:�tc session . Since new<br />

clocks can be created dynamically, and objects can reg­<br />

ister and unrcgister to be int(mncd of clock pulses<br />

on-rhe-fly, clocks can be used as timers and triggers,<br />

and as pacesetters.<br />

Application Resources<br />

Tcc:-tte's system services arc prcdctincd application<br />

resources that aid in interactivc:ly visualizing data. As<br />

mentioned previously, these objects include the I nrcl­<br />

ligcnr Visualization System, the Database I nrerbcc,<br />

the WVVW Interface, and rhe BigR.iver visualization<br />

programming system. In addition, :111 applications pro­<br />

grammer can casilv add new :-tpplicarion resources<br />

using tools provided with the base Tecate system. Such<br />

new resources can be built around either user-written<br />

programs or commercial oft�thc-shelf applic1tions.<br />

To create a new application resource, a programmer<br />

needs to provide a set oHuncrions that c:-tn be invoked<br />

by other Tecate objects. These fi.nKtions correspond<br />

to behaviors that are called when rhc resource recci,·cs<br />

a message from other objects. Tools arc provided<br />

to register the behaviors wirh Tecate and ro manage<br />

the communication bet\\'een �1 resource and other<br />

TecJte objects. 1'<br />

Vol. 7 :--Jo. 3 l \)


models, user tasks, and visualization techniques. The<br />

Planner functions by constructing a sentence within a<br />

dataflow language defined by a context-sensitive graph<br />

grammar. At each step in the construction of the sen­<br />

tence, rules in the knowledge base dictate which pro­<br />

ductions in the grammar are to be applied and when.<br />

Presently, the knowledge base is implemented using<br />

the Classic knowledge representation system; the<br />

Planner is implemented in CLOS.'u"<br />

BigRiver<br />

The dataflow program produced by the Inte lligent<br />

Visualization System is written in a scripting language<br />

that is interpreted by BigRivcr, a visualization pro­<br />

gramming system similar to AVS and Khoros.�.' hom<br />

a technical standpoint, BigRiver is not particularly<br />

innovative and will eventually be reimplementcd using<br />

some existing visualization system that has mon: fi.mc­<br />

tionality. The reason that BigRivcr was created from<br />

scratch was to better understand how existi ng visual­<br />

ization programming systems work and ro ovncome<br />

limitations within those systems. These limitations are<br />

their inability to bt: t:mbeddcd within other applica­<br />

tions, their lack of comprehensive data models, and<br />

their inability to work with user-supplied renderers.<br />

The latest generation of visualization programming<br />

syste ms, such as Data Explorer and AVS/ Express,<br />

overcome many of these limitations.,_,,.<br />

Like most of the existing visualization syste ms,<br />

BigRiver consists of a collection of proced ures called<br />

modules, each of which has a wdl-dcfined set of inputs<br />

and ou tputs. Fu nctional specifications fo r these mod ­<br />

ules represent some of the knowledge contained in the<br />

Intelligent Visualization System's knowledge base.<br />

Visualization scripts that are interpreted by BigRiver<br />

specif)' module parameter values and dictate how the<br />

outputs of chosen modules arc to be channeled into<br />

the inputs of others.<br />

BigRiver modules come in tlm:e varieties: l/0, data<br />

manipulators, and glyph generators. AJI modules use<br />

se lf-descri bing data tc>rmats based on fiber bund les.<br />

One t() rmat is used t( )r manipulation within memory;<br />

the other is an on -the-wire encoding intended tor<br />

transporting data across a network. An input mod ule<br />

is responsible to r converting data stored in the on-the­<br />

wire encoding into the in-memory fo rmat. The data<br />

manipulator mod ules transform tlber bundles of one<br />

in-memory fo rmat into those of another. The glyph<br />

generators take as input fiber bund les in the in-memorv<br />

t(xmat and produn: AVL programs that when necuted<br />

build virtual worlds containing objects th;lt represent<br />

fe atu res of selected data sets. A single dispby module<br />

takes as input AV L code and p:1sses it to the Abstract<br />

Visualization Machine. Bv means of the Rendering<br />

Engine, the Abstr:Jct Visualization Machine uses the<br />

appearance attributes of objects to create an image of<br />

a virtual world that contains the objects.<br />

The Database Interface<br />

The Database Interface provides the means to interact<br />

with a database management syste m, which in the cur­<br />

rent version of Tecate can be either POSTGRES or<br />

Illustra.'"·N Database queries, written in POSTQ UEL<br />

fo r POSTGRES-managed databases or in SQL for<br />

Illustra databases, are sent to the Database Interface by<br />

Tecate objects where they are passed to a database<br />

management system server for execution. The server<br />

returns the query results to the Database Interface,<br />

which then attempts to package them up as an on-the­<br />

wire encoding of a fiber bundle buff-ered on local disk.<br />

If the result is a set of tu ples in the standard format<br />

returned by POSTGRES or lllustra, the Database<br />

Interface pcrf(>rms the fiber bundle translation. For<br />

most otJ1er nonstandard results, the so-called bi nary<br />

large objects (RLOBs) of the database realm, the<br />

Database Interface cannot yet arbitrari ly perform the<br />

translation into the on-the-wire fiber bundle encod­<br />

ing. The only BLOBs that the Database Interface can<br />

deal with presently arc those that arc already encoded<br />

as on-the-wire fiber bundles. The difficult problem of<br />

automated data f(mnat translation was not addressed<br />

during Tecate's initial development, although the<br />

intent is to address this issue in the fu ture.<br />

Once query results arc buffered on disk, a descrip­<br />

tion of the fiber bundle and the location of the buffer<br />

are sent back to the object that made the query<br />

request of the Database I ntcrfacc. That object might<br />

then request the ]ntc lligcnt Visualization System to<br />

structure a virtual world whose image would appear<br />

on the display screen by way of BigRivcr and the<br />

Re ndering Engine. Objects in the vi rtual world can be<br />

given behaviors that arc elicited by user interactions.<br />

These behaviors might then result in further database<br />

queries and so on . Chains of events such as these pro­<br />

vide a means tor browsing databases through direct<br />

manipu lation of objects within a virtual world.<br />

The World Wide Web Interface<br />

The WWVV lntertace fi.mctions similarly to the<br />

Database I ntcrface but instead of accessing data in<br />

a database, the WWVV Interface provides access to data<br />

stored on the World Wide We b. Messages that contain<br />

U RLs arc passed to the W'vVW Interface, which then<br />

fe tches the data pointed to by the URLs. In retrieving<br />

data fr om the \Ve b, rhe vVWW lntertace uses the same<br />

CERN sothvJre libr:1rics used by Web browsers like<br />

Netscapc.<br />

Once a data ti le is fetched , the 'vV'vVW Interface<br />

attempts to translate irs contents into an AVL pro­<br />

gram, which is then passed to the Object Manager fo r<br />

interpretation. AVL either specifies the creation of<br />

a new virtual world that represents the data file's con­<br />

tents or specifies new objects that are to populate the<br />

current world being viewed. lf the fetc hed data file<br />

contains a stream of AV L code, the Vv'vVW Intcrbce<br />

Vo l. 7 No. 3 !99S 73


74<br />

merely fo nvard s the file to the Object Manager. If the<br />

file contains general data in the form of an on-the-wire<br />

encoding of a fiber bundle, the WWVV lntcrtacc<br />

appeals to the Intelligent Visualization System to<br />

structure an appropriate virntal world. If the data file<br />

contains a stream ofHTML code, the WWW Inrerbce<br />

invokes an internal translator that translates HT1Vl l.<br />

code into an equivalent AVL program, which is then<br />

interpreted by the Object Manager. This interpreter<br />

actually understands an extended version of HTM L<br />

that supports the direct embedding of AVL within<br />

HTM L documents. Through this mechanism, 3-D<br />

objects with which users can interact can be embedded<br />

directly into a hypertext We b page-something that<br />

fe w if any other We b browsers can do today.<br />

Example Applications<br />

Applications that browse the contents of data spaces<br />

and then interactively visualize selected results have<br />

the same overall structure. One browser application<br />

component acts as a data space interface, and through<br />

this interface queries arc posed, query results arc<br />

imported into the application, and data generated by<br />

the application is stored back into a data space. Once<br />

data has been imported into the Jpplication , a second<br />

component must map the data into some appropriate<br />

virtual world . Finally, J third component must manage<br />

any interactions that may take place between an end<br />

user and elements that popu late the virtual worlds that<br />

arc created.<br />

In creating an application using Tecate, the DatabJse<br />

Interface and the vVWW Interface represent resources<br />

that can be used to form the appl ication's data space<br />

interfJce. The mapping of dJta into a representative<br />

virtual world can utilize Tecate's Intelligent Visuali­<br />

zation System and the BigRiver visualization progrJm-<br />

Figure 4<br />

PLAN VISUALIZATION<br />

INTELLIGENT PLAN<br />

VISUALIZATION !-+--- -- --,<br />

SYSTEM<br />

BIG RIVER<br />

OBJECT<br />

MANAGER<br />

EXECUTE<br />

SCRIPT<br />

VISUALIZATION<br />

SET<br />

INTERPRET<br />

ABSTRACT<br />

VISUALIZATION<br />

LANGUAGE<br />

PARAMETERS<br />

Message Flow between Important Objn:rs in rhc Earrh Science Applic1rion<br />

<strong>Digital</strong> Tt:c hnical Journal Vol. 7 !'Jo. 3 I 995<br />

ming system. Finally, the manJgement of these worlds<br />

can take place through AVL programs that exercise the<br />

katures of Te cJte's Abstract Visualization Machine.<br />

The following two examples tl1at were implemented in<br />

AV L illustrate how Tecate can be used to create applica­<br />

tions that browse data spaces.<br />

Visualizing Data in a Database<br />

A simple example of an application that exploits<br />

Tecate's features is one that browses fo r earth science<br />

data in a database and then provides visualizations of<br />

that dJta. The initial user interface fo r this application is<br />

built using a collection of user interface \vidgers, where<br />

each widget is a Tecate dynamic object. Because the<br />

Tecate system docs not yet have a comprehensive 3-D<br />

widget set, some widgets sti ll rely on two-dimensional<br />

( 2-D) constructs provided by the Tk \vi dget set that<br />

is implemented on top of the Te l language:'"<br />

Figure 4 depicts the tlow of messages between some<br />

of the more important objects that are used within the<br />

application. One object is the Map Query Tool that<br />

is used to make certain graphical queries for earth<br />

science data sets whose geographical extents and rime<br />

stamps fa ll within user-specified constraints. The tool<br />

is built around a world map on which regions of inter­<br />

est can be specified (see figure 5). When a user marks<br />

a region of interest on the map and selects a temporal<br />

rJnge, a query message is sent to the Database<br />

lnrertace. The result of the query is returned to the<br />

Map Query Tool, which then torwards a description<br />

of the res ult to the Intelligent Visualization System.<br />

To structure an appropriate visualization, an inferred<br />

select task directive accompanies the re sult. The ensu­<br />

ing script produced by the Intel ligent Visualization<br />

System is executed by BigRiver, which produces a<br />

stream of AVL code that is sent to the Abstract<br />

Visualization Machine tor interpretation.<br />

QUERY<br />

QUERY<br />

RESULT<br />

QUERY<br />

QUERY<br />

RESULT<br />

DATABASE<br />

INTERFACE<br />

KEY:<br />

0 DYNAMIC OBJECT<br />

Cl RESOURCE OBJECT<br />

-+-- MESSAGE FLOW


Figure 5<br />

The Map Query Tool Sho\\'ing a Vtstdization of a Qucrv Res ult<br />

This AVL program creates a new vi rtu::d world that<br />

consists of a collection of 3-D objects. Each object acts<br />

as an icon that corresponds to one data set that was<br />

returned as the result of the initial query (sec hgure<br />

5). The Intelligent Visualization System also builds<br />

in rwo behaviors t(>r each icon . Depending on how<br />

a user selects an icon, either the mctadata associated<br />

with the data scr represented by the icon is dispiJvcd in<br />

a separate window or a query message is sent to the<br />

Database lnterbcc requesting the actual data. In<br />

the latter case, the Map Query Tool again t(>rwards<br />

r.he qucrv result to the I ntelligcnr Visualization System,<br />

and another virtual world contJining objects representing<br />

data kJturcs is created and displavcd with<br />

the aid of BigR.ivcr Jnd the Abstract Visualization<br />

Machine. In generJI, lhta exploration proceeds this<br />

way by creating and discarding virtual worlds based on<br />

interactions with objects that populate prior worlds.<br />

After selecting Jn icon to actuJIIy view the data associated<br />

with it, an end user is Jskcd bv the I ntelligcnt<br />

Visualization System to input a task spccitication using<br />

a Task Editor. Generally, data sets can be visualized in<br />

many diffl:rent ways. The Intelligent Visualization<br />

Svstem uses the task specification ro select the one<br />

1·isualization that best satisfies the stated task. After<br />

J task specification is entered, a visualization of the<br />

selected data set appears on the screen. The BigRiver<br />

dataflow program that the Intelligent Visualization<br />

System creates ro do that visualization can be edited by<br />

h:md by knowledgeable end users to override the decisions<br />

made by the system.<br />

Figure 6 shows a Task Editor and a visualization<br />

crafted by the Intelligent Visualization System after an<br />

end user selected a data-set icon. The visualization represents<br />

hydrological dara that consists of a collectjon of<br />

tuples, each corresponding to a set of measurements<br />

made at discrete geographical locations. Based on<br />

the task specification that the end user entered, the<br />

[ntelligenr Visualization System chose to map the data<br />

into a coordinate system that has axes that represent<br />

latitude, longitude, and elevation. Each sphere represents<br />

an individual measurement site, whose color is<br />

a fi.mction of the mean temperature. When an end user<br />

selects a sphere, the actual data values associated with<br />

Digirc1l <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 75


76<br />

Figure 6<br />

Task Editor Sho\\'ing a Visualizatio11 of Hnlrologiral DaLl<br />

the location represented 1)\' the sphere JJ-c displavcd .<br />

In addition, the Imclligc nr Visualiz;1tion Svstcm Juto­<br />

matically places into the l'irtual world ofrhe visu:tliza­<br />

tion ;1 color legend to help relat-e sphere colors to mea11<br />

tcmperatun: v:tlues.<br />

hgurc 7 depicts another 1 irrual world showing J<br />

visu:tlization ohbta-sct output h·om a regional clim;1te<br />

model program. The d a u set is ;J 3-D arrav indexed b1·<br />

l atitude, longitude, and cle1·ation . Fach arrav clement<br />

is a tuple thJt contains cloud densi tv, water content,<br />

and temperature I'Jiues. In this 1nst;1ncc, the end user<br />

enrered J task specincnion that st;ned that the spatial<br />

variation in te mperature was of primarv illlJlOrtance.<br />

The Inrelligent Visualization System responded lw<br />

specif-\·ing J l'isualizJtion that represemed the te mper­<br />

atun: datJ as an isosurbcc, i .e . , a surbce 11·hose poi nts<br />

all h:tve the s:trne 1·aluc f(Jr the temperature. Included<br />

in the 1·irtual world is ;l widget tklt can be used to<br />

change the isosurbcc I';Jiue and the field l';lriable th;n<br />

is being srudied .<br />

The isosurbcc widget that :tppc:trs in the l'isualiz;l ­<br />

tion shown in Figure 7 is ofspeci;ll interest bcc:�use of<br />

the way that it is implemented. Embedded in the tool<br />

is a slider that is used to change the isosurbcc 1·aluc. As<br />

with most sliders, the slider ,·:�luc indicator ;J utom;Jti­<br />

cal ly moves when ;J mouse button is held down while<br />

Digiral Tcdmiol )< n1rn�l<br />

poinring at one ofrhe slider ends. To achieve this sim­<br />

ple ;J nimation, Tecate's clock object is used. When rhe<br />

mouse burron is firs� depressed while the cursor is over<br />

a slider end, rbe slider indicator registers itself to be<br />

informed of clock ricks. From then on, at every clock<br />

rick, rhe indicator receives an update message fi-om the<br />

clock, at which time the indicator repositions itself and<br />

increments or d.eerernenrs the current slider value.<br />

VV hen the mouse burron is released, rhe slider sends<br />

a message to BigRiver indicating that a new isosurtace<br />

is ro be calculated and d isplayed. In addition, the slider<br />

indiutor unregisters itself f!·om the clock signali ng<br />

thJr it no longer is to receive the update messages. In<br />

general, applications em use this same clock mecha­<br />

nism to perf(mn more e la borate animations.<br />

A 3-D World Wide Web Browser<br />

In the Teute Web browser, exploration of the World<br />

Wide Web ami irs comcnrs occurs by placing an end<br />

user onto an informational landscape. This landscape<br />

is ;1 3-D 'irtual world whose appearance reflects the<br />

coment and the structure of a designated subset of the<br />

entire Web. Upon applicnion start-up, an end user<br />

is presented with an initial informational landscape<br />

that consists of a p!Jn;lr map of rhe earth embedded<br />

in a 3-D space, ;l S shown in Figure 8. In general, rhe


Figure 7<br />

Task Editor Sh owing a Visualization of Regional Climate Dat3, I nduding an lsosurtacc 3nd cl User Inrcrt , lcc Widger<br />

initial informational landscape can be any 3-D scene<br />

and does not have to be geographically based. For<br />

instance, an intormational landscape might be a virtual<br />

library where books on shelves serve as anchors fo r<br />

hyper! inks to diffe rent Web sites.<br />

In the present browser application, selected Web<br />

sites appear as 3- D icons on the world map. These<br />

icons arc positioned either in locations where Web<br />

servers physically reside or in locations referenced<br />

within Web documents (see Figure 8). A user places<br />

information that describes these sites into a database<br />

that serves as an elaboration of the hot list of current<br />

hypertext- based browsers. When the browser applica­<br />

tion is first started, it sends a query fo r the initial com­<br />

plement of Web sites to the Database Interface. The<br />

browser application then invokes a BigRiver script that<br />

visualizes the results by placing icons representing<br />

each site onto the world map.<br />

Suspended above the world map is a 3-D user inter­<br />

fa ce widget that is used to query a database of Web<br />

sites that are of interest to an end user (see Figure 8).<br />

This database, where the initial set of Web sites is<br />

stored, includes information such as URLs, keywords,<br />

geographical locations, and Web site types. Currently,<br />

Figure 8<br />

Te cate Web Browser Intrmarion3l L�1ndscapc Showing<br />

WW\V Sites Depicted as 3· D Icons on a Map of the World<br />

<strong>Digital</strong> Tc..:hni..:al Journal Vol. 7 �o. 3 J Y95 77


78<br />

individual users arc responsible tor maintaining their<br />

own databases by adding or removing \Ve b sire entries<br />

by hand. An automated me:ms to r building these LhtJ�<br />

bases can be easily added to the broll'ser appliurion so<br />

that Web intormation co uld be Jccu mulated based on<br />

where and when an end user rra\'cls on the Web.<br />

Duri ng a browsing session, rhc Web Querv Tool<br />

allows arbi trary SQL queries to be posed ro the<br />

database by an end user. ln addition, the Web Query<br />

Tool has provisions to allow pac kaged queries ro be<br />

initiated by a simp le click of a mouse button . ln both<br />

cases, queries are sent ro the Dat


Figure 10<br />

BIGRIVER<br />

INTERPRET<br />

ABSTRACT<br />

VISUALIZATION<br />

LANGUAGE<br />

EXECUTE SCR IPT<br />

INTERPRET<br />

OBJECT<br />

ABSTRACT<br />

MANAGER VISUALIZATION<br />

LANGUAGE<br />

KEY:<br />

0 DYNAMIC OBJECT<br />

c=J RESOURCE OBJECT<br />

- MESSAGE FLOW<br />

Message Flow between J mportanr Objects in the Web Browser<br />

Figure 11<br />

WEB QUERY )<br />

TOOL<br />

-<br />

DATA SET<br />

ICON<br />

J<br />

QUERY<br />

QUERY RESULT<br />

FETCH URL<br />

FETCH RESULT<br />

DATABASE<br />

INTERFACE<br />

www<br />

INTERFACE<br />

Results of a Tecate Browsing Session Showing a Hyperlink and a Forest of Pyramids That Re presents the User's Travels<br />

on the Web<br />

<strong>Digital</strong> <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 79


80<br />

are cloned !Tom the same HypCI·Jink prototype as the<br />

Web site icons. If another HTML document is<br />

retrieved by fo llowing a hypcrlink, tlut document<br />

is viewed on the base of :mother inverted pvramid<br />

whose apex rests on the selected text and so on (sec<br />

Figure l l ) . Rather than having to page back and t(xth<br />

between hypertext docu ments as with most hypl:rtot·<br />

based browsers, in Te cate, an end user needs only to<br />

move about the virtual world to gain an appropriate<br />

viewpoint from which to examine a desired document.<br />

Overall, as shown in figure I I, J browsing session<br />

with Tecate's vVe b browser results in a t(m::st of pvLl­<br />

midal structures that represent a pictorial historv of<br />

an end user's trave ls on the We b.<br />

Although Te cate's Web browser is capable of view­<br />

ing HTML documents, irs main purpose is not to<br />

emubte what can currently be done using hypertext­<br />

based browsers, albeit using 3-D. R.1 thcr, the new<br />

browser is intended to visualize primarilv more com­<br />

plex types of data. When d�1ra docs not consist of<br />

a stream ofHTML code, the WWW Interrace attempts<br />

ro visualize what was returned h·om the 'We b. These<br />

visualizations can take place in virtual worlds separate<br />

!Tom the intormationa.l landscape !Tom where the data<br />

Figure 12<br />

Example of a We b Document with Embedded 3-D Vi rtml World<br />

<strong>Digital</strong> T.:chniund in visualization systems, network browsers,<br />

datJbase ti·orlt ends, and virtual reality systems. As a<br />

first prototype, Tecate was created using a breadth-first<br />

development strategy. That is, developers deemed it<br />

esse ntiJI to first understand what components were<br />

needed to build a general data space exploration utility<br />

and then determine how those components interact.


The Tecate car demo<br />

<br />

<br />

The Teca te car demo<br />

<br />

# Globa l variables<br />

globa l TEC_W EB_ PARENT TEC WEB_W IN<br />

set path "/projects/s2k/sharedata"<br />

# Def ine car pa rt prototype<br />

clone CarPart Visua l<br />

add CarPart {<br />

state {ang le 10}<br />

appea rance {<br />

repType surface<br />

i n terpType surface<br />

behavior {<br />

around {args} {<br />

# Defi n e car body<br />

for {set i 0} {$i < 360} {incr i [getstate angle]} {<br />

clone car_body CarPart<br />

add car_body {<br />

appearance {<br />

send [getself] rotate "add 0 [getstate ang le] 0"<br />

replacematrix {rotate {0 .0 0.0 90 .0}}<br />

shape {AliasObj "$p ath/car_body . t ri"}<br />

# Def ine generic whee l<br />

clone whee l Ca rPa rt<br />

# Define car' s four wheels<br />

clone back_ right CarPart<br />

# Assemb le car<br />

clone whee ls CarPart<br />

add wheels {subobject {back_r ight back left front left front_ right}}<br />

clone car CarPart<br />

add car {<br />

appea rance {replacem atrix {translate {28.0 -8 .0 3.0} rotate {90 .0 90 .0 0.0}}}<br />

subobject {car_b ody wheels}<br />

add $TEC_W EB_P ARENT {subobject {car}}<br />

# Bind pick events to car<br />

send $TEC WEB WIN addEvent {wheel {Pi ck-Shift-Butto n-1 {rot_w h ee ls {}}}}<br />

send $TEC_W E B_W IN addEvent {wheel {Pi ck-Button-1 {around {}}}}<br />

send $TEC WEB WIN addEvent {car {Pi ck-But ton-1 {around {}}}}<br />

<br />

<br />

Bu tton-1 on car to rotate the car <br />

Bu tton-1 on a whee l to rotate the wheels <br />

Shift Button-1 on a wheel to change the whee ls <br />

<br />

<br />

<br />

<br />

Figure 13<br />

AV !. C:odc hnbeddt:d in rhc HTM!. l'c1gc tor rhc Wch Documcnr Exam ric<br />

])ig:it;\] Tcclmi(;ll jourml Sl


82<br />

This d n\:lopm enr srr�ncg;1· tLldcd off the fu ncrio11Jiirv<br />

of indi1·idu:�l components t(>r rhc co mplercncss of<br />

:I ri.dl\' ru n ning \'iSU�liization S\'StClll .<br />

In terms of �1Chic1·i ng irs design goals, rile Tcc,Hc<br />

eft(Jrt h�1s been modcratch· successti.d . TcGHc Gl ll nm1·<br />

prm·idc inrcrbccs to t\1 o kind of rb t�l sp�1ccs: the<br />

World vVidc We b �l lld thLlb�lSCS lllJ113gcd b\' rhe<br />

POSTCRES �m d lllustLl Lh t�1b�1sc management S\'S­<br />

tc ms. In addition, intnbccs to other dat�l sp�1ccs c�m<br />

be i m pl e men ted easih' l1\' CI'C :l ti ng nc\1 resou rce<br />

objects using the tools pro1·idcd lw Te cate . Much<br />

11 ork srill nccds to be done, hm1 c1'Cr For cx:�mple, rhc<br />

attciH.bnt dar:� translation problem must be satisbcto­<br />

rik soh'Cd ; data passing tlmJugh an intcrbcc rhat<br />

is stored in one fo rm�1r should be :H ltom:�tic;�llv um­<br />

l'trtcd i mo Tec: nc 's b1·orcd t(mna t :md 1·icc l'crs�l.<br />

When building ,·isu�lli;.�ni ons of data, Tccnc nm1<br />

undcrsr�mds dat�l that h�1s :1 s1Jecirlc conccpru�1l struc­<br />

ture, i 11 p:�rticular, arhi tr,u·1· scrs of tuples �m d mul ri­<br />

dimension:�! �1rr:11'S ll'hnc �l lT�l\' clements tlU\' be<br />

tuples. Although data t\'pcs from man1' di!'ll:rcm disci­<br />

plines possess such a stru cture, some tvpcs rc m�1in thJt<br />

dO ilOt, r(>r i nstan ce, cb t:l rh�H h�1s a latticc-likl· or poh·­<br />

hcd Lli srrucrure. F u rthcrmorc, Tee:� tc can noll' con­<br />

struct onh· crude \'isu;�li;�ltions ofrhc da ta ti'Pcs rlut it<br />

docs understand. The prim�1n reason t(Jr rhis shOI't­<br />

com ing is that the b�1sic mod ule set 11 irhin the<br />

BigRil'cr resource is incomplete, �md the knml'icdgc<br />

base 11·i thin the ln tc II igcm Visu�li iza tion Svstcm con­<br />

tains limited knoll' ledge of 1 isu:�lization tec hniques<br />

that can be used to t ra nsr( m n d


J I. G. Bell, A. p,,risi, ;md 1\.l. reset·, "The Virru;li J\cJiin·<br />

,\ Jlldelillg Lm gu;lgc Sfl�citlc;Hion," a,·,,il:lbk on the<br />

lnrcrllet ;H http:/ /lr!lli.ll·ircd.com ( :\o1·cmbcr I l)9.J. ).<br />

12. J. White, "Tcksnipt Tcchnolog1·: The �ound.Hioll t(lr<br />

the Fkcrronic lvbrkctplace," Cencral M�gic 11·hitc<br />

p•lJll'r (Sull lliY;llr, C1 lif.: Gener;ll ,\ hgic, Inc., 19{}4 ).<br />

13. ). ,\ kKcclun Jnd :\. Rhodes, l'n�!.!,llfllllllillg j(w ihe<br />

,\('tr/(J/1: Sojitntre net:elr Scicmitic 1);\Ll \'istdiz;Hion,"<br />

TechnicJI Rcporr (.;\\'L!-JlST- ')2- 10 ( \V;1.shington,<br />

D.C.: (;corge \V;lsh ingtnn Uni1usirl', 1\b rc h 1992 ).<br />

20. Z. Ahmed n ;l l., "An lmelligt· nr VisuJiiLation S1·stcn1<br />

fo r J-:;lrth Scirc•uu· r I ()94 ) .<br />

22. D. l�urkr ;uld ill l'cndln·, "A \'isulllizarion ,\ lodel lhsed<br />

on rhe iVLlthenJ;Jtics of fib,r Bund les," 0>111/!11/ers ill<br />

/'hl:.;ics. 1·ol. 3, no. 5 (Septcnlber/Onober I ():N)<br />

D. l�urlcr ;l \ld S. l�n·son , "\\:cror- bundlc C:bsscs form<br />

l'm,·ertiil Tool tc>r Scienritlc \' isuJJii'l.�rion," (),ntf!llfl!I'S<br />

ill 1'/Jy sics. 1·ol. 6, no. 6 ( Nm·cmbn/Decel1lber I ()92 ):<br />

576-5X4.<br />

24. 1\ . Hllher, B. Luc\s,.mJ �· . Collins, "A Dar;, ,\ lodel fo r<br />

Sciemitic VJsuJii;arion 11·ith Pn"·isions t( >r Rcgu iJr ;md<br />

lrrc gui;Jr Crids," l'mcel!dil l,!l.-' u/ !be l'isuolizuliuu<br />

C) I (,r !1/{aC:'IICt'( 19() I ) .<br />

25. [, Resnick ct ;JI., U. ISS!C f)e,cnjJtion o11d Nefi'rellce<br />

.\/(/ ////(// .Ji>r the (,'(JJ)//JI(J/1 use lmjJ/c ·nlt'/1{{//ion<br />

( ,\•l u rra\' Hill, :\.].: AT&T Bell L1borJtories, 1993 ).<br />

26. C. Steele, Jr., 0!11111/l>ll l!SV '!Zw l.ong iiO,!.!,m­<br />

purc.:r Center (SI)SC). Cmrenrlv, Peter is " 1·isiting scic.:nrisr<br />

;l[ tbc snsc. \\'here he kads rcseJrchc.:rs in dci ' Cioping<br />

intcr;Kti lt: tbra ,·isualizJtion s,·stt·ms. Petn joined Digir.1l<br />

in 1990 as ;l member of tht· \ Vorkstations Engineering<br />

(;roup. In e.1rlier 11·ork, he ,,·,1s � sott\l';lre t'll)!;inccr t( >r th


High-performance<br />

1/0 and Networking<br />

Software in Sequoia 2000<br />

The Sequoia 2000 project req uires a high-speed<br />

network and 1/0 software for the support of<br />

global change research. In addition, Sequoia<br />

distributed applications require the efficient<br />

movement of very large objects, from tens to<br />

hundreds of megabytes in size. The network<br />

architecture incorporates new designs and<br />

implementations of operating system 1/0 soft­<br />

ware. New methods provide significant per­<br />

formance improvements for transfers among<br />

devices and processes and between the two.<br />

These techniques reduce or eliminate costly mem­<br />

ory accesses, avoid unnecessary processing, and<br />

bypass system overheads to improve through­<br />

put and reduce latency.<br />

\'ol. 7 ;-..: " ·' 199S<br />

I<br />

Joseph Pasquale<br />

Eric W. Anderson<br />

Kevin Fall<br />

Jonathan S. Kay<br />

In the Sequoia 2000 project, 11 e �1ddressed rile [Jroh­<br />

lem of designing a distri buted computn SI'Stem th�H<br />

can efticientlv rerriel'l:, store, r conraincr shipping. (for a complete<br />

discussion of container shipping, see Re krence l.)<br />

Since conrainn shipping is a nell' design, this p:�per<br />

de1·orcs more sp:�ce to ir in re l:trion to the other sur­<br />

' c1·ed '' ork ( 11·hose der�1ikd descriptions m:l\' be fo und<br />

in Rdcret1ces 2 ro 9). In addi rion ro this 11 ork, '' e con­<br />

ducred orhn ncrll'ork studies �1s parr of the Sequoi�1<br />

2000 l)rojccr. These include 1-csearch on protocols to<br />

pro1·ide f)Crtormance guaranrecs .m d multic1sting . 10 , -<br />

To support a higJJ-pcrt(mn:lncc distributed comput­<br />

ing etwimnmenr in ll'bicb :lpplicuions em cftCetii'Cl\'<br />

manipul�ue large d:H:J objects, 1\'C 11-cre concerned 11·ith<br />

achin in g. high throughput during the rranster of these<br />

objccrs. The processes or de�· ices representing the d;1L1<br />

sources :md sinks m:ty all reside on the s'1mc work­<br />

St


(mu ltiple node case ). ln either c:�se , we w�1nn.:d appli­<br />

C1tions , be the�· c1rth science distributed computa­<br />

tions or collabor�nion tools i111·oh-ing multipoint<br />

video, ro 111�1kc fu ll use of the raw bandwidth prm·ided<br />

bv the underlving communicarion svsrem.<br />

In the multiple node case , the ra11 band11·idrh is<br />

h·om 45 ro 100 meg�1birs per second (Mb/s), bec:ll!se<br />

the Sequoi�1 2000 nemork used T3 links t(>r long­<br />

distJnce communication :md a ti ber distri buted data<br />

inrcrbce (FDD! ) tc>r loc1l area communicnion. ln rhe<br />

single node case, rhc l'JII' band11·idth is approx imJteh·<br />

100 megabnes �1cr second, since rhe workstation of<br />

choice w:�s one of the D ECstation 5000 series or the<br />

Alph�1-powcred DEC 3000 series, both of which use<br />

the TL! RBOchannel �1s the SI'Stcm hus.<br />

Our 11·ork foc used onh· on soft11·;m:: impro1·emct1ts,<br />

in particular how to achieve maximum svsrem sotT\\':tre<br />

perte>rnuncc gi1-cn the hardware we selected. In bet,<br />

we fo und that the throughput bottlenecks in the<br />

Seq uoi�1 distributed computing cm·ironment 11·ere<br />

indeed in the 11·orkst:1rion 's operating svstcm softii'Jre,<br />

and nor in the undnlving communication svstem<br />

hard11·�1re (e.g., nctll'ork links or the svstem bus) · . This<br />

problem is not limited to the Scqu(;iJ em ironment:<br />

gi1·cn modern high-speed workstations ( 100+ mil lions<br />

of instructions per second [mips 1) r<br />

.FDDI bo:�rds, both models DEF'lA, 11· hich supports<br />

both send and receive direct memory �Kccss (r:li'viA),<br />

:111d DEfZA, ll'hich supports onlv rccci1·c DlviA.<br />

The Data Transfer Problem<br />

Since a d�1r;1 sou rce or si nk mav he either a process or<br />

d e1·ice, and the operating SI'Stem gcncralh- pcrt())'lns<br />

the fu nction of' transtCrring data bct11·cen processes<br />

and devices, undcrst:1nding the hotrlenceks in these<br />

operating sysn:m data p:1 ths is kcv ro improving<br />

perfonn;1nec. These datJ paths gcncralh- imoh-c trJ ­<br />

�·ersing numerous la1·crs ofoper:ni ng s1·stetn softwa re .<br />

l n the case of netll'ork rr:1nskrs, the data paths �11-c<br />

ex tended bv lavers of ncnvork protocol sofTware.


To undnst;uld rhc pcd(nm;mcc p roblem 11·c \\'CIT<br />

tr�·ing ro sokc , consider a common c l i e m- se!Yc r i nrcr­<br />

acrion in which


device is ;� network �l liaptcr. In this p�1per, \\'C usc<br />

rhe term dolo lntll4.erJII"oblem to re kr to the problem<br />

of red ucing these ovc r hc;�d s ro achi e1·c h i gh through­<br />

pur between ;� source device Jlld a sink device, either<br />

of \\"hich can he ;� net11·ork aLhpter 11 ·i rh in �� single<br />

worksr:� ri on.<br />

Al though the data rr:mskr problem mal' also oi st in<br />

intermediate routers, ir docs so to :1 much lesser<br />

degree than with end-user workstations (assu ming<br />

modem router softwa re and h:�rd\\'�lrc te chnoJog,· ) .<br />

Thi s i s because of �� rourc r' s si mpli �i ed execution envi­<br />

ronmellt :�nd i rs red uced needs �() r transfers Kross<br />

m ultiple protected domains. Ho\\"CI"Cr, there is noth­<br />

ing rh�n prec ludes rhc :�ppJicnion of the tec h niq u es<br />

discussed in th is paper to rourer SOlT\\";Jre. In ri ct, since<br />

ll"e used gcnerJI-purposc workstations �or routers ro<br />

su pporr �� tlex i blc, mod i �i ablc rest bed �(Jr ex peri men­<br />

t�ltion ll"ith nc11· protocol s, our work ll"as also applied<br />

ro rourer software.<br />

In the nor three sections, ll"e descri be 1·arious<br />

�1 ppro:�ches to soil-ing rhe data rr:�nskr problem . Since<br />

data copving/rouc hing is a major sottware limitation<br />

in achie�·ing hi gh throughput, a1·oiding data co�wing/<br />

touching is a constant theme. tVl uch of our work<br />

im·olvcs fi nding \1'


hgs to cs_re�H.I , without el i ling cs_rnap. Unm�lpping<br />

is Jutomnic in cs_IHi tc , so if cs_unm:1p is not used,<br />

t\1'0 S\'Stem Cl iis JI"C Sti li sur"fi cicnt .<br />

As shol\'n in hgure 3, a pmcess reJds d


11rinen is lost, �1 ll'ri ti ng process must copl' ::uw d�1t�1<br />

th�1t ll'i\1 still be needed �1 frer the ll'rirc . �ortu narelv,<br />

applications thJt n101T great 1·o\umes of tbt�1 often<br />

h�11·e no t"u rthcr need t(Jr ir �1 fre r �1 IITite is co mpl eted .<br />

Implica tions of Virtual Memory Remapping<br />

In


YO<br />

communicnion (JPC:) 11·irhin the ULTR!X 1·e rsion<br />

4.2;� operatin g system on a DECstation 5000/200<br />

svstc m.' With the nc11· DEC OSI-jl i m plementation<br />

011 Alpha worksr.1tions, liT com 11ared the l/0 puformance<br />

ofconvcntionJI UNIX ljO to that ofcollt;lincr<br />

s hipping rix ;1 l'arictl' of 1/0 de� ices as we ll


USER<br />

SPACE<br />

APPLICATION LAY ER<br />

PROCESS<br />

Figure 5<br />

:\ Sf.Jlicc Cnl1lll'Cring a Source ro J Sink Dc1 icc<br />

de1·icc. If the l/0 bus and the dl'l ices support hardware<br />

streaming, the data p:1th is direcrlv over the bus,<br />

�ll'oiding Sl'stem memon· altogether. Although the<br />

process docs nor necess�1rily manipui


92<br />

been implemented 11 irhin the U LT !U :\ 1crsion 4.2J<br />

operating svstcm t(Jr the DEC SOOO serit:s worksr:�rions,<br />

:md 11 irhin DEC OSF/ 1 ITrsion 2.0 ti1r DJ-:C<br />

3000 series (Aipha-pmH·red ) 11 orkst:�rions, c;H.:h t(Jr<br />

a l i mited number ofdc:,·ices.<br />

Three pert(mnance c1 .1 luation studies of PPIO<br />

ha,·e been c:�rried our :Jllli :�re described in our earilj)Jpers.2·'<br />

' ThLI' indicate Cl'U :JI';liLlbilin· imprm es l) \'<br />

30 percent or more; and throughput and Lw:nc1·<br />

impro1·c b1· ;1 bctor of t11·o to three, depending on<br />

the speed of l/0 devices. (3 ener:111v, the l:�tencv ;md<br />

throughput pcrform.l !Ke imprm emenrs ofkt·�d bl·<br />

PPJO improve with bster 1/0 de1·iccs, indicating th,{t<br />

1'1'10 scales well with ne11· IjO dn ice rechnolog�·.<br />

Improving Network Software Throug hput<br />

Net11·ork 1/0 presents a Sflecia] 11roblcm in th;lt rhc<br />

complcxitl' of the abstraction layer (sec Figu re 2), .1<br />

sr:�ck of network protocols, is gencr:�lh- much greater<br />

than that tor other types of 1/0. In this sectio11, 11 c<br />

summari1.e the results of an ;mall'sis of m·crhe:�ds r(Jr<br />

an implcment:nion ofTCI'/IP 11c used in rhc Scquoi;l<br />

2000 project. The prim:11'1' bmrlcncck in :K hiel'itw<br />

high throughput communic:ltion ti x TC P /l l' is du�<br />

to data-touching opcr:nions: one expected culprit is<br />

d:lt 199�<br />

• ErrorChk: The caregon· of checks t(x user :1 11d SI'Srem<br />

errors, such :1s pJramcter checking on socket<br />

s1·stem c:111s<br />

• Other: This tin:1l cJtegorv of OI'Ct-llC;ld includes :1!1<br />

the opcr;nions th


Figure 6<br />

3000<br />

� 2500<br />

f=<br />

(.9<br />

z<br />

�- 2000<br />

w


4.2:� op eratin g S\'Stcm, ;md a 27 percent llnproYemciH<br />

over the implcment•Hion with the opti mi zed check­<br />

sum comput:�tion algoi·irbm .<br />

Of course, one must be ,.er,· c;lrcti.Ii ;l hout d l lli II,� o //{/ S1 :..;fL'IIIS<br />

f/C\ 1


5. ]. K�v and /. P�1sq u al c, "The lmf)orr�liKe of Non­<br />

D.1L1-Touching Processing Q,crhcads in TC I'/ IP,"<br />

f>mceediugs of' the ACl/ Com nlltllicatiuus Architec­<br />

fllrcs onrl l'mtucols Co11/('rence ISIG'CUl/.1/J.<br />

S.1n hJ11cisco (Sq,re mbcr 199�), P�'· 259-269.<br />

6. /. ](�,. �1nd /. 1'�1squ�Ic. "Mc�1surcmenr, Ano1lvsis,<br />

and lmpro\'C illCIH of LI DI'/ll' Throughpu t rc n rhe<br />

DECs r�1tion 5000," l'mceedinp,s c;j ' the I 'SI:NJX<br />

\fin fer 'f ('cb nolugr C!Jtlj(•rence. S:111 D ie go ()Jnuan·<br />

I99�). pp. 249-25H.<br />

7. J. KJ\' and J. 1'.1sq u �1 lc, "A Sum nun· of l'C:P /II'<br />

Net\\ orking Soriwarc Pedcm11JI\Ce fo r the DECst�tion<br />

5000," l'r()(.:net<br />

Afifirooc/J ( 13erkc lc,·, C1 lif.: lnrernatiotd Computer<br />

Science lnstiture, <strong>Technical</strong> Report TR-92-072, 1992 ).<br />

II. H. Zb�mg, D. Verma, Jnd D. FerrJri, "Design and<br />

lmplemenr.1rion of the 1\eJI-Tillll' Inrnnet Prorrnunce of<br />

lv kss;lge-P�ssing Using Restri cted Virtual Memor!'<br />

Rc maf1ping," Sojill'are-l'mcfice and F .. \pcrieuce.<br />

,·ol . 21, no. 3 (I 991 ): 251-267.<br />

26. P. Druschel and L. Peterson, "Fbuts: a High · b;n1dH'idth<br />

Cmss- Domc1in Tra 11sfer bcilin·," Pmccediug-; of<br />

the 14th ACil l S) 'lilj)(JS iwn l!il OfWi'Cifillf!, .'>)·steill l'riii­<br />

Cij;/es (SOSl'). AshL' illc, N.C. (December 1993 ), pp.<br />

189-202.<br />

27. J. !vlogul and A. Borg, "The Effect of Context S\\ itches<br />

on Cache J'ertorm�111(c," l'mceedin,r.;s of' the ASI'f.OS-<br />

1\ '( Apri l I 99 1 ), pp. 75-84.<br />

2tl. R. Bershad, T. Anderson, E. Lazo\\'ska, �111d H. Le\'\',<br />

" Li gh twei gh t Remote Procedure C:Jil," ACIV/<br />

Tra nsactions 011 Computer SJ'sfeills. H>l. 8, no. I<br />

(1990): 37-55.<br />

<strong>Digital</strong> Tl'l'hnical Journ.1l Vol. 7 :-Jo. 3 <strong>1995</strong> 95


29. 1'- l)ruschcl, M. Abbott, ,\.\. I'Jge ls, cl lld I .. Pcre1·son,<br />

"An,1il'sis of !jO Suhwsrc m Design t(n J'vl ulri mcdic1<br />

Wo rksr,nions," Proceerlill,�.\ n/ tbe /birr! !utcrurt­<br />

liourt! \\ ur/;sh (ljJ (J/f .\eitt ·ud! r111d Ojir rill' rou ting architecture


Call fo r Authors<br />

from <strong>Digital</strong> Press<br />

<strong>Digital</strong> Press is an imprint of Butterworth-Heinemann, a major international pub­<br />

lisher of professional books and a member of the Reed Elsevier group. <strong>Digital</strong><br />

Press is the authorized publisher fo r <strong>Digital</strong> Equipment Corporation: The two<br />

companies are working in partnership to identify and publish new books under the<br />

<strong>Digital</strong> Press imprint and create opportunities fo r authors to publish their work.<br />

<strong>Digital</strong> Press is committed to publishing high-quality books on a wide variety<br />

of subjects. We would like to hear from you ifyou are writing or thinking about<br />

writing a book.<br />

Contact: Mike Cash, <strong>Digital</strong> Press Manager, or<br />

Liz McCarthy, Assistant Editor<br />

DIGITAL PRESS<br />

313 Washi ngton Street<br />

Newton, MA 02158-1626<br />

U.S.A.<br />

Tel (617) 928-2649, Fax: (617) 928-2640<br />

E-mail : Mike.Cash@BHein.rel.co.uk or<br />

LizMc@world .std .com


ISSN 0898-901X<br />

Printed in U.S.A. EY-T8 38E-TJ/YS 12 14 18.0 Copvright © <strong>Digital</strong> Equipment Corporation. All Rights Resen·ed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!