DTJ Volume 7 Number 3 1995 - Digital Technical Journals

Digital 

Technical 

Journal 

I 

HIGH PERFORMANCE FORTRAN 

IN PARALLEL ENVIRONMENTS 

SEQUOIA 2000 RESEARCH 

Volume 7 Number 3 

1995

Editorial 

Jane C. Blake, Managing Editor 

Helen L. Patterson, Editor 

Kathleen M. Stetson, Editor 

Circulation 

Catherine M. Phillips, Administrator 

Dorothea B. Cassady, Secretary 

Production 

Terri Autieri, Production Editor 

Anne S. Katzeff, Typographer 

Peter R. Woodbury, Illustrator 

Advisory Board 

Samuel H. Fuller, Chairman 

Richard W Beane 

Donald Z. Harbert 

William R. Hawe 

Richard J. Hollingsworth 

Richard F. Lary 

Alan G. Nemeth 

Robert M. Supnik 

Cover Design 

The images on the front and back covers 

of this issue are different visualizations 

of the same data output from a regional 

climate simulation program run by Dr. 

John Roads of the Scripps Institution of 

Oceanography. The data depicted contain 

measures of temperature, liquid and 

gaseous water content, and wind vectors; 

the topography represented by the data 

is the western U.S. in January 1990. Providing 

earth scientists with the ability to 

visualize such data is one of the objectives 

of the Sequoia 2000 research projecta 

joint eftort of the University of California, 

government agencies, and industry to build 

a computing environment for global change 

research. This issue presents papers on several 

major an::as explored by Sequoia 2000 

rese;uchers, including an electronic repository, 

networking, and visualization. 

The cover was designed by Lucinda O'Neill 

of Digital's Design Group. Special thanks go 

to Peter Kochevar for supplying the cover 

images. 

The Digital Technical journal is a refereed 

journal published quarterly by Digital 

Equipment Corporation, 30 Porter Road 

L)02/D lO, Littleton, Massachusetts 01460. 

Subscriptions ro rhejoumaiare $40.00 

(non-U.S. $60) for four issues and $75.00 

(non-U.S. $115) for eight issues and must 

be prepaid in U.S. funds. University and 

college protessors and Ph.D. srudems in 

the electrical engineering and computer 

science fields receive complimentary subscriptions 

upon request. Orders, inquiries, 

and address changes should be sent to the 

Digital Technica!journalat rhe publishedby 

address. Inquiries can also be sent electronically 

ro drj@digital.com. Single copies 

and back issues are available for $16.00 each 

by calling DECdirecr at 1-800-DIGITAL 

(1-800-344-4825). Recent back issuesofrhe 

journal are also available on the Internet at 

h rrp:/ jwww.digital.com/info/DTJ/home. 

hrml. Complete Digital Internet listings can 

be obtained by sending an electronic mail 

message ro info@digiral.com. 

Digital employees may order subscriptions 

through Readers Choice by enteringVTX 

PROFILE at the system prompt. 

Comments on the content of any paper are 

welcomed and may be sent to the managing 

editor at the pubhshed-by or network address. 

Copyright© 1995 Digital Equipment 

Corporation. Copying without fee is permitted 

provided that such copies are made 

for use in educational institutions by faculty 

members and are not distributed for commercial 

advantage. Abstracting with credit 

of Digital Equipment Corporation's authorship 

is permitted. All rights reserved. 

The information in the journal is subject 

to change without notice and should not 

be construed as a commitment by Digital 

Equipment Corporation or by the companies 

herein represented. Digital Equipment 

Corporation assumes no responsibility for 

any errors that may appear in the Journal. 

JSSN 0898-901X 

Documentation Number EY-T838E-TJ 

Book production was done by Quantic 

Communications, Inc. 

The following are trademarks of Digital 

Equipment Corporation: Digital, the 

DIGITAL logo, Alpha Generation, 

AJphaServer, AlphaSrarion, DEC, DEC 

OSF /1, DECstation, GIGAswirch, 

TURBOchannel, and ULTlUX. 

Dore is a registered trademark of Kubota 

Pacific Computer Inc. 

Ex a byte is a registered trademark of 

Exabyre Corporation. 

Hewlett-Packard and HP are registered 

trademarks of Hewlett-Packard Company 

JBM and SP2 are registered trademarks 

of! nternarional Business Machines 

Corporation. 

II lustra is a registered trademark of Illustra 

lntormation Technologies, Inc. 

Intel is a trademark oflnrel Corporation. 

MCf is a registered trademark of MCI 

Communications Corporation. 

MEMORY CHANNEL is a trademark 

of Encore Computer Corporation. 

Mosaic is a trademark of Mosaic 


Nerscape is a trademark ofNerscape 


NewronScripr is a trademark of Apple 

Computer, Inc. 

NFS is a registered trademark of Sun 

Microsysrems, Inc. 

OpenGL is a registered trademark and 

Open Inventor is a trademark of Silicon 

Graphics, Inc. 

Picture Tel is a registered trademark of 

PictureTcl Corporation. 

PostScript is a registered trademark of 

Adobe Systems Inc. 

SAIC is a registered trademark of Science 

Applications International Corporation. 

Siemens is ·a registered trademark of 

Siemens Nixdorflnformation Systems, Inc. 

Sony is a registered trademark of Sony 

Corporation. 

SPEC is a trademark of the Standard 

Performance Evaluation Council. 

Telescripr is a trademark of General Magic, Inc. 

UNIX is a registered trademark in the 

United Stares and other countries, licensed 

exclusively through X/Open Company Ltd.

Contents 

Foreword 

HIGH PERFORMANCE FORTRAN IN 

PARALLEL ENVIRONMENTS 

Compiling High Performance Fortran for 

Distributed-memory Systems 

Design of Digital's Parallel Software Environment 

SEQUOIA 2000 RESEARCH 

An Overview of the Sequoia 2000 Project 

The Sequoia 2000 Electronic Repository 

Tecate: A Software Platform for Browsing and 

Visualizing Data from Networked Data Sources 

High-performance 1/0 and Networking Software 

in Sequoia 2000 

) can C. Bonnev 

Jonathan Harris, John A. Bircsak, M. Regina Bolduc, 

Jill Ann Diell'ald, Israel Gale, !\cil W. Johnson, 

Shin Lee, C. Alexander Kelson, :md Carl D. Offner 

Edward G. Benson, David C. P. La France- Linden, 

Richard A. Warn::n, and Sant�1 Wirvaman 

Michael Sroncbr,lker 

Ra\· R. Larson, Christian PLlunr, 

Allison G. Woodruff, and MJrri Hearst 

Peter D. Kochevar and L.eon�1rd R. Wanger 

joseph Pasquale, Eric W. Anderson, Kevin Fall, and 

Jonathan S. Kav 

Digital T�chnical )ounLll Vol.7 l'-:o.3 1995 

3 

5 

24 

39 

50 

66 

84

2 

Editor's 

Introduction 

Scientists have long been moti,·ators 

tc>r the development of powert-l1l 

computing environments. Two 

sections in this issue of the.founw/ 

address the requirements of scienrinc 

and technical computing. The tirst, 

fron1 Digital's High Pertonnance 

Technical Computing Group, looks 

at compiler and development tools 

that accelerate performance in parallel 

environments. The second section 

looks ro the future of computing; 

University of California and Digital 

researchers present their work on a 

large, distributed computing environmem 

suited to the needs of earth scientists 

studving global changes such 

as ocean dynamics, global warming, 

and ozone depletion. Digital w:�s an 

early industry sponsor and particip:mt 

in this joint research project, called 

Sequoia 2000. 

To support the writing of parallel 

programs tor computationally intense 

environments, Digital has extended 

DEC fortran 90 by implementing 

most of High Performance Fortran 

( H Pf) version 1.1. After reviewing 

the syntactic features of Fortran 90 

and HPF, Jonathan Harris et al. fc>eus 

on the HPF compiler design and 

expbin the optimizations it pertorms 

ro improve interprocessor communication 

in a distributed-memory environment, 

specincally, in workstation 

clusters (tarms) based on Digital's 

64-bit Alpha microprocessors. 

The run-time support for this distributed 

environment is the Parallel 

Software Environment (PSE). Ed 

Benson, David Lafrance- Linden, 

Rich Warren, and Santa vVin'aman 

describe the PSE product, which is 

layered on the UNIX operating system 

and includes tools for developing 

Digit�l'lr ll'riting the Foreword ro tl1is 

ISSUC. 

Our next issue will feature papers 

on multimedia and UNIX clusters. 

);me C. Blake 

Mallrl_r�illg t.ditor

Foreword 

Jean C. Bonney 

Dirc>ci!J/: J::Yit'l"ll!li Nc>sem·cb 

The lntonmrion Utilitv, rhe 

Information Highw�l\', the Internet, 

rhe Infi.>bahn, rhe Information 

Eeononw-rhe sound lwtes ofrhe 

1990s. To 111�1ke these concepts 

reality, a robust technolob"' intra 

structure is necessary. In 1990, 

Digit�1l's research organization saw 

rbis need :md ser out to develop an 

exl1erimenral rest hed that would 

examine assumptions and provide a 

basis tr a technolob"' edge in the '90s. 

The resulting project was Sequoia 

2000, a three-vcar research collabora 

tion bet\veen Digital, campuses of the 

University ofCaliti.>rni�l, �111d several 

other industrv and governmem orga 

nizations. The Sequoia 2000 vision is 

l'c>!ahyles! i.e .. /ri/liuiiS uj'hylc>s) 

oj'dala in u dislrilmled orchi(lc. 

/runsparel!lir ma11a,t.;ed. aJtd 

lugicallv cieti'C'd ouer a hl;t.;h-speul 

nelnnd' u·i!h isuchmiiOIIS capahililies 

l'ia a hosl u/loo/s 

-in orher words, a big, bst, easy-ro 

use system. 

Although rhe vision is still not reality 

today, our more than three years 

of participation in Sequoia 2000 

research gave us rhe knowledge base 

we sought. 

AlTer a rigorous process of pro 

poSrnia, Sequoia 2000 began 

in June 1991. The t(JCus of the 

rcsean.:h was a high-speed, broad 

hrmation ...:ol 

lccred trom satellites. Once rhe data 

arc colle...:tcd :111d organized, there is 

rhe challenge of massive simulations, 

simulations that t(>rccast world climate 

ten or even one hundred yeJrs ti·om 

now. These were exactly the kinds 

of challenges the computer scientists 

needed . 

Among the m:1jor results of three 

ye�ns of work on Sequoia 2000 was 

a set of produ...:t requiremems t(>r 

large data applications. These require 

mems have been validated through 

discussions with customers in tinan 

cial, healthcare, and communications 

industries Jnd in governmenr. The 

requin:menrs include 

• A computing environment built 

on an object relational database, 

i.e., a thta-cenrric computing 

syste m 

• A tbtabase that handles a wide 

variety of nontraditional objects 

such as tor, audio, video, graph 

ics, and inuges 

• Support ror a variety ofrraditional 

databases and file systems 

• The ability to perform necessary 

operations ti·om computing 

environments that arc intuitive 

and have rhe same look and kcl; 

the imerface to the environment 

should be generic, very high level, 

and easily tailored to the user 

:lf-1plication 

• High -speed data migration 

bet\veen secondary and tertiary 

storage with the ability to handle 

very large data transfers 

• Net\vork bandwidth capable 

of handling image transmission 

across nervvorks in an acceptable 

rime hame with quality guarantees 

fc->r the dara 

• High-quality remote visualization 

of any relevant data regardless 

oftormat; the user must be able 

to manipulate the visual data 

interactively 

• Reliable, guaranteed , deli very 

of data ti·om tertiary storage to 

rhe desktop 

Sequoia 2000 was also a catalyst 

t(>r maturing the POSTGRES research 

database soft\vare to the point where 

it was ready tor ...:ommercialization. 

The commercial version, I !lustra, 

is available on Alpha platforms and 

is enjoying success in the banking 

industry and in geographic informa 

tion system (GIS) applications, as 

well as in other government applica 

tions with massive data requirements . 

Illusrra is Jlso making inroads into the 

Imernet where it is used by on-line 

services. 

Yet another major result of Sequoia 

2000 was a grant fi-om the National 

Digital Tcclmic�l Journ.1l Vol. 7 No.3 1995 3

4 

Aeronautics and Space Administra 

tion (l\:ASA) to develop an ;lltemare 

archirectu1·e t(Jr the Earth Obsen·ing 

Svstem i);ua and lntcmnarion S)·stem 

(EOSDIS). EOS[)JS will process the 

pctJbvres of re:�l-time dar:� ti·om 

the E:�rth Observing System ( EOS) 

satellites to be launched at the end 

of the dec1de. The altematc int(Jr 

marion :1rchirecrure proposed bv the 

Universitv ofCalit(Jrni:� bculn· II'Js 

. . 

the Sequoi;l 2000 ;Jrchirecrure. lr 

will have a m;ljor intluence on the 

EOSDIS project. 

fm the e;uth sciemisrs, g;lins 

1\'Cre made in simul;nion speeds :md 

111 access to Luge stores oforg:mized 

data. These sciemists used some oF 

Digital's tirst Alpha workst;Hion Emm 

and sottw;lre prototi'JKS t(Jr their cli 

mate simulations. An eight-processor 

Alph:� ,,·orksr:�rion brm pnwided a 

t\\'O-to-onc price/ped(mn;mce Jlklll 

tage o1·er the pm1·erful, multimillion 

dollar C:RAY C90 machine. In :mother 

earth science apJ)Iiurion, sciemisrs 

usi11g Alpha and hicr;uchiul stor;lgc 

svstems could simui;He r11·o \'cHs' 

worth of climate d,ltJ over the 11 cck 

end without operator imcn'Cntion; 

tcmllcrh', t\I'O months' worth ohbt:-t 

rook one dav to simui;He :-tnd required 

considerable OJ)er;Hor imen·cntion. 

Thus m:�ny more simul:�rions could 

be processed in ;l tixed rime :-tnd 

"time to discoverv" W;ls decrc1sed 

considerablv. 

Now that ll'e em look ;lt Sequoi:-t 

2000 in retrospect, would II'C do 

such a project againl The ;111swcr 

is a resounding ''l·cs" ti·om ;!II of 

us im·oh·ed. It \\';ls a complex proj 

ect that included 12 Unil'lTsin· of 

Calitcm1ia bculn· members, 2:1 grad 

uate students, ;md 20 st:1ff Another 

8 hculn· members and students pm 

,·idcd additional expertise. Four of 

Digital's engineers worked on site, 

;md ;l varien· of support personnel 

ti·om other industrv sponsors p::�rtici 

p:-tted, including SAlC, the C1li�clrni::� 

DepMtmenr ofvVater Resources, 

Hewlett-Packard, Metrum, United 

States CcologicaJ Survey (USGS), 

Hughes Application Int(mnation 

Sen·ices, ::�nd the Annv Corps of 

Engineers. 

But ;Js is the c1se with such ::�mbi 

tious projects, there \\'ere tlll:mtici 

p:ncd ;lnd difticulr lessons tclr ;111 

to leam. To experiment 11·ith real- 

lite rest beds means considerabh 

more rhan 11ri ring a rigorous set 

oFh1-porheses in a proposal. Michael 

Stonehr,lker, in his paper, notes a 

lllllllbcr of ch,lilenges \\"C t:Ked and 

the lessons lc:1rned. One ofthe issues 

rh::�t kqlt surbcing was the "grease 

;md glue" tell. the inti·astrucrure, that 

is, the inreroperabilin· of pieces of 

sofn1 are and hard11·are tlut composed 

the end-to-end S\'Stem.1'his remains 

,\ chal.lenge that needs rese;Jrch if we 

;li'C going to achieve the promised 

gcds ot'intcrnct\l'orking. Another 

stick\· point was scalability. On the 

one hand, it is difficult to build a l't:rv 

large networked svstem ti·orn scratch. 

On the orhcr hand, as we slowlv built 

the mass storage svsrem to the point 

oF minimal critical mass, liT t(nmd 

th:lt the current oft�thc-shelftech 

nologies t(Jr m.1ss storage were not 

I'CJch· to be put use tclr our purposes. 

So, ves, \\'C be lie1·e the project ll';ls 

11·orthll'hile 11·irh some Gll'e

Compiling High 

Performance Fortran 

for Distributedmemory 

Systems 

Digital's DEC Fortran 90 compiler implements 

most of High Performance Fortran version 1.1, 

a language fo r writing parallel programs. The 

co mpiler generates code for distributed-memory 

machi nes consisting of interconnected work 

stations or servers powered by Digital's Alpha 

microprocessors. The DEC Fortran 90 compiler 

efficiently implements the features of Fortran 90 

and HPF that support parallelism. HPF programs 

compiled with Digital's compiler yield perfor 

mance that scales linearly or even superlinearly 

on significant applications on both distributed 

memory and shared-memory architectures. 

I 

Jonathan Harris 

John A. Bircsak 

M. Regina Bolduc 

Jill Ann Diewald 

Israel Gale 

Neil W. J olmson 

Shin Lee 

C. Alexandet· Nelson 

Carl D. Offner 

High Performance Fortran (HPF) is a new program 

ming language for writing parallel programs. It is 

based on the Fortran 90 language, with extensions 

that enable the programmer ro speci�r how array oper 

ations can be divided among multiple processors tor 

increased performance. In HPF, the program specitles 

only the pattern in which the data is divided among 

the processors; the compiler auromates the low-level 

details of synchronization and communication of data 

benvecn processors. 

Digital's DEC Fortran 90 compiler is the tirsr imple 

mentation of the full H PF version l.l language 

(except for transcriptive argument passing, dynamic 

remapping, and nested FORALL and 'NHERE con 

structs). The compiler was designed tor a distributed 

memory machine made up of a cluster (or farm) of 

workstations and/or servers powered by Digital's 

Alpha microprocessors. 

In a distributed-memory machine, communication 

between processors must be kept to an absolute mini 

mum, because communication across the nenvork is 

enormously more time-consuming than anv operation 

done locally. Digital's DEC Fortran 90 compiler 

includes a number of optimizations to minimize the 

cost of communication benveen processors. 

This paper briefly reviews the teatures of Fortran 90 

and HPF that support parallelism, describes how the 

compiler implements these features efficiently, and 

concludes with some recent performance results 

showing that HPt programs compiled with Digital's 

compiler yield pert(xmance that scales linearly or even 

superlinearly on significant applications on both 

distributed-memory and shared-memory architectures. 

Historical Background 

The desire to write parallel programs dates back to the 

1950s, at least, and probably earlier. The mathematician 

John von Neumann, credited with the invention of the 

basic architecture of today's serial computers, also 

invented cellular automata, the precursor of today's 

massively parallel machines. The continuing motiva 

tion tor parallelism is provided by the need to solve 

computationally intense problems in a reasonable time 

and

6 

which range ti·om collections of workstations con 

nected by standard ti ber-optic networks to rightlv cou 

pled CPUs with custom high-speed interconn�ction 

networks, are cheaper than single-processor svstems 

with equivalent performance . In many cases, �qu iva 

lent smgle-processor svstems do not exist and could 

not be constructed with existing technologv. 

Historically, one of the difficulties \\ : ith parallel 

machmes has been writing paral lel programs. The work 

of paralleli zing a program w�1s fa r fi·om the original sci 

ence be�ng explored; it required progra mmers to keep 

track of a great deal of inti:m11ation unrelated ro the 

actual computations; and it was done using ad hoc 

methods that were not portable to other machines. 

The experience gained fi·om this work however led 

' 

' 

to a consensus on a better way to write portable 

Fortran programs that would pcrtorm well on a varietv 

of paral lel machines. The High Pcrt

c 

i n teger n, number_o f_i terations, i , j,k 

parameter(n=16) 

real A( n,n), Temp(n,n) 

. . . (Ini t ialize A, numbe r_o f_i terations) 

do k=1 , number_o f_i terations 

Update non-edge eleme nts on ly 

do i=2, n-1 

Figure 1 

A Computation Expressed in fortran 77 

Figure 2 

do j=2, n-1 

Temp( i , j ) =(A(i , j-1 )+A(i, j+1 )+A(i+1, j ) +A( i-1 , j ) )*0.25 

enddo 

end do 

do i=2, n-1 

do j=2, n-1 

end do 

enddo 

end do 

A(i , j ) = Temp(i , j ) 

i n teger n, number_o f_i terations, i , j , k 

pa rameter (n= 16) 

rea l A

8 

l'rocessor O 

SEND 

A(8, 2) . . A(8, 8) 

ro Processor I 

SEND 

A(2, 8) A(S, 8) 

ro Processor 2 

RECEIVE 

A(9, 2) .A(9, 8) 

ti·om Processor I 

RECEIVE 

A(2, 9) ... A(Il, 9) 

from Processor· 2 

Processor 

SEND 

A(9, 2) .. A(9, 8) 

to Processor 0 

SE:'-JD 

A(9, 8) . . A(lS, 8) 

ro Processor 3 

RECEIVE 

A(8, 2) ... A( 8, R) 

from Processor 0 

RECE IVE 

A(9, 9) ... A( I 5, 9) 

ri·om Processor 3 

Figure 4 

Compiler-generated Communication to r

We now describe some of the HPF language exten 

sions used to manage parallel data . 

Distributing Data over Processors 

Data is distributed over processors by the 

DISTRI BUTE directive, the ALIGN directive, or 

the default distribution . 

The DISTRIBUTE Directive For parallel execution of 

array operations, each array must be divided in mem 

ory, with each processor storing some portion of 

the array in its own local memory. Dividing the array 

into parts is known as distributing the array. The HPF 

DISTRIBUTE directive controls the distribution of 

arrays across each processor's local memory. ft does 

this by spcci�'ing a mapping pattern of data objects 

onto processors. Many mappings are possible; we illus 

trate only a kw. 

Consider fi rst the case of a 16 X 16 arrav A in an 

environment with to ur processors. One possi ble speci 

fication tor A is 

I hpf$ 

rea l AC16, 16) 

distribute A(*, block) 

The asterisk ( *) tor the fi rst dimension of A means 

that the array elements are not distributed along 

the tirst (vertica l) axis. In other words, the elements 

in any given column are not divided among differ 

ent processors, bur are assigned as a single block to 

one processor. This type of mapping is rekrred to as 

serial distribution. Figure 5 illustrates this distribution. 

The BLOCK keyword tor the second dimension 

means that for any given row, the array elements are 

distributed over each processor in large blocks. The 

blocks are of approximately equal size-i n this case, 

they are exactly equal-with each processor holding 

one block. As a result, A is broken into fo ur contigu 

ous groups of columns, with each group assigned to 

a separate processor. 

Another possibility is a ( *, CYCLIC) distribution . 

As in (*, BLOCK), all the elements in each co lumn are 

assigned to one processor. The elements in any given 

row, however, are dealt out ro the processors in round 

robin order, like playing cards dealt out to players 

around a table. \Vhen elements are distributed over n 

processors, each processor contains every 11th column, 

starting h·orn a ditkrent offset. Figure 6 shows the 

same array and processor arrangement, distributed 

CYCLIC instead ofl3LOCK. 

As these examples indicate, the distributions of the 

separate dimensions are independent. 

A (BLOCK, BLOCK) distribution, as in hgure 3, 

divides the array into large rectangles. In that ti gure, 

the array clements in any given column or any given 

row are divided into rwo large blocks: Processor 0 gets 

A( l :8, 1 :8), processor l gets A(9:16, l :8), processor 2 

gets A( 1 :8, 9:16), and processor 3 gets A(9:16,9 : 16). 

I 

I 

0 

Figure 5 

A (*, BLOCK) Disrriburion 

0 

2 3 0 2 3 0 

Figure 6 

A(*, CYCLIC) Distri burion 

2 

I 

II I 

I 

I 

2 3 0 

3 

2 3 

The ALIGN Directive The ALIGN directive is used to 

speci�' the mapping of arrays relative ro one another. 

Corresponding elements in aligned arrays are always 

mapped to the same processor; array operations 

between aligned arrays are in most cases more efficient 

thJn Jrray operations between arrays that are not 

known to be aligned . 

The most common use of ALIGN is to specifY that 

the corresponding clements of rwo or more Jrrays be 

mapped identically, as in the following example: 

DigitJI Tcchnic;\l Journal Vol. 7 No. 3 1995 9

10 

! hpf$ align A(i ) wi th B(i ) 

This example specifies that the two arrays A and Bare 

always mapped the same way. More complex align 

ments can also be specitled. For example: 

! hpf$ align E(i) wi th F(2*i-1 ) 

In this example, the elements of fare aligned with the 

odd elements ofF In this case, 1:· can have at most half 

as many elements as F 

An array can be aligned with the interior of a larger 

array: 

real A( 12, 12) 

rea l B(16, 16) 

! hpf$ align A(i, j ) with B(i+2, j+2) 

In this example, the 12 X 12 array A is aligned with 

the interior of the 16 X 16 array B (see Figure 7). Each 

interior element of B is always stored on the same 

processor as the corresponding element of A. 

The Default Distribution Va riables that are not explic 

itly distributed or aligned are given a default distribu 

tion by the compiler. The default distribution is not 

specified by the language: dinerent compilers can 

choose different default distributions, usually based 

on constraints of the target architecture. In the DEC 

Fortran 90 language, an array or scalar with the default 

distri bution is completely replicated. This decision was 

made because the large arrays in the program are the 

significant ones that the programmer has to distri bute 

explicitly to get good performance. Any other arrays 

or scalars will be small and ge nerally will benetit from 

being replicated since their values will then be available 

everywhere. Of course, the progra mmer retains com 

plete control and can specif}1 a ditre re nt distribution 

to r these arrays. 

Re plicated data is cheap to read but generally 

expensive to write. Programmers typically use repli 

cated data fo r information rh:�r is computed infre 

quently but used ofi:en. 

B 

Figure 7 

An Example of A.rrav Alignment 

A 

Digital Technic:d )ournJI Voi. 7 I o.3 1995 

Data Mapping and Procedure Calls 

The distribution of arrays across processors introduces 

a new complication fo r procedure calls: the inte rface 

between the procedure and the calling program must 

take into accoum not only the type and size of the rel 

evant objects but also their mapping across processors. 

The HPF language includes special forms of the 

ALIGN and DISTRIBUTE directives fo r procedure 

interbces . These allow tl1 e program to speci�' whether 

array arguments can be handled by the procedure as 

they are currently distributed, or whether (and how) 

they need to be redistributed across the processors. 

Expressing Parallel Comp utations 

Parallel computations in HPF can be identified in four 

ways: 

• fortran 90 array assignments 

• FORALL statements 

• The IN DEPENDENT directive, applied to DO 

loops and rORALL statements 

• Fortran 90 and HPF intrinsics and library fu nctions 

In addition, a compiler may be able to discover paral 

lelism in other constructs. ln this section, we discuss 

the fi rst two of these paral lel constructions. 

Fortran 90 Array Assignment In Fortran 77, operations 

on whole arrays can be accomplished only through 

explicit DO loops that access array clements one at a 

time. Fortran 90 array assignment statements allow 

operations on entire arr:�ys to be expressed more simply. 

In Fortran 90, the usual intrinsic operations fo r 

scalars (arithmetic, comparison, and logical) can be 

applied to arrays, provided the arrays arc of the same 

shape. for example, if A, B, and C arc two-dimensional 

arrays of the same shape, the statement C = A + B 

assigns to each element of C a value equal to the sum 

of the corresponding clements of A and B. 

In more complex cases, this assignment syntax can 

have the eftect of drastically simplit),ing the code. For 

instance, consider the case of three-d imensional 

arrays, such as the arrays dimensioned in the fo llowing 

declaration: 

real D(10, 5 : 2 4, -5 :M), E(0:9, 20, M+6) 

In Fortr:�n 77 S)'ntax, an assignment to every ele 

ment of D requires triple-nested loops such as rhe 

example shown in Figure 8. 

line: 

In Fortran 90, this code can be expressed in a single 

D = 2.5*D+E+2 .0 

The FORALL Statement The FORALL statement is an 

H PF extension to the American National Standards 

Institute (ANSI) Fortran 90 standard but has been 

included in the dratt Fortran 95 standard .

do i = 1 , 10 

do j = 5, 24 

do k = -5, M 

D(i, j, k) 

end do 

end do 

end do 

Figure 8 

An Example of'' Triple·nes[cd Loop 

FOIW...L is a generalized fo rm of Fortran 90 array 

assignment syntax that allows a wider variety of array 

assignments to be expressed . For example, the diago 

nal of an array cannot be represented as a single 

Fortran 90 array section . Theretore, the assignment of 

a value to every element of the diagonal cannot be 

expressed in a single array assign ment statement. It 

can be expressed in a FOIW...L statement : 

rea l, dimension(n, n) A 

fora l l Ci = 1 : n) ACi, i ) = 1 

Although FORALL structures serve the same pur 

pose as some DO loops do in Fortran 77, a FORALL 

structure is a parallel assignment statement, not a 

loop, and in many cases prod uces a difterent result 

fr om an analogous DO loop. For example, the 

FORALL statement 

fora l l ( i = 2:5) CCi, i ) CCi-1 , i-1 ) 

applied to the matrix 

ll 0 0 0 0 

0 22 0 0 0 

c 0 0 33 0 0 

0 0 0 44 0 

0 0 0 0 55 

produces the to llowing result: 

On the other hand , the apparently similar DO loop 

do i = 2, 5 

C(i, i ) = C( i-1 , i-1 ) 

end do 

produces 

c 

11 0 

0 11 

0 0 

0 0 

0 0 

0 0 0 

0 0 0 

ll 0 0 

0 ll 0 

0 0 11 

2 . 5 *D( i , j, k) + E(i-1 , j-4, k+6) + 2.0 

This happens because the DO loop iterations are per 

to rmed sequentially, so that each successive element of 

the diago nal is updated betore it is used in the next 

iteration. In contrast, in the FORALL statement, all 

the diagonal elements are fetched and used betore any 

stores happen . 

The Ta rget Machine 

Digital's DEC Fortran 90 compiler generates code 

to r clusters of Alpha processors running the Digital 

UNIX operating system. These clusters can be separate 

Al pha workstations or servers connected by a fiber dis 

tributed data interface (FDDI) or other network 

devices. (Digital's high-speed GIGAswitch/FDDI sys 

tem is particularly appropriate.") A shared-memory, 

symmetric multiprocessing (SMP) system like the 

AJphaServer 8400 system can also be used. In the case 

of an Sl'v!P system, the message-passing library uses 

shared memory as the message-passing medium; the 

generated code is otherwise identical. The same exe 

cutable can run on a distributed-mernory cluster or an 

SM P shared -memory cluster without recompiling. 

DEC Fortran 90 programs use the execution envi 

ronment provided by Digital's Parallel Sothvare 

Environment (PSE), a companion prod uct -'" PSE 

is responsible fo r invoking the program on multiple 

processors and for performing the message passing 

req uested by the generated code. 

The Architecture of the Compiler 

Figure 9 illustrates the high-level an:hitecture of 

the compiler. The curved path is the path taken 

when wmpi lcr command-line switches are set to r 

compiling programs that will not execute in parallel , 

or when the scoping unit being compiled is declared 

as EXTRINSIC(HPF _LOCAL). 

Figure 9 shows the fr ont end, tr�mstorm, middle 

end , and GEM back end components of the com piler. 

These components fu nction in the to llowing ways: 

• The front end parses the input code and produces 

an internal representation containing an abstract 

syntax tree and a symbol ta ble. It pedorms exten 

sive semantic checking.'" 

Digit�! Tc( hnic;11 )oumal Vol. 7 No. 3 1995 ll

12 

Figure 9 

Compiler C:om[10llents 

• The transform component pertorms the transtor 

rnation from globai-HPF to cxplicit-SPMD fo rm . 

To do this, it localizes the add ressing of data, inserts 

communication where necessary, and distribures 

p

LOW ER ph;1sc implements rhis lw turning c:tch 

forrr.1n 90 :liT:l\' assign ment into an cqui1·;1k11t 

rORA.l.L statement (acrualh', into the dorrcc repre 

sentation of one ). This unitorm rcprescnt:.1rion m c:1 ns 

rh:1t rhe compiler has t�1r tcll'er special cases ro consider 

than orhcn1·isc might be necess;1ry and leads ro no 

degr;1lbtion of the generJted code. 

DATA 

The DATA phase specities where dara lives. Pbcing 

and addressing data correctlv is one ofrhc major rasks 

ofrr:1 nstcmn . Ih: n: are :1 large number of possibilities: 

\V hcn :1 nluc is ;1\'ailable on c1 en· processor, it is 

S;1id ro be I'C1Jiicoted. When it is a1·aibblc on more than 

one hut nor :1 11 processors, it is said ro be ;wrtioll l ' 

rej;/icotecl. for instance, a scalar m:ll' lil'c 011 onh' one 

processor, or on more than one processor. Tvpic:-rl h·, a 

scJI:Jr is replic1tcd-it lilTS on all processors. The rcpli 

cnion of scalar d:.1u m;1kes ktches cheJp because e;Kh 

processor has a copy of the req uested \';1lue. Stores to 

replic:1red sc::� Llr data can be expensive, howcl'e r, if the 

value ro be stored has not been replicated. In rhar cJse, 

the value ro be stored must be senr to each processor. 

The s:1me consideration applies to Jr-r::�vs. Arravs 

mJv lx: replicJted, in ll'hich c1se c1ch processor h;1S a 

copv ofJn crnirc :1rrav; or arravs m:1v be p;1rti;1llv rc pl i 

carcd, in ll'hich c1se c::�ch element o f r hc arr.l\' is ::tl :lil 

able on ;1 subset of the processors. 

hrrrl1crmorc, :l iTa\'S that :1re nor repl icated 111:11' be 

distributed Jcross the processors in se1crJI diftcrcn r 

t:lshions, as nplained abm'e , In t:K t, each dimension 

of each ;1 rrav ma\' be distributed indcpcndcmly of 

the other dimensions. The HPr m:�.pping direc tives, 

princip;1lil' ALIGN ;md DISTRI BUTE, give the pro 

grammer the :r bi lity to specii}' completely how each 

dimension of each arrav i s bid ouc DATA uses the 

inhm11;1tion in these directives to construct ;1 11 internal 

description m data space of the i:11·our of each arrav. 

ITER 

The ITER ph

14 

ln our impkmentation, the CJIIer ped(mns all 

remapping. If remapping is necess:�n·, AR

ea l A(100, 20), 8(100, 20) 

1 h pf$ distribute A(cyc l i c , block), B(cy clic, block) 

(a) Code Fragment 

fora l l (i = 2 : 99) 

A(i , k) = B(i , k) 

end fora l l 

m = my_ processor ( ) 

if k mod 5 = Lm/4J then 

do i = ( i f m mod 4 = 0 then 2 else 1 ) , ( i f m mod 4 

end i f 

A( i , Lk/5Jl = B(i, Lk/5Jl 

end do 

(h) Pseudocode Cennared for Code Fragmenr 

Figure 11 

Code Fragtncnr and Pseudocode GcnCLHCd r(>r Code Fr"

16 

If the :11T,11'S A and B arc bid o u r as in Figu re 12 md 

if 13 is to be assigned to A, the n JIT�11· clcmems LJ( 4 ) , 

13( S ) , �1 11d /J( 6 ) , all of 11 h ich li1 c 011 p roccssm 6, 

shou ld be scm to processor 1 . Clc�1d1·, \\'c do 11ot 11 :1nt 

to gcncr�ltc three distinct messages t(Jr this. Thcrd(Jre, 

we collect these tilrce clements ;1 11d ge nn�Hc one mes 

sage con taini ng :1ll three of them. This oamplc 

i nvolves full knowledge. 

Communic:�tions involving p�1rti

Figure 14 

Sh.1do11· Edges Fo r· c1 :\ccHcsr-ncighbor· Computation 

cor11munication involving even kss 0\'Crhead than 

general communicnion im·olving fu ll knowledge. 

• No local copying. If sh:1dow edges were not used , 

then the t(lllowing stambrd processing would take 

pbce: �or c:1ch shitted-Jrr:Jy rdi.: rcncc on the right 

hand side of the ''ssignmcnt , shift the entire :1rray; 

thcn idcmi�· that parr ol . . thc shir'tcd array that lives 

loc1ll\' 011 each processor and create a local tem po 

r,m• to hold it. Some of that tcmporarv (the part 

reprcscmi ng our sh,ldo\\' edge ) II'Ould be moved in 

from :1 neighboring processor, and the rest of the 

tcmporarv \\'(lli ld be copied loc1llv ri·om the origi 

nal arLl\'. Our processing el im inates the need fo r 

the local tcmpor,u·,· and tix rhc local co�w, which is 

substami,ll t!. >r large '' rravs. 

• Greater local in· of rdi.: rcncc. When the :�cn!al com 

pur:�tion is pcdcmncd, grc.1tcr localin· of reference 

is ,1 chiC\ni because the sh,ldOII' edges (i.e., the 

rccci,·cd ,·a lues ) :11-c 110\\' p:�rt of the atTJ\', r:1ther 

than being '' tcmporan· somc\\·hcrc else in mcmon·. 

• Fcll'er mcss,1gcs. hn,1lly, the optimiz,nion also 

makes it possible fo r the compiler to sec that some 

messages m,1,. be combined into one message, 

rhnelw reducing the number of messages that 

must be sent. for insuncc, if the right-hand side 

of the '1ssignment st,Hc mellt in the above exam ple 

also co11t,1incd '1 term A( i + l ,.f + 1 ), C\'Cn though 

overlapping shJdo\\' edges and Jn additiona l 

shJdO\\' edge \\'ou ld now he in the diagonally adja 

cent processor, no :�ddirional communicuion 

\\'ou ld need to be generated . 

Reductions 

The SUM intrinsic fu nction of t-:ortrJn 90 takes an 

arrav argument and returns rhc sum of all its elements. 

Altcrn;�tivclv, SUM c1n return an '1 rrav ll'hose rank is 

one less thJn the rJnk of its '1 rgumem, and eJch of 

\\'hose ,.,1lucs is the sum of the clements in the argu 

ment along .1 line p'1rallcl to J spcciticd dimension. 

In either case, the rank of the result is less than that of 

the argument; thcrd(m:, SUM is rdi.: rred to ''s ,, 

reduction intrinsic. Fortran 90 includes J bmil\' of 

such reductions, :1nd HPF

18 

is repl icated or distributed , (2) the different distri 

butions that uch �liTa\· dimension might ha\·c, a11d 

( 3 ) whether the n:ducri on is comp lete or p:�rti�1 1 (i.e., 

\\'ith a DL\11 argument). 

Run -time Preprocessing of Irregular Data Accesses 

Run-time preprocessing of i rregu l ar data accesses 1s 

a popu l:lr teehni

is a slice through either the water or the atmosphere. 

Partial difte rcntial equations relate the variables. 

The model is implemented using a �l nite-diftcrence 

method that approximates th e partial differential 

equations at each of the mesh points.1' Mode ls based 

on partial difterenti�ll equations are ar the core of many 

simulations of physical phenomena; finite difkrcnce 

methods arc commonly used �or sol ving such models 

on com puters. 

The shallow-water program is a widclv quoted 

benchmark, partly because the program is small 

enough to examine and tune careful l �', vet it per�orms 

real computation representative of many scientitlc si m 

ulations. Unlike SPEC and other be nchmarks, the 

source �or the shallow-water program is not controlled. 

The shal low-water benchmark was written in H PF 

and run i n p�1 ra llcl on workstation ��mns using PS E. 

There is no ex pl ici t message-passing code in the pro 

gram. We modified the Fo rtran 90 version that 

Applied Parallel Research used �or its benchmark data. 

The f90 / H Pf version of the program t

20 

of heat tlow and elcctrostJtic or gravitational poten 

tial. We have investigated a Poisson soh·cr using the 

conjugate-gradient algorithm. The code exercises 

both the nearest-neighbor optimizations and the 

inlining abilities of the DEC FortrJn 90 compiler.'' 

Ta ble 3 gives the timings :111d speedup obtained 

on a 1000 X 1000 JrrJv. The h:1rdw:1re and sothvare 

contlgurations arc identical to those used for the 

shallow-water timings. 

Red-black Relaxation 

A common method of solving p:trtial difkrenti:tl 

equations is red-black relaxation -'" vVe used this 

method to solve the Poisson equation in J three 

dimensional cube. We compare the p:1rallelization 

of this algorithm tor a distributed-memorv svstem 

(a cluster of Digita l Alpha workstations) wirl� P�rallcl 

Virtual Machine ( PVM ), which is an explicit message 

passing li brary, and with HPF.l' These algorithms arc 

based on codes written by Klose, Wolton, and Lemke 

and made a\·ailable as pan of the suite of GENESIS 

distributed -memorv benchmarks.'" 

Table 4 gives the speedups obtained t(x both 

the HPF and PVM \'ersions of the program, which 

soh·es a 128 X 128 X 128 probJem, on a cluster of 

DEC 3000 Model 900 workstations connected bv an 

FDDI/GIGAswitch system. The speedups shown arc 

relative to DEC Fortran 77 code written for and run on 

a single processor. This table shows th:lt the HPF ,·er 

sion perf(m11S somewhat better than the PVM version . 

There is a significant ditkrence in the complexitv ot· 

the programs, however. The PVM code is quite intri 

cate, because it requires that the user be responsibJ e 

tor the block partitioning of the volume, and then tor 

explicitly copying boundary t:Kes between processors. 

By contrast, the HPF code is intuitive and tar more 

easilv maintained . The reader is encouraged to obtain 

the codes (as described abm·e) and compare them. 

Ta ble 3 

Ta ble 4 

Speedups of DEC Fortran 90/H PF 

and DEC Fortran 77/PVM on 

Red-black Code 

DEC Fortran 77 

DEC Fortran 77/PVM 7.01 

DEC Fortran 90/H PF 8.04 

,--- Number of Processors , 

8 4 2 1 

3.73 

4. 10 

1.79 

1.95 

1.00 

1.05 

In conclusion, we have shown that important algo 

rithms fa miliar to the scientitlc and technical commu 

nity can be \\'ritten in HPF. HPF codes scale well to at 

le

References and Notes 

I. High Perfornuncc fortran Forum, "High Pc rtormJnce 

Fonran L1 ngu:1gc Specitic:�rion, Version 1.0," 

Scient iji'i.: Pmg l'(.lllllllillp,, ,·ol. 2, no. I ( 1 993). Also 

Jvai!Jblc :1s Tcc bnicJI Report CR.PC-TR93300, Center 

tor ReseJrch on .Par;11lcl Computation, Rice Uni\'ersitv, 

Houston, Te x.; ;\11d via anonymous ftp ti·om 

tiL\JU.:s.ricc.edu in rhe dircctorv public/HPFF /d raft; 

version I .1 is rhe ti le hp(y l 1 .ps. 

2. C. Ko elbcl, D. Lo,·cmJn, R. Schreiber, G. Steele, Jr., 

and M. Zoscl, The High Perj(;rmance Fonmn 

/-land hook (Cambridge, Mass.: J'vi iT Press, 1994 ). 

3. Dig ital J-tt;� h Perj'ormance Fortran 90 HPF and 

PSE Ma nual (Mavnard, Mass.: Digital Eq uipment 

Corporation, 1995 ). 

4. {)f:'C hll'irWI 90 La uguap,e !?ej'erence Manual (Maynard, 

i'vLlss.: Digital Equipment Corporation, 1994). 

5. E. Albert, K Knobc, j. Lukas, and G. Steele, Jr., "Compiling 

ForrrJl\ 8x Arrav Features fo r the Connection 

Machine Co1nputcr Svsn::m," S) 'mposium on Pam/lei 

Prog mmming L\perience tl'ith Applicatio ns. 

Lw·1p, 11agc>s. wnl ,)).'stems. ACM .sJG'PIAN. j ulv 19R8. 

6. K. Knobc, ]. Lukas, ;H \d G. Steele, Jr., " Massively Par 

;\! lei Dat

22 

2S. B. Boulter, "Perr(mn:mce E,·,llucHion ofH l'J-' J·( )r Scien 

ritic Compu ting," Pmc( ·et liii,U,S u/f ll:u,h l'crjiJf'l!7Cli/Ce 

Cumpu!ing one/ .\ i'/1/'(JJldiiJ..',. l.eclure . .\'u/es ill 

0JIIIjmlel' Scie7!Ce 91 il',Jl 

.md dei'L·Iopmcnr otrhc trcmst(mn componcm ofrhe IJ FC: 

�mrrcm 90 compikr. Bdcl!T joining Di);i!.ll in I 99 I , he 11 Js 

1111 ohed in the design ,\Jld del elo[m leJlt of compilers ;Jt 

C:oll11'·''-5, lnc., })l'ior ro rh.11, he 11 mkcd on COlllJ'i krs .md 

sotiw·.1re tools cH Rcll'rhcon Coq'. He holds '' H.S.r:. in 

cD!llJlUtn science and en!..!.im·eri ng from the l' ni,·ersit'l' 

of }\:J\n)I·JI;mia i J98.J. ) :\;ld ,\n ,\\\ Ill COJll})lJI'LT Sc'Jel;Ce 

ti·om 1\osron Uni1 crsm· ( 1990 ). 

\'ol. 7 :--lo .'\ 1'.19:i 

M. Regina Bolduc 

Rcgi n.1 l�olduc joined Digital in 1991; she is a principal 

so tiwc1rc engineer in rhe High l'ertc mnancc Computing 

CirouJ' 1\egin;l "·"s in1·oh cd in the de1·elopmem of the 

rrcmst(mn c1nd f)'()llf end conlpoJKilfS ofrbc DEC: �o1·n·.1n 

90 C0111J'ikr. l'ri())' to th is 11 ork, she 11 .1s ;1 :.cnior membLT 

Dfrhe tech11iol src1tLH C :ompa'>.s, Inc., 11·hcrc she 11 01·ked 

On the de.sign Cllld dcl·eJOJ'Il1CIH OrC0111piJers cl lld COJl) ['ikJ·· 

gcner;ltor rools. 1\egi n;l recei1ed J lL\ . in mcHhcmcHics 

Ji·om Emm.111uel College in J 957. 

Jill Ann Diewald 

Jill Dic11;1 ld UlJIHihltted to rhe design and implcmcllfcl 

fiOil ofrhe tr.msl(mn COll\J'OllC!H ofthe DEC Forrr;Jn 90 

compiler. She is cl princiJ'''I sofm·c\rc engineer in the 1:-l igh 

Pe rt()J·mcmce C:ompuring (;roup. Before joining Digit;l l 

in 199 I , Jill ll 'as cl rcchniul coordin;Hor clr CompJss, 

lnc., ll'lleJ-c she helped design :�m1 den:lol' compikrs c1nd 

cumpiln-relcHL'd tool.s. Prior ro rhJt JX>sition, she 11 o1"ked 

.H lnno\;Hilc S\'SteJns TL·clllliLJues and D�rc\ 1\esou rcc.s, 

[nc. on J'J'OgraJnJning hngu.1ges rhJr prm ide economic 

.m.1hsis, modeling, cmd d.l tclb�sL· c:1pc1biliries t(Jr rhe ti ncm 

ciJi nurkeq,bcc. She hclS " B.S. in compurer science trom 

rhc l' ni,crsir1· of,\ lichigc1n 

Israel Gale 

lsr;\cl

Neil W. Johnson 

Rd(m: coming to Digiul in I 99 1, Neil johnson was a 

staff scicnrist :Jt Compass, Inc. He has more rhan 30 vears 

ofcxpnience in the dcvdopmenr of compilers, including 

work on the vc.:crorizarion and optimiLation phases and 

tools t( H· compiler devcloprnenr. As

24 

Design of Digital's 

Parallel Software 

Environment 

Digital's Pa ra llel Software Environ ment was 

designed to support the development and exe 

cution of scalable pa rallel applications on clus 

ters (fa rms) of distributed- and shared-memory 

Alpha processors running the Digital UNIX oper 

ating system. PSE supports the parallel execu 

tion of High Performance Fortran applications 

with message-passing libraries that meet the 

low-latency and high-bandwidth communica 

tion requirements of efficient parallel comput 

ing. It provides system management tools to 

create clusters for distributed para llel process 

ing and development tools to debug and pro 

file HPF programs. An extended version of dbx 

allows HPF-distributed arrays to be viewed, 

and a parallel profiler supports both program 

cou nter and interval sampling. PSE also supplies 

generic facilities required by other parallel lan 

guages and systems. 

Dig,it:ll TcchnicJl ]uurnal 

I 

Edward G. Benson 

David C.P. LaFrance- Linden 

Richard A. Wa rren 

Santa Wiryaman 

Digir:t l's PJr�1llcl Sofr11·;1t-c Etll'i ronmcnr ( PS E) 11 as 

dcsig:ncd ro support the dc,·clopmenr and execution 

of scabble par:rllcJ :t pplic:t rions on clusters (brms) of 

distri buted- :md slun:d -mcmorv Alpha pr

tr:msmissio11 con trol prorocoljinrcrncr protocol 

(TCP/lP). Netll"orking tech nologies em r:mge h·om 

si mple Ethernet to �DDI, AT M, and MEMORY 

CHA\:;-.:EL. l\1 r�1 l l e l e\ecution is most dli cie nr ,,·hen 

the imercon nect technolog1· oftcrs hi gh - ba nd11·idth 

and lm1·-latencv com munications ro the user ar the 

process level. When bu ildi ng a cluster t(x p�1 ralle l pro 

cessing, rhe bisectional ba ndwidth ofrhe communica 

tions b bri c shou ld scale with the number of processors 

in the cluster. In practice , such a contiguration cu1 be 

:�c hieved lw building clusters using Alph:1 processors 

�u1d D igi r�1l's GIGAswirch/FDDf as components in �1 

m u lti stage s11·itch configuration.45 Fi gures l �md 2 

sh o11· tii'O oamples of PS E cl uster contigu r:nions. 

Although the design center tor PSE is :1 set ofm�1ehines 

connected by �� high - speed local area i nterconnect, a 

cluster can be constructed that incl udes remote 

111�1chincs con nected Lw a ll"ide :�rca llCtll"ork. 

PS E is �� col lection of tn

26 

Figure 2 

l'S E Multistage S\\ itch C:ontig;ur,uion 

Before dc1·eloping I'Sf. t(Jr use \\ith Hf'F prog1·,1ms, 

Digi tal considned r11 o major altem,ni' cs : the d istribu 

ted compu ti ng cn1·ironmcnt ( DC E) ,md l'VM.''' 

(At that time, the mess,Jgc-rJassing i ntcrhcc I JVl l'l J 

stand ard efl(Jrr \\'JS in progress. �) 

Although a good model t(Jr client-server :1 pplicnion 

dcpl ovment, DCL 1s desi gned fo r usc wi th I'CillOte C:l'U 

resources 1i1 procedure calls to libraries. This model 

is 1·en· difkrcnr ti·om the data-p,u·allel :1 11d mcss,lgcpassi 

ng nature of distributed parallel processing. Irs 

svnch ronous procedure call model requires the oten 

si,·e use ofrh rc1ds. In ,,ddirion, DCE cont,lins '' si gnit� 

icant number of setup :�nd management Ll sks. For 

these reasons, we rejected the DCE environ ment . 

\'ol. 7 �"- 3 Jl)l)� 

Three m'1jm consider:�ti ons in out· choice to de1·clop 

PSE instead of using PVM \\'eJ'C stabilin·, ped(mllJilCc, 

the rr�1nsp�1rcnr us:�bilin· of J S\'lll lllctric multiproces 

sor. EK ilirics such as :\fS (to moullt/sh�lrc ri le s1·srcms 

�1mong machines, in particular working directories ) 

:llld IKt\Hlrk in r(mn arion scn·ice ( ::--.: Is) (also kno\\'n JS 

"vcllo\1' p:1gcs" an d mcd to share password ri les) arc 

ri·c quemlv used ro set up :1 common S\'Stc m cm·iron 

rncnr. ln such �1 n cm·ironmc:nt, users em log into :m y 

m:�chinc �llld sec the same cnviron mcm. Other distrib 

uted cnvirolllllC11tS such JS Load Sluring F:�ci lity 

( L

fKtors can help an applicuion choose the best set of 

members fo r parallel execution. This d\'llJ.mic inform:� 

rion is collected bv a d:�emon process (farmd). The 

farmd daemon process executes as :1 privileged (root ) 

process on each cluster member :�nd liste ns to r requests 

on a well-known clusrcr-spcciflc UDP/lP port. 

Multiple cluster members ddincd in the configura 

tion d:.uabasc :-tiT designated as lo::�d sen·e�·s. The load 

servers arc the central rcpositorv for the dvnJ.mic 

information for the entire cluster. Their farmd process 

periodicallv receives time-stamped upthtcs h· om the 

individual daemons. Applications querv the load 

servers for both static and dvnamic information . 

Applications do not themselves parse the database nor 

query the individu�ll farmd d�1emons running on each 

cluster member. 

Once PSE is inst::�llcd and configured, farmd is 

st:�rted each time the syste m is booted. Thc nJmc of 

the cluster that farmd \\'i ll service and the number of 

t�S E jobs (job slots) that \\'ill be :�I lowed ro run arc set. 

The inetd fa cility is used to restart farmd in res�)OilSc to 

UDP/IP connection requests, if farmd is nor run 

ning _,., Usc of the inetd fJ cility to start farmd impro,·es 

the avaibbility of machines to run PSE applications by 

transparenrlv restarting farmd in the case of a bilure. 

As farmd daemons are st:�rted, they attempt to 

establish TC P/IP connections \\'ith their neighbors as 

ddincd bv the PSE configur

Figure 3 

PSE Usc 

COM PILATION 

SOURCE 

"MYPROG F90" 

� 

FORTRAN 90 

OPTIONS r--- - COMPILER 

(E G., -WSF 4) 

LIBRARIES: 

• FORTRAN 

·RUN-TIME 

• PSE 

-+ 

COMMAND LINE 

SWITCHES AND 

ENVIRONMENT 

VARIABLES 

I 

� 

OBJECT FILE 

"MYPROG.U 

� 

STANDARD 

LINKER (LD) 

t 

EXECUTABLE 

"MYPROG., 

� 

EXECUTION 

(E.G., SHELL) 

� � 

CONTROLLING 

PROCESS 

1- I-f--- 

process re lationship between the io_managcr and peer 

processes. This relationship is used fo r exit status report 

ing and process conrrol. It also enables or eases other 

activities, such as signal handling and propagation. Peer 

processes cre�ue com munication channels between 

themselves and perform standard !jO through a desig 

nated peer. Stand:trd I/0 is fo rwarded to and fr om the 

controlling process through the io_manager. Figure 4 

shows a PSE application structure. 

Applica tion Initialization 

Prior to the execution of JllV user code, an initializa 

tion routine executes automatically through fu nction 

alit\' provi ded by the linker and loader. The 

initialization routine impkrnenrs both the controlling 

process fu nctions :m d the HPf-specitic peer initializa 

tion. Because no explicit call is required , parallel HPF 

procedures can be used within non-HPF main pro 

grams, :md proper initialization will occur. A simple 

HPF mJin program can also be used with PS E to start 

up and manage a task-parallel application that uses 

PVM or MPI tor message passing. 

In general, the controlling process places peer 

processes onto members of;l partition , although hand 

placement of indil'idual peers onto selected members 

PEER SPAWN 

AND CONTROL 

STATIC 

DATABASE 

(E.G , DNS) 

� 

LOAD SERVER 

I 

DYNAMIC INFORMA TION 

(E.G., LOAD) 

CLUSTER 

--- ---------- ---, 

.- --- 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

-- ------- -------, 

I 

HOST 

I 

I 

(CLUSTER MEMBER) I I I I 

I 

I 

! HOST 

(CLUSTER MEMBER) : 

I: 

- - - - - - - - - - - - - - - _I 

1- -------------- I 

I : 

I 

I 

I 

I 

HOST [SMP] 

(CLUSTER MEMBER) : 

L--------------..J 

! HOST 

(CLUSTER MEMBER) 

PARTITIONS 

------------ -------- 

is possible. To achieve efficiency and f1 irness in map 

ping a set of peers, the comrolling process consults 

with a load server for load-balancing information . 

Which members are used and the order in which they 

are used is based on each member's load average, 

number of CPUs, and number of available job slots. 

As an alternative, PS E may nup peer processes onto 

members based upon a user-selected mock of opera 

tion . In the dctault physical mode of operation, PSE 

maps one peer process per mem ber. In virtual mode, 

PS E a! loll'S more than one peer process per member, 

thereby cnJbling large virtual cl ustcrs. This is useful 

fo r developing and debuggi ng parJIIel programs on 

limited resources. Virtual clusters also i mprove appli 

cation availability: when the requested number of peer 

processes is greater th an the :wailable set of partition 

members, applications continue to run; however, they 

may sufkr perf(mnance degradation. 

Application Peer Execution 

Each application peer process has an io_manager 

parent process that provides it with env iron ment 

initialization, exit value processing, l/0 burkri ng, 

signal to rwarding, and potemial scheduling priority 

and policy modification. Ra ther than include the 

Digital Technic:1l Jou rn:1l Vol . 7 No. 3 1995 29 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I

30 

Figure 4 

KEY: 

CONTROLLING 

PROCESS 

PS E Ap1>l icuion Srn�mcesses exit ,,·irhour error and :tt 

�1pproxim;nch· the S

nonn�1l �lcti,·irY ll'ill cease after reeci1·ing the bsr nit 

noriricuion h·om an io_managcr. 

Peers rh�lt oit abnormalh· J11r the JpplicJtion. 

Consider one peer process that exits due ro �l segmen 

tation bult and another that exits due to J tl oaring 

poinr exception. There is no single exit value possi ble 

ti:>r the :1 pplication; PSE chooses the tirsr abnornJJI 

va lue it sees. �urrhcrmorc, as a result of error detec 

tion in the comnHlllication Jibrarv, the other peer 

processes wi II exit ll'ith lost network connections. It is 

possible that the controlling process ll'ill sec an exit 

l':tlue ti: >r this cfkcr before it sees an exit ,.�1Im: ti:>r one 

ofrhc c1uses, resulting in r the core directo 

ries cJn be set through an en1 ironmcnt vari�1blc. 

Issues 

Although l'S E Jchic,·es the srambrd UNIX look-andkel 

f(>r most 3pplication situations, complete trans 

parency is nor achieved. f:or ex.�1mple, riming au 

applicarion-colltrolling process using the c-shell's 

built-in rime command, does not time user code or 

�) f'ol·ide meaningful statistics other rhJn the cbpscd 

ll'all clock rime to starr a p:lrallcl application and ro teJr 

it do11 n. Another situ:Jtion that highlights the p�lr:JIIcl 

n�1rure of I'S E occurs during applicnion debugging: 

multiple debug sessions arc StJrtcd lw running the 

:1 pplic::nion ll 'ith :1 debugger flag ra ther th�111 l11' using 

dbx direcrh·. 

Tools for HPF Progra mming 

The dc,·elopmellt model t( >r H Pf-b�1scd :1 pplicnions 

is �l rwo-srcp process. First, a serial I-'orrran 90 progrJm 

is wri tten, debugged , and optimized. Then it is paral 

lelizcd with H PI-' directives and JgJin debugged and 

optimized . The de1·elopmenr tools supplied with PS E 

Jddrcss 11roriling and debuggi ng. Unlike most of PSI:::, 

ll'hich is l:1nguagc -independent, both rhe pprof protil 

ing bcilitY and the "dbx in n ll'indoll's" debugging 

t:Kilin· �liT spccitic to HPF programming. 

Pro filing 

Sc,·eral issues in protlling par:�llcl HPF programs do 

not appl:' ro fortrJn progrr the execution of each evenr. 

This produces an accurate execution profile. El'cnts 

include rou tines, JtTa\' assignments, do loops, FORALL. 

constructs, messJge sends, and message rcceil'es. 

Because the cntn' ::1nd exit rimes Me recorded, rime 

spent executing other ei'Cllts within an ei'Cnt is 

included, which gi1'CS a hicrarchicJI profi le. To achiC\'C 

fi ne-resolution timings (single-digit na noseconds), the 

AJpha process cvcle coumcr is used to measure ti me.''' 

Analysis The pprof urilirv prol'ides many different 

ways to en mine and report on a large ser of protiling 

data ti·om a p3rallel program execution. Diftc rmr 

approaches include f( >eusi ng on routines, statements, 

or communicnions. In contrast, prof reports on proce 

dures onlv. With pprof, the scope of the analvsis cJn be 

limited to a single peer process or cncompJss all Jppli 

cation processes. The range of reports generated can be 

comprehensive or limited to a number of c1·enrs or 

Digir.1l Techniol )ournll Vol. 7 :-.Jo. 3 199S 31

�' perecnc1gc of time. Users em spccir\• their reports 

fi·om a combination of a nalysis, report tcm11:-tt, :� nd 

scoping options . By debult, rhe pprof u rilitv reports on 

routine-level acri,·itY a1·eraged across a ll peer pmcesscs, 

11·hich prm·ides an 0\nal l 1·ie11· of :-tppliCJtion beh;n ior. 

Parallel programs execute most eftl.cicnrl1· ll'hen 

there is minimum communicnion betll'ccn processes. 

The high-level, data par.1llel natu re of the HPf 

language reduces the 1·isibilitv of· comnu1 11icnion to 

the programmer. To make tuning easier, pprof 11·:-ts 

designed ll'ith the abilin· to tclCus tuning 011 communi 

cation. Reports can be gc nc r;Hcd th

is implemented using J I"CCtor or' pointers to fu nctions 

II ithin J sh�1red Jibr:trl', pr01 ides rl e\ibi!in· to illtmd uce 

ne11· prorocols �1 11d media without ha,·ing to recompile 

or relink existing applications. 

Shared-memory Message Passing 

The usc of shared mcmorv as a mcssage-pass1ng 

medium :l i!OWS [()!' I'Cr V high pcrf()rlllancc because 

data docs 11or have to be copied. When design i ng 

shJred-memorv messaging, we looked :H a variety of 

interrcbted issues, includ i ng coordin�nion mecha 

nisms, memorv-sharing strategies, and memorv con 

sumption. The usc or- locks (i.e., semaphores ) in the 

trJditional m:1nner to coordinate access ro shared 

mcmon· segments pron:d problem atic. �or oample, 

clicnrs often req uest a message ti·om

34 

The UDPIIP option provides better bandwidth 

than the TCP liP with smaller messages and matches 

the TCP I r P bandwidth at large message size. The 

user-level latency reduction, however, was less than 

expected. The next t\vo sections discuss our investiga 

tion into ways to optimize the latency ofUDPIIP and 

the performance of the message-passing options. 

Optimizing UDP/IP 

Our initial approach to improve latencv was to reex 

amine the standard UDPilP code path within the 

Digital UNIX kernel fo r unnecessary overhead . Our 

idea was to create a faster path, optimized tor a 

UDPIIP over a local area network (LAN ) configura 

tion by reducing numerous conditional checks in the 

path. Although this work yielded some improvement, 

it was not enough to justit)' supporting a deviation 

from the standard code path. An overhaul of the origi 

nal code path would have been necessary for this 

approach to gain significant improvement in latency. 

UDP llP provides a general transport protocol, 

capable of runni ng across a range of network inter 

faces. We realize the value in retaining the generality 

of UDPIIP. For optimal performance, however, we 

anticipate typical cluster configurations being con 

structed using a high-perkmnance switched LAN 

technology such as the GIGAswitchiFDDI system.; 

In such configurations, the IP fa mily of protocols 

presents unnecessary protocol-processing overhead. 

A messaging system using a lower-level protocol, such 

as native FDDI, wou ld ofter better late ncy, but its 

implementation requires the usc of nonstandard mech 

anisms to access the data link layer directly, which is less 

general and portable than a UDPIIP implementation. 

Based on the above observations, we designed a new 

protocol stack in the kernel, called UDP_prime, to 

coexist with the standard UDPilP stack. UDP_prime 

packets conform to the UDP I IP specirication.'9 To 

reduce the amount of per-packet processing and 

approach that of a lower-level protocol, UDP_ prime 

imposes several restrictions on its usc. These restric 

tions optimize the typical switched LAt"J cluster config 

urations. To retain the generality of UDPIIP, 

UDP_prime falls back to rbe standard UDPilP stack 

when these restrictions are not appucable. 

Restrictions on UDP_ prime 

The LAN nature of the cluster configuration imposes 

a restriction on UDP _prime. Each cl uster member has 

to be within the same JP subnet, which is directly 

accessib.le from any other member. With this restric 

tion, routing decision and internet-to-hardware 

address resolution can be done once tor each peer 

to-peer connection rather than on a per- packet basis. 

Per-packet UDP I IP checksum processing can also 

be eliminated, because intermediate routing is not 

Digital Technical journal Vol. 7 No. 3 1995 

involved and the data link cyclic redundancy check 

( CRC) is sufficient to guarantee error-tl-ee packets. 

The next restriction is the maximum length of rhe 

message. PSE message passing uses fixed-size bufters. 

UDP_prime restricts the maximum bufter size ro be 

the maximum transmission unit (MTU) ofrhe underly 

ing network interface. This eliminates per-message IP 

fragmentation and defragmentation overhead. Since 

the messaging clients have to fi-agment the messages 

inro fi xed-size buffers at the higher layer, there is no 

need fo r rhe IP layer to perform fu rther fragmentation. 

One complication in our current implementation 

occurs when multiple peers are running on a single 

system while others are on remote systems. The 

default behavior for peers within a single system is 

to communicate across the loopback intert�Ke. In this 

situation, there are two MTU values, one fo r the net 

work interface and one for the loopback interface . 

Our currenr implementation of UDP_prime does not 

allow communication over the loopback interface so 

that a single-size MTU can be used. rurrher studies 

need to be done to fi nd an optimal maximum buffer 

size, taking into account multiple MTU values, page 

alignment, and so forth. 

Based on the above restrictions, UDP_ prime opti 

mizes the per-packet processing overhead of sending a 

packet by constructing a UDP, IP, and data link packet 

header template for each peer ar initialization. Except 

for a few fields, the content of these headers is static 

with respect to a particular peer. UDP _prime defines a 

new IP option, IP _UDP _PRI ME, tor the setsockopt( ) 

system call, to allow the messaging system to define 

the set of peers and their Internet addresses involved in 

the application execution."' The IP option processing, 

done prior to sending any message, makes routing 

decisions, performs Internet-to-hardware address res 

olution, and tills in the static portion of the header 

fields. When sending a packet, UDP _prime simply 

copies the header template to the beginning of the 

packet, minimizing the per-packer processing over 

head and increasing the likelihood of the templates 

being in the CPU cache. Several header fi elds, such as 

the IP identification, header checksum, and packet 

length fi elds, are then filled dynamical ly, and the com 

plete packet is presented to the interface layer. 

UDP_ prime Packet Processing 

Since a UDP_prime packer is a UDPIIP packet, the 

standard UDP I IP receive processing can handle the 

packet and deliver it to the messaging client. To trig 

ger the use of UDP_prime optimized receive process 

ing, the sending system uses the type of service (TOS) 

field within the IP header to specif)' priority delivery of 

the packet." The priority delivery indication does not 

by itself uniquely differentiate bet\veen UD P _prime 

and UDPIIP packets, as any other IP packets can 

also have the TOS ti.eld set to priority. As a result, the

optimized receive processing has to check for the 

packer's adherence to the UD P _prime restrictions. 

Nonadherence to the restrictions reroutes the packet 

to the standard receive processing code. 

\tVhen a packet arrives at a network interface, the 

imertace posts a hardware interrupt, and the interface 

interrupt service routine processes the packer. The 

standard interrupt service routine deletes the data link 

header, and hands the packet over ro the netisr ke rnel 

thread ." Netisr clemultiplexes the packet based on 

the packet header contents and delivers it to the appli 

cation's socket receive bufkr. Netisr, designed to be 

a general-purpose packet demultiplexer, runs at a low 

interrupt priori ry level. The main reason fo r a thread 

based demultiplexer is extensibility. New protocol 

stacks can be registered to the thread . Since there is 

no a priori knowledge of the execution and SMP lock 

ing requirernents of these stacks, a thread-based low 

interrupt priority demultiplexer is needed so that the 

network interrupt processing rime can be held to a 

minimum. The extensibi lity feature, however, intro 

duces a context switch overhead. 

For UDP _prime, the packet header processing ti me 

on the receive path is almost a small constant. We 

modified the interface service routine to demultiplex 

the packet by processing the data link, IP, and UDP 

headers, and deliver the packet to the socket receive 

buffe r without handing it over to netisr. This short cir 

cuit path is used only when the packet is a UDP / IP 

packet with no 1 P fi·agmentation and with priority 

delivery indication . If these cond itions are not met, 

the standard netisr path is chosen. The UDP_prime 

receive path eliminates the netisr context switch over 

bead . This is a significant advantage, especially when 

the receiving application runs with a rea l-rime FIFO 

scheduling policy. 

SMP Synchronization 

One difficulty in designing the UDP_prime stack to 

run in parallel with the standard UDP/IP stack was 

in SMP synchronization .2-' The socket butle r structure 

is a critical section guarded by a complex lock. 

Requesting a complex lock in Digital UNIX blocks 

execution if the lock is ta ken. To prevent deadlocks, 

its use is prohibited at an elevated priority level, such 

as the case in the interrupt service routine. To work 

around this problem, a new spin lock was introduced 

in the short circuit path and in the socket layer where 

access to the socket buffe r needs to be synchronized. 

Performance 

To measure message-passing performance, we used 

two DEC 3000 Model 700 workstations connected by 

a GIGAswi tch/FDDI system using TURBOcbannel 

based DEFTA fu ll-duplex FDDI adapters. Each work- 

station contained a 225-megahcrrz (MHz) Alpha 

21064 microprocessor and was running the Digital 

UNIX version 3.0 operating system. 

Figure 6 shows the message-passing bandwidth fo r 

TC P/ IP, UDP/IP, and UDP_prime transports at dif 

ferent message sizes. The bandwidth was measured at 

the message-passing application programmer interrace 

(API) level, raking into account al location and deallo 

cation of each message buffer in addition to the data 

transmission. TCP/IP, UDP/IP and UDP_prime 

bandwidth peaks at approximately 95 megabits per 

second at a 4,224-byte message, approaching the 

FDDI peak bandwidth. UDP/IP approaches the peak 

bandwidth at a 1 ,4 00-byre message, and UDP _prime 

at a 1,024-byte message. Reaching the peak band 

width using small messages is a measure of protocol 

processing efficiency. 

Figure 7 shows the minimum message-passi ng 

latency fo r TCP/ IP, UDP/IP, and UDP_prime 

transports at diffe rent message sizes. The latency was 

measured as balfofthe minimum rime to send a nl CS 

sage and receive the same message looped by the 

receiver system over many iterations. The measure 

ment made allowance fo r the allocation and deallo 

carjon of each message bufter, in addition to the 

round-trip transmission . 

Compared to the TCP/IP option, UDP/IP has a 

slightly higher minimum latency. This is not expected, 

because the original goal ofrhe UDP/IP option was to 

reduce TCP/IP processing overhead . It is, however, 

encouraging to see only a slight degradation in latency 

when the reliable in-order delivery protocol is imple 

mented at the library level. This prompted us to use 

the same protocol engine in the libra ry to r 

UDP_prime. At a very small m.essage si ze (4 bytes), 

100 

90 

80 

0 70 

z 

0 60 

(_) 

lli 50 

{7j 

1- 40 

m 

::;;; 30 

20 

10 

KEY: 

r - . / 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

...... - / 

I 

I 

I 

I 

I . 

. 

0 500 1 000 1500 2000 2500 3000 3500 4000 4500 

-- UDP _PRIME 

Figure 6 

TCP/IP 

UDP/IP 

Peer-to- Peer Bandwidth 

MESSAGE SIZE (BYTES) 

Digiral Technical Journal Vol. 7 No. 3 1995 35

36 

1000 

900 

(f) 800 

0 

z 700 

0 

� 600 

� 500 

a: 

� 400 

2 300 

KEY 

200 

100 

· / 

- � 

/ 

/ 

0 500 1000 1500 2000 2500 3000 3500 4000 4500 

-- UDP PRIME 

Figure 7 

TCPII P 

UDPIIP 

;\r[inimum Peer-to-Peer Larctll\. 

MESSAGE SIZE (BYTES) 

prorocol processing overhead domin;ltcs rile Lncncv. 

At rhis poinr, UDP_primc ll"

References 

I. Digital High Pe Jfrmnance Fo rtran 90. HPF a nd 

PSI;" /VIa mwl (M:�vnard , M:1ss.: Digiral Equipment 

Corporation, Orde r No. AA-2ATAA-Te , 1995 ). 

2. G. Be ll, "Sc:�lablc, Pclt"

Richard A. \Varren 

Richard W;uTcn is cl prin..:ipc1l software engi neer in the 

High Pedornunce Computi ng Group, ll'hcre his [1 rimclrl' 

n: sponsibilirY is the design Jnd dcl'clopmcnr of Digi tc1 l's 

pclr:JIIel softii':Jrc cm·ironmcnr. Since joining Digital in 

1977, Ri ch,mi hc1s conrrilJuted to PDI'- 11 SI'Stems dcl·cl 

opmcnr and \'.-\ \ 32 -bi r shcHed-mctnon· tnul rip rocnsor 

designs. He h:1s c1lso been

An Overview of the 

Sequoia 2000 Project 

The Sequoia 2000 project is the joint effort 

of computer scientists, earth scientists, gov 

ernment agencies, and industry partners to 

build a better computing environment for 

global change researchers. The objectives of 

this widely distributed project are to support 

high-performance 1/0 on terabyte data sets, 

to put all data in a database management 

system, and to provide improved visualization 

tools and high-speed networking. The partici 

pants developed a four-level architecture to 

meet these objectives. Chief among the lessons 

learned is that the Sequoia 2000 system must 

be considered an end-to-end solution, with all 

pieces of the architecture working together. 

This paper describes the Sequoia 2000 project 

and its implementation efforts during the first 

th ree years. The research was sponsored by 

Digital Equipment Corporation. 

I Michael Stonebraker 

The purpose of the Sequoia 2000 project is to build a 

better computi ng environment to r global change 

researchers, hereafter referred to as Sequoia 2000 

clients. These researchers investigate issues such as 

global warming, ozone depletion, envi ron ment toxiti 

cation, and species extinction and are members of 

earth science departments at universities and national 

laboratOries. A more detailed conception tor the proj 

ect appears in the Sequoia 2000 technical report 

"Large Capacity Object Servers to Support Global 

Change Research."1 

The participants in the Seq uoia 2000 project are 

investigators of to ur types: ( 1) computer science 

researchers, (2) earth science researchers, ( 3) govern 

ment age ncies, and ( 4) industry partners. 

Computer science researchers arc responsible fo r 

building a prototype environment that better serves 

the needs of the target clients. Participating in 

the Sequoia 2000 project are invesrigarors from the 

Computer Science Division at the University of 

California, Berkeley; the Computer Science Depart 

ment at the Universi ty of Ca lifornia, San Diego; the 

School of Library and Information Studies at the 

University of California, Berke ley; and the San Diego 

Supercomputer Center. 

Earth science researchers must explain their needs 

to the computer science im'estigators and use the 

resu lting prototype environment to perform better 

earth science research. The Sequoia 2000 project 

comprises earth science investigators fr om the 

Department of Geography at the University of 

California, Santa Barbara; the Atmospheric Science 

Department at the University of California, Los 

Angeles (UCLA); the Climate Research Division at 

the Scripps Institution of Oceanography; and the 

Department of Earrh, Air, and Water at the University 

of Calitornia, Davis. 

To ensure that the resulting computer environment 

addresses the needs of the Sequoia 2000 clients, gov 

ernment agencies that are affected by global change 

matters participate in the project. The responsibi l ity of 

these agencies is to steer Seq uoia 2000 research 

toward achieving solutions to their problems. The 

government agencies that participate are the State of 

Ca lifornia Department of Water Re sou rces ( DWR), 

Digir

40 

the SLnc of Cal itorni�1 Dcp�1rtment of forcsrn·, rh e 

Coord inated Environment Research L1boJ:�non· 

(CERL) or the United States Arnw, the N;uion�; l 

Aeronautics and Space AdJllJJlisn·:tti�n (\:AS,-\), rhc 

:\'ation�1l Occ.mic and Atnws�1hcric Adminisrr�Hion 

(�0,-\A), Jnd rhe United Sr:1res Geologic Sun·C\· 

(USCS). 

The task or' the ind ustr\' participants is to use 

the Seq u oia 2000 tech nol og\'

APPLICATIONS - 

t 

DATABASE 

MANAGEMENT 

SYSTEM 

NETWORK t 

Figure 1 

The Scquoi� 2000 Architecrun: 

The File System Layer 

FILE SYSTEMS t 

FOOTPRINT 

t 

STORAGE DEVICES 

On top of the fo otprint layer is the ti le system layer. 

Two ri le systems manage data in the Bigt-oot multilevel 

memory hierarchy. The fi rst t-i le system is Highlight, 

which extends the Log-structured File System ( LFS ) 

pioneered ror disk devices bv Ousterhout and 

Rosenblum to tertiary memory.1··' The original LFS 

treats a disk device as a single continuous log onto 

which newlv written disk blocks are appended. Blocks 

are never overwritten, so a disk de,'ice can always be 

written sequentially. Hence, the LFS turns a random 

write environ ment into a seq uential-write environ 

ment. In particu lar problem areas, this may lead to 

much higher performance. Benchmark dar� support 

this conclusion 4 In add ition, the LFS can always iden 

ti f:}• the last kw blocks that were written prior ro a t-i le 

S}'Stem fa ilure by tinding the end of the log at recovery 

time. File system repair is then very t-a st, because 

potentially damaged blocks arc easily ro und. This 

approach dift-ers 6-om conventional ti le svstem repair, 

where a laborious check of the disk must be performed 

tO ascertain disk integrity. 

Highlight extends the LFS to support tertiary mem 

ory by adding a second log-structured t-i le system on 

top of the footpri nt layer. This ti le system also writes 

tertiary memory blocks seq uentially, thereby obtain 

ing the pedcmmnce characteristics of the LFS. The 

Highlight ri le system adds migration and bookkeeping 

code that treats the disk LFS file system as a cache tor 

the te rtiary memory tile system. In summarv, 

Highlight should provide good performance t6r 

workloads that consist of mainly write operations. 

Since Sequoia 2000 clients want to archive vast 

amounts of data, the Highlight file system has the 

potential tor good performance in the Sequoia 2000 

environment. 

The second ti le system is Inversion.; Most DBMSs, 

including the one used fo r the Seq uoia 2000 project, 

support binary large objects (BLO Bs), which are 

arbitrary-length byte strings of variable length. Like 

several commercial systems, Sequoia's data manager 

POSTGRES stores large objects in a customized 

storage system directly on a raw storage device.6 As 

a result, it is a straightforward exercise to support con 

ve ntional files on top of DBMS large objects. In this 

way, the fi·onr end turns every read or write operation 

into a query or an update, which is processed directly 

by the DBMS. Simulating t-i les on top of DBMS large 

objects has several advantages. First, DBMS services 

such as transaction management and security are auto 

matically supported tor t-i les. In addition, novel charac 

teristics of our next-generation DBMS, including time 

travel and an extensi ble type system tor all DBMS 

objects, are automatically available t-or tiles. Of course, 

the possi ble disadvantage of simulating files on top of 

a DBMS is poor performance . As reported by Olson , 

Inversion performance is exceedingly good when large 

blocks of data are read and written, as is characteristic 

of the Seq uoia 2000 workload.; 

At the present ti me, Highlight is operational but 

very buggy. Inversion, on the other hand, is used to 

manage production data on Sequoia's Sony WORM 

jukebox. Unfortu nately, the reliability of the proto 

type system has not met user expectations. Sequoia 

2000 cliems have a strong desire fo r commercial off 

the-shelf(COTS) software and are fr ustrated by docu 

mentation gl itches, bugs, and crashes. 

As a result, the Sequoia 2000 project ream has also 

deployed two commercial t-i le systems, Epoch and 

AMASS. The Epoch file system is quite reliable but 

does not support either of Sequoia's large-capacity 

robots. Hence, it is used heavily but only fo r small d:�ta 

sets. The AMASS fi le system is just coming into pro 

duction use fo r Sequoia's Merrum robot and replaces 

an earlier COTS syste m, which was unreliable. Given 

the experience of the Sequoia 2000 team with tertiary 

memory su pport, tertiary memory users should care 

fu lly test all file system software . 

The DBMS Layer 

To meet Sequoia 2000 client needs, a DBMS 

must support spatial data such as points, lines, and 

polygons. In addition, the DBMS must support the 

large spatial arrays in which satellite imagery is natu 

ra lly stored. These characteristics are nor met by pop 

ular, general-purpose relational and object-oriented 

DB1\tlS s.7 The best fit to client needs is a special 

purpose Geographic Information System (GIS) or 

a next-generation object-relational DBMS. Since it 

has one such object-relational system, namely 

Digital Technical Journal Vo1. 7 No. 3 1995 41

42 

POSTGRES, the Seq uoi::t 2000 project elected to 

tix us its DBJ'vi S efforts on this s1·srem. 

To make the POSTGRES DBMS suit::tb l e fo r 

Sequoia 2000 use, we req ui re �1 schcm�1 t( >l. :�II Seq u oia 

d�1t:1. This ciat:1base design process h:ts CH>Ivcd as :1 

cooper::ttive exercise between v:1 rio us d:tt�1b�1sc experts 

:tt Bcrkelev, the San Diego Su percomputer Center, 

CERL, and SAJC. The Sequoi:� schc111�1 is rhc collec· 

rion of metadata that describes the d�1ta stored in the 

POSTGRES DBMS on Bigt(>ot. SpccitiClilv, these 

mcrad:Ha comprise 

• A standard \'Ocabularv of terms ,,·irh �1grced·u pon 

definitions that arc used to descri be the (b tJ 

• A set of rvpcs, instances of 11 hich m�n store data 

"''lues 

• A hier:1rchicll collection of cl:�sscs tlut describe 

3. A browsi ng capability for te xtual intixmation 

4. A fa cility ro interrace the UCLA General Circula 

tion Model (GCM ) to the POSTG RES/IIIustra 

SI'Stem 

5. A desktop ,·idcoconterencing or "picturephonc'' 

tacilitl' 

For the off-the-shelf visualization tool, we have 

converged around the usc ofAVS and IDL fo r project 

activities. AVS has :m easy-to-use "hox�.:s-and-arrows" 

user interface, ll"hcrcas I DL has a more conventional 

linear programming notation. On the other hand, 

IDL has better tll"o-dimensional (2-D) graphics tea 

tllres. Both AVS and IDL a llo\\' the user to read and 

write ti le data. To connect to the DRMS, we have writ 

ten an AVS-POSTGR.ES bridge. This program allows 

the user to construct an ad hoc POSTG R.ES q uerv and 

pipe the result into :m AYS boxes-and-arrows net11·ork. 

Seq uoia 2000 clients can use AYS fo r fu rther process 

ing on any data retrieved fr om the DBMS. IDL is 

being interfaced to AVS by the vendor. Consequently, 

data retrieved fr om the database can be moved into 

IDL using AVS as an intermediarv. Now that \\ 'e have 

migrated to the Illustra DBMS, 1\'C arc consideri ng 

porting this AVS bridge to the II lustra :1pplicarion pro 

gramming interface (API ). 

AVS has sorm: disadvantages

+4 

Illustr::t has Jl1 itnegy�ued document d�n�1 tvpe 11 ith 

capabi l i ti es si m ilar ro the extensi ons 11·c m:-tdc ro 

POSTGRES. 

A related Berkclt\' project is r(K uscd 011 digiti;ing 

all the Berkcl c\· Computer Science Techni c�1 l Rq)orts. 

This project uses �1 Mosaic cl ient to �1ccess �1 custom 

World Wide We b scnn ca lled D i e nst, 11·h ich stores 

technical report objects in a L�IX ri le SI'Stcm. J n a tC \\' 

months, 11·e e x pect to con1·crr D ienst to stmc objects 

in the Seq uoi:-t 2000 d;na b:-tse, rather th;1n in ti les. 

\Vhcn thi s SI'Ste m, nicknamed Datab:�sc Di e nst, is 

operati onal, M osaic/ Dienst sen·icc ll'i II he �1 1'r all textual objects in the Seq uoia schem;l. 

Our fourth thrust in the application laver is �1 bcil i t\' 

ro i ntcr bce rhe LIC:LA Ccncral C:in.:ul.1tiot1 Model 

(GC:'vl ) to the POSTCRES/lllusrra SI'Stc nl . Th is pro 

gram is :1 "cbu pum p" bec1use ir pumps cbL1 our of 

the si mul;nion model :-tnd imo the DBMS. \Vc 11,1mcd 

rhc progra m "the big l ifr" attcr rhe DWR pu m p ing 

station that r:�i scs Northern C:tlit(m1ia 11 ;Hn m·er the 

Te hacha pi Mount;l ins inr.o Southern Cllit(m1i;l. 

Basicallv, the UC :l.A CC:M p rod uces :1 1ccrm or' sim 

ulation outpur ,.�1riables r()r each ti me stq) of;l l en gt h1· 

ru n tor each tile in ;1 three -dimensional (3-D) grid of 

the at mosp here ;1 11d occ;ln. De pe ndi ng 011 the scal e 

of thc model, its resolution, and rhe clp;lbilin· of the 

serial or parallel m:-tchine on 11·hich the mode l is running, 

the CCL\ GC:Ivl c�111 prod uce ti·om 0. 1 to 10.0 

megabytes per se cond ( J\'l B/s ) ou tpu t. The pu rpose of 

rile big litr is ro inst�11l rhc outpu t dar�1 into :1 chr�1 b;1se 

in re:d ti me. UCLA sci t· n rists can then usc Objccr 

K.n oll'led ge , Tiog�1, Tc c;ltc, AVS, or lDL ro ,·isualizc 

their simulation output. The big lirr 11 ill likcil h�11 c ro 

e x ploi t p ar: dldism in rhc d�Ha managn, if i t is req uired 

to keep up \\'ith rhe ncc u tion of the model on ;1 11L1s 

si ,·c ll· parall el �1 rc hirccrurc . 

The rifrh applic1tion S\'Stem is a con krc ncing SI'S 

tc m. Since Scq uoi:J 2000 is �1 distnburcd pmjcct, II'C 

lcarnccl e�1rlv that bcc-ro-L1cc meetings rh;H requ ired 

p;lrricip;m ts to tr;llcl to other si tes �1 nd el ectron ic mail 

\\'Cre not sufticic m ro keep proj ect members "·urking 

;ls J tc:am. Consequcml1·, 11·e pu rch�1scd umkrcncc 

room 1 idcocon krcncing eq u ip m ent t( >r each project 

sire. ·rhis tcc l1 11olog1 costs .lt)prmim�1tcll- S::iO,OOO per 

si te ;m d �• llo11 s m u l till'rc, 1·ideoconterencing te nds to be 

usee! fo r arra nged conkrcnccs and nor t

Sequoia 2000 as an End-to-End Problem 

The m�1jor lesson 11·c ha1 c le:m1ed ti·om the Scquoi�1 

2000 project is rh�1t manv issues being our clients c\n 

not be isobrcd ro �l single Lwcr of the Scq uoi�1 2000 

�1 rc hirecrurc. This section describes three such end-toend 

problems: gu:1ranrecd dcli1-crv, :� bstLKts, :111l1 

COil\ 1XeSS10n. 

Guaranteed Delivery 

Clearlv, guar:� ntced delivery must be an cnd-ro-end 

colltracr. Suppose a Scquoi:� 2000 clicnr wishes ro l'isu 

:d izc a s11ccitic compurarion; t( >r oample, rhe client 

11·a1HS to obscr1·e H u rricanc And rc11· as it mm·cs ti·om 

the B�1h�1n1:1s ro florid:� ro Louisi.111�1. Spcciti c1li l·, the 

clienr 11·ishes to 1'isu:1lizc appropriate s�udlirc im:�gcrv at 

a resolution of 500 X 500 in 8-bit color at 10 ti·:unes 

per second . Hence, the client req uires 2.5 MB/s of 

b:mdwidrh ro his screen. The t()IJ011·i ng sce 1 urio mi ght 

be the co111�1utation steps that t�1 kc pbcc. 

TilL DHMS must run a q u ery to krch the s�ucllitc 

im:�gerv. The querv might require retuming �1 16-bir 

d�lta l'�lluc t(>r each pixel that will ultimarclv appear 011 

the screen . The DBMS must rhcrd(>JT agree to occure 

the quen· in such �1 II'JI' rhat ir guaranrecs ourpur 

.1t �1 r�1tc of 5.0 tv! B/s. 

The stor:�gc s1·stcm at the sencr will krch some 

number of 1/0 blocks ti·om scconchn· and/or tertian· 

memo!'\'. DB,\IS quen· opti mizers em accuratch- guess 

ho11· mam· blocks thc1· need to rod to s�Hist\· the 

qucr1'. The DBMS can then e1sih· generate �1 guar:l ntccd 

dcliiLTI' comrxt that the sror::�gc 111:1n�1gcr must 

sarist\·, rhus �1 1ioll'ing the DBMS ro sarist\· irs contract. 

The network must �1 gree to dc]ii'Cr 5.0 MIVs over 

the network link that connects the clienr to rhc server. 

The Sequoia 2000 nctll'ork soft ll'arc expects cxactlv 

rh is tvpc of contract request. 

The 1 isu:1liz�1tion package must

46 

then the DBi\rlS may hal'e to decompress the data to 

perform the search. If an application resides on the 

same machine as the storage 111

finally, the computer scientists and the earth scientists 

r-cJiizcd then the word "bcnchmJrk" has a difkrem 

meaning t()r cJch of the two groups of researchers. To 

cJrth scientists, a benchmark is a scenario, \\·hcreas to 

com puter scientists, a be nchmark is an abstract example. 

This \'ignctte was typical of the experience these 

two disciplines had trying to understa nd one Jnother. 

Fundamentally, this process is time-consuming, �md 

ample inrerJction time should be planned tc.>r Jny project 

that must dcJI with multiple disciplines. 

The ScquoiJ 2000 project participants made dkctivc 

usc of "converters." A converter is a person of one 

discipline \\'ho is planted directly in the research group 

of :mother discipline. Through intorrn�11 communication, 

this person serves JS an interpreter and translator 

fi:>r the other discipline. Corwerters arc encouraged lw 

the existence of a ti:>rmal exchange progr:un, whcreb\· 

ccntL11 Seq uoia 2000 resources pa\' the li\·ing opcnses 

of the exchange personnel. 

Lesson 4: Da tabase technology is a major advance for 

earth scientists. 

Our initial plan was to introduce database technology 

into the project with the expectation that the earth scientists 

\\'Ould pick ir up and use it. Unfo rtunately, they 

:1rc accustomed to dat:1 being in ti les and t-()lmd it vcrv 

difticult to make the transition to a database vic\\·. The 

earth scientists arc becoming increasing!\· aware of 

the inherent ad\'Jntages of DBMS technolot, '\'. 

In ad dition, we

4� 

the Seq uoi;1 2000 eftorr. As ;1 result, project man;lge 

menr WJS pcrt(mned poorlv at best. In Jny t'ururc large 

project, this component should be �1 ddresscd sJtisbc 

rorily up fi·ont bv project personnel. 

Lesson 6: Multicampus projects are extremely difficult 

to implement. 

Sequoia 2000 '' ork is ra king place in seve n difkrcnt 

orgJ niz:ttions wi thin the Uni1·crsitv ofCJiit(nnia cd u 

cationJI S\'Stcm. There is a constJnt need to trJnsfer 

monel' and people among these org:1nizarions. Accom 

plishing such molTS is :1 difti cult and slm1· process, 

hm,·e, er, because of the burc1ucrac1· ll'ithin rhe S\'S 

rern. In Jddition, rhe personnel rules of the Uni,·ersitv 

are often in conflicr with the needs of the Sequoi;1 

2000 project. As a resul r, multi -i nsti w tion projects, 

where participants arc in difkrcnr and often distant 

locations, :1re ntrernelv difficult to earn· our. 

Status and Future Plans 

The Sequoia 2000 project is more rh:1n rhrce vcars old 

and h:1s nearlv accomplished irs objectives. We luve 

a common schema in place t()r rrs under 1\'J\' in rhc spatial at-c n:1.1" The 

intl·asrrucrurc is in place ro enable this sc hcm;1 to 

evolvc JS more data tvpes, uscr-dctined fu nctions, ;1lld 

operators an: included in the fu rurc. 

The combination of Objecr-K.noll'ledge, ll lustLl, 

Epoc h, and AMASS is prm ing robust and meers our 

clients' needs. Lastk, ,,.c IDI'C ample resources ro 

move our prototvpe into prod uction use at UCLA and 

Santa B:1rlx1r;1 during rhc next SCI'cral months. 

We arc ::1lso extending the scope of the prorotvpe in 

t\\'O difkrcnr directions. first, 11·c ll'i] l recruit Jddirional 

ea rrh scientists to utili 1.e our s1·srem. This ll'i II 

req uire extending our common schema to meet their 

needs and then inst::1 lling our suite ofsoft11'rucccdiug; 

o/ !he 7993 I.t.FL' /Jo/o L'n,!!, illeerin,�.; c!ilt/(>rence. 

Housron, Texas ( Febru:ll'l' 1993 ). 

9. ) . Hcllcrsrein and M. Sronehrmccedit tgs of /be 1091 tiC�! SJC.HO/) Cmtj(n'IICC. 

W:1 shingron, D.C. (rVb1· ]')93).

10. P. Koc lu.:var and L. Wanger, "Tecne: A Software 

P!Jrfonn tor Browsing and Visualizing Data ti·om 

Networked Data Sources," Digita l Technical 

.foumal, vol. 7, no. 3 ( 1995, this issue): 66-8 3. 

11. M. Stonebraker er al., "Tioga : PrO\·iding Dara Man 

agement for Scientific Visu:�liz::nion Appl ications, " 

Proceedings Q/ the 199.3 VLDB Co n/erence. Dublin, 

Ireland (August 1993). 

12. A. vVoodruffet :� 1., "Zooming

so 

The Sequoia 2000 

Electronic Repository 

A major effort in the Sequoia 2000 project was to 

build a very large database of earth science infor 

mation. Without providing the means for scien 

tists to efficiently and effectively locate required 

information and to browse its contents, how 

ever, this vast database would rapidly become 

unmanageable and eventually unusable. The 

Sequoia 2000 Electronic Repository addresses 

these problems through indexing and retrieval 

software that is incorporated into the POSTGRES 

database management system. The Electronic 

Repository effort involved the design of proba 

bilistic indexing and retrieval for text documents 

in POSTGRES, and the development of algo 

rithms for automatic georeferencing of text 

documents and segmentation of full texts 

into topica lly coherent segments for improved 

retrieval. Va rious graphical interfaces support 

these retrieval features. 

Dig;ir:1l Tcch11ical JounlJI Vo l. 7 :--:o . .'\ l

the research conducrcd in these r the POSTG RES database system, 

the GIPSY S\'Stcm f(>r automatic indexing of texts 

using geographic coordinates based on locations mentioned 

in the tnt, :1nd the TextTiling method f(>r 

improving access to fu ll-text documents. 

Indexing and Retrieval in the Electronic Repository 

The priman· engine f(>r information storage and 

retrieval in the Sequoia 2000 Electronic Repository 

is the POSTGIU:S ncxt-generarion darabase management 

system (DBMS).' POSTG RES is the core of 

rhe DBMS-ccntric Sequoia 2000 system design. All 

rhe data used in the projecr was stored in POSTGRES, 

including complex mulridimensional arravs of data, 

parial objecrs such as raster and vector maps, sarellite 

images, and sers of measuremenrs, as well as all the 

fu ll-rext documents available. The POSTG RES DBMS 

supports user-defined abstracr data types, user-defined 

ti.mcrions, a rules system, and man�' katures of objectoriented 

DBMSs, including inheritance and methods, 

through fu ncrions in both the querv l

52 

Figure 1 

KW_SOURCES 

KW_INDEX_FLAGS 

The Lassen POSTGRES Classes t()r Indexing ,md Their Linkages 

a particu!Jr term from the wn_index class. The 

POSTQUEL definition ohhis class is 

create kw_term_doc_rel ( 

termid = int4, I* WordNet termid number *I 

synset = int4 , I* WordNet sense number *I 

docid = int4, I* document ID *I 

termfreq = int4) I* term frequency within 

the document *I 

The raw fi-equencv of occurrence of the term 

in the document ( lermji·eq) is included in the 

k.\1·_tenn_doc_rel tuple . This information is used in 

the retrieval process for calculating the probability of 

relevance for each document that contains the term. 

The kw doc index class stores information on indi 

,·idual documents in the database. This information 

includes a unique documem identifier ( docid), the 

location ofthe document (the class, the attribute, and 

the tuple in which it is contained), and whether it is 

a simple attribute or a large object (with effectively 

unlimited size). The kw_doc_index class also main 

tains additional statistical information, such as the 

number of unique terms fo und in the document. The 

POSTQUEL definition is as tol lo11 s: 

create kw_doc_index ( 

docid = int4, I* document ID *I 

reloid = oid, I* oid of relation 

containing it *I 

attroid oid, I* attribute definition of 

attr containing it *I 

attrnum int2, I* attribute number of attr 


tupleid oid, I* tuple oid of tuple 


sourcetype = int4, I* type of object -- attribute 

or large obj ect *I 

doc_len = int4 , I* document length in words *I 

doc_ulen = int4) I* number of unique words in 

document *I 

Digital Tec hnical Journal 

Vol. 7 No. 3 1995 

WN_INDEX 

The kw_sources class contains information about 

the classes and attributes indexed at the class ln·el, as 

,,·ell as statistics such as the number of items indexed 

from any given class. The fol lowing POSTQUEL 

statement defines this class: 

create kw_sources 

relname = charl6, 

relaid = oid, 

attrname = charl6, 

attroid oid, 

attrnum int2 , 

I* name of indexed 

relation *I 

I* oid of indexed 

relation *I 

I* name of indexed 

attribute *I 

I* obj ect ID of indexed 

attribute *I 

I* number of indexed 

attribute *I 

attrtype = int4, I* attribute type -- large 

object or otherwise *I 

num_indexed = int4 , I* number of items 

indexed *I 

last tid = oid, I* oid and time for last *I 

last time abstime, I* tuple added *I 

tot_terms = int4, I* total terms from all 

items *I 

tot_uterms = int4, I* total unique terms from 

all items *I 

include_pat text , I* simple patterns to *I 

exc lude_pat text ) I* match for indexable 

I* items *I 

The other classes shown in Figure l relate to the 

indexing and retrieval processes. The Lassen prototvpe 

uses the POSTGRES rules svstem to perform such 

tasks as storing the elements of the bibliographic 

records in an appropriate normalized fo rm and to trig 

ger the indexing daemon. 

Defining an attribute in the database as indexable 

tor information retrie1·al purposes (i.e., bv appending 

a new tuple to the kw_sources definition) creates a rule 

that appends the class name and attribute name to the

kw_index_tlags class whenever a new tuple is appended 

to the class. Another rule then starts the indexing 

process for the newly appended data . Figure 2 shows 

this trigger process. 

The indexing process extracts each unique keyword 

fr om the indexed attributes of the database and stores 

it along with pointers to its source document and its 

frequency of occurrence in kw_term_doc_rel. This 

process is shown in Figure 3. The indexing daemon 

and the rules system maintain other global frequency 

information. For example, the overall fr equency of 

occurrence of terms in the database and the total num 

ber of indexed items are maintained fo r retrieval pro 

cessing. The indexing daemon attempts to perform 

any outstanding indexing tasks before it shuts down. It 

also updates the kw _doc_index tuple fo r a given index 

able class and attribute with a time stamp fo r the last 

item indexed (last_tid and last_time). This permits 

ongoing incremental indexing without having to 

reindex older tuples. 

Retrieval 

The prototype version of Lassen provides ranked 

retrieval of the documents indexed by the indexing 

daemon using a probabilistic retrieval algorithm. This 

algorithm estimates the probability of relevance fo r 

each docu ment based on statistical information on 

term usage in a user's natural language query and in 

the database. The algorithm used in the prototype is 

based on the staged logistic regression method.8 

A POSTGRES user-defined function invokes ranked 

retrieval processing. That is, fi-om a user's perspective, 

ranked retrieval is performed by a simple fu nction 

call (k\vsea rch) in a POSTQ UEL query language 

Figure 2 

DO NOTHING 

The Lassen Indexing Trigger Process 

RULE STARTS 

FUNCTION 

DAEMON_ TRIGGER 

statement. Information from the classes created and 

maintained by the indexing daemon are used to esti 

mate the probability of relevance fo r each indexed doc 

ument. (Note that the fu ll power of the POSTQUEL 

query language can also be used to perform conven 

tional Boolean retrieval using the classes created by the 

indexing process and to combine the results of ranked 

retrieval with other search criteria.) Figure 4 shows the 

process involved in the probabilistic ranked retrieval 

from the repository database. 

The actual query to the Lassen ranked retrieval 

process consists simply of a natural language statement 

of the searcher's interests. The query goes through the 

Figure 3 

The Lassen Indexing Daemon Process 

Digital Technical Journal Vol. 7 No. 3 1995 53

54 

Figure 4 

The Lassen Rctrie,·a\ Process 

same processing steps as documems in the indexing 

process. The i ndivid ual words o f the query are 

extracred :1 11d located in the wn_indn dictionary 

(after re moving common words or "stopll'ords"). The 

tcrmids �o r matching words ti·om wn_i ndn arc then 

u sed to retrieve all the tuples in kw_tcrm_doc_rd that 

contain the te rm. For each u niq ue document identifier 

in this list ofruples, the marching k11 _doc_indn tuple 

is rctrinTd. \Vith the �i·eq uenc�· int(mllation contained 

in kw_term_d oc_rel and kll'_doc_indn, th e estimated 

probabi litY of relevance is calculated fo r e:�ch docu 

ment that con tains at least one term in common ll'i th 

the query. The formulae used in the calculation arc 

based on experiments with fu ll-text retrieval -" The 

basic equation fo r the probabilistic model used in 

Lassen states the fo llowing: The probability of the 

event that a document is relevant R. gin:n that there 

is a set of N"clues" associated with that docu ment, A1 

fo r i = l, 2, ... , 1\', is 

log O(RIA,, ... ,A,) = log O(R) + 2 )1og O(RIA,) 

Digit�! Technic�! Journal 

1= I 

- log O (R)], (l) 

Vol. 7 :--Jo. 3 J

;md queri es (wirh relevance judgments) from the 

TlPSTER intcmn:nion retrieval test collection -" The 

deri,·ation of rhis fc>rmula is given in "Probabilistic 

Re trie' al Rased on Staged Logistic Regression ."' The 

fu ll ren·i n·al cquJtion used f(x the prororl'pe ,·ersion of 

retrie,·al described in this section i s 

where 

log O(NIA1, ... ,A11) """ - 3.Sl 

·" ·" 

I 

374 

+ \ .lJ + d :Lx,11 + 0.330 :Lx.,, 

I I 

If 

-0.1937 :Lx"',] + o.0929M, ( 3 ) 

I 

X111 1 is the quotiem of the number of ti mes the mth 

term occurs in the querv and the sum of the rot:d 

number ofte nns in the querv plus 35; 

X1112 is the logarithm of the quotient a rri ved at bv 

dividing the number of times the mth term occurs in 

the document by the sum of the tot;ll number of te nns 

in the document plus 80; 

Figure 5 

CALCULATE NUMBER OF 

TERMS IN COMMON 

BETWEEN QUERY AND 

DOCUMENT M 

SUM FREQUENCY OF 

TERM IN QUERY DIVIDED 

BY ALL TERM 

OCCURENCES PLUS A 

CONSTANT 

2-Xm. l 

CALCULATE LOGARITHM 

OF SUM OF NUMBER OF 

TIMES TERM OCCURS IN 

DOCUMENT DIVIDED BY 

TOTAL TERMS IN DOCUMENT 

PLUS A CONSTANT 

LXm.2 

CALCULATE LOGARITHM 

OF SUM OF NUMBER OF 

TIMES TERM OCCURS IN 

DATABASE DIVIDED BY 

TOTAL TERM OCCURENCES 

IN DATABASE 

�Xm.3 

YES 

x/J/.3 is the logarithm of the quotie llt arrived at by 

dividing the n u m ber of times the 111th term occu rs in 

the databJsc ( i .e., in all documents) by the total num 

ber of terms in the collection; 

.ll is the number of terms held in common bv the 

querv and the document. 

Note that the .'v/2 term called f(>r in Equation 2 was 

not fo und to prol'ide any significmt difkrence in the 

results and w:1s omitted fr om Eq u ati on 3. The con 

stants 35 and 80, whi ch were used in X111 1 and X11,_2, 

are :1 rbitrarv but appear to offe r the best results 11·hen 

set to the a\·erage size of a quen· and the r the Staged Logistic Regression ProbJhilistic Ranking Process 

CALCULATE DOCUMENT 

PROBABILITY OF RELEVANCE 

P(R) = 1 I 1 + e "(- LOG O(R)) 

CALCULATE DOCUMENT LOG 

ODDS OF RELEVANCE 

LOG O(R) = k1 + (S · [k2 • :LXm.l 

+ k3 · :

56 

and its unique identitier are stored in the k\\'_querv 

class . To see the results of the retrieval operation, the 

querv idenritier is used to 1·errieve the appropri,ue 

kw_retrieval tuples, r:mked in order according to the 

estimated probabilitY of reln·ance. The k11 _rerrie1·al 

and kl1 _quen· classes ha1 e the tollo11·ing POSTQUEL 

detinitions: 

create kw_query 

query_id = int4, 

query_user charl6, 

query_text = text ) 

create kw_ret rieval 

query_id = int4 , 

doc_id = int4 , 

rel_oid = oid, 

attr_oid = oid, 

attr_num = int2, 

tuple_id = oid, 

/* ID number */ 

/* POSTGRES user name */ 

/* the actual query */ 

!* link to the query */ 

/* document ID number */ 

/* location of doc */ 

doc_len = int4 , /* size of document */ 

doc_ma tch_terms = int4 , /* number of query terms 

in the document */ 

doc_prob_rel = floatS ) /* estimated probability 

of relevance */ 

The algorithm used tor ranked retrin·al in the 

Lassen prototvpe was rested

geograp hic position being referenced in the te xt. The 

actual index te rms assigned to a document are a ser of 

coordinate pol�·gons that descri be an area on rhe 

e:mh 's surface in a standard geographical projection 

system. Using coordinates instead of names for rhe 

place or geographic characteristic offers a number of 

advantages. 

• Uniqueness. Place names are not uniq ue, e.g., 

Ve nice, California, and Ve nice, Italy, are not appar 

ent!�· dit'tere nt ll'irhout the qualit\·ing larger region 

to differentiate them. Using coordinates removes 

th is ambiguitv. 

• Immunity to spatial boundary changes. Political 

boundaries change over time, lead ing to confiJsion 

about the precise area being re.ferred to. Coordi 

nates do nor depend on political boundaries. 

• lmmunirv ro name changes. Geographic names 

change 01-er rime, making it diftlcult for a user to 

retrieve all information that has been written about 

an area during any extended rime period. Coord i 

nates remove this ambiguity. 

• Immunity to spatial, naming, and spelling l'aria 

rion. Names and rerms 1·an· nor onlv o1·er rime but 

also in contemporar�· usage . Geographic names 

1·ary in spelling over rime and by language. Areas of 

interest ro rhe user will often be given place names 

designated onlv in the context of a specific docu 

ment or proje ct. Such va riations occur frequentlv 

for studies done in oceanic locations. Names associ 

ated 11·irh these stud ies are unknoll'n to most users. 

Coordinates are not subject to these ki nds oherbal 

,·ariations. 

Indexing texts

58 

L titu 

a d 

e (mi ute ) 

n s 

.... ..J 

1 

___ o _ i_ s p_l _ a y_M_ a _P __ _j ___ 

Figure 6 

H_ i d 

Scr(cn h·om rhc L1sscn GcogLl}) hic Brm1 ocr 

_ e _ lc_o_n _s __ _J ___ _ 

necessity, the result of

e lationships ti·om the Word Ner thesaurus, to 

include rhe ri: >llowing types ofrerms:7 

Sl'nOI1\'In\' 

. . . 

= S\'110111' 111 

kind -of reL1 rionsh ips 

= lwponym ( mJ pl c is :1 - ofrrec ) 

@ : = lwpernvm (tree is :1 @ of mapl e) 

p:1rt-of rebrionsilips 

# mc rom·m (finger is a # of lund) 

% h o lonvm (hand is �1 % oHinge r) 

& c1 ido111'm (pine is a & ofshortleaf 

pine ) 

3. Overlay polygons to estimate a pproximate loc:l 

rions. The objccr ivc of rhis step is ro combine the 

evidence �1ccumubted in rile preceding step and 

111kr a scr or· polvgons th:�t prm ides a reason:�bJc 

approximation of rhc geogr:1phical locations me n 

tioned in rile rcxr. E::tch gf!oj)hrasc. t/'e��bt. pull'gon 

ru plc em be represented �•s a thn . :c-dimcnsion�ll 

"extruded" polygon whose b�1se is in the plane of 

the x- ;m d z- axes and whose hcighr extends upw

Figure 7 

/7 

Overl;wing Poh·gons to Estimate Approximate Locations 

(a) The "weight" of a polygon, indicated by the 

vertical arrow, is interpreted as "thickness." 

(b) Two adjacent polygons do not affect each other; 

J. 

I 

I 

each is merely assigned its appropriate "thickness." 

(c) When one polygon subsumes another, their 

"thicknesses" in the area of overlap are summed. 

(d) When two polygons intersect, their "thicknesses" 

are summed in the area of overlap. 

60 Digital Techni(al journal Vo 1. 7 No. 3 !995

Figure 8 

Surface Plot Produced ri-om the State Water Project Text 

TextTiling: Enhancing Retrieval through 

Automatic Subtopic Identification 

Full-length documents have onlv recentlv become 

available on-line in large quantities, although technical 

abstracts, short newswire texts, and legal documents 

have been accessible for many years 23 The large major 

itY ofon -line information has been bibliographic (e.g., 

authors, titles, and abstracts) instead of the fu ll text of 

the document. For this reason, most information 

retriev:ll methods :1re better suited for accessing 

abstracts than for accessing longer documents. Part of 

the rcposirorv research was an exploration of ne\\' 

approaches to information retrieval particularlv suited 

to ti.l ll-length texts, such as those expected in the 

Sequoia 2000 datalnse. 

A problem \\'ith applying traditional information 

retrieval methods to full-length text documents is that 

the structure of full-length documents is quite differ 

ent from that of abstracts. (In this paper, "ti.l ll-length 

document" refers to expositorv text of any length. 

Tvpical examples are a short magazine article and 

a 50-page technical report. We exclude documents 

composed of headlines, short advertisements, and any 

other disjointed texts of whatever length. We also 

assume that the document does not have detailed 

orthographically marked structure. Croft, Krovetz, 

and Tu rtle describe work that takes advantage of this 

ki nd of information.'; ) One way to view an expository 

texr is as a sequence of subtopics set against a backdrop 

of one or two main topics. A long text comprises many 

diffe rent subtopics that may be related to one another 

and to the backdrop in many diffe rent ways. The main 

topics of a text are discussed in its abstract, if one 

exists, but su btopics are usuallv not mentioned. 

Therefore, instead of querying against the entire 

content of a document, a user should be able to issue a 

query about a coherent subpart, or subtopic, of a ti.dl 

length document, and that subtopic shou ld be specifi 

able with respect to the document's main topic(s). 

Consider a Discouer magazine :�nicle about the 

Magellan space probe's exploration of Ve nus.'; 

A reader divided this 23-paragraph article into the fo l 

lowing segments with the labels shown, where the 

numbers indicate paragraph numbers: 

1-2 lntro to J'vlagellan space probe 

3-4 Intro to Venus 

5-7 Lack of craters 

8-1 1 Evidence of volcanic action 

12-15 Ri'n Stvx 

16-18 Crustal spreading 

19-2 1 Recent volcanism 

22-23 Future of Magellan 

Assume that the topic of volcanic acti,·itv is of 

interest to a user. Crucial to a system's decision to 

retrieve this document is the knowledge that a dense 

discussion of volcanic activity, rather than a passing ref 

erence, appears. Since volcanism is not one of the 

text's two main topics, the number of references to 

this term will probably not dominate the statistics of 

term frequencv. On the other hand, document selec 

tion should not necessarilv be based on the number of 

references ro the target terms. 

The goal should be to determine whether or not 

a relevant discussion of a concept or topic appears. 

A simple approach to distinguishing between a true 

discussion and a passing reference is to determine the 

locality of the references. In the computer science 

operating systems literature, locality refers to the fact 

that over rime, memory access patterns tend to con 

centrate in localized clusters rather rhan be distributed 

evenly throughout memory. Similarly, in fu ll-length 

texts, the close proximity of members of a set of 

Digital Technical journal Vo l. 7 :--;o. 3 1995 61

62 

references to a particul ar concept is a good indiuror of 

topicality. For example, the term L'olcauislll occurs 5 

rim es in the Magellan article, the first fo ur insr;l nces of 

which occur in fo ur adj :\ Ce ll t paragraphs, ;l long ll'ith 

accompanvi ng discussion. In comrast, the rerm scieu 

tists. which is not a ,·a lid sub topic, occurs 13 ri mes, dis 

tributed somewhat e\-cnlv througho ut . Bv irs \'en· 

nature, a su btopic wi l l nor be discussed throughout an 

entire texr. Similarly, true subtopics are nor indiuted 

by only passing references. The term helll' do ucer 

occurs on l�· once , and its re lated terms arc con tined to 

the one sentence it appea rs in. As its usage is onh· 

a passing rckrence, belh· (Lwcing is not a true subropic 

of this text. 

Our solution to the problem of reuining v:did 

subtopical discu ssions while �H the same time ;woid 

ing being to oled by passing rdcrcnces is to make 

usc of localirv information :m d to partition docu 

ments according to their subto p ica l structure. 'This 

approach's capacitY tor i mp rO\'i ng a srallCLl rd informa 

tion retrieval tasl< has been verified bv i nt(mn arion 

retrieval experiments using fu ll-texr rest collections 

from t he TIPSTER daubase .l(··27 

One wav to get an approximation of the subtopic 

structure is to break the document into paragr;lp hs, or 

fo r \'cry long documenrs, sections. In both c1scs, rhis 

entai l s using the orthogr;1phic marking supplied l1\' the 

author to determine topic boundrmance oF moti\·;Jted 

segmenta tion, i.e., segmentation that rdkc rs rile 

text's true underlyi ng su btopic structu re, \\'hich ofren 

spans pac1graph boundaries. 

To\\'ard rhis end, \\'C h :-�,·c dn·c loped To rTi l ing, 

a method f( Jr partitioning fu ll-length tnt docu mems 

into coherent mLtlriparal:'-r,lph u n its called tiks.'"·'" . .:o' 

TcxrTi l ing approximates rhe su btopi c structure of 

a docu rne m by using patterns ot'lcxical conlKCtivin· ro 

find coherem su bdiscussions. The l

SimibritY ber11·ecn blocks is ukularcd b1· the t(,l loll·lllg 

cosJ Jle meastu·c: Ci1 en two tex t blocks bl �1 11d !J2, 

cos (1?1 ,1?2) = 

" 

L u·, ,, 11', ,1 

I I 

I! " 

L"·i,,, L ll'j ,,, 

t J /'='"I 

11 here 1 ranges m·cr �1 11 the terms in the document, a nd 

t/'1 "1 is the tf.idfweighr �1ssigned to term tin block hl . 

Thus, iF the similaritv score ben1 een tii'O blocks is 

high, then nor only do the blocks have te rms in com 

mon , bur the common terms �llT relatively rare with 

respect to the rest of the document. The evidence in 

the reverse is nor as conclusive . U' :-�djacenr blocks hJ1·e 

a lo\\' sirnilarin· measure, this docs not necess�1riil' 

J11C111 that the blocks cohere . 1 n pr:1crice, hm1·e\'er, this 

negJti1·e n·idence is often jusritied. 

The graph is then smoothed using a discrete con,·o 

lurion" of the similarirv fu nction \\'ith the fu nction 

h1?. ( ), where 

I ii � �< - 1 

otherwise. 

The result is smoothed tiJrrhcr 11·ith a simple median 

smoothing algori thm ro elimit1

64 

5. D. BlI Ctl., and H. Tu rt l e , "lmcr.JCtilc 

Retr·ie' ,,) of Com pic .x Docu me n rs," lnjbrnw I ion J'm 

cessillg u11d Jfol/{tgc>lllell/. ,·ol. 26, no. S ( 1 990 ): 

593-6 16. 

2S. A. Chaikin, " M agellan Pierces the Vc nusi:-tn Ve il," 

/Ji.1crwe1: ,·ol. 13, no. I (Januarv 1992). 

26. M. Hearst .ltld C:. Plcl unt, "Su bropic Srru crmi ng f(lr 

Full-Length Doc u1ncn t Access," Proceedings of the 

Sixteenth Anmwl !nlenwliomt! ACH SIGIR Co nfer 

ence on N.escorch ond Det ·elopmenl in !njonnation 

!(etrieml (S/GIR '93J. Pittsburgh, June 1993 (Nc"· 

York: Association fo r Computing MachinerY, 1993 ): 

59-68. 

27. ,\>!. H earst, "Context ;m d Stru cture in Automated 

Full-Te xt Information Access," Ph.D. dissertation, 

Re port :\'o. UCB/CSD-94/836 ( Berkele1·, Calif 

Uni,·crsin· of CllitcJrni;J , Bcrkclcv, Compu ter Science 

Di,·ision, 1994 ). 

28. M. Hearst, "Tc xrTili n g : A Quan ti tative Approach ro 

Discourse Segrnen rarion ," Sequoia 2000 Tcchnicll 

Repcm 93/24 (S2K-93-24) (Berkele1., Ca li f Uni,n 

sit\' of C:liit(Jrnia, Bc rkele1·, 1993) (frp:/ /s2k 

h:p .cs. bcrkcl c1·. ed u /pub/ sequoia/ tech- reports/ s2 k -9 

3-24/s2 k-93-24.ps ). 

29. 1Vl. H c1rsr, "Mulri-Par;lgJ·aph Segmentation of Expos 

iron· TC \t," l;mceedings of the 77? irty-sccol!d 

J/e!:'lill,f.!. o/ the Assuciotion jar Co 111putctlional 

Li11p11s!ics. Los Cruces, ]\.' e\\' Mexico, June 1994.

30. G. Salton, AIIIOIIIlllic rex/ Processing· The Tran�jor- 

11/C/I io11. A nu!)•sis. ct/1(/ Ret rie!'al of Information by 

Computer ( RcJding, Mass.: Addison· Weskv, 1989). 

31. The authors arc gr�Hctii l ro Miclucl Braverman fo r 

proving that the smoothing algorithm is equivalent to 

this convolution . 

32. L. l'-1bincr and R. SchJti:r, f)ig ital Processill8 of 

Speech S(t;11als (Englewood Clifts, N.J.: Prcntiee 

H�lll, Inc., 1978). 

Biographies 

Ray R. Larson 

Rav Larson is an AssociJtc Professor rmation Studies). He tcJches courses and conducts 

resean.:h on the design Jnd evaluation ofintormJtion 

retrievJI svstcms. Rav received his Ph .D. !Tom the Universirv 

ofCalit(>�nia. He is � member of the American Socictv tor . 

. . 

lntcxmation Science (AS IS), the Association for Comput· 

ing 1Yiachinery (ACM ), the IEEE Computer Society, the 

Ametican Association tc >r the Advancement of Science, 

and the American Librarv Association. He is the Associate 

Editor fo r IICM Tru l/sa�lions 011 fllj(>rmati

Tecate: A Software 

Platform for Browsing 

and Visualizing Data 

from Networked Data 

Sources 

Te cate is a new infrastructure on which applica 

tions can be constructed that allow end users 

to browse for and then visualize data within 

networked data sources. This software platform 

capitalizes on the architectural strengths of cur 

rent scientific visualization systems, network 

browsers like Netscape, database management 

system front ends, and virtual reality systems. 

Applications layered on top of Tecate are able 

to browse for information in databases man 

aged by database management systems and for 

information contained in the World Wide Web. 

In addition, Tecate dynamically crafts user inter 

faces and interactive visualizations of selected 

data sets with the aid of an intelligent system. 

This system automatically maps many kinds of 

data sets into a virtual world that can be explored 

directly by end users. In describing these virtual 

worlds, Tecate uses an interpretive language that 

is also capable of performing arbitrary compu 

tations and mediating communications among 

different processes. 

66 Digital Technical Journal Vol. 7 No. 3 1':195 

I 

Peter D. Kochevar 

Leonard R. Wanger 

All people share the need to find and assimilate infor 

mation. Data fi-om which information is created is 

increasi ngly avai lable electronically, and that data 

is becoming more and more accessible with the prolif� 

eration of computer networks. Therefore, the world 

is quickly becoming abstracted as a collection of net 

worked data spaces, where a data space is a data source 

or repository whose access is control led by means of 

a well-defined software interface. Some examples 

of data spac

the tools provided by Tecate can be used to build 

applications of only modest capabilities. 

Historically, Tecate grew out of the Sequoia 2000 

project, which was initiated jointly by Digital Equip 

ment Corporation and the University of California 

in 1991 . The primary purpose of the Sequoia 2000 

project was to develop inf(xmation systems that wou ld 

allow earth scientists to better study global envi 

ronmental change. Sequoia 2000 participants needed 

to browse tor data sets on which ro test scientific 

hypotheses and then to interactively visualize the data 

sets once to und. The data can be quite varied in con 

tent and strucnJre, ranging rrom text and images 

to time-varying, multidimensional, gridded or poly 

hedral data sets. Such data may stream rrom many clif 

fe rent sources, e.g., databases managed by a database 

management system, a running simulation of some 

physical process, or tl1e WWW. Therefore, a tool was 

required that could interface to any such source. To be 

of maximum use, though, the tool had to be easy 

to use so that the scientists th emselves cou ld make 

sophisticated data queries and then experiment with 

the query results using a wide variety of data visualiza 

tjon tec hniques. 

Generalizing rrom its Sequoia 2000 roots, the 

design of Te cate is intended to achieve fo ur goals: 

l. lnrerface to general data spaces wherever they may 

reside. 

2 . Saliently visualize most kinds of data, e.g., scientific 

data and the listings in a telephone book. 

3. Dynamically craft user interfaces and interactive 

visualizations based on what data is selected, who is 

doing the visualizing, and why the user is exploring 

me data. 

4. AJJow end users to interact with elements in visual 

izations as a means to query data spaces, to explore 

alternate ways of presenting information, and to 

make annotations. 

There arc systems available today that have some of 

these capabilities, but no one system possesses all fo ur. 

Data visualization systems such as AVS, Khoros, or 

Data Explorer are capable of visualizing scientific data; 

however, they are poor at interfacing to general data 

spaces, they provide only limited interactivit:y \vitl1ill 

visualizations themselves, and they require visualiza 

tions to be crafted by hand by knowledgeable end 

users.'.l.-' Network browsers such as Netscape are good 

at fe tching data rrom certajn types of data spaces but 

are limited in the variety of data they can directly visu 

alize without having to rely on external viewer pro 

grams. Moreover, most network browsers offer a 

restricted type of interacrivity where only hyperlinks 

can be f()llowcd and text can be submitted through 

fo rms. Finally, rront ends to database management 

systems provide elaborate querying mechanisms fo r 

selecting data rrom a database, but they lack a sophisti 

cated means fo r visualizing and further exploring 

query results. 

The Tecate architecture borrows from that of visu 

alization systems, network browsers, and database 

management systems as well as rrom virtual reality sys 

tems like Alice and the Minimal Reality Toolkit/ 

Object Modeling Language (MR/OML)!·' One 

major contribution of the Tecate system is tJ1at it 

incorporates the architectural strengths of these 

systems into a coherent whole. In addition, Tecate 

possesses at least two novel features that are not fo und 

in other data visualization systems. One fe ature is 

Tecate's use of an interpretive language that can 

describe three-dimensional (3-D) virtual worlds. This 

language is more than a markup language in that it is 

capable of performjng arbitrary computations and 

fa cilitating communication among clifferent processes. 

The second novel component ofTecate is tJ1e presence 

of an expert system that automatically crafts interactive 

visualizations of data. This system is intended to make 

data space exploration easier to perform by having end 

users simply state their goals while leaving the detajls 

of implementing a visualization to anajn tl1osc goals to 

tJ1e expert system. 

The remajnder of the paper outlines Tecate's sys 

tem model and architecture and then identifies and 

describes Te cate's major components. finally, the 

paper sketches Tecate's capabilities by discussing two 

simple applications mat have been implemented on top 

of me Tecate software fTamework. The first application 

is a tool fo r visualizing earth science dat.a rcsicling in 

a database managed by a database management system . 

The second app�cation is a We b browser that uses 3-0 

graphics as an underlying browsing paradigm ramer 

than depending solely on me mecliwn of hypertext. 

Tecate's System Model 

After presenting an overview ofTecate's system model, 

this section provides detans of me object model and the 

interpretive, object-oriented language used to describe 

virtual world objects. 

Overview 

From the standpoint of an applications programmer, 

Tecate is a distributed, object-oriented system. AU 

major components ofTecate, as weU as entities appear 

ing in virtual worlds created by Tecate, are objects that 

communicate with one another by means of message 

passing. The majn focus within Tecate is on object 

object interactions. These interactions occur primarily 

when objects send messages to one another. An object 

can also send a message to itself, which has the effect of 

making a local function cal l. Unlike with graphics 

systems such as Open Inventor, rendering is not a cen 

traJ activity wi thin Tecate; rather it is just a side effect 

DigiraJ Technical journal Vol. 7 No. 3 1995 67

68 

of object-object interactions -" In this sense, Tecate is 

like virtual reality programming systems such as Alice 

and MR/OML, although Tecate is far more fl exible. 

In the Tecate syste m, objects can create and destroy 

other objects and can alter the properties of existing 

objects on-the-fly. Sud1 capabilities make Tecate vcrv 

extensible and give it great power and fl exibility. These 

capabilities can also cause problems fo r applications 

programmers, however, if care is not taken 'vhen writ 

ing programs. Presently, all of an object's properties 

are visible to all other objects, and hence those proper 

tics can be manipulated fi-om outside the object. In the 

fi.1ture, some form of selective property hiding needs 

to be added so that designated properties of an object 

cannot be altered by other objects. 

A powerful teature ofTecatc is irs ability ro dynami 

cally establish object-subobjecr relationships. This fea 

ture provides a mechanism for bui ldi ng assemblies of 

parts similar to the mechanisms in classical hierarchical 

graphics systems like Don� or Open lnvcnror.7 This 

tcatun: also provides the capability of creating sets or 

aggregates of objects rhar share some trait, such as 

being highlighted. Te cate allows all objects within a set 

to be treated en masse by providing a means of se lec 

tively broadcasting messages ro groups of objects. 

A message that is sent to an object can be fixwarded 

to all the object's subobjccrs. Thus, for example, one 

object can serve as a container fix all other objects that 

are highlighted; the highlighted objects arc merely sub 

objects of the container. To unbighlight all highlighted 

objects, a single unhighlight message can be sent to the 

container object, which then fcmvards the message to 

al l its subobjccts. In general, an object can be the sub 

object of any nu mber of other objects and thus simulta 

neously be a member of many diffe rent sets. 

The handling of user input within Tecate is 

intended to appear the same as ordinary object-object 

interactions. Al l physical input devices that are known 

to Tecate have an agent object associated with them 

th at acts as a device handler. All objects that wish to 

be informed of a particular input event register with 

the appropriate agent. When an input event occurs, 

the agent sends all registered objects a message notit)r 

ing them of the event. Complex events, such as the 

occurrence of event A and event B within a specified 

time period, can easily be defined by creating new han 

dler objects. These handlers register to be informed of 

separate events but then, in turn, inform other objects 

of the events' conjunction. 

The Object Model 

Te cate uses an object model in which no distinction 

is made between classes and instances, as is done in 

languages like C++ .8 In Tecate, there is a single object 

creation operation called cloning. Any object in the 

system can serve as a prototype fi-om which a copy can 

be made through the clone operation. A clone inherits 

Digital T�chnical )ourn

compute-intensive behaviors or whose behavior execu 

tions arc time-critical are generally implemented as 

resource objects. For instance, most Te cate objects that 

provide system services, such as rendering or database 

management, are implemented as resource objects. 

Objects populating virtual worlds that represent 

data fe atu res are implemented diffe rently than 

resource objects by using an interpretive program 

ming language ca lled the Abstract Visualization Lan 

guage (AVL). Such objects are called dynamic objects 

because they may be created, destroyed, and altered 

on-the-fly as a Tecate session unfolds. Nonetheless, 

the ability to dynamically add, remove, and alter object 

properties is nor solely endemic to dynamic objects. 

Resource properties may also bt: changed on -the-tly 

because resources are actually implemented with a 

dynamic object that interfaces to the portion of the 

resource that is implemented as an external process. 

The Abstract Visualization Language 

AV L is essential to the Tecate system; it is through AVL 

that applications programmers write applications that 

usc Tecate's ti.:arures.'' AVL is an interpretive, object 

oriented programming language that is capable of 

performing arbitrary computations and facilitating 

communication among diffe rent processes. Through 

this language, applications programmers spcci f)� and 

manipulate object properties and invoke object lx :hav 

iors by se nding messages fr om one object to another. 

AV L is a typelcss language that manipulates char 

acter strings; it is based on the Tel embeddable 

command language.'" AVL extends Tel by adding 

objcct-oricnn.:d program ming support, 3-D graphics, 

and a more sophisticated event-handling mechanism. 

Although AVL is a proper supe:rser ofTcl, the relation 

ship between AVL and Tel is much like that between 

C and C++. By adding a small set of new constructs 

to Te l, the way applications programmns structure 

AVL progra ms differs markedly !Tom how they struc 

ture Tel progra ms, just as the C+ + language exten 

sions to C: greatly alter the C programming style. 

One usc of AV L is to describe virtual worlds that 

represent data sets. Through AV L, objects that popu 

late these worlds can be assigned bt:haviors that are 

elicited through user interaction . �or instann.:, select 

ing a 3-D icon can cause a Universal Resource Locator 

(UlU,) to be fo llowed out into the WVVVV . In this 

sense, AV L is somewhat like the Hypertext Markup 

Language ( HTM L) that underlies all Web browsers 

today, or, more fi ning, it is similar to the Virtual 

R.c:�liry Modeling Language (VRML) that has been 

proposed as a 3-D analog ofHTML.'' AVL does, how 

ever, differ marked ly !Tom HTM L and VRNlL, which 

Jrc only markup l:�nguages. Bec:�use AVL is a fu ll 

tl edged programming language that Ius sophisticated 

interaction h:�ndling built in, it is philosophically more 

similar to interpretive languages like Telescript, 

NewtonScript, and PostScript. "·"·'4 Like Te lescript, 

fo r instance, AVL programs can encode "smart 

agents" that can be sent across a network to perform 

user tasks at a remote machine, if an AVL interpreter 

resides there . Note, however, that in the present ver 

sion of Tecate, there is no notion of security when 

arbitrary AVL code runs on a remote machine. 

AVL includes some additional commands that aug 

ment the Te l instruction set, fo r instance, clone and 

delete. The clone command is the object creation com 

mand within AVL, and the delete command is the com 

plementary operation to delete objects fi-om the 

syste m. Object properties are specified and manipu 

lated using the add command and deleted using rhe 

remove command . Behaviors in one object are initiated 

by another object using the send command, which 

specifies the behavior to invoke and the argu ments to 

be passed . Queries about object properties can be 

made using the inquire command. The which com 

mand is used to determine where an object's properties 

are actually defined in light of Tecate's use of delega 

tion to resolve property references. Finally, AVL pro 

vides a rich set of matrix and vector operators that arc 

useful when positioning objects "�thin 3-D scenes. 

As an example of how AVL is used in practjce, 

Figure l depicts a code fTagmcnt similar to one that 

appears in the vV'vV'vV application described later in the 

paper. The code fi-agmenr creates a 3-D Web site icon 

that is positioned on a world map. The code begins 

with the definition of the Hyper/ink object fi-om which 

all Web site icons arc cloned. The Hyper/ink object is 

itself cloned !Tom the Visual object that is predefined 

by Tecate at system start-up. The Visual object con 

tains properties that relate to the viewing of objects 

within scenes. h)f instance, objects that are cloned 

fi-om the Visual object inherit behaviors to rotate 

themselves and to change their color. To the proper 

ties that are inherited from the Visual object, the 

Hyper/ink object adds the state variables uri and desc, 

which will be used to store respectively a URL and irs 

textual description. In addition, objects cloned !Tom 

the !-Iyperlink object will inherit the default appear 

ance of a solid blue sphere having unit radius. 

The specification fo r the Hyper/ink object also 

defines three behaviors: ini/, openUrl, and showDesc. 

The inil behavior replaces the inil method inherited 

!Tom the Visual object. When an object cloned !Tom 

the Hyper/ink object receives an init message, it sets irs 

uri and desc state variables, positions itself within the 

scene whose name is given by the argument scene, and 

registers itself with the mouse handler agent to receive 

two events. When mouse bu tton l is depressed , the 

agent sends the object the open Uri message, which in 

turn requests the W'vVW Interface to fetch the data 

pointed to by the object's URL. Depressing button 2 

invokes d1e shOti'Desc message, causing the Web site 

URL and description to be displayed by a previously 

Digital Tt:x hnicJI /ournJI Vo l. 7 No. 3 1995 69

70 

# Define a prototype for all Web icons 

clone Hyperlink Visua l 

add Hyperlink 

} 

state { 

u r L "" 

} 

desc "" 

appearance { 

} 

shape {sphere} 

diffuseColor {0.0 0.0 1 . 0} 

repType {surface} 

behavior { 

# Ini tialize hyper link 

} 

init {url desc pos scene wi ndow} { 

} 

addstate url $url 

addstate desc $desc 

send [getselfJ move "add $pos " 

add $scene "subo bject [get self]" 

send $wi ndow addEvent "[getse lfJ {Button-1 {openUrl {}}} {Button-2 {showDesc {}}}" 

# Open the URL 

openUr l {} {send www fetch "[getstate url J"} 

# Display the desc ription 

showDesc {} {send me taViewer display "[getstate descJ''} 

# Ini tialize an informationa l Landscape 

clone scene Vi sua l 

clone window Viewer 

send wi ndow init {scene} 

# Create a Web site i con 

clone hlink Hyperlink 

send hlink init {"http: //www.sdsc . edu/ Home . html" 

"SDSC home page" "-2.3 -2 .0 1 . 0" scene wi ndow} 

# Use the SDSC mode l geometry 

add hlink {appea rance {shape {box}}} 

Figure 1 

An Implementation of a World Wide Web Icon in the Abstract Visualization Language 

defined interface widget ca lled the meta Viewer. The 

AVL command getself. which is used within the init 

behavior body, returns the name of the object on 

which the behavior was called, thus allowing applica 

tions programmers to write generic behaviors. The 

other AVL commands, getstate and addstate, are 

shorthand fo r "get [getself) state ... " and "add [getself] 

{state . . . l." 

Once the HyJX>rlink object is defined, a scene, a dis 

play window, and a Web site icon are created . The 

Tecate scene object is cloned fi-om the Visual object. 

The window object, cloned fi-om the predefined 

Viewe1· object, is the viewport into which the scene is 

to be rendered. Finally, blink is a We b site icon whose 

appearance difters from that which is inherited from 

the Hyper/ink object. Rather than being spherical, the 

shape of the blink icon is a unit cube. 

Digital Tcchnic:d Journal Vo l. 7 No. 3 1995 

Tecate's Architecture 

The general structure of Tecate and how it relates to 

application progra ms is depicted in Figure 2. Tecate 

consists of a kernel, a set of basic system services, and a 

toolkit of predefined objects. The Tecate kernel, which 

is shown in Figure 3, is an object management system 

called the Abstract Visualization Machine; AVL is its 

native language. The Abstract Visualization Machine 

is responsible fo r creating, destroying, altering, ren 

dering, and mediating communication between 

objects. The two major components of the Abstract 

Visualizarjon Machine are the Object Manager and 

the Rendering Engine. 

The Object Manager is the primary component of 

the Abstract Visualization Machine. It is responsible fo r 

interpreting AVL programs, managing a database of

APPUCA TIONS 

TECATE 

Figure 2 

The Tecate System and Its Rcbrionship to Application 

Programs 

objects, mediating communication between objects, 

and intcrbcing with input devices. The Object 

Manager is itself a resource object that is distinguished 

by the fact that all other resource objects are spawned 

fi-om this om: objecr. In addition, the Object Manager 

is responsible to r creating a distinguished dymmic 

object, called Root. fi-om which aU other dynamic 

objects can trace their heritage through prototype 

clone relationships. 

The Object Manager is implemented on a simple, 

custom-built thread package. Each object within 

Tecate can be thought of as a process that has its own 

thread of control. Each thread can be implemented 

either as a lightweight process that shares the same 

machine context as the Object Manager's operating 

system process or as its own operating system process 

separate fi-om that of the Object Manager. Lightweight 

processes are so named because their use requires little 

system overhead, which enables thousands of such 

processes to be active at any given time. Within Tecate, 

dynamic objects arc implemented as lightweight 

Figure 3 

INTELLIGENT 

VISUALIZATION 

SYSTEM 

BIGRIVER 

RENDERING 

ENGINE 

processes, whereas resource objects arc implemented 

as heavyweight operating system processes, which may 

or may not be paired with a lightweight, adjunct 

process. A low-level function library is provided to 

handle the creation and destruction of threads and 

to handle interthread communication regardkss of 

how the threads are implemented. 

Closely allied with the Object Manager is the Ren 

dering Engine, which is a special resource object 

wholly contained \vi thin the Abstract Visualization 

Machine. The Rendering Engine is responsible fo r 

creating a graphical rendition of a virtual world that is 

specified by AVL programs interpreted by the Object 

Manager. When interpreting an AV L program, the 

Object Manager strips off appearance attributes of 

objects and sends appropriate messages to the Ren 

dering Engine so that it can maintain a separate display 

list that represents a virtual world. Display lists are rep 

resented as directed, acyclic graphs whose connectivity 

is determined by object-subobject relationships that 

are specified \vi thin AVL programs. 

The present Rendering Engine implementation 

uses the Don� graphics package running on a DEC 

3000 Model 500 workstation.7 The display lists that are 

created by invoking behaviors within the Rendering 

Engine are actually built up and maintained through 

Dorc. The set of messages that the Rendering Engine 

responds to represents an interface to a platform's 

graphics hardware that is indepe ndent of both the 

graphics package and the display device. 

Layered on top of the Abstract Visualization 

Machine arc Tecate's system services and the object 

toolkit. The system services consist of a col lection of 

resource objects that are automatically instantiated at 

system start-up. These resources include an expert 

system ca lled the Intelligent Visualization System, 

the Database Interface, the vVWW Interface , and a 

OBJECT MANAGER 

ABSTRACT VISUALIZATION MACHINE 

DATABASE 

INTERFACE 

www 

INTERFACE 

Derail ofTecare's Kernel (the Abstr:Kt Visualization Machine) and the System Services Provided by Tecate 

Digital TcchnicJI journal Vol. 7 No. 3 1995 71

72 

visualization programming system called BigRi,·cr. 

Figure 3 shows these resources in relationship to 

Tecate's kernel . Each resource is a Tecate object that 

has a number of predefined be haviors that can be use 

ful to applications programmers. For instance, the 

W'vVW Interface has a be havior that fe tches a data tile 

referred to by a URL and then translates the file's con 

tents into an appropriate AVL program. 

The tool kit within Tecate is a set of predefined 

dynamic objects that program mers can use to develop 

applications. These objects arc considered abstract 

objects in the sense that they arc not intended to be 

used directly. Rather, they serve as prototypes from 

which clones can be created. The rool kir consists of 

objects such as viewporrs, lights, and cameras that arc 

used to illuminate and render virtual worlds. The 

toolkit also contains a modest collection of 3-D usn 

interface widgets that can be used wirhin virtual 

worlds created by an applications programmer. These 

widgets include sliders, menus, icons, legends, and 

coordinate axes. 

One usefld object in the toolkit that aids in simulat 

ing physical processes and helps in perfixming aoinu 

tions is a clock. This object is an event generator thar 

signals every clock tick. If objects wish ro be informed 

of a clock pulse, those objects register rhemsch-cs with 

the clock object just like objects register themselves 

with input device agent objects. The dcbult clock 

object can be cloned , and each clone can be instanti 

ated with a difkrcnr dock period down to J resolution 

of one millisecond. Any number of clocks em be tick 

ing si multaneously during a Tcc:�tc session . Since new 

clocks can be created dynamically, and objects can reg 

ister and unrcgister to be int(mncd of clock pulses 

on-rhe-fly, clocks can be used as timers and triggers, 

and as pacesetters. 

Application Resources 

Tcc:-tte's system services arc prcdctincd application 

resources that aid in interactivc:ly visualizing data. As 

mentioned previously, these objects include the I nrcl 

ligcnr Visualization System, the Database I nrerbcc, 

the WVVW Interface, and rhe BigR.iver visualization 

programming system. In addition, :111 applications pro 

grammer can casilv add new :-tpplicarion resources 

using tools provided with the base Tecate system. Such 

new resources can be built around either user-written 

programs or commercial oft�thc-shelf applic1tions. 

To create a new application resource, a programmer 

needs to provide a set oHuncrions that c:-tn be invoked 

by other Tecate objects. These fi.nKtions correspond 

to behaviors that are called when rhc resource recci,·cs 

a message from other objects. Tools arc provided 

to register the behaviors wirh Tecate and ro manage 

the communication bet\\'een �1 resource and other 

TecJte objects. 1' 

Vol. 7 :--Jo. 3 l \)

models, user tasks, and visualization techniques. The 

Planner functions by constructing a sentence within a 

dataflow language defined by a context-sensitive graph 

grammar. At each step in the construction of the sen 

tence, rules in the knowledge base dictate which pro 

ductions in the grammar are to be applied and when. 

Presently, the knowledge base is implemented using 

the Classic knowledge representation system; the 

Planner is implemented in CLOS.'u" 

BigRiver 

The dataflow program produced by the Inte lligent 

Visualization System is written in a scripting language 

that is interpreted by BigRivcr, a visualization pro 

gramming system similar to AVS and Khoros.�.' hom 

a technical standpoint, BigRiver is not particularly 

innovative and will eventually be reimplementcd using 

some existing visualization system that has mon: fi.mc 

tionality. The reason that BigRivcr was created from 

scratch was to better understand how existi ng visual 

ization programming systems work and ro ovncome 

limitations within those systems. These limitations are 

their inability to bt: t:mbeddcd within other applica 

tions, their lack of comprehensive data models, and 

their inability to work with user-supplied renderers. 

The latest generation of visualization programming 

syste ms, such as Data Explorer and AVS/ Express, 

overcome many of these limitations.,_,,. 

Like most of the existing visualization syste ms, 

BigRiver consists of a collection of proced ures called 

modules, each of which has a wdl-dcfined set of inputs 

and ou tputs. Fu nctional specifications fo r these mod 

ules represent some of the knowledge contained in the 

Intelligent Visualization System's knowledge base. 

Visualization scripts that are interpreted by BigRiver 

specif)' module parameter values and dictate how the 

outputs of chosen modules arc to be channeled into 

the inputs of others. 

BigRiver modules come in tlm:e varieties: l/0, data 

manipulators, and glyph generators. AJI modules use 

se lf-descri bing data tc>rmats based on fiber bund les. 

One t() rmat is used t( )r manipulation within memory; 

the other is an on -the-wire encoding intended tor 

transporting data across a network. An input mod ule 

is responsible to r converting data stored in the on-the 

wire encoding into the in-memory fo rmat. The data 

manipulator mod ules transform tlber bundles of one 

in-memory fo rmat into those of another. The glyph 

generators take as input fiber bund les in the in-memorv 

t(xmat and produn: AVL programs that when necuted 

build virtual worlds containing objects th;lt represent 

fe atu res of selected data sets. A single dispby module 

takes as input AV L code and p:1sses it to the Abstract 

Visualization Machine. Bv means of the Rendering 

Engine, the Abstr:Jct Visualization Machine uses the 

appearance attributes of objects to create an image of 

a virtual world that contains the objects. 

The Database Interface 

The Database Interface provides the means to interact 

with a database management syste m, which in the cur 

rent version of Tecate can be either POSTGRES or 

Illustra.'"·N Database queries, written in POSTQ UEL 

fo r POSTGRES-managed databases or in SQL for 

Illustra databases, are sent to the Database Interface by 

Tecate objects where they are passed to a database 

management system server for execution. The server 

returns the query results to the Database Interface, 

which then attempts to package them up as an on-the 

wire encoding of a fiber bundle buff-ered on local disk. 

If the result is a set of tu ples in the standard format 

returned by POSTGRES or lllustra, the Database 

Interface pcrf(>rms the fiber bundle translation. For 

most otJ1er nonstandard results, the so-called bi nary 

large objects (RLOBs) of the database realm, the 

Database Interface cannot yet arbitrari ly perform the 

translation into the on-the-wire fiber bundle encod 

ing. The only BLOBs that the Database Interface can 

deal with presently arc those that arc already encoded 

as on-the-wire fiber bundles. The difficult problem of 

automated data f(mnat translation was not addressed 

during Tecate's initial development, although the 

intent is to address this issue in the fu ture. 

Once query results arc buffered on disk, a descrip 

tion of the fiber bundle and the location of the buffer 

are sent back to the object that made the query 

request of the Database I ntcrfacc. That object might 

then request the ]ntc lligcnt Visualization System to 

structure a virtual world whose image would appear 

on the display screen by way of BigRivcr and the 

Re ndering Engine. Objects in the vi rtual world can be 

given behaviors that arc elicited by user interactions. 

These behaviors might then result in further database 

queries and so on . Chains of events such as these pro 

vide a means tor browsing databases through direct 

manipu lation of objects within a virtual world. 

The World Wide Web Interface 

The WWVV lntertace fi.mctions similarly to the 

Database I ntcrface but instead of accessing data in 

a database, the WWVV Interface provides access to data 

stored on the World Wide We b. Messages that contain 

U RLs arc passed to the W'vVW Interface, which then 

fe tches the data pointed to by the URLs. In retrieving 

data fr om the \Ve b, rhe vVWW lntertace uses the same 

CERN sothvJre libr:1rics used by Web browsers like 

Netscapc. 

Once a data ti le is fetched , the 'vV'vVW Interface 

attempts to translate irs contents into an AVL pro 

gram, which is then passed to the Object Manager fo r 

interpretation. AVL either specifies the creation of 

a new virtual world that represents the data file's con 

tents or specifies new objects that are to populate the 

current world being viewed. lf the fetc hed data file 

contains a stream of AV L code, the Vv'vVW Intcrbce 

Vo l. 7 No. 3 !99S 73

74 

merely fo nvard s the file to the Object Manager. If the 

file contains general data in the form of an on-the-wire 

encoding of a fiber bundle, the WWVV lntcrtacc 

appeals to the Intelligent Visualization System to 

structure an appropriate virntal world. If the data file 

contains a stream ofHTML code, the WWW Inrerbce 

invokes an internal translator that translates HT1Vl l. 

code into an equivalent AVL program, which is then 

interpreted by the Object Manager. This interpreter 

actually understands an extended version of HTM L 

that supports the direct embedding of AVL within 

HTM L documents. Through this mechanism, 3-D 

objects with which users can interact can be embedded 

directly into a hypertext We b page-something that 

fe w if any other We b browsers can do today. 

Example Applications 

Applications that browse the contents of data spaces 

and then interactively visualize selected results have 

the same overall structure. One browser application 

component acts as a data space interface, and through 

this interface queries arc posed, query results arc 

imported into the application, and data generated by 

the application is stored back into a data space. Once 

data has been imported into the Jpplication , a second 

component must map the data into some appropriate 

virtual world . Finally, J third component must manage 

any interactions that may take place between an end 

user and elements that popu late the virtual worlds that 

arc created. 

In creating an application using Tecate, the DatabJse 

Interface and the vVWW Interface represent resources 

that can be used to form the appl ication's data space 

interfJce. The mapping of dJta into a representative 

virtual world can utilize Tecate's Intelligent Visuali 

zation System and the BigRiver visualization progrJm- 

Figure 4 

PLAN VISUALIZATION 

INTELLIGENT PLAN 

VISUALIZATION !-+--- -- --, 

SYSTEM 

BIG RIVER 

OBJECT 

MANAGER 

EXECUTE 

SCRIPT 

VISUALIZATION 

SET 

INTERPRET 

ABSTRACT 

VISUALIZATION 

LANGUAGE 

PARAMETERS 

Message Flow between Important Objn:rs in rhc Earrh Science Applic1rion 

Digital Tt:c hnical Journal Vol. 7 !'Jo. 3 I 995 

ming system. Finally, the manJgement of these worlds 

can take place through AVL programs that exercise the 

katures of Te cJte's Abstract Visualization Machine. 

The following two examples tl1at were implemented in 

AV L illustrate how Tecate can be used to create applica 

tions that browse data spaces. 

Visualizing Data in a Database 

A simple example of an application that exploits 

Tecate's features is one that browses fo r earth science 

data in a database and then provides visualizations of 

that dJta. The initial user interface fo r this application is 

built using a collection of user interface \vidgers, where 

each widget is a Tecate dynamic object. Because the 

Tecate system docs not yet have a comprehensive 3-D 

widget set, some widgets sti ll rely on two-dimensional 

( 2-D) constructs provided by the Tk \vi dget set that 

is implemented on top of the Te l language:'" 

Figure 4 depicts the tlow of messages between some 

of the more important objects that are used within the 

application. One object is the Map Query Tool that 

is used to make certain graphical queries for earth 

science data sets whose geographical extents and rime 

stamps fa ll within user-specified constraints. The tool 

is built around a world map on which regions of inter 

est can be specified (see figure 5). When a user marks 

a region of interest on the map and selects a temporal 

rJnge, a query message is sent to the Database 

lnrertace. The result of the query is returned to the 

Map Query Tool, which then torwards a description 

of the res ult to the Intelligent Visualization System. 

To structure an appropriate visualization, an inferred 

select task directive accompanies the re sult. The ensu 

ing script produced by the Intel ligent Visualization 

System is executed by BigRiver, which produces a 

stream of AVL code that is sent to the Abstract 

Visualization Machine tor interpretation. 

QUERY 

QUERY 

RESULT 

QUERY 

QUERY 

RESULT 

DATABASE 

INTERFACE 

KEY: 

0 DYNAMIC OBJECT 

Cl RESOURCE OBJECT 

-+-- MESSAGE FLOW

Figure 5 

The Map Query Tool Sho\\'ing a Vtstdization of a Qucrv Res ult 

This AVL program creates a new vi rtu::d world that 

consists of a collection of 3-D objects. Each object acts 

as an icon that corresponds to one data set that was 

returned as the result of the initial query (sec hgure 

5). The Intelligent Visualization System also builds 

in rwo behaviors t(>r each icon . Depending on how 

a user selects an icon, either the mctadata associated 

with the data scr represented by the icon is dispiJvcd in 

a separate window or a query message is sent to the 

Database lnterbcc requesting the actual data. In 

the latter case, the Map Query Tool again t(>rwards 

r.he qucrv result to the I ntelligcnr Visualization System, 

and another virtual world contJining objects representing 

data kJturcs is created and displavcd with 

the aid of BigR.ivcr Jnd the Abstract Visualization 

Machine. In generJI, lhta exploration proceeds this 

way by creating and discarding virtual worlds based on 

interactions with objects that populate prior worlds. 

After selecting Jn icon to actuJIIy view the data associated 

with it, an end user is Jskcd bv the I ntelligcnt 

Visualization System to input a task spccitication using 

a Task Editor. Generally, data sets can be visualized in 

many diffl:rent ways. The Intelligent Visualization 

Svstem uses the task specification ro select the one 

1·isualization that best satisfies the stated task. After 

J task specification is entered, a visualization of the 

selected data set appears on the screen. The BigRiver 

dataflow program that the Intelligent Visualization 

System creates ro do that visualization can be edited by 

h:md by knowledgeable end users to override the decisions 

made by the system. 

Figure 6 shows a Task Editor and a visualization 

crafted by the Intelligent Visualization System after an 

end user selected a data-set icon. The visualization represents 

hydrological dara that consists of a collectjon of 

tuples, each corresponding to a set of measurements 

made at discrete geographical locations. Based on 

the task specification that the end user entered, the 

[ntelligenr Visualization System chose to map the data 

into a coordinate system that has axes that represent 

latitude, longitude, and elevation. Each sphere represents 

an individual measurement site, whose color is 

a fi.mction of the mean temperature. When an end user 

selects a sphere, the actual data values associated with 

Digirc1l Technical journal Vol. 7 No. 3 1995 75

76 

Figure 6 

Task Editor Sho\\'ing a Visualizatio11 of Hnlrologiral DaLl 

the location represented 1)\' the sphere JJ-c displavcd . 

In addition, the Imclligc nr Visualiz;1tion Svstcm Juto 

matically places into the l'irtual world ofrhe visu:tliza 

tion ;1 color legend to help relat-e sphere colors to mea11 

tcmperatun: v:tlues. 

hgurc 7 depicts another 1 irrual world showing J 

visu:tlization ohbta-sct output h·om a regional clim;1te 

model program. The d a u set is ;J 3-D arrav indexed b1· 

l atitude, longitude, and cle1·ation . Fach arrav clement 

is a tuple thJt contains cloud densi tv, water content, 

and temperature I'Jiues. In this 1nst;1ncc, the end user 

enrered J task specincnion that st;ned that the spatial 

variation in te mperature was of primarv illlJlOrtance. 

The Inrelligent Visualization System responded lw 

specif-\·ing J l'isualizJtion that represemed the te mper 

atun: datJ as an isosurbcc, i .e . , a surbce 11·hose poi nts 

all h:tve the s:trne 1·aluc f(Jr the temperature. Included 

in the 1·irtual world is ;l widget tklt can be used to 

change the isosurbcc I';Jiue and the field l';lriable th;n 

is being srudied . 

The isosurbcc widget that :tppc:trs in the l'isualiz;l 

tion shown in Figure 7 is ofspeci;ll interest bcc:�use of 

the way that it is implemented. Embedded in the tool 

is a slider that is used to change the isosurbcc 1·aluc. As 

with most sliders, the slider ,·:�luc indicator ;J utom;Jti 

cal ly moves when ;J mouse button is held down while 

Digiral Tcdmiol )< n1rn�l 

poinring at one ofrhe slider ends. To achieve this sim 

ple ;J nimation, Tecate's clock object is used. When rhe 

mouse burron is firs� depressed while the cursor is over 

a slider end, rbe slider indicator registers itself to be 

informed of clock ricks. From then on, at every clock 

rick, rhe indicator receives an update message fi-om the 

clock, at which time the indicator repositions itself and 

increments or d.eerernenrs the current slider value. 

VV hen the mouse burron is released, rhe slider sends 

a message to BigRiver indicating that a new isosurtace 

is ro be calculated and d isplayed. In addition, the slider 

indiutor unregisters itself f!·om the clock signali ng 

thJr it no longer is to receive the update messages. In 

general, applications em use this same clock mecha 

nism to perf(mn more e la borate animations. 

A 3-D World Wide Web Browser 

In the Teute Web browser, exploration of the World 

Wide Web ami irs comcnrs occurs by placing an end 

user onto an informational landscape. This landscape 

is ;1 3-D 'irtual world whose appearance reflects the 

coment and the structure of a designated subset of the 

entire Web. Upon applicnion start-up, an end user 

is presented with an initial informational landscape 

that consists of a p!Jn;lr map of rhe earth embedded 

in a 3-D space, ;l S shown in Figure 8. In general, rhe

Figure 7 

Task Editor Sh owing a Visualization of Regional Climate Dat3, I nduding an lsosurtacc 3nd cl User Inrcrt , lcc Widger 

initial informational landscape can be any 3-D scene 

and does not have to be geographically based. For 

instance, an intormational landscape might be a virtual 

library where books on shelves serve as anchors fo r 

hyper! inks to diffe rent Web sites. 

In the present browser application, selected Web 

sites appear as 3- D icons on the world map. These 

icons arc positioned either in locations where Web 

servers physically reside or in locations referenced 

within Web documents (see Figure 8). A user places 

information that describes these sites into a database 

that serves as an elaboration of the hot list of current 

hypertext- based browsers. When the browser applica 

tion is first started, it sends a query fo r the initial com 

plement of Web sites to the Database Interface. The 

browser application then invokes a BigRiver script that 

visualizes the results by placing icons representing 

each site onto the world map. 

Suspended above the world map is a 3-D user inter 

fa ce widget that is used to query a database of Web 

sites that are of interest to an end user (see Figure 8). 

This database, where the initial set of Web sites is 

stored, includes information such as URLs, keywords, 

geographical locations, and Web site types. Currently, 

Figure 8 

Te cate Web Browser Intrmarion3l L�1ndscapc Showing 

WW\V Sites Depicted as 3· D Icons on a Map of the World 

Digital Tc..:hni..:al Journal Vol. 7 �o. 3 J Y95 77

78 

individual users arc responsible tor maintaining their 

own databases by adding or removing \Ve b sire entries 

by hand. An automated me:ms to r building these LhtJ� 

bases can be easily added to the broll'ser appliurion so 

that Web intormation co uld be Jccu mulated based on 

where and when an end user rra\'cls on the Web. 

Duri ng a browsing session, rhc Web Querv Tool 

allows arbi trary SQL queries to be posed ro the 

database by an end user. ln addition, the Web Query 

Tool has provisions to allow pac kaged queries ro be 

initiated by a simp le click of a mouse button . ln both 

cases, queries are sent ro the Dat

Figure 10 

BIGRIVER 

INTERPRET 

ABSTRACT 

VISUALIZATION 

LANGUAGE 

EXECUTE SCR IPT 

INTERPRET 

OBJECT 

ABSTRACT 

MANAGER VISUALIZATION 

LANGUAGE 

KEY: 

0 DYNAMIC OBJECT 

c=J RESOURCE OBJECT 

- MESSAGE FLOW 

Message Flow between J mportanr Objects in the Web Browser 

Figure 11 

WEB QUERY ) 

TOOL 

- 

DATA SET 

ICON 

J 

QUERY 

QUERY RESULT 

FETCH URL 

FETCH RESULT 

DATABASE 

INTERFACE 

www 

INTERFACE 

Results of a Tecate Browsing Session Showing a Hyperlink and a Forest of Pyramids That Re presents the User's Travels 

on the Web 

Digital Technical journal Vol. 7 No. 3 1995 79

80 

are cloned !Tom the same HypCI·Jink prototype as the 

Web site icons. If another HTML document is 

retrieved by fo llowing a hypcrlink, tlut document 

is viewed on the base of :mother inverted pvramid 

whose apex rests on the selected text and so on (sec 

Figure l l ) . Rather than having to page back and t(xth 

between hypertext docu ments as with most hypl:rtot· 

based browsers, in Te cate, an end user needs only to 

move about the virtual world to gain an appropriate 

viewpoint from which to examine a desired document. 

Overall, as shown in figure I I, J browsing session 

with Tecate's vVe b browser results in a t(m::st of pvLl 

midal structures that represent a pictorial historv of 

an end user's trave ls on the We b. 

Although Te cate's Web browser is capable of view 

ing HTML documents, irs main purpose is not to 

emubte what can currently be done using hypertext 

based browsers, albeit using 3-D. R.1 thcr, the new 

browser is intended to visualize primarilv more com 

plex types of data. When d�1ra docs not consist of 

a stream ofHTML code, the WWW Interrace attempts 

ro visualize what was returned h·om the 'We b. These 

visualizations can take place in virtual worlds separate 

!Tom the intormationa.l landscape !Tom where the data 

Figure 12 

Example of a We b Document with Embedded 3-D Vi rtml World 

Digital T.:chniund in visualization systems, network browsers, 

datJbase ti·orlt ends, and virtual reality systems. As a 

first prototype, Tecate was created using a breadth-first 

development strategy. That is, developers deemed it 

esse ntiJI to first understand what components were 

needed to build a general data space exploration utility 

and then determine how those components interact.

The Tecate car demo 

 

 

The Teca te car demo 

 

# Globa l variables 

globa l TEC_W EB_ PARENT TEC WEB_W IN 

set path "/projects/s2k/sharedata" 

# Def ine car pa rt prototype 

clone CarPart Visua l 

add CarPart { 

state {ang le 10} 

appea rance { 

repType surface 

i n terpType surface 

behavior { 

around {args} { 

# Defi n e car body 

for {set i 0} {$i < 360} {incr i [getstate angle]} { 

clone car_body CarPart 

add car_body { 

appearance { 

send [getself] rotate "add 0 [getstate ang le] 0" 

replacematrix {rotate {0 .0 0.0 90 .0}} 

shape {AliasObj "$p ath/car_body . t ri"} 

# Def ine generic whee l 

clone whee l Ca rPa rt 

# Define car' s four wheels 

clone back_ right CarPart 

# Assemb le car 

clone whee ls CarPart 

add wheels {subobject {back_r ight back left front left front_ right}} 

clone car CarPart 

add car { 

appea rance {replacem atrix {translate {28.0 -8 .0 3.0} rotate {90 .0 90 .0 0.0}}} 

subobject {car_b ody wheels} 

add $TEC_W EB_P ARENT {subobject {car}} 

# Bind pick events to car 

send $TEC WEB WIN addEvent {wheel {Pi ck-Shift-Butto n-1 {rot_w h ee ls {}}}} 

send $TEC_W E B_W IN addEvent {wheel {Pi ck-Button-1 {around {}}}} 

send $TEC WEB WIN addEvent {car {Pi ck-But ton-1 {around {}}}} 

 

 

Bu tton-1 on car to rotate the car 

Bu tton-1 on a whee l to rotate the wheels 

Shift Button-1 on a wheel to change the whee ls 

 

 

 

 

Figure 13 

AV !. C:odc hnbeddt:d in rhc HTM!. l'c1gc tor rhc Wch Documcnr Exam ric 

])ig:it;\] Tcclmi(;ll jourml Sl

82 

This d n\:lopm enr srr�ncg;1· tLldcd off the fu ncrio11Jiirv 

of indi1·idu:�l components t(>r rhc co mplercncss of 

:I ri.dl\' ru n ning \'iSU�liization S\'StClll . 

In terms of �1Chic1·i ng irs design goals, rile Tcc,Hc 

eft(Jrt h�1s been modcratch· successti.d . TcGHc Gl ll nm1· 

prm·idc inrcrbccs to t\1 o kind of rb t�l sp�1ccs: the 

World vVidc We b �l lld thLlb�lSCS lllJ113gcd b\' rhe 

POSTCRES �m d lllustLl Lh t�1b�1sc management S\'S 

tc ms. In addition, intnbccs to other dat�l sp�1ccs c�m 

be i m pl e men ted easih' l1\' CI'C :l ti ng nc\1 resou rce 

objects using the tools pro1·idcd lw Te cate . Much 

11 ork srill nccds to be done, hm1 c1'Cr For cx:�mple, rhc 

attciH.bnt dar:� translation problem must be satisbcto 

rik soh'Cd ; data passing tlmJugh an intcrbcc rhat 

is stored in one fo rm�1r should be :H ltom:�tic;�llv um 

l'trtcd i mo Tec: nc 's b1·orcd t(mna t :md 1·icc l'crs�l. 

When building ,·isu�lli;.�ni ons of data, Tccnc nm1 

undcrsr�mds dat�l that h�1s :1 s1Jecirlc conccpru�1l struc 

ture, i 11 p:�rticular, arhi tr,u·1· scrs of tuples �m d mul ri 

dimension:�! �1rr:11'S ll'hnc �l lT�l\' clements tlU\' be 

tuples. Although data t\'pcs from man1' di!'ll:rcm disci 

plines possess such a stru cture, some tvpcs rc m�1in thJt 

dO ilOt, r(>r i nstan ce, cb t:l rh�H h�1s a latticc-likl· or poh· 

hcd Lli srrucrure. F u rthcrmorc, Tee:� tc can noll' con 

struct onh· crude \'isu;�li;�ltions ofrhc da ta ti'Pcs rlut it 

docs understand. The prim�1n reason t(Jr rhis shOI't 

com ing is that the b�1sic mod ule set 11 irhin the 

BigRil'cr resource is incomplete, �md the knml'icdgc 

base 11·i thin the ln tc II igcm Visu�li iza tion Svstcm con 

tains limited knoll' ledge of 1 isu:�lization tec hniques 

that can be used to t ra nsr( m n d

J I. G. Bell, A. p,,risi, ;md 1\.l. reset·, "The Virru;li J\cJiin· 

,\ Jlldelillg Lm gu;lgc Sfl�citlc;Hion," a,·,,il:lbk on the 

lnrcrllet ;H http:/ /lr!lli.ll·ircd.com ( :\o1·cmbcr I l)9.J. ). 

12. J. White, "Tcksnipt Tcchnolog1·: The �ound.Hioll t(lr 

the Fkcrronic lvbrkctplace," Cencral M�gic 11·hitc 

p•lJll'r (Sull lliY;llr, C1 lif.: Gener;ll ,\ hgic, Inc., 19{}4 ). 

13. ). ,\ kKcclun Jnd :\. Rhodes, l'n�!.!,llfllllllillg j(w ihe 

,\('tr/(J/1: Sojitntre net:elr Scicmitic 1);\Ll \'istdiz;Hion," 

TechnicJI Rcporr (.;\\'L!-JlST- ')2- 10 ( \V;1.shington, 

D.C.: (;corge \V;lsh ingtnn Uni1usirl', 1\b rc h 1992 ). 

20. Z. Ahmed n ;l l., "An lmelligt· nr VisuJiiLation S1·stcn1 

fo r J-:;lrth Scirc•uu· r I ()94 ) . 

22. D. l�urkr ;uld ill l'cndln·, "A \'isulllizarion ,\ lodel lhsed 

on rhe iVLlthenJ;Jtics of fib,r Bund les," 0>111/!11/ers ill 

/'hl:.;ics. 1·ol. 3, no. 5 (Septcnlber/Onober I ():N) 

D. l�urlcr ;l \ld S. l�n·son , "\\:cror- bundlc C:bsscs form 

l'm,·ertiil Tool tc>r Scienritlc \' isuJJii'l.�rion," (),ntf!llfl!I'S 

ill 1'/Jy sics. 1·ol. 6, no. 6 ( Nm·cmbn/Decel1lber I ()92 ): 

576-5X4. 

24. 1\ . Hllher, B. Luc\s,.mJ �· . Collins, "A Dar;, ,\ lodel fo r 

Sciemitic VJsuJii;arion 11·ith Pn"·isions t( >r Rcgu iJr ;md 

lrrc gui;Jr Crids," l'mcel!dil l,!l.-' u/ !be l'isuolizuliuu 

C) I (,r !1/{aC:'IICt'( 19() I ) . 

25. [, Resnick ct ;JI., U. ISS!C f)e,cnjJtion o11d Nefi'rellce 

.\/(/ ////(// .Ji>r the (,'(JJ)//JI(J/1 use lmjJ/c ·nlt'/1{{//ion 

( ,\•l u rra\' Hill, :\.].: AT&T Bell L1borJtories, 1993 ). 

26. C. Steele, Jr., 0!11111/l>ll l!SV '!Zw l.ong iiO,!.!,m 

purc.:r Center (SI)SC). Cmrenrlv, Peter is " 1·isiting scic.:nrisr 

;l[ tbc snsc. \\'here he kads rcseJrchc.:rs in dci ' Cioping 

intcr;Kti lt: tbra ,·isualizJtion s,·stt·ms. Petn joined Digir.1l 

in 1990 as ;l member of tht· \ Vorkstations Engineering 

(;roup. In e.1rlier 11·ork, he ,,·,1s � sott\l';lre t'll)!;inccr t( >r th

High-performance 

1/0 and Networking 

Software in Sequoia 2000 

The Sequoia 2000 project req uires a high-speed 

network and 1/0 software for the support of 

global change research. In addition, Sequoia 

distributed applications require the efficient 

movement of very large objects, from tens to 

hundreds of megabytes in size. The network 

architecture incorporates new designs and 

implementations of operating system 1/0 soft 

ware. New methods provide significant per 

formance improvements for transfers among 

devices and processes and between the two. 

These techniques reduce or eliminate costly mem 

ory accesses, avoid unnecessary processing, and 

bypass system overheads to improve through 

put and reduce latency. 

\'ol. 7 ;-..: " ·' 199S 

I 

Joseph Pasquale 

Eric W. Anderson 

Kevin Fall 

Jonathan S. Kay 

In the Sequoia 2000 project, 11 e �1ddressed rile [Jroh 

lem of designing a distri buted computn SI'Stem th�H 

can efticientlv rerriel'l:, store, r conraincr shipping. (for a complete 

discussion of container shipping, see Re krence l.) 

Since conrainn shipping is a nell' design, this p:�per 

de1·orcs more sp:�ce to ir in re l:trion to the other sur 

' c1·ed '' ork ( 11·hose der�1ikd descriptions m:l\' be fo und 

in Rdcret1ces 2 ro 9). In addi rion ro this 11 ork, '' e con 

ducred orhn ncrll'ork studies �1s parr of the Sequoi�1 

2000 l)rojccr. These include 1-csearch on protocols to 

pro1·ide f)Crtormance guaranrecs .m d multic1sting . 10 , - 

To support a higJJ-pcrt(mn:lncc distributed comput 

ing etwimnmenr in ll'bicb :lpplicuions em cftCetii'Cl\' 

manipul�ue large d:H:J objects, 1\'C 11-cre concerned 11·ith 

achin in g. high throughput during the rranster of these 

objccrs. The processes or de�· ices representing the d;1L1 

sources :md sinks m:ty all reside on the s'1mc work 

St

(mu ltiple node case ). ln either c:�se , we w�1nn.:d appli 

C1tions , be the�· c1rth science distributed computa 

tions or collabor�nion tools i111·oh-ing multipoint 

video, ro 111�1kc fu ll use of the raw bandwidth prm·ided 

bv the underlving communicarion svsrem. 

In the multiple node case , the ra11 band11·idrh is 

h·om 45 ro 100 meg�1birs per second (Mb/s), bec:ll!se 

the Sequoi�1 2000 nemork used T3 links t(>r long 

distJnce communication :md a ti ber distri buted data 

inrcrbce (FDD! ) tc>r loc1l area communicnion. ln rhe 

single node case, rhc l'JII' band11·idth is approx imJteh· 

100 megabnes �1cr second, since rhe workstation of 

choice w:�s one of the D ECstation 5000 series or the 

Alph�1-powcred DEC 3000 series, both of which use 

the TL! RBOchannel �1s the SI'Stcm hus. 

Our 11·ork foc used onh· on soft11·;m:: impro1·emct1ts, 

in particular how to achieve maximum svsrem sotT\\':tre 

perte>rnuncc gi1-cn the hardware we selected. In bet, 

we fo und that the throughput bottlenecks in the 

Seq uoi�1 distributed computing cm·ironment 11·ere 

indeed in the 11·orkst:1rion 's operating svstcm softii'Jre, 

and nor in the undnlving communication svstem 

hard11·�1re (e.g., nctll'ork links or the svstem bus) · . This 

problem is not limited to the Scqu(;iJ em ironment: 

gi1·cn modern high-speed workstations ( 100+ mil lions 

of instructions per second [mips 1) r 

.FDDI bo:�rds, both models DEF'lA, 11· hich supports 

both send and receive direct memory �Kccss (r:li'viA), 

:111d DEfZA, ll'hich supports onlv rccci1·c DlviA. 

The Data Transfer Problem 

Since a d�1r;1 sou rce or si nk mav he either a process or 

d e1·ice, and the operating SI'Stem gcncralh- pcrt())'lns 

the fu nction of' transtCrring data bct11·cen processes 

and devices, undcrst:1nding the hotrlenceks in these 

operating sysn:m data p:1 ths is kcv ro improving 

perfonn;1nec. These datJ paths gcncralh- imoh-c trJ 

�·ersing numerous la1·crs ofoper:ni ng s1·stetn softwa re . 

l n the case of netll'ork rr:1nskrs, the data paths �11-c 

ex tended bv lavers of ncnvork protocol sofTware.

To undnst;uld rhc pcd(nm;mcc p roblem 11·c \\'CIT 

tr�·ing ro sokc , consider a common c l i e m- se!Yc r i nrcr 

acrion in which

device is ;� network �l liaptcr. In this p�1per, \\'C usc 

rhe term dolo lntll4.erJII"oblem to re kr to the problem 

of red ucing these ovc r hc;�d s ro achi e1·c h i gh through 

pur between ;� source device Jlld a sink device, either 

of \\"hich can he ;� net11·ork aLhpter 11 ·i rh in �� single 

worksr:� ri on. 

Al though the data rr:mskr problem mal' also oi st in 

intermediate routers, ir docs so to :1 much lesser 

degree than with end-user workstations (assu ming 

modem router softwa re and h:�rd\\'�lrc te chnoJog,· ) . 

Thi s i s because of �� rourc r' s si mpli �i ed execution envi 

ronmellt :�nd i rs red uced needs �() r transfers Kross 

m ultiple protected domains. Ho\\"CI"Cr, there is noth 

ing rh�n prec ludes rhc :�ppJicnion of the tec h niq u es 

discussed in th is paper to rourer SOlT\\";Jre. In ri ct, since 

ll"e used gcnerJI-purposc workstations �or routers ro 

su pporr �� tlex i blc, mod i �i ablc rest bed �(Jr ex peri men 

t�ltion ll"ith nc11· protocol s, our work ll"as also applied 

ro rourer software. 

In the nor three sections, ll"e descri be 1·arious 

�1 ppro:�ches to soil-ing rhe data rr:�nskr problem . Since 

data copving/rouc hing is a major sottware limitation 

in achie�·ing hi gh throughput, a1·oiding data co�wing/ 

touching is a constant theme. tVl uch of our work 

im·olvcs fi nding \1'

hgs to cs_re�H.I , without el i ling cs_rnap. Unm�lpping 

is Jutomnic in cs_IHi tc , so if cs_unm:1p is not used, 

t\1'0 S\'Stem Cl iis JI"C Sti li sur"fi cicnt . 

As shol\'n in hgure 3, a pmcess reJds d

11rinen is lost, �1 ll'ri ti ng process must copl' ::uw d�1t�1 

th�1t ll'i\1 still be needed �1 frer the ll'rirc . �ortu narelv, 

applications thJt n101T great 1·o\umes of tbt�1 often 

h�11·e no t"u rthcr need t(Jr ir �1 fre r �1 IITite is co mpl eted . 

Implica tions of Virtual Memory Remapping 

In

YO 

communicnion (JPC:) 11·irhin the ULTR!X 1·e rsion 

4.2;� operatin g system on a DECstation 5000/200 

svstc m.' With the nc11· DEC OSI-jl i m plementation 

011 Alpha worksr.1tions, liT com 11ared the l/0 puformance 

ofconvcntionJI UNIX ljO to that ofcollt;lincr 

s hipping rix ;1 l'arictl' of 1/0 de� ices as we ll

USER 

SPACE 

APPLICATION LAY ER 

PROCESS 

Figure 5 

:\ Sf.Jlicc Cnl1lll'Cring a Source ro J Sink Dc1 icc 

de1·icc. If the l/0 bus and the dl'l ices support hardware 

streaming, the data p:1th is direcrlv over the bus, 

�ll'oiding Sl'stem memon· altogether. Although the 

process docs nor necess�1rily manipui

92 

been implemented 11 irhin the U LT !U :\ 1crsion 4.2J 

operating svstcm t(Jr the DEC SOOO serit:s worksr:�rions, 

:md 11 irhin DEC OSF/ 1 ITrsion 2.0 ti1r DJ-:C 

3000 series (Aipha-pmH·red ) 11 orkst:�rions, c;H.:h t(Jr 

a l i mited number ofdc:,·ices. 

Three pert(mnance c1 .1 luation studies of PPIO 

ha,·e been c:�rried our :Jllli :�re described in our earilj)Jpers.2·' 

' ThLI' indicate Cl'U :JI';liLlbilin· imprm es l) \' 

30 percent or more; and throughput and Lw:nc1· 

impro1·c b1· ;1 bctor of t11·o to three, depending on 

the speed of l/0 devices. (3 ener:111v, the l:�tencv ;md 

throughput pcrform.l !Ke imprm emenrs ofkt·�d bl· 

PPJO improve with bster 1/0 de1·iccs, indicating th,{t 

1'1'10 scales well with ne11· IjO dn ice rechnolog�·. 

Improving Network Software Throug hput 

Net11·ork 1/0 presents a Sflecia] 11roblcm in th;lt rhc 

complcxitl' of the abstraction layer (sec Figu re 2), .1 

sr:�ck of network protocols, is gencr:�lh- much greater 

than that tor other types of 1/0. In this sectio11, 11 c 

summari1.e the results of an ;mall'sis of m·crhe:�ds r(Jr 

an implcment:nion ofTCI'/IP 11c used in rhc Scquoi;l 

2000 project. The prim:11'1' bmrlcncck in :K hiel'itw 

high throughput communic:ltion ti x TC P /l l' is du� 

to data-touching opcr:nions: one expected culprit is 

d:lt 199� 

• ErrorChk: The caregon· of checks t(x user :1 11d SI'Srem 

errors, such :1s pJramcter checking on socket 

s1·stem c:111s 

• Other: This tin:1l cJtegorv of OI'Ct-llC;ld includes :1!1 

the opcr;nions th

Figure 6 

3000 

� 2500 

f= 

(.9 

z 

�- 2000 

w

4.2:� op eratin g S\'Stcm, ;md a 27 percent llnproYemciH 

over the implcment•Hion with the opti mi zed check 

sum comput:�tion algoi·irbm . 

Of course, one must be ,.er,· c;lrcti.Ii ;l hout d l lli II,� o //{/ S1 :..;fL'IIIS 

f/C\ 1

5. ]. K�v and /. P�1sq u al c, "The lmf)orr�liKe of Non 

D.1L1-Touching Processing Q,crhcads in TC I'/ IP," 

f>mceediugs of' the ACl/ Com nlltllicatiuus Architec 

fllrcs onrl l'mtucols Co11/('rence ISIG'CUl/.1/J. 

S.1n hJ11cisco (Sq,re mbcr 199�), P�'· 259-269. 

6. /. ](�,. �1nd /. 1'�1squ�Ic. "Mc�1surcmenr, Ano1lvsis, 

and lmpro\'C illCIH of LI DI'/ll' Throughpu t rc n rhe 

DECs r�1tion 5000," l'mceedinp,s c;j ' the I 'SI:NJX 

\fin fer 'f ('cb nolugr C!Jtlj(•rence. S:111 D ie go ()Jnuan· 

I99�). pp. 249-25H. 

7. J. KJ\' and J. 1'.1sq u �1 lc, "A Sum nun· of l'C:P /II' 

Net\\ orking Soriwarc Pedcm11JI\Ce fo r the DECst�tion 

5000," l'r()(.:net 

Afifirooc/J ( 13erkc lc,·, C1 lif.: lnrernatiotd Computer 

Science lnstiture, Technical Report TR-92-072, 1992 ). 

II. H. Zb�mg, D. Verma, Jnd D. FerrJri, "Design and 

lmplemenr.1rion of the 1\eJI-Tillll' Inrnnet Prorrnunce of 

lv kss;lge-P�ssing Using Restri cted Virtual Memor!' 

Rc maf1ping," Sojill'are-l'mcfice and F .. \pcrieuce. 

,·ol . 21, no. 3 (I 991 ): 251-267. 

26. P. Druschel and L. Peterson, "Fbuts: a High · b;n1dH'idth 

Cmss- Domc1in Tra 11sfer bcilin·," Pmccediug-; of 

the 14th ACil l S) 'lilj)(JS iwn l!il OfWi'Cifillf!, .'>)·steill l'riii 

Cij;/es (SOSl'). AshL' illc, N.C. (December 1993 ), pp. 

189-202. 

27. J. !vlogul and A. Borg, "The Effect of Context S\\ itches 

on Cache J'ertorm�111(c," l'mceedin,r.;s of' the ASI'f.OS- 

1\ '( Apri l I 99 1 ), pp. 75-84. 

2tl. R. Bershad, T. Anderson, E. Lazo\\'ska, �111d H. Le\'\', 

" Li gh twei gh t Remote Procedure C:Jil," ACIV/ 

Tra nsactions 011 Computer SJ'sfeills. H>l. 8, no. I 

(1990): 37-55. 

Digital Tl'l'hnical Journ.1l Vol. 7 :-Jo. 3 1995 95

29. 1'- l)ruschcl, M. Abbott, ,\.\. I'Jge ls, cl lld I .. Pcre1·son, 

"An,1il'sis of !jO Suhwsrc m Design t(n J'vl ulri mcdic1 

Wo rksr,nions," Proceerlill,�.\ n/ tbe /birr! !utcrurt 

liourt! \\ ur/;sh (ljJ (J/f .\eitt ·ud! r111d Ojir rill' rou ting architecture

Call fo r Authors 

from Digital Press 

Digital Press is an imprint of Butterworth-Heinemann, a major international pub 

lisher of professional books and a member of the Reed Elsevier group. Digital 

Press is the authorized publisher fo r Digital Equipment Corporation: The two 

companies are working in partnership to identify and publish new books under the 

Digital Press imprint and create opportunities fo r authors to publish their work. 

Digital Press is committed to publishing high-quality books on a wide variety 

of subjects. We would like to hear from you ifyou are writing or thinking about 

writing a book. 

Contact: Mike Cash, Digital Press Manager, or 

Liz McCarthy, Assistant Editor 

DIGITAL PRESS 

313 Washi ngton Street 

Newton, MA 02158-1626 

U.S.A. 

Tel (617) 928-2649, Fax: (617) 928-2640 

E-mail : Mike.Cash@BHein.rel.co.uk or 

LizMc@world .std .com

ISSN 0898-901X 

Printed in U.S.A. EY-T8 38E-TJ/YS 12 14 18.0 Copvright © Digital Equipment Corporation. All Rights Resen·ed.

DTJ Volume 7 Number 3 1995 - Digital Technical Journals

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?