DTJ Volume 7 Number 3 1995 - Digital Technical Journals
DTJ Volume 7 Number 3 1995 - Digital Technical Journals
DTJ Volume 7 Number 3 1995 - Digital Technical Journals
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Digital</strong><br />
<strong>Technical</strong><br />
Journal<br />
I<br />
HIGH PERFORMANCE FORTRAN<br />
IN PARALLEL ENVIRONMENTS<br />
SEQUOIA 2000 RESEARCH<br />
<strong>Volume</strong> 7 <strong>Number</strong> 3<br />
<strong>1995</strong>
Editorial<br />
Jane C. Blake, Managing Editor<br />
Helen L. Patterson, Editor<br />
Kathleen M. Stetson, Editor<br />
Circulation<br />
Catherine M. Phillips, Administrator<br />
Dorothea B. Cassady, Secretary<br />
Production<br />
Terri Autieri, Production Editor<br />
Anne S. Katzeff, Typographer<br />
Peter R. Woodbury, Illustrator<br />
Advisory Board<br />
Samuel H. Fuller, Chairman<br />
Richard W Beane<br />
Donald Z. Harbert<br />
William R. Hawe<br />
Richard J. Hollingsworth<br />
Richard F. Lary<br />
Alan G. Nemeth<br />
Robert M. Supnik<br />
Cover Design<br />
The images on the front and back covers<br />
of this issue are different visualizations<br />
of the same data output from a regional<br />
climate simulation program run by Dr.<br />
John Roads of the Scripps Institution of<br />
Oceanography. The data depicted contain<br />
measures of temperature, liquid and<br />
gaseous water content, and wind vectors;<br />
the topography represented by the data<br />
is the western U.S. in January 1990. Providing<br />
earth scientists with the ability to<br />
visualize such data is one of the objectives<br />
of the Sequoia 2000 research projecta<br />
joint eftort of the University of California,<br />
government agencies, and industry to build<br />
a computing environment for global change<br />
research. This issue presents papers on several<br />
major an::as explored by Sequoia 2000<br />
rese;uchers, including an electronic repository,<br />
networking, and visualization.<br />
The cover was designed by Lucinda O'Neill<br />
of <strong>Digital</strong>'s Design Group. Special thanks go<br />
to Peter Kochevar for supplying the cover<br />
images.<br />
The <strong>Digital</strong> <strong>Technical</strong> journal is a refereed<br />
journal published quarterly by <strong>Digital</strong><br />
Equipment Corporation, 30 Porter Road<br />
L)02/D lO, Littleton, Massachusetts 01460.<br />
Subscriptions ro rhejoumaiare $40.00<br />
(non-U.S. $60) for four issues and $75.00<br />
(non-U.S. $115) for eight issues and must<br />
be prepaid in U.S. funds. University and<br />
college protessors and Ph.D. srudems in<br />
the electrical engineering and computer<br />
science fields receive complimentary subscriptions<br />
upon request. Orders, inquiries,<br />
and address changes should be sent to the<br />
<strong>Digital</strong> Technica!journalat rhe publishedby<br />
address. Inquiries can also be sent electronically<br />
ro drj@digital.com. Single copies<br />
and back issues are available for $16.00 each<br />
by calling DECdirecr at 1-800-DIGITAL<br />
(1-800-344-4825). Recent back issuesofrhe<br />
journal are also available on the Internet at<br />
h rrp:/ jwww.digital.com/info/<strong>DTJ</strong>/home.<br />
hrml. Complete <strong>Digital</strong> Internet listings can<br />
be obtained by sending an electronic mail<br />
message ro info@digiral.com.<br />
<strong>Digital</strong> employees may order subscriptions<br />
through Readers Choice by enteringVTX<br />
PROFILE at the system prompt.<br />
Comments on the content of any paper are<br />
welcomed and may be sent to the managing<br />
editor at the pubhshed-by or network address.<br />
Copyright© <strong>1995</strong> <strong>Digital</strong> Equipment<br />
Corporation. Copying without fee is permitted<br />
provided that such copies are made<br />
for use in educational institutions by faculty<br />
members and are not distributed for commercial<br />
advantage. Abstracting with credit<br />
of <strong>Digital</strong> Equipment Corporation's authorship<br />
is permitted. All rights reserved.<br />
The information in the journal is subject<br />
to change without notice and should not<br />
be construed as a commitment by <strong>Digital</strong><br />
Equipment Corporation or by the companies<br />
herein represented. <strong>Digital</strong> Equipment<br />
Corporation assumes no responsibility for<br />
any errors that may appear in the Journal.<br />
JSSN 0898-901X<br />
Documentation <strong>Number</strong> EY-T838E-TJ<br />
Book production was done by Quantic<br />
Communications, Inc.<br />
The following are trademarks of <strong>Digital</strong><br />
Equipment Corporation: <strong>Digital</strong>, the<br />
DIGITAL logo, Alpha Generation,<br />
AJphaServer, AlphaSrarion, DEC, DEC<br />
OSF /1, DECstation, GIGAswirch,<br />
TURBOchannel, and ULTlUX.<br />
Dore is a registered trademark of Kubota<br />
Pacific Computer Inc.<br />
Ex a byte is a registered trademark of<br />
Exabyre Corporation.<br />
Hewlett-Packard and HP are registered<br />
trademarks of Hewlett-Packard Company<br />
JBM and SP2 are registered trademarks<br />
of! nternarional Business Machines<br />
Corporation.<br />
II lustra is a registered trademark of Illustra<br />
lntormation Technologies, Inc.<br />
Intel is a trademark oflnrel Corporation.<br />
MCf is a registered trademark of MCI<br />
Communications Corporation.<br />
MEMORY CHANNEL is a trademark<br />
of Encore Computer Corporation.<br />
Mosaic is a trademark of Mosaic<br />
Communications Corporation.<br />
Nerscape is a trademark ofNerscape<br />
Communications Corporation.<br />
NewronScripr is a trademark of Apple<br />
Computer, Inc.<br />
NFS is a registered trademark of Sun<br />
Microsysrems, Inc.<br />
OpenGL is a registered trademark and<br />
Open Inventor is a trademark of Silicon<br />
Graphics, Inc.<br />
Picture Tel is a registered trademark of<br />
PictureTcl Corporation.<br />
PostScript is a registered trademark of<br />
Adobe Systems Inc.<br />
SAIC is a registered trademark of Science<br />
Applications International Corporation.<br />
Siemens is ·a registered trademark of<br />
Siemens Nixdorflnformation Systems, Inc.<br />
Sony is a registered trademark of Sony<br />
Corporation.<br />
SPEC is a trademark of the Standard<br />
Performance Evaluation Council.<br />
Telescripr is a trademark of General Magic, Inc.<br />
UNIX is a registered trademark in the<br />
United Stares and other countries, licensed<br />
exclusively through X/Open Company Ltd.
Contents<br />
Foreword<br />
HIGH PERFORMANCE FORTRAN IN<br />
PARALLEL ENVIRONMENTS<br />
Compiling High Performance Fortran for<br />
Distributed-memory Systems<br />
Design of <strong>Digital</strong>'s Parallel Software Environment<br />
SEQUOIA 2000 RESEARCH<br />
An Overview of the Sequoia 2000 Project<br />
The Sequoia 2000 Electronic Repository<br />
Tecate: A Software Platform for Browsing and<br />
Visualizing Data from Networked Data Sources<br />
High-performance 1/0 and Networking Software<br />
in Sequoia 2000<br />
) can C. Bonnev<br />
Jonathan Harris, John A. Bircsak, M. Regina Bolduc,<br />
Jill Ann Diell'ald, Israel Gale, !\cil W. Johnson,<br />
Shin Lee, C. Alexander Kelson, :md Carl D. Offner<br />
Edward G. Benson, David C. P. La France- Linden,<br />
Richard A. Warn::n, and Sant�1 Wirvaman<br />
Michael Sroncbr,lker<br />
Ra\· R. Larson, Christian PLlunr,<br />
Allison G. Woodruff, and MJrri Hearst<br />
Peter D. Kochevar and L.eon�1rd R. Wanger<br />
joseph Pasquale, Eric W. Anderson, Kevin Fall, and<br />
Jonathan S. Kav<br />
<strong>Digital</strong> T�chnical )ounLll Vol.7 l'-:o.3 <strong>1995</strong><br />
3<br />
5<br />
24<br />
39<br />
50<br />
66<br />
84
2<br />
Editor's<br />
Introduction<br />
Scientists have long been moti,·ators<br />
tc>r the development of powert-l1l<br />
computing environments. Two<br />
sections in this issue of the.founw/<br />
address the requirements of scienrinc<br />
and technical computing. The tirst,<br />
fron1 <strong>Digital</strong>'s High Pertonnance<br />
<strong>Technical</strong> Computing Group, looks<br />
at compiler and development tools<br />
that accelerate performance in parallel<br />
environments. The second section<br />
looks ro the future of computing;<br />
University of California and <strong>Digital</strong><br />
researchers present their work on a<br />
large, distributed computing environmem<br />
suited to the needs of earth scientists<br />
studving global changes such<br />
as ocean dynamics, global warming,<br />
and ozone depletion. <strong>Digital</strong> w:�s an<br />
early industry sponsor and particip:mt<br />
in this joint research project, called<br />
Sequoia 2000.<br />
To support the writing of parallel<br />
programs tor computationally intense<br />
environments, <strong>Digital</strong> has extended<br />
DEC fortran 90 by implementing<br />
most of High Performance Fortran<br />
( H Pf) version 1.1. After reviewing<br />
the syntactic features of Fortran 90<br />
and HPF, Jonathan Harris et al. fc>eus<br />
on the HPF compiler design and<br />
expbin the optimizations it pertorms<br />
ro improve interprocessor communication<br />
in a distributed-memory environment,<br />
specincally, in workstation<br />
clusters (tarms) based on <strong>Digital</strong>'s<br />
64-bit Alpha microprocessors.<br />
The run-time support for this distributed<br />
environment is the Parallel<br />
Software Environment (PSE). Ed<br />
Benson, David Lafrance- Linden,<br />
Rich Warren, and Santa vVin'aman<br />
describe the PSE product, which is<br />
layered on the UNIX operating system<br />
and includes tools for developing<br />
Digit�l'lr ll'riting the Foreword ro tl1is<br />
ISSUC.<br />
Our next issue will feature papers<br />
on multimedia and UNIX clusters.<br />
);me C. Blake<br />
Mallrl_r�illg t.ditor
Foreword<br />
Jean C. Bonney<br />
Dirc>ci!J/: J::Yit'l"ll!li Nc>sem·cb<br />
The lntonmrion Utilitv, rhe<br />
Information Highw�l\', the Internet,<br />
rhe Infi.>bahn, rhe Information<br />
Eeononw-rhe sound lwtes ofrhe<br />
1990s. To 111�1ke these concepts<br />
reality, a robust technolob"' intra<br />
structure is necessary. In 1990,<br />
Digit�1l's research organization saw<br />
rbis need :md ser out to develop an<br />
exl1erimenral rest hed that would<br />
examine assumptions and provide a<br />
basis tr a technolob"' edge in the '90s.<br />
The resulting project was Sequoia<br />
2000, a three-vcar research collabora<br />
tion bet\veen <strong>Digital</strong>, campuses of the<br />
University ofCaliti.>rni�l, �111d several<br />
other industrv and governmem orga<br />
nizations. The Sequoia 2000 vision is<br />
l'c>!ahyles! i.e .. /ri/liuiiS uj'hylc>s)<br />
oj'dala in u dislrilmled orchi(lc.<br />
/runsparel!lir ma11a,t.;ed. aJtd<br />
lugicallv cieti'C'd ouer a hl;t.;h-speul<br />
nelnnd' u·i!h isuchmiiOIIS capahililies<br />
l'ia a hosl u/loo/s<br />
-in orher words, a big, bst, easy-ro<br />
use system.<br />
Although rhe vision is still not reality<br />
today, our more than three years<br />
of participation in Sequoia 2000<br />
research gave us rhe knowledge base<br />
we sought.<br />
AlTer a rigorous process of pro<br />
poSrnia, Sequoia 2000 began<br />
in June 1991. The t(JCus of the<br />
rcsean.:h was a high-speed, broad<br />
hrmation ...:ol<br />
lccred trom satellites. Once rhe data<br />
arc colle...:tcd :111d organized, there is<br />
rhe challenge of massive simulations,<br />
simulations that t(>rccast world climate<br />
ten or even one hundred yeJrs ti·om<br />
now. These were exactly the kinds<br />
of challenges the computer scientists<br />
needed .<br />
Among the m:1jor results of three<br />
ye�ns of work on Sequoia 2000 was<br />
a set of produ...:t requiremems t(>r<br />
large data applications. These require<br />
mems have been validated through<br />
discussions with customers in tinan<br />
cial, healthcare, and communications<br />
industries Jnd in governmenr. The<br />
requin:menrs include<br />
• A computing environment built<br />
on an object relational database,<br />
i.e., a thta-cenrric computing<br />
syste m<br />
• A tbtabase that handles a wide<br />
variety of nontraditional objects<br />
such as tor, audio, video, graph<br />
ics, and inuges<br />
• Support ror a variety ofrraditional<br />
databases and file systems<br />
• The ability to perform necessary<br />
operations ti·om computing<br />
environments that arc intuitive<br />
and have rhe same look and kcl;<br />
the imerface to the environment<br />
should be generic, very high level,<br />
and easily tailored to the user<br />
:lf-1plication<br />
• High -speed data migration<br />
bet\veen secondary and tertiary<br />
storage with the ability to handle<br />
very large data transfers<br />
• Net\vork bandwidth capable<br />
of handling image transmission<br />
across nervvorks in an acceptable<br />
rime hame with quality guarantees<br />
fc->r the dara<br />
• High-quality remote visualization<br />
of any relevant data regardless<br />
oftormat; the user must be able<br />
to manipulate the visual data<br />
interactively<br />
• Reliable, guaranteed , deli very<br />
of data ti·om tertiary storage to<br />
rhe desktop<br />
Sequoia 2000 was also a catalyst<br />
t(>r maturing the POSTGRES research<br />
database soft\vare to the point where<br />
it was ready tor ...:ommercialization.<br />
The commercial version, I !lustra,<br />
is available on Alpha platforms and<br />
is enjoying success in the banking<br />
industry and in geographic informa<br />
tion system (GIS) applications, as<br />
well as in other government applica<br />
tions with massive data requirements .<br />
Illusrra is Jlso making inroads into the<br />
Imernet where it is used by on-line<br />
services.<br />
Yet another major result of Sequoia<br />
2000 was a grant fi-om the National<br />
<strong>Digital</strong> Tcclmic�l Journ.1l Vol. 7 No.3 <strong>1995</strong> 3
4<br />
Aeronautics and Space Administra<br />
tion (l\:ASA) to develop an ;lltemare<br />
archirectu1·e t(Jr the Earth Obsen·ing<br />
Svstem i);ua and lntcmnarion S)·stem<br />
(EOSDIS). EOS[)JS will process the<br />
pctJbvres of re:�l-time dar:� ti·om<br />
the E:�rth Observing System ( EOS)<br />
satellites to be launched at the end<br />
of the dec1de. The altematc int(Jr<br />
marion :1rchirecrure proposed bv the<br />
Universitv ofCalit(Jrni:� bculn· II'Js<br />
. .<br />
the Sequoi;l 2000 ;Jrchirecrure. lr<br />
will have a m;ljor intluence on the<br />
EOSDIS project.<br />
fm the e;uth sciemisrs, g;lins<br />
1\'Cre made in simul;nion speeds :md<br />
111 access to Luge stores oforg:mized<br />
data. These sciemists used some oF<br />
<strong>Digital</strong>'s tirst Alpha workst;Hion Emm<br />
and sottw;lre prototi'JKS t(Jr their cli<br />
mate simulations. An eight-processor<br />
Alph:� ,,·orksr:�rion brm pnwided a<br />
t\\'O-to-onc price/ped(mn;mce Jlklll<br />
tage o1·er the pm1·erful, multimillion<br />
dollar C:RAY C90 machine. In :mother<br />
earth science apJ)Iiurion, sciemisrs<br />
usi11g Alpha and hicr;uchiul stor;lgc<br />
svstems could simui;He r11·o \'cHs'<br />
worth of climate d,ltJ over the 11 cck<br />
end without operator imcn'Cntion;<br />
tcmllcrh', t\I'O months' worth ohbt:-t<br />
rook one dav to simui;He :-tnd required<br />
considerable OJ)er;Hor imen·cntion.<br />
Thus m:�ny more simul:�rions could<br />
be processed in ;l tixed rime :-tnd<br />
"time to discoverv" W;ls decrc1sed<br />
considerablv.<br />
Now that ll'e em look ;lt Sequoi:-t<br />
2000 in retrospect, would II'C do<br />
such a project againl The ;111swcr<br />
is a resounding ''l·cs" ti·om ;!II of<br />
us im·oh·ed. It \\';ls a complex proj<br />
ect that included 12 Unil'lTsin· of<br />
Calitcm1ia bculn· members, 2:1 grad<br />
uate students, ;md 20 st:1ff Another<br />
8 hculn· members and students pm<br />
,·idcd additional expertise. Four of<br />
<strong>Digital</strong>'s engineers worked on site,<br />
;md ;l varien· of support personnel<br />
ti·om other industrv sponsors p::�rtici<br />
p:-tted, including SAlC, the C1li�clrni::�<br />
DepMtmenr ofvVater Resources,<br />
Hewlett-Packard, Metrum, United<br />
States CcologicaJ Survey (USGS),<br />
Hughes Application Int(mnation<br />
Sen·ices, ::�nd the Annv Corps of<br />
Engineers.<br />
But ;Js is the c1se with such ::�mbi<br />
tious projects, there \\'ere tlll:mtici<br />
p:ncd ;lnd difticulr lessons tclr ;111<br />
to leam. To experiment 11·ith real-<br />
lite rest beds means considerabh<br />
more rhan 11ri ring a rigorous set<br />
oFh1-porheses in a proposal. Michael<br />
Stonehr,lker, in his paper, notes a<br />
lllllllbcr of ch,lilenges \\"C t:Ked and<br />
the lessons lc:1rned. One ofthe issues<br />
rh::�t kqlt surbcing was the "grease<br />
;md glue" tell. the inti·astrucrure, that<br />
is, the inreroperabilin· of pieces of<br />
sofn1 are and hard11·are tlut composed<br />
the end-to-end S\'Stem.1'his remains<br />
,\ chal.lenge that needs rese;Jrch if we<br />
;li'C going to achieve the promised<br />
gcds ot'intcrnct\l'orking. Another<br />
stick\· point was scalability. On the<br />
one hand, it is difficult to build a l't:rv<br />
large networked svstem ti·orn scratch.<br />
On the orhcr hand, as we slowlv built<br />
the mass storage svsrem to the point<br />
oF minimal critical mass, liT t(nmd<br />
th:lt the current oft�thc-shelftech<br />
nologies t(Jr m.1ss storage were not<br />
I'CJch· to be put use tclr our purposes.<br />
So, ves, \\'C be lie1·e the project ll';ls<br />
11·orthll'hile 11·irh some Gll'e
Compiling High<br />
Performance Fortran<br />
for Distributedmemory<br />
Systems<br />
<strong>Digital</strong>'s DEC Fortran 90 compiler implements<br />
most of High Performance Fortran version 1.1,<br />
a language fo r writing parallel programs. The<br />
co mpiler generates code for distributed-memory<br />
machi nes consisting of interconnected work<br />
stations or servers powered by <strong>Digital</strong>'s Alpha<br />
microprocessors. The DEC Fortran 90 compiler<br />
efficiently implements the features of Fortran 90<br />
and HPF that support parallelism. HPF programs<br />
compiled with <strong>Digital</strong>'s compiler yield perfor<br />
mance that scales linearly or even superlinearly<br />
on significant applications on both distributed<br />
memory and shared-memory architectures.<br />
I<br />
Jonathan Harris<br />
John A. Bircsak<br />
M. Regina Bolduc<br />
Jill Ann Diewald<br />
Israel Gale<br />
Neil W. J olmson<br />
Shin Lee<br />
C. Alexandet· Nelson<br />
Carl D. Offner<br />
High Performance Fortran (HPF) is a new program<br />
ming language for writing parallel programs. It is<br />
based on the Fortran 90 language, with extensions<br />
that enable the programmer ro speci�r how array oper<br />
ations can be divided among multiple processors tor<br />
increased performance. In HPF, the program specitles<br />
only the pattern in which the data is divided among<br />
the processors; the compiler auromates the low-level<br />
details of synchronization and communication of data<br />
benvecn processors.<br />
<strong>Digital</strong>'s DEC Fortran 90 compiler is the tirsr imple<br />
mentation of the full H PF version l.l language<br />
(except for transcriptive argument passing, dynamic<br />
remapping, and nested FORALL and 'NHERE con<br />
structs). The compiler was designed tor a distributed<br />
memory machine made up of a cluster (or farm) of<br />
workstations and/or servers powered by <strong>Digital</strong>'s<br />
Alpha microprocessors.<br />
In a distributed-memory machine, communication<br />
between processors must be kept to an absolute mini<br />
mum, because communication across the nenvork is<br />
enormously more time-consuming than anv operation<br />
done locally. <strong>Digital</strong>'s DEC Fortran 90 compiler<br />
includes a number of optimizations to minimize the<br />
cost of communication benveen processors.<br />
This paper briefly reviews the teatures of Fortran 90<br />
and HPF that support parallelism, describes how the<br />
compiler implements these features efficiently, and<br />
concludes with some recent performance results<br />
showing that HPt programs compiled with <strong>Digital</strong>'s<br />
compiler yield pert(xmance that scales linearly or even<br />
superlinearly on significant applications on both<br />
distributed-memory and shared-memory architectures.<br />
Historical Background<br />
The desire to write parallel programs dates back to the<br />
1950s, at least, and probably earlier. The mathematician<br />
John von Neumann, credited with the invention of the<br />
basic architecture of today's serial computers, also<br />
invented cellular automata, the precursor of today's<br />
massively parallel machines. The continuing motiva<br />
tion tor parallelism is provided by the need to solve<br />
computationally intense problems in a reasonable time<br />
and
6<br />
which range ti·om collections of workstations con<br />
nected by standard ti ber-optic networks to rightlv cou<br />
pled CPUs with custom high-speed interconn�ction<br />
networks, are cheaper than single-processor svstems<br />
with equivalent performance . In many cases, �qu iva<br />
lent smgle-processor svstems do not exist and could<br />
not be constructed with existing technologv.<br />
Historically, one of the difficulties \\ : ith parallel<br />
machmes has been writing paral lel programs. The work<br />
of paralleli zing a program w�1s fa r fi·om the original sci<br />
ence be�ng explored; it required progra mmers to keep<br />
track of a great deal of inti:m11ation unrelated ro the<br />
actual computations; and it was done using ad hoc<br />
methods that were not portable to other machines.<br />
The experience gained fi·om this work however led<br />
'<br />
'<br />
to a consensus on a better way to write portable<br />
Fortran programs that would pcrtorm well on a varietv<br />
of paral lel machines. The High Pcrt
c<br />
i n teger n, number_o f_i terations, i , j,k<br />
parameter(n=16)<br />
real A( n,n), Temp(n,n)<br />
. . . (Ini t ialize A, numbe r_o f_i terations)<br />
do k=1 , number_o f_i terations<br />
Update non-edge eleme nts on ly<br />
do i=2, n-1<br />
Figure 1<br />
A Computation Expressed in fortran 77<br />
Figure 2<br />
do j=2, n-1<br />
Temp( i , j ) =(A(i , j-1 )+A(i, j+1 )+A(i+1, j ) +A( i-1 , j ) )*0.25<br />
enddo<br />
end do<br />
do i=2, n-1<br />
do j=2, n-1<br />
end do<br />
enddo<br />
end do<br />
A(i , j ) = Temp(i , j )<br />
i n teger n, number_o f_i terations, i , j , k<br />
pa rameter (n= 16)<br />
rea l A
8<br />
l'rocessor O<br />
SEND<br />
A(8, 2) . . A(8, 8)<br />
ro Processor I<br />
SEND<br />
A(2, 8) A(S, 8)<br />
ro Processor 2<br />
RECEIVE<br />
A(9, 2) .A(9, 8)<br />
ti·om Processor I<br />
RECEIVE<br />
A(2, 9) ... A(Il, 9)<br />
from Processor· 2<br />
Processor<br />
SEND<br />
A(9, 2) .. A(9, 8)<br />
to Processor 0<br />
SE:'-JD<br />
A(9, 8) . . A(lS, 8)<br />
ro Processor 3<br />
RECEIVE<br />
A(8, 2) ... A( 8, R)<br />
from Processor 0<br />
RECE IVE<br />
A(9, 9) ... A( I 5, 9)<br />
ri·om Processor 3<br />
Figure 4<br />
Compiler-generated Communication to r
We now describe some of the HPF language exten<br />
sions used to manage parallel data .<br />
Distributing Data over Processors<br />
Data is distributed over processors by the<br />
DISTRI BUTE directive, the ALIGN directive, or<br />
the default distribution .<br />
The DISTRIBUTE Directive For parallel execution of<br />
array operations, each array must be divided in mem<br />
ory, with each processor storing some portion of<br />
the array in its own local memory. Dividing the array<br />
into parts is known as distributing the array. The HPF<br />
DISTRIBUTE directive controls the distribution of<br />
arrays across each processor's local memory. ft does<br />
this by spcci�'ing a mapping pattern of data objects<br />
onto processors. Many mappings are possible; we illus<br />
trate only a kw.<br />
Consider fi rst the case of a 16 X 16 arrav A in an<br />
environment with to ur processors. One possi ble speci<br />
fication tor A is<br />
I hpf$<br />
rea l AC16, 16)<br />
distribute A(*, block)<br />
The asterisk ( *) tor the fi rst dimension of A means<br />
that the array elements are not distributed along<br />
the tirst (vertica l) axis. In other words, the elements<br />
in any given column are not divided among differ<br />
ent processors, bur are assigned as a single block to<br />
one processor. This type of mapping is rekrred to as<br />
serial distribution. Figure 5 illustrates this distribution.<br />
The BLOCK keyword tor the second dimension<br />
means that for any given row, the array elements are<br />
distributed over each processor in large blocks. The<br />
blocks are of approximately equal size-i n this case,<br />
they are exactly equal-with each processor holding<br />
one block. As a result, A is broken into fo ur contigu<br />
ous groups of columns, with each group assigned to<br />
a separate processor.<br />
Another possibility is a ( *, CYCLIC) distribution .<br />
As in (*, BLOCK), all the elements in each co lumn are<br />
assigned to one processor. The elements in any given<br />
row, however, are dealt out ro the processors in round<br />
robin order, like playing cards dealt out to players<br />
around a table. \Vhen elements are distributed over n<br />
processors, each processor contains every 11th column,<br />
starting h·orn a ditkrent offset. Figure 6 shows the<br />
same array and processor arrangement, distributed<br />
CYCLIC instead ofl3LOCK.<br />
As these examples indicate, the distributions of the<br />
separate dimensions are independent.<br />
A (BLOCK, BLOCK) distribution, as in hgure 3,<br />
divides the array into large rectangles. In that ti gure,<br />
the array clements in any given column or any given<br />
row are divided into rwo large blocks: Processor 0 gets<br />
A( l :8, 1 :8), processor l gets A(9:16, l :8), processor 2<br />
gets A( 1 :8, 9:16), and processor 3 gets A(9:16,9 : 16).<br />
I<br />
I<br />
0<br />
Figure 5<br />
A (*, BLOCK) Disrriburion<br />
0<br />
2 3 0 2 3 0<br />
Figure 6<br />
A(*, CYCLIC) Distri burion<br />
2<br />
I<br />
II I<br />
I<br />
I<br />
2 3 0<br />
3<br />
2 3<br />
The ALIGN Directive The ALIGN directive is used to<br />
speci�' the mapping of arrays relative ro one another.<br />
Corresponding elements in aligned arrays are always<br />
mapped to the same processor; array operations<br />
between aligned arrays are in most cases more efficient<br />
thJn Jrray operations between arrays that are not<br />
known to be aligned .<br />
The most common use of ALIGN is to specifY that<br />
the corresponding clements of rwo or more Jrrays be<br />
mapped identically, as in the following example:<br />
DigitJI Tcchnic;\l Journal Vol. 7 No. 3 <strong>1995</strong> 9
10<br />
! hpf$ align A(i ) wi th B(i )<br />
This example specifies that the two arrays A and Bare<br />
always mapped the same way. More complex align<br />
ments can also be specitled. For example:<br />
! hpf$ align E(i) wi th F(2*i-1 )<br />
In this example, the elements of fare aligned with the<br />
odd elements ofF In this case, 1:· can have at most half<br />
as many elements as F<br />
An array can be aligned with the interior of a larger<br />
array:<br />
real A( 12, 12)<br />
rea l B(16, 16)<br />
! hpf$ align A(i, j ) with B(i+2, j+2)<br />
In this example, the 12 X 12 array A is aligned with<br />
the interior of the 16 X 16 array B (see Figure 7). Each<br />
interior element of B is always stored on the same<br />
processor as the corresponding element of A.<br />
The Default Distribution Va riables that are not explic<br />
itly distributed or aligned are given a default distribu<br />
tion by the compiler. The default distribution is not<br />
specified by the language: dinerent compilers can<br />
choose different default distributions, usually based<br />
on constraints of the target architecture. In the DEC<br />
Fortran 90 language, an array or scalar with the default<br />
distri bution is completely replicated. This decision was<br />
made because the large arrays in the program are the<br />
significant ones that the programmer has to distri bute<br />
explicitly to get good performance. Any other arrays<br />
or scalars will be small and ge nerally will benetit from<br />
being replicated since their values will then be available<br />
everywhere. Of course, the progra mmer retains com<br />
plete control and can specif}1 a ditre re nt distribution<br />
to r these arrays.<br />
Re plicated data is cheap to read but generally<br />
expensive to write. Programmers typically use repli<br />
cated data fo r information rh:�r is computed infre<br />
quently but used ofi:en.<br />
B<br />
Figure 7<br />
An Example of A.rrav Alignment<br />
A<br />
<strong>Digital</strong> Technic:d )ournJI Voi. 7 I o.3 <strong>1995</strong><br />
Data Mapping and Procedure Calls<br />
The distribution of arrays across processors introduces<br />
a new complication fo r procedure calls: the inte rface<br />
between the procedure and the calling program must<br />
take into accoum not only the type and size of the rel<br />
evant objects but also their mapping across processors.<br />
The HPF language includes special forms of the<br />
ALIGN and DISTRIBUTE directives fo r procedure<br />
interbces . These allow tl1 e program to speci�' whether<br />
array arguments can be handled by the procedure as<br />
they are currently distributed, or whether (and how)<br />
they need to be redistributed across the processors.<br />
Expressing Parallel Comp utations<br />
Parallel computations in HPF can be identified in four<br />
ways:<br />
• fortran 90 array assignments<br />
• FORALL statements<br />
• The IN DEPENDENT directive, applied to DO<br />
loops and rORALL statements<br />
• Fortran 90 and HPF intrinsics and library fu nctions<br />
In addition, a compiler may be able to discover paral<br />
lelism in other constructs. ln this section, we discuss<br />
the fi rst two of these paral lel constructions.<br />
Fortran 90 Array Assignment In Fortran 77, operations<br />
on whole arrays can be accomplished only through<br />
explicit DO loops that access array clements one at a<br />
time. Fortran 90 array assignment statements allow<br />
operations on entire arr:�ys to be expressed more simply.<br />
In Fortran 90, the usual intrinsic operations fo r<br />
scalars (arithmetic, comparison, and logical) can be<br />
applied to arrays, provided the arrays arc of the same<br />
shape. for example, if A, B, and C arc two-dimensional<br />
arrays of the same shape, the statement C = A + B<br />
assigns to each element of C a value equal to the sum<br />
of the corresponding clements of A and B.<br />
In more complex cases, this assignment syntax can<br />
have the eftect of drastically simplit),ing the code. For<br />
instance, consider the case of three-d imensional<br />
arrays, such as the arrays dimensioned in the fo llowing<br />
declaration:<br />
real D(10, 5 : 2 4, -5 :M), E(0:9, 20, M+6)<br />
In Fortr:�n 77 S)'ntax, an assignment to every ele<br />
ment of D requires triple-nested loops such as rhe<br />
example shown in Figure 8.<br />
line:<br />
In Fortran 90, this code can be expressed in a single<br />
D = 2.5*D+E+2 .0<br />
The FORALL Statement The FORALL statement is an<br />
H PF extension to the American National Standards<br />
Institute (ANSI) Fortran 90 standard but has been<br />
included in the dratt Fortran 95 standard .
do i = 1 , 10<br />
do j = 5, 24<br />
do k = -5, M<br />
D(i, j, k)<br />
end do<br />
end do<br />
end do<br />
Figure 8<br />
An Example of'' Triple·nes[cd Loop<br />
FOIW...L is a generalized fo rm of Fortran 90 array<br />
assignment syntax that allows a wider variety of array<br />
assignments to be expressed . For example, the diago<br />
nal of an array cannot be represented as a single<br />
Fortran 90 array section . Theretore, the assignment of<br />
a value to every element of the diagonal cannot be<br />
expressed in a single array assign ment statement. It<br />
can be expressed in a FOIW...L statement :<br />
rea l, dimension(n, n) A<br />
fora l l Ci = 1 : n) ACi, i ) = 1<br />
Although FORALL structures serve the same pur<br />
pose as some DO loops do in Fortran 77, a FORALL<br />
structure is a parallel assignment statement, not a<br />
loop, and in many cases prod uces a difterent result<br />
fr om an analogous DO loop. For example, the<br />
FORALL statement<br />
fora l l ( i = 2:5) CCi, i ) CCi-1 , i-1 )<br />
applied to the matrix<br />
ll 0 0 0 0<br />
0 22 0 0 0<br />
c 0 0 33 0 0<br />
0 0 0 44 0<br />
0 0 0 0 55<br />
produces the to llowing result:<br />
On the other hand , the apparently similar DO loop<br />
do i = 2, 5<br />
C(i, i ) = C( i-1 , i-1 )<br />
end do<br />
produces<br />
c<br />
11 0<br />
0 11<br />
0 0<br />
0 0<br />
0 0<br />
0 0 0<br />
0 0 0<br />
ll 0 0<br />
0 ll 0<br />
0 0 11<br />
2 . 5 *D( i , j, k) + E(i-1 , j-4, k+6) + 2.0<br />
This happens because the DO loop iterations are per<br />
to rmed sequentially, so that each successive element of<br />
the diago nal is updated betore it is used in the next<br />
iteration. In contrast, in the FORALL statement, all<br />
the diagonal elements are fetched and used betore any<br />
stores happen .<br />
The Ta rget Machine<br />
<strong>Digital</strong>'s DEC Fortran 90 compiler generates code<br />
to r clusters of Alpha processors running the <strong>Digital</strong><br />
UNIX operating system. These clusters can be separate<br />
Al pha workstations or servers connected by a fiber dis<br />
tributed data interface (FDDI) or other network<br />
devices. (<strong>Digital</strong>'s high-speed GIGAswitch/FDDI sys<br />
tem is particularly appropriate.") A shared-memory,<br />
symmetric multiprocessing (SMP) system like the<br />
AJphaServer 8400 system can also be used. In the case<br />
of an Sl'v!P system, the message-passing library uses<br />
shared memory as the message-passing medium; the<br />
generated code is otherwise identical. The same exe<br />
cutable can run on a distributed-mernory cluster or an<br />
SM P shared -memory cluster without recompiling.<br />
DEC Fortran 90 programs use the execution envi<br />
ronment provided by <strong>Digital</strong>'s Parallel Sothvare<br />
Environment (PSE), a companion prod uct -'" PSE<br />
is responsible fo r invoking the program on multiple<br />
processors and for performing the message passing<br />
req uested by the generated code.<br />
The Architecture of the Compiler<br />
Figure 9 illustrates the high-level an:hitecture of<br />
the compiler. The curved path is the path taken<br />
when wmpi lcr command-line switches are set to r<br />
compiling programs that will not execute in parallel ,<br />
or when the scoping unit being compiled is declared<br />
as EXTRINSIC(HPF _LOCAL).<br />
Figure 9 shows the fr ont end, tr�mstorm, middle<br />
end , and GEM back end components of the com piler.<br />
These components fu nction in the to llowing ways:<br />
• The front end parses the input code and produces<br />
an internal representation containing an abstract<br />
syntax tree and a symbol ta ble. It pedorms exten<br />
sive semantic checking.'"<br />
Digit�! Tc( hnic;11 )oumal Vol. 7 No. 3 <strong>1995</strong> ll
12<br />
Figure 9<br />
Compiler C:om[10llents<br />
• The transform component pertorms the transtor<br />
rnation from globai-HPF to cxplicit-SPMD fo rm .<br />
To do this, it localizes the add ressing of data, inserts<br />
communication where necessary, and distribures<br />
p
LOW ER ph;1sc implements rhis lw turning c:tch<br />
forrr.1n 90 :liT:l\' assign ment into an cqui1·;1k11t<br />
rORA.l.L statement (acrualh', into the dorrcc repre<br />
sentation of one ). This unitorm rcprescnt:.1rion m c:1 ns<br />
rh:1t rhe compiler has t�1r tcll'er special cases ro consider<br />
than orhcn1·isc might be necess;1ry and leads ro no<br />
degr;1lbtion of the generJted code.<br />
DATA<br />
The DATA phase specities where dara lives. Pbcing<br />
and addressing data correctlv is one ofrhc major rasks<br />
ofrr:1 nstcmn . Ih: n: are :1 large number of possibilities:<br />
\V hcn :1 nluc is ;1\'ailable on c1 en· processor, it is<br />
S;1id ro be I'C1Jiicoted. When it is a1·aibblc on more than<br />
one hut nor :1 11 processors, it is said ro be ;wrtioll l '<br />
rej;/icotecl. for instance, a scalar m:ll' lil'c 011 onh' one<br />
processor, or on more than one processor. Tvpic:-rl h·, a<br />
scJI:Jr is replic1tcd-it lilTS on all processors. The rcpli<br />
cnion of scalar d:.1u m;1kes ktches cheJp because e;Kh<br />
processor has a copy of the req uested \';1lue. Stores to<br />
replic:1red sc::� Llr data can be expensive, howcl'e r, if the<br />
value ro be stored has not been replicated. In rhar cJse,<br />
the value ro be stored must be senr to each processor.<br />
The s:1me consideration applies to Jr-r::�vs. Arravs<br />
mJv lx: replicJted, in ll'hich c1se c1ch processor h;1S a<br />
copv ofJn crnirc :1rrav; or arravs m:1v be p;1rti;1llv rc pl i <br />
carcd, in ll'hich c1se c::�ch element o f r hc arr.l\' is ::tl :lil<br />
able on ;1 subset of the processors.<br />
hrrrl1crmorc, :l iTa\'S that :1re nor repl icated 111:11' be<br />
distributed Jcross the processors in se1crJI diftcrcn r<br />
t:lshions, as nplained abm'e , In t:K t, each dimension<br />
of each ;1 rrav ma\' be distributed indcpcndcmly of<br />
the other dimensions. The HPr m:�.pping direc tives,<br />
princip;1lil' ALIGN ;md DISTRI BUTE, give the pro<br />
grammer the :r bi lity to specii}' completely how each<br />
dimension of each arrav i s bid ouc DATA uses the<br />
inhm11;1tion in these directives to construct ;1 11 internal<br />
description m data space of the i:11·our of each arrav.<br />
ITER<br />
The ITER ph
14<br />
ln our impkmentation, the CJIIer ped(mns all<br />
remapping. If remapping is necess:�n·, AR
ea l A(100, 20), 8(100, 20)<br />
1 h pf$ distribute A(cyc l i c , block), B(cy clic, block)<br />
(a) Code Fragment<br />
fora l l (i = 2 : 99)<br />
A(i , k) = B(i , k)<br />
end fora l l<br />
m = my_ processor ( )<br />
if k mod 5 = Lm/4J then<br />
do i = ( i f m mod 4 = 0 then 2 else 1 ) , ( i f m mod 4<br />
end i f<br />
A( i , Lk/5Jl = B(i, Lk/5Jl<br />
end do<br />
(h) Pseudocode Cennared for Code Fragmenr<br />
Figure 11<br />
Code Fragtncnr and Pseudocode GcnCLHCd r(>r Code Fr"
16<br />
If the :11T,11'S A and B arc bid o u r as in Figu re 12 md<br />
if 13 is to be assigned to A, the n JIT�11· clcmems LJ( 4 ) ,<br />
13( S ) , �1 11d /J( 6 ) , all of 11 h ich li1 c 011 p roccssm 6,<br />
shou ld be scm to processor 1 . Clc�1d1·, \\'c do 11ot 11 :1nt<br />
to gcncr�ltc three distinct messages t(Jr this. Thcrd(Jre,<br />
we collect these tilrce clements ;1 11d ge nn�Hc one mes<br />
sage con taini ng :1ll three of them. This oamplc<br />
i nvolves full knowledge.<br />
Communic:�tions involving p�1rti
Figure 14<br />
Sh.1do11· Edges Fo r· c1 :\ccHcsr-ncighbor· Computation<br />
cor11munication involving even kss 0\'Crhead than<br />
general communicnion im·olving fu ll knowledge.<br />
• No local copying. If sh:1dow edges were not used ,<br />
then the t(lllowing stambrd processing would take<br />
pbce: �or c:1ch shitted-Jrr:Jy rdi.: rcncc on the right<br />
hand side of the ''ssignmcnt , shift the entire :1rray;<br />
thcn idcmi�· that parr ol . . thc shir'tcd array that lives<br />
loc1ll\' 011 each processor and create a local tem po<br />
r,m• to hold it. Some of that tcmporarv (the part<br />
reprcscmi ng our sh,ldo\\' edge ) II'Ould be moved in<br />
from :1 neighboring processor, and the rest of the<br />
tcmporarv \\'(lli ld be copied loc1llv ri·om the origi<br />
nal arLl\'. Our processing el im inates the need fo r<br />
the local tcmpor,u·,· and tix rhc local co�w, which is<br />
substami,ll t!. >r large '' rravs.<br />
• Greater local in· of rdi.: rcncc. When the :�cn!al com<br />
pur:�tion is pcdcmncd, grc.1tcr localin· of reference<br />
is ,1 chiC\ni because the sh,ldOII' edges (i.e., the<br />
rccci,·cd ,·a lues ) :11-c 110\\' p:�rt of the atTJ\', r:1ther<br />
than being '' tcmporan· somc\\·hcrc else in mcmon·.<br />
• Fcll'er mcss,1gcs. hn,1lly, the optimiz,nion also<br />
makes it possible fo r the compiler to sec that some<br />
messages m,1,. be combined into one message,<br />
rhnelw reducing the number of messages that<br />
must be sent. for insuncc, if the right-hand side<br />
of the '1ssignment st,Hc mellt in the above exam ple<br />
also co11t,1incd '1 term A( i + l ,.f + 1 ), C\'Cn though<br />
overlapping shJdo\\' edges and Jn additiona l<br />
shJdO\\' edge \\'ou ld now he in the diagonally adja<br />
cent processor, no :�ddirional communicuion<br />
\\'ou ld need to be generated .<br />
Reductions<br />
The SUM intrinsic fu nction of t-:ortrJn 90 takes an<br />
arrav argument and returns rhc sum of all its elements.<br />
Altcrn;�tivclv, SUM c1n return an '1 rrav ll'hose rank is<br />
one less thJn the rJnk of its '1 rgumem, and eJch of<br />
\\'hose ,.,1lucs is the sum of the clements in the argu<br />
ment along .1 line p'1rallcl to J spcciticd dimension.<br />
In either case, the rank of the result is less than that of<br />
the argument; thcrd(m:, SUM is rdi.: rred to ''s ,,<br />
reduction intrinsic. Fortran 90 includes J bmil\' of<br />
such reductions, :1nd HPF
18<br />
is repl icated or distributed , (2) the different distri<br />
butions that uch �liTa\· dimension might ha\·c, a11d<br />
( 3 ) whether the n:ducri on is comp lete or p:�rti�1 1 (i.e.,<br />
\\'ith a DL\11 argument).<br />
Run -time Preprocessing of Irregular Data Accesses<br />
Run-time preprocessing of i rregu l ar data accesses 1s<br />
a popu l:lr teehni
is a slice through either the water or the atmosphere.<br />
Partial difte rcntial equations relate the variables.<br />
The model is implemented using a �l nite-diftcrence<br />
method that approximates th e partial differential<br />
equations at each of the mesh points.1' Mode ls based<br />
on partial difterenti�ll equations are ar the core of many<br />
simulations of physical phenomena; finite difkrcnce<br />
methods arc commonly used �or sol ving such models<br />
on com puters.<br />
The shallow-water program is a widclv quoted<br />
benchmark, partly because the program is small<br />
enough to examine and tune careful l �', vet it per�orms<br />
real computation representative of many scientitlc si m<br />
ulations. Unlike SPEC and other be nchmarks, the<br />
source �or the shallow-water program is not controlled.<br />
The shal low-water benchmark was written in H PF<br />
and run i n p�1 ra llcl on workstation ��mns using PS E.<br />
There is no ex pl ici t message-passing code in the pro<br />
gram. We modified the Fo rtran 90 version that<br />
Applied Parallel Research used �or its benchmark data.<br />
The f90 / H Pf version of the program t
20<br />
of heat tlow and elcctrostJtic or gravitational poten<br />
tial. We have investigated a Poisson soh·cr using the<br />
conjugate-gradient algorithm. The code exercises<br />
both the nearest-neighbor optimizations and the<br />
inlining abilities of the DEC FortrJn 90 compiler.''<br />
Ta ble 3 gives the timings :111d speedup obtained<br />
on a 1000 X 1000 JrrJv. The h:1rdw:1re and sothvare<br />
contlgurations arc identical to those used for the<br />
shallow-water timings.<br />
Red-black Relaxation<br />
A common method of solving p:trtial difkrenti:tl<br />
equations is red-black relaxation -'" vVe used this<br />
method to solve the Poisson equation in J three<br />
dimensional cube. We compare the p:1rallelization<br />
of this algorithm tor a distributed-memorv svstem<br />
(a cluster of Digita l Alpha workstations) wirl� P�rallcl<br />
Virtual Machine ( PVM ), which is an explicit message<br />
passing li brary, and with HPF.l' These algorithms arc<br />
based on codes written by Klose, Wolton, and Lemke<br />
and made a\·ailable as pan of the suite of GENESIS<br />
distributed -memorv benchmarks.'"<br />
Table 4 gives the speedups obtained t(x both<br />
the HPF and PVM \'ersions of the program, which<br />
soh·es a 128 X 128 X 128 probJem, on a cluster of<br />
DEC 3000 Model 900 workstations connected bv an<br />
FDDI/GIGAswitch system. The speedups shown arc<br />
relative to DEC Fortran 77 code written for and run on<br />
a single processor. This table shows th:lt the HPF ,·er<br />
sion perf(m11S somewhat better than the PVM version .<br />
There is a significant ditkrence in the complexitv ot·<br />
the programs, however. The PVM code is quite intri<br />
cate, because it requires that the user be responsibJ e<br />
tor the block partitioning of the volume, and then tor<br />
explicitly copying boundary t:Kes between processors.<br />
By contrast, the HPF code is intuitive and tar more<br />
easilv maintained . The reader is encouraged to obtain<br />
the codes (as described abm·e) and compare them.<br />
Ta ble 3<br />
Ta ble 4<br />
Speedups of DEC Fortran 90/H PF<br />
and DEC Fortran 77/PVM on<br />
Red-black Code<br />
DEC Fortran 77<br />
DEC Fortran 77/PVM 7.01<br />
DEC Fortran 90/H PF 8.04<br />
,--- <strong>Number</strong> of Processors ,<br />
8 4 2 1<br />
3.73<br />
4. 10<br />
1.79<br />
1.95<br />
1.00<br />
1.05<br />
In conclusion, we have shown that important algo<br />
rithms fa miliar to the scientitlc and technical commu<br />
nity can be \\'ritten in HPF. HPF codes scale well to at<br />
le
References and Notes<br />
I. High Perfornuncc fortran Forum, "High Pc rtormJnce<br />
Fonran L1 ngu:1gc Specitic:�rion, Version 1.0,"<br />
Scient iji'i.: Pmg l'(.lllllllillp,, ,·ol. 2, no. I ( 1 993). Also<br />
Jvai!Jblc :1s Tcc bnicJI Report CR.PC-TR93300, Center<br />
tor ReseJrch on .Par;11lcl Computation, Rice Uni\'ersitv,<br />
Houston, Te x.; ;\11d via anonymous ftp ti·om<br />
tiL\JU.:s.ricc.edu in rhe dircctorv public/HPFF /d raft;<br />
version I .1 is rhe ti le hp(y l 1 .ps.<br />
2. C. Ko elbcl, D. Lo,·cmJn, R. Schreiber, G. Steele, Jr.,<br />
and M. Zoscl, The High Perj(;rmance Fonmn<br />
/-land hook (Cambridge, Mass.: J'vi iT Press, 1994 ).<br />
3. Dig ital J-tt;� h Perj'ormance Fortran 90 HPF and<br />
PSE Ma nual (Mavnard, Mass.: <strong>Digital</strong> Eq uipment<br />
Corporation, <strong>1995</strong> ).<br />
4. {)f:'C hll'irWI 90 La uguap,e !?ej'erence Manual (Maynard,<br />
i'vLlss.: <strong>Digital</strong> Equipment Corporation, 1994).<br />
5. E. Albert, K Knobc, j. Lukas, and G. Steele, Jr., "Compiling<br />
ForrrJl\ 8x Arrav Features fo r the Connection<br />
Machine Co1nputcr Svsn::m," S) 'mposium on Pam/lei<br />
Prog mmming L\perience tl'ith Applicatio ns.<br />
Lw·1p, 11agc>s. wnl ,)).'stems. ACM .sJG'PIAN. j ulv 19R8.<br />
6. K. Knobc, ]. Lukas, ;H \d G. Steele, Jr., " Massively Par<br />
;\! lei Dat
22<br />
2S. B. Boulter, "Perr(mn:mce E,·,llucHion ofH l'J-' J·( )r Scien<br />
ritic Compu ting," Pmc( ·et liii,U,S u/f ll:u,h l'crjiJf'l!7Cli/Ce<br />
Cumpu!ing one/ .\ i'/1/'(JJldiiJ..',. l.eclure . .\'u/es ill<br />
0JIIIjmlel' Scie7!Ce 91 il',Jl<br />
.md dei'L·Iopmcnr otrhc trcmst(mn componcm ofrhe IJ FC:<br />
�mrrcm 90 compikr. Bdcl!T joining Di);i!.ll in I 99 I , he 11 Js<br />
1111 ohed in the design ,\Jld del elo[m leJlt of compilers ;Jt<br />
C:oll11'·''-5, lnc., })l'ior ro rh.11, he 11 mkcd on COlllJ'i krs .md<br />
sotiw·.1re tools cH Rcll'rhcon Coq'. He holds '' H.S.r:. in<br />
cD!llJlUtn science and en!..!.im·eri ng from the l' ni,·ersit'l'<br />
of }\:J\n)I·JI;mia i J98.J. ) :\;ld ,\n ,\\\ Ill COJll})lJI'LT Sc'Jel;Ce<br />
ti·om 1\osron Uni1 crsm· ( 1990 ).<br />
\'ol. 7 :--lo .'\ 1'.19:i<br />
M. Regina Bolduc<br />
Rcgi n.1 l�olduc joined <strong>Digital</strong> in 1991; she is a principal<br />
so tiwc1rc engineer in rhe High l'ertc mnancc Computing<br />
CirouJ' 1\egin;l "·"s in1·oh cd in the de1·elopmem of the<br />
rrcmst(mn c1nd f)'()llf end conlpoJKilfS ofrbc DEC: �o1·n·.1n<br />
90 C0111J'ikr. l'ri())' to th is 11 ork, she 11 .1s ;1 :.cnior membLT<br />
Dfrhe tech11iol src1tLH C :ompa'>.s, Inc., 11·hcrc she 11 01·ked<br />
On the de.sign Cllld dcl·eJOJ'Il1CIH OrC0111piJers cl lld COJl) ['ikJ··<br />
gcner;ltor rools. 1\egi n;l recei1ed J lL\ . in mcHhcmcHics<br />
Ji·om Emm.111uel College in J 957.<br />
Jill Ann Diewald<br />
Jill Dic11;1 ld UlJIHihltted to rhe design and implcmcllfcl<br />
fiOil ofrhe tr.msl(mn COll\J'OllC!H ofthe DEC Forrr;Jn 90<br />
compiler. She is cl princiJ'''I sofm·c\rc engineer in the 1:-l igh<br />
Pe rt()J·mcmce C:ompuring (;roup. Before joining Digit;l l<br />
in 199 I , Jill ll 'as cl rcchniul coordin;Hor clr CompJss,<br />
lnc., ll'lleJ-c she helped design :�m1 den:lol' compikrs c1nd<br />
cumpiln-relcHL'd tool.s. Prior ro rhJt JX>sition, she 11 o1"ked<br />
.H lnno\;Hilc S\'SteJns TL·clllliLJues and D�rc\ 1\esou rcc.s,<br />
[nc. on J'J'OgraJnJning hngu.1ges rhJr prm ide economic<br />
.m.1hsis, modeling, cmd d.l tclb�sL· c:1pc1biliries t(Jr rhe ti ncm<br />
ciJi nurkeq,bcc. She hclS " B.S. in compurer science trom<br />
rhc l' ni,crsir1· of,\ lichigc1n<br />
Israel Gale<br />
lsr;\cl
Neil W. Johnson<br />
Rd(m: coming to Digiul in I 99 1, Neil johnson was a<br />
staff scicnrist :Jt Compass, Inc. He has more rhan 30 vears<br />
ofcxpnience in the dcvdopmenr of compilers, including<br />
work on the vc.:crorizarion and optimiLation phases and<br />
tools t( H· compiler devcloprnenr. As
24<br />
Design of <strong>Digital</strong>'s<br />
Parallel Software<br />
Environment<br />
<strong>Digital</strong>'s Pa ra llel Software Environ ment was<br />
designed to support the development and exe<br />
cution of scalable pa rallel applications on clus<br />
ters (fa rms) of distributed- and shared-memory<br />
Alpha processors running the <strong>Digital</strong> UNIX oper<br />
ating system. PSE supports the parallel execu<br />
tion of High Performance Fortran applications<br />
with message-passing libraries that meet the<br />
low-latency and high-bandwidth communica<br />
tion requirements of efficient parallel comput<br />
ing. It provides system management tools to<br />
create clusters for distributed para llel process<br />
ing and development tools to debug and pro<br />
file HPF programs. An extended version of dbx<br />
allows HPF-distributed arrays to be viewed,<br />
and a parallel profiler supports both program<br />
cou nter and interval sampling. PSE also supplies<br />
generic facilities required by other parallel lan<br />
guages and systems.<br />
Dig,it:ll TcchnicJl ]uurnal<br />
I<br />
Edward G. Benson<br />
David C.P. LaFrance- Linden<br />
Richard A. Wa rren<br />
Santa Wiryaman<br />
Digir:t l's PJr�1llcl Sofr11·;1t-c Etll'i ronmcnr ( PS E) 11 as<br />
dcsig:ncd ro support the dc,·clopmenr and execution<br />
of scabble par:rllcJ :t pplic:t rions on clusters (brms) of<br />
distri buted- :md slun:d -mcmorv Alpha pr
tr:msmissio11 con trol prorocoljinrcrncr protocol<br />
(TCP/lP). Netll"orking tech nologies em r:mge h·om<br />
si mple Ethernet to �DDI, AT M, and MEMORY<br />
CHA\:;-.:EL. l\1 r�1 l l e l e\ecution is most dli cie nr ,,·hen<br />
the imercon nect technolog1· oftcrs hi gh - ba nd11·idth<br />
and lm1·-latencv com munications ro the user ar the<br />
process level. When bu ildi ng a cluster t(x p�1 ralle l pro<br />
cessing, rhe bisectional ba ndwidth ofrhe communica<br />
tions b bri c shou ld scale with the number of processors<br />
in the cluster. In practice , such a contiguration cu1 be<br />
:�c hieved lw building clusters using Alph:1 processors<br />
�u1d D igi r�1l's GIGAswirch/FDDf as components in �1<br />
m u lti stage s11·itch configuration.45 Fi gures l �md 2<br />
sh o11· tii'O oamples of PS E cl uster contigu r:nions.<br />
Although the design center tor PSE is :1 set ofm�1ehines<br />
connected by �� high - speed local area i nterconnect, a<br />
cluster can be constructed that incl udes remote<br />
111�1chincs con nected Lw a ll"ide :�rca llCtll"ork.<br />
PS E is �� col lection of tn
26<br />
Figure 2<br />
l'S E Multistage S\\ itch C:ontig;ur,uion<br />
Before dc1·eloping I'Sf. t(Jr use \\ith Hf'F prog1·,1ms,<br />
Digi tal considned r11 o major altem,ni' cs : the d istribu<br />
ted compu ti ng cn1·ironmcnt ( DC E) ,md l'VM.'''<br />
(At that time, the mess,Jgc-rJassing i ntcrhcc I JVl l'l J<br />
stand ard efl(Jrr \\'JS in progress. �)<br />
Although a good model t(Jr client-server :1 pplicnion<br />
dcpl ovment, DCL 1s desi gned fo r usc wi th I'CillOte C:l'U<br />
resources 1i1 procedure calls to libraries. This model<br />
is 1·en· difkrcnr ti·om the data-p,u·allel :1 11d mcss,lgcpassi<br />
ng nature of distributed parallel processing. Irs<br />
svnch ronous procedure call model requires the oten<br />
si,·e use ofrh rc1ds. In ,,ddirion, DCE cont,lins '' si gnit�<br />
icant number of setup :�nd management Ll sks. For<br />
these reasons, we rejected the DCE environ ment .<br />
\'ol. 7 �"- 3 Jl)l)�<br />
Three m'1jm consider:�ti ons in out· choice to de1·clop<br />
PSE instead of using PVM \\'eJ'C stabilin·, ped(mllJilCc,<br />
the rr�1nsp�1rcnr us:�bilin· of J S\'lll lllctric multiproces<br />
sor. EK ilirics such as :\fS (to moullt/sh�lrc ri le s1·srcms<br />
�1mong machines, in particular working directories )<br />
:llld IKt\Hlrk in r(mn arion scn·ice ( ::--.: Is) (also kno\\'n JS<br />
"vcllo\1' p:1gcs" an d mcd to share password ri les) arc<br />
ri·c quemlv used ro set up :1 common S\'Stc m cm·iron<br />
rncnr. ln such �1 n cm·ironmc:nt, users em log into :m y<br />
m:�chinc �llld sec the same cnviron mcm. Other distrib<br />
uted cnvirolllllC11tS such JS Load Sluring F:�ci lity<br />
( L
fKtors can help an applicuion choose the best set of<br />
members fo r parallel execution. This d\'llJ.mic inform:� <br />
rion is collected bv a d:�emon process (farmd). The<br />
farmd daemon process executes as :1 privileged (root )<br />
process on each cluster member :�nd liste ns to r requests<br />
on a well-known clusrcr-spcciflc UDP/lP port.<br />
Multiple cluster members ddincd in the configura<br />
tion d:.uabasc :-tiT designated as lo::�d sen·e�·s. The load<br />
servers arc the central rcpositorv for the dvnJ.mic<br />
information for the entire cluster. Their farmd process<br />
periodicallv receives time-stamped upthtcs h· om the<br />
individual daemons. Applications querv the load<br />
servers for both static and dvnamic information .<br />
Applications do not themselves parse the database nor<br />
query the individu�ll farmd d�1emons running on each<br />
cluster member.<br />
Once PSE is inst::�llcd and configured, farmd is<br />
st:�rted each time the syste m is booted. Thc nJmc of<br />
the cluster that farmd \\'i ll service and the number of<br />
t�S E jobs (job slots) that \\'ill be :�I lowed ro run arc set.<br />
The inetd fa cility is used to restart farmd in res�)OilSc to<br />
UDP/IP connection requests, if farmd is nor run<br />
ning _,., Usc of the inetd fJ cility to start farmd impro,·es<br />
the avaibbility of machines to run PSE applications by<br />
transparenrlv restarting farmd in the case of a bilure.<br />
As farmd daemons are st:�rted, they attempt to<br />
establish TC P/IP connections \\'ith their neighbors as<br />
ddincd bv the PSE configur
Figure 3<br />
PSE Usc<br />
COM PILATION<br />
SOURCE<br />
"MYPROG F90"<br />
�<br />
FORTRAN 90<br />
OPTIONS r--- - COMPILER<br />
(E G., -WSF 4)<br />
LIBRARIES:<br />
• FORTRAN<br />
·RUN-TIME<br />
• PSE<br />
-+<br />
COMMAND LINE<br />
SWITCHES AND<br />
ENVIRONMENT<br />
VARIABLES<br />
I<br />
�<br />
OBJECT FILE<br />
"MYPROG.U<br />
�<br />
STANDARD<br />
LINKER (LD)<br />
t<br />
EXECUTABLE<br />
"MYPROG.,<br />
�<br />
EXECUTION<br />
(E.G., SHELL)<br />
� �<br />
CONTROLLING<br />
PROCESS<br />
1- I-f---<br />
process re lationship between the io_managcr and peer<br />
processes. This relationship is used fo r exit status report<br />
ing and process conrrol. It also enables or eases other<br />
activities, such as signal handling and propagation. Peer<br />
processes cre�ue com munication channels between<br />
themselves and perform standard !jO through a desig<br />
nated peer. Stand:trd I/0 is fo rwarded to and fr om the<br />
controlling process through the io_manager. Figure 4<br />
shows a PSE application structure.<br />
Applica tion Initialization<br />
Prior to the execution of JllV user code, an initializa<br />
tion routine executes automatically through fu nction<br />
alit\' provi ded by the linker and loader. The<br />
initialization routine impkrnenrs both the controlling<br />
process fu nctions :m d the HPf-specitic peer initializa<br />
tion. Because no explicit call is required , parallel HPF<br />
procedures can be used within non-HPF main pro<br />
grams, :md proper initialization will occur. A simple<br />
HPF mJin program can also be used with PS E to start<br />
up and manage a task-parallel application that uses<br />
PVM or MPI tor message passing.<br />
In general, the controlling process places peer<br />
processes onto members of;l partition , although hand<br />
placement of indil'idual peers onto selected members<br />
PEER SPAWN<br />
AND CONTROL<br />
STATIC<br />
DATABASE<br />
(E.G , DNS)<br />
�<br />
LOAD SERVER<br />
I<br />
DYNAMIC INFORMA TION<br />
(E.G., LOAD)<br />
CLUSTER<br />
--- ---------- ---,<br />
.- ---<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
-- ------- -------,<br />
I<br />
HOST<br />
I<br />
I<br />
(CLUSTER MEMBER) I I I I<br />
I<br />
I<br />
! HOST<br />
(CLUSTER MEMBER) :<br />
I:<br />
- - - - - - - - - - - - - - - _I<br />
1- -------------- I<br />
I :<br />
I<br />
I<br />
I<br />
I<br />
HOST [SMP]<br />
(CLUSTER MEMBER) :<br />
L--------------..J<br />
! HOST<br />
(CLUSTER MEMBER)<br />
PARTITIONS<br />
------------ --------<br />
is possible. To achieve efficiency and f1 irness in map<br />
ping a set of peers, the comrolling process consults<br />
with a load server for load-balancing information .<br />
Which members are used and the order in which they<br />
are used is based on each member's load average,<br />
number of CPUs, and number of available job slots.<br />
As an alternative, PS E may nup peer processes onto<br />
members based upon a user-selected mock of opera <br />
tion . In the dctault physical mode of operation, PSE<br />
maps one peer process per mem ber. In virtual mode,<br />
PS E a! loll'S more than one peer process per member,<br />
thereby cnJbling large virtual cl ustcrs. This is useful<br />
fo r developing and debuggi ng parJIIel programs on<br />
limited resources. Virtual clusters also i mprove appli<br />
cation availability: when the requested number of peer<br />
processes is greater th an the :wailable set of partition<br />
members, applications continue to run; however, they<br />
may sufkr perf(mnance degradation.<br />
Application Peer Execution<br />
Each application peer process has an io_manager<br />
parent process that provides it with env iron ment<br />
initialization, exit value processing, l/0 burkri ng,<br />
signal to rwarding, and potemial scheduling priority<br />
and policy modification. Ra ther than include the<br />
<strong>Digital</strong> Technic:1l Jou rn:1l Vol . 7 No. 3 <strong>1995</strong> 29<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I
30<br />
Figure 4<br />
KEY:<br />
CONTROLLING<br />
PROCESS<br />
PS E Ap1>l icuion Srn�mcesses exit ,,·irhour error and :tt<br />
�1pproxim;nch· the S
nonn�1l �lcti,·irY ll'ill cease after reeci1·ing the bsr nit<br />
noriricuion h·om an io_managcr.<br />
Peers rh�lt oit abnormalh· J11r the JpplicJtion.<br />
Consider one peer process that exits due ro �l segmen<br />
tation bult and another that exits due to J tl oaring<br />
poinr exception. There is no single exit value possi ble<br />
ti:>r the :1 pplication; PSE chooses the tirsr abnornJJI<br />
va lue it sees. �urrhcrmorc, as a result of error detec<br />
tion in the comnHlllication Jibrarv, the other peer<br />
processes wi II exit ll'ith lost network connections. It is<br />
possible that the controlling process ll'ill sec an exit<br />
l':tlue ti: >r this cfkcr before it sees an exit ,.�1Im: ti:>r one<br />
ofrhc c1uses, resulting in r the core directo<br />
ries cJn be set through an en1 ironmcnt vari�1blc.<br />
Issues<br />
Although l'S E Jchic,·es the srambrd UNIX look-andkel<br />
f(>r most 3pplication situations, complete trans<br />
parency is nor achieved. f:or ex.�1mple, riming au<br />
applicarion-colltrolling process using the c-shell's<br />
built-in rime command, does not time user code or<br />
�) f'ol·ide meaningful statistics other rhJn the cbpscd<br />
ll'all clock rime to starr a p:lrallcl application and ro teJr<br />
it do11 n. Another situ:Jtion that highlights the p�lr:JIIcl<br />
n�1rure of I'S E occurs during applicnion debugging:<br />
multiple debug sessions arc StJrtcd lw running the<br />
:1 pplic::nion ll 'ith :1 debugger flag ra ther th�111 l11' using<br />
dbx direcrh·.<br />
Tools for HPF Progra mming<br />
The dc,·elopmellt model t( >r H Pf-b�1scd :1 pplicnions<br />
is �l rwo-srcp process. First, a serial I-'orrran 90 progrJm<br />
is wri tten, debugged , and optimized. Then it is paral<br />
lelizcd with H PI-' directives and JgJin debugged and<br />
optimized . The de1·elopmenr tools supplied with PS E<br />
Jddrcss 11roriling and debuggi ng. Unlike most of PSI:::,<br />
ll'hich is l:1nguagc -independent, both rhe pprof protil<br />
ing bcilitY and the "dbx in n ll'indoll's" debugging<br />
t:Kilin· �liT spccitic to HPF programming.<br />
Pro filing<br />
Sc,·eral issues in protlling par:�llcl HPF programs do<br />
not appl:' ro fortrJn progrr the execution of each evenr.<br />
This produces an accurate execution profile. El'cnts<br />
include rou tines, JtTa\' assignments, do loops, FORALL.<br />
constructs, messJge sends, and message rcceil'es.<br />
Because the cntn' ::1nd exit rimes Me recorded, rime<br />
spent executing other ei'Cllts within an ei'Cnt is<br />
included, which gi1'CS a hicrarchicJI profi le. To achiC\'C<br />
fi ne-resolution timings (single-digit na noseconds), the<br />
AJpha process cvcle coumcr is used to measure ti me.'''<br />
Analysis The pprof urilirv prol'ides many different<br />
ways to en mine and report on a large ser of protiling<br />
data ti·om a p3rallel program execution. Diftc rmr<br />
approaches include f( >eusi ng on routines, statements,<br />
or communicnions. In contrast, prof reports on proce<br />
dures onlv. With pprof, the scope of the analvsis cJn be<br />
limited to a single peer process or cncompJss all Jppli<br />
cation processes. The range of reports generated can be<br />
comprehensive or limited to a number of c1·enrs or<br />
Digir.1l Techniol )ournll Vol. 7 :-.Jo. 3 199S 31
�' perecnc1gc of time. Users em spccir\• their reports<br />
fi·om a combination of a nalysis, report tcm11:-tt, :� nd<br />
scoping options . By debult, rhe pprof u rilitv reports on<br />
routine-level acri,·itY a1·eraged across a ll peer pmcesscs,<br />
11·hich prm·ides an 0\nal l 1·ie11· of :-tppliCJtion beh;n ior.<br />
Parallel programs execute most eftl.cicnrl1· ll'hen<br />
there is minimum communicnion betll'ccn processes.<br />
The high-level, data par.1llel natu re of the HPf<br />
language reduces the 1·isibilitv of· comnu1 11icnion to<br />
the programmer. To make tuning easier, pprof 11·:-ts<br />
designed ll'ith the abilin· to tclCus tuning 011 communi<br />
cation. Reports can be gc nc r;Hcd th
is implemented using J I"CCtor or' pointers to fu nctions<br />
II ithin J sh�1red Jibr:trl', pr01 ides rl e\ibi!in· to illtmd uce<br />
ne11· prorocols �1 11d media without ha,·ing to recompile<br />
or relink existing applications.<br />
Shared-memory Message Passing<br />
The usc of shared mcmorv as a mcssage-pass1ng<br />
medium :l i!OWS [()!' I'Cr V high pcrf()rlllancc because<br />
data docs 11or have to be copied. When design i ng<br />
shJred-memorv messaging, we looked :H a variety of<br />
interrcbted issues, includ i ng coordin�nion mecha<br />
nisms, memorv-sharing strategies, and memorv con <br />
sumption. The usc or- locks (i.e., semaphores ) in the<br />
trJditional m:1nner to coordinate access ro shared<br />
mcmon· segments pron:d problem atic. �or oample,<br />
clicnrs often req uest a message ti·om
34<br />
The UDPIIP option provides better bandwidth<br />
than the TCP liP with smaller messages and matches<br />
the TCP I r P bandwidth at large message size. The<br />
user-level latency reduction, however, was less than<br />
expected. The next t\vo sections discuss our investiga<br />
tion into ways to optimize the latency ofUDPIIP and<br />
the performance of the message-passing options.<br />
Optimizing UDP/IP<br />
Our initial approach to improve latencv was to reex<br />
amine the standard UDPilP code path within the<br />
<strong>Digital</strong> UNIX kernel fo r unnecessary overhead . Our<br />
idea was to create a faster path, optimized tor a<br />
UDPIIP over a local area network (LAN ) configura<br />
tion by reducing numerous conditional checks in the<br />
path. Although this work yielded some improvement,<br />
it was not enough to justit)' supporting a deviation<br />
from the standard code path. An overhaul of the origi<br />
nal code path would have been necessary for this<br />
approach to gain significant improvement in latency.<br />
UDP llP provides a general transport protocol,<br />
capable of runni ng across a range of network inter<br />
faces. We realize the value in retaining the generality<br />
of UDPIIP. For optimal performance, however, we<br />
anticipate typical cluster configurations being con<br />
structed using a high-perkmnance switched LAN<br />
technology such as the GIGAswitchiFDDI system.;<br />
In such configurations, the IP fa mily of protocols<br />
presents unnecessary protocol-processing overhead.<br />
A messaging system using a lower-level protocol, such<br />
as native FDDI, wou ld ofter better late ncy, but its<br />
implementation requires the usc of nonstandard mech<br />
anisms to access the data link layer directly, which is less<br />
general and portable than a UDPIIP implementation.<br />
Based on the above observations, we designed a new<br />
protocol stack in the kernel, called UDP_prime, to<br />
coexist with the standard UDPilP stack. UDP_prime<br />
packets conform to the UDP I IP specirication.'9 To<br />
reduce the amount of per-packet processing and<br />
approach that of a lower-level protocol, UDP_ prime<br />
imposes several restrictions on its usc. These restric<br />
tions optimize the typical switched LAt"J cluster config<br />
urations. To retain the generality of UDPIIP,<br />
UDP_prime falls back to rbe standard UDPilP stack<br />
when these restrictions are not appucable.<br />
Restrictions on UDP_ prime<br />
The LAN nature of the cluster configuration imposes<br />
a restriction on UDP _prime. Each cl uster member has<br />
to be within the same JP subnet, which is directly<br />
accessib.le from any other member. With this restric<br />
tion, routing decision and internet-to-hardware<br />
address resolution can be done once tor each peer<br />
to-peer connection rather than on a per- packet basis.<br />
Per-packet UDP I IP checksum processing can also<br />
be eliminated, because intermediate routing is not<br />
<strong>Digital</strong> <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong><br />
involved and the data link cyclic redundancy check<br />
( CRC) is sufficient to guarantee error-tl-ee packets.<br />
The next restriction is the maximum length of rhe<br />
message. PSE message passing uses fixed-size bufters.<br />
UDP_prime restricts the maximum bufter size ro be<br />
the maximum transmission unit (MTU) ofrhe underly<br />
ing network interface. This eliminates per-message IP<br />
fragmentation and defragmentation overhead. Since<br />
the messaging clients have to fi-agment the messages<br />
inro fi xed-size buffers at the higher layer, there is no<br />
need fo r rhe IP layer to perform fu rther fragmentation.<br />
One complication in our current implementation<br />
occurs when multiple peers are running on a single<br />
system while others are on remote systems. The<br />
default behavior for peers within a single system is<br />
to communicate across the loopback intert�Ke. In this<br />
situation, there are two MTU values, one fo r the net<br />
work interface and one for the loopback interface .<br />
Our currenr implementation of UDP_prime does not<br />
allow communication over the loopback interface so<br />
that a single-size MTU can be used. rurrher studies<br />
need to be done to fi nd an optimal maximum buffer<br />
size, taking into account multiple MTU values, page<br />
alignment, and so forth.<br />
Based on the above restrictions, UDP_ prime opti <br />
mizes the per-packet processing overhead of sending a<br />
packet by constructing a UDP, IP, and data link packet<br />
header template for each peer ar initialization. Except<br />
for a few fields, the content of these headers is static<br />
with respect to a particular peer. UDP _prime defines a<br />
new IP option, IP _UDP _PRI ME, tor the setsockopt( )<br />
system call, to allow the messaging system to define<br />
the set of peers and their Internet addresses involved in<br />
the application execution."' The IP option processing,<br />
done prior to sending any message, makes routing<br />
decisions, performs Internet-to-hardware address res<br />
olution, and tills in the static portion of the header<br />
fields. When sending a packet, UDP _prime simply<br />
copies the header template to the beginning of the<br />
packet, minimizing the per-packer processing over<br />
head and increasing the likelihood of the templates<br />
being in the CPU cache. Several header fi elds, such as<br />
the IP identification, header checksum, and packet<br />
length fi elds, are then filled dynamical ly, and the com<br />
plete packet is presented to the interface layer.<br />
UDP_ prime Packet Processing<br />
Since a UDP_prime packer is a UDPIIP packet, the<br />
standard UDP I IP receive processing can handle the<br />
packet and deliver it to the messaging client. To trig<br />
ger the use of UDP_prime optimized receive process<br />
ing, the sending system uses the type of service (TOS)<br />
field within the IP header to specif)' priority delivery of<br />
the packet." The priority delivery indication does not<br />
by itself uniquely differentiate bet\veen UD P _prime<br />
and UDPIIP packets, as any other IP packets can<br />
also have the TOS ti.eld set to priority. As a result, the
optimized receive processing has to check for the<br />
packer's adherence to the UD P _prime restrictions.<br />
Nonadherence to the restrictions reroutes the packet<br />
to the standard receive processing code.<br />
\tVhen a packet arrives at a network interface, the<br />
imertace posts a hardware interrupt, and the interface<br />
interrupt service routine processes the packer. The<br />
standard interrupt service routine deletes the data link<br />
header, and hands the packet over ro the netisr ke rnel<br />
thread ." Netisr clemultiplexes the packet based on<br />
the packet header contents and delivers it to the appli<br />
cation's socket receive bufkr. Netisr, designed to be<br />
a general-purpose packet demultiplexer, runs at a low<br />
interrupt priori ry level. The main reason fo r a thread<br />
based demultiplexer is extensibility. New protocol<br />
stacks can be registered to the thread . Since there is<br />
no a priori knowledge of the execution and SMP lock<br />
ing requirernents of these stacks, a thread-based low<br />
interrupt priority demultiplexer is needed so that the<br />
network interrupt processing rime can be held to a<br />
minimum. The extensibi lity feature, however, intro<br />
duces a context switch overhead.<br />
For UDP _prime, the packet header processing ti me<br />
on the receive path is almost a small constant. We<br />
modified the interface service routine to demultiplex<br />
the packet by processing the data link, IP, and UDP<br />
headers, and deliver the packet to the socket receive<br />
buffe r without handing it over to netisr. This short cir<br />
cuit path is used only when the packet is a UDP / IP<br />
packet with no 1 P fi·agmentation and with priority<br />
delivery indication . If these cond itions are not met,<br />
the standard netisr path is chosen. The UDP_prime<br />
receive path eliminates the netisr context switch over<br />
bead . This is a significant advantage, especially when<br />
the receiving application runs with a rea l-rime FIFO<br />
scheduling policy.<br />
SMP Synchronization<br />
One difficulty in designing the UDP_prime stack to<br />
run in parallel with the standard UDP/IP stack was<br />
in SMP synchronization .2-' The socket butle r structure<br />
is a critical section guarded by a complex lock.<br />
Requesting a complex lock in <strong>Digital</strong> UNIX blocks<br />
execution if the lock is ta ken. To prevent deadlocks,<br />
its use is prohibited at an elevated priority level, such<br />
as the case in the interrupt service routine. To work<br />
around this problem, a new spin lock was introduced<br />
in the short circuit path and in the socket layer where<br />
access to the socket buffe r needs to be synchronized.<br />
Performance<br />
To measure message-passing performance, we used<br />
two DEC 3000 Model 700 workstations connected by<br />
a GIGAswi tch/FDDI system using TURBOcbannel<br />
based DEFTA fu ll-duplex FDDI adapters. Each work-<br />
station contained a 225-megahcrrz (MHz) Alpha<br />
21064 microprocessor and was running the <strong>Digital</strong><br />
UNIX version 3.0 operating system.<br />
Figure 6 shows the message-passing bandwidth fo r<br />
TC P/ IP, UDP/IP, and UDP_prime transports at dif<br />
ferent message sizes. The bandwidth was measured at<br />
the message-passing application programmer interrace<br />
(API) level, raking into account al location and deallo<br />
cation of each message buffer in addition to the data<br />
transmission. TCP/IP, UDP/IP and UDP_prime<br />
bandwidth peaks at approximately 95 megabits per<br />
second at a 4,224-byte message, approaching the<br />
FDDI peak bandwidth. UDP/IP approaches the peak<br />
bandwidth at a 1 ,4 00-byre message, and UDP _prime<br />
at a 1,024-byte message. Reaching the peak band<br />
width using small messages is a measure of protocol<br />
processing efficiency.<br />
Figure 7 shows the minimum message-passi ng<br />
latency fo r TCP/ IP, UDP/IP, and UDP_prime<br />
transports at diffe rent message sizes. The latency was<br />
measured as balfofthe minimum rime to send a nl CS<br />
sage and receive the same message looped by the<br />
receiver system over many iterations. The measure<br />
ment made allowance fo r the allocation and deallo<br />
carjon of each message bufter, in addition to the<br />
round-trip transmission .<br />
Compared to the TCP/IP option, UDP/IP has a<br />
slightly higher minimum latency. This is not expected,<br />
because the original goal ofrhe UDP/IP option was to<br />
reduce TCP/IP processing overhead . It is, however,<br />
encouraging to see only a slight degradation in latency<br />
when the reliable in-order delivery protocol is imple<br />
mented at the library level. This prompted us to use<br />
the same protocol engine in the libra ry to r<br />
UDP_prime. At a very small m.essage si ze (4 bytes),<br />
100<br />
90<br />
80<br />
0 70<br />
z<br />
0 60<br />
(_)<br />
lli 50<br />
{7j<br />
1- 40<br />
m<br />
::;;; 30<br />
20<br />
10<br />
KEY:<br />
r - . /<br />
/<br />
/<br />
/<br />
/<br />
/<br />
/<br />
/<br />
...... - /<br />
I<br />
I<br />
I<br />
I<br />
I .<br />
.<br />
0 500 1 000 1500 2000 2500 3000 3500 4000 4500<br />
-- UDP _PRIME<br />
Figure 6<br />
TCP/IP<br />
UDP/IP<br />
Peer-to- Peer Bandwidth<br />
MESSAGE SIZE (BYTES)<br />
Digiral <strong>Technical</strong> Journal Vol. 7 No. 3 <strong>1995</strong> 35
36<br />
1000<br />
900<br />
(f) 800<br />
0<br />
z 700<br />
0<br />
� 600<br />
� 500<br />
a:<br />
� 400<br />
2 300<br />
KEY<br />
200<br />
100<br />
· /<br />
- �<br />
/<br />
/<br />
0 500 1000 1500 2000 2500 3000 3500 4000 4500<br />
-- UDP PRIME<br />
Figure 7<br />
TCPII P<br />
UDPIIP<br />
;\r[inimum Peer-to-Peer Larctll\.<br />
MESSAGE SIZE (BYTES)<br />
prorocol processing overhead domin;ltcs rile Lncncv.<br />
At rhis poinr, UDP_primc ll"
References<br />
I. <strong>Digital</strong> High Pe Jfrmnance Fo rtran 90. HPF a nd<br />
PSI;" /VIa mwl (M:�vnard , M:1ss.: Digiral Equipment<br />
Corporation, Orde r No. AA-2ATAA-Te , <strong>1995</strong> ).<br />
2. G. Be ll, "Sc:�lablc, Pclt"
Richard A. \Varren<br />
Richard W;uTcn is cl prin..:ipc1l software engi neer in the<br />
High Pedornunce Computi ng Group, ll'hcre his [1 rimclrl'<br />
n: sponsibilirY is the design Jnd dcl'clopmcnr of Digi tc1 l's<br />
pclr:JIIel softii':Jrc cm·ironmcnr. Since joining <strong>Digital</strong> in<br />
1977, Ri ch,mi hc1s conrrilJuted to PDI'- 11 SI'Stems dcl·cl<br />
opmcnr and \'.-\ \ 32 -bi r shcHed-mctnon· tnul rip rocnsor<br />
designs. He h:1s c1lso been
An Overview of the<br />
Sequoia 2000 Project<br />
The Sequoia 2000 project is the joint effort<br />
of computer scientists, earth scientists, gov<br />
ernment agencies, and industry partners to<br />
build a better computing environment for<br />
global change researchers. The objectives of<br />
this widely distributed project are to support<br />
high-performance 1/0 on terabyte data sets,<br />
to put all data in a database management<br />
system, and to provide improved visualization<br />
tools and high-speed networking. The partici<br />
pants developed a four-level architecture to<br />
meet these objectives. Chief among the lessons<br />
learned is that the Sequoia 2000 system must<br />
be considered an end-to-end solution, with all<br />
pieces of the architecture working together.<br />
This paper describes the Sequoia 2000 project<br />
and its implementation efforts during the first<br />
th ree years. The research was sponsored by<br />
<strong>Digital</strong> Equipment Corporation.<br />
I Michael Stonebraker<br />
The purpose of the Sequoia 2000 project is to build a<br />
better computi ng environment to r global change<br />
researchers, hereafter referred to as Sequoia 2000<br />
clients. These researchers investigate issues such as<br />
global warming, ozone depletion, envi ron ment toxiti<br />
cation, and species extinction and are members of<br />
earth science departments at universities and national<br />
laboratOries. A more detailed conception tor the proj<br />
ect appears in the Sequoia 2000 technical report<br />
"Large Capacity Object Servers to Support Global<br />
Change Research."1<br />
The participants in the Seq uoia 2000 project are<br />
investigators of to ur types: ( 1) computer science<br />
researchers, (2) earth science researchers, ( 3) govern<br />
ment age ncies, and ( 4) industry partners.<br />
Computer science researchers arc responsible fo r<br />
building a prototype environment that better serves<br />
the needs of the target clients. Participating in<br />
the Sequoia 2000 project are invesrigarors from the<br />
Computer Science Division at the University of<br />
California, Berkeley; the Computer Science Depart<br />
ment at the Universi ty of Ca lifornia, San Diego; the<br />
School of Library and Information Studies at the<br />
University of California, Berke ley; and the San Diego<br />
Supercomputer Center.<br />
Earth science researchers must explain their needs<br />
to the computer science im'estigators and use the<br />
resu lting prototype environment to perform better<br />
earth science research. The Sequoia 2000 project<br />
comprises earth science investigators fr om the<br />
Department of Geography at the University of<br />
California, Santa Barbara; the Atmospheric Science<br />
Department at the University of California, Los<br />
Angeles (UCLA); the Climate Research Division at<br />
the Scripps Institution of Oceanography; and the<br />
Department of Earrh, Air, and Water at the University<br />
of Calitornia, Davis.<br />
To ensure that the resulting computer environment<br />
addresses the needs of the Sequoia 2000 clients, gov<br />
ernment agencies that are affected by global change<br />
matters participate in the project. The responsibi l ity of<br />
these agencies is to steer Seq uoia 2000 research<br />
toward achieving solutions to their problems. The<br />
government agencies that participate are the State of<br />
Ca lifornia Department of Water Re sou rces ( DWR),<br />
Digir
40<br />
the SLnc of Cal itorni�1 Dcp�1rtment of forcsrn·, rh e<br />
Coord inated Environment Research L1boJ:�non·<br />
(CERL) or the United States Arnw, the N;uion�; l<br />
Aeronautics and Space AdJllJJlisn·:tti�n (\:AS,-\), rhc<br />
:\'ation�1l Occ.mic and Atnws�1hcric Adminisrr�Hion<br />
(�0,-\A), Jnd rhe United Sr:1res Geologic Sun·C\·<br />
(USCS).<br />
The task or' the ind ustr\' participants is to use<br />
the Seq u oia 2000 tech nol og\'
APPLICATIONS -<br />
t<br />
DATABASE<br />
MANAGEMENT<br />
SYSTEM<br />
NETWORK t<br />
Figure 1<br />
The Scquoi� 2000 Architecrun:<br />
The File System Layer<br />
FILE SYSTEMS t<br />
FOOTPRINT<br />
t<br />
STORAGE DEVICES<br />
On top of the fo otprint layer is the ti le system layer.<br />
Two ri le systems manage data in the Bigt-oot multilevel<br />
memory hierarchy. The fi rst t-i le system is Highlight,<br />
which extends the Log-structured File System ( LFS )<br />
pioneered ror disk devices bv Ousterhout and<br />
Rosenblum to tertiary memory.1··' The original LFS<br />
treats a disk device as a single continuous log onto<br />
which newlv written disk blocks are appended. Blocks<br />
are never overwritten, so a disk de,'ice can always be<br />
written sequentially. Hence, the LFS turns a random<br />
write environ ment into a seq uential-write environ<br />
ment. In particu lar problem areas, this may lead to<br />
much higher performance. Benchmark dar� support<br />
this conclusion 4 In add ition, the LFS can always iden<br />
ti f:}• the last kw blocks that were written prior ro a t-i le<br />
S}'Stem fa ilure by tinding the end of the log at recovery<br />
time. File system repair is then very t-a st, because<br />
potentially damaged blocks arc easily ro und. This<br />
approach dift-ers 6-om conventional ti le svstem repair,<br />
where a laborious check of the disk must be performed<br />
tO ascertain disk integrity.<br />
Highlight extends the LFS to support tertiary mem<br />
ory by adding a second log-structured t-i le system on<br />
top of the footpri nt layer. This ti le system also writes<br />
tertiary memory blocks seq uentially, thereby obtain<br />
ing the pedcmmnce characteristics of the LFS. The<br />
Highlight ri le system adds migration and bookkeeping<br />
code that treats the disk LFS file system as a cache tor<br />
the te rtiary memory tile system. In summarv,<br />
Highlight should provide good performance t6r<br />
workloads that consist of mainly write operations.<br />
Since Sequoia 2000 clients want to archive vast<br />
amounts of data, the Highlight file system has the<br />
potential tor good performance in the Sequoia 2000<br />
environment.<br />
The second ti le system is Inversion.; Most DBMSs,<br />
including the one used fo r the Seq uoia 2000 project,<br />
support binary large objects (BLO Bs), which are<br />
arbitrary-length byte strings of variable length. Like<br />
several commercial systems, Sequoia's data manager<br />
POSTGRES stores large objects in a customized<br />
storage system directly on a raw storage device.6 As<br />
a result, it is a straightforward exercise to support con<br />
ve ntional files on top of DBMS large objects. In this<br />
way, the fi·onr end turns every read or write operation<br />
into a query or an update, which is processed directly<br />
by the DBMS. Simulating t-i les on top of DBMS large<br />
objects has several advantages. First, DBMS services<br />
such as transaction management and security are auto<br />
matically supported tor t-i les. In addition, novel charac<br />
teristics of our next-generation DBMS, including time<br />
travel and an extensi ble type system tor all DBMS<br />
objects, are automatically available t-or tiles. Of course,<br />
the possi ble disadvantage of simulating files on top of<br />
a DBMS is poor performance . As reported by Olson ,<br />
Inversion performance is exceedingly good when large<br />
blocks of data are read and written, as is characteristic<br />
of the Seq uoia 2000 workload.;<br />
At the present ti me, Highlight is operational but<br />
very buggy. Inversion, on the other hand, is used to<br />
manage production data on Sequoia's Sony WORM<br />
jukebox. Unfortu nately, the reliability of the proto<br />
type system has not met user expectations. Sequoia<br />
2000 cliems have a strong desire fo r commercial off<br />
the-shelf(COTS) software and are fr ustrated by docu<br />
mentation gl itches, bugs, and crashes.<br />
As a result, the Sequoia 2000 project ream has also<br />
deployed two commercial t-i le systems, Epoch and<br />
AMASS. The Epoch file system is quite reliable but<br />
does not support either of Sequoia's large-capacity<br />
robots. Hence, it is used heavily but only fo r small d:�ta<br />
sets. The AMASS fi le system is just coming into pro<br />
duction use fo r Sequoia's Merrum robot and replaces<br />
an earlier COTS syste m, which was unreliable. Given<br />
the experience of the Sequoia 2000 team with tertiary<br />
memory su pport, tertiary memory users should care<br />
fu lly test all file system software .<br />
The DBMS Layer<br />
To meet Sequoia 2000 client needs, a DBMS<br />
must support spatial data such as points, lines, and<br />
polygons. In addition, the DBMS must support the<br />
large spatial arrays in which satellite imagery is natu<br />
ra lly stored. These characteristics are nor met by pop<br />
ular, general-purpose relational and object-oriented<br />
DB1\tlS s.7 The best fit to client needs is a special<br />
purpose Geographic Information System (GIS) or<br />
a next-generation object-relational DBMS. Since it<br />
has one such object-relational system, namely<br />
<strong>Digital</strong> <strong>Technical</strong> Journal Vo1. 7 No. 3 <strong>1995</strong> 41
42<br />
POSTGRES, the Seq uoi::t 2000 project elected to<br />
tix us its DBJ'vi S efforts on this s1·srem.<br />
To make the POSTGRES DBMS suit::tb l e fo r<br />
Sequoia 2000 use, we req ui re �1 schcm�1 t( >l. :�II Seq u oia<br />
d�1t:1. This ciat:1base design process h:ts CH>Ivcd as :1<br />
cooper::ttive exercise between v:1 rio us d:tt�1b�1sc experts<br />
:tt Bcrkelev, the San Diego Su percomputer Center,<br />
CERL, and SAJC. The Sequoi:� schc111�1 is rhc collec·<br />
rion of metadata that describes the d�1ta stored in the<br />
POSTGRES DBMS on Bigt(>ot. SpccitiClilv, these<br />
mcrad:Ha comprise<br />
• A standard \'Ocabularv of terms ,,·irh �1grced·u pon<br />
definitions that arc used to descri be the (b tJ<br />
• A set of rvpcs, instances of 11 hich m�n store data<br />
"''lues<br />
• A hier:1rchicll collection of cl:�sscs tlut describe<br />
3. A browsi ng capability for te xtual intixmation<br />
4. A fa cility ro interrace the UCLA General Circula<br />
tion Model (GCM ) to the POSTG RES/IIIustra<br />
SI'Stem<br />
5. A desktop ,·idcoconterencing or "picturephonc''<br />
tacilitl'<br />
For the off-the-shelf visualization tool, we have<br />
converged around the usc ofAVS and IDL fo r project<br />
activities. AVS has :m easy-to-use "hox�.:s-and-arrows"<br />
user interface, ll"hcrcas I DL has a more conventional<br />
linear programming notation. On the other hand,<br />
IDL has better tll"o-dimensional (2-D) graphics tea <br />
tllres. Both AVS and IDL a llo\\' the user to read and<br />
write ti le data. To connect to the DRMS, we have writ<br />
ten an AVS-POSTGR.ES bridge. This program allows<br />
the user to construct an ad hoc POSTG R.ES q uerv and<br />
pipe the result into :m AYS boxes-and-arrows net11·ork.<br />
Seq uoia 2000 clients can use AYS fo r fu rther process<br />
ing on any data retrieved fr om the DBMS. IDL is<br />
being interfaced to AVS by the vendor. Consequently,<br />
data retrieved fr om the database can be moved into<br />
IDL using AVS as an intermediarv. Now that \\ 'e have<br />
migrated to the Illustra DBMS, 1\'C arc consideri ng<br />
porting this AVS bridge to the II lustra :1pplicarion pro<br />
gramming interface (API ).<br />
AVS has sorm: disadvantages
+4<br />
Illustr::t has Jl1 itnegy�ued document d�n�1 tvpe 11 ith<br />
capabi l i ti es si m ilar ro the extensi ons 11·c m:-tdc ro<br />
POSTGRES.<br />
A related Berkclt\' project is r(K uscd 011 digiti;ing<br />
all the Berkcl c\· Computer Science Techni c�1 l Rq)orts.<br />
This project uses �1 Mosaic cl ient to �1ccess �1 custom<br />
World Wide We b scnn ca lled D i e nst, 11·h ich stores<br />
technical report objects in a L�IX ri le SI'Stcm. J n a tC \\'<br />
months, 11·e e x pect to con1·crr D ienst to stmc objects<br />
in the Seq uoi:-t 2000 d;na b:-tse, rather th;1n in ti les.<br />
\Vhcn thi s SI'Ste m, nicknamed Datab:�sc Di e nst, is<br />
operati onal, M osaic/ Dienst sen·icc ll'i II he �1 1'r all textual objects in the Seq uoia schem;l.<br />
Our fourth thrust in the application laver is �1 bcil i t\'<br />
ro i ntcr bce rhe LIC:LA Ccncral C:in.:ul.1tiot1 Model<br />
(GC:'vl ) to the POSTCRES/lllusrra SI'Stc nl . Th is pro<br />
gram is :1 "cbu pum p" bec1use ir pumps cbL1 our of<br />
the si mul;nion model :-tnd imo the DBMS. \Vc 11,1mcd<br />
rhc progra m "the big l ifr" attcr rhe DWR pu m p ing<br />
station that r:�i scs Northern C:tlit(m1ia 11 ;Hn m·er the<br />
Te hacha pi Mount;l ins inr.o Southern Cllit(m1i;l.<br />
Basicallv, the UC :l.A CC:M p rod uces :1 1ccrm or' sim<br />
ulation outpur ,.�1riables r()r each ti me stq) of;l l en gt h1·<br />
ru n tor each tile in ;1 three -dimensional (3-D) grid of<br />
the at mosp here ;1 11d occ;ln. De pe ndi ng 011 the scal e<br />
of thc model, its resolution, and rhe clp;lbilin· of the<br />
serial or parallel m:-tchine on 11·hich the mode l is running,<br />
the CCL\ GC:Ivl c�111 prod uce ti·om 0. 1 to 10.0<br />
megabytes per se cond ( J\'l B/s ) ou tpu t. The pu rpose of<br />
rile big litr is ro inst�11l rhc outpu t dar�1 into :1 chr�1 b;1se<br />
in re:d ti me. UCLA sci t· n rists can then usc Objccr<br />
K.n oll'led ge , Tiog�1, Tc c;ltc, AVS, or lDL ro ,·isualizc<br />
their simulation output. The big lirr 11 ill likcil h�11 c ro<br />
e x ploi t p ar: dldism in rhc d�Ha managn, if i t is req uired<br />
to keep up \\'ith rhe ncc u tion of the model on ;1 11L1s<br />
si ,·c ll· parall el �1 rc hirccrurc .<br />
The rifrh applic1tion S\'Stem is a con krc ncing SI'S<br />
tc m. Since Scq uoi:J 2000 is �1 distnburcd pmjcct, II'C<br />
lcarnccl e�1rlv that bcc-ro-L1cc meetings rh;H requ ired<br />
p;lrricip;m ts to tr;llcl to other si tes �1 nd el ectron ic mail<br />
\\'Cre not sufticic m ro keep proj ect members "·urking<br />
;ls J tc:am. Consequcml1·, 11·e pu rch�1scd umkrcncc<br />
room 1 idcocon krcncing eq u ip m ent t( >r each project<br />
sire. ·rhis tcc l1 11olog1 costs .lt)prmim�1tcll- S::iO,OOO per<br />
si te ;m d �• llo11 s m u l till'rc, 1·ideoconterencing te nds to be<br />
usee! fo r arra nged conkrcnccs and nor t
Sequoia 2000 as an End-to-End Problem<br />
The m�1jor lesson 11·c ha1 c le:m1ed ti·om the Scquoi�1<br />
2000 project is rh�1t manv issues being our clients c\n<br />
not be isobrcd ro �l single Lwcr of the Scq uoi�1 2000<br />
�1 rc hirecrurc. This section describes three such end-toend<br />
problems: gu:1ranrecd dcli1-crv, :� bstLKts, :111l1<br />
COil\ 1XeSS10n.<br />
Guaranteed Delivery<br />
Clearlv, guar:� ntced delivery must be an cnd-ro-end<br />
colltracr. Suppose a Scquoi:� 2000 clicnr wishes ro l'isu <br />
:d izc a s11ccitic compurarion; t( >r oample, rhe client<br />
11·a1HS to obscr1·e H u rricanc And rc11· as it mm·cs ti·om<br />
the B�1h�1n1:1s ro florid:� ro Louisi.111�1. Spcciti c1li l·, the<br />
clienr 11·ishes to 1'isu:1lizc appropriate s�udlirc im:�gcrv at<br />
a resolution of 500 X 500 in 8-bit color at 10 ti·:unes<br />
per second . Hence, the client req uires 2.5 MB/s of<br />
b:mdwidrh ro his screen. The t()IJ011·i ng sce 1 urio mi ght<br />
be the co111�1utation steps that t�1 kc pbcc.<br />
TilL DHMS must run a q u ery to krch the s�ucllitc<br />
im:�gerv. The querv might require retuming �1 16-bir<br />
d�lta l'�lluc t(>r each pixel that will ultimarclv appear 011<br />
the screen . The DBMS must rhcrd(>JT agree to occure<br />
the quen· in such �1 II'JI' rhat ir guaranrecs ourpur<br />
.1t �1 r�1tc of 5.0 tv! B/s.<br />
The stor:�gc s1·stcm at the sencr will krch some<br />
number of 1/0 blocks ti·om scconchn· and/or tertian·<br />
memo!'\'. DB,\IS quen· opti mizers em accuratch- guess<br />
ho11· mam· blocks thc1· need to rod to s�Hist\· the<br />
qucr1'. The DBMS can then e1sih· generate �1 guar:l ntccd<br />
dcliiLTI' comrxt that the sror::�gc 111:1n�1gcr must<br />
sarist\·, rhus �1 1ioll'ing the DBMS ro sarist\· irs contract.<br />
The network must �1 gree to dc]ii'Cr 5.0 MIVs over<br />
the network link that connects the clienr to rhc server.<br />
The Sequoia 2000 nctll'ork soft ll'arc expects cxactlv<br />
rh is tvpc of contract request.<br />
The 1 isu:1liz�1tion package must
46<br />
then the DBi\rlS may hal'e to decompress the data to<br />
perform the search. If an application resides on the<br />
same machine as the storage 111
finally, the computer scientists and the earth scientists<br />
r-cJiizcd then the word "bcnchmJrk" has a difkrem<br />
meaning t()r cJch of the two groups of researchers. To<br />
cJrth scientists, a benchmark is a scenario, \\·hcreas to<br />
com puter scientists, a be nchmark is an abstract example.<br />
This \'ignctte was typical of the experience these<br />
two disciplines had trying to understa nd one Jnother.<br />
Fundamentally, this process is time-consuming, �md<br />
ample inrerJction time should be planned tc.>r Jny project<br />
that must dcJI with multiple disciplines.<br />
The ScquoiJ 2000 project participants made dkctivc<br />
usc of "converters." A converter is a person of one<br />
discipline \\'ho is planted directly in the research group<br />
of :mother discipline. Through intorrn�11 communication,<br />
this person serves JS an interpreter and translator<br />
fi:>r the other discipline. Corwerters arc encouraged lw<br />
the existence of a ti:>rmal exchange progr:un, whcreb\·<br />
ccntL11 Seq uoia 2000 resources pa\' the li\·ing opcnses<br />
of the exchange personnel.<br />
Lesson 4: Da tabase technology is a major advance for<br />
earth scientists.<br />
Our initial plan was to introduce database technology<br />
into the project with the expectation that the earth scientists<br />
\\'Ould pick ir up and use it. Unfo rtunately, they<br />
:1rc accustomed to dat:1 being in ti les and t-()lmd it vcrv<br />
difticult to make the transition to a database vic\\·. The<br />
earth scientists arc becoming increasing!\· aware of<br />
the inherent ad\'Jntages of DBMS technolot, '\'.<br />
In ad dition, we
4�<br />
the Seq uoi;1 2000 eftorr. As ;1 result, project man;lge<br />
menr WJS pcrt(mned poorlv at best. In Jny t'ururc large<br />
project, this component should be �1 ddresscd sJtisbc<br />
rorily up fi·ont bv project personnel.<br />
Lesson 6: Multicampus projects are extremely difficult<br />
to implement.<br />
Sequoia 2000 '' ork is ra king place in seve n difkrcnt<br />
orgJ niz:ttions wi thin the Uni1·crsitv ofCJiit(nnia cd u<br />
cationJI S\'Stcm. There is a constJnt need to trJnsfer<br />
monel' and people among these org:1nizarions. Accom<br />
plishing such molTS is :1 difti cult and slm1· process,<br />
hm,·e, er, because of the burc1ucrac1· ll'ithin rhe S\'S<br />
rern. In Jddition, rhe personnel rules of the Uni,·ersitv<br />
are often in conflicr with the needs of the Sequoi;1<br />
2000 project. As a resul r, multi -i nsti w tion projects,<br />
where participants arc in difkrcnr and often distant<br />
locations, :1re ntrernelv difficult to earn· our.<br />
Status and Future Plans<br />
The Sequoia 2000 project is more rh:1n rhrce vcars old<br />
and h:1s nearlv accomplished irs objectives. We luve<br />
a common schema in place t()r rrs under 1\'J\' in rhc spatial at-c n:1.1" The<br />
intl·asrrucrurc is in place ro enable this sc hcm;1 to<br />
evolvc JS more data tvpes, uscr-dctined fu nctions, ;1lld<br />
operators an: included in the fu rurc.<br />
The combination of Objecr-K.noll'ledge, ll lustLl,<br />
Epoc h, and AMASS is prm ing robust and meers our<br />
clients' needs. Lastk, ,,.c IDI'C ample resources ro<br />
move our prototvpe into prod uction use at UCLA and<br />
Santa B:1rlx1r;1 during rhc next SCI'cral months.<br />
We arc ::1lso extending the scope of the prorotvpe in<br />
t\\'O difkrcnr directions. first, 11·c ll'i] l recruit Jddirional<br />
ea rrh scientists to utili 1.e our s1·srem. This ll'i II<br />
req uire extending our common schema to meet their<br />
needs and then inst::1 lling our suite ofsoft11'rucccdiug;<br />
o/ !he 7993 I.t.FL' /Jo/o L'n,!!, illeerin,�.; c!ilt/(>rence.<br />
Housron, Texas ( Febru:ll'l' 1993 ).<br />
9. ) . Hcllcrsrein and M. Sronehrmccedit tgs of /be 1091 tiC�! SJC.HO/) Cmtj(n'IICC.<br />
W:1 shingron, D.C. (rVb1· ]')93).
10. P. Koc lu.:var and L. Wanger, "Tecne: A Software<br />
P!Jrfonn tor Browsing and Visualizing Data ti·om<br />
Networked Data Sources," Digita l <strong>Technical</strong><br />
.foumal, vol. 7, no. 3 ( <strong>1995</strong>, this issue): 66-8 3.<br />
11. M. Stonebraker er al., "Tioga : PrO\·iding Dara Man<br />
agement for Scientific Visu:�liz::nion Appl ications, "<br />
Proceedings Q/ the 199.3 VLDB Co n/erence. Dublin,<br />
Ireland (August 1993).<br />
12. A. vVoodruffet :� 1., "Zooming
so<br />
The Sequoia 2000<br />
Electronic Repository<br />
A major effort in the Sequoia 2000 project was to<br />
build a very large database of earth science infor<br />
mation. Without providing the means for scien<br />
tists to efficiently and effectively locate required<br />
information and to browse its contents, how<br />
ever, this vast database would rapidly become<br />
unmanageable and eventually unusable. The<br />
Sequoia 2000 Electronic Repository addresses<br />
these problems through indexing and retrieval<br />
software that is incorporated into the POSTGRES<br />
database management system. The Electronic<br />
Repository effort involved the design of proba<br />
bilistic indexing and retrieval for text documents<br />
in POSTGRES, and the development of algo<br />
rithms for automatic georeferencing of text<br />
documents and segmentation of full texts<br />
into topica lly coherent segments for improved<br />
retrieval. Va rious graphical interfaces support<br />
these retrieval features.<br />
Dig;ir:1l Tcch11ical JounlJI Vo l. 7 :--:o . .'\ l
the research conducrcd in these r the POSTG RES database system,<br />
the GIPSY S\'Stcm f(>r automatic indexing of texts<br />
using geographic coordinates based on locations mentioned<br />
in the tnt, :1nd the TextTiling method f(>r<br />
improving access to fu ll-text documents.<br />
Indexing and Retrieval in the Electronic Repository<br />
The priman· engine f(>r information storage and<br />
retrieval in the Sequoia 2000 Electronic Repository<br />
is the POSTGIU:S ncxt-generarion darabase management<br />
system (DBMS).' POSTG RES is the core of<br />
rhe DBMS-ccntric Sequoia 2000 system design. All<br />
rhe data used in the projecr was stored in POSTGRES,<br />
including complex mulridimensional arravs of data,<br />
parial objecrs such as raster and vector maps, sarellite<br />
images, and sers of measuremenrs, as well as all the<br />
fu ll-rext documents available. The POSTG RES DBMS<br />
supports user-defined abstracr data types, user-defined<br />
ti.mcrions, a rules system, and man�' katures of objectoriented<br />
DBMSs, including inheritance and methods,<br />
through fu ncrions in both the querv l
52<br />
Figure 1<br />
KW_SOURCES<br />
KW_INDEX_FLAGS<br />
The Lassen POSTGRES Classes t()r Indexing ,md Their Linkages<br />
a particu!Jr term from the wn_index class. The<br />
POSTQUEL definition ohhis class is<br />
create kw_term_doc_rel (<br />
termid = int4, I* WordNet termid number *I<br />
synset = int4 , I* WordNet sense number *I<br />
docid = int4, I* document ID *I<br />
termfreq = int4) I* term frequency within<br />
the document *I<br />
The raw fi-equencv of occurrence of the term<br />
in the document ( lermji·eq) is included in the<br />
k.\1·_tenn_doc_rel tuple . This information is used in<br />
the retrieval process for calculating the probability of<br />
relevance for each document that contains the term.<br />
The kw doc index class stores information on indi<br />
,·idual documents in the database. This information<br />
includes a unique documem identifier ( docid), the<br />
location ofthe document (the class, the attribute, and<br />
the tuple in which it is contained), and whether it is<br />
a simple attribute or a large object (with effectively<br />
unlimited size). The kw_doc_index class also main<br />
tains additional statistical information, such as the<br />
number of unique terms fo und in the document. The<br />
POSTQUEL definition is as tol lo11 s:<br />
create kw_doc_index (<br />
docid = int4, I* document ID *I<br />
reloid = oid, I* oid of relation<br />
containing it *I<br />
attroid oid, I* attribute definition of<br />
attr containing it *I<br />
attrnum int2, I* attribute number of attr<br />
containing it *I<br />
tupleid oid, I* tuple oid of tuple<br />
containing it *I<br />
sourcetype = int4, I* type of object -- attribute<br />
or large obj ect *I<br />
doc_len = int4 , I* document length in words *I<br />
doc_ulen = int4) I* number of unique words in<br />
document *I<br />
<strong>Digital</strong> Tec hnical Journal<br />
Vol. 7 No. 3 <strong>1995</strong><br />
WN_INDEX<br />
The kw_sources class contains information about<br />
the classes and attributes indexed at the class ln·el, as<br />
,,·ell as statistics such as the number of items indexed<br />
from any given class. The fol lowing POSTQUEL<br />
statement defines this class:<br />
create kw_sources<br />
relname = charl6,<br />
relaid = oid,<br />
attrname = charl6,<br />
attroid oid,<br />
attrnum int2 ,<br />
I* name of indexed<br />
relation *I<br />
I* oid of indexed<br />
relation *I<br />
I* name of indexed<br />
attribute *I<br />
I* obj ect ID of indexed<br />
attribute *I<br />
I* number of indexed<br />
attribute *I<br />
attrtype = int4, I* attribute type -- large<br />
object or otherwise *I<br />
num_indexed = int4 , I* number of items<br />
indexed *I<br />
last tid = oid, I* oid and time for last *I<br />
last time abstime, I* tuple added *I<br />
tot_terms = int4, I* total terms from all<br />
items *I<br />
tot_uterms = int4, I* total unique terms from<br />
all items *I<br />
include_pat text , I* simple patterns to *I<br />
exc lude_pat text ) I* match for indexable<br />
I* items *I<br />
The other classes shown in Figure l relate to the<br />
indexing and retrieval processes. The Lassen prototvpe<br />
uses the POSTGRES rules svstem to perform such<br />
tasks as storing the elements of the bibliographic<br />
records in an appropriate normalized fo rm and to trig<br />
ger the indexing daemon.<br />
Defining an attribute in the database as indexable<br />
tor information retrie1·al purposes (i.e., bv appending<br />
a new tuple to the kw_sources definition) creates a rule<br />
that appends the class name and attribute name to the
kw_index_tlags class whenever a new tuple is appended<br />
to the class. Another rule then starts the indexing<br />
process for the newly appended data . Figure 2 shows<br />
this trigger process.<br />
The indexing process extracts each unique keyword<br />
fr om the indexed attributes of the database and stores<br />
it along with pointers to its source document and its<br />
frequency of occurrence in kw_term_doc_rel. This<br />
process is shown in Figure 3. The indexing daemon<br />
and the rules system maintain other global frequency<br />
information. For example, the overall fr equency of<br />
occurrence of terms in the database and the total num<br />
ber of indexed items are maintained fo r retrieval pro<br />
cessing. The indexing daemon attempts to perform<br />
any outstanding indexing tasks before it shuts down. It<br />
also updates the kw _doc_index tuple fo r a given index<br />
able class and attribute with a time stamp fo r the last<br />
item indexed (last_tid and last_time). This permits<br />
ongoing incremental indexing without having to<br />
reindex older tuples.<br />
Retrieval<br />
The prototype version of Lassen provides ranked<br />
retrieval of the documents indexed by the indexing<br />
daemon using a probabilistic retrieval algorithm. This<br />
algorithm estimates the probability of relevance fo r<br />
each docu ment based on statistical information on<br />
term usage in a user's natural language query and in<br />
the database. The algorithm used in the prototype is<br />
based on the staged logistic regression method.8<br />
A POSTGRES user-defined function invokes ranked<br />
retrieval processing. That is, fi-om a user's perspective,<br />
ranked retrieval is performed by a simple fu nction<br />
call (k\vsea rch) in a POSTQ UEL query language<br />
Figure 2<br />
DO NOTHING<br />
The Lassen Indexing Trigger Process<br />
RULE STARTS<br />
FUNCTION<br />
DAEMON_ TRIGGER<br />
statement. Information from the classes created and<br />
maintained by the indexing daemon are used to esti<br />
mate the probability of relevance fo r each indexed doc<br />
ument. (Note that the fu ll power of the POSTQUEL<br />
query language can also be used to perform conven<br />
tional Boolean retrieval using the classes created by the<br />
indexing process and to combine the results of ranked<br />
retrieval with other search criteria.) Figure 4 shows the<br />
process involved in the probabilistic ranked retrieval<br />
from the repository database.<br />
The actual query to the Lassen ranked retrieval<br />
process consists simply of a natural language statement<br />
of the searcher's interests. The query goes through the<br />
Figure 3<br />
The Lassen Indexing Daemon Process<br />
<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 7 No. 3 <strong>1995</strong> 53
54<br />
Figure 4<br />
The Lassen Rctrie,·a\ Process<br />
same processing steps as documems in the indexing<br />
process. The i ndivid ual words o f the query are<br />
extracred :1 11d located in the wn_indn dictionary<br />
(after re moving common words or "stopll'ords"). The<br />
tcrmids �o r matching words ti·om wn_i ndn arc then<br />
u sed to retrieve all the tuples in kw_tcrm_doc_rd that<br />
contain the te rm. For each u niq ue document identifier<br />
in this list ofruples, the marching k11 _doc_indn tuple<br />
is rctrinTd. \Vith the �i·eq uenc�· int(mllation contained<br />
in kw_term_d oc_rel and kll'_doc_indn, th e estimated<br />
probabi litY of relevance is calculated fo r e:�ch docu <br />
ment that con tains at least one term in common ll'i th<br />
the query. The formulae used in the calculation arc<br />
based on experiments with fu ll-text retrieval -" The<br />
basic equation fo r the probabilistic model used in<br />
Lassen states the fo llowing: The probability of the<br />
event that a document is relevant R. gin:n that there<br />
is a set of N"clues" associated with that docu ment, A1<br />
fo r i = l, 2, ... , 1\', is<br />
log O(RIA,, ... ,A,) = log O(R) + 2 )1og O(RIA,)<br />
Digit�! Technic�! Journal<br />
1= I<br />
- log O (R)], (l)<br />
Vol. 7 :--Jo. 3 J
;md queri es (wirh relevance judgments) from the<br />
TlPSTER intcmn:nion retrieval test collection -" The<br />
deri,·ation of rhis fc>rmula is given in "Probabilistic<br />
Re trie' al Rased on Staged Logistic Regression ."' The<br />
fu ll ren·i n·al cquJtion used f(x the prororl'pe ,·ersion of<br />
retrie,·al described in this section i s<br />
where<br />
log O(NIA1, ... ,A11) """ - 3.Sl<br />
·" ·"<br />
I<br />
374<br />
+ \ .lJ + d :Lx,11 + 0.330 :Lx.,,<br />
I I<br />
If<br />
-0.1937 :Lx"',] + o.0929M, ( 3 )<br />
I<br />
X111 1 is the quotiem of the number of ti mes the mth<br />
term occurs in the querv and the sum of the rot:d<br />
number ofte nns in the querv plus 35;<br />
X1112 is the logarithm of the quotient a rri ved at bv<br />
dividing the number of times the mth term occurs in<br />
the document by the sum of the tot;ll number of te nns<br />
in the document plus 80;<br />
Figure 5<br />
CALCULATE NUMBER OF<br />
TERMS IN COMMON<br />
BETWEEN QUERY AND<br />
DOCUMENT M<br />
SUM FREQUENCY OF<br />
TERM IN QUERY DIVIDED<br />
BY ALL TERM<br />
OCCURENCES PLUS A<br />
CONSTANT<br />
2-Xm. l<br />
CALCULATE LOGARITHM<br />
OF SUM OF NUMBER OF<br />
TIMES TERM OCCURS IN<br />
DOCUMENT DIVIDED BY<br />
TOTAL TERMS IN DOCUMENT<br />
PLUS A CONSTANT<br />
LXm.2<br />
CALCULATE LOGARITHM<br />
OF SUM OF NUMBER OF<br />
TIMES TERM OCCURS IN<br />
DATABASE DIVIDED BY<br />
TOTAL TERM OCCURENCES<br />
IN DATABASE<br />
�Xm.3<br />
YES<br />
x/J/.3 is the logarithm of the quotie llt arrived at by<br />
dividing the n u m ber of times the 111th term occu rs in<br />
the databJsc ( i .e., in all documents) by the total num<br />
ber of terms in the collection;<br />
.ll is the number of terms held in common bv the<br />
querv and the document.<br />
Note that the .'v/2 term called f(>r in Equation 2 was<br />
not fo und to prol'ide any significmt difkrence in the<br />
results and w:1s omitted fr om Eq u ati on 3. The con <br />
stants 35 and 80, whi ch were used in X111 1 and X11,_2,<br />
are :1 rbitrarv but appear to offe r the best results 11·hen<br />
set to the a\·erage size of a quen· and the r the Staged Logistic Regression ProbJhilistic Ranking Process<br />
CALCULATE DOCUMENT<br />
PROBABILITY OF RELEVANCE<br />
P(R) = 1 I 1 + e "(- LOG O(R))<br />
CALCULATE DOCUMENT LOG<br />
ODDS OF RELEVANCE<br />
LOG O(R) = k1 + (S · [k2 • :LXm.l<br />
+ k3 · :
56<br />
and its unique identitier are stored in the k\\'_querv<br />
class . To see the results of the retrieval operation, the<br />
querv idenritier is used to 1·errieve the appropri,ue<br />
kw_retrieval tuples, r:mked in order according to the<br />
estimated probabilitY of reln·ance. The k11 _rerrie1·al<br />
and kl1 _quen· classes ha1 e the tollo11·ing POSTQUEL<br />
detinitions:<br />
create kw_query<br />
query_id = int4,<br />
query_user charl6,<br />
query_text = text )<br />
create kw_ret rieval<br />
query_id = int4 ,<br />
doc_id = int4 ,<br />
rel_oid = oid,<br />
attr_oid = oid,<br />
attr_num = int2,<br />
tuple_id = oid,<br />
/* ID number */<br />
/* POSTGRES user name */<br />
/* the actual query */<br />
!* link to the query */<br />
/* document ID number */<br />
/* location of doc */<br />
doc_len = int4 , /* size of document */<br />
doc_ma tch_terms = int4 , /* number of query terms<br />
in the document */<br />
doc_prob_rel = floatS ) /* estimated probability<br />
of relevance */<br />
The algorithm used tor ranked retrin·al in the<br />
Lassen prototvpe was rested
geograp hic position being referenced in the te xt. The<br />
actual index te rms assigned to a document are a ser of<br />
coordinate pol�·gons that descri be an area on rhe<br />
e:mh 's surface in a standard geographical projection<br />
system. Using coordinates instead of names for rhe<br />
place or geographic characteristic offers a number of<br />
advantages.<br />
• Uniqueness. Place names are not uniq ue, e.g.,<br />
Ve nice, California, and Ve nice, Italy, are not appar<br />
ent!�· dit'tere nt ll'irhout the qualit\·ing larger region<br />
to differentiate them. Using coordinates removes<br />
th is ambiguitv.<br />
• Immunity to spatial boundary changes. Political<br />
boundaries change over time, lead ing to confiJsion<br />
about the precise area being re.ferred to. Coordi<br />
nates do nor depend on political boundaries.<br />
• lmmunirv ro name changes. Geographic names<br />
change 01-er rime, making it diftlcult for a user to<br />
retrieve all information that has been written about<br />
an area during any extended rime period. Coord i<br />
nates remove this ambiguity.<br />
• Immunity to spatial, naming, and spelling l'aria<br />
rion. Names and rerms 1·an· nor onlv o1·er rime but<br />
also in contemporar�· usage . Geographic names<br />
1·ary in spelling over rime and by language. Areas of<br />
interest ro rhe user will often be given place names<br />
designated onlv in the context of a specific docu<br />
ment or proje ct. Such va riations occur frequentlv<br />
for studies done in oceanic locations. Names associ<br />
ated 11·irh these stud ies are unknoll'n to most users.<br />
Coordinates are not subject to these ki nds oherbal<br />
,·ariations.<br />
Indexing texts
58<br />
L titu<br />
a d<br />
e (mi ute )<br />
n s<br />
.... ..J<br />
1<br />
___ o _ i_ s p_l _ a y_M_ a _P __ _j ___<br />
Figure 6<br />
H_ i d<br />
Scr(cn h·om rhc L1sscn GcogLl}) hic Brm1 ocr<br />
_ e _ lc_o_n _s __ _J ___ _<br />
necessity, the result of
e lationships ti·om the Word Ner thesaurus, to<br />
include rhe ri: >llowing types ofrerms:7<br />
Sl'nOI1\'In\'<br />
. . .<br />
= S\'110111' 111<br />
kind -of reL1 rionsh ips<br />
= lwponym ( mJ pl c is :1 - ofrrec )<br />
@ : = lwpernvm (tree is :1 @ of mapl e)<br />
p:1rt-of rebrionsilips<br />
# mc rom·m (finger is a # of lund)<br />
% h o lonvm (hand is �1 % oHinge r)<br />
& c1 ido111'm (pine is a & ofshortleaf<br />
pine )<br />
3. Overlay polygons to estimate a pproximate loc:l <br />
rions. The objccr ivc of rhis step is ro combine the<br />
evidence �1ccumubted in rile preceding step and<br />
111kr a scr or· polvgons th:�t prm ides a reason:�bJc<br />
approximation of rhc geogr:1phical locations me n <br />
tioned in rile rcxr. E::tch gf!oj)hrasc. t/'e��bt. pull'gon<br />
ru plc em be represented �•s a thn . :c-dimcnsion�ll<br />
"extruded" polygon whose b�1se is in the plane of<br />
the x- ;m d z- axes and whose hcighr extends upw
Figure 7<br />
/7<br />
Overl;wing Poh·gons to Estimate Approximate Locations<br />
(a) The "weight" of a polygon, indicated by the<br />
vertical arrow, is interpreted as "thickness."<br />
(b) Two adjacent polygons do not affect each other;<br />
J.<br />
I<br />
I<br />
each is merely assigned its appropriate "thickness."<br />
(c) When one polygon subsumes another, their<br />
"thicknesses" in the area of overlap are summed.<br />
(d) When two polygons intersect, their "thicknesses"<br />
are summed in the area of overlap.<br />
60 <strong>Digital</strong> Techni(al journal Vo 1. 7 No. 3 !995
Figure 8<br />
Surface Plot Produced ri-om the State Water Project Text<br />
TextTiling: Enhancing Retrieval through<br />
Automatic Subtopic Identification<br />
Full-length documents have onlv recentlv become<br />
available on-line in large quantities, although technical<br />
abstracts, short newswire texts, and legal documents<br />
have been accessible for many years 23 The large major<br />
itY ofon -line information has been bibliographic (e.g.,<br />
authors, titles, and abstracts) instead of the fu ll text of<br />
the document. For this reason, most information<br />
retriev:ll methods :1re better suited for accessing<br />
abstracts than for accessing longer documents. Part of<br />
the rcposirorv research was an exploration of ne\\'<br />
approaches to information retrieval particularlv suited<br />
to ti.l ll-length texts, such as those expected in the<br />
Sequoia 2000 datalnse.<br />
A problem \\'ith applying traditional information<br />
retrieval methods to full-length text documents is that<br />
the structure of full-length documents is quite differ<br />
ent from that of abstracts. (In this paper, "ti.l ll-length<br />
document" refers to expositorv text of any length.<br />
Tvpical examples are a short magazine article and<br />
a 50-page technical report. We exclude documents<br />
composed of headlines, short advertisements, and any<br />
other disjointed texts of whatever length. We also<br />
assume that the document does not have detailed<br />
orthographically marked structure. Croft, Krovetz,<br />
and Tu rtle describe work that takes advantage of this<br />
ki nd of information.'; ) One way to view an expository<br />
texr is as a sequence of subtopics set against a backdrop<br />
of one or two main topics. A long text comprises many<br />
diffe rent subtopics that may be related to one another<br />
and to the backdrop in many diffe rent ways. The main<br />
topics of a text are discussed in its abstract, if one<br />
exists, but su btopics are usuallv not mentioned.<br />
Therefore, instead of querying against the entire<br />
content of a document, a user should be able to issue a<br />
query about a coherent subpart, or subtopic, of a ti.dl<br />
length document, and that subtopic shou ld be specifi<br />
able with respect to the document's main topic(s).<br />
Consider a Discouer magazine :�nicle about the<br />
Magellan space probe's exploration of Ve nus.';<br />
A reader divided this 23-paragraph article into the fo l<br />
lowing segments with the labels shown, where the<br />
numbers indicate paragraph numbers:<br />
1-2 lntro to J'vlagellan space probe<br />
3-4 Intro to Venus<br />
5-7 Lack of craters<br />
8-1 1 Evidence of volcanic action<br />
12-15 Ri'n Stvx<br />
16-18 Crustal spreading<br />
19-2 1 Recent volcanism<br />
22-23 Future of Magellan<br />
Assume that the topic of volcanic acti,·itv is of<br />
interest to a user. Crucial to a system's decision to<br />
retrieve this document is the knowledge that a dense<br />
discussion of volcanic activity, rather than a passing ref<br />
erence, appears. Since volcanism is not one of the<br />
text's two main topics, the number of references to<br />
this term will probably not dominate the statistics of<br />
term frequencv. On the other hand, document selec<br />
tion should not necessarilv be based on the number of<br />
references ro the target terms.<br />
The goal should be to determine whether or not<br />
a relevant discussion of a concept or topic appears.<br />
A simple approach to distinguishing between a true<br />
discussion and a passing reference is to determine the<br />
locality of the references. In the computer science<br />
operating systems literature, locality refers to the fact<br />
that over rime, memory access patterns tend to con<br />
centrate in localized clusters rather rhan be distributed<br />
evenly throughout memory. Similarly, in fu ll-length<br />
texts, the close proximity of members of a set of<br />
<strong>Digital</strong> <strong>Technical</strong> journal Vo l. 7 :--;o. 3 <strong>1995</strong> 61
62<br />
references to a particul ar concept is a good indiuror of<br />
topicality. For example, the term L'olcauislll occurs 5<br />
rim es in the Magellan article, the first fo ur insr;l nces of<br />
which occur in fo ur adj :\ Ce ll t paragraphs, ;l long ll'ith<br />
accompanvi ng discussion. In comrast, the rerm scieu<br />
tists. which is not a ,·a lid sub topic, occurs 13 ri mes, dis<br />
tributed somewhat e\-cnlv througho ut . Bv irs \'en·<br />
nature, a su btopic wi l l nor be discussed throughout an<br />
entire texr. Similarly, true subtopics are nor indiuted<br />
by only passing references. The term helll' do ucer<br />
occurs on l�· once , and its re lated terms arc con tined to<br />
the one sentence it appea rs in. As its usage is onh·<br />
a passing rckrence, belh· (Lwcing is not a true subropic<br />
of this text.<br />
Our solution to the problem of reuining v:did<br />
subtopical discu ssions while �H the same time ;woid<br />
ing being to oled by passing rdcrcnces is to make<br />
usc of localirv information :m d to partition docu<br />
ments according to their subto p ica l structure. 'This<br />
approach's capacitY tor i mp rO\'i ng a srallCLl rd informa<br />
tion retrieval tasl< has been verified bv i nt(mn arion<br />
retrieval experiments using fu ll-texr rest collections<br />
from t he TIPSTER daubase .l(··27<br />
One wav to get an approximation of the subtopic<br />
structure is to break the document into paragr;lp hs, or<br />
fo r \'cry long documenrs, sections. In both c1scs, rhis<br />
entai l s using the orthogr;1phic marking supplied l1\' the<br />
author to determine topic boundrmance oF moti\·;Jted<br />
segmenta tion, i.e., segmentation that rdkc rs rile<br />
text's true underlyi ng su btopic structu re, \\'hich ofren<br />
spans pac1graph boundaries.<br />
To\\'ard rhis end, \\'C h :-�,·c dn·c loped To rTi l ing,<br />
a method f( Jr partitioning fu ll-length tnt docu mems<br />
into coherent mLtlriparal:'-r,lph u n its called tiks.'"·'" . .:o'<br />
TcxrTi l ing approximates rhe su btopi c structure of<br />
a docu rne m by using patterns ot'lcxical conlKCtivin· ro<br />
find coherem su bdiscussions. The l
SimibritY ber11·ecn blocks is ukularcd b1· the t(,l loll·lllg<br />
cosJ Jle meastu·c: Ci1 en two tex t blocks bl �1 11d !J2,<br />
cos (1?1 ,1?2) =<br />
"<br />
L u·, ,, 11', ,1<br />
I I<br />
I! "<br />
L"·i,,, L ll'j ,,,<br />
t J /'='"I<br />
11 here 1 ranges m·cr �1 11 the terms in the document, a nd<br />
t/'1 "1 is the tf.idfweighr �1ssigned to term tin block hl .<br />
Thus, iF the similaritv score ben1 een tii'O blocks is<br />
high, then nor only do the blocks have te rms in com<br />
mon , bur the common terms �llT relatively rare with<br />
respect to the rest of the document. The evidence in<br />
the reverse is nor as conclusive . U' :-�djacenr blocks hJ1·e<br />
a lo\\' sirnilarin· measure, this docs not necess�1riil'<br />
J11C111 that the blocks cohere . 1 n pr:1crice, hm1·e\'er, this<br />
negJti1·e n·idence is often jusritied.<br />
The graph is then smoothed using a discrete con,·o<br />
lurion" of the similarirv fu nction \\'ith the fu nction<br />
h1?. ( ), where<br />
I ii � �< - 1<br />
otherwise.<br />
The result is smoothed tiJrrhcr 11·ith a simple median<br />
smoothing algori thm ro elimit1
64<br />
5. D. BlI Ctl., and H. Tu rt l e , "lmcr.JCtilc<br />
Retr·ie' ,,) of Com pic .x Docu me n rs," lnjbrnw I ion J'm<br />
cessillg u11d Jfol/{tgc>lllell/. ,·ol. 26, no. S ( 1 990 ):<br />
593-6 16.<br />
2S. A. Chaikin, " M agellan Pierces the Vc nusi:-tn Ve il,"<br />
/Ji.1crwe1: ,·ol. 13, no. I (Januarv 1992).<br />
26. M. Hearst .ltld C:. Plcl unt, "Su bropic Srru crmi ng f(lr<br />
Full-Length Doc u1ncn t Access," Proceedings of the<br />
Sixteenth Anmwl !nlenwliomt! ACH SIGIR Co nfer<br />
ence on N.escorch ond Det ·elopmenl in !njonnation<br />
!(etrieml (S/GIR '93J. Pittsburgh, June 1993 (Nc"·<br />
York: Association fo r Computing MachinerY, 1993 ):<br />
59-68.<br />
27. ,\>!. H earst, "Context ;m d Stru cture in Automated<br />
Full-Te xt Information Access," Ph.D. dissertation,<br />
Re port :\'o. UCB/CSD-94/836 ( Berkele1·, Calif<br />
Uni,·crsin· of CllitcJrni;J , Bcrkclcv, Compu ter Science<br />
Di,·ision, 1994 ).<br />
28. M. Hearst, "Tc xrTili n g : A Quan ti tative Approach ro<br />
Discourse Segrnen rarion ," Sequoia 2000 Tcchnicll<br />
Repcm 93/24 (S2K-93-24) (Berkele1., Ca li f Uni,n<br />
sit\' of C:liit(Jrnia, Bc rkele1·, 1993) (frp:/ /s2k<br />
h:p .cs. bcrkcl c1·. ed u /pub/ sequoia/ tech- reports/ s2 k -9<br />
3-24/s2 k-93-24.ps ).<br />
29. 1Vl. H c1rsr, "Mulri-Par;lgJ·aph Segmentation of Expos<br />
iron· TC \t," l;mceedings of the 77? irty-sccol!d<br />
J/e!:'lill,f.!. o/ the Assuciotion jar Co 111putctlional<br />
Li11p11s!ics. Los Cruces, ]\.' e\\' Mexico, June 1994.
30. G. Salton, AIIIOIIIlllic rex/ Processing· The Tran�jor-<br />
11/C/I io11. A nu!)•sis. ct/1(/ Ret rie!'al of Information by<br />
Computer ( RcJding, Mass.: Addison· Weskv, 1989).<br />
31. The authors arc gr�Hctii l ro Miclucl Braverman fo r<br />
proving that the smoothing algorithm is equivalent to<br />
this convolution .<br />
32. L. l'-1bincr and R. SchJti:r, f)ig ital Processill8 of<br />
Speech S(t;11als (Englewood Clifts, N.J.: Prcntiee<br />
H�lll, Inc., 1978).<br />
Biographies<br />
Ray R. Larson<br />
Rav Larson is an AssociJtc Professor rmation Studies). He tcJches courses and conducts<br />
resean.:h on the design Jnd evaluation ofintormJtion<br />
retrievJI svstcms. Rav received his Ph .D. !Tom the Universirv<br />
ofCalit(>�nia. He is � member of the American Socictv tor .<br />
. .<br />
lntcxmation Science (AS IS), the Association for Comput·<br />
ing 1Yiachinery (ACM ), the IEEE Computer Society, the<br />
Ametican Association tc >r the Advancement of Science,<br />
and the American Librarv Association. He is the Associate<br />
Editor fo r IICM Tru l/sa�lions 011 fllj(>rmati
Tecate: A Software<br />
Platform for Browsing<br />
and Visualizing Data<br />
from Networked Data<br />
Sources<br />
Te cate is a new infrastructure on which applica<br />
tions can be constructed that allow end users<br />
to browse for and then visualize data within<br />
networked data sources. This software platform<br />
capitalizes on the architectural strengths of cur<br />
rent scientific visualization systems, network<br />
browsers like Netscape, database management<br />
system front ends, and virtual reality systems.<br />
Applications layered on top of Tecate are able<br />
to browse for information in databases man<br />
aged by database management systems and for<br />
information contained in the World Wide Web.<br />
In addition, Tecate dynamically crafts user inter<br />
faces and interactive visualizations of selected<br />
data sets with the aid of an intelligent system.<br />
This system automatically maps many kinds of<br />
data sets into a virtual world that can be explored<br />
directly by end users. In describing these virtual<br />
worlds, Tecate uses an interpretive language that<br />
is also capable of performing arbitrary compu<br />
tations and mediating communications among<br />
different processes.<br />
66 <strong>Digital</strong> <strong>Technical</strong> Journal Vol. 7 No. 3 1':195<br />
I<br />
Peter D. Kochevar<br />
Leonard R. Wanger<br />
All people share the need to find and assimilate infor<br />
mation. Data fi-om which information is created is<br />
increasi ngly avai lable electronically, and that data<br />
is becoming more and more accessible with the prolif�<br />
eration of computer networks. Therefore, the world<br />
is quickly becoming abstracted as a collection of net<br />
worked data spaces, where a data space is a data source<br />
or repository whose access is control led by means of<br />
a well-defined software interface. Some examples<br />
of data spac
the tools provided by Tecate can be used to build<br />
applications of only modest capabilities.<br />
Historically, Tecate grew out of the Sequoia 2000<br />
project, which was initiated jointly by <strong>Digital</strong> Equip<br />
ment Corporation and the University of California<br />
in 1991 . The primary purpose of the Sequoia 2000<br />
project was to develop inf(xmation systems that wou ld<br />
allow earth scientists to better study global envi<br />
ronmental change. Sequoia 2000 participants needed<br />
to browse tor data sets on which ro test scientific<br />
hypotheses and then to interactively visualize the data<br />
sets once to und. The data can be quite varied in con<br />
tent and strucnJre, ranging rrom text and images<br />
to time-varying, multidimensional, gridded or poly<br />
hedral data sets. Such data may stream rrom many clif<br />
fe rent sources, e.g., databases managed by a database<br />
management system, a running simulation of some<br />
physical process, or tl1e WWW. Therefore, a tool was<br />
required that could interface to any such source. To be<br />
of maximum use, though, the tool had to be easy<br />
to use so that the scientists th emselves cou ld make<br />
sophisticated data queries and then experiment with<br />
the query results using a wide variety of data visualiza<br />
tjon tec hniques.<br />
Generalizing rrom its Sequoia 2000 roots, the<br />
design of Te cate is intended to achieve fo ur goals:<br />
l. lnrerface to general data spaces wherever they may<br />
reside.<br />
2 . Saliently visualize most kinds of data, e.g., scientific<br />
data and the listings in a telephone book.<br />
3. Dynamically craft user interfaces and interactive<br />
visualizations based on what data is selected, who is<br />
doing the visualizing, and why the user is exploring<br />
me data.<br />
4. AJJow end users to interact with elements in visual<br />
izations as a means to query data spaces, to explore<br />
alternate ways of presenting information, and to<br />
make annotations.<br />
There arc systems available today that have some of<br />
these capabilities, but no one system possesses all fo ur.<br />
Data visualization systems such as AVS, Khoros, or<br />
Data Explorer are capable of visualizing scientific data;<br />
however, they are poor at interfacing to general data<br />
spaces, they provide only limited interactivit:y \vitl1ill<br />
visualizations themselves, and they require visualiza<br />
tions to be crafted by hand by knowledgeable end<br />
users.'.l.-' Network browsers such as Netscape are good<br />
at fe tching data rrom certajn types of data spaces but<br />
are limited in the variety of data they can directly visu<br />
alize without having to rely on external viewer pro<br />
grams. Moreover, most network browsers offer a<br />
restricted type of interacrivity where only hyperlinks<br />
can be f()llowcd and text can be submitted through<br />
fo rms. Finally, rront ends to database management<br />
systems provide elaborate querying mechanisms fo r<br />
selecting data rrom a database, but they lack a sophisti<br />
cated means fo r visualizing and further exploring<br />
query results.<br />
The Tecate architecture borrows from that of visu<br />
alization systems, network browsers, and database<br />
management systems as well as rrom virtual reality sys<br />
tems like Alice and the Minimal Reality Toolkit/<br />
Object Modeling Language (MR/OML)!·' One<br />
major contribution of the Tecate system is tJ1at it<br />
incorporates the architectural strengths of these<br />
systems into a coherent whole. In addition, Tecate<br />
possesses at least two novel features that are not fo und<br />
in other data visualization systems. One fe ature is<br />
Tecate's use of an interpretive language that can<br />
describe three-dimensional (3-D) virtual worlds. This<br />
language is more than a markup language in that it is<br />
capable of performjng arbitrary computations and<br />
fa cilitating communication among clifferent processes.<br />
The second novel component ofTecate is tJ1e presence<br />
of an expert system that automatically crafts interactive<br />
visualizations of data. This system is intended to make<br />
data space exploration easier to perform by having end<br />
users simply state their goals while leaving the detajls<br />
of implementing a visualization to anajn tl1osc goals to<br />
tJ1e expert system.<br />
The remajnder of the paper outlines Tecate's sys<br />
tem model and architecture and then identifies and<br />
describes Te cate's major components. finally, the<br />
paper sketches Tecate's capabilities by discussing two<br />
simple applications mat have been implemented on top<br />
of me Tecate software fTamework. The first application<br />
is a tool fo r visualizing earth science dat.a rcsicling in<br />
a database managed by a database management system .<br />
The second app�cation is a We b browser that uses 3-0<br />
graphics as an underlying browsing paradigm ramer<br />
than depending solely on me mecliwn of hypertext.<br />
Tecate's System Model<br />
After presenting an overview ofTecate's system model,<br />
this section provides detans of me object model and the<br />
interpretive, object-oriented language used to describe<br />
virtual world objects.<br />
Overview<br />
From the standpoint of an applications programmer,<br />
Tecate is a distributed, object-oriented system. AU<br />
major components ofTecate, as weU as entities appear<br />
ing in virtual worlds created by Tecate, are objects that<br />
communicate with one another by means of message<br />
passing. The majn focus within Tecate is on object<br />
object interactions. These interactions occur primarily<br />
when objects send messages to one another. An object<br />
can also send a message to itself, which has the effect of<br />
making a local function cal l. Unlike with graphics<br />
systems such as Open Inventor, rendering is not a cen<br />
traJ activity wi thin Tecate; rather it is just a side effect<br />
DigiraJ <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 67
68<br />
of object-object interactions -" In this sense, Tecate is<br />
like virtual reality programming systems such as Alice<br />
and MR/OML, although Tecate is far more fl exible.<br />
In the Tecate syste m, objects can create and destroy<br />
other objects and can alter the properties of existing<br />
objects on-the-fly. Sud1 capabilities make Tecate vcrv<br />
extensible and give it great power and fl exibility. These<br />
capabilities can also cause problems fo r applications<br />
programmers, however, if care is not taken 'vhen writ<br />
ing programs. Presently, all of an object's properties<br />
are visible to all other objects, and hence those proper<br />
tics can be manipulated fi-om outside the object. In the<br />
fi.1ture, some form of selective property hiding needs<br />
to be added so that designated properties of an object<br />
cannot be altered by other objects.<br />
A powerful teature ofTecatc is irs ability ro dynami<br />
cally establish object-subobjecr relationships. This fea<br />
ture provides a mechanism for bui ldi ng assemblies of<br />
parts similar to the mechanisms in classical hierarchical<br />
graphics systems like Don� or Open lnvcnror.7 This<br />
tcatun: also provides the capability of creating sets or<br />
aggregates of objects rhar share some trait, such as<br />
being highlighted. Te cate allows all objects within a set<br />
to be treated en masse by providing a means of se lec<br />
tively broadcasting messages ro groups of objects.<br />
A message that is sent to an object can be fixwarded<br />
to all the object's subobjccrs. Thus, for example, one<br />
object can serve as a container fix all other objects that<br />
are highlighted; the highlighted objects arc merely sub<br />
objects of the container. To unbighlight all highlighted<br />
objects, a single unhighlight message can be sent to the<br />
container object, which then fcmvards the message to<br />
al l its subobjccts. In general, an object can be the sub<br />
object of any nu mber of other objects and thus simulta<br />
neously be a member of many diffe rent sets.<br />
The handling of user input within Tecate is<br />
intended to appear the same as ordinary object-object<br />
interactions. Al l physical input devices that are known<br />
to Tecate have an agent object associated with them<br />
th at acts as a device handler. All objects that wish to<br />
be informed of a particular input event register with<br />
the appropriate agent. When an input event occurs,<br />
the agent sends all registered objects a message notit)r<br />
ing them of the event. Complex events, such as the<br />
occurrence of event A and event B within a specified<br />
time period, can easily be defined by creating new han<br />
dler objects. These handlers register to be informed of<br />
separate events but then, in turn, inform other objects<br />
of the events' conjunction.<br />
The Object Model<br />
Te cate uses an object model in which no distinction<br />
is made between classes and instances, as is done in<br />
languages like C++ .8 In Tecate, there is a single object<br />
creation operation called cloning. Any object in the<br />
system can serve as a prototype fi-om which a copy can<br />
be made through the clone operation. A clone inherits<br />
<strong>Digital</strong> T�chnical )ourn
compute-intensive behaviors or whose behavior execu<br />
tions arc time-critical are generally implemented as<br />
resource objects. For instance, most Te cate objects that<br />
provide system services, such as rendering or database<br />
management, are implemented as resource objects.<br />
Objects populating virtual worlds that represent<br />
data fe atu res are implemented diffe rently than<br />
resource objects by using an interpretive program<br />
ming language ca lled the Abstract Visualization Lan<br />
guage (AVL). Such objects are called dynamic objects<br />
because they may be created, destroyed, and altered<br />
on-the-fly as a Tecate session unfolds. Nonetheless,<br />
the ability to dynamically add, remove, and alter object<br />
properties is nor solely endemic to dynamic objects.<br />
Resource properties may also bt: changed on -the-tly<br />
because resources are actually implemented with a<br />
dynamic object that interfaces to the portion of the<br />
resource that is implemented as an external process.<br />
The Abstract Visualization Language<br />
AV L is essential to the Tecate system; it is through AVL<br />
that applications programmers write applications that<br />
usc Tecate's ti.:arures.'' AVL is an interpretive, object<br />
oriented programming language that is capable of<br />
performing arbitrary computations and facilitating<br />
communication among diffe rent processes. Through<br />
this language, applications programmers spcci f)� and<br />
manipulate object properties and invoke object lx :hav<br />
iors by se nding messages fr om one object to another.<br />
AV L is a typelcss language that manipulates char<br />
acter strings; it is based on the Tel embeddable<br />
command language.'" AVL extends Tel by adding<br />
objcct-oricnn.:d program ming support, 3-D graphics,<br />
and a more sophisticated event-handling mechanism.<br />
Although AVL is a proper supe:rser ofTcl, the relation <br />
ship between AVL and Tel is much like that between<br />
C and C++. By adding a small set of new constructs<br />
to Te l, the way applications programmns structure<br />
AVL progra ms differs markedly !Tom how they struc<br />
ture Tel progra ms, just as the C+ + language exten<br />
sions to C: greatly alter the C programming style.<br />
One usc of AV L is to describe virtual worlds that<br />
represent data sets. Through AV L, objects that popu<br />
late these worlds can be assigned bt:haviors that are<br />
elicited through user interaction . �or instann.:, select<br />
ing a 3-D icon can cause a Universal Resource Locator<br />
(UlU,) to be fo llowed out into the WVVVV . In this<br />
sense, AV L is somewhat like the Hypertext Markup<br />
Language ( HTM L) that underlies all Web browsers<br />
today, or, more fi ning, it is similar to the Virtual<br />
R.c:�liry Modeling Language (VRML) that has been<br />
proposed as a 3-D analog ofHTML.'' AVL does, how<br />
ever, differ marked ly !Tom HTM L and VRNlL, which<br />
Jrc only markup l:�nguages. Bec:�use AVL is a fu ll<br />
tl edged programming language that Ius sophisticated<br />
interaction h:�ndling built in, it is philosophically more<br />
similar to interpretive languages like Telescript,<br />
NewtonScript, and PostScript. "·"·'4 Like Te lescript,<br />
fo r instance, AVL programs can encode "smart<br />
agents" that can be sent across a network to perform<br />
user tasks at a remote machine, if an AVL interpreter<br />
resides there . Note, however, that in the present ver<br />
sion of Tecate, there is no notion of security when<br />
arbitrary AVL code runs on a remote machine.<br />
AVL includes some additional commands that aug<br />
ment the Te l instruction set, fo r instance, clone and<br />
delete. The clone command is the object creation com<br />
mand within AVL, and the delete command is the com<br />
plementary operation to delete objects fi-om the<br />
syste m. Object properties are specified and manipu<br />
lated using the add command and deleted using rhe<br />
remove command . Behaviors in one object are initiated<br />
by another object using the send command, which<br />
specifies the behavior to invoke and the argu ments to<br />
be passed . Queries about object properties can be<br />
made using the inquire command. The which com<br />
mand is used to determine where an object's properties<br />
are actually defined in light of Tecate's use of delega<br />
tion to resolve property references. Finally, AVL pro<br />
vides a rich set of matrix and vector operators that arc<br />
useful when positioning objects "�thin 3-D scenes.<br />
As an example of how AVL is used in practjce,<br />
Figure l depicts a code fTagmcnt similar to one that<br />
appears in the vV'vV'vV application described later in the<br />
paper. The code fi-agmenr creates a 3-D Web site icon<br />
that is positioned on a world map. The code begins<br />
with the definition of the Hyper/ink object fi-om which<br />
all Web site icons arc cloned. The Hyper/ink object is<br />
itself cloned !Tom the Visual object that is predefined<br />
by Tecate at system start-up. The Visual object con<br />
tains properties that relate to the viewing of objects<br />
within scenes. h)f instance, objects that are cloned<br />
fi-om the Visual object inherit behaviors to rotate<br />
themselves and to change their color. To the proper<br />
ties that are inherited from the Visual object, the<br />
Hyper/ink object adds the state variables uri and desc,<br />
which will be used to store respectively a URL and irs<br />
textual description. In addition, objects cloned !Tom<br />
the !-Iyperlink object will inherit the default appear<br />
ance of a solid blue sphere having unit radius.<br />
The specification fo r the Hyper/ink object also<br />
defines three behaviors: ini/, openUrl, and showDesc.<br />
The inil behavior replaces the inil method inherited<br />
!Tom the Visual object. When an object cloned !Tom<br />
the Hyper/ink object receives an init message, it sets irs<br />
uri and desc state variables, positions itself within the<br />
scene whose name is given by the argument scene, and<br />
registers itself with the mouse handler agent to receive<br />
two events. When mouse bu tton l is depressed , the<br />
agent sends the object the open Uri message, which in<br />
turn requests the W'vVW Interface to fetch the data<br />
pointed to by the object's URL. Depressing button 2<br />
invokes d1e shOti'Desc message, causing the Web site<br />
URL and description to be displayed by a previously<br />
<strong>Digital</strong> Tt:x hnicJI /ournJI Vo l. 7 No. 3 <strong>1995</strong> 69
70<br />
# Define a prototype for all Web icons<br />
clone Hyperlink Visua l<br />
add Hyperlink<br />
}<br />
state {<br />
u r L ""<br />
}<br />
desc ""<br />
appearance {<br />
}<br />
shape {sphere}<br />
diffuseColor {0.0 0.0 1 . 0}<br />
repType {surface}<br />
behavior {<br />
# Ini tialize hyper link<br />
}<br />
init {url desc pos scene wi ndow} {<br />
}<br />
addstate url $url<br />
addstate desc $desc<br />
send [getselfJ move "add $pos "<br />
add $scene "subo bject [get self]"<br />
send $wi ndow addEvent "[getse lfJ {Button-1 {openUrl {}}} {Button-2 {showDesc {}}}"<br />
# Open the URL<br />
openUr l {} {send www fetch "[getstate url J"}<br />
# Display the desc ription<br />
showDesc {} {send me taViewer display "[getstate descJ''}<br />
# Ini tialize an informationa l Landscape<br />
clone scene Vi sua l<br />
clone window Viewer<br />
send wi ndow init {scene}<br />
# Create a Web site i con<br />
clone hlink Hyperlink<br />
send hlink init {"http: //www.sdsc . edu/ Home . html"<br />
"SDSC home page" "-2.3 -2 .0 1 . 0" scene wi ndow}<br />
# Use the SDSC mode l geometry<br />
add hlink {appea rance {shape {box}}}<br />
Figure 1<br />
An Implementation of a World Wide Web Icon in the Abstract Visualization Language<br />
defined interface widget ca lled the meta Viewer. The<br />
AVL command getself. which is used within the init<br />
behavior body, returns the name of the object on<br />
which the behavior was called, thus allowing applica<br />
tions programmers to write generic behaviors. The<br />
other AVL commands, getstate and addstate, are<br />
shorthand fo r "get [getself) state ... " and "add [getself]<br />
{state . . . l."<br />
Once the HyJX>rlink object is defined, a scene, a dis<br />
play window, and a Web site icon are created . The<br />
Tecate scene object is cloned fi-om the Visual object.<br />
The window object, cloned fi-om the predefined<br />
Viewe1· object, is the viewport into which the scene is<br />
to be rendered. Finally, blink is a We b site icon whose<br />
appearance difters from that which is inherited from<br />
the Hyper/ink object. Rather than being spherical, the<br />
shape of the blink icon is a unit cube.<br />
<strong>Digital</strong> Tcchnic:d Journal Vo l. 7 No. 3 <strong>1995</strong><br />
Tecate's Architecture<br />
The general structure of Tecate and how it relates to<br />
application progra ms is depicted in Figure 2. Tecate<br />
consists of a kernel, a set of basic system services, and a<br />
toolkit of predefined objects. The Tecate kernel, which<br />
is shown in Figure 3, is an object management system<br />
called the Abstract Visualization Machine; AVL is its<br />
native language. The Abstract Visualization Machine<br />
is responsible fo r creating, destroying, altering, ren<br />
dering, and mediating communication between<br />
objects. The two major components of the Abstract<br />
Visualizarjon Machine are the Object Manager and<br />
the Rendering Engine.<br />
The Object Manager is the primary component of<br />
the Abstract Visualization Machine. It is responsible fo r<br />
interpreting AVL programs, managing a database of
APPUCA TIONS<br />
TECATE<br />
Figure 2<br />
The Tecate System and Its Rcbrionship to Application<br />
Programs<br />
objects, mediating communication between objects,<br />
and intcrbcing with input devices. The Object<br />
Manager is itself a resource object that is distinguished<br />
by the fact that all other resource objects are spawned<br />
fi-om this om: objecr. In addition, the Object Manager<br />
is responsible to r creating a distinguished dymmic<br />
object, called Root. fi-om which aU other dynamic<br />
objects can trace their heritage through prototype<br />
clone relationships.<br />
The Object Manager is implemented on a simple,<br />
custom-built thread package. Each object within<br />
Tecate can be thought of as a process that has its own<br />
thread of control. Each thread can be implemented<br />
either as a lightweight process that shares the same<br />
machine context as the Object Manager's operating<br />
system process or as its own operating system process<br />
separate fi-om that of the Object Manager. Lightweight<br />
processes are so named because their use requires little<br />
system overhead, which enables thousands of such<br />
processes to be active at any given time. Within Tecate,<br />
dynamic objects arc implemented as lightweight<br />
Figure 3<br />
INTELLIGENT<br />
VISUALIZATION<br />
SYSTEM<br />
BIGRIVER<br />
RENDERING<br />
ENGINE<br />
processes, whereas resource objects arc implemented<br />
as heavyweight operating system processes, which may<br />
or may not be paired with a lightweight, adjunct<br />
process. A low-level function library is provided to<br />
handle the creation and destruction of threads and<br />
to handle interthread communication regardkss of<br />
how the threads are implemented.<br />
Closely allied with the Object Manager is the Ren <br />
dering Engine, which is a special resource object<br />
wholly contained \vi thin the Abstract Visualization<br />
Machine. The Rendering Engine is responsible fo r<br />
creating a graphical rendition of a virtual world that is<br />
specified by AVL programs interpreted by the Object<br />
Manager. When interpreting an AV L program, the<br />
Object Manager strips off appearance attributes of<br />
objects and sends appropriate messages to the Ren<br />
dering Engine so that it can maintain a separate display<br />
list that represents a virtual world. Display lists are rep<br />
resented as directed, acyclic graphs whose connectivity<br />
is determined by object-subobject relationships that<br />
are specified \vi thin AVL programs.<br />
The present Rendering Engine implementation<br />
uses the Don� graphics package running on a DEC<br />
3000 Model 500 workstation.7 The display lists that are<br />
created by invoking behaviors within the Rendering<br />
Engine are actually built up and maintained through<br />
Dorc. The set of messages that the Rendering Engine<br />
responds to represents an interface to a platform's<br />
graphics hardware that is indepe ndent of both the<br />
graphics package and the display device.<br />
Layered on top of the Abstract Visualization<br />
Machine arc Tecate's system services and the object<br />
toolkit. The system services consist of a col lection of<br />
resource objects that are automatically instantiated at<br />
system start-up. These resources include an expert<br />
system ca lled the Intelligent Visualization System,<br />
the Database Interface, the vVWW Interface , and a<br />
OBJECT MANAGER<br />
ABSTRACT VISUALIZATION MACHINE<br />
DATABASE<br />
INTERFACE<br />
www<br />
INTERFACE<br />
Derail ofTecare's Kernel (the Abstr:Kt Visualization Machine) and the System Services Provided by Tecate<br />
<strong>Digital</strong> TcchnicJI journal Vol. 7 No. 3 <strong>1995</strong> 71
72<br />
visualization programming system called BigRi,·cr.<br />
Figure 3 shows these resources in relationship to<br />
Tecate's kernel . Each resource is a Tecate object that<br />
has a number of predefined be haviors that can be use<br />
ful to applications programmers. For instance, the<br />
W'vVW Interface has a be havior that fe tches a data tile<br />
referred to by a URL and then translates the file's con<br />
tents into an appropriate AVL program.<br />
The tool kit within Tecate is a set of predefined<br />
dynamic objects that program mers can use to develop<br />
applications. These objects arc considered abstract<br />
objects in the sense that they arc not intended to be<br />
used directly. Rather, they serve as prototypes from<br />
which clones can be created. The rool kir consists of<br />
objects such as viewporrs, lights, and cameras that arc<br />
used to illuminate and render virtual worlds. The<br />
toolkit also contains a modest collection of 3-D usn<br />
interface widgets that can be used wirhin virtual<br />
worlds created by an applications programmer. These<br />
widgets include sliders, menus, icons, legends, and<br />
coordinate axes.<br />
One usefld object in the toolkit that aids in simulat<br />
ing physical processes and helps in perfixming aoinu<br />
tions is a clock. This object is an event generator thar<br />
signals every clock tick. If objects wish ro be informed<br />
of a clock pulse, those objects register rhemsch-cs with<br />
the clock object just like objects register themselves<br />
with input device agent objects. The dcbult clock<br />
object can be cloned , and each clone can be instanti<br />
ated with a difkrcnr dock period down to J resolution<br />
of one millisecond. Any number of clocks em be tick<br />
ing si multaneously during a Tcc:�tc session . Since new<br />
clocks can be created dynamically, and objects can reg<br />
ister and unrcgister to be int(mncd of clock pulses<br />
on-rhe-fly, clocks can be used as timers and triggers,<br />
and as pacesetters.<br />
Application Resources<br />
Tcc:-tte's system services arc prcdctincd application<br />
resources that aid in interactivc:ly visualizing data. As<br />
mentioned previously, these objects include the I nrcl<br />
ligcnr Visualization System, the Database I nrerbcc,<br />
the WVVW Interface, and rhe BigR.iver visualization<br />
programming system. In addition, :111 applications pro<br />
grammer can casilv add new :-tpplicarion resources<br />
using tools provided with the base Tecate system. Such<br />
new resources can be built around either user-written<br />
programs or commercial oft�thc-shelf applic1tions.<br />
To create a new application resource, a programmer<br />
needs to provide a set oHuncrions that c:-tn be invoked<br />
by other Tecate objects. These fi.nKtions correspond<br />
to behaviors that are called when rhc resource recci,·cs<br />
a message from other objects. Tools arc provided<br />
to register the behaviors wirh Tecate and ro manage<br />
the communication bet\\'een �1 resource and other<br />
TecJte objects. 1'<br />
Vol. 7 :--Jo. 3 l \)
models, user tasks, and visualization techniques. The<br />
Planner functions by constructing a sentence within a<br />
dataflow language defined by a context-sensitive graph<br />
grammar. At each step in the construction of the sen<br />
tence, rules in the knowledge base dictate which pro<br />
ductions in the grammar are to be applied and when.<br />
Presently, the knowledge base is implemented using<br />
the Classic knowledge representation system; the<br />
Planner is implemented in CLOS.'u"<br />
BigRiver<br />
The dataflow program produced by the Inte lligent<br />
Visualization System is written in a scripting language<br />
that is interpreted by BigRivcr, a visualization pro<br />
gramming system similar to AVS and Khoros.�.' hom<br />
a technical standpoint, BigRiver is not particularly<br />
innovative and will eventually be reimplementcd using<br />
some existing visualization system that has mon: fi.mc<br />
tionality. The reason that BigRivcr was created from<br />
scratch was to better understand how existi ng visual<br />
ization programming systems work and ro ovncome<br />
limitations within those systems. These limitations are<br />
their inability to bt: t:mbeddcd within other applica<br />
tions, their lack of comprehensive data models, and<br />
their inability to work with user-supplied renderers.<br />
The latest generation of visualization programming<br />
syste ms, such as Data Explorer and AVS/ Express,<br />
overcome many of these limitations.,_,,.<br />
Like most of the existing visualization syste ms,<br />
BigRiver consists of a collection of proced ures called<br />
modules, each of which has a wdl-dcfined set of inputs<br />
and ou tputs. Fu nctional specifications fo r these mod <br />
ules represent some of the knowledge contained in the<br />
Intelligent Visualization System's knowledge base.<br />
Visualization scripts that are interpreted by BigRiver<br />
specif)' module parameter values and dictate how the<br />
outputs of chosen modules arc to be channeled into<br />
the inputs of others.<br />
BigRiver modules come in tlm:e varieties: l/0, data<br />
manipulators, and glyph generators. AJI modules use<br />
se lf-descri bing data tc>rmats based on fiber bund les.<br />
One t() rmat is used t( )r manipulation within memory;<br />
the other is an on -the-wire encoding intended tor<br />
transporting data across a network. An input mod ule<br />
is responsible to r converting data stored in the on-the<br />
wire encoding into the in-memory fo rmat. The data<br />
manipulator mod ules transform tlber bundles of one<br />
in-memory fo rmat into those of another. The glyph<br />
generators take as input fiber bund les in the in-memorv<br />
t(xmat and produn: AVL programs that when necuted<br />
build virtual worlds containing objects th;lt represent<br />
fe atu res of selected data sets. A single dispby module<br />
takes as input AV L code and p:1sses it to the Abstract<br />
Visualization Machine. Bv means of the Rendering<br />
Engine, the Abstr:Jct Visualization Machine uses the<br />
appearance attributes of objects to create an image of<br />
a virtual world that contains the objects.<br />
The Database Interface<br />
The Database Interface provides the means to interact<br />
with a database management syste m, which in the cur<br />
rent version of Tecate can be either POSTGRES or<br />
Illustra.'"·N Database queries, written in POSTQ UEL<br />
fo r POSTGRES-managed databases or in SQL for<br />
Illustra databases, are sent to the Database Interface by<br />
Tecate objects where they are passed to a database<br />
management system server for execution. The server<br />
returns the query results to the Database Interface,<br />
which then attempts to package them up as an on-the<br />
wire encoding of a fiber bundle buff-ered on local disk.<br />
If the result is a set of tu ples in the standard format<br />
returned by POSTGRES or lllustra, the Database<br />
Interface pcrf(>rms the fiber bundle translation. For<br />
most otJ1er nonstandard results, the so-called bi nary<br />
large objects (RLOBs) of the database realm, the<br />
Database Interface cannot yet arbitrari ly perform the<br />
translation into the on-the-wire fiber bundle encod<br />
ing. The only BLOBs that the Database Interface can<br />
deal with presently arc those that arc already encoded<br />
as on-the-wire fiber bundles. The difficult problem of<br />
automated data f(mnat translation was not addressed<br />
during Tecate's initial development, although the<br />
intent is to address this issue in the fu ture.<br />
Once query results arc buffered on disk, a descrip<br />
tion of the fiber bundle and the location of the buffer<br />
are sent back to the object that made the query<br />
request of the Database I ntcrfacc. That object might<br />
then request the ]ntc lligcnt Visualization System to<br />
structure a virtual world whose image would appear<br />
on the display screen by way of BigRivcr and the<br />
Re ndering Engine. Objects in the vi rtual world can be<br />
given behaviors that arc elicited by user interactions.<br />
These behaviors might then result in further database<br />
queries and so on . Chains of events such as these pro<br />
vide a means tor browsing databases through direct<br />
manipu lation of objects within a virtual world.<br />
The World Wide Web Interface<br />
The WWVV lntertace fi.mctions similarly to the<br />
Database I ntcrface but instead of accessing data in<br />
a database, the WWVV Interface provides access to data<br />
stored on the World Wide We b. Messages that contain<br />
U RLs arc passed to the W'vVW Interface, which then<br />
fe tches the data pointed to by the URLs. In retrieving<br />
data fr om the \Ve b, rhe vVWW lntertace uses the same<br />
CERN sothvJre libr:1rics used by Web browsers like<br />
Netscapc.<br />
Once a data ti le is fetched , the 'vV'vVW Interface<br />
attempts to translate irs contents into an AVL pro<br />
gram, which is then passed to the Object Manager fo r<br />
interpretation. AVL either specifies the creation of<br />
a new virtual world that represents the data file's con<br />
tents or specifies new objects that are to populate the<br />
current world being viewed. lf the fetc hed data file<br />
contains a stream of AV L code, the Vv'vVW Intcrbce<br />
Vo l. 7 No. 3 !99S 73
74<br />
merely fo nvard s the file to the Object Manager. If the<br />
file contains general data in the form of an on-the-wire<br />
encoding of a fiber bundle, the WWVV lntcrtacc<br />
appeals to the Intelligent Visualization System to<br />
structure an appropriate virntal world. If the data file<br />
contains a stream ofHTML code, the WWW Inrerbce<br />
invokes an internal translator that translates HT1Vl l.<br />
code into an equivalent AVL program, which is then<br />
interpreted by the Object Manager. This interpreter<br />
actually understands an extended version of HTM L<br />
that supports the direct embedding of AVL within<br />
HTM L documents. Through this mechanism, 3-D<br />
objects with which users can interact can be embedded<br />
directly into a hypertext We b page-something that<br />
fe w if any other We b browsers can do today.<br />
Example Applications<br />
Applications that browse the contents of data spaces<br />
and then interactively visualize selected results have<br />
the same overall structure. One browser application<br />
component acts as a data space interface, and through<br />
this interface queries arc posed, query results arc<br />
imported into the application, and data generated by<br />
the application is stored back into a data space. Once<br />
data has been imported into the Jpplication , a second<br />
component must map the data into some appropriate<br />
virtual world . Finally, J third component must manage<br />
any interactions that may take place between an end<br />
user and elements that popu late the virtual worlds that<br />
arc created.<br />
In creating an application using Tecate, the DatabJse<br />
Interface and the vVWW Interface represent resources<br />
that can be used to form the appl ication's data space<br />
interfJce. The mapping of dJta into a representative<br />
virtual world can utilize Tecate's Intelligent Visuali<br />
zation System and the BigRiver visualization progrJm-<br />
Figure 4<br />
PLAN VISUALIZATION<br />
INTELLIGENT PLAN<br />
VISUALIZATION !-+--- -- --,<br />
SYSTEM<br />
BIG RIVER<br />
OBJECT<br />
MANAGER<br />
EXECUTE<br />
SCRIPT<br />
VISUALIZATION<br />
SET<br />
INTERPRET<br />
ABSTRACT<br />
VISUALIZATION<br />
LANGUAGE<br />
PARAMETERS<br />
Message Flow between Important Objn:rs in rhc Earrh Science Applic1rion<br />
<strong>Digital</strong> Tt:c hnical Journal Vol. 7 !'Jo. 3 I 995<br />
ming system. Finally, the manJgement of these worlds<br />
can take place through AVL programs that exercise the<br />
katures of Te cJte's Abstract Visualization Machine.<br />
The following two examples tl1at were implemented in<br />
AV L illustrate how Tecate can be used to create applica<br />
tions that browse data spaces.<br />
Visualizing Data in a Database<br />
A simple example of an application that exploits<br />
Tecate's features is one that browses fo r earth science<br />
data in a database and then provides visualizations of<br />
that dJta. The initial user interface fo r this application is<br />
built using a collection of user interface \vidgers, where<br />
each widget is a Tecate dynamic object. Because the<br />
Tecate system docs not yet have a comprehensive 3-D<br />
widget set, some widgets sti ll rely on two-dimensional<br />
( 2-D) constructs provided by the Tk \vi dget set that<br />
is implemented on top of the Te l language:'"<br />
Figure 4 depicts the tlow of messages between some<br />
of the more important objects that are used within the<br />
application. One object is the Map Query Tool that<br />
is used to make certain graphical queries for earth<br />
science data sets whose geographical extents and rime<br />
stamps fa ll within user-specified constraints. The tool<br />
is built around a world map on which regions of inter<br />
est can be specified (see figure 5). When a user marks<br />
a region of interest on the map and selects a temporal<br />
rJnge, a query message is sent to the Database<br />
lnrertace. The result of the query is returned to the<br />
Map Query Tool, which then torwards a description<br />
of the res ult to the Intelligent Visualization System.<br />
To structure an appropriate visualization, an inferred<br />
select task directive accompanies the re sult. The ensu<br />
ing script produced by the Intel ligent Visualization<br />
System is executed by BigRiver, which produces a<br />
stream of AVL code that is sent to the Abstract<br />
Visualization Machine tor interpretation.<br />
QUERY<br />
QUERY<br />
RESULT<br />
QUERY<br />
QUERY<br />
RESULT<br />
DATABASE<br />
INTERFACE<br />
KEY:<br />
0 DYNAMIC OBJECT<br />
Cl RESOURCE OBJECT<br />
-+-- MESSAGE FLOW
Figure 5<br />
The Map Query Tool Sho\\'ing a Vtstdization of a Qucrv Res ult<br />
This AVL program creates a new vi rtu::d world that<br />
consists of a collection of 3-D objects. Each object acts<br />
as an icon that corresponds to one data set that was<br />
returned as the result of the initial query (sec hgure<br />
5). The Intelligent Visualization System also builds<br />
in rwo behaviors t(>r each icon . Depending on how<br />
a user selects an icon, either the mctadata associated<br />
with the data scr represented by the icon is dispiJvcd in<br />
a separate window or a query message is sent to the<br />
Database lnterbcc requesting the actual data. In<br />
the latter case, the Map Query Tool again t(>rwards<br />
r.he qucrv result to the I ntelligcnr Visualization System,<br />
and another virtual world contJining objects representing<br />
data kJturcs is created and displavcd with<br />
the aid of BigR.ivcr Jnd the Abstract Visualization<br />
Machine. In generJI, lhta exploration proceeds this<br />
way by creating and discarding virtual worlds based on<br />
interactions with objects that populate prior worlds.<br />
After selecting Jn icon to actuJIIy view the data associated<br />
with it, an end user is Jskcd bv the I ntelligcnt<br />
Visualization System to input a task spccitication using<br />
a Task Editor. Generally, data sets can be visualized in<br />
many diffl:rent ways. The Intelligent Visualization<br />
Svstem uses the task specification ro select the one<br />
1·isualization that best satisfies the stated task. After<br />
J task specification is entered, a visualization of the<br />
selected data set appears on the screen. The BigRiver<br />
dataflow program that the Intelligent Visualization<br />
System creates ro do that visualization can be edited by<br />
h:md by knowledgeable end users to override the decisions<br />
made by the system.<br />
Figure 6 shows a Task Editor and a visualization<br />
crafted by the Intelligent Visualization System after an<br />
end user selected a data-set icon. The visualization represents<br />
hydrological dara that consists of a collectjon of<br />
tuples, each corresponding to a set of measurements<br />
made at discrete geographical locations. Based on<br />
the task specification that the end user entered, the<br />
[ntelligenr Visualization System chose to map the data<br />
into a coordinate system that has axes that represent<br />
latitude, longitude, and elevation. Each sphere represents<br />
an individual measurement site, whose color is<br />
a fi.mction of the mean temperature. When an end user<br />
selects a sphere, the actual data values associated with<br />
Digirc1l <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 75
76<br />
Figure 6<br />
Task Editor Sho\\'ing a Visualizatio11 of Hnlrologiral DaLl<br />
the location represented 1)\' the sphere JJ-c displavcd .<br />
In addition, the Imclligc nr Visualiz;1tion Svstcm Juto<br />
matically places into the l'irtual world ofrhe visu:tliza<br />
tion ;1 color legend to help relat-e sphere colors to mea11<br />
tcmperatun: v:tlues.<br />
hgurc 7 depicts another 1 irrual world showing J<br />
visu:tlization ohbta-sct output h·om a regional clim;1te<br />
model program. The d a u set is ;J 3-D arrav indexed b1·<br />
l atitude, longitude, and cle1·ation . Fach arrav clement<br />
is a tuple thJt contains cloud densi tv, water content,<br />
and temperature I'Jiues. In this 1nst;1ncc, the end user<br />
enrered J task specincnion that st;ned that the spatial<br />
variation in te mperature was of primarv illlJlOrtance.<br />
The Inrelligent Visualization System responded lw<br />
specif-\·ing J l'isualizJtion that represemed the te mper<br />
atun: datJ as an isosurbcc, i .e . , a surbce 11·hose poi nts<br />
all h:tve the s:trne 1·aluc f(Jr the temperature. Included<br />
in the 1·irtual world is ;l widget tklt can be used to<br />
change the isosurbcc I';Jiue and the field l';lriable th;n<br />
is being srudied .<br />
The isosurbcc widget that :tppc:trs in the l'isualiz;l <br />
tion shown in Figure 7 is ofspeci;ll interest bcc:�use of<br />
the way that it is implemented. Embedded in the tool<br />
is a slider that is used to change the isosurbcc 1·aluc. As<br />
with most sliders, the slider ,·:�luc indicator ;J utom;Jti<br />
cal ly moves when ;J mouse button is held down while<br />
Digiral Tcdmiol )< n1rn�l<br />
poinring at one ofrhe slider ends. To achieve this sim<br />
ple ;J nimation, Tecate's clock object is used. When rhe<br />
mouse burron is firs� depressed while the cursor is over<br />
a slider end, rbe slider indicator registers itself to be<br />
informed of clock ricks. From then on, at every clock<br />
rick, rhe indicator receives an update message fi-om the<br />
clock, at which time the indicator repositions itself and<br />
increments or d.eerernenrs the current slider value.<br />
VV hen the mouse burron is released, rhe slider sends<br />
a message to BigRiver indicating that a new isosurtace<br />
is ro be calculated and d isplayed. In addition, the slider<br />
indiutor unregisters itself f!·om the clock signali ng<br />
thJr it no longer is to receive the update messages. In<br />
general, applications em use this same clock mecha<br />
nism to perf(mn more e la borate animations.<br />
A 3-D World Wide Web Browser<br />
In the Teute Web browser, exploration of the World<br />
Wide Web ami irs comcnrs occurs by placing an end<br />
user onto an informational landscape. This landscape<br />
is ;1 3-D 'irtual world whose appearance reflects the<br />
coment and the structure of a designated subset of the<br />
entire Web. Upon applicnion start-up, an end user<br />
is presented with an initial informational landscape<br />
that consists of a p!Jn;lr map of rhe earth embedded<br />
in a 3-D space, ;l S shown in Figure 8. In general, rhe
Figure 7<br />
Task Editor Sh owing a Visualization of Regional Climate Dat3, I nduding an lsosurtacc 3nd cl User Inrcrt , lcc Widger<br />
initial informational landscape can be any 3-D scene<br />
and does not have to be geographically based. For<br />
instance, an intormational landscape might be a virtual<br />
library where books on shelves serve as anchors fo r<br />
hyper! inks to diffe rent Web sites.<br />
In the present browser application, selected Web<br />
sites appear as 3- D icons on the world map. These<br />
icons arc positioned either in locations where Web<br />
servers physically reside or in locations referenced<br />
within Web documents (see Figure 8). A user places<br />
information that describes these sites into a database<br />
that serves as an elaboration of the hot list of current<br />
hypertext- based browsers. When the browser applica<br />
tion is first started, it sends a query fo r the initial com<br />
plement of Web sites to the Database Interface. The<br />
browser application then invokes a BigRiver script that<br />
visualizes the results by placing icons representing<br />
each site onto the world map.<br />
Suspended above the world map is a 3-D user inter<br />
fa ce widget that is used to query a database of Web<br />
sites that are of interest to an end user (see Figure 8).<br />
This database, where the initial set of Web sites is<br />
stored, includes information such as URLs, keywords,<br />
geographical locations, and Web site types. Currently,<br />
Figure 8<br />
Te cate Web Browser Intrmarion3l L�1ndscapc Showing<br />
WW\V Sites Depicted as 3· D Icons on a Map of the World<br />
<strong>Digital</strong> Tc..:hni..:al Journal Vol. 7 �o. 3 J Y95 77
78<br />
individual users arc responsible tor maintaining their<br />
own databases by adding or removing \Ve b sire entries<br />
by hand. An automated me:ms to r building these LhtJ�<br />
bases can be easily added to the broll'ser appliurion so<br />
that Web intormation co uld be Jccu mulated based on<br />
where and when an end user rra\'cls on the Web.<br />
Duri ng a browsing session, rhc Web Querv Tool<br />
allows arbi trary SQL queries to be posed ro the<br />
database by an end user. ln addition, the Web Query<br />
Tool has provisions to allow pac kaged queries ro be<br />
initiated by a simp le click of a mouse button . ln both<br />
cases, queries are sent ro the Dat
Figure 10<br />
BIGRIVER<br />
INTERPRET<br />
ABSTRACT<br />
VISUALIZATION<br />
LANGUAGE<br />
EXECUTE SCR IPT<br />
INTERPRET<br />
OBJECT<br />
ABSTRACT<br />
MANAGER VISUALIZATION<br />
LANGUAGE<br />
KEY:<br />
0 DYNAMIC OBJECT<br />
c=J RESOURCE OBJECT<br />
- MESSAGE FLOW<br />
Message Flow between J mportanr Objects in the Web Browser<br />
Figure 11<br />
WEB QUERY )<br />
TOOL<br />
-<br />
DATA SET<br />
ICON<br />
J<br />
QUERY<br />
QUERY RESULT<br />
FETCH URL<br />
FETCH RESULT<br />
DATABASE<br />
INTERFACE<br />
www<br />
INTERFACE<br />
Results of a Tecate Browsing Session Showing a Hyperlink and a Forest of Pyramids That Re presents the User's Travels<br />
on the Web<br />
<strong>Digital</strong> <strong>Technical</strong> journal Vol. 7 No. 3 <strong>1995</strong> 79
80<br />
are cloned !Tom the same HypCI·Jink prototype as the<br />
Web site icons. If another HTML document is<br />
retrieved by fo llowing a hypcrlink, tlut document<br />
is viewed on the base of :mother inverted pvramid<br />
whose apex rests on the selected text and so on (sec<br />
Figure l l ) . Rather than having to page back and t(xth<br />
between hypertext docu ments as with most hypl:rtot·<br />
based browsers, in Te cate, an end user needs only to<br />
move about the virtual world to gain an appropriate<br />
viewpoint from which to examine a desired document.<br />
Overall, as shown in figure I I, J browsing session<br />
with Tecate's vVe b browser results in a t(m::st of pvLl<br />
midal structures that represent a pictorial historv of<br />
an end user's trave ls on the We b.<br />
Although Te cate's Web browser is capable of view<br />
ing HTML documents, irs main purpose is not to<br />
emubte what can currently be done using hypertext<br />
based browsers, albeit using 3-D. R.1 thcr, the new<br />
browser is intended to visualize primarilv more com<br />
plex types of data. When d�1ra docs not consist of<br />
a stream ofHTML code, the WWW Interrace attempts<br />
ro visualize what was returned h·om the 'We b. These<br />
visualizations can take place in virtual worlds separate<br />
!Tom the intormationa.l landscape !Tom where the data<br />
Figure 12<br />
Example of a We b Document with Embedded 3-D Vi rtml World<br />
<strong>Digital</strong> T.:chniund in visualization systems, network browsers,<br />
datJbase ti·orlt ends, and virtual reality systems. As a<br />
first prototype, Tecate was created using a breadth-first<br />
development strategy. That is, developers deemed it<br />
esse ntiJI to first understand what components were<br />
needed to build a general data space exploration utility<br />
and then determine how those components interact.
The Tecate car demo<br />
<br />
<br />
The Teca te car demo<br />
<br />
# Globa l variables<br />
globa l TEC_W EB_ PARENT TEC WEB_W IN<br />
set path "/projects/s2k/sharedata"<br />
# Def ine car pa rt prototype<br />
clone CarPart Visua l<br />
add CarPart {<br />
state {ang le 10}<br />
appea rance {<br />
repType surface<br />
i n terpType surface<br />
behavior {<br />
around {args} {<br />
# Defi n e car body<br />
for {set i 0} {$i < 360} {incr i [getstate angle]} {<br />
clone car_body CarPart<br />
add car_body {<br />
appearance {<br />
send [getself] rotate "add 0 [getstate ang le] 0"<br />
replacematrix {rotate {0 .0 0.0 90 .0}}<br />
shape {AliasObj "$p ath/car_body . t ri"}<br />
# Def ine generic whee l<br />
clone whee l Ca rPa rt<br />
# Define car' s four wheels<br />
clone back_ right CarPart<br />
# Assemb le car<br />
clone whee ls CarPart<br />
add wheels {subobject {back_r ight back left front left front_ right}}<br />
clone car CarPart<br />
add car {<br />
appea rance {replacem atrix {translate {28.0 -8 .0 3.0} rotate {90 .0 90 .0 0.0}}}<br />
subobject {car_b ody wheels}<br />
add $TEC_W EB_P ARENT {subobject {car}}<br />
# Bind pick events to car<br />
send $TEC WEB WIN addEvent {wheel {Pi ck-Shift-Butto n-1 {rot_w h ee ls {}}}}<br />
send $TEC_W E B_W IN addEvent {wheel {Pi ck-Button-1 {around {}}}}<br />
send $TEC WEB WIN addEvent {car {Pi ck-But ton-1 {around {}}}}<br />
<br />
<br />
Bu tton-1 on car to rotate the car <br />
Bu tton-1 on a whee l to rotate the wheels <br />
Shift Button-1 on a wheel to change the whee ls <br />
<br />
<br />
<br />
<br />
Figure 13<br />
AV !. C:odc hnbeddt:d in rhc HTM!. l'c1gc tor rhc Wch Documcnr Exam ric<br />
])ig:it;\] Tcclmi(;ll jourml Sl
82<br />
This d n\:lopm enr srr�ncg;1· tLldcd off the fu ncrio11Jiirv<br />
of indi1·idu:�l components t(>r rhc co mplercncss of<br />
:I ri.dl\' ru n ning \'iSU�liization S\'StClll .<br />
In terms of �1Chic1·i ng irs design goals, rile Tcc,Hc<br />
eft(Jrt h�1s been modcratch· successti.d . TcGHc Gl ll nm1·<br />
prm·idc inrcrbccs to t\1 o kind of rb t�l sp�1ccs: the<br />
World vVidc We b �l lld thLlb�lSCS lllJ113gcd b\' rhe<br />
POSTCRES �m d lllustLl Lh t�1b�1sc management S\'S<br />
tc ms. In addition, intnbccs to other dat�l sp�1ccs c�m<br />
be i m pl e men ted easih' l1\' CI'C :l ti ng nc\1 resou rce<br />
objects using the tools pro1·idcd lw Te cate . Much<br />
11 ork srill nccds to be done, hm1 c1'Cr For cx:�mple, rhc<br />
attciH.bnt dar:� translation problem must be satisbcto<br />
rik soh'Cd ; data passing tlmJugh an intcrbcc rhat<br />
is stored in one fo rm�1r should be :H ltom:�tic;�llv um<br />
l'trtcd i mo Tec: nc 's b1·orcd t(mna t :md 1·icc l'crs�l.<br />
When building ,·isu�lli;.�ni ons of data, Tccnc nm1<br />
undcrsr�mds dat�l that h�1s :1 s1Jecirlc conccpru�1l struc<br />
ture, i 11 p:�rticular, arhi tr,u·1· scrs of tuples �m d mul ri<br />
dimension:�! �1rr:11'S ll'hnc �l lT�l\' clements tlU\' be<br />
tuples. Although data t\'pcs from man1' di!'ll:rcm disci<br />
plines possess such a stru cture, some tvpcs rc m�1in thJt<br />
dO ilOt, r(>r i nstan ce, cb t:l rh�H h�1s a latticc-likl· or poh·<br />
hcd Lli srrucrure. F u rthcrmorc, Tee:� tc can noll' con<br />
struct onh· crude \'isu;�li;�ltions ofrhc da ta ti'Pcs rlut it<br />
docs understand. The prim�1n reason t(Jr rhis shOI't<br />
com ing is that the b�1sic mod ule set 11 irhin the<br />
BigRil'cr resource is incomplete, �md the knml'icdgc<br />
base 11·i thin the ln tc II igcm Visu�li iza tion Svstcm con<br />
tains limited knoll' ledge of 1 isu:�lization tec hniques<br />
that can be used to t ra nsr( m n d
J I. G. Bell, A. p,,risi, ;md 1\.l. reset·, "The Virru;li J\cJiin·<br />
,\ Jlldelillg Lm gu;lgc Sfl�citlc;Hion," a,·,,il:lbk on the<br />
lnrcrllet ;H http:/ /lr!lli.ll·ircd.com ( :\o1·cmbcr I l)9.J. ).<br />
12. J. White, "Tcksnipt Tcchnolog1·: The �ound.Hioll t(lr<br />
the Fkcrronic lvbrkctplace," Cencral M�gic 11·hitc<br />
p•lJll'r (Sull lliY;llr, C1 lif.: Gener;ll ,\ hgic, Inc., 19{}4 ).<br />
13. ). ,\ kKcclun Jnd :\. Rhodes, l'n�!.!,llfllllllillg j(w ihe<br />
,\('tr/(J/1: Sojitntre net:elr Scicmitic 1);\Ll \'istdiz;Hion,"<br />
TechnicJI Rcporr (.;\\'L!-JlST- ')2- 10 ( \V;1.shington,<br />
D.C.: (;corge \V;lsh ingtnn Uni1usirl', 1\b rc h 1992 ).<br />
20. Z. Ahmed n ;l l., "An lmelligt· nr VisuJiiLation S1·stcn1<br />
fo r J-:;lrth Scirc•uu· r I ()94 ) .<br />
22. D. l�urkr ;uld ill l'cndln·, "A \'isulllizarion ,\ lodel lhsed<br />
on rhe iVLlthenJ;Jtics of fib,r Bund les," 0>111/!11/ers ill<br />
/'hl:.;ics. 1·ol. 3, no. 5 (Septcnlber/Onober I ():N)<br />
D. l�urlcr ;l \ld S. l�n·son , "\\:cror- bundlc C:bsscs form<br />
l'm,·ertiil Tool tc>r Scienritlc \' isuJJii'l.�rion," (),ntf!llfl!I'S<br />
ill 1'/Jy sics. 1·ol. 6, no. 6 ( Nm·cmbn/Decel1lber I ()92 ):<br />
576-5X4.<br />
24. 1\ . Hllher, B. Luc\s,.mJ �· . Collins, "A Dar;, ,\ lodel fo r<br />
Sciemitic VJsuJii;arion 11·ith Pn"·isions t( >r Rcgu iJr ;md<br />
lrrc gui;Jr Crids," l'mcel!dil l,!l.-' u/ !be l'isuolizuliuu<br />
C) I (,r !1/{aC:'IICt'( 19() I ) .<br />
25. [, Resnick ct ;JI., U. ISS!C f)e,cnjJtion o11d Nefi'rellce<br />
.\/(/ ////(// .Ji>r the (,'(JJ)//JI(J/1 use lmjJ/c ·nlt'/1{{//ion<br />
( ,\•l u rra\' Hill, :\.].: AT&T Bell L1borJtories, 1993 ).<br />
26. C. Steele, Jr., 0!11111/l>ll l!SV '!Zw l.ong iiO,!.!,m<br />
purc.:r Center (SI)SC). Cmrenrlv, Peter is " 1·isiting scic.:nrisr<br />
;l[ tbc snsc. \\'here he kads rcseJrchc.:rs in dci ' Cioping<br />
intcr;Kti lt: tbra ,·isualizJtion s,·stt·ms. Petn joined Digir.1l<br />
in 1990 as ;l member of tht· \ Vorkstations Engineering<br />
(;roup. In e.1rlier 11·ork, he ,,·,1s � sott\l';lre t'll)!;inccr t( >r th
High-performance<br />
1/0 and Networking<br />
Software in Sequoia 2000<br />
The Sequoia 2000 project req uires a high-speed<br />
network and 1/0 software for the support of<br />
global change research. In addition, Sequoia<br />
distributed applications require the efficient<br />
movement of very large objects, from tens to<br />
hundreds of megabytes in size. The network<br />
architecture incorporates new designs and<br />
implementations of operating system 1/0 soft<br />
ware. New methods provide significant per<br />
formance improvements for transfers among<br />
devices and processes and between the two.<br />
These techniques reduce or eliminate costly mem<br />
ory accesses, avoid unnecessary processing, and<br />
bypass system overheads to improve through<br />
put and reduce latency.<br />
\'ol. 7 ;-..: " ·' 199S<br />
I<br />
Joseph Pasquale<br />
Eric W. Anderson<br />
Kevin Fall<br />
Jonathan S. Kay<br />
In the Sequoia 2000 project, 11 e �1ddressed rile [Jroh<br />
lem of designing a distri buted computn SI'Stem th�H<br />
can efticientlv rerriel'l:, store, r conraincr shipping. (for a complete<br />
discussion of container shipping, see Re krence l.)<br />
Since conrainn shipping is a nell' design, this p:�per<br />
de1·orcs more sp:�ce to ir in re l:trion to the other sur<br />
' c1·ed '' ork ( 11·hose der�1ikd descriptions m:l\' be fo und<br />
in Rdcret1ces 2 ro 9). In addi rion ro this 11 ork, '' e con<br />
ducred orhn ncrll'ork studies �1s parr of the Sequoi�1<br />
2000 l)rojccr. These include 1-csearch on protocols to<br />
pro1·ide f)Crtormance guaranrecs .m d multic1sting . 10 , -<br />
To support a higJJ-pcrt(mn:lncc distributed comput<br />
ing etwimnmenr in ll'bicb :lpplicuions em cftCetii'Cl\'<br />
manipul�ue large d:H:J objects, 1\'C 11-cre concerned 11·ith<br />
achin in g. high throughput during the rranster of these<br />
objccrs. The processes or de�· ices representing the d;1L1<br />
sources :md sinks m:ty all reside on the s'1mc work<br />
St
(mu ltiple node case ). ln either c:�se , we w�1nn.:d appli<br />
C1tions , be the�· c1rth science distributed computa<br />
tions or collabor�nion tools i111·oh-ing multipoint<br />
video, ro 111�1kc fu ll use of the raw bandwidth prm·ided<br />
bv the underlving communicarion svsrem.<br />
In the multiple node case , the ra11 band11·idrh is<br />
h·om 45 ro 100 meg�1birs per second (Mb/s), bec:ll!se<br />
the Sequoi�1 2000 nemork used T3 links t(>r long<br />
distJnce communication :md a ti ber distri buted data<br />
inrcrbce (FDD! ) tc>r loc1l area communicnion. ln rhe<br />
single node case, rhc l'JII' band11·idth is approx imJteh·<br />
100 megabnes �1cr second, since rhe workstation of<br />
choice w:�s one of the D ECstation 5000 series or the<br />
Alph�1-powcred DEC 3000 series, both of which use<br />
the TL! RBOchannel �1s the SI'Stcm hus.<br />
Our 11·ork foc used onh· on soft11·;m:: impro1·emct1ts,<br />
in particular how to achieve maximum svsrem sotT\\':tre<br />
perte>rnuncc gi1-cn the hardware we selected. In bet,<br />
we fo und that the throughput bottlenecks in the<br />
Seq uoi�1 distributed computing cm·ironment 11·ere<br />
indeed in the 11·orkst:1rion 's operating svstcm softii'Jre,<br />
and nor in the undnlving communication svstem<br />
hard11·�1re (e.g., nctll'ork links or the svstem bus) · . This<br />
problem is not limited to the Scqu(;iJ em ironment:<br />
gi1·cn modern high-speed workstations ( 100+ mil lions<br />
of instructions per second [mips 1) r<br />
.FDDI bo:�rds, both models DEF'lA, 11· hich supports<br />
both send and receive direct memory �Kccss (r:li'viA),<br />
:111d DEfZA, ll'hich supports onlv rccci1·c DlviA.<br />
The Data Transfer Problem<br />
Since a d�1r;1 sou rce or si nk mav he either a process or<br />
d e1·ice, and the operating SI'Stem gcncralh- pcrt())'lns<br />
the fu nction of' transtCrring data bct11·cen processes<br />
and devices, undcrst:1nding the hotrlenceks in these<br />
operating sysn:m data p:1 ths is kcv ro improving<br />
perfonn;1nec. These datJ paths gcncralh- imoh-c trJ <br />
�·ersing numerous la1·crs ofoper:ni ng s1·stetn softwa re .<br />
l n the case of netll'ork rr:1nskrs, the data paths �11-c<br />
ex tended bv lavers of ncnvork protocol sofTware.
To undnst;uld rhc pcd(nm;mcc p roblem 11·c \\'CIT<br />
tr�·ing ro sokc , consider a common c l i e m- se!Yc r i nrcr<br />
acrion in which
device is ;� network �l liaptcr. In this p�1per, \\'C usc<br />
rhe term dolo lntll4.erJII"oblem to re kr to the problem<br />
of red ucing these ovc r hc;�d s ro achi e1·c h i gh through<br />
pur between ;� source device Jlld a sink device, either<br />
of \\"hich can he ;� net11·ork aLhpter 11 ·i rh in �� single<br />
worksr:� ri on.<br />
Al though the data rr:mskr problem mal' also oi st in<br />
intermediate routers, ir docs so to :1 much lesser<br />
degree than with end-user workstations (assu ming<br />
modem router softwa re and h:�rd\\'�lrc te chnoJog,· ) .<br />
Thi s i s because of �� rourc r' s si mpli �i ed execution envi<br />
ronmellt :�nd i rs red uced needs �() r transfers Kross<br />
m ultiple protected domains. Ho\\"CI"Cr, there is noth<br />
ing rh�n prec ludes rhc :�ppJicnion of the tec h niq u es<br />
discussed in th is paper to rourer SOlT\\";Jre. In ri ct, since<br />
ll"e used gcnerJI-purposc workstations �or routers ro<br />
su pporr �� tlex i blc, mod i �i ablc rest bed �(Jr ex peri men<br />
t�ltion ll"ith nc11· protocol s, our work ll"as also applied<br />
ro rourer software.<br />
In the nor three sections, ll"e descri be 1·arious<br />
�1 ppro:�ches to soil-ing rhe data rr:�nskr problem . Since<br />
data copving/rouc hing is a major sottware limitation<br />
in achie�·ing hi gh throughput, a1·oiding data co�wing/<br />
touching is a constant theme. tVl uch of our work<br />
im·olvcs fi nding \1'
hgs to cs_re�H.I , without el i ling cs_rnap. Unm�lpping<br />
is Jutomnic in cs_IHi tc , so if cs_unm:1p is not used,<br />
t\1'0 S\'Stem Cl iis JI"C Sti li sur"fi cicnt .<br />
As shol\'n in hgure 3, a pmcess reJds d
11rinen is lost, �1 ll'ri ti ng process must copl' ::uw d�1t�1<br />
th�1t ll'i\1 still be needed �1 frer the ll'rirc . �ortu narelv,<br />
applications thJt n101T great 1·o\umes of tbt�1 often<br />
h�11·e no t"u rthcr need t(Jr ir �1 fre r �1 IITite is co mpl eted .<br />
Implica tions of Virtual Memory Remapping<br />
In
YO<br />
communicnion (JPC:) 11·irhin the ULTR!X 1·e rsion<br />
4.2;� operatin g system on a DECstation 5000/200<br />
svstc m.' With the nc11· DEC OSI-jl i m plementation<br />
011 Alpha worksr.1tions, liT com 11ared the l/0 puformance<br />
ofconvcntionJI UNIX ljO to that ofcollt;lincr<br />
s hipping rix ;1 l'arictl' of 1/0 de� ices as we ll
USER<br />
SPACE<br />
APPLICATION LAY ER<br />
PROCESS<br />
Figure 5<br />
:\ Sf.Jlicc Cnl1lll'Cring a Source ro J Sink Dc1 icc<br />
de1·icc. If the l/0 bus and the dl'l ices support hardware<br />
streaming, the data p:1th is direcrlv over the bus,<br />
�ll'oiding Sl'stem memon· altogether. Although the<br />
process docs nor necess�1rily manipui
92<br />
been implemented 11 irhin the U LT !U :\ 1crsion 4.2J<br />
operating svstcm t(Jr the DEC SOOO serit:s worksr:�rions,<br />
:md 11 irhin DEC OSF/ 1 ITrsion 2.0 ti1r DJ-:C<br />
3000 series (Aipha-pmH·red ) 11 orkst:�rions, c;H.:h t(Jr<br />
a l i mited number ofdc:,·ices.<br />
Three pert(mnance c1 .1 luation studies of PPIO<br />
ha,·e been c:�rried our :Jllli :�re described in our earilj)Jpers.2·'<br />
' ThLI' indicate Cl'U :JI';liLlbilin· imprm es l) \'<br />
30 percent or more; and throughput and Lw:nc1·<br />
impro1·c b1· ;1 bctor of t11·o to three, depending on<br />
the speed of l/0 devices. (3 ener:111v, the l:�tencv ;md<br />
throughput pcrform.l !Ke imprm emenrs ofkt·�d bl·<br />
PPJO improve with bster 1/0 de1·iccs, indicating th,{t<br />
1'1'10 scales well with ne11· IjO dn ice rechnolog�·.<br />
Improving Network Software Throug hput<br />
Net11·ork 1/0 presents a Sflecia] 11roblcm in th;lt rhc<br />
complcxitl' of the abstraction layer (sec Figu re 2), .1<br />
sr:�ck of network protocols, is gencr:�lh- much greater<br />
than that tor other types of 1/0. In this sectio11, 11 c<br />
summari1.e the results of an ;mall'sis of m·crhe:�ds r(Jr<br />
an implcment:nion ofTCI'/IP 11c used in rhc Scquoi;l<br />
2000 project. The prim:11'1' bmrlcncck in :K hiel'itw<br />
high throughput communic:ltion ti x TC P /l l' is du�<br />
to data-touching opcr:nions: one expected culprit is<br />
d:lt 199�<br />
• ErrorChk: The caregon· of checks t(x user :1 11d SI'Srem<br />
errors, such :1s pJramcter checking on socket<br />
s1·stem c:111s<br />
• Other: This tin:1l cJtegorv of OI'Ct-llC;ld includes :1!1<br />
the opcr;nions th
Figure 6<br />
3000<br />
� 2500<br />
f=<br />
(.9<br />
z<br />
�- 2000<br />
w
4.2:� op eratin g S\'Stcm, ;md a 27 percent llnproYemciH<br />
over the implcment•Hion with the opti mi zed check<br />
sum comput:�tion algoi·irbm .<br />
Of course, one must be ,.er,· c;lrcti.Ii ;l hout d l lli II,� o //{/ S1 :..;fL'IIIS<br />
f/C\ 1
5. ]. K�v and /. P�1sq u al c, "The lmf)orr�liKe of Non<br />
D.1L1-Touching Processing Q,crhcads in TC I'/ IP,"<br />
f>mceediugs of' the ACl/ Com nlltllicatiuus Architec<br />
fllrcs onrl l'mtucols Co11/('rence ISIG'CUl/.1/J.<br />
S.1n hJ11cisco (Sq,re mbcr 199�), P�'· 259-269.<br />
6. /. ](�,. �1nd /. 1'�1squ�Ic. "Mc�1surcmenr, Ano1lvsis,<br />
and lmpro\'C illCIH of LI DI'/ll' Throughpu t rc n rhe<br />
DECs r�1tion 5000," l'mceedinp,s c;j ' the I 'SI:NJX<br />
\fin fer 'f ('cb nolugr C!Jtlj(•rence. S:111 D ie go ()Jnuan·<br />
I99�). pp. 249-25H.<br />
7. J. KJ\' and J. 1'.1sq u �1 lc, "A Sum nun· of l'C:P /II'<br />
Net\\ orking Soriwarc Pedcm11JI\Ce fo r the DECst�tion<br />
5000," l'r()(.:net<br />
Afifirooc/J ( 13erkc lc,·, C1 lif.: lnrernatiotd Computer<br />
Science lnstiture, <strong>Technical</strong> Report TR-92-072, 1992 ).<br />
II. H. Zb�mg, D. Verma, Jnd D. FerrJri, "Design and<br />
lmplemenr.1rion of the 1\eJI-Tillll' Inrnnet Prorrnunce of<br />
lv kss;lge-P�ssing Using Restri cted Virtual Memor!'<br />
Rc maf1ping," Sojill'are-l'mcfice and F .. \pcrieuce.<br />
,·ol . 21, no. 3 (I 991 ): 251-267.<br />
26. P. Druschel and L. Peterson, "Fbuts: a High · b;n1dH'idth<br />
Cmss- Domc1in Tra 11sfer bcilin·," Pmccediug-; of<br />
the 14th ACil l S) 'lilj)(JS iwn l!il OfWi'Cifillf!, .'>)·steill l'riii<br />
Cij;/es (SOSl'). AshL' illc, N.C. (December 1993 ), pp.<br />
189-202.<br />
27. J. !vlogul and A. Borg, "The Effect of Context S\\ itches<br />
on Cache J'ertorm�111(c," l'mceedin,r.;s of' the ASI'f.OS-<br />
1\ '( Apri l I 99 1 ), pp. 75-84.<br />
2tl. R. Bershad, T. Anderson, E. Lazo\\'ska, �111d H. Le\'\',<br />
" Li gh twei gh t Remote Procedure C:Jil," ACIV/<br />
Tra nsactions 011 Computer SJ'sfeills. H>l. 8, no. I<br />
(1990): 37-55.<br />
<strong>Digital</strong> Tl'l'hnical Journ.1l Vol. 7 :-Jo. 3 <strong>1995</strong> 95
29. 1'- l)ruschcl, M. Abbott, ,\.\. I'Jge ls, cl lld I .. Pcre1·son,<br />
"An,1il'sis of !jO Suhwsrc m Design t(n J'vl ulri mcdic1<br />
Wo rksr,nions," Proceerlill,�.\ n/ tbe /birr! !utcrurt<br />
liourt! \\ ur/;sh (ljJ (J/f .\eitt ·ud! r111d Ojir rill' rou ting architecture
Call fo r Authors<br />
from <strong>Digital</strong> Press<br />
<strong>Digital</strong> Press is an imprint of Butterworth-Heinemann, a major international pub<br />
lisher of professional books and a member of the Reed Elsevier group. <strong>Digital</strong><br />
Press is the authorized publisher fo r <strong>Digital</strong> Equipment Corporation: The two<br />
companies are working in partnership to identify and publish new books under the<br />
<strong>Digital</strong> Press imprint and create opportunities fo r authors to publish their work.<br />
<strong>Digital</strong> Press is committed to publishing high-quality books on a wide variety<br />
of subjects. We would like to hear from you ifyou are writing or thinking about<br />
writing a book.<br />
Contact: Mike Cash, <strong>Digital</strong> Press Manager, or<br />
Liz McCarthy, Assistant Editor<br />
DIGITAL PRESS<br />
313 Washi ngton Street<br />
Newton, MA 02158-1626<br />
U.S.A.<br />
Tel (617) 928-2649, Fax: (617) 928-2640<br />
E-mail : Mike.Cash@BHein.rel.co.uk or<br />
LizMc@world .std .com
ISSN 0898-901X<br />
Printed in U.S.A. EY-T8 38E-TJ/YS 12 14 18.0 Copvright © <strong>Digital</strong> Equipment Corporation. All Rights Resen·ed.