DTJ Volume 8 Number 2 1996 - Digital Technical Journals
DTJ Volume 8 Number 2 1996 - Digital Technical Journals
DTJ Volume 8 Number 2 1996 - Digital Technical Journals
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Digital</strong><br />
<strong>Technical</strong><br />
Journal<br />
I<br />
SPIRALOG LOG-STRUCTURED FILE SYSTEM<br />
OPENVMS FOR 64-BIT ADDRESSABLE<br />
VIRTUAL MEMORY<br />
HIGH-PERFORMANCE MESSAGE PASSING<br />
FOR CLUSTERS<br />
SPEECH RECOGNITION SOFTWARE<br />
<strong>Volume</strong> 8 <strong>Number</strong> 2<br />
<strong>1996</strong>
Editorial<br />
Jane C. Blake, Managing Editor<br />
Kathleen M. Stetson, Editor<br />
Helen L. Patterson, Editor<br />
Circulation<br />
Catherine M. Phillips, Administrator<br />
Dorothea B. Cassady, Secretary<br />
Production<br />
Terri Autieri, Production Editor<br />
Anne S. Katzdf, Typographer<br />
Peter R.. Woodbury, Illustrator<br />
Advisory Board<br />
Samuel H. fuller, Chairman<br />
Richard W. Beane<br />
Donald Z. Harbert<br />
William R. Hawe<br />
Richard J. Hollingsworth<br />
William A. Laing<br />
Richard F. Lary<br />
Alan G. Nemeth<br />
Pauline A. Nist<br />
Robert M. Supnik<br />
Cover Design<br />
<strong>Digital</strong>'s new Spiralog rile system, a featured<br />
topic in the issue, supports rid! 64-bit system<br />
capability and fast backup and is integrated<br />
with the Open VMS 64-bit version 7.0 oper<br />
ating system. The cover graphic captures the<br />
inspired character of the Spiralog design<br />
effort and illustrates a concept taken from<br />
University of California research in which<br />
the whole disk is treated as a single, sequential<br />
log and all file system modifications arc<br />
appended to the tail of the log.<br />
The cover was designed by Lucinda O'Neill<br />
of <strong>Digital</strong>'s Design Group using images<br />
fi·om Photo Disc, Inc., copyright <strong>1996</strong>.<br />
The <strong>Digital</strong> Teclmica/juumal is a refereed<br />
journal published quarterly by <strong>Digital</strong><br />
Equipment Corporation, 30 Porter Road<br />
L)02/DIO, Littleton, MA 0!460.<br />
Subscriptions can be ordered by sending<br />
a check in U.S. ti.mds (made payable to<br />
<strong>Digital</strong> Equipment Corporation) to the<br />
published-by address. General subscrip<br />
tion rates arc $40.00 (non-U.S. $60) for<br />
rour issues and $75.00 (non-U.S. $!15)<br />
for eight issues. University and college pro<br />
fessors and Ph.D. students in the electrical<br />
engineering and computer science fields<br />
receive complimentary subscriptions upon<br />
request. <strong>Digital</strong>'s customers may qualify<br />
for gift subscriptions and arc encouraged<br />
to contact their account representatives.<br />
Single copies and back issues are available<br />
ri:>r $16.00 (non-U.S. $18) each and can<br />
be ordered by sending the requested issue's<br />
volume and number and a check to the<br />
published-by address. Sec the Further<br />
Readings section in the back of this issue<br />
ror a complete listing. Recent issues arc<br />
also available on the Internet at<br />
http:/ /www.digital.com/info/dtj.<br />
<strong>Digital</strong> employees may order subscrip<br />
tions through Readers Choice at UR.L<br />
http:/ jwebrc.das.dec.com or by entering<br />
VTX PROfiLE at the system prompt.<br />
Inquiries, address changes, and compli<br />
mentary subscription orders can be sent<br />
to the <strong>Digital</strong> Teclmical.fourna/ at the<br />
published-by address or the electronic<br />
mail address, dtj@digital.com. Inquiries<br />
can also be made by calling the joumal<br />
office at 508-486-2538.<br />
Comments on the content of any paper<br />
arc welcomed and may be sent to the<br />
managing editor at the published-by or<br />
electronic mail address.<br />
Copyright© <strong>1996</strong> <strong>Digital</strong> Equipment<br />
Corporation. Copying without fee is per<br />
mitted provided that such copies are made<br />
for usc in educational institutions by faculty<br />
members and are not distributed for com<br />
mercial advantage. Abstracting with credit<br />
of <strong>Digital</strong> Equipment Corporation's auth<br />
orship is permitted.<br />
The information in the journal is subject<br />
to change without notice and should not<br />
be construed as a commitment by <strong>Digital</strong><br />
Equipment Corporation or by the compa<br />
nies herein represented. <strong>Digital</strong> Equipment<br />
Corporation assumes no responsibility ror<br />
any errors that may appear in the Journal.<br />
!SSN 0898-90 I X<br />
Documentation <strong>Number</strong> EC-N6992-18<br />
Book production was done by Quantic<br />
Communications, Inc.<br />
The following arc trademarks of <strong>Digital</strong><br />
Equipment Corporation: AlphaScrver,<br />
DEC, DECtalk, <strong>Digital</strong>, the DIGITAL<br />
logo, HSC, Open VMS, PATH WORKS,<br />
POLYCENTER, RZ, TruCiuster, VAX,<br />
and VAXcluster.<br />
BBN Hark is a trademark of Bolt Beranek<br />
and Newman Inc.<br />
Encore is a registered trademark and<br />
MEMORY CHANNEL is a trademark<br />
of Encore Computer Corporation.<br />
FAScrver is a trademark of Network<br />
Appliance Corporation.<br />
Listen for Windows is a trademark of<br />
Verbcx Voice Systems, Inc.<br />
Microsoft and Win32 arc registered trade<br />
marks and Windows and Windows NT arc<br />
trademarks of Microsoft Corporation.<br />
MIPSpro is trademark of MIPS Technol<br />
ogies, 1 nc., a wholly owned subsidiary of<br />
Silicon Graphics, Inc.<br />
Netscape Navigator is a trademark of<br />
Netscape Communications Corporation.<br />
PAL is a rcgistcn:d trademark of Advanced<br />
Micro Devices, Inc.<br />
UNIX is a registered trademark in the<br />
United States and in other countries,<br />
licensed exclusively through X/Open<br />
Company Ltd.<br />
VoiceAssist is a trademark of Creative<br />
Labs, Inc.<br />
X Window System is a trademark of the<br />
Massachusetts Institute ofTechnology.
Contents<br />
Foreword<br />
SPIRALOG LOG-STRUCTURED FILE SYSTEM<br />
Overview of the Spira log File System<br />
Design of the Server for the Spira log File System<br />
Designing a Fast, On-line Backup System for<br />
a Log-structured File System<br />
Integrating the Spira log File System into the<br />
OpenVMS Operating System<br />
Open VMS FOR 64-BIT ADDRESSABLE VIRTUAL MEMORY<br />
Extending Open VMS for 64-bit Addressable<br />
Virtual Memory<br />
The OpenVMS Mixed Pointer Size Environment<br />
Rich Marcello<br />
James E. Johnson and William A. Laing<br />
Christopher Whitaker, ). Stu
2<br />
Editor's<br />
Introduction<br />
This past spring when we sun·c\Td<br />
.foun w/s ubscribcrs, readers rook the<br />
rime ro commem on the parri cula1·<br />
,·aluc ofrhc issues fCawring <strong>Digital</strong>'s<br />
64-bir Alpha technology. The engi<br />
neering described in those two issues<br />
continu es, with ever higher levels of<br />
pcrr(>rmance in Alpha microproces<br />
sors, servers, clusters, and systems<br />
software. This issue presents reccm<br />
developments: a log-structured file<br />
system, called Spiralog; the Open VMS<br />
opcr1ting system extended ro take full<br />
ar<br />
Alpha clusters; and speech recognition<br />
sofn,arc for Alpha workstations.<br />
Spir1log is a whollv new clusterwidc<br />
file svsrcm integrated with the new<br />
64-bir Open VMS version 7.0 opcrat·<br />
ing system and is designed for hig h<br />
d1ra avaibbiliry and high pert(xman cc.<br />
The tirst of four papers about Spira log<br />
is writtcn by Jim Johnson and Bill<br />
L1ing, wlw imroduce log -srru crured<br />
file ( LfS) concepts, the university<br />
research behind the design, and design<br />
innO\'ltions.<br />
The r scicntitic users,<br />
the p.!r;!llcl-programming tool<br />
ncx t described does not run on the<br />
Open VMS Alpha wstcm but instead<br />
on UNIX clusters connected with<br />
l'vlEMORY CHANNEL technologv.<br />
jim Lawton, John Brosnan, Morgan<br />
Doyle, Scosarnh () Riordain, and<br />
Tim Reddin rc\'iew the challenges in<br />
designing the TruCiu stcr ,vn-:,vlOJY<br />
CH ANNE1. SofTware product, which<br />
is a ll!Cssagc-passing s·stcm intended<br />
t(>r builders of parallel software<br />
libr1rics and implcmenrers of flH!Ilcl<br />
compilers. The product reduces<br />
communicuions latenc\' to Jess than<br />
10 f.lS in shared mcmor\' S\'stc ms.<br />
Fin1lly, Bernie RozmO\·its prcscnrs<br />
the design of user interfaces f(Jr the<br />
<strong>Digital</strong> Speech Recognition Software<br />
( J)SRS) product. Although DSRS<br />
is rugct c d t(Jr <strong>Digital</strong>'s Alpha work<br />
stations running UNIX, the impk·<br />
mcntrrs ro ensure the prod<br />
uct's Clsc-of-usc c1n be gcner
Foreword<br />
Rich Marcello<br />
Vice President. Open VMS ,\)stems<br />
Sqfiu,are Group<br />
The papers you will read in this issue<br />
ofthe.fournal describe how we in the<br />
Open VMS engineering community<br />
set out to bring the Open VMS oper<br />
ating system and our loyal customer<br />
base into the rwcnty-first century.<br />
The papers present both the develop<br />
ment issues and the technical chal<br />
lenges faced by the engineers who<br />
delivered the Open VMS operating<br />
system version 7.0 and the Spiralog<br />
file system, a new log-structured file<br />
system tor Open VMS.<br />
We are extremely proud of the<br />
results of these efforts. In December<br />
1995 at U.S. Fall DECUS (<strong>Digital</strong><br />
Equipment Computer Users Society),<br />
<strong>Digital</strong> announced Open VMS version<br />
7.0 and the Spiralog tile system as part<br />
of a first wave of product deliveries for<br />
the Open VMS Windows NT Nllnity<br />
Program. Open VMS version 7.0 pro<br />
vides the "unlimited high end" on<br />
which our customers can build their<br />
distributed computing environments<br />
and move toward the next millennium.<br />
The release of Open VMS version<br />
7.0 in January oftJ1is year represents<br />
the most significant engi11eering<br />
enhancement to the Open VMS oper<br />
ating system since <strong>Digital</strong> released<br />
the VAXcluster system in 1983.<br />
Open VMS version 7.0 extends the<br />
32-bit architecture of Open VMS<br />
to a 64-bit
4<br />
In E1ct, Digitl has two 64-bit oper<br />
uing s1·srems 1\'ith this po11·er: the<br />
OpenV/vlS 1nci the <strong>Digital</strong> U:-.JIX<br />
oper;ni ng SI'Stems.<br />
As noted 1bm·e, Digit:ll inrrocluccd<br />
the OpenV1v!S operating system with<br />
SUflport f(lr full 64-bit l'irtual address<br />
ing at the s:1mc time it introduced the<br />
Spir;1log tile system, in Dt:cembcr<br />
1995. The Spira log design is lmt:d<br />
on rhe Sprite log-structllred tile svs<br />
rem trom the University ofCalit(>rni;1,<br />
Berkeley. Wirh irs usc of this log<br />
structured ;1pproaeh, Spira log offers<br />
Jll;1jor nell performance reatures,<br />
including t1st, applicnion-consistcnr,<br />
on-line b;lckup. further, ir is fulk<br />
comparibk 11·irh customers' oisting<br />
hies-II rile SI'Stems, and applications<br />
rh;H t·un 011 hles - 1 1 will run on<br />
Spir;Jiog 11·ith no modification. To<br />
ddiver 111 of the re;nures we kit ll't:rc<br />
esscmd to mt:t:t the needs of our<br />
loval utstomer base, the Spiralog tc1m<br />
0;1mined ;1t1d resoked a number of<br />
rechniul issues. The papers in this<br />
issue desuibe some ofrhe clullcnges<br />
rhe1· hcnl, including the decision to<br />
design ;1 hlcs-11 tile SI'Stcm emubrion.<br />
The dcli1·crv ofrhe OpenVMS<br />
1 crsion 7.0 operating SI'Stem and<br />
rhe Spir:1lug tile system are part of<br />
DigiL1I's continued commitment ro<br />
rhe Open VMS customer b:se. These<br />
products reptTsent the work of dedi<br />
cated, t:demed engineering teams<br />
that h;we deployed state-otthe-arr<br />
reehnology in products rhat ll'ill help<br />
our customers remain competiti1·e<br />
r(Jt- \'cars to come.<br />
In the Open VMS group as elsell'hcre<br />
in <strong>Digital</strong>, 11·e are committt:d<br />
to excellence in the de,·clopmcnt and<br />
Di1;ital <strong>Technical</strong> )ournl<br />
delivnv o· business computing solu<br />
tions. 'vVt: ll 'ill continue to maint;lin<br />
;Jnd cnluncc a product porr(,lio th;1t<br />
meers our customers' need for true<br />
24-hour bl' 365-dal' acct:ss ro theit-<br />
. .<br />
d1t;1, ti.tll imegration wirh Microsof-t<br />
'vVindows :-.rr enl'ironmenrs, and the<br />
full complement of network solutions<br />
;1nd application software tor today<br />
and well into tht: next millennium.<br />
Vol. 8 No. 2 J 996
Overview of the Spira log<br />
File System<br />
The OpenVMS Alpha environment requires a<br />
file system that supports its full 64-bit capabili<br />
ties. The Spiralog file system was developed to<br />
increase the capabilities of <strong>Digital</strong>'s Files-11 file<br />
system for Open VMS. It incorporates ideas from<br />
a log-structured file system and an ordered write<br />
back model. The Spiralog file system provides<br />
improvements in data availability, scaling of the<br />
amount of storage easily managed, support for<br />
very large volume sizes, support for applications<br />
that are either write-operation or file-system<br />
operation intensive, and support for heteroge<br />
neous file system client types. The Spira log<br />
technology, which matches or exceeds the relia<br />
bility and device independence of the Files-11<br />
system, was then integrated into the Open VMS<br />
operating system.<br />
I<br />
James E. Johnson<br />
William A. Laing<br />
<strong>Digital</strong>'s Spiralog product is a log-structured, duster<br />
wide file system with integrated , on-line backup and<br />
restore capability and support tor multiple tile sys<br />
tem personalities. It incorporates a number of recent<br />
ideas ti·om the research community, including the<br />
log-structured tile system ( LFS) from the Sprite tile<br />
system and the ordered write back ti-om the Echo<br />
tile system.U<br />
The Spiralog file system is fu lly integrated into the<br />
Open VMS operating system, providing compatibility<br />
with the current Open VMS fi le system, Files-ll. It<br />
supports a coherent, clusterwide write-behind cache<br />
and provides higb-pertonnance, on-line bac kup and<br />
per-file and per-volume restore functions.<br />
In this paper, we first discuss the evolution of tile<br />
systems and the requirements tor many of the basic<br />
designs in the Spiralog tile system. Next we descri be<br />
the overall architecture of the Spiralog file system,<br />
identit),ing its major components and outlining their<br />
designs. Then we discuss the project's results: what<br />
worked well and what did not work so well. Finally, \Ve<br />
present some conclusions and ideas tor future work.<br />
Some of the major components, i.e., the backup<br />
and restore facility, the LFS server, and OpcnVMS<br />
integration, are described in greater detail in compan<br />
ion papers in this issue .3-5<br />
The Evolution of File Systems<br />
File systems have existed throughout much of the his<br />
tory of computing. The need tor libraries or services<br />
that help to manage the collection of data on long<br />
term storage devices was recognized many years ago.<br />
The early support libraries have evolved into the tile<br />
systems of today. During their evolution, they have<br />
responded to the industry's improved hardware capa<br />
bilities and to users' increased expectations. Hardware<br />
has continued to decrease in price and improve in its<br />
price/performance ratio. Consequently, ever larger<br />
amounts of data are stored and manipulated by users<br />
in ever more sophisticated ways. As more and more<br />
data are stored on-line, the need to access that data 24<br />
hours a day, 365 days a year has also escalated.<br />
<strong>Digital</strong> <strong>Technical</strong> Jounul Vol. 8 No. 2 <strong>1996</strong> 5
6<br />
Signiticant i mprovements to file systems have been<br />
made in the t(> l lowing areas:<br />
• Direcrorv stru ctures to ease locating data<br />
• Device independence of data access through the rile<br />
svstcm<br />
• Accessibility of the data to users on other svstems<br />
• Avail;1bility of the thta, despite either planned or<br />
unplanned service outages<br />
• Reliability of the stored data and the pertc>rm:mn.:<br />
of the datJ JCCCSS<br />
Requirements of the Open VMS File System<br />
Since l977, the OpenVMS operating system has<br />
ofti..:rcd a stable, robust tile svstem known as Files-!!.<br />
This ti le system is considered to be very successfu l in<br />
the areas of reliability and device independence.<br />
Recent customer teed lxtck, however, indicated th;H<br />
the areas of data avaibbility, scaling of the ::tmounr of<br />
sror:gc c:tsilv managed, support tor \'Cry large volume<br />
sizes, and support tc>r heterogeneous file system client<br />
types were in need of improvement.<br />
The Spiralog project was initiated in response to<br />
customers' needs. We designed the Spiralog rile system<br />
ro match or somewhat exceed the Files-1 1 system in<br />
its reliability and device independence. The focus of<br />
the Sp ir::t log project was on those areas that were due<br />
tor improveme nt, notably:<br />
• DatJ availability, especially during pJanned opera<br />
tions, such ::ts backup.<br />
If the stor:tgc device needs to be taken offline<br />
to pcrtcm11 J backup, even at a very high backup<br />
rate of 20 megabytes per second (Ml3/s ), ;llmost<br />
14 hours are needed to back up l terabyte . This<br />
length of service outage is clearly unacceptable.<br />
More typical backup rates of 1 to 2 MB/s can rake<br />
several davs, which, of course, is not acceptable .<br />
• Grc:nlv increased sca ling in total amount of on-line<br />
storage, without greatly increasing the cost to man<br />
age rhar storage.<br />
For exa mpl e, 1 terabyte of disk storage cutTcmly<br />
costs approximately $250,000, which is wel l within<br />
the budget of many large computing centers .<br />
However, the cost in staff and rime to manage such<br />
amounts of storage can be many times th::tt of the<br />
storage.'' The cost of storage continues ro t11I, while<br />
the cost of managing it continues to rise.<br />
• Efkc rivc scaling as more processing and storage<br />
resources become avail able.<br />
For e xample, Open VMS Cluster systems aJlow pro<br />
cessing power and storage capacity to be added<br />
incrementally. Ir is crucial that the software support-<br />
Dig.it.tl Tcdtnic;tl jounul Vol. 8 No. 2 <strong>1996</strong><br />
ing the rile system scale as the processing power,<br />
bandwidth to storage, and storage capacity increase .<br />
• Improved performance tor applic:nions that arc<br />
either \\'rite-operation or ti le-syste m-operation<br />
imcnsivc.<br />
As ti le svsrem caches m main mcmor\' ha\'C<br />
increased in capacitv, data reads and file svsrcm read<br />
opcr.nions have become satisf-ied more and more<br />
tl-om the cache. At the same time, nL1nv applica<br />
tions write large amounts of data or create and<br />
manipulate large numbers of tiles. The usc of<br />
redundant arrays of inexpensive disks (RAID) stor<br />
age has increased the available bandwidth r(>r d:na<br />
writes and rile system writes. Most tile system oper<br />
ations, on the other hand, are small writes and 1rc<br />
spread across the disk at random, often negating<br />
the bcndirs of RAID storage.<br />
• lmpnl\'cd ;lbi lity to transparently access the stored<br />
data across several dissim ilar client types.<br />
Computing environments have become tncrcas<br />
ingly heterogeneous . Different client S\'Stcms, such<br />
as the Windo\\'S or the UNIX opc cni ng S\'Stcm,<br />
store their nics on and share their ti les with scn·cr<br />
S\'Stcms such as the OpenVJ\1\S sen·cr. It Ius<br />
become ncccssJry to support the svntax ;md scmJn<br />
rics ot- several ditkrcnr tile system personalities on<br />
:1 common rile server.<br />
These needs were central ro many design decisions we<br />
m::tdc t(lr the Spiralog tile system.<br />
The members of the Spiralog project evaluJtcd<br />
much of the ongoing work in file systems, dat:tbascs,<br />
:tnd storage architectures. RA.ID storage makes high<br />
bandwidth av:tiLlble to disk storage, but it req uires<br />
large writes to be etiective. Dar:tbascs ha\'C exp loited<br />
logs ;md the grouping of writes together to minimize<br />
the number ot' disk f/Os and disk seeks req u i red .<br />
Databases and transaction systems have also e xploited<br />
the tt:cb niquc of copving the tail of the log to dkct<br />
backups or data replication. The Sprite project at<br />
Berkeley had brought together a log-str uct ured ri le<br />
system and RA.I D storage to good eftccr.1<br />
By d rawing rl·om the above ideas, parricul ::trlv the<br />
insight of how a log structure could support on-line,<br />
high-pcrr(mnance backup, we began our dcvc lopmcnr<br />
cft(>rt. We designed and buil t a distributed tile system<br />
rhar made extensive use of the processor and me mory<br />
ncar the application and used log-stru ctured storage in<br />
the server.<br />
Spiralog File System Design<br />
The m ai n cxccurion stack of the Spira log tile svstcm<br />
consists of three distinct lavers. Figure I sho\\'S the<br />
o\·er:ll structure. At the rop, nearest the user, is the ri le
F64 FSLIB<br />
Figure 1<br />
VPI SERVICES<br />
FILE<br />
SYSTEM<br />
CLIENT<br />
Spira log Structure Overview<br />
BACKUP USER<br />
INTERFACE<br />
system client layer. It consists of a number of fi le<br />
system personalities and the underlying personalityindependent<br />
services, which we call the VPI.<br />
Two tile system personalities dominate the Spiralog<br />
design. The F64 personality is an emulation of the<br />
Files-l l fi le system. The fi le system library (FSLIB)<br />
personality is an implementation of Microsoft's New<br />
Technology Advanced Server (NTAS) file services for<br />
use by the PATHWORKS for Open VMS file server.<br />
The next layer, present on all systems, is the clerk<br />
layer. It supports a distributed cache and ordered write<br />
back to the LFS server, giving single-system semantics<br />
in a cluster configuration.<br />
The LFS server, the third layer, is present on all designated<br />
server systems. This component is responsible<br />
for maintaining the on-disk log structure; it includes<br />
the cleaner, and it is accessed by multiple clerks. Disks<br />
can be connected to more than one LFS server, but<br />
they are served only by one LFS server at a time. Transparent<br />
fa il over, fi·om the point of view of the tlle system<br />
client layer, is achieved by cooperation between<br />
the clerks and the surviving LFS servers.<br />
The backup engine is present on a system with an<br />
active LFS server. It uses the LFS server to access the<br />
on-disk data, and it interfaces to the clerk to ensure<br />
that the backup or restore operations are consistent<br />
with the clerk's cache.<br />
Figure 2 shows a typical Spiralog cluster configuration.<br />
In this cluster, the clerks on nodes A and B are<br />
accessing the Spira log volumes. Normally, they use the<br />
LFS server on node C to access their data. I f node C<br />
should fail, the LFS server on node D would immediately<br />
provide access to the volumes. The clerks on<br />
nodes A and B would usc the LFS server on node D,<br />
retrying all their outstanding operations. Neither user<br />
application would detect any f1ilure. Once node C had<br />
recovered, it would become the standby LFS server.<br />
NODE A NODE B<br />
USER APPLICATION USER APPLICATION<br />
SPIRALOG CLERK SPIRALOG CLERK<br />
ETHERNET<br />
NODE C NODE D<br />
SPIRALOG VOLUMES<br />
Figure 2<br />
Spiralog Cluster Configuration<br />
File System Client Design<br />
The fi le system client is responsible for the traditional<br />
fi le system fimctions. This layer provides fi les, directories,<br />
access arbitration, and file naming rules. It also<br />
provides the services that the user calls to access the file<br />
system.<br />
VPI Services Layer The VPI layer provides an underlying<br />
primitive file system interface, based on the UNIX<br />
VFS switch. The VPI layer has two overall goals:<br />
1. To support multiple file system personalities<br />
2. To effectively scale to very large volumes of data<br />
and very large numbers oftilcs<br />
To meet the first goal, the VPI layer provides<br />
• File names of 256 Unicode characters, with no<br />
reserved characters<br />
• No restriction on directory depth<br />
• Up to 255 sparse data streams per ti le, each with<br />
64-bit addressing<br />
• Attributes with 255 Unicode character names, containing<br />
values of up to l ,024 bytes<br />
• Files and directories that are freely shared among<br />
fi le system personality modules<br />
To meet the second goal, the V Pl layer provides<br />
• File identifiers stored as 64-bit integers<br />
• Directories through a B-tree, rather than a simple<br />
linear structure, for log(n) fi le name lookup time<br />
The VPI layer is only a base for file system personalities.<br />
Therefore it requires that such personalities are<br />
trusted components of the operating system.<br />
Moreover, it requires them to implement ti le access<br />
security (although there is a convention tor storing<br />
access control list information) and to perform all necessary<br />
cleanup when a process or image terminates.<br />
<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 7
8<br />
F64 File System Personality As m.:viously stated, the<br />
Spi ra log prod uct i nc l udes t\VO ri le system persona lities,<br />
F64 :111d fS UB. The f64 pcrson1 l iry provides a sen·ice<br />
that emulates the Files-! ! ri le S\'Stcm.' Its fu nctions,<br />
services, available ri le attributes, and execution<br />
bchwiors :lrc similar to those in the Filcs-l l ri le S\'S<br />
tc m. i'vl inor difkrcnccs l i'C isobtcd into areas that<br />
receive little usc ti·om most applicuions.<br />
Fm i nstance , the Spiralog ti le svstcm supports rhe<br />
\':lrious Files-! ! queued l/0 ($QIO) parJ mc ters fo r<br />
reruming ri le attri bute inf(mnuion, because they are<br />
used impl icit\' or expl ici tl y by most usn :1ppl ic:1tions.<br />
On the other hand, the hlcs-1 1 method of read i ng<br />
the ri le hear the ti le name, by<br />
- Re ading some dirccrorv entries (VPI)<br />
- Searching the dirccrorv buftl.:r r( )r the hie name<br />
- Exiting the l oop, ifrhc 1mtch is tcllll1d<br />
• Access the target ti le (VPl)<br />
• Read rhc ri le's attributes (VI'!)<br />
• Audit the hie open Htempt<br />
FSLIB File System Personality The SUB ti le sysrem<br />
personalitv is J spcci11 izcd ri le system to support the<br />
PATHWORKS rc >r Open VMS ri l e server. Irs nvo major<br />
goals arc to support the ti l e 11amcs, attri butes, and<br />
bcha,·iors ti.n111d in M icrosoft's NTAS ri le access proto<br />
coJs, and ro provide low run-time cost rc >r processing<br />
NTAS ti le svstcm requests.<br />
The PATHWO RKS server i m plements 1 ri le service<br />
t(Jr personal computer (!'C) clients bycrcd on rop of<br />
the Filcs-1 1 ti l e system services. When NTAS service<br />
behaviors or ::trrihutcs do not match those ofFi les- ll,<br />
the PATHWORKS server h:ts to emulate them. This<br />
can lead to checking security access permissions nvice ,<br />
mapping ri le names, and emul atin g ri le Jttriburcs.<br />
Many of these problems can be avoided if the VPI<br />
i nterbcc is used d i rectl y. for instance, because the<br />
SLI B pcrson:1liry docs nor laver on top of a Files-1 1<br />
persona lin·, sccurin· access checks do nor need to be<br />
pcrtcmm:d t\\'icc. furthermore, in a srraightrc>rward<br />
design , there is 110 need to map across diftl.: renr ti le<br />
Dig:ire1l Tcchnic.d )oum,d Vol. No. 2 lr mmagi ng the caches,<br />
determini ng the order of writes out ofrhc cache to the<br />
LFS scn·cr, and mainL1ining c:tche cohet-cnc\· ,,·ithin<br />
a cluster. The caches arc write behind in 1 1111nncr rhat<br />
preserves the order of dependent operations.<br />
The clerk-scn·cr protocol controls the rr:-nskr of<br />
d:tta to and from stable storage. Data Gill he scm 1s<br />
a multi block :n omic \\Titc, md operuions thu ch1 ngc<br />
m ultiple data items s u c h as a ri l e rename em be made<br />
atomically. If a server tails d uri ng ; request, the clerk<br />
treats the req uest as if it were lost and retries rite<br />
req uest.<br />
The clerk-server protocol is idempotent. Idempotent<br />
operations cJn be :tpplicd rcpcarcd l v with no<br />
effects other than the desired one. Thus, after any<br />
number of server fa ilures or scn·cr ta ilovcrs, ir is alwavs<br />
sate to reissue an operation. Clerk-to-server wri te<br />
operations always !ewe the ri le system stare consistent.<br />
The clerk-clerk protocol protects the user data :1nd<br />
ti.Jc svstcm . mctadatJ cached lw . the c l uks . C:c hc<br />
coherency inr(Jrmarion, rather th111
• To avoid writing data thJt is quickly deleted such as<br />
temporary tiles<br />
• To com bine multiple smaU writes into larger transfers<br />
The clerks order dependent writes from tile cache<br />
to the server; consequently, other clerks never see<br />
"impossible" states, and related writes never overtake<br />
each other. For instance, the deletion of a ti le cannot<br />
happen beti:>rc a ren:me that was previously issued to<br />
the same ti le. Related d:ta writes arc caused by a partial<br />
overwrite, or an expl icit linking of operations passed<br />
into the clerk by the V PI layer, or an implicit linking<br />
due to the clerk-clerk coherency protocol.<br />
The ordering between writes is kept as a directed<br />
graph. As the clerks trave rse these graphs, they issue<br />
the writes in order or collapse the graph when writes<br />
can be sately combined or eliminated .<br />
LFS Server Design<br />
The Spiralog ti le system uses a log-structured, on-disk<br />
fo rmat tor sto ring data within a volume, yet presents<br />
a trad itional, update-in-place ti le system to its users.<br />
Figure 3<br />
Spira log Address Mapping<br />
FILE VIRTUAL BLOCKS<br />
I I I I I I I I I I<br />
l USER<br />
1/0s<br />
Recently, log-structured ti le systems, such as Sprite,<br />
have been an area of active 1-escJrch.'<br />
Within the LFS server, support is provided ti:x the<br />
.log-structured, on-disk fo rmat and Jo r mapping that<br />
tormat to an update-in-place model. Specitically, this<br />
component is responsible tor<br />
• Mapping the incoming read and write operations<br />
from their simple Jddress space to positions in an<br />
open-ended Jog<br />
• Mapping the open-ended log onto a ti nite amount<br />
of disk spJce<br />
• Reclaiming disk space by cleaning (gJrbage collect<br />
ing) the obsolete (overwritten) sections of the log<br />
Figure 3 shows tllC various mapping layers in the<br />
Spiralog tile system, i ncluding those hJndled by the<br />
LFS server.<br />
fncoming read and write operations arc based on a<br />
single, large address space. Initially, the LFS server trans<br />
torms the address ranges in the incoming operations<br />
into equivalent address ranges in an open-ended log.<br />
This log supports a very large, write-once address space.<br />
DISK<br />
Digit.ll "kdmical Journal<br />
FILE SYSTEM ADDRESS<br />
SPACE<br />
jVPI<br />
CLERK<br />
FILE ADDRESS SPACE<br />
jLFS<br />
B-TREE<br />
LOG ADDRESS SPACE<br />
LFS<br />
LOG<br />
DRIVER<br />
LAYER<br />
PHYSICAL ADDRESS<br />
SPACE<br />
Vol. S No. 2 <strong>1996</strong> 9
10<br />
A read operation looks up its location in the open<br />
ended log and proceeds. On the other hand, a write<br />
operation makes obsolete its current address range<br />
and appends its new value to the tail ofthe log.<br />
In turn, locations in the open-ended log arc then<br />
mapped into locations on the (finite-sized) disk. This<br />
additional mapping allo\\'S disk blocks to be reused<br />
once their original contents have become obsolete.<br />
Physically, the log is divided into Jog segments, each<br />
of which is 256 kilobytes ( KB) in lengrl1. The log seg<br />
ment is used as the transfer unit fo r the backup engine.<br />
It is also used by the cleaner fo r reclaiming obsolett<br />
log space.<br />
More intr<br />
high-pcdcm11ancc, on-line backup save opcLltions.<br />
We : dso met their needs tor per-tile and per-volume<br />
restores and an ongoing need tor simplicity and red uc<br />
tion in operating costs.<br />
<strong>Digital</strong> Te chnical Joumal Vo l. 8 No. 2 19')6<br />
To provide per-tile restore capabilities, tht backup<br />
utilitY and the LrS server ensure that the :1 ppropriarc<br />
ti le header information is stortd during the sa\ c opcr<br />
:l tion . The Sa\·cd ri le svstem clara, including ri le head <br />
ers, log mapping information, and user data, arc<br />
stored in a ti le k.no\\'n as a SCI/ Y!SC'I . Each sa\·escr,<br />
regardless of the number of tapes it requires, repre<br />
sents a singl e sa,·e operation.<br />
'T'o reduce the complcxitv offile restore opcr:nions,<br />
the Spiralog fi le system provides an onlinc SJ\'CSCt<br />
merge feature. This allows tht systtm manager to<br />
merge several savesets, either tl.d l or incremental, to<br />
t(mn a new, single savcset. VVith this katurc, system<br />
managers can have a workable backup save plan that<br />
never calls fc>r :n on-line fu ll backup, rhus h1 rthcr<br />
red ucing the load on their production systems. Also,<br />
this featurt can be used to ensure that ti lt restore oper<br />
ations can be accomplished with a small, bounded set<br />
of S:l\'CSCtS.<br />
The Spiralog backup t:Kilitv is dtscribcd in dcuil in<br />
this issue.;<br />
Project Results<br />
'l'he Spir:t log tile svstem contains a number of innm·::t<br />
tions in the areas of on-line backup, log-structu red<br />
storage, clusttrwick ordered \\'rite-behind caching,<br />
and multiple-tile-system client support.<br />
The usc of log structuring as an on-disk t(Jrmat is<br />
very cftCctivc in supporting high-pcrf(mnancc, on-line<br />
backup. The Spiralog tile system retains the previously<br />
documented benefits of LfS, such as bst write pcrfc>r<br />
mancc that scales with the disk size and throughput<br />
that increases as large read caches arc used to oftscr<br />
disk rc:1ds.'<br />
It should also be noted that the Filcs-1 1 ti le svstcm<br />
stts a high standard t(x data reliabilitY and robusmcss.<br />
The Spira log technologv met this challe nge \'en· wel l:<br />
as a result of the idempotent protocol, the cluster<br />
Eli lover design, and the recover capabilitv of the log,<br />
\\'C cncountertd fC\\' data rcliabilitv problems during<br />
development.<br />
In any large, complex project, manv technical deci<br />
sions arc necessary to convert research technology<br />
into :1 product. In this section, we discuss why certain<br />
decisions were made during the devtlopmtnt of the<br />
Spira log subsystems.<br />
VPI Results<br />
The V PI tile system was generally successful 111 pro<br />
viding the underlying support necessary r< >r dirkrcnr<br />
ti le system personalities. We fo und that it was possi<br />
ble to construct a set ot' primitive operations that<br />
could be used to build complex, user-lC\·cl, ti le svstcm<br />
operations.
By using these primitives, the Spir:log project<br />
members were able to successfully design t\vo distinctly<br />
difterenr personality modules. Neither was a<br />
fu nctional superset of the other, and neither was layered<br />
on top of the other. However, there was an<br />
imporrallt second-order problem.<br />
The fSLIB t-ile system personality did not have a fu ll<br />
mapping to the Files-1 1 file system. As a consequence,<br />
ti le management was rather difficult, because all the<br />
data management tools on the OpenVMS operating<br />
system assumed compliance with a Files-1 1, rather<br />
than a VPI, t-ile system.<br />
This problem led to the decision not to proceed<br />
with the original design for the FSLIB personality in<br />
version 1.0 of Spiralog. Instead, we developed an<br />
FSLI B !lie svstem personality that was ndly compatible<br />
with the F64 personality, even when that compatibility<br />
t(>rced us to accept an additional execution cost.<br />
We also r()Lilld an execution cost to the primitive<br />
VPI operations. Generally, there was little overhead<br />
t(x
12<br />
arc eligible tor cl ea ning bur have nor yet been<br />
processed . These obsolete areas an: also saved by the<br />
backup engine. Although they have no cfkct on the<br />
logic:d state of the log, they do req uire the b1ckup<br />
engine to move more data to bKkup storage tiL1n<br />
might otherwise be necess:try.<br />
Second, the design initially proh ibited the cleaner<br />
tl:om r u nning ag:inst a log with snapshots. Consequcnrlv,<br />
the cleaner was disabled during 1 save operation,<br />
which had the fo llowing dkcts: ( 1 ) The :mount<br />
of available ti·ee space in tht: log was artiti cia ll v<br />
depressed d uring a backup. (2) Once the backu p \\'JS<br />
fi n ish ed, the activated cle:111cr would discover that<br />
a great number of log segments were now e l igib le for<br />
cleaning. As a result, the cleaner underwent a sudden<br />
surge in cleaning acrivirv soon after the back up h:td<br />
completed .<br />
vVc addressed this p robl em bv n.:ducing the area of<br />
the l og that was offlimirs to the c leaner ro onlv the<br />
part that th e backu p e ngine \\'ould read. This limi ted<br />
snaps hot window al lowed more segme nts to remai n<br />
eligi ble fix cleaning, rhus great!\' alln·i:tting the short<br />
age of cleanable space during the backup and eliminat<br />
ing the postb::tckup cleanin g surge. for 1n 8 -gigabvte<br />
rime-sh aring ,-olume, this change tvpical h· reduced the<br />
period of high cleaner acti,·irv ti·om 40 secon ds to less<br />
than one-half of a second.<br />
vVe have nor vet experi mented with difkrellt clcm er<br />
algorithms. More work needs to be done in this arc:J.<br />
to sec if the c leaning efticiencv, cost, and interactions<br />
with backup can be improved .<br />
The current mapping transt(mnarion fi·om the<br />
i ncoming operation address space to l ocations in rhc<br />
open -ended l og is more ex pensive in CPU rime rh;l11<br />
we would like. More work is needed ro opti mize rhe<br />
code path.<br />
Finally, rhe LfS server is gc ncra ll v successful at pro<br />
viding the appearance of a tradition:!, u pcbrc-i n-place<br />
tile system. H owcvcr, as the unusecl space in a vol umc<br />
nears zero, the ability to behave with semantics that<br />
meet users' expectations in a log-structured file system<br />
proved more difric ult than we had a11 ticipn ed and<br />
required signitican r cff(Jrt to correct.<br />
The LfS server is described in much more detail in<br />
t his issue •<br />
Ta ble 2<br />
Performance Comparison of the Backup Save Operation<br />
File System<br />
Spiralog<br />
Files-1 1<br />
Dig:it,l l Technic.1l )ounnl<br />
Elapsed Time<br />
Backup Performance Results<br />
vVc rook a new approach ro the b:tcku p design in the<br />
Spiralog system, resulting in a very t;1St and very low<br />
impact backup that un be used ro crene consistent<br />
copies of the ti le system while applications :1re ::tctively<br />
mod i'ing dat
Ta ble 3<br />
Performance Comparison of the Individual File<br />
Restore Operation<br />
Elapsed Time<br />
File System (Minutes:Seconds)<br />
Spiralog<br />
Files-1 1<br />
01:06<br />
03:35<br />
length is mitigated by continuing improvements UJ<br />
processor performance.<br />
We learned a great deal during the Spiralog project<br />
and made the fo llowing observations:<br />
• <strong>Volume</strong> fu ll semantics and tine-tuning the cleaner<br />
were more complex than we anticipated and will<br />
require future retinemenr.<br />
• A heavily layered architecture extends the CPU<br />
path length and the tan-out of procedure calls. We<br />
focused too much attention on reducing JjOs and<br />
not enough attention on reducing the resource<br />
usage of some critical code paths.<br />
• Although e legant, the memory abstraction for tbe<br />
interface to the cache was not as good a fit to file<br />
system operations as we had expected. Furthermore,<br />
a block abstraction fo r the data space would<br />
have been more suitable.<br />
In summary, the project team delivered a new<br />
fi le system for the OpenVMS operatjng system. The<br />
Spiralog file system offers single-system semantics in<br />
a cluster, is compatible with the current OpcnVMS<br />
fi le system, and supports on-line backup.<br />
Future Work<br />
During the Spira log version 1.0 project, we pursued a<br />
number of new technologies and found fo ur areas that<br />
warrant fu ture work:<br />
• Support is needed from storage and tilemanagement<br />
tools tor multiple, dissimilar tile<br />
system personalities.<br />
• The cleaner represents another area of ongoing<br />
innovation and complex dynamics. We believe a<br />
better understanding of these dynamics is needed,<br />
and design alternatives should be studied .<br />
• The on-line backup engine, coupled with the logstructured<br />
fi le system technology, oHers many areas<br />
for potential development . For instance, one area<br />
tor investigation is continuous backup operation,<br />
either to a local backup device or to a remote<br />
replica.<br />
• Finally, we do not believe the higher-than-expected<br />
code path length is intrinsic to the basic fi le system<br />
design. vVc expect to be working on this resource<br />
usage in the near future.<br />
Acknowledgments<br />
We would like to take this opportunity to thank the<br />
many individuals who contributed to the Spiraiog<br />
project. Don Harbert and Rich Marcello, OpenVMS<br />
vice presidents, supported this work over the lifetime<br />
of the project. Dan Doherty and Jack Fallon, the<br />
Open VMS managers in Livingston, Scotland, had dayto-day<br />
management responsibility. CatJ1y Foley kept<br />
the project moving toward the goal of shipping. Jan is<br />
Horn and Clare Wells, the product managers who<br />
helped us understand our customers' needs, were eloquent<br />
in explaining our project and goal to others.<br />
Ncar the end of the project, Yehia Bcyh and Paul<br />
Mosteika gave us valuable testing support, without<br />
which the product would certainly be Jess stable than it<br />
is today. Finally, and not least, we would like to<br />
acknowledge the members of the development team :<br />
Alasdair Baird, Stuart Bayley, Rob Burke, Ian<br />
Compton, Chris Davies, Stuart Deans, Alan Dewar,<br />
Campbell Fraser, Russ Green, Peter Hancock, Steve<br />
Hirst, Jim Hogg, Mark Howell, Mike Johnson,<br />
Robert Landau, Douglas McLaggan, Rudi Martin,<br />
Conor J\llorrison, Julian Palmer, Judy Parsons, Ian<br />
Pattison, Alan Paxton, Nancy Pban, Kevin Porter,<br />
Alan Potter, Russell Robles, Chris Whitaker, and Rod<br />
Widdowson.<br />
References<br />
1. M. Rosenblum and J. Ousterhout, "The Design and<br />
Implememation of a Log Structured File System," ACM<br />
Tra nsactions on Computer S)'slems. vol. I 0, no. I<br />
(February 1992): 26-52.<br />
2. T. Mann, A. Birrell, A. Hisgen, C. Jerian, and C. Swart,<br />
"A Coherent Distributed File Cache with Direcrory<br />
Write-behind," <strong>Digital</strong> Systems Research Center,<br />
Research Report 103 (June 1993).<br />
3. R. Creen, A. Baird, and J. Davies, "Designing a Fast,<br />
On-line Backup System for :1 Log-structured File System,"<br />
D(v, ital Techrzica!Journal, vol . 8, no. 2 ( <strong>1996</strong>,<br />
this issue): 32-45.<br />
4. C. Whitaker, }. Bayley, and R. Widdowson, "Design of the<br />
Server lor the Spiralog File System," <strong>Digital</strong> Tee/mica!<br />
Journal, vol. 8, no. 2 ( <strong>1996</strong>, this issue): 15-3 1.<br />
5. M. Howell and ). Palmer, "Integrating the Spiralog<br />
File Svstem into the OpenVMS Operating System,"<br />
<strong>Digital</strong> <strong>Technical</strong> .fournal, vol . 8, no. 2 ( <strong>1996</strong>, this<br />
issue): 46-56.<br />
6. R. Wrenn, "Why the Real Cost of Storage is More Than<br />
$1/MR," presented at U.S. DECUS Symposium,<br />
St. Louis, Mo., June 3-6, <strong>1996</strong>.<br />
<strong>Digital</strong> ·ll:chnic
14<br />
Biographies<br />
James E. Johnson<br />
jim Johnson, a consul ring software engineer, has been<br />
working for <strong>Digital</strong> since 1984. H e was a n1ember of the<br />
Open VMS Engineering Group, where he contributed<br />
in several at-cas, i ncludi ng RMS, transaction processi ng<br />
services, the port of Open VMS to the Alpha architecrl!l·e,<br />
ti le systems, and system management. Jim recently joined<br />
rhc Internet Software B usi ness Unir and is working on<br />
rhe application ofX.SOO directory services. jim holds rwo<br />
patents on transaction commit protocol optimizations and<br />
maintains a keen interest in rh is area.<br />
William A. Laing<br />
Bill Laing, a corporate consulting engineer, is rhe technical<br />
director of the Internet Software. Business Unir. Bill joined<br />
Digita l in 191\l; he worked in the United Stares tcJr five<br />
vears befo re transterring to Europe. During his oreer :tt<br />
<strong>Digital</strong>, Bill has worked on VMS systems ped(mllarlce<br />
:nalysis , VAXclusrer design and development, oper;l ting<br />
systems dc,·clopmcnt, and transaction processing. He<br />
was the technical dir·ector of Open VMS engineering, the<br />
tt:chnical director fo r engineering in Europe, and most<br />
n: cenrk was focusing on software in the ·rechnologv and<br />
Architecture Group of the Computer Svsrcms Division.<br />
Prior to joining <strong>Digital</strong>, Bill held rese.1rch and reaching<br />
posts in operati ng svstems at rhe University of Edinburgh,<br />
where he worked on the EMAS operating system. He \\·,1s<br />
also parr of the starr-up of European Si li con Structures<br />
(LS2), an ambitious pan- Enropean company. He holds<br />
undergraduate and postgraduate degrees in computer<br />
science from the University of Edinburgh .<br />
DigitJI Tcchnic:1l journal Vol. 8 No. 2 <strong>1996</strong>
Design of the Server for<br />
the Spira log File System<br />
The Spira log file system uses a log-structured,<br />
on-disk format inspired by the Sprite log<br />
structured file system (LFS) from the University<br />
of Ca lifornia, Berkeley. Log-structured file sys<br />
tems promise a number of performance and<br />
functional benefits over conventional, update<br />
in-place file systems, such as the Files-11 file<br />
system developed for the Open VMS operating<br />
system or the FFS file system on the UNIX oper<br />
ating system. The Spira log server combines log<br />
structured technology with more traditional<br />
B-tree technology to provide a general server<br />
abstraction. The B-tree mapping mechanism<br />
uses write-ahead logging to give stability and<br />
recoverability guarantees. By combining write<br />
ahead logging with a log-structured, on-disk<br />
format, the Spira log server merges file system<br />
data and recovery log records into a single,<br />
sequential write stream.<br />
I<br />
Christopher Whitaker<br />
J. Stuart Bayley<br />
Rod D. W. Widdowson<br />
The goal of the Spira log file system project team was<br />
w produce a high-pedonnance, highly available, and<br />
robust tile system with a high-performance, on-line<br />
backup capability for the OpenVMS Alpha operating<br />
system. The server component of the Spiralog file system<br />
is responsible tor reading data trom and writing<br />
data w persistent storage. It must provide fa st write<br />
performance, scalability, and rapid recovery from system<br />
fa ilures. In addition, the server must allow an<br />
on-line backup utility to copy a consistent snapshot of<br />
the tile system to another location, while allowing normal<br />
fi le system operations to continue in parallel.<br />
In this paper, \Ve describe the log-structured file system<br />
(LFS) technology and its particular implementation<br />
in the Spiralog file system. We also describe the novel<br />
way in which the Spiralog server maps the log to provide<br />
a rich address space in which files and directo1ies are<br />
constructed. Finally, we review some of the opportunities<br />
and challenges presented by the design we chose.<br />
Background<br />
All tile systems must trade off performance against<br />
availability in different ways to provide the throughput<br />
required during normal operations and to protect data<br />
from corruption during system failures. Traditionally,<br />
fi le systems fall into two categories, careful write and<br />
check on recovery.<br />
• Careful writing policies are designed to provide a<br />
fa il-safe n1cchanism tor the file system structures in<br />
the event of a system fa ilure; however, they sufter<br />
fl·om the need to serialize several ljOs during file<br />
system operations.<br />
• Some fi le systems forego the need to serialize fi le<br />
system updates. After a system failure, however,<br />
they require a complete disk scan to reconstruct a<br />
consistent file system. This requirement becomes<br />
a problem as disk sizes increase.<br />
Modern file systems such as Cedar, Episode,<br />
Microsoft's New Technology File System (NTFS ),<br />
and <strong>Digital</strong>'s POLYCENTER Advanced File System<br />
use Jogging to overcome the problems inherent in<br />
these two approaches.'·2 Logging file system metadata<br />
removes the need to serialize IjOs and allows a simple<br />
Digiral <strong>Technical</strong> journal Vol. 8 No. 2 <strong>1996</strong> 15
16<br />
Jnd bounded mechanism f(>r reconstructing the ti le<br />
system ah:cr a fa ilure. Rcsc.1rchcrs at th e University of<br />
Ca liti:m1i a, Berkeley, took this process one stage fi.Jr<br />
ther and treated the ll' hole disk as a single, sequential<br />
log where all ri le system modifications arc appended to<br />
tbe tail of the log.'<br />
Log-structured tile system tcchnolot,'Y is p;1rticularh'<br />
appropriate to the Spiral og tile s\·stcm, because it is<br />
tksigncd as a clustcrwidc ti le S\·stcm . The seJTcr must<br />
support a large number of ti le S\'Stcm clerks, each of<br />
which may be rcJding ;1 nd ll'riting lh ta to the disk. The<br />
clerks usc 1:-rge ll'rite-back caches to red uce the need to<br />
read daL1 ti·om the server. The caches also allow the<br />
clerks to bufkr write requests destined t(>r the server.<br />
A log-structured design allows mu ltiple concurrent<br />
writes to be grouped togcthn imo large, sequential<br />
ljOs to the disk. This 1/0 pJttcrn reduces disk head<br />
movement d uring writing and allows the size of the<br />
writes to be marched to characteristics of the underlying<br />
disk. This is Inrticul•1rly bcncticial t( >r storage devices<br />
with redundant arr.n·s of inexpensive disks (RAID )4<br />
The usc of a log-structured , on-disk f(>rmat greatly<br />
simplifies the implementation of ;111 on-line backup<br />
capability. Here, the challenge is ro provi de a con sis<br />
tent snapshot of thc ti le system that can be copied to<br />
the backup med ia while normal opeLltions continue<br />
to modit\• the ti le svsrcm. BcG\ usc ;1n LrS appends Jil<br />
data to rhc tail of a l og, all d:1t1 writes within the log<br />
arc tc m pora llv ordered . A complete sn:-pshot of the<br />
ti le systcn1 corresponds to the contcllts of the sequen<br />
tial log up ro thr the clcmcr. In su bsequent sec<br />
tions of this paper, we compare these diffe rences \\·ith<br />
existing i mpl emc nrarions ll'hcrc appropri •1tc.<br />
Spiralog File System Server Architecture<br />
The Spiralog ti le S\·stcm cmpl o\'s ;1 clic m-scn·cr JJ·c hi<br />
recture. E:-ch node in rhc duster thar 1noums ;l<br />
Spiralog \·olume runs a ti le svstcm clerk.. The term<br />
clerk is used in this p;1pcr ro d istinguish the client component<br />
of the ri le system fi·om diems oft he ti le S\'Stem<br />
as a whole. Clerks implement all the ti le h111crions associated<br />
\\'ith maintaining the tile svsrem sr1rc wirh the<br />
exception of persistent storage of ti le S\'stem :m d user<br />
data. This latter responsibil ity Lli ls on the Spir;1log<br />
server. There is exactly one sc n·cr h>1· eac h volume,<br />
which must run on a node that l1;1s ;1 direct connection<br />
to the disk conta ining the volume. This distribution of<br />
fu nction , where the majority of tilc system processing<br />
takes place on the cle rk, is simiLlr to that of the Echo<br />
ti le system .'" The reasons t(n choosing rh is ;1 rc h itccrurc<br />
arc described in more dct:-il in the p;1pcr "Ovcn'icw of<br />
the Spira log hie System," elsewhere in this issue."<br />
Spiralog clerks build ti. lcs and d irectories in a struc<br />
tured address space called the ti le ;1 ddrcss s1x1cc. This<br />
address space is i mcma l ro the tile S\'Stcm and is onh·<br />
loose ly related w tiLlt pcrcci\·cd lw clicms of rhc ti le<br />
svsrcm. The server pro\·ides ;1 11 imcrbcc that :dlm\·s<br />
tbe clerks to pcrsistcnth· map to ti l c sp;Kc add rcscs.<br />
Internalh·, the scnn uses a logic:lh· infinite log struc<br />
ture, built on top of a plwsic:l disk, to store the ti le<br />
S\'Stem datJ and the srructu res nccess:11·\· to locate<br />
the datJ. Figure l shO\\·s the rcl;Hionshifl hcn\·ccn t he<br />
clerks and the sc nu· and the rebtionships among<br />
the major componcms \\·ithin the scrnT.<br />
Figure 1<br />
Sen·er Architecture
The mapping layer is responsible for maintaining<br />
the mapping between the file address space used by<br />
the clerks to the address space of the log. The server<br />
directly supports the file address space so that it can<br />
make use of information about the relative performance<br />
sensitivity of parts of the address space that is<br />
implicit within its structure. Although this results in<br />
the mapping layer being relatively complex, it reduces<br />
the complexity of the clerks and aids performance.<br />
The mapping layer is the primary point of contact with<br />
the server. Here, read and write requests from clerks<br />
are received and translated into operations on the log<br />
address space .<br />
The log driver ( LD) creates the illusion of an infinite<br />
log on top of the physical disk. The LD transforms read<br />
and write requests from the mapping layer that are cast<br />
in terms of a location in the log address space into read<br />
and write requests to physical addresses on the underlying<br />
disk. Hiding the implementation of the log from<br />
the mapping layer allows the organization of the log to<br />
be altered transparently to the mapping layer. For<br />
example, parts of the log can be migrated to other<br />
physical devices without involving the mapping layer.<br />
FILE HEADER FILE VIRTUAL BLOCKS<br />
Figure 2<br />
Address Translation<br />
I I I I I I I I I I<br />
1 USER<br />
1/0s<br />
Although the log exported by the LD layer is conceptually<br />
infinite, disks have a finite size. The cleaner<br />
is responsible for garbage collecting or coalescing free<br />
space within the log.<br />
Figure 2 shows the relationship between the various<br />
address spaces making up the Spiralog file system. In<br />
the next three sections, we examine each of the components<br />
of the server.<br />
Mapping Layer<br />
The mapping layer implements the mapping bet\veen<br />
the file address space used by the file system clerks<br />
and the log address space maintained by the LD.<br />
It exports an interface to the clerks that they use to<br />
read data from locations in the file address space,<br />
to write new data to the file address space, and to specif)'<br />
which previously written data is no longer required.<br />
The interface also allows clerks to group sets of dependent<br />
writes into units that succeed or fail as if they<br />
were a single write. In this section, we introduce the<br />
file address space and describe the data structure used<br />
to map it. Then we explain the method used to handle<br />
clerk requests to modi f)' the address space.<br />
DISK<br />
<strong>Digital</strong> <strong>Technical</strong> Journal<br />
FILE SYSTEM ADDRESS<br />
SPACE<br />
IVPI<br />
CLERK<br />
FILE ADDRESS SPACE<br />
ILFS<br />
B-TREE<br />
LOG ADDRESS SPACE<br />
LFS<br />
LOG<br />
DRIVER<br />
LAYER<br />
PHYSICAL ADDRESS<br />
SPACE<br />
Vol. S No. 2 <strong>1996</strong> 17
18<br />
File Address Space<br />
Tht fi le address space is a structurtd address sp:1ce. At<br />
its highest level it is divided into objects, each of which<br />
Ius a numeric object ident ifier (OlD). An object may<br />
have any number of named cells associated with it and<br />
up to 2'"- 1 streams. A named cdl may contain a variable<br />
amount of data, but it is read and written as a single<br />
unit. A stream is a sequence of bytes that are<br />
address.:d by their offset from tiK start of the stream,<br />
up to a maximum of2"' -l. Fund:Jm.:ntally, there are<br />
t\vo f(>rms of addresses defined by tiK ti k address<br />
space: Named addresses of the form<br />
<br />
sp.:ci, an individual n amed cell within an object, and<br />
numeric addresses of the f(mn<br />
<br />
spcei' a sequence of lengtb con tiguous bytes in an<br />
individual stream belonging to an obj.:et.<br />
The ekrks use named cells and streams to build ti ks<br />
and direerori.:s. In the Spiralog file system version l.O,<br />
a tile is repr.:scnted by an object, a named ed l contlining<br />
its attributes, and a singk str.:am that is used<br />
to store the ti le's data. A dir.:ctory is represented by<br />
an object that contains a number of nanKd ctlls.<br />
Each 1wned cell represents a link in that dir.:cron· and<br />
conr:.1ins what a traditional ti le svstcm rdcrs to as a<br />
directory entry. Figure 3 shows how data ti les and<br />
directories are built from named cells and str.:ams.<br />
The mapping layer provides thrc.: principal operations<br />
f(>r manipulating the ti le address space: read,<br />
write, :1nd clear. The read operation allovvs a clerk to<br />
read rhc coments of a named cell, a contiguous range<br />
of bytes from a stream, or all the named cells tC>r a particular<br />
object that tall into a specified search range. The<br />
write opcr:ltion allows a clerk to writ.: ro a contiguous<br />
range of bytes in a stream or an individual named cell.<br />
DATA FILE<br />
ATTRIBUTES<br />
KEY<br />
0 OBJECT<br />
El<br />
BYTE STREAM<br />
L:::. NAMED CELL<br />
Figure 3<br />
1-'i lc Svsrc 111<br />
BYTE<br />
STREAM<br />
Digir,d T.:chnical Joun1.1\<br />
DIRECTORY<br />
ATTRIBUTES<br />
FRED.TXT<br />
FRED.C<br />
JIM.H<br />
CHRIS. TXT<br />
STU.C<br />
Vol. t:: No. 2 <strong>1996</strong><br />
The dear operation allows a clerk to remove a named<br />
cell or a nu m ber oflwtes fi-om an object.<br />
Mapping the File Address Space<br />
We looked at a variety of indexing strucwres tor mapping<br />
the file add ress spJcc onto the log address space -'"'1 We<br />
chose a derivative of the B-trcc for the following reasons.<br />
For a unif(>rm address space, B-trees provide predictable<br />
worst-ease access times because the tree is balanced<br />
across all the keys it maps. A B-tree scales we ll as the<br />
number of keys mapped increases. In other words, Js<br />
more keys arc added, the B-tree grows in width and in<br />
depth . Deep B-rrccs urry an obvious perf(>rmr :1ll the tiles stored in the log.<br />
To solve this problem, we limited the k eys tr an<br />
object to a single B -trce leaf node. With this restriction,<br />
several small ti les can be accommodated in a single<br />
leaf node. files with a large number ofexrents (or<br />
large directories) arc supported br allowing individual<br />
streams to be spawned into subtrees. The subtrees arc<br />
balanced across the kevs within the subtree. An object<br />
can never span more than a single leaf node oF the<br />
main B-trce; therefore, nonleaf nodes of the main<br />
B-tree only need to con tJi n OIDs. This allows the<br />
main B-trcc to be very compact. Figure 4 shows the<br />
relationship bet>.vccn the main B-tree and its subtrees.<br />
Figure 4<br />
Mapping B-rrcc Sn·ucrurc
To reduce the time required to open a file, data for<br />
small extents and small named cells are stored directly in<br />
the leaf node tl1at maps tl1em. For larger extents (greater<br />
than one disk block in size in the current implementa<br />
tion), the data item is written into tl1e log and a pointer<br />
to it is stored in the node. This pointer is an address in<br />
the log address space. Figure 5 illustrates how the B-tree<br />
maps a small file and a file with several large extents.<br />
Processing Read Requests<br />
The clerks submit read requests that may be for a<br />
sequence of bytes from a stream (reading a data from a<br />
file), a single named cell (reading a file's attributes), or<br />
a set of named cells (reading directory contents). To<br />
fultill a given read request, the server must consult the<br />
B-tree to translate from the address in the file address<br />
space supplied by the clerk to the position in the log<br />
address space where the data is stored. The extents<br />
making up a stream are created when the file data<br />
is written. If an application writes 8 kilobytes ( KB)<br />
of data in 1-KB chunks, the B-tree would contain<br />
8 extents, one tor each 1-KB write. The server may<br />
need to collect data ti·om several diffe rent parts of the<br />
log address space to fi.dtlll a single read request.<br />
Read requests share access to the B - t.ree in much<br />
the same way as processes share access to the CPU of<br />
a multiprocessing computer system. Read requests<br />
Figure 5<br />
KEY:<br />
142.1 .0.50 1<br />
Mapping B-rree Derail<br />
_' 135,<br />
ATIRIBUTES I<br />
B-TREE INDEX RECORD<br />
MAPPING OlD 35 ...<br />
RECORD CONTAINING FILE<br />
DATA: OlD 42, STREAM 1,<br />
START OFFSET 0, LENGTH 50<br />
arriving from clerks are placed in a tl rst in tl rst out<br />
(FIFO) work queue and are started in order of their<br />
arrival. Al l operations on the B-tree arc performed by<br />
a single worker thread in each volume. This avoids<br />
the need tor heavyweight locking on individual<br />
nodes in the B-tree, which significantly reduces the<br />
complexity of the tree manipulation algorithms and<br />
removes tbe potential fo r deadlocks on tree nodes.<br />
This reduction in complexity comes at the cost of<br />
the design not scaling with the number of processors<br />
in a symmetric multiprocessing (SMP) system. So far<br />
we have no evidence to show that this design deci<br />
sion represents a major performance limitation on<br />
the server.<br />
The worker thread takes a request from the head<br />
of the work queue and traverses the B-tree until it<br />
reaches a leaf node that maps the address range of<br />
the read request. Upon reaching a leaf node, it may<br />
discover that the node contains<br />
• Records that map part or al l of the address of the<br />
read request to locations in the log, and/or<br />
• Records that map part or all of the address of the<br />
read request to data stored directly in the node,<br />
and/or<br />
• No records mapping part or all of the address of the<br />
read request<br />
, , /'- ' , , MAIN B-TREE<br />
NODE A<br />
SUBTREE FOR OlD 35,<br />
_STREAM 1<br />
B-TREE INDEX RECORD<br />
MAPPING OlD 35, STREAM 1,<br />
START OFFSET 0 ...<br />
RECORD CONTAINING POINTER<br />
TO FILE DATA: OlD 42, STREAM 1,<br />
START OFFSET 0, LENGTH 50<br />
<strong>Digital</strong> <strong>Technical</strong> Journ
20<br />
Data rh:n is stored in the node is simplv copied<br />
to rhc output bufkr. When data is stored in the log ,<br />
rhc worker thread issues req uests to the LD to read the<br />
ch ra imo the output buffer. Once :.ll the reads ha\T<br />
been issued , the original request is placed on a pend<br />
ing queue until they complete; then the results are<br />
returned to the clerk. When no dat:t is stored t(>r all or<br />
part of the read request, the server zero-tills the corn:<br />
sponding part of the output buftCr.<br />
The process described above is complicated by the<br />
bet that the B-trcc is itself stored in the log. The map<br />
ping layer contains a node cache rh:lt ensures thJt com <br />
monly referenced nodes Jre nonnallv tc>Lmd in memory.<br />
vV hen the worker thread needs to traverse through a<br />
tree node that is not in mernono, it must atTJ.nge tc>r the<br />
node to be read into the cache. The address ofrhe node<br />
in the log is the vaJ ue of rhe pointer to it fl·om its pJ.rent<br />
node. The worker thread uses this to issue a request to<br />
the LD to read the node into a cad1e butkr. While the<br />
node reJ.d request is in progress, the origin::tl clerk oper<br />
ation is placed on a pending queue and the worker<br />
thread proceeds to the next request on the work queue.<br />
When the node is resident in memory, the pmding read<br />
request is placed back on the work queue to be<br />
rest: rred . In this way, multiple read req uests can be in<br />
progress :-tt any given time.<br />
Processing Write Requests<br />
Write requests received by the server arrive in groups<br />
consisting of a number of data items corresponding to<br />
upcbtes to noncontiguous addn:sses in the ti le :-tddress<br />
sp:1ce. Eac h group must be writtn as :t single tJilure<br />
atom ic unit, which means rbat all the parts of the write<br />
request must be made stable or none of thm must<br />
become stable. Such groups of writes arc ulkd wun<br />
ners :tnd ar used by the clerk to cncapsubtc complex<br />
tile system operations."<br />
Bdi:lre the server Glll complete a wunncr, that<br />
is, bd(>rc an acknowl edgment c:tn be sent b:1ck to<br />
the clerk indicating that the \\'llllner w:1s successful,<br />
the scn·er must make two guar:1ntecs:<br />
1. All [Xlrts of the wunner are stably stored in the log<br />
so that the entire \\'Utmer is persistent in the e\·ent<br />
of: system bilure.<br />
2. All cbtJ. items described by the wunner arc visi ble to<br />
subsequent read requests.<br />
The wunncr is made persistent by writing each data<br />
item to the log. Each data item is tagged with a log<br />
record that identifies irs corresponding tile spKc<br />
ad d ress. This :1 l lows the data to be recovered in the<br />
e\'l'nt of a S\'Stcm tailurc. All indi,·idual writes arc made<br />
as p1 rr ofa single compound atomic operation (C:AO ).<br />
This method is prm·idcd by the LD bvcr to br:1ckct<br />
a scr of writes rhar must be recm·ered as :1 11 :ttomic<br />
unir. Once all the writes fo r the \\'Utl ller have been<br />
Vo l. il No. 2 <strong>1996</strong><br />
issued to the log, the mapping !Jvcr instructs the LD<br />
layer to end (or commit) the CAO.<br />
The \\·u nncr c1n be made ,·isible to subsequent rc1d<br />
operations by u pdati ng the B-tree to rdkct the loca<br />
tion of th new dat::J. Untc>rntnJ.tely, this "·ould c:1usc<br />
writes to incur :1 significant latency since updating the<br />
B-tree involves traversing the B-tree and potentially<br />
reading B-tree nodes imo memory ti·om the log.<br />
Instead, the server com pletes a writ operation before<br />
the B-tree is updated . By doing this, howver, it must<br />
take addition:1l steps to ensure that the data is 1·isiblc to<br />
subseq uent read requests.<br />
Betorc completing the wunner, the mapping Ll\n<br />
queues the B-trce upd:ttes resulti ng t[·om \\Ti ring rhe<br />
\l'lttJIKr to the same FIFO work queue as read requests.<br />
All items arc queued atomicallv, that is, no other read<br />
or write oper:rion can be imerlcwed with the indi,·id<br />
ual wu nncr updates. In this wav, the ordering between<br />
the writes making up the wuntler and su bsequellt read<br />
or write oper:-ttions is maintained . vVork cannot begi n<br />
on a subsequent rc:td request until work has started OJJ<br />
the B-trce updates ahead ofit in the queue.<br />
Once the B-trcc upd ates have been queued to the<br />
server work queue and the mapping layer h:s been<br />
notified th:-tt the C:AO fo r th writes has eommirred ,<br />
both of the gu:1ranrces rhar the sen-er gi ,·es on write<br />
completion hold. The data is pcrsistem, and the writes<br />
arc ,·isible to su bseq uenr operations; therefore, the<br />
sen·er can send an ac knowledgment back to the clerk.<br />
Updating the 8-tree<br />
The worker thread processes a B-trce update req uest<br />
in much the s:me way 1s a read request. The upd ate<br />
request traverses the B-rrcc until either it reaches the<br />
node that maps the appropriate pan of the file add ress<br />
space, or it tails to fi nd a node in memory.<br />
Once the lc:.1fnode is reachd , it is updated to point at<br />
the locarjon of the data in the log (if the clatJ is to be<br />
stored directlv in the node, the chta is copied inro the<br />
node). The node is tlOW dim· in memorY :md must<br />
be written to the log at some point. Ra ther than \\Tiring<br />
the node immediate\-, the lll3pping la,·cr writes a log<br />
record describing the ch:nge, locks the node into rhe<br />
cache, and places :1 fl ush operation ror the node to<br />
tl1e mapping L1yer's Hush queue. The flush operation<br />
describes the locltion of the node in the tree and<br />
records the need to write it to the log at some poi nt<br />
in the future.<br />
It on its \\·1\· to the lc:tf node, the write operation<br />
reaches a node that is not in memorv, the worker<br />
thread arranges t( Jr it to be read h·om the log :.1nd the<br />
write operation is pbced on a pe nding queue Js "·i rh :1<br />
read operation. Because the wri te has been ackno11·l<br />
edged to the clerk, the nc,,· d
eing read. If a read operation reaches the node with<br />
records of stalled updates, it must check whether any<br />
of these records contains data that should be returned.<br />
The record contains either a pointer to the data in the<br />
Jog or the actual data itself. If a read operation ri nds<br />
a record that can satist)r al l or part of the request, the<br />
read request uses the information in the record to<br />
retch the data. This preserves the guarantee that the<br />
clerk must see all data for which the write request has<br />
been acknowledged.<br />
Once the node is read in ti-om the log, the stalled<br />
updates are restarted. Each update removes its log<br />
record fi·om the node and recommences traversing the<br />
B-tree from that point.<br />
Writing B-tree Nodes to the Log<br />
Writing nodes consumes bandwidth to the disk that<br />
might otherwise be used for writing or reading user<br />
data, so the server tries to avoid doing so until<br />
absolutely necessary. Two conditions make it necessary<br />
to begin writing nodes:<br />
1. There are a large number of dirty nodes in the<br />
cache.<br />
2. A checkpoint is in progress.<br />
In the fi rst condition, most of the memory available<br />
to rbe server bas been given over to nodes that are<br />
locked in memory and waiting to be written to the<br />
Jog. Read and update operations begin to back up,<br />
waiting fo r available memory to store nodes. In the<br />
second condition, the LD has requested a checkpoint<br />
in order to bound recovery time (see the section<br />
Checkpointing later in this paper).<br />
When either of these conditions occurs, the mapping<br />
layer switches into tlush mode, during which it only<br />
writes nodes, until rhc condition is changed. In Hush<br />
mode, the worker thread processes tlush operations<br />
ti·om the mapping layer's fl ush queue in depth order,<br />
that is, starting with the nodes ti.1 rthest from the root<br />
ofrhc B-rrcc . For each tl ush operation, it traverses the<br />
B-trce until it ti nds the target node and its parent. The<br />
target node is identified by the keys it maps and its<br />
level. The level of a node is its distance ri·om the leaf of<br />
the B-rn::e (or subtree). Unlike its depth, which is its<br />
distance from the root of the B-tree, a node's level does<br />
not change as the B-tree grows and shrinks.<br />
Once it has reached its destination, the tlush operation<br />
writes out the target node and updates the parent<br />
with the new log address. The moditicnions made to<br />
the p
22<br />
Figure 7<br />
SEGMENT ARRAY I<br />
SEGMENT WRITER I<br />
DISK<br />
Subcomponents of rhe LD LJI'er<br />
ALTERNATE SOURCE<br />
OF SEGMENTS<br />
(SPI RALOG BACKUP)<br />
0<br />
TAPE<br />
The Segment Writer<br />
The segment writer is responsible tor all I/Os to the<br />
log. It groups together writes it receives ti·om the mapping<br />
layer into large, sequential I/Os wiKre possible.<br />
This increases write throughput, but at the potential<br />
cost of increasing the latency of individual operations<br />
when the disk is lightly loaded.<br />
As shown in Figure 8, the segment writer is responsible<br />
fo r the internal organization of segments written<br />
to the disk. Segments are divided into two sections, a<br />
data area md a much smaller commit record area.<br />
Writing a piece of data requires two operations to the<br />
segment at the tail of the log. First the data item is<br />
written to the data area of the segment. Once this I/0<br />
has completed successfully, a record describing that<br />
data is written to the commit record area. Only when<br />
the write to the commit record area is complete can<br />
the original request be considered stable.<br />
Figure 8<br />
DATA AREA<br />
J r-o-f--------,1<br />
-- I<br />
L<br />
I<br />
KEY:<br />
--- - - - ----J<br />
USER DATA OR B-TREE NODE<br />
[.=J COMMIT RECORD<br />
r------1<br />
l o· L___<br />
L _ ___ _ .J<br />
Orgcmizcltion of a Segment<br />
_J SI NGLE 1/0 OPERATION<br />
Digir.1l <strong>Technical</strong> )omnal Vol. 8 No. 2 <strong>1996</strong><br />
The need tor two writes to disk (potentially, with a<br />
rotational delay berween ) to commit a single data<br />
write is clearly a disadvantage. Normally, however, the<br />
segment wri ter receives a set of related writes from<br />
the mapping layer which are tagged as part of a single<br />
CAO. Since the mapping layer is interested in the completion<br />
of the whole CAO and not the writes within it,<br />
the segment writer is able to buffer add itions to the<br />
commit records area in memory and then write them<br />
with a single I/0. Under a normal write load , this<br />
reduces the number ot' ljOs tor a single data write to<br />
very close to om:.<br />
The bou ndary between the commit record area and<br />
the data arcJ is ti xed . Ine\'itably, this wastes space in<br />
either the commit record area or data area when the<br />
other ti lls. Choosing a size t(Jr the commit record area<br />
that minimizes this waste req uires some care. After<br />
analysis of segments that had been subj ected to a typical<br />
OpenVMS load, we chose 24 KB as the value ti.>r<br />
the commit record area .<br />
This segment organization permits the segment<br />
writer to have comp.letc control over the contents of<br />
the commit record area, which allows the segment<br />
writer to accomplish two important recovery tasks:<br />
• Detect the end of the log<br />
• Detect multiblock write failure<br />
When physical segments are reused to extend the<br />
log, they arc not scrubbed and their commit record<br />
areas contain stale (but comprehensible) records. The<br />
recovery man
a segment is an attribute of the physical slot that is<br />
assigned to it. The sequence number tor a physical slot<br />
is incremented each time the slot is reused, allowing<br />
the recovery manager to detect blocks that do not<br />
belong to the segment stored in the physical slot.<br />
The cost of resubstituting the stolen bytes is incurred<br />
only during recovery and cleaning, because this is<br />
the only time that the commit record area is read.<br />
In hindsight, the partitioning of segments into data<br />
and commit areas was probably a mistake. A layout<br />
that intermingles the data and commit records and that<br />
allows them to be written in one ljO would offer better<br />
latency at low throughput. If combined with careful<br />
writing, command tag queuing, and other optimizations<br />
becoming more prevalent in disk hardware and<br />
controllers, such an on-disk structure could offer significant<br />
improvements in latency and throughput.<br />
Cleaner<br />
The cleaner's job is to turn free space in segments in<br />
the log into empty, unassigned physical slots that can<br />
be used to extend the log. Areas offree space appear in<br />
segments when the corresponding data decays; that is,<br />
it is either deleted or replaced.<br />
The cleaner rewrites the live data contained in partially<br />
fu ll segments. Essentially, the cleaner torces the<br />
segments to decay completely. If the rate at which data<br />
is written to the log matches the rate at which it is<br />
deleted, segments eventually become empty of their<br />
own accord. When the log is fu ll (fullness depends on<br />
the distribution of ti le longevity), it is necessary to<br />
proactively clean segments. As the cleaner continues<br />
to consume more of the disk bandwidth, performance<br />
can be expected to decline. Our design goal was that<br />
performance should be maintained up to a point at<br />
which the log is 85 percent fu ll. Beyond this, it was<br />
acceptable for performance to degrade significantly.<br />
Bytes Die Yo ung<br />
Recently written data is more likely to decav than old<br />
data.1"·1' Segments that were writtn a shor time ago<br />
are likely to decay fu rther, after which the cost of<br />
cleaning them will be less. In our design, the cleaner<br />
selects candidate segments that were written some<br />
time ago and are more likely to have undergone this<br />
initial decay.<br />
Mixing data cleaned from older segments with data<br />
trom the current stream or' new writes is likely to produce<br />
a segment that will need to be cleaned again once<br />
the new data has undergone its initial decay. To avoid<br />
mixing cleaned data and data hom the current write<br />
stream, the cleaner builds its output segments separately<br />
and then passes them to the LD to be threaded in<br />
at the tail of the log. This has two important benefits:<br />
• The recovery information in the output segment is<br />
minimal, consisting only of the selfdescribing tags<br />
on the data. As a result, the cleaner is unlikely to<br />
waste space in the data area by virtue of having ti lled<br />
the commit record area.<br />
• By constructing the output segment offline, the<br />
cleaner has as much time as it needs to look for data<br />
chunks that best till the segment.<br />
Remapping the Output Segment<br />
The data items contained in the cleaner's output segment<br />
receive new addresses. The cleaner informs the<br />
mapping layer of the change oflocation by submitting<br />
B-tree update operation for each piece of data it<br />
copied. The mapping layer handles this update operation<br />
in much the same way as it would a normal overwrite.<br />
This update does have one special property:<br />
the cleaner writes are conditional. In other words, tbe<br />
mapping layer will update the B-tree to point to<br />
the copy created by the cleaner as long as no change<br />
has been made to the data since the cleaner took its<br />
copy. This allows the cleaner to work asynchronously<br />
to tile system activity and avoids any locking protocol<br />
between the cleaner and any other part of the Spira log<br />
file system.<br />
To avoid modifYing the mapping layer directly, the<br />
cleaner does not copy B-tree nodes to its output segment.<br />
Instead, it requests the mapping layer to flush<br />
the nodes that occur in its input segments (i.e., rewrite<br />
them to the tail of the log). This also avoids wasting<br />
space in the cleaner output segment on nodes that<br />
map data in the cleaner's input segments. These nodes<br />
are guaranteed to decay as soon as the cleaner's B-tree<br />
updates are processed.<br />
Figure 9 shows how the cleaner constructs an output<br />
segment from a number of input segments. The cleaner<br />
keeps selecting input segments until either the output<br />
segment is fu ll, or there are no more input segments.<br />
Figure 9 also shows the set of operations that are generated<br />
by tl1e cleaner. In this example, tl1e output segment<br />
is filled with the contents oftvm fu ll segments and part<br />
of a third segment. This will cause the third input segment<br />
to decay still further, and the remaining data and<br />
B-tree nodes will be cleaned when that segment is<br />
selected to create another output segment.<br />
Cleaner Policies<br />
A set of heuristics governs the cleaner's operation.<br />
One of our fundamental design decisions was to separate<br />
the cleaner policies from the mechanisms that<br />
implement them.<br />
When to clean?<br />
Our design explicitly avoids cleaning until it is<br />
required. This design appears to be a good match tor<br />
<strong>Digital</strong> Tec hnical journal Vol. 8 No. 2 <strong>1996</strong> 23
24<br />
KEY<br />
I NODE A I<br />
I<br />
B-TREE NODE<br />
D LIVE DATA<br />
;-- SUPERSEDED DATA<br />
• _ _ _ J<br />
Figure 9<br />
Cle,mcr OpcrJtioiJ<br />
B-TREE UPDATE REQUEST<br />
CLEANER<br />
a workload on the OpenVMS system . On our rimesharing<br />
system, the cleaner was entirely inactive for the<br />
tirst th ree months of <strong>1996</strong>; although segments were<br />
used ::md reused repeatedly, they always dec:1ycd<br />
entirely to empty of their own accord. The trade-off<br />
in avoiding cleaning is that although performance is<br />
improved (no cleaner activity), the size of riK fu ll<br />
savcsnaps created by backup is increased. This is<br />
because backup copies whole segments, regardless of<br />
how much live data thev contain.<br />
vVhcn the cleaner is nor running, the live data in the<br />
volume tends to be disnibuted across a large number of<br />
partially hi ll segments. To avoid this problem, we have<br />
added a conrrol to allow the system manager to nu nually<br />
surt and stop the cleaner. Forcing the cleilner ro<br />
run bd()re performing a full backup compacts the live<br />
cb ta in the log and reduces the size ofthe savesnap.<br />
In normal operation, the cleaner will Stilrt cleaning<br />
\\'hen rhc number oftree segments available ro extend<br />
rhc log blls below a fi xed threshold ( 300 in the current<br />
implcmentJtion ). In making this calculation, the<br />
cleaner takes into account the amount of space in<br />
rhe log that will be consumed by writing data currently<br />
held in rhe clerks' write-behind caches. Thus, accepting<br />
data into the cache c1uses the cleaner to "clear the way"<br />
tC:.>r the subsequent write request fi-om the clerk.<br />
When the cleaner starts, it is possible that rhc<br />
amou11t of live data in the log is appro:ching<br />
the capacity of the underlying disk, so the cleaner may<br />
tind nothing to do. It is more likely, however, that<br />
there will be free space it can reclaim. Because the<br />
cleaner works by torcing the data in its input segments<br />
Vol. 8 No. 2 <strong>1996</strong><br />
OPERATIONS SUBMITTED TO MAPPING LAYER<br />
to dec1y by rewriting, it is i mportant that it begins<br />
work while ti-ee segments are available. Debying the<br />
decision to start cleaning cou ld result in the cleaner<br />
being unable to proceed.<br />
A tl xed number was chosen tor the cleaning threshold<br />
rather than one based on the size of the disk. The<br />
size of the disk does not affect the urgency of cleaning<br />
Jt any p:rticular point in time. A more cftcctivc indic:tor<br />
of urgency is the time taken f()r the disk ro ti ll :1t the<br />
maximum rate of writing. Writing to the log at lOMB<br />
per second wi ll usc 300 segments in about 8 seconds.<br />
With hindsight, we realize that a threshold based on a<br />
measuremc11t of the speed of the disk might have been<br />
a more appropriate choice.<br />
Input Segment Selection<br />
The clc::mer divides segments into tour distinct groups:<br />
l. Empty. These segments contain no live data :m d arc<br />
avai lable to the LD to extend the Jog.<br />
2. Noncle:nable. These segments arc not candichtes<br />
t(w cleaning f()r one of t\vo reasons:<br />
• The segmem contains information that would<br />
be required by the recovery manager in the event<br />
of a system fl ilure. Segments in this group arc<br />
always close to the tail of the log and therd(Jrc<br />
likely to undergo t-l1 rther decay, making them<br />
poor candidiltes tor cleaning.<br />
• The segment is part of a snapshot.; The SILlpsllot<br />
represents a rercrcnce to the segment, so it cm-<br />
110t be reused even though it may no longer COilrain<br />
live data .
3. Preferred noncleanable. These segments have<br />
recently experienced some natural decay. The sup<br />
position is that they may decay further in the near<br />
ti.1 ture and so are not good candidates tor cleaning.<br />
4. Cleanable. These segments have not decayed tor<br />
some time. Their stability makes them good cami.i <br />
dates to r cleaning.<br />
The transitions between the groups are illustrated in<br />
Figure 10. It should be noted that the cleaner itself<br />
does not have to execute to transfer segments into the<br />
empty state.<br />
The cleaner's job is to fill output segments, not to<br />
empty input segments. Once it has been started, the<br />
cleaner works to entirely fi ll one segment. When that<br />
segment has been filled, it is threaded into the log;<br />
if appropriate, the cleaner will then repeat the process<br />
with a new output segment and a new set of input<br />
segments. The cleaner will commit a partially full<br />
output segment only under circumstances of extreme<br />
resource depletion.<br />
The cleaner ti lls the output segment by copying<br />
chunks of data fo rward fi·om segments taken ti-om the<br />
cleanable group. The members of this group are held<br />
on a list sorted in order of emptiness. Thus, the tl rst<br />
cleaner cycle will always cause the greatest number of<br />
segments to decay. As the output segment fills, the<br />
smallest chunk of data in the segment at the head of<br />
the cleanable Jist may be larger than the space left in<br />
the output segment. In this case, the cleaner performs<br />
a limited search down the cleanable list tor segments<br />
containing a suitable chunk. The required information<br />
is kept in memory, so this is a reasonably cheap opera<br />
tion. As each input segment is processed, the cleaner<br />
Figure 10<br />
Segment States<br />
DECAY<br />
TO EMPTY<br />
temporarily removes it from the cleanable list. This<br />
allows the mapping layer to process the operations the<br />
cleaner submitted to it and thereby cause decay<br />
to occur before the cleaner again considers the seg<br />
ment as a candidate fo r cleaning. As the volume fills,<br />
the ratio between the number of segments in the<br />
cleanable and preferred noncleanable groups is<br />
adjusted so that the size of the preferred non cleanable<br />
group is reduced and segments are inserted into the<br />
cleanable Jist. If appropriate, a segment in the clean<br />
able list that experiences decay will be moved to the<br />
preferred noncleanable list. The preferred nonclean<br />
able list is kept in order of least recently decayed.<br />
Hence, as it is emptied, the segments that are least<br />
likely to experience further decay are moved to the<br />
cleanable group.<br />
Recovery<br />
The goal of recovery of any file system is to rebuild the<br />
ti le system state after a system fa ilure. This section<br />
describes how the server reconstructs state, both in<br />
memory and in the log. It then describes checkpoi.nt<br />
ing, the mechanism by which the server bounds the<br />
amount of time it takes to recover the file system state.<br />
Recovery Process<br />
In normal operation, a single update to the server can<br />
be viewed as several stages:<br />
l. The user data is written to the log. It is tagged with<br />
a self-identifYing record that describes its position in<br />
the file address space. A B-tree update operation is<br />
generated that drives stage 2 of the update process.<br />
CLEANER POLICY/<br />
SEGMENT DECAY<br />
CHECKPOINT/<br />
SNAPSHOT<br />
DELETION<br />
<strong>Digital</strong> Tcchnicll journal Vol. 8 No. 2 <strong>1996</strong> 25
26<br />
2. The leaf nodes of the B-tree are moditled in memory,<br />
and corresponding change records are written<br />
to the log to reflect the position of the new data.<br />
A flush operation is generated and queued and then<br />
starts stage 3.<br />
3. The B-tree is written out level by level until the root<br />
node has been rewritten. As one node is written to<br />
the log, the parent of that node must be modified,<br />
and a corresponding change record is written to the<br />
log. As a parent node is changed, a fu rther Hush<br />
operation is generated fo r the parent node and so<br />
on up to the root node.<br />
Stage 2 of this process, logging changes to the leaf<br />
nodes of the B-tree, is actually redundant. The self<br />
identi'ing tags that are written with the user data are<br />
sufficient to act as change records for the leaf nodes of<br />
the B-tree. When we started to design the server, we<br />
chose a simple implementation based on physiological<br />
B-TREE<br />
STAGE 1<br />
STAGE 2:<br />
STAGE 3:<br />
LOG<br />
LOG<br />
LOG<br />
Figure 11<br />
Stgcs oLl Write Request<br />
<strong>Digital</strong> TcchnicJI Joum.1l Vol. 8 No. 2 <strong>1996</strong><br />
write-ahead logging." As time progressed, we moved<br />
more toward operational logging." The records written<br />
in stage 2 arc a holdover trom the earlier implementation,<br />
which we may remove in a fu ture release ot'<br />
the Spiralog file system.<br />
At each stage of the process, a change record is written<br />
to the log and an in-memory operation is generated<br />
to drive the update through the next stage. In drect,<br />
the change record describes the set of changes mack<br />
to an in-memory copy of a node and an in-memory<br />
operation associated with that change.<br />
Figure ll shows the log and the in-memory work<br />
queue at each stage of
After a system bilure, the server's goal is to recon<br />
struct the file system state to the point of the last write<br />
that was written to the log at the time of the crash.<br />
This recovery process involves rebuilding, in memory,<br />
those B-tree nodes that were dirty and generating any<br />
operations that were outstanding when the system<br />
ta iled. The outstanding operations can be scheduled in<br />
the normal way to make the changes that they repre<br />
sent permanent, thus avoiding the need to recover<br />
them in the event of a future system fa ilure. The recov<br />
ery process itself does not write to the log.<br />
The mapping layer work queue and the flush lists<br />
are rebuilt, and the nodes are fe tched into memory by<br />
reading the sequential log from the recovery start<br />
position (see the section Checkpointing) to the end of<br />
tl1e log in a single pass.<br />
The B-tree update operations are regenerated using<br />
the selfidentifYing tag that was written with each<br />
piece of data. When the recovery process finds a node,<br />
a copy of the node is stored in memory. As log records<br />
fo r node changes are read, they are attached to the<br />
nodes in memory and a Hush operation is generated<br />
for the node. I fa log record is read fo r a node that has<br />
not yet been seen, the log record is attached to a place<br />
holder node that is marked as not-yet-seen. The recov<br />
ery process does not pertorm reads to tetch in nodes<br />
that are not part of the recovery scan. Changes to<br />
B-tree nodes are a consequence of operations that<br />
happened earlier in the log; therefore, a B-tree node<br />
t<br />
RECOVERY<br />
START POSITION<br />
CHANGE CHANGE CHANGE<br />
RECORD 1 RECORD 2 RECORD 3<br />
NODE A NODE B NODE C<br />
B-TREE NODE CACHE (AFTER RECOVERY SCAN)<br />
NODE A'<br />
WORK QUEUE {AFTER RECOVERY)<br />
Figure 12<br />
Recovering a Log<br />
I <br />
- CHANGE<br />
RECORD 4<br />
NODE A'<br />
B-TREE r-- FLUSH<br />
UPDATE REQUEST<br />
DATA 1 NODE A'<br />
RECOVERY SCAN<br />
NODE B<br />
(NOT-YET-SEEN)<br />
log record has the et1ect of committing a prior modifi<br />
cation. Recovery uses this fa ct to throw away update<br />
operations that have been committed; they no longer<br />
need to be applied.<br />
Figure 12 shows a log with change records and<br />
B-tree nodes along with the in-memory state of the<br />
B-tree node cache and the operations that are regener<br />
ated. In this example, change record l fo r node A is<br />
superseded or committed by the new version of node A<br />
(node A'). The new copy of node C (node C') super<br />
sedes change records 3 and 5. This example also shows<br />
the effect of finding a Jog record without seeing a copy<br />
of the node during recovery. The log record fo r node B<br />
is attached to an in-memory version of the node that is<br />
marked as not-yet-seen. The data record with self-iden<br />
tifYing tag Data l generates a B-tree update record that<br />
is placed on the work queue fo r processing. As a final<br />
pass, the recovery process generates the set of flush<br />
operations that was outstanding when the system tailed.<br />
The set oftlush requests is defined as the set of nodes in<br />
the B-tree node cache that has log records attached<br />
when the recovery scan is complete. In this case, flush<br />
operations for nodes A' and B are generated.<br />
The server guarantees that a node is never written to<br />
the log with uncommitted changes, which means that<br />
we only need to log redo records.9·16 In addition, when<br />
we see a node during the recovery scan, any log<br />
records that are attached to the previous version of the<br />
node in memory can be discarded.<br />
r- FLUSH<br />
REQUEST<br />
NODE B<br />
CHANGE CHANGE I NODE C' I<br />
RECORD 4 RECORD 5<br />
NODE A' NODE C<br />
- CHANGE<br />
RECORD 2<br />
NODE B<br />
NODE C'<br />
t<br />
TAIL OF<br />
LOG<br />
<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 27
28<br />
Operati ons generated during recovery arc posted to<br />
the work queues as they wo uld be in normal running.<br />
Normal operation is nor allowed to begin until the<br />
recovery pass has completed ; however, when recovery<br />
reaches rhe end of the log, tJ1e server is :blc to service<br />
opcr:rions fi·om clerks. Thus new requests from the clerk<br />
can be scnced, potentially in parallel with tJ1c operations<br />
tJ1at were generated by rJ1e recovery process.<br />
Log records are not applied to nodes during recov<br />
ery t(x :1 number of reasons:<br />
• Less processing time is needed to scan the log and<br />
therefore the server can start servicing new user<br />
requests sooner.<br />
• Recover' will not have seen copies of all nodes t(x<br />
which it has log records. To apply the log records,<br />
the R-tree node must be read ti·om the log. This<br />
would result in random read requests during the<br />
sequential scan ofthe log, and again wo uld result in a<br />
longer period betore user requests could be serviced.<br />
• There may be a copy of rhe node L:ltcr in tiK recov<br />
ery scan. This would make the additional I/O oper<br />
ation red undant.<br />
Checkpointing<br />
As we have shown, recovering an LfS log is i mplc<br />
mcnred by a single-pass sequcnri:- 11 scan of 1 11 records<br />
in rhc log ti·om the recovery start position ro the tail of<br />
the log. This section detines a recovery start position<br />
and describes how ir can be moved t()l·ward to red uce<br />
rhc 1 mount of log that Ius to be scan ned to recover<br />
rhc ti le system state.<br />
To re construct the in-memory stare when a system<br />
cr:shcd , recovery must see something in the log that<br />
represents each operation or clnngc of stare that was<br />
represented in memory but nor yet made stable. This<br />
means that at time t, the recovery st:rr position is<br />
ddincd as a poi nt in the log afte r which all operations<br />
rh:n arc not stably stored have a log record associated<br />
with them. Operations obtain rhe :tssoci:ttion by scan<br />
ning the log sequentially fr om rhc begi nning ro the<br />
end. The recovery position then becomes the start of<br />
the log, which has two important problems:<br />
l. In the worst case, it wou ld be necessary to sequen<br />
tial ly scan the entire log ro pcr f(mn recovery. For<br />
large disks, a seq uential read of the enti re log con<br />
sumes a great deal of time.<br />
2. Recove ry must process every log record writte n<br />
between the recovery start position and rhe end of<br />
the Jog. As a consequence, segments bct\vc cn the<br />
start of recovery and the end of the log cannot be<br />
clc:ned and reused.<br />
To restrict the amount of rime to recover the log<br />
and to allow segments to be released Lw cJcaning, the<br />
<strong>Digital</strong> Te chnical Jomn;J\ Vo l. H No. 2 I ')96<br />
recovery position must be mm·ed forward ti·om rime<br />
to time, so that it is alw:ys close to rhe tail ofrhc log.<br />
Under r Jogs rhar predominantlv ser<br />
vice read requests, rhe rccoverv ri me rends toward the<br />
limit. In these cases, ir mav be more appropriate to add<br />
rj mer- b:scd ch ec kpoi nts.<br />
Managing Free Space<br />
A traditional, updarc-in-p!Jce tile system 0\H \\'ri rcs<br />
superseded data by writing to the same p hysi cal loc:<br />
tion on disk. It t(>r example,
parts of the B-tree that arc needed to make the tile<br />
system structures consistent. This means that we can<br />
never have dirty B-tree nodes in memory that cannot<br />
be tlushed to the log.<br />
The server must carefully manage the amount of free<br />
space in the log. It must provide 1:\vo guarantees:<br />
l. A write will be accepted by the server only if there is<br />
sufficient free space in the log to hold the data and<br />
rewrite tbe mapping B-trce to describe it. This guarantee<br />
must hold regardless of how much space the<br />
cleaner may subsequently reclaim.<br />
2. At the higher levels of the tile system, if an I/0 operation<br />
is accepted, even if that operation is stored in<br />
the write-behind cache, the data will be w1itten to<br />
the Jog. This guarantee holds except in rbe event of<br />
a system fa ilure.<br />
The server provides these guarantees using the same<br />
mechanism. As shown in Figure 13, the free space and<br />
the reserved space in the log are mod eled using an<br />
escrow fu nction. 17<br />
The total number of blocks that contain live, valid<br />
data is maint
Compression<br />
Adding file compression to an update-in-place file<br />
system presents a particular problem: what to do when<br />
a data item is overwritten with a new version that does<br />
not compress to the same size. Since all updates take<br />
place at the tail of the log, an LFS avoids this problem<br />
entirely. In adctition, the amount of space consumed<br />
by a data item is determined by its size and is not influenced<br />
by the cluster size of the disk. For this reason, an<br />
LFS does not need to employ file compaction to make<br />
efficient use of large disks or RAID sets.'9<br />
Future Improvements<br />
The existing implementation can be improved in a<br />
number of areas, many of which involve resource consumption.<br />
The B-tree mapping mechanism, although<br />
general and flexible, has high CPU overheads and<br />
requires complex recovery algorithms. The segment<br />
layout needs to be revisited to remove the need for serialized<br />
1/0s when committing w1ite operations and thus<br />
further reduce the write latency.<br />
For the Spiralog file system version 1.0, we chose to<br />
keep complete information about live data and data that<br />
was no longer valid for every segment in the log. This<br />
mechanism allows us to reduce the overhead of the<br />
cleaner; however, it does so at the expense of memory<br />
and disk space and consequently does not scale well to<br />
multi-terabyte disks.<br />
A Final Word<br />
Log structuring is a relatively new and exciting technology.<br />
Building <strong>Digital</strong>'s first product using this<br />
technology has been both a considerable challenge and<br />
a great deal of fun. Our experience during the construction<br />
of the Spiralog product has led us to believe<br />
that LFS technology has an important role to play in<br />
the future of file systems and storage management.<br />
Acknowledgments<br />
We would like to take this opportunity to acknowledge<br />
the contributions of the many individuals who<br />
helped during the design of the Spiralog server. Alan<br />
Paxton was responsible for initial investigations into<br />
LFS technology and laid the fo undation to r our understanding.<br />
Mike Johnson made a significant contribution<br />
to the cleaner design and was a key member of the<br />
team that built the final server. We are very grateful to<br />
colleagues who reviewed the design at various stages,<br />
in particular, Bill Laing, Dave Thiel, Andy Goldstein,<br />
and Dave Lomet. Finally, we wou ld like to thank Jim<br />
Johnson and Cathy Foley fo r their continued loyalty,<br />
enthusiasm, and direction during what has been a long<br />
and sometimes hard journey.<br />
30 Digiral <strong>Technical</strong> Journal Vo l. 8 No. 2 <strong>1996</strong><br />
References<br />
l. D. Gifford , R. Needham, and M. Schroeder, "The<br />
Cedar File System," Co mmunications of the ACJ\.1,<br />
vol . 31, oo. 3 (March 1988).<br />
2. S. Chutanai, 0. Anderson, M. Kazar, and B. Leverett,<br />
"The Episode File System," Proceedings of the Winter<br />
7 992 U!:J't!VJX <strong>Technical</strong> Conference (January I 992).<br />
3. M. Rosenblum, "The Design and Implementation of<br />
a Log-Structured File System," Report No. UCB/CSD<br />
92/696, University of California, Berkeley (June<br />
1992).<br />
4. J. Ousterhout and F. Douglis, "Beating the IjO Bottle<br />
neck: The Case fo r Log-Structured File Systems,"<br />
Operating Systems Review (January 1989).<br />
5. R. Green, A. Baird, and J. Davies, "Designing a Fast,<br />
On-line Backup System fo r a Log-structured File System,"<br />
<strong>Digital</strong> <strong>Technical</strong>joumal, vol. 8, no. 2 (<strong>1996</strong>,<br />
this issue): 32-45.<br />
6. ]. Ousterhout et al., "A Comparison of Logging and<br />
Cl ustering," Computer Science Department, University<br />
of California, Berkeley (March 1994 ).<br />
7. M. Seltzer, K. Bostic, M. McKusick, and C. Staelin,<br />
"An Implementation of a Log·Structured File System<br />
for UNIX," Proceedings of tbe Winter 1993 CSENIX<br />
Te chnical Conference (January 1993).<br />
8. M. Wiebren de Jounge, F. Kaashoek, and W. -C. Hsieh,<br />
"The Logical Disk: A New Approach to Improvi ng<br />
File Systems," ACIH SJGOPS '93 (December 1993).<br />
9. J. Gray and A. Reuter, Tra nsaction Processing: Co n<br />
cepts and Techniques (San i'vlateo, Calif.: Morgan<br />
Kaufman Publishers, 1993), ISRN l-55860-190-2.<br />
10. A. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swarr,<br />
"The Echo Distributed File System," <strong>Digital</strong> Systems<br />
Research Center, Research Repon 111 (September<br />
1993).<br />
11. J. Johnson and W. Laing, "Overview of the Spira log<br />
File System," <strong>Digital</strong> <strong>Technical</strong>.fournal, vol. 8, no. 2<br />
(<strong>1996</strong>, this issue): 5-14.<br />
12. A. Sweeney et al., "Scalability in the XFS File System,"<br />
Proceedings of the Winter <strong>1996</strong> USENIX Tecbnica/<br />
Conference (January <strong>1996</strong>).<br />
13. J. Ko hl, "Highlight: Using a Log-structured File<br />
System fo r Tertiary Storage Management," USENIX<br />
Associ ation Conference Proceedings (January 1993).<br />
14. M. Baker et al., "Measurements of: 1 Distributed File<br />
System," Symposium on Operating System Principles<br />
(SOSP) 13 (October 1991 ).<br />
15. J. Ousterhour et al., "A Trace-driven Analysis of the<br />
UNIX 4.2 13SD File System," Symposium on Operating<br />
System Principles (SOSJ') 10 (December 1985).
16. D. Lomet and B. Salzberg, "Concurrency and Recov<br />
ery for Index Trees," <strong>Digital</strong> Cambridge Research<br />
Laboratory, <strong>Technical</strong> Report (August 1991 ).<br />
17. P. O'Neil, "The Escrow Transactional Model,"<br />
A0\11 Transactions on Distributed Systems, vol . 11<br />
(December 1986).<br />
18. Vo lume Shadowingfor Op en VMS A.XP Ve rsion 6. 1<br />
(Maynard, Mass.: <strong>Digital</strong> Equipment Corp., 1994).<br />
19. M. Burrows et al., "On-line Data Compression in a<br />
Log-structured File System," <strong>Digital</strong> Systems Research<br />
Center, Research Report 85 (April 1992 ).<br />
Biographies<br />
Christopher Whitaker<br />
Chris Whitaker joined <strong>Digital</strong> in 1988 after receiving<br />
a R.Sc. Eng. (honours, 1SLcJass) in computer science<br />
from the Imperial College ofScience and Technology,<br />
University of London. He is a principal software engineer<br />
with the Open VMS File System Development Group<br />
located near Edin burgh, Scotland . Chris was the team<br />
leader for the LFS server component of the Spiralog file<br />
system. Prior to this, Chris worked on the distributed<br />
transaction management services (DECdtm) fo r Open VMS<br />
and the port of the Open VMS record management services<br />
(RMS and RMS journaling) to Alpha.<br />
J. Stuart Bayley<br />
Stuart Bayley is a member of the Open VMS File System<br />
Development Group, .located near Edinburgh, Scotland.<br />
He joined <strong>Digital</strong> in 1990 and prior to becoming a member<br />
oftl1e Spiralog LFS server team, worked on Open VMS<br />
DECdtm services and the Open VMS XQP file system.<br />
Stuart graduated fiom Ki ng's College, University of<br />
London, with a R.Sc. (honours) in physics in 1986.<br />
Rod D. W. Widdowson<br />
Rod Widdowson received a B.Sc. (1984) and a Ph.D. (1987)<br />
in computer science fi·om Edinburgh University. He joined<br />
<strong>Digital</strong> in 1990 and is a ptincipa.l software engineer ''d1 the<br />
Open VMS File System Development Group located ncar<br />
Edinburgh, Scotland . Rod worked on me implementation<br />
of LFS and cluster distribution components of the Spira.log<br />
file system. Prior to d1is, Rod worked on the port of the<br />
Open VMS XQP file system to Alpha. Rod is a charter member<br />
of me British Computer Society.<br />
<strong>Digital</strong> <strong>Technical</strong> journ
32<br />
Designing a Fast,<br />
On-line Backup System<br />
for a Log-structured<br />
File System<br />
The Spira log file system for the Open VMS<br />
operating system incorporates a new tech<br />
nical approach to backing up data. The fast,<br />
low-impact backup can be used to create<br />
consistent copies of the file system while<br />
applications are actively modifying data.<br />
The Spira log backup uses the log-structured<br />
file system to solve the backup problem. The<br />
physical on-disk structure allows data to be<br />
saved at near-maximum device throughput<br />
with little processing of data. The backup<br />
system achieves this level of performance<br />
without compromising functionality such as<br />
incremental backup or fast, selective restore.<br />
Digir.1l Tcchnical /ournl Vol. ll No. 2 <strong>1996</strong><br />
I<br />
Russell J. Green<br />
Alasdair C. Baird<br />
J. Christopher Davies<br />
iVl ost computer users want to be able to recover data<br />
lost through user error, software or mcdi:1 E1 ilurc, or<br />
site disaster but are unwilling to devote system<br />
resources or downtime to m1ke backup copies of the<br />
data. Furthermore, with the rapid growth in the usc of<br />
data storage and the tendency to move systems toward<br />
complete utilization (i.e., 24-hour by 7 -day operation),<br />
the practice of taking the system off line to back up<br />
data is no longer fe asi ble.<br />
The Spiralog tile system, an option:JI component of<br />
the OpenVMS Al pha operating system, incorporates<br />
a nevv approach to the backup process (called<br />
simply backup), resulting in J number of subst:llltial<br />
customer benefits. By exploiting the fe atures of log<br />
structured storage, the backup system combines the<br />
advantages of two ditkrent tradition:tl approaches<br />
to perf-orming backup: the flexibility of ti le-based<br />
backup and the high pcrt(mnancc of plwsicallv ori<br />
ented backup.<br />
The design goal to r the Spiralog b:1ckup system was<br />
to provide customers \\'ith a last, applic:1tion-consistent,<br />
on-line backup. In this paper, we explain rhc features<br />
of the Spiralog ti le system that helped achieve this goal<br />
and outline the design of the n1:1jor backup hrnctions,<br />
namelv vo lume save, vol ume restore, ti le restore, :m d<br />
incremental management. \Ve then present some performance<br />
results arrived at using Spira log version l.l.<br />
The paper concludes with a discussion of other design<br />
approaches and areas fo r ti.1 ture work.<br />
Background<br />
File system data may be lost t( Jr many re:tsons, includ<br />
mg<br />
• User error-A user m:ty mistakenly delete tb t::l.<br />
• Software failure-An application m:ty cxccure<br />
incorrectly.<br />
• Media fa ilure-The computi ng cquirmenr may<br />
malfunction because of poor design, old :1 ge, etc.<br />
• Site disaster-Computing fKilitics m
The ability to save backup copies of all or part of<br />
a file system's information in a form that allows it to be<br />
restored is essential to most customers who usc computing<br />
resources. To understand the backup capability<br />
needed in the Spiralog file system, we spoke to a number<br />
of customers-five directly and several hundred<br />
through public forums. Each ran a different type of system<br />
in a distinct environment, ranging from research<br />
and development to fi nance on OpenVMS and other<br />
systems. Our survey revealed the fo llowing set of customer<br />
requirements tor the Spiralog backup system:<br />
l. Backup copies of d:1ta must be consistent with<br />
respect to the applications that use the data.<br />
2. Data must be continuously avai lable to applications.<br />
Dowmime for the purpose of backup is unacceptable.<br />
An application must copy all data of<br />
interest as it exists at an instant in time; ho>vever,<br />
the applicarjon should also be allowed to modit)'<br />
the data during the copying process. Performing<br />
backup in such a way as to satist)' these constraints is<br />
often called hot backup or on-line backup. Figure l<br />
illustrates how data inconsistency can occur during<br />
an on-line backup.<br />
3. The backup operations, particularly tJ1c save operation,<br />
must be fa st. That is, copying data from the<br />
system or restoring data to the system must be<br />
accomplished in the time available.<br />
4. The backup system rnust allow an incremental<br />
backup operation, i.e., an operation that captures<br />
only the changes made to data since the last backup.<br />
The Spiralog backup team set out to design and<br />
implement a backup system that would meet the fo ur<br />
customer requirements. The fo llowing section discusses<br />
the features of the implementation of a logstructured<br />
ti le system ( LFS ) that allowed us to use<br />
a new approach to performing backup. Note that<br />
throughout this paper we use disk to describe the<br />
TIME<br />
Figure 1<br />
FILE BACKUP EXPLANATION<br />
The initial file contains two blocks.<br />
Backup starts and copies 1he first<br />
block.<br />
The application rewrites the file.<br />
Backup proceeds and copies the<br />
second block. The resulting backup<br />
copy is corrupt because the first<br />
block is inconsistent with the latest<br />
rewritten file.<br />
ExJmplc of an On-line Backup ThJt Results in Inconsistent<br />
Data<br />
physical media used to store data and uolume to<br />
describe the abstraction of the disk as presented by the<br />
Spiralog ti le system.<br />
Spiralog Features<br />
The Spiralog file system is an implementation of a logstructured<br />
file system. An LFS is characterized by the<br />
use of disk storage as a sequential, never-ending repository<br />
of data. We generally reter to this organization of<br />
data as a Jog. Johnson and Ll ing describe in detail the<br />
design of the Spira log implementation of an LFS and<br />
how ti les are maintained in this implementation.'<br />
Some teatures unique to a Jog-structured file system<br />
are of particular interest in the design of a backup<br />
system.2-• These teatures are<br />
• Segments, where a segment is the fu ndamental<br />
unit of storage<br />
• The no-overwrite nature of the system<br />
• The temporal ordering of on-disk data structures<br />
• The means by which files are constructed<br />
This section of the paper discusses the relevance of<br />
these fe atures; a later section explains how these features<br />
arc exploited in the backup design.<br />
Segments<br />
In this paper, the term segment refers to a logiGd<br />
entity that is uniquely identified and never overwritten.<br />
This definition is distinct fi·om the physical storage<br />
of a segment. The only physical feature of interest<br />
to backup with regard to segments is that they are efficient<br />
to read in their entirety.<br />
Using Jog-structured storage in a ti le system allows<br />
efficient writing irrespective of the write patterns or<br />
load to the ti le system. All write operations arc<br />
grouped in segment-sized chunks. The segment size is<br />
chosen to be sufficiently large that the time required<br />
to read or write the segment is significantly greater<br />
than the time required to access the segment, i.e., the<br />
time required f()r a head seek and rotational dc!Jy on<br />
a magnetic disk. AJ I data (except the LFS homeblock<br />
and checkpoint information used to locate the end of<br />
the data log) is stored in segments, and all segments<br />
are known to the ti le system. From a backup point of<br />
view, this means that the entire contents of a volume<br />
can be copied by reading the segments. The segments<br />
are large enough to allow efficient reading, resulting in<br />
a ncar-maximum transfer rate of thc device.<br />
No Overwrite<br />
In a log-structured file system, in which the segments<br />
are never overwritten, all data is written to new, empty<br />
segments. Each new segment is given a segment identifier<br />
(segid ) allocated in a monotonically increasing<br />
<strong>Digital</strong> <strong>Technical</strong> )ourn31 Vol. 8 No. 2 <strong>1996</strong> 33
34<br />
manner. At any point in time, the entire contcm.s and<br />
state of a volume can be described in terms of a (check<br />
poim position, sep,rnent list) pair. At the physical level,<br />
a volume consists of a list or· segments and a position<br />
within a segment that ddines the end of the log.<br />
Rosenblum describes the concept of time trave l, where<br />
an old stare ofrhe file system can lx ro:visitcd by cn:ar<br />
ing and maintaining a snapshot of the J-i lc system r(>r<br />
fu ture access.' Allowing time travel in this \\'ay req uires<br />
maintaining an old checkpoint :md disabling the r.:use<br />
of disk space by the cleaner. The ckaner is a mecha<br />
nism used to reclaim disk space occupied by obsolete<br />
data in a Jog, i.e., disk space no longer rdcn::nc.:d in<br />
the tile system. The contents of a snapshot :m..: inde<br />
pendent of operations undertaken on the live version<br />
of the ti le system. Modit)'ing or dt:leting a file arkcts<br />
onlv the live version of the rile system (sec figure 2 ) .<br />
Because of the no-overwrite nature of the l.YS, prC\'i<br />
ouslv written data re mains unchanged .<br />
Other mechanisms specific to a particular backup<br />
algorithm have been developed to achieve on-line CO!l<br />
sistencv.' The snapshot model as described abm'C allo\\'s<br />
a more general solution with respect to multiple con<br />
currem backups and the choice of the s:-,·c algorith m.<br />
A re:1d -only version of the fi le svstem at an inst:lllt<br />
in time is preciselv what is req uired for application<br />
consistency in on-line backup. This snapshot approach<br />
to attaining consistency in on-line backup has been<br />
used in other systems. 0·' As explained in the fd lowing<br />
sections, the Spiralog ti le system combines the snap<br />
shot technique with fi:::atures oflog-structurcd storage<br />
to obtain both on-line backup consistency and perr(>r<br />
mancc benefits r(>r bJckup.<br />
Te mporal Ordering<br />
As mentioned earl ier, all data , i.e., user data and fi le<br />
system metadata (data that describes the user d:-ta in<br />
the fi le system), is stored in segments and there is no<br />
overwrite of segments. All on-disk data structures that<br />
rekr to physical placement of data usc poimcrs,<br />
namely (segid, ofJ.;,el) pairs, ro describe tbc location of<br />
the data. Each (segid, off-el) pair specifics the segment<br />
and where within rhar segmcnr the d:-ra is stored.<br />
Together, these imply the followi ng two properties of<br />
data structures, which are key rcatures oLln LrS:<br />
This data is<br />
visible to only<br />
!he snapshot.<br />
Figure 2<br />
This dala is<br />
shared by the<br />
snapshot and the<br />
live file system.<br />
This is new live<br />
dala written since<br />
the snapshot was<br />
laken.<br />
Data Accessi ble to the Snapshot Jnd to the Live hie<br />
Svstem<br />
Digirl TL'chnic1l lm1nL1I Vol. t:l No. 2 <strong>1996</strong><br />
l. On-disk structure pointers, namely (s(:r_: id. (!ff.\·et)<br />
pairs, arc re latively ri me ordered. Specifically, data<br />
stored at (s2, n2 ) was written more recentlv than<br />
data stored at ( sl , ol ) if and only if s2 is greater<br />
than sl or s2 cqu::ls sl and o2 is greater than ol .<br />
Thus, new datJ wou ld appear to the right in the<br />
dat:1 structure depicted in Figure 3.<br />
2. Any daLl structure that uses on-disk pointers stored<br />
within the segments (the mapping data structure<br />
implementing the LfS index) must be rime<br />
ordered ; that is, all poi mers must rekr ro data writ<br />
ten prior ro the pointer. Rckrri ng 1g1in to Figure 3,<br />
onlv dna structures rh:r poinr ro the left are valid.<br />
These properties of on-disk data structures are of<br />
interest when designing backup syste ms. Such data<br />
structu res can be traversed so rhat segments arc re:d<br />
in reverse time order. To understand this concept, con<br />
sider the root of some 011-disk darJ structu re. This root<br />
must h;wc been written after any of the data to '' hich<br />
ir rekrs (propertY 2 ). A data item that the root rd cr<br />
cnccs must h:J\·c been written bd()re the root and so<br />
must ha,·e been stored in a segment with a scgid less<br />
than or equal to that of the segment in ,,·hich the root<br />
is stored (proper tv 1 ). A similar ind ucrivc :1rgumen t em<br />
be used to show tll
data contained in a file defined by a file identifier can be<br />
recovered, independent of how the ti le was created,<br />
without any knowledge of the ti le system structure.<br />
Consequently, together with the temporal ordering of<br />
data in an LfS, fi les can be recovered using an ordered<br />
linear scan of the segments of a volume, provided the<br />
on-disk data structures are traversed correctly. This<br />
mecbanism allows efficient tile restore ti·om a sequence<br />
of segments. In particular, a set of files can be restored<br />
in a single pass of a saved volume stored on tape.<br />
Existing Approaches to Backup<br />
The design of the Spiralog backup attempts to com<br />
bine the advantages oHile-based backup tools such as<br />
Files- 11 backup, UNIX tar, and Windows NT backup,<br />
and physical backup tools such as UN IX dd, Files- 11<br />
backup/PHYSICAL, and HSC backup (a controller<br />
based backup tor Open VMS volumes)."<br />
File-based Backup<br />
A file-based backup system has two main advantages:<br />
( 1 ) the system can explicitly name files to be saved, and<br />
(2) the system can restore individual files. In this paper,<br />
the file or structure that contains the output data of<br />
a backup save operation is called a saveset. Individual<br />
file restore is achieved by scanning a saveset fo r the file<br />
and then recreating the ti le using the saved contents.<br />
Incremental fi le-based backup usually entails keeping<br />
a record of when the last backup was made (either on a<br />
per-file basis or on a per-volume basis) and copying<br />
only those files and directories that have been created<br />
or modified since a previous backup time.<br />
The penalty associated with these features of a ti le<br />
based backup system is that of save performance.<br />
In efkct, the backup system performs a considerable<br />
amount of work to lay out data in the saveset to allow<br />
simple restore. All files arc segregated to a much greater<br />
extent than they are in the file system on-disk struc<br />
ture. The limiting factor in the performance of a fi le<br />
based save operation is the rate at which data can be<br />
read from the source disk. Although there are some<br />
ways to improve performance, in the case of a volume<br />
that has a large number of tiles, read performance is<br />
always costly. Figure 4 illustrates the layouts of three<br />
different types ofsavcscts.<br />
Physical Backup<br />
In contrast to the fi le-based approach to backup, a<br />
physical backup system copies the actual blocks of data<br />
on the sou rce disk to a saveset. The backup system is<br />
able to read the disk optimally, which allows an imple<br />
mentation to achieve data throughput near tbe disk's<br />
maximum transfer rate . Physical backups typically<br />
allow neither individual file restore nor incremental<br />
DIRECTION IN WHICH THE TAPE IS WRITTEN<br />
11 1 2 1 3 1 4 1 s 1 6 1 7 1 s 1 9 l1o l11 1···1<br />
In a physical backup saveset, blocks are laid out contiguously on tape.<br />
File restore is not possible without random access.<br />
FILE 1 FILE 2 FILE 3<br />
In a file backup saveset. files are laid out contiguously on tape.<br />
To create this sort of saveset, lites need to be read individually<br />
lrom diSk, wh1ch generally means suboptimal disk access.<br />
DIR I SEGT I SEG SEG SEG I .. ·I<br />
In a Spiralog backup saveset, directory (DIR) and segment table<br />
(SEGT) allow file restore from segments. Segments are large<br />
enough to allow near-optimal disk access.<br />
Figure 4<br />
Layouts of Three Diftc rcnt Types ofSavcser<br />
backup. The overhead required to include sufficient<br />
information fo r these fe atures usually erodes the per<br />
formance benefits oftered by the physical copy. In<br />
addition, a physical backup usually requires that the<br />
entire volume be saved regardless of how much of the<br />
volume is used to store data.<br />
How Spira log Backup Exploits the LFS<br />
Spiralog backup uses the snapshot mechanism to<br />
achieve on-line consistency to r backup. This section<br />
describes how Spiralog attains high-performance<br />
backup with respect to the various save and restore<br />
operations.<br />
<strong>Volume</strong> Save Operation<br />
The save operation of Spiralog creates a snapshot and<br />
then physically copies it to a tape or disk structure<br />
called a savesnap. (This term is chosen to be different<br />
from saveset to emphasize that it holds a consistent<br />
snapshot of the data.) This physical copy operation<br />
allows high-performance data transfer with minimal<br />
processing.10 In addition, the temporal ordering of<br />
data stored by Spiralog means that this physical copy<br />
operation can also be an incremental operation.<br />
The savcsnap is a file that contains, among other<br />
information, a list of segments cxactlv as thcv exist<br />
in the log. The structure of the savcsap allo·VS the<br />
efficient implementation of volume restore and ti le<br />
restore (see Figure 5 and Figure 6).<br />
The steps of a full save operation arc as follows:<br />
1. Create a snapshot and mount it. This mounted<br />
snapshot looks like a separate, read-only file system.<br />
Read information about the snapshot.<br />
Digi[al Technic! journal Vol. R No. 2 <strong>1996</strong> 35
36<br />
Figure 5<br />
SJ1·csn;1p Structure<br />
Figure 6<br />
METADATA SEGMENTS (DECREASING SEGID)<br />
.--- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --<br />
KEY<br />
DIRECTORY<br />
INFO<br />
PHYSICAL SAVESNAP<br />
RECORD (FIXED SIZE FOR<br />
ENTIRE SAVESNAP)<br />
ZERO PADDING<br />
DIRECTION IN WHICH THE LOG IS WRITTEN<br />
SAVESNAP<br />
l1os 1104 l1o2l1o1 1<br />
DIRECTION IN WHICH<br />
THE TAPE IS WRITTEN<br />
KEY:<br />
D UNUSED SEGMENT<br />
D USED SEGMENT<br />
Cor-respondence bctll'ecn Segments on Disk Jnd in the<br />
SJI'CStl;lp<br />
2. Write the he:der ro the s;ll'esnap, incl uding snap<br />
shot inf
38<br />
disk from which the savcsnap was made, providing<br />
the target container is large enough to hold the<br />
restored segments.<br />
4. Backup declares the volume restore as complete<br />
(no more segments will be written to the volume).<br />
Backup tells the file system how to mount the voltune<br />
by supplying the snapshot checkpoint location.<br />
A Spiralog restore operation treats an incremental<br />
savesnap and all the preceding savesnaps upon which it<br />
depends as a single savesnap. For savesnaps other than<br />
the most recent savesnap (the base savesnap ), the<br />
snapshot intcJrmation and directory intcmnation arc<br />
ignored. The sole purpose of these saves naps is to pro<br />
,·idc segments to the base savesnap.<br />
To rstore a volume from a set of incremental savesnaps,<br />
the Spiralog backup system pcrtcmm steps 1<br />
and 2 using the base savcsnap. In step 3, the restore<br />
copies all the segments in the snapshot ddincd by<br />
the base savcsnap to the target disk. (Note that there<br />
is :- one-to-one correspondence between snapshots<br />
and savcsnaps.) The savcsnaps :rc processed in reverse<br />
chronological order. The contents of the segment<br />
table in the base savesnap define the list ofscgmcms in<br />
the snapshot to be restored. Although the volume<br />
restore operation copies all the segments in the base<br />
savcsnap, nor : til segments in the savcsnaps procrmancc<br />
recovcn· ti·om volume bilurc or site disaster.<br />
<strong>Digital</strong> Tcchtlical )ound Vol . 8 No. 2 <strong>1996</strong><br />
File Restore Operation<br />
The purpose of :- tile restore operation is ro provide<br />
a fast and efficient wav to retrieve a small number of<br />
tiles from a savesnap without pc:rtorming a full volume<br />
restore. Typically, ti le restore is used to recover f-iles<br />
that have been in:tdvertcntly deleted . To achieve highperformance<br />
ti le restore, we imposed the fol lowing<br />
requirements on the design:<br />
• A tile restore session must process as few savesnaps<br />
as possible; it should skip savesnaps that do not<br />
contain data needed by the session .<br />
• vV hcn processing a savesnap, the file restore must<br />
scan the savcsnap lincarlv, in a single pass.<br />
Th
elevant to a particular ti le identifier that is held within<br />
a given segment. The mapping information returned<br />
through this interface describes either mapping infor<br />
mation held elsewhere or real file data. One character<br />
istic of the log is that anything to which such mapping<br />
information points must occur earlier in the log, that<br />
is, in a subsequent saved segment. Recall property 2 of<br />
the LFS on-disk data structures. Consequently, rhe file<br />
restore session will progress through the savesnaps in<br />
the desired linear fashion provided that requests are<br />
presented to the interface in the correct order. The<br />
correct order is determined by the allocation of segids.<br />
Since segids increase monotonically over time, it is<br />
necessary only to ensure that requests are presented in<br />
a decreasing segid order.<br />
The ti le restore interface operates on an object<br />
called a context. The context is a tuple that contains a<br />
location in the log, namely (segid, ojf\·et), and a type<br />
field. When supplied with a file identifier and a con <br />
text, the core function of the interface inspects the seg<br />
ment determined by the context and returns the set of<br />
contexts that enumerate all available mapping infor<br />
mation fo r the file identifier held at the location given<br />
by the initial context.<br />
The type of context returned indicates one of the<br />
following situations:<br />
METADATA<br />
Figure 8<br />
File Restore Session in Progress<br />
633<br />
SAVES NAP<br />
555 478<br />
EXTENT OF SAVESNAP TRAVERSAL SO FAR<br />
TARGET FILE SYSTEM FOR FILE RESTORE<br />
DIRECTION IN WHICH THE LOG IS WRITIEN<br />
KEY:<br />
r---,<br />
... - --<br />
•<br />
FILE DATA<br />
FILE SYSTEM MAP DATA<br />
• The location contains real fi le data.<br />
• The location given by the context holds more<br />
mapping information. In this case, the core func<br />
tion can be applied repeatedly to determine the<br />
precise location of the fi le's data.<br />
A work list of contexts in decreasing segid order<br />
drives the fi le restore process. The procedure tor<br />
retrieving the data for a single file identifier is as fo l<br />
lows. At the outset of the file restore operation, the<br />
work list holds a single context that identifies the root<br />
of the map (the tail of the log). As items are taken from<br />
the head of the list, the ti le restore must perform one<br />
of two actions. If the context is a pointer to real file<br />
data, then the tile restore reads the data at that location.<br />
If the context holds the location of mapping informa<br />
tion, then the core fimction must be applied to enu<br />
merate all possible fimher mapping information held<br />
there. The tile restore operation places all returned<br />
contexts in the work list in the correct order prior to<br />
picking the next work item. This simple procedure,<br />
which is illustrated in Figure 8, contin ues until the<br />
work list is empty and all the file's data has been read .<br />
To cope with more than one ti le, the fi le restore<br />
operation extends this procedure by converting the<br />
work list so that it associates a particular file identifier<br />
195 69 59<br />
The shaded areas represent the file data to be restored and the file system metadata that<br />
needs to be accessed to retrieve that data. The restore session has thus far processed<br />
segment 478. Part A of the file has been recovered into the target file system. Parts B and C<br />
are st1ll to come. Aller processing segment 478, the file restore visits the next known parts of<br />
the log, segments 69 and 59. Items that describe metadata in segment 69 and data in segment<br />
59 Will be on the work list. The next segment that the file restore will read is segment 69. so the<br />
sess1on can sk1p the intervemng segment (segment 195).<br />
<strong>Digital</strong> Tec hnical journal Vo l. S No. 2 <strong>1996</strong> 39
40<br />
with e:Kh context. File restore initializes the work list<br />
ro hol d a poinrer ro the root of the map ( rhe rail of the<br />
log) t(lr each ti le idemitier ro be n:srored . The eftecr is<br />
ro imerleave requests ro read more than one ti le while<br />
maintaining the correct segid ordering.<br />
A fu rther subtletY occurs when rhc conrexr ar the<br />
head of rhc work list is tc)Und ro rckr ro a segment<br />
outside the current savesnap. The ordering imposed<br />
on the work list implies that all subsequent items of<br />
work must also Lx: outside the current S31'esnap. This<br />
fclllows from the rc m por.1l orderi ng properties of LFS<br />
on-disk structu res and the way in which incremental<br />
savesnaps arc ddined. vVhen this situ:nion occurs, the<br />
work list is sa1·cd. When the next swesn:tp is ready for<br />
processing, the ti le restore st:ssion c:tn be restarted<br />
using the saved work list 1s the starting point.<br />
During this step, the fi le restore writes the pieces of<br />
fl ies to the target volume as they arc read from the<br />
savcsnap. Since the ti le restore process allocates fi le<br />
identifiers on a per-volume basis, restore must allocate<br />
new fi le identifiers in the target volum
Figure 9<br />
Merging Savesnaps<br />
BACKUPS<br />
Monday - Full<br />
Wednesday <br />
Incremental<br />
Friday <br />
Incremental<br />
Ta ble 1<br />
Performance Comparison of the Spira log and Files-1 1 Backup Save Operations<br />
Elapsed Time<br />
Backup System (Minutes:seconds)<br />
Spira log save 05:20<br />
Files-1 1 backup 10:14<br />
phases of the save operation. During the directory<br />
scan phase (typically up to 20 percent of the total<br />
elapsed save time), the only tape output is a compact<br />
representation of the volume directory graph. In comparison,<br />
the segment writing phase is usually bound by<br />
the tape throughput rate. In this configuration, the<br />
tape is the throughput bottleneck; its maximum raw<br />
data throughput is 1 .25 MB/s ( uncompressed ).11<br />
Overall, the Spiralog volume save operation is nearlv<br />
twice as f:1st as the Files-1 1 backup volume save opera - <br />
tion in this type of computing environment. Note that<br />
the Spiralog savesnap is larger than the corresponding<br />
Files-1 1 saveser. The Spiralog savesnap is less efticient<br />
at holding user data than the packed per-tile representation<br />
of the Filcs- ll saveset. In many cases, though,<br />
the higher pert(xmance of the Spira log save operation<br />
more than outweighs this inefficiency, particularly<br />
when it is taken into account that the Spiralog save<br />
operation can be performed on-line.<br />
File Restore Performance<br />
To determine tile restore performance, we measured<br />
how long it took to restore a single fi le from the<br />
savesets created in the save benchmark tests. The hardware<br />
and software configurations were identical to<br />
those used tor the save measurements. \Ve deleted<br />
a single 3-kilobyte (J(li ) ti le ti·om the source volume<br />
and then restored the ti le. We repeated this operation<br />
nine times, each time measuring the time it took to<br />
restore the fi le. Table 2 shows the results.<br />
Merge three savesets to produce one<br />
new savesnap equivalent to a full<br />
savesnap taken on Friday.<br />
Savesnap or<br />
Saveset Size Throughput<br />
(Megabytes) (Megabytes/second)<br />
339 1.05<br />
297 0.48<br />
Ta ble 2<br />
Performance Comparison of the Spira log and Files-1 1<br />
Individual File Restore Operations<br />
Backup System<br />
Spira log file restore<br />
Files-1 1 backup<br />
Elapsed Time<br />
(Minutes:seconds)<br />
01 :06<br />
03:35<br />
The Spiralog backup system achieves such good<br />
pertormance for tile restore by using its knowledge of<br />
the way the segments are laid out on tape. The fi le<br />
restore process needs to read only those segments<br />
required to restore the tile; the restore skips the intervening<br />
segments using tape skip commands. In the<br />
example presented in Figure 8, the restore can skip<br />
segments 555 and 195. In contrast, a file-based backup<br />
such as Filcs-ll usually does not have accurate indexing<br />
information to minimize tape 1/0. Spiralog's<br />
tape-skipping benefit is particularly noticeable when<br />
restoring small numbers of files from very large savesnaps;<br />
however, as shown in Table 2, even with small<br />
savesets, individual file restore using Spira log backup is<br />
three times as fast as using Files-ll.<br />
Table 3 presents a comparison of rhe save performance<br />
and features of the Spiralog, ti le-based, and<br />
physical backup systems.<br />
<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No 2 <strong>1996</strong> 41
42<br />
Ta ble 3<br />
Comparison of Spira log, File-based, and Physical Backup Systems<br />
Save performance<br />
(the number of 1/0s<br />
required to save the<br />
the source volume)<br />
File restore<br />
Vo lume restore<br />
Incremental save<br />
Spiralog Backup<br />
System<br />
The number of 1/0s is<br />
O(number of segments that<br />
contain live data) plus<br />
O(number of directories)<br />
Yes<br />
Yes, fast<br />
Yes, physical<br />
File-based Backup<br />
System<br />
The number of 1/0s is<br />
O(number of files)<br />
1/0s to read the file<br />
data plus O(number<br />
of directories) I /Os<br />
Yes<br />
Yes<br />
Yes, entire files that<br />
have changed<br />
Physical Backup<br />
System<br />
The number of 1/0s<br />
is O(size of the disk)<br />
No<br />
Yes, fast but limited<br />
to disks of the same size<br />
Note that this table uses "big oh" notation to bound a va lue. O(n), which is pronounced "order of n, " means that the value represented is no<br />
greater than Cn for some constant C, regardless of the value of n. Informally, this means that O(n) can be thought of as some constant multiple<br />
of n.<br />
Other Approaches and Future Work<br />
This section outlines some other design options<br />
we considered fo r the Spiralog backup system. Our<br />
approach oncrs fu rther possibilities in a number<br />
of areas. \Ve describe some of the opportunities<br />
available.<br />
Backup and the Cleaner<br />
The benefits of the write perf(xmancc g1ins in an LFS<br />
arc attained at the cost of having to clean segments."<br />
An opportunity appears to exist in combining the<br />
cleaner and backup functions to red uce the amount of<br />
work done by either or both of these components;<br />
however, the aims of backup and tbe cleaner are quite<br />
diftcrmt. Backup needs to read all segments written<br />
since a specitlc time (in the case of a ti.J II backup, since<br />
the birth of the volume). The cleaner needs to defrag<br />
ment the tree space on the volume. This is done most<br />
efficimtly by relocating data held in certain segments.<br />
These segments are those that arc suHiciently empty to<br />
be worth scavenging tor ti-ee space. The datl in these<br />
segments should also be stable in the sense that the<br />
data is unlikely to be deleted or outdated immediately<br />
after relocation.<br />
The onl y real benefit that can be exacted by looking<br />
at thest fu nctions together is to clean some segments<br />
while performing backup. For example, once a stg<br />
mcnt has been read to copy to a savesnap, it can be<br />
cleaned. This approach is probably not a good one<br />
because it reduces system pedorm::mce in the fo llow<br />
ing ways: additional processing req uired in clc::ming<br />
removes CPU and memory resources available to<br />
applications, and tht cleaner gtneratcs \\'rite opera<br />
tions that rtducc the backup rtad rate.<br />
<strong>Digital</strong> Tcclmical Journal Vol. 8 No . 2 <strong>1996</strong><br />
No<br />
There are two other areas in which backup and<br />
the cleaner mechanism interact that warrant fu rther<br />
investigation.<br />
l. Th e save opcLltion copies stgments in their<br />
entirety. Th:1t is, the operation copies both "stale"<br />
(old) cb tJ and li\'c data to a s1vesnap. The cost of<br />
extra storage media t(>r this extraneous data is<br />
traded otfagainst the performance penalty in trying<br />
to copy only live data. It appears that the tile systtm<br />
should run the clemtr vigorously prior to a backup<br />
to minirnizt the stale data copied .<br />
2. Incremental savesnaps contain cleaned data. This<br />
means that an incremental savesnap contains a copy<br />
of data that alrtady exists in one of the savcsnaps on<br />
which it depends. This is an apparent waste ofdhm<br />
and storagt space.<br />
It is best to undert: kc a fu ll backup after a thorough<br />
cleaning of the volumt. A single strategy tor incremen<br />
tal backups is less easy to dctine. On one hand, the size<br />
of an incrcmcllt:t l backup is incrtased if much cleming<br />
is pcd(m11td bd(l!-c the backup. On tht other h:md,<br />
restore operations from a largt incremental backup<br />
(particularly sclectivt ri le rcstorts) are likely to be<br />
more efticitnt. The larger tht incremental backup, the<br />
more data it contains. Consequently, the chance of<br />
restoring a single ti le trom just the base savesnap<br />
increases with the size of the incrcmemal backup.<br />
Studying the intcLKtions bcrween the backup and the<br />
cleaner may oft-Cr some insight into how to improve<br />
either or both ofthest components.<br />
A conti nuous b1ekup system can takt copies of seg<br />
ments ti·om disk using policies similar to the cle:mcr.<br />
This is explored in Ko hl's paptr. 12
Separating the Backup Save Operation into a<br />
Snapshot and a Copy<br />
The (k sign of the save operation involves the creation<br />
of a snapshot toll owed by the tast copy of the snapshot<br />
to some separate storage. The Spiralog version 1.1<br />
implementation of the save operation combines these<br />
steps. A snapshot can exist only during a backup save<br />
operation.<br />
System administrators and applications have significantly<br />
more flexibility if the split in these two fu nctions<br />
of backup is visi ble. The ability to create snapshots that<br />
can be mounted to look like read -only versions of a tile<br />
system may eliminate the need fo r the large number of<br />
backups performed today. Indeed, some fi le systems<br />
ofkr this kature 6·7 The additional advantage that<br />
Spir:.dog otkrs is to allow the very efticient copying of<br />
individual snapshots to offline medi:t.<br />
Improving the Consistency and Availability<br />
of On-line Backup<br />
There :re a number of ways to improve application<br />
consistency
44<br />
data indexing. The Spiralog backup implementation<br />
provides an on-lint: backup save operation with significantly<br />
improved performance over existing offerings.<br />
Performance of individual file restore is also improved.<br />
Acknowledgments<br />
We would like to thank the following people whose<br />
efforts were vital in bringing the Spiralog backup system<br />
to fruition: Nancy Phan, who helped us develop<br />
the product and worked relentlessly to get it right;<br />
Judy Parsons, who helped us clarir, describe, and document<br />
our work; Clare Wells, who helped us tocus on<br />
the real customer problems; Alan Paxton, who was<br />
involved in the early design ideas and .later specif-ication<br />
of some of the implementation; and, finally,<br />
Cathy Foley, our engineering manager, who supported<br />
us throughout the project.<br />
References<br />
I. J. Johnson and W. Laing, "Overview of the Spiralog<br />
File System," <strong>Digital</strong> <strong>Technical</strong> journal. vol. 8, no. 2<br />
( <strong>1996</strong>, this issue): 5-14.<br />
2. M. Rose nblum and ]. Ousterhout, "The Design and<br />
Implementation of a Log-Structured File System,"<br />
ACM Transactions on Computer Systems, vol . 10,<br />
no. I ( February 1992): 26-52.<br />
3. M. Rosenblum, "The Design and Implemen tation of a<br />
Log-Structured File System," Report No. UCB/CSD<br />
92/696 (Berke ley, Calif.: University of Calito rnia,<br />
Berkeley, 1992 ).<br />
4. M. Seltzer, K. Bostock , M. McKusick, and C. Staelin,<br />
"An I mplementation of a Log-Structured File System<br />
fo r UNIX," Proceedings of the USENJX Winter 1993<br />
<strong>Technical</strong> Confe rence, San Diego, Calif. (JanuJr)'<br />
1993).<br />
5. K. W;J!ls, "File Backup System to r Producin g a Backup<br />
Copy of a File Which May Be Updated during<br />
B ackup," U.S. Patent No. 5,163,148.<br />
6. D. Hitz, ]. LJu, and M. Malcolm, "File System Design<br />
fo r an NFS Fi le Server Appliance," Proceedings of<br />
the USENIX Winter 1994 <strong>Technical</strong> Con/erence,<br />
San Francisco, Calif. (January 1994 ).<br />
7. S. C hutani, 0. Anderson, M. \(azar, and B. Leverett,<br />
"The Episode File System," Proceedings of the<br />
UStNIX Winter 1992 <strong>Technical</strong> Con/erence.<br />
San Francisco, Calif (January 1992).<br />
8. C:. Whitaker, T. Bayley, and R. Widdowson, "Design of<br />
the Server fo r the Spiralog File Syste m," Digilul<br />
TechnicaL}oumal, vol. 8, no. 2 ( <strong>1996</strong>, this issue): 15-31 .<br />
9. Open VMS System Management Utilities Refere nce<br />
Manual A-L, Order No. M-PVSPC-TK (Mayn ard ,<br />
M ass. : <strong>Digital</strong> Eq uipment Corporation, 1995 ).<br />
DigirJI <strong>Technical</strong> )ourn
Alasdair C. Baird<br />
Alasdair Baird joined <strong>Digital</strong> in 1988 ro work fo r the<br />
ULTIUX Engineering group in Reading, U.K. He is<br />
a senior sofrware engineer and has been a member of<br />
<strong>Digital</strong>'s Open VMS Engineering group since 1991 .<br />
He worked on the design of the Spiralog file system and<br />
then contri buted ro the Spiralog ba ckup syste m, particularly<br />
the ti le resrore component. Currently, he is involved<br />
in Spir
46<br />
Integrating the Spiralog<br />
File System into the<br />
Open VMS Operating<br />
System<br />
<strong>Digital</strong>'s Spira log file system is a log-structured<br />
file system that makes extensive use of write<br />
back caching. Its technology is substantially<br />
different from that of the traditional Open VMS<br />
file system, known as Files-1 1. The integration<br />
of the Spiralog file system into the Open VMS<br />
environment had to ensure that existing appli<br />
cations ran unchanged and at the same time had<br />
to expose the benefits of the new file system.<br />
Application compatibility was attained through<br />
an emulation of the existing Files-1 1 file system<br />
interface. The Spira log file system provides an<br />
ordered write-behind cache that allows applica<br />
tions to control write order through the barrier<br />
primitive. This form of caching gives the benefits<br />
of write-back caching and protects data integrity.<br />
Digiral Te chnical )oum,ll Vol. 8 No. 2 <strong>1996</strong><br />
I<br />
Mark A. Howell<br />
J u1ian M. Palmer<br />
The Spiral og file system is based 011 a log-structuring<br />
method rll3t off-ers fast writes and a bst, on-line backup<br />
capability. ' - ' The i ntegration of the Sriralog fi le system<br />
into the Open VMS operating system prese nted many<br />
challenges. Its progra mming intnrace and its ex tensive<br />
use of write-back caching were su bst;11ltially difkrent<br />
ti·om those of the existing Open VMS fi le svstcm,<br />
known as Files-1 1.<br />
To encourage usc of the Spiralog fi le system , we had<br />
to ensure that ex isti ng applications ran unchanged in<br />
the OpenVMS e nvironmen t. A ti le system em ulati on<br />
layer provided the necessary compatibi lity by mapping<br />
the Files-1 1 file system interrace onto the Spi ralog ri le<br />
system. Bdore we could build the emu Lnion layer, we<br />
needed to understa nd how these applications used the<br />
ri le system interrace. The approach taken to u nder<br />
standing application rcquirell1 CI 1tS led to a ri le SVStCl11<br />
emu l ation layer that exceeded the origi n a l compatibil<br />
ity expectations.<br />
The first part of th is paper deals with the approach<br />
to i ntegrating a new ti le system into the OpenVMS<br />
environment and preservi ng application compatibilitY.<br />
It describes the various levels at which the fi le system<br />
could have been int egrated and the decision to em u<br />
late the l ow-level tile system interbcc. Techniques<br />
such as tracing, source code scanning, and fu nctional<br />
analysis of the Files-1 1 ri le svstcm helped determine<br />
which fe atures should be supported bv the emulation.<br />
The Spiralog fi le system uses extensive write- back<br />
c aching to ga in pe rtonn ance over the write-through<br />
cache on the Files-1 1 ri le system. Applications have<br />
relied on the ordering of writes implied by wri te<br />
through cac hi ng to maintain on-disk consistency in<br />
the event of system railurcs. The lack of orderi ng<br />
guarantees prevented the i mplementation of such<br />
careful write pol icies in write-b::tck environments. The<br />
Spiralog ri le system uses a write -beh ind cache (intro<br />
duced in the Echo ti le system) to allm.v applications to<br />
take advantage of write-back cach ing pcrr( Jrmance<br />
while preservi ng careful write policies.' This f-eatu re is<br />
unique in a comnH.:rcial fi le system. The second parr of<br />
this paper describes the dirlicultics ofintcgr:ting write<br />
back caching into a write-through environment and<br />
how a write-behind cache addressed these problems.
Providing a Compatible File System Interface<br />
Application compatibility can be described in two<br />
ways : compatibility at the file system interface and<br />
compatibility of the on-disk structure. Since only specialized<br />
applications use knowledge of the on-disk<br />
structure and maintaining compatibility at the interface<br />
level is a teature of the Open VMS system, the<br />
Spiralog tile system preserves compatibility at the file<br />
system interface level only. In the section Files-1 1 and<br />
the Spiralog File System On-disk Structures, we give<br />
an overview of the major on-disk diffe rences bet\veen<br />
the t\VO file systems.<br />
The level of interface compatibility wou ld have a<br />
large impact on how well users adopted the Spiralog<br />
file system. If data and applications could be moved to<br />
a Spiralog volume and run unchanged, the file system<br />
would be better accepted . The goal tor the Spiralog<br />
file system was to achieve 100 percent interface compatibi<br />
lity for the majority of existing applications. The<br />
implementation of a log-structured ti le system, however,<br />
meant that certain features and operations of the<br />
Files-1 1 file system could not be supported.<br />
The Open VMS operating system provides a number<br />
of ti le system interfaces that are called by applications.<br />
This section describes how we chose the most compatible<br />
tile system interface. The OpenVMS operating<br />
system directly supports a system-level call interface<br />
(QIO) to the ti le system, which is an extremely complex<br />
interface.' The QIO interface is very specific to<br />
the OpenVMS system and is difficult to map directly<br />
onto a modern ti le system intedace. This interface is<br />
used infrequently by applications but is used extensively<br />
by Open VMS utilities.<br />
Open VMS File System Environment<br />
This section gives an overview of the general<br />
OpenVMS file system environment, and the existing<br />
Figure 1<br />
r-<br />
n 1 HIGH-LEVEL<br />
APPLICATIONS<br />
Open VMS and the new Spiralog file system intertaces.<br />
To emulate the Files-1 1 file system, it was important to<br />
understand the way it is used by applications in the<br />
OpenVMS environment. A brief description of the<br />
Files-1 1 and the Spiralog file system interfaces gives an<br />
indication of the problems in mapping one interface<br />
onto the other. These problems are discussed later in<br />
the section Compatibility Problems.<br />
In the Open VMS environment, applications interact<br />
with the ti le system through various interfaces,<br />
ranging from high-level language interfaces to direct<br />
ti le system calls. Figure 1 shows the organization of<br />
interfaces within the Open VMS environment, including<br />
both tl1e Spira log and the Files-1 1 file systems.<br />
The fo llowing brietly describes the levels of interface<br />
to the tile system.<br />
• High-level language (HLL) libraries. H LL libraries<br />
provide tile system fu nctions tor high-level<br />
languages such as the Standard C library and<br />
FORTRAN I/0 fu nctions.<br />
• OpenVMS language-specific libraries. These<br />
libraries offe r OpenVMS-spccific ti le system fu nctions<br />
at a high level. For ex;mple, lib$create_dir( )<br />
creates a new directory with specitic OpenVMS<br />
security attributes such as ownership.<br />
• Record Management Services. The OpenVMS<br />
Record Management Services (RMS) are a set of<br />
complex routines that fo rm part of the Open VMS<br />
kernel. These routines
48<br />
• Files-1 1 file system interface. The Open VMS operating<br />
system has traditionally provided the Files-1 1<br />
file system fo r applications. It provides a low-level<br />
file system intert:ace so that applications can request<br />
fu e system operations from the kernel.<br />
Each tile system call can be composed of multiple<br />
subcalls. These subcalls can be combined in numerous<br />
permutations to fo rm a complex file system<br />
operation. The number of permutations of calls and<br />
subcalls makes the file system interface extremely<br />
difficult to understand and use.<br />
• File system emulation layer. This layer provides<br />
a compatible interface bet\¥een the Spiralog ti le<br />
system and existing applications. Calls to export<br />
the new fe atures available in the SpiraJog file system<br />
are also included in this layer. An important new<br />
fe ature, the write-behind cache, is described in the<br />
section Overview of Caching.<br />
• The Spiralog file system interface. The Spiralog<br />
file system provides a generic file system interface.<br />
This interface was designed to provide a superset<br />
of the features that are typically available in file systems<br />
used in the UNIX operating system. File<br />
system emulation layers, such as the one written for<br />
Files-1 1, could also be written for many different<br />
file systems." Features that could not be provided<br />
generically, for example, the implementation of<br />
security policies, arc implemented in the file system<br />
emu lation layer.<br />
The Spiralog file system's interface is based on the<br />
Virtual File System (VFS), which provides a file<br />
system interface similar to those fo und on UNIX<br />
systems.7 Functions available are at a higher level<br />
than the Files-1 1 file system interface. For example,<br />
an atomic rename fitnction is provided.<br />
Files-1 1 and the Spira log File Sys tem<br />
On-disk Structures<br />
A major difference bet\'lecn the Filcs-1 1 and the<br />
Spiralog file systems is the way data is laid out on<br />
the disk. The Files-1 1 system is a conventional,<br />
update-in-place file system -" Here, space is reserved for<br />
file data, and updates to that data are written back to<br />
the same location on the disk. Given this knowledge,<br />
applications could place data on Files-1 1 volumes to<br />
take advantage of the disk's geometry. For example,<br />
the Files-l l file system allows applications to place files<br />
on cylinder boundaries to reduce seck times.<br />
The Spiralog file system is a log-structured file<br />
system (LFS). The entire volume is treated as a continuous<br />
Jog with updates to files being appended to<br />
the tail of the log. In efkct, files do not have a fixed<br />
home location on a volume. Updates to files, or cleaner<br />
activity, will change the location of data on a volume.<br />
Applications do nor have to be concerned where their<br />
data is placed on the disk; LFS provides this mapping.<br />
DigirJI Tcchnier example, security<br />
information. Although few in number, those applications<br />
that do call the Files-l l file system directly arc<br />
significant ones. If tile only interface supported was<br />
RMS, then these utilities, such as SET FILE and<br />
OpcnVMS Backup, would need signiticant modification.<br />
This class of utilities represents a large number of<br />
the OpcnVMS utilities that maintain the file system .<br />
To provide support tor the widest range of applications,<br />
we selected the low-level Filcs-1 1 interface to r<br />
usc by the file system emulation layer. By selecting this<br />
interface, we decreased the amount of work needed<br />
fo r its emulation. However, this gain was offset by the<br />
increased complexity in the interface emulation.<br />
Problems caused by this interface selection are<br />
described in the next section.<br />
Interface Compatibility<br />
Once the file system interface was selected, choices<br />
bad to be made about the level of support provided by<br />
the emulation layer. Due to the nature of tl1e logstructured<br />
tile system, described in the section Filcs-11<br />
and the Spiralog File System On-disk Structures, full<br />
compatibility of all features in the emulation layer was<br />
not possible. This section discusses some of the decisions<br />
made concerning interface compatibility.
An initial decision w::s made to support documented<br />
low-level Fi les-l l calls through the emulation<br />
layer as ofi:en as possible. This would enable all<br />
well-behaved applications to run unchanged on the<br />
Spiralog tile system. Examples of well-behaved applications<br />
are those that make use of HLL library calls.<br />
The tollowing categories of access to the fi le system<br />
would not be supported:<br />
• Those directly accessing the disk without going<br />
through the file system<br />
• Those making usc of specitlc on-disk structure<br />
information<br />
• Those making use of undocumented ti le system<br />
katures<br />
A very small number of applications fe ll into these<br />
categories. Exampl es of applications that make use of<br />
on-disk structu re knowledge are the Open VMS boot<br />
code, disk structure analyzers, and disk dcti·agmenters.<br />
The majority of Open VMS applications make tile<br />
system calls through the 1'-LV\S interface. Using file system<br />
call-tracing techniques, described in the section<br />
Investigation Te chniques, a fu ll set of tile system calls<br />
made by RMS could be constructed. Afi:er analysis of<br />
this trace data, it was cle::r that IUv\ S used a small set<br />
of well-structu red calls to the low-level ti le system<br />
intertace. Further, detailed analysis of these calls<br />
showed that all RMS operations could be fu lly emulated<br />
on the Spiralog fi le system.<br />
The su pport of Open VMS ti le system utilities that<br />
made direct calls to the low-level Filcs-11 intertace was<br />
important if we were to minimize the amount of code<br />
change required in the Open VMS code b::se. Analysis<br />
of these utilities showed that the majority of them<br />
could be supported through tJ1e emulation layer.<br />
Ve ry rcw applications made use of katures of the<br />
Filcs-1 1 ti le system that could not be emulated . This<br />
enabled a high number of applications to run<br />
unchanged on the Spiralog file system.<br />
Ta ble 1<br />
Categorization of File System Features<br />
Category Examples<br />
Supported. The operation requested<br />
was completed, and a success status<br />
was returned.<br />
Ignored. The operation req uested<br />
was ignored, and a success status<br />
was returned.<br />
Not supported. The operation<br />
req uested was ignored, and a<br />
failure status was returned.<br />
Requests to create a file or open<br />
a file.<br />
A request to place a file in a<br />
specific position on the disk to<br />
improve performance.<br />
A request to directly read the<br />
on-disk structure.<br />
Compa tibility Problems<br />
This section describes some of the comp::tibility problems<br />
that we encountered in developing the emulation<br />
layer and how we resolved them.<br />
When considering the compatibility of the Spira log<br />
tile system with the Files-l l file system, we placed the<br />
features of the file system into three categories: supported,<br />
ignored, and not supported . Table 1 gives<br />
examples and descriptions of these categories. A feature<br />
was recategorized on ly if it could be supported but was<br />
not used, or if it could not be easily supported but<br />
was used by a wide range ofapplieations.<br />
The majority of Open VMS applications make supported<br />
fi le system ca lls. These :pplications will run as<br />
intended on the Spiralog tile system. Few applications<br />
make calls that could be safely ignored . These applications<br />
would run successfully but could not make use of<br />
these fe atures. Ve ry fe w applications made calls that<br />
were not supported . Untortunatcly, some of these<br />
applications were very important to the success of the<br />
Spiralog tile system, for example, system management<br />
utilities that were optimized fo r the Files-1 1 system.<br />
Analysis of applications that made unsupported calls<br />
showed the to ll owing categories of use:<br />
• Those that accessed the tile header-a structu re<br />
used to store a tile's attributes. This method was<br />
used to return multiple fi le attributes in one call.<br />
The supported mechanism involved an individual<br />
call fo r each attribute.<br />
This was solved by returning an emulated fi le<br />
header to applications that contained the majority<br />
of info rmation interesting to applications.<br />
• Those read ing directory files. This method was used<br />
to perform fast directory scans. The supported<br />
mechanism involved a file system call for each nJme.<br />
This was solved by providing a bulk directory<br />
read ing interface call. This ca ll was similar to the<br />
getdircntries( ) call on the UNIX system and was<br />
Notes<br />
Most calls made by applications<br />
belong in the supported category.<br />
This type of feature is incompatible<br />
with a log-structured file system.<br />
It is very infreq uently used and not<br />
ava ilable through HLL libraries. It<br />
co uld be safely ignored.<br />
This type of request is specific to<br />
the Files-1 1 file system and could<br />
be allowed to fail because the<br />
application would not work on the<br />
Spira log file system. It is used only<br />
by a few specialized applications.<br />
Vol. 8 No. 2 <strong>1996</strong> 49
50<br />
straightr(>rward to replace in applications that<br />
directly read directories.<br />
The OpcnVMS Backup utility was an example of<br />
:t system man:tgcmcm mility that directly read<br />
directory tiles. The backup milirv \\'aS changed to<br />
usc the dircctorv rc:1ding call on Spiralog volumes.<br />
• Tl10sc accessing rcscnni riles. The existing file system<br />
stores :til irs mctad:na in normal ti les that can be<br />
read lw :1pplic:ttions. These riles :tre called reserved<br />
ri les and :11-e crc1tcd when a \'olume is initi1lized.<br />
No reserved tiles arc cn:ated on a Spiralog volume,<br />
\\'ith the exception of the m:ster t-ile directory<br />
( M FD ). Appliutions that 1-e:1d reserved tiles make<br />
specific usc of on-disk structu re infi:mnation and<br />
are not supporrcd with the Spira log fi le system. The<br />
MFD is used as the root directory :nd performs<br />
directory tr:tvcrsals. This ti le w:ts virtually emulated.<br />
It appears in directory listi ngs of a Spiralog volume<br />
and can he used to starr :t directory traversal, but it<br />
docs not exist on the volume :ts a real ri le.<br />
Investigation Te chniques<br />
This section describes the appro:tch taken to investigate<br />
the interhcc and compatibility problems<br />
described euscd on understJnding how<br />
applications c:dled the ti le wstcm and the sem:1ntics of<br />
tbc c;lls. A number of techniques were used in lieu<br />
of design documcnt:ttion tr r :lc!·ting the other maimainers<br />
Vnl. R No. 2 <strong>1996</strong><br />
about the impact of the Spiralog ti le system. J\l[orc<br />
than 2,000 surveys were sent out, but rcwcr than<br />
25 useful results were rctumcd. S:tdly, most application<br />
maintainers were not aw:t-c of how their<br />
product used the fi le s\·stcm.<br />
• Automated applic:tion source code sc:rching<br />
Autom:1tcd source code searching quickly checks<br />
a large amount ofsourcc code. This technique 1vas<br />
most usdi.d when :tn:lyzing lilc svstcm calls m:de bv<br />
the OpenYt'viS opcDting system or utilities. However,<br />
this does not 1\'0rk we ll \\'hen :pplic:1tions<br />
make dvnamic oils to the ti le wstcm
Ta ble 2<br />
Verification of Compatibility<br />
Test Suite<br />
RMS regression tests<br />
OpenVMS regression tests<br />
Files-1 1 compatibility tests<br />
C2 security test suite<br />
C language tests<br />
FORTRAN ianguage tes<br />
<strong>Number</strong> of Te sts<br />
-500<br />
-100<br />
-100<br />
-50 discrete tests<br />
-2,000<br />
-100<br />
Write-back caching provides very difterent semantics<br />
to the model of write-through caching used on the<br />
Files-l l ti le syste m. The goal of the Spiralog project<br />
members was to provide write- back caching<br />
in a way that was compatible with existing Open VMS<br />
applications.<br />
This section compares write-through and write-back<br />
caching and shows how some important OpenVMS<br />
applicatiom rely on write-through semantics to pro<br />
tect data ti·om system fa ilure. It describes the ordered<br />
wri te-back cache as introduced in the Echo file system<br />
and exp la ins how this model of caching (known as<br />
write-hehind caching) is particularly suited to the envi<br />
ronment of Open VMS Cluster systems and the<br />
Spira log log-structured ti le svstem.<br />
Overview of Caching<br />
During the last kw years, CPU pcrf{mnance improvements<br />
have continued to ou tpace performance<br />
improvements t( lr disks. As a result, the 1/0 bottle<br />
neck has worsc iH.:d rather than improved. One of<br />
the most succcssfi.d tech niques used to alleviate this<br />
problem is caching. Caching means holding a copy of<br />
data that has been recently read ri·om , or written to,<br />
the disk in memory, giving appl ications access to that<br />
data at memory speeds rather than at disk speeds.<br />
VVrite-through and write- back caching are two<br />
difrerent models h·cqucnrly used in ti le srstems.<br />
• Write-through caching. In a write-through cache,<br />
data read ti·om the disk is stored in the in-memory<br />
cache. When data is written, a copy is placed in<br />
the cache, but the write request does not return<br />
until the data is on the disk. Wri te-through caches<br />
improve the pcrtormance of read requests but not<br />
write requests.<br />
• Write- back caching. A write-back cache improves<br />
the pert(ml1
.'i 2<br />
Figure 2<br />
Caching Models<br />
NO CACHE<br />
WRITE-BACK<br />
CACHE<br />
I I ..._<br />
MICROSECONDS<br />
Caching is more important to the Spira log ti lt<br />
system than it is to convcntionll ti le systems. Log<br />
structu red ti le systems have inherently worse read<br />
pertonn1ncc rhan com·enrional, update-in-place ti le<br />
svstems, due to the need to locate the data in the log.<br />
As described in another paper in rhisjour11a/. locating<br />
data in the l og requires more disk I/Os than ;1 11<br />
update-in-place ti le system 2 The Spiralog tile system<br />
uses large read caches to other this extra read cost.<br />
Careful Writing<br />
The Filcs-1 1 ti le system prm·ides \\'rite -through<br />
semantics. Key Open VMS applications such as tr:msac<br />
tion processing and the OpenVJ'v!S Record Man::tge<br />
ment Services ( RMS) have come to rely on the implicit<br />
orderi ng of write-through. Thcv use a technique<br />
kno\\'n as c::trcful writing to pre\'(:llt data corruption<br />
fo llo\\'ing l S\'Stem fa ilure.<br />
C1reful \\'riring allo\\'s an 1 pplicnion to ensure rh:n<br />
rhe data on rhe disk is never in :tn inconsistent or<br />
invalid st:ttc. This guarantee
I I I A I B<br />
I t t<br />
I I I<br />
Figure 3<br />
r: x1mplc of a Careti.il Write<br />
Write-behind Caching<br />
A I B<br />
----------<br />
I t t<br />
J I A B I<br />
---L--------L----L.-...J. ..<br />
I<br />
A'<br />
__ . ...J. .. .<br />
,<br />
A careful write policy relics totally on being able to<br />
control the order of writes to the disk. This Gmnot be<br />
achieved on a write-back cache because the write-back<br />
method docs not preserve the order of write requests.<br />
Reordering writes in a write- back cache would risk cor<br />
rupting the data that applications using careful writing<br />
were seeking to protect. This is untclrtunatc because<br />
the pcrtcm11ancc benefits of deferring the write to the<br />
disk Jrc compatible with a careful write polin•. Cm.:ful<br />
writing docs not need to know when the data is written<br />
to the disk, on ly the order it is wJittcn .<br />
To allow these applications to gain the pcrt(mnancc<br />
of the write-back cache bur still prorcct their data on<br />
disk, the Spiralog ti le system uses a variation on write<br />
back oching known as write-behind caching. Intro<br />
duced in the Echo tile system, write-behind e1ching is<br />
essentially write-back caching with ordering guaran<br />
tees.' The cache allows the application to spccit)r which<br />
writes must be ordered and the order in which they<br />
must be written to the disk.<br />
This is achieved bv providing the barrier primitive to<br />
applications. Barrier defines an order or dcpendencv<br />
between write operations. For example, consider the<br />
diagram in Figure 4: Here, writes arc represented as<br />
a time-ordered queue, with later writes being added<br />
Figure 4<br />
B
54<br />
...<br />
I ..,. ....<br />
.<br />
I _____ ....<br />
I •<br />
.... .... ..,<br />
Figure 5<br />
Ex
closed . The semantics of fu ll write-behind caching<br />
arc best suited to applications that can easily regen<br />
erate lost data at any time. Temporary ti les from a<br />
compiler arc a good example. Should the system<br />
fa il, the compilation can be restarted without any<br />
loss of data.<br />
3. Flush-on-close caching policy. The fl ush-on-close<br />
policy provides a restricted level of write-behind<br />
caching f(x applications. Here, all updates to the file<br />
are treated as write behind, but when the file is<br />
closed, all changes are forced to the disk. This gives<br />
the performance of write-behind but, in addition,<br />
provides a known point when the data is on the disk.<br />
This torm of caching is particularly suitable tor appli<br />
cations that can easily re-create data in the event of<br />
a system crash but need to know that data is on the<br />
disk at a specific ti me. For example, a mail store-and<br />
forward system receiving an incoming message must<br />
know the data is on the disk when it acknowledges<br />
receipt of the message to the forwarder. Once the<br />
acknowledgment is sent, the message has been for<br />
mally passed on, and the forwarder may delete its<br />
copy. In this example, the data need not be on the<br />
disk until that acknowledgment is sent, because that<br />
is the point at which the message receipt is commit<br />
ted . Should the system fail before the acknowledg<br />
ment is sent, all dirty data in the cache would be lost.<br />
In that event, the sender can easily re-create the data<br />
by sending the message again.<br />
Figure 6 shows the results of a performance com<br />
parison of the three caching policies. The test was run<br />
on a dual-CPU DEC 7000 Alpha system with 384<br />
megabytes of memory on a RAID-S disk. The test<br />
repeated the fo llowing sequence tor the different fi le<br />
SIZeS.<br />
l. Create and open a file of the required size and set<br />
its caching policy.<br />
2. Write data to the whole ti le in l ,024-byte 1/0s.<br />
3. Close the ti le.<br />
4. Delete the ti le.<br />
With small files, the number of file operations (create,<br />
close, delete) dominates. The leftmost side of the<br />
graph therefore shows the time per operation for tile<br />
operations. vVith time, the files increase in size, and the<br />
data 1/0s become prevalent. Hence, the rightmost<br />
side of Figure 6 is displaying the time per operation fo r<br />
data ljOs.<br />
Figure 6 clearly shows that an ordered write-behind<br />
cache provides the highest performance of the three<br />
caching models. For file operations, the write-behind<br />
cache is almost 30 percent faster than the write<br />
through cache. Data operations are approximately<br />
three times faster than the corresponding operation<br />
with write-through caching.<br />
(jJ<br />
0<br />
z<br />
0<br />
w<br />
'!]_<br />
z<br />
Q<br />
f- :<br />
0:<br />
w<br />
o._<br />
0<br />
0:<br />
w<br />
o._<br />
w<br />
<br />
f-<br />
KEY<br />
0.156<br />
0.138<br />
0.121<br />
0.104<br />
0.086<br />
0.069<br />
0.052<br />
0.035<br />
0.017<br />
0.000<br />
\<br />
\<br />
'<br />
'<br />
- - - - - - -<br />
1,024 2,048 4,096 8.192 16,384 32,768<br />
WRITE-BEHIND CACHE<br />
FLUSH-ON-CLOSE CACHE<br />
WRITE-THROUGH CACHE<br />
FILE SIZE (BYTES)<br />
Figure 6<br />
Performance Comparison of Caching Policies<br />
Summary and Conclusions<br />
The task of integrating a log-structu red ti le system<br />
into the Open VMS environment was a significant<br />
challenge tor the Spiralog project members. Our<br />
approach of carefully determining the interface to<br />
emulate and the level of compatibility was important<br />
to ensure that the majority of applications worked<br />
unchanged .<br />
vVe have shown that an existing update-in-place tile<br />
system can be replaced by a log-structured ti le system.<br />
Initial effo rt in the analysis of application usage fur<br />
nished intormatjon on interface compatibility. Most<br />
fi le system operations can be provided through a fi le<br />
system emulation layer. Where necessary, new inter<br />
faces were provided for applications to replace their<br />
direct knowledge of the Files-1 1 file system.<br />
File system operation tracing and fu nctional analysis<br />
of the Files-1 1 fi le system proved to be the most usefld<br />
techniques to establish interface compatibility. Appli<br />
cation compatibility far exceeds the level expected<br />
when the project was started. A majority of people use<br />
the Spiralog ti le system volumes without noticing any<br />
change in their application's behavior.<br />
Careful write policies rely on the order of updates<br />
to the disk. Since write-back caches reorder write<br />
requests, applications using careful writing have been<br />
unable to take advantage of the significant improve<br />
ments in write performance given by write-back<br />
caching. The Spiralog tile system solves this problem<br />
by providing ordered write-back caching, known as<br />
write-behind. The write-behind cache allows applica<br />
tions to control the order of writes to the disk through<br />
a primitive called barrier.<br />
Using barriers, applications can build careful write<br />
policies on top of a write-behind cache, gaining all the<br />
performance of write-back caching without risking<br />
<strong>Digital</strong> <strong>Technical</strong> )ournall Vol. 8 No. 2 <strong>1996</strong> 55
56<br />
d:u integritY. A \\Ti re - behind cache also 1 llows the rile<br />
system itself ro gain write-back ped(>nllr their help through·<br />
out the project, and Karen Howell, Morag Currie, and<br />
all those who helped with this paper. fi ll
Extending OpenVMS<br />
for 64-bit Addressable<br />
Virtual Memory<br />
The OpenVMS operating system recently<br />
extended its 32-bit virtual address space to<br />
exploit the Alpha processor's 64-bit virtual<br />
addressing capacity while ensuring binary<br />
compatibility for 32-bit non privileged pro<br />
grams. This 64-bit technology is now available<br />
both to Open VMS users and to the operating<br />
system itself. Extending the virtual address<br />
space is a fundamental evolutionary step for<br />
the OpenVMS operating system, which has<br />
existed within the bounds of a 32-bit address<br />
space for nearly 20 years. We chose an asym<br />
metric division of virtual address extension that<br />
allocates the majority of the address space to<br />
applications by minimizing the address space<br />
devoted to the kernel. Significant scaling issues<br />
arose with respect to the kernel that dictated<br />
a different approach to page table residency<br />
within the OpenVMS address space. The paper<br />
discusses key scaling issues, their solutions,<br />
and the resulting layout of the 64-bit virtual<br />
address space.<br />
I chaei S. llarvey<br />
Leonard S. Szubowicz<br />
The OpcnVMS Alpha operating system initially supported<br />
a 32-bit virtual address space that maximized<br />
compatibility for Open VMS VAX users as they ported<br />
their applications from the VAX platform to the Alpha<br />
plattonn. Providing access to the 64-bit virtual memory<br />
capability detlned by the Alpha architecture was<br />
always a goal tor the Open VMS operating system. An<br />
early consideration was the eventual usc of this techno<br />
logy to enable a transition from a purely 32-bitoriented<br />
context to a purely 64-bit-oricnted native<br />
context. OpenVMS designers recognized that such<br />
a fu ndamental transition tor the operating system,<br />
along with a 32-bit VAX compatibility mode support<br />
environment, would take a long time to implement<br />
and could seriously jeopardize the migration of applications<br />
from the VAX platform to the Al pha platform.<br />
A phased approach was called tor, by which the operating<br />
system could evolve over time, allowing tor quicker<br />
time-to-market for significant features and better, more<br />
timely support for binary compatibility.<br />
In 1989, a strategy emerged that defined two fundamental<br />
phases of Open VMS Alpha development. Phase<br />
1 would deliver the Open VMS Alpha operating system<br />
initiaUy with a virtual address space that fa ithfully replicated<br />
address space as it was defined by the VAX architecture.<br />
This familiar 32-bit environment would case<br />
the migration of applications from the VAX platform<br />
to the Alpha platform and wou ld case the port of the<br />
operating system itself. Phase l, the OpenVMS Alpha<br />
version 1.0 product, was delivered in 1992.1<br />
For Phase 2, the Open VMS operating system would<br />
successfully exploit the 64-bit virtual address capacity<br />
of the Alpha architecture, laying the groundwork<br />
tor fu rther evolution of the OpenVMS system. In<br />
1989, strategists predicted that Phase 2 could be delivered<br />
approximately three years after Phase 1. As<br />
planned, Phase 2 culminated in 1995 with the delivery<br />
of Open VMS Alpha version 7.0, the first version of<br />
the OpenVMS operating system to support 64-bit<br />
virtual addressing.<br />
This paper discusses how the OpenVMS Alpha<br />
Operating System Development group extended the<br />
OpenVMS virtual address space to 64 bits. Topics<br />
covered include compatibility fo r existing applications,<br />
the options for extending the address space, the<br />
<strong>Digital</strong> Tt"chnical Journal Vol . 8 No. 2 19Y6 57
58<br />
strategy tor page table residency, and the fi nal layout of<br />
the Open VMS 64-bit virtual address space. In imple<br />
menting support tor 64-bit virtual addresses, design <br />
ers maximized privileged code compatibility; the paper<br />
presents some key measures taken to this end r a given process, process<br />
permanent code, and various process-specific<br />
control cells.<br />
• The higher-addressed half ( 2 GB) of virtual address<br />
space is detlned to be shared by all processes. This<br />
shared space is where tbe Open VMS operating sys<br />
tem kernel resides. Although the VAX architecture<br />
divides this space into a pair of separately named<br />
1-GB regions (SO space and Sl space), the Open VMS<br />
Alpha operating system makes no material distinc<br />
tion between the two regions and reters to them<br />
collectively
PROCESS<br />
PRIVATE<br />
(2 GB)<br />
SHARED<br />
SPACE<br />
(2 GB)<br />
00000000.00000000<br />
00000000. 7FFFFFFF<br />
•JOOOilOOO HOOOOOOO<br />
I<br />
I<br />
I<br />
/<br />
/ //1<br />
PO SPACE<br />
P1 SPACE<br />
UNREACHABLE WITH<br />
32-BIT POINTERS<br />
12 2 't BYTES<br />
Ff r f 7FFF"FFF : --------------<br />
FFFFFFFF .80000000<br />
Figure 1<br />
Open VMS Alpha 32-bit Virtual Address Space<br />
SO/S1 SPACE<br />
FFFFFFFF.FFFFFFFF L--- -- -- -- -- -1<br />
privileged code compatibility and ensuring Elster timeto-market.<br />
However, analysis of this option showed<br />
that the1-c were enough significant portions of the kernel<br />
requiring change that, in practice, very little additional<br />
privileged code compatibility, such as tor<br />
drivers, would be achievable. Also, this option did not<br />
address certain important problems that are specific to<br />
shared space, such as limitations on the kernel's capacity<br />
to manage ever-larger, very large memory (VLM)<br />
systems in the future.<br />
We decided to pursue the option of a fl at, superset<br />
64-bit virtual address space that provided extensions<br />
f(x both the shared and the process-private portions of<br />
the sp:1ce that a given process could reference. The<br />
new, extended process-private sp:1ce, named P2 space,<br />
is adjacent to Pl space and extends toward higher<br />
virtual addresses."5 The new, extended shared space,<br />
named 52 space, is adjacent to 50/51 space and<br />
extends toward lower virtual addresses. P2 and 52<br />
spaces grow toward each other.<br />
A remaining design problem was to decide where<br />
P2 and 52 would meet in the address space layout.<br />
A simple approach would split the 64-bit address<br />
space exactly in halt symmetrically scaling up the<br />
design of the 32-bit address space already in place.<br />
(The address space is split in this way by the <strong>Digital</strong><br />
UNIX operating system 3) This solution is easy to<br />
explain because, on the one hand, it extends the 32-bit<br />
convention that the most significant address bit can be<br />
tre
60<br />
Virtual Address-to-Physical Address Tra nslation<br />
The Alpha architecture allows an implementation to<br />
choose one of the fo l lowing tour page sizes: 8 kilobytes<br />
(KB), 16 KB, 32 KB, or 64 KB.' The architecture<br />
also ddines a multilevel, hierarchical page table structure<br />
for virtual address-to-physical address (VA-to<br />
PA ) translations. Al l OpenVMS Alpha platforms have<br />
implemented a page size of 8 KB and three levels<br />
in this page table structure. Although throughout<br />
this paper we assume a page size of 8 KB and three<br />
levels in the page table hierarchy, no loss of generality<br />
is incurred by this assumption.<br />
Figure 2 illustrates the VA-to-PA translation<br />
sequence using the multilevel page table structure.<br />
l. The page table base register (PTBR) is a per-process<br />
pointer to the highest level (Ll) of that process'<br />
page table structure. At the highest level is one<br />
8-KB page (LlPT) that contains 1,024 page table<br />
entries (PTEs) of 8 bytes each. Each PTE at the<br />
highest page table level (that is, each L l PTE) maps<br />
a page table page at the next lower level in the tr;.Jnslation<br />
hierarchy (the L2PTs ).<br />
2. The Segment 1 bit field of a given virtual address<br />
is an index into the Ll PT that selects a particular<br />
Lll'TE, hence selecting a specific L2PT tor the next<br />
stage of the translation.<br />
3. The Segment 2 bit field of the virtual address<br />
then indexes into that L2PT to select an L2PTE,<br />
VIRTUAL<br />
ADDRESS<br />
63<br />
SIGN EXTENSION<br />
OF SEGMENT 1<br />
PAGE TABLE<br />
BASE REGISTER<br />
Figure 2<br />
Virtual Address-to-Physical Address Translation<br />
I<br />
42<br />
SEGMENT 1<br />
L1 PT<br />
<strong>Digital</strong> Tec hnical )oumal Vo l. 8 No. 2 <strong>1996</strong><br />
hence selecting ;1 specific L3PT tor the next stage<br />
ofthe translation.<br />
4. The Segment 3 bit tield of the virtual address then<br />
indexes into that l..3PT to select an L3PTE, hence<br />
selecting a specific 8-KB code or data page.<br />
5. The byte-within-page bit f-ield of the virtual address<br />
then selects a specific byte address in that page.<br />
An Alpha implementation may increase the page<br />
size and/or number of levels in the page t:ble hierarchy,<br />
thus mapping greater amounts of virtual space up<br />
to the fu ll 64-bit amount. The assumed combin:tion<br />
of8-KB pJge size and three levels of page table allows<br />
the system to map up to 8 teralwtes (TB) (i.e., 1,024<br />
X 1,024 X l ,024 X 8 KR = 8 TB) of virtual memorv<br />
fo r a single process.<br />
To map the entire 8-TB Jddress space available to a<br />
single process requires up to 8GB ofPTEs (i.e., 1,024<br />
X l ,024 X I ,024 X 8 bytes = 8 GB ). This bet alone<br />
presents a serious sizing issue f(>r the Open VMS operating<br />
system. The 32-bit page table residency model<br />
that the Open VMS operating system ported ti·om the<br />
VA,'{ plattorm to the Alpha platform does not have<br />
the capacity to support such large page tables.<br />
Page Tables: 32-bit Residency Model<br />
We stated earlier thJt materializing a 32-bit virtual<br />
address space
system tl·om the V fv'
62<br />
The effect of this self-mapping on the VA-to-PA<br />
translation sequence (shown in Figure 2) is subtle but<br />
important.<br />
• For those virtual addresses with a Segment l bit<br />
field value that selects the self-mapper Ll PTE, step<br />
2 of the VA-to-PA translation sequence reselects<br />
the Ll PT as the effective L2PT ( L2PT') tor the<br />
next stage of the translation.<br />
• Step 3 indexes into L2PT' (the Ll PT) using the<br />
Segment 2 bit field value to select an L3 PT '.<br />
• Step 4 indexes into L3PT' (an L2PT) usi ng the<br />
Segment 3 bit field value to select a specific data<br />
page.<br />
• Step 5 indexes into that data page (an L3 PT) using<br />
the byte-within-page bit field of the virtual address<br />
to select a specific byte address within that page.<br />
When step 5 of the VA-to-PA translation sequence<br />
is finished, the final page being accessed is itself one of<br />
the level 3 page table pages, not a page that is mapped<br />
PTBR<br />
L 1 PT'S PFN · <br />
KEY:<br />
PTBR<br />
PFN<br />
PTE<br />
..<br />
Figure 4<br />
Eftect of Page Table Selfmap<br />
L1PT<br />
_,,.r--- -- -- -r<br />
PTE #1022 L1PT'S PFN<br />
PAGE TABLE BASE REGISTER<br />
PAGE FRAME NUMBER<br />
PAGE TABLE ENTRY<br />
L--- -- -- -1. . .<br />
Digiral <strong>Technical</strong> Journal Vol. R No. 2 <strong>1996</strong><br />
by a level 3 page table p::ge. The self-map operation<br />
places the entire 8-GB page table structure at the end<br />
of the VA-to-PA translation sequence fo r a specific<br />
8-GB portion of the process' address space. This virtual<br />
space that contains all of a process' potential page<br />
tables is called page table space (PT space)."<br />
Figure 4 depicts the effect of self-mapping the page<br />
tables. On the left is the highest-level page table<br />
page containing a fixed number of PTEs. On the right<br />
is the virtual address space that is mapped by that page<br />
table page. The mapped address space consists of a collection<br />
of identically sized, contiguous address range<br />
sections, each one mapped by a PTE in the corresponding<br />
position in the highest-level page table page.<br />
(For clarity, lower levels of the page table structure arc<br />
omitted from the llgure.)<br />
Notice that LlPTE # 1022 in Figure 4 was chosen to<br />
map the high-level p::ge table page that contains that<br />
PTE. (The reason f()r this particular choice will<br />
be explained in the next section. Theoretically, any one<br />
64-BIT ADDRESSABLE<br />
VIRTUAL ADDRESS SPACE<br />
PO/P1<br />
PT SPACE f<br />
00000000.00000000<br />
8-GB #0<br />
1,020 X 8 GB<br />
8-GB OW22<br />
-_-_ \ :,:::::3,,mm<br />
t-_-__ -_-_s-o;-_s_1 _
of the Ll PTEs could have been chosen as the selfmapper.)<br />
The section of virtual memory mapped by<br />
the chosen Ll PTE contains the entire set of page<br />
tables needed to map the available address space of<br />
a given process. This section of virtual memory is PT<br />
space, which is depicted on the right side of Figure 4<br />
in the 1 ,022d 8-GB section in the materialized virtual<br />
address space.<br />
The base address fo r this PT space incorporates the<br />
i ndex of the chosen self mapper LlPTE ( 1,022 =<br />
3FE( l6)) as fo llows (see Figure 2):<br />
Segment 1 bit field = 3FE<br />
Segment 2 bit field = 0<br />
Segment 3 bit field = 0<br />
Byte within page = 0,<br />
which result in<br />
VA = FFFFFFFC.OOOOOOOO<br />
(also known as PT_Base).<br />
Figure 5 illustrates the exact contents of PT space<br />
for a given process. One can observe the positional<br />
effect of choosing a particular high-level PTE to self<br />
map the page tables even within PT space. In Figure 4,<br />
the choice of PTE for selfmapping not only places PT<br />
space as a whole in the l ,022d 8-GB section in virtual<br />
memory but also means that<br />
Figure 5<br />
Page Table Space<br />
PAGE TABLE<br />
SPACE (8 GB)<br />
PT_BASE:<br />
L2_BASE:<br />
L1_BASE:<br />
NEXT-LOWER 8 GB<br />
L2PT<br />
----------------------<br />
1.02 1 L2PTs<br />
----------------------<br />
L1PT<br />
---- -- -- -- --- ---------<br />
L2PT<br />
NEXT-HIGHER 8GB<br />
• The 1 ,022d grouping of the lowest-level page<br />
tables (L3PTs ) within PT space is actually the collection<br />
of next-higher-level PTs ( L2PTs) that map<br />
the other groupings ofL3PTs, beginning at<br />
Segment l bit field = 3FE<br />
Segment 2 bit field = 3FE<br />
Segment 3 bit field = 0<br />
Byte within page = 0,<br />
which result in<br />
VA = FFFFFFFD.FFOOOOOO<br />
(also known as L2_Base).<br />
• Within that block of L2PTs , the 1 ,022d L2 PT is<br />
actually the next-higher-level page table that maps<br />
the L2PTs, namely, the Ll PT. The Ll PT begins at<br />
Segment I bit field = 3FE<br />
Segment 2 bit field = 3FE<br />
Segment 3 bit field = 3FE<br />
Byte within page = 0,<br />
which result in<br />
VA = FFFFFFF D.FF7FCOOO<br />
(also known as Ll_Base).<br />
• Within that Ll PT, the l,022d PTE is the one used<br />
fo r self-mapping these page tables. The address of<br />
the self-mapper Ll PTE is<br />
l<br />
1 ,024 L3PTs<br />
1,021 x (1 ,024 L3PTs)<br />
1,024 L2PTs<br />
l<br />
1,024 L3PTs<br />
Digit:JI T
64<br />
Stgmem l bit rield = 3FE<br />
Segmellt 2 bit tleld = 3FE<br />
Segment 3 bit rield = 3rE<br />
Bvte within page = 3FE X 8<br />
,,·hich result in<br />
VA = FrHHFD.FF7FDHO .<br />
This positional correspondence within PT space is preserved<br />
should a difrcrenr high-b·el PTE be chosen r(>r<br />
self-mapping the p:ge tables.<br />
The properties inherent in this sclfmapped page<br />
tabk Jrc compelling.<br />
• The amount of virtual memory reserved is ex:ctly<br />
the amount required fo r mapping the page ta bles,<br />
regard less of page size or page t:blc depth.<br />
Consider the stgmem-numbered bit fi elds of :1<br />
given virtual address ri·om Figure 2. Concatenated ,<br />
these bit ticlds constitute the \'irtual page number<br />
(VPN) portion of a gi,·en virtual address.<br />
The tor: I size of the PT space needed to map n·en·<br />
Vl'N is the number of possible VPNs times 8 l)\'tes,<br />
the size of :1 PTE. The total size of the address<br />
space mapped bv that PT space is the number of<br />
possible VPNs times the page size. Factoring<br />
out the VPN multiplier, the diftc rence between<br />
these is the page size divided bv 8, which is cxactlv<br />
the size of the Segment 1 bit fi eld in the \'irtual<br />
:ddress. Hence, Jli the space mapped Lw 1<br />
single PTE Jt the highest level of p:1ge table is<br />
exacrly the size required for mapping all the I'TEs<br />
that could ever be needed to map the process'<br />
1ddress sp:ce.<br />
• The mapping of PT space involves simplv choosing<br />
one ohhe highest-b·el PTEs and forcing it to<br />
selr:map.<br />
• No addition:! system tuning or coding is required<br />
to :-ccommodatt a mort \\'idely implemented<br />
\irtLd address width in PT space. Bv ddinition of<br />
rhe sclfm:1p cftecr, rhe cxacr amount of ,·irtu1l<br />
address space required will be a,·ailable, no more<br />
and no less.<br />
• lr is easy to locate a given PTE. The add rcss of<br />
:1 PTE becomes an efficient function of the address<br />
that rhc PTE m:1ps. The fu nction first clc1rs<br />
the byte-within-page bit field of the subject virttd<br />
address and then shifts the remaining virru11<br />
J.ddress bits such that rhe Segments 1, 2, and 3 bit<br />
ricld values (Figure 2) now reside in the cmTcsponding<br />
next-lower bit ricld positions. The fu nction<br />
then writes (and sign extends if necessary)<br />
the ,.1c1ted Segment l ticld with the index of<br />
the selfmapper PTE. The result is the address<br />
ot· the r'TE that maps the original \'irtual address.<br />
Note that this algorithm also ,,·orks ror addresses<br />
Dig:ir.tl '1\:dmicJI )uunlJ! Vo l. 8 No. 2 I YY6<br />
\\'ithin PT space, including that of the selfmappn<br />
PTE itsclt'.<br />
• Process page L1blc residcnc1· in ,·irtu1l 111emon· is<br />
:chie,'Cd ,,·ithout imposing on the c:1pacin· ot'<br />
sh:rcd space. Th:-r is, there is no longer need ro<br />
1111p the process page tables into shared space. Such<br />
a 1111pping \\'Ould be red undant and \\'lsteful.<br />
Open VMS 64-bit Virtual Address Space<br />
With this page table residency strategy in h:nd, it<br />
became possible ro ti nalize a 64-bit virtu: I address layour<br />
r(>r the Open VtviS operating system. A self mapper<br />
I'TE had to be chosen. Consider ag:in the highest b·el<br />
of page table in a given process' page table structure<br />
(Figure 4 ). The first PTE in rh:u page table 1mps a section<br />
ofvirru1l memorv that includes PO and PI spaces.<br />
TIJis PTE \\'JS rherdore una,·aibblc tc>r usc as a sclt<br />
mapptr. The last PTE in that page t:-blc n11ps a section<br />
ohirtual memorv that includes SO/S I space. This I'TI-'.<br />
was :!so una,·aillbk tor se lt:mapping purposes.<br />
All the intu,·cning high-Jn·el PTEs ,,·ere potential<br />
choices tr selt:mapping the page rabies. To mnimizc<br />
the size of pmccss-pri\ ate space, the COITCCt choice<br />
is the ncxt-lmn::r PTE than the one rh:t maps the lo\\·est<br />
:ddress in sh;H·ed sp:ce.<br />
This choice is implemented 1s a boor-rime algorithm.<br />
Bootstrap code Erst dercnnines the size<br />
required for OpcnVMS shared space, calcubring the<br />
corresponding number of high-level PTLs. A sufticient<br />
number of PTEs to map rh:r shared sp1cc arc<br />
1 llocated Lner ti·om the high-order end of a gi,·cll<br />
process' highest-ln·d page table page. Then tl1c nextlower<br />
PTE is allocated tor sclfn11pping tiLlt process'<br />
page r:bles. All remaining lr mapping [Jrocess- pri,·uc spKe . In prlctice,<br />
ncarh· all the l'TEs are a\·:ilJblc, ,,·hich mc1ns riLlt<br />
on roda\·'s svstcms, almost 8 TB of process pri\ 1rc ,·irrull<br />
mcmon· is a\·1ilablc to a gi,·en Open V1'viS pmccss.<br />
Figure 6 presents the ri.nal 64 -bir OI)CllVI'vLS ,·irtual<br />
address space la\·out. The portion \\'ith the lo\\·cr<br />
addresses is cmirelv proccss-pt·i,·Jte . The highn<br />
lddrcsscd portion is sbared b· 1 11 process ,1ddrcss<br />
splCcS. PT space is a region oh·irtual memor\' thlt lies<br />
lxrwecn the P2 and S2 spaces rC)[ am· gi\'Cll pmccss<br />
1 11d at the s1111e virtual Jddress t())' all processes.<br />
Note rhH I''T' sp1ce irselt'consisrs of a proccss·pri,·1rc<br />
and J shared f)Ortion. Again, consider figure 5. ' The<br />
highest-b·cl p1ge tJblc page, Lll'T, is proccss-pri''ltc.<br />
Iris poimcd to lw the PTBR. (When a process' comcxt<br />
is lotkd, or 1111dc acti\T, the process' PT BR \:lluc is<br />
loaded ti·om rhc process' hard,,·art- pri,·ilegcd comnt<br />
block into rhe PT B 1\ register, tllcrcl)\' J111king currellt<br />
the p:ge r1blc structure poimcd ro lw rh1t l'TBR ,md<br />
the pmcess-pri,·lte 1ddress space th
1 /<br />
_,<br />
/<br />
/<br />
/<br />
/<br />
/ ____,_:_ /<br />
/<br />
/ /<br />
/<br />
00000000.00000000<br />
PO SPACE /<br />
00000000. 7FFFFFF F<br />
00000000.80000000<br />
P1 SPACE<br />
P2 SPACE :::b: :;;: :<br />
1--- -- --:<br />
/<br />
/ /<br />
---- --:;<br />
/<br />
- -<br />
;; PAGE TABLE SPACE ;; /<br />
/0<br />
/<br />
0 / (;0<br />
«-<br />
PROCESS-PRIVATE 0<br />
- - - - - - - - - - - - - - - - - -« ---<br />
SHARED SPACE<br />
FFFFFFFF. 7FFFFFF F<br />
FFFFFFFF.80000000<br />
FFFFFFFF.FFFFFFF F<br />
::::;: :<br />
Note that this drawing is not to scale.<br />
Figure 6<br />
Open VMS Alpha 64 -bit Virtual Address Space<br />
All higher-addressed p :gc tables in PT space are<br />
used to map sh:.1red space and are themselves shared.<br />
They arc also adjacent to the shared space that they<br />
map. All page tables in PT space that reside at<br />
addresses lower than that of the Ll PT :re used to map<br />
process-private sp:.1ce. These p:.1ge tables arc processprivate<br />
and Jre adjacent w the process-private space<br />
that they map. Hence, the end of the Ll PT marks<br />
a universal boundary between the process-private<br />
portion and the shared portion of the emire virtual<br />
address space, serving to separate even the PTEs that<br />
map those portions. In Figure 6, the line passing<br />
through PT space illustrates this bou ndary.<br />
A direct co nsequence of this desi gn is that the<br />
process page tables have been privatized . That is,<br />
the portion of PT space that is process-private is curremly<br />
active in virtual memory only when the owning<br />
process itself is currentl y Jctive on the processor.<br />
S2 SPACE :;;:<br />
SO/S1 SPACE<br />
/<br />
/<br />
Fortunately, the majority of page table references<br />
occur while executing in the conrext of the owning<br />
process. Such rderenccs JctuaiJy arc en hanced by<br />
the privatization of the process page t
66<br />
addresses, whether they 1re process-private or shared.<br />
The shared portion of" PT space could serve now as the<br />
sole location tcx shared-space PTEs. Being redundant,<br />
the original SPT is eminently discardable; however,<br />
discarding the SI'T would cn.:are a nussive compati bility<br />
problem r()r device drivers with their many 32-bir<br />
SPT rdcrenccs. This area is one in which an opportunity<br />
exists to preserve a significant degree of"pri,·ileged<br />
code cornpati bilit\'.<br />
Key Measures Ta ken to Maximize<br />
Privileged Code Compatibility<br />
To implement 64-bit virtual address space support, we<br />
altered central sections of the Open VMS Al pha kernel<br />
and many ofits key data structures. We expected that<br />
such changes would require compensating or corresponding<br />
source changes in surrounding privileged<br />
components within the kernel, in device drivers, and<br />
in privileged layered products.<br />
For example, the previous discussion seems to indicare<br />
that any privileged component that reads or writes<br />
PTEs would now need to usc 64-bit-widc pointers<br />
instead of 32-bit l1oimcrs. Similarly, all system f"ork<br />
threads and interrupt service routines could no longer<br />
count on direct :tccess to proccss-priv:tte PTEs without<br />
regard to which process happens to be current<br />
at the momun.<br />
A number of bctors exacerbated the impact of such<br />
changes. Since the OpcnVMS AJplLl operating syste<br />
m origin:ttcd h·om the OpenVMS Vt\ X operating<br />
system, signiric:mr portions of the Open VMS Alpha<br />
operating system and its dC\·ice dri,·ers are still written<br />
in MAC:R0-32 code, a compiled bnguage on the<br />
Alpha p!Jrr(mn .' Because MACR0-32 is an ;lssemblyle,-cl<br />
srvle of progr:unming Llnguagc, we could not<br />
sirnpl\' change the ddinirions 1 11d declarations ofl'ari<br />
OLIS t\-pes and rch· on recompil:nion to handle the<br />
mm-c ti·om 32-bir to 64-bir pointers. Fin:-dly, there are<br />
well over 3,000 rckrcnces to PTEs h·om MAC R0-32<br />
code modules in the OpenVJVI S Alpha source pool.<br />
We were thus LKed with the prospect or" visiti ng and<br />
potentially altering c1cb of these 3,000 rdcrences.<br />
Moreover, II'C would need to r( >llow the register lifetimes<br />
that resulted h·om each of these rctcrences to<br />
ensure that all address oleularions and memory re krenccs<br />
were done using 64- bir operations. \Ve expected<br />
that this process wo uld be time-consuming m d error<br />
prone and that it would h:we a signiticant negative<br />
impact on our completion dJte.<br />
Once OpcnViVIS Alpha version 7.0 was wailable<br />
to users, those with device dri\'ers and privileged code<br />
of their own would need to go through a similar<br />
efti.)rt. This would further delav wide usc of the<br />
release. For all these reasons, we were we ll motivated<br />
Vo l. 8 \:o. 2 <strong>1996</strong><br />
to minimize the impact on privileged code. The next<br />
fo ur sections discuss techniques th:Jt we used to 0\·ercome<br />
these obstacles.<br />
Resolving the SPT Problem<br />
A signific:mt number of the PTE rc krcnces in pri<br />
"ileged code arc to PT Es 11·irhin the SPT. Dc1·icc<br />
drivers often double-map the user's 1/0 hufkr into<br />
SO/S l space lw allocating and appmpriately initializing<br />
svstem p:.1ge ta ble entries (SI'Tr:s). Another situation<br />
in which :1 dri\'er manipulates SI'TEs is in the<br />
substitution of a svstcm bufkr ri.>r 1 poorly aligned or<br />
noncontiguous user 1/0 bufkr that pre,·ems rhe<br />
buffer from being directh- used wi th a p:1rricular<br />
dc,·ice . Such code relics ho,'ilv on the svsrem dat:1 cell<br />
MMG$GL_S PTBASE, which points to the SPT.<br />
The new page ublc design comple relv olwi1rcs the<br />
need ro r a separ:nc SPT. Given an 8-KB p1ge size 1 nd<br />
8 bytes per PTE, the entire 2-GB SO/SI virtual H.idrcss<br />
space range can be mapped by 2 MB ofPTEs within PT<br />
space. Because S0/51 resides at the highest addressable<br />
end ofrhe 64-bir virtual address sp1ec, ir is mapped bv<br />
the highest 2 M B of PT space. The :uu on the left in<br />
Figure 7 illustrate this mapping. The PTI-:s in PT space<br />
that map SO/S l arc fu lly shared l1\' :1ll processes, bur<br />
they must be rdcrmccd with 64-bir addresses.<br />
This incompnibi liry is completelv hidden lw the<br />
creation of :1 2-MB "SI'T window" m-cr rhe 2 iv\B in<br />
I'T space (lel'el 3 PTEs ) that m:tps SO/S I space. The<br />
SPT window is positioned at the highest 1dd rcss1blc<br />
end ot"SO/S l space. Therdi. >re, an access through the<br />
SPT window onlv requires a 32-bir SO/S I 1ddrcss and<br />
can obtain anv of the I'TEs in PT sp1ce that m:1p<br />
SO/Sl space The l i"CS on the ri ght in hgurc 7 illus<br />
tratc this access path.<br />
The SPT 11·i ndow is set up at svstcm ini' i1lizarion<br />
rime and consumes onll' the 2 KB of 1'TI-:s tlLlt<br />
are needed to map 2 MB. The S\'stcm d1t.1 cell<br />
MMG$GL_SPTBASE now points to the b:1se of rhe<br />
SPT window, and all existing rekrmccs to that d:1ta ce ll<br />
continue to fu nction corrcctlv without ch:1ngc. -<br />
Providing Cross -process PTE Access for Direct 1/0<br />
The selfm:1pping ofthe page t:1bles is Jn elegant solution<br />
to the page ta ble residency problem imposed by<br />
the preceding design. However, the selr:map1xd page<br />
ta bles present significant challenges or" their own to the<br />
I/0 subsystem and to many device drivers.<br />
Typically, OpenVJ\115 device drivers ri.>r 1111SS storlgc,<br />
net\vork, and other high-perr(mnlncc dev1ccs pcrJ-( mn<br />
direct memory access (DMA) 1 nd what Open VMS cal ls<br />
"direct 1/0." These dn·icc drivers lock down IIHO<br />
physical mcmorv the virtual pages rlut conLl in the<br />
requester's l/0 bufrc r. The 1/0 rr:1nsrcr is fX rr(mned<br />
direcrlv to those pages, aiLLT which the butkr pages are<br />
unlocked, hence the term "direct 1/0 ."
Figure 7<br />
System Page Table Window<br />
: "1<br />
PAGE TABLE SPACE<br />
(8 GB)<br />
- - ---- -----------------------<br />
PTEs THAT MAP SO/S1 (2 MB)<br />
S2 (6GB)<br />
SO/S1 (2 GB)<br />
------------- ------ ----------<br />
SPT WINDOW (2 MB)<br />
The virtual address of the buffer is not adequate for<br />
device drivers because much of the driver code runs in<br />
system context and nor in the process context of the<br />
requester. Similarly, a process-specific virtual address is<br />
meaningless to most DMA devices, which typically can<br />
deal only with the physical addresses of the virtual<br />
pages spanned by the buffer.<br />
For these reasons, when the 1/0 buffe r is locked<br />
into memory, the Open VMS I/0 subsystem converts<br />
the virtual address of the requester's buffer into<br />
( 1) the address of the PTE that maps the start of<br />
the buffe r and (2) the byte oftset within that page to<br />
the first byte of the buffe r.<br />
Once the virtual address of the 1/0 bufkr is convened<br />
to a PTE address, all references to that buffer<br />
are made using the PTE address. This remains the case<br />
even if this 1/0 request and 1/0 bufkr are handed otT<br />
from one driver to another. For example, the IjO<br />
request may be passed fi·om the shadowing virtual disk<br />
driver to the small computer systems intertace (SCSI)<br />
disk class driver to a port driver for a specific SCSI host<br />
adapter. Each of these drivers will rely solely on the<br />
PTE address and the byte offset and not on the virtual<br />
address of the 1/0 butler.<br />
Therefore, the number of virtual address bits the<br />
requester originally used to specif)r rh address of<br />
·FFFFFFFFFFFFFF<br />
the l/0 buffer is irrelevant. What really matters is<br />
the number of address bits that the driver must use<br />
to reference a PTE.<br />
These PTE addresses were always within the page<br />
tables within the balance set slots in shared SO /S 1<br />
space. With the introduction of tbe self mapped page<br />
tables, a 64-bit address is req uired t()r accessing any<br />
l)TE in PT space. Furthermore, the desired PTE is not<br />
accessible using this 64-bit address when the driver is<br />
no longer executing in the context of the original<br />
requester process. This is called a cross-process PTE<br />
access problem.<br />
In most cases, this access problem is solved tor<br />
direct 1/0 by copying the PTEs that map the 1/0<br />
buftc r when the 1/0 buffer is locked into physical<br />
memory. The PTEs in PT space are accessible at that<br />
point because the requester process context is required<br />
in order to lock the buffer. The PTEs arc copied into<br />
the kernel's heap storage and the 64-bit PT space<br />
address is replaced by the address of the PTE copies.<br />
Because the kernel's heap storage remains in SO/S l<br />
space, the replacement address is a 32-bit address that<br />
is shared by all processes on the system.<br />
This copy approach works because drivers do not<br />
need to modit)r the actual PTEs. Typically, this<br />
arrangement works well because the associated PTEs<br />
Vol. 8 No. 2 <strong>1996</strong> 67
em ti t into dcdic:n:d space \\'ithin the I/0 request<br />
pKkct d:t: structure used bv the OncnVMS operatino-<br />
- t b<br />
S\'Stc m, m d there is 110 mcasu r:tble increase in CPU<br />
Ol'crhcad to copy those PT Es .<br />
If the 1/0 buFkr is so L1rgc th:t its :1ssoci;1ted PTEs<br />
c1nnot ti t \\'ithin the 1/0 req uest pKkct, a separate<br />
kernel hc:-tf1 stm:gc packet is 1 lloutcd ro hold the<br />
PTEs. lf the l/0 bufkr is so large rhH the cost of<br />
copl'ing all the l'TFs is Jloticcable, a direct access path<br />
is crc:.1tcd as t( ll lo\\'s:<br />
• The L::l PTEs rhH m:.1p the I/0 bufh: r arc locked<br />
into ph1·sic1l men1or\·.<br />
• Add ress sp1ce 11·ithin SO/S l sp:.1ce is allocned<br />
and nup pcd OI'Cr the l.3 PT Es th at \\'Cre just<br />
locked do\\'Jl.<br />
This esr;1blishcs ' 32-bir address1blc slLlred-space<br />
\\'indo\\' m·cr the 1.3 l'TEs rhar 1mp the 1/0 bu ffe r.<br />
The essc mial point is that one of these methods is<br />
selected md employed umil the 1/0 is completed :md<br />
the bufh:r is unlockl:d . Each method prol'idcs a 32-bit<br />
l'TE address that the rest ofrhc l/0 su bsystem can use<br />
transp::rcnrly, 1s it has been accustomed ro doing, with<br />
our req uiring numerous, complex sou rce changes.<br />
Use of Self-identifying Structures<br />
To 1 Ccommod:.1tc 64 -bir user l'irtull addresses, 1 num<br />
ber of kemel dar:1 structures had to be cxp1nded and<br />
ch:1ngcd. for example, as,·nchronous system trap<br />
(AST ) control blocks, bu ffered l/0 pKk
do its job. This is an instance of the cross-process PTE<br />
access problem mentioned earlier.<br />
The swappcr is unable to directly access the page<br />
tables of the process being swapped because the swapper's<br />
own page tables are currently active in virtual<br />
memory. We solved this access problem by revising the<br />
swapper to temporarily "adopt" the page tables of<br />
the process being swapped. The swapper accomplishes<br />
this by temporarily changing its PTBR contents to<br />
point to the page table structure tor the process being<br />
swapped instead of to the swapper's own page table<br />
structure. This change torces the PT space of the<br />
process being swapped to become active in virtual<br />
memory and therd(xe available to the swapper as it<br />
prepares the process to be swapped. Note that the<br />
swapper can make this temporary change because<br />
the swapper resides in shared space. The swapper does<br />
not vanish ti-om virtual memory as the PTBR value is<br />
changed. Once the process has been prepared for<br />
swapping, the swappcr restores its own PTBR value,<br />
thus relinquishing access to the target process' PT<br />
space contents.<br />
Thus, it can be seen how privileged code with<br />
intimate knowledge of OpenVMS memory management<br />
mechanisms can be affected by the changes<br />
to support 64-bit virtual memory. Also evident is that<br />
the alterations needed to accommodate the 64-bit<br />
changes arc relatively straighttorward . Although the<br />
swappcr has a higher-than- normal awareness of memory<br />
managemenr internal workings, extending the<br />
swapper to accommodate the 64-bit changes was<br />
not particularly difficult.<br />
Immediate Use of 64-bit Addressing by the<br />
OpenVMS Kernel<br />
Page table residency was certainly the most pressing<br />
issue we taced with regard to the Open VMS kernel as<br />
it evolved ti·om a 32-bit to a 64-bit-capable operating<br />
system . Once implemented, 64-bit virtual addressing<br />
could be harnessed as an enabling technology tor solving<br />
a number of other problems :s well. This section<br />
briefly discusses some prominent examples that serve<br />
to illustrate how immediately useful 64-bit addressing<br />
became to the OpcnVJ\IIS kernel.<br />
Page Frame <strong>Number</strong> Database and<br />
Very Large Memory<br />
The OpenVMS Alpha operating system maintains a<br />
database t(>r managing individual, physical page ti·ames<br />
of memory, i.e., page ti·ame numbers. This database is<br />
stored in SO/S l space. The size of this database grows<br />
linearly as the size of the physical memory grows.<br />
Future Alpha systems may include larger memory<br />
configurations as memory technology continues to<br />
evolve. The corresponding growth of the page ti-ame<br />
number database tor such systems could consume<br />
an unacceptably large portion of SO /S 1 space, which<br />
has a maximum size of 2 GB. This design effectively<br />
restricts the maxirnum amount of physical memory<br />
that the OpenVMS operating system would be able<br />
to support in the fu wre.<br />
We chose to remove this potential restriction by<br />
relocating the page trame llU!llber database ti·om<br />
SO/S l to 64-bit addressable 52 space. There it can<br />
grow to support any physical memory size being considered<br />
tor years to come.<br />
Global Page Ta ble<br />
The OpenVMS operating system maintains a data<br />
structure in SO /S 1 space called the global page table<br />
( GPT). This pseudo-page table maps memory objects<br />
called global sections. Multiple processes may map<br />
portions of their respective process -private address<br />
spaces to these global sections to achieve protected<br />
shared memory access tor whatever applications they<br />
may be running.<br />
With the advent ofP2 space, one can easily anrjcipate<br />
a need tor orders-of-magnitude-greater global section<br />
usage. This usage directly increases the size of the<br />
GPT, potentially reaching the point where the GPT<br />
consumes an unacceptably large portion of SO/Sl<br />
space. We chose to forestall this problem by relocating<br />
the GPT fi·om SO/S l to S2 space. This move allows the<br />
configuration of a GPT that is much larger than any<br />
that could ever be configured in SO/Sl space.<br />
Summary<br />
AJthough providing 64-bit support was a significant<br />
amount of work, the design of the Open VMS operating<br />
system was readily scalable such that it could<br />
be achieved practically. First, we established a goal of<br />
strict binary compatibility tor nonprivileged applications.<br />
We then designed a superset virtual address<br />
space that extended both process-private and shared<br />
spaces while preserving the 32-bit visible address space<br />
to ensure compatibility. To maximize the available<br />
space for process-private use, we chose an asymmetric<br />
style of address space layout. We privatized the process<br />
page tables, thereby eliminating their residency<br />
in shared space. The bv page table accesses that<br />
occurred fl·om outside the context of the owning<br />
process, which no longer worked after the privatization<br />
of the page tables, were addressed in various ways.<br />
A variety of ripple effects stemming ti-om this design<br />
were readily solved within the kernel.<br />
Solutions to other scaling problems related ro the<br />
kernel were immediately possible with the advent of<br />
64-bit virtual address space. AJrcady mentioned was<br />
the complete removal of the process page tables ti-om<br />
shared space. vVe also removed the global page table<br />
Digirl Tc hnic.ll journ;ll Vo l. S No. 2 l996 69
70<br />
and the page fi-ame number database from 32-bit<br />
addressable to 64-bit addressable shared space. The<br />
immediate net dkct of these changes was significantly<br />
more room in SO/S l space tor configuring more<br />
kernel heap storage, more balance slots to be assigned<br />
to greater numbers of memory resident processes, etc.<br />
We further anticipate use of64-bit addressable shared<br />
space to realize additional bendits of VLM, such as<br />
tor caching massive amounts of file system data.<br />
Providing 64-bit addressing capacity was a logical,<br />
evolutionary step fo r the Open VMS operating system.<br />
Growing numbers of customers are demanding the<br />
additional virtual memory to help solve their problems<br />
in new ways and to achieve higher performance. This<br />
has been especial ly fr uitful for database applications,<br />
with substantial performance improvements already<br />
proved possible by the use of64-bit addressing on the<br />
<strong>Digital</strong> UNIX operating system . Similar results are<br />
expected on the OpenVMS system. With terabytes<br />
of virtual memory and many gigabytes of physical<br />
memory available, entire databases may be loaded into<br />
memory at once. Much of the I/0 that otherwise<br />
would be necessary to access the database can be elimi<br />
nated, thus allowing an application to improve perfor<br />
mance by orders of magnitude, for example, to reduce<br />
query time from eight hours to five minutes. Such<br />
performance gains were difficult to achieve while<br />
the OpenVMS operating system was constrained to a<br />
32-bit environment. With the advent of 64-bit address<br />
ing, OpenVMS users now have a powerful enabling<br />
tec hnology available to solve their problems.<br />
Acknowledgments<br />
The work described in this paper was done by mem<br />
bers of the Open VMS Alpha Operating System Devel<br />
opment group. Numerous contributors put in many<br />
long hours to ensure a well-considered design and<br />
a high-quality implementation. The authors particu<br />
larly wish to acknowledge the fo llowing major con<br />
tributors to this effort: Tom Benson, Richard Bishop ,<br />
Wa lter Blaschuk, Nitin Karkhanis, Andy Kuehnel,<br />
Karen Noel, Phil Norwich, Margie Sherlock, Dave<br />
\Vall, and Elinor ·wood s. Thanks also to members<br />
of the Alpha languages community who provided<br />
extended programming support for a 64-bit environ<br />
ment; to Wayne Cardoza, who helped shape the earli<br />
est notions of what could be accomplished; to Beverly<br />
Schultz, who provided strong, early encouragement<br />
to r pursuing this project; and to Ron Higgins and<br />
Steve Noyes, tor their spirited and unflagging support<br />
to the very end .<br />
The tdlowing reviewers also deserve thanks tor<br />
the invaluable comments they provided in helping to<br />
prepare this paper: Tom Benson, Cathy Foley, Clair<br />
Grant, Russ Green, Mark Howel l, Karen Noel, Margie<br />
Sherlock, and Rod vViddowson.<br />
<strong>Digital</strong> Tec hnical Journal Vo l. 8 No. 2 <strong>1996</strong><br />
References and Notes<br />
I. N. Kronenberg, T. Benson, W. Cardoza, R. Jagannathan,<br />
and B. Thomas, "Porting Open VMS from VAX to AJpha<br />
1\.,\P," <strong>Digital</strong> Tecbnical.fournal, val. 4, no. 4 ( 1992 ):<br />
lll-120.<br />
2. T. Leonard, cd., VAX Arch itecture Reference Manual<br />
(Bedford, Mass.: <strong>Digital</strong> Press, 1987).<br />
3. R. Sites and R. Witek, Alpha AXP Architecture Refe rence<br />
JI!Ia nual, 2d cd. (Newton, 1Vlass.: <strong>Digital</strong> Press,<br />
1995).<br />
4. Although an OpcnVMS process may refer to PO or PI<br />
space using either 32-bit or 64-bit pointers, references<br />
to P2 space require 64-bit pointers. AppJications may<br />
very well execute with mixed pointer sizes. (See reference<br />
8 and D. Smith, "Adding 64-bit Pointer Support<br />
to a 32-bit Run-time Library," <strong>Digital</strong> <strong>Technical</strong><br />
jo urnal, val . 8, no. 2 [<strong>1996</strong>, this issue]: 83-95.) There<br />
is no notion of an application executing in either a 32-bit<br />
mode or a 64-bit mode.<br />
5. Superset system services and language support wct-c<br />
added to facilitate the manipulation of 64-bit addressable<br />
P2 space.8<br />
6. This mechanism has been in place since OpenVMS<br />
Alpha version 1.0 to support virtual PTE tCtchcs by the<br />
translation buffer miss handler in PALcode. (PALcodc<br />
is rbe operating system-specific privileged architecture<br />
library that provides control over interrupts, exceptions,<br />
context switching, ctc 3) In cHeer, this means rhar the<br />
OpenV!YlS page rabies already existed in two virtual<br />
locations, namely, SO/S l space and PT space.<br />
7. The SPT window is more precisely only an SO/S l PTE<br />
window. The PTEs that map 52 space
Biographies<br />
Michael S. Harvey<br />
Michcl Harvey joi ned Digit! in 1978 at-i:er receivi n g his<br />
B.S.C.S. h·om the Univcrsirv of Vermon t. In 1984, as a mem·<br />
ber of rhe OpenV;viS Engi;1eering group, he participated in<br />
nell' processor support i:>r VAX multiprocessor systems and<br />
helped de1·clop Open VMS Sl'mmetric multiprocessing (S;vl I')<br />
support for these s1·srems. He recein:d a patenr or this ll'ork.<br />
Mike 1vs an o1·iginl member of the RJSC1·-VAX task ti Jt·co:,<br />
which conccil'ed and d eveloped rhc Alph a architectun:.<br />
Mike led the project that ported the Open VMS Executil'(:<br />
fi·om the VA-'< to the Alph pbrcmn and subsequ ently led<br />
the project that designed r the Open VMS 1/0 engineering team,<br />
he joined Digir,, l SoiWr rh c Open VMS high-lclel langtugc dc1·ice d ril'cr proj <br />
ccr , conrributcd ro rhc port ohhc Open VMS opcr•1ring<br />
sysrem ro the Alph
72<br />
The OpenVMS Mixed<br />
Pointer Size Environment<br />
A central goal in the implementation of 64-bit<br />
addressing on the OpenVMS operating system<br />
was to provide upward-compatible support for<br />
applications that use the existing 32-bit address<br />
space. Another guiding principle was that mixed<br />
pointer sizes are likely to be the rule rather than<br />
the exception for applications that use 64-bit<br />
address space. These factors drove several key<br />
design decisions in the OpenVMS Calling Stan<br />
dard and programming interfaces, the DEC C<br />
language support , and the system services<br />
support. For example, self-identifying 64-bit<br />
descriptors were designed to ease development<br />
when mixed pointer sizes are used. DEC C sup<br />
port makes it easy to mix pointer sizes and to<br />
recompile for uniform 32- or 64-bit pointer sizes.<br />
OpenVMS system services remain fully upward<br />
compatible, with new services defined only<br />
where required or to enhance the usability of the<br />
huge 64-bit address space. This paper describes<br />
the approaches taken to support the mixed<br />
pointer size environment in these areas. The<br />
issues and rationale behind these OpenVI\IIS<br />
and DEC C solutions are presented to encourage<br />
others who provide library interfaces to use<br />
a consistent programming interface approach.<br />
<strong>Digital</strong> Tcclmiul )ounul Vol. No. 2 [')')(,<br />
I<br />
Thomas R. Benson<br />
Karen L. Noel<br />
Richard E. Peterson<br />
Suppmt t( >r 64-bir \'irtual addressing on the OpcnVt\!!S<br />
Alph;l opcr;ning s\·srcm, \'Crsion 7.0, h:1s \·asril· inc1·clscd<br />
the ;l mmmr oh inu:1l :-ddrcss space ;l\'r the initil lowing sections explore the reasons t(>r<br />
mixing pointer sizes.
Open VMS and Language Support<br />
I mplement3tion choices that <strong>Digital</strong> made tor th is first<br />
release of the Open VMS operating system that sup<br />
ports 64-bit virtual addressing will probably encour<br />
age mixed pointer size programming. These choices<br />
were driven largely by the need tor absolute upward<br />
compati bility tor nisting programs and the goal of<br />
supporting large , dynamic data sets as the primary<br />
applicltion t()r 64-bit addressing.<br />
Dynamic Data Only OpcnVMS services support<br />
dynamic allocation ot-64-bit address space. This mech<br />
anism most closely resembles the malloc and free fimc<br />
tions t() r al locating and deallocaring dynamic storage<br />
in the C programming language. Allocation of this<br />
type diffe rs trom static and stack storage in that explicit<br />
source statements are required to manage it. For static<br />
and stack storage, the system is allocating the memory<br />
on behalf of the application at image activation rime.<br />
(Of course, the al location may be extended during<br />
execution in the case ofst:ck stor:ge.) This allocation<br />
continues co be fi·om 32-bit addrcss:ble space.<br />
Two special cases of static allocation are worth men<br />
tioning. Linkage sections, which are program sections<br />
that contain routine linkage information, and code<br />
sections, which contJin the cxccutJbk instructions,<br />
do not difkr su bstantially ti·om preinitialized static<br />
storage. As J result, these sections also reside only in<br />
32-bit add ressable memory.<br />
Upward-compatibility Constraints The OpenVMS<br />
Al pha operating system is cautious to avoid using<br />
64-bit memory fr eely where it may prevent upward<br />
compatibility tor 32-bit :pplications. For example, the<br />
linkage section might seem to be a nJtural candidate<br />
fo r the Open VMS system to allocate automatically in<br />
64-bit memory. This allocnion would essentially free<br />
more 32-bit addressable memory fo r application use;<br />
however, even if this were done only tor applications<br />
relin ked tor new versions of the Open VMS operating<br />
system, there is no guarantee that all object code treats<br />
Jin kagc section addresses as 64 bits in width. A simple<br />
example is storing the address of a routine in a struc<br />
ture. Since a routine's address is the add ress of its pro<br />
cedure descriptor in the linkage section, moving the<br />
lin kage section to 64 -bit memory wouJd cause code<br />
that stores this address in a 32-bit cell to fail.<br />
Allocating the user stack in 64-bit space also appears<br />
to be a good opportunity to easily increase the amount<br />
of memory available to an application. Stack addresses<br />
arc often more visible to application code than linkage<br />
section addresses arc. For instance, a routine can easi ly<br />
allocate a local variable using temporary storage on the<br />
stack and pass the address of the variable to another<br />
routine. If the stack is moved to 64-bit space, this<br />
address quietly becomes a 64-bit address. If the cal led<br />
routine is not 64-bit capable, attempts to use the<br />
address will fai l.<br />
Focus on Services Required for Large Data Sets Not<br />
all system services could be changed to support 64-bit<br />
addresses (i.e., promoted) in time to r the first version<br />
of the OpenVMS operating system to support 64-bit<br />
addressing. vV ith the mixed-pointer model in mind,<br />
we tocused on those services that were likely to be<br />
required fo r large data sets. For example, to allow IjO<br />
directly to and from high memory, it was essential that<br />
the IjO queuing service, SYS$QIO, accept a 64-bit<br />
buffer address. Conversely, the SYS$TRNLNM service<br />
t(x translating a logical name did not need to be moditicd<br />
to accept 64 -bit addresses. Its arguments include<br />
a logical name, a table name, and a vector that contains<br />
requests tor information about the name. These are<br />
small data elements that arc unlikely to req uire 64-bit<br />
addressing on their own . Of course, they may be part<br />
of some larger structure that resides in 64-bit space.<br />
In this case, they can easily be copied to or from 32-bit<br />
addressable memory.<br />
System services are discussed further in the section<br />
Open VMS System Services. The 32-bit address restric<br />
tion on certain system services again emphasizes the<br />
importance of being able to logically separate large<br />
data set support from the rest of an application.<br />
Limited Language Support Another interface point<br />
that requires care when using 64-bit addressing is at<br />
calls between mod ules written in diffe rent program<br />
ming languages. The Open VMS Calling Standard<br />
traditionally makes it easy to mix .languages in an appli<br />
cation, but DEC C is the only high-level language<br />
to fully support 64-bit add resses in the tirst 64-bit<br />
capable version of the Ope n VMS operating system.2<br />
The usc of 64-bit addresses in mixed-language<br />
applications is possible, and data that contains 64-bit<br />
addresses may even be shared; however, references<br />
that actually use the data pointed to by these addresses<br />
need to be limited to DEC C code or assembly lan<br />
guage. Mixed high-level language applications arc cer<br />
tain to be mixed pointer size applications in this<br />
version of the operating system.<br />
Support for 32-bit Libraries<br />
Many applications rely on library packages to provide<br />
some aspect of their fu nctionality. Typical examples<br />
include user interface packages, graphics libraries, and<br />
database utilities. Third-party libraries may or may not<br />
support 64-bit add resses. Applications that usc these<br />
libraries will probably mix 32-bit and 64-bit poi nter<br />
sizes and will therefore require an operating system<br />
that supports mixed pointer sizes.<br />
DigirCll Tcc hnicJI )ourn
74<br />
Implications of Full 64-bit Conversion<br />
for some applications, it may be desirable to mix<br />
pointer sizes to avoid the side dkcts of universal 64-bit<br />
address conversion. The approach of recompiling every<br />
thing with 64-bit address widths is sometimes called<br />
"throwing the switch." An obvious implication of<br />
throwing the switch is that all pointer data doubles in<br />
size. For complex linked data structures, this can be a<br />
significant overal l increase in size. Increasing the pointer<br />
size may also reveal hidden dependencies on pointer size<br />
being the same as integer size. If code accesses a cell as<br />
both a 32-bit integer and a 32-bit pointer, the code will<br />
no longer work if the pointer is enlarged . Thus,<br />
univerS
structure-specific information. Typically, descriptors<br />
are used only tc>r character strings, arrays, and complex<br />
data types such as packed decimal.<br />
Descriptor types are by ddinition sclfidcntit)ring by<br />
virtue of the type and class fields they contain. An<br />
obvious choice, therd(xe, fo r extending descriptors to<br />
handle 64-bir addresses would be to 1dd new type<br />
constants t(>r 64-bit data elements and extend the<br />
structure beyond the type fields to accommodate<br />
larger addresses and sizes. In practice , however, the<br />
address and Jength fields hom descriptors are fi-equently<br />
used without accessing the type fi elds, particularly<br />
when a character stri ng descriptor is expected.<br />
As a resu lt, a solution was sought that would yield<br />
a predictable fa ilure, rather than incorrect results or<br />
data corruption, when a 64-bit descriptor is received<br />
by a routine that expects onlv the 32-bit f(>rm. The<br />
final design includes a separate 64-bit descriptor layout<br />
that contains two special fields at the same ottSets as<br />
the length and address fields in the 32- bit descriptor.<br />
These fields are called MBO (must be one) and<br />
MRMO (must be minus one ), respectively. The simplest<br />
versions of the 32-bit and 64-bit descriptors are<br />
illustrated in Figure l.<br />
If r ensuring robust interfaces in<br />
the mixed pointer size environment.<br />
DEC C Language Support for Mixed Pointer Sizes<br />
To support application programming in the mixed<br />
pointer size environment, some design work was<br />
required in the DEC: C compiler. This section<br />
describes the rationale behind the ti na! design.<br />
It was clear that the compiler would have to provide<br />
a way tor 32-bit and 64-bit pointers to coexist in the<br />
same regions of code. At the same time, customers and<br />
DigitJI <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 75
76<br />
internal users initially favored a simple command line<br />
switch when polled on potential compiler support<br />
for 64-bit address space. (At least one C compiler that<br />
supports 64-bit addressing, MIPSpro C, does so only<br />
through command line switches fo r setting pointer<br />
sizes.3) The motivation fo r using switches was to limit<br />
the source changes needed to take advantage of the<br />
additional address space, especially when portability<br />
to other plattonns is desired. For cases in which mixing<br />
pointer sizes was unavoidable, something more<br />
flexible than a switch was needed.<br />
Why Not _near and _far?<br />
The most common suggestion fo r controlling individual<br />
pointer declarations was to adopt the _near and<br />
_far type qualifier syntax used in the PC environment<br />
in its transition from 16-bit to 32-bit addressing.4<br />
While this idea has merit in that it has already been<br />
used elsewhere in C compilers and is fa miliar to PC<br />
software developers, we rejected this approach for the<br />
following reasons:<br />
• The syntax is not standard.<br />
• The syntax requires source code edits at each declaration<br />
to be affected.<br />
• The syntax has become largely obsolete even in the<br />
PC domain with the acceptance of the flat 32-bit<br />
address space model offered by modern 386minimum<br />
PC compilers and the Win32 programming<br />
interface.<br />
• Because of the vast difference in scale in choosing<br />
between 16-bit or 32-bit pointers on a PC as compared<br />
to choosing between 32-bit or 64-bit poimers<br />
on an Alpha system, there would be no porting benefit<br />
in using the same keywords. No existing source<br />
code base would be able to port to the OpenVMS<br />
mixed pointer size environment more easily because<br />
of the presence of _near and _t:1r qualifiers.<br />
Pragma Support<br />
The <strong>Digital</strong> UNIX C compiler had previously defined<br />
pragma preprocessing directives to control pointer<br />
sizes tor slightly difkrcnt reasons than those described<br />
for the OpenVMS system.' By default, the <strong>Digital</strong><br />
UNIX operating system ofkrs a pure 64-bit addressing<br />
model. In some circumstances, however, it is desirable<br />
to be able to represent pointers in 32 bits to<br />
match externally imposed data layouts or, more rarely,<br />
to reduce the amount of memory used in representing<br />
pointer values. The <strong>Digital</strong> UNIX pointer_size pragmas<br />
work in conjunction with command line options<br />
and linker/loader katurcs that limit memory use and<br />
map memory such that pointer values accessible to the<br />
C program can always be represented in 32 bits.<br />
Since compatibility with the <strong>Digital</strong> UNIX compiler<br />
would have greater value if it met the needs of the<br />
OpenVMS platform, we evaluated the pragma-based<br />
Digir.1l <strong>Technical</strong> Jourml Vol. 8 No. 2 <strong>1996</strong><br />
approach and decided to adopt it, propagating any<br />
necessary changes back to the UNIX platform to maintain<br />
compatibility. The decision to use pragmas to<br />
control pointer size addressed the major deficiencies<br />
of the _ncar and _far approach. In particular, the<br />
pragma directive is specitled by ISO I ANSI C in such<br />
a way that using it docs not compromise portability as<br />
the use of additional keywords can, because unrecognized<br />
pragmas arc ignored. Furthermore, pragmas can<br />
easily be specified to apply to a range of source code<br />
rather than to an individual declaration. A number of<br />
DEC C pragmas, including the pointer size controls<br />
implemented on the UNIX system, provide the ability<br />
to save and restore the state of the pragma. This makes<br />
them convenient and safe to usc to modit)r the pointer<br />
size within a particular region of code without disturbing<br />
the surrounding region. The state may easily be<br />
saved betore changing it at the beginning of the region<br />
and then restored at the end.<br />
Command Line Interaction<br />
Pragmas tlt in with the initial desire of prospective<br />
users to have a simple command line switch to indicate<br />
64 bits. As with several other pragmas, we detined a<br />
command line qualifier (/pointer_size ) to spccif)r the<br />
initial state of the pragma before any instances arc<br />
encountered in the text. U nlikc other pragmas,<br />
though , we also use the same command line qualifier<br />
to enable or disable the action of the pragmas altogether.<br />
In this way, a ddault compilation of source<br />
code moditied for 64-bit support behaves the same<br />
way that it would on a system that did not ofter 64-bit<br />
support. That is, the pragmas arc effectively ignored,<br />
with only an informational message produced .<br />
This behavior was adopted for consistency with the<br />
<strong>Digital</strong> UNIX behavior and also to aid in the process of<br />
adding optional 64-bit support to existing portable<br />
32-bit source code that might be compiled fo r an<br />
older system or with an older compiler. In this model,<br />
a compilation of new source code using an old command<br />
line produces behavior that is equivalent to the<br />
behavior produced using an older compiler or a compiler<br />
on another platform. vVitb one notable exception,<br />
building an application that actually uses 64-bit<br />
addressing requires changing the command line.<br />
The exception to the rule that existing 32-bit build<br />
procedures do not create 64-bit dependencies is a second<br />
form of the pragma, named required_pointer_size.<br />
This form contrasts with the form poimcr_size in that it<br />
is always active regardless of command line qualifiers;<br />
otherwise, required_pointer_size and pointer_size arc<br />
identical. The intent of this second pragma is to support<br />
writing source code that specifics or interfaces to<br />
services or libraries that can only work correctly with<br />
64-bit pointers. An example of this code might be a<br />
header file that contains declarations for both 64-bit<br />
and 32-bit memory management services; the services
must always be defined to accept and return the<br />
appropriate pointer size, regardless of the command<br />
line qualitler used in the compilation.<br />
Pragma Usage<br />
The use of pragmas to control pointer sizes within a<br />
range of source code fits well with the model of starting<br />
with a working 32-bit application and extending it<br />
to exploit 64-bit addressing with minimal source code<br />
edits. Programming interface and data structure declarations<br />
are typically packaged together in header files,<br />
and the primary manipulators of those data structures<br />
are often implemented together in modules.<br />
One good approach for extending a 32-bit application<br />
wou ld be to start with an initial analysis of memory<br />
usage measurements. The purpose of this analysis<br />
wo uld be ro produce a rough partitioning of routines<br />
and data structures into two categories: "32-bit sufficient"<br />
and "64-bit desirable." Next, 64-bit pointer<br />
pragmas could be used to enclose just the header files<br />
and source modules that correspond to the routines<br />
and dat
78<br />
Allocation Function Mapping The com mand line<br />
qualitier setting the default pointer size has an additional<br />
effect that simplifies the use of 64-bit address<br />
space. If an explicit poinrer size is specified on the<br />
command line, the malloc fu nction is mapped to a<br />
routine specific to the address space to r that size. For<br />
example, _malloc64 is used fo r malloc when the<br />
default pointer size is 64 bits. This allows allocation<br />
of 64-bit address space without additional source<br />
changes. The source code may also call the sizespecific<br />
versions of run-time routines explicitly, when<br />
compiled for mixed pointer sizes. These size-specific<br />
fu nctions are available, however, only when the<br />
/pointer_size command line quatler is used. See<br />
"Adding 64 -bit Pointer Support to a 32-bit Ru n-time<br />
Library" in this issue for a discussion of other cHeers of<br />
64-bit addressing on the C run-time library."<br />
Header File Semantics The treatment of poinrer_size<br />
pragmas in and around header fi les (i.e., any source<br />
included by the #include preprocessi ng directive)<br />
deserves special mention. Programs typically include<br />
both ptivate definition files and public or system-specific<br />
header files. In the latter case, it may not be desirable tor<br />
definitions within the header files to be afrccted by the<br />
poinrer_size pragmas or command line currently in<br />
effect. To prevent these definitions trom being aHected,<br />
the DEC C compiler searches fo r special prologue and<br />
epilogue header files when a #include directive is<br />
processed. These files may be used to establish a particular<br />
state for environmental pragmas, such as<br />
pointer_size, tor all header files in the directory. This<br />
eliminates the need to modify either the individual<br />
header files or the source code that includes them.<br />
The compiler creates a predetined macro called<br />
_JNITIAL_POTNTER_SIZE to indicate the initial<br />
pointer size ;-tS specified on the command line. This may<br />
be of particular use in header files to determine what<br />
pointer size should be used, if mixed pointer size support<br />
is desirable. Conditional compilation based on this<br />
macro's definition state can be used to set or override<br />
pointer size or to detect compilation by an older compiler<br />
lacking pointer-size support. Ifits val ue is zero, no<br />
/pointer_size qualifier was specified, which means that<br />
pointer_size pragmas do not take effect. If its value is<br />
32 or 64, pointer_size pragmas do take drect, so it can<br />
be assumed that mixed pointer sizes are in usc.<br />
Code Example<br />
In the simple code example shown in Figure 3, suppose<br />
that the routine procl is part of a library that has<br />
been only partially promoted to use 64-bit addresses.<br />
This function may receive either a 32-bit address or a<br />
64-bit address in the mRwneru_ptr parameter. To<br />
demonstrate the use of the new DEC C fe atures, prod<br />
has been modified to copy this character string parameter<br />
fi·om 64-bit space to 32-bit space when ncces-<br />
DigitJ! TcchnicJ! Joumal Vol. 8 No. 2 <strong>1996</strong><br />
sary, so that routines that procl subsequently calls<br />
need to deal with only 32-bit addresses.<br />
The INITIAL_POINTER_S IZE macro is used to<br />
determine if pointer_size pragmas will be effective<br />
and, hence, whether argument_ptrmight be 64 bits in<br />
width. If it might be a 64-bit pointer, whose actual<br />
width is utlk.nown in this example, the pointer's value<br />
is copied to a 32-bit-wide pointer. The pointer_size<br />
pragma is used to change the current pointer size to<br />
32 bits to declare the temporary pointer. Next, the<br />
two pointer va.l ues are compared to determine if<br />
the original pointer fits in 32 bits. If the pointer does<br />
not fit, temporary storage in 32-bit addressable space<br />
is allocated, and the argument is copied there. Note<br />
that the example uses _malloc32 rather than malloc,<br />
because malloc would allocate 64-bit address space<br />
if the initial pointer size was 64 bits. At the end of<br />
the routine, the temporary space is treed, if necessary.<br />
A type cast is used in the assignment from<br />
argument:__ptr to temp_short_ptr. even though both<br />
variables are of type char * . Without this type cast, if<br />
arp,umeJZI_jJtr is a 64-bit-wide pointer, the DEC C<br />
compiler would report a warning message because of<br />
the potential data loss when assigning from a 64-bit to<br />
a 32-bit pointer.<br />
For other examples of pointer_ size pragmas and the<br />
use of t he _INITIAL_PO INTER_SIZE macro, see<br />
Duane Smith's paper on 64-bit pointer support in<br />
run-time libraries."<br />
OpenVMS System Services<br />
The OpenVJVlS operating system provides a suite of<br />
services that perform a variety of basic operating system<br />
functions 7 Design work was required to maximize<br />
the utility of these routines in the new mixed<br />
pointer size environment . Issues that needed to be<br />
addressed included the fo llowing, which arc discussed<br />
in subsequent sections:<br />
• Several services pass pointers by reference and,<br />
hence, required an interface change.<br />
• Because of resource constraints, nor all system services<br />
cou ld be promoted to handle 64-bit addresses<br />
in the first version of the 64-bit-capable Open VMS<br />
operating system.<br />
• Since the services provide mixed levels of support, it<br />
is important to indicate those that support 64-bit<br />
addresses and those that do not.<br />
• Certain new services seemed desirable to improve<br />
the usability of64-bit address space.<br />
Services That Are 64-bit Friendly<br />
Services that can be promoted to support 64-bit<br />
addresses without any interface change are called 64-bit<br />
fi·iendly. If a service receives an add ress by reference, the<br />
service is typically not 64-bit friendly, and a separate
Figure 3<br />
voi d proc1 ( char * argument_ pt r)<br />
{<br />
#if INITIAL POINTER SIZE != 0<br />
#pragma poi n ter_ size save<br />
#pragma poi n ter_s ize 32<br />
char * temp_ short_p tr;<br />
temp_s hort_ ptr = (char *)argument_p tr;<br />
i f (temp_s hort_p tr != argument_p t r) {<br />
temp_s hort_p tr = _m alloc32 (strlen(argument_p tr) + 1 ) ;<br />
st rcpy(temp_s hort_ ptr,argum ent_p tr);<br />
argument_p tr = temp_ short_p tr;<br />
}<br />
else {<br />
temp_ short_p tr = 0;<br />
}<br />
#end if<br />
I*<br />
*I<br />
#pragma poi n ter_s i z e restore<br />
The actual body of proc1 is omi tted. Assume that it ca l l s<br />
routi nes that operate on the data po i nted to by argument_ ptr<br />
and that the rout ines are not yet prepa red to hand le 64-b it<br />
addresses .<br />
#if INITIAL_PO INTER_S IZE 1= 0<br />
i f (temp_ short_p tr 1 = 0)<br />
#endif<br />
}<br />
free( t emp_ short_p tr);<br />
Code Example of Poi nrc r_sizc Pragmas and rhc _lNlTlAI ,_POI NTER _SI Z E MKro<br />
enrry point is required to support 64-bir addresses. A<br />
single routine cannot distinguish whether the address at<br />
the specified location is 32 bits or 64 birs in width.<br />
If a scn·icc docs not rccei\'c or return an address by<br />
rderence, the service is usually 64- bit tri endly. Even<br />
descriptor arguments present no problem, because the<br />
32-and 64-bit versions can be distinguished at run<br />
time. The majority of services t:t ll into this category.<br />
The services th::tt are not 64-bit tl-icndly include<br />
the entire suite of memory management system scr<br />
\'ices, since they access add ress ranges passed by reference.<br />
Other such services include those that receive<br />
J 32-bit vector as an Jrgument, which may include the<br />
address of a pointer as an element. A good example<br />
ti·om this group is SYS$FAOL, which accepts a 32-bit<br />
\'LCtor argu ment t(x t(>rmatrcd output. For all these<br />
scn·ices, JlC\\' intcrhccs were designed to accommodate<br />
64-bir callers.<br />
Promotion of Services<br />
The Open VMS project team explored the idea of pro·<br />
moting all system services to support 64 -bit addresses.<br />
Since the majority of OpcnVtvlS system service<br />
ro utines an: 1\'rittcn in the l'viAC R0-32 assembly language<br />
or the Bliss-32 programming language, the<br />
internals of the routines could not be promoted to<br />
handle 64-bit addresses without moditications. We<br />
could not take advantage of the throw-the-swirch<br />
approach, and we did not want to because many<br />
pointers used internally in the OpcnVMS operating<br />
system remain at 32 bits.<br />
We considered using 64-bit jacket ro utines to copv<br />
64-bit argumcllts to the stack in 32-bit space, which<br />
wou ld then e1ll the 32-bit intcrnJI ro utine to pertr several reasons. First, the jackets would<br />
need ro be custom written to ensure correct parameter<br />
semantics. There could not be a "common jacket"<br />
that could have saved time and lowered risk. Second,<br />
there would be an undesirable pcrtc>n11ance impact h1r<br />
64-bit callers. Third, we decided that ba\'ing a com <br />
plete 64-bit system service suite was not essential h>r<br />
usable 64-bit support. 'We could ddine a subset that<br />
wou ld meet the needs of 64- bit address space users,<br />
while lowering our risk and implementation costs.<br />
The sen·iccs selected tor 64-bit support fa ll inro<br />
tour categories.<br />
l. Memory mJnagement services.<br />
2. Performance-critical services. This group includes<br />
services that Jre typically scnsiti\·e to the addition of<br />
Dig;ir;1\ Tcrhnic1l journ.1l Vol. R No. 2 <strong>1996</strong> 79
e\·en a fe\\' cycles of execution ti me . Req uiri ng that<br />
a 64-bit address user do any addi tional \\'ork, such<br />
as copying data to 32-bit space, is undesirable. An<br />
exam p le of this type of service is SYS$ ENQ, which<br />
is used tor q ueu i ng lock requests.<br />
3. Design center services. The primary design center<br />
tor 64-bit su pport was database applications.<br />
Database arch itects Jnd consultants were polled to<br />
determine \\'hi c h scn·ices \\'ere most needed lw<br />
their produ cts. Many of these services, t(Jr exa mple<br />
SYS$QIO f(x q ueuing JjO requests, wen: also in<br />
the performance-critical set.<br />
4. Other useful basic services. Th is set includes ser<br />
vices to case the transition to 64 bits with minim:d<br />
change to program structure. For example, the<br />
SYS$Gv! Kll..NL service accepts a routine address<br />
and a vector of 32 -bit arguments
interfaces than the 32-bit-only SYS$CRJ\tlPSC. The<br />
original SYS$CRNI.PSC is still present so that existing<br />
code may ti.mction without change.<br />
Some new feature requests were considered as part<br />
of the 64-bit eftort, but, to maintain the focus of<br />
the release, these features were not implemented. The<br />
64-bit memory management services were designed<br />
to more easily accommodate new fe atures in the<br />
future. For example, the new services check the argu<br />
ment count f()r both too many and too tew supplied<br />
arguments. In this way, new optional arguments can<br />
be added later to the end of the list without jeopardiz<br />
ing backward compatibility.<br />
Virtual Regions<br />
One new feature that was added to the suite of64-bit<br />
memory management services is support tor new enti<br />
ties called virtual regions. A virtual region is an address<br />
range th:t is n:served by a program for fu ture dynamic<br />
allocation requests. The region is similar in concept to<br />
the program region (PO) and the control region ( Pl ) ,<br />
which have long existed on the Open VMS operating<br />
system.'' A virtual region differs tl-om the program and<br />
control regions in that it may be ddined by the user by<br />
calling : system service and may exist within PO, Pl, or<br />
the new 64-bit addressable process-private space, P2.'<br />
When a virtual region is created, a handle is returned<br />
that is subsequently used to identif)' the region in<br />
memory management requests.<br />
Address space within virtual regions is allocated in<br />
the same manner as in the def:llllt PO, P 1, and P2<br />
regions, with allocation defined to expand space<br />
toward either ascending or descending addresses. As<br />
in the default regions, allocation is in multiples of<br />
pages. The Open VMS operating system keeps track of<br />
the first fiTe virtual address within the region. A region<br />
can be created such that address space is created auto<br />
matically when a virtual reference is made within the<br />
region, just as the control region in Pl space expands<br />
automatically to accommodate user stack expansion.<br />
When a virtual region is created within PO, Pl, or P2,<br />
the remainder of that containing region decreases in<br />
size so that it does nor overlap with the virtual region.<br />
Virtual regions were added to the Open VMS Alpha<br />
opcr:ning system along with the 64-bit addressing<br />
capability so that the huge expanse of 64 -bit address<br />
space could be more easi ly managed. If a subsystem<br />
requires a large portion of virtually contiguous address<br />
space , the space can be reserved within P2 with little<br />
ovcrhe:d. Other subsystems within the application<br />
cannot inadvertently interfere with the contiguity<br />
of this address space. They may create their own<br />
regions or cre:Jte address space wi thin one of the<br />
detault regions.<br />
Another advantage of using virtual regions is that<br />
they arc the most efficient w1y ro manage sparse<br />
address space within the 64-bit P2 space. Further-<br />
more, no quotas are charged fo r the creation of a vir<br />
tual region. The internal storage for the description<br />
of the region comes from process :ddress space, which<br />
is the only resource used.<br />
Summary<br />
This paper presents the reasons behind the new<br />
OpenVMS mixed poi nter size environment and the<br />
support added to allow programming within this envi<br />
ronment. The discussion touches on some of the new<br />
support designed to simplif)r the usc of the 64 -bit<br />
address space.<br />
The approaches disc ussed yielded fu ll upward com<br />
patibility tor 32- bit applications, while allowing other<br />
applications access ro the huge 64-bit address space tor<br />
data sets th:t require it. Promotion of all p oi nters to<br />
64-bit width is not required to use 64-bit space; the<br />
mixed pointer size environment was considered para<br />
mount in all design decisions. A case study of adding<br />
64-bit support to the C run-time library also appears<br />
in this issue of the .fou rna/.''<br />
Acknowledgments<br />
The authors wish to thank the other members or· the<br />
64 -bit Al pha-L Team who helped shape many of the<br />
ideas presented in th is paper: Mark Arsenault, G:1ry<br />
Barton, Barbara Benton, Ron Brender, Ken Cowan,<br />
Mark Davis, Mike Harvey, Lon Hilde, Duane Smith,<br />
Cheryl Stocks, Lenny Szubowicz, and Ed Vogel .<br />
References<br />
l. M. Harvey and L. Szubowicz, "Exte nding OpcnVMS<br />
tor 64-bit Addressable Virru,ll ,\!!emory," D(t;ital<br />
<strong>Technical</strong> joumal, vol. 8, no. 2 ( <strong>1996</strong>, this issue):<br />
57-7 1.<br />
2. Op!'n \ 'MS Callin,l!. S/(!ndcmi (J\ILwnard, Mass.: <strong>Digital</strong><br />
Equipmcnr Corporation, Order No. AA-QSBBA-TE,<br />
1995).<br />
3. i\1/Jf'Spro 64-Bit Por!lni and Ti'ansilion Gu ide, Docu<br />
ment No. 007-2391 -002 (Mountain Vic\\', Calif.:<br />
Si licon Grc1phies, Inc., <strong>1996</strong>).<br />
4. C lclllf:!JWi
82<br />
7. Op en V/11/S System Seruices Hejerence iltfa rmal.<br />
A-GETMSG' (Maynard , Mass.: <strong>Digital</strong> Equipment<br />
Corporation, Order No. AA-QSBMA-TE, 1995) and<br />
Op e11 VMS Systern Sen• ices Re/erence Jl!Jcm ual<br />
GcTQ UI-Z (Maynard, Mass.: <strong>Digital</strong> Equipment Corpo<br />
ration, Order No. AA-QSBN-TE, I 995 ).<br />
8. Op en V/v/S Alpha Gu ide to 64 -/Jit Adclressi11g (May<br />
nard, lvbss.: <strong>Digital</strong> Equipment Corporation, Order<br />
No. AA-QSIICA-TE, 1995 ).<br />
9. T. Leonard, ed ., li;J.X Architecture Re/erence iVIa nual<br />
(Bedford, Mass.: <strong>Digital</strong> Press, 1987).<br />
Biographies<br />
Thomas R. Benson<br />
A consulting engineer in the Open VMS Engineering Group,<br />
Tom Benson is one of' the developers of 64-bit addressi ng<br />
support. To m joined DigitaJ's VA,\ Basic project in I 979<br />
after receivi ng B.S. and M.S. degt·ees in computer science<br />
fi·om Svracuse Universitv. After wmking on an optimizing<br />
compiler shell used by several VAX cornpilers, he joined<br />
the VMS Group where he led the VMS DEC windows<br />
File View Jnd Session Manager projects, and brought tbe<br />
X lib graphics librarv ro rbe VMS operati ng svstem. Tom<br />
holds rhree parents on the design of the VAX MACR0-32<br />
compiler for Alpha and recent!)' applied for two parents<br />
related ro (A-bit addressing work.<br />
Karen L. Noel<br />
A pri ncipal engineer in the Open VMS Engineering Group,<br />
Karen Noel is one of the developers of64-bit
Adding 64-bit Pointer<br />
Support to a 32-bit<br />
Run-time Library<br />
A key component of delivering 64-bit addressing<br />
on the OpenVMS Alpha operating system, ver<br />
sion 7.0, is an enhanced C run-time library that<br />
allows application programmers to allocate and<br />
utilize 64-bit virtual memory from their C pro<br />
grams. This C run-time library includes modified<br />
programming interfaces and additional new<br />
interfaces yet ensures upward compatibility<br />
for existing applications. The same run-time<br />
library supports applications that use only<br />
32-bit addresses, only 64-bit addresses, or<br />
a combination of both. Source code changes<br />
are not required to utilize 64-bit addresses,<br />
although recompilation is necessary. The new<br />
techniques used to analyze and modify the<br />
interfaces are not specific to the C run-time<br />
library and can serve as a guide for engineers<br />
who are enhancing their programming inter<br />
faces to support 64-bit pointers.<br />
I<br />
Duane A. Smith<br />
The OpenVMS Alpha operating system, version 7.0,<br />
has extended the address space accessible to applica<br />
tions beyond the traditional 32-bit address space. This<br />
new address space is reterred to as 64-bit virtual mem<br />
ory and requires a 64-bit pointer tOr memorv access.1<br />
The operating system has an additional set of new<br />
memory allocation routines that allows programs to<br />
allocate and release 64-bit memory. ]n OpenVMS<br />
Alpha version 7.0, this set ofroutines is the only mech<br />
anism available to acquire 64- bit memory.<br />
For application programs to take advantage of these<br />
new OpenVMS programming interfaces, high- level<br />
programmi ng languages such as C had to support<br />
64-bit pointers. Both the C compiler and the C run<br />
time librarv req uired changes to provide this support.<br />
The compiler neec.k d to understand both 32-bit and<br />
64-bit pointers, and the run -time library needed to<br />
accept and return such pointers .<br />
The compiler has a new qualifier called /pointcr_sizc,<br />
which sets the ddiHllt pointer size t(x the compilation<br />
to either 32 bits or 64 bits. Also added to the compi ler<br />
are pragmas (directives) that c1n be used within the<br />
source code to change the active pointer size. An<br />
appl ication program is not required to compile each<br />
module using the same /pointcr_size qualitier; some<br />
modules may usc 32-bit pointers wh ile others usc<br />
64-bit poin ters. Benson, Noel, and Peterson describe<br />
these compiler enhancements.' The DEC C [!ser \<br />
Cuide jor Open'vi\1S 5)•slems documents the qualiticr<br />
and the pragmas 3<br />
The C run -time librarv added 64-bit pointer sup<br />
port either by modit\•ing the existing interrace to ,,<br />
fu nction or by adding a second inrertacc to the same<br />
fu nction. Public header ti les dctine the C: run-time<br />
library intcrtaces. These header ti les contain the pub<br />
licly accessible ti.11Ktion prototypes and structure definitions.<br />
The library documentation and he:.dcr ti les<br />
are shipped with the C compiler; the C run-rime<br />
library ships with the operating system.<br />
This paper discusses all phases of the enhanccmenrs<br />
to the C run-time librarv, ti·om project concepts<br />
through the analysis, the design, and ti nally the i mplementation.<br />
The f)hC C Nuntime Lihrarv R(fercnce<br />
Mauua/j(Jr Open \ 'MS S)•stems contains user documentation<br />
regarding the chan gcs.4<br />
Digiral <strong>Technical</strong> journal Vol. 8 No. 2 1')')6
S4<br />
Sta rting the Project<br />
We devoted the in i tial two momhs of the project to<br />
unclerst111ding the overall OpcnVMS presen tation of<br />
64-bit addresses and deciding ho\\ to prescllt 64-bit<br />
en hancemems to customers. Re present\tin:s ti·om<br />
OpenVMS engineering, the c ompiler team, the runtime<br />
l ibrar y ream, and the Open VMS Calling St:111dard<br />
team met weekly with the go:d of converging on the<br />
cl divcra hl cs f{>r Open VMS Alpha version 7.0.<br />
The project team was committed to preserving both<br />
source code compati bilitv :\lld the upward compatibilit\'<br />
aspects of shareable images on the Open VMS<br />
operating S\'Stem. Early discussions with appl i ca tion<br />
developers reint()rced our belief that the OpwVMS<br />
system must a!Jow applications to usc 32-bit and<br />
64-bit pointers within the s1mc application. The team<br />
also :-greed that tor a mixed-pointer application to<br />
work cfkctin:h', a single run-time librarv \\ ould need<br />
to support both 32 -bit and 64-bit poiners; ho\\'cn:r,<br />
there \\'as no known prcccdem tor designing such<br />
a libr:-ry.<br />
One implication of the decision to design a runti<br />
me librm· that supported 32 -bit :md 64-bit pointers<br />
\\':IS that the librarv could ne\Tr return an unsolicited<br />
64-bit pointer. Returning a 64-bir pointer to an<br />
application that was expecting J 32 -bit poi mer \\·ould<br />
result in t he loss of one half of an add ress. A lthough<br />
typic11ly this error would cause a hardware exception ,<br />
the resu lting address cou ld be a \'alid address. Storing<br />
to or reading tl·om such an 1ddrcss could result in<br />
in correct bc l11\·ior that \\'Ould be d i ffi cu lt to detect.<br />
The OJ X' II\ :If.'; C:allil lP, Sta 11rlm d specif-ics thar argu<br />
mems p1ssed to a function be 64-bit values.' If 1<br />
32-bit llo\\'in;, ::::><br />
t-o ur categories (listed in order of added comple:-;in·<br />
to librarv users):<br />
I. Not 1fkcted by the size of a pointer<br />
2. En hanced to accept both pointer sizes<br />
Dig:ir,11 Tcclmic,11 )ourml Vol. 8 No. 2 <strong>1996</strong><br />
3. Duplicated to have a 64-bir-specific inrerf..1cc<br />
4. Restricted from using 64-bit pointers<br />
One last point to come tt·om rbe meetings was<br />
that man\' of the C run-time librarv intet·bces arc<br />
.<br />
implememed bv calling other OpenVMS<br />
images. For<br />
example, the Curses Screen J\ll an agc mc m interfaces<br />
make calls to the OpcnVMS Screen Manage ment<br />
(SMG) bei l ity. It is i mporr:mt that the C run-time<br />
libr:rv ddines the inrcrt:1ces to support 64- bit<br />
addrcsSl's without looking 1 t the i mpleme ntation of<br />
rl1e fu nction. Consistenc\· and completeness of the<br />
imerbcc arc more important than the comf)lexit\·<br />
of the implementation. In the SNI G ex1mp lc, if th<br />
C run-rime library needs to mJke a copy of a string<br />
prior ro passing the string to the SMG f:1e iliry, this<br />
is what will be implemented.<br />
Analyzing the Interfaces<br />
The process of analyzi ng the interfaces began by creating<br />
1 document that listed all the header ti les
in strucrurcs. If a rl LE poin ter is :lwavs a 32-bit<br />
poi nter, structures that contain on ly HLE pointers arc<br />
not affected by the choice of poinrcr size We use this<br />
inrormation when we look Jt the Llyout of structures<br />
and examine fu nction prototvpcs that accept pointers<br />
to structures.<br />
In this paper, structures that arc alwavs
86<br />
An exam ple of a parameter being rcsrricrcd ro a<br />
low memory address is the buffe r being passed as rhe<br />
parameter to the fu nction sctbuf. The parameter<br />
defines rhis buffer to be used t(Jr I/0 operations . The<br />
user expects to see this bufkr change as 1/0 opcLltions<br />
arc pcrt(Jrmed on the ti le. If the ru n - ti me library<br />
made a copy of th is bufkr, the changes would
structure thJt is bound to low memorv, such as a fl LE<br />
or Wl NDOW pointer, the return v1lue does not t()rce<br />
separate entry points.<br />
Certain fu nctions, such as malloc, Jllocate memory<br />
on behalf ohhc calling program Jnd return the address<br />
of that memory Js the value of the ti.mction. These<br />
h.t nctions have two implementations: the 32-bit intcr<br />
fKe always :tllocues 32-bit virtual memory, Jnd the<br />
64- bit interface always allocates 64- bit virtual memory.<br />
Many stri ng and memory fu nctions h:JVe return val<br />
ues that are relative to a parameter passed to the same<br />
routine. These ;ddrcsses mav be returned as high<br />
memory addresses if ;nd on ly if the p;ramerer is a<br />
high memory address.<br />
The fo llowing is the fu nction prototype fo r strcat,<br />
which is to und in the header ti le :<br />
char *strcat(char *s1 , canst char *s2);<br />
The strcat fu nction appends the stri ng pointed to lw<br />
s2 to the string poimed to by sl . The return value is<br />
the address of thc latest stri ng sl .<br />
In this case, the size of the pointer in the return<br />
value is always the same as the size of the pointer<br />
passed as the first parameter. The C programming l;m<br />
guage has no wJv to reflect this. Since the function<br />
may return 1 64-bit pointer, the strcat fu nction must<br />
h;1ve two entry points.<br />
As discussed c1rlicr, the poimcr size used fo r parJ<br />
mctcr s2 is not related to the returned poi nter size.<br />
The C run-rime libr: ry made this s2 argument 64-bit<br />
ti·iendly by declaring it a 64-bit poi nter. This declara<br />
tion allows the application programmn to concatc<br />
n;ltC a string in h igh memorv to one in low memory<br />
without alteri ng the source code. The t( )llowing strcat<br />
fu nction statement shows this declaration:<br />
char *strcat(char *s1 , __ c har_ptr64 s2);<br />
The data rvpc _cbar_ptr64 is 1 64-bit character<br />
pointer whose definition ;m d usc will be explained<br />
later in this paper.<br />
High-level Design<br />
The /pointcr_size qua litler is J\"Jilablc in those<br />
versions of the C: compiler that support 64-bit point<br />
ers. The compiler has a prcddl ncd macro named<br />
_INITIAL_POINTER_SIZJ-: whose value is based on<br />
the usc of the /pointcr_sizc qualifier. The macro<br />
accepts the tllllowing values:<br />
•<br />
0, which indicates rh:t the /pointer_size qualifier is<br />
not used or is not a\·aibble<br />
• 32, which indicates that the /pointcr_sizc qualifier<br />
is used and has l value of 32<br />
• 64, which indicates that the /pointcr_size qualiticr<br />
is used and has J value of64<br />
The C ru n - time library hodcr tiles conditionally<br />
compile based on the value of this predefined macro.<br />
A zero value indicates to the header Illes that the computing<br />
environment is purely 32-bit. The pointer-size<br />
specific fu nction prototypes arc not ddined. The user<br />
must usc the /pointer_sizc qualifier to access 64-hit<br />
fu nctionality. The choice of 32 or 64 determines the<br />
default pointer size .<br />
The heackr ti les define two distinct types of dccbra<br />
tions: those that have a single implementation and<br />
tbose that have pointer-size-specific implementations.<br />
The addresses passed or returned h·om ti.mctions that<br />
ba,·e a single implementation arc either bound to lm,·<br />
memory, restricted to low mcmorv, or widened to<br />
accept a 64- bit pointer.<br />
Those fu nctions that have poi ntcr-sizc-spccitic<br />
entry points have three ti.mction prototypes ddincd .<br />
Using m;lloc Js an cxJmple, prototypes Jre created hn<br />
the fu nctions m;lloc, _malloc32, Jnd _rnalloc64 . The<br />
latter two prorotvpes arc the poi nter-size-specific pro<br />
totypes and Jrc ddined on ly when the /pointcr_sizc<br />
qualitier is used. The malloc prototype dct:llllts to calling<br />
_m111oc32 when the default poin ter size is 32 bits.<br />
The mallnc prototype dcbults to ulling _n1JIIoc64<br />
when the dcbulr pointer size is 64 bits. Applic;tion<br />
programmers who mix pointer types usc the<br />
/pointer_sizc qualifier to establish the default pointer<br />
si ze but can then usc the _malloc32 and m11loc64<br />
explicitly to achieve nondebult behavior.<br />
In addition to being enhanced to support 64-bit<br />
pointers, the C compiler h1s the Jdded capabi l i ty of<br />
detecting incorrect mixed-pointer usage. It is the<br />
fu nction prororvpe found in the header ti les ti1Jt tel ls<br />
the compiler exactly wh
whether the fu nction is the 32-bit-spccitic cmn· poinr.<br />
If it is, the compiler t(mns the prdi xed n:une by<br />
add i ng "dccc$" to the beginning of the n:mc but<br />
also removes the "_" and the "32." Conseq ucntlv, the<br />
function n:1mc _malloc32 becomes dccc$m:tlloc, but<br />
the h.t ncrion name _wz32 docs not c hange.<br />
Implementation<br />
To illustrate changes th:t needed to be made ro the<br />
hc:dcr ti les, we inven ted a single header ti le Cl llcd<br />
. This tile, wh i ch is shown in Figure 1, illus<br />
trJtes rhe c lasses of problems bccd Lw a dC\·clopcr 11 ho<br />
is adding su pport tc)r 64- bi t poi nters. The fu nctions<br />
detined in this he:der tile :rc acrual C run-rime libr;m•<br />
ri.Jn ctions.<br />
Preparing the Header File<br />
The fi rst p1ss through rtsulted in a number<br />
of chmgcs in terms of t(mnatting, commcming,<br />
and 64-bir support. Rta lizing that matw modifications<br />
would be m1dc to the hc1dcr ti les, we considered<br />
rcachbilitv a major goJI t( >r this release of these tiles.<br />
The initi11 htader J-I les 1ssumcd '' uniform poimcr<br />
size of 32 bits rc)f the OpcnVi\1\ S operating S\'Stcm.<br />
Duri ng the fi rst pass th mugh , 11·e :ddcd<br />
poi nter-size pr;1gmas to ensur-e that the ti le S•ll'cd the<br />
user's poimcr size, set the pointer size to 32 bits, and<br />
then restored the user's pointer size at the end ofrhc<br />
header.<br />
Next 11 c t(mnattcd < heJdcr. h> to s ho\\' t he 1·arious<br />
categories that the structures ,md fi.tnctions ta ll imo.<br />
The Gttcgorics and the result of the first pass th rough<br />
can be seen in hgurc 2. For ex a mple,<br />
the fu nction r:md l1ad no pointers in the fu n ction<br />
Figure 1<br />
Orisitl;ti Hc;ldcr ilc <br />
lfi fndef<br />
#define<br />
HEADER LOADED<br />
HEADER LOADED<br />
lfi fndef SIZE T<br />
If defi n e SIZE T 1<br />
lfendi f<br />
i n t<br />
vo i d<br />
vo i d<br />
i n t<br />
char<br />
char<br />
size<br />
prototvpc and was immcdiatclv mc)l'cd to the section<br />
"Functions that support 64-bit poi nters. "<br />
Organizing in this wav gave us an accur:te<br />
reading of bow m :l n\' more fu nctions needed<br />
64-bir support. If arw of the sections became cmptl',<br />
11·e did nor rcm
Figure 2<br />
/lifndef<br />
#def ine<br />
I*<br />
HEADER_LOADED<br />
HEADER LOADED<br />
** Ensure that we beg in wi th 32-bit pointers.<br />
*I<br />
#if INITIAL POINTER SIZE<br />
If if ( VMS VER < 70000000)<br />
If error "Pointer size added in OpenVMS V7 .0 for Alpha"<br />
If endif<br />
If pragma __ pointer_s ize save<br />
If pragma __ pointer_s ize 32<br />
#endif<br />
I*<br />
** STRUCTURES NOT AFFECTED BY POINTERS<br />
*I<br />
#ifndef SIZE_T<br />
If defi ne SIZE T 1<br />
typedef unsi gned int si ze_t ;<br />
#endif<br />
I*<br />
** FUNCTIONS<br />
*I<br />
THAT NEED 64-BIT SUPPORT<br />
in t execv(const char *, char *[]);<br />
voi d<br />
voi d<br />
char<br />
char<br />
size<br />
I*<br />
free(void *);<br />
*ma l l oc (si ze_t ) ;<br />
*strcat(char * , const char *);<br />
*strerror(int);<br />
t strlen(const cha r *>;<br />
** Create 32-bi t header fi le typedefs.<br />
*I<br />
I*<br />
** Create 64-bi t header fi le typedefs .<br />
*I<br />
I*<br />
** FUNCTIONS RESTRICTED FROM 64 BITS<br />
*I<br />
I*<br />
** Change defau lt to 64-bi t po inters.<br />
*I<br />
#if INITIAL PO INTER SIZE<br />
If pragma __ p ointer_ size 64<br />
#endi f<br />
I*<br />
** FUNCTIONS THAT SUPPORT 64-BIT PO INTERS<br />
*I<br />
int rand(void);<br />
I*<br />
hrsr PJss through <br />
** Restore the user's po inter context .<br />
*I<br />
#i f INITIAL PO INTER SIZE<br />
If pragma __ p oi n ter_s i z e __ r estore<br />
#endif<br />
llendif I* HEADER LOADED *I<br />
<strong>Digital</strong> Tn:hnical Journal Vol . 8 No. 2 <strong>1996</strong> 89
90<br />
We created a C run-time library private header<br />
file caLled . This header file has the<br />
appropriate pragmas to define 64-bit pointer types used<br />
within the C run-time library, as shown in Figure 3.<br />
This header file also contains the definitions ofrnacros<br />
used in the implementations of the functions. Figure 4<br />
shows the macros declared in .<br />
Once a module includes the ti le ,<br />
the compilation of that module changes to add the<br />
qualifier /pointer_size=32. This change improves the<br />
readability of the code because "char * " is read as a<br />
I*<br />
32-bit character pointer, whereas 64-bit poi mcrs usc<br />
typedcfs whose names begin with "_wide." The<br />
name of the new typedcf is _ wide_char_ptr, which is<br />
read as a 64-bit character pointer.<br />
The C run-time librarv design also requires that the<br />
impleme ntation of a fu nction include all header ti les<br />
that define the nmction. This ensures that the imple<br />
mentation matches the header files as they arc modi<br />
fi ed to support 64-bit pointers. For fu nctions defined<br />
in multiple header ti les, this ensu res that header ti les<br />
do not contrad ict each other.<br />
** Thi s inc lude fi le defi nes a l l 32-bi t and 64-bi t data types used i n<br />
** the implementation of 64-bi t addresses i n the C run-t ime Library.<br />
**<br />
** Those modu les that are com pi led wi th a 64-bi t-capab le campi Ler<br />
** are requi red to enable poi nter size wi th /POINTER SIZE=32 .<br />
*I<br />
#i fdef INITIAL_PO INTER_S IZE<br />
# if C INITIAL POINTER SIZE != 32 )<br />
# error "This modu le must be compi led /pointer_s ize=32"<br />
# endi f<br />
#endi f<br />
I*<br />
** ALL interfaces that requi r e 64-bi t poi nters must use one of<br />
** the fol lowi ng definitions . When this header fi le is used on<br />
** platforms not supporting 64-bi t pointers, these definitions<br />
** wi l l define 32-bi t pointers.<br />
*I<br />
#ifdef INITIAL POINTER SIZE<br />
# pragma __ p oi nter_ size save<br />
# pragma __ pointer_s ize 64<br />
#endif<br />
typedef char * __ w ide_ char_ptr;<br />
typedef canst char * __ w ide_ const_ char_ptr;<br />
typedef int * __ w i de_i nt_ ptr;<br />
typedef canst int * __ w ide_c onst_i nt_ ptr;<br />
typedef char ** __ w ide_ char_pt r_ptr;<br />
typedef canst char ** __ w ide_ const_ char_pt r_ptr;<br />
typedef void * __ w ide_vo id_ ptr;<br />
typedef canst void * __ w ide_const_vo id_pt r;<br />
#include <br />
typedef WINDOW * __ w ide_W INDOW_ ptr;<br />
#include <br />
typedef size t * __ w ide_s i ze_t_p tr;<br />
I*<br />
** Restore pointer size.<br />
*I<br />
#i fdef INITIAL POINTER SIZE<br />
# pragma __ poi n ter_ size __ r estore<br />
#end if<br />
Figure 3<br />
Typcdds From <br />
<strong>Digital</strong> <strong>Technical</strong> Joumal Vol. 8 No. 2 <strong>1996</strong>
I*<br />
** Def ine macros that are used to determine pointer size and<br />
** macros that wi l l copy from high memo ry onto the stack.<br />
*I<br />
#ifdef INITIAL POINTER SIZE<br />
# i n clude <br />
# def ine C$$IS SHORT ADDR(addr)<br />
( ( ( ( int64 )(addr)32 ) = = (unsigned int64)addr)<br />
# define C$$SHORT ADDR OF STRING(add r) \<br />
(C$$ IS SHORT A DDR(;d d ) ? (char *) (addr) \<br />
: ( c har - *) st cpy( __ A LLO CA(strlen(addr) + 1 ) , (ad dr)) )<br />
# defi n e C$$SHORT ADDR OF STRUCT(add r) \<br />
(C$$ IS SHORT AD DR(;d d ) ? (void *) (addr) \<br />
: ( vo id - *) me; cpy( __ A LLOCA(sizeof (* add r)), Caddr), sizeof (*addr ) ) )<br />
# defi n e C$$SHORT_A DDR_O F_MEMORYCaddr, len) \<br />
(C$$I S SHORT ADDRCadd r) ? (voi d *) (addr) \<br />
: ( voi d *) me; cpy( __ A LLOCA ( L en), (ad dr), Len ) )<br />
#else<br />
# define C$$IS_SHORT_A DDR(addr) ( 1 )<br />
# define C$$SHORT_A DDR_O F_STRING Caddr ) (addr)<br />
# define C$$SHORT_A DDR_O F_STRUCT(addr) (addr)<br />
# define C$$SHORT_A DDR_O F_MEMORY(addr, len) (addr)<br />
#endif<br />
Figure 4<br />
iVL lnos ti·om <br />
Implementing the strerror Return Pointer<br />
The function strcrror always returns a 32-bit pointer.<br />
The memory is allocated by the C ru n-time library for<br />
both 32-bit and 64-bit calling programs. As shown<br />
in Figure 5, we moved the fu nction strcrror into the<br />
section "Functions that support 64-bit pointers" of<br />
to show that there arc no rcsrrictions on<br />
the usc of this fi.mcrion.<br />
The "Create 32-bit header file rypcdcts" section of<br />
is in the 32-bit pointer section, where the<br />
bound-to-low-memory data structures :.1re declared.<br />
The fu nction retu rns a pointer to a character string.<br />
We, therefore, added typedefs fo r _c har_ptr32 and<br />
_const_char_ptr32 while in a 32-bit poinrer context.<br />
These declarations :re protected wirh rhe ddinition of<br />
_CHAR_PTR32 ro allow multiple header ti les to usc<br />
the same naming convention. Declarations of the<br />
consr fo rm of the typcdef arc always made in the same<br />
conditional code since they usually :Jrc needed and<br />
using the same condition removes the need to r a dif<br />
krcnt protecting n:Jmc.<br />
The strerror fi.111ction could have been implemented<br />
in by placing the fi.mction in the 32-bit sec <br />
tion, but that wo uld have implied that the 32-bit<br />
pointer was a restriction that could be removed later.<br />
The pointer is not a restriction, and the strerror fu nc<br />
tion fu lly supports 64 -bit pointers.<br />
Tile private header tile typedefs are always declared<br />
starti ng with rwo underscores and ending in either<br />
"_ptr32" or "_ptr64." These typedets arc created only<br />
when the header tile needs to be in a particular<br />
pointer-size mode while referring to a pointer of the<br />
other size. The return v:.1 lue of strerror is modified to<br />
usc the typedef_char_ptr32.<br />
Including the header tile, which declares strerror,<br />
allows the compiler to vcrif)' that the arguments,<br />
return values, and pointer sizes are correct.<br />
Widening the strlen Argument<br />
The fu nction strlen accepts a constant character<br />
pointer and re turns an unsigned integer (size_t) .<br />
Implementing fu ll 64-bit support in strlcn means<br />
changing the parameter to a 64-bit constant character<br />
pointer. If an application passes a 32-bit pointer to<br />
the strlcn function, the compiler-generated code sign<br />
extends the pointer. The required header tile mod<br />
ific:tion is to simply move strlen fr om the sec<br />
tion "Functions that need 64-bit support" to the<br />
section "Functions that support 64 -bit pointers."<br />
The steps necessary to r the source code to support<br />
64 -bit addressing arc as tallows:<br />
l. Ensure that the module includes header files that<br />
declare strlen.<br />
Digjral Tl·dmical journal Vol. 8 No. 2 <strong>1996</strong> 91
92<br />
#ifndef<br />
#define<br />
I*<br />
Figure 5<br />
Final Forrn of <br />
HEADER_LOADED<br />
HEADER LOADED<br />
** Ensure that we begin with 32-bi t poi nters .<br />
*I<br />
#if INITIAL POINTER SIZE<br />
# i y- ( VMS VER < 70000000 )<br />
# error "Poi nter size added in OpenVMS V7 .0 for Alpha"<br />
# endif<br />
# pragma __ po inter_s i z e save<br />
# pragma __ po inter_ size 32<br />
#endi f<br />
I*<br />
** STRUCTURES NOT AFFECTED BY POINTERS<br />
*I<br />
#ifndef SIZE_T<br />
# define SIZE T 1<br />
typedef unsi gned int size_t ;<br />
#endif<br />
I*<br />
** FUNCTIONS THAT NEED 64-BIT SUPPORT<br />
*I<br />
I*<br />
** Create 32-bit header fi le typedefs.<br />
*I<br />
#ifndef CHAR_P TR32<br />
# def ine<br />
typedef<br />
typedef<br />
#end i f<br />
I*<br />
CHAR PTR32<br />
char * __ char_ptr32; const char * __ const_char_ptr32; ** Create 64-b it header fi le typedefs.<br />
*I<br />
#ifndef CHAR_ PTR64<br />
# define CHAR PTR64<br />
# pragma __ p ointer_ size 64<br />
typedef char * __ c har_p tr64;<br />
typedef con st char * __ c onst_c har_p tr64;<br />
# pragma __ p o inter_s i z e 32<br />
#endif<br />
I*<br />
** FUNCTIONS RESTRICTED FROM 64 BITS<br />
*I<br />
int execv( __ c onst_c har_ptr64, char *[]);<br />
I*<br />
** Change default to 64-bit po inters .<br />
*I<br />
#i f INITIAL PO INTER SIZE<br />
# pragma __ p ointer_sT ze 64<br />
#endif<br />
I*<br />
** The fol lowi ng functions have interfaces of XXX, _X XX32,<br />
** and XXX64 .<br />
**<br />
** The function strcat has two interfaces because the return<br />
** argument i s a po inter that is relative to the first arguments.<br />
**<br />
** The ma l l oc funct ion returns ei ther a 32-bit or a 64-bi t<br />
** memo ry address .<br />
*I<br />
#if INITIAL PO INTER SIZE<br />
# pragma __ p oi nter_ size 32<br />
#end i f<br />
Digiral Tcc hnicl )oun1
Figure 5<br />
Continued<br />
vo id *malloc(si ze_t __ s ize);<br />
char *strcat(char * __ s 1 , __ c onst_ char_p tr64 __ s 2) ;<br />
# i f INITIAL _P O INTER_S IZE = = 32<br />
# pragma __ p ointer_ size 64<br />
#endif<br />
#i f INITIAL POINTER SIZE && VMS VER >= 70000000<br />
# pragma __ p ointer_s i z e 32<br />
voi d *_m a l l oc32(size_t ) ;<br />
char *_s t rcat32(char * __ s 1 , __ c onst_ char_ptr64 __ s 2 );<br />
# pragma __ p ointer_ size 64<br />
void *_ma l l oc64(si ze_t );<br />
char * strcat64(char * __ s 1 , const char * __ s 2 >;<br />
#endif<br />
I*<br />
** FUNCTIONS THAT SUPPORT 64-BIT POINTERS<br />
*I<br />
voi d free(voi d * __ p t r);<br />
int rand(void);<br />
size t strlen(const char * __ s );<br />
__ c har_pt r32 strerror(int __ e rrnum);<br />
I*<br />
** Restore the user ' s poi nter context .<br />
*I<br />
#i f INITIAL POINTER SIZE<br />
# prag ma __ po inter_si ze __ r estore<br />
#end i f<br />
#endif I* HEADER LO ADED *I<br />
2. Add the kJJiowing line of code to the top of the<br />
module: II i n c 1 u de < w i de_ types . s r c >.<br />
3. Ch:mge the declaration of th.e fu nction to accept<br />
a _wide_const_char_ptr parameter insteJd of the<br />
previous eonst char * parameter.<br />
4. Visually tcJI Iow this argument through the code,<br />
looking for assignment statements. This particular<br />
tl11Ktion wou ld be a simple loop. If local variables<br />
store this pointer, thev must also be declared as<br />
_ wide_const_chJr_ptr.<br />
5. Compile the source code using the directive<br />
/wJrn=enable=maylosedata to have the compiler<br />
help detect pointer truncation.<br />
6. Add ::1 new test to the test system ro exercise 64 -bit<br />
poimers.<br />
Restricting execv from High Memory<br />
Examination of the execv fi.m ction prototype showed<br />
thJt this fu nction receives two argumenrs. The ri rst<br />
::rgumem is J pointer to the name of the file to start.<br />
The second Mgu ment represents the ::rg" array that is<br />
to be passed to the child process. This JJTJY of pointers<br />
to null terminated strings ends with a NULL pointer.<br />
Initially, the exeev fu nction was to have had two<br />
implementations. The parameters passed to the execv<br />
fu nction are used as the parameters to the main func<br />
tion of the child process being started . Because no<br />
assumptions could be made about that child process<br />
(in terms of support tor 64 -bit pointers), these para<br />
meters are restricted to low memory addresses.<br />
To illustrate that the art,"' passing was a restriction,<br />
\\'e place that prorotvpe into the section "Functions<br />
restri cted from 64 bits" of . The first argu<br />
ment, the name of the tile, did not need to have this<br />
restriction. The section "Create 64 -bit header file<br />
typcdds" was enhanced to add the definition of<br />
_const_c har_ptr64, which allows the prototypes to<br />
define a 64-bit pointer to constant characters while in<br />
either 32-bit or 64 -bit context.<br />
Returning a Relative Pointer in strcat<br />
The strcat function returns a pointer relative to its tirst<br />
argument. We looked at this function and determined<br />
that it required two entry points. In addition, we<br />
widened the second parameter, which is the address of<br />
the string to concatenate to the second, to allow the<br />
application to concatenate a 64-bit string to a 32-bit<br />
string without source code changes.<br />
<strong>Digital</strong> Tcc hnictl )ourn:tl Vo l. 8 No. 2 <strong>1996</strong> 93
94<br />
Figure 5 shows the changes made to support fu nctions<br />
that have pointer-size-specific entry points. The<br />
prototypes of functions X.:'\.X., _XXX32, and _XXX64<br />
begin in 64-bit pointer-size mode. Since the unmod ified<br />
fi.mction name (stt-cat, XXX) is to be in the pointer<br />
size specified by the /pointer_size qualifier, the<br />
pointer size is changed fr om 64 bits to 32 bits if and<br />
only if the user has specified /pointer_size=32. At this<br />
point, we are not certain of the pointer size in effect.<br />
We know only that the size is the same as the size of<br />
the qualifier. The second argu ment to strcat uses the<br />
_const_char_ptr64 typedef in case we are in 32-bit<br />
pointer mode. Notice the declaration of _strcat64<br />
does not use this typedef because we are guaranteed<br />
to be in 64-bit pointer context. Figure 6 shows the<br />
implementation of both the 32-bit and the 64-bit<br />
strcat fi.mctions.<br />
Th e 64-bit mal/oc Function<br />
The implementation of multiple entry points was discussed<br />
and demonstrated in the strcat implementation.<br />
Al though multiple entry points are typically added to<br />
woid tru ncating pointers, fi.mctions such as memory<br />
allocation routines have newly defined behavior.<br />
The functions decc$malloc and decc$_malloc64<br />
use new support provided by the OpenVMS Alpha<br />
operating system fo r allocating, extending, and ri·ecing<br />
64-bit virtual memory. The C run-time library utilizes<br />
this new fu nctionality through the LIBRTL entry<br />
points. The LIBRTL group added new entry points fo r<br />
each of the existing memory management fu nctions.<br />
The LI BRTL includes an additional second t:ntry<br />
point fo r the fr ee fu nction. Since our implementation<br />
of the free function simply widens the pointer, wt: end<br />
up with a single, C run-time library function that must<br />
choose which LIBRTL function to call.<br />
#include <br />
#include <br />
I*<br />
** STRCATI STRCAT64<br />
**<br />
int free( __ w i de_void_p tr ptr) {<br />
if ( ! ( C $$I S_S HORT_A DDR(ptr)))<br />
}<br />
return ( c$$_f ree64(ptr));<br />
else return(c$$ free32((void *) ptr);<br />
Concluding Remarks<br />
** The 'strcat ' function concatenates 's2 ' , including the<br />
** terminating nul l character, to the end of ' s 1 ' .<br />
*I<br />
The project took approximatdy seven person-months<br />
to complete. The work involved two months to determine<br />
what we wanted to do, one month to tlgurc out<br />
how we were. going to do it, and fo ur person -months<br />
to modi ', document, and test the software.<br />
During the initial two months, the technical leaders<br />
met on a weekly basis and discussed the overall<br />
approach to adding 64-bit pointers to the OpenVi'vlS<br />
environment. Since I was the technical lead for the C<br />
run-time library project, this initial phase occupied<br />
between 25 and 50 percent of my time.<br />
The ont: month of detailed analysis and design consumed<br />
more than 90 percent of my time and resulted<br />
in a detailed document of approximately 100 pages.<br />
The document covered each of the 50 headt:r fi les and<br />
500 function interfaces. The fu nctions were grouped<br />
by type, based on the amount of work required to<br />
support 64 -bit pointers.<br />
The first month of implementation occupied Iwarl'<br />
all of my time, as I made several false starts. Once I<br />
worked out the tl nal implementation technique, I<br />
completed at least two of each type of work. As coding<br />
deadlines approached, I taught nvo other engineers on<br />
my team how to add 64-bit pointer support, pointing<br />
out those fu nctions already completed tor reference.<br />
They came up to speed within one week. Togcthn, we<br />
completed the work during the fi nal month of the<br />
project.<br />
__ w i de_ char_ptr _s trcat64 < __ w i de_ char_ptr s1 , __ w i de_ const_ char_ptr s2)<br />
{<br />
}<br />
(void) _m emcpy64 ( ( s1 + strlen(s1 ) ) , s2, (strlen(s2) + 1 ) );<br />
return(s1 ) ;<br />
char * strcat32
Acknowledgments<br />
The author would like to acknowledge the others who<br />
contributed to the success of the C run-time library<br />
project. The engineers who helped with various<br />
aspects of the analysis, design, and implementation<br />
were Sandra Whitman, Brian McCarthy, Greg Tarsa,<br />
Marc Noel, Boris Gubenko, and Ken Cowan. Our<br />
writer, John Paolillo, worked countless hours documenting<br />
the changes we made to the library.<br />
References<br />
l. M. Harvey and L. Szubowiez, "Extending OpcnVl'viS<br />
for 64-bit Addressable Virtual Memory," <strong>Digital</strong><br />
<strong>Technical</strong> journal, val. 8, no. 2 (<strong>1996</strong>, this issue):<br />
57-7 1.<br />
2. T. Benson, K. Noel, and R. Peterson, "The OpenVMS<br />
Mixed Pointer Size Environment," Dip,ital <strong>Technical</strong><br />
.foumal. vol. 8, no. 2 ( <strong>1996</strong>, this issue): 72-82.<br />
3. nH: C User \ Guide fo r Open Vt\1/S Systems (Maynard,<br />
MJss.: <strong>Digital</strong> Equipment Corporation, Order No.<br />
AA- Plct\ZE TK, 1995).<br />
4. f)h'C C Ru ntime Library Reference t\lla nual fo r<br />
OpenVMSSystems (Maynard, Mass.: <strong>Digital</strong> Equipment<br />
CorporJtion, Order No. AA-PU N EE-TK, 1995 ) .<br />
5. Open VMS Ca lling Standard (Maynard , MJss.: <strong>Digital</strong><br />
Equipment Cmporation, Order No. AA-QSB RA-TE,<br />
1995)<br />
Biography<br />
Duane A. Smith<br />
As a consul ring soti:wan: engineer, Duane Smith is currently<br />
.<br />
an:hitect and project leader of the C run-rime libr:ry for<br />
rhe Open VMS VAX and Alpha platforms. He joined <strong>Digital</strong><br />
in 1981 and h•lS worked on a variety of projects, including<br />
the A-to-1: Datab
Building a Hig h-performance<br />
Message-passing System for<br />
MEMORY CHANNEL Clusters<br />
The new MEMORY CHANNEL for PCI cluster<br />
interconnect technology developed by <strong>Digital</strong><br />
{based on technology from Encore Computer<br />
Corporation) dramatically reduces the over<br />
head involved in intermachine commun ica-<br />
tion. <strong>Digital</strong> has designed a software system,<br />
the TruCiuster MEMORY CHANNEL Software ver<br />
sion 1.4 product, that provides fast user-level<br />
access to the MEMORY CHANNEL network and<br />
can be used to implement a form of distributed<br />
shared memory. Using this product, <strong>Digital</strong> has<br />
built a low-level message-passing system that<br />
reduces the communications latency in a MEMORY<br />
CHANNEL cluster to less than 10 microseconds.<br />
This system can, in turn, be used to easily build<br />
the communications libraries that programmers<br />
use to parallelize scientific codes. <strong>Digital</strong> has<br />
demonstrated the successful use of this message<br />
passing system by developing implementations<br />
of two of the most popular of these libraries,<br />
Parallel Virtual Machine (PVM) and Message<br />
Passing Interface {MPI).<br />
Vo l. 8 No. 2 1':!96<br />
I<br />
James V. Lav.rton<br />
John ]. Brosnan<br />
Morgan P. Doyle<br />
Seosamh D. 6 Riordain<br />
Timothy G. Reddin<br />
During rlw l:st [(:,,. vc1rs, signiric:nr rcsc:1rch ;l ml<br />
dn·clopment has been undertaken in both academia<br />
and industry in an dlc>rt to reduce the cost of high<br />
pertimnanu: computing ( H PC ). The method most<br />
ti·cqueml\' used was to build p;lrallel S\'Stems out of<br />
clusters of commodity ,,·orkstations or sen"Crs tlut<br />
could be used as a ,·i rtu;ll supercomputcr. ' The nwti<br />
\'Jtion t(>r this '' ork ,,·as the tremendous g:1ins that<br />
have lx:en ;K hievcd in reduced instruction set com<br />
puter (!USC:) microprocessor perf()l'mJnce during the<br />
last decade. Indeed , processor pertormance in rod:1v's<br />
workst:Jtions and sen·ers often cxceeds that ofrhc indi<br />
,·idu;ll processors in a tigiHI\· coupled supercomputer.<br />
Hm,·e,·er, trad itional Joc1l 1rc:1 net\\'ork (LAN) per·<br />
to rm1nce has not kept pace ll'ith microprocessor<br />
pert(mn:lJKe. LANs, such as rl bcr distributed d:na<br />
intcrbce (FDDI), ofkr rosonable bandwidth, since<br />
communication is geneLlllv GlJTied our Lw mems of<br />
rradirion1l prorocol stacks suclJ as the usn d:t:gr;1lll<br />
protoco!jimernct protocol ( U D P /1 !' ) or the tr:msmission<br />
conrroJ protocol/internet protowl (TCI)/I P),<br />
bur softw:t re O\'erhead is 1 major factor in mess:gc<br />
rranskr time 2 This sofi-,,.,lre o\'erhead is not red uced<br />
by building hster LAN ncrwork hardware. 1\:\rhcr, a<br />
ne\\' appro:tch is needed-one that bvpasses the pro<br />
tocol stack ll'hile prescn·ing sequencing, error detec<br />
tion , :tnd p rotection.<br />
Much current research is dcnncd to reducing this<br />
communicarions O\'erhe::Jd usi ng specialized lurdw1re<br />
:1 nd sof-tware. To this
Data may then be transkrred between the machines<br />
using simple memory read and write operations, with<br />
no software overhead, essentially utilizing the fu ll per<br />
formance of the hardware. This approach is similar to<br />
the one used in the Princeton SHRJMP project, where<br />
this process is described as Virtual Memory-Mapped<br />
Com munication (VJ'v!MC). 7-'"<br />
Figure 1 shows the re lationship between the various<br />
components of our message-passing system. The tirst<br />
phase of our work involved designing a program<br />
ming library and associated kernel components to pro<br />
vide protected , unprivileged access to the MEMORY<br />
CHANNEL network. Our objective in creating this<br />
library was to provide a fa cility much like the standard<br />
System V interproccss communication (J PC ) shared<br />
memory fi.mctions available in UNIX implementations.<br />
Programmers could usc the library to set up operations<br />
over the M EMORY CHANNEL interconnect, but they<br />
would not need to use the library ti.mctions tor data<br />
transfer. In this way, pcrtonnance cou ld be maximized.<br />
This product, the Tr uCiustcr MEMORY CHANNEL<br />
Software, provides programmers with a simple, high<br />
pert(mllancc mechanism kH- building parallel systems.<br />
TruCiustcr MEMOLY CHANNEL Software delivers<br />
the pert(xm:mce available ti·om the MEMORY<br />
CHANNEL network directly to user applications but<br />
rcq uircs a programming style that is different from<br />
that req uired for shared memory. This different pro<br />
gramming style is necessary because of the diffe rent<br />
access characteristics between local memory and mem<br />
orv on a remote node connected through a MEMORY<br />
CIANNEL network. To make programming with the<br />
MEMOKY CHANNEL technology re latively simple<br />
while continuing to deliver the hardware performance,<br />
we built a library of primitive communications fu nc<br />
tions. This system, called Un iversal Message Passing<br />
(UMP), hides the details of MEMORY CH.AJ"lNEL<br />
operations from the programmer and operates seam<br />
lcssly over two transports (initially): shared memory<br />
and the Mb\ll ORY CHANNEL interconnect. This<br />
allows seamless growth h·om a symmetric multipro<br />
cessor (SMP) to a hi ll MEMORY CHANNEL cluster.<br />
Development can be done on a workstation, while<br />
production work is done on the cl uster. The UMP<br />
Figure 1<br />
SHARED<br />
MEMORY<br />
PARALLEL APPLICATION<br />
PVM I MPI<br />
UMP<br />
TRUCLUSTER<br />
MEMORY CHANNEL<br />
SOFTWARE<br />
Message-passi ng Systc111 Archirccrurc<br />
OTHER<br />
TRANSPORT<br />
layer was designed fr om the beginning with pertor<br />
mance considerations in mind, particuLlrly \Vith<br />
respect to minimizing the overhead il1\·oJvcd in se nd<br />
ing small messages.<br />
Two distributed memory models arc predominamly<br />
used in high- perfo rmance computing today:<br />
1. Data parallel, which is used in High Performance<br />
Fortran (HPF)." With this model, tl1e programmer<br />
uses parallel language constructs to indicate ro the<br />
compiler how to distribute data and what opera<br />
tions should be pertormed on it. The problem is<br />
assumed to be regular so that the compiler can use<br />
one of a number of data distribu tion algorithms.<br />
2. Message passing, which is used in Parallel Virtual<br />
Machine ( PV M) and Message Passing Inrcrt:K e<br />
( M PI ) u 15 In this approach, all messaging is per<br />
fo rmed explicitly, so the application programmer<br />
determines the data distribution algorithm, making<br />
this approach more suitable fo r irrcgubr probkms.<br />
It is not yet. clear whether one of these approaches<br />
will predominate in rhe ti.tturc or if both will continue<br />
to coexist. <strong>Digital</strong> has been worki ng to provide com<br />
petitive solutions fo r both approaches usi ng MEMORY<br />
CHANNEL cl usters. <strong>Digital</strong>'s H PF \\'Ork has been<br />
described in a previous issue of the Jo urnal'"·'' This<br />
paper is primarily concerned with message passing.<br />
Building on the UMP layer, we constructed imple<br />
mentations of two common message-passing systems.<br />
The tlrst, PVM, is a de facto st
depend ing on the bandwidth of the I/0 subsystem<br />
into which the adapter is plugged. Although the cur<br />
rent /vi EMORY CHANNEL network is a shared bus, the<br />
plan ti:.Jr the next generation is to utilize a switched<br />
technologv that will increase the aggregate bandll'idth<br />
of the llCt\Hlrk be\·ond thn of currentlv a\·aiLlblc<br />
. .<br />
s\\·itchcd LAN technologies. The latency (time to send<br />
a minimum-length message one way bct\vccn t\vo<br />
processes) is less than S mic roseconds (1-ls). The<br />
MEMORY CHANNELnet\vork provides a commLIIlic:t<br />
tions medium with a low bit-error rate, on the order of<br />
10-u,_ The proba bil it\' of undetected errors occurring<br />
is so sm::Lil (on the order of the undetected error rate of<br />
CPUs and memory Sll bsystcms) that it is esscnri1llv<br />
negligi ble. A MEMORY C H ANNE L cluste r consists of<br />
one or more PCI MEMORY CH ANNE L adapters on<br />
each node and a hub connecting up to eight nodes.<br />
The Mt-: MORY CH ANNEL cluster suppmts l<br />
512-MB glob:! address space into \\'hich e::ch adapter,<br />
under opcr:ti ng S\'Stcrn contml, em map regions of<br />
local \'irtual 1ddress space." figmc 2 illustrates the<br />
JVJ F.MORY CHANNEL opc r;1tion. Figure 2a shows<br />
transmission, and Figure 2h shows reception . A p1ge<br />
table cntr· (PTE) is an cntrv in the svstcm \'irtull<br />
to-plwsicll map th::n translates the ,·irtual :ddrcss of<br />
a p:ge to the corresponding phvsical address. The<br />
MEMOKY CHANNEL adapter comains a page comrol<br />
table (PC:T) that i ndi cates tr reception<br />
Vo l. S :--Jo 2 I')')(,<br />
Thus, \\'hen a MEMO 1\Y CH ANNEL network packet<br />
is received that corresponds to the page tb:Lt is mapped<br />
t( Jr reception , the data is transkrred directly to the<br />
appropriate page of physic1l memory by the system 's<br />
DMA engine. In addition, an\' cache lines th:Lt rckr to<br />
the updned page are in\-llidatcd.<br />
Subscq ucmlv, am· \\'rites to the mapped page oh·ir<br />
tual mcmmy on the r! rst node result in correspond i ng<br />
writes to physical memory on the second node. This<br />
means th
TRANSMIT<br />
REGION<br />
VIRTUAL<br />
ADDRESS<br />
RECEIVE<br />
REGION<br />
VIRTUAL<br />
ADDRESS<br />
Figure 2<br />
/\'I EMORY CHANNEL OperJrion<br />
TRUCLUSTER MEMORY<br />
CHANNEL API LIBRARY<br />
TRUCLUSTER MEMORY CHANNEL<br />
KERNEL SUBSYSTEM<br />
LOW-LEVEL KERNEL<br />
MEMORY CHANNEL FUNCTIONS<br />
EXECUTES STORE INSTRUCTION<br />
TO VIRTUAL ADDRESS IN<br />
TRANSMIT REGION<br />
VIRTUAL-TO<br />
PHYSICAL<br />
ADDRESS<br />
TRANSLATION<br />
PAGE TABLE ENTRY<br />
PHYSICAL ADDRESS<br />
IN PCI I/0 SPACE<br />
(a) Transmission<br />
MEMORY CHANNEL<br />
ADAPTER<br />
PAGE CONTROL<br />
TABLE<br />
MEMORY<br />
CHANNEL<br />
ADDRESS<br />
DATA<br />
-+ --------- - --------- - --- , DATA RETURNED FROM<br />
' , PHYSICAL MEMORY<br />
EXECUTES LOAD INSTRUCTION<br />
FROM VIRTUAL ADDRESS IN<br />
RECEIVE REGION<br />
VIRTUAL-TO<br />
PHYSICAL<br />
ADDRESS<br />
TRANSLATION<br />
PAGE TABLE ENTRY<br />
USER SPACE<br />
KERNEL SPACE<br />
CACHE<br />
INVALIDATE<br />
Figure 3<br />
TruCiusrer ,vi El'v!ORY CHANNEL Sofrware Architecture<br />
a set of simple building blocks that arc analogous to the<br />
System V !PC bcility in most UNIX implementations.<br />
The advantage is that while a very simple interface is<br />
provided initially, the intertiKe can later be extended as<br />
(b) Reception<br />
'<br />
' '<br />
I<br />
I<br />
I<br />
I<br />
PAGE CONTROL<br />
TABLE<br />
MEMORY<br />
CHANNEL<br />
ADDRESS<br />
DATA<br />
required, \Vithout losing compatibility with applications<br />
based on the initial implementation. Table l details the<br />
MEMORY CHAl'\!NEL API library timctions that the<br />
product provides. An important feature to note is that<br />
when a MEMORY CHANNEL region is allocated using<br />
TruCJuster MEMORY CHANNEL Software, a key is<br />
specified that uniquely identifies this region in the clus<br />
ter. Otber processes anywhere in the cluster can attach<br />
to the same region using the same key; the collection of<br />
keys provides a clusterwide namespace.<br />
The MEMORY CHAN EL API library communi<br />
cates with the kernel subsystem using kmodcall, a sim<br />
ple generic system call used to manage kernel<br />
subsystems. The library fu nction constructs a com<br />
mand block containing the type of command (i.e.,<br />
<strong>Digital</strong> Tech nical Journal Vol. R No. 2 <strong>1996</strong> 99
Ta ble 1<br />
TruCiuster MEMORY CHANNEL API Library Functions<br />
Function<br />
Name<br />
imc_asalloc<br />
lmc_asattach<br />
lmc_asdetach<br />
imc_asdealloc<br />
imc_l kalloc<br />
imc_l kacquire<br />
imc_lkrelease<br />
imc_l kdealloc<br />
imc_rderrcnt<br />
imc ckerrcnt<br />
imc kill<br />
imc_gethosts<br />
Description<br />
Allocates a region of MEMORY CHANNEL address space of a specified size and permissions and<br />
with a user-supplied key; the ability to specify a key allows other cluster processes to rendezvous<br />
at the sa me region. The function returns to the user a clusterwide ID for this region.<br />
Attaches an allocated MEMORY CHANNEL region to a process virtual address space. A region<br />
can be attached for transmission or reception, and in shared or exclusive mode. The user can also<br />
request that the page be attached in loopback mode, i.e., any writes will be reflected back to the<br />
current node so that if an appropriate reception mapping is in effect, the result of the writes can<br />
be seen local ly. The virtual address of the mapped region is assigned by the kernel and returned<br />
to the user.<br />
Detaches an allocated MEMORY CHANNEL region from a process virtual address space.<br />
Deallocates a region of MEMORY CHANNEL address space with a specified 10.<br />
Allocates a set of clusterwide spinlocks. The user can specify a key and the required permissions.<br />
Normally, if a spin lock set exists, then this function just returns the ID of that lock set; otherwise<br />
it creates the set. If the user specifies that creation is to be exclusive, then fa i I u re wi II resu It if the<br />
spinlock set exists already. In addition, by specifying the IMC_CREATOR flag, the first spin lock in<br />
the set will be acquired. These two features prevent the occurrence of races in the allocation of<br />
spin lock sets across the cluster.<br />
Acquires (locks) a spin lock in a specified spin lock set.<br />
Releases (unlocks) a spin lock in a specified spin lock set.<br />
Deallocates a set of spin locks.<br />
Reads the clusterwide lVI EMORY CHANNEL error count and returns the value to the user. This<br />
va lue is not guaranteed to be up-to-date for all nodes in the cluster. It can be used to construct<br />
an application-specific error-detection scheme.<br />
Checks for outstanding MEMORY CHANNEL errors, i.e., errors that have not yet been reflected in<br />
the clusterwide MEMORY CHANNEL error count returned by imc_rderrcnt. This function checks<br />
each node in the cluster for any outsta nding errors and updates the global error count accordingly.<br />
Sends a UNIX signal to a specified process on another node in the cluster.<br />
Returns the number of nodes currently in the cluster and their host names.<br />
1\'hich libran· fu nction has been called ) rmation, such<br />
as virtual addresses.<br />
Similarly, ro r each set of spin locks a I looted on the<br />
cluster there is a cluster lock dcscriprm ( CLD) that<br />
contains information describing the spinlock set,<br />
including its clusterwidc lock I D, the numhu of spin<br />
locks in the set, the key, permissions, crc::ttion time,<br />
and the UID and GlD of the creating f> roccss. Fm an<br />
individual CLD, there is a host lock dcsniptor (HLD)<br />
tor each node that is using the spin lock set. The HLD<br />
contains the cluster ID oF the node and other nodc<br />
spccifie int(>rmation : bout the spin]ock set. for a spe<br />
cific HLD, there is a process lock descriptor ( I'LD) r( >r<br />
each pmccss on th:t node that is using the Sf)inlock
KEY:<br />
CLD 0<br />
CLD 1<br />
<br />
CLD CLUSTER LOCK DESCRIPTOR<br />
CRD CLUSTER REGION DESCRIPTOR<br />
HLD HOST LOCK DESCRIPTOR<br />
HRD HOST REGION DESCRIPTOR<br />
PLD PROCESS LOCK DESCRIPTOR<br />
PRO PROCESS REGION DESCRIPTOR<br />
HRD 0: HOST 4<br />
Figure 4<br />
Truc:luster ME/IIORY C:HA:-..: \:EL Kernel Data Structures<br />
HRD 1: HOST 6<br />
HRD 0: HOST 6<br />
HRD 1· HOST 1<br />
( J) Regions<br />
HLD 0: HOST 2 PLD 0: PID 3346<br />
HLD 1 HOST 0 I<br />
J L PLD<br />
- HLD 0 HOST 4 ...<br />
HLD 1: HOST 6<br />
...<br />
HLD 2: HOST 3<br />
set. The PLD contains the l' ID of the process that cre<br />
ated the spinlock set .md any process-specific informa<br />
tion about the spinlock set.<br />
All these cluster datur regions of Mr)v!ORY CHAN:-JEL space have been<br />
allocued on the cluster. The ti rst region, with descrip<br />
tor CRD 0, is mapped on three nodes: host 4, host 6,<br />
and host 3. The diJgr:tm
Figure 5<br />
HOST A<br />
-------------------<br />
I<br />
INVOKE<br />
Command Relav Opcr
Gi\'cn the usefulness of coherent allocations, it may<br />
seem unusu:1l that we nude this feature an option<br />
rJ thcr th:m the dcbult. There are severa l t-eJsons t(x<br />
this. vV ith coherent Jllocations, the associated physical<br />
memory becomes nonpagcable on aU nodes within the<br />
cluster,
Performance<br />
The performance ofTruCtuster MEMORY CHI\N::--J EL<br />
Soft\\-11-c on a pair of AlphaScner 4100 5/300<br />
n1 1chi nes is prese nted in Tabl e 2. These measurements<br />
\\'CIT made using 1·crsion 1.5 1\Jl EMOIY C:HA0l:\EL<br />
atbptcrs. The lmKill'idth ( 64 M H/s )
extern Long asm(const char *<br />
#pragma intrinsic (asm)<br />
#def ine mb ( ) asm("mb" )<br />
#inc lude <br />
#include <br />
. . . ) ;<br />
rna in < )<br />
{<br />
int status, i , Locks=4, temp, errors;<br />
imc_asid_t region_i d;<br />
imc_L kid t Lock_i d;<br />
typedef struct {<br />
vo lati l e int processes;<br />
vol ati le int patte rn[2047J;<br />
shared_region;<br />
sha red_region *regi on_ read, *region_w ri te;<br />
caddr_t read_p tr = 0, wri te_p tr = 0;<br />
}<br />
I* MC region ID *I<br />
I* MC spi nlock set ID *I<br />
I* Shared data structure *I<br />
I* Al locate a region of coherent MC address space and attach to *I<br />
I* process VA *I<br />
imc_a sa l l oc (123, 8192, IMC_U RW, IMC_COHERENT, ®i on_ id);<br />
imc _a sattach(region_i d, IMC_T RANSMIT, IMC_SHARED, IMC_LOOPBACK, &write_pt r);<br />
imc_asattach(regi on_i d, IMC_ RECEIVE, IMC_SHARED, 0, &read_p tr);<br />
region_read = (shared_region *)wri te_p tr;<br />
region_w ri te = (shared_region *)re ad_ ptr;<br />
I* Allocate a set of spinlocks and atomi ca l l y acq uire the first Lock *I<br />
status = imc_ Lkalloc(456, &locks, IMC_LKU, IMC_C REATOR, &lock_id); errors = imc rderrcnt();<br />
i f (status =<br />
do {<br />
IMC SUCCESS) {<br />
region wri te->processes = 0;<br />
I * Ini tialize the globa l region *I<br />
for (iO ; ipa ttern[i ] =<br />
i--;<br />
mb ( ) ;<br />
i ;<br />
} wh i l e (imc ckerrcnt (&errors) I I region_read->pattern[i J != i )<br />
}<br />
imc_l kre leaseClock_i d, 0);<br />
else if (status == IMC EXISTS) {<br />
imc_lkallocprocesses + 1 ; I* Increment the process counter *I<br />
errors = imc_r d errcnt();<br />
do {<br />
region_w ri te->processes =<br />
mb processes - 1; I* Decrement the process counter *I<br />
errors = imc_r d errcnt( ) ;<br />
do {<br />
region_w ri te->processes = temp;<br />
mb ( ) ;<br />
} whi le (imc ckerrcntprocesses != temp )<br />
imc_L krelease( lock_i d, 0);<br />
imc_ Lkdea l l oc( lock_ id);<br />
imc_a sdetach( regi on_id);<br />
imc_asdea l l oc ( reg i on_ id);<br />
Figure 6<br />
Programming wirh TruCiusrer 1'vl EMORY CHANNEL Software<br />
I* Dea l locate spi n lock set *I<br />
I* Detach shared region *I<br />
I* Dea l locate MC address space *I<br />
Digit,ll <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 105
These gu1ls placed some important constraints on<br />
the architecture of UMP, parricularl' with regard to<br />
pcrt(>rmance. This meant that design decisions had<br />
to be constantly evaluated in terms of their pcrtrmance<br />
impact. The initial design decision was to use a dedi<br />
cated point-to-point circular butter between ever)' pair<br />
of processes. These buf1ers use prod ucer and consumer<br />
indexes to control the reading and writing of bufte r<br />
contents. The indexes can be moditied only by the<br />
consumer and producer tasks and allow fll lly lockless<br />
operation of the buffe rs. Removing lock requirements<br />
eliminates not only the software costs associated with<br />
lock manipulation (in the initial implementation of<br />
TruCiuster Jv! EMORY CHANNEL Software, acquiring<br />
and releasing an uncontested spinlock takes approxi<br />
mately 130 f.LS and 120 f.LS, respecti,-elv) but also the<br />
impact on processor pertormance associated with<br />
Load-locked/Store-conditional instruction sequences.<br />
Although this butlering style eliminates lock manip<br />
ulation costs, it results in an exponential demand tor<br />
storage and can limit scalability. If there arc N processes<br />
communicating using this method, that implies N2<br />
bufkrs are req uired for filii mesh communication.<br />
MEMORY CHANNE L address space is a relatively<br />
scarce resource that needs to be carefullv husbanded.<br />
To manage the demand on cluster resources as t3 irly as<br />
possibk, we decided to do the t(>llowing:<br />
Figure 7<br />
Clusrcr Conlnlunicarion Using UMP<br />
• Allocate buftc rs sparsely, i.e., as required up to<br />
some dcbult limit. Full N2 al location would still be<br />
possible if the user increased the n um bcr of bu fk rs.<br />
• Make the size of the butlers contigurJble.<br />
• Usc lock-controlled single-writer, multiple-reader<br />
buffers to hand lc both the overflow trom the JV2<br />
bufkr and bst multicast. One of these bufkrs,<br />
cal led outbufs, would be assigned to each process<br />
using UMP upon initialization.<br />
Note that while the channel buffers are logically<br />
point-to-point, they may be implemented physically as<br />
either point-to-point or broadcast. For example, in the<br />
first \'ersion of UMP, we used broadcast MEMORY<br />
CHANNEL mappings fc>r the sake of simplicity. We
with task 2 using UMP channel buffers in shared mem<br />
ory, shown as l->2 and 2--> l. Task l and task 3 communicate<br />
using UMP channel buffers in ME1viORY<br />
CHANNEL space, shown as l--> 3 and 3--> l. Task 3 is<br />
reading a message from task 1 using an outbuf. The<br />
outbuf can be written only by task 1 but is mapped fo r<br />
transmission to all other cluster members. On node 2,<br />
the same region is mapped for reception. Access to<br />
each outbuf is controlled by a unique cluster spinlock.<br />
Our rationale fo r taking this approach is that a short<br />
software path is more appropriate for small messages<br />
because overhead dominates message transfer time,<br />
whereas the overhead of lock manipulation is a small<br />
component of message transfer time for large mes<br />
sages. We fe lt that this approach helped to control the<br />
use of cluster resources and maintained the lowest pos<br />
sible latency tor short messages yet still accommodated<br />
large messages. Note that outbuts are still ti.xed-size<br />
buffers but are generally configured to be much larger<br />
than the N2 buffers.<br />
This approach worked for PVM because its message<br />
transfer semantics make it acceptable to fa il a mes<br />
sage send request due to buffe r space restrictions<br />
(e.g., if both the N2 buffer and tbe outbuf are fu ll).<br />
When we analyzed the requirements tor MPI, how<br />
ever, we fo und that this approach was not possible . For<br />
this reason, we changed the design to use only the N2<br />
buffers. Instead of writing the message as a single<br />
operation, the message is streamed through the buffer<br />
in a series of fragments. Not only does this approach<br />
support arbitrarily large messages, but it also improves<br />
message bandwidth by al lowing (and, fo r messages<br />
exceeding the available bufter capacity, requiring) the<br />
overlapped writing and reading of the message.<br />
Deadlock is avoided by using a background thread<br />
to write the message. Since overflow is now handled<br />
using the streaming N2 buffers, outbufs were not nec<br />
essary to achieve the required level of performance fo r<br />
large messages and were not implemented. Outbufs<br />
are retained in the design to provide fa st multicast<br />
messaging, even though in the current implementa<br />
tion they are not yet supported .<br />
Achieving the performance goals set tor UMP was<br />
not easv. In addition to the butler architecture<br />
described earlier, several oth er techniqu es were used.<br />
• No syscalls were allowed anywhere in the UMP<br />
messaging fu nctions, so U M P runs completely in<br />
user space.<br />
• Calls to library routines and any expensive arith<br />
metic operations were minimized .<br />
• Global state was cached in .local memory wherever<br />
possi ble.<br />
• Careful attention was paid to data alignment issues,<br />
and all transfers are multiples of 32-bit data.<br />
At the programmer's level, UMP operation is based<br />
on duplex point-to-point links called channels, which<br />
correspond to the N2 buffe rs already described .<br />
A channel is a pair of unidirectional buffers used to<br />
provide two-way communication between a pair of<br />
process endpoints anywhere in the cluster. UMP pro<br />
vides functions to open a channel between a pair of<br />
tasks. While the resources are allocated by the first task<br />
to open the channel, the connection is not complete<br />
until the second task also opens the same channel.<br />
Once a channel has been opened by both sides, UMP<br />
functions can be used to send and receive messages on<br />
that channel. It is possible to direct UMP to use shared<br />
memory or MEMORY CHANNEL address space for<br />
the channel buffers, depending on the relative location<br />
ofthe associated processes. In addition, UMP provides<br />
a fu nction to wait on any event (e.g., arrival of a mes<br />
sage, creation or deletion of a channel). In total, UMP<br />
provides a dozen fu nctions, which are listed in Table 3.<br />
Most of the functions relate to initial ization, shut<br />
down, and miscellaneous operations. Three fu nctions<br />
establish the channel connection, and three functions<br />
pertorm all message communications.<br />
UMP channels provide guaranteed error detection<br />
but not recovery. Through the use of TruC!uster<br />
MEMORY CHANNEL Software error-checking rou<br />
tines, we were able to provide efficient error detection<br />
in UMP. We decided to let the higher layers implement<br />
error recovery. As a result, designers of higher layers can<br />
control the performance penalty they incur by specifY<br />
ing their ovvn error recovery mechanisms, or, since<br />
rel iability is high, can adapt a fai l-on-error strategy.<br />
Performance<br />
UMP avoids any calls to the kernel and any copying of<br />
data across the kernel boundary. Messages are written<br />
directly into the reception buffer of the destination<br />
channel. Data is copied once from the user's buffe r<br />
to physical memory on the destination node by the<br />
sending process. The receiving process then copies the<br />
data from local physical memory to the destination<br />
user's butler. By comparison, the number of copies<br />
involved in a similar operation over a LAN using sock<br />
ets is greater. In this case, the data has to be copied<br />
into the kernel, where the network driver uses DMA to<br />
copy it again into the memory of the network adapter.<br />
At this point the data is transmitted onto the LAN.<br />
The first version of UMP used one large shared<br />
region of MEMORY CHANNEL space to contain its<br />
channel buffers and a broadcast mapping to transmit<br />
this simultaneously to all nodes in the cluster. This<br />
version ofUMP also used loopback to reflect transmis<br />
sions back to the corresponding receive region on the<br />
sending node, which resulted in a Joss of available<br />
bandwidth . Using our AI phaSe rver 2100 4/190<br />
development machines, we measured<br />
Digir:1l Te chnical journal Vol. 8 No. 2 <strong>1996</strong> 107
Ta ble 3<br />
UMP API Functions<br />
Function<br />
Name Description<br />
ump_init<br />
ump_exit<br />
ump_open<br />
ump_close<br />
ump_l isten<br />
Initializes UMP and allocates the necessary resources.<br />
Shuts down UMP and deal locates any resources used by the calling process.<br />
Opens a duplex channel between two endpoints over a given transport (shared memory or<br />
MEMORY CHANNEL). Channel endpoints are identified by user-supplied, 64-bit integer handles.<br />
Closes a specified UMP channel, deallocating all resources assigned to that channel as necessary.<br />
Registers an endpoint for a channel over a specified transport. This can be used by a server process<br />
to wait on connections from clients with unknown handles. This function returns immed iately,<br />
but the channel is created only when another task opens the channel. This can be detected using<br />
ump_wa it.<br />
ump_wa it Wa its for a UMP event to occur, either on one specified channel to this task or on all channels<br />
to this task.<br />
ump_read<br />
Reads a message from a specified channel.<br />
ump_write Writes a message to a specified channel. This function is blocking, i.e., it does not return until<br />
the complete message has been written to the channel.<br />
ump_n bread Starts reading a message from a channel, i.e., it returns as soon as a specified amount of the<br />
message has been received, but not necessarily all the message.<br />
ump_nbwrite Starts writing a message to a specified channel, i.e., it returns as soon as the write has started.<br />
A background thread will continue writing the message until it is completely transmitted.<br />
ump_mcast<br />
ump_info<br />
Writes a message to a specified list of channels.<br />
Returns UMP configuration and status information.<br />
• Lncncv : 11 f.LS (MEi'v!ORY CHANNEL), 4 JJ.S<br />
( sh;1rcd rncmorv)<br />
• Bandwidt h : 16 MB/s (MEMORY CHANNEL),<br />
30 ;VI B/s ( sha red m emory )<br />
To i ncrease bandwidth, we mod i icd U1\'I P to usc<br />
transmit-only regions fur its challnel butlers, th us<br />
el i mi m.ti ng Joopback. The per(>rmancc mosun.:d fo r<br />
the revised UMP usin g the same machines was<br />
• Ll tcncy: 9 f.LS (MEMORY CHANNEL), 3 f.LS<br />
(shared memory)<br />
• Bandwidth: 23 MB/s (MEMORY CHANNEL),<br />
32 MB/s (shared mernon·)<br />
figu re 8 sho\\'s the message tr:111scr time ;1nd Figu re<br />
9 sho\\'S the bandwidth (x various mcss1gc sizes (lr the<br />
ITI'iscd 1·crsion of UMP usi ng both blocki ng ;l lld non<br />
bl ocking writes m·cr shared memorv and the MEMORY<br />
CHANNEL network. Usi ng newer AlphaSc nn 4100<br />
5/300 mach i nes, which have a Elster [/0 subsystem<br />
th111 the older machines, and version 1.5 /vi E!VlORY<br />
CHANNEL ad apters, the mc1su rcd latency is 5.8 f.LS<br />
(MEMORY CHANNEL), 2 f.LS ( shared memory) . The<br />
pc;lk bandwidth achieved is 61 M B/s (MEMORY<br />
CHANNEL), 75 MB/s ( shared memory ) . In the non<br />
blocking cases, the buffe r size used 11·:-s 256 kilobvtcs<br />
(KB) (lr sh1 rcd memon· and 32 KB (lr MEMORY<br />
C:HANN EL. Further work is u nder 11·a1' to im prm·c the<br />
pcr(Jml;lllce using shared memorv 1 s the transport.<br />
This work is
100,000<br />
10,000<br />
i=<br />
a:<br />
UJ<br />
t!icn 1.ooo<br />
zo<br />
z<br />
a: a<br />
t-u<br />
100<br />
UJCI:<br />
UJU<br />
10<br />
KEY:<br />
Figure 8<br />
/./<br />
, ". "<br />
/ /<br />
·<br />
,. ;; ."'"' '<br />
, -.-- ·<br />
· -<br />
- .:.. :::: . - --- . .... .<br />
.'/<br />
!/<br />
'<br />
;<br />
y<br />
/<br />
<br />
. /.<br />
. . · "<br />
.,.,<br />
./.<br />
. /<br />
10 100 1 ,000 1 0.000 100,000 1 ,000,000<br />
MESSAGE SIZE (BYTES)<br />
UMP BLOCKING (SHARED MEMORY)<br />
UMP BLOCKING (MEMORY CHANNEL)<br />
UMP NONBLOCKING (SHARED MEMORY)<br />
UMP NONBLOCKING (MEMORY CHANNEL)<br />
UMI' Com municuions PcrfmmanC<br />
CD<br />
<br />
C.!l<br />
UJ<br />
30<br />
I<br />
1- 20<br />
0<br />
<br />
0 10<br />
z<br />
<br />
CD<br />
0<br />
KEY:<br />
50<br />
40<br />
- -----<br />
200,000 400,000 600,000 800,000 1 .000,000<br />
MESSAGE SIZE (BYTES)<br />
UMP BLOCKING (SHARED MEMORY)<br />
UMP BLOCKING (MEMORY CHANNEL)<br />
UMP NONBLOCKING (SHARED MEMORY)<br />
UMP NONBLOCKING (MEMORY CHANNEL)<br />
Figure 9<br />
U 1YI P C:ommun ic.ll ions Pcrtormnu.:: Band "'idrh<br />
PVM when using net\\'Orks like Ethernet or fDDJ.<br />
The high cost of communications f()r these systems<br />
means that only the more co :rse-grai ned paral lel appli<br />
cations h:ve de monstrJted pert(mnancc im pro,·cments<br />
as a result of paralklization using PVM . Using the<br />
MEMORY CHANNEL cluster technology described<br />
earlier, we have impl e mented :m optimized PVM that<br />
otters lo\1' latencv :nd high -h:.llldwidth comnHi l1ic:<br />
rions. The PVM librarv and daemon use UM I' ro pro<br />
vide se:mless communications over the MEMORY<br />
CHANNEL cluster.<br />
\iVhen we began to devel op PVM ror MEMORY<br />
CHAN1 EL c lusters, \\'e had one overrid ing goal: to usc<br />
the hardw;u-c pe rtormance the MEMORY C:HANNFL<br />
interconnect ofters to prm·ide a PVM with industry<br />
leading communicuions pertormancc, speciticallv with<br />
regard to btcncy. I nitiallv, we set a target btency t(n<br />
PVM of less than 15 f..lS using shared mcmorv md less<br />
than 30 f..lS using the MEMORY C H AN NEL tr:msport.<br />
Our ti rst task was to bui ld ;\ prototype using the<br />
public-domain PVM implementation. We used an<br />
earlv prototYpe of the MEMORY CHANNEL svstcm<br />
jointlv dc\'eloped by <strong>Digital</strong> and Encore. The prototype<br />
had a hardware latency of4 f..lS. We moditied the<br />
shared-memory version of PVM to usc the prototype<br />
hard\\'are and achieved a PVM l:ltencv of 60 f..lS.<br />
Protiling and straighttcx\\'ard code Jn:ll\'sis I-c\·ealed<br />
that most of the overhead was caused by<br />
• PVM's support t
Figure 10<br />
MEMORY CHANNEL CLUSTER<br />
r - - - - - - - - - - - - - - - - - - - - - - - - - - --,<br />
HOST 1 HOST 2<br />
- - - - - - - - - - - - - - - - - - - - - - - - - - - ,<br />
I<br />
DAEMON 1 PROCESS 1 PROCESS 2 DAEMON 2 PROCESS 3<br />
I I<br />
PVM DAEMON PVM APPLICATION<br />
I<br />
PVM APPLICATION I PVM DAEMON<br />
I<br />
PVM APPLICATION<br />
I<br />
I I<br />
PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY I I PVM API LIBRARY PVM API LIBRARY<br />
I<br />
1 UNIX UMP<br />
UNIX I : '--- --'- -- --'<br />
I<br />
UMP UNIX I I<br />
I I<br />
UMP I I UNIX UMP UNIX UMP<br />
I<br />
I I I I<br />
I<br />
t B<br />
I<br />
t I I<br />
'-- ------------ ----A ---t -- ---------------- L - - - - - - - - ----- - - - - - - - - - - - ---<br />
D<br />
HOST 3<br />
r-------------------------------------------------------<br />
F<br />
I<br />
I<br />
I<br />
I<br />
DAEMON 3 PROCESS 4 GATEWAY<br />
I<br />
I<br />
PVM DAEMON PVM APPLICATION PVM GATEWAY<br />
I<br />
I PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY<br />
PVM3 DAEMON<br />
I<br />
I<br />
INTERFACE<br />
UNIX UMP UNIX UMP<br />
I<br />
UNIX UMP<br />
I I I<br />
l _ _ _ !---------------------------------------- ----<br />
c<br />
---_j<br />
L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - <br />
KEY:<br />
A A PVM application on host t performs local control functions using UNIX signals.<br />
B A PVM application on host 1 commun1cates w1th another PVM task on the same host using<br />
UMP (via shared memory).<br />
C A PVM application on host t communicates with another PVM task on a different host in the<br />
cluster (host 2) using UMP (via MEMORY CHANNEL).<br />
D A PVM application on host 1 requires a control function (e.g., a signal) to be executed on<br />
another host in the cluster (host 3 ); it sends a request to a PVM daemon on host 3.<br />
E The PVM daemon on host 3 executes the control function.<br />
F A PVM application on host t sends a message to a PVM task on a host outside the MEMORY CHANNEL<br />
cluster; the message is routed to the PVM gateway task on host 3.<br />
G The PVM gateway translates the cluster message into a form compatible with the external PVM<br />
implementation and forwards the message to the external task via IP sockets.<br />
<strong>Digital</strong> PVM Architecture<br />
gateway daemon. The PVM API library is a complete<br />
rewrite of the standard PVM ,·ersion 3.3 API, \\'ith<br />
which full fu nctional compatibility is maintained.<br />
Emphasis has been placed on optimizing the pcd()r·<br />
mancc of the most frequently used code paths. In<br />
addition, all data structures and dat: transters h;wc<br />
been optimized for the Alpha architecture. As stated<br />
earlier, the amount of message passing between tasks<br />
and the local daemon has been minimi;.ed by pcrt(mn·<br />
ing most operations in-line and communicating \\'ith<br />
the daemon only \\'hen absolutely necessary. !mer·<br />
mediate bufkrs are used fo r copying data between the<br />
user bufrcrs. This is necessary because ofthe semantics<br />
of PVM, which allow operations on buffer contents<br />
bd()re and at-i:cr a message has beetJ sent. The one<br />
exception to this is pvm_pscnd; in this case, data is<br />
copied directly since the user is not allowed to modi'<br />
the send buffer.<br />
The purpose of om PVM daemon is difkrem ri·om<br />
that of the daemon in tbc standard PVM package. Our<br />
daemon is designed to relay commands between diftl:rent<br />
nodes in the PVM cluster. It exists solely ro<br />
110 Digiral Technid Joumal Vnl. R No. 2 l9
comri burions of each of the group members, etc. The<br />
group fu nctions arc implemented separately ti-om the<br />
other PVM messaging fu nctions. They use a sep:.rate<br />
global structure (the group ta ble) to manage l'VJ\11<br />
group data. Access to the group table is controlled<br />
by locks. Unlike other l'VM implementations, there is<br />
no PVM group server, since all group operations c:tn<br />
manipu l:ne the group t:ble direcrlv.<br />
Performance<br />
Table 4 compares the communications latency achieved<br />
by various PVM implementations. A the ta ble indicates,<br />
the Lnency between two mKh incs with <strong>Digital</strong><br />
PVM over :1 MEMORY CHANNEL. tr:1nsport is much<br />
less th:n the l:1rcncy of the public-domain PVM<br />
implemelltation over shared memorr, which va lidates<br />
our approach of removing support r(Jr heterogeneity<br />
ti-orn the critical pertr l'VM l atency over the<br />
MEMORY CHANNEL transport given in Reterence 6<br />
were obtained us1ng an carlin- vers1on of<br />
Digir:li PVM. Since those results were measured,<br />
latency has been hah-ed, mostly due to improvements<br />
in UMI' pertcm11
-<br />
Figure 14<br />
MPI PORTABLE API LIBRARY<br />
MPICH ABSTRACT DEVICE<br />
MPICH CHANNEL INTERFACE<br />
UMP<br />
I<br />
SHARED MEMORY<br />
MEMORY CHANNEL<br />
(a) I nitil Protor1·pc<br />
MPI PORTABLE API LIBRARY<br />
MPICH ABSTRACT DEVICE<br />
FRONT END<br />
UMP<br />
I<br />
SHARED MEMORY<br />
MEMORY CHANNEL<br />
(b) Version 1.0 ImplcmcnLuion<br />
Digit I JV! PI Archirccrurc<br />
MPICH<br />
- - - - - - ,<br />
ABSTRACT<br />
-DEVICE<br />
INTERFACE<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
_ _ _ _ _ _ _j<br />
MPICH<br />
- - - - - -l<br />
ABSTRACT I<br />
- DEVICE - - I<br />
INTERFACE 1<br />
_ _ _ _ _ _ _j<br />
using MEMORY CHANNEL Bv comparison, the<br />
unmodified lvl PI CH achieves a pok bandwidth of<br />
24 MB/s using shared memon· and 5.5 MB/s using<br />
TCP/l Pmnan DDI LAN .<br />
Figure 17 shows the speedup demonstrated by an<br />
MPI application . The application is the Accelerated<br />
Strategic Computing Initiative (ASCI) benchmark<br />
SPPM, which soh-es a three-dimensional g;:s dynamics<br />
problem on a unitonn Cartesian mcsh s2'' The same<br />
code \\'JS run using both Digit:! Ml'l and MPICH<br />
using TCr /1 P. The hardware conligur:t tion was a two<br />
node MEMORY C:HA:--.JNEL clustn of AJ phaServer<br />
8400 5/350 1mchines, each with six CPUs. <strong>Digital</strong><br />
MPJ used sh:tred memory ;: nd MEMORY CHANNEL<br />
transports, ll'herus Ml'ICH used the Ethernet LAN<br />
eonnecring the machines. The mnimum speedup<br />
Ta ble 5<br />
MPI Latency Comparison<br />
MPI Implementation<br />
MPICH 1.0. 10<br />
MPICH 1.0.10<br />
<strong>Digital</strong> MPI V1.0<br />
<strong>Digital</strong> MPI V1 .0<br />
<strong>Digital</strong> MPI V1 .0<br />
Digita l MPI V1 .0<br />
Transport<br />
Sockets FDDI<br />
Shared Memory<br />
MEMORY CHANNEL 1.0<br />
MEMORY CHANNEL 1.5<br />
Shared Memory<br />
Shared Memory<br />
obtained using <strong>Digital</strong> MPI ll'as l pproximately 7,<br />
whereas fr MPICH the m1ximum speedup w1s<br />
approximately 1.6.<br />
Future Work<br />
We intend to continue rdining tlH: compom:nts<br />
described in this paper. The major chmge em·isioned<br />
regarding the TruCiuster MEMORY CHANNEL Soft<br />
ware prod uct is the addition of user-spJ.ce spin locks,<br />
which should significanrlv reduce tl1rcc the usc of an<br />
intermediate bufti.:r, perfr scicntilic 1 pplic:1tions that utilizes a<br />
new network technology to hvpass the sotrll':lrc over<br />
head that limits the applicabilit\' of traditional nct<br />
II'Orks. The pcrt(ll'mance ofrhis s\·stcm has been prm·cn<br />
to be on a p1 r ll'ith that of currenr supercomputer tech<br />
nology and has been achie1·ed using commoditv<br />
technology developed to r <strong>Digital</strong>'s commercial cluster<br />
products . The paper demonstrates the suitability of<br />
the MEMORY CHAN:-.JF.l. rcchnolnt-'Y as a communica<br />
tions medium h>r sellable svstem del'clopmcnt.<br />
Platform Latency<br />
DEC 3000/800 350 JJ.S<br />
AlphaServer 2100 4/233 30 JJ.S<br />
Alpha Server 2100 4/2 33 16 JJ.S<br />
AlphaServer 4100 5/300 6.9 JJ.S<br />
Alpha Server 2100 4/233 9.5 JJ.S<br />
Alpha Server 4100 5/300 5.2 JJ.S<br />
Digit;ll Trdmic.ll jounul Vol . 8 :-.:o. 2 <strong>1996</strong> 113
100,000<br />
10.000<br />
f=<br />
((<br />
w<br />
(j) 1.000<br />
zo<br />
7. J\11 . Blumrich er 1., "Virrul Mcmorv Mapped Ncr<br />
work lnrerbce i>r the SHJU/v\ P Mul ticomputer," Pro<br />
ceedings o{ lfw Tit,enl)•-jlrsl Allllttal flllenwlional<br />
Sy mposi11111 011 Com puler Anhilec/11re (April 1 994 ):<br />
142-1 53.<br />
8. M. Rl umrich et al., "T\1 0 Virrual Mcmon· Mapped<br />
Network Interhce Designs," Pmcl'edillg of /he Hoi<br />
Jntercollllects II ,\) ·rnposilltll. Palo Alto, Calif<br />
(August 1994): 134-142.<br />
9. L I frock ct :\I ., " Improving Relc:\sc-Consistenr Shared<br />
Virtunnance Fortran<br />
for Disrribttted-memory Srstenls," D1,ilal Tech 11ical<br />
jou rnal. 'oL 7, no. 3 ( 1995 ): S-23.<br />
17. E. Benson et 1., "Design o· Digir:1 l's Prallel Sofr,, are<br />
Etwimnment," Di,t; ilal Tech 11ical jo11rnal, ,·ol. 7,<br />
no. 3, ( !99S) 24-38.<br />
18. 1n the ti rst itlll'lcmemrions, rhe PC! t\ll i:::MOitY<br />
CHANNEl. network adapter places a limit of 128 !vi i\<br />
on rhe amount of MEMORY CHANNEL space that can<br />
be :\ 1\ocm: d.<br />
19. Tm C!us/er 1/1:' 1/()l?l' CHA X\'f:f Sojill·are Pmp_m/JI<br />
mer,· Mw 11wl (M\·nard , Mr implementing Ut'vi P<br />
and adding support for collectil'e operations to Digit:ll l'VM.<br />
l3ctore rh:l t, he worked on rhe charactnization and opti m ization<br />
of customer scientific/technical benchmark codes<br />
and on various lurdware and software design projects. Prior<br />
to coming ro <strong>Digital</strong>, Jim conrributcd to the design ofana-<br />
1og and digit:ll morion control svstems and sensors at the<br />
Inland Motor Di,·i-,ion of Kollmorgen Corporation. Jim<br />
recei,·ed a ru:: . in electrical engineering ( 1982) and an<br />
M.Eng.Sc. ( 1985 ) from Uni,usir1· Col lege Cork, Ircbnd,<br />
where he wrote his thesis on rhe design of an electronic<br />
control system or variable relucta nce motors. In addition<br />
to receiving rhe Hewlett-Packard (I rcbnd) Award t(Jr I nnovation<br />
( 1982), jim holds one patent and has published se1 -<br />
era! p:lpcrs. He is a member of I : : L and ACIVL<br />
<strong>Digital</strong> Tcclmicll journal Vnl. 8 No. 2 <strong>1996</strong> liS
John J. Brosnan<br />
)tlhn Brosn.111 is currenrh ;1 principal engi neer in the<br />
<strong>Technical</strong> Computi ng Group 11·here he is project lc;1der<br />
tor <strong>Digital</strong> PVM. In prior positions rnunce Fortra n tc>t<br />
suire and .1 signi ticanr con tri buror to a ,·ariety of publish<br />
i ng rech nolog1· prod ucts. john joined Digi tal a tin n:ceiv<br />
ing his B.Lng. in electron ic engineering in l986 from the<br />
U ni1crsi n· of Li meri ck, ln:]:ll\d. He recei1 ed his M.Eng.<br />
in conJfl Uter S\·stems in 1994, also ti·o1n the Uni1·ersin· ot·<br />
Limerick.<br />
Morgan P. Doyle<br />
In I 994, J'vlorgan Dol'le c1me to <strong>Digital</strong> ro \\'Ork on rhe<br />
High Perl(mn.\ncc fortran test suire. Presenrh·, he is :m<br />
engineer in the <strong>Technical</strong> Com puti n g Group. · Earll- on,<br />
he contributed signiticllltlv ro the design and de,·elop<br />
lllenr of rhe TruCiusrer MEM O l\ Y Cf 1.\NNF.L Softw;1re,<br />
and he is now responsible tllr irs development. 1VI mg:1n<br />
rccein:d his I.A.J and B.A. in eb:tronic engineering<br />
( 1991) .111d h i s 1'v1 .Sc. ( 1993) ti·om T1·i niry Collq.1,c<br />
Du bl i 11, I 1·cLllld.<br />
Seosamh D. 6 Riord,iin<br />
Semamh () Riord,\in is Jn engineer in rhe Technic1l<br />
ComflUtin):'. Cmup \\'here he is cmn:nrh· 11·orking on<br />
Digirai Jvl PI :1 nd on enh.111cemenrs ro the l'.'vl P libr:11·,·.<br />
Pr,·iou>h-, he contributed to rhc design ;1nd implclnnt:\rion<br />
ofthe TruCi u srer ,\H:.\!ORY CI I.\:"\ELSofrll':m· .<br />
Seos;1m h JOined <strong>Digital</strong> ;1frer I"Ccci,·ing his Il.Sc. (199 1)<br />
and M.Sc. ( 1993) in computer science ti·om the U niversity<br />
ofLimnick, Ircl.llld.<br />
Vol. 8 No. 2 [1· which<br />
he holds t\\'O patents, ;1ntl the dcsip;n ot'an 1/0 and com-<br />
1\lunic:uions controller. Tim :1lso ll'orkcd ar 1\:\nheon 011<br />
the d:H:\ co1nmu1J icarion' subs,·sre\Jl tor rhc :S.:L\Il-\ D<br />
distri buted re:\1-time Doppler ll'tcms pmgran11ne1·<br />
Tim is ,, n\l'lnbcr nfrhc British Compmer Socicn· .llld is<br />
a Chartered Engineer.
The Design of User<br />
Interfaces for Digita l<br />
Speech Recognition<br />
Software<br />
<strong>Digital</strong> Speech Recognition Software (DSRS) adds<br />
a new mode of interaction between people and<br />
computers-speech. DSRS is a command and<br />
control application integrated with the UNIX<br />
desktop environment. It accepts user commands<br />
spoken into a microphone and converts them<br />
into keystrokes. The project goal for DSRS was<br />
to provide an easy-to-learn and easy-to-use<br />
computer-user interface that would be a power<br />
ful productivity tool. Making DSRS simple and<br />
natu ral to use was a challenging engineering<br />
problem in user interface design. Also challeng<br />
ing was the development of the part of the<br />
interface that communicates with the desktop<br />
and applications. DSRS designers had to solve<br />
timing-induced problems associated with enter<br />
ing keystrokes into applications at a rate much<br />
higher than that at which people type. The DSRS<br />
project clarifies the need to continue the devel<br />
opment of improved speech integration with<br />
applications as speech recognition and text-to<br />
speech technologies become a standard part of<br />
the modern desktop computer.<br />
I<br />
Bernal"d A. Rozmovits<br />
In the 1960s and early 1970s, people controlled com·<br />
purcrs using toggle switches, pu nchcd cards, :m d<br />
punched paper tape. In the 1970s, the common co n<br />
trol mechanism was the kevboard on tclcrvpes and on<br />
1·ideo termina ls. In the 1980s, with the advent of<br />
graphical user i n terfac es, people ( nmd th:t : new<br />
mode of interJction with the compu ter w1s usdi.d.<br />
The concept of a pointer-the nH>use-n·olved . Its<br />
populari rv grew such that the mouse is now a st:mdard<br />
component of every modern computer. In tlK 1990s,<br />
the time is righ t to 3dd yet an other mode of inter<br />
:lcti on with the computer. As compute pown grows<br />
each I'C:lr, the boundan· ofthe man-n1:1chinc imerhce<br />
c:n move ti·om interaction that is native to the com <br />
puter tow;trd communication that is !lrms<br />
some ;tctions. The action might be to launch an application,<br />
f( >r nample, in response to the comm:nd<br />
"bring up calendar"; or to rvpc something, t(>r cx::tmplc,<br />
in response to "edit to-do list," to it11'oke emacs<br />
\tilcs\projcctA\todo.txt. OS RS not only houses the<br />
speech macro capa bility but also provides a user imer·<br />
t:Ke, a speech recognition engine , and in tcrhccs to the<br />
X Window System .<br />
Following is a high -b·cl description of how the<br />
sofuvarc fu nctions. Commands arc spoken into :1<br />
m icrophom:, and the audio is captured :ncl d igiti /.cd .<br />
The ti.rst step in the processing is the speech m alvsis<br />
system, which pro\·ides a spectra l rcpn:sentr each match. These
match is acceptable, DSRS retrieves keystrokes associ<br />
:ncd with each utterance, ;:md the keystrokes are then<br />
sent into the system's keyboard bufkr or to the appropriate<br />
application. For instances of continuous speech<br />
recognition, a sentence is recognized and keystrokes<br />
are concatenated to represent the utterance. for<br />
cxJmple, (>r the utterance "ti.\'e two times seven three<br />
tc>ur equals," the keys "52 * 734 =" would be dcli\'crcd<br />
to the calculator application.<br />
Although this concept seems simple, its implementation<br />
raised significant svstem integr:nion issues and<br />
dircctlv affected the user interface design, which w1s<br />
critical to the product's success. This paper specitic1lly<br />
addresses the user interbcc and integration issues and<br />
concludes with a discussion of future directions t(.>r<br />
speech recognition products.<br />
Project Objective<br />
The objecti\'c of the DS RS project was to provide a<br />
useful but limited tool to users of <strong>Digital</strong>'s Alpha<br />
workstations running the UNIX operating sysn.:m.<br />
DSRS would be designed as a low-cost, speech recognition<br />
application and would be provided at no cost to<br />
workstation users for a finite period of ri me.<br />
vVhen the project began in 1994, a number of command<br />
and control speech recognition products f( >r<br />
PCs alreadv nistcd. These progrJms were aimed at<br />
end users and performed useful tasks "out of the bo:x,"<br />
that is, immediately upon start-up. They all carne with<br />
built-in vocabulary for common applications and gave<br />
users the abilitv to add their own \'OGtbulary.<br />
On UNIX systems, howc\'er, speech recognition<br />
products existed only in rhc orm of programmable<br />
rccognizcrs, such as RRN Hark software. Our objective<br />
was to build a speech recognition product tor the<br />
UNIX workst:tion that h1d the char:ctcristics of the<br />
PC recognizcrs, that is, one that would be fu nctional<br />
immediately upon start-up and would allow the nonprogrammer<br />
end user to customize the product's<br />
vocabulary.<br />
vVc studied several speech recognition products,<br />
including Ta lk-> To Next i·om Dragon Systems, Inc.,<br />
VoiccAssist from Creative Labs, Voice Pilot h·om<br />
Microsoft, ;ll)d Listen fi·om Verbcx. 'vVe decided to<br />
provide users with the (>!lowing features as the most<br />
desirable in a command : nd control speech recognition<br />
product:<br />
• Intuiti,-c, c1sv-ro-use interface<br />
• Speaker-independent models that would eliminate<br />
rhe need (>r extensive training<br />
• Spcakcr
Figure 1<br />
TRAI NING<br />
MANAGER<br />
USER<br />
INTERFACE<br />
MICROPHONE<br />
AUDIO CARD<br />
SPEECH<br />
MODELS'<br />
I<br />
DIGITIZED FEATURE<br />
AUDIO FRONT-END PACKETS<br />
PROCESSOR'<br />
• Denotes a component licensed from Dragon Systems. Inc<br />
DSRS Architcctur;ll Block Di:tgrcl ln<br />
words can crcuc models usmg the Vocabul:ry<br />
Manager user intcrbceus :nd Jets ;1 S the :ge nt to control<br />
focus in response to users' speech commands.<br />
The Vo cahulmy /via nager uscr- i nter bce window<br />
displ:ys the current hierarchy of the fi nite state gr:tn<br />
mar ti le. The Vo cabulary Manager t· addition, deletion,<br />
:nd moditic:tion of words or macros. Also in this win<br />
dow, the command -utterance to keystroke translations<br />
arc displayed, created , or modi ti ed.<br />
In the TJ-aining ;11a nagC'r user intcrbcc, tl1e user<br />
may train newly created words or phrases in the<br />
user vocabul:lry tiles and rctr:1in, or adapt, the prod uct<br />
supplied, independent vocabulary.<br />
The DSRS Implementation<br />
As the design team gained ex perience wi th the DSRS<br />
prototypes, we rdi ncd user procedures and intert:1 ccs.<br />
This section describes the kcv fu nctions the team<br />
dc,-eloped ro L· nsurc the user-fi·icndl iness of the prod<br />
uct, including the first-time setup, the Speech<br />
Manager, the Training Manager, the Vo cabu iJry<br />
Manager, Jiowing critical con trol s :<br />
• Microphone on/offsll'itch.<br />
• A VU (volume units) meter that gives rc:l-timc<br />
fCedbJck to the audio signal being heard. A VU<br />
meter is :1 visual teed back device common ly used on<br />
devices such as tape decks. Users are ge nerally very<br />
comfortable using them.<br />
<strong>Digital</strong> lcdmic:li journJI Vol. 8 No. 2 19'}6 119
Figure 2<br />
DSRS Si1cus. The Actin: \'OC1bu<br />
larv \I'OI'ds m: specific to the 1 pplic1rion in h>Lus<br />
and c hmge dynamically as the current application<br />
changes . The vocabularies arc designed in this way so<br />
tl1:1t 1 user can speak comm;mds both within Lus. This comcxt :dso<br />
determines the current stJte of the Active word list.<br />
-The history fr:1me displavs the word, phr;1se , or<br />
scmcncc last heard lw the recognizer. The hisrorv<br />
ti·amc is set up as :1 button . \rVhen pressed , it drops<br />
do\\'n ro reveal the l:1st 20 recognized utterances.<br />
• A menu that provides access to the management of<br />
user ti les, the Vocabu l ary Ma11r ;1 good traini ng user imcrbcc.<br />
• Strong, clear indic:ltions thlt utterances arc pro<br />
cessed . \rYe 1ddcd a series of bo,;cs thrth<br />
across the screen at each utterance.<br />
• A glimpse of upcoming \\'ords. A list of\\'ords is dis<br />
played on the user inter bee ;md moves r<br />
example, Wmd 4 of2l.<br />
• Option to pause, resume, and rcstJrt training.<br />
• Large, bold tonr disp lay of the word ro be spoken<br />
and a small prompt, "Please continue," displavcd<br />
when the svstcm is waiting t(:>r input.<br />
• AutomHic ;1 ddition of rq1c1tcd utterJnccs that
Figure 3<br />
Tr:1ining i'vbnager vV indow<br />
bold tom to diftcrcmiate it ri·om the other clements in<br />
the window. To train the words, the user repeats an<br />
um:rancc ti·om one to six times. The user must speak<br />
lt the proper times to make tr1ining a smooth md eff-icient<br />
process. DSRS manages the process by prompting<br />
the speaker with visual cues. Right below the word<br />
is :1 set of boxes th:u represent the repetitions. The<br />
boxes arc checked off as utterances arc processed, providing<br />
positive visu1l feed back to the speaker. When<br />
one word is complete, the next word to be trained is<br />
displayed and the process is repeated. vVhcn : II the<br />
words in the list arc trained, the user saves the files, and<br />
DSRS returns to the Speech Manager and its Jcti,·c<br />
mode with the microphone turned off.<br />
Vocabulary Manager<br />
The Vocabularv J\tLmager, l ll example of which is<br />
shown in Figure 4, enables users to modi!)' speech<br />
ll1
£ile ::::'_ocabulary<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
OJ<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
I<br />
-<br />
Figure 4<br />
Vocabularv ivl anager Windo\\'<br />
1:6: ...<br />
Always Active Vocabl ' r.:1<br />
<br />
C Programming Modr <br />
Calculator ...<br />
Calendar ...<br />
Calendar & Diary ...<br />
DtTermlnal ...<br />
Emacs ..<br />
Emacs extensions.<br />
Pesonal Emacs ex<br />
File Manager ...<br />
Mail ...<br />
Nets cape ...<br />
Sentences for Switch<br />
Speech Manager ...<br />
-<br />
Vocabulary Manager<br />
J!j<br />
J!j<br />
•<br />
<br />
J!j<br />
J!j<br />
J!j<br />
J!j<br />
J!j<br />
J!j<br />
<br />
J!j<br />
J!j<br />
•<br />
I<br />
<br />
abort<br />
backward<br />
backward char<br />
backwards a page<br />
beginning of blifer<br />
beginning of line<br />
beginning of word<br />
center on line<br />
control g<br />
copy region<br />
delete backward char<br />
delete character<br />
delete line<br />
delete previous word<br />
delete region<br />
delete word<br />
edit declare project SllTlmary<br />
Emacs extensions ...<br />
end of buffer<br />
p -<br />
- I /r--- -- -- -- --<br />
<strong>Number</strong> of active words in selected Vocabu lary. 64<br />
The simple logic of this special fu nction enh:mces<br />
user productivity. Often workstation and PC screens<br />
are littered with windows or applications icons 1 nd<br />
icon boxes through which the usn must search.<br />
Speech control elirnin:Hes the steps between the user<br />
thinking "I want the calculator" and the applicuion<br />
being presented in f(Kus, ready to be used . The DSRS<br />
team created a fu nction called FocusOrlaunch, which<br />
implements the behavior described above. The tinc<br />
tion is encoded into the FSC continuous-switching<br />
mode sentences in the Always Active vocabulary<br />
associated with the spoken commands "switch to<br />
," "go to ,"<br />
and just plain "."<br />
Applications like c1kulator mel c:tlcndJr :liT not<br />
likely to be needed in multiple instances. Hm\'e\·er,<br />
applications such as termin:1l emulator windows 1 re .<br />
DSRS defines the specific phrase "bring up " to explicitly l:nmch a new inst:mcc oftht 1ppli<br />
cation; that is, the phrase "bring up " is tied to a fu nction 1umed Launch.<br />
The phrases "next " and "pre\·i<br />
ous " were chosen t( Jr navigating<br />
between instJnces of the s1me application. DS RS<br />
remembers the previous state of the application. For<br />
122 Dig:inl Tcc hniol Jounul \'ol. l'\o. 2 <strong>1996</strong><br />
!:::!.elp<br />
-, .. r-:<br />
instance, if the calcndJr :tpplication is minimized when<br />
the user says "switch to calendar," the calendar<br />
window is restored. \
user to launch an audio-based application that will<br />
record, such as a teleconferencing session. Version 1.1<br />
also includes a fu nction that al lows the user to play<br />
a "wave," or digitized audio clip. Audio cues may thus<br />
be played JS part of speech macros. The "say" command<br />
invokes DECtalk Text-to-Speech functionality<br />
so rbat audio events can be spoken!<br />
Since speech recognition is a st
In the i m plementation of DSRS, we encountered<br />
one unexpected proble m. vV hen :t nested menu<br />
mnemon ic was invoked, the second ch1r:tcter was lost.<br />
The seque nce of events was as rol lows:<br />
• A spoken word was recognized , and keystrokes<br />
were sent to the keyboard butkr.<br />
• The ri rst character, ALT + , acted n ormal ly<br />
and caused a pop-up menu to d ispL1y.<br />
• The menu remai ned on d ispl ay, and the last key was<br />
lost.<br />
vVe determined that the second keystroke was bei ng<br />
delivered to the application bd(>re the pop-up menu<br />
'"1s displavcd . Th erdorc , at the time the key was<br />
pressed , it did not yet have me:1ning to the appl ic:1 tion .<br />
It is 1 pparent that such applications arc written rc>r<br />
1 IH11n:m reaction-based pa radigm . DS RS , on the<br />
other ha nd , is typing on behalf of the user at computer<br />
speeds md is not waiti ng tor the pop-up menu to<br />
d ispby bd(xc entering the next key.<br />
To overcome this problem, we deve loped J syn<br />
ch ron izi ng ti.1 nction. Normally the Voca bu lary<br />
M:1nagcr not:1tion to send an AlT + f fd lowcd bv an<br />
x would be ALT + fx. Th is new synchronizing func<br />
tion was designated as sALT + fx. The synch ronizing<br />
fu nction sends the ALT + f and the n monitors e\·ents<br />
t(>r a map-notif'\• message i ndicati ng that the pop-up<br />
menu has been wri tten to the screen. The ch:1ractcr<br />
fol lowi ng ALT + f is then sent, in this usc, the x.<br />
The svnchroni zi ng t'i.mction also has a watchdog timer<br />
to prcvellt a hang in the ev
in tociay's applications. A user agent or secretary pro<br />
gram that could process common requests delivered<br />
entirely by speech is not out of reach even with the<br />
technology available today, for example:<br />
User: What time is it?<br />
Computer: It is now 1:30 p.m.<br />
User: Do I have any meetings today)<br />
Computer: Staff meeting is ten o'clock to twelve<br />
o'clock in the corner conference room.<br />
Computer: Mike Jones is e
References<br />
l. L . Rabiner and B. Juang, Fu ndamentals of Speech<br />
Recognition (Englewood Cl iffs , N.J.: Prentice-Hall,<br />
Inc., 1993) 45-46.<br />
2. C. Schmandt, Vo ice Communication with Computers:<br />
Co rwersational Systems (New York, N.Y.: Van Nostrand<br />
Reinhold, 1994): 144-145.<br />
3. K.F. Lee, Large- \locabulmy Sp eaker-Independent<br />
Continuous Sp eech Recog nition: The SPHINX System<br />
(Pittsburgh, Pa.: Carnegie-Mellon University Computer<br />
Science Deparunent, Apri I 1988).<br />
4. W. Hallahan , "DECtalk Software: Text-to-Speech Technology<br />
and Implementation," <strong>Digital</strong> <strong>Technical</strong><br />
Journal, vol. 7, no. 4 ( 1995 ): 5-19.<br />
5. G. Lowney, The Jvficrosoji Windows Guidelines fo r<br />
Accessible Sojiware Design (Redmond, Wash.:<br />
Microsoft Development Library, 1995): 3-4.<br />
6. N. Yankelovich, G. Levow, and M. Marx, "Designing<br />
SpeechActs: Issues in Speech User Interfaces," Pro<br />
ceedings of ACM Cmference on Computer-Human<br />
Interaction (CHI) '9.5: Human factors in Computing<br />
Systems: Mosaic of Creativity, Denver, Colo. (May<br />
1995 ) : 369-376.<br />
Biography<br />
Bernard A. Rozmovits<br />
During his tenure at <strong>Digital</strong>, Bernie Rozmovirs bas worked<br />
on both sides of computer engineering-hardware and software.<br />
Cmrently he manages Speech Services in <strong>Digital</strong>'s<br />
Light and Sound Software Group, which developed the<br />
user inrerf . 1ces for <strong>Digital</strong>'s Speech Recognition Software<br />
and also developed the DECtalk sofnvare product. Prior<br />
to joining this sofnvarc effort, he focused on hardware<br />
engineering in the Computer Special Systems Group<br />
an.d was the architect for voice-processing platforms in<br />
the Image, Voice and Video Group. Bernie received a<br />
Diplome D'Etude Collegiale (DEC) from Dawson<br />
College, Montreal, Quebec, Canada, in 1974. He<br />
holds a patent entitled "Data Format ror Pckets Of<br />
Information," U.S. Patent No. 5,317,719.<br />
126 Digical <strong>Technical</strong> journal Vol. 8 No. 2 <strong>1996</strong>
Further Readings<br />
The DigitaL Te chnica/Joumal is a rerereed, quarterly<br />
publication of papers that explore the toundations of<br />
<strong>Digital</strong>'s products and technologies. Journal content<br />
is se lected by the Journal Advisory Board, and papers<br />
are vvritten by <strong>Digital</strong>'s engineers and engineering<br />
partners. Engineers who would like to contribute a<br />
paper to the Jo urnaL should contact the Managing<br />
Editor, Jane Blake, at Jane. Blake@ ljo.dec.com.<br />
To pics covered in previous issues of the<br />
<strong>Digital</strong> Te ch nical.fourrwlare as to llows:<br />
<strong>Digital</strong> UNIX Clusters/Object Modification Tools/<br />
eXcursion fo r Windows Operating Systems/<br />
Network Directory Services<br />
Vo l. 8, No. 1, <strong>1996</strong>, EY-U025E- TJ<br />
Audio and Video Tecluwlogies/UNIX Available Servers/<br />
Real-time Debugging Tools<br />
Vol . 7, No. 4, 1995, EY- U002F.-TJ<br />
High Performance Fortran in Parallel Environments/<br />
Sequoia 2000 Research<br />
Vo l. 7, No. 3, 1995, EY-T838E-TJ<br />
(Available only on the Internet)<br />
Graphical Software Development/Systems Engineering<br />
Vol . 7, No. 2, 1995, F.Y- UOO l E-TJ<br />
Database Integration/ Alpha Servers & Workstations/<br />
Alpha 21164 CPU<br />
Vol. 7, No. 1, 1995, F.Y-T l35E-TJ<br />
(Avail:tblc only on rhc Internet)<br />
RAID Array ContmllersjWorkflow Models/<br />
PC LAN and System Management Tools<br />
Vo l. 6, No. 4, Fall l994, EY-Tl l8E-TJ<br />
AlphaServer Multiprocessing Systems/<br />
DEC OSF/1 Symmetric Multiprocessing/<br />
Scientific Computing Optimization for Alpha<br />
Vol. 6, No. 3, Summer 1994, EY-S799 F. -TJ<br />
Alpha AXP Partners- Cray, Raytheon, Kubota/<br />
DECchip 21071/21072 PCI Chip Sets/<br />
DLT2000 Tape Drive<br />
Vo l. 6, o. 2, Spring 1994, EY-F947E-T)<br />
High· performance Networking/<br />
Open VMS AXP System Software/<br />
Alpha AXP PC Hardware<br />
Vo l. 6, No. I, Winter 1994, EY-QO l!E-TJ<br />
I<br />
Software Process and Quality<br />
Vol. 5, No. 4, Fall l993, EY- P920E-DP<br />
Product Internationalization<br />
Vol. 5, No. 3, Summer 1993, EY-P986E-Dl'<br />
Multimedia/ Application Control<br />
Vol. 5, No. 2, Spring 1 993, EY-P963E-DP<br />
DECnet Open Networking<br />
Vol. 5, No. 1, Wi nter 1993, EY-M770E-DP<br />
Alpha AXP Architecnlre and Systems<br />
Vol. 4, No. 4, Special Issue 1992, EY-J886E-DP<br />
NV AX- microprocessor VAX Systems<br />
Vo l. 4, No. 3, Summer 1992, EY-J884E-DP<br />
Semiconductor Technologies<br />
Vol. 4, No. 2, Spring 1992, EY-L521E-DP<br />
PATHWORKS : PC Integration Software<br />
Vol . 4, No. 1, Winter 1992, EY-J82 5E-DP<br />
Image Processing, Video Terminals, and<br />
Printer Technologies<br />
Vol. 3, No. 4, fall l991, EY-H889E-DP<br />
Availability in VAXcluster Systems/<br />
Network Performance and Adapters<br />
Vol. 3, No. 3, Sun1mer 1991, EY-H890E-DP<br />
Fiber Distributed Data Interface<br />
Vol. 3, No. 2, Spri ng 1991, EY-H876E-DP<br />
Transaction Processing, Databases, and<br />
Fault-tolerant Systems<br />
Vol. 3, No. 1, Winter 1991, J:::Y-1:'588E-DP<br />
VAX 9000 Series<br />
Vol . 2, No. 4, Fall 1990, EY-E762E-DP<br />
DECwindows Program<br />
Vol. 2, No. 3, Summer 1990, EY-E756E-DP<br />
VAX 6000 Model 400 System<br />
Vo l. 2, No. 2, Spri ng 1990, EY-C l97E- Dl'<br />
Compound Doctm1ent Architecture<br />
Vol. 2, No. l, \N inter 1990, EY-Cl96E- DP<br />
<strong>Digital</strong> Tcd1nicl Journ"l Voi .8 No. 2 <strong>1996</strong> 127
ISSN 0898-901X<br />
Printed in U.S.A. EC-N6992-18/96 9 14 20.0 Copyright © <strong>Digital</strong> Equipment Corporation